Deck 9: Predictive Data Mining

Full screen (f)
exit full mode
Question
______________ is NOT a step of Data Mining Process.

A)Data sampling
B)Data partitioning
C)Model construction
D)Supervised learning
Use Space or
up arrow
down arrow
to flip the card.
Question
Determine a freshman's likely first-year grade point average from the student's Scholastic Aptitude Test (SAT) score, high school grade point average, and number of extra-curricular activities. This is an example of

A)classification of a categorical outcome.
B)estimation of a continuous outcome.
C)prediction of a categorical outcome.
D)unsupervised learning.
Question
___________ is dividing the sample data into three sets for training, validation, and testing of the data-mining algorithm performance.

A)Data sampling
B)Data partitioning
C)Data preparation
D)Model assessment
Question
__________is one minus the Class 0 error rate.

A)Sensitivity
B)Specificity
C)Accuracy
D)Cutoff value
Question
Estimation methods are also referred to as

A)prediction methods.
B)clustering methods.
C)association methods.
D)supervised methods.
Question
A characteristic or quantity of interest that can take on different values is a(n)

A)variable.
B)observation.
C)record.
D)quality.
Question
A(n)_______________ is often displayed as a row of values in a spreadsheet or database in which the columns correspond to the variables.

A)record
B)data point
C)classification
D)location
Question
____________ is a category of data-mining techniques in which an algorithm learns how to predict or classify an outcome variable of interest.

A)Supervised Learning
B)Unsupervised Learning
C)Dimension Reduction
D)Data Sampling
Question
Data used to build a data mining model is called

A)validation data.
B)training data.
C)test data.
D)exploration data.
Question
____________is a method of extracting data relevant to the business problem under consideration. It is the first step in the Data Mining process.

A)Data sampling
B)Data partitioning
C)Model construction
D)Model assessment
Question
As we increase the cutoff value, _______ error will decrease. And_________error will rise.

A)Class 0, Class 1
B)Class 1, Class 0
C)error, accuracy
D)false, true
Question
____________is the manipulation of the data with the goal of putting it in a form suitable for formal modeling.

A)Data sampling
B)Data partitioning
C)Data preparation
D)Model assessment
Question
Data-mining methods for predicting an outcome based on a set of input variables is referred to as

A)supervised learning.
B)unsupervised learning.
C)dimension reduction.
D)data sampling.
Question
Classifying a record as belonging to one class when it belongs to another class is referred to as a(n)

A)overall error rate.
B)error.
C)accuracy.
D)class.
Question
Misclassifying an actual ______ observation as a(n) ______ observation is known as a false positive.

A)Class 0, Class 1
B)Class 1, Class 0
C)error, accuracy
D)false, true
Question
The set of recorded values of variables associated with a single entity is a(n)

A)observation.
B)data point.
C)classification.
D)location.
Question
_____________is the step in data-mining which includes addressing missing and erroneous data, reducing the number of variables, defining new variables, and data exploration.

A)Data sampling
B)Data partitioning
C)Data preparation
D)Model assessment
Question
______________ involves descriptive statistics, data visualization, and clustering.

A)Data exploration
B)Data partitioning
C)Data preparation
D)Model assessment
Question
Applying descriptive statistics and data visualization to the training set to understand the data and assist in the selection of an appropriate technique is a part of

A)data exploration.
B)data partitioning.
C)data preparation.
D)model assessment.
Question
The percent of misclassified records out of the total records in the validation data is known as the

A)overall error rate.
B)error.
C)accuracy.
D)class.
Question
___________ is a generalization of linear regression for predicting a categorical outcome variable.

A)Multiple linear regression
B)Logistic regression
C)Discriminant analysis
D)Cluster analysis
Question
Separate error rates with respect to the false negative and false positive cases are computed to take into account the

A)asymmetric costs in misclassification.
B)symmetric weights of these two cases.
C)distortions due to outliers.
D)effect of sampling error.
Question
The X axis of a lift chart shows

A)number of actual Class 1 records identified.
B)ratio of decile mean to overall mean.
C)the number of actual Class 1 records.
D)the ratio of the overall mean to the decile mean.
Question
Given the following classification confusion matrix, what is the accuracy?
Given the following classification confusion matrix, what is the accuracy? ​   ​ ​<div style=padding-top: 35px>
Question
Given the following classification confusion matrix, what is the overall error rate? Given the following classification confusion matrix, what is the overall error rate?   ​ ​ ​<div style=padding-top: 35px>

Question
A _____ classifies a categorical outcome variable by splitting observations into groups via a sequence of hierarchical rules.

A)regression tree
B)scatter chart
C)classification tree
D)classification confusion matrix
Question
The impurity of a group of observations is based on the variance of the outcome value for the observations in the group for

A)regression trees.
B)time-series plots.
C)classification trees.
D)cumulative lift charts.
Question
_______compares the number of actual Class 1 observations identified if considered in decreasing order of their estimated probability if randomly selected.

A)Cumulative lift
B)​Classification confusion
C)Decile-wise lift chart
D)ROC curve
Question
Which of the following is a commonly used supervised learning method?

A)k-means clustering
B)k-nearest neighbors
C)hierarchical clustering
D)association rule development
Question
_____ is a measure of the heterogeneity of observations in a classification tree.

A)Sensitivity
B)Specificity
C)Accuracy
D)Impurity
Question
One minus the overall error rate is often referred to as the _____ of the model.

A)sensitivity
B)accuracy
C)specificity
D)cutoff value
Question
How many Class 1's are correctly classified as Class 1 in the Table below?  Classification Confusion Matrix  Predicted Class  Actual Class 1012211000303,000\begin{array} { | l | c | c | } \hline { \text { Classification Confusion Matrix } } \\\hline & { \text { Predicted Class } } \\\hline \text { Actual Class } & 1 & 0 \\\hline 1 & 221 & 100 \\\hline 0 & 30 & 3,000 \\\hline\end{array} ?

A)221
B)100
C)30
D)3,000
Question
An observation classified as part of a group with a characteristic when it actually does not have the characteristic is termed as a(n)

A)false negative.
B)false positive.
C)residual.
D)outlier.
Question
The Y axis of a decile chart shows

A)number of important class records identified.
B)ratio of decile mean to overall mean.
C)the number of actual Class 1 records.
D)the ratio of the overall mean to the decile mean.
Question
In the k-nearest neighbors method, when the value of k is set to 1

A)the classification or prediction of a new observation is based solely on the single most similar observation from the training set.
B)the new observation's class is naïvely assigned to the most common class in the training set.
C)the new observation's prediction is used to estimate the anticipated error rate on future data over the entire training set.
D)the classification or prediction of a new observation is subject to the smallest possible classification error.
Question
_______ attempts to classify a categorical outcome as a linear function of explanatory variables.

A)Linear regression
B)Logistic regression
C)Classification model
D)Supervised learning
Question
How many Class 1's are incorrectly classified as Class 0?  Classification Confusion Matrix  Predicted Class  Actual Class 1012211000303,000\begin{array} { | l | c | c | } \hline { \text { Classification Confusion Matrix } } \\\hline & { \text { Predicted Class } } \\\hline \text { Actual Class } & 1 & 0 \\\hline 1 & 221 & 100 \\\hline 0 & 30 & 3,000 \\\hline\end{array} ?

A)221
B)100
C)30
D)3,000
Question
_____ refers to the scenario in which the analyst builds a model that does a great job of explaining the sample of data on which it is based but fails to accurately predict outside the sample data.

A)Underfitting
B)Overfitting
C)Oversampling
D)Undersampling
Question
A(n) __________ matrix displays a model's correct and incorrect classification.

A)cumulative lift
B)classification confusion
C)decile-wise lift chart
D)ROC curve
Question
Test set is the data set used to

A)build the data mining model.
B)estimate accuracy of candidate models on unseen data.
C)estimate accuracy of final model on unseen data.
D)show counts of actual versus predicted class values.
Unlock Deck
Sign up to unlock the cards in this deck!
Unlock Deck
Unlock Deck
1/40
auto play flashcards
Play
simple tutorial
Full screen (f)
exit full mode
Deck 9: Predictive Data Mining
1
______________ is NOT a step of Data Mining Process.

A)Data sampling
B)Data partitioning
C)Model construction
D)Supervised learning
Supervised learning
2
Determine a freshman's likely first-year grade point average from the student's Scholastic Aptitude Test (SAT) score, high school grade point average, and number of extra-curricular activities. This is an example of

A)classification of a categorical outcome.
B)estimation of a continuous outcome.
C)prediction of a categorical outcome.
D)unsupervised learning.
estimation of a continuous outcome.
3
___________ is dividing the sample data into three sets for training, validation, and testing of the data-mining algorithm performance.

A)Data sampling
B)Data partitioning
C)Data preparation
D)Model assessment
Data partitioning
4
__________is one minus the Class 0 error rate.

A)Sensitivity
B)Specificity
C)Accuracy
D)Cutoff value
Unlock Deck
Unlock for access to all 40 flashcards in this deck.
Unlock Deck
k this deck
5
Estimation methods are also referred to as

A)prediction methods.
B)clustering methods.
C)association methods.
D)supervised methods.
Unlock Deck
Unlock for access to all 40 flashcards in this deck.
Unlock Deck
k this deck
6
A characteristic or quantity of interest that can take on different values is a(n)

A)variable.
B)observation.
C)record.
D)quality.
Unlock Deck
Unlock for access to all 40 flashcards in this deck.
Unlock Deck
k this deck
7
A(n)_______________ is often displayed as a row of values in a spreadsheet or database in which the columns correspond to the variables.

A)record
B)data point
C)classification
D)location
Unlock Deck
Unlock for access to all 40 flashcards in this deck.
Unlock Deck
k this deck
8
____________ is a category of data-mining techniques in which an algorithm learns how to predict or classify an outcome variable of interest.

A)Supervised Learning
B)Unsupervised Learning
C)Dimension Reduction
D)Data Sampling
Unlock Deck
Unlock for access to all 40 flashcards in this deck.
Unlock Deck
k this deck
9
Data used to build a data mining model is called

A)validation data.
B)training data.
C)test data.
D)exploration data.
Unlock Deck
Unlock for access to all 40 flashcards in this deck.
Unlock Deck
k this deck
10
____________is a method of extracting data relevant to the business problem under consideration. It is the first step in the Data Mining process.

A)Data sampling
B)Data partitioning
C)Model construction
D)Model assessment
Unlock Deck
Unlock for access to all 40 flashcards in this deck.
Unlock Deck
k this deck
11
As we increase the cutoff value, _______ error will decrease. And_________error will rise.

A)Class 0, Class 1
B)Class 1, Class 0
C)error, accuracy
D)false, true
Unlock Deck
Unlock for access to all 40 flashcards in this deck.
Unlock Deck
k this deck
12
____________is the manipulation of the data with the goal of putting it in a form suitable for formal modeling.

A)Data sampling
B)Data partitioning
C)Data preparation
D)Model assessment
Unlock Deck
Unlock for access to all 40 flashcards in this deck.
Unlock Deck
k this deck
13
Data-mining methods for predicting an outcome based on a set of input variables is referred to as

A)supervised learning.
B)unsupervised learning.
C)dimension reduction.
D)data sampling.
Unlock Deck
Unlock for access to all 40 flashcards in this deck.
Unlock Deck
k this deck
14
Classifying a record as belonging to one class when it belongs to another class is referred to as a(n)

A)overall error rate.
B)error.
C)accuracy.
D)class.
Unlock Deck
Unlock for access to all 40 flashcards in this deck.
Unlock Deck
k this deck
15
Misclassifying an actual ______ observation as a(n) ______ observation is known as a false positive.

A)Class 0, Class 1
B)Class 1, Class 0
C)error, accuracy
D)false, true
Unlock Deck
Unlock for access to all 40 flashcards in this deck.
Unlock Deck
k this deck
16
The set of recorded values of variables associated with a single entity is a(n)

A)observation.
B)data point.
C)classification.
D)location.
Unlock Deck
Unlock for access to all 40 flashcards in this deck.
Unlock Deck
k this deck
17
_____________is the step in data-mining which includes addressing missing and erroneous data, reducing the number of variables, defining new variables, and data exploration.

A)Data sampling
B)Data partitioning
C)Data preparation
D)Model assessment
Unlock Deck
Unlock for access to all 40 flashcards in this deck.
Unlock Deck
k this deck
18
______________ involves descriptive statistics, data visualization, and clustering.

A)Data exploration
B)Data partitioning
C)Data preparation
D)Model assessment
Unlock Deck
Unlock for access to all 40 flashcards in this deck.
Unlock Deck
k this deck
19
Applying descriptive statistics and data visualization to the training set to understand the data and assist in the selection of an appropriate technique is a part of

A)data exploration.
B)data partitioning.
C)data preparation.
D)model assessment.
Unlock Deck
Unlock for access to all 40 flashcards in this deck.
Unlock Deck
k this deck
20
The percent of misclassified records out of the total records in the validation data is known as the

A)overall error rate.
B)error.
C)accuracy.
D)class.
Unlock Deck
Unlock for access to all 40 flashcards in this deck.
Unlock Deck
k this deck
21
___________ is a generalization of linear regression for predicting a categorical outcome variable.

A)Multiple linear regression
B)Logistic regression
C)Discriminant analysis
D)Cluster analysis
Unlock Deck
Unlock for access to all 40 flashcards in this deck.
Unlock Deck
k this deck
22
Separate error rates with respect to the false negative and false positive cases are computed to take into account the

A)asymmetric costs in misclassification.
B)symmetric weights of these two cases.
C)distortions due to outliers.
D)effect of sampling error.
Unlock Deck
Unlock for access to all 40 flashcards in this deck.
Unlock Deck
k this deck
23
The X axis of a lift chart shows

A)number of actual Class 1 records identified.
B)ratio of decile mean to overall mean.
C)the number of actual Class 1 records.
D)the ratio of the overall mean to the decile mean.
Unlock Deck
Unlock for access to all 40 flashcards in this deck.
Unlock Deck
k this deck
24
Given the following classification confusion matrix, what is the accuracy?
Given the following classification confusion matrix, what is the accuracy? ​   ​ ​
Unlock Deck
Unlock for access to all 40 flashcards in this deck.
Unlock Deck
k this deck
25
Given the following classification confusion matrix, what is the overall error rate? Given the following classification confusion matrix, what is the overall error rate?   ​ ​ ​

Unlock Deck
Unlock for access to all 40 flashcards in this deck.
Unlock Deck
k this deck
26
A _____ classifies a categorical outcome variable by splitting observations into groups via a sequence of hierarchical rules.

A)regression tree
B)scatter chart
C)classification tree
D)classification confusion matrix
Unlock Deck
Unlock for access to all 40 flashcards in this deck.
Unlock Deck
k this deck
27
The impurity of a group of observations is based on the variance of the outcome value for the observations in the group for

A)regression trees.
B)time-series plots.
C)classification trees.
D)cumulative lift charts.
Unlock Deck
Unlock for access to all 40 flashcards in this deck.
Unlock Deck
k this deck
28
_______compares the number of actual Class 1 observations identified if considered in decreasing order of their estimated probability if randomly selected.

A)Cumulative lift
B)​Classification confusion
C)Decile-wise lift chart
D)ROC curve
Unlock Deck
Unlock for access to all 40 flashcards in this deck.
Unlock Deck
k this deck
29
Which of the following is a commonly used supervised learning method?

A)k-means clustering
B)k-nearest neighbors
C)hierarchical clustering
D)association rule development
Unlock Deck
Unlock for access to all 40 flashcards in this deck.
Unlock Deck
k this deck
30
_____ is a measure of the heterogeneity of observations in a classification tree.

A)Sensitivity
B)Specificity
C)Accuracy
D)Impurity
Unlock Deck
Unlock for access to all 40 flashcards in this deck.
Unlock Deck
k this deck
31
One minus the overall error rate is often referred to as the _____ of the model.

A)sensitivity
B)accuracy
C)specificity
D)cutoff value
Unlock Deck
Unlock for access to all 40 flashcards in this deck.
Unlock Deck
k this deck
32
How many Class 1's are correctly classified as Class 1 in the Table below?  Classification Confusion Matrix  Predicted Class  Actual Class 1012211000303,000\begin{array} { | l | c | c | } \hline { \text { Classification Confusion Matrix } } \\\hline & { \text { Predicted Class } } \\\hline \text { Actual Class } & 1 & 0 \\\hline 1 & 221 & 100 \\\hline 0 & 30 & 3,000 \\\hline\end{array} ?

A)221
B)100
C)30
D)3,000
Unlock Deck
Unlock for access to all 40 flashcards in this deck.
Unlock Deck
k this deck
33
An observation classified as part of a group with a characteristic when it actually does not have the characteristic is termed as a(n)

A)false negative.
B)false positive.
C)residual.
D)outlier.
Unlock Deck
Unlock for access to all 40 flashcards in this deck.
Unlock Deck
k this deck
34
The Y axis of a decile chart shows

A)number of important class records identified.
B)ratio of decile mean to overall mean.
C)the number of actual Class 1 records.
D)the ratio of the overall mean to the decile mean.
Unlock Deck
Unlock for access to all 40 flashcards in this deck.
Unlock Deck
k this deck
35
In the k-nearest neighbors method, when the value of k is set to 1

A)the classification or prediction of a new observation is based solely on the single most similar observation from the training set.
B)the new observation's class is naïvely assigned to the most common class in the training set.
C)the new observation's prediction is used to estimate the anticipated error rate on future data over the entire training set.
D)the classification or prediction of a new observation is subject to the smallest possible classification error.
Unlock Deck
Unlock for access to all 40 flashcards in this deck.
Unlock Deck
k this deck
36
_______ attempts to classify a categorical outcome as a linear function of explanatory variables.

A)Linear regression
B)Logistic regression
C)Classification model
D)Supervised learning
Unlock Deck
Unlock for access to all 40 flashcards in this deck.
Unlock Deck
k this deck
37
How many Class 1's are incorrectly classified as Class 0?  Classification Confusion Matrix  Predicted Class  Actual Class 1012211000303,000\begin{array} { | l | c | c | } \hline { \text { Classification Confusion Matrix } } \\\hline & { \text { Predicted Class } } \\\hline \text { Actual Class } & 1 & 0 \\\hline 1 & 221 & 100 \\\hline 0 & 30 & 3,000 \\\hline\end{array} ?

A)221
B)100
C)30
D)3,000
Unlock Deck
Unlock for access to all 40 flashcards in this deck.
Unlock Deck
k this deck
38
_____ refers to the scenario in which the analyst builds a model that does a great job of explaining the sample of data on which it is based but fails to accurately predict outside the sample data.

A)Underfitting
B)Overfitting
C)Oversampling
D)Undersampling
Unlock Deck
Unlock for access to all 40 flashcards in this deck.
Unlock Deck
k this deck
39
A(n) __________ matrix displays a model's correct and incorrect classification.

A)cumulative lift
B)classification confusion
C)decile-wise lift chart
D)ROC curve
Unlock Deck
Unlock for access to all 40 flashcards in this deck.
Unlock Deck
k this deck
40
Test set is the data set used to

A)build the data mining model.
B)estimate accuracy of candidate models on unseen data.
C)estimate accuracy of final model on unseen data.
D)show counts of actual versus predicted class values.
Unlock Deck
Unlock for access to all 40 flashcards in this deck.
Unlock Deck
k this deck
locked card icon
Unlock Deck
Unlock for access to all 40 flashcards in this deck.