Question 1

In which of the following scenarios would it be appropriate to use hierarchical clustering?

Accepted Answer

A)  When the number of observations in the dataset is relatively high 
B)  When it is not necessary to know the nesting of clusters 
C)  When the number of clusters is known beforehand 
D)  When binary or ordinal data needs to be clustered 
A)  When the number of observations in the dataset is relatively high 
B)  When it is not necessary to know the nesting of clusters 
C)  When the number of clusters is known beforehand 
D)  When binary or ordinal data needs to be clustered

Question 2

Jaccard's coefficient is different from the matching coefficient in that the:

Accepted Answer

A)  former measures overlap while the latter measures dissimilarity. 
B)  former does not count matching zero entries while the latter does. 
C)  former deals with categorical variable while the latter deals with continuous variables. 
D)  former is affected by the scale used to measure variables while the latter is not. 
A)  former measures overlap while the latter measures dissimilarity. 
B)  former does not count matching zero entries while the latter does. 
C)  former deals with categorical variable while the latter deals with continuous variables. 
D)  former is affected by the scale used to measure variables while the latter is not.

Question 3

A bank is interested in identifying different attributes of its customers and below is the sample data of 150 customers. In the data table for the dummy variable Gender, 0 represents Male and 1 represents Female. And for the dummy variable Personal loan, 0 represents a customer who has not taken personal loan and 1 represents a customer who has taken personal loan.

Partition the data into training (50 percent), validation (30 percent), and test (20 percent) sets. Classify the data using k-nearest neighbors with up to k = 10. Use Age, Gender, Work experience, Income (in 1000 $), and Family size as input variables and Personal loan as the output variable. In Step 2 of XLMiner's k-nearest neighbors Classification procedure, be sure to Normalize input data and to Score on best k between 1 and specified value. Generate lift charts for both the validation data and test data. 
a. For the cutoff probability value 0.5, what value of k minimizes the overall error rate on the validation data? Explain the difference in the overall error rate on the training, validation, and test data.
b. Examine the decile-wise lift chart on the test data. Identify and interpret the first decile lift. 
c. For cutoff probability values of 0.5, 0.4, 0.3, and 0.2, what are the corresponding Class 1 error rates and Class 0 error rates on the validation data?

Accepted Answer

a. The overall error rate is minimized a

Question 4

_____ is a measure of the heterogeneity of observations in a classification tree.

Accepted Answer

A)  Sensitivity 
B)  Specificity 
C)  Accuracy 
D)  Impurity 
A)  Sensitivity 
B)  Specificity 
C)  Accuracy 
D)  Impurity

Question 5

To examine the local housing market in a particular region, a sample of 120 homes sold during a year are collected. The data are given below:

Partition the data into training (50 percent), validation (30 percent), and test (20 percent) sets. Predict the sale price using multiple linear regression. Use Sale Price as the output variable and all the other variables as input variables. To generate a pool of models to consider, execute the following steps. In Step 2 of XLMiner's Multiple Linear Regression procedure, click the Best subset option. In the Best Subset dialog box, check the box next to Perform best subset selection, enter 6 in the box next to Maximum size of best subset:, enter 1 in the box next to Number of best subsets:, and check the box next to Exhaustive search. 
a. From the generated set of multiple linear regression models, select one that you believe is a good fit. Express the model as a mathematical equation relating the output variable to the input variables.
b. For your model, what is the RMSE on the validation data and test data?
c. What is the average error on the validation data and test data? What does this suggest?

Accepted Answer

a. Using goodness-of-fit measures such a

Question 6

A bank is interested in identifying different attributes of its customers and below is the sample data of 150 customers. In the data table for the dummy variable Gender, 0 represents Male and 1 represents Female. And for the dummy variable Personal loan, 0 represents a customer who has not taken personal loan and 1 represents a customer who has taken personal loan.

Partition the data into training (50 percent), validation (30 percent), and test (20 percent) sets. Use logistic regression to classify observations as Personal loan taken (or not taken) using Age, Gender, Work experience, Income (in 1000 $), and Family size as input variables and Personal loan as the output variable. Perform an exhaustive-search best subset selection with the number of best subsets equal to 2. 
a. From the generated set of logistic regression models, select one that you believe is a good fit. Express the model as a mathematical equation relating the output variable to the input variables.
b. Increases in which variables increase the chance of a customer who has taken the personal loan? Increases in which variables decrease the chance of a customer who has not taken the personal loan?
c. Using the default cutoff value of 0.5 for your logistic regression model, what is the overall error rate on the test data?

Accepted Answer

a.
 @#IMG-DLM& 
Using Mallow's Cp statistic to gui

Question 7

A _____ refers to the number of times that a collection of items occur together in a transaction data set.

Accepted Answer

A)  test set 
B)  validation count 
C)  support count 
D)  training set 
A)  test set 
B)  validation count 
C)  support count 
D)  training set

Question 8

An observation classified as part of a group with a characteristic when it actually does not have the characteristic is termed as a(n) _____.

Accepted Answer

A)  false negative 
B)  false positive 
C)  residual 
D)  outlier 
A)  false negative 
B)  false positive 
C)  residual 
D)  outlier

Question 9

To examine the local housing market in a particular region, a sample of 120 homes sold during a year are collected. The data is given below.

For the above data, apply k-means clustering using Price ($) as the variable with k = 3. Be sure to Normalize input data, and specify 50 iterations and 10 random starts in Step 2 of the XLMiner k-Means Clustering procedure. Then create one distinct data set for each of the three resulting clusters of price. 
a. For the observations composing the cluster with low home price, apply hierarchical clustering with Ward's method to form three clusters using Acres and Age as variables. Be sure to Normalize input data in Step 2 of the XLMiner Hierarchical Clustering procedure. Using a PivotTable on the data in HC_Clusters1, report the characteristics of each cluster.
b. For the observations composing the cluster with medium home price, apply hierarchical clustering with Ward's method to form three clusters using Acres and Age as variables. Be sure to Normalize input data in Step 2 of the XLMiner Hierarchical Clustering procedure. Using a PivotTable on the data in HC_Clusters1, report the characteristics of each cluster.
c. Comment on the cluster with high home price.

Accepted Answer

Below is the Pivot table on the data in

Question 10

_____ can be used to partition observations in a manner to obtain clusters with the least amount of information loss due to the aggregation.

Accepted Answer

A)  Single linkage 
B)  Ward's method 
C)  Average group linkage 
D)  Dendrogram 
A)  Single linkage 
B)  Ward's method 
C)  Average group linkage 
D)  Dendrogram

Question 11

Separate error rates with respect to the false negative and false positive cases are computed to take into account the:

Accepted Answer

A)  assymetric costs in misclassification. 
B)  symmetric weights of these two cases. 
C)  distortions due to outliers. 
D)  effect of sampling error. 
A)  assymetric costs in misclassification. 
B)  symmetric weights of these two cases. 
C)  distortions due to outliers. 
D)  effect of sampling error.

Question 12

_____methods do not attempt to predict an output value but are rather used to detect patterns and relationships in the data.

Accepted Answer

A)  Supervised learning 
B)  Machine learning 
C)  Artificial intelligence 
D)  Unsupervised learning 
A)  Supervised learning 
B)  Machine learning 
C)  Artificial intelligence 
D)  Unsupervised learning

Question 13

The data preparation technique used in market segmentation to divide consumers into different homogeneous groups is _____.

Accepted Answer

A)  data visualization 
B)  cluster analysis 
C)  market analysis 
D)  supervised learning 
A)  data visualization 
B)  cluster analysis 
C)  market analysis 
D)  supervised learning

Question 14

To examine the local housing market in a particular region, a sample of 120 homes sold during a year are collected. The data is given below.

a. Apply hierarchical clustering with 10 clusters using LandValue ($), BuildingValue ($), Acres, Age, and Price ($) as variables. Be sure to Normalize input data in Step 2 of the XLMiner Hierarchical Clustering procedure, and specify complete linkage as the clustering method. Analyze the resulting clusters by computing the cluster size. It may be helpful to use a PivotTable on the data in the HC_Clusters worksheet generated by XLMiner. You can also visualize the clusters by creating a scatter plot with Acre as the x-variable and Price ($) as the y-variable.
b. Repeat part a using average group linkage as the clustering method. Compare the clusters to the previous method.

Accepted Answer

a. Complete linkage results in clusters

Question 15

A tree diagram used to illustrate the sequence of nested clusters produced by hierarchical clustering is known as a _____.

Accepted Answer

A)  dendrogram 
B)  scatter chart 
C)  decile-wise lift chart 
D)  cumulative lift tree 
A)  dendrogram 
B)  scatter chart 
C)  decile-wise lift chart 
D)  cumulative lift tree

Question 16

Which of the following methods is used by the analyst to decide if a particular variable needs to be retained in the sample during the sampling process?

Accepted Answer

A)  Descriptive statistics and data visualization 
B)  Regression 
C)  Outlier analysis 
D)  Data Testing 
A)  Descriptive statistics and data visualization 
B)  Regression 
C)  Outlier analysis 
D)  Data Testing

Question 17

To examine the local housing market in a particular region, a sample of 120 homes sold during a year is collected. The data are given below.

Apply k-means clustering with k = 10 using LandValue ($), BuildingValue ($), Acres, Age, and Price ($) as variables. Be sure to Normalize input data, and specify 50 iterations and 10 random starts in Step 2 of the XLMiner k-Means Clustering procedure. What is the smallest cluster? What is the least dense cluster (as measured by the average distance in the cluster)?

Accepted Answer

We specify # Iterations = 50 and # Start

Question 18

_____ is the process of estimating the value of a categorical outcome variable.

Accepted Answer

A)  Sampling 
B)  Prediction 
C)  Classification 
D)  Validation 
A)  Sampling 
B)  Prediction 
C)  Classification 
D)  Validation

Question 19

Which of the following is a commonly used supervised learning method?

Accepted Answer

A)  k-means clustering 
B)  k-nearest neighbors 
C)  hierarchical clustering 
D)  association rule development 
A)  k-means clustering 
B)  k-nearest neighbors 
C)  hierarchical clustering 
D)  association rule development

Question 20

Which of the following reasons is responsible for the increase in the use of data-mining techniques in business?

Accepted Answer

A)  The lack of methods to electronically track data 
B)  The dearth of information to analyze and interpret 
C)  The ability to electronically warehouse data 
D)  The ability to manually analyze all the data 
A)  The lack of methods to electronically track data 
B)  The dearth of information to analyze and interpret 
C)  The ability to electronically warehouse data 
D)  The ability to manually analyze all the data

In which of the following scenarios would it be appropriate to use hierarchical clustering?

Jaccard's coefficient is different from the matching coefficient in that the:

_____ is a measure of the heterogeneity of observations in a classification tree.

A _____ refers to the number of times that a collection of items occur together in a transaction data set.

An observation classified as part of a group with a characteristic when it actually does not have the characteristic is termed as a(n) _____.

_____ can be used to partition observations in a manner to obtain clusters with the least amount of information loss due to the aggregation.

Separate error rates with respect to the false negative and false positive cases are computed to take into account the:

_____methods do not attempt to predict an output value but are rather used to detect patterns and relationships in the data.

The data preparation technique used in market segmentation to divide consumers into different homogeneous groups is _____.

A tree diagram used to illustrate the sequence of nested clusters produced by hierarchical clustering is known as a _____.

Which of the following methods is used by the analyst to decide if a particular variable needs to be retained in the sample during the sampling process?

_____ is the process of estimating the value of a categorical outcome variable.

Which of the following is a commonly used supervised learning method?

Which of the following reasons is responsible for the increase in the use of data-mining techniques in business?

Introduction

Descriptive Statistics

Data Visualization

Linear Regression

Time Series Analysis and Forecasting

Spreadsheet Models

Linear Optimization Models

Integer Linear Optimization Models

Nonlinear Optimization Models

Monte Carlo Simulation

Decision Analysis

Filters

Exam 6: Data Mining

In which of the following scenarios would it be appropriate to use hierarchical clustering?

Jaccard's coefficient is different from the matching coefficient in that the:

_____ is a measure of the heterogeneity of observations in a classification tree.

A _____ refers to the number of times that a collection of items occur together in a transaction data set.

An observation classified as part of a group with a characteristic when it actually does not have the characteristic is termed as a(n) _____.

_____ can be used to partition observations in a manner to obtain clusters with the least amount of information loss due to the aggregation.

Separate error rates with respect to the false negative and false positive cases are computed to take into account the:

_____methods do not attempt to predict an output value but are rather used to detect patterns and relationships in the data.

The data preparation technique used in market segmentation to divide consumers into different homogeneous groups is _____.

A tree diagram used to illustrate the sequence of nested clusters produced by hierarchical clustering is known as a _____.

Which of the following methods is used by the analyst to decide if a particular variable needs to be retained in the sample during the sampling process?

_____ is the process of estimating the value of a categorical outcome variable.

Which of the following is a commonly used supervised learning method?

Which of the following reasons is responsible for the increase in the use of data-mining techniques in business?

Introduction

Descriptive Statistics

Data Visualization

Linear Regression

Time Series Analysis and Forecasting

Spreadsheet Models

Linear Optimization Models

Integer Linear Optimization Models

Nonlinear Optimization Models

Monte Carlo Simulation

Decision Analysis

Filters