Question 1

In which of the following data-mining process steps is the data manipulated to make it suitable for formal modeling?

Accepted Answer

A)  Data sampling 
B)  Data preparation 
C)  Model construction 
D)  Model assessment 
A)  Data sampling 
B)  Data preparation 
C)  Model construction 
D)  Model assessment

Question 2

The impurity of a group of observations is based on the variance of the outcome value for the observations in the group for _____.

Accepted Answer

A)  regression trees 
B)  time-series plots 
C)  classification trees 
D)  cumulative lift charts 
A)  regression trees 
B)  time-series plots 
C)  classification trees 
D)  cumulative lift charts

Question 3

A sample is representative of the entire data population only if it:

Accepted Answer

A)  includes all the observations as the original data repository. 
B)  can be used to draw the same conclusions as the database. 
C)  is drawn sequentially from the given database. 
D)  is small enough to be manipulated quickly. 
A)  includes all the observations as the original data repository. 
B)  can be used to draw the same conclusions as the database. 
C)  is drawn sequentially from the given database. 
D)  is small enough to be manipulated quickly.

Question 4

An analysis of items frequently co-occurring in transactions is known as _____.

Accepted Answer

A)  market segmentation 
B)  market basket analysis 
C)  regression analysis 
D)  cluster analysis 
A)  market segmentation 
B)  market basket analysis 
C)  regression analysis 
D)  cluster analysis

Question 5

Observation refers to the:

Accepted Answer

A)  estimated continuous outcome variable. 
B)  set of recorded values of variables associated with a single entity. 
C)  goal of predicting a categorical outcome based on a set of variables. 
D)  mean of all variable values associated with one particular entity. 
A)  estimated continuous outcome variable. 
B)  set of recorded values of variables associated with a single entity. 
C)  goal of predicting a categorical outcome based on a set of variables. 
D)  mean of all variable values associated with one particular entity.

Question 6

To examine the local housing market in a particular region, a sample of 120 homes sold during a year are collected. The data are given below:

Partition the data into training (50 percent), validation (30 percent), and test (20 percent) sets. Predict the sale price using a regression tree. Use Sale Price as the output variable and all the other variables as input variables. In Step 2 of XLMiner's Regression Tree procedure, be sure to Normalize input data, to set the Maximum #splits for input variables to 59, to set the Minimum #records in a terminal node to 1, and specify Using Best prune tree as the scoring option. In Step 3 of XLMiner's Regression Tree procedure, set the maximum number of levels to 7. Generate the Full tree and Best pruned tree. 
a. In terms of number of decision nodes, compare the size of the full tree to the size of the best pruned tree.
b. What is the root mean squared error (RMSE) of the best pruned tree on the validation data and on the test data?
c. What is the average error on the validation data and test data? What does this suggest?
d. By examining the best pruned tree, what are the critical variables in predicting the sale price of a home?

Accepted Answer

a. There 59 decision nodes in the full t

Question 7

Single linkage is a measure of calculating dissimilarity between clusters by:

Accepted Answer

A)  considering only the two most dissimilar observations in the two clusters. 
B)  computing the average dissimilarity between every pair of observations between the two clusters. 
C)  considering only the two closest observations in the two clusters. 
D)  considering the distance between the cluster centroids. 
A)  considering only the two most dissimilar observations in the two clusters. 
B)  computing the average dissimilarity between every pair of observations between the two clusters. 
C)  considering only the two closest observations in the two clusters. 
D)  considering the distance between the cluster centroids.

Question 8

Which of the following is true of Euclidean distances?

Accepted Answer

A)  It is used to measure dissimilarity between categorical variable observations. 
B)  It is not affected by the scale on which variables are measured. 
C)  It increases with the increase in similarity between variable values. 
D)  It is susceptible to distortions from outlier measurements. 
A)  It is used to measure dissimilarity between categorical variable observations. 
B)  It is not affected by the scale on which variables are measured. 
C)  It increases with the increase in similarity between variable values. 
D)  It is susceptible to distortions from outlier measurements.

Question 9

Which of the following is true of unsupervised learning?

Accepted Answer

A)  Its objective is to predict the outcome of a variable. 
B)  Its error tolerance is tightly controlled by accuracy measures. 
C)  Qualitative assessments are used to confirm the definite accuracy measures. 
D)  It detects patterns and relationships in the data. 
A)  Its objective is to predict the outcome of a variable. 
B)  Its error tolerance is tightly controlled by accuracy measures. 
C)  Qualitative assessments are used to confirm the definite accuracy measures. 
D)  It detects patterns and relationships in the data.

Question 10

k-means clustering is the process of:

Accepted Answer

A)  agglomerating observations into a series of nested groups based on a measure of similarity. 
B)  organizing observations into one of a number of groups based on a measure of similarity. 
C)  reducing the number of variables to consider in a data-mining approach. 
D)  estimating the value of a continuous outcome variable. 
A)  agglomerating observations into a series of nested groups based on a measure of similarity. 
B)  organizing observations into one of a number of groups based on a measure of similarity. 
C)  reducing the number of variables to consider in a data-mining approach. 
D)  estimating the value of a continuous outcome variable.

Question 11

Average linkage is a measure of calculating dissimilarity between clusters by:

Accepted Answer

A)  considering only the two most dissimilar observations in the two clusters. 
B)  computing the average dissimilarity between every pair of observations between the two clusters. 
C)  considering only the two closest observations in the two clusters. 
D)  considering the distance between the cluster centroids. 
A)  considering only the two most dissimilar observations in the two clusters. 
B)  computing the average dissimilarity between every pair of observations between the two clusters. 
C)  considering only the two closest observations in the two clusters. 
D)  considering the distance between the cluster centroids.

Question 12

_____ is a generalization of linear regression for predicting a categorical outcome variable.

Accepted Answer

A)  Multiple linear regression 
B)  Logistic regression 
C)  Discriminant analysis 
D)  Cluster analysis 
A)  Multiple linear regression 
B)  Logistic regression 
C)  Discriminant analysis 
D)  Cluster analysis

Question 13

A bank is interested in identifying different attributes of its customers and below is the sample data of 150 customers. In the data table for the dummy variable Gender, 0 represents Male and 1 represents Female. And for the dummy variable Personal loan, 0 represents a customer who has not taken personal loan and 1 represents a customer who has taken personal loan.

Partition the data into training (50 percent), validation (30 percent), and test (20 percent) sets. Fit a classification tree using Age, Gender, Work experience, Income (in 1000 $), and Family size as input variables and Personal loan as the output variable. In Step 2 of XLMiner's Classification Tree procedure, be sure to Normalize input data and to set the Minimum #records in a terminal node to 1. In Step 3 of XLMiner's Classification Tree procedure, set the maximum number of levels to seven. Generate the Full tree, Best pruned tree, and Minimum error tree. Generate lift charts for both the validation data and the test data. 
a. Interpret the set of rules implied by the best pruned tree that characterize the customers who have taken personal loan.
b. For the default cutoff value of 0.5, what is the overall error rate, Class 1 error rate, and Class 0 error rate of the best pruned tree on the test data? Interpret these respective measures. 
c. Examine the decile-wise lift chart for the best pruned tree on the test data. What is the first decile lift? Interpret this value.

Accepted Answer

@#IMG-DLM& 
b. For the default cutoff value of 0.5

Question 14

To examine the local housing market in a particular region, a sample of 120 homes sold during a year are collected. The data are given below:

Partition the data into training (50 percent), validation (30 percent), and test (20 percent) sets. Predict the sale price using k-nearest neighbors with up to k = 10. Use Sale Price as the output variable and all the other variables as input variables. In Step 2 of XLMiner's k-Nearest Neighbors Prediction procedure, be sure to Normalize input data and to Score on best k between 1 and specified value. Generate a Detailed Scoring report for all three sets of data. 
a. What value of k minimizes the root mean squared error (RMSE) on the validation data?
b. What is the RMSE on the validation data and test data?
c. What is the average error on the validation data and test data? What does this suggest?

Accepted Answer

a. A value of k = 2 minimizes the RMSE o

Question 15

A cluster's _____ can be measured by the difference between the distance value at which a cluster is originally formed and the distance value at which it is merged with another cluster in a dendrogram.

Accepted Answer

A)  dimension 
B)  affordability 
C)  durability 
D)  span 
A)  dimension 
B)  affordability 
C)  durability 
D)  span

Question 16

Test set is the data set used to:

Accepted Answer

A)  build the data mining model. 
B)  estimate accuracy of candidate models on unseen data. 
C)  estimate accuracy of final model on unseen data. 
D)  show counts of actual versus predicted class values. 
A)  build the data mining model. 
B)  estimate accuracy of candidate models on unseen data. 
C)  estimate accuracy of final model on unseen data. 
D)  show counts of actual versus predicted class values.

Question 17

To examine the local housing market in a particular region, a sample of 120 homes sold during a year are collected. The data is given below.

Apply hierarchical clustering with 10 clusters using LandValue ($), BuildingValue ($), Acres, Age, and Price ($) as variables. Be sure to Normalize input data in Step 2 of the XLMiner Hierarchical Clustering procedure. Use Ward's method as the clustering method. a. Use a PivotTable on the data in the HC_Clusters1 worksheet to compute the cluster centers for the clusters in the hierarchical clustering.
b. Identify the cluster with the largest average price. Using all the variables, how would you characterize this cluster?
c. Identify the smallest cluster.

Accepted Answer

a. Below is the PivotTable obtained on t

Question 18

The endpoint of a k-means clustering algorithm occurs when:

Accepted Answer

A)  Euclidean distance between clusters is minimum. 
B)  Euclidean distance between observations in a cluster is maximum. 
C)  no further changes are observed in cluster structure and number. 
D)  all of the observations are encompassed within a single large cluster with mean k. 
A)  Euclidean distance between clusters is minimum. 
B)  Euclidean distance between observations in a cluster is maximum. 
C)  no further changes are observed in cluster structure and number. 
D)  all of the observations are encompassed within a single large cluster with mean k.

Question 19

_____ is a measure of calculating dissimilarity between clusters by considering only the two most dissimilar observations in the two clusters.

Accepted Answer

A)  Single linkage 
B)  Complete linkage 
C)  Average linkage 
D)  Average group linkage 
A)  Single linkage 
B)  Complete linkage 
C)  Average linkage 
D)  Average group linkage

Question 20

A research team wanted to assess the relationship between age, systolic blood pressure, smoking, and risk of stroke. A sample of 150 patients who had a stroke is selected and the data collected are given below. Here, for the variable Smoker, 1 represents smokers and 0 represents nonsmokers.

Partition the data into training (50 percent), validation (30 percent), and test (20 percent) sets. Predict the Risk of stroke using k-nearest neighbors with up to k = 20. Use Risk as the output variable and all the other variables as input variables. In Step 2 of XLMiner's k-Nearest Neighbors Prediction procedure, be sure to Normalize input data and to Score on best k between 1 and specified value. Generate a Detailed Scoring report for all three sets of data. 
a. What value of k minimizes the root mean squared error (RMSE) on the validation data?
b. What is the RMSE on the validation data and test data?
c. What is the average error on the validation data and test data? What does this suggest?

Accepted Answer

a. A value of k = 10 minimizes the RMSE

In which of the following data-mining process steps is the data manipulated to make it suitable for formal modeling?

The impurity of a group of observations is based on the variance of the outcome value for the observations in the group for _____.

A sample is representative of the entire data population only if it:

An analysis of items frequently co-occurring in transactions is known as _____.

Observation refers to the:

Single linkage is a measure of calculating dissimilarity between clusters by:

Which of the following is true of Euclidean distances?

Which of the following is true of unsupervised learning?

k-means clustering is the process of:

Average linkage is a measure of calculating dissimilarity between clusters by:

_____ is a generalization of linear regression for predicting a categorical outcome variable.

A cluster's _____ can be measured by the difference between the distance value at which a cluster is originally formed and the distance value at which it is merged with another cluster in a dendrogram.

Test set is the data set used to:

The endpoint of a k-means clustering algorithm occurs when:

_____ is a measure of calculating dissimilarity between clusters by considering only the two most dissimilar observations in the two clusters.

Introduction

Descriptive Statistics

Data Visualization

Linear Regression

Time Series Analysis and Forecasting

Spreadsheet Models

Linear Optimization Models

Integer Linear Optimization Models

Nonlinear Optimization Models

Monte Carlo Simulation

Decision Analysis

Filters

Exam 6: Data Mining

In which of the following data-mining process steps is the data manipulated to make it suitable for formal modeling?

The impurity of a group of observations is based on the variance of the outcome value for the observations in the group for _____.

A sample is representative of the entire data population only if it:

An analysis of items frequently co-occurring in transactions is known as _____.

Observation refers to the:

Single linkage is a measure of calculating dissimilarity between clusters by:

Which of the following is true of Euclidean distances?

Which of the following is true of unsupervised learning?

k-means clustering is the process of:

Average linkage is a measure of calculating dissimilarity between clusters by:

_____ is a generalization of linear regression for predicting a categorical outcome variable.

A cluster's _____ can be measured by the difference between the distance value at which a cluster is originally formed and the distance value at which it is merged with another cluster in a dendrogram.

Test set is the data set used to:

The endpoint of a k-means clustering algorithm occurs when:

_____ is a measure of calculating dissimilarity between clusters by considering only the two most dissimilar observations in the two clusters.

Introduction

Descriptive Statistics

Data Visualization

Linear Regression

Time Series Analysis and Forecasting

Spreadsheet Models

Linear Optimization Models

Integer Linear Optimization Models

Nonlinear Optimization Models

Monte Carlo Simulation

Decision Analysis

Filters