Question 1

Jaccard's coefficient is different from the matching coefficient in that the:&#10;A) former measures overlap while the latter measures dissimilarity.&#10;B) former does not count matching zero entries while the latter does.&#10;C) former deals with categorical variable while the latter deals with continuous variables.&#10;D) former is affected by the scale used to measure variables while the latter is not.

Accepted Answer

former does not count matching zero entries while the latter does.

Question 2

_____ is a measure of calculating dissimilarity between clusters by considering only the two most dissimilar observations in the two clusters.&#10;A) Single linkage&#10;B) Complete linkage&#10;C) Average linkage&#10;D) Average group linkage

Accepted Answer

Complete linkage

Question 3

Which of the following reasons is responsible for the increase in the use of data-mining techniques in business?&#10;A) The lack of methods to electronically track data&#10;B) The dearth of information to analyze and interpret&#10;C) The ability to electronically warehouse data&#10;D) The ability to manually analyze all the data

Accepted Answer

The ability to electronically warehouse data

Question 4

A sample is representative of the entire data population only if it:&#10;A) includes all the observations as the original data repository.&#10;B) can be used to draw the same conclusions as the database.&#10;C) is drawn sequentially from the given database.&#10;D) is small enough to be manipulated quickly.

Accepted Answer

The answer of A sample is representative of the entire...

Question 5

The simplest measure of similarity between observations consisting solely of categorical variables is given by _____.&#10;A) the Euclidean distance&#10;B) the Ward's distance&#10;C) matching coefficient&#10;D) Jaccard's coefficient

Accepted Answer

The answer of The simplest measure of similarity between observations...

Question 6

Single linkage is a measure of calculating dissimilarity between clusters by:&#10;A) considering only the two most dissimilar observations in the two clusters.&#10;B) computing the average dissimilarity between every pair of observations between the two clusters.&#10;C) considering only the two closest observations in the two clusters.&#10;D) considering the distance between the cluster centroids.

Accepted Answer

The answer of Single linkage is a measure of calculating...

Question 7

The process of reducing the number of variables to consider in a data-mining approach without losing any crucial information is termed as _____.&#10;A) dimension reduction&#10;B) data sampling&#10;C) data reduction&#10;D) aggregation

Accepted Answer

The answer of The process of reducing the number of...

Question 8

Which of the following methods is used by the analyst to decide if a particular variable needs to be retained in the sample during the sampling process?&#10;A) Descriptive statistics and data visualization&#10;B) Regression&#10;C) Outlier analysis&#10;D) Data Testing

Accepted Answer

The answer of Which of the following methods is used...

Question 9

In which of the following data-mining process steps is the data manipulated to make it suitable for formal modeling?&#10;A) Data sampling&#10;B) Data preparation&#10;C) Model construction&#10;D) Model assessment

Accepted Answer

The answer of In which of the following data-mining process...

Question 10

Which of the following is true of Euclidean distances?&#10;A) It is used to measure dissimilarity between categorical variable observations.&#10;B) It is not affected by the scale on which variables are measured.&#10;C) It increases with the increase in similarity between variable values.&#10;D) It is susceptible to distortions from outlier measurements.

Accepted Answer

The answer of Which of the following is true of...

Question 11

_____ is a category of data-mining techniques in which an algorithm learns how to predict or classify an outcome variable of interest.&#10;A) Supervised learning&#10;B) Unsupervised learning&#10;C) Dimension reduction&#10;D) Data sampling

Accepted Answer

The answer of _____ is a category of data-mining techniques...

Question 12

The data preparation technique used in market segmentation to divide consumers into different homogeneous groups is _____.&#10;A) data visualization&#10;B) cluster analysis&#10;C) market analysis&#10;D) supervised learning

Accepted Answer

The answer of The data preparation technique used in market...

Question 13

Which of the following is true of unsupervised learning?&#10;A) Its objective is to predict the outcome of a variable.&#10;B) Its error tolerance is tightly controlled by accuracy measures.&#10;C) Qualitative assessments are used to confirm the definite accuracy measures.&#10;D) It detects patterns and relationships in the data.

Accepted Answer

The answer of Which of the following is true of...

Question 14

Which of the following is true of hierarchical clustering?&#10;A) All observations are put in a mega-cluster to begin with.&#10;B) Each of the large clusters is broken down iteratively.&#10;C) It is a bottom-up approach to clustering.&#10;D) At the end of the process, observations in the same cluster have maximum distance.

Accepted Answer

The answer of Which of the following is true of...

Question 15

_____ is the process of estimating the value of a categorical outcome variable.&#10;A) Sampling&#10;B) Prediction&#10;C) Classification&#10;D) Validation

Accepted Answer

The answer of _____ is the process of estimating the...

Question 16

k-means clustering is the process of:&#10;A) agglomerating observations into a series of nested groups based on a measure of similarity.&#10;B) organizing observations into one of a number of groups based on a measure of similarity.&#10;C) reducing the number of variables to consider in a data-mining approach.&#10;D) estimating the value of a continuous outcome variable.

Accepted Answer

The answer of k-means clustering is the process of:&#10;A) agglomerating...

Question 17

Observation refers to the:&#10;A) estimated continuous outcome variable.&#10;B) set of recorded values of variables associated with a single entity.&#10;C) goal of predicting a categorical outcome based on a set of variables.&#10;D) mean of all variable values associated with one particular entity.

Accepted Answer

The answer of Observation refers to the:&#10;A) estimated continuous outcome...

Question 18

The estimation of the value for a continuous outcome is done during _____.&#10;A) classification&#10;B) prediction&#10;C) data preparation&#10;D) data sampling

Accepted Answer

The answer of The estimation of the value for a...

Question 19

_____methods do not attempt to predict an output value but are rather used to detect patterns and relationships in the data.&#10;A) Supervised learning&#10;B) Machine learning&#10;C) Artificial intelligence&#10;D) Unsupervised learning

Accepted Answer

The answer of _____methods do not attempt to predict an...

Question 20

Average linkage is a measure of calculating dissimilarity between clusters by:&#10;A) considering only the two most dissimilar observations in the two clusters.&#10;B) computing the average dissimilarity between every pair of observations between the two clusters.&#10;C) considering only the two closest observations in the two clusters.&#10;D) considering the distance between the cluster centroids.

Accepted Answer

The answer of Average linkage is a measure of calculating...

Question 21

A _____ refers to the number of times that a collection of items occur together in a transaction data set.&#10;A) test set&#10;B) validation count&#10;C) support count&#10;D) training set

Accepted Answer

The answer of A _____ refers to the number of...

Question 22

A tree diagram used to illustrate the sequence of nested clusters produced by hierarchical clustering is known as a _____.&#10;A) dendrogram&#10;B) scatter chart&#10;C) decile-wise lift chart&#10;D) cumulative lift tree

Accepted Answer

The answer of A tree diagram used to illustrate the...

Question 23

The lift ratio of an association rule with a confidence value of 0.43 and in which the consequent occurs in 6 out of 10 cases is:&#10;A) 1.40&#10;B) 0.54&#10;C) 1.00&#10;D) 0.72

Accepted Answer

The answer of The lift ratio of an association rule...

Question 24

The endpoint of a k-means clustering algorithm occurs when:&#10;A) Euclidean distance between clusters is minimum.&#10;B) Euclidean distance between observations in a cluster is maximum.&#10;C) no further changes are observed in cluster structure and number.&#10;D) all of the observations are encompassed within a single large cluster with mean k.

Accepted Answer

The answer of The endpoint of a k-means clustering algorithm...

Question 25

A cluster's _____ can be measured by the difference between the distance value at which a cluster is originally formed and the distance value at which it is merged with another cluster in a dendrogram.

A) dimension
B) affordability
C) durability
D) span

Accepted Answer

The answer of A cluster's _____ can be measured by...

Question 26

Which of the following is a commonly used supervised learning method?&#10;A) k-means clustering&#10;B) k-nearest neighbors&#10;C) hierarchical clustering&#10;D) association rule development

Accepted Answer

The answer of Which of the following is a commonly...

Question 27

Test set is the data set used to:&#10;A) build the data mining model.&#10;B) estimate accuracy of candidate models on unseen data.&#10;C) estimate accuracy of final model on unseen data.&#10;D) show counts of actual versus predicted class values.

Accepted Answer

The answer of Test set is the data set used...

Question 28

_____ can be used to partition observations in a manner to obtain clusters with the least amount of information loss due to the aggregation.&#10;A) Single linkage&#10;B) Ward's method&#10;C) Average group linkage&#10;D) Dendrogram

Accepted Answer

The answer of _____ can be used to partition observations...

Question 29

_____ refers to the scenario in which the analyst builds a model that does a great job of explaining the sample of data on which it is based but fails to accurately predict outside the sample data.&#10;A) Underfitting&#10;B) Overfitting&#10;C) Oversampling&#10;D) Undersampling

Accepted Answer

The answer of _____ refers to the scenario in which...

Question 30

An analysis of items frequently co-occurring in transactions is known as _____.&#10;A) market segmentation&#10;B) market basket analysis&#10;C) regression analysis&#10;D) cluster analysis

Accepted Answer

The answer of An analysis of items frequently co-occurring in...

Question 31

Separate error rates with respect to the false negative and false positive cases are computed to take into account the:&#10;A) assymetric costs in misclassification.&#10;B) symmetric weights of these two cases.&#10;C) distortions due to outliers.&#10;D) effect of sampling error.

Accepted Answer

The answer of Separate error rates with respect to the...

Question 32

The impurity of a group of observations is based on the variance of the outcome value for the observations in the group for _____.&#10;A) regression trees&#10;B) time-series plots&#10;C) classification trees&#10;D) cumulative lift charts

Accepted Answer

The answer of The impurity of a group of observations...

Question 33

One minus the overall error rate is often referred to as the _____ of the model.&#10;A) sensitivity&#10;B) accuracy&#10;C) specificity&#10;D) cutoff value

Accepted Answer

The answer of One minus the overall error rate is...

Question 34

In the k-nearest neighbors method, when the value of k is set to 1,:&#10;A) the classification or prediction of a new observation is based solely on the single most similar observation from the training set.&#10;B) the new observation's class is na&#239;vely assigned to the most common class in the training set.&#10;C) the new observation's prediction is used to estimate the anticipated error rate on future data over the entire training set.&#10;D) the classification or prediction of a new observation is subject to the smallest possible classification error.

Accepted Answer

The answer of In the k-nearest neighbors method, when the...

Question 35

_____ is a measure of the heterogeneity of observations in a classification tree.&#10;A) Sensitivity&#10;B) Specificity&#10;C) Accuracy&#10;D) Impurity

Accepted Answer

The answer of _____ is a measure of the heterogeneity...

Question 36

_____ is a generalization of linear regression for predicting a categorical outcome variable.&#10;A) Multiple linear regression&#10;B) Logistic regression&#10;C) Discriminant analysis&#10;D) Cluster analysis

Accepted Answer

The answer of _____ is a generalization of linear regression...

Question 37

In which of the following scenarios would it be appropriate to use hierarchical clustering?&#10;A) When the number of observations in the dataset is relatively high&#10;B) When it is not necessary to know the nesting of clusters&#10;C) When the number of clusters is known beforehand&#10;D) When binary or ordinal data needs to be clustered

Accepted Answer

The answer of In which of the following scenarios would...

Question 38

_____ is a measure of calculating dissimilarity between clusters by considering the distance between the cluster centroids.&#10;A) Single linkage&#10;B) Complete linkage&#10;C) Average linkage&#10;D) Average group linkage

Accepted Answer

The answer of _____ is a measure of calculating dissimilarity...

Question 39

An observation classified as part of a group with a characteristic when it actually does not have the characteristic is termed as a(n) _____.&#10;A) false negative&#10;B) false positive&#10;C) residual&#10;D) outlier

Accepted Answer

The answer of An observation classified as part of a...

Question 40

A _____ classifies a categorical outcome variable by splitting observations into groups via a sequence of hierarchical rules.&#10;A) regression tree&#10;B) scatter chart&#10;C) classification tree&#10;D) classification confusion matrix

Accepted Answer

The answer of A _____ classifies a categorical outcome variable...

Question 41

A bank wants to understand better the details of customers who are likely to default the loan. In order to analyze this, the data from a random sample of 200 customers are given below:

In XLMiner's Partition with Oversampling procedure, partition the data so there is 50 percent successes (Loan default) in the training set and 40 percent of the validation data is taken away as test data. Classify the data using k-nearest neighbors with up to k = 10. Use Loan default as the output variable and all the other variables as input variables. In Step 2 of XLMiner's k-nearest neighbors Classification procedure, be sure to Normalize input data and to Score on best k between 1 and specified value. Generate lift charts for both the validation data and test data.
a. For the cutoff probability value 0.5, what value of k minimizes the overall error rate on the validation data?
b. What is the overall error rate on the test data? Interpret this measure.
c. What are the Class 1 error rate and the Class 0 error rate on the test data?
d. Compute and interpret the sensitivity and specificity for the test data.
e. Examine the decile-wise lift chart on the test data. What is the first decile lift on the test data? Interpret this value.

Accepted Answer

The answer of A bank wants to understand better the...

Question 42

A retailer is interested in analyzing the shopping trend of men concerning the items: Shirts, pants, Jeans, t-shirts, Shoes, and Belts. A sample of 50 male customers is selected and the data are given below.

a. Using a minimum support of 20 transactions and a minimum confidence of 50 percent, use XLMiner to generate a list of association rules. How many rules satisfy this criterion?
b. Using the list of rules from part a, consider the rule with the largest lift ratio. Interpret what this rule is saying about the relationship between the antecedent item set and consequent item set.
c. Interpret the support count of the item set composed of the all the items involved in the rule with the largest lift ratio.
d. Interpret the confidence of the rule with the largest lift ratio.
e. Interpret the lift ratio of the rule with the largest lift ratio.

Accepted Answer

The answer of A retailer is interested in analyzing the...

Question 43

As part of the quarterly reviews, the manager of a retail store analyzes the quality of customer service based on the periodic customer satisfaction ratings (on a scale of 1 to 10 with 1 = Poor and 10 = Excellent). To understand the level of service quality, which includes the waiting times of the customers in the checkout section, he collected the following data on 100 customers who visited the store.

For the above data, apply k-means clustering using Wait time (min) as the variable with k = 3. Be sure to Normalize input data, and specify 50 iterations and 10 random starts in Step 2 of the XLMiner k-Means Clustering procedure. Then create one distinct data set for each of the three resulting clusters for waiting time.
a. For the observations composing the cluster which has the low waiting time, apply hierarchical clustering with Ward's method to form two clusters using Purchase Amount, Customer Age, and Customer Satisfaction Rating as variables. Be sure to Normalize input data in Step 2 of the XLMiner Hierarchical Clustering procedure. Using a PivotTable on the data in HC_Clusters, report the characteristics of each cluster.
b. For the observations composing the cluster which has the medium waiting time, apply hierarchical clustering with Ward's method to form three clusters using Purchase Amount, Customer Age, and Customer Satisfaction Rating as variables. Be sure to Normalize input data in Step 2 of the XLMiner Hierarchical Clustering procedure. Using a PivotTable on the data in HC_Clusters, report the characteristics of each cluster.
c. For the observations composing the cluster which has the high waiting time, apply hierarchical clustering with Ward's method to form two clusters using Purchase Amount, Customer Age, and Customer Satisfaction Rating as variables. Be sure to Normalize input data in Step 2 of the XLMiner Hierarchical Clustering procedure. Using a PivotTable on the data in HC_Clusters, report the characteristics of each cluster.

Accepted Answer

The answer of As part of the quarterly reviews, the...

Question 44

A bank is interested in identifying different attributes of its customers and below is the sample data of 150 customers. In the data table for the dummy variable Gender, 0 represents Male and 1 represents Female. And for the dummy variable Personal loan, 0 represents a customer who has not taken personal loan and 1 represents a customer who has taken personal loan.

Partition the data into training (50 percent), validation (30 percent), and test (20 percent) sets. Use logistic regression to classify observations as Personal loan taken (or not taken) using Age, Gender, Work experience, Income (in 1000 $), and Family size as input variables and Personal loan as the output variable. Perform an exhaustive-search best subset selection with the number of best subsets equal to 2.
a. From the generated set of logistic regression models, select one that you believe is a good fit. Express the model as a mathematical equation relating the output variable to the input variables.
b. Increases in which variables increase the chance of a customer who has taken the personal loan? Increases in which variables decrease the chance of a customer who has not taken the personal loan?
c. Using the default cutoff value of 0.5 for your logistic regression model, what is the overall error rate on the test data?

Accepted Answer

The answer of A bank is interested in identifying different...

Question 45

A bank wants to understand better the details of customers who are likely to default the loan. In order to analyze this, the data from a random sample of 200 customers are given below:

In XLMiner's Partition with Oversampling procedure, partition the data so there is 50 percent successes (Loan default) in the training set and 40 percent of the validation data is taken away as test data. Classify the data using k-nearest neighbors with up to k = 10. Use Loan default as the output variable and all the other variables as input variables. In Step 2 of XLMiner's k-nearest neighbors Classification procedure, be sure to Normalize input data and to Score on best k between 1 and specified value. Generate lift charts for both the validation data and test data.
a. For the cutoff probability value 0.5, what value of k minimizes the overall error rate on the validation data?
b. What is the overall error rate on the test data? Interpret this measure.
c. What are the Class 1 error rate and the Class 0 error rate on the test data?
d. Compute and interpret the sensitivity and specificity for the test data.
e. Examine the decile-wise lift chart on the test data. What is the first decile lift on the test data? Interpret this value.

Accepted Answer

The answer of A bank wants to understand better the...

Question 46

A bank wants to understand better the details of customers who are likely to default the loan. In order to analyze this, the data from a random sample of 200 customers are given below:

In XLMiner's Partition with Oversampling procedure, partition the data so there is 50 percent successes (Loan default) in the training set and 40 percent of the validation data is taken away as test data. Classify the data using k-nearest neighbors with up to k = 10. Use Loan default as the output variable and all the other variables as input variables. In Step 2 of XLMiner's k-nearest neighbors Classification procedure, be sure to Normalize input data and to Score on best k between 1 and specified value. Generate lift charts for both the validation data and test data.
a. For the cutoff probability value 0.5, what value of k minimizes the overall error rate on the validation data?
b. What is the overall error rate on the test data? Interpret this measure.
c. What are the Class 1 error rate and the Class 0 error rate on the test data?
d. Compute and interpret the sensitivity and specificity for the test data.
e. Examine the decile-wise lift chart on the test data. What is the first decile lift on the test data? Interpret this value.

Accepted Answer

The answer of A bank wants to understand better the...

Question 47

To examine the local housing market in a particular region, a sample of 120 homes sold during a year are collected. The data is given below.

a. Apply hierarchical clustering with 10 clusters using LandValue ($), BuildingValue ($), Acres, Age, and Price ($) as variables. Be sure to Normalize input data in Step 2 of the XLMiner Hierarchical Clustering procedure, and specify complete linkage as the clustering method. Analyze the resulting clusters by computing the cluster size. It may be helpful to use a PivotTable on the data in the HC_Clusters worksheet generated by XLMiner. You can also visualize the clusters by creating a scatter plot with Acre as the x-variable and Price ($) as the y-variable.
b. Repeat part a using average group linkage as the clustering method. Compare the clusters to the previous method.

Accepted Answer

The answer of To examine the local housing market in...

Question 48

To examine the local housing market in a particular region, a sample of 120 homes sold during a year are collected. The data are given below:

Partition the data into training (50 percent), validation (30 percent), and test (20 percent) sets. Predict the sale price using multiple linear regression. Use Sale Price as the output variable and all the other variables as input variables. To generate a pool of models to consider, execute the following steps. In Step 2 of XLMiner's Multiple Linear Regression procedure, click the Best subset option. In the Best Subset dialog box, check the box next to Perform best subset selection, enter 6 in the box next to Maximum size of best subset:, enter 1 in the box next to Number of best subsets:, and check the box next to Exhaustive search.
a. From the generated set of multiple linear regression models, select one that you believe is a good fit. Express the model as a mathematical equation relating the output variable to the input variables.
b. For your model, what is the RMSE on the validation data and test data?
c. What is the average error on the validation data and test data? What does this suggest?

Accepted Answer

The answer of To examine the local housing market in...

Question 49

As part of the quarterly reviews, the manager of a retail store analyzes the quality of customer service based on the periodic customer satisfaction ratings (on a scale of 1 to 10 with 1 = Poor and 10 = Excellent). To understand the level of service quality, which includes the waiting times of the customers in the checkout section, he collected the following data on 100 customers who visited the store.

a. Apply hierarchical clustering with 5 clusters using Wait Time (min) and Customer Satisfaction Rating as variables. Be sure to Normalize input data in Step 2 of the XLMiner Hierarchical Clustering procedure, and specify single linkage as the clustering method. Analyze the resulting clusters by computing the cluster size. It may be helpful to use a PivotTable on the data in the HC_Clusters worksheet generated by XLMiner to compute descriptive measures of the Wait Time and Customer Satisfaction Rating variables in each cluster. You can also visualize the clusters by creating a scatter plot with Wait Time (min) as the x-variable and Customer Satisfaction Rating as the y-variable.
b. Repeat part a using average linkage as the clustering method. Compare the clusters to the previous method.

Accepted Answer

The answer of As part of the quarterly reviews, the...

Question 50

To examine the local housing market in a particular region, a sample of 120 homes sold during a year are collected. The data are given below:

Partition the data into training (50 percent), validation (30 percent), and test (20 percent) sets. Predict the sale price using multiple linear regression. Use Sale Price as the output variable and all the other variables as input variables. To generate a pool of models to consider, execute the following steps. In Step 2 of XLMiner's Multiple Linear Regression procedure, click the Best subset option. In the Best Subset dialog box, check the box next to Perform best subset selection, enter 6 in the box next to Maximum size of best subset:, enter 1 in the box next to Number of best subsets:, and check the box next to Exhaustive search.
a. From the generated set of multiple linear regression models, select one that you believe is a good fit. Express the model as a mathematical equation relating the output variable to the input variables.
b. For your model, what is the RMSE on the validation data and test data?
c. What is the average error on the validation data and test data? What does this suggest?

Accepted Answer

The answer of To examine the local housing market in...

Question 51

As part of the quarterly reviews, the manager of a retail store analyzes the quality of customer service based on the periodic customer satisfaction ratings (on a scale of 1 to 10 with 1 = Poor and 10 = Excellent). To understand the level of service quality, which includes the waiting times of the customers in the checkout section, he collected the following data on 100 customers who visited the store.

For the above data, apply k-means clustering using Wait time (min) as the variable with k = 3. Be sure to Normalize input data, and specify 50 iterations and 10 random starts in Step 2 of the XLMiner k-Means Clustering procedure. Then create one distinct data set for each of the three resulting clusters for waiting time.
a. For the observations composing the cluster which has the low waiting time, apply hierarchical clustering with Ward's method to form two clusters using Purchase Amount, Customer Age, and Customer Satisfaction Rating as variables. Be sure to Normalize input data in Step 2 of the XLMiner Hierarchical Clustering procedure. Using a PivotTable on the data in HC_Clusters, report the characteristics of each cluster.
b. For the observations composing the cluster which has the medium waiting time, apply hierarchical clustering with Ward's method to form three clusters using Purchase Amount, Customer Age, and Customer Satisfaction Rating as variables. Be sure to Normalize input data in Step 2 of the XLMiner Hierarchical Clustering procedure. Using a PivotTable on the data in HC_Clusters, report the characteristics of each cluster.
c. For the observations composing the cluster which has the high waiting time, apply hierarchical clustering with Ward's method to form two clusters using Purchase Amount, Customer Age, and Customer Satisfaction Rating as variables. Be sure to Normalize input data in Step 2 of the XLMiner Hierarchical Clustering procedure. Using a PivotTable on the data in HC_Clusters, report the characteristics of each cluster.

Accepted Answer

The answer of As part of the quarterly reviews, the...

Question 52

To examine the local housing market in a particular region, a sample of 120 homes sold during a year are collected. The data are given below:

Partition the data into training (50 percent), validation (30 percent), and test (20 percent) sets. Predict the sale price using multiple linear regression. Use Sale Price as the output variable and all the other variables as input variables. To generate a pool of models to consider, execute the following steps. In Step 2 of XLMiner's Multiple Linear Regression procedure, click the Best subset option. In the Best Subset dialog box, check the box next to Perform best subset selection, enter 6 in the box next to Maximum size of best subset:, enter 1 in the box next to Number of best subsets:, and check the box next to Exhaustive search.
a. From the generated set of multiple linear regression models, select one that you believe is a good fit. Express the model as a mathematical equation relating the output variable to the input variables.
b. For your model, what is the RMSE on the validation data and test data?
c. What is the average error on the validation data and test data? What does this suggest?

Accepted Answer

The answer of To examine the local housing market in...

Question 53

To examine the local housing market in a particular region, a sample of 120 homes sold during a year are collected. The data are given below:

Partition the data into training (50 percent), validation (30 percent), and test (20 percent) sets. Predict the sale price using multiple linear regression. Use Sale Price as the output variable and all the other variables as input variables. To generate a pool of models to consider, execute the following steps. In Step 2 of XLMiner's Multiple Linear Regression procedure, click the Best subset option. In the Best Subset dialog box, check the box next to Perform best subset selection, enter 6 in the box next to Maximum size of best subset:, enter 1 in the box next to Number of best subsets:, and check the box next to Exhaustive search.
a. From the generated set of multiple linear regression models, select one that you believe is a good fit. Express the model as a mathematical equation relating the output variable to the input variables.
b. For your model, what is the RMSE on the validation data and test data?
c. What is the average error on the validation data and test data? What does this suggest?

Accepted Answer

The answer of To examine the local housing market in...

Question 54

To examine the local housing market in a particular region, a sample of 120 homes sold during a year are collected. The data are given below:

Partition the data into training (50 percent), validation (30 percent), and test (20 percent) sets. Predict the sale price using multiple linear regression. Use Sale Price as the output variable and all the other variables as input variables. To generate a pool of models to consider, execute the following steps. In Step 2 of XLMiner's Multiple Linear Regression procedure, click the Best subset option. In the Best Subset dialog box, check the box next to Perform best subset selection, enter 6 in the box next to Maximum size of best subset:, enter 1 in the box next to Number of best subsets:, and check the box next to Exhaustive search.
a. From the generated set of multiple linear regression models, select one that you believe is a good fit. Express the model as a mathematical equation relating the output variable to the input variables.
b. For your model, what is the RMSE on the validation data and test data?
c. What is the average error on the validation data and test data? What does this suggest?

Accepted Answer

The answer of To examine the local housing market in...

Question 55

A bank is interested in identifying different attributes of its customers and below is the sample data of 150 customers. In the data table for the dummy variable Gender, 0 represents Male and 1 represents Female. And for the dummy variable Personal loan, 0 represents a customer who has not taken personal loan and 1 represents a customer who has taken personal loan.

Partition the data into training (50 percent), validation (30 percent), and test (20 percent) sets. Use logistic regression to classify observations as Personal loan taken (or not taken) using Age, Gender, Work experience, Income (in 1000 $), and Family size as input variables and Personal loan as the output variable. Perform an exhaustive-search best subset selection with the number of best subsets equal to 2.
a. From the generated set of logistic regression models, select one that you believe is a good fit. Express the model as a mathematical equation relating the output variable to the input variables.
b. Increases in which variables increase the chance of a customer who has taken the personal loan? Increases in which variables decrease the chance of a customer who has not taken the personal loan?
c. Using the default cutoff value of 0.5 for your logistic regression model, what is the overall error rate on the test data?

Accepted Answer

The answer of A bank is interested in identifying different...

Question 56

A bank is interested in identifying different attributes of its customers and below is the sample data of 150 customers. In the data table for the dummy variable Gender, 0 represents Male and 1 represents Female. And for the dummy variable Personal loan, 0 represents a customer who has not taken personal loan and 1 represents a customer who has taken personal loan.

Partition the data into training (50 percent), validation (30 percent), and test (20 percent) sets. Use logistic regression to classify observations as Personal loan taken (or not taken) using Age, Gender, Work experience, Income (in 1000 $), and Family size as input variables and Personal loan as the output variable. Perform an exhaustive-search best subset selection with the number of best subsets equal to 2.
a. From the generated set of logistic regression models, select one that you believe is a good fit. Express the model as a mathematical equation relating the output variable to the input variables.
b. Increases in which variables increase the chance of a customer who has taken the personal loan? Increases in which variables decrease the chance of a customer who has not taken the personal loan?
c. Using the default cutoff value of 0.5 for your logistic regression model, what is the overall error rate on the test data?

Accepted Answer

The answer of A bank is interested in identifying different...

Question 57

A research team wanted to assess the relationship between age, systolic blood pressure, smoking, and risk of stroke. A sample of 150 patients who had a stroke is selected and the data collected are given below. Here, for the variable Smoker, 1 represents smokers and 0 represents nonsmokers.&#10;   &#10;   &#10;   &#10;   &#10;Partition the data into training (50 percent), validation (30 percent), and test (20 percent) sets. Predict the Risk of stroke using k-nearest neighbors with up to k = 20. Use Risk as the output variable and all the other variables as input variables. In Step 2 of XLMiner's k-Nearest Neighbors Prediction procedure, be sure to Normalize input data and to Score on best k between 1 and specified value. Generate a Detailed Scoring report for all three sets of data. &#10;a. What value of k minimizes the root mean squared error (RMSE) on the validation data?&#10;b. What is the RMSE on the validation data and test data?&#10;c. What is the average error on the validation data and test data? What does this suggest?

Accepted Answer

The answer of A research team wanted to assess the...

Question 58

To examine the local housing market in a particular region, a sample of 120 homes sold during a year are collected. The data are given below:

Partition the data into training (50 percent), validation (30 percent), and test (20 percent) sets. Predict the sale price using multiple linear regression. Use Sale Price as the output variable and all the other variables as input variables. To generate a pool of models to consider, execute the following steps. In Step 2 of XLMiner's Multiple Linear Regression procedure, click the Best subset option. In the Best Subset dialog box, check the box next to Perform best subset selection, enter 6 in the box next to Maximum size of best subset:, enter 1 in the box next to Number of best subsets:, and check the box next to Exhaustive search.
a. From the generated set of multiple linear regression models, select one that you believe is a good fit. Express the model as a mathematical equation relating the output variable to the input variables.
b. For your model, what is the RMSE on the validation data and test data?
c. What is the average error on the validation data and test data? What does this suggest?

Accepted Answer

The answer of To examine the local housing market in...

Question 59

A research team wanted to assess the relationship between age, systolic blood pressure, smoking, and risk of stroke. A sample of 150 patients who had a stroke is selected and the data collected are given below. Here, for the variable Smoker, 1 represents smokers and 0 represents nonsmokers.&#10;   &#10;   &#10;   &#10;   &#10;Partition the data into training (50 percent), validation (30 percent), and test (20 percent) sets. Predict the Risk of stroke using a regression tree. Use Risk as the output variable and all the other variables as input variables. In Step 2 of XLMiner's Regression Tree procedure, be sure to Normalize input data, to set the Maximum #splits for input variables to 74, to set the Minimum #records in a terminal node to 1, and specify Using Best prune tree as the scoring option. In Step 3 of XLMiner's Regression Tree procedure, set the maximum number of levels to 7. Generate the Full tree, Best pruned tree, and Minimum error tree. Generate a Detailed Scoring report for all three sets of data. &#10;a. In terms of number of decision nodes, compare the size of the full tree to the size of the best pruned tree.&#10;b. What is the root mean squared error (RMSE) of the best pruned tree on the validation data and on the test data?&#10;c. What is the average error on the validation data and test data? What does this suggest?&#10;d. By examining the best pruned tree, what are the critical variables in predicting the risk?

Accepted Answer

The answer of A research team wanted to assess the...

Question 60

As part of the quarterly reviews, the manager of a retail store analyzes the quality of customer service based on the periodic customer satisfaction ratings (on a scale of 1 to 10 with 1 = Poor and 10 = Excellent). To understand the level of service quality, which includes the waiting times of the customers in the checkout section, he collected the following data on 100 customers who visited the store.

a. Apply hierarchical clustering with 5 clusters using Wait Time (min) and Customer Satisfaction Rating as variables. Be sure to Normalize input data in Step 2 of the XLMiner Hierarchical Clustering procedure, and specify single linkage as the clustering method. Analyze the resulting clusters by computing the cluster size. It may be helpful to use a PivotTable on the data in the HC_Clusters worksheet generated by XLMiner to compute descriptive measures of the Wait Time and Customer Satisfaction Rating variables in each cluster. You can also visualize the clusters by creating a scatter plot with Wait Time (min) as the x-variable and Customer Satisfaction Rating as the y-variable.
b. Repeat part a using average linkage as the clustering method. Compare the clusters to the previous method.

Accepted Answer

The answer of As part of the quarterly reviews, the...

Deck 6: Data Mining