ML MCQs

1. Which of the following is a widely used and effective machine learning algorithm based on the idea of bagging?
a. Decision Tree
b. Regression
c. Classification
d. Random Forest

2. To find the minimum or the maximum of a function, we set the gradient to zero because:
a. The value of the gradient at extrema of a function is always zero
b. Depends on the type of problem
c. Both A and B
d. None of the above

3. The most widely used metrics and tools to assess a classification model are:
a. Confusion matrix
b. Cost-sensitive accuracy
c. Area under the ROC curve
d. All of the above

4. Which of the following is a good test dataset characteristic?
a. Large enough to yield meaningful results
b. Is representative of the dataset as a whole
c. Both A and B
d. None of the above

5. Which of the following is a disadvantage of decision trees?
a. Factor analysis
b. Decision trees are robust to outliers
c. Decision trees are prone to be overfit
d. None of the above

6. How do you handle missing or corrupted data in a dataset?
a. Drop missing rows or columns
b. Replace missing values with mean/median/mode
c. Assign a unique category to missing values
d. All of the above

7. What is the purpose of performing cross-validation?
a. To assess the predictive performance of the models
b. To judge how the trained model performs outside the sample on test data
c. Both A and B

8. Why is second order differencing in time series needed?
a. To remove stationarity
b. To find the maxima or minima at the local point
c. Both A and B
d. None of the above

9. When performing regression or classification, which of the following is the correct way to preprocess the data?
a. Normalize the data → PCA → training
b. PCA → normalize PCA output → training
c. Normalize the data → PCA → normalize PCA output → training
d. None of the above

10. Which of the folllowing is an example of feature extraction?
a. Constructing bag of words vector from an email
b. Applying PCA projects to a large high-dimensional data
c. Removing stopwords in a sentence
d. All of the above

11. What is pca.components in Sklearn?
a. Set of all eigen vectors for the projection space
b. Matrix of principal components
c. Result of the multiplication matrix
d. None of the above options

12. Which of the following is true about Naive Bayes ?
a. Assumes that all the features in a dataset are equally important
b. Assumes that all the features in a dataset are independent
c. Both A and B
d. None of the above options

13. Which of the following statements about regularization is not correct?
a. Using too large a value of lambda can cause your hypothesis to underfit the data.
b. Using too large a value of lambda can cause your hypothesis to overfit the data.
c. Using a very large value of lambda cannot hurt the performance of your hypothesis.
d. None of the above

14. How can you prevent a clustering algorithm from getting stuck in bad local optima?
a. Set the same seed value for each run
b. Use multiple random initializations
c. Both A and B
d. None of the above

15. Which of the following techniques can be used for normalization in text mining?
a. Stemming
b. Lemmatization
c. Stop Word Removal
d. Both A and B

16. In which of the following cases will K-means clustering fail to give good results?
1) Data points with outliers
2) Data points with different densities
3) Data points with non convex shapes
a. 1 and 2
b. 2 and 3
c. 1, 2, and 3
d. 1 and 3

17. Which of the following is a reasonable way to select the number of principal components "k"?
a. Choose k to be the smallest value so that at least 99% of the varinace is retained.
b. Choose k to be 99% of m (k = 0.99*m, rounded to the nearest integer).
c. Choose k to be the largest value so that 99% of the variance is retained.
d. Use the elbow method

18. You run gradient descent for 15 iterations with a=0.3 and compute J(theta) after each iteration. You find that the value of J(Theta) decreases quickly and then levels off. Based on this, which of the following conclusions seems most plausible?
a. Rather than using the current value of a, use a larger value of a (say a=1.0)
b. Rather than using the current value of a, use a smaller value of a (say a=0.1)
c. a=0.3 is an effective choice of learning rate
d. None of the above

19. What is a sentence parser typically used for?
a. It is used to parse sentences to check if they are utf-8 compliant.
b. It is used to parse sentences to derive their most likely syntax tree structures.
c. It is used to parse sentences to assign POS tags to all tokens.
d. It is used to check if sentences can be parsed into meaningful tokens.

20. Suppose you have trained a logistic regression classifier and it outputs a new example x with a prediction ho(x) = 0.2. This means
a. Our estimate for P(y=1 | x)
b. Our estimate for P(y=0 | x)
c. Our estimate for P(y=1 | x)
d. Our estimate for P(y=0 | x)

21. Which of the following clustering type has characteristic shown in the below figure?

a) Partitional
b) Hierarchical
c) Naive bayes
d) None of the mentioned
Explanation: Hierarchical clustering groups data over a variety of scales by creating a cluster tree or dendrogram.

22. Hierarchical clustering is an agglomerative approach & K-means clustering follows partitioning approach.

23. Hierarchical clustering is deterministic & K-means is not deterministic.

24. The process of forming general concept definitions from examples of concepts to be learned.
A. Deduction
B. abduction
C. induction
D. conjunction

25. Computers are best at learning
A. facts.
B. concepts.
C. procedures.
D. principles.

26. Data used to build a data mining model.
A. validation data
B. training data
C. test data
D. hidden data

27. Supervised learning and unsupervised clustering both require at least one
A. hidden attribute.
B. output attribute.
C. input attribute.
D. categorical attribute.

28. Supervised learning differs from unsupervised clustering in that supervised learning requires
A. at least one input attribute.
B. input attributes to be categorical.
C. at least one output attribute.
D. ouput attriubutes to be categorical.

29. A regression model in which more than one independent variable is used to predict the dependent variable is called
A. a simple linear regression model
B. a multiple regression models
C. an independent model
D. none of the above

30. A term used to describe the case when the independent variables in a multiple regression model are correlated is
A. regression
B. correlation
C. multicollinearity
D. none of the above

31. A multiple regression model has the form: y = 2 + 3x1 + 4x2. As x1 increases by 1 unit (holding x2 constant), y will
A. increase by 3 units
B. decrease by 3 units
C. increase by 4 units
D. decrease by 4 units

32. A multiple regression model has
A. only one independent variable
B. more than one dependent variable
C. more than one independent variable
D. none of the above

33. A measure of goodness of fit for the estimated regression equation is the
A. multiple coefficient of determination
B. mean square due to error
C. mean square due to regression
D. none of the above

34. The adjusted multiple coefficient of determination accounts for
A. the number of dependent variables in the model
B. the number of independent variables in the model
C. unusually large predictors
D. none of the above

35. The multiple coefficient of determination is computed by
A. dividing SSR by SST
B. dividing SST by SSR
C. dividing SST by SSE
D. none of the above

36. For a multiple regression model, SST = 200 and SSE = 50. The multiple coefficient of determination is
A. 0.25
B. 4.00
C. 0.75
D. none of the above

37. A nearest neighbor approach is best used
A. with large-sized datasets.
B. when irrelevant attributes have been removed from the data.
C. when a generalized model of the data is desireable.
D. when an explanation of what has been found is of primary importance.

38. Determine which is the best approach for each problem.
A. supervised learning
B. unsupervised clustering
C. data query

38.1. What is the average weekly salary of all female employees under forty years of age? (C)
38.2. Develop a profile for credit card customers likely to carry an average monthly balance of more than $1000.00. (A)
38.3. Determine the characteristics of a successful used car salesperson. (A)
38.4. What attribute similarities group customers holding one or several insurance policies? (A)
38.5. Do meaningful attribute relationships exist in a database containing information about credit card customers? (B)
38.6. Do single men play more golf than married men? (C)
38.7. Determine whether a credit card transaction is valid or fraudulent (A)

39. Another name for an output attribute.
A. predictive variable
B. independent variable
C. estimated variable
D. dependent variable

40. Classification problems are distinguished from estimation problems in that
A. classification problems require the output attribute to be numeric.
B. classification problems require the output attribute to be categorical.
C. classification problems do not allow an output attribute.
D. classification problems are designed to predict future outcome.

41. Which statement is true about prediction problems?
A. The output attribute must be categorical.
B. The output attribute must be numeric.
C. The resultant model is designed to determine future outcomes.
D. The resultant model is designed to classify current behavior.

42. Which statement about outliers is true?
A. Outliers should be identified and removed from a dataset.
B. Outliers should be part of the training dataset but should not be present in the test data.
C. Outliers should be part of the test dataset but should not be present in the training data.
D. The nature of the problem determines how outliers are used.
E. More than one of a,b,c or d is true.

43. Which statement is true about neural network and linear regression models?
A. Both models require input attributes to be numeric.
B. Both models require numeric attributes to range between 0 and 1.
C. The output of both models is a categorical attribute value.
D. Both techniques build models whose output is determined by a linear sum of weighted input attribute values.
E. More than one of a,b,c or d is true.

44. Which of the following is a common use of unsupervised clustering?
A. detect outliers
B. determine a best set of input attributes for supervised learning
C. evaluate the likely performance of a supervised learner model
D. determine if meaningful relationships can be found in a dataset
E. All of a,b,c, and d are common uses of unsupervised clustering.

45. The average positive difference between computed and desired outcome values, called as
A. root mean squared error
B. mean squared error
C. mean absolute error
D. mean positive error

46. Selecting data so as to assure that each class is properly represented in both the training and test set.
A. cross validation
B. stratification
C. verification
D. bootstrapping

47. The standard error is defined as the square root of this computation.
A. The sample variance divided by the total number of sample instances.
B. The population variance divided by the total number of sample instances.
C. The sample variance divided by the sample mean.
D. The population variance divided by the sample mean.

48. Data used to optimize the parameter settings of a supervised learner model.
A. training
B. test
C. verification
D. validation

49. Bootstrapping allows us to
A. choose the same training instance several times.
B. choose the same test set instance several times.
C. build models with alternative subsets of the training data several times.
D. test a model with alternative subsets of the test data several times.

50. The correlation between the number of years an employee has worked for a company and the salary of the employee is 0.75. What can be said about employee salary and years worked?
A. There is no relationship between salary and years worked.
B. Individuals that have worked for the company the longest have higher salaries.
C. Individuals that have worked for the company the longest have lower salaries.
D. The majority of employees have been with the company a long time.
E. The majority of employees have been with the company a short period of time.

51. The correlation coefficient for two real-valued attributes is –0.85. What does this value tell you?
A. The attributes are not linearly related.
B. As the value of one attribute increases the value of the second attribute also increases.
C. As the value of one attribute decreases the value of the second attribute increases.
D. The attributes show a curvilinear relationship.

52. The average squared difference between classifier predicted output and actual output.
A. mean squared error
B. root mean squared error
C. mean absolute error
D. mean relative error

53. Simple regression assumes a __________ relationship between the input attribute and output attribute.
A. linear
B. quadratic
C. reciprocal
D. inverse

54. Regression trees are often used to model _______ data.
A. linear
B. nonlinear
C. categorical
D. symmetrical

55. The leaf nodes of a model tree are
A. averages of numeric output attribute values.
B. nonlinear regression equations.
C. linear regression equations.
D. sums of numeric output attribute values.

56. Logistic regression is a ________ regression technique that is used to model data having a _____outcome.
A. linear, numeric
B. linear, binary
C. nonlinear, numeric
D. nonlinear, binary

57. This technique associates a conditional probability value with each data instance.
A. linear regression
B. logistic regression
C. simple regression
D. multiple linear regression

58. This supervised learning technique can process both numeric and categorical input attributes.
A. linear regression
B. Bayes classifier
C. logistic regression
D. backpropagation learning

59. With Bayes classifier, missing data items are
A. treated as equal compares.
B. treated as unequal compares.
C. replaced with a default value.
D. ignored.

60. This clustering algorithm merges and splits nodes to help modify nonoptimal partitions.
A. agglomerative clustering
B. expectation maximization
C. conceptual clustering
D. K-Means clustering

61. This clustering algorithm initially assumes that each data instance represents a single cluster.
A. agglomerative clustering
B. conceptual clustering
C. K-Means clustering
D. expectation maximization

62. This unsupervised clustering algorithm terminates when mean values computed for the current iteration of the algorithm are identical to the computed mean values for the previous iteration.
A. agglomerative clustering
B. conceptual clustering
C. K-Means clustering
D. expectation maximization

63. Machine learning techniques differ from statistical techniques in that machine learning methods
A. typically assume an underlying distribution for the data.
B. are better able to deal with missing and noisy data.
C. are not able to explain their behavior.
D. have trouble with large-sized datasets.

2 marks questions :

64. How do you handle missing or corrupted data in a dataset?
A ) drop missing rows or columns
B) replace missing values with mean/median/mode
C) assign a unique category to missing values
D) all of the above

65. What is the purpose of performing cross validation?
A) To asses the predictive performance of the models
B) To judge how the trained model performs outside the sample on test data.
C) Both A & B

66. Why is second order differencing in time series needed?
A) To remove stationarity
B) To find the maxima or minima at the local point
C) Both A & B

67. When performing regression or classification which of the following is the correct way to pre-process the data?
Normalize the data > PCA (Principal Component Analysis) > Training

68. Which of the following Is an example of feature extraction?
A) Constructing bag of words vector from an email
B) Applying PCA projects to a large high dimensional data
C) Removing stopwards in a sentence
D) All of the above

69. Which of the following is true about Naïve bay’s algorithm?
A) Assume that all the features in a data set are equally important
B) Assume that all the features in a data set are independent
C) Both A & B

70. Which of the following statements about regularisation is not correct?
A) Using too a large a value of lamda can cause your hypothesis to underfit the data
B) Using too a large a value of lamda can cause your hypothesis to overfit the data
C) Using a very large value of lamda cannot hurt the performance of your hypothesis
D) None of the above

71. How can you prevent a clustering algorithm from getting stuck in bad local optima?
A) Set the same seed value for each run
B) Use multiple random initializations
C) Both A & B

72. Which of the following techniques can be used for normalization in text mining?
A) Stemming
B) Lemmatization
C) Stopward removal
D) Both A & B

73. In which of the following cases will K means clustering fail to give good results?
1. Data points with outliers
2. Data points with different densities
3. Data points with non convex shapes
For all the three cases

74. What is a sentence parser typically used for?
Answer: It is used to parse sentences to derive their most likely syntax tree structures.

75. Suppose you have trained a logistic regression classifier and it outputs a new example ‘X’ with a prediction HO(X) = 0.2. This means what?
Answer: Our estimate for P(Y) = 0 for X

76. What is pca.components_ in SKlearn?
Answer: Set of all Eigen vectors for the projection space.

77. Which of the following is an example of a deterministic algorithm?
Answer: PCA

78. A Pearson correlation between to variables is 0 but their values can still be related to each other?
Answer: True

79. Imagine you are solving a classification problem with highly imbalanced class, the majority class is observed 99% of times in the training data. Your model has 99% accuracy after taking the predictions on the test data. Which of the following is true in such a case?
1. Accuracy matrix is not is good idea for imbalanced class problems
2. Accuracy matrix is a good idea for imbalanced class problems
3. Precision and recall matrix are good for imbalanced class problems
4. Precision and recall matrix are not good for imbalanced class problems
Option 1 & 3 are correct

80. Which of the following option is true for overall execution time for 5 fold cross validation with 10 different values of max_depth?
Answer: More than 600 secs

81. What would you do in PCA to get the same projection as SVM?
Answer: Transform data to zero mean.

82. Which of the following value of K will have least leave-one-out cross validation accuracy?
Answer: 1-NN

83. Which of the following options can be used to get global minima K-means algorithm?
A) Try to run algorithm for different centroid initialization
B) Adjust number of iterations
C) Find out the optimal no. of clusters
D) All of the above

84. Imagine, you have a 28 * 28 image and you run a 3 * 3 convolution neural network on it with the input depth of 3 and output depth of 8.
Note: Stride is 1 and you are using same padding.
A) 28 width, 28 height and 8 depth
B) 13 width, 13 height and 8 depth
C) 28 width, 13 height and 8 depth
D) 13 width, 28 height and 8 depth

85. A feature F1 can take certain values: A, B, C, D, E & F and represents grade of from a college.
Which of the following statement is true for the above case?
1. Feature F1 is an example of nominal variable
2. Feature F1 is an example of ordinal variable
3. It doesn’t belong to any of the above category
4. Both of these

86. Assume that there is a blackbox algorithm which takes training data with multiple observations T1, T2, T3,………., Tn and a new observation Q1. The blackbox the nearest neighbour of Q1 say Ti and its corresponding class level Ci. Assume that this blackbox algorithm is same as 1- NN.
It is possible to construct a K-NN classification algorithm based on this blackbox alone where number of training observations is very large compared to K?

Answer: True

87. Assume that there is a blackbox algorithm which takes training data with multiple observations T1, T2, T3,………., Tn and a new observation Q1. The blackbox the nearest neighbour of Q1 say Ti and its corresponding class level Ci. Assume that this blackbox algorithm is same as 1- NN.
Instead of using 1-NN blackbox we want to use the J-NN algorithm for blackbox, where J>1?
Which of the following option is correct for finding K-NN using J-NN?
A) J must be a proper factor of K.
B) J must be greater than K.
C) Not possible

3 Marks Questions :

88. Which of the following statement is true ?
A. In Gradient Boosting (GD) and Stocharstic Gradient Boosting (SGD), we update set of parameters in an iterative manner.
B. In Stocharstic Gradient Boosting (SGD), we have to run for all the sample for a single update of parameter in each iteration.
C. In Gradient Boosting (GD), we either use entire data or a subject of training data to update a parameter in each iteration.

89. Which of the following hyper parameter of Random Forest increase, causes overfit the data ?
Answer : depth of tree

90. Imagine you are working with analytics vidya and you want to develop a machine learning algorithm which predict no. of views on the article. Your analysis is based on teachers name, author name, no. of articles written by same author on the analytics vidya platform in past etc. Which of the following evalution matrix would you choose in that case ?
Answer : Min square error

91. Lets say that you are using action funX in hidden layer of neural network at a particular neuron for given input you get the output -0.001. Which of the following activation function should X represent ?
Answer : fun 8

92. Which of the following are one of the important step to preprocess the text in NLP based project.
A. stemming
B. stopward removal
C. object standardization
D. All of these

93. Adding a anon important feature to a linear regression method may result in
Answer : increase in R^2 (square of R)

94. In KNN model, it is very likely to overfit due to cause of dimensionality. Which of the following option would you consider to handle this problem.
Answer : dimensionality reduction & feature selection

95. Which of the following is true about the gradient boosting tree ?
A. in each stage introduce a regression tree to compensate the shortcoming of the existing model.
B. we can use gradient distance (GD) method to minimze the loss finction
C. Both of these

96. To apply bagging to regression trees, which of the following are true in that case ?
A. we build the n regression with n bootstrap sample
B. we take the average of n regression tree
C. each tree has high varience with low biased.
D. All of the above

97. When you find noise in data, which of the following option will you considered in KNN ?
Answer : I will increase the value of k.

98. Suppose you want to predict the class of the data point, x=1 and y=1 using eucleadian distance in 3NN in which class these data points belongs to ?
Answer : positive (+) class

99. Which of the following will be eucleadian distance between the two data points A(1,3) and (2,3)
Answer : 1

100. Suppose you are working on a binary classification problems with three input features and you choose to apply a bagging algorithm X. On this data, you choose max_features = 2 and the n_estimators = 3, assume that each estimation has 70% accuracy. Note that algorithm X is aggregating the result of individual estimates based on maximum voting. What will be the maximum accuracy you can get ?
Answer : 100%

101. In random forest or gradient boosting algorithm, features can be of any type, for example it can be a continuous features or categorical features. Which of the following option is true when you consider this type of feature.
Answer : Both the algorithm can handle real valued attributes by discretizing them.

102. Which of the following is true about training and testing error in the case described below. Suppose you want to apply Adaboost algorithm on data D which has 'T' observation. You have set half of the data for training and half for testing initially. Now you want to increase the no. of data points for training. [ T1, T2, ------- Tn where T1 > T2 < T3 < ---- < Tn ]
Answer : the difference between training error and testing error decreases as the no. of observation increases.

103. Suppose you are given 3 variables x, y, z, the pearson corelation coefficient for (x,y), (y,z) & (x,z), c1, c2 and c3 respectively. Now you have added 2 in all the values of X, and substract 2 from all the values of Y and Z remains the same. The new coefficient (x,y), (y,z) & (x,z) are given by D1, D2 and D3 respectively. How do the values of D1, D2, D3 relates to c1, c2, c3 ?
Answer : D1=c1, D2=c2, D3=c3.

104. Which of the following techniques can be used for the purpose of keyword normalization, the process of converting a keyword into its base form?
1. Lemmatization
2. Levenshtein
3. Stemming
4. Soundexbr

A) 1 and 2
B) 2 and 4
C) 1 and 3
D) 1, 2 and 3
E) 2, 3 and 4
F) 1, 2, 3 and 4

105. Which of the following models can perform tweet classification with regards to context mentioned above?
A) Naive Bayes
B) SVM
C) None of the above

Explanation : you are given only the data of tweets and no other information, which means there is no target variable present. One cannot train a supervised learning model, both svm and naive bayes are supervised learning techniques.

106. What is the major difference between CRF (Conditional Random Field) and HMM (Hidden Markov Model)?
A) CRF is Generative whereas HMM is Discriminative model
B) CRF is Discriminative whereas HMM is Generative model
C) Both CRF and HMM are Generative model
D) Both CRF and HMM are Discriminative model