
Q. You are given a data set. The data set has missing values that spread along 1
standard deviation from the median. What percentage of data would remain
unaffected? Why?
This question has enough hints for you to start thinking! Since the data is spread across the
median, let’s assume it’s a normal distribution. We know, in a normal distribution, ~68% of the
data lies in 1 standard deviation from mean (or mode, median), which leaves ~32% of the data
unaffected. Therefore, ~32% of the data would remain unaffected by missing values.
Q. What are PCA, KPCA, and ICA used for?
PCA (Principal Components Analysis), KPCA ( Kernel-based Principal Component Analysis)
and ICA ( Independent Component Analysis) are important feature extraction techniques used
for dimensionality reduction.
Q. What are support vector machines?
Support vector machines are supervised learning algorithms used for classification and
regression analysis.
Q. What is batch statistical learning?
Statistical learning techniques allow learning a function or predictor from a set of observed data
that can make predictions about unseen or future data. These techniques provide guarantees
on the performance of the learned predictor on the future unseen data based on a statistical
assumption on the data generating process.
Q. What is the bias-variance decomposition of classification error in the
ensemble method?
The expected error of a learning algorithm can be decomposed into bias and variance. A bias
term measures how closely the average classifier produced by the learning algorithm matches
the target function. The variance term measures how much the learning algorithm’s prediction
fluctuates for different training sets.
Q. When is Ridge regression favorable over Lasso regression?
You can quote ISLR’s authors Hastie who asserted that, in the presence of few
variables with medium / large sized effect, use lasso regression. In presence of many variables
with small/medium-sized effects, use ridge regression.
Conceptually, we can say, lasso regression (L1) does both variable selection and parameter
shrinkage, whereas Ridge regression only does parameter shrinkage and end up including all
the coefficients in the model. In the presence of correlated variables, ridge regression might be
the preferred choice. Also, ridge regression works best in situations where the least square
estimates have higher variance. Therefore, it depends on our model objective.
Q. You’ve built a random forest model with 10000 trees. You got delighted after
getting training error as 0.00. But, the validation error is 34.23. What is going on?
Haven’t you trained your model perfectly?
The model has overfitted. Training error 0.00 means the classifier has mimicked the training
data patterns to an extent, that they are not available in the unseen data. Hence, when this
classifier was run on an unseen sample, it couldn’t find those patterns and returned predictions
with higher error. In a random forest, it happens when we use a larger number of trees than
necessary. Hence, to avoid this situation, we should tune the number of trees using
cross-validation.
Q. What is a convex hull?
In the case of linearly separable data, the convex hull represents the outer boundaries of the
two groups of data points. Once the convex hull is created, we get maximum margin hyperplane
(MMH) as a perpendicular bisector between two convex hulls. MMH is the line which attempts to
create the greatest separation between two groups.
Q. What do you understand by Type I vs Type II error?
Type I error is committed when the null hypothesis is true and we reject it, also known as a
‘False Positive’. Type II error is committed when the null hypothesis is false and we accept it,
also known as ‘False Negative’.
In the context of the confusion matrix, we can say Type I error occurs when we classify a value
as positive (1) when it is actually negative (0). Type II error occurs when we classify a value as
negative (0) when it is actually positive(1).
Q. In k-means or KNN, we use distance to calculate the distance
between nearest neighbors. Why not distance?
We don’t use distance because it calculates distance horizontally or vertically only. It
has dimension restrictions. On the other hand, the metric can be used in any space to
calculate distance. Since the data points can be present in any dimension, distance is
a more viable option.
Example: Think of a chessboard, the movement made by a bishop or a rook is calculated by
distance because of their respective vertical & horizontal movements.
Q. Do you suggest that treating a categorical variable as a continuous variable
would result in a better predictive model?
For better predictions, the categorical variable can be considered as a continuous variable only
when the variable is ordinal in nature.
Q.OLS is to linear regression. The maximum likelihood is logistic regression.
Explain the statement.
OLS and Maximum likelihood are the methods used by the respective regression methods to
approximate the unknown parameter (coefficient) value. In simple words,
Ordinary least square(OLS) is a method used in linear regression which approximates the
parameters resulting in minimum distance between actual and predicted values. Maximum
Likelihood helps in choosing the values of parameters which maximizes the likelihood that the
parameters are most likely to produce observed data.
Q. When does regularization becomes necessary in Machine Learning?
Regularization becomes necessary when the model begins to overfit/underfit. This technique
introduces a cost term for bringing in more features with the objective function. Hence, it tries to
push the coefficients for many variables to zero and hence reduce the cost term. This helps to
reduce model complexity so that the model can become better at predicting (generalizing).
Q. What is Linear Regression?
Linear Regression is a supervised Machine Learning algorithm. It is used to find the linear
relationship between the dependent and the independent variables for predictive analysis.
Q. What is the Variance Inflation Factor?
Variance Inflation Factor (VIF) is the estimate of the volume of multicollinearity in a collection of
many regression variables.
VIF = Variance of the model / Variance of the model with a single independent variable
We have to calculate this ratio for every independent variable. If VIF is high, then it shows the
high collinearity of the independent variables.
Q. We know that one hot encoding increases the dimensionality of a dataset,
but label encoding doesn’t. How?
When we use one-hot encoding, there is an increase in the dimensionality of a dataset. The
reason for the increase in dimensionality is that, for every class in the categorical variables, it
forms a different variable.
Q. What is a Decision Tree?
A decision tree is used to explain the sequence of actions that must be performed to get the
desired output. It is a hierarchical diagram that shows the actions.
Q. What is the Binarizing of data? How to Binarize?
In most of the Machine Learning Interviews, apart from theoretical questions, interviewers focus
on the implementation part. So, this ML Interview Questions focused on the implementation of
the theoretical concepts.
Converting data into binary values on the basis of threshold values is known as the binarizing of
data. The values that are less than the threshold are set to 0 and the values that are greater
than the threshold are set to 1. This process is useful when we have to perform feature
engineering, and we can also use it for adding unique features.
Q. What is cross-validation?
Cross-validation is essentially a technique used to assess how well a model performs on a new
independent dataset. The simplest example of cross-validation is when you split your data into
two groups: training data and testing data, where you use the training data to build the model
and the testing data to test the model.
Q. When would you use random forests Vs SVM and why?
There are a couple of reasons why a random forest is a better choice of the model than a
support vector machine:
● Random forests allow you to determine the feature importance. SVM’s can’t do this.
● Random forests are much quicker and simpler to build than an SVM.
● For multi-class classification problems, SVMs require a one-vs-rest method, which is
less scalable and more memory intensive.
Q. What are the drawbacks of a linear model?
There are a couple of drawbacks of a linear model:
● A linear model holds some strong assumptions that may not be true in the application. It
assumes a linear relationship, multivariate normality, no or little multicollinearity, no
auto-correlation, and homoscedasticity
● A linear model can’t be used for discrete or binary outcomes.
● You can’t vary the model flexibility of a linear model.



Comments
There are no comments for this story
Be the first to respond and start the conversation.