Data Science Interview Questions

Data Science

By Pr DataPublished 2 years ago • 6 min read

Q. You are given a data set. The data set has missing values that spread along 1

standard deviation from the median. What percentage of data would remain

unaffected? Why?

This question has enough hints for you to start thinking! Since the data is spread across the

median, let’s assume it’s a normal distribution. We know, in a normal distribution, ~68% of the

data lies in 1 standard deviation from mean (or mode, median), which leaves ~32% of the data

unaffected. Therefore, ~32% of the data would remain unaffected by missing values.

Q. What are PCA, KPCA, and ICA used for?

PCA (Principal Components Analysis), KPCA ( Kernel-based Principal Component Analysis)

and ICA ( Independent Component Analysis) are important feature extraction techniques used

for dimensionality reduction.

Q. What are support vector machines?

Support vector machines are supervised learning algorithms used for classification and

regression analysis.

Q. What is batch statistical learning?

Statistical learning techniques allow learning a function or predictor from a set of observed data

that can make predictions about unseen or future data. These techniques provide guarantees

on the performance of the learned predictor on the future unseen data based on a statistical

assumption on the data generating process.

Q. What is the bias-variance decomposition of classification error in the

ensemble method?

The expected error of a learning algorithm can be decomposed into bias and variance. A bias

term measures how closely the average classifier produced by the learning algorithm matches

the target function. The variance term measures how much the learning algorithm’s prediction

fluctuates for different training sets.

Q. When is Ridge regression favorable over Lasso regression?

You can quote ISLR’s authors Hastie who asserted that, in the presence of few

variables with medium / large sized effect, use lasso regression. In presence of many variables

with small/medium-sized effects, use ridge regression.

Conceptually, we can say, lasso regression (L1) does both variable selection and parameter

shrinkage, whereas Ridge regression only does parameter shrinkage and end up including all

the coefficients in the model. In the presence of correlated variables, ridge regression might be

the preferred choice. Also, ridge regression works best in situations where the least square

estimates have higher variance. Therefore, it depends on our model objective.

Q. You’ve built a random forest model with 10000 trees. You got delighted after

getting training error as 0.00. But, the validation error is 34.23. What is going on?

Haven’t you trained your model perfectly?

The model has overfitted. Training error 0.00 means the classifier has mimicked the training

data patterns to an extent, that they are not available in the unseen data. Hence, when this

classifier was run on an unseen sample, it couldn’t find those patterns and returned predictions

with higher error. In a random forest, it happens when we use a larger number of trees than

necessary. Hence, to avoid this situation, we should tune the number of trees using

cross-validation.

Q. What is a convex hull?

In the case of linearly separable data, the convex hull represents the outer boundaries of the

two groups of data points. Once the convex hull is created, we get maximum margin hyperplane

(MMH) as a perpendicular bisector between two convex hulls. MMH is the line which attempts to

create the greatest separation between two groups.

Q. What do you understand by Type I vs Type II error?

Type I error is committed when the null hypothesis is true and we reject it, also known as a

‘False Positive’. Type II error is committed when the null hypothesis is false and we accept it,

also known as ‘False Negative’.

In the context of the confusion matrix, we can say Type I error occurs when we classify a value

as positive (1) when it is actually negative (0). Type II error occurs when we classify a value as

negative (0) when it is actually positive(1).

Q. In k-means or KNN, we use distance to calculate the distance

between nearest neighbors. Why not distance?

We don’t use distance because it calculates distance horizontally or vertically only. It

has dimension restrictions. On the other hand, the metric can be used in any space to

calculate distance. Since the data points can be present in any dimension, distance is

a more viable option.

Example: Think of a chessboard, the movement made by a bishop or a rook is calculated by

distance because of their respective vertical & horizontal movements.

Q. Do you suggest that treating a categorical variable as a continuous variable

would result in a better predictive model?

For better predictions, the categorical variable can be considered as a continuous variable only

when the variable is ordinal in nature.

Q.OLS is to linear regression. The maximum likelihood is logistic regression.

Explain the statement.

OLS and Maximum likelihood are the methods used by the respective regression methods to

approximate the unknown parameter (coefficient) value. In simple words,

Ordinary least square(OLS) is a method used in linear regression which approximates the

parameters resulting in minimum distance between actual and predicted values. Maximum

Likelihood helps in choosing the values of parameters which maximizes the likelihood that the

parameters are most likely to produce observed data.

Q. When does regularization becomes necessary in Machine Learning?

Regularization becomes necessary when the model begins to overfit/underfit. This technique

introduces a cost term for bringing in more features with the objective function. Hence, it tries to

push the coefficients for many variables to zero and hence reduce the cost term. This helps to

reduce model complexity so that the model can become better at predicting (generalizing).

Q. What is Linear Regression?

Linear Regression is a supervised Machine Learning algorithm. It is used to find the linear

relationship between the dependent and the independent variables for predictive analysis.

Q. What is the Variance Inflation Factor?

Variance Inflation Factor (VIF) is the estimate of the volume of multicollinearity in a collection of

many regression variables.

VIF = Variance of the model / Variance of the model with a single independent variable

We have to calculate this ratio for every independent variable. If VIF is high, then it shows the

high collinearity of the independent variables.

Q. We know that one hot encoding increases the dimensionality of a dataset,

but label encoding doesn’t. How?

When we use one-hot encoding, there is an increase in the dimensionality of a dataset. The

reason for the increase in dimensionality is that, for every class in the categorical variables, it

forms a different variable.

Q. What is a Decision Tree?

A decision tree is used to explain the sequence of actions that must be performed to get the

desired output. It is a hierarchical diagram that shows the actions.

Q. What is the Binarizing of data? How to Binarize?

In most of the Machine Learning Interviews, apart from theoretical questions, interviewers focus

on the implementation part. So, this ML Interview Questions focused on the implementation of

the theoretical concepts.

Converting data into binary values on the basis of threshold values is known as the binarizing of

data. The values that are less than the threshold are set to 0 and the values that are greater

than the threshold are set to 1. This process is useful when we have to perform feature

engineering, and we can also use it for adding unique features.

Q. What is cross-validation?

Cross-validation is essentially a technique used to assess how well a model performs on a new

independent dataset. The simplest example of cross-validation is when you split your data into

two groups: training data and testing data, where you use the training data to build the model

and the testing data to test the model.

Q. When would you use random forests Vs SVM and why?

There are a couple of reasons why a random forest is a better choice of the model than a

support vector machine:

● Random forests allow you to determine the feature importance. SVM’s can’t do this.

● Random forests are much quicker and simpler to build than an SVM.

● For multi-class classification problems, SVMs require a one-vs-rest method, which is

less scalable and more memory intensive.

Q. What are the drawbacks of a linear model?

There are a couple of drawbacks of a linear model:

● A linear model holds some strong assumptions that may not be true in the application. It

assumes a linear relationship, multivariate normality, no or little multicollinearity, no

auto-correlation, and homoscedasticity

● A linear model can’t be used for discrete or binary outcomes.

● You can’t vary the model flexibility of a linear model.

Creators

About the Creator

Pr Data

Reader insights

Be the first to share your insights about this piece.

How does it work?

Add your insights

Comments

There are no comments for this story

Be the first to respond and start the conversation.

Keep reading

More stories from writers in Interview and other communities.

Data Science Interview Questions

Data Science

About the Creator

Pr Data

Reader insights

Be the first to share your insights about this piece.

Comments

Keep reading

Center Stage with Paul Stewart

William Stern on Community, Jewish Values, and Leadership at Cardiff

Tea?

Miss Persephone's Manual to a Seemingly Ordinary Life