Top 29 Interview Questions For Machine Learning Engineer

The interview questions for machine learning jobs are those who deal with subjects that include the various forms of machine learning (ML). These jobs fall under the tech field blanket, but the positions may vary from a scientist to a data analyst to engineer. For any level of experience in the field, there is some information you should always keep handy for an interview or a casual conversation with a fellow expert. We’ve summarized the questions that are frequently asked in an interview in 2 parts: Theoretical and Scenario Based.

Theoretical

1. Name the different types of Machine Learning.


Supervised LearningUnsupervised LearningReinforcement Learning
DefinitionThe machine learns by using labeled dataThe machine is trained on unlabeled data without any guidanceAn agent learns by interacting with its environment.
Type of problemsRegression & ClassificationAssociation & ClusteringReward based
Type of dataLabelled dataUnlabeled dataNo pre-defined data
TrainingExternal supervisionNo supervisionNo supervision
ApproachMap labeled input to known outputUnderstand patterns and discover outputFollow trail and error method
Popular algorithmsLinear regression, Logistic regression, Support Vector Machine, etcK-means, C-means, etc KNNQ-Learning, SARSA, etc

2. What are the different languages used for machine learning?

The most popular language for machine learning is python. Other languages for machine learning are

  • C++
  • JavaScript
  • Java
  • C#
  • Julia
  • Shell
  • R
  • TypeScript
  • Scala

3. What is the meaning of Variance Error in ML algorithms?

Variance error is found in machine learning algorithms that are highly complex and pose difficulties in understanding them. As a result, you can find a greater extent of variation in the training data. Subsequently, the machine learning model would overfit the data. Also, you can also find excessive noise for training data, which is entirely inappropriate for the test data.

4. What’s the trade-off between bias and variance?

Bias is an error due to erroneous or overly simplistic assumptions in the learning algorithm you’re using. This can lead to the model underfitting your data, making it hard for it to have high predictive accuracy and for you to generalize your knowledge from the training set to the test set.

Variance is an error due to too much complexity in the learning algorithm you’re using. This leads to the algorithm being highly sensitive to high degrees of variation in your training data, which can lead your model to overfit the data. You’ll be carrying too much noise from your training data for your model to be very useful for your test data.

The bias-variance decomposition essentially decomposes the learning error from any algorithm by adding the bias, the variance and a bit of irreducible error due to noise in the underlying dataset. Essentially, if you make the model more complex and add more variables, you’ll lose bias but gain some variance — to get the optimally reduced amount of error, you’ll have to tradeoff bias and variance. You don’t want either high bias or high variance in your model.

5. How is KNN different from k-means clustering?

K-Nearest Neighbors is a supervised classification algorithm, while k-means clustering is an unsupervised clustering algorithm. While the mechanisms may seem similar at first, what this means is that for K-Nearest Neighbors to work, you need labelled data you want to classify an unlabeled point into (thus the nearest neighbour part). 

K-means clustering requires only a set of unlabeled points and a threshold: the algorithm will take unlabeled points and gradually learn how to cluster them into groups by computing the mean of the distance between different points.

The main difference here is that KNN requires labelled points and is thus supervised learning, while k-means doesn’t and is, therefore, unsupervised learning.

6. Explain how a ROC curve works.

Receiver operating characteristics (ROC) graphs are useful for organizing classifiers and visualizing their performance. The ROC curve is a graphical representation of the contrast between true positive rates and the false positive rate at various thresholds. A receiver operating characteristics (ROC) graph is a technique for visualizing, organizing and selecting classifiers based on their performance. ROC graphs have long been used in signal detection theory to depict the tradeoff between hit rates and false alarm rates of classifiers.

7. What is the importance of Bayes’ theorem in ML algorithms?

Bayes’ theorem can help in measuring the posterior probability of an event according to previous knowledge.It can be seen as a way of understanding how the probability that a theory is true is affected by a new piece of evidence. It has been used in a wide variety of contexts, ranging from marine biology to the development of “Bayesian” spam blockers for email systems. The formula for Bayes’ theorem is,

P (A|B) = [P(B|A) P(A)] / [P(B)]

where A and B are events and P(B) ≠ 0

P(A|B) is a conditional probability: the likelihood of event A occurring given that B is true.

P(B|A) is also a conditional probability: the likelihood of event B occurring given that A is true.

P(A) and P(B) are the probabilities of observing A and B independently of each other; this is known as the marginal probability.

8. What is precision, and what is a recall?

The recall is the number of true positive rates identified for a specific total number of datasets. Precision involves predictions for positive values claimed by a model as compared to the number of actually claimed positives. You can assume this as a special case for probability with respect to mathematics.

9. What is bagging and boosting in Machine Learning?

Bagging and Boosting are two types of Ensemble Learning. These two decrease the variance of single estimate as they combine several estimates from different models. So the result may be a model with higher stability.

Bagging is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. 

Boosting, on the other hand, is an ensemble meta-algorithm for primarily reducing bias, and also variance in supervised learning, and a family of machine learning algorithms that convert weak learners to strong ones.

The similarities between the two forms of are that both:

  1. are ensemble methods to get N learners from 1 learner.
  2. generate several training data sets by random sampling.
  3. make the final decision by averaging the N learners (or taking the majority of them i.e Majority Voting).
  4. are good at reducing variance and provide higher stability.

The differences between Bagging and Boosting:

BAGGINGBOOSTING
Simplest way of combining predictions that belong to the same type.A way of combining predictions thatbelong to different types.
Aim to decrease variance, not bias.Aim to decrease bias, not variance.
Each model receives equal weight.Models are weighted according to their performance.
Each model is built independently.New models are influencedby performance of previously built models.
Different training data subsets are randomly drawn with replacement from the entire training dataset.Every new subset contains the elements that were misclassified by previous models.
Bagging tries to solve overfitting problem.Boosting tries to reduce bias.
If the classifier is unstable (high variance), then apply bagging.If the classifier is stable and simple (high bias) the apply boosting.
Random forest.Gradient boosting.

10. Can you explain the difference between L1 and L2 regularization?

Candidates can face this question in their interview as it’s one of the latest machine learning interview questions. L2 regularization is more likely to transfer error across all terms. On the other hand, L1 regularization is highly sparse or binary. Many variables in L1 regularization involve the assignment of 1 or 0 in weighting to them. The case of L1 regularization involves the setup of Laplacian before the terms. In the case of L2, the focus is on the setup of Gaussian prior on the terms.

11. What is Naive Bayes?

Naive Bayes is ideal for practical application in text mining. However, it also involves an assumption that it is not possible to visualize in real-time data. Naive Bayes consists of the calculation of conditional probability from the pure product of individual probabilities of different components. The condition in such cases would imply complete independence for the features that are practically not possible or very difficult. 

12. Is it possible to manage an imbalanced dataset? How?

This is one of the most challenging machine learning interview questions in data scientist interviews. The imbalanced dataset is found in cases of classification test and allocation of 90% of data in one class. As a result, you can encounter problems. Without any predictive power over the other data categories, the accuracy of around 90% could skew. However, it is possible to manage an imbalanced dataset.

You can try collecting more data to compensate for the imbalances in the dataset. You could also try re-sampling of the dataset to correct imbalances. Most important of all, you could try another completely different algorithms on the dataset. The critical factor here is the understanding of the negative impacts of an imbalanced dataset and approaches for balancing the irregularities.

13. How is Type I error different from Type II error?

A Type I error is classified as a false positive, and a Type II error classifies as a false negative. It means that claiming about something happening when it hasn’t, classifies as a Type I error.

On the other hand, Type II error is the opposite. Type II error happens when you claim something is not happening when it is happening. Consider this example; a shepherd thinks a wolf is present when no wolf is actually present; this is a type I error. The shepherd thinks a wolf is NOT present when a wolf is actually present is a type II error.

14. What are collinearity and multicollinearity?

Collinearity occurs when two predictor variables (e.g., x1 and x2) in a multiple regression have some correlation.

Multicollinearity occurs when more than two predictor variables (e.g., x1, x2, and x3) are inter-correlated.

15. What is A/B Testing?

A/B is Statistical hypothesis testing for a randomized experiment with two variables A and B. It is used to compare two models that use different predictor variables to check which variable fits best for a given sample of data.

Consider a scenario where you’ve created two models (using different predictor variables) that can be used to recommend products for an e-commerce platform.

A/B Testing can be used to compare these two models to check which one best recommends products to a customer.

16. Which library would you prefer for plotting in Python language: Seaborn or Matplotlib or Bokeh?

It depends on the visualization you’re trying to achieve. Each of these libraries is used for a specific purpose:

  1. Matplotlib: Used for basic plotting like bars, pies, lines, scatter plots, etc
  2. Seaborn: Is built on top of Matplotlib and Pandas to ease data plotting. It is used for statistical visualizations like creating heatmaps or showing the distribution of your data
  3. Bokeh: Used for interactive visualization. In case your data is too complicated, and you haven’t found any “message” in the data, then use Bokeh to create interactive visualizations that will allow your viewers to explore the data themselves

17. How are NumPy and SciPy related?

NumPy is part of SciPy.

NumPy defines arrays along with some essential numerical functions like indexing, sorting, reshaping, etc.

SciPy implements computations such as numerical integration, optimization and machine learning using NumPy’s functionality.

18. You are given a data set consisting of variables having more than 30% missing values? Let’s say, out of 50 variables, 8 variables have missing values higher than 30%. How will you deal with them?

Assign a unique category to the missing values, who knows the missing values might uncover some trends.

We can remove them blatantly.

Or, we can sensibly check their distribution with the target variable, and if found any pattern, we’ll keep those missing values and assign them a new category while removing others.

19. How do you handle missing or corrupted data in a dataset?

You could find missing/corrupted data in a dataset and either drop those rows or columns or decide to replace them with another value.

In Pandas, there are two beneficial methods: isnull() and dropna() that will help you find columns of data with missing or corrupted data and drop those values. If you want to fill the invalid values with a placeholder value (for example, 0), you could use the fillna() method.

20. How is a decision tree pruned?

Pruning is what happens in decision trees when branches that have weak predictive power are removed to reduce the complexity of the model and increase the predictive accuracy of a decision tree model. Pruning can happen bottom-up and top-down, with approaches such as reduced error pruning and cost complexity pruning.

Reduced error pruning is perhaps the simplest version: replace each node. If it doesn’t decrease predictive accuracy, keep it pruned. While simple, this heuristic actually comes pretty close to an approach that would optimize for maximum accuracy.

21. What is an F1 Score and how is it used?

The F1 score is a measure of a model’s performance. It is a weighted average of the precision and recall of a model, with results tending to 1 being the best, and those tending to 0 being the worst. You would use it in classification tests where true negatives don’t matter much.

22. What is Cluster Sampling?

It is a process of randomly selecting intact groups within a defined population, sharing similar characteristics. Cluster Sample is a probability sample where each sampling unit is a collection or cluster of elements.

For example, if you’re clustering the total number of managers in a set of companies, in that case, managers (samples) will represent elements and companies will represent clusters.

23. Is model accuracy important or model performance?

Models with higher accuracy could not perform well in terms of predictive power. The model accuracy is generally a subset of model performance, and it can also be misleading at certain times. If you have to detect fraud in large datasets with a sample of millions, the more accurate model would not predict any scam

This condition is possible if only a large minority of cases, involving fraud. As for a predictive model, this condition would be inappropriate. Imagine that a model designed for fraud shows that there is no sign of a scam. Therefore, we can ascertain that model accuracy is not the sole determinant of model performance. Both elements are highly significant in machine learning.

24. In which situation is classification better than regression?

Classification results in the generation of discrete values and dataset according to specific categories. On the other hand, regression provides continuous results with better demarcations between individual points. Classification is better than regression when you need the results to reflect the presence of data points in specific categories in the dataset. Classification is better if you just want to find whether a name is female or male. Regression is ideal if you’re going to find out the correlation of the name with male and female names.

Scenario-based

25. You are given a data set on cancer detection. You’ve built a classification model and achieved an accuracy of 96%. Why shouldn’t you be happy with your model performance? What can you do about it?

You deduce that cancer detection results in imbalanced data. In an imbalanced data set, accuracy should not be used as a measure of performance because 96% (as given) might only be predicting majority class correctly, but our class of interest is minority class (4%) which is the people who actually got diagnosed with cancer. Hence, to evaluate model performance, we should use Sensitivity (True Positive Rate), Specificity (True Negative Rate), F measure to determine the class-wise performance of the classifier. If the minority class performance is found to be poor, we can undertake the following steps:

  1. We can use undersampling, oversampling or SMOTE to make the data balanced.
  2. We can alter the prediction threshold value by doing probability calibration and finding an optimal threshold using the AUC-ROC curve.
  3. We can assign a weight to classes such that the minority classes get more considerable weight.
  4. We can also use anomaly detection.

26. You came to know that your model is suffering from low bias and high variance. Which algorithm should you use to tackle it? Why?

Low bias occurs when the model’s predicted values are near to actual values. In other words, the model becomes flexible enough to mimic the training data distribution. While it sounds like a great achievement, but not to forget, a flexible model has no generalization capabilities. That means, when this model is tested on unseen data, it gives disappointing results.

In such situations, we can use the bagging algorithm (like random forest) to tackle the high variance problem. Bagging algorithms divides a data set into subsets made with repeated randomized sampling. Then, these samples are used to generate a set of models using a single learning algorithm. Later, the model predictions are combined using voting (classification) or averaging (regression).

To combat the high variance:

Use the regularization technique, where higher model coefficients get penalized, hence lowering model complexity.

Use top n features from variable importance chart. Maybe, with all the variables in the data set, the algorithm is having difficulty in finding the meaningful signal.

27. When is Ridge regression favourable over Lasso regression?

To quote ISLR authors Hastie & Tibshirani who asserted that, in the presence of few variables with medium/large sized effect, use lasso regression. In the presence of many variables with small/medium-sized effect, use ridge regression.

Conceptually, we can say, lasso regression (L1) does both variable selection and parameter shrinkage, whereas Ridge regression only does parameter shrinkage and end up including all the coefficients in the model. In the presence of correlated variables, ridge regression might be the preferred choice. Also, ridge regression works best in situations where the least square estimates have higher variance. Therefore, it depends on our model objective.

28. How would you predict who will renew their subscription next month? What data would you need to solve this? What analysis would you do? Would you build predictive models? If so, which algorithms?

  • Let’s assume that we’re trying to predict renewal rate for Netflix subscription. So our problem statement is to predict which users will renew their subscription plan for the next month.
  • Next, we must understand the data that is needed to solve this problem. In this case, we need to check the number of hours the channel is active for each household, the number of adults in the household, number of kids, which channels are streamed the most, how much time is spent on each channel, how much has the watch rate varied from last month, etc. Such data is needed to predict whether or not a person will continue the subscription for the upcoming month.
  • After collecting this data, it is crucial that you find patterns and correlations. For example, we know that if a household has kids, then they are more likely to subscribe. Similarly, by studying the watch rate of the previous month, you can predict whether a person is still interested in a subscription. Such trends must be studied.
  • The next step is analysis. For this kind of problem statement, you must use a classification algorithm that classifies customers into 2 groups:
    • Customers who are likely to subscribe next month
    • Customers who are not likely to subscribe next month
  • Would you build predictive models? Yes, to achieve this, you must build a predictive model that classifies the customers into two classes like those mentioned above.
  • You can choose classification algorithms such as Logistic Regression, Random Forest, Support Vector Machine, etc.
  • Once you choose the right algorithm, you must perform a model evaluation to calculate the efficiency of the algorithm. This is followed by deployment.

29. What cross-validation technique would you use on a time series dataset?

More reading: Using k-fold cross-validation for time-series model selection (CrossValidated)

Instead of using standard k-folds cross-validation, you have to pay attention to the fact that a time series is not randomly distributed data — it is inherently ordered by chronological order. If a pattern emerges in later periods, for example, your model may still pick up on it even if that effect doesn’t hold in earlier years.

You’ll want to do something like forward chaining where you’ll be able to model on past data then look at forward-facing data.

fold 1 : training [1], test [2]
fold 2 : training [1 2], test [3]
fold 3 : training [1 2 3], test [4]
fold 4 : training [1 2 3 4], test [5]
fold 5 : training [1 2 3 4 5], test [6]

We hope our list was upto your expectations and that we were able to prep you nicely for your interview, or at least made something clear. With the rise in machine learning jobs and requirements of experts, having a clear knowledge is of utmost importance for professionals.