机器学习与数据挖掘笔试面试题

本文来自：http://analyticscosm.com/machine-learning-interview-questions-for-data-scientist-interview/ 希望大家学习有个条理，但是就算知道了这些，这一行最重要的还是实践！自己敲代码，跑数据，理解业务关系！

The Machine Learning part of the interview is usually the most elaborate one. That's the reason we have dedicated a complete post to the interview questions from ML. We've also provided, wherever possible, the link to Suggested Reading material that will be helpful in answering these questions.

We update these links from time to time and if you have any solution you can suggest please feel free to post it. You should explore these questions thoroughly, especially the ones that may relate to your previous experience and projects.

General ML Questions

Here is a nice post covering various aspects of machine learning that'll be a good starting point.

How will you differentiate a machine learning algorithm from other algorithms?
Suggested reading
What's the difference between data mining and machine learning?
What are the advantages of machine learning?
Often much more accurate than human-crafted rules (since data driven) • humans often incapable of expressing what they know (e.g., rules of English, or how to recognize letters), but can easily classify examples • don't need a human expert or programmer • cheap and flexible — can apply to any learning task
Describe some popular machine learning methods.
Suggested Reading
How will you differentiate between supervised and unsupervised learning? Give few examples of algorithms for supervised learning?
Suggested reading
What is your favorite ML algorithm? How will you explain it to a layman? Why is it your favourite?

Regression

 Is regression some type of supervised learning? Why?

 Explain the tradeoff between bias and variance in a regression problem.

 A learning algorithm with low bias and high variance may be suitable under what circumstances?

 What is regression analysis?

 What do coefficient estimates mean?

 How do you measure fit of the model? What do R and D mean?

 What are some possible problems with regression models? How do you avoid or compensate for them?

 Name a few types of regression you are familiar with? What are the differences?

 What are the downfalls of using too many or too few variables for performing regression?

Linear Regression

Suggested reading on difference between linear and non-linear regression

 What is linear regression? Why is it called linear?

 What are the constraints you need to keep in mind when using a linear regression?

 How does the variance of the error term change with the number of predictors, in OLS?

 In linear regression, under what condition R^2 always equals a perfect 1?

 Do you consider the models Y~X1+X2+X1X2 and Y~X1+X2+X1X2 to be linear? Why?

Suggested reading

 Do we always need the intercept term? When do we need it and when do we not?

Suggested reading

 What is collinearity and what to do with it?

Suggested reading

 How to remove multicollinearity?

Suggested reading

 What is overfitting a regression model? What are ways to avoid it?

Suggested reading

 What is Ridge Regression? How is it different from OLS Regression? Why do we need it?

 What is Lasso regression? How is it different from OLS and Ridge?

 What are the assumptions that standard linear regression models with standard estimation techniques make?

 How can some of these assumptions be relaxed?

 You fit a multiple regression to examine the effect of a particular feature. The feature comes back insignificant, but you believe it is significant. How will you explain it?

 Your model considers the feature X significant, and Z is not, but you expected the opposite result. How will you explain it?

 How to check if the regression model fits the data well?

 When to use k-Nearest Neighbors for regression?

 Could you explain some of the extension of linear models like Splines or LOESS/LOWESS?

Classification

Basic Questions

 State some real life problems where classification algorithms can be used?

Text categorization (e.g., spam filtering) • fraud detection • optical character recognition • machine vision (e.g., face detection) • natural-language processing (e.g., spoken language understanding) • market segmentation (e.g.: predict if customer will respond to promotion) • bioinformatics (e.g., classify proteins according to their function) etc.

 What is the simplest classification algorithm?

Many consider Logistic Regression as a simple approach to begin with in order to to set a baseline and only make it more complicated if need be.

 What is your favourite ML algorithm? Why is it your favourite? How will you describe it to a non-technical person.

Decision Trees

To answer questions on decision trees here are some useful links:
Youtube video tutorial
This article covers decision tree in depth
Other suggested reading

 What is a decision tree?

 What are some business reasons you might want to use a decision tree model?

 How do you build a decision tree model?

 What impurity measures do you know?

 Describe some of the different splitting rules used by different decision tree algorithms.

 Is a big brushy tree always good?

 How will you compare a decision tree to a logistic regression? Which is more suitable under different circumstances?

 What is pruning and why is it important?

Ensemble models:
To answer questions on ensemble models here is a useful link:

 Why do we combine multiple trees?

 What is Random Forest? Why would you prefer it to SVM?

Logistic regression:
Link to understand basics of Logistic regression
Here's a nice tutorial from Khan Academy

 What is logistic regression?

 How do we train a logistic regression model?

 How do we interpret its coefficients?

Support Vector Machines
A tutorial on SVM can be found here and here

 What is the maximal margin classifier? How this margin can be achieved and why is it beneficial?

 How do we train SVM? What about hard SVM and soft SVM?

 What is a kernel? Explain the Kernel trick

 Which kernels do you know? How to choose a kernel?

Neural Networks
Here's a link to Neural Network course from Hinton on Coursera

 What is an Artificial Neural Network?

 How to train an ANN? What is back propagation?

 How does a neural network with three layers (one input layer, one inner layer and one output layer) compare to a logistic regression?

 What is deep learning? What is CNN (Convolution Neural Network) or RNN (Recurrent Neural Network)?

Other models:

 What other models do you know?

 How can we use Naive Bayes classifier for categorical features? What if some features are numerical?

 Tradeoffs between different types of classification models. How to choose the best one?

 Compare logistic regression with decision trees and neural networks.

Regularization

 What is Regularization?

 Which problem does Regularization try to solve?

Ans. used to address the overfitting problem, it penalizes your loss function by adding a multiple of an L1 (LASSO) or an L2 (Ridge) norm of your weights vector w (it is the vector of the learned parameters in your linear regression).

 What does it mean (practically) for a design matrix to be "ill-conditioned"?

 When might you want to use ridge regression instead of traditional linear regression?

 What is the difference between the L1 and L2 regularization?

 Why (geometrically) does LASSO produce solutions with zero-valued coefficients (as opposed to ridge)?

Dimensionality Reduction

Suggested Reading

 Explain the Newton's method?

Suggested Reading

 What kind of problems are well suited for Newton's method? BFGS? SGD?

 What are "slack variables"?

 Describe a constrained optimization problem and how you would tackle it.

Recommendation

some good examples of recommender models can be found here

 What is a recommendation engine? How does it work?

 How to do customer recommendation?

 What is Collaborative Filtering?

 How would you generate related searches for a search engine?

 How would you suggest followers on Twitter?

 Do you know about the Netflix Prize problem? How would you approach it?

Here is a nice post on the Netflix challenge

Feature Engineering

Here is a good article on feature engineering

 What is Feature Engineering?

How predictors are encoded in a model can have a signi?cant impact on model performance and we achieve such encoding through feature engineering. Sometimes using combinations of predictors can be more e?ective than using the individual values: the product of two predictors may be more e?ective than using two independent predictors. Often the most e?ective encoding of the data is captured by the modeler's understanding of the problem and thus is not derived from any mathematical technique.
These features can be extracted in two ways: 1. By a human expert (known as hand-crafted) or 2. By using automated feature extraction methods such as PCA, or Deep Learning tools such as DBN. Both 1 and 2 can be used on top of each other.

 Give an example where feature example can be very useful in predicting results from data and explain with reason why it is so effective in some cases?

 What are some good ways for performing feature selection that do not involve exhaustive search?

 How to convert categorical variables to numerical for extracting features?

Feature Selection

Here is a nice post on feature selection,
also known as variable selection, attribute selection or variable subset selection

 Explain feature selection and its importance with examples.

 What is variance threshold approach?

 How Univariate feature selection works?

 Is there any negative impact of using too many or too few variables?

 Is there any thumb rule for the number of features that should be used? How do you select the best features?

 What will be your approach to recursive feature elimination?

 Describe some feature selection methods.

 Does the model affect the choice of feature selection method?

Natural Language Processing (NLP)

For basic introduction visit the wiki page.
Here is the link to coursera course for NLP
Pick the software from the The Stanford NLP (Natural Language Processing) Group and input some text to view its parse tree, named entities, part of speech tags, etc.
If the company deals with text data, you can expect some questions on NLP and Information Retrieval:

 Explain NLP to a non-technical person.

 What's the use of NLP in Machine Learning?

Some interesting usages are in areas like sentiment analysis, spam detecting, POS, Text summarization, Language translation etc.

 How unstructured text data can be converted into structured data for the purpose of ML models?

 Explain Vector Space Model and its use?

 Explain the distances and similarity measures that can be used to compare documents?

 Explain cosine similarity in a simple way?

Suggested Reading

 Why and when stop words are removed? In which situation we do not remove them?

Image processing and Text mining

 What tool would you prefer for image processing?

 What parameters would you consider while selecting a tool for image processing?

Ease of use, speed and resources needed are some of the common parameters

 How to apply Machine Learning to images?

 What are the text mining tools you are familiar with?

Some example are:
Commercial: Autonomy, Lexalytics , SAS/SPSS, SQLServer 2008+
OpenSource: RapidMiner , NClassifier, OpenTextSumarizer, WordNet, OpenNLP/SharpNLP, Lucene/Lucene.NET, LingPipe, Weka

 What techniques do you apply for processing texts? Explain with an example.

 How to apply Machine Learning to audio data?

Meta Learning

Wiki link on meta learning

 How will you differentiate between boosting and inductive transfer?

Model selection

 What criteria would you use while selecting the best model from many different models?

 You have one model and want to find the best set of parameters for this model. How would you do that?

 How would you use model tuning for arriving at the best parameters?

Suggested Reading

 Explain grid search and how you would use it?

 What is Cross-Validation?