机器学习Machine Learning是Data Scientist面试中决定性的环节,因为这是Data Scientist看家吃饭的技能。每一轮面试中都会涉及到不同类型的机器学习的问题。
Machine Learning面试一共分为4个维度
- Machine Learning的广度 (Breadth)
- Machine Learning的深度 (Depth)
- Machine Learning的经验 (Experience)
- Machine Learning的应用 (Application)
第一:Machine Learning的广度(面试的主要考点)
简介:用几句话描述出算法是在怎样的假设下用怎样的步骤解决了怎样的问题,有什么优缺点。
我总结了15个常用的Machine Learning算法
- Linear Regression
- Regression with Lasso
- Regression with Ridge
- Stepwise Regression
- Logistic Regression
- Naïve Bayes
- K-Nearest Neighbors
- K-means Clustering
- Decision Tree
- Random Forest
- Ada-Boost
- Gradient Boosting
- SVM (Support Vector Machine)
- PCA (Principal Component Analysis)
- Neural Networks
(如果申加州的IT公司,需要懂Deep Learning和Tensor-flow)
我总结了Machine Learning算法的5个General的问题简称The Big Five.
- What are the basic concepts? What problem does it solve?
- What are the assumptions?
- What are the steps of the algorithm?
- What is the cost function?
- What are the advantages/disadvantages?
第二:Machine Learning的深度(一般只有FLAG会考)
简介:说白了就是手写算法的数学推导公式,这是最难的考点了,掌握了那么你就达到了PhD的Level
一般金融类的公司不会考,加州的顶级IT公司可能会考。
第三:Machine Learning的经验(面试的主要考点)
简介:侧重的是考察与实际项目有关但是在课堂或教科书里一般不会涉及的内容。比如:
- 如何进行feature engineering?
- 如果数据量比feature量少怎么办?
- 如何解决imbalanced data classification的问题?
- 如果模型的performance没有达到预期应该怎么办?
- 怎么解决Missing Data?
- 如何Detect Outlier?怎么解决Outlier?
要多看一些大神的Blog的总结和data science相关的网站
第四:Machine Learning的应用(拿面试的关键)
简介:就是你做过的Machine Learning相关的Project。
只是掌握machine learning的知识点和推导公式那还是不够的,公司招人的目的是为了解决实际问题的,you must have solid project development experience!
对于一些质量高的Project,比如你在Kaggle上排名高或者得过奖的话,基本上大部分公司都会给你发面试的。一般面试中不会直接给你一堆data让你做个model,都是On-site之前给你一个Data Challenge用一周的时间做点model写点insights出来。
Here is a short list of common Data Scientist deliverables:
- Prediction (predict a value based on inputs)
- Classification (e.g., spam or not spam)
- Recommendations (e.g., Amazon, Netflix, Spotify recommendations)
- Pattern detection and grouping (e.g., classification without known classes)
- Anomaly detection (e.g., fraud detection)
- Recognition (image, text, audio, video, facial, …)
- Actionable insights (via dashboards, reports, visualizations, …)
- Automated processes and decision-making (e.g., credit card approval)
- Scoring and ranking (e.g., FICO score)
- Segmentation (e.g., demographic-based marketing)
- Optimization (e.g., risk management)
- Forecasts (e.g., sales and revenue)
建议一定要有的4种Project:
- Regression类: Prediction Model-预测房价,预测股价
- Classification类: Image Classification-给你一堆图片让你classify到底哪些是猫咪那些是狗狗或者classify狗狗的种类,都是经典project
- Recommendation System类: 一般用collaborative filtering就可以,像Netflix做Movie Recommendation based on ratings。像Spotify做Music Recommendation base on historical streaming data
- NLP-Natural Language Process类:做一个垃圾邮件的识别model,基于News或者Twitter名人像Donald Trump之类的twitters去predict市场的走势。
Reference:
An Introduction to Statistical Learning
https://www.innoarchitech.com/what-is-data-science-does-data-scientist-do/