# Machine Learning的广度

## The Breadth of Machine Learning: Part I

Posted by Bruce Yang on July 14, 2018

# Part I: The Big Five

In the first few rounds of Data Scientist interview, most companies will heavily focus on your machine learning knowledge (the breadth of machine learning), to test how many algorithms you are familiar with and used in real life projects. I summarize the top 15 algorithms and the 5 most commonly asked questions (The Big Five).

1. What are the basic concepts? What problem does it solve?
2. What are the assumptions?
3. What are the steps of the algorithm?
4. What is the cost function?

## Top 15 Machine Learning Algorithms Summary

1. Linear Regression
2. Regression with Lasso
3. Regression with Ridge
4. Stepwise Regression
5. Logistic Regression
6. Naïve Bayes
7. K-Nearest Neighbors
8. K-means Clustering
9. Decision Tree
10. Random Forest
13. SVM (Support Vector Machine)
14. PCA (Principal Component Analysis)
15. Neural Networks

## 1. Linear Regression

### 1.1 What are the basic concepts? What problem does it solve?

Basic Concept: linear regression is used to model relationship between continuous variables. Xs are the predictor, explanatory, or independent variable. Y is the response, outcome, or dependent variable.

### 1.2 What are the assumptions?

• The mean function is linear
• The variance function is constant
• Residuals are statistically independent, have uniform variance, are normally distributed
• The epsilon $\epsilon$ term is assumed to be a random variable that has a mean of 0 and normally distributed
• The $\epsilon$ term has constant variance $\sigma^2$ at every value of X (Homoscedastic)
• The error terms are also assumed to be independent. $\epsilon$ ~ $N(0,\sigma^2)$

### 1.3 What are the steps of the algorithm?

We want to find the best fit line: $\hat{y_i} = b_0 + b_1 * x_i$

We can achieve that by using least squares estimation, which minimizes the sum of the squared predicted errors.

We need to find the values of $b_0$ and $b_1$ that minimize:$Q = \sum_{i=1}^{n} (y_i-\hat{y_i})^2$ such that $b_1 = \frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^{n}(x_i-\bar{x})^2}$ and $b_0 =\bar{y}- b_1\bar{x}$

### 1.4 What is the cost function?

$MSE = \frac{1}{n}\sum_{i=1}^{n} (y_i-\hat{y_i})^2$

Pros:

• Linear regression is an extremely simple method. It is very easy to use, understand, and explain.
• The best fit line is the line with minimum error from all the points, it has high efficiency

Cons:

• Linear regression only models relationships between dependent and independent variables that are linear. It assumes there is a straight-line relationship between them which is incorrect sometimes.
• Linear regression is very sensitive to the anomalies/outliers in the data
• If the number of the parameters are greater than the samples, then the model starts to model noise rather than relationship

## 2. Regression with Lasso

### 2.1 What are the basic concepts? What problem does it solve?

Lasso is not a type of model, it is a regularization method for selecting variables and shrinking coefficients to adjust for complexity.

Lasso uses an L1 penalty when fitting the model using least squares

It can force regression coefficients to be exactly 0

When $\lambda = 0$, then the lasso simply gives the least squares fit

When $\lambda$ becomes sufficiently large, the lasso gives the null model in which all coefficient estimates equal 0

### 2.2 What are the assumptions?

Same as linear regression

### 2.3 What are the steps of the algorithm?

Lasso uses L1-the sum of the absolute values of the coefficients

### 2.4 What is the cost function?

$$\hat{\beta}{^{lambda}} = \arg\min_{\beta} \{\sum_{i=1}^{n} (y_i-\beta_0-\sum_{j=1}^{p}x_{ij}\beta_j)^2 + \lambda\sum_{j=1}^{p}||\beta_j|| \}$$

Pros:

• Feature selection
• Much easier to interpret and produces simple models
• Lasso to perform better in a setting where a relatively small number of predictors have substantial coefficients, and the remaining predictors have coefficients that are very small or that equal zero
• Lasso tends to outperform ridge regression in terms of bias, variance, and MSE

Cons:

• For n'<'p case (high dimensional case), LASSO can at most select n features. This has to do with the nature of convex optimization problem LASSO tries to minimize.
• For usual case where we have correlated features which is usually the case for real word datasets, LASSO will select only one feature from a group of correlated features. That selection also happens to be arbitrary in nature. Often one might not want this behavior. Like in gene expression the ideal gene selection method is: eliminate the trivial genes and automatically include whole groups into the model once one gene among them is selected (‘grouped selection’). LASSO doesn't help in grouped selection.
• Even for n'>'p case, it is seen that for correlated features , Ridge (Tikhonov Regularization) regression has better prediction power than LASSO. Though Ridge won't help in feature selection and model interpretability is low.

## 3. Regression with Ridge

### 3.1 What are the basic concepts? What problem does it solve?

Ridge uses an L2 penalty when fitting the model using least squares

When lambda = 0, the penalty term has no effect, and ridge regression will produce the least squares estimates.

When lambda become infinite large, the impact of the shrinkage penalty grow, and the ridge regression coefficient estimates will approach to 0.

### 3.2 What are the assumptions?

Same as linear regression

### 3.3 What are the steps of the algorithm?

Ridge uses L2-the sum of the squares of the coefficients, can't zero coefficients, force the coefficients to be relatively small.

### 3.4 What is the cost function?

$$\hat{\beta}{^{ridge}} = \arg\min_{\beta} \{\sum_{i=1}^{n} (y_i-\beta_0-\sum_{j=1}^{p}x_{ij}\beta_j)^2 + \lambda\sum_{j=1}^{p}\beta_j^2 \}$$

Pros:

• As lambda increases, the shrinkage of the ridge coefficient estimates leads to a substantial reduction in the variance of the predictions, at the expense of a slight increase in bias
• Ridge regression works best in situations where the least squares estimates have high variance. Meaning that a small change in the training data can cause a large change in the least squares coefficient estimates
• Ridge will perform better when the response is a function of many predictors, all with coefficients of roughly equal size
• Ridge regression also has substantial computational advantages over best subset selection.

Cons:

• Ridge regression is not able to shrink coefficients to exactly zero. As a result, it cannot perform variable selection

### 3.6 What is lambda $\lambda$?

$\lambda$-Lambda is the tuning parameter that decides how much we want to penalize the flexibility of our model

As lambda increases, the impact of the shrinkage penalty grows, and the ridge regression coefficient estimates will approach zero.

Selecting a good value of lambda is critical, cross validation comes in handy for this purpose, the coefficient estimates produced by this method are also known as the L2 norm.

## 4. Stepwise Regression

### 4.1 What are the basic concepts? What problem does it solve?

#### Best Subset Selection

Idea: we fit a separate least squares regression for each possible combination of the p predictors.

That is:

1. We fit all p models that contain exactly 1 predictor
2. Choose 2 from p: p(p-1)/2 models that contain exactly 2 predictors
3. so on and so forth
4. Selecting the best model from among the 2^p possibilities

#### Forward Stepwise Selection

Idea: Forward stepwise selection begins with a model containing no predictors, and then adds predictors to the model, one-at-a-time, until all of the predictors are in the model

That is: for example p =10

1. We fit 10 variables (one at a time) relative to response variable to determine the best model.
2. We fit other 9 variables with the previous best 1 variable to response to determine the best model
3. so on and so forth
4. Selecting the best model from among the Sn = (p + 1)*p/2 + 1 (null model) possibilities

Forward stepwise selection can be applied even in the high-dimensional setting where n < p; however, in this case, it is possible to construct submodels $M_0$,…$M_p$ only, since each submodel is fit using least squares, which will not yield a unique solution if p >= n

#### Backward stepwise selection

Idea: Backward stepwise selection begins with the full least squares model containing all p predictors, and then iteratively removes the least useful predictor one-at-a-time.

That is: for example p =10

1. We have all 10 variables in our model, we remove 10 variables (one at a time) relative to response variable to determine the best model.
2. We have 9 variables in our model (we removed the worst one), we remove 9 variables (one at a time) relative to response to determine the best model
3. so on and so forth
4. Selecting the best model from among the Sn = (p + 1)*p/2 + 1 (null model) possibilities

Backward selection requires that the number of samples n is larger than the number of variables p

### 4.2 What are the assumptions?

Same as linear regression

### 4.3 What are the steps of the algorithm?

#### Algorithm for Best Subset Selection

1. Let $M_0$ denote the null model, which contains no predictors. This model simply predicts the sample mean for each observation
2. For k = 1,2,… p: a. Fit all choose k from p models that contain exactly k predictors. b. Pick the best among these choose k from p models, and call it $M_{k+1}$. Here best is defined as having the smallest RSS, or equivalently largest $R^2$
3. Select a single best model from among $M_0$,…,$M_p$ using cross-validated prediction error, Cp (AIC), BIC, or adjusted $R^2$.

#### Algorithm for Forward Stepwise Selection

1. Let $M_0$ denote the null model, which contains no predictors
2. For k=0,….,p-1: a. Consider all p-k models that augment the predictors in $M_k$ with one additional predictor. b. Choose the best among these p-k models, and call it $M_{k+1}$. Here best is defined as having smallest RSS or Highest $R^2$
3. Select a single best model from among $M_0$,…,$M_p$ using cross-validated prediction error, Cp (AIC), BIC, or adjusted $R^2$

#### Algorithm for Backward stepwise selection

1. Let $M_p$ denote the full model, which contains all p predictors
2. For k=p, p-1, …,1: a. Consider all k models that contain all but one of the predictors in $M_k$, for a total of k-1 predictors. b. Choose the best among these k models, and call it $M_{k-1}$. There best is defined as having smallest RSS or highest $R^2$
3. Select a single best model from among $M_0$,…$M_p$ using cross-validated prediction error, Cp (AIC), BIC, or adjusted $R^2$

### 4.4 What is the cost function?

Same as linear regression

Pros:

• Forward & Backward has a computational advantages Total number of models considered: Sn = (p + 1)*p/2 + 1 (null model)

Cons:

• The best subset selection is a simple and conceptually appealing approach, if suffers from computational limitations. The number of possible models that must be considered grows rapidly as p predictors. So if p = 10, then there are approximately 1000 possible models to be considered; if p = 20, then there are over 1 million possibilities. If p >40, then it is computationally infeasible

## 5. Logistic Regression

### 5.1 What are the basic concepts? What problem does it solve?

Logistic Regression: is a classification method, usually do binary classification 0 or 1.

A logistic model is one where the log-odds of the probability of an event is a linear combination of independent or predictor variables

### 5.2 What are the assumptions?

• The outcome is a binary or dichotomous variable like yes vs no, positive vs negative, 1 vs 0.
• There is a linear relationship between the logit of the outcome and each predictor variables. Recall that the logit function is logit(p) = log(p/(1-p)), where p is the probabilities of the outcome.
• There is no influential values (extreme values or outliers) in the continuous predictors
• There is no high intercorrelations (i.e. multicollinearity) among the predictors.

### 5.3 What are the steps of the algorithm?

• Sigmoid function: $p=\frac{1}{1+\exp(-y)}$, use sigmoid function to determine the probability of this dataset belong to a certain class (>50% Yes)
• Linear relationship between Logit of the outcome probability and the predictor variables: Y=w*x+b, $ln(\frac{p}{1-p})=w*x+b$

### 5.4 What is the cost function?

We use a cost function called Cross-Entropy, also known as Log Loss. Cross-entropy loss can be divided into two separate cost functions: one for y=1 and one for y=0.

Cost function is achieved by maximum likelihood. We try to find beta0 and beta1 such that plugging these estimates into the model for p(x), yields a number close to one for all individuals who defaulted (class 1), and a number close to 0 for all individuals who did not (class 0).

The difference between the cost function and the loss function: the loss function computes the error for a single training example; the cost function is the average of the loss function of the entire training set.

Pros:

• Outputs have a nice probabilistic interpretation, and the algorithm can be regularized to avoid overfitting. Logistic models can be updated easily with new data using stochastic gradient descent.
• Linear combination of parameters β and the input vector will be incredibly easy to compute.
• Convenient probability scores for observations
• Efficient implementations available across tools
• Multi-collinearity is not really an issue and can be countered with L2 regularization to an extent
• Wide spread industry comfort for logistic regression solutions [ oh that’s important too!]

Cons:

• Logistic regression tends to underperform when there are multiple or non-linear decision boundaries. They are not flexible enough to naturally capture more complex relationships.
• Doesn’t handle large number of categorical features/variables well

## 6. Naïve Bayes

### 6.1 What are the basic concepts? What problem does it solve?

The Naïve Bayes Algorithm is a Machine Learning algorithm for classification problems. It is primarily used for text classification, which involves high-dimensional training datasets. A few examples are spam filtration, sentimental analysis, and classifying news articles

This algorithm learns the probability of an object with certain features belonging to a particular group in class, in short it is a probabilistic classifier.

### 6.2 What are the assumptions?

The Naïve Bayes algorithm is called “naïve” because it makes the assumption that the occurrence of a certain feature is independent of the occurrence of other features.

For instance, if you are trying to identify a fruit based on its color shape and taste then an orange colored spherical and tangy fruit would most likely be an orange even if these features depend on each other or on the presence of the other features all these properties individually contribute to the probability that this fruit is an orange and that is why its called Naïve

### 6.4 What are the steps of the algorithm?

Example 1:

There are total 13 observations in a training dataset:

• In the First Observation, there are 2 features, they are both 1.
• In the following 4 Observations, the two features are 1 and 0.
• In the following 4 Observations, the two features are 0 and 1.
• In the last 4 Observations, the two features are both 0.
• We label the first observation as class 1, and the other 12 obsevations as class 2

You are privided one testing data: the two features are both 1.

The Question is: Do you think this testing data belongs to class 1 or class 2?

A lot of people will say it belongs to class 1 by instinct.

Let's see what we can get by using Naive Bayes:

Step 1: Assume independent

Assume the events are independent, so $P(x) = P(x_1 and x_2) = P(x_1)P(x_2)$, therefore $P(x|C_i) = P(x_1|C_i)P(x_2|C_i)$

Step 2: Calculate prior and likelihood

$P(C_1) = \frac{1}{13}, P(C_2) = \frac{12}{13}$ because there are total 13 observations, only 1 observation labled class 1 and the other 12 labled class 2

$P(x_1|C_1) = 1, P(x_2|C_1) = 1$ because both features x1 and x2 are 1 in class 1

So $P(x|C_1) = P(x_1|C_1)P(x_2|C_1) = 1$

$P(x_1|C_2) = \frac{1}{3}$ because in class 2, there four 1s in the first feature x1, the other eight are 0 (total 12)

$P(x_2|C_2) = \frac{1}{3}$ because in class 2, there four 1s in the second feature x2, the other eight are 0 (total 12)

So $P(x|C_2) = P(x_1|C_2)P(x_2|C_2) = \frac{1}{9}$

Step 3: Bayes Theorem

$P(C1|x) = \frac{ P(x|C_1)P(C_1) }{P(x)} = \frac{ P(x|C_1)P(C_1) }{P(x|C_1)*P(C_1)+P(x|C_2)*P(C_2)}$

$P(C1|x) = \frac{ 1*\frac{1}{13} }{1*\frac{1}{13} +\frac{1}{9}*\frac{12}{13} } = \frac{3}{7} < 0.5$

$P(C2|x) = \frac{ \frac{1}{9} *\frac{12}{13} }{1*\frac{1}{13} +\frac{1}{9}*\frac{12}{13} } = \frac{4}{7} > 0.5$

Therefore, based on Naive Bayes, we classify the testing data as class 2

### 6.5 What is the cost function?

Pros:

• It is easy and fast to predict class of test data set. It also perform well in multi class prediction
• When assumption of independence holds, a Naive Bayes classifier performs better compare to other models like logistic regression and you need less training data.
• It perform well in case of categorical input variables compared to numerical variable(s). For numerical variable, normal distribution is assumed (bell curve, which is a strong assumption).

Cons:

• If categorical variable has a category (in test data set), which was not observed in training data set, then model will assign a 0 (zero) probability and will be unable to make a prediction. This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called Laplace estimation.
• On the other side naive Bayes is also known as a bad estimator, so the probability outputs from predict_proba are not to be taken too seriously.
• Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost impossible that we get a set of predictors which are completely independent.

### 6.7 What are the variations of Naive Bayes?

• Gaussian: It is used in classification and it assumes that features follow a normal distribution.
• Multinomial: It is used for discrete counts. For example, let’s say, we have a text classification problem. Here we can consider bernoulli trials which is one step further and instead of “word occurring in the document”, we have “count how often word occurs in the document”, you can think of it as “number of times outcome number x_i is observed over the n trials”.
• Bernoulli: The binomial model is useful if your feature vectors are binary (i.e. zeros and ones). One application would be text classification with ‘bag of words’ model where the 1s & 0s are “word occurs in the document” and “word does not occur in the document” respectively.

### 6.8 What are the applications of Naive Bayes?

• Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be used for making predictions in real time.
• Multi class Prediction: This algorithm is also well known for multi class prediction feature. Here we can predict the probability of multiple classes of target variable.
• Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers mostly used in text classification (due to better result in multi class problems and independence rule) have higher success rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer sentiments)
• Recommendation System: Naive Bayes Classifier and Collaborative Filtering together builds a Recommendation System that uses machine learning and data mining techniques to filter unseen information and predict whether a user would like a given resource or not

## 7. K-Nearest Neighbors

### 7.1 What are the basic concepts? What problem does it solve?

K-Nearest Neighbors: Given N training vectors, KNN algorithm identifies the K nearest neighbor of ‘c’, regardless of labels

### 7.2 What are the steps of the algorithm? What is the cost function?

Pros:

• Easy to interpret output
• Naturally handles multi-class cases
• Calculation time
• Predictive power, can do well in practice with enough representative data

Cons:

• Large search problem to find nearest neighbors
• Storage of data
• Must know we have a meaningful distance function

## 8. K-means Clustering

### 8.1 What are the basic concepts? What problem does it solve?

K-means is the most popular and widely used clustering method. It is essentially an optimization problem. It is an unsupervised learning. K-means is an iterative algorithm and it does two things, first is a cluster assignment step and second is a move centroid step.

Clustering Assignment step: it goes through each of the example, depending on whether it is closer to some cluster centroid (say k=2), it is going to assign the data points to one of the two cluster centroids.

Move Centroid step: compute the average of all the points that belong to each centroid, and then move the centroid to the average point location.

### 8.2 What are the assumptions?

Data has to be numeric not categorical.

### 8.4 What is the cost function?

Input: Finite set $S \subset R{^{d}}$; integer k

Output: $T \subset R{^{d}}$ with $|T| = k$

Goal: Minimize Squared Distance $$Cost(T) = \sum_{x \in S} \min_{z \in T} ||x-z||^2$$

Pros:

• Easy to implement; easy to interpret the clustering results.
• With a large number of variales, Kmeans is fast and efficient in terms of computational cost.

Cons:

• Sensitive to outliers
• Initial seeds have a strong impact on the final results
• K value not known, it's difficult to predict the number of clusters.
• The order of the data has an impact on the final results
• Sensitive to scale: rescaling your datasets (normalization or standardization) will completely change results.
• It does not work well with clusters (in the original data) of Different size and Different density

## 9. Decision Tree

### 9.1 What are the basic concepts? What problem does it solve?

• A decision tree uses a tree structure to specify sequences of decisions and consequences.
• The idea is simple: it breaks down a dataset into smaller subsets.
• There are Regression Tree and Classification Tree.
• A decision tree employs a structure of nodes and branches: Root node, Branch Node, Internal Node, and Leaf Node.
• The depth of a node is the minimum number of steps required to reach the node from the root.
• Tree-based methods tend to perform well on unprocessed data (i.e. without normalizing, centering, scaling features).

### 9.2 What are the steps of the algorithm?

#### How to implement a Regression Tree?

1. Perform recursive binary splitting: we first select the predictor $X_{j}$ and the cutpoint $s$ such that splitting the predictor space into the Region $\{X|X_{j} < s\}$ and $\{X|X_{j} >= s\}$ leads to the greatest possible reduction in RSS.
2. So we will have 2 partitions: R1 and R2, $R_{1}(j,s) = \{X|X_{j} < s\}$ and $R_{2}(j,s) = \{X|X_{j} >= s\}$, and we seek the value of $j$ and $s$ to minimize the equation: $$\sum_{i: x_{i} \in R_{1}(j,s)} (y_{i}-\hat{y}_{R_1})^2 + \sum_{i: x_{i} \in R_{2}(j,s)} (y_{i}-\hat{y}_{R_2})^2$$ where $\hat{y}_{R_1}$ is the mean response for the training observations in $R_{1}(j,s)$, and $\hat{y}_{R_2}$ is the mean response for the training observations in $R_{2}(j,s)$. Finding the values of j and s that minimzie the equation can be done quite quickly, especially when the number of features p is not too large.
3. Repeat this process, look for the best predictor and best cutpoint in order to split the data further so as to minimize the RSS within each of the resulting regions.
4. The process contincues until a stopping criterion is reached: e.g. no region contains more than 5 observations.

#### How to Implement a Classfication Tree?

1. Perform recursive binary splitting: we first select the predictor $X_{j}$ and the cutpoint $s$ such that splitting the predictor space into the Region $R\{X|X_{j} < s\}$ and $R{^{'}}\{X|X_{j} >= s\}$ leads to the greatest possible reduction in Classification Error Rate: $$E_R = \min_{y} \frac{1}{N_R} \sum_{1:X_i \in R}I(y_i \neq y)$$ $$N_R = \#\{i: X_i \in R\}$$
2. So we will have 2 partitions: $R\{X|X_{j} < s\}$ and $R{^{'}}\{X|X_{j} >= s\}$, and we seek the value of $j$ and $s$ to minimize the equation: $$N_{R(j,s)} * E_{R(j,s)} + N_{R{^{'}}(j,s)} * E_{R{^{'}}(j,s)}$$ where $E_{R(j,s)}$ is the misclassification rate, and $N_{R(j,s)}$ is the total number of points in the partition.
3. Repeat this process, look for the best predictor and best cutpoint in order to split the data further so as to minimize the Classification Error Rate/Misclassification Rate within each of the resulting regions.
4. The process contincues until a stopping criterion is reached: e.g. no region contains more than 5 observations/contains points of only one class.

### 9.3 What is the cost function?

Regression Tree: minimize RSS $$\sum_{j=1}^{J} \sum_{i \in R_{j}} (y_{i}-\hat{y}_{R_j})^2$$

Classification Tree: minimize classification error rate $$E_R = \min_{y} \frac{1}{N_R} \sum_{1:X_i \in R}I(y_i \neq y)$$

Pros:

• Simple to understand and interpret. People are able to understand decision tree models after a brief explanation.
• Have value even with little hard data. Important insights can be generated based on experts describing a situation (its alternatives, probabilities, and costs) and their preferences for outcomes.
• Allow the addition of new possible scenarios.
• Help determine worst, best and expected values for different scenarios.
• Use a white box model. If a given result is provided by a model.
• Can be combined with other decision techniques.

Cons:

• They are unstable, meaning that a small change in the data can lead to a large change in the structure of the optimal decision tree.
• They are often relatively inaccurate. Many other predictors perform better with similar data. This can be remedied by replacing a single decision tree with a random forest of decision trees, but a random forest is not as easy to interpret as a single decision tree.
• For data including categorical variables with different number of levels, information gain in decision trees is biased in favor of those attributes with more levels.
• Calculations can get very complex, particularly if many values are uncertain and/or if many outcomes are linked.

## 10. Random Forest

### 10.1 What are the basic concepts? What problem does it solve?

Random Forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputing the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

### 10.2 What are the steps of the algorithm?

Randm Forest is applying bagging to the decision tree:

• Bagging: random sampling with replacement from the original set to generate additional training data, the purpose of bagging is to reduce the variance while retaining the bias, it is effective because you are improving the accuracy of a single model by using multiple copies of it trained on different sets of data, but bagging is not recommended on models that have a high bias.
• Randomly selection of $m$ predictions: used in each split, we use a rule of thumb to determine the number of features selected $m=\sqrt{p}$, this process decorrelates the trees.
• Each tree is grown to the largest extent possible and there is no pruning.
• Predict new data by aggregating the predictions of the ntree trees (majority votes for classification, average for regression).

### 10.3 What is the cost function?

The same as Decision Tree.

Pros:

• As we mentioned earlier a single decision tree tends to overfit the data. The process of averaging or combining the results of different decision trees helps to overcome the problem of overfitting.
• Random forests also have less variance than a single decision tree. It means that it works correctly for a large range of data items than single decision trees.
• Random forests are extremely flexible and have very high accuracy.
• They also do not require preparation of the input data. You do not have to scale the data.
• It also maintains accuracy even when a large proportion of the data are missing.

Cons:

• The main disadvantage of Random forests is their complexity. They are much harder and time-consuming to construct than decision trees.
• They also require more computational resources and are also less intuitive. When you have a large collection of decision trees it is hard to have an intuitive grasp of the relationship existing in the input data.
• In addition, the prediction process using random forests is time-consuming than other algorithms.

### 11.1 What are the basic concepts? What problem does it solve?

It focuses on classification problems and aims to convert a set of weak classifiers into a strong one.

It retrains the algorithm iteratively by choosing the training set based on accuracy of previous training.

The weight-age of each trained classifier at any iteration depends on the accuracy achieved.

Aggregates weak hypotheses for strength; Scale up incorrect, scale down correct; Two heads are better than one, theoretically.

### 11.2 What are the steps of the algorithm?

1. Train a weak classifier $f1$ on the original training dataset, so we will have a error rate $\epsilon < 0.5$ because 0.5 is random guessing.
2. Introduce a new weight: $u_2$ to make a new training dataset, such at we will make the error rate $\epsilon = 0.5$, that is we are making $f1$ become random guessing, to fail $f1$.
3. Train another classifier $f2$ on the new training dataset generated by $u_2$, so that $f2$ will be complementary to $f1$.

### 11.3 What is the cost function?

Exponential Loss:$$L(y,f(x)) = exp(-yf(x))$$

Pros:

• AdaBoost is a powerful classification algorithm that has enjoyed practical success with applications in a wide variety of fields, such as biology, computer vision, and speech processing.
• Unlike other powerful classifiers, such as SVM, AdaBoost can achieve similar classification results with much less tweaking of parameters or settings (unless of course you choose to use SVM with AdaBoost).
• The user only needs to choose: (1) which weak classifier might work best to solve their given classification problem; (2) the number of boosting rounds that should be used during the training phase. The GRT enables a user to add several weak classifiers to the family of weak classifiers that should be used at each round of boosting. The AdaBoost algorithm will select the weak classifier that works best at that round of boosting.

Cons:

• AdaBoost can be sensitive to noisy data and outliers. In some problems, however, it can be less susceptible to the overfitting problem than most learning algorithms. The GRT AdaBoost algorithm does not currently support null rejection, although this will be added at some point in the near future.

### 12.1 What are the basic concepts? What problem does it solve?

The main idea of boosting is to add new models to the ensemble sequentially. At each particular iteration, a new weak, base-learner model is trained with respect to the error of the whole ensemble learnt so far.

Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.

It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitray differentiable loss function.

Whereas random forests build an ensemble of deep independent trees, GBMs build an ensemble of shallow and weak successive trees with each tree learning and improving on the previous. When combined, these many weak successive trees produce a powerful “committee” that are often hard to beat with other algorithms.

### 12.2 What are the steps of the algorithm?

1. First, learn a regression predictor with simple models.
2. Compute the error residuals.
3. Learn to predict the residual.
4. Repeat this process, in the end, we combine all the predictors by giving some weights to each predictor.

### 12.3 What is the cost function?

It depends on the model, could be square loss or exponential loss. For any loss function, we can derive a gradient boosting algorithm. Absolute loss and Huber loss are more robust to outliers than square loss.

Pros:

• Gradient Boosted Methods generally have 3 parameters to train shrinkage parameter, depth of tree, number of trees. Now each of these parameters should be tuned to get a good fit. And you cannot just take maximum value of ntree in this case as GBM can overfit fit higher number of trees. But if you are able to use correct tuning parameters, they generally give somewhat better results than Random Forests.
• Often provides predictive accuracy that cannot be beat.
• Lots of flexibility - can optimize on different loss functions and provides several hyperparameter tunning options that make the function fit very flexible.
• No data preprocessing required, often works great with categorical and numerical values as is.
• Handles missing data, imputation not required.

Cons:

• GBMs will continue improving to minimize all errors. This can overemphasize outliers and cause overfitting. Must use cross-validation to neuralize.
• Computationally expensive, GBMs often require many trees (>1000) which can be time and memory exhaustive.
• The high flexibility results in many parameters that interact and influence heavily the behavior of the approach. This requires a large grid search during tunning.
• Less interpretable although this is easily addressed with various tools (varaible importance, partial dependence plots, LIME, etc.)

## 13. Support Vector Machine (SVM)

### 13.1 What are the basic concepts? What problem does it solve?

SVM: the goal is to design a hyperplane that classifies all training vectors in 2 classes. The best choice will be the hyperplane that leaves the maximum margin from both classes. As a rule of thumb, SVMs are great for relatively small data sets with fewer outliers.

Hyperplane: is a linear decision surface that splits the space into 2 parts, it is obvious that a hyperplane is a binary classifier. A hyperplane is a p-1 dimension: in two dimensions, a hyperplane is a line, in three dimensions, a hyperplane is a plane.

### 13.2 What are the steps of the algorithm?

Maximal Margin Hyperplane: is the separating hyperplane for which the margin is largest. If you created a boundary with a large margin with a large buffer zone between the decision boundary and the nearest training data points then you are probably going to do better on the new data.

Support Vectors: are the data points nearest to the hyperplane, the points of a dataset that if removed, would alter the position of the dividing hyperplane. Because of this, they can be considered the critical elements of a dataset, they are what help use build our SVM.

Support Vector Classifier: sometimes called a soft margin classifier, rather than seeking the largest possible margin so that every observation is not only on the correct side of the hyperplane but also on the correct side of the margin, we instead allow some observations to be on the incorrect side of the margin, or even incorrect side of the hyperplane.

Kernel Trick: the idea is that our data which isn't linearly separable in our n' dimensional space may be linearly separable in a higher dimensional space.

### 13.3 What is the cost function?

Hinge Loss:

Pros:

• SVM models have generalization in practice, the risk of overfitting is less in SVM.
• The kernel trick is real strength of SVM. With an appropriate kernel function, we can solve any complex problem.
• An SVM is defined by a convex optimization problem (no local minima) for which there are efficient methods, unlike in neural network, SVM is not solved for local optima
• Works well with even unstructured and semi structured data like text, Images and trees.
• High-Dimensionality - The SVM is an effective tool in high-dimensional spaces, which is particularly applicable to document classification and sentiment analysis where the dimensionality can be extremely large (≥106).
• It scales relatively well to high dimensional data.

Cons:

• Choose a "good" kernel function is not easy.
• Long training time for large datasets.
• Difficult to understand and interpret the final model, varaible weights and individual impact.
• p > n - In situations where the number of features for each object (p) exceeds the number of training data samples (n), SVMs can perform poorly.

## 14. Principal Component Analysis (PCA)

### 14.1 What are the basic concepts? What problem does it solve?

PCA: is a dimension reduction method for compressing a lot of data into something that captures the essence of the original data. It takes a dataset with a lot of dimensions and flattens it to 2 or 3 dimensions so we can look at it. PC1 (the first principal component) is the axis that spans the most variation. PC2 is the axis that spans the second most varation.

PCA: is a tool to reduce multidimensional data to lower dimensions while retaining most of the information. It covers standard deviation, covariance, and eigenvectors.

PCA: is a statistical method that uses an orthogonal transformation to convert a set of observations of correlated variables into a set of values of linearly uncorrelated variables called principal components.

PCA: is a variance maximizing exercise, it projects the original data onto a direction which maximizes variance.

PCAs: are eigen-pairs, they describe the direction in the original feature space with the greatest variances.

PCA: is an orthogonal, linear transformation that transforms data to a new coordinate system such that the greatest variance by some projection of data lies on the first principal component. The second greatest variance lies on the second component. The variance is the measurement of how spread out some data is.

PCA: is a classical featur extraction and data representation technique widely used in the areas of pattern recognization and computer vision such as face recognization.

### 14.2 What are the assumptions?

1. You have multiple variables that should be measured at the continuous level.
2. There needs to be a linear relationship between all variables. The reason for this assumption is that a PCA is based on Pearson correlation coefficients, and as such, there needs to be a linear relationship between the variables.
3. You should have sampling adequacy, which simply means that for PCA to produce a reliable result, large enough sample sizes are required.
4. Your data should be suitable for data reduction. Effectively, you need to have adequate correlations between the variables in order for variables to be reduced to a smaller number of components.
5. There should be no significant outliers. Outliers are important because these can have a disproportionate influence on your results.

### 14.3 What are the steps of the algorithm?

1. Step 1: Normalizing the data to value between 0 and 1 $X' = \frac{X-X_{min}}{X_{max}-X_{min}}$
2. Step 2: Perform Eigen decomposition, obtain the Eigenvectors and Eigenvalues from the covariance matrix or correlation matrix, or perform Singular Vector Decomposition. Covariance is the measure of how two variables change with respect to each other.
3. Step 3: Sort eigenvalues in descending order and choose the k eigenvectors that correspond to the k largest eigenvalues where k is the number of dimensions of the new feature subspace.
4. Step 4: Construct the projection matrix W from the selected k eigenvectors. Transform the original dataset X via W to obtain a k-dimensional feature subspace Y = X.dot(W).

### 14.4 What is the cost function?

Pros:

• Reflects our intuition about the data.
• Allows estimating probabilities in high-dimensional data, no need to assume independence.
• Dramatic reduction in size of data, faster processing, smaller storage.

Cons:

• Too expensive for many applications.
• Disastrous for taks with fine-grained classes.
• Understand assumptions behind the methods, linearity.

## 15. Neural Network

### 15.1 What are the basic concepts? What problem does it solve?

Neural Networks: are computing systems vaguely inspired by the biological neural networks that constitute animal brains. The neural network itself is not an algorithm, but rather a framework for many different machine learning algorithms to work together and process complex data inputs. such systems "learn" to perform tasks by considering examples, generally without being programmed with any task-specific rules. For example, in image recognition, they might learn to identify images that contain cats by analyzing example images that have been manually labled as "cat" or "no cat" and using the results to identify cats in other images. (Wikipedia Definition)

Neuron: a thing that holds a number, specifically a number between 0 and 1

Input Layer: $x$

Output Layer: $\hat{y}$

Hidden Layer: arbitray

Activation Function: are what connected between hidden layers, it squishes the real number line into the range between 0 and 1, a comon function that does this is called the sigmoid function also known as a logistic curve basically very negative inputs end up close to zero very positive inputs end up close to 1. Activation of the neuron here is basically a measure of how positive the relevant weighted sum is.

Weights and Biases: between each layer $W$ and $b$, weights tell you what kind of pattern this neuron in the second layer is picking up on; bias is for inactivity: may be you only want it to be active when the sum is bigger than say 10, that is you want some bias for it to be inactive, just need to add in some number like negative 10 to the weighted sum, before plug it into the sigmoid function, it tells you how high the weighted sum needs to be before the neuron starts getting meaninfully active.

### 15.2 What are the steps of the algorithm?

Train the Neural Network: the process of fine-tunning the weights and biases from the input data, our goal in training is to find the best set of weights and biases that minimizes the loss function.

1. Calculating the predicted output $\hat{y}$, known as feedforward.
2. Updating the weights and biases, known as backpropagation.

### 15.3 What is the cost function?

Pros:

• ANNs (Artificial Neural Network) have the ability to learn and model non-linear and complex relationships, which is really important because in real-life, many of the relationships between inputs and outputs are non-linear as well as complex.
• ANNs can generalize, after learning from the initial inputs and their relationships, it can infer unseen relationships on unseen data as well, thus making the model generalize and predict on unseen data.
• Unlike many other prediction techniques, ANN does not impose any restrictions on the input variables (like how they should be distributed). Additionally, many studies have shown that ANNs can better model heteroskedasticity i.e. data with high volatility and non-constant variance, given its ability to learn hidden relationships in teh data without imposing any fixed relationships in the data. This is something very useful in financial time series forcasting where data volatility is very high.
• Can significantly outperform other models when the conditions are right (lots of high quality labeled data).

Cons:

• Hard to interpret the model. ANNs are a black box model once they are trained.
• Don't perform well on small data sets. The Bayesian approaches do have an advantage here.
• Hard to tune to ensure they learn well, and therefore hard to debug.
• ANNs are computationally intensive to train; i.e. you need a lot of chips and a distributed run-time to train on very large datasets.

Reference:

Linear Regression
http://r-statistics.co/Assumptions-of-Linear-Regression.html

Random Forest
Random Forest Wikipedia

Tree Based Model Summary