Hui Lin @Google

Ming Li @Amazon

- Can be used for both regression and classification problems
- Do not require user to specify the form of the relationship between predictors and response
- Do not require (or if they do, very limited) data preprocessing and can handle different types of predictors (sparse, skewed, continuous, categorical, etc.)
- Robust to co-linearity
- Can handle missing data
- Many pre-built packages make implementation as easy as a button push

**Classification tree**: the outcome is discrete**Regression tree**: the outcome is continuous (e.g. the price of a house, or a patient’s length of stay in a hospital)**Non-leaf node (or split node)**: the algorithm needs to decide a split at each non-leaf node (eg: \(age \leq 25\), \(25 < age \leq 35\), \(age > 35\))**Root node**：the beginning node where the tree starts**Leaf node (or Terminal node)**: the node stops splitting. It has the final decision of the model**Degree of the node**: the number of subtrees of a node**Degree of the tree**: the maximum degree of a node in the tree**Pruning**: remove parts of the tree that do not provide power to classify instances**Branch (or Subtree)**: the whole part under a non-leaf node**Child**: the node directly after and connected to another node**Parent**: the converse notion of a child

The goal of splitting is to get homogenous groups.

**How to define homogenous for a classification problem?**

**Gini impurity**(two-class problem):

\[p_{1}(1-p_{1})+p_{2}(1-p_{2})\]

**Entropy**:

\[Entropy=-plog_{2}p-(1-p)log_{2}(1-p)\]

**How to define homogenous for a regression problem?**

**Sum of Square Error (SSE)**: Suppose you want to divide the data set S into two groups of S1 and S2, where the selection of S1 and S2 needs to minimize the sum of squared errors

\[SSE=\Sigma_{i\in S_{1}}(y_{i}-\bar{y}_{1})^{2}+\Sigma_{i\in S_{2}}(y_{i}-\bar{y}_{2})^{2}\]

**Gini impurity**(two-class problem): \(p_{1}(1-p_{1})+p_{2}(1-p_{2})\)

- Gini impurity for “Female” = \(\frac{1}{6}\times\frac{5}{6}+\frac{5}{6}\times\frac{1}{6}=\frac{5}{18}\)
- Gini impurity for “Male” = \(0\times1+1\times 0=0\)

The Gini impurity for the node “Gender” is the following weighted average of the above two scores:

\[\frac{3}{5}\times\frac{5}{18}+\frac{2}{5}\times 0=\frac{1}{6}\]

**Entropy**: \(Entropy=-plog_{2}p-(1-p)log_{2}(1-p)\)

- Root node: \(-\frac{25}{50}log_{2}\frac{25}{50}-\frac{25}{50}log_{2}\frac{25}{50}=1\)

The entropy of the split using variable “gender” can be calculated in three steps:

- Entropy for “Female” = \(-\frac{5}{30}log_{2}\frac{5}{30}-\frac{25}{30}log_{2}\frac{25}{30}=0.65\)
- Entropy for “Male” = \(0\times1+1\times 0=0\)
- Entropy for the node “Gender” is the weighted average of the above two entropy numbers: \(\frac{3}{5}\times 0.65+\frac{2}{5}\times 0=0.39\)

**Sum of Square Error (SSE)**: \(SSE=\Sigma_{i\in S_{1}}(y_{i}-\bar{y}_{1})^{2}+\Sigma_{i\in S_{2}}(y_{i}-\bar{y}_{2})^{2}\)

- Root SSE: 522.9
- SSE for “Female” is 136
- SSE for “Male” is 32
- SSE for splitting node “Gender” is the sum of the above two numbers which is 168

Pruning is the process that reduces the size of decision trees. It reduces the risk of overfitting by limiting the size of the tree or removing sections of the tree that provide little power.

- Limit the size
- Minimum sample size at each node
- Maximum depth of the tree
- Maximum number of terminal nodes
- The number of variables considered for each split

- Remove branches
- cost/complexity penalty
- Error-based pruning
- Error-complexity pruning
- Minimum error pruning

Refer to this section of the book for more detail.

A single tree is unstable, and Bootstrap aggregation (Bagged) is an ensemble method that can effectively stabilize the model.

**Steps of Bagging**

- Build a model on different bootstrap samples to form an ensemble, say \(B\) samples
- For a new sample, each model will give a prediction: \(\hat{f}^1(x),\hat{f}^2(x)\dots,\hat{f}^B(x)\)
- The bagged model’s prediction is the average of all the predictions:

\[\hat{f}_{avg}(x)=\frac{1}{B}\Sigma^B_{b=1}\hat{f}^b(x)\]

**Advantages**- stabilizes the model predictions by averaging the results.
- provides more accurate predictions.
- can use out-of-bag samples to evaluate model performance.

**Disadvantages**- Computation and memory requirements
- The bagged model is difficult to explain
- Since the bagging tree uses all of the original predictors, trees are correlated, which prevents bagging from optimally reducing the variance of the predicted values.

To solve one of the disadvantage of bagging tree (i.e. correlation among these trees), random forest was introduced.

- Steps of Random Forest

- Select the number of trees, B
**for***i=1 to B***do**- generate a bootstrap sample of the original data
- train a tree on this sample
**for***each split***do**- randomly select m（<p）predictors
- choose the best one out of the \(m\) and partition the data

**end**

- use typical tree model stopping criteria to determine when a tree is complete without pruning

**end**

The ultimate commonly used none-deep-learning method that wins Kaggle competitions. Using sequence of weaker learner to build a strong learner.

- Steps of Gradient Boosted Tree

- Initialize all predictions to the sample log-odds: \(f_{i} = log \frac{\hat{p}}{1- \hat{p}}\)
**for**j=1 … M**do**- Compute predicted event probability: \(\hat{p}_i=\frac{1}{1+exp[-f_{i}(x)]}\).
- Compute the residual (i.e. gradient): \(z_i=y_i-\hat{p}_i\)
- Randomly sample the training data
- Train a tree model on the random subset using the residuals as the outcome
- Compute the terminal node estimates of the Pearson residuals: \(r_i=\frac{1/n\Sigma_i^n(y_i-\hat{p}_i)}{1/n\Sigma_i^n\hat{p}_i(1-\hat{p}_i)}\)
- Update f：\(f_i=f_i+\lambda f_i^{(j)}\)

- end