Why Tree Based Models

Can be used for both regression and classification problems
Do not require user to specify the form of the relationship between predictors and response
Do not require (or if they do, very limited) data preprocessing and can handle different types of predictors (sparse, skewed, continuous, categorical, etc.)
Robust to co-linearity
Can handle missing data
Many pre-built packages make implementation as easy as a button push

A Few Definitions

Classification tree: the outcome is discrete
Regression tree: the outcome is continuous (e.g. the price of a house, or a patient’s length of stay in a hospital)
Non-leaf node (or split node): the algorithm needs to decide a split at each non-leaf node (eg: \(age \leq 25\), \(25 < age \leq 35\), \(age > 35\))
Root node：the beginning node where the tree starts
Leaf node (or Terminal node): the node stops splitting. It has the final decision of the model
Degree of the node: the number of subtrees of a node
Degree of the tree: the maximum degree of a node in the tree
Pruning: remove parts of the tree that do not provide power to classify instances
Branch (or Subtree): the whole part under a non-leaf node
Child: the node directly after and connected to another node
Parent: the converse notion of a child

Splitting Criteria

The goal of splitting is to get homogenous groups.

How to define homogenous for a classification problem?

Gini impurity (two-class problem):

\[p_{1}(1-p_{1})+p_{2}(1-p_{2})\]

Entropy:

\[Entropy=-plog_{2}p-(1-p)log_{2}(1-p)\]

How to define homogenous for a regression problem?

Sum of Square Error (SSE): Suppose you want to divide the data set S into two groups of S1 and S2, where the selection of S1 and S2 needs to minimize the sum of squared errors

\[SSE=\Sigma_{i\in S_{1}}(y_{i}-\bar{y}_{1})^{2}+\Sigma_{i\in S_{2}}(y_{i}-\bar{y}_{2})^{2}\]

Splitting Criteria

Gini impurity (two-class problem): \(p_{1}(1-p_{1})+p_{2}(1-p_{2})\)

Gini impurity for “Female” = \(\frac{1}{6}\times\frac{5}{6}+\frac{5}{6}\times\frac{1}{6}=\frac{5}{18}\)
Gini impurity for “Male” = \(0\times1+1\times 0=0\)

The Gini impurity for the node “Gender” is the following weighted average of the above two scores:

\[\frac{3}{5}\times\frac{5}{18}+\frac{2}{5}\times 0=\frac{1}{6}\]

Splitting Criteria

Entropy: \(Entropy=-plog_{2}p-(1-p)log_{2}(1-p)\)

Root node: \(-\frac{25}{50}log_{2}\frac{25}{50}-\frac{25}{50}log_{2}\frac{25}{50}=1\)

The entropy of the split using variable “gender” can be calculated in three steps:

Entropy for “Female” = \(-\frac{5}{30}log_{2}\frac{5}{30}-\frac{25}{30}log_{2}\frac{25}{30}=0.65\)
Entropy for “Male” = \(0\times1+1\times 0=0\)
Entropy for the node “Gender” is the weighted average of the above two entropy numbers: \(\frac{3}{5}\times 0.65+\frac{2}{5}\times 0=0.39\)

Splitting Criteria

Sum of Square Error (SSE): \(SSE=\Sigma_{i\in S_{1}}(y_{i}-\bar{y}_{1})^{2}+\Sigma_{i\in S_{2}}(y_{i}-\bar{y}_{2})^{2}\)

Root SSE: 522.9
SSE for “Female” is 136
SSE for “Male” is 32
SSE for splitting node “Gender” is the sum of the above two numbers which is 168

Tree Pruning

Pruning is the process that reduces the size of decision trees. It reduces the risk of overfitting by limiting the size of the tree or removing sections of the tree that provide little power.

Limit the size
- Minimum sample size at each node
- Maximum depth of the tree
- Maximum number of terminal nodes
- The number of variables considered for each split
Remove branches
- cost/complexity penalty
- Error-based pruning
- Error-complexity pruning
- Minimum error pruning

Refer to this section of the book for more detail.

Bagging Tree

A single tree is unstable, and Bootstrap aggregation (Bagged) is an ensemble method that can effectively stabilize the model.

Steps of Bagging

Build a model on different bootstrap samples to form an ensemble, say \(B\) samples
For a new sample, each model will give a prediction: \(\hat{f}^1(x),\hat{f}^2(x)\dots,\hat{f}^B(x)\)
The bagged model’s prediction is the average of all the predictions:

\[\hat{f}_{avg}(x)=\frac{1}{B}\Sigma^B_{b=1}\hat{f}^b(x)\]

Advantages
- stabilizes the model predictions by averaging the results.
- provides more accurate predictions.
- can use out-of-bag samples to evaluate model performance.
Disadvantages
- Computation and memory requirements
- The bagged model is difficult to explain
- Since the bagging tree uses all of the original predictors, trees are correlated, which prevents bagging from optimally reducing the variance of the predicted values.

Random Forest

To solve one of the disadvantage of bagging tree (i.e. correlation among these trees), random forest was introduced.

Steps of Random Forest

Select the number of trees, B
for i=1 to B do
- generate a bootstrap sample of the original data
- train a tree on this sample
  - for each split do
    - randomly select m（<p）predictors
    - choose the best one out of the \(m\) and partition the data
  - end
- use typical tree model stopping criteria to determine when a tree is complete without pruning
end

Gradient Boosted Machine

The ultimate commonly used none-deep-learning method that wins Kaggle competitions. Using sequence of weaker learner to build a strong learner.

Steps of Gradient Boosted Tree

Initialize all predictions to the sample log-odds: \(f_{i} = log \frac{\hat{p}}{1- \hat{p}}\)
for j=1 … M do
- Compute predicted event probability: \(\hat{p}_i=\frac{1}{1+exp[-f_{i}(x)]}\).
- Compute the residual (i.e. gradient): \(z_i=y_i-\hat{p}_i\)
- Randomly sample the training data
- Train a tree model on the random subset using the residuals as the outcome
- Compute the terminal node estimates of the Pearson residuals: \(r_i=\frac{1/n\Sigma_i^n(y_i-\hat{p}_i)}{1/n\Sigma_i^n\hat{p}_i(1-\hat{p}_i)}\)
- Update f：\(f_i=f_i+\lambda f_i^{(j)}\)
end