Tree Based Models

Hui Lin @Google

Ming Li @Amazon

Why Tree Based Models

A Few Definitions

Splitting Criteria

The goal of splitting is to get homogenous groups.

How to define homogenous for a classification problem?

\[p_{1}(1-p_{1})+p_{2}(1-p_{2})\]

\[Entropy=-plog_{2}p-(1-p)log_{2}(1-p)\]

How to define homogenous for a regression problem?

\[SSE=\Sigma_{i\in S_{1}}(y_{i}-\bar{y}_{1})^{2}+\Sigma_{i\in S_{2}}(y_{i}-\bar{y}_{2})^{2}\]

Splitting Criteria

  1. Gini impurity for “Female” = \(\frac{1}{6}\times\frac{5}{6}+\frac{5}{6}\times\frac{1}{6}=\frac{5}{18}\)
  2. Gini impurity for “Male” = \(0\times1+1\times 0=0\)

The Gini impurity for the node “Gender” is the following weighted average of the above two scores:

\[\frac{3}{5}\times\frac{5}{18}+\frac{2}{5}\times 0=\frac{1}{6}\]

Splitting Criteria

The entropy of the split using variable “gender” can be calculated in three steps:

  1. Entropy for “Female” = \(-\frac{5}{30}log_{2}\frac{5}{30}-\frac{25}{30}log_{2}\frac{25}{30}=0.65\)
  2. Entropy for “Male” = \(0\times1+1\times 0=0\)
  3. Entropy for the node “Gender” is the weighted average of the above two entropy numbers: \(\frac{3}{5}\times 0.65+\frac{2}{5}\times 0=0.39\)

Splitting Criteria

  1. Root SSE: 522.9
  2. SSE for “Female” is 136
  3. SSE for “Male” is 32
  4. SSE for splitting node “Gender” is the sum of the above two numbers which is 168

Tree Pruning

Pruning is the process that reduces the size of decision trees. It reduces the risk of overfitting by limiting the size of the tree or removing sections of the tree that provide little power.

Refer to this section of the book for more detail.

Bagging Tree

A single tree is unstable, and Bootstrap aggregation (Bagged) is an ensemble method that can effectively stabilize the model.

  1. Build a model on different bootstrap samples to form an ensemble, say \(B\) samples
  2. For a new sample, each model will give a prediction: \(\hat{f}^1(x),\hat{f}^2(x)\dots,\hat{f}^B(x)\)
  3. The bagged model’s prediction is the average of all the predictions:

\[\hat{f}_{avg}(x)=\frac{1}{B}\Sigma^B_{b=1}\hat{f}^b(x)\]

Random Forest

To solve one of the disadvantage of bagging tree (i.e. correlation among these trees), random forest was introduced.

  1. Select the number of trees, B
  2. for i=1 to B do
    • generate a bootstrap sample of the original data
    • train a tree on this sample
      • for each split do
        • randomly select m(<p)predictors
        • choose the best one out of the \(m\) and partition the data
      • end
    • use typical tree model stopping criteria to determine when a tree is complete without pruning
  3. end

Gradient Boosted Machine

The ultimate commonly used none-deep-learning method that wins Kaggle competitions. Using sequence of weaker learner to build a strong learner.

  1. Initialize all predictions to the sample log-odds: \(f_{i} = log \frac{\hat{p}}{1- \hat{p}}\)
  2. for j=1 … M do
    • Compute predicted event probability: \(\hat{p}_i=\frac{1}{1+exp[-f_{i}(x)]}\).
    • Compute the residual (i.e. gradient): \(z_i=y_i-\hat{p}_i\)
    • Randomly sample the training data
    • Train a tree model on the random subset using the residuals as the outcome
    • Compute the terminal node estimates of the Pearson residuals: \(r_i=\frac{1/n\Sigma_i^n(y_i-\hat{p}_i)}{1/n\Sigma_i^n\hat{p}_i(1-\hat{p}_i)}\)
    • Update f:\(f_i=f_i+\lambda f_i^{(j)}\)
  3. end