# Gradient Boosting from Theory to Practice (Part 1) | by Dr. Roi Yehoshua | Jul, 2023

Contents

The generalization of gradient boosting to other types of problems (e.g., classification problems) and other loss functions follows from the observation that the residuals hₘ(x) are proportional to the negative gradients of the squared loss function with respect to Fₘ₋₁(x):

Therefore, we can generalize this technique to any differentiable loss function by using the negative gradients of the loss function instead of the residuals.

We will now derive the general gradient boosting algorithm for any differentiable loss function.

Boosting approximates the true mapping from the features to the labels y = f(x) using an additive expansion (ensemble) of the form:

where hₘ(x) are weak learners (or base learners) from some class H (usually decision trees of a fixed size), and M represents the number of learners.

Given a loss function L(y, F(x)), our goal is to find an approximation F(x) that minimizes the average loss on the training set:

Gradient boosting uses an iterative approach to find this approximation. It starts from a model F₀ of a constant function that minimizes the loss:

For example, if the loss function is squared loss (used in regression problems), F₀(x) would be the mean of the target values.

Then, it incrementally expands the model in a greedy fashion:

where the newly added base learner hₘ is fitted to minimize the sum of losses of the ensemble Fₘ:

Finding the best function hₘ at each iteration for an arbitrary loss function L is computationally infeasible. Therefore, we use an iterative optimization approach to get closer to the minimum loss in every iteration. In every iteration, we choose a weak learner that points in the negative gradient direction.

This process is similar to gradient descent, but it operates in the function space rather than the parameter space, i.e., in every iteration we move to a different function in the hypothesis space H, rather than making a step in the parameter space of a specific function h. This allows h to be a non-parametric machine learning model, such as a decision tree. This process is called functional gradient descent.

In functional gradient descent, our parameters are the values of F(x) evaluated at each point x, and we seek to minimize L(yᵢ, F(x)) at each individual x. The best steepest-descent direction of the loss function at every point xis its negative gradient:

gₘ(x) is the derivative of the loss with respect to its second parameter, evaluated at Fₘ₋₁(x).

Therefore, the vector

gives the best steepest-descent direction in the N-dimensional data space at Fₘ₋₁(x). However, this gradient is defined only at the data points (x₁, …, x), and cannot be generalized to other x-values.

In the continuous case, where H is the set of arbitrary differentiable functions on R, we could have simply chosen a function hₘH where hₘ(x) = –gₘ(x).

In the discrete case (i.e., when the set H is finite), we choose hₘ as a function in H that is closest to gₘ(x) at the data points x, i.e., hₘ that is most parallel to the vector –g in Rⁿ. This function can be obtained by fitting a base learner hₘ to a training set {(x, ỹᵢₘ)}, with the labels

These labels are called pseudo-residuals. In other words, in every boosting iteration, we are fitting a base learner to predict the negative gradients of the loss function with respect to the ensemble’s predictions from the previous iteration.

Note that this approach is heuristic, and does not necessarily yield an exact solution to the optimization problem.

The complete pseudocode of the algorithm is shown below:

Gradient tree boosting is a specialization of the gradient boosting algorithm to the case where the base learner h(x) is a fixed-size regression tree.

In each iteration, a regression tree hₘ(x) is fit to the pseudo-residuals. Let Kₘ be the number of its leaves. The tree partitions the input space into Kₘ disjoint regions: R, …, R, and predicts a constant value in each region j, which is the mean of the pseudo-residuals in that region:

Therefore, the function hₘ(x) can be written as the following sum:

These regression trees are built in a top-down greedy fashion using mean squared error as the splitting criterion (see this article for more details on regression trees).

The same algorithm of gradient boosting can also be used for classification tasks. However, since the sum of the trees Fₘ(x) can be any continuous value, it needs to be mapped to a class or a probability. This mapping depends on the type of the classification problem:

1. In binary classification problems, we use the sigmoid function to model the probability that x belongs to the positive class (similar to logistic regression):

The initial model in this case is given by the prior probability of the positive class, and the loss function is the binary log loss.

2. In multiclass classification problems, K trees (for K classes) are built at each of the M iterations. The probability that x belongs to class k is modeled as the softmax of the Fₘ,ₖ(x) values:

The initial model in this case is given by the prior probability of each class, and the loss function is the cross-entropy loss.

As with other ensemble methods based on decision trees, we need to control the complexity of the model in order to avoid overfitting. Several regularization techniques are commonly used with gradient-boosted trees.

First, we can use the same regularization techniques that we have in standard decision trees, such as limiting the depth of the tree, the number of leaves, or the minimum number of samples required to split a node. We can also use post-pruning techniques to remove branches from the tree that fail to reduce the loss by a predefined threshold.

Second, we can control the number of boosting iterations (i.e., the number of trees in the ensemble). Increasing the number of trees reduces the ensemble’s error on the training set, but may also lead to overfitting. The optimal number of trees is typically found by early stopping, i.e., the algorithm is terminated once the score on the validation set does not improve for a specified number of iterations.

Lastly, Friedman [1, 2] has suggested the following regularization techniques, which are more specific to gradient-boosted trees:

## Shrinkage

Shrinkage [1] scales the contribution of each base learner by a constant factor ν:

The parameter ν (0 < ν ≤ 1) is called the learning rate, as it controls the step size of the gradient descent procedure.

Empirically, it has been found that using small learning rates (e.g., ν ≤ 0.1) can significantly improve the model’s generalization ability. However, smaller learning rates also require more boosting iterations in order to maintain the same training error, thereby increasing the computational time during both training and prediction.

In a follow-up paper [2], Friedman proposed stochastic gradient boosting, which combines gradient boosting with bagging.

In each iteration, a base learner is trained only on a fraction (typically 0.5) of the training set, drawn at random without replacement. This subsampling procedure introduces randomness into the algorithm and helps prevent the model from overfitting.

Like in bagging, subsampling also allows us to use the out-of-bag samples (samples that were not involved in building the next base learner) in order to evaluate the performance of the model, instead of having an independent validation data set. Out-of-bag estimates often underestimate the real performance of the model, thus they are used only if cross-validation takes too much time.

Another strategy to reduce the variance of the model is to randomly sample the features considered for split in each node of the tree (similar to random forests).