The generalization of gradient boosting to other types of problems (e.g., classification problems) and other loss functions follows from the observation that the residuals *hₘ*(**x***ᵢ*) are proportional to the negative gradients of the squared loss function with respect to *Fₘ*₋₁(**x***ᵢ*):

Therefore, we can generalize this technique to any differentiable loss function by using the negative gradients of the loss function instead of the residuals.

We will now derive the general gradient boosting algorithm for any differentiable loss function.

Boosting approximates the true mapping from the features to the labels *y* = *f*(**x**) using an **additive expansion** (ensemble) of the form:

where *hₘ*(**x**) are weak learners (or base learners) from some class *H* (usually decision trees of a fixed size), and *M* represents the number of learners.

Given a loss function *L*(*y*, *F*(**x**)), our goal is to find an approximation *F*(**x**) that minimizes the average loss on the training set:

Gradient boosting uses an iterative approach to find this approximation. It starts from a model *F*₀ of a constant function that minimizes the loss:

For example, if the loss function is squared loss (used in regression problems), *F*₀(**x**) would be the mean of the target values.

Then, it incrementally expands the model in a greedy fashion:

where the newly added base learner *hₘ* is fitted to minimize the sum of losses of the ensemble *Fₘ*:

Finding the best function *hₘ* at each iteration for an arbitrary loss function *L* is computationally infeasible. Therefore, we use an iterative optimization approach to get closer to the minimum loss in every iteration. In every iteration, we choose a weak learner that points in the negative gradient direction.

This process is similar to gradient descent, but it operates in the function space rather than the parameter space, i.e., in every iteration we move to a different function in the hypothesis space *H*, rather than making a step in the parameter space of a specific function *h*. This allows *h* to be a non-parametric machine learning model, such as a decision tree. This process is called **functional gradient descent**.

In functional gradient descent, our parameters are the values of *F*(**x**) evaluated at each point **x**, and we seek to minimize *L*(*yᵢ*, *F*(**x***ᵢ*)) at each individual **x***ᵢ*. The best steepest-descent direction of the loss function at every point **x***ᵢ *is its negative gradient:

*gₘ*(**x***ᵢ*) is the derivative of the loss with respect to its second parameter, evaluated at *Fₘ*₋₁(**x***ᵢ*).

Therefore, the vector

gives the best steepest-descent direction in the *N*-dimensional data space at *Fₘ*₋₁(**x***ᵢ*). However, this gradient is defined only at the data points (**x**₁, …, **x***ₙ*), and cannot be generalized to other **x**-values.

In the continuous case, where *H* is the set of arbitrary differentiable functions on *R*, we could have simply chosen a function *hₘ* ∈ *H* where *hₘ*(**x***ᵢ*) = –*gₘ*(**x***ᵢ*).

In the discrete case (i.e., when the set *H* is finite), we choose *hₘ* as a function in *H* that is closest to *gₘ*(**x***ᵢ*) at the data points **x***ᵢ*, i.e., *hₘ* that is most parallel to the vector –**g***ₘ* in *Rⁿ*. This function can be obtained by fitting a base learner *hₘ *to a training set {(**x***ᵢ*, *ỹᵢₘ*)}, with the labels

These labels are called **pseudo-residuals**. In other words, in every boosting iteration, we are fitting a base learner to predict the negative gradients of the loss function with respect to the ensemble’s predictions from the previous iteration.

Note that this approach is heuristic, and does not necessarily yield an exact solution to the optimization problem.

The complete pseudocode of the algorithm is shown below:

Gradient tree boosting is a specialization of the gradient boosting algorithm to the case where the base learner *h*(**x**) is a fixed-size regression tree.

In each iteration, a regression tree *hₘ*(**x**) is fit to the pseudo-residuals. Let *Kₘ* be the number of its leaves. The tree partitions the input space into *Kₘ* disjoint regions: *R*₁*ₘ*, …, *R*ₖ*ₘ*, and predicts a constant value in each region *j*, which is the mean of the pseudo-residuals in that region:

Therefore, the function *hₘ*(**x**) can be written as the following sum:

These regression trees are built in a top-down greedy fashion using mean squared error as the splitting criterion (see this article for more details on regression trees).

The same algorithm of gradient boosting can also be used for classification tasks. However, since the sum of the trees *Fₘ*(**x**) can be any continuous value, it needs to be mapped to a class or a probability. This mapping depends on the type of the classification problem:

- In binary classification problems, we use the sigmoid function to model the probability that
**x***ᵢ*belongs to the positive class (similar to logistic regression):

The initial model in this case is given by the prior probability of the positive class, and the loss function is the binary log loss.

2. In multiclass classification problems, *K* trees (for *K *classes) are built at each of the *M* iterations. The probability that **x***ᵢ* belongs to class *k* is modeled as the softmax of the *Fₘ,ₖ*(**x***ᵢ*) values:

The initial model in this case is given by the prior probability of each class, and the loss function is the cross-entropy loss.

As with other ensemble methods based on decision trees, we need to control the complexity of the model in order to avoid overfitting. Several regularization techniques are commonly used with gradient-boosted trees.

First, we can use the same regularization techniques that we have in standard decision trees, such as limiting the depth of the tree, the number of leaves, or the minimum number of samples required to split a node. We can also use post-pruning techniques to remove branches from the tree that fail to reduce the loss by a predefined threshold.

Second, we can control the number of boosting iterations (i.e., the number of trees in the ensemble). Increasing the number of trees reduces the ensemble’s error on the training set, but may also lead to overfitting. The optimal number of trees is typically found by **early stopping**, i.e., the algorithm is terminated once the score on the validation set does not improve for a specified number of iterations.

Lastly, Friedman [1, 2] has suggested the following regularization techniques, which are more specific to gradient-boosted trees:

## Shrinkage

Shrinkage [1] scales the contribution of each base learner by a constant factor *ν*:

The parameter *ν* (0 < *ν* ≤ 1) is called the **learning rate**, as it controls the step size of the gradient descent procedure.

Empirically, it has been found that using small learning rates (e.g., *ν* ≤ 0.1) can significantly improve the model’s generalization ability. However, smaller learning rates also require more boosting iterations in order to maintain the same training error, thereby increasing the computational time during both training and prediction.

## Stochastic Gradient Boosting (Subsampling)

In a follow-up paper [2], Friedman proposed stochastic gradient boosting, which combines gradient boosting with bagging.

In each iteration, a base learner is trained only on a fraction (typically 0.5) of the training set, drawn at random without replacement. This subsampling procedure introduces randomness into the algorithm and helps prevent the model from overfitting.

Like in bagging, subsampling also allows us to use the **out-of-bag samples **(samples that were not involved in building the next base learner) in order to evaluate the performance of the model, instead of having an independent validation data set. Out-of-bag estimates often underestimate the real performance of the model, thus they are used only if cross-validation takes too much time.

Another strategy to reduce the variance of the model is to randomly sample the features considered for split in each node of the tree (similar to random forests).

You can find the code examples of this article on my github: https://github.com/roiyeho/medium/tree/main/gradient_boosting

Thanks for reading!

[1] Friedman, J.H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29, 1189–1232.

[2] Friedman, J.H. (2002). Stochastic gradient boosting. Computational Statistics & Data Analysis, 38, 367–378.