Learn to effectively optimize hyperparameters, and prevent creating overtrained models for XGBoost, CatBoost, and LightBoost
Gradient boosting techniques such as XGBoost, CatBoost, and LightBoost has gained much popularity in recent years for both classification and regression tasks. An important part of the process is the tuning of hyperparameters to gain the best model performance. The key is to optimize the hyperparameter search space together with finding a model that can generalize on new unseen data. In this blog, I will demonstrate 1. how to learn a boosted decision tree regression model with optimized hyperparameters using Bayesian optimization, 2. how to select a model that can generalize (and is not overtrained), 3. how to interpret and visually explain the optimized hyperparameter space together with the model performance accuracy. The HGBoost library is ideal for this task which performs, among others a double loop cross-validation to protect against overtraining.
Gradient boosting algorithms such as Extreme Gradient Boosting (XGboost), Light Gradient Boosting (Lightboost), and CatBoost are powerful ensemble machine learning algorithms for predictive modeling (classification and regression tasks) that can be applied to data sets in the form of tabular, continuous, and mixed forms [1,2,3 ]. Here I will focus on the regression task. In the following sections, we will train a boosted decision tree model using a double-loop cross-validation loop. We will carefully split the data set, set up the search space, and perform Bayesian optimization using the library Hyperopt. After training the model, we can deeper interpret the results by creating insightful plots.
If you need more background or are not entirely familiar with these concepts, I recommend reading this blog: