High-cardinality categorical variables are variables for which the number of different levels is large relative to the sample size of a data set. In Part I of this series, we did an empirical comparison of different machine learning methods and found that random effects are an effective tool for handling high-cardinality categorical variables with the GPBoost algorithm [Sigrist, 2022, 2023] having the highest prediction accuracy. In this article, we demonstrate how the GPBoost algorithm, which combines tree-boosting with random effects, can be applied with the Python and R packages of the GPBoost
library. GPBoost
version 1.2.1 is used in this demo.
Table of contents
∘ 1 Introduction
∘ 2 Data: description, loading, and sample split
∘ 3 Training a GPBoost model
∘ 4 Choosing tuning parameter
∘ 5 Prediction
∘ 6 Interpretation
∘ 7 Further modeling options
· · 7.1 Interaction between categorical variables and other predictor variables
· · 7.2 (Generalized) linear mixed effects models
∘ 8 Conclusion and references
Applying a GPBoost model involves the following main steps:
- Define a
GPModel
in which one specifies the following:
— A random effects model: grouped random effects viagroup_data
and/or Gaussian processes viagp_coords
— Thelikelihood
(= distribution of the response variable conditional on fixed and random effects) - Create a
Dataset
containing the response variable (label
) and fixed effects predictor variables (data
) - Choose tuning parameters, e.g., using the function
gpb.grid.search.tune.parameters
- Train the model
- Make predictions and/or interpret the trained model
In the following, we go through these points step-by-step.