High-cardinality categorical variables are variables for which the number of different levels is large relative to the sample size of a data set. In Part I of this series, we did an empirical comparison of different machine learning methods and found that random effects are an effective tool for handling high-cardinality categorical variables with the GPBoost algorithm [Sigrist, 2022, 2023] having the highest prediction accuracy. In this article, we demonstrate how the GPBoost algorithm, which combines tree-boosting with random effects, can be applied with the Python and R packages of the `GPBoost`

library. `GPBoost`

version 1.2.1 is used in this demo.

## Table of contents

∘ 1 Introduction

∘ 2 Data: description, loading, and sample split

∘ 3 Training a GPBoost model

∘ 4 Choosing tuning parameter

∘ 5 Prediction

∘ 6 Interpretation

∘ 7 Further modeling options

· · 7.1 Interaction between categorical variables and other predictor variables

· · 7.2 (Generalized) linear mixed effects models

∘ 8 Conclusion and references

Applying a GPBoost model involves the following main steps:

- Define a
`GPModel`

in which one specifies the following:

— A random effects model: grouped random effects via`group_data`

and/or Gaussian processes via`gp_coords`

— The`likelihood`

*(= distribution of the response variable conditional on fixed and random effects)* - Create a
`Dataset`

containing the response variable (`label`

) and fixed effects predictor variables (`data`

) - Choose tuning parameters, e.g., using the function
`gpb.grid.search.tune.parameters`

- Train the model
- Make predictions and/or interpret the trained model

In the following, we go through these points step-by-step.