Mixed Effects Machine Learning for High-Cardinality Categorical Variables — Part II: A Demo of the GPBoost Library

A demo of GPBoost in Python & R using real-world data

Fabio Sigrist

Towards Data Science

Illustration of high-cardinality categorical data: box plots and raw data (red points) of the response variable for different levels of a categorical variable — Image by author

High-cardinality categorical variables are variables for which the number of different levels is large relative to the sample size of a data set. In Part I of this series, we did an empirical comparison of different machine learning methods and found that random effects are an effective tool for handling high-cardinality categorical variables with the GPBoost algorithm [Sigrist, 2022, 2023] having the highest prediction accuracy. In this article, we demonstrate how the GPBoost algorithm, which combines tree-boosting with random effects, can be applied with the Python and R packages of the GPBoost library. GPBoost version 1.2.1 is used in this demo.

Table of contents

1 Introduction
2 Data: description, loading, and sample split
3 Training a GPBoost model
4 Choosing tuning parameter
5 Prediction
6 Interpretation
7 Further modeling options
· · 7.1 Interaction between categorical variables and other predictor variables
· · 7.2 (Generalized) linear mixed effects models
8 Conclusion and references

Applying a GPBoost model involves the following main steps:

  1. Define a GPModel in which one specifies the following:
    — A random effects model: grouped random effects via group_data and/or Gaussian processes via gp_coords
    — The likelihood (= distribution of the response variable conditional on fixed and random effects)
  2. Create a Dataset containing the response variable (label) and fixed effects predictor variables (data)
  3. Choose tuning parameters, e.g., using the function gpb.grid.search.tune.parameters
  4. Train the model
  5. Make predictions and/or interpret the trained model

In the following, we go through these points step-by-step.

Source link

Leave a Comment