High-Cardinality Categorical Covariates in Network Regressions

A major challenge in actuarial modelling is how to deal with categorical variables with many levels (i.e. high cardinality). This is often encountered when one has a rating factor like car model, which can take on one of thousands of values, some of which have significant exposure and others with exposure close to zero.

In a new paper with Mario Wüthrich, we show how to incorporate these variables into neural networks using different types of regularized embeddings, including using variational inference. We also consider both the case of standalone variables, as well as the case of variables with a natural hierarchy, which lend themselves to being modelled with recurrent neural networks or Transformers. On a synthetic dataset, the proposed methods provide a significant gain in performance compared to other techniques.

We show the problem we are trying to solve in the image below, which illustrates how the most detailed covariate in the synthetic dataset – Vehicle Detail – can produce observed values vastly different from the true value due to sampling error.

A special thank you to Michael Mayer, PhD for input into the paper and interesting discussions on the topic!