Ideas from IDSC 2019

About a week ago, I attended the second Insurance Data Science Conference held at ETH Zürich. On a personal note, I am very grateful to the conference organizers for inviting me to give a keynote, and my deck from that presentation is here. Making the conference extra special for me was the opportunity to meet the faculty of ETH Zürich’s RiskLab, who have written some of the best textbooks and papers on the actuarial topics that I deal with in my professional capacity.

This was one of the best organized events I have attended, including the beautiful location of the conference dinner at the Zürich guild house (shown below), and the hard choices of deciding between simultaneous sessions at the conference. It was great to see the numerous insurance professionals, academics and students who were present – the growth in the number of conferences attendees from previous years is witness to the huge current interest in data science in insurance, which will I am sure will help create tangible benefits for the industry, and the policyholders it serves.

In this post I will discuss some of the interesting ideas presented at the IDSC 2019 that stand out in my memory from the conference. If any of these snippets spark interest, then the full presentations can be found at the conference website here.

Evolution of Insurance Modelling

It is interesting to observe the impact on modelling techniques caused the availability of data at a more granular level than previously, or due to a recognition of the potential benefits of better exploiting traditional data. I would categorize this impact as a move towards more empirical modelling, but still framed within the classical actuarial models, and I explain this by examining some of the standout talks for me that fell into this category. Within my talk, I showed the following slide, which discusses the split between those actuarial tasks driven primarily by models, versus those driven by empirical relationships found within datasets. Many of the talks I discuss cover proposal to make tasks that are today more model driven, more empirically driven.

One of the sessions was structured with a focus on reserving techniques. Alessandro Carrato presented on an interesting technique that adapts the chain ladder method within an unsupervised learning framework. This technique is used for reserving for IBNeR on reported claims and works by clustering claims trajectories in a 2d spaced comprised of claims paid and outstanding loss reserves. Loss development factors are then calculated using development factors calculated from the more developed claims in each cluster. Thus, the traditional approach of finding “homogenous” lines of business, which is usually done subjectively, is here replaced by unsupervised learning. Another reserving talk, by Jonas Crevecoeur, also investigated the possibility of reserving at a more granular level using several GLMs, which were shown to reduce to more traditional techniques depending on the choice of GLM covariates.

Within the field of mortality modelling, Andrew Cairns presented on a new dataset covering mortality in the UK split by small geographic areas. This dataset also includes several static variables describing the circumstances of each of these areas, such as deprivation index, education, weekly income, nursing homes, allowing for the modelling of granular mortality rates depending on these covariates. This presentation took a very interesting approach – firstly, an overall national mortality rate was calculated, and then the mortality rate in each area was compared to the national rate in a typical “actual versus expected” analysis. Models were then estimated to explain this AvE analysis in terms of the covariates, as well as in terms of the geographic location of each area. An interesting finding was that income deprivation is an important indicator of excess mortality at the older ages, whereas unemployment is more important at the younger ages.

Another talk on mortality modelling was given by Andrés Villegas, who cast traditional mortality models into what I would call a feature engineering context. In other words, many traditional mortality models, such as the Cairns-Blake-Dowd model can be expressed as a regression of the mortality rate on a number of features, or basis functions which represent, different combinations of age, period and cohort effects. The method basically proceeds by setting up a very large number of potential features, and then selecting these using the grouped lasso technique (which gives zero weight to most features i.e. performs feature selection). A very similar idea has appeared in the reserving literature from Gráinne McGuire, Greg Taylor and Hugh Miller (link). This talk epitomized for me the shift to more empirical techniques, within a field that has traditionally been defined by models and competing model specifications(Gompertz vs Kannisto, Lee-Carter vs Cairns-Blake-Dowd etc).

Keeping it safe

A topic touched on by some speakers was the need to manage new, emerging risks arising due to advanced algorithms and open source software. Jürg Schelldorfer presented an excellent view of how to apply machine learning models within a highly regulated industry such as insurance. Some of his ideas were to focus on prediction uncertainty, and to provide questions to be answered when peer reviewing ML models. I highly recommend this presentation if you are going on the ML journey within an established company!

Jeffrey Bonh also spoke about this theme, emphasizing “algorithmic risks”, which are risks arising due to poor data used to calibrate ML algorithms, or due to the risks of malpractice during algorithmic design and calibration.

Within this section, I would also mention the amazing morning keynote by Professor Buhmann, who presented on an alternative to the paradigm of empirical risk minimization, used often to train ML models. The extent of the knowledge of ML theory shown in this talk was breath-taking, and I am excited to delve into Professor Buhmann’s work in more detail link. The lesson here for me was that it is a mistake to assume that ML methodology is “cut and dried”, and that by building more knowledge about alternative methods, one can hopefully understand some of the risks implied by these techniques.

R – the language for insurance data science

The IDSC began life as the R in Insurance conference, and in this respect, many interesting talks covered innovative R packages. Within the sessions I attended, Daphné Giorgi presented an R package used for simulating human populations based on individuals, which showed excellent performance due to the implementation of some of the algorithms in C++. Kornelius Rohmeyer presented a very promising package called DistrFit, which, as the name implies, is helpful for fitting distributions to insurance claims. This package is a very neat Shiny app, which automates some of the drudge work when fitting claims distributions in R. I hope this one gets a public release soon! Other notable packages are Silvana Pesenti’s SWIM package which implements methods for sensitivity analysis of stochastic models and the interesting sue of Hawke’s processes by Alexandre Boumezoued for predicting cyber claims.

I would also mention the excellent presentation on TensorFlow Probability by Roland Schmid. TF Probability offers many possibilities of incorporating a probabilistic view into Keras deep learning models (amongst other things) and it is exciting that RStudio is in the process of porting this package from Python to R.


The above is a sample of the excellent talks presented (biased towards my own interests), and I have not done justice to the rest of the talks on the day.

I look forward to IDSC 2020 and wish the organizers every success as this conference grows from strength to strength!

Industry publications – Sigma/Lloyd’s

Some of my favorite reading on insurance related topics comes from Swiss Re‘s Sigma series and Lloyd’s of London’s emerging risk teams. 

The latest Swiss Re Sigma publication covers the CAT events of 2018, which were driven mainly by “secondary perils”:

The Lloyd’s of London reports cover the impacts of AI on insurance and the risks of robotics:

Some interesting new articles

– An excellent tutorial article by Jürg Schelldorfer and Mario Wüthrich showing how to apply a hybrid GLM/neural net for pricing. The paper is here:

– This paper uses a recurrent neural network (LSTM) to forecast the time parameters of a Lee-Carter model, and the results look very promising – much better than using an ARIMA model:

– Lastly, this paper proposes an interesting combination of a decision tree model with Bühlmann-Straub credibility:

Great to see the state of the art being advanced on so many fronts!

Neural networks, the Lee-Carter Model and Large Scale Mortality  Forecasting

This post discusses a new paper that I am very glad to have co-authored with Mario Wüthrich, in which we apply deep neural networks to mortality forecasting. The draft can be found here:


A topic that I have been interested in for a long time is forecasting mortality rates, perhaps because this is one of the interesting intersections of statistics and these days, machine learning, and the field of actuarial science in life insurance. Several methods to model mortality rates over time have been proposed, ranging from the relatively simple method of extrapolating mortality rates directly using time series, to more complicated statistical approaches.

One of the most famous of these is the Lee-Carter method, which models mortality as an average mortality rate that changes over time. The change over time is governed by a time-based mortality index, which is common to all ages, and an age-specific rate of change factor:

ln(mx,t )=ax +kt .bx

where mx,t is the force of mortality at age x in year t, ax is average log mortality rate during the period at age x, κt is the time index in year in year t and bx is the rate of change of log mortality with respect to the time index at age x.

How are these quantities derived? There are two methods prominent in the literature – applying Principal Components Analysis, or Generalized Non-linear Models, which are different from GLMs in the sense that the user can specify non-additive relationship between two or more terms. To forecast mortality, models are first fit to historical mortality data and the coefficients (in the case of the Lee- Carter model, the vector κ) are then forecast using a time series model, in a second step.

In the current age of big data, relatively high quality mortality data spanning an extended period are available for many countries from the excellent Human Mortality Database, which is a resource that anyone with an interest in the study of mortality can benefit from. Other interesting sources are databases containing sub-national rates for the USA, Japan and Canada. The challenge, though, is how to model all of these data simultaneously to improve mortality forecasts? While some extensions of basic models like the Lee-Carter model have been proposed, these rely on assumptions that might not necessarily be applicable in the case of large scale mortality forecasting. For example, some of the common multi-population mortality models rely on the assumption of a common mortality trend for all countries, which is likely not the case.

In the paper, we tackle this problem in a novel way – feed all the variables to a deep neural network and let it figure out how exactly to model the mortality rates over time. This speaks to the idea of representation learning that is central to modern deep learning, which is that many datasets, such as large collections of images as in the ImageNet dataset, are too complicated to model by hand-engineering features, or it is too time consuming to perform the modelling. Rather, the strategy in deep learning is to define a neural network architecture that expresses useful priors about the data, and allow the network to learn how the raw data relates to the problem at hand. In the example of modelling mortality rates, we use two architectural elements that are common in applications of neural networks to tabular data:

  • We use a deep network, in other words the network consists of multiple layers of variables that expresses the prior that complex features can be represented by a hierarchy of simpler representations learned in the model.
  • Instead of using one-hot encoding to signify to the network when we are modelling a particular country, or gender, we use embedding layers. When applied to many categories, one-hot encoding produces a high-dimensional feature vector that is sparse (i.e. many of the entries are zero), leading to potential difficulties in fitting a model as there might not be enough data to derive credible estimates for each category. Even if there is enough data, as in our case of mortality rates, each parameter is learned in isolation, and the estimated parameters do not share information, unless the modeller explicitly chooses to use something like credibility or mixed models. The insight of Bengio et al. (2003) to solve these problems is that categorical data can successfully be encoded into low dimensional, dense numerical vectors, so, for example, in our model, country is encoded into a five-dimensional vector.

In the paper, we also show how the original Lee-Carter model can be expressed as a neural network with embeddings!

Here is a picture of the network we have just described:

In the paper, we also employ one of the most interesting techniques to emerge from the computer vision literature in the past several years. The original insight is due to the authors of the ResNet paper, who analysed the well-known problem that it is often difficult to train deep neural networks. They considered that a deep neural network should be no more difficult to train than a shallow network, since the deeper layers could simply learn the identity function, and thus be equivalent to a shallow network. Without going too far off track into these details, their solution is simple – add skip layers that connect the deep layers to more shallow layers in the network. This idea is expanded on in the DenseNet architectures. We simply added a connection between the feature layer, and the fifth layer of the network, connecting the embedding layers almost to the deepest layer of the network. This boosted the performance of the networks considerably.

We found that the deep neural networks dramatically outperformed the competing methods that we tested, forecasting with the lowest MSE in 51 out of the 76 instances we tested! Here is a table comparing the methods, and see the paper for more details:

Lastly, an interesting property of the embedding layers learned by neural networks is the fact that the parameters of these layers are often interpretable as so-called “relativities” (to use some actuarial jargon), in other words, as defining the relationship between the different values that a variable may take.  Here is a picture of the age embedding, which shows that the main relationship learned by the network is the characteristic shape of a modern life table:

This is a rather striking result, since at no time did we specify this to the network! Also, once the architecture was specified, the network has also learned to forecast mortality rates more successfully than human specified models reminding me of one of the desiderata for AI systems listed by Bengio (2009):

“Ability to learn with little human input the low-level, intermediate, and high-level abstractions that would be useful to represent the kind of complex functions needed for AI tasks.”

We would value any feedback on the paper you might have.


Bengio, Y. 2009. “Learning deep architectures for AI”, Foundations and trends® in Machine Learning 2(1):1-127.

AI in Actuarial Science – #ASSA2018

I will be speaking about deep learning and its applications to actuarial work at the 2018 ASSA Convention in Cape Town, this Thursday 25 October. Hope to see you there!

Here are:

  • the slides of the talk:

AI in Act Sci – slides

  • accompanying paper:


  • code to reproduce some of the results:



Thoughts on writing “AI in Actuarial Science”

This is a follow up post to something I wrote a few months ago, on the topic of AI in Actuarial Science. Over the intervening time, I have been writing a paper for the ASSA 2018 Convention in Cape Town on this topic, a draft of which can be found here:

and code here:

I would value feedback on the paper from anyone who has time to read the paper.

This post, though, is about the process of writing the paper and some of the issues I encountered. Within the confines of an academic paper it is often hard, and perhaps mostly irrelevant, to express some thoughts and opinions and in this blog post I hope to share some of these ideas that did not make it into the paper. I am not going to spend much time defining the terms too much, and if you refer to the paper if some terminology is unclear I think it will help to clarify.

Vanilla ML techniques might not work

Within the paper I try to apply deep learning to the problem addressed in the excellent tutorial paper of Noll, Salzmann and Wüthrich (2018) which is about applying machine learning techniques to a French Motor 3rd Party Liability (MTPL) dataset. They achieve some nice performance boosts over a basic GLM with some hand engineered features using a boosted tree and a neural network.

One of the biggest shocks that I had was when I decided to try this problem myself is that off the shelf tools like XGboost did not work well at all – in fact, the GLM was by far better despite the many hyper-parameter settings that I tried. I also tried out the mboost package but the dataset was too big for the 16gb of RAM on my laptop.

So the first mini-conclusion is that just because you have tabular data (i.e. structured data with rows for observations and columns for variables, like in SQL), you should not automatically assume that a fancy ML approach is going to outperform a basic statistical one. Anecdotally, I am hearing from several different people that applying vanilla techniques to pricing problems doesn’t provide much performance boost.

To this point, I recommend Frank Harell’s excellent blog post on ML versus statistical techniques, and about when to apply which:

Vanilla DL techniques might not work either

This was perhaps the most vexing part of the process. Fitting deep networks with ReLu activations to the French dataset, like the more up to date sources on deep learning seem suggest, also did not work all that well! In fact, I achieved only poor performance on a network fit to data without manual feature engineering. Another issue is that depth didn’t seem to help all that much.

Similarly, naively coding up deep autoencoders for the mortality data that is also discussed in the paper turned out to be a major learning when writing the paper – these just did not converge despite the many attempts at tuning the hyperparameters. I only managed to find a decent solution using greedy unsupervised learning (Hinton and Salakhutdinov 2006) of autoencoder layers.

Therefore a conclusion if you encounter a problem to which you want to apply Deep Learning – be aware that ReLus plus depth might not work and you might need to dig into the literature a bit!

When DL works, it really works!

This is connected to the next idea. Once I found a way of training the autoencoders, the results were fantastic and by far exceeded my expectations (and the performance of the Lee-Carter benchmark for mortality forecasting). Also, once I had the embedding layers working on the French MTPL dataset, the results were better than any other technique I could (or can) find. I was also impressed by the intuitive meaning of the learned embeddings, which I discuss in some detail in the paper, and the fact that plugging these embeddings back into the vanilla GLM resulted in a substantial performance boost.

The flexibility of the neural networks that can be fit with modern software, like Keras, is almost unlimited. Below is what I call a “learned exposure” network which has a sub-network to learn an optimal exposure measure for each MTPL policy. I have not encountered a similarly flexible and powerful system in any other field of statistics or machine learning.

Is this really AI?

One potential criticism of the title of the paper is that this isn’t really AI, but rather fancy regression modelling. I try to argue in Section 3 of the paper that Deep Learning is an approach to Machine Learning whereby you allow the algorithm to figure out the features that are important (instead of designing them by hand).

This is one of the desiderata for AI listed by Bengio (2009) on page 10 of that work – “Ability to learn with little human input the low-level, intermediate, and high-level abstractions that would be useful to represent the kind of complex functions needed for AI tasks.”

Do I think that my trained Keras models are AI? Absolutely not. But, the fact that the mortality model has figured out the shape of a life table (i.e. the function ax in the Lee-Carter model) without any inputs besides for year/age/gender/region and target mortality rates should make us pause to think about the meaningful features captured by deep neural nets. Here is the relevant plot from the paper – consider “dim1”:

This gets even more interesting in NLP applications, such as in Mikolov, Sutskever, Chen et al. (2013) who provide this image which shows that their deep network has captured the semantic meaning of English words:

Also, that deep nets seem to be able to perform “AI tasks” (the term used by Bengio, Courville and Vincent (2013) to mean tasks “which are challenging for current (shallow, my addition) machine learning algorithms, and involve complex but highly structured dependencies”) such as describing images indicates that something more than simple regression is happening in these models.

DL is empirical, not yet scientific

An in joke that seems to have made the rounds is so-called “gradient descent by grad student” – in other words, it is difficult to find optimal deep learning models and one needs to fiddle around with designs and optimizers until something that works is found. This is much easier if you have a team of graduate students who can do this for you, thus the phrase quoted above. What this means in practice is that there is often no off the shelf solution, and little or no theory to guide you in what might work or not, leading to lots of experimenting with different ideas until the networks perform well.

AI in Actuarial Science is a new topic but there are some pioneers

The traditional actuarial literature has not seen many contributions dealing with deep neural networks, yet. Some of the best work I found, which I highly recommend to anyone interested in this topic, is a series of papers by Mario Wüthrich and his collaborators (Gabrielli and Wüthrich 2018; Gao, Meng and Wüthrich 2018; Gao and Wüthrich 2017; Noll, Salzmann and Wüthrich 2018; Wüthrich 2018a, b; Wüthrich and Buser 2018; Wüthrich 2017). What is great about these papers is that the ideas are put on a firm mathematical basis and discussed within the context of profound traditional actuarial knowledge. I have little doubt that once these ideas take hold within the mainstream of the actuarial profession, they will have a huge impact on the practical work performed by actuaries, as well as on the insurance industry.

Compared to statistical methods, though, there are still big gaps in understanding the parameter/model risk of these deep neural networks and an obvious next step is to try apply some of the techniques used for parameter risk of statistical models to deep nets.

The great resources available to learn about and apply ML and DL

There are many excellent resources available to learn about Machine and Deep Learning that I discuss in the resources sections of the paper, and, best of all, most of these are free, except for opportunity costs.

Lastly, a word about Keras , which is the high level API that makes fitting deep neural models easy. This is a phenomenally well put together package, and the R interface makes it much more accessible to actuaries who might not be familiar with Python. I highly recommend Keras to anyone interested in experimenting with these models, and, Keras will be able to handle most tasks thrown at it, as long as you don’t try anything too fancy. One thing I wanted to try, but couldn’t figure out was how to add an autoencoder layer to a supervised model where the inputs are the outputs of a previous layer, and this is one of the few examples where I ran into a limitation in Keras.


Bengio, Y. 2009. “Learning deep architectures for AI”, Foundations and trends® in Machine Learning 2(1):1-127.

Bengio, Y., A. Courville and P. Vincent. 2013. “Representation learning: A review and new perspectives”, IEEE transactions on pattern analysis and machine intelligence 35(8):1798-1828.

Gabrielli, A. and M. Wüthrich. 2018. “An Individual Claims History Simulation Machine”, Risks 6(2):29.

Gao, G., S. Meng and M. Wüthrich. 2018. Claims Frequency Modeling Using Telematics Car Driving Data. SSRN. Accessed: 29 June 2018.

Gao, G. and M. Wüthrich. 2017. Feature Extraction from Telematics Car Driving Heatmaps. SSRN. Accessed: June 29 2018.

Hinton, G. and R. Salakhutdinov. 2006. “Reducing the dimensionality of data with neural networks”, Science 313(5786):504-507.

Mikolov, T., I. Sutskever, K. Chen, G. Corrado et al. 2013. “Distributed representations of words and phrases and their compositionality,” Paper presented at Advances in neural information processing systems. 3111-3119.

Noll, A., R. Salzmann and M. Wüthrich. 2018. Case Study: French Motor Third-Party Liability Claims. SSRN. Accessed: 17 June 2018.

Wüthrich, M. 2018a. Neural networks applied to chain-ladder reserving. SSRN. Accessed: 1 July 2018.

Wüthrich, M. 2018b. v-a Heatmap Simulation Machine. Accessed: 1 July 2018.

Wüthrich, M. and C. Buser. 2018. Data analytics for non-life insurance pricing. Swiss Finance Institute Research Paper. Accessed: 17 June 2018.

Wüthrich, M.V. 2017. “Covariate selection from telematics car driving data”, European Actuarial Journal 7(1):89-108.


Thoughts on the International Congress of Actuaries 2018

I had to get a couple more CPD hours done and the ICA 2018 conference came along at exactly the right time! This time around, a virtual option was offered and the Actuarial Society of South Africa (ASSA) organized access for all of its members – a really great move by ASSA in my opinion, and I hope that others benefited as much as I did from this intellectually stimulating event.

I listened with a focus on P&C insurance (I prefer the American term, but in other jurisdictions: SA – short-term, UK – GI, Europe – non-Life), so my comments that follow don’t take account of the sessions on other actuarial areas that I have no doubt were also very worthwhile.

In a previous post I advanced the view that actuarial science is not standing still as a discipline, and that comments such as “AI instead of actuaries” are short-sighted. I am glad to report that the discipline is moving forward rapidly to incorporate machine learning and data science into its toolbox – of the 28 sessions I listened to (I needed a lot of CPD!), 9 mentioned machine learning/data science and had some advice on methods or integrating ML into actuarial practice. Another good sign is that some of the leading researchers speaking at the conference – such as Paul Embrechts and Mario Wüthrich – provided their (positive) thoughts on integrating ML and data science into actuarial science. The actuarial world is moving forward rapidly and I think the prospects for the profession are good, if the actuarial associations around the world recognize the trends quickly and incorporate ML/data science and more into the curricula.

My standout favourite session was by Mario Wüthrich (who some actuaries will recognize as one of the co-authors of the Merz-Wüthrich one year reserve risk formula), who presented on his paper “Neural networks applied to Chain-ladder reserving”, available on SSRN. Besides for the new method he suggests, which I think is one of the best solutions when an actuary needs to reserve for IBNR by sub-category (such as LOB/injury code etc), I found the perspectives on ML that he interspersed his talk with fascinating, an example of which is connecting neural networks to Hilbert’s 13th problem. One point made was that the new claims of algorithms potentially reserving with much less uncertainty than the chain-ladder need to be treated with caution until the issue of model risk is dealt with, and the underlying assumptions brought out into the open.

A brilliant session was given by Paul Glasserman on “Robust model risk assessment” (the paper is ungated if you google it). At the heart of the idea is that model risk could be defined as an alternative probability measure (i.e. if the model generates a baseline probability distribution on a random variable of interest, then model risk could arise if in fact the RV followed a different probability distribution) attached to the simulations presented by a stochastic model (instead of an issue with the model parameters or structure). With this idea in hand, the presentation carried on to show how to find the maximally damaging alternative probability measure for a given level of model risk, as measured by the relative entropy between the baseline and alternative models. The major benefit for actuaries is that this is simple to implement in practice and gives rise to some interesting insights into what can go wrong with a model.

Another session that stood out for me is Pietro Parodi and Peter Watson’s session ”Property Graphs: A Statistical Model for Fire Losses Based on Graph Theory”. The idea here is to find a model that helps to explain why commercial property follows the heavy tail severity distributions observed in practice. Often, in practice, algorithms/distributions are applied because they work and not because there is a good logical basis based on first principles. Along these lines, I am reminded of some of the work of Perks/Beard who proposed first principle models to explain their mortality laws (my small contribution along these lines is an explanation of the chain-ladder algorithm as a life-table estimator, on this blog). Parodi and Watson use graph theory to represent properties (houses/factories etc) and define statistical models on the graph of fire events. These models lead, after simulation and in aggregate, to curves defining the overall severity of a fire event that are not radically different from the current set of curves used by actuaries.

Paul Embrecht’s sessions were amazing to listen to, because of his ability to tie together so many disparate strands in actuarial science and quantitative risk management. It was particularly meaningful to see Paul, who is close to retirement, show-casing some of the work of Mario Wüthrich on telematics on stage, and providing his view that the novel application of data science, as embodied by the work on telematics, is a direction for the discipline to take.

I also enjoyed Hansjörg Albrecher’s session “Flood risk modelling” which was a tour-de-force of various statistical modelling techniques applied to some novel data.

Some other noteworthy sessions which spring to mind:

  • “Data based storm modelling” – this was a demonstration of how a storm model was built for Germany by a medium size consulting firm. I liked the application of mathematics (Delauney triangulation for mapping the wind events and Singular Value Decomposition for dimensionality reduction) to a relatively big data problem (40m records).
  • “Using Risk Factors in Insurance Analytics: Data Driven Strategies” – how to apply GLMs to sparse and high-dimensional data without breaking the linearity assumptions.
  • “Trend in Marine Insurance” – a great overview of this specialty line.

I am not sure if access to the VICA is still open, but if you have any interest in the topics above, I would strongly recommend you try and view some of the sessions.