Industry publications – Sigma/Lloyd’s

Some of my favorite reading on insurance related topics comes from Swiss Re‘s Sigma series and Lloyd’s of London’s emerging risk teams. 

The latest Swiss Re Sigma publication covers the CAT events of 2018, which were driven mainly by “secondary perils”:

https://www.swissre.com/institute/research/sigma-research/sigma-2019-02.html

The Lloyd’s of London reports cover the impacts of AI on insurance and the risks of robotics:

https://www.lloyds.com/news-and-risk-insight/risk-reports/library/technology/taking-control

Some interesting new articles

– An excellent tutorial article by Jürg Schelldorfer and Mario Wüthrich showing how to apply a hybrid GLM/neural net for pricing. The paper is here: https://lnkd.in/edv5s9k

– This paper uses a recurrent neural network (LSTM) to forecast the time parameters of a Lee-Carter model, and the results look very promising – much better than using an ARIMA model: https://lnkd.in/eRAddBd

– Lastly, this paper proposes an interesting combination of a decision tree model with Bühlmann-Straub credibility: https://lnkd.in/eJts5Mp

Great to see the state of the art being advanced on so many fronts!

Advances in time series forecasting – M4 and what it means for insurance

Not necessarily the best way to forecast!
Photo by Jenni Jones on Unsplash

In a previous post I discussed the M4 conference and what my key takeaways were. In this post I plan to focus the discussion on insurance, and then specifically on actuarial work, and think about what the advances in time series forecasting might mean for actuaries and other professionals in insurance.

This post starts off by discussing the traditional time series forecasting problem, where it appears in the context of insurance, and how insurers could benefit from recent advances, and then narrows in to focus on actuarial work.

Let’s quickly cover what is meant by time series forecasting. Quite often, the only data that is available for a problem consists of past values that a series took, measured at regular points in time In other words, associated variables which would help to explain the past values of the series, are not available, and the exercise needs to be informed only by the past values of the series. For example, one might have data on the number of various insurance products sold monthly for the past five years (in this case, associated variables such as number of salespeople or advertising spend might not be available), and to understand revenue, one might need to forecast the number of products that will be sold over the next quarter or year.

Some more examples of this are given in a fantastic online book on forecasting by Rob J Hyndman and George Athanasopoulos over here. I would recommend this book to anyone interested in time series forecasting!

Insurance and forecasting 

Compared to more traditional industries, insurance is interesting in that there is no physical product being sold, and insurers do not need to maintain or forecast inventories. Having said that, the familiar time series forecasting problem pops up in the context of insurance in other areas, for example:

  • Forecasting the number of sales or claims and the associated resourcing requirements
  • Forecasting revenue, losses, expenses and profits

Perhaps surprisingly, revenue forecasts play a major role in determining the capital requirements of insurers under Solvency II, which is the European insurance legislation, as well as in SAM, which is the South Africa variation. In fact, part of the capital requirements for insurance risk are often directly proportional to forecast premiums, see, for example, Article 116.3.a of the Solvency II Directive

So, besides for insurers, regulators around the world also have an interested in ensuring that revenue forecasts are accurate and advances in time series forecasting, such as those at the M4 conference, should see wider applications in insurance. One advance to consider is Microsoft’s extensive use of machine learning to determine revenue forecasts, as described in this paper , by Jocelyn Barker and others. At the M4 conference (and in the paper) Jocelyn noted that these forecasts are used for widely from providing Wall Street guidance to managing global sales performance. 

Some of the other ideas that could also be of benefit, that were expressed at the M4 conference, and are now clearly established in the time series literature are understanding:

  • when to make changes to statistical forecasts (summary here)
  • the value of aggregating forecasts (insightful presentation from Bob Winkler at M4 on the topic is here) from different methods

A peculiarity of insurance forecasting is that often insurance professionals will not aim to forecast the actual value of losses and expenses, but rather will focus on ratios that express these quantities in terms of revenue (or a close proxy to revenue). For example, if one wants to forecast losses, then one would try to forecast loss ratios, which express how many cents are paid in losses for every dollar of revenue. In the next section, I will discuss how these ratios are often currently forecast in insurance companies. 

Forecasting in Actuarial Work

For the main topic of this post, I want to examine the work that actuaries do for insurers, that often consists of, or contains forecasts of some kind.

In life insurance, these forecasts are often the key variables underlying pricing and reserving such as:

  • Mortality
  • Morbidity
  • Withdrawal or lapse rates
  • Expenses

In P&C insurance (or general or short-term if you are in the UK or South Africa), these forecasts are often comprised of:

  • Loss ratios
  • Frequency rates and average cost per claim
  • Premium rates
  • Claims development patterns

As an aside, not so long ago, these lists would have included investment returns, but a large swathe of the actuarial profession has more or less adopted market consistent valuation practices, which dictate that all cashflows should be valued like bond cashflows, with the implication that investment returns can simple be read off from market yield curves. One currently controversial discussion here is the valuation of no negative guarantees on reverse mortgages in the UK, see here from Dean Buckner and Kevin Dowd.

A common assumption that is made for some of these variables is that whatever experience has occurred over the past few years will repeat itself in the future – in time series jargon, actuaries often use so-called “naive” forecasts (please read the conclusion though, where I note that this is not always the case). Here are some examples of naive forecasts in current actuarial work:

  • When determining (P&C) claims reserves, an allowance must be made for the costs of managing claims (to be precise, here I refer to claims department and associated costs, or ULAE), in addition to the cost of indemnifying policyholders. The South African SAM regulations allow actuaries to forecast these costs on the basis of the average claims management costs over the past two years/
  • Also on P&C reserving, a very common approach to determining claims development patterns (which are used then to forecast the extent of the outstanding claims that are still to be reported) is to rely on  averages of recent experience. 
  • Mortality analysis often consists of comparing an assumed mortality table to recent experience. The assumed mortality table is then adjusted to match the recent experience more closely, and only rarely will a trend over time be allowed for. 
  • When pricing P&C insurance with a GLM, a dataset of recent claims experience is used to derive factors which define how different policies are likely to perform. For example, how much more likely are claims if the policyholder is a new driver, compared to an experienced driver. These factors are most often based on the recent past, with no allowance for trend over the years.

In all these examples, the recent past is taken as representative of the future. The reasons for this are probably a general lack of sufficient data to do better, and the difficulties in specifying a suitable model that can capture these changes over time adequately. However, as data quality (and quantity) improves, and especially, as the options for modelling increase (for example, using neural nets instead of GLMs), I think there are ample opportunities to improve on some parts of current practice. 

Two potential paths to achieve this stand out for me from the M4 conference:

  • One way to improve forecasts is to come up with a smart way of ensembling multiple models (as opposed to coming up with new, more complicated models), as done by the runners up to the M4 competition (link). Of course, this needs to be done in a scientific manner, and very little research has been performed on how this could be achieved on traditional actuarial models. The advantage of this approach is that the building blocks remain the same traditional models, and a meta-model works out which of these models is best and when.
  • Another way is more or less to forget about model specification, and let a neural net find an optimal model automatically, as was done in Slawek Smyl’s winning solution (link). To do this, one generally needs more data than in traditional modelling approaches, but the results can be impressive. I particularly favor this latter approach, and for examples of applications to population mortality forecasting and claims reserving, I would point to two recent papers I co-authored that are up on SSRN that demonstrate this approach:

Having noted some of the above areas that can be improved, it is important to end by stating that often, data simply isn’t available to do much better than the most simple forecasts, and, indeed, in cases where the data is available, actuaries will try use more sophisticated modelling. One example is mortality improvement modelling, generally undertaken by providers of annuities and other products exposed to longevity risk, where actuaries apply mortality models from both the actuarial and demographic “schools”, most often to population level data. Another example is claims reserving, where there is increasing attention being placed on developing reserving models that allow for trends in claims development assumptions over time, though I have not yet seen one of these in practice. 

In conclusion, I think it is an exciting time to be involved in actuarial work and insurance more broadly, and I look forward to seeing how advances made in other areas will influence the insurance industry. 

 

 

 

 

Thoughts on the M4 Conference

I had the opportunity to attend the M4 Conference held last week in NYC, which focused on the results of the recent M4 forecasting competition, as well as more generally on the state of the art in time series forecasting. In this post, I plan to summarize some of the key ideas that were presented at the conference and point out some of the thoughts that have occurred to me since.

There were a number of excellent speakers whose key points (from my perspective) I summarize very briefly later on in this blog, with the standout ones for me being:

  • Slawek Smyl (winner of the competition with his “hybrid” method)
  • Spyros Makridakis (M competitions)
  • Nassim Taleb 
  • Pablo Montero-Manso (representing the runner up team in the competition with a boosting meta-learning method)
  • Andrea Pasqua

The rest of this post will discuss:

  • Big Ideas of the M4 Conference
  • Summaries of some of the talks

In a follow up post I hope to discuss what actuaries can learn from the M4 competition.

The Big Ideas of the M4 Conference

There were several recurring themes at the conference that were addressed several times by the speakers. Of these, the one that came up the most often was the difference between statistics and machine learning.

Stats vs ML
It was fascinating to see the back and forth between the speakers and the audience on exactly what defines machine learning, and how this is different from statistics. Two of the different viewpoints were:

  • Statistical methods generally do not learn across different time series and datasets, whereas ML methods do. (This first perspective made sense from the perspective that most methods used for time series forecasting focus on the univariate case, i.e. where there is only one sequence, and techniques to leverage information across series are newer in this field (although obviously not a new concept in more traditional applications of statistics.)
  • There is no difference between statistics and ML, and in fact neural networks are a generalizations of GLMs, which are a basic statistical tool, in other words, the distinction is arbitrary.

Interestingly, there was also not much consensus on whether the field of forecasting should be classified as a traditional statistical discipline or not. One good point that was made is that one of the basic time series methods – exponential smoothing – was always used as an algorithm, until statistical justification in the state-space framework was given by Rob Hyndman et al. 

One amusing debate focussed on whether Slawek’s method was in fact a statistical or machine learning approach, with different participants arguing for their perspectives, and being somewhat averse to the idea of a hybrid approach. This carried on, until Slawek himself was asked to clarify, at which point he confirmed that his method is a “hybrid” of statistical and machine learning approaches. 

My perspective is that some of these issues can be tied up quite neatly using the distinction between prediction and inference given by Shmueli (2010). A significant part of statistical practice is focussed on defining models and then working out whether or not the observed data could have been generated by the model, and, within this framework, one generally does not have concepts such as out-of-sample predictive accuracy. Machine learning, on the other hand, focuses on achieving good out-of-sample performance of models, whether these have been specified using some stochastic data generating procedure, or on an algorithmic basis. From this perspective, the field of forecasting is not a traditional statistical discipline, as the focus is on prediction!

Complexity
A recurring theme of the M competitions is that more complex models are usually outperformed by simple methods, for example, in the original M1 competition it was shown that exponential smoothing was better than ARIMA models. In the M4 competition, this became much more nuanced. One the one hand, “vanilla” machine learning techniques performed poorly, and worse than the benchmark, mirroring the findings in Makridakis, Spiliotis and Assimakopoulos (2018). On the other hand, the winners of the M4 competition used relatively more complex machine learning methods to great success. The difference seems to be that the complexity of the methods is in how they learn to generalize across time series (Slawek’s LSTM model and Paulo’s meta-learning algorithm), instead of trying to apply especially sophisticated methods to single time series. 

Triumph of Deep Learning
As I have written about several times on this blog, the big advantage of deep learning over traditional machine learning approaches is that feature engineering gets performed automatically (i.e. this is the paradigm of representation learning, in that the model learns the features), and therefore, when dealing with large and very complex datasets, suitable neural network architectures can provide a massive performance boost over other approaches. I think this was clearly part of the “secret sauce” of Slawek’s winning solution, in that he very neatly specified a neural network combined with exponential smoothing, thus obviating the need to try derive features from each time series. This is in contrast to the runner-up solution presented by Pablo, which involved a substantial feature engineering step, in which many features were calculated for each time-series, after which a boosted tree model was fit on these features to work out how to weight the various time series methods. 

More to learn
Although forecasting is not a new field, it seemed to me that many participants at the conference felt that there is much more to learn to advance the state of the art of forecasting, especially as machine learning methods get adapted to time series forecasting. The amazing and unanticipated success of Slawek’s hybrid method will no doubt lead many researchers to try similar methods on other datasets.

This also manifested in the advance detail given on the upcoming M5 competition, which is going to focus on the role of explanatory variables in forecasting time series, as well as feature online learning as more data become available. I think many people felt that the techniques incorporating explanatory variables are not yet optimal and represent an opportunity to advance the state of the art.

Ensembling of methods
A famous finding in the forecasting literature is that combinations of methods usually do better than single methods, and that held true in the M4 competition. Slawek’s winning approach consisted of an ensemble of LSTM models (I discuss the very smart idea of a so-called Mixture of Specialists later) and Paulo’s method used a boosting algorithm to assign weights to different simple methods, which were then combined to produce the final forecasts. 

Summaries of talks
Here are some summaries of my favourite talks of the conference.

Slawek Smyl (Uber Technologies): A Hybrid Approach to Forecasting
Slawek won the M4 competition by a large margin over the next best entry. His method, described in a short note here, essentially did two things:

  • Firstly, allow the neural net to learn optimal coefficients of the Holt-Winters algorithm which were then used to normalize each time series
  • Secondly, forecast the normalized series using the neural net and then restore the series using the Holt-Winters parameters

The network design was a stack of various types of Long Short Term Memory cells (with skip connections and dilation).

Slawek also used ensembling at several levels to produce the forecast. I found one ensembling method which he proposed to be particularly interesting, the Ensemble of Specialists, which is described in more detail here.

Basically, the idea is to take several of the same neural net architectures and allow them to train for a single epoch on some of the training data. Then, allocate each time series to the top-2 neural nets and repeat both steps until the validation error increases. Once the nets are trained, one applies different ensemble methods to derive the final forecasts. This seems like a very smart way of ensuring optimal performance on all types of series – in my own research, I have encountered situations when neural nets trained to a global optimum do not perform as well as would be expected on some time series and I am excited to try out this approach.

Spyros Makridakis (University of Nicosia): The contributions of the M4 Competition to the Theory and Practice of Forecasting

The slides have been made available here.

What stood out most for me about Spyros’ talk was the focus on improving the state of the art of time series forecasting using hard evidence, and that seems to be the key theme running throughout his work on the M competitions and even before. As easy as it might be to favour a method based on how pleasing it is theoretically, the approach during the M competitions has been simply to check what works, and what doesn’t on out of sample error. This created what seems to be a huge amount of work in the M4 competition, in that Spyros and his team have replicated every submission (even those that take upwards of a month to run in full!) and I admire the dedication to advancing the state of the art!

Some of the major findings that Spyros discussed are:

  • Improving accuracy via combining methods
  • Superiority of Slawek’s hybrid method
  • The improved precision of prediction intervals in Slawek’s and Pablo’s methods – these had a coverage ratio very close to the required 95%
  • Increased complexity, as measured by compute time, led to increased accuracy, which I think is a first for the M competitions.
  • Learning across time series in the winning methods
  • Poor performance of pure ML methods, which was attributed to these methods overfitting on the univariate time series i.e. not learning across series

Spyros then ended with two challenges where improvement is needed – improving the measurement of uncertainty (where there is great potential for ML/DL methods) and improving explanatory models of time series.

Nassim Taleb (New York University): Forecasting and Uncertainty: The Challenge of Fat Tailedness

I enjoyed hearing Nassim explain some of his ideas in the context of forecasting. My key takeaway here was that when forecasting, one might not be as interested in the underlying random variable being forecast, call it x, but rather the payoff function of x, which is f(x). The payoff function can be manipulated in various ways by taking positions against the underlying x, for example, one could hedge out tail risks, and therefore Nassim was effectively offering a way of dealing with uncertainty in x, which is manipulate your payoffs so that you are not hurt, and ideally gain, from the parts of x that you do not know about or are at most risk from.

One interesting connection that he made was between the way options traders have always approximated payoff functions using a European options, which effectively comes down to function approximation using the ReLu activation in deep learning.

Andrea Pasqua (Data Science Manager, Uber): Forecasting at Uber: Machine Learning Approaches

Andrea’s talk covered how time series forecasting is done at Uber, with their own set of interesting and challenging issues, such as a huge number of series to forecast, dealing with extreme events, and the cold-start problem when services are launched in a new city. He gave a very nice walkthrough of how Uber arrived at the solutions currently in production, by going through each stage of model choice and development. It seems as if this team has benefited from Quantile Random Forests and I plan to read up more about these.

Conclusion

It was refreshing to see how approachable the speakers at the M4 conference were, and how willing the winners of the competition were to share of their expertise and knowledge. The organizers of the conference put together a great event and well done to them!

In the next post I hope to discuss some of what I believe the actuarial profession could learn from the advances in the state of the art of forecasting that were shown at the M4 conference.

References
Makridakis, S., E. Spiliotis and V. Assimakopoulos. 2018. “Statistical and Machine Learning forecasting methods: Concerns and ways forward”, PLOS ONE 13(3):e0194889.
Shmueli, G. 2010. “To explain or to predict?”, Statistical Science:289-310.

Neural Network Embedding of the Over-Dispersed Poisson Reserving Model

Claims reserving for non-life (i.e. GI or P&C) companies is a core activity of actuaries working in these companies, and a huge academic literature on the subject has been produced (Schmidt 2017). Recently, there has been more focus on how machine learning can be applied to claims reserving and some examples of studies are Kuo (2018); Wüthrich (2018a); (Wüthrich 2018b); Zarkadoulas (2017).

When I think about the literature that has sprung up on the claims reserving problem, one issue that has always bothered me is that actuaries in practice will often be forced to depart from the theoretical methods, because the triangles that they encounter do not conform to the assumptions of the theory. For example, one will often observe that the claims development pattern is not constant over time, and then averaging over all accident years will produce inaccurate reserves. Thus, in practice, actuaries apply all sorts of heuristics to derive a hopefully less biased set of assumptions that are then applied to derive reserves. This becomes very problematic when the actuaries are then required to derive uncertainty estimates, which are used in Solvency II/SAM for setting capital, because the methods for deriving the uncertainty estimates generally are unable to cater for the heuristics that were applied to derive the best estimate of the reserves. Some approaches that have emerged recently apply non-linear mixed models or fully Bayesian models to allow for changing claims development patterns, but I have not yet seen someone derive the uncertainty of the reserves using these methods.

So, with this background in mind, this post is about a new approach to the claims reserving problem that solves these issues very neatly using the paradigm of representation learning (i.e. allowing a neural network to figure out the optimal way to use the input features within the model structure). The approach appears in a new paper applying neural networks to the claims reserving problem that I am delighted to have worked on together with Andrea Gabrielli and Mario Wüthrich, which is available here:

Paper on SSRN

In this paper, we show how a traditional IBNR model – the over dispersed Poisson model (Renshaw and Verrall 1998), which uses a GLM to model the claims run off triangle – can be embedded into a neural network, which is then allowed to learn additional model structure, automatically enhancing the accuracy of the claims reserving model. The underlying claims data was simulated from the individual claims simulation machine developed by my co-authors (Gabrielli and Wüthrich 2018) and aggregated into six triangles representing different lines of business. One very nice feature of these data is that we also have the claims runoff and we can thus compare the predicted claims (derived using our reserving method) to the actual claims development.

This paper features the following ideas, which are discussed next:

  • Residual learning
  • Learning over multiple lines of business
  • Uncertainty prediction

In this paper, we are building on an idea that was used in our recent paper on mortality forecasting using neural networks (Richman and Wüthrich 2018), in which we showed how the Lee-Carter mortality model can be expressed and extended to multiple populations within a neural network framework, leading to accurate mortality forecasts at a large scale.

However, whereas in the previous paper, we did not maintain the structure of the Lee-Carter model, in the current paper, we have maintained the ODP reserving model, which is a familiar reference point for actuaries, and allowed the network to enhance the familiar model; thus the network is learning about whatever residual structure remains after the application of the ODP model. Here is a view of the neural network used in the current paper:

This is a similar concept to the very successful class of computer vision models called ResNets (He, Zhang, Ren et al. 2016), which consist of very deep neural networks, where each set of layers learn a residual function.  This concept was shown to be successful in allowing the training of exceptionally deep networks on the ImageNet dataset, and in the Lee-Carter paper, we showed how including a residual connection improved the performance of our deep network. Here, we use this idea a little differently, not to calibrate a very deep network, but to improve the calibration times by providing the ODP model to the network within a skip connection, dramatically reducing the time taken to calibrate the final neural network. Using the flexibility of the neural networks, we also calibrate the model on six triangles simultaneously, and these results are shown in the paper to be more accurate than either the original ODP model (which produces biased predictions that are too low across all lines), or the neural network calibrated to a single triangle. In fact, comparing the predicted claims to the actual claims, we find that the neural network calibrated to the six triangles produces exceptionally accurate predictions!

Why is this model more accurate? We show in the paper that the network has learned additional structure that has picked up automatically on a shift in the claims development patterns over time. Here is a view of the claims development patterns for each of the accident years relating to one of the lines of business:

Thus, the network automatically has learned to vary the assumptions applied to each accident year, resulting in more accurate predictions. This is the paradigm of representation learning that was mentioned above – we have not specified to the model exactly how the claims development assumptions should vary by accident year, but fed information regarding accident and development year into the neural network, and allowed it to figure out how to combine this information optimally.

Perhaps most importantly, since each network is quick to calibrate, we then apply the bootstrap to derive the uncertainty of the predictions of the network, which interestingly is similar to the aggregate uncertainty of the ODP model. This is one of the first examples in the literature that I have seen whereby a model that is complex enough to be applied to real life triangles is also amenable to uncertainty analysis. This work therefore is likely to be an important step to advancing the state of the art of claims reserving models!

Please feel free to contact us if you have any feedback, which we would value!

References

Gabrielli, A. and M. Wüthrich. 2018. “An Individual Claims History Simulation Machine”, Risks 6(2):29.

He, K., X. Zhang, S. Ren and J. Sun. 2016. “Deep residual learning for image recognition,” Paper presented at Proceedings of the IEEE conference on computer vision and pattern recognition. 770-778.

Kuo, K. 2018. “DeepTriangle: A Deep Learning Approach to Loss Reserving”, arXiv arXiv:1804.09253

Renshaw, A.E. and R.J. Verrall. 1998. “A stochastic model underlying the chain-ladder technique”, British Actuarial Journal 4(04):903-923.

Richman, R. and M. Wüthrich. 2018. “A Neural Network Extension of the Lee-Carter Model to Multiple Populations”, SSRN

Schmidt, K. 2017. A Bibliography on Loss Reserving. https://www.math.tu-dresden.de/sto/schmidt/dsvm/reserve.pdf. Accessed: 8 July 2018.

Wüthrich, M. 2018a. “Machine learning in individual claims reserving”, Scandinavian Actuarial Journal:1-16.

Wüthrich, M. 2018b. “Neural networks applied to chain–ladder reserving”, European Actuarial Journal 8(2):407-436.

Zarkadoulas, A. 2017. “Neural network algorithms for the development of individual losses.” Unpublished thesis, Lausanne: University of Lausanne.

Neural networks, the Lee-Carter Model and Large Scale Mortality  Forecasting

This post discusses a new paper that I am very glad to have co-authored with Mario Wüthrich, in which we apply deep neural networks to mortality forecasting. The draft can be found here:

Paper

A topic that I have been interested in for a long time is forecasting mortality rates, perhaps because this is one of the interesting intersections of statistics and these days, machine learning, and the field of actuarial science in life insurance. Several methods to model mortality rates over time have been proposed, ranging from the relatively simple method of extrapolating mortality rates directly using time series, to more complicated statistical approaches.

One of the most famous of these is the Lee-Carter method, which models mortality as an average mortality rate that changes over time. The change over time is governed by a time-based mortality index, which is common to all ages, and an age-specific rate of change factor:

ln(mx,t )=ax +kt .bx

where mx,t is the force of mortality at age x in year t, ax is average log mortality rate during the period at age x, κt is the time index in year in year t and bx is the rate of change of log mortality with respect to the time index at age x.

How are these quantities derived? There are two methods prominent in the literature – applying Principal Components Analysis, or Generalized Non-linear Models, which are different from GLMs in the sense that the user can specify non-additive relationship between two or more terms. To forecast mortality, models are first fit to historical mortality data and the coefficients (in the case of the Lee- Carter model, the vector κ) are then forecast using a time series model, in a second step.

In the current age of big data, relatively high quality mortality data spanning an extended period are available for many countries from the excellent Human Mortality Database, which is a resource that anyone with an interest in the study of mortality can benefit from. Other interesting sources are databases containing sub-national rates for the USA, Japan and Canada. The challenge, though, is how to model all of these data simultaneously to improve mortality forecasts? While some extensions of basic models like the Lee-Carter model have been proposed, these rely on assumptions that might not necessarily be applicable in the case of large scale mortality forecasting. For example, some of the common multi-population mortality models rely on the assumption of a common mortality trend for all countries, which is likely not the case.

In the paper, we tackle this problem in a novel way – feed all the variables to a deep neural network and let it figure out how exactly to model the mortality rates over time. This speaks to the idea of representation learning that is central to modern deep learning, which is that many datasets, such as large collections of images as in the ImageNet dataset, are too complicated to model by hand-engineering features, or it is too time consuming to perform the modelling. Rather, the strategy in deep learning is to define a neural network architecture that expresses useful priors about the data, and allow the network to learn how the raw data relates to the problem at hand. In the example of modelling mortality rates, we use two architectural elements that are common in applications of neural networks to tabular data:

  • We use a deep network, in other words the network consists of multiple layers of variables that expresses the prior that complex features can be represented by a hierarchy of simpler representations learned in the model.
  • Instead of using one-hot encoding to signify to the network when we are modelling a particular country, or gender, we use embedding layers. When applied to many categories, one-hot encoding produces a high-dimensional feature vector that is sparse (i.e. many of the entries are zero), leading to potential difficulties in fitting a model as there might not be enough data to derive credible estimates for each category. Even if there is enough data, as in our case of mortality rates, each parameter is learned in isolation, and the estimated parameters do not share information, unless the modeller explicitly chooses to use something like credibility or mixed models. The insight of Bengio et al. (2003) to solve these problems is that categorical data can successfully be encoded into low dimensional, dense numerical vectors, so, for example, in our model, country is encoded into a five-dimensional vector.

In the paper, we also show how the original Lee-Carter model can be expressed as a neural network with embeddings!

Here is a picture of the network we have just described:

In the paper, we also employ one of the most interesting techniques to emerge from the computer vision literature in the past several years. The original insight is due to the authors of the ResNet paper, who analysed the well-known problem that it is often difficult to train deep neural networks. They considered that a deep neural network should be no more difficult to train than a shallow network, since the deeper layers could simply learn the identity function, and thus be equivalent to a shallow network. Without going too far off track into these details, their solution is simple – add skip layers that connect the deep layers to more shallow layers in the network. This idea is expanded on in the DenseNet architectures. We simply added a connection between the feature layer, and the fifth layer of the network, connecting the embedding layers almost to the deepest layer of the network. This boosted the performance of the networks considerably.

We found that the deep neural networks dramatically outperformed the competing methods that we tested, forecasting with the lowest MSE in 51 out of the 76 instances we tested! Here is a table comparing the methods, and see the paper for more details:

Lastly, an interesting property of the embedding layers learned by neural networks is the fact that the parameters of these layers are often interpretable as so-called “relativities” (to use some actuarial jargon), in other words, as defining the relationship between the different values that a variable may take.  Here is a picture of the age embedding, which shows that the main relationship learned by the network is the characteristic shape of a modern life table:

This is a rather striking result, since at no time did we specify this to the network! Also, once the architecture was specified, the network has also learned to forecast mortality rates more successfully than human specified models reminding me of one of the desiderata for AI systems listed by Bengio (2009):

“Ability to learn with little human input the low-level, intermediate, and high-level abstractions that would be useful to represent the kind of complex functions needed for AI tasks.”

We would value any feedback on the paper you might have.

References

Bengio, Y. 2009. “Learning deep architectures for AI”, Foundations and trends® in Machine Learning 2(1):1-127.

AI in Actuarial Science – #ASSA2018

I will be speaking about deep learning and its applications to actuarial work at the 2018 ASSA Convention in Cape Town, this Thursday 25 October. Hope to see you there!

Here are:

  • the slides of the talk:

AI in Act Sci – slides

  • accompanying paper:

2018-Richman-FIN

  • code to reproduce some of the results:

https://github.com/RonRichman/AI_in_Actuarial_Science/

 

 

Thoughts on writing “AI in Actuarial Science”

This is a follow up post to something I wrote a few months ago, on the topic of AI in Actuarial Science. Over the intervening time, I have been writing a paper for the ASSA 2018 Convention in Cape Town on this topic, a draft of which can be found here:

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3218082

and code here:

https://github.com/RonRichman/AI_in_Actuarial_Science

I would value feedback on the paper from anyone who has time to read the paper.

This post, though, is about the process of writing the paper and some of the issues I encountered. Within the confines of an academic paper it is often hard, and perhaps mostly irrelevant, to express some thoughts and opinions and in this blog post I hope to share some of these ideas that did not make it into the paper. I am not going to spend much time defining the terms too much, and if you refer to the paper if some terminology is unclear I think it will help to clarify.

Vanilla ML techniques might not work

Within the paper I try to apply deep learning to the problem addressed in the excellent tutorial paper of Noll, Salzmann and Wüthrich (2018) which is about applying machine learning techniques to a French Motor 3rd Party Liability (MTPL) dataset. They achieve some nice performance boosts over a basic GLM with some hand engineered features using a boosted tree and a neural network.

One of the biggest shocks that I had was when I decided to try this problem myself is that off the shelf tools like XGboost did not work well at all – in fact, the GLM was by far better despite the many hyper-parameter settings that I tried. I also tried out the mboost package but the dataset was too big for the 16gb of RAM on my laptop.

So the first mini-conclusion is that just because you have tabular data (i.e. structured data with rows for observations and columns for variables, like in SQL), you should not automatically assume that a fancy ML approach is going to outperform a basic statistical one. Anecdotally, I am hearing from several different people that applying vanilla techniques to pricing problems doesn’t provide much performance boost.

To this point, I recommend Frank Harell’s excellent blog post on ML versus statistical techniques, and about when to apply which:

http://www.fharrell.com/post/stat-ml/

Vanilla DL techniques might not work either

This was perhaps the most vexing part of the process. Fitting deep networks with ReLu activations to the French dataset, like the more up to date sources on deep learning seem suggest, also did not work all that well! In fact, I achieved only poor performance on a network fit to data without manual feature engineering. Another issue is that depth didn’t seem to help all that much.

Similarly, naively coding up deep autoencoders for the mortality data that is also discussed in the paper turned out to be a major learning when writing the paper – these just did not converge despite the many attempts at tuning the hyperparameters. I only managed to find a decent solution using greedy unsupervised learning (Hinton and Salakhutdinov 2006) of autoencoder layers.

Therefore a conclusion if you encounter a problem to which you want to apply Deep Learning – be aware that ReLus plus depth might not work and you might need to dig into the literature a bit!

When DL works, it really works!

This is connected to the next idea. Once I found a way of training the autoencoders, the results were fantastic and by far exceeded my expectations (and the performance of the Lee-Carter benchmark for mortality forecasting). Also, once I had the embedding layers working on the French MTPL dataset, the results were better than any other technique I could (or can) find. I was also impressed by the intuitive meaning of the learned embeddings, which I discuss in some detail in the paper, and the fact that plugging these embeddings back into the vanilla GLM resulted in a substantial performance boost.

The flexibility of the neural networks that can be fit with modern software, like Keras, is almost unlimited. Below is what I call a “learned exposure” network which has a sub-network to learn an optimal exposure measure for each MTPL policy. I have not encountered a similarly flexible and powerful system in any other field of statistics or machine learning.

Is this really AI?

One potential criticism of the title of the paper is that this isn’t really AI, but rather fancy regression modelling. I try to argue in Section 3 of the paper that Deep Learning is an approach to Machine Learning whereby you allow the algorithm to figure out the features that are important (instead of designing them by hand).

This is one of the desiderata for AI listed by Bengio (2009) on page 10 of that work – “Ability to learn with little human input the low-level, intermediate, and high-level abstractions that would be useful to represent the kind of complex functions needed for AI tasks.”

Do I think that my trained Keras models are AI? Absolutely not. But, the fact that the mortality model has figured out the shape of a life table (i.e. the function ax in the Lee-Carter model) without any inputs besides for year/age/gender/region and target mortality rates should make us pause to think about the meaningful features captured by deep neural nets. Here is the relevant plot from the paper – consider “dim1”:

This gets even more interesting in NLP applications, such as in Mikolov, Sutskever, Chen et al. (2013) who provide this image which shows that their deep network has captured the semantic meaning of English words:

Also, that deep nets seem to be able to perform “AI tasks” (the term used by Bengio, Courville and Vincent (2013) to mean tasks “which are challenging for current (shallow, my addition) machine learning algorithms, and involve complex but highly structured dependencies”) such as describing images indicates that something more than simple regression is happening in these models.

DL is empirical, not yet scientific

An in joke that seems to have made the rounds is so-called “gradient descent by grad student” – in other words, it is difficult to find optimal deep learning models and one needs to fiddle around with designs and optimizers until something that works is found. This is much easier if you have a team of graduate students who can do this for you, thus the phrase quoted above. What this means in practice is that there is often no off the shelf solution, and little or no theory to guide you in what might work or not, leading to lots of experimenting with different ideas until the networks perform well.

AI in Actuarial Science is a new topic but there are some pioneers

The traditional actuarial literature has not seen many contributions dealing with deep neural networks, yet. Some of the best work I found, which I highly recommend to anyone interested in this topic, is a series of papers by Mario Wüthrich and his collaborators (Gabrielli and Wüthrich 2018; Gao, Meng and Wüthrich 2018; Gao and Wüthrich 2017; Noll, Salzmann and Wüthrich 2018; Wüthrich 2018a, b; Wüthrich and Buser 2018; Wüthrich 2017). What is great about these papers is that the ideas are put on a firm mathematical basis and discussed within the context of profound traditional actuarial knowledge. I have little doubt that once these ideas take hold within the mainstream of the actuarial profession, they will have a huge impact on the practical work performed by actuaries, as well as on the insurance industry.

Compared to statistical methods, though, there are still big gaps in understanding the parameter/model risk of these deep neural networks and an obvious next step is to try apply some of the techniques used for parameter risk of statistical models to deep nets.

The great resources available to learn about and apply ML and DL

There are many excellent resources available to learn about Machine and Deep Learning that I discuss in the resources sections of the paper, and, best of all, most of these are free, except for opportunity costs.

Lastly, a word about Keras , which is the high level API that makes fitting deep neural models easy. This is a phenomenally well put together package, and the R interface makes it much more accessible to actuaries who might not be familiar with Python. I highly recommend Keras to anyone interested in experimenting with these models, and, Keras will be able to handle most tasks thrown at it, as long as you don’t try anything too fancy. One thing I wanted to try, but couldn’t figure out was how to add an autoencoder layer to a supervised model where the inputs are the outputs of a previous layer, and this is one of the few examples where I ran into a limitation in Keras.

References

Bengio, Y. 2009. “Learning deep architectures for AI”, Foundations and trends® in Machine Learning 2(1):1-127.

Bengio, Y., A. Courville and P. Vincent. 2013. “Representation learning: A review and new perspectives”, IEEE transactions on pattern analysis and machine intelligence 35(8):1798-1828.

Gabrielli, A. and M. Wüthrich. 2018. “An Individual Claims History Simulation Machine”, Risks 6(2):29.

Gao, G., S. Meng and M. Wüthrich. 2018. Claims Frequency Modeling Using Telematics Car Driving Data. SSRN. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3102371. Accessed: 29 June 2018.

Gao, G. and M. Wüthrich. 2017. Feature Extraction from Telematics Car Driving Heatmaps. SSRN. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3070069. Accessed: June 29 2018.

Hinton, G. and R. Salakhutdinov. 2006. “Reducing the dimensionality of data with neural networks”, Science 313(5786):504-507.

Mikolov, T., I. Sutskever, K. Chen, G. Corrado et al. 2013. “Distributed representations of words and phrases and their compositionality,” Paper presented at Advances in neural information processing systems. 3111-3119.

Noll, A., R. Salzmann and M. Wüthrich. 2018. Case Study: French Motor Third-Party Liability Claims. SSRN. https://ssrn.com/abstract=3164764 Accessed: 17 June 2018.

Wüthrich, M. 2018a. Neural networks applied to chain-ladder reserving. SSRN. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2966126. Accessed: 1 July 2018.

Wüthrich, M. 2018b. v-a Heatmap Simulation Machine. https://people.math.ethz.ch/~wueth/simulation.html. Accessed: 1 July 2018.

Wüthrich, M. and C. Buser. 2018. Data analytics for non-life insurance pricing. Swiss Finance Institute Research Paper. https://ssrn.com/abstract=2870308. Accessed: 17 June 2018.

Wüthrich, M.V. 2017. “Covariate selection from telematics car driving data”, European Actuarial Journal 7(1):89-108.