Ronald Richman

Thoughts on writing “AI in Actuarial Science”

This is a follow up post to something I wrote a few months ago, on the topic of AI in Actuarial Science. Over the intervening time, I have been writing a paper for the ASSA 2018 Convention in Cape Town on this topic, a draft of which can be found here:

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3218082

and code here:

https://github.com/RonRichman/AI_in_Actuarial_Science

I would value feedback on the paper from anyone who has time to read the paper.

This post, though, is about the process of writing the paper and some of the issues I encountered. Within the confines of an academic paper it is often hard, and perhaps mostly irrelevant, to express some thoughts and opinions and in this blog post I hope to share some of these ideas that did not make it into the paper. I am not going to spend much time defining the terms too much, and if you refer to the paper if some terminology is unclear I think it will help to clarify.

Vanilla ML techniques might not work

Within the paper I try to apply deep learning to the problem addressed in the excellent tutorial paper of Noll, Salzmann and Wüthrich (2018) which is about applying machine learning techniques to a French Motor 3^rd Party Liability (MTPL) dataset. They achieve some nice performance boosts over a basic GLM with some hand engineered features using a boosted tree and a neural network.

One of the biggest shocks that I had was when I decided to try this problem myself is that off the shelf tools like XGboost did not work well at all – in fact, the GLM was by far better despite the many hyper-parameter settings that I tried. I also tried out the mboost package but the dataset was too big for the 16gb of RAM on my laptop.

So the first mini-conclusion is that just because you have tabular data (i.e. structured data with rows for observations and columns for variables, like in SQL), you should not automatically assume that a fancy ML approach is going to outperform a basic statistical one. Anecdotally, I am hearing from several different people that applying vanilla techniques to pricing problems doesn’t provide much performance boost.

To this point, I recommend Frank Harell’s excellent blog post on ML versus statistical techniques, and about when to apply which:

http://www.fharrell.com/post/stat-ml/

Vanilla DL techniques might not work either

This was perhaps the most vexing part of the process. Fitting deep networks with ReLu activations to the French dataset, like the more up to date sources on deep learning seem suggest, also did not work all that well! In fact, I achieved only poor performance on a network fit to data without manual feature engineering. Another issue is that depth didn’t seem to help all that much.

Similarly, naively coding up deep autoencoders for the mortality data that is also discussed in the paper turned out to be a major learning when writing the paper – these just did not converge despite the many attempts at tuning the hyperparameters. I only managed to find a decent solution using greedy unsupervised learning (Hinton and Salakhutdinov 2006) of autoencoder layers.

Therefore a conclusion if you encounter a problem to which you want to apply Deep Learning – be aware that ReLus plus depth might not work and you might need to dig into the literature a bit!

When DL works, it really works!

This is connected to the next idea. Once I found a way of training the autoencoders, the results were fantastic and by far exceeded my expectations (and the performance of the Lee-Carter benchmark for mortality forecasting). Also, once I had the embedding layers working on the French MTPL dataset, the results were better than any other technique I could (or can) find. I was also impressed by the intuitive meaning of the learned embeddings, which I discuss in some detail in the paper, and the fact that plugging these embeddings back into the vanilla GLM resulted in a substantial performance boost.

The flexibility of the neural networks that can be fit with modern software, like Keras, is almost unlimited. Below is what I call a “learned exposure” network which has a sub-network to learn an optimal exposure measure for each MTPL policy. I have not encountered a similarly flexible and powerful system in any other field of statistics or machine learning.

Is this really AI?

One potential criticism of the title of the paper is that this isn’t really AI, but rather fancy regression modelling. I try to argue in Section 3 of the paper that Deep Learning is an approach to Machine Learning whereby you allow the algorithm to figure out the features that are important (instead of designing them by hand).

This is one of the desiderata for AI listed by Bengio (2009) on page 10 of that work – “Ability to learn with little human input the low-level, intermediate, and high-level abstractions that would be useful to represent the kind of complex functions needed for AI tasks.”

Do I think that my trained Keras models are AI? Absolutely not. But, the fact that the mortality model has figured out the shape of a life table (i.e. the function ax in the Lee-Carter model) without any inputs besides for year/age/gender/region and target mortality rates should make us pause to think about the meaningful features captured by deep neural nets. Here is the relevant plot from the paper – consider “dim1”:

This gets even more interesting in NLP applications, such as in Mikolov, Sutskever, Chen et al. (2013) who provide this image which shows that their deep network has captured the semantic meaning of English words:

Also, that deep nets seem to be able to perform “AI tasks” (the term used by Bengio, Courville and Vincent (2013) to mean tasks “which are challenging for current (shallow, my addition) machine learning algorithms, and involve complex but highly structured dependencies”) such as describing images indicates that something more than simple regression is happening in these models.

DL is empirical, not yet scientific

An in joke that seems to have made the rounds is so-called “gradient descent by grad student” – in other words, it is difficult to find optimal deep learning models and one needs to fiddle around with designs and optimizers until something that works is found. This is much easier if you have a team of graduate students who can do this for you, thus the phrase quoted above. What this means in practice is that there is often no off the shelf solution, and little or no theory to guide you in what might work or not, leading to lots of experimenting with different ideas until the networks perform well.

AI in Actuarial Science is a new topic but there are some pioneers

The traditional actuarial literature has not seen many contributions dealing with deep neural networks, yet. Some of the best work I found, which I highly recommend to anyone interested in this topic, is a series of papers by Mario Wüthrich and his collaborators (Gabrielli and Wüthrich 2018; Gao, Meng and Wüthrich 2018; Gao and Wüthrich 2017; Noll, Salzmann and Wüthrich 2018; Wüthrich 2018a, b; Wüthrich and Buser 2018; Wüthrich 2017). What is great about these papers is that the ideas are put on a firm mathematical basis and discussed within the context of profound traditional actuarial knowledge. I have little doubt that once these ideas take hold within the mainstream of the actuarial profession, they will have a huge impact on the practical work performed by actuaries, as well as on the insurance industry.

Compared to statistical methods, though, there are still big gaps in understanding the parameter/model risk of these deep neural networks and an obvious next step is to try apply some of the techniques used for parameter risk of statistical models to deep nets.

The great resources available to learn about and apply ML and DL

There are many excellent resources available to learn about Machine and Deep Learning that I discuss in the resources sections of the paper, and, best of all, most of these are free, except for opportunity costs.

Lastly, a word about Keras , which is the high level API that makes fitting deep neural models easy. This is a phenomenally well put together package, and the R interface makes it much more accessible to actuaries who might not be familiar with Python. I highly recommend Keras to anyone interested in experimenting with these models, and, Keras will be able to handle most tasks thrown at it, as long as you don’t try anything too fancy. One thing I wanted to try, but couldn’t figure out was how to add an autoencoder layer to a supervised model where the inputs are the outputs of a previous layer, and this is one of the few examples where I ran into a limitation in Keras.

References

Bengio, Y. 2009. “Learning deep architectures for AI”, Foundations and trends® in Machine Learning 2(1):1-127.

Bengio, Y., A. Courville and P. Vincent. 2013. “Representation learning: A review and new perspectives”, IEEE transactions on pattern analysis and machine intelligence 35(8):1798-1828.

Gabrielli, A. and M. Wüthrich. 2018. “An Individual Claims History Simulation Machine”, Risks 6(2):29.

Gao, G., S. Meng and M. Wüthrich. 2018. Claims Frequency Modeling Using Telematics Car Driving Data. SSRN. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3102371. Accessed: 29 June 2018.

Gao, G. and M. Wüthrich. 2017. Feature Extraction from Telematics Car Driving Heatmaps. SSRN. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3070069. Accessed: June 29 2018.

Hinton, G. and R. Salakhutdinov. 2006. “Reducing the dimensionality of data with neural networks”, Science 313(5786):504-507.

Mikolov, T., I. Sutskever, K. Chen, G. Corrado et al. 2013. “Distributed representations of words and phrases and their compositionality,” Paper presented at Advances in neural information processing systems. 3111-3119.

Noll, A., R. Salzmann and M. Wüthrich. 2018. Case Study: French Motor Third-Party Liability Claims. SSRN. https://ssrn.com/abstract=3164764 Accessed: 17 June 2018.

Wüthrich, M. 2018a. Neural networks applied to chain-ladder reserving. SSRN. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2966126. Accessed: 1 July 2018.

Wüthrich, M. 2018b. v-a Heatmap Simulation Machine. https://people.math.ethz.ch/~wueth/simulation.html. Accessed: 1 July 2018.

Wüthrich, M. and C. Buser. 2018. Data analytics for non-life insurance pricing. Swiss Finance Institute Research Paper. https://ssrn.com/abstract=2870308. Accessed: 17 June 2018.

Wüthrich, M.V. 2017. “Covariate selection from telematics car driving data”, European Actuarial Journal 7(1):89-108.

Thoughts on the International Congress of Actuaries 2018

I had to get a couple more CPD hours done and the ICA 2018 conference came along at exactly the right time! This time around, a virtual option was offered and the Actuarial Society of South Africa (ASSA) organized access for all of its members – a really great move by ASSA in my opinion, and I hope that others benefited as much as I did from this intellectually stimulating event.

I listened with a focus on P&C insurance (I prefer the American term, but in other jurisdictions: SA – short-term, UK – GI, Europe – non-Life), so my comments that follow don’t take account of the sessions on other actuarial areas that I have no doubt were also very worthwhile.

In a previous post I advanced the view that actuarial science is not standing still as a discipline, and that comments such as “AI instead of actuaries” are short-sighted. I am glad to report that the discipline is moving forward rapidly to incorporate machine learning and data science into its toolbox – of the 28 sessions I listened to (I needed a lot of CPD!), 9 mentioned machine learning/data science and had some advice on methods or integrating ML into actuarial practice. Another good sign is that some of the leading researchers speaking at the conference – such as Paul Embrechts and Mario Wüthrich – provided their (positive) thoughts on integrating ML and data science into actuarial science. The actuarial world is moving forward rapidly and I think the prospects for the profession are good, if the actuarial associations around the world recognize the trends quickly and incorporate ML/data science and more into the curricula.

My standout favourite session was by Mario Wüthrich (who some actuaries will recognize as one of the co-authors of the Merz-Wüthrich one year reserve risk formula), who presented on his paper “Neural networks applied to Chain-ladder reserving”, available on SSRN. Besides for the new method he suggests, which I think is one of the best solutions when an actuary needs to reserve for IBNR by sub-category (such as LOB/injury code etc), I found the perspectives on ML that he interspersed his talk with fascinating, an example of which is connecting neural networks to Hilbert’s 13^th problem. One point made was that the new claims of algorithms potentially reserving with much less uncertainty than the chain-ladder need to be treated with caution until the issue of model risk is dealt with, and the underlying assumptions brought out into the open.

A brilliant session was given by Paul Glasserman on “Robust model risk assessment” (the paper is ungated if you google it). At the heart of the idea is that model risk could be defined as an alternative probability measure (i.e. if the model generates a baseline probability distribution on a random variable of interest, then model risk could arise if in fact the RV followed a different probability distribution) attached to the simulations presented by a stochastic model (instead of an issue with the model parameters or structure). With this idea in hand, the presentation carried on to show how to find the maximally damaging alternative probability measure for a given level of model risk, as measured by the relative entropy between the baseline and alternative models. The major benefit for actuaries is that this is simple to implement in practice and gives rise to some interesting insights into what can go wrong with a model.

Another session that stood out for me is Pietro Parodi and Peter Watson’s session ”Property Graphs: A Statistical Model for Fire Losses Based on Graph Theory”. The idea here is to find a model that helps to explain why commercial property follows the heavy tail severity distributions observed in practice. Often, in practice, algorithms/distributions are applied because they work and not because there is a good logical basis based on first principles. Along these lines, I am reminded of some of the work of Perks/Beard who proposed first principle models to explain their mortality laws (my small contribution along these lines is an explanation of the chain-ladder algorithm as a life-table estimator, on this blog). Parodi and Watson use graph theory to represent properties (houses/factories etc) and define statistical models on the graph of fire events. These models lead, after simulation and in aggregate, to curves defining the overall severity of a fire event that are not radically different from the current set of curves used by actuaries.

Paul Embrecht’s sessions were amazing to listen to, because of his ability to tie together so many disparate strands in actuarial science and quantitative risk management. It was particularly meaningful to see Paul, who is close to retirement, show-casing some of the work of Mario Wüthrich on telematics on stage, and providing his view that the novel application of data science, as embodied by the work on telematics, is a direction for the discipline to take.

I also enjoyed Hansjörg Albrecher’s session “Flood risk modelling” which was a tour-de-force of various statistical modelling techniques applied to some novel data.

Some other noteworthy sessions which spring to mind:

“Data based storm modelling” – this was a demonstration of how a storm model was built for Germany by a medium size consulting firm. I liked the application of mathematics (Delauney triangulation for mapping the wind events and Singular Value Decomposition for dimensionality reduction) to a relatively big data problem (40m records).
“Using Risk Factors in Insurance Analytics: Data Driven Strategies” – how to apply GLMs to sparse and high-dimensional data without breaking the linearity assumptions.
“Trend in Marine Insurance” – a great overview of this specialty line.

I am not sure if access to the VICA is still open, but if you have any interest in the topics above, I would strongly recommend you try and view some of the sessions.

Naked – South Africa’s Lemonade?

A start-up that I’ve been excited to watch for a while now – enigmatically called Naked – launched their product to market late last week. Naked offers car insurance over an app with quote and bind functionality. This post is going to discuss my thoughts on their offering, draw a parallel to the well-known Lemonade start-up in the USA and briefly touch on my experience with the app (spoiler – good price and fun experience).

The basis of Naked’s offering seems to be the following points:

The policy on offer is primarily Car insurance (Auto), plus a couple of extras (wheels, lights etc) over an app using a chatbot instead of traditional forms.
A neat feature is a pause in accident coverage when your car is stationary, leaving only the flood, theft, fire and other relevant perils on the policy.
They take a fixed fee upfront (20%) of premium to cover expenses and profit leaving the rest to cover claims
If claims are less than the 80% of premium written, then the extra portion is paid out to a charity of the policyholder’s choice
Policies are underwritten by Hollard, which is a big player in the SA insurance markets. Although not explicit on the website, I assume this means that if claims are more than the claims “pot”, then Hollard and the reinsurers on this product (Munich Re and Scor, according to the Naked website) need to pick up the bill
In a legal sense, Naked is an intermediary (broker) operating with what I assume is a binder agreement with Hollard, so actually, they could not bank the underwriting profit even if they wanted to

Anyone following the Insurtech scene will notice that this setup is very similar to Lemonade, which is a licensed insurer in the USA offering Renters and Homeowners insurance. Some of the similarities are using a chatbot instead of traditional forms, the fixed fee for expenses and profit, leaving a pot behind for claims, and the extra funds in the pot going to a charitable cause as a Giveback (Lemonade seems to have come up with this terminology). Some of the differences are that Lemonade is an insurance company and could participate in underwriting profit and loss if they wanted, Lemonade seems to have allowed themselves more discretion as to when they “Giveback” (Board needs to approve, they build up a rainy-day fund etc) and Lemonade pushed an emphasis on data science and behavioural economists, whereas Naked discusses their “AI” technology which they hope will bring down costs.

Using the app is a pleasant and short experience, with an emphasis on ease of use and quickly speeding you along the process. I was pleasantly surprised that the pricing is actually quite reasonable on the vehicle I chose to request a quote for (which is the same vehicle I tried to insure online a few weeks ago and discussed in this post). To actually get the cover, one takes a few photos of your vehicle, and I wonder what exactly the process is on the back-end after that – are these just used at the claims stage or is there something else happening here (deep learning pre-inspection maybe )?

The only point that left me wondering a little bit is a particular claim made on the Naked website when it is explained that A “… flat fee means that our income doesn’t depend on how much we pay out in claims, so we have no reason to make claiming difficult.” I don’t think this is entirely true, despite the claim to the contrary – at the end of the day, if there are claims greater than 80% of written premium, someone will need to foot the bill, whether the insurer or the reinsurers. Therefore, claims will still need to be managed carefully to avoid upsetting these parties (insurer and reinsurers), and claims overruns of the 80% in the pot for an extended period of time will mean prices will need to go up. So I don’t think this has solved the “the age-old cycle of distrust between insurers and their customers” as neatly as one might think from the website.

As a conclusion, it is exciting to see a well-built startup bring an innovative insurance business model to South African shores!

Motor policy price comparisons – comparing apples with oranges

Introduction

I recently tried to obtain a quote for comprehensive motor insurance from a price comparison website. The quote was on an older car, worth approximately R70k. After asking for some of my details, the comparison website presented me with something quite similar the following table of premiums and excesses.

Note that these are not the actual premiums and excesses quoted (due to copyright issues) but are modified by adding a normal random variable and then rounding the excesses. I don’t think these changes distort the economic reality of what I was quoted, but, nonetheless, these are not the actual numbers.

	Premium	Excess
1	458	9845
2	514	4840
3	534	7620
4	532	4580
5	544	4580
6	584	4580
7	571	4580
8	767	3920
9	894	4515

Most of the policies presented had similar terms and conditions – some sort of cashback benefit, hail cover and car rental. The distinguishing features seemed to be premium and excess. However, as a consumer, in this case, I found it difficult to compare these premiums, except for those with a R4.58k excess. What is a good deal and which of these is more overpriced? It makes some sense that policy number 9 is overpriced – I can get a lower excess for a lower premium, so this policy is definitely sub-optimal. But what about policy 8 – this has a low excess, but seems very expensive compared to the policies with only a slightly higher excess. Is this reasonable? Intuitively, and having some idea how motor policies are priced, my answer is no, but can we show this from the numbers presented?

Moral Soap Box (feel free to skip)

Before getting into the details of how I tried to work with these numbers, I think it is important to stop and consider the public interest. Would the general consumer of insurance have any idea how to compare these different premiums given the different excesses? Probably not, in my opinion, leading to the title of this post. I guess that some rational consumers would be ‘herded’ into comparing policies 4-7, since they have the same excess, and maybe go for the cheapest one of those. But this is perhaps only a “local minimum” – maybe, in fact, one of the other policies offers better value. Also, one has to rely on the good faith of those running the comparison website to present policies with only the same terms and conditions, or else this supposedly rational strategy might backfire if policy number 4 has worse terms. Lastly, this all makes sense on day one – what will the insurer offering such a generous premium do over the lifetime of the policy – will they keep being so generous or will the consumer be horrified after a couple of steep price hikes.

Hence, this set of quotes seems to me a “comparison of apples with oranges”.

The code

As usual, the code for this post is on my Github, over here:

https://github.com/RonRichman/ABC_pricing/

Note that code is under the open-source MIT License, so please read that if you want to use it!

The theory

Of course, if we had access to the pricing models underlying these premiums then it would be a simple matter to work out what is expensive and what is not, but the companies quoting were not so kind as to share these and only provided these point estimates. I have some ideas about the frequency of occurrence of motor claims and the average cost per claim, so ideally I would want to incorporate this information into whatever calculations I perform, pointing to the need for some sort of Bayesian approach to the problem. However, the issue here is that the price of a general non-Life/P&C policy is really the outcome of a complicated mathematical function – the collective risk process – often represented by a compound Poisson distribution, which, to my knowledge, does not have an explicit likelihood function (which is why, in practice, actuaries will use Monte Carlo simulation or other approaches like the Panjer approximation or the Fast Fourier Transform to simulate from the distribution). Since most Bayesian techniques require an explicit likelihood function (or the ability to decompose the likelihood function into a bunch of simpler distributions), it would therefore be difficult to build a Bayesian model with standard methods like Markov Chain Monte Carlo (MCMC).

So, in this blog post I share an approach to this problem that I took using an amazing technique called Approximate Bayesian Computation (‘ABC’). To explain the basic idea, it is worth going back to the basics of Bayesian calculations, which try to make direct inferences about parameters in a statistical problem. These calculations generally progress in three steps

Prior information on the problem at hand is encoded in a statistical distribution for the parameters we are interested in. For example, the average cost per claim might be distributed as a Gamma distribution.
The data likelihood is then calculated based on a realization of the parameters from the prior distribution.
The likelihood of a set of parameters is then assessed as the product of a) the likelihood of getting that parameter set multiplied with b) the data likelihood divided by c) the total probability of all parameter sets and data likelihoods.

In this case, the data likelihood is not available easily. The basic idea of ABC is that in models with an intractable likelihood function, one can use a different method of ascertaining whether or not a parameter set is “likely” or not. That is, by generating data based on the prior distribution and comparing how “close” this generated data is to the actual data, one can get a feel for which parts of the prior distribution make sense in the context of the data, and which do not.

For some more information on ABC, have a look at this blog post and the sources it quotes:

http://www.sumsar.net/blog/2014/10/tiny-data-and-the-socks-of-karl-broman/

The generative model and priors

I assumed that the number of claims, N, claims are distributed as a Poisson distribution, with a frequency parameter drawn from a beta distribution:

I selected the parameters of the Beta distribution to produce a mean frequency of .25 (i.e. a claim every four years) with a standard deviation of .075.

Cost per claim was modelled as a log-normal distribution:

Instead of putting priors on and , which do not have an easy real world interpretation, instead I chose priors for the average cost per claim (ACPC) and the standard deviation of the cost per claim (SDCPC) , and, for each draw from these prior distributions, found the matching parameters for the log-normal. Both of these priors were modelled as Gamma distributions:

with the parameters of the gamma chosen so that the average cost per claim is R20k with a standard deviation of R2.5k and the standard deviation of the cost per claim is R10k with a standard deviation of R2.5k.

The code to find the corresponding log-normal parameters, once we have an ACPC and SDCPC is:

[sourcecode language=”r”]

lnorm_par = function(mean, sd) {

cv = sd/mean

sigma2 = log(cv^2+1)

mu = log(mean)-sigma2/2

results = list(mu,sigma2)

results

}

[/sourcecode]

Lastly, I assumed that the insurers are working to a target loss ratio of 70% (i.e. for every 70c of claims paid, the insurers will bring in R1 of income), with a standard deviation of 2.5%. This distribution also followed a beta, similar to the frequency rate.

The following algorithm was then run 100 000 times:

Draw a frequency parameter from the Beta prior
Simulate the number of claims from the Poisson distribution, using the frequency parameter
Draw an average cost per claim and it’s standard deviation, and find the corresponding log-normal distribution
For each claim, simulate a claim severity from the log-normal
For each excess with a corresponding premium quote, subtract the excess from the claims and add these up
The implied premium is the sum of the claims net of the excess divided by:
- 12, since we are interested in comparing monthly premiums
- the target loss ratio of the insurers, to gross up the premium for expenses and profit margins

Inference

So far we have generated lots of data from our priors. Now it is time to see which of the parameter combinations actually produce premiums reasonably in line with the quotes on the website. To simplify things, I put each of the simulated parameters into one of nine “buckets” depending on the percentile of the parameter within its prior distribution.

[sourcecode language=”r”]

claims[,freq_bin :=ntile(freq,9)]

claims[,sev_bin :=ntile(acpc,9)]

claims[,sev_sd_bin :=ntile(acpc_sd,9)]

claims[,lr_bin :=ntile(LR,9)]

claims[,id:=paste0(freq_bin, sev_bin, sev_sd_bin, lr_bin)]

[/sourcecode]

Then, indicative premiums for each bucket were derived by averaging the premiums derived in the previous section for each parameter “bucket”. The distance between the generated data and the actual quoted premiums was taken as the absolute percentage error:

And for the very last step, the median distance between the generated and quoted premiums was found for each parameter bucket. I only selected those “buckets” which produced a median distance of less than 8%. The median was used, instead of the mean, since I believe that some of the quotes are actually unreasonable, and I do not want to move the posterior distance too much in their favour by using a distance metric that is sensitive to outliers.

Now we have everything we need to show the posterior distributions of the parameters:

Some observations are that the prices I was quoted implies both a frequency and severity of claims that are a little bit higher than I assumed, but with a lower average cost per claim. The standard deviation of the average cost per claim is lower as well, with less weight given to the tails than I had assumed. Lastly, the loss ratio distribution matches the prior quite well.

Prices

Lastly, the implied prices are shown in red the next image.

Bearing in mind that this is all based on the assumption of actuarially unfair premiums – in other words, allowing the insurer to add a substantial profit to the actual risk premium by targeting a loss ratio of 70% – only three of the quotes are reasonable (two of those with an excess of R4.58k and the one with an excess of R4.84k). The rest of the quotes are significantly higher than can be justified by my priors on the key elements of the claims process, and it would seem irrational for a consumer with similar priors to take out one of these policies.

Conclusion

This post showed how it is possible to back out the parameters that underlie an insurance quote using prior information and Approximate Bayesian Computation. Based on the analysis, we can go back to the original question I asked at the beginning of the post – is the low excess policy number 8 priced reasonably? The answer, based on my priors, seems to be “no”, and the excesses quoted here do not seem to be all that useful when it comes to explaining the prices of each quote.

What could be modelled more accurately? Some of the policies include a cashback, which we could priced explicitly using the posterior parameter distributions, but I personally attach very little utility to cashback benefits and would not pay more for one. So this is a more minor limitation, in my opinion.

I would love to hear your thoughts on this.

Bridging between the tribes – chain-ladder and lifetables

Introduction

During the last few years of my career I have had the opportunity to work in two of the major fields of practice for actuaries – life insurance and non-life insurance. Something that always bothered me is that actuaries who perform reserving work in either of these two areas use totally different techniques from each other.

Life actuaries will generally build cash-flow models to project out expected income and outgo to derive the expected profit for each policy they are called on to reserve for, which is then discounted back to produce the reserve amount. One of the key inputs into this type of reserving model is a life table which tabulates mortality rates which apply to the insured population that is being reserved for.

Non-life actuaries, on the other hand, almost never build cash-flow models, but will apply a range of techniques to past claims information (arranged into a “triangle”, see later in the post for a famous example) to derive expected claims amounts that are held as an incurred but not reported reserve (IBNR). Some of these techniques are the chain-ladder, the Bornhuetter-Ferguson (Bornhuetter and Ferguson 1973) and Cape-Cod techniques (Bühlmann and Straub 1983). Lifetables are never considered.

It would make sense intuitively that there is some connection between these two “tribes” of actuaries who, after all, are both trying to do the same things, but for different types of company – make sure that the companies have enough funds held back to fund claims payments. This post tries to illustrate that in fact, hidden away in the chain-ladder method, there is an implicit life table calculation and that IBNR calculations can be cast in a life table setup. The key idea was actually expressed in a paper I wrote for the 2016 ASSA convention with Professor Rob Dorrington and appeared as an appendix in the paper.

Something else that the idea helps with is that it provides an explanation why the chain-ladder is so popular and seems to work well. The chain-ladder method remains the most popular choice of method for actuaries reserving for short term insurance liabilities globally and in South Africa (Dal Moro, Cuypers and Miehe 2016). Although stochastic models have been proposed for the chain-ladder method by Mack (1993) and Renshaw and Verrall (1998), the underlying chain-ladder algorithm is still described in the literature as an heuristic, see for example Frees, Derrig and Meyers (2014).

The simple explanation for the success of the chain-ladder method is that underlying the estimates of reserves produced by the chain-ladder method is a life table and that the chain-ladder method is actually a type of life-table estimator.

The rest of the post shows the simple maths and some R code to “pull out” a life table for the chain-ladder calculation. In a future post, I hope to discuss some other helpful intuitions that can be built once the basic idea is established.

The code for this post is available on my GitHub account here:

Code

Chain-ladder calculations

Define:

as the claims amount relating to accident year i in development period J, where there are I accident years and J development years. An example claims triangle is shown below, that appears in Mack (1993). This triangle can easily be pulled up in R by running the following code that references the excellent Chainladder package:

[sourcecode language=”r”]

require(ggplot2)

require(ChainLadder)

require(data.table)

require(reshape2)

require(magrittr)

GenIns

[/sourcecode]

i	C(i,1)	C(i,2)	C(i,3)	C(i,4)	C(i,5)	C(i,6)	C(i,7)	C(i,8)	C(i,9)	C(i,10)
1	357 848	1 124 788	1 735 330	2 218 270	2 745 596	3 319 994	3 466 336	3 606 286	3 833 515	3 901 463
2	352 118	1 236 139	2 170 033	3 353 322	3 799 067	4 120 063	4 647 867	4 914 039	5 339 085
3	290 507	1 292 306	2 218 525	3 235 179	3 985 995	4 132 918	4 628 910	4 909 315
4	310 608	1 418 858	2 195 047	3 757 447	4 029 929	4 381 982	4 588 268
5	443 160	1 136 350	2 128 333	2 897 821	3 402 672	3 873 311
6	396 132	1 333 217	2 180 715	2 985 752	3 691 712
7	440 832	1 288 463	2 419 861	3 483 130
8	359 480	1 421 128	2 864 498
9	376 686	1 363 294
10	344 014

The chain-ladder algorithm predicts the next claims amount in the table:

as:

where f is the so called loss development factor in development period j.

The volume weighted estimator of the loss development factor is defined in Mack (1993) as:

The estimate of the ultimate claims – the claims amount after all of the claims development is finished – for accident year i is given by:

In R, most of the chain-ladder calculations have been helpfully automated. To produce the loss development factors and an estimate of the IBNR, one runs the following code:

[sourcecode language=”r”]

fit = ChainLadder::MackChainLadder(GenIns)

plot(fit)

[/sourcecode]

Estimating the life table

Now for the lifetable. The percentage of claims developed by development period j is defined as:

and the percentage of claims developed in period j is:

The claims development can be cast in demographic terms as follows. Assume that for each accident year i, a population of claims:

will eventually be reported. In each development period j:

of the claims will be reported, or will “die”. The term is therefore comparable to the demographic quantity:

which is the probability of death in the period j, after surviving to time j. A full lifetable can then be derived from:

This is shown in the next table and plot, followed by the R code to produce the numbers.

j	1	2	3	4	5	6	7	8	9	10
C(i,j+1)	11 614 543	17 912 342	21 930 921	21 654 971	19 828 268	17 331 381	13 429 640	9 172 600	3 901 463
C(i,j)	3 327 371	10 251 249	15 047 844	18 447 791	17 963 259	15 954 957	12 743 113	8 520 325	3 833 515
f(j)	3.49	1.75	1.46	1.17	1.10	1.09	1.05	1.08	1.02	1.00
F(j,J)	14.45	4.14	2.37	1.63	1.38	1.25	1.15	1.10	1.02	1.00
tq0	0.07	0.24	0.42	0.62	0.72	0.80	0.87	0.91	0.98	1.00
t\|q0	0.07	0.17	0.18	0.19	0.11	0.07	0.07	0.05	0.07	0.02
qx	0.07	0.19	0.24	0.33	0.28	0.27	0.34	0.35	0.80	1.00
tpx	0.93	0.76	0.58	0.38	0.28	0.20	0.13	0.09	0.02	–
px	0.93	0.81	0.76	0.67	0.72	0.73	0.66	0.65	0.20	–

[sourcecode language=”r”]

t_prime_qx = c(PERC_DEV[1], diff(PERC_DEV))

max_age = length(PERC_DEV)

px = numeric(10)

tpx = numeric(10)

qx = numeric(10)

px[1] = (1-t_prime_qx[1])

tpx[1] = px[1]

qx[1]= t_prime_qx[1]

for (i in 2:length(PERC_DEV)){

print(i)

qx[i] = t_prime_qx[i]/tpx[i-1]

px[i] = (1-qx[i])

tpx[i] = tpx[i-1]* px[i]

}

lifetable = data.table(t = seq(1,max_age), PERC_DEV=PERC_DEV,px = px, tpx=tpx, qx=qx,t_prime_qx=t_prime_qx )

lifetable_melt = lifetable %>% melt(id.var="t") %>% data.table()

lifetable_melt %>% ggplot(aes(x=t, y=value)) + geom_line(aes(group = variable, colour = variable)) + facet_wrap(~variable)

[/sourcecode]

Conclusion

When will the above calculations work well? These calculations make sense when dealing with triangles that increase monotonically i.e. do not allow for over-reserving or salvage and recoveries. A good example is on count triangles of paid claims.

Now that we have shown that the chain-ladder estimates a lifetable, the question is whether this is just an interesting idea that lets one connect two diverse areas of actuarial practice, or if any significant insights with practical implications can be derived. That will be the subject of the next post.

References

Bornhuetter, R. and R. Ferguson. 1973. “The Actuary and IBNR”, Proceedings of the Casualty Actuary Society Volume LX, Numbers 113 & 114

Bühlmann, H. and E. Straub. 1983. “Estimation of IBNR reserves by the methods chain-ladder, Cape Cod and complementary loss ratio,” Paper presented at International Summer School. Vol. 1983:

Dal Moro, E., F. Cuypers and P. Miehe. 2016. Non-life Reserving Practices. ASTIN.

Frees, E.W., R.A. Derrig and G. Meyers. 2014. “Predictive Modeling in Actuarial Science”, Predictive Modeling Applications in Actuarial Science 1:1.

Mack, T. 1993. “Distribution-free calculation of the standard error of chain-ladder reserve estimates”, Astin Bulletin 23(02):213-225.

Renshaw, A.E. and R.J. Verrall. 1998. “A stochastic model underlying the chain-ladder technique”, British Actuarial Journal 4(4):903-923.

AI in Actuarial Science

The CEO of Lemonade, Dan Schreiber, made the statement in a recent talk that:

“The future of insurance will be staffed by bots rather than brokers and AI in favor of actuaries.”

Despite sounding rather impressive, the Youtube video of the rest of Dan’s talk doesn’t really go into much detail to explain his thinking. That being said, this statement seems to be predicated on the view that actuarial science won’t evolve to embrace AI and bring the tools of modern statistical and machine learning into the everyday practice of actuaries. I personally think this view is inaccurate, and that actuaries practicing in “the future” will just as easily turn to a machine learning algorithm as they will to the traditional tools of the trade. Together with the domain specific knowledge of insurance that actuarial training brings, I think this will be a powerful combination that will serve the insurance industry well.

One reason I feel comfortable making this prediction is that the actuarial literature is starting to examine AI and machine learning, and how it can be applied to traditional actuarial problems. Many of the best examples that I have seen are from Professor Mario Wüthrich (who, together with his colleagues, is credited with the thinking and formula behind the Solvency II Reserving Risk formula). Some of his work includes applying deep neural networks to telematics data and machine learning approaches to the problem of IBNR reserving. Other recent papers include one applying deep auto-encoders to analyse population mortality in a Lee-Carter setup and another examining gradient boosted Tweedie models for pricing.

I plan to examine some of these new ideas in actuarial science on this blog in the next couple of posts and also provide code in R and Python on my Github account that will allow anyone who is interested to see how to apply these ideas practically. First up will be auto-encoders, which are a form of dimensionality reduction used to summarize high dimensional data in a low dimensional form. As an example of high dimensional data, think of a life table that has entries for the mortality rate for each age in the table. A life table with rates up to age 110 can be thought of a 110 dimensional vector. Although it is not a new idea to summarize a life table with a only a few parameters (the mortality laws, the Lee-Carter model using SVD and Brass’ Logit Transform spring to mind), recent work has shown that neural networks can estimate these summaries more accurately than traditional methods.

I also plan to build a section on my website that acts as a guide to this emerging field of actuarial science that I hope will be useful to other actuaries (and professionals) who want to understand how AI and machine learning can be applied to actuarial problems. As a start, some of the excellent material from Prof Wüthrich is available on his website.

Insurance seems to have become an exciting industry to work in these days and I am equally excited about the opportunities that lie ahead for actuaries and other insurance professionals.

Self-driving cars and the impact of Motor Accidents on Mortality – The Case of South Africa

Introduction

This post is a continuation of the last two weeks and tries to find an estimate of impact of motor accident deaths on life expectancy in South Africa. A significant part of this impact could be eliminated by self-driving cars and the post two weeks ago looked at the possible benefits in the UK and the USA. Last week I discussed some of the issues encountered when dealing with demographic data in South Africa and this week proposes a simple approach that tries to avoid some of the issues that were discussed to derive the impact of motor accidents on life expectancy.

Caveat: Digging into these numbers, it seems to me that dealing with this properly needs much more than a blog post and could be the subject of detailed research. Indeed, much of the work has been done already in the National Burden of Disease study (Pillay-Van Wyk, Laubscher, Msemburi et al. 2014; Pillay-van Wyk, Msemburi, Laubscher et al. 2016) and the most this post can attempt to do is see what can be derived from the publicly available information. I recommend that anyone interested in mortality in South Africa go through this fantastically detailed study.

My guess is that the numbers in this post are a lower bound on what the true reduction in life expectancy due to motor accidents is.

If you want to consider the appropriateness of these numbers, please also read the section below “Conclusions and Limitations”.

The code for this post has been uploaded to my Github here, in the file “traffic mort – RSA.r”:

https://github.com/RonRichman/traffic_mortality

Approach

For the purpose of this post, I am going to try avoid the issue of incomplete reporting of deaths as much as possible. I am not aware of any demographic (i.e. mathematical) method that can correct incomplete reporting of deaths by cause, I am going to make the strong assumption that the level of completeness of reporting of deaths by cause is constant in each year i.e. there is no greater propensity to report a death due to one cause more than another. If this is the case, then some simple arithmetic shows that the ratio of deaths due to one cause to deaths due to another cause is an unbiased estimate of the true ratio and therefore we don’t need to correct the death data.

Important to note here is the study by Matzopoulos, Prinsloo, Wyk et al. (2015) who went through mortuary records in 2009 to work out the true number and cause of deaths due to injuries, including motor accident deaths. The Burden of Disease study (Pillay-van Wyk, Msemburi, Laubscher et al. 2016) used these numbers as an input into a calculation whereby they corrected for injury-specific completeness of reporting, which implies that the assumption made above is a little questionable.

To deal with the issue of mislabelled cause of death data, I am going to take the following approaches:

Firstly, hunt through the data to find causes of death that are not in the CDC list but probably represent motor accident deaths
Using these deaths, establish a cause-specific age profile and hunt though the rest of the data to see if we find any matches.
Try to cross-reference the WHO data with the Road Traffic Management reports on accident fatalities in South Africa.

I am then going to derive a set of factors for each age group and year which explain how many of the reported deaths are due to motor accidents.

Lastly, I am not going to try rederive my own set of mortality rates given the uncertainties in both the death and population data, but I am rather going to rely on modelled estimates of mortality from the Thembisa model (https://www.thembisa.org/). The Thembisa model seems to me to be the best publicly available model for this purpose and the project maintainers have made a very commendable effort to make the model and documentation available at their website. Using these estimates and the reduction factors discussed in the step before this, we will have everything we need to work out an adjusted life expectancy.

Data

I used three main sources of data. Like last week, the cause of death data is from the WHO Mortality database (http://www.who.int/healthinfo/statistics/mortality_rawdata/en/), which compiles death counts in 5-year bands by the ICD10 classification for a large number of countries around the world.

Secondly, I used the reports from the Road Traffic Management Corporation to compare the number of reported motor accident deaths in the WHO data to an external source. The reports are available here and contain many other interesting pieces of data:

http://www.rtmc.co.za/index.php/reports/traffic-reports

Lastly, I used the Tembisa model outputs to provide mortality rates.

Coding

Similar to the last post, I used the Centre for Disease Controls classification of the ICD-10 codes to identify the deaths due to motor accidents. This classification can be found here: https://www.cdc.gov/nchs/data/nvsr/nvsr66/nvsr66_05.pdf

However, this seemed to capture an unrealistically small number of deaths. When I looked at the data a bit more, I realized that many of the counts in the ICD10 codes relating to traffic deaths were under less informative codes than the CDC coding allowed for:

Cause	Deaths	ICD Title
*V89*	162598	Motor- or nonmotor-vehicle accident, type of vehicle unspecified
*V09*	11890	Pedestrian injured in other and unspecified transport accidents
*V19*	708	Pedal cyclist injured in other and unspecified transport accidents

I added these three to the CDC list on the basis that most of the recorded motor accident deaths are probably lurking in these codes. I then worked out the percentage of deaths at each age accounted for by these deaths, producing the following plot:

The shape of these curves is quite different from those for the UK and the USA, which peak at the ages when people begin to drive:

Looking more closely at the plot for South Africa, one can see that in recent years, the pattern is shifting towards a peak at these ages too. This makes sense – as fewer AIDS deaths get recorded in recent years (with the impact of AIDS mortality falling in recent years, probably due to ARVs as discussed in Pillay-van Wyk, Msemburi, Laubscher et al. (2016)[1]), the impact of other causes is increasing.

More disturbing, though, is the fact that motor accidents don’t seem to be as much of an issue in South Africa, compared to the USA and UK, accounting for a maximum of about 7.5% of deaths compared to upwards of 40% in the USA at some ages.

Some more prior knowledge comes from Pillay-van Wyk, Msemburi, Laubscher et al. (2016) who show on page 646 of their study that road injuries were the ninth largest cause of death in South Africa in both 1997 and 2012.

A final hint that we are missing some deaths comes from the Road Traffic Management Corporation reports, which contain fatality numbers for each of the years since 2004. I pulled these numbers out of the reports, and produced the comparison shown below:

Year	Deaths	Crash Fatalities	Proportion
*2004*	5 026	12 778	39%
*2005*	5 279	14 135	37%
*2006*	5 546	15 419	36%
*2007*	5 995	14 920	40%
*2008*	5 470	13 875	39%
*2009*	5 550	13 768	40%
*2010*	5 511	13 967	39%
*2011*	5 027	13 954	36%
*2012*	5 250	13 528	39%
*2013*	5 544	11 844	47%
*2014*	5 786	12 702	46%
*2015*	6 171	12 944	48%

The table shows that only about 40-45% of the deaths registered by the RTMC are showing up in the WHO data. The RTMC uses a different reporting process than the deaths going into the WHO data and relies on reports issued by the police in the case of accidents. Could we perhaps be missing some of the deaths from the RTMC because we are missing some deaths in the WHO data due to poor coding?

Searching through the data

To look for some of the missing deaths, I calculated the age “signature” of the traffic deaths that we have already found, which I defined as the proportion of deaths in each age bucket for each sex in each year that we have coded as being due to motor accidents. This signature looked like the following plot.

I then searched through the WHO data and calculated the distance between the age signature for the motor deaths and each cause of death labelled by ICD10 code. The table below shows the results:

Country	Cause	Deaths	distance
*South Africa*	Y34	273496	30%
UK	X969	41	32%
*USA*	X940	1965	38%
*USA*	X930	2188	40%
*USA*	X708	3377	47%
*USA*	O960	490	51%
*USA*	X701	2454	51%
*USA*	X804	1052	53%
*USA*	X730	9945	53%
*USA*	X744	2488	55%
*USA*	X740	35822	56%
*USA*	X808	1033	57%
*USA*	X816	805	57%
*USA*	W776	73	57%
*USA*	X748	5357	57%
*USA*	O961	596	58%
*USA*	X718	1207	58%
*USA*	X702	667	59%
*USA*	X728	2077	60%
*USA*	W875	63	61%

It turns out that the closest match amongst the SA, USA and UK data is code Y34, which stands for “Unspecified event, undetermined intent”. The correspondence is quite good for both sexes, but a little bit out for females at the younger ages. The match is shown in the following plot (the lines represent the age signature of Y34 and the dots represent the signature shown above):

So I think it is a fair conclusion that some of the motor related deaths in South Africa land up in the WHO data under a “garbage” code. This is also in line with Matzopoulos, Prinsloo, Wyk et al. (2015) who found that the aggregate number of deaths in their study was not significantly different from the aggregate number due to external deaths in the Stats SA data (which feeds into the WHO dataset) but that deaths had been mislabelled.

Therefore, I transferred some of the deaths from cause Y34 into those related to motor accidents. I used the RTMC reports as the “true” number of deaths, which is another questionable assumption since Matzopoulos, Prinsloo, Wyk et al. (2015) actually found more motor accident related deaths than those reported by the RTMC. For this reason I view the number produced next as a lower bound, and discuss more in the conclusion.

The final proportions of deaths due to motor accidents I derived are as follows:

It can be seen that these are significantly higher than the proportions in the previous section.

Impact on Life Expectancy

The next step is to calculate the impact on life expectancy. I extended out the Thembisa mortality rates to age 110 using a Gompertz curve and then reduced the mortality rates in the Thembisa model by the proportions of the deaths due to motor accident discussed above. These curves for 2015 are shown in the following plot (the blip at age 90 is where the Gompertz curve joins the data and should be smoothed out), together with the curves adjusted for the impact of motor accidents:

The impact on life expectancy at birth is as follows:

Sex	Year	e0	e0 – no motor accidents	Increase
Male	2004	51.72	52.51	0.79
Male	2005	51.71	52.58	0.86
Male	2006	52.07	52.96	0.89
Male	2007	52.93	53.82	0.89
Male	2008	51.24	52.06	0.81
Male	2009	55.41	56.25	0.83
Male	2010	57.20	58.04	0.84
Male	2011	58.50	59.34	0.85
Male	2012	59.20	60.05	0.85
Male	2013	59.74	60.54	0.80
Male	2014	60.15	61.03	0.88
Male	2015	60.47	61.34	0.88
Female	2004	55.77	56.09	0.32
Female	2005	55.66	56.01	0.35
Female	2006	56.32	56.71	0.40
Female	2007	57.83	58.18	0.35
Female	2008	56.35	56.66	0.31
Female	2009	61.15	61.47	0.32
Female	2010	63.04	63.38	0.33
Female	2011	64.71	65.08	0.37
Female	2012	65.79	66.13	0.34
Female	2013	66.81	67.13	0.33
Female	2014	67.66	68.00	0.34
Female	2015	68.00	68.35	0.35

The gain in life expectancy for males is much higher than for females, which is due to two factors:

The higher mortality rates due to accidental death for males, compared to females
The bigger impact of motor deaths for males compared to females, as shown above

Translating these numbers into years of life lost due to motor accidents, using the reported 2015 birth cohorts from Stats SA, we get an 417 124 years of life for males and 163 157 for females.

Conclusion and Limitations

This post examined the impact of motor accident related deaths on mortality and life expectancy in South Africa. Like most exercises focussing on South African mortality that I have been involved in, it comes down to trying to work out how deaths have been reported and recorded.

The key assumptions that were made are:

some of the motor accident related deaths have been misreported under Y34
the RTMC reports are the true number of these deaths
all causes of death are reported with the same level of completeness

Matzopoulos, Prinsloo, Wyk et al. (2015) found that in fact, more deaths had been recorded by mortuary reports in 2009 than appeared in the RTMC reports. The difficulty I have in using their number in this type of armchair analysis is that we know the WHO data is not completely reported, so some part of the deaths that they found relates to the normal under-reporting of deaths in South Africa, and not the cause specific reporting issues. The fact that they found more deaths also invalidates the assumption that all deaths are reported at the same level of completeness but it is unclear to me how to correct the WHO data using their finding.

This represents a limitation of the analysis performed above and it seems to me that the gain in life expectancy derived in this analysis is probably too low.

This post showed that eliminating deaths due to motor accidents would be a big win for public health. The problem is that I imagine self-driving cars will not come to South Africa nearly as quickly as more developed countries and also I don’t imagine that the whole population would benefit immediately. Other challenges for self-driving cars in South Africa are likely to arise from the relatively poor road infrastructure. Therefore, the potential benefits to mortality will not be realized anytime soon.

References

Matzopoulos, R., M. Prinsloo, V.P.-v. Wyk, N. Gwebushe et al. 2015. “Injury-related mortality in South Africa: a retrospective descriptive study of postmortem investigations”, Bulletin of the World Health Organization 93(5):303-313.

Pillay-Van Wyk, V., R. Laubscher, W. Msemburi, R.E. Dorrington et al. 2014. “Second South African National Burden of Disease Study: Data Cleaning, Validation and SA NBD List”, Cape Town: Burden of Disease Research Unit, South African Medical Research Council

Pillay-van Wyk, V., W. Msemburi, R. Laubscher, R.E. Dorrington et al. 2016. “Mortality trends and differentials in South Africa from 1997 to 2012: second National Burden of Disease Study”, The Lancet Global Health 4(9):e642-e653.

[1] Quoting from page 651 of their study “We report a marked decline in HIV/AIDS and tuberculosis mortality since 2006, which can be attributed to the intensified antiretroviral treatment rollout for adults since 2005. According to the National Department of Health, more than 2 million people received antiretroviral therapy in 201231 versus an estimated 47 500 in 2004. The rollout of the prevention of mother-to-child transmission programme since 2002 has reduced infections and hence deaths in infants.”

Working with Demographic Data in South Africa

Last week I posted about the impact of motor related accidents on mortality in the USA and UK and derived the increased life expectancy that would occur if self-driving cars eliminated the extra mortality from motor accidents. There was some interest expressed to me in considering the case of South Africa, and this post attempts to document some of the issues that need to be dealt with before this can be done successfully (next week’s post will cover the actual numbers for South Africa). These are some of the lessons I learned when I ventured out to explore mortality improvement in South Africa and I hope these will be useful for other actuaries or anyone else interested in this data, who may attempt to do similar work.

As an introduction to the challenges of working with demographic data in South Africa (and many developing countries, such as those in South America), it is worth spending some time considering the perfect counterpoint to the demographic data – which is, why AI and machine learning are enjoying their moment in the spot light. Most of the advances we see today in AI and machine learning are driven by the huge and relatively accurate datasets that are now common in most fields. These datasets might consist of labelled images of objects or user data collected from websites. Two of the major appeals of deep learning (by which I mean the modern approach to neural networks) are:

the performance of deep learning algorithms scales with the amount of data fed in, whereas other approaches to machine learning often don’t benefit from adding extra data once a certain amount of data has been used.
with enough data (and computing power), all one needs to do is feed data through an appropriate neural network architecture to get world class predictive performance, without spending time and effort trying to hand derive useful representations of the data.

Almost the exact opposite case is the demographic data that are used to derive mortality and other measures. In developing countries, this type of data is often inaccurate, incomplete, hard to access and available long after it is relevant. Before once can derive any sort of value from it, a large amount of data clean-up needs to be performed, generally assisted by ingenious demographic methods developed for these specific types of problems.

As a reminder of last week’s post, the plan eventually is to use the cause of death data to isolate deaths due to motor accidents and then rederive life expectancy without the impact of these deaths. To do this successfully, we will need to understand some of the issues with the death data in South Africa.

The code for this post has been uploaded to my Github here, in the file ‘garbage codes.r’:

https://github.com/RonRichman/traffic_mortality

What are the data issues?

In the rest of this post I am going to focus on some of the issues with the South African death and population data that need to be dealt with before a valid set of results can be derived. The post next week will discuss how I attempted to solve these problems.

The first issue is that deaths in South Africa are not all reported. The image that follows shows various estimates of the completeness of reporting of deaths over time, as derived by various different studies. These studies have all applied a set of techniques known as the death distribution methods which use mathematical demography to reconcile the death data to population data and work out how many deaths are missing.

Most sources agree that completeness has improved over time, reaching about 90% towards 2010. Before the death data can be used for any purpose, the missing deaths need to be added back into the mix.

If anyone wants to find out more, a fantastic resource on the death distribution methods and other demographic techniques is available here:

http://demographicestimation.iussp.org/

There is also a recent R package here:

https://cran.r-project.org/web/packages/DDM/index.html

The next major issue to tackle is that the cause of death data in South Africa suffer from mislabelling and garbage codes. As in the last post, I pulled the WHO cause of death dataset from the WHO website. In the following, I tabulated the major causes of death appearing in the data for South Africa:

Cause	Deaths	ICD Title	Garbage code
*R99*	2 268 174	Other ill-defined and unspecified causes of mortality	x
*A16*	1 787 058	Respiratory tuberculosis, not confirmed bacteriologically or histologically
*J18*	1 257 426	Pneumonia, organism unspecified
*A09*	815 196	Diarrhea and gastroenteritis of infectious origin
*I64*	767 534	Stroke, not specified as hemorrhage or infarction
*E14*	655 406	Unspecified diabetes mellitus
*I50*	563 446	Heart failure
*Y34*	553 978	Unspecified event, undetermined intent	x
*D84*	430 476	Other immunodeficiencies
*I21*	344 358	Acute myocardial infarction
*B33*	343 688	Other viral diseases, not elsewhere classified
*B20*	340 528	Human immunodeficiency virus [HIV] disease with infectious and parasitic diseases
*X59*	296 970	Exposure to unspecified factor	x
*J44*	240 726	Other chronic obstructive pulmonary disease
*I11*	234 216	Hypertensive heart disease
*I10*	225 752	Essential (primary) hypertension
*C34*	184 868	Malignant neoplasm of bronchus and lung
*G03*	175 520	Meningitis due to other and unspecified causes
*J45*	169 680	Asthma
*R54*	167 222	Senility

The largest number of deaths fall under R99, which stands for “Other ill-defined and unspecified causes of mortality”. Other suspicious codes are X59 and Y34. With our prior knowledge of the HIV/AIDS epidemic in South Africa, it is a fair guess that most of the deaths in the top 20 causes actually are from HIV/AIDS, and Birnbaum, Murray and Lozano (2011) and Bradshaw, Msemburi, Dorrington et al. (2016) attempt to quantify this with some clever work on the data. In next weeks post, I will discuss why I think quite a few traffic related deaths are actually sitting in code Y34.

Turning to the population data, there is disagreement amongst the various available estimates as to the size of the South African population. Below I show a plot from my research with estimates of the population aged 70 and older over time from various sources.

AltMYE – Alternative mid-year estimates, Dorrington (2013); ASSA – Actuarial Society of South Africa (2009); Stats SA – Statistics South Africa (2015); UNPD – United Nations Population Division (2013); USCB – United States Census Bureau (2015)

The black squares represent estimates from the model developed in my thesis, which estimates the population and mortality rates at the same time, using the death data as an input. A good resource to get acquainted with some of the issues at play here is Dorrington (2013). This work also contains a set of population estimates that are the most consistent with the last two censuses in South Africa, amongst those that I considered.

An obvious thing to do to try work out the size of the population is to look at the census data, but one has to be wary of census undercount. For example, I showed in my thesis that the censuses in South Africa before 2011 were undercounted relative to 2011:

	National Males	National Females
*1985*	81%	88%
*1991*	90%	88%
*1996*	92%	94%
*2001*	97%	99%

This is why the estimates from my research in the previous plot appear higher than the earlier censuses.

Some miscellaneous issues:

Both the population and death data exaggerate the number of the elderly in South Africa, see for examples Machemedze (2009) and Richman (2017).
Both datasets suffer from age and year of birth heaping.
Birth registration is also uncertain in developing countries.
There are specific problems to consider when working with infant and maternal mortality.

After all of these problems, it will be time to discuss some solutions next week!

*** Footnote added on 14/1/2018

Diego Iturralde from Stats SA pointed out to me that the population estimates released by Stats SA in 2017 have been improved. I think it is fantastic that Stats SA has been working on improving these estimates, which (together with a document discussing them) are available from Stats SA on their website. A quick comparison of the estimates with the others shown above for the population aged 70+ is shown below.

These numbers are closer to those estimated in the censuses of 2001 and 2011 than the previous set of mid-year estimates I considered, but do not appear to be consistent with the 2011 census numbers for either sex, or with 2001 for males.

References

Actuarial Society of South Africa. 2009. Aids and Demographic Model 2008. Cape Town: ASSA. http://www.actuarialsociety.org.za/Societyactivities/CommitteeActivities/DemographyEpidemiologyCommittee/Models.aspx.

Birnbaum, J.K., C.J. Murray and R. Lozano. 2011. “Exposing misclassified HIV/AIDS deaths in South Africa”, Bulletin of the World Health Organization 89(4):278-285.

Bradshaw, D., W. Msemburi, R. Dorrington, V. Pillay-van Wyk et al. 2016. “HIV/AIDS in South Africa: how many people died from the disease between 1997 and 2010?”, AIDS 30(5):771-778.

Dorrington, R.E. 2013. Alternative South African mid-year estimates, 2013. CARe Monograph No. 13. Cape Town: Centre for Actuarial Research, University of Cape Town. http://www.commerce.uct.ac.za/Research_Units/CARE/Monographs/Monographs/Mono13.pdf.

Machemedze, T. 2009. “Old age mortality in South Africa.” Unpublished thesis, Cape Town: University of Cape Town.

Richman, R.D. 2017. “Old age mortality in South Africa, 1985-2011.” Unpublished thesis, Cape Town: University of Cape Town.

Stats SA. 2015. Mid-year population estimates. P0302. Pretoria: Statistics South Africa.

United Nations. 2013. World Population Prospects: The 2012 Revision. New York: Population Division, Department of Economic and Social Affairs.

US Census Bureau. 2015. International Database. Washington DC: International Programs Center.

AIG Report on Self-driving Cars

I may be biased (for obvious reasons if you look at my LinkedIn profile), but I thought the new report on consumer perceptions of self-driving cars from @AIG is excellent. The report is here:

https://www.aig.com/content/dam/aig/america-canada/us/documents/insights/aig-the-future-of-mobility-and-shifting-risk.pdf

Some of my key takeaways are:

– Consumers in different countries have significantly different attitudes to self-driving cars – USA and UK vs Singapore

– Unlike some articles which assume that consumers will favor a subscription model for self-driving cars (a recent one from the FT is here – https://www.ft.com/content/c97eaa72-eaf8-11e7-bd17-521324c81e23), the most popular response to the question of ownership in the survey was that consumers would like to own a self-driving car!

– The article quotes a statistic that vehicle autonomy will reduce accidents by 90% by 2050 (interesting consequences for the change to life expectancy that I discussed in an earlier post)

Self-driving cars and the impact of Motor Accidents on Mortality

An interesting development to watch is self-driving cars which I believe will have a massive impact on many areas of our day to day lives in the near future (especially for those of us working in Personal Lines insurance). Some of the interesting recent developments in this area have been:

Waymo has put part of their fleet on fully autonomous mode i.e. no driver in the driver’s seat. https://medium.com/waymo/with-waymo-in-the-drivers-seat-fully-self-driving-vehicles-can-transform-the-way-we-get-around-75e9622e829a
Waymo seems to be gearing up to launch a public service with the announcement of insurance with Trov. https://techcrunch.com/2017/12/19/waymo-teams-with-trov-on-passenger-insurance-for-self-driving-service/
There are large commercial interests at play, with most of the major car manufacturers running some sort of autonomous driving initiative.
- Two current court cases are Waymo vs Uber and Didi versus one of their previous employees.
An impressive video of a self driving car in the rain from drive.ai is here – https://youtu.be/GMvgtPN2IBU

My interest in the subject was sparked by some of the interesting deep learning applications for self-driving cars that Andrew Ng talks about in his recent Coursera course on Convolutional Neural Networks for computer vision.

This all got me wondering what the impact on mortality would be if self-driving cars reduced or eliminated the extra deaths caused by cars every year. In particular, who would this matter the most for – the young or the old, males or females – and what impact would this have on the shape of the mortality curve. It makes sense that it would take some time for self-driving cars to begin to have a noticeable effect on mortality, but if cars on the road would be autonomous, then it would be fair to assume that the majority of deaths relating to cars that currently appear in mortality data would be avoided.

To quantify the extent of the possible impact, I recalculated a mortality curve (more formally, a lifetable) for the USA and the UK with and without the impact of car related deaths. Obviously, self-driving cars will not immediately eliminate the entire burden of car related deaths, but this number represents an upper bound on the possible beneficial impact. The rest of this post will present the data sources used, the methodology I followed and the results.

The code for this post is available on my Github here:

https://github.com/RonRichman/traffic_mortality

Data

I used two main sources of data. Firstly the cause of death data is from the WHO Mortality database (http://www.who.int/healthinfo/statistics/mortality_rawdata/en/), which compiles death counts in 5-year bands by the ICD10 classification for a large number of countries around the world.

For exposure data, I used the Human Mortality Database – the HMD – (http://www.mortality.org/) which has population numbers (as well as death counts and lifetables) for countries with relatively high quality demographic data. (Notably there is now also a Human Cause of Death database, but the information was not available at the level of granularity needed for these calculations).

Coding

After experimenting a bit, I landed on using the Centre for Disease Controls classification of the ICD-10 codes to identify the deaths due to motor accidents. This classification can be found here: https://www.cdc.gov/nchs/data/nvsr/nvsr66/nvsr66_05.pdf

Some high level reconciliations to other data sources indicate that the USA numbers are a little higher than those reported to the NHSTA (https://www-fars.nhtsa.dot.gov/Main/index.aspx) and the UK numbers are also higher (https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/665162/ras40001.ods). One would need to have significant understanding of how each of these reporting systems work (in particular, how many deaths are reported to the NHSTA versus how many go into the USA vital registration and the same for the UK), so if anyone reading the post has some input, then please let me know!

Methodology

I performed the analysis in two parts. Firstly, I worked out a set of ratios indicating the percentage reduction in deaths in each five-year age band from the WHO data. Secondly, I applied these ratios to the death data at each individual age from the HMD and then calculated mortality rates using the HMD population data.

The reason two steps were needed is because I couldn’t find a comprehensive database with cause of death information in single year age bands.

Results

The following plot shows the percentage of total deaths attributable to motor accidents.

Both the USA and UK have fewer motor deaths over time at the relatively younger ages. I wonder if this is a real effect (due perhaps to improving safety technology in cars), or something not captured completely in the coding I used. The percentage of deaths attributable to motor accidents peak at the ages around which people first begin to drive. One possible insight here is that this is also an age when deaths due to “natural” causes are low, so “extra” accidental deaths at these ages will contribute significantly to the total number of deaths, whereas at the older ages, where “natural deaths” are high, the effect is less. Something else to consider is that driving ability probably improves with time.

After stripping the motor accident deaths out of the total deaths, I produced mortality rates (qx) as shown in the following plot. To smooth these out a little, I averaged the curves over five years (note that 2017 actually consists of data from only the 2015 year):

The major effect seems to be a flattening of the so-called accident hump between 20 and 30, with more impact in the USA than the UK. The declining impact of motor accidents over time is visible in the plots for both the USA and the UK.

The impact on life expectancy at birth is as follows:

Country_Name	Sex	Year_centre	ex	ex_no_traffic	Increase
UK	1	2002	76.20	76.44	0.24
UK	1	2007	77.49	77.71	0.22
UK	1	2012	78.94	79.06	0.12
UK	2	2002	80.77	80.85	0.07
UK	2	2007	81.75	81.82	0.07
UK	2	2012	82.78	82.82	0.04
USA	1	1997	74.13	74.68	0.55
USA	1	2002	74.64	75.21	0.58
USA	1	2007	75.70	76.24	0.54
USA	1	2012	76.65	77.08	0.43
USA	1	2017	76.62	77.06	0.44
USA	2	1997	79.56	79.85	0.29
USA	2	2002	79.86	80.15	0.28
USA	2	2007	80.74	80.98	0.25
USA	2	2012	81.45	81.65	0.19
USA	2	2017	81.47	81.67	0.20

Immediately noticeable is the declining impact of motor accidents on life expectancy with time, for both countries and sexes. If we take 2012 as the most robust recent estimate, then the biggest beneficiaries of eliminating motor accidents would be males in the USA.

I recently read a blog post from Bill Gardner (https://theincidentaleconomist.com/wordpress/us-life-expectancy-declined-again-how-much-does-that-matter/) who frames a change in life expectancy in a clever way, by working out the number of years of life gained or lost for the current birth cohort due to a change in life expectancy. With this idea in mind, a gain in life expectancy of 0.44 is a highly significant improvement representing about 870 000 years of life gained for the 2012 male birth cohort.

Conclusion

This post examined the impact of motor accident related deaths on mortality and life expectancy in the USA and the UK, to provide a view of what the maximum possible impact of introducing self-driving cars, which presumably would eliminate most of the burden of motor accident related deaths, on public health would be. Of the four groups considered, the biggest beneficiaries would be males in the USA, with about 870 000 years of life gained if motor accidents were eliminated completely.

The post did not try to quantify how much or how quickly these benefits would be realized and it seems this would be quite speculative sitting here at the beginning of 2018. The post also ignored second order effects on mortality, such as the fact that people would probably have more time to spend on pursuits other than driving when self-driving cars become a reality and the fact that the gains in life expectancy could be partially offset by other competing risks.

I found it interesting that the impact on life expectancy is lower in the UK than the USA, and it would be worthwhile to reproduce the analysis for all of the countries in the HMD.

Thoughts on writing “AI in Actuarial Science”

Like this:

Thoughts on the International Congress of Actuaries 2018

Like this:

Naked – South Africa’s Lemonade?

Like this:

Motor policy price comparisons – comparing apples with oranges

Like this:

Bridging between the tribes – chain-ladder and lifetables

Like this:

AI in Actuarial Science

Like this:

Self-driving cars and the impact of Motor Accidents on Mortality – The Case of South Africa

Like this:

Working with Demographic Data in South Africa

Like this:

AIG Report on Self-driving Cars

Like this:

Self-driving cars and the impact of Motor Accidents on Mortality

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this: