I had the opportunity to attend the M4 Conference held last week in NYC, which focused on the results of the recent M4 forecasting competition, as well as more generally on the state of the art in time series forecasting. In this post, I plan to summarize some of the key ideas that were presented at the conference and point out some of the thoughts that have occurred to me since.

There were a number of excellent speakers whose key points (from my perspective) I summarize very briefly later on in this blog, with the standout ones for me being:

- Slawek Smyl (winner of the competition with his “hybrid” method)
- Spyros Makridakis (M competitions)
- Nassim Taleb
- Pablo Montero-Manso (representing the runner up team in the competition with a boosting meta-learning method)
- Andrea Pasqua

The rest of this post will discuss:

- Big Ideas of the M4 Conference
- Summaries of some of the talks

In a follow up post I hope to discuss what actuaries can learn from the M4 competition.

**The Big Ideas of the M4 Conference**

There were several recurring themes at the conference that were addressed several times by the speakers. Of these, the one that came up the most often was the difference between statistics and machine learning.

*Stats vs ML*

It was fascinating to see the back and forth between the speakers and the audience on exactly what defines machine learning, and how this is different from statistics. Two of the different viewpoints were:

- Statistical methods generally do not learn across different time series and datasets, whereas ML methods do. (This first perspective made sense from the perspective that most methods used for time series forecasting focus on the univariate case, i.e. where there is only one sequence, and techniques to leverage information across series are newer in this field (although obviously not a new concept in more traditional applications of statistics.)
- There is no difference between statistics and ML, and in fact neural networks are a generalizations of GLMs, which are a basic statistical tool, in other words, the distinction is arbitrary.

Interestingly, there was also not much consensus on whether the field of forecasting should be classified as a traditional statistical discipline or not. One good point that was made is that one of the basic time series methods – exponential smoothing – was always used as an algorithm, until statistical justification in the state-space framework was given by Rob Hyndman et al.

One amusing debate focussed on whether Slawek’s method was in fact a statistical or machine learning approach, with different participants arguing for their perspectives, and being somewhat averse to the idea of a hybrid approach. This carried on, until Slawek himself was asked to clarify, at which point he confirmed that his method is a “hybrid” of statistical and machine learning approaches.

My perspective is that some of these issues can be tied up quite neatly using the distinction between prediction and inference given by Shmueli (2010). A significant part of statistical practice is focussed on defining models and then working out whether or not the observed data could have been generated by the model, and, within this framework, one generally does not have concepts such as out-of-sample predictive accuracy. Machine learning, on the other hand, focuses on achieving good out-of-sample performance of models, whether these have been specified using some stochastic data generating procedure, or on an algorithmic basis. From this perspective, the field of forecasting is not a traditional statistical discipline, as the focus is on prediction!

*Complexity*

A recurring theme of the M competitions is that more complex models are usually outperformed by simple methods, for example, in the original M1 competition it was shown that exponential smoothing was better than ARIMA models. In the M4 competition, this became much more nuanced. One the one hand, “vanilla” machine learning techniques performed poorly, and worse than the benchmark, mirroring the findings in Makridakis, Spiliotis and Assimakopoulos (2018). On the other hand, the winners of the M4 competition used relatively more complex machine learning methods to great success. The difference seems to be that the complexity of the methods is in how they learn to generalize across time series (Slawek’s LSTM model and Paulo’s meta-learning algorithm), instead of trying to apply especially sophisticated methods to single time series.

*Triumph of Deep Learning*

As I have written about several times on this blog, the big advantage of deep learning over traditional machine learning approaches is that feature engineering gets performed automatically (i.e. this is the paradigm of representation learning, in that the model learns the features), and therefore, when dealing with large and very complex datasets, suitable neural network architectures can provide a massive performance boost over other approaches. I think this was clearly part of the “secret sauce” of Slawek’s winning solution, in that he very neatly specified a neural network combined with exponential smoothing, thus obviating the need to try derive features from each time series. This is in contrast to the runner-up solution presented by Pablo, which involved a substantial feature engineering step, in which many features were calculated for each time-series, after which a boosted tree model was fit on these features to work out how to weight the various time series methods.

*More to learn*

Although forecasting is not a new field, it seemed to me that many participants at the conference felt that there is much more to learn to advance the state of the art of forecasting, especially as machine learning methods get adapted to time series forecasting. The amazing and unanticipated success of Slawek’s hybrid method will no doubt lead many researchers to try similar methods on other datasets.

This also manifested in the advance detail given on the upcoming M5 competition, which is going to focus on the role of explanatory variables in forecasting time series, as well as feature online learning as more data become available. I think many people felt that the techniques incorporating explanatory variables are not yet optimal and represent an opportunity to advance the state of the art.

*Ensembling of methods*

A famous finding in the forecasting literature is that combinations of methods usually do better than single methods, and that held true in the M4 competition. Slawek’s winning approach consisted of an ensemble of LSTM models (I discuss the very smart idea of a so-called Mixture of Specialists later) and Paulo’s method used a boosting algorithm to assign weights to different simple methods, which were then combined to produce the final forecasts.

**Summaries of talks**

Here are some summaries of my favourite talks of the conference.

__ Slawek Smyl (Uber Technologies): A Hybrid Approach to Forecasting__Slawek won the M4 competition by a large margin over the next best entry. His method, described in a short note here, essentially did two things:

- Firstly, allow the neural net to learn optimal coefficients of the Holt-Winters algorithm which were then used to normalize each time series
- Secondly, forecast the normalized series using the neural net and then restore the series using the Holt-Winters parameters

The network design was a stack of various types of Long Short Term Memory cells (with skip connections and dilation).

Slawek also used ensembling at several levels to produce the forecast. I found one ensembling method which he proposed to be particularly interesting, the Ensemble of Specialists, which is described in more detail here.

Basically, the idea is to take several of the same neural net architectures and allow them to train for a single epoch on some of the training data. Then, allocate each time series to the top-2 neural nets and repeat both steps until the validation error increases. Once the nets are trained, one applies different ensemble methods to derive the final forecasts. This seems like a very smart way of ensuring optimal performance on all types of series – in my own research, I have encountered situations when neural nets trained to a global optimum do not perform as well as would be expected on some time series and I am excited to try out this approach.

__Spyros Makridakis (University of Nicosia): The contributions of the M4 Competition to the Theory and Practice of Forecasting__

The slides have been made available here.

What stood out most for me about Spyros’ talk was the focus on improving the state of the art of time series forecasting using hard evidence, and that seems to be the key theme running throughout his work on the M competitions and even before. As easy as it might be to favour a method based on how pleasing it is theoretically, the approach during the M competitions has been simply to check what works, and what doesn’t on out of sample error. This created what seems to be a huge amount of work in the M4 competition, in that Spyros and his team have replicated every submission (even those that take upwards of a month to run in full!) and I admire the dedication to advancing the state of the art!

Some of the major findings that Spyros discussed are:

- Improving accuracy via combining methods
- Superiority of Slawek’s hybrid method
- The improved precision of prediction intervals in Slawek’s and Pablo’s methods – these had a coverage ratio very close to the required 95%
- Increased complexity, as measured by compute time, led to increased accuracy, which I think is a first for the M competitions.
- Learning across time series in the winning methods
- Poor performance of pure ML methods, which was attributed to these methods overfitting on the univariate time series i.e. not learning across series

Spyros then ended with two challenges where improvement is needed – improving the measurement of uncertainty (where there is great potential for ML/DL methods) and improving explanatory models of time series.

__Nassim Taleb (New York University): Forecasting and Uncertainty: The Challenge of Fat Tailedness__

I enjoyed hearing Nassim explain some of his ideas in the context of forecasting. My key takeaway here was that when forecasting, one might not be as interested in the underlying random variable being forecast, call it x, but rather the payoff function of x, which is f(x). The payoff function can be manipulated in various ways by taking positions against the underlying x, for example, one could hedge out tail risks, and therefore Nassim was effectively offering a way of dealing with uncertainty in x, which is manipulate your payoffs so that you are not hurt, and ideally gain, from the parts of x that you do not know about or are at most risk from.

One interesting connection that he made was between the way options traders have always approximated payoff functions using a European options, which effectively comes down to function approximation using the ReLu activation in deep learning.

__Andrea Pasqua (Data Science Manager, Uber): Forecasting at Uber: Machine Learning Approaches __

Andrea’s talk covered how time series forecasting is done at Uber, with their own set of interesting and challenging issues, such as a huge number of series to forecast, dealing with extreme events, and the cold-start problem when services are launched in a new city. He gave a very nice walkthrough of how Uber arrived at the solutions currently in production, by going through each stage of model choice and development. It seems as if this team has benefited from Quantile Random Forests and I plan to read up more about these.

**Conclusion**

It was refreshing to see how approachable the speakers at the M4 conference were, and how willing the winners of the competition were to share of their expertise and knowledge. The organizers of the conference put together a great event and well done to them!

In the next post I hope to discuss some of what I believe the actuarial profession could learn from the advances in the state of the art of forecasting that were shown at the M4 conference.

**References**

Makridakis, S., E. Spiliotis and V. Assimakopoulos. 2018. “Statistical and Machine Learning forecasting methods: Concerns and ways forward”, PLOS ONE 13(3):e0194889.

Shmueli, G. 2010. “To explain or to predict?”, Statistical Science:289-310.

Good article.On what sample do the organizers evaluate the accuracy of the forecasts? Do they have hidden sample for that? I saw in their site that train and test sample was made available to contest participants.

The test data was only released after the competition ended. Quite a debate currently going on on Twitter about this…