A guide on how to use counterfactual forecasting to estimate the cost-effectiveness of past in-store promotions in retail.
During a 3-month real-world project, we developed and industrialised a counterfactual forecasting model (first using Prophet, then XGBoost) in order to assess the performance of past in-store promotions of a store chain, to help demand planners in their choices of promotional campaigns.
This model is trained and then forecasts hypothetical sales (called baseline) in the past if there had not been any promotion. The difference between the actual promotion sales and this baseline gives the incremental sales, that we call uplift.
Thanks to hand-crafted temporal features, we reached a forecast accuracy of almost 90%.
When planning future promotional campaigns, demand planners need to decide which product assortments will be discounted, with a certain promotional mechanism (e.g “-15%”, “buy 2, get 1 free” etc…)
These are difficult decisions as:
For most retail companies, the campaign choices are made based on their business knowledge and the performance of past promotions. However, the “performance of past promotions” is difficult to estimate. Indeed, promotional campaigns do increase sales (in most cases), but how to estimate the efficiency or Return On Investment (ROI) if we do not know what would have been the sales without a promotion? This hypothetical value of the sales with no promotion can be called a baseline. In other words, it is all about being able to estimate the incremental sales (or uplift) of a promotional campaign, corresponding to the actual sales, minus the baseline.
To answer this question, we built a tool which was able to estimate the promotional sales uplift of past campaigns, with an accuracy of almost 90%.
This task is quite challenging as the goal is to make forecasts of hypothetical sales in another situation (here, if the promotional campaign had not occurred for a given product). This can be called “counterfactual forecasting”. This article is mainly based on our experience on a project we did for a French store chain.
Its purpose is to describe the approach we used, give tips and caveats when implementing a counterfactual forecasting solution (data preparation, modelling), explain the evaluation process and finally discuss the limits and next steps to this approach.
What is counterfactual forecasting and why is it difficult to predict?
Counterfactual forecasting is the process of predicting something in the form: what would X be like had there been no Y. In our use case, X would be the sales and Y would be a promotional campaign.
There are actually multiple fields where this process can be applied: stock shortage (estimate the shortfall due to out of stock items), any special events which do not last too long (Covid: does not work!) in order to have enough data to estimate that counterfactual.
The promotion problem can be tackled on 3 angles (sorted by ascending difficulty):
In this article, we will focus on the first step as it was the objective of our project. However, we will give a few insights on how to tackle the next two ones, in the following sections.
There are two main reasons which make the task of counterfactual forecasting a challenging process:
The approach we used to build our tool is the following:
Important note: The goal is to use the forecasts during the promotion periods, which are in the past. It is because this task is an a posteriori analysis that, contrary to classical forecasting, it is possible to train on dates which are after the inference period, corresponding to the promotional campaign. There is no notion of data leakage here as we try to explain a phenomenon that happened in the past. Thus, the training — inference workflow looks like this:
Preparing the data
To tackle the promotion problem, one must use the proper data format. Usually, we have access to two types of data:
1. Promotional data (descriptive information related to promotions)
2. Sales data.
The preprocessed data is basically sales data, enriched by promotion information (left join, see figure above). Every row with a non-null “Promo type” corresponds to a day where the product is in promotion.
Before doing the first implementation, it is important to assess the data quality. Here are a few guidelines for the checks to perform:
1. Look for major issues in the time series:
2. Define a granularity for the use case:
So, if the time series are clean enough, a good starting point is to take the most granular approach (e.g. product X day, especially if working with Prophet, as we did in this project).
3. Having a clear promotion scope: which products/families of products are part of a given promotion? Are the promotions planned at a national level? (if not, one may not for instance aggregate the sales of a product across all the stores of a country.)
Once the data has been checked and prepared, it is time for modelling.
First iterations and key takeaways
We started our first iterations with Prophet because it enabled us to have a baseline very quickly, easily add regressors, and interpret the results naturally (thanks to its additive decomposition).
Here is a summary of the main iteration improvements we had during the project:
Basically, the main improvements were coming from the regressors we added:
Finally, adapting the way we measured forecast accuracy (see Evaluation section below) also helped to have a more accurate way to assess the performance.
Why did we switch to XGBoost?
Despite Prophet good performance and interpretability, we realised that XGBoost was most suitable, for multiple reasons:
Evaluation and limits
As written above, there is no ground truth in counterfactual forecasting, which makes the performance assessment more complex than for classical forecasting.
However, we found a way to measure our performance, or rather estimate it as precisely as possible. Here is how:
In classical forecasting, we typically measure performance using a cross-validation strategy (here, expanding window) on a certain validation period (e.g. last year of available data). For this validation period, the actual window where we measure the performance is shifting in each fold (“evaluation window”), and the anterior data is used for the lag features (“Data used to make predictions”). In a promotion use case, we add some data after the evaluation window to reproduce the training — inference workflow described in the “Proposed approach” section.
We can thus apply this cross-validation strategy on the subset of data where there is no promotion, with the Forecast Accuracy (FA) as a metric.
With this approach, we were able to reach a forecast accuracy of almost 90% with a granularity at the level family X day, which is a decent performance, comparable to what we achieved on other projects on classical forecasting.
Even though this performance can be satisfying, our approach has some limitations.
Going further & next steps
Improving the modelling
Several effects could be added to measure the net impact of a promotion :
The first two effects were not included in our analysis because of the chosen granularity (family level) and the two last ones were hard to quantify thoroughly with the time we had for this project.
To summarise, the net additional sales of a promotion could be represented with this waterfall:
Going beyond the a posteriori analysis
As stated earlier, once the (posterior) analysis of past promotions has been done (stage A), it is then possible to go further by predicting the profitability of future promotions (stage B) and finally propose an optimisation of the promotions plan (stage C).
Of course, predicting (an estimation of) the future profitability of a promotion is harder than estimating the profitability of a previous promotion because we have no data available around the promotion. The idea is to reuse the model developed in stage A using data which is not historical data but forecasted data from a classical forecasting model, as follows:
First, train the classical forecasting model on the available data (until today):
Then, make the predictions with this model (the period to be forecasted must cover the range of temporal features which will be used by the “baseline model”):
Finally, use the trained baseline model using temporal features based on the forecasts of the first model and estimate the baseline, which will give the sales uplift:
Of course, this process has more uncertainty by construction, given that the errors from the two stacked models will be correlated.
Finally, to be able to optimise the promotions plan, the strategy consists in using what has been done for the previous stage to choose the best combination of promotion parameters in order to optimise a business metric such as the ROI.
Using counterfactual forecasting to solve business problems is not a common task that can be found in the literature.
However, we saw that it could be a powerful tool to tackle the problem of assessing thoroughly the performance of past promotions, by forecasting hypothetical sales (baseline) if there had not been any promotion. We also explored recommendations for feature engineering for an autoregressive (Prophet) or gradient boosting (XGBoost) model. Finally, we detailed some guidelines to refine the analysis even more and also go further than just doing an a posteriori analysis.
Thanks to the fellow data scientists who worked with me on this project: Kasra & Ombeline. Thanks also to the Artefactors who proofread this article.
Interested in Data Consulting | Data & Digital Marketing | Digital Commerce ?
Read our monthly newsletter to get actionable advice, insights, business cases, from all our data experts around the world!