In September of 2015, a competition was launched to see which Data Science team could generate the most accurate predictions for the next six weeks of sales for each of the 1,115 Rossmann drugstores located throughout Germany. With a bounty of $30,000 in prize money for the winners, this challenge attracted some of the best researchers on the Kaggle Data Science website.
That competition has since completed, but now provides a rich data set and research studies of some of the best modeling techniques currently in use. This presented an opportunity for us, since we have been learning statistical modeling in the rigorous Harvard Extension School course CS109A: Introduction to Data Science.
In the spirit of the Rossmann competition, we took up the challenge to see how our newly acquired modeling skills matched up against the best. The video below provides a high level overview of our project.
Sales forecasts are important tools for managing businesses of any size, and essential for managing large chain stores. Data-driven sales forecasts can be used to manage the inventory, set staff schedules, and avoid potential cash flow problems. A study conducted by economists at MIT and the University of Pennsylvania found that companies that are one standard deviation higher (on a data driven decision making scale) have 5% higher productivity, 6% higher profit, and up to 50% higher market value. For a company the size of Rossmann, with 3,500 stores, 47,500 employees, and sales revenue of €7.9 billion in 2015, having accurate sales predictions is invaluable.
When training our prediction models, we were mindful of the need to simultaneously minimize both variance and bias errors, and the inherent dangers of over and under-fitting. Our goal was to perform well in general cases and not only for data that is similar to the training data provided on Kaggle. Within the competition, the teams super tuned their models for the specific weeks they needed to predict, and even included features that would not be practical to use in general, such as including future weather events, but we chose not to do this.
When fitting our models, we also concentrated on measuring the overall prediction error, and not the specific decomposition of error into bias and variance per se. The use of an appropriate error metric when tuning models is highly beneficial. For projections of the uncorrelated sales responses at different stores, we found that commonly used measures of performance such as R2 and Mean Squared Error (MSE) did not perform optimally. For this reason, we decided to concentrate on using Root Mean Square Percent Error (RMSPE) as a performance metric. This allowed us to compare the relative performance of our models and is the metric that was used for the Kaggle competition.
We observed that the sales patterns at many stores deviate significantly from the average behavior. A single model would be hard pressed to handle all of the special cases these stores present. Therefore, we felt that methods that ensemble a large set of models would be best equipped to make predictions in our situation. One technique that applies is bagging (bootstrap aggregating), where subsets of the data are used to create multiple models that are then pooled together. Another approach is to build multiple models based on different subsets of the predictors, so that all models are not focused on the most influential predictors, but can instead capture the nuances of lesser predictors. Boosting is another method where each model is not tasked with predicting the final answer, but instead on correcting the residual errors of the previous model. These techniques form the basis for several of the model types we used, such as Random Forests and Boosted Trees.
The sales amounts we need to predict are a quantitative numeric values, so we used regression models as opposed to classification models. The inputs to the models are a mix of quantitative and categorical predictors, but predominantly the features are categorical. We first used a basic linear regression approach, but quickly saw that the sales values are cyclic and fall into different groups based on the categories, and they are not governed by linear relationships. Tree models were ideal for the patterns we needed to model and could easily handle the clustering of high dimensional categories, such as the store number. This allowed us to fit a single model that could make accurate predictions for all the stores, rather than creating a separate model for each store.
We studied the work done by previous teams in terms of their empirical results, quantitative performance, and how they managed the Bias-Variance trade off. This allowed us to focus our efforts on the most promising methods.
Engaging with the data was crucial to understanding the issues we would face when modeling the data. We visualized the available features and looked for patterns in the sales. We cleaned and corrected missing and damaged data.
Based on our research and data exploration, we created baseline models ranging from Linear Regression, to Ridge regression, Decision Trees, and Random Forests. We evaluated the models with R2, MSE, and RMSPE metrics.
After evaluating the baseline models, we identified ways to add computed predictors to the train and test data; to enhance the model building process for better sales forecasts. After adding features, we went back and refit our baseline models.
We combined the enhanced modeling power from feature engineering with the error reduction of boosted tree ensembles. We directly trained an XGBoost model with our objective of minimizing RMSPE (Root Mean Squared Percent Error). This resulted in our best performing model.
Math Nerd in Chicago
Mad Scientist in Cambridge
Data Countess in Transylvania