Limits of Linear Models for Forecasting
This article was written by Blaine Bateman
In this post, I will demonstrate the use of nonlinear models for time series analysis, and contrast to linear models. I will use a (simulated) noisy and nonlinear time series of sales data, use multiple linear regression and a small neural network to fit training data, then predict 90 days forward. I implemented all of this in R, although it could be done in a number of coding environments. (Specifically, I used R 3.4.2 in RStudio 1.1.183 in Windows 10).
It is worth noting that much of what is presented in the literature and trade media regarding neural networks concerns classification problems. Classification means there are a finite number of correct answers given a set of inputs. In image recognition, an application well served by neural networks, classification would include dog/not dog. This is a simplistic example and such methods can predict a very large number of classes, such as reading addresses on mail with machine vision and automatically sorting for delivery. In this post, I am exploring models that produce continuous outputs instead of a finite number of discrete outputs. Neural networks and other methods are very applicable to continuous prediction.
Another point to note is that there are many empirical methods available for time series analysis. For example, ARIMA (auto regressive integrative moving average) and related methods use a combination of time lagged data to predict the future. Often these approaches are used for relatively short-term prediction. In this post, I want to use business knowledge and data in a model to predict future sales. My view is that such models are more likely to behave well over time, and can be adapted for business changes by adding or removing factors deemed newly important or found unimportant.
The linear regression method is available in base R using the lm() function. For the neural network, I used the RPROP Algorithm, published by Martin Riedmiller and Heinrich Braun of the University of Karlsruhe. RPROP is a useful variation of neural network modeling that, in some forms, automatically finds the appropriate learning rate.
For my purposes, I mainly use the rprop+ version of the algorithm, very nicely implemented by Stefan Fritsch & Frauke Guenther with contributors Marc Suling & Sebastian M. Mueller. The implementation is available as a library as well as source code. rprop+ appears to be quite resilient in that it easily converges without a lot of hyperparameter tuning. This is important to my point here which is implementation of nonlinear modeling isn’t necessarily more difficult than linear models.
The data are shown here:
Figure 1. Data used in this analysis. Data are synthesized time series data representing sales. Periodic spikes are noted along with long term nonlinear behavior.
The data are a simulated time series of sales data, which has spikes at quarterly and smaller periods, as well as longer term variations. There is about 3 ¼ years of data at daily granularity, and I want to test the potential to use the first 3 years as training data, then predict another 90 days in the future. The business case is that it is believed there are various factors that influence sales, some internal to our business and some external. We have a set of 8 factors, one of which is past sales, the remaining being market factors (such as GDP, economic activity, etc.) and internal data (such as sales pipeline, sales incentive programs, new product introductions (NPI), etc.). The past sales are used with phasing of one year, at which it is arrived by noting there are annual business cycles. (Note: there are many more rigorous ways to determine phasing; I’ll address that in another post.) These factors are labeled a, c, f, g, h, i, j, and k in what follows. The sales values are labeled as Y. For each model then, the 1210 daily values of the 8 factors are provided, plus the actual sales results, and build a model that fits the historical data as well as possible.
Using the lm() function in R, I fit a linear model to the data. Linear means that each factor is multiplied by a coefficient (determined by the fit process) and these are simply added together to estimate the resulting sales. The equation looks as follows:
Y = C1*a + C2*c + C3*f + C4*g + C5*h + C6*I + C7*j + C8*k + C0
where, as noted, Y is the sales. Note that C0 is a constant value that is also determined by the regression modeling. Once I have such as model, sales are predicted by simply multiplying an instance of the factors by the coefficients and summing to get a prediction for sales. To predict future sales, values of the factors are needed. If the factors are not time lagged from the sales, then, for example, a forecast for GDP or the future NPI plan would be needed. Depending on the specific case, all the factors might be time lagged values and future sales can be predicted from known data. In some cases, a forecast is needed for some of the factors. These details are not important for the evaluation here.
Neural network tests
As a first step, I will use a simple neural network that has 8 input nodes (one for each factor) plus the “bias” node (The bias node is motivated by understanding the behavior of a single unit, also known as a perceptron. Including a bias allows a single perceptron to mimic the entire range of logical operators (like AND, OR, XOR, etc.) and thus is usually included in the network architecture). These 9 nodes are fed into a single hidden layer of 3 nodes, which, along with a bias node, are fed into the output node. The network is shown here:
Figure 2. Simple neural network with one hidden layer comprising three nodes. Values for the eight predictors are presented at the left, the output of those nodes are multiplied by weights (determined by the modeling process) and sent to the hidden layer. The values shown are the weights determined using the training data (see below).
There are (at least) two ways that a neural network representation can model nonlinear behavior. First, every node from a given layer is connected to every node of the next layer. These connections are multiplied by weights before feeding into the next node, where the weights are determined in the modeling process. These cross connections can model interactions between factors that the linear model does not. In addition, a typical neural network node uses a nonlinear function to determine the node output from the node inputs. These functions are often called activation functions, another reference to organic neuron behavior. A common activation function is the sigmoid or logistic function.
The two cases were run as indicated, with the results summarized in the following charts. In each case, the model was trained using the training data, excluding the last 90 days to be used as test data. Predictions were then made out to 90 days in the future from the end of the training data.
Figure 3. Training and test data overlaid with model predictions. The linear model exhibits short term features not representative of the data. The nonlinear model appears to perform better.
Figure 4. The same data only over the test/prediction range.
The linear model produces a stair-stepped output. The spikes are exaggerated relative to the original data. The nonlinear model appears to do a better job. On the right, the chart is zoomed in on only the prediction period, and the differences in the model performance are clearer.
- Services: Hire a Data Scientist | Search DSC | Classifieds | Find a Job
- Contributors: Post a Blog | Ask a Question
- Follow us: @DataScienceCtrl | @AnalyticBridge
- Difference between Machine Learning, Data Science, AI, Deep Learnin…
- What is Data Science? 24 Fundamental Articles Answering This Question
- Hitchhiker’s Guide to Data Science, Machine Learning, R, Python
- Advanced Machine Learning with Basic Excel