## Blog: ARIMA/SARIMA vs LSTM with Ensemble learning Insights for Time Series Data

### Motivation

There are five types of traditional time series models most commonly used in epidemic time series forecasting, which includes

- Autoregressive (AR),
- Moving Average (MA),
- Autoregressive Moving Average (ARMA),
- Autoregressive Integrated Moving Average (ARIMA), and
- Seasonal Autoregressive Integrated Moving Average (SARIMA) models.

AR models express the current value of the time series linearly in terms of its previous values and the current residual, whereas MA models express the current value of the time series linearly in terms of its current and previous residual series.

ARMA models are a combination of AR and MA models, in which the current value of the time series is expressed linearly in terms of its previous values and in terms of current and previous residual series. The time series defined in AR, MA, and ARMA models are stationary processes, which means that the mean of the series of any of these models and the covariance among its observations do not change with time.

For non-stationary time series, transformation of the series to a stationary series has to be performed first. ARIMA model generally fits the non-stationary time series based on the ARMA model, with a differencing process which effectively transforms the non-stationary data into a stationary one. SARIMA models, which combine seasonal differencing with an ARIMA model, are used for time series data modeling with periodic characteristics.

Comparing the performance of all algorithmic models available for time series, it was found that the machine learning methods were all out-performed by simple classical methods, where ETS and ARIMA models performed the best overall. The following figure represents the model comparisons.

However apart from traditional time-series forecasting, if we look at the advancements in the field of deep learning for time series prediction , we see Recurrent Neural Network (RNN), and Long Short-Term Memory (LSTM) have gained lots of attentions in recent years with their applications in many disciplines including computer vision, natural language processing and finance. Deep learning methods are capable of identifying structure and pattern of data such as non-linearity and complexity in time series forecasting.

There remains an open research question on how the newly developed deep learning-based algorithms for forecasting time series data, such as “Long Short-Term Memory (LSTM)”, are superior to the traditional algorithms.

The blog is structured as follows:

- Understanding deep learning algorithms RNN, LSTM and the role of ensemble learning with LSTM to aid in performance improvement.
- Understanding conventional time series modeling technique ARIMA and how it helps to improve time series forecasting in ensembling methods when used in conjunction with MLP and multiple linear regression.
- Understanding problems and scenarios where ARIMA can be used vs LSTM and the pros and cons behind adopting one against the other.
- Understanding how time series modeling with SARIMA can be clubbed with other spatial, decision-based and event based models using ensemble learning.

However a more detailed understanding The study does not look at more complex time series problems, such as those datasets with: Complex irregular temporal structures, Missing observations, Heavy noise and Complex interrelationships between multiple variates.

### LSTM

LSTM is a special kind of RNN composed of a set of cells with features to memorize the sequence of data. The cell captures and stores the data streams. Further the cells inter-connect one module of past to another module of present one to convey information from several past time instants to the present one. Due to the use of gates in each cell, data in each cell can be disposed, filtered, or added for the next cells.

The gates, are based on sigmoidal neural network layer, enable the cells to optionally let data pass through or disposed. Each sigmoid layer yields numbers in the range of zero and one, depicting the amount of every segment of data ought to be let through in each cell. More precisely, an estimation of zero value implies that “let nothing pass through”; whereas; an estimation of one indicates that “let everything pass through.” Three types of gates are involved in each LSTM with the goal of controlling the state of each cell:

**Forget Gate** outputs a number between 0 and 1, where 1 shows “completely keep this”; whereas, 0 implies “completely ignore this.”

**Memory Gate** chooses which new data need to be stored in the cell through a sigmoid layer followed by a tanh layer. The initial sigmoid layer, called the “input door layer” chooses which values will be modified. Next, a tanh layer makes a vector of new candidate values that could be added to the state.

**Output Gate** decides what will be yield out of each cell. The yielded value will be based on the cell state along with the filtered and newly added data.

Both in terms of learning how it works, and the implementation, the LSTM-model provides considerably more options for fine-tuning compared to ARIMA

**Ensembles of LSTM for time-series Forecasting**

Several research and studies have found that a single LSTM network that is trained with a particular dataset is very likely to perform poorly on an entirely different timeseries unless rigorous parameter optimization is performed. As LSTM is a very successful in the forecasting domain, research use a so called **stacking ensemble** approach where **multiple LSTM networks are stacked** and combined to provide a more accurate prediction, aiming to propose a more generalized model to forecasting problems

Research on four different forecasting problems, have concluded that the stacked LSTM networks outperformed the regular LSTM networks as well as the ARIMA model, in terms of the evaluation measure RMSE when compared together.

The general quality of the ensemble method studied could be increased through tuning the parameters for each individual LSTM. The reasons for poor performance of a single LSTM network are heavy tuning of parameters for LSTM networks and use of individual LSTM networks that perform poorly when it is used for a different dataset than the one it was trained with. Hence the concept of **ensembling LSTM networks** evolved to yield a better choice for forecasting problems, to reduce the need to heavily optimize parameters and to increase the quality of the predictions.

Among other ensembling techniques, **hybrid ensemble** **learning** with **Long Short-Term Memory (LSTM), **as depicted in the above figure can be used to forecast financial time series. **AdaBoost** algorithm is used to combine predictions from several individual Long Short-Term Memory (LSTM) networks.

Firstly, by using AdaBoost algorithm the database is trained to get the training data by generating samples with replacement from the original dataset. Secondly, the LSTM is utilized to forecast each training sample separately. Thirdly, AdaBoost algorithm is used to integrate the forecasting results of all the LSTM predictors to generate the ensemble results. The empirical results on two major daily exchange rate datasets and two stock market index datasets demonstrate that AdaBoost-LSTM ensemble learning approach outperforms other single forecasting models and ensemble learning approaches.

AdaBoost-LSTM ensemble learning approach looks promising for financial time series data forecasting, for the time series data with nonlinearity and irregularity, such as exchange rates and stock indexes.

Another example of ensemble learning in LSTM as depicted in the above figure, occurs when the input layer contains inputs from time *t1* to *tn*, input for each time instant is fed to each LSTM layer. The output from each LSTM layer *hk* which represents the part of information time k is fed to the final output layer, which aggregates and computes the mean from all of the outputs received. Further the mean is fed into a logistic regression layer to predict the label of the sample.

### ARIMA

The ARIMA Algorithm is a class of models that captures temporal structures in time series data. However using only ARIMA model, it is hard to model the nonlinear relationships between variable.

Autoregressive Integrated Moving Average Model (ARIMA) is a generalized model of Autoregressive Moving Average (ARMA) that combines Autoregressive (AR) process and Moving Average (MA) processes and builds a composite model of the time series.

**AR: Autoregression**. A regression model that uses the dependencies between an observation and a number of lagged observations.

**I: Integrated**. To make the time series stationary by measuring the differences of observations at different time.

**MA: Moving Average**. An approach that takes into accounts the dependency between observations and the residual error terms when a moving average model is used to the lagged observations (q). A simple form of an AR model of order p, i.e., AR (p), can be written as a linear process given by:

Here *xt** *represents* *the stationary variable, **c** is constant, the terms in *∅t* are autocorrelation coefficients at lags 1, 2, … , p and ** ξt, **the residuals, are the Gaussian white noise series with mean zero and variance

**σt²**.

The general form of a ARIMA model is denoted as ARIMA (p, q, d). With seasonal time series data, it is likely that short run non-seasonal components contribute to the model. ARIMA model is typically represented as ARIMA (p, q, d), where: —

- p is the number of lag observations utilized in training the model (i.e., lag order).
- d is the number of times differencing is applied (i.e. the degree of differencing).
- q is known as the size of the moving average window (i.e., order of moving average).

As for example, ARIMA (5,1,0) indicates that the lag value is set to 5 for autoregression. It uses a difference order of 1 to make the time series stationary, and finally does not consider any moving average window (i.e., a window with zero size). RMSE can be used as an error metric to evaluate performance of the model and to assess the accuracy of the prediction and evaluate the forecasts.

Therefore, we need to estimate seasonal ARIMA model, which incorporates both non-seasonal and seasonal factors in a multiplicative model. The general form of a seasonal ARIMA model is denoted as (p, q, d) X (P, Q, D)S, where p is the non-seasonal AR order, d is the non-seasonal differencing, q is the non-seasonal MA order, P is the seasonal AR order, D is the seasonal differencing, Q is the seasonal MA order, and S is the time span of repeating seasonal pattern, respectively. The most important step in estimating seasonal ARIMA model is to identify the values of (p, q, d) and (P, Q, D) .

Based on the time plot of the data, if for instance, the variance grows with time, we should use variance-stabilizing transformations and differencing.

Then, using autocorrelation function (ACF) to measure the amount of linear dependence between observations in a time series that are separated by a lag p, and the partial autocorrelation function (PACF) to determine how many autoregressive terms q are necessary, and inverse autocorrelation function (IACF) for detecting over differencing, we can identify the preliminary values of autoregressive order p, the order of differencing d, the moving average order q and their corresponding seasonal parameters P, D and Q. The parameter d is the order of difference frequency changing from non-stationary time series to stationary time series.

In the popular univariate method of “Auto-Regressive Moving Average (ARMA)” for a single time series data, Auto-Regressive (AR) and Moving Average (MA) models are combined. Univariate “Auto-Regressive Integrated Moving Average (ARIMA)” is a special type of ARMA where differencing is taken into account in the model.

Multivariate ARIMA models and Vector Auto-Regression (VAR) models are the other most popular forecasting models, which in turn, generalize the univariate ARIMA models and univariate autoregressive (AR) model by allowing for more than one evolving variable.

ARIMA is a linear regression based forecasting approach, best suited for forecasting one-step out-of-sample forecast. Here, the algorithm developed performs **multi-step out-of-sample forecast** with re-estimation, i.e., each time the model is re-fitted to build the best estimation model. The algorithm, works on input “time series” data set, builds a forecast model and reports the root mean-square error of the prediction. It stores two data structures to hold the accumulatively added training data set at each iteration, “history”, and the continuously predicted values for the test data sets, “prediction.”

**Ensemble learning with ARIMA**

The three prediction models namely ARIMA, Multilayer Perceptron (MLP), and Multiple Linear Regression (MLR) are trained, validated and tested individually to obtain target pollutant concentration prediction. To train and fit ARIMA model, the p, d, q values are estimated based on AutoCorrelated Function (ACF) and Partial Auto-Correlated Function (PACF). The MLP model is built using the following parameters: The solver used for weight optimization is ‘lbfgs’ as it can converge faster and perform better for less dimensional data.

It gives better results compared to stochastic gradient descent optimizer. The activation function ‘relu’ is used which stands for Rectified Linear units (RELU) function. It avoids the problem of vanishing gradient. The predictions from each model are then combined into a final prediction using weighted average ensemble technique. The Weighted Average Ensemble is a method where the prediction of each model is multiplied by the weight and then their average is calculated. The weights for each base model is adjusted based on the performance ability of each model.

The predictions from each model are combined using the weighted average technique, where each model is given different weights based on its performance. The model with better performance is given more weight. The weights are assigned such that the sum of weights must be equal to 1.

### SARIMA

ARIMA, is one of the most widely used forecasting methods for univariate time series data forecasting, but it does not support time series with a seasonal component. The ARIMA model is extended (SARIMA) to support the seasonal component of the series. SARIMA (Seasonal Autoregressive Integrated Moving Average), method for time series forecasting is used on univariate data containing trends and seasonality. SARIMA is composed of trend and seasonal elements of the series.

Some of the parameters that are same as ARIMA model are:

**p**: Trend autoregression order.**d**: Trend difference order.**q**: Trend moving average order

There are four seasonal elements that are not part of ARIMA are:

**P**: Seasonal autoregressive order.**D**: Seasonal difference order.**Q**: Seasonal moving average order.**m**: The number of time steps for a single seasonal period.

Thus SARIMA model can be specified as:

*SARIMA (p, d, q) (P,D,Q) m*

If m is 12, it specifies monthly data suggests a yearly seasonal cycle.

SARIMA time series models can also be combined with spatial and event based models to yield ensemble models that solves multi-dimensional ML problems. Such a ML model can be designed to predict cell load in cellular networks at different times of the day round the year as illustrated below in the sample figure

*Autocorrelation, trend, and seasonality (weekday , weekend effects) from time series analysis can be used to interpret temporal influence.*

*Regional and cell wise load distribution can be used to predict sparse and over loaded cells in varying intervals of time.*

*Events (holidays, special mass gatherings and others) can be predicted using decision trees.*

### DataSet, problem and Model Selection

On analyzing the domain of the problem to be solved by either classical machine learning or deep learning mechanisms, certain factors needs to be taken into consideration before conclusively choosing the right model.

- The amount by which performance metrics differ in classical time-series models (ARIMA/SARIMA) vs deep learning models.
- The business impact long-term or short-term created due to model selection.
- Design, Implementation and maintenance cost of the more complex model.
- The loss of interpretability.

First, the data are highly dynamic. It is often difficult to tease out the structure that is embedded in time series data. Second, time series data can be nonlinear and contain highly complex autocorrelation structure. Data points across different periods of time can be correlated with each other and a linear approximation sometimes fails to model all the structure in the data. Traditional methods such as autoregressive models attempt to estimate parameters of a model that can be viewed as a smooth approximation to the structure that generated the data.

Under the above factors, ARIMA has been found to better model data that follow linear relationships while RNN (depending on the activation function) better model data that has non-linear relationships. ARIMA model offers a good choice to data scientists for applying it to datasets. Such datasets can be further processed with non-linear models like RNN, when the data still contains non-linear relationships in the residuals with the Lee, White and Granger (LWG) test.

On applying LSTM and ARIMA on a set of financial data, the results indicated that LSTM was superior to ARIMA, as LSTM-based algorithm improved the prediction by 85% on average compared to ARIMA.

### Conclusion

The study concludes with some case studies why specific machine learning methods perform so poorly in practice, given their impressive performance in other areas of artificial intelligence. The challenge leaves it open to evaluate reasons of poor performance for ARIMA/SARIMA and LSTM models, and devise mechanisms to improve model’s poor performance and accuracy. Some of the areas of application of the models and their performance is listed below:

- ARIMA yields better results in forecasting short term, whereas LSTM yields better results for long term modeling.
- Traditional time series forecasting methods (ARIMA) focus on univariate data with linear relationships and fixed and manually-diagnosed temporal dependence.
- Machine learning problems with substantial dataset, its found that the average reduction in error rates obtained by LSTM is between 84–87 percent when compared to ARIMA indicating the superiority of LSTM to ARIMA.
- The number of training times, known as “epoch” in deep learning, has no effect on the performance of the trained forecast model and it exhibits a truly random behavior.
- LSTMs when compared to simpler NNs like RNN and MLP
- Neural networks (LSTMs and other deep learning methods) with huge datasets offer ways to divide it into several smaller batches and train the network in multiple stages. The batch size/each chunk size refers to the total number of training data used. The term iteration is used to represent number of batches needed to complete training a model using the entire dataset.
- LSTM is undoubtedly more complicated and difficult to train and in most cases do not exceed the performance of a simple ARIMA model.
- Classical methods like ETS and ARIMA out-perform machine learning and deep learning methods for one-step forecasting on univariate datasets.
- Classical methods like Theta and ARIMA out-perform machine learning and deep learning methods for multi-step forecasting on univariate datasets.
- Classical methods like ARIMA focus on fixed temporal dependence: the relationship between observations at different times, which necessitates analysis and specification of the number of lag observations provided as input.
- Machine learning and deep learning methods do not yet deliver on their promise for univariate time series forecasting and there is much research left to be done.
- Neural networks add the capability to learn possibly noisy and nonlinear relationships with arbitrarily defined but fixed numbers of inputs. In addition, NNs output multivariate and multi-step forecasting.
- Recurrent neural networks (RNNs) add the explicit handling of ordered observations and is able to adapt itself to learn the temporal dependencies from context. With one observation at a time from a sequence, RNN can learn what relevant observations it has seen previously and determine its relevancy in forecasting.
- As LSTMs are equipped to to learn long term correlations in a sequence, they can model complex multivariate sequences without the need to specify any time window .

**References**

- https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0194889
- https://machinelearningmastery.com/findings-comparing-classical-and-machine-learning-methods-for-time-series-forecasting/
- https://arxiv.org/pdf/1803.06386.pdf
- https://pdfs.semanticscholar.org/e58c/7343ea25d05f6d859d66d6bb7fb91ecf9c2f.pdf
- S. Krstanovic and H. Paulheim, “Ensembles of recurrent neural networks for robust time series forecasting”, in Artificial Intelligence XXXIV, M. Bramer and M. Petridis, Eds., Cham: Springer International Publishing, 2017, pp. 34– 46, ISBN: 978–3–319–71078–5.
- https://link.springer.com/chapter/10.1007/978-3-319-93713-7_55
- http://nebula.wsimg.com/5b49ad24a16af2a07990487493272154?AccessKeyId=DFB1BA3CED7E7997D5B1&disposition=0&alloworigin=1
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3641111/
- Traffic Prediction Based Power Saving in Cellular Networks: A Machine Learning Method https://www.cse.cuhk.edu.hk/lyu/_media/conference/slzhao_sigspatial17.pdf?id=publications%3Aall_by_year&cache=cache

*Source: Artificial Intelligence on Medium*