Last Updated on March 12, 2023 by Jay
Before diving straight into time series modeling in Python, let’s try to understand what a time series is. Most non-time series datasets include observations (or rows) that are independent of each other. For example, the dataset containing employees profile data such as employee id, years of experience (YOE), etc. as below. Each user has a unique id and is independent of the other user. Therefore, the order is not important for table #1 and table #2.
However, in time series data, observations are dependent of each other as they are measured over time. For example, the daily sales data for a store would contain records where each row would represent a day and their sales volume. Here, the rows are somewhat linked with the previous row. Therefore, the order of the data becomes very important.
Introduction to time series modeling
Since the basic rules of a general ML-based model are that the observations should be independent of each other; therefore, the ML models can’t be applied directly to these datasets. This is where time series modeling (aka forecasting) comes into the picture.
Time series modeling or forecasting is used to predict events through a sequence of time. It predicts future events by analyzing the trends of the past, on the assumption that future trends will hold similar to historical trends. Therefore, the application of time series modeling is almost everywhere in the real world, such as forecasting company sales/revenue, predicting stock prices, weather forecasts, etc.
Let’s look at some common approaches or modeling techniques used to forecast time series data.
- Moving averages (MA): This is the simplest way to forecast any time series data. As the name suggests, we use moving averages as the forecast value. For example, the sales of the next day would be roughly the average of the last three days.
- EWMA/Exponential Smoothening: Exponentially weighted moving average or Exponential Smoothening is similar to the concept of the Moving average, except that it gives more weight to the most recent observations.
- ARIMA stands for Autoregressive Integrated Moving Average. This is the first step towards supervised modeling (training model on the historical data to learn patterns). We will deep-dive more into this in the following section.
- Statistical models (such as TBATS, TSLM, ETS, etc.) have been in focus for many years and are still actively used in research.
- Prophet is an open-source Python (and R) library developed by Facebook to forecast time series data
- LSTM (Long Short-Term Memory) was the major breakthrough when Deep learning picked up the pace. It is a type of RNN (recurrent neural network) that learn the order dependence between items in a sequence.
In this article, we will deep-dive into time series modeling using ARIMA.
Introduction to ARIMA
ARIMA (AutoRegressive Integrated Moving Average) is the most common and widely used time series modeling technique.
Let’s break down these individual terms:
- AR (Auto Regressive): This means that the model uses the dependent relationship between an observation and some predefined number of lagged observations (also known as “time lag”)
- I (Integrated) means that the model subtracts an observation from observation at the previous time step in order to make the time-series stationary, i.e., constant mean and standard deviation.
- MA (Moving average) means that the model uses the relationship between the residual error (difference of forecasted value minus the actual value) and the observations
Overall, there are three main parameters for the ARIMA model which we need to tune:
- p is the number of lag observations (AR)
- d is the degree of differencing (I)
- q is the size/width of the moving average window (MA)
Earlier, we needed to test whether the time series is stationary or not and then accordingly identify the optimal set of (p,d,q) values. But with the evolution of Auto ARIMA, these steps are no longer required.
To understand more, let’s quickly get started by using ARIMA for forecasting stock prices.
Install the following Python libraries to follow along with this tutorial:
pip install numpy pip install pandas pip install sklearn pip install yfinance pip install pmdarima pip install statsmodels
Getting the Data
Getting any stock price is simply using the Yahoo finance API via the yfinance library.
Once installed, let’s import these packages into our environment. Please ensure you’re using any Python 3.x environment as the base. (if not, please install Python or directly install using Anaconda)
import numpy as np import pandas as pd import matplotlib.pyplot as plt from statsmodels.tsa.arima_model import ARIMA from sklearn.metrics import mean_squared_error import yfinance as yf import warnings warnings.filterwarnings('ignore')
Let’s load the stock data for Apple using the Yahoo finance API. We have filtered the data from 2012 as 10 years of data is sufficient for our experimentation. As observed, we have the daily stock price (open, close, high, low) and the volume traded.
df = yf.download('AAPL').reset_index() df = df[(df['Date'] >= "2012-01-01") & (df['Date'] <= "2022-08-09")] df.head() Date Open High Low Close Adj Close \ 7835 2012-01-03 14.621429 14.732143 14.607143 14.686786 12.540044 7836 2012-01-04 14.642857 14.810000 14.617143 14.765714 12.607436 7837 2012-01-05 14.819643 14.948214 14.738214 14.929643 12.747406 7838 2012-01-06 14.991786 15.098214 14.972143 15.085714 12.880666 7839 2012-01-09 15.196429 15.276786 15.048214 15.061786 12.860235 Volume 7835 302220800 7836 260022000 7837 271269600 7838 318292800 7839 394024400
Data Preparation for modeling
For the modeling purpose, we will train/predict the stock closing price. We all know that Apple has grown significantly over the last decade, which is reflected in the chart below.
plt.plot(df["Date"], df["Close"]) plt.title("Apple stock price over time") plt.xlabel("time") plt.ylabel("price") plt.show()
Before we put any bet on our models, it is very important to backtest our models on the past data. Therefore, we will divide the data into training and validation (or testing data), which is unseen by the model. For simplicity, let’s keep last year’s data for testing and remaining as training.
# split data into train and training set train_data = df[df['Date'] < '2021-08-01'] test_data = df[df['Date'] >= '2021-08-01'] # plotting the data plt.figure(figsize=(10,6)) plt.grid(True) plt.xlabel('Dates') plt.ylabel('Closing Prices') plt.plot(df['Date'], df['Close'], 'green', label='Train data') plt.plot(test_data['Date'], test_data['Close'], 'blue', label='Test data') plt.legend()
Auto ARIMA has made things easy by automatically identifying the optimal values for p, d, and q parameters. It starts by using a statistical test (such as Kwiatkowski–Phillips–Schmidt–Shin, Augmented Dickey-Fuller or Phillips–Perron) first to determine the value of d (differencing) to make the time series stationary. Then, it uses a step-wise search with multiple p and q parameters combinations to identify the best combination.
Let’s quickly train our first model on the Apple dataset as below using the pmdarima library (wrapper for statsmodel). Also, let’s understand the parameters used for modeling.
- y: time series to fit (array object)
- start_p, start_q, max_p, max_q: Starting and maximum values of p and q that would be used in the search
- test: Statistical test used to determine the best value for d (here, we are using adf (Augmented Dickey Fuller test)
- d: set as None to let the model determine the value, else we can share the value to be used
- m: Number of periods in each season (example, m=4 for quarterly data, 12 for monthly data, 1 for annual/non-seasonal data)
- seasonal: Boolean to decide whether to fit seasonal ARIMA
- start_P: Starting value for the seasonal model
- D: Differencing order for seasonal, if None, it will automatically select based on the seasonal tests
- trace: Boolean to decide whether to print status on the fits (generally used for debugging)
- stepwise: Boolean to decide whether to use stepwise search to identify model parameters. The stepwise approach is generally faster than fitting all combinations and is less likely to overfit on the training data
from pmdarima.arima import auto_arima model_autoARIMA = auto_arima(y = train_data['Close'], start_p=0, start_q=0, test='adf', max_p=3, max_q=3, m=1, d=None, start_P=0, D=0, trace=True, # error_action='ignore', # suppress_warnings=True, stepwise=True)
Performing stepwise search to minimize aic ARIMA(0,1,0)(0,0,0) intercept : AIC=7353.829, Time=0.04 sec ARIMA(1,1,0)(0,0,0) intercept : AIC=7327.125, Time=0.11 sec ARIMA(0,1,1)(0,0,0) intercept : AIC=7328.467, Time=0.12 sec ARIMA(0,1,0)(0,0,0) : AIC=7357.594, Time=0.04 sec ARIMA(2,1,0)(0,0,0) intercept : AIC=7328.058, Time=0.14 sec ARIMA(1,1,1)(0,0,0) intercept : AIC=7327.547, Time=0.31 sec ARIMA(2,1,1)(0,0,0) intercept : AIC=7330.789, Time=0.32 sec ARIMA(1,1,0)(0,0,0) : AIC=7332.276, Time=0.06 sec Best model: ARIMA(1,1,0)(0,0,0) intercept Total fit time: 1.147 seconds
Interpret The Stepwise Search Result
The model starts with performing a stepwise search to identify the optimal p and q values (note that the d value is fixed to 1, which is coming from ADF test). The model utilizes different combinations and tries to minimize the AIC. Therefore, the combination giving the lowest AIC would be the optimal parameter. In this case, (p=1, d=1, q=0) and (P=0, D=0, Q=0) is coming out to be the best parameters.
SARIMAX Results ============================================================================== Dep. Variable: y No. Observations: 2410 Model: SARIMAX(1, 1, 0) Log Likelihood -3660.563 Date: Wed, 10 Aug 2022 AIC 7327.125 Time: 23:03:44 BIC 7344.486 Sample: 0 HQIC 7333.440 - 2410 Covariance Type: opg ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ intercept 0.0604 0.023 2.675 0.007 0.016 0.105 ar.L1 -0.1088 0.009 -12.784 0.000 -0.125 -0.092 sigma2 1.2228 0.011 107.552 0.000 1.201 1.245 =================================================================================== Ljung-Box (L1) (Q): 0.01 Jarque-Bera (JB): 29919.10 Prob(Q): 0.91 Prob(JB): 0.00 Heteroskedasticity (H): 26.79 Skew: -0.13 Prob(H) (two-sided): 0.00 Kurtosis: 20.26 =================================================================================== Warnings:  Covariance matrix calculated using the outer product of gradients (complex-step).
The “auto_arima” function gives the best-fitted object, and the summary function helps us summarize all the findings from the model. We can use the different statistics of the model, but these are generally relevant when comparing two models. To understand if the model is fitted correctly or not, it is better to look at the diagnostics charts, which are directly available as part of the model object.
Let’s try to understand these charts and check if any of these are violating any assumptions.
1) Standardized residual: Residuals are the errors (forecasting values – actual values). The charts show that the residuals fluctuate around the zero mean and have a uniform variance which is ideal for any linear model.
2) Histogram plus estimated density: The KDE (kernel density estimate) plot suggests normal distribution with a mean of zero
3) Normal Q-Q: This check is important for any linear model to check if any outliers in the data impact the model fitting. The blue dots falling the majority over the red line show that the data is acceptable.
4) Correlogram or ACF (auto-correlation): The plot shows the residual errors are not autocorrelated. Any autocorrelation would imply that there is some pattern in the residual errors which are not currently explained in the model.
Overall, the model seems like a good fit without violating any assumptions. Therefore, we can train the final ARIMA model using the optimal parameters available from the Auto ARIMA.
Training The Model
# Fitting Model model = ARIMA(train_data['Close'], order=(1,1,0)) # using p=1, d=1, q=0 fitted = model.fit(disp=-1) # display < 0 means no output to be printed
Using the fitted object, let’s forecast for the next 256 days (equal to the dates available in the test data). We will use the alpha value of 5% (95% confidence level) to get a lower and upper band as well, along with the point forecast.
# Forecast fc, se, conf = fitted.forecast(test_data.shape, alpha=0.05) # 95% conf
To compare the forecast value, let’s plot them together with the actual values to understand if the model can learn the patterns from the historical data.
# converting to series fc_series = pd.Series(fc, index=test_data['Date']) lower_series = pd.Series(conf[:, 0], index=test_data['Date']) upper_series = pd.Series(conf[:, 1], index=test_data['Date']) # plot all the series together plt.figure(figsize=(10,5), dpi=100) plt.plot(train_data['Date'], train_data['Close'], label='Training data') plt.plot(test_data['Date'], test_data['Close'], color = 'blue', label='Actual Stock Price') plt.plot(fc_series, color = 'orange',label='Predicted Stock Price') plt.fill_between(lower_series.index, lower_series, upper_series, color='k', alpha=.10) plt.title('Apple Stock Price Prediction') plt.xlabel('Time') plt.ylabel('Apple Stock Price') plt.legend(loc='upper left', fontsize=8) plt.show()
Woah! As observed, our model is able to capture the trend pretty well. Also, the actual values are well within our confidence band as well. This is some good performance by our model.
Although to compare the actual accuracy/error of the model, we can utilize the below performance matrices to understand the model performance.
from sklearn.metrics import mean_absolute_error import math # report performance mse = mean_squared_error(test_data['Close'], fc) print('MSE: '+str(mse)) mae = mean_absolute_error(test_data['Close'], fc) print('MAE: '+str(mae)) rmse = math.sqrt(mean_squared_error(test_data['Close'], fc)) print('RMSE: '+str(rmse)) mape = np.mean(np.abs(fc - test_data['Close'])/np.abs(test_data['Close'])) print('MAPE: '+str(mape))
MSE: 181.22031923233524 MAE: 10.673073757623026 RMSE: 13.461809656667088 MAPE: 0.06684075156735672
Our model’s point forecast gives us 6.7% MAPE (mean absolute percentage error), which is not bad. We can improve more on the model as we know only historical prices don’t drive the stock market. There are a lot of internal/external factors that define the market.
Hope that you were able to get the basics of time series modeling and how you could use that for forecasting the stock prices. However, you should do your own research and testing before putting any real money based on these algorithms.
Predicting Bitcoin Price Using Stock To Flow Model & Linear Regression