Time series forecasting is an important area of data science and machine learning that deals with predicting future values based on historical data. One of the most commonly used models for time series forecasting is the ARIMA (AutoRegressive Integrated Moving Average) model. In this guide, we will explore ARIMA models and how they can be used for time series forecasting.
What is Time Series Forecasting?
Time series forecasting is the process of predicting future values based on historical data. This is done by identifying patterns and trends in the data, and using them to make predictions about what will happen in the future. Time series forecasting is used in many different fields, including finance, economics, and marketing.
What is an ARIMA Model?
An ARIMA model is a type of time series model that combines three components: Autoregression (AR), Integrated (I), and Moving Average (MA). These three components are combined to create a powerful model that can capture both short-term and long-term trends in the data.
Autoregression (AR)
Autoregression refers to the use of past values of the variable to predict future values. In an AR model, the value of the variable at time t is predicted using a linear combination of the past p values of the variable. The order of the autoregression model is denoted by p.
Moving Average (MA)
Moving Average refers to the use of past prediction errors to predict future values. In an MA model, the value of the variable at time t is predicted using a linear combination of the past q prediction errors. The order of the moving average model is denoted by q.
Integrated (I)
Integrated refers to the use of differencing to make the time series stationary. Stationary time series have a constant mean and variance, and the properties of the time series do not change over time. Differencing is the process of subtracting each value in the time series from the previous value. The order of differencing is denoted by d.
ARIMA Model
An ARIMA model combines these three components into a single model. The notation for an ARIMA model is ARIMA(p, d, q), where p is the order of the autoregression, d is the order of differencing, and q is the order of the moving average.
How to Build an ARIMA Model
Building an ARIMA model involves several steps:
- Load the time series data.
- Check for stationarity.
- Make the time series stationary (if necessary).
- Determine the values of p, d, and q.
- Fit the ARIMA model.
- Make predictions.
Step 1: Load the Time Series Data
The first step is to load the time series data into Python. The data can be loaded from a CSV file or from a database. The data should be in a format that can be easily parsed by Python, such as a Pandas DataFrame.
import pandas as pd
import matplotlib.pyplot as plt
# Load the time series data from a CSV file
data = pd.read_csv('data.csv', index_col='date', parse_dates=True)
# Plot the time series data
plt.plot(data)
plt.show()
Step 2: Check for Stationarity
The second step is to check if the time series is stationary. A stationary time series has a constant mean and variance, and the properties of the time series do not change over time. There are several methods for checking for stationarity, including:
- Plotting the rolling mean and rolling standard deviation.
- Performing the Augmented Dickey-Fuller (ADF) test.
The rolling mean and rolling standard deviation can be computed using the Pandas rolling function.
# Compute the rolling mean and rolling standard deviation
rolling_mean = data.rolling(window=12).mean()
rolling_std = data.rolling(window=12).std()
# Plot the time series data, rolling mean, and rolling standard deviation
plt.plot(data, label='Data')
plt.plot(rolling_mean, label='Rolling Mean')
plt.plot(rolling_std, label='Rolling Std')
plt.legend()
plt.show()
If the rolling mean and rolling standard deviation are roughly constant over time, then the time series is stationary. Otherwise, the time series may need to be made stationary.
The ADF test is a statistical test that can be used to determine if a time series is stationary. The null hypothesis of the test is that the time series is non-stationary. If the p-value of the test is less than a chosen significance level (e.g., 0.05), then the null hypothesis is rejected and the time series is considered stationary.
from statsmodels.tsa.stattools import adfuller
# Perform the ADF test
result = adfuller(data['value'])
# Print the p-value
print('p-value:', result[1])
Step 3: Make the Time Series Stationary
If the time series is not stationary, then it may need to be made stationary. This can be done by taking first differences, second differences, or seasonal differences of the time series. The differenced time series can then be checked for stationarity using the methods described in Step 2.
# Take the first difference of the time series
diff = data.diff().dropna()
# Check for stationarity
rolling_mean = diff.rolling(window=12).mean()
rolling_std = diff.rolling(window=12).std()
result = adfuller(diff['value'])
# Plot the differenced time series, rolling mean, and rolling standard deviation
plt.plot(diff, label='Differenced Data')
plt.plot(rolling_mean, label='Rolling Mean')
plt.plot(rolling_std, label='Rolling Std')
plt.legend()
plt.show()
# Print the p-value of the ADF test
print('p-value:', result[1])
Step 4: Determine the Values of p, d, and q
The next step is to determine the values of p, d, and q for the ARIMA model. This can be done using the autocorrelation function (ACF) and partial autocorrelation function (PACF) plots.
The ACF plot shows the correlation between the time series and its lagged values. The PACF plot shows the correlation between the time series and its lagged values, after removing the effects of the intermediate lags.
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
# Plot the ACF and PACF plots
plot_acf(diff, lags=20)
plot_pacf(diff, lags=20)
plt.show()
The values of p and q can be determined by looking at the lag at which the ACF and PACF plots cut off. The value of d is the order of differencing used to make the time series stationary.
Step 5: Fit the ARIMA Model
The next step is to fit the ARIMA model using the statsmodels library in Python.
# Fit the ARIMA model
model = ARIMA(data, order=(p, d, q))
results = model.fit()
# Print the summary of the model
print(results.summary())
The output of the summary() function provides information about the model parameters, including the coefficient estimates, standard errors, t-values, and p-values.
Step 6: Make Predictions
Once the ARIMA model has been fit, predictions can be made for future time points using the predict() function.
# Make predictions for the next 12 time points
predictions = results.predict(start=len(data), end=len(data)+11)
# Plot the original time series data and the predicted values
plt.plot(data, label='Data')
plt.plot(predictions, label='Predictions')
plt.legend()
plt.show()
Conclusion
In this article, we have discussed the basics of time series forecasting with ARIMA models. We covered the key steps involved in the process, including data preparation, stationarity testing, model parameter selection, model fitting, and prediction. ARIMA models are a powerful tool for time series forecasting, and with the right data and careful parameter selection, they can produce accurate and useful predictions.