This data report will analyze Australian beer production from the
first quarter of 1956 to the second quarter of 1994 and forecast the
next 16 quarters, or four years. This data provides both a clear trend
and seasonality to work with, meaning that it will need to be
transformed in order to have a model fitted. Several models will be
fitted and have their residuals tested against standard diagnostic
procedures before the best is chosen and used to forecast future
production.
The data suggests that there may be a downwards
trend, and forecasting supports this, thus production may need to be
scaled back over the next 10 years. Quarter to quarter, the seasonality
is expected to remain at the same variance.
Between 1956 and the end of quarter 1 in 1994, Australian beer was
produced in a very seasonal manner. Quarter 1 consistently has the
highest beer production for each year, while the other quarters remain
at a much lower level of production. A likely explanation for the
seasonality is that quarter 1 is during the Australian summer, prompting
higher sales and thus production. Additionally, this data set is taken
during the Cold War. While an upwards trend in the data can not be
specifically attributed to that, the increase in production correlates
to rising tensions, while the downwards trend which appears to be start
towards the end of the data, which correlates to the end of the Cold
War.
The data seems to begin a downwards trend around the year
1992 and the purpose of this report is to forecast the next 16 quarters,
or four years, and determine whether or not this trend will continue.
The predicted values actually point to the opposite of this, but the 95%
confidence intervals suggest that decline is still possible. The data
being analyzed is “Quarterly beer production in Australia: megalitres.
March 1956 - June 1994” recorded by the Australian Bureau of Statistics
and was sourced from the Time Series Data Library by Rob Hyndman and
Yangzhuoran Yang.
It is clear from the graph that there is both a trend and seasonal component to this data. There is a clear upwards trend in production between the years 1961 and 1974. The seasonality is exhibited by a sharp drop in production after quarter 1, followed by consistent production in quarters 2, 3, and 4 before a sharp spike in production during quarter 1. The data also has an upwards trend between the years 1991 and 1974, but the seasonal component is present throughout. See Figure 1.
Before deciding on a specific method, multiple other techniques were employed to attempt to make the data stationary such as log and Box-Cox transforms, differencing at various lags, and different combinations of these methods. The final method chosen was first differencing at lags 1 and 4. The plot after differencing at lag 1 can be seen in Figure 2.
By differencing at lag 1, the trend component of the data is successfully removed. This is proven by the red mean line which is nearly zero. The next step is to difference the data at lag 4, which removes the seasonal component and makes the data stationary. As Figure 3 suggests, the differenced data has a mean of nearly zero and the variance is constant. This is the final transformation performed on the data.
The autocorrelation and partial autocorrelation functions can be used to determine a potential ARMA or SARIMA model. The ACF is found in Figure 5, and it suggests that the model is stationary. Additionally, based on the significance at lags 1, 3, 4, and 8 I would consider a SARIMA model with q = 3, and Q = 2.
The partial autocorrelation function supports that the data is stationary and suggests that there may be an AR component, such that p = 3 or 4 based on the significance in the same number of periods before a drop in the values. Combining these various assumptions, the following models should be tested: AR(3), AR(4), ARMA(3,3), ARMA(4, 3), and MA(3). Additionally there will be a seasonal component such that Q = 2, so a SARIMA model will be chosen.
In order to determine which models are worth running through diagnostics, Akaike Information Criterion or AIC(c) testing will be automated on a large number of potential SARIMA models. This is seen in the R code below.
# FITTING A MODEL
# THIS WAS USED TO FIT AN AIC(c) MODEL
df <- expand.grid(p=0:4, q=0:4, P=0:4, Q=0:4)
df <- cbind(df, AICc=NA)
# AICC TESTING
for (i in 1:nrow(df)) {
try(arima.obj <- arima(beerdata.d14, order=c(df$p[i], 1, df$q[i]),
seasonal=list(order=c(df$P[i], 1, df$Q[i]), period=4),
method="ML"))
if (!is.null(arima.obj)) {
df$AICc[i] <- AICc(arima.obj)
}
}
AIC(c) testing is done to determine the most suitable values for p, q, P, and Q with values between and including zero to four for each. Testing states that the best two potential models are that of SARIMA\((0, 1, 3)\times(0, 1, 2)_4\) and SARIMA\((0, 1, 3)\times(1, 1, 2)_4\). Thus diagnostic testing can begin for each of the two models, with the first round of diagnostics being done on the first model which will be known as fit one and the second round on fit two. The first model is fitted in the code chunk below.
# FITTING THE FIRST MODEL
fit1.arima = arima(beerdata.d14, order = c(0, 1, 3),
seasonal = list(order = c(0, 1, 2), period = 4),
method = "ML")
Before beginning diagnostic tests, the first step is to ensure that
the roots lie outside of the unit circle for the estimated values. Using
uc.check()
which is from the UnitCircle
package, the coefficients of the SARIMA model is plotted against the
unit circle. This figure shows that the roots lie outside of the unit
circle.
## real complex outside
## 1 1.050924 0.000000 TRUE
## 2 1.178085 0.979962 TRUE
## 3 1.178085 -0.979962 TRUE
## *Results are rounded to 6 digits.
## real complex outside
## 1 1.000495 0 TRUE
## 2 1.156976 0 TRUE
## *Results are rounded to 6 digits.
The residuals are then taken using the command
res1 <- residuals(fit1.arima)
. These residuals are then
plotted in Figure 6 where it can be seen that they resemble
white noise. While the mean is not equivalent to zero, it is close
enough that it can still be considered within the criterion of a white
noise function.
The next step in the diagnostic process is plotting the ACF and PACF of the residuals. These can be seen in Figure 7 and Figure 8. These both continue to suggest that the data is stationary and they resemble the ACF and PACF of a white noise function.
Next, the independence of the residuals will be tested to determine
whether the chosen model is stationary using both Box-Pierce and
Ljung-Box tests. These tests are done with the hypotheses: \(H_0\): The residuals are independent
(stationary) vs. \(H_a\): The residuals
are not independent and exhibit correlation (non-stationary). After
performing the tests, the p-value > 0.05. The Shapiro-Wilk Test is
used to test \(H_0\): Variable is
normally distributed vs. \(H_a\):
Variable is not normally distributed. In this case, the test fails to
reject the null hypothesis and it can be determined that the data is
normally distributed and the residuals again, resemble a white noise
function. Altogether, these tests indicate that the residuals resemble
white noise.
Next, the histogram and QQ-norm can be plotted. In
the case of the histogram represented by Figure 9, it can be
seen by the curve and the shape of the bars that the residuals are
normally distributed. The QQ-norm plot in Figure 10 shows that
the residuals follow the normal line.
This shows that the first fit is a good choice and is a suitable
model for the data. The data is stationary and the residuals resemble
white noise. Next, the model SARIMA\((0,1,3)\times(1,1,3)_4\) will be tested. It
is fitted in the step below and is stationary and invertible per
uc.check()
.
fit2.arima = arima(beerdata.d14, order = c(0, 1, 3), # MODEL 2
seasonal = list(order = c(1, 1, 3), period = 4),
method = "ML")
## real complex outside
## 1 1.056998 0.000000 TRUE
## 2 1.296814 0.991182 TRUE
## 3 1.296814 -0.991182 TRUE
## *Results are rounded to 6 digits.
## real complex outside
## 1 1.123885 0 TRUE
## *Results are rounded to 6 digits.
## real complex outside
## 1 1.004734 0 TRUE
## 2 -1.000151 0 TRUE
## 3 1.140861 0 TRUE
## *Results are rounded to 6 digits.
After again taking the residuals using the same command used for model one, they can be fitted and they do resemble a white noise process as seen in Figure 11. In Figure 12 and Figure 13 it can also be seen that the ACF and PACF resemble that of a white noise process, thus they can be tested using the Box-Pierce and Ljung-Box tests.
The Box-Pierce, Ljung-Box, and Shapiro-Wilk tests fail to reject the null hypothesis that the residuals resemble white noise. The next step is plotting the residuals as histogram and QQ-norm plots. These plots continue to support the hypothesis that the residuals resemble white noise as seen in Figure 14 and Figure 15. This model is therefore also suitable to be used for forecasting.
Based on the diagnostics of both of these models, either is a good
choice, but the best model can be determined by the histograms presented
in Figure 9 and Figure 14. While there is no clear
differentiation in most aspects of the two models, the histogram
presented in Figure 9 is more normal than that of Figure
14, thus model one is the best model to choose based on AIC(c).
Based on the ACF and PACF of the transformed data in Section
2, both a MA(3) model and a seasonal component of Q = 2 were
proposed and in Section 3 the predicted SARIMA model
had q = 3 and Q = 2 such that model one was SARIMA\((0,0,3)\times(0,1,2)_4\). This can be
represented algebraically as \((1-B)(1-B^4)X_t
= (1 + -1.955B + 1.381B^2 - 0.405B^3)(1 - 1.864B^4 +
0.864^8)Z_t\). Now that the residuals have been analyzed and a
model chosen, future values can be forecasted.
In order to forecast future values, the model must first be applied
to the original data set before differencing. Then, a prediction will be
taken using the R command predict()
and these values are
plotted in Figure 16 below. It can be seen based on the 95%
confidence interval, calculated by \(\text{Predicted Value} \pm
1.96\times\text{Standard Error}\), that these values have a large
potential of movement. Despite this, the values will most likely not
stray far from the predictions or those of the previous years.
Contrary to the expectation that the downwards trend will continue,
the forecasts indicate that there will be a slight rise in production
over the next four years. The 95% confidence interval suggests that
there is a possibility for a continued decline, consistent with slight
downwards trend after 1992. This is what I would consider the most
likely scenario as well. The model chosen to predict this data is
contained below the conclusion. \[
(1-B)(1-B^4)X_t = (1 + -1.955B + 1.381B^2 - 0.405B^3)(1 - 1.864B^4 +
0.864^8)Z_t
\]
Rob Hyndman and Yangzhuoran Yang (2018). tsdl: Time Series Data
Library. v0.1.0.
https://pkg.yangzhuoranyang.com/tsdl/
Australian
Bureau of Statistics, Quarterly beer production, 1956-1994
David
Stoffer (2021). astsa: Applied Statistical Time Series Analysis. v1.14
https://github.com/nickpoison/astsa/
Brian
Ripley, Bill Venables, Douglas M. Bates, Kurt Hornik, Albrecht Gebhardt,
and David Firth (2022). MASS: Support Functions and Datasets for
Venables and Ripley’s MASS. v7.3-55.
http://www.stats.ox.ac.uk/pub/MASS4/
Hadley
Wickham and RStudio (2021). tidyverse: Easily Install and Load the
‘Tidyverse’. v1.3.1
https://tidyverse.tidyverse.org
Jonathan Berrisch
(2018). UnitCircle: Check if Roots of a Polynomial Lie Outside of the
Unit Circle. v0.1.3.
https://github.com/BerriJ/UnitCircle
Yeuk Yu Lee
and Samuel Ventura (2017). lindia: Automated Linear Regression
Diagnostic. v0.9
https://github.com/yeukyul/lindia
Rob Hyndman,
George Athanasopoulos, Christoph Bergmeir, G. Caceres, L. Chhay, M.
O’Hara-Wild, F. Petropoulos, S. Razbash, E. Wang, F. Yasmeen, R Core
Team, R. Ihaka, D. Reid, D. Shaub, Y. Tang, and Z. Zhou (2022).
forecast: Forecasting Functions for Time Series and Linear Models.
v8.16.
https://pkg.robjhyndman.com/forecast/
Baptiste
Auguie (2017). gridExtra: Miscellaneous Functions for “Grid” Graphics.
v2.3
https://CRAN.R-project.org/package=gridExtra
knitr::opts_chunk$set(echo = TRUE)
library(tsdl)
library(astsa)
library(MASS)
library(tidyverse)
library(UnitCircle)
library(lindia)
library(forecast)
library(gridExtra)
# SELECTING THE DATA FROM THE TSDL
beerdata <- tsdl[[99]]
beerdata.ts <- ts(beerdata, frequency = 4, start = c(1956, 1))
# CONVERTING THE TIME SERIES INTO A DATA FRAME AND PLOTTING WITH GGPLOT
beerdf <- data.frame(Y=as.matrix(beerdata.ts), date=time(beerdata.ts))
ggplot(beerdf, aes(x = date, y = Y)) + # PLOTTING THE DATA USING GGPLOT
geom_line() + # AS A LINE
scale_x_continuous(breaks = scales::pretty_breaks((1994.25 - 1956)/2),
expand = c(.01,0)) +
geom_line(aes(x = date, y = mean(Y)), col = 'red') +
labs(x = "Year",
y = expression("Beer Production in Megaliters" ~ (X[t])),
caption = "Figure 1: Quarterly Beer Production per Year") +
ggplot2::theme_bw() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5),
plot.caption = element_text(size = 10))
# MAKING THE DATA STATIONARY
beerdata.d1 <- diff(beerdata.ts, 1) # DIFFERENCING BY 1
beerdf.d1 <- data.frame(Y=as.matrix(beerdata.d1), date=time(beerdata.d1))
ggplot(beerdf.d1, aes(x = date, y = Y)) + # PLOTTING THE DATA
geom_line() + # AS A LINE
geom_abline(aes(slope = 0, intercept = 0)) +
scale_x_continuous(breaks = scales::pretty_breaks((1994.25 - 1956)/2),
expand = c(.01,0)) +
geom_line(aes(x = date, y = mean(Y)), col = 'red') +
labs(x = "Year",
y = expression(nabla ~ X[t]),
caption = expression("Figure 2: " ~ X[t] ~ " Differenced at Lag 1")) +
ggplot2::theme_bw() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5),
plot.caption = element_text(size = 10))
beerdata.d14 <- diff(beerdata.d1, 4) # DIFFERENCING BY 4
beerdf.d14 <- data.frame(Y=as.matrix(beerdata.d14), date=time(beerdata.d14))
ggplot(beerdf.d14, aes(x = date, y = Y)) + # PLOTTING THE DATA
geom_line() + # AS A LINE
scale_x_continuous(breaks = scales::pretty_breaks((1994.25 - 1956)/2),
expand = c(.01,0)) +
geom_abline(aes(slope = 0, intercept = 0)) +
geom_line(aes(x = date, y = mean(Y)), col = 'red') +
labs(x = "Year",
y = expression(nabla^4 ~ nabla ~ X[t]),
caption = expression("Figure 3: " ~ X[t] ~ " Differenced at Lags 1 and 4")) +
ggplot2::theme_bw() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5),
plot.caption = element_text(size = 10))
ggAcf(beerdata.d14) + # PLOTTING THE ACF USING GGPLOT
ggplot2::theme_bw() +
labs(x = "Lag",
y = 'ACF',
caption = expression("Figure 4: Autocorrelation Function for" ~ nabla^4 ~ nabla ~ X[t])) +
theme(plot.title = element_blank(),
plot.caption = element_text(size = 10))
ggPacf(beerdata.d14) + # PLOTTING THE PACF USING GGPLOT
ggplot2::theme_bw() +
labs(x = "Lag",
y = 'PACF',
caption = expression("Figure 5: Partial Autocorrelation Function for" ~ nabla^4 ~ nabla ~ X[t])) +
theme(plot.title = element_blank(),
plot.caption = element_text(size = 10))
# FITTING A MODEL
# THIS WAS USED TO FIT AN AIC(c) MODEL
df <- expand.grid(p=0:4, q=0:4, P=0:4, Q=0:4)
df <- cbind(df, AICc=NA)
# AICC TESTING
for (i in 1:nrow(df)) {
try(arima.obj <- arima(beerdata.d14, order=c(df$p[i], 1, df$q[i]),
seasonal=list(order=c(df$P[i], 1, df$Q[i]), period=4),
method="ML"))
if (!is.null(arima.obj)) {
df$AICc[i] <- AICc(arima.obj)
}
}
# FITTING THE FIRST MODEL
fit1.arima = arima(beerdata.d14, order = c(0, 1, 3),
seasonal = list(order = c(0, 1, 2), period = 4),
method = "ML")
# FITTING VALUES TO THE MODEL
ma.coef1 <- fit1.arima$coef[1:3]
sma.coef1 <- fit1.arima$coef[4:5]
# CHECKING THE ROOTS
uc.check(c(1, ma.coef1), plot_output = FALSE)
uc.check(c(1, sma.coef1), plot_output = FALSE)
# DIAGNOSTIC TESTING
res1 <- residuals(fit1.arima)
# PLOTTING FITTED RESIDUALS
res1df <- data.frame(Y=as.matrix(res1), date=time(res1))
ggplot(res1df, aes(x = date, y = Y)) +
geom_line() +
scale_x_continuous(breaks = scales::pretty_breaks((1994.25 - 1956)/2),
expand = c(.01,0)) +
geom_abline(aes(slope = 0, intercept = 0)) +
geom_line(aes(x = date, y = mean(Y)), col = 'red') +
labs(x = "Year",
y = expression(nabla^4 ~ nabla ~ X[t]),
caption = expression("Figure 6: Fitted Residuals of Model 1")) +
ggplot2::theme_bw() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5),
plot.caption = element_text(size = 10))
res1ACF <- ggAcf(res1) + # ACF OF RESIDUALS OF MODEL 1
ggplot2::theme_bw() +
labs(x = "Lag",
y = 'PACF',
caption = expression("Figure 7: Autocorrelation Function of the Residuals of Model 1")) +
theme(plot.title = element_blank(),
plot.caption = element_text(size = 10))
res1PACF <- ggPacf(res1) + # PACF OF RESIDUALS OF MODEL 1
ggplot2::theme_bw() +
labs(x = "Lag",
y = 'PACF',
caption = expression("Figure 8: Partial Autocorrelation Function of the Residuals of Model 1")) +
theme(plot.title = element_blank(),
plot.caption = element_text(size = 10))
grid.arrange(res1ACF, res1PACF)
# TEST FOR INDEPENDENCE OF RESIDUALS
Box.test(res1, lag = 12, type = c("Box-Pierce"), fitdf = 5) # BOX PIERCE
Box.test(res1, lag = 12, type = c("Ljung-Box"), fitdf = 5) # BOX LJUNG
# NORMALITY ASSUMPTION
shapiro.test(res1)
# HISTOGRAM AND QQ-PLOT
hist1 <- gghistogram(res1, add.normal = TRUE) +
ggplot2::theme_bw() +
labs(x = "Residuals",
y = "Frequency",
caption = "Figure 9: Histogram of the Residuals of Model 1") +
theme(plot.title = element_blank(),
plot.caption = element_text(size = 10))
n = 1:length(res1)
res1.fit = lm(res1 ~ n)
qq1 <- gg_qqplot(res1.fit) +
ggplot2::theme_bw() +
labs(y = "Sample Quantile",
caption = "Figure 10: Normal-QQ Plot of the Residuals of Model 1") +
theme(plot.title = element_blank(),
plot.caption = element_text(size = 10))
grid.arrange(hist1, qq1, ncol = 2)
fit2.arima = arima(beerdata.d14, order = c(0, 1, 3), # MODEL 2
seasonal = list(order = c(1, 1, 3), period = 4),
method = "ML")
# FITTING VALUES TO THE MODEL
ma.coef2 <- fit2.arima$coef[1:3]
sar.coef2 <- fit2.arima$coef[4]
sma.coef2 <- fit2.arima$coef[5:7]
# CHECKING THE ROOTS
uc.check(c(1, ma.coef2), plot_output = FALSE)
uc.check(c(1, sar.coef2), plot_output = FALSE)
uc.check(c(1, sma.coef2), plot_output = FALSE)
#DIAGNOSTIC TESTING
res2 <- residuals(fit2.arima)
# PLOTTING FITTED RESIDUALS
res2df <- data.frame(Y=as.matrix(res2), date=time(res2))
ggplot(res2df, aes(x = date, y = Y)) +
geom_line() +
scale_x_continuous(breaks = scales::pretty_breaks((1994.25 - 1956)/2),
expand = c(.01,0)) +
geom_abline(aes(slope = 0, intercept = 0)) +
geom_line(aes(x = date, y = mean(Y)), col = 'red') +
labs(x = "Year",
y = expression(nabla^4 ~ nabla ~ X[t]),
caption = expression("Figure 11: Fitted Residuals of Model 2")) +
ggplot2::theme_bw() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5),
plot.caption = element_text(size = 10))
res2ACF <- ggAcf(res2) + # ACF OF RESIDUALS OF MODEL 2
ggplot2::theme_bw() +
labs(x = "Lag",
y = 'PACF',
caption = expression("Figure 12: Autocorrelation Function of the Residuals of Model 2")) +
theme(plot.title = element_blank(),
plot.caption = element_text(size = 10))
res2PACF <- ggPacf(res2) + # PACF OF RESIDUALS OF MODEL 2
ggplot2::theme_bw() +
labs(x = "Lag",
y = 'PACF',
caption = expression("Figure 13: Partial Autocorrelation Function of the Residuals of Model 2")) +
theme(plot.title = element_blank(),
plot.caption = element_text(size = 10))
grid.arrange(res2ACF, res2PACF) # PLOTTING AS A GRID
Box.test(res2, lag = 12, type = c("Box-Pierce"), fitdf = 6) # BOX PIERCE
Box.test(res2, lag = 12, type = c("Ljung-Box"), fitdf = 6) # LJUNG BOX
shapiro.test(res2)
# HISTOGRAM AND QQ-PLOT
hist2 <- gghistogram(res2, add.normal = TRUE) +
ggplot2::theme_bw() +
labs(x = "Residuals",
y = "Frequency",
caption = "Figure 14: Histogram of the Residuals of Model 2") +
theme(plot.title = element_blank(),
plot.caption = element_text(size = 10))
n = 1:length(res2)
res2.fit = lm(res2 ~ n)
qq2 <- gg_qqplot(res2.fit) +
ggplot2::theme_bw() +
labs(y = "Sample Quantile",
caption = "Figure 15: Normal-QQ Plot of the Residuals of Model 2") +
theme(plot.title = element_blank(),
plot.caption = element_text(size = 10))
grid.arrange(hist2, qq1, ncol = 2)
fit1.orig <- arima(beerdata.ts, order = c(0, 0, 3),
seasonal = list(order = c(0, 1, 2), period = 4),
method = "ML") # FITTING THE MODEL TO ORIGINAL DATA
pred1 <- predict(fit1.orig, n.ahead=16) # FORECASTING n = 16
pred1 <- data.frame(pred1, "date" = tail(beerdf$date, 16) + 4)
newrow <- data.frame(pred = beerdata.ts[154], se = 0, date = 1994.25)
pred1 <- rbind(newrow, pred1)
ggplot() + # PLOTTING THE FORECASTED DATA
geom_line(beerdf, mapping = aes(x = date, y = Y)) + # ORIGINAL
scale_x_continuous(breaks = scales::pretty_breaks((1998.25 - 1956)/2),
expand = c(.01,0)) +
geom_line(pred1, mapping = aes(x = date, y = pred),
color = 'red') + # FORECAST
geom_line(pred1[2:17,], mapping = aes(x = date, y = pred + 1.96*se),
linetype = "dashed", col = "gray38") + # 95% CONFIDENCE INCREASE
geom_line(pred1[2:17,], mapping = aes(x = date, y = pred - 1.96*se),
linetype = "dashed", col = "gray38") + # 95% CONFIDENCE DECREASE
labs(x = "Year",
y = expression("Beer Production in Megaliters" ~ (X[t])),
caption = "Figure 16: Quarterly Beer Production per Year with a Four Year Forecast (16 Quarters) Using Model 1") +
ggplot2::theme_bw() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5),
plot.caption = element_text(size = 10))