Australian Beer Sales from 1956 to 1994

Abstract:

This data report will analyze Australian beer production from the first quarter of 1956 to the second quarter of 1994 and forecast the next 16 quarters, or four years. This data provides both a clear trend and seasonality to work with, meaning that it will need to be transformed in order to have a model fitted. Several models will be fitted and have their residuals tested against standard diagnostic procedures before the best is chosen and used to forecast future production.
The data suggests that there may be a downwards trend, and forecasting supports this, thus production may need to be scaled back over the next 10 years. Quarter to quarter, the seasonality is expected to remain at the same variance.

Introduction:

Between 1956 and the end of quarter 1 in 1994, Australian beer was produced in a very seasonal manner. Quarter 1 consistently has the highest beer production for each year, while the other quarters remain at a much lower level of production. A likely explanation for the seasonality is that quarter 1 is during the Australian summer, prompting higher sales and thus production. Additionally, this data set is taken during the Cold War. While an upwards trend in the data can not be specifically attributed to that, the increase in production correlates to rising tensions, while the downwards trend which appears to be start towards the end of the data, which correlates to the end of the Cold War.
The data seems to begin a downwards trend around the year 1992 and the purpose of this report is to forecast the next 16 quarters, or four years, and determine whether or not this trend will continue. The predicted values actually point to the opposite of this, but the 95% confidence intervals suggest that decline is still possible. The data being analyzed is “Quarterly beer production in Australia: megalitres. March 1956 - June 1994” recorded by the Australian Bureau of Statistics and was sourced from the Time Series Data Library by Rob Hyndman and Yangzhuoran Yang.

Section 1: Initial Time Series Analysis

It is clear from the graph that there is both a trend and seasonal component to this data. There is a clear upwards trend in production between the years 1961 and 1974. The seasonality is exhibited by a sharp drop in production after quarter 1, followed by consistent production in quarters 2, 3, and 4 before a sharp spike in production during quarter 1. The data also has an upwards trend between the years 1991 and 1974, but the seasonal component is present throughout. See Figure 1.

Section 2: Making the Data Stationary

Before deciding on a specific method, multiple other techniques were employed to attempt to make the data stationary such as log and Box-Cox transforms, differencing at various lags, and different combinations of these methods. The final method chosen was first differencing at lags 1 and 4. The plot after differencing at lag 1 can be seen in Figure 2.

By differencing at lag 1, the trend component of the data is successfully removed. This is proven by the red mean line which is nearly zero. The next step is to difference the data at lag 4, which removes the seasonal component and makes the data stationary. As Figure 3 suggests, the differenced data has a mean of nearly zero and the variance is constant. This is the final transformation performed on the data.

Section 3: The Autocorrelation and Partial Autocorrelation Functions

The autocorrelation and partial autocorrelation functions can be used to determine a potential ARMA or SARIMA model. The ACF is found in Figure 5, and it suggests that the model is stationary. Additionally, based on the significance at lags 1, 3, 4, and 8 I would consider a SARIMA model with q = 3, and Q = 2.

The partial autocorrelation function supports that the data is stationary and suggests that there may be an AR component, such that p = 3 or 4 based on the significance in the same number of periods before a drop in the values. Combining these various assumptions, the following models should be tested: AR(3), AR(4), ARMA(3,3), ARMA(4, 3), and MA(3). Additionally there will be a seasonal component such that Q = 2, so a SARIMA model will be chosen.

Section 4: Fitting the Model and Diagnostics

In order to determine which models are worth running through diagnostics, Akaike Information Criterion or AIC(c) testing will be automated on a large number of potential SARIMA models. This is seen in the R code below.

# FITTING A MODEL
# THIS WAS USED TO FIT AN AIC(c) MODEL
df <- expand.grid(p=0:4, q=0:4, P=0:4, Q=0:4)
df <- cbind(df, AICc=NA)

# AICC TESTING
for (i in 1:nrow(df)) {
  try(arima.obj <- arima(beerdata.d14, order=c(df$p[i], 1, df$q[i]),
  seasonal=list(order=c(df$P[i], 1, df$Q[i]), period=4),
  method="ML"))
  if (!is.null(arima.obj)) {
    df$AICc[i] <- AICc(arima.obj)
  }
}

AIC(c) testing is done to determine the most suitable values for p, q, P, and Q with values between and including zero to four for each. Testing states that the best two potential models are that of SARIMA\((0, 1, 3)\times(0, 1, 2)_4\) and SARIMA\((0, 1, 3)\times(1, 1, 2)_4\). Thus diagnostic testing can begin for each of the two models, with the first round of diagnostics being done on the first model which will be known as fit one and the second round on fit two. The first model is fitted in the code chunk below.

# FITTING THE FIRST MODEL
fit1.arima = arima(beerdata.d14, order = c(0, 1, 3),
            seasonal = list(order = c(0, 1, 2), period = 4),
            method = "ML")

Before beginning diagnostic tests, the first step is to ensure that the roots lie outside of the unit circle for the estimated values. Using uc.check() which is from the UnitCircle package, the coefficients of the SARIMA model is plotted against the unit circle. This figure shows that the roots lie outside of the unit circle.

##       real   complex outside
## 1 1.050924  0.000000    TRUE
## 2 1.178085  0.979962    TRUE
## 3 1.178085 -0.979962    TRUE
## *Results are rounded to 6 digits.

##       real complex outside
## 1 1.000495       0    TRUE
## 2 1.156976       0    TRUE
## *Results are rounded to 6 digits.

The residuals are then taken using the command res1 <- residuals(fit1.arima). These residuals are then plotted in Figure 6 where it can be seen that they resemble white noise. While the mean is not equivalent to zero, it is close enough that it can still be considered within the criterion of a white noise function.

The next step in the diagnostic process is plotting the ACF and PACF of the residuals. These can be seen in Figure 7 and Figure 8. These both continue to suggest that the data is stationary and they resemble the ACF and PACF of a white noise function.

Next, the independence of the residuals will be tested to determine whether the chosen model is stationary using both Box-Pierce and Ljung-Box tests. These tests are done with the hypotheses: \(H_0\): The residuals are independent (stationary) vs. \(H_a\): The residuals are not independent and exhibit correlation (non-stationary). After performing the tests, the p-value > 0.05. The Shapiro-Wilk Test is used to test \(H_0\): Variable is normally distributed vs. \(H_a\): Variable is not normally distributed. In this case, the test fails to reject the null hypothesis and it can be determined that the data is normally distributed and the residuals again, resemble a white noise function. Altogether, these tests indicate that the residuals resemble white noise.
Next, the histogram and QQ-norm can be plotted. In the case of the histogram represented by Figure 9, it can be seen by the curve and the shape of the bars that the residuals are normally distributed. The QQ-norm plot in Figure 10 shows that the residuals follow the normal line.

This shows that the first fit is a good choice and is a suitable model for the data. The data is stationary and the residuals resemble white noise. Next, the model SARIMA\((0,1,3)\times(1,1,3)_4\) will be tested. It is fitted in the step below and is stationary and invertible per uc.check().

fit2.arima = arima(beerdata.d14, order = c(0, 1, 3), # MODEL 2
            seasonal = list(order = c(1, 1, 3), period = 4),
            method = "ML")

##       real   complex outside
## 1 1.056998  0.000000    TRUE
## 2 1.296814  0.991182    TRUE
## 3 1.296814 -0.991182    TRUE
## *Results are rounded to 6 digits.

##       real complex outside
## 1 1.123885       0    TRUE
## *Results are rounded to 6 digits.

##        real complex outside
## 1  1.004734       0    TRUE
## 2 -1.000151       0    TRUE
## 3  1.140861       0    TRUE
## *Results are rounded to 6 digits.

After again taking the residuals using the same command used for model one, they can be fitted and they do resemble a white noise process as seen in Figure 11. In Figure 12 and Figure 13 it can also be seen that the ACF and PACF resemble that of a white noise process, thus they can be tested using the Box-Pierce and Ljung-Box tests.

The Box-Pierce, Ljung-Box, and Shapiro-Wilk tests fail to reject the null hypothesis that the residuals resemble white noise. The next step is plotting the residuals as histogram and QQ-norm plots. These plots continue to support the hypothesis that the residuals resemble white noise as seen in Figure 14 and Figure 15. This model is therefore also suitable to be used for forecasting.

Based on the diagnostics of both of these models, either is a good choice, but the best model can be determined by the histograms presented in Figure 9 and Figure 14. While there is no clear differentiation in most aspects of the two models, the histogram presented in Figure 9 is more normal than that of Figure 14, thus model one is the best model to choose based on AIC(c). Based on the ACF and PACF of the transformed data in Section 2, both a MA(3) model and a seasonal component of Q = 2 were proposed and in Section 3 the predicted SARIMA model had q = 3 and Q = 2 such that model one was SARIMA\((0,0,3)\times(0,1,2)_4\). This can be represented algebraically as \((1-B)(1-B^4)X_t = (1 + -1.955B + 1.381B^2 - 0.405B^3)(1 - 1.864B^4 + 0.864^8)Z_t\). Now that the residuals have been analyzed and a model chosen, future values can be forecasted.

Section 5: Forecasting

In order to forecast future values, the model must first be applied to the original data set before differencing. Then, a prediction will be taken using the R command predict() and these values are plotted in Figure 16 below. It can be seen based on the 95% confidence interval, calculated by \(\text{Predicted Value} \pm 1.96\times\text{Standard Error}\), that these values have a large potential of movement. Despite this, the values will most likely not stray far from the predictions or those of the previous years.

Conclusion

Contrary to the expectation that the downwards trend will continue, the forecasts indicate that there will be a slight rise in production over the next four years. The 95% confidence interval suggests that there is a possibility for a continued decline, consistent with slight downwards trend after 1992. This is what I would consider the most likely scenario as well. The model chosen to predict this data is contained below the conclusion. \[ (1-B)(1-B^4)X_t = (1 + -1.955B + 1.381B^2 - 0.405B^3)(1 - 1.864B^4 + 0.864^8)Z_t \]

References

Rob Hyndman and Yangzhuoran Yang (2018). tsdl: Time Series Data Library. v0.1.0.
https://pkg.yangzhuoranyang.com/tsdl/
Australian Bureau of Statistics, Quarterly beer production, 1956-1994
David Stoffer (2021). astsa: Applied Statistical Time Series Analysis. v1.14
https://github.com/nickpoison/astsa/
Brian Ripley, Bill Venables, Douglas M. Bates, Kurt Hornik, Albrecht Gebhardt, and David Firth (2022). MASS: Support Functions and Datasets for Venables and Ripley’s MASS. v7.3-55.
http://www.stats.ox.ac.uk/pub/MASS4/
Hadley Wickham and RStudio (2021). tidyverse: Easily Install and Load the ‘Tidyverse’. v1.3.1
https://tidyverse.tidyverse.org
Jonathan Berrisch (2018). UnitCircle: Check if Roots of a Polynomial Lie Outside of the Unit Circle. v0.1.3.
https://github.com/BerriJ/UnitCircle
Yeuk Yu Lee and Samuel Ventura (2017). lindia: Automated Linear Regression Diagnostic. v0.9
https://github.com/yeukyul/lindia
Rob Hyndman, George Athanasopoulos, Christoph Bergmeir, G. Caceres, L. Chhay, M. O’Hara-Wild, F. Petropoulos, S. Razbash, E. Wang, F. Yasmeen, R Core Team, R. Ihaka, D. Reid, D. Shaub, Y. Tang, and Z. Zhou (2022). forecast: Forecasting Functions for Time Series and Linear Models. v8.16.
https://pkg.robjhyndman.com/forecast/
Baptiste Auguie (2017). gridExtra: Miscellaneous Functions for “Grid” Graphics. v2.3
https://CRAN.R-project.org/package=gridExtra

Appendix

knitr::opts_chunk$set(echo = TRUE)
library(tsdl)
library(astsa)
library(MASS)
library(tidyverse)
library(UnitCircle)
library(lindia)
library(forecast)
library(gridExtra)
# SELECTING THE DATA FROM THE TSDL
beerdata <- tsdl[[99]]
beerdata.ts <- ts(beerdata, frequency = 4, start = c(1956, 1))
# CONVERTING THE TIME SERIES INTO A DATA FRAME AND PLOTTING WITH GGPLOT
beerdf <- data.frame(Y=as.matrix(beerdata.ts), date=time(beerdata.ts))
ggplot(beerdf, aes(x = date, y = Y)) + # PLOTTING THE DATA USING GGPLOT
  geom_line() +                        # AS A LINE
  scale_x_continuous(breaks = scales::pretty_breaks((1994.25 - 1956)/2),
                     expand = c(.01,0)) +
  geom_line(aes(x = date, y = mean(Y)), col = 'red') +
  labs(x = "Year",
       y = expression("Beer Production in Megaliters" ~ (X[t])),
       caption = "Figure 1: Quarterly Beer Production per Year") +
  ggplot2::theme_bw() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5),
        plot.caption = element_text(size = 10))
# MAKING THE DATA STATIONARY
beerdata.d1 <- diff(beerdata.ts, 1) # DIFFERENCING BY 1
beerdf.d1 <- data.frame(Y=as.matrix(beerdata.d1), date=time(beerdata.d1))
ggplot(beerdf.d1, aes(x = date, y = Y)) + # PLOTTING THE DATA
  geom_line() +                           # AS A LINE
  geom_abline(aes(slope = 0, intercept = 0)) +
  scale_x_continuous(breaks = scales::pretty_breaks((1994.25 - 1956)/2),
                     expand = c(.01,0)) +
  geom_line(aes(x = date, y = mean(Y)), col = 'red') +
  labs(x = "Year",
       y = expression(nabla ~ X[t]),
       caption = expression("Figure 2: " ~ X[t] ~ " Differenced at Lag 1")) +
  ggplot2::theme_bw() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5),
        plot.caption = element_text(size = 10))
beerdata.d14 <- diff(beerdata.d1, 4) # DIFFERENCING BY 4
beerdf.d14 <- data.frame(Y=as.matrix(beerdata.d14), date=time(beerdata.d14))
ggplot(beerdf.d14, aes(x = date, y = Y)) + # PLOTTING THE DATA
  geom_line() +                            # AS A LINE
  scale_x_continuous(breaks = scales::pretty_breaks((1994.25 - 1956)/2),
                     expand = c(.01,0)) +
  geom_abline(aes(slope = 0, intercept = 0)) +
  geom_line(aes(x = date, y = mean(Y)), col = 'red') +
  labs(x = "Year",
       y = expression(nabla^4 ~ nabla ~ X[t]),
       caption = expression("Figure 3: " ~ X[t] ~ " Differenced at Lags 1 and 4")) +
  ggplot2::theme_bw() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5),
        plot.caption = element_text(size = 10))
ggAcf(beerdata.d14) + # PLOTTING THE ACF USING GGPLOT
  ggplot2::theme_bw() +
  labs(x = "Lag",
       y = 'ACF',
       caption = expression("Figure 4: Autocorrelation Function for" ~ nabla^4 ~ nabla ~ X[t])) +
  theme(plot.title = element_blank(),
        plot.caption = element_text(size = 10))
ggPacf(beerdata.d14) + # PLOTTING THE PACF USING GGPLOT
  ggplot2::theme_bw() +
  labs(x = "Lag",
       y = 'PACF',
       caption = expression("Figure 5: Partial Autocorrelation Function for" ~ nabla^4 ~ nabla ~ X[t])) +
  theme(plot.title = element_blank(),
        plot.caption = element_text(size = 10))
# FITTING A MODEL
# THIS WAS USED TO FIT AN AIC(c) MODEL
df <- expand.grid(p=0:4, q=0:4, P=0:4, Q=0:4)
df <- cbind(df, AICc=NA)

# AICC TESTING
for (i in 1:nrow(df)) {
  try(arima.obj <- arima(beerdata.d14, order=c(df$p[i], 1, df$q[i]),
  seasonal=list(order=c(df$P[i], 1, df$Q[i]), period=4),
  method="ML"))
  if (!is.null(arima.obj)) {
    df$AICc[i] <- AICc(arima.obj)
  }
}
# FITTING THE FIRST MODEL
fit1.arima = arima(beerdata.d14, order = c(0, 1, 3),
            seasonal = list(order = c(0, 1, 2), period = 4),
            method = "ML")
# FITTING VALUES TO THE MODEL
ma.coef1 <- fit1.arima$coef[1:3]
sma.coef1 <- fit1.arima$coef[4:5]

# CHECKING THE ROOTS
uc.check(c(1, ma.coef1), plot_output = FALSE)
uc.check(c(1, sma.coef1), plot_output = FALSE)
# DIAGNOSTIC TESTING
res1 <- residuals(fit1.arima)

# PLOTTING FITTED RESIDUALS
res1df <- data.frame(Y=as.matrix(res1), date=time(res1))
ggplot(res1df, aes(x = date, y = Y)) +
  geom_line() +
  scale_x_continuous(breaks = scales::pretty_breaks((1994.25 - 1956)/2),
                     expand = c(.01,0)) +
  geom_abline(aes(slope = 0, intercept = 0)) +
  geom_line(aes(x = date, y = mean(Y)), col = 'red') +
  labs(x = "Year",
       y = expression(nabla^4 ~ nabla ~ X[t]),
       caption = expression("Figure 6: Fitted Residuals of Model 1")) +
  ggplot2::theme_bw() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5),
        plot.caption = element_text(size = 10))
res1ACF <- ggAcf(res1) + # ACF OF RESIDUALS OF MODEL 1
  ggplot2::theme_bw() +
  labs(x = "Lag",
       y = 'PACF',
       caption = expression("Figure 7: Autocorrelation Function of the Residuals of Model 1")) +
  theme(plot.title = element_blank(),
        plot.caption = element_text(size = 10))
res1PACF <- ggPacf(res1) + # PACF OF RESIDUALS OF MODEL 1
  ggplot2::theme_bw() +
  labs(x = "Lag",
       y = 'PACF',
       caption = expression("Figure 8: Partial Autocorrelation Function of the Residuals of Model 1")) +
  theme(plot.title = element_blank(),
        plot.caption = element_text(size = 10))
grid.arrange(res1ACF, res1PACF)
# TEST FOR INDEPENDENCE OF RESIDUALS
Box.test(res1, lag = 12, type = c("Box-Pierce"), fitdf = 5) # BOX PIERCE
Box.test(res1, lag = 12, type = c("Ljung-Box"), fitdf = 5) # BOX LJUNG
# NORMALITY ASSUMPTION
shapiro.test(res1)
# HISTOGRAM AND QQ-PLOT
hist1 <- gghistogram(res1, add.normal = TRUE) +
  ggplot2::theme_bw() +
  labs(x = "Residuals",
       y = "Frequency",
       caption = "Figure 9: Histogram of the Residuals of Model 1") +
  theme(plot.title = element_blank(),
        plot.caption = element_text(size = 10))

n = 1:length(res1)
res1.fit = lm(res1 ~ n)
qq1 <- gg_qqplot(res1.fit) +
  ggplot2::theme_bw() +
  labs(y = "Sample Quantile",
       caption = "Figure 10: Normal-QQ Plot of the Residuals of Model 1") +
  theme(plot.title = element_blank(),
        plot.caption = element_text(size = 10))

grid.arrange(hist1, qq1, ncol = 2)
fit2.arima = arima(beerdata.d14, order = c(0, 1, 3), # MODEL 2
            seasonal = list(order = c(1, 1, 3), period = 4),
            method = "ML")
# FITTING VALUES TO THE MODEL
ma.coef2 <- fit2.arima$coef[1:3]
sar.coef2 <- fit2.arima$coef[4]
sma.coef2 <- fit2.arima$coef[5:7]

# CHECKING THE ROOTS
uc.check(c(1, ma.coef2), plot_output = FALSE)
uc.check(c(1, sar.coef2), plot_output = FALSE)
uc.check(c(1, sma.coef2), plot_output = FALSE)
#DIAGNOSTIC TESTING
res2 <- residuals(fit2.arima)

# PLOTTING FITTED RESIDUALS
res2df <- data.frame(Y=as.matrix(res2), date=time(res2))
ggplot(res2df, aes(x = date, y = Y)) +
  geom_line() +
  scale_x_continuous(breaks = scales::pretty_breaks((1994.25 - 1956)/2),
                     expand = c(.01,0)) +
  geom_abline(aes(slope = 0, intercept = 0)) +
  geom_line(aes(x = date, y = mean(Y)), col = 'red') +
  labs(x = "Year",
       y = expression(nabla^4 ~ nabla ~ X[t]),
       caption = expression("Figure 11: Fitted Residuals of Model 2")) +
  ggplot2::theme_bw() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5),
        plot.caption = element_text(size = 10))

res2ACF <- ggAcf(res2) + # ACF OF RESIDUALS OF MODEL 2
  ggplot2::theme_bw() +
  labs(x = "Lag",
       y = 'PACF',
       caption = expression("Figure 12: Autocorrelation Function of the Residuals of Model 2")) +
  theme(plot.title = element_blank(),
        plot.caption = element_text(size = 10))
res2PACF <- ggPacf(res2) + # PACF OF RESIDUALS OF MODEL 2
  ggplot2::theme_bw() +
  labs(x = "Lag",
       y = 'PACF',
       caption = expression("Figure 13: Partial Autocorrelation Function of the Residuals of Model 2")) +
  theme(plot.title = element_blank(),
        plot.caption = element_text(size = 10))
grid.arrange(res2ACF, res2PACF) # PLOTTING AS A GRID
Box.test(res2, lag = 12, type = c("Box-Pierce"), fitdf = 6) # BOX PIERCE
Box.test(res2, lag = 12, type = c("Ljung-Box"), fitdf = 6) # LJUNG BOX
shapiro.test(res2)
# HISTOGRAM AND QQ-PLOT
hist2 <- gghistogram(res2, add.normal = TRUE) +
  ggplot2::theme_bw() +
  labs(x = "Residuals",
       y = "Frequency",
       caption = "Figure 14: Histogram of the Residuals of Model 2") +
  theme(plot.title = element_blank(),
        plot.caption = element_text(size = 10))

n = 1:length(res2)
res2.fit = lm(res2 ~ n)
qq2 <- gg_qqplot(res2.fit) +
  ggplot2::theme_bw() +
  labs(y = "Sample Quantile",
       caption = "Figure 15: Normal-QQ Plot of the Residuals of Model 2") +
  theme(plot.title = element_blank(),
        plot.caption = element_text(size = 10))

grid.arrange(hist2, qq1, ncol = 2)
fit1.orig <- arima(beerdata.ts, order = c(0, 0, 3),
            seasonal = list(order = c(0, 1, 2), period = 4),
            method = "ML") # FITTING THE MODEL TO ORIGINAL DATA
pred1 <- predict(fit1.orig, n.ahead=16) # FORECASTING n = 16
pred1 <- data.frame(pred1, "date" = tail(beerdf$date, 16) + 4)
newrow <- data.frame(pred = beerdata.ts[154], se = 0, date = 1994.25)
pred1 <- rbind(newrow, pred1)

ggplot() + # PLOTTING THE FORECASTED DATA
  geom_line(beerdf, mapping = aes(x = date, y = Y)) + # ORIGINAL
  scale_x_continuous(breaks = scales::pretty_breaks((1998.25 - 1956)/2),
                     expand = c(.01,0)) +
  geom_line(pred1, mapping = aes(x = date, y = pred),
             color = 'red') + # FORECAST
  geom_line(pred1[2:17,], mapping = aes(x = date, y = pred + 1.96*se), 
            linetype = "dashed", col = "gray38") + # 95% CONFIDENCE INCREASE
  geom_line(pred1[2:17,], mapping = aes(x = date, y = pred - 1.96*se), 
              linetype = "dashed", col = "gray38") + # 95% CONFIDENCE DECREASE
  labs(x = "Year",
       y = expression("Beer Production in Megaliters" ~ (X[t])),
       caption = "Figure 16: Quarterly Beer Production per Year with a Four Year Forecast (16 Quarters) Using Model 1") +
  ggplot2::theme_bw() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5),
        plot.caption = element_text(size = 10))