Advanced Data Analysis in R organised by NUGS-China

Session: Four (last session for the program)

Date: 20/03/2021

Intructor: Clement Twumasi

Website: https://twumasiclement.wixsite.com/website

YouTube channel: https://www.youtube.com/channel/UCxrpjLYi_Akmbc2QICDvKaQ

The most important info you need to know about statistics and data analysis: https://www.youtube.com/watch?v=SRGpeSYyOoA&t=3467s

Main objective:

Time series analysis in R (we shall simulate time series data, learn how to declare time series data and fit its model).

Other objectives:

Introduction to other advanced statistical (univariate and multivariate) tests & different models (eg. machine learning algorithms, etc.) in R

Objectives: Inferential tests and model fitting in R

NB: The class/type of statistical tests and models to fit are predominantly dependent on the type/nature of your data.

  1. Statistical tests:
    • Mean tests (Parametric tests: Paired t-tests, independent t-test, ANOVA, Bonferroni pairwise comparison tests, Repeated Measures ANOVA, MANOVA,etc).
    • Median tests (Non-parametric tests: Wilcoxon sign test, Mann-Whitney, Kruskal-Wallis test, Bonferroni-Dunn's test, Friedman test, Multivariate Kruskal-Wallis test,etc).
    • Test of proportions for one or more groups, etc.
    • Correlation test/analysis (Using the informative correlation matrix plot from the PerformanceAnalytics package should be enough).

Link to some of the aforementioned univariate tests in R: https://rpubs.com/sujith/IS

Multivariate analysis in R: https://little-book-of-r-for-multivariate-analysis.readthedocs.io/en/latest/src/multivariateanalysis.html

http://www.sthda.com/english/wiki/manova-test-in-r-multivariate-analysis-of-variance

PCA, SVD, Correspondence Analysis and Multidimensional Scaling:

https://web.stanford.edu/class/bios221/labs/multivariate/lab_5_multivariate.html

  1. Mixed effect regression model (we shall fit a Generalized Linear Mixed Model based on an empirical).

Note some of the class of regression models available. Knowing which one is appropriate for any given data and research problem is very imperative.

  • GLM (OLS regression, binomial \& multinomial regression, poisson regression, quasi poisson regression, negative binomial regression, etc) [cross-sectional data]
  • Generalized Linear Mixed Models (for all types of dependent variables) [these class of models are for longitudinal data]
  • Generalized Additive Models (fixed and mixed-effect types; both cross-sectional \& longitudinal data) and Panel regression models [these class of models for longitudinal data]

Introduction to linear regression: https://www.machinelearningplus.com/machine-learning/complete-introduction-linear-regression-r/

OLS linear regression in R: http://r-statistics.co/Linear-Regression.html

Different class of linear models in R: https://www.statmethods.net/advstats/glm.html

  1. Time series analysis (we shall simulate a time series data, learn how to declare time series data and fit its model).

Time series analysis in R:

https://ourcodingclub.github.io/tutorials/time/

https://www.analyticsvidhya.com/blog/2015/12/complete-tutorial-time-series-modeling/

Other class of models: Machine learning algorithms

  • Machine learning algorithms (Classification tree, Random forest, Gradient Boosting Machine, etc.) [both cross-sectional \& longitudinal data]

Variable selection methods using penalized regression & other methods

  • Method of variable selection: Stepwise regression, Penalized regression (Ridge, LASSO and Elastic net), Recursive Feature Selection.
In [3]:
options(repr.plot.width=8, repr.plot.height=8,repr.plot.res = 300) #Setting plot size

#setting working directory to import and/or export data
setwd("C:/Users/user/Desktop/DataAnalysis_results_R/NUGSChina_R_class")

Time series analysis based on a data

We shall simulate our own time series data for this session

Simulating a time series data with 300 data points from ARIMA(2,1,1) model

In [224]:
#Simulating  data from ARIMA(2,1,1)
#You should set a working directory when importing 
set.seed(1)
chosen_length <- 300; burnin <- 100
n <- chosen_length + burnin #n=300
e <- rnorm(n,0,1)
y <- e
mu <- 0.2; phi1 <- 0.3; phi2 <- 0.4; theta1 <- 0.5; sigma <- 1.2

for (i in 3:n) {
  y[i] <- mu + phi1*y[i-1] + phi2*y[i-2] + sigma*e[i] - theta1*e[i-1]
}

S_y <- cumsum(y)
data<- S_y[(burnin+1):n]

print(data)
  [1]  89.28906  89.88158  88.96761  89.77552  88.98778  91.72266  92.40444
  [8]  94.63678  95.78513  99.04909  99.08364 100.36349 102.91082 102.88998
 [15] 104.17916 104.38990 104.98120 105.26795 106.52308 106.75442 107.00739
 [22] 109.24045 109.28254 110.28021 110.76590 112.41597 112.86065 113.84571
 [29] 113.71992 114.22791 114.76433 114.59169 115.88671 114.31834 115.69290
 [36] 113.68091 114.23418 113.31192 112.93818 112.91492 110.68967 112.58187
 [43] 109.87317 110.29369 108.42902 107.89481 110.06869 109.68446 109.08648
 [50] 107.62782 108.51156 108.14585 108.21723 107.33616 105.98013 104.87438
 [57] 105.93788 104.76909 103.69318 106.03825 106.08688 106.74058 108.54564
 [64] 110.08310 110.28006 114.11108 114.13008 114.28631 115.07975 115.90152
 [71] 119.33123 119.86183 122.08839 122.84751 123.80364 124.71946 126.53919
 [78] 129.74792 131.83367 134.87868 134.74495 137.91917 138.78990 138.65014
 [85] 140.51536 140.76801 143.62677 143.13387 144.19629 143.62163 144.32473
 [92] 145.07676 144.70451 146.45597 145.16762 145.02815 146.92436 145.69784
 [99] 147.29066 146.81462 148.19075 150.43512 152.91836 153.57069 152.38286
[106] 156.62725 157.17709 159.55586 160.40268 162.42706 163.12080 164.92569
[113] 165.25402 164.83034 166.90508 168.88781 169.38216 169.17398 170.90660
[120] 171.16834 170.08241 170.93049 170.19311 170.41712 169.17197 171.83009
[127] 171.03053 170.29286 170.99113 171.32275 170.58697 167.72509 167.44809
[134] 167.42509 167.15047 167.17094 167.98930 166.73883 168.80040 168.56388
[141] 170.36900 171.90320 173.03663 173.02415 175.50869 173.46738 174.39514
[148] 174.02253 174.41034 175.88526 176.33609 177.68180 178.17868 178.80365
[155] 180.34839 182.28950 180.23290 182.48120 182.69634 183.16252 184.94229
[162] 184.92010 185.67877 187.26855 189.88380 190.96854 191.89841 191.59543
[169] 192.27380 191.59384 192.02039 192.67910 192.02791 195.90097 195.86504
[176] 198.88173 196.66031 199.43432 197.62796 200.15755 200.41173 201.01173
[183] 203.28627 202.90502 203.55435 202.88567 202.84351 204.23171 204.87915
[190] 206.81800 206.88791 208.53113 209.35688 208.62829 211.78726 211.91564
[197] 214.26924 215.99003 217.10946 217.99191 220.12974 219.62047 223.41207
[204] 223.09982 226.89957 229.10218 230.82632 233.06379 233.11161 235.12083
[211] 237.03356 238.20815 238.93111 239.25797 240.13000 242.02330 242.02476
[218] 243.07929 242.10377 243.67317 245.27611 247.72725 249.53289 248.60394
[225] 250.76084 251.54268 252.18784 253.27541 252.93766 254.71846 254.58546
[232] 253.73853 253.97486 255.72803 255.46516 257.33400 258.75904 259.66607
[239] 260.28321 260.57169 262.25744 261.43262 262.88665 262.72243 260.88166
[246] 259.90556 260.88007 260.29444 261.76814 264.03932 266.33508 268.20806
[253] 270.00558 271.07268 274.30169 275.82377 276.06559 277.34699 275.80326
[260] 276.77785 273.63957 276.16089 274.44232 274.93707 274.60988 275.72894
[267] 276.64168 277.90550 277.87872 277.22653 277.43534 277.96497 277.28072
[274] 277.81623 276.53641 277.14053 277.16862 277.37922 277.52988 280.05753
[281] 281.10942 283.58847 283.28938 285.05008 286.96228 287.79498 286.48270
[288] 288.10059 285.80267 285.93953 287.25581 287.98220 289.92881 290.82544
[295] 291.78758 291.58078 294.47725 294.72102 295.27209 297.13075

Saving and exporting data as a csv file into your working directory for use later if you wish

In [225]:
#Saving and exporting data as a csv file into your working directory
write.csv(data,"Simulated_ts_data.csv")

Declaring series as a monthly data in R

In [227]:
data_monthly<- ts(data,start=c(1997,1),end=c(2021,12),frequency=12)
data_monthly
A Time Series: 25 × 12
JanFebMarAprMayJunJulAugSepOctNovDec
1997 89.28906 89.88158 88.96761 89.77552 88.98778 91.72266 92.40444 94.63678 95.78513 99.04909 99.08364100.36349
1998102.91082102.88998104.17916104.38990104.98120105.26795106.52308106.75442107.00739109.24045109.28254110.28021
1999110.76590112.41597112.86065113.84571113.71992114.22791114.76433114.59169115.88671114.31834115.69290113.68091
2000114.23418113.31192112.93818112.91492110.68967112.58187109.87317110.29369108.42902107.89481110.06869109.68446
2001109.08648107.62782108.51156108.14585108.21723107.33616105.98013104.87438105.93788104.76909103.69318106.03825
2002106.08688106.74058108.54564110.08310110.28006114.11108114.13008114.28631115.07975115.90152119.33123119.86183
2003122.08839122.84751123.80364124.71946126.53919129.74792131.83367134.87868134.74495137.91917138.78990138.65014
2004140.51536140.76801143.62677143.13387144.19629143.62163144.32473145.07676144.70451146.45597145.16762145.02815
2005146.92436145.69784147.29066146.81462148.19075150.43512152.91836153.57069152.38286156.62725157.17709159.55586
2006160.40268162.42706163.12080164.92569165.25402164.83034166.90508168.88781169.38216169.17398170.90660171.16834
2007170.08241170.93049170.19311170.41712169.17197171.83009171.03053170.29286170.99113171.32275170.58697167.72509
2008167.44809167.42509167.15047167.17094167.98930166.73883168.80040168.56388170.36900171.90320173.03663173.02415
2009175.50869173.46738174.39514174.02253174.41034175.88526176.33609177.68180178.17868178.80365180.34839182.28950
2010180.23290182.48120182.69634183.16252184.94229184.92010185.67877187.26855189.88380190.96854191.89841191.59543
2011192.27380191.59384192.02039192.67910192.02791195.90097195.86504198.88173196.66031199.43432197.62796200.15755
2012200.41173201.01173203.28627202.90502203.55435202.88567202.84351204.23171204.87915206.81800206.88791208.53113
2013209.35688208.62829211.78726211.91564214.26924215.99003217.10946217.99191220.12974219.62047223.41207223.09982
2014226.89957229.10218230.82632233.06379233.11161235.12083237.03356238.20815238.93111239.25797240.13000242.02330
2015242.02476243.07929242.10377243.67317245.27611247.72725249.53289248.60394250.76084251.54268252.18784253.27541
2016252.93766254.71846254.58546253.73853253.97486255.72803255.46516257.33400258.75904259.66607260.28321260.57169
2017262.25744261.43262262.88665262.72243260.88166259.90556260.88007260.29444261.76814264.03932266.33508268.20806
2018270.00558271.07268274.30169275.82377276.06559277.34699275.80326276.77785273.63957276.16089274.44232274.93707
2019274.60988275.72894276.64168277.90550277.87872277.22653277.43534277.96497277.28072277.81623276.53641277.14053
2020277.16862277.37922277.52988280.05753281.10942283.58847283.28938285.05008286.96228287.79498286.48270288.10059
2021285.80267285.93953287.25581287.98220289.92881290.82544291.78758291.58078294.47725294.72102295.27209297.13075
In [ ]:
write.csv(data_monthly,"Simulated_ts_Monthlydata.csv")
In [11]:
plot(data_monthly,type="l",lwd=3,ylab="Monthly times series",col="blue")

Suppose you wish to declare the time series from monthly to quarterly & yearly in R

In [10]:
data_quartely<- aggregate(data_monthly,nfrequency=4)
data_quartely
A Time Series: 25 × 4
Qtr1Qtr2Qtr3Qtr4
1997268.1383270.4860282.8263298.4962
1998309.9800314.6390320.2849328.8032
1999336.0425341.7935345.2427343.6921
2000340.4843336.1865328.5959327.6480
2001325.2259323.6993316.7924314.5005
2002321.3731334.4742343.4961355.0946
2003368.7395381.0066401.4573415.3592
2004424.9101430.9518434.1060436.6517
2005439.9129445.4405458.8719473.3602
2006485.9505495.0101505.1750511.2489
2007511.2060511.4192512.3145509.6348
2008502.0237501.8991507.7333517.9640
2009523.3712524.3181532.1966541.4415
2010545.4104553.0249562.8311574.4624
2011575.8880580.6080591.4071597.2198
2012604.7097609.3450611.9544622.2371
2013629.7724642.1749655.2311666.1324
2014686.8281701.2962714.1728721.4113
2015727.2078736.6765748.8977757.0059
2016762.2416763.4414771.5582780.5210
2017786.5767783.5097782.9427798.5825
2018815.3800829.2364826.2207825.5403
2019826.9805833.0107832.6810831.4932
2020832.0777844.7554855.3017862.3783
2021858.9980868.7364877.8456887.1239
In [15]:
data_yearly<- aggregate(data_monthly,nfrequency=1)
data_yearly
A Time Series:
  1. 1119.94677174861
  2. 1273.70710914111
  3. 1366.77094093738
  4. 1332.91458791689
  5. 1280.21803484857
  6. 1354.43805845798
  7. 1566.56262379654
  8. 1726.61967019594
  9. 1817.58545037134
  10. 1997.38456192752
  11. 2044.57450757903
  12. 2029.61999826195
  13. 2121.32744723246
  14. 2235.72884151672
  15. 2345.1229060832
  16. 2448.24619019725
  17. 2593.31080515406
  18. 2823.70839618042
  19. 2969.78792047236
  20. 3077.76215619353
  21. 3151.6114870293
  22. 3296.37726058855
  23. 3324.16545659238
  24. 3394.51313306045
  25. 3492.70391182474
In [46]:
#par(mfrow=c(2,1))
par(mfrow=c(2,2),mar=c(4,4,1,1))

#First plot (top left)
plot(data_monthly,type="l",lwd=3,ylab="Times series",xlab="Time",col="blue",las=1)
text(2000,300,"Monthly series",col="blue",font=1)

plot(data_quartely,type="l",lwd=3,ylab="Times series",col="red",las=1)
text(2000,890,"Quartely series",col="red",font=1)

plot(data_yearly,type="l",lwd=3,ylab="Times series",col="green",las=1)
text(2000,3500,"Yearly series",col="green",font=1)


plot(data_yearly,type="l",lwd=3,ylab="Times series",col="green",ylim=c(0,3600),las=1)
lines(data_quartely,type="l",lwd=3,ylab="Times series",col="red",las=1)
lines(data_monthly,type="l",lwd=3,ylab="Times series",col="blue",las=1)


legend(x=1997,y=3650,legend = c("Yearly","Quartely","Monthly"),col=c("green","red","blue"),
      cex=1, horiz = FALSE,pt.cex = 1,fill=c("green","red","blue"),ncol=1,
      title="Time frequency ",box.lwd = 2)
In [ ]:
#You can save the quartely data for use later into your working repository if you want
write.csv(data_quartely,"Simulated_ts_Quartelydata.csv")
write.csv(data_yearly,"Simulated_ts_Yearlydata.csv")

Fitting univariate classical time series models (we shall use the monthly data)

Practice task: Fit the same class of time series models using the quartely and yearly ts data respectively

NB: There lots of other class of machine learning algorithms for time series prediction.

In [208]:
#Don't forget to install each of the packages with install.package("") before loading with library()
#install.packages("tseries")  #installing package tseries

library(tseries) #for time series
library("tsoutliers")#identifying outliers in time series
library(ggplot2) #its for other types of customized plots
library(forecast) # forecasting methods
library(gridExtra)# for combine ggplots using grid.arrange() fucntion
library("seastests")# for testing seasonality
library("TTR")#for simple moving average model
library(zoo)

Finding mean values for each year

In [228]:
year<- as.character(1997:2021)
means_year<- aggregate(data_monthly, nfrequency = 1, FUN = mean)
data.frame(year=year,mean=means_year)
A data.frame: 25 × 2
yearmean
<fct><ts>
1997 93.3289
1998106.1423
1999113.8976
2000111.0762
2001106.6848
2002112.8698
2003130.5469
2004143.8850
2005151.4655
2006166.4487
2007170.3812
2008169.1350
2009176.7773
2010186.3107
2011195.4269
2012204.0205
2013216.1092
2014235.3090
2015247.4823
2016256.4802
2017262.6343
2018274.6981
2019277.0138
2020282.8761
2021291.0587

Step 1: Detecting outliers in the series using an Automatic ARIMA procedure

The types of outliers it detects: "AO" additive outliers, "LS" level shifts, and "TC" temporary changes are selected; "IO" innovative outliers and "SLS" seasonal level shifts can also be selected.

In [71]:
?tso #to see the arguments of the function "tso" if you wish
In [74]:
outlier.monthlydata<- tso(data_monthly,maxit.iloop=10,tsmethod = "auto.arima")

outlier.monthlydata
Series:  
ARIMA(2,1,1) 

Coefficients:
         ar1     ar2      ma1
      0.4292  0.4681  -0.5681
s.e.  0.0802  0.0579   0.0889

sigma^2 estimated as 1.494:  log likelihood=-483.22
AIC=974.44   AICc=974.58   BIC=989.24

No outliers were detected.

Suppose the data had outliers, you can use the tsclean() to recompute good estimates of the outliers

NB: For this data, data cleaning is not necessary since there are no outliers.

In [116]:
cleaned_tseries<- tsclean(data_monthly)
cleaned_tseries
A Time Series: 25 × 12
JanFebMarAprMayJunJulAugSepOctNovDec
1997 89.28906 89.88158 88.96761 89.77552 88.98778 91.72266 92.40444 94.63678 95.78513 99.04909 99.08364100.36349
1998102.91082102.88998104.17916104.38990104.98120105.26795106.52308106.75442107.00739109.24045109.28254110.28021
1999110.76590112.41597112.86065113.84571113.71992114.22791114.76433114.59169115.88671114.31834115.69290113.68091
2000114.23418113.31192112.93818112.91492110.68967112.58187109.87317110.29369108.42902107.89481110.06869109.68446
2001109.08648107.62782108.51156108.14585108.21723107.33616105.98013104.87438105.93788104.76909103.69318106.03825
2002106.08688106.74058108.54564110.08310110.28006114.11108114.13008114.28631115.07975115.90152119.33123119.86183
2003122.08839122.84751123.80364124.71946126.53919129.74792131.83367134.87868134.74495137.91917138.78990138.65014
2004140.51536140.76801143.62677143.13387144.19629143.62163144.32473145.07676144.70451146.45597145.16762145.02815
2005146.92436145.69784147.29066146.81462148.19075150.43512152.91836153.57069152.38286156.62725157.17709159.55586
2006160.40268162.42706163.12080164.92569165.25402164.83034166.90508168.88781169.38216169.17398170.90660171.16834
2007170.08241170.93049170.19311170.41712169.17197171.83009171.03053170.29286170.99113171.32275170.58697167.72509
2008167.44809167.42509167.15047167.17094167.98930166.73883168.80040168.56388170.36900171.90320173.03663173.02415
2009175.50869173.46738174.39514174.02253174.41034175.88526176.33609177.68180178.17868178.80365180.34839182.28950
2010180.23290182.48120182.69634183.16252184.94229184.92010185.67877187.26855189.88380190.96854191.89841191.59543
2011192.27380191.59384192.02039192.67910192.02791195.90097195.86504198.88173196.66031199.43432197.62796200.15755
2012200.41173201.01173203.28627202.90502203.55435202.88567202.84351204.23171204.87915206.81800206.88791208.53113
2013209.35688208.62829211.78726211.91564214.26924215.99003217.10946217.99191220.12974219.62047223.41207223.09982
2014226.89957229.10218230.82632233.06379233.11161235.12083237.03356238.20815238.93111239.25797240.13000242.02330
2015242.02476243.07929242.10377243.67317245.27611247.72725249.53289248.60394250.76084251.54268252.18784253.27541
2016252.93766254.71846254.58546253.73853253.97486255.72803255.46516257.33400258.75904259.66607260.28321260.57169
2017262.25744261.43262262.88665262.72243260.88166259.90556260.88007260.29444261.76814264.03932266.33508268.20806
2018270.00558271.07268274.30169275.82377276.06559277.34699275.80326276.77785273.63957276.16089274.44232274.93707
2019274.60988275.72894276.64168277.90550277.87872277.22653277.43534277.96497277.28072277.81623276.53641277.14053
2020277.16862277.37922277.52988280.05753281.10942283.58847283.28938285.05008286.96228287.79498286.48270288.10059
2021285.80267285.93953287.25581287.98220289.92881290.82544291.78758291.58078294.47725294.72102295.27209297.13075
In [118]:
#Check whether the original series without outliers have the same values in the cleaned data
all(cleaned_tseries==data_monthly)
TRUE

Step 2: Identify patterns in time series and decomposition

Why do we need to identify patterns?: Because they can also suggest the class of models to use/consider.

A time series with additive trend, seasonal, and irregular components can be decomposed using the stl() function.

NB: STL decomposition stands for Seasonal-Trend decomposition using LOESS

In [75]:
decomp_ts <- stl(data_monthly, s.window = "periodic", robust = TRUE)$time.series
#print(decomp_ts)

Seasonal_component<- decomp_ts[,1]
Trend_component<- decomp_ts[,2]
Random_component  <- decomp_ts[,3]
Deasonalized_data<-  data_monthl-Seasonal_component

par(mfrow=c(4,1),mar=c(4,4,1,1))
plot(data_monthly,ylab="Observed data",col="blue")
plot(Seasonal_component,col="blue",ylab="Seasonal")
plot(Trend_component,col="blue",ylab="Trend")
plot(Random_component,col="blue",ylab="Random")

Testing for stationarity using Augmented Dickey-Fuller unit root test (ADF test)

H0: There series is nonstationarity.

H1: There is stationary

Conclusion: The original monthly series is non-stationary (i.e at lag 0).

Clearly, the time series non-stationary or changes over time

In [113]:
adf.test(data_monthly)
	Augmented Dickey-Fuller Test

data:  data_monthly
Dickey-Fuller = -2.3985, Lag order = 6, p-value = 0.4083
alternative hypothesis: stationary

Notice that the series is stationary at lag 1 (first differencing $\Delta Y_t$) using the ADF test

where $\Delta Y_t=Y_t-Y_{t-1}$

In [119]:
adf.test(diff(data_monthly,lag=1)) # NB: diff(data_monthly,lag=1) gives first differenced series Y_t-Y_{t-1}
	Augmented Dickey-Fuller Test

data:  diff(data_monthly, lag = 1)
Dickey-Fuller = -4.7266, Lag order = 6, p-value = 0.01
alternative hypothesis: stationary
In [124]:
first_diff_series<- diff(data_monthly,lag=1)

par(mfrow=c(2,1),mar=c(4,4,1,1))

plot(data_monthly,type="l",lwd=2,col="blue",ylab="Original series")
plot(first_diff_series,type="l",lwd=2,col="blue",ylab="First differenced series")
abline(h=0,col="black",lwd=3)

WO-test Seasonality test

By default, the WO-test is used to assess the seasonality of a time series and returns a boolean. Alternatively, the QS test (test='qs'), Friedman test (test='fried'), Kruskall-Wallis (test='kw'), F-test on seasonal dummies (test='seasdum') or the Welch test (test='welch') can be used.

Webel, K. and Ollech, D. (2019). An overall seasonality test. Deutsche Bundesbank’s Discussion Paper series.

In [109]:
#The p-values indicate the p-values of the underlying test, i.e. the QS-test, the QS-R test and the KW-R-test.

#Testing for monthly seasonality
summary(wo(data_monthly,freq=12))
Test used:  WO 
 
Test statistic:  0 
P-value:  1 1 0.5132214 
 
The WO - test does not identify  seasonality
In [178]:
#Clearly there are no seasonal patterns
#But the series generally increase over time for each year
seasonplot(data_monthly, col=rainbow(12), year.labels=TRUE,main="Seasonal plots across years")
In [187]:
ggseasonplot(data_monthly, col=rainbow(n=25),polar=TRUE) +
  ylab("Series") +
  ggtitle("Polar seasonal plot")

ACF and PACF plots using ggplot with forecast package

In [126]:
options(warn=-1)#avoid title warning message in ggAcf or ggPacf function

acf_plot1=ggAcf(data_monthly,main="Original series")

pacf_plot1=ggPacf(data_monthly,main="Original series")


acf_plot2=ggAcf(first_diff_series,main="First differenced series")

pacf_plot2=ggPacf(first_diff_series,main="First differenced series")

grid.arrange(acf_plot1,pacf_plot1,acf_plot2,pacf_plot2,ncol=2)

Divide the data into training and testing sets

In [189]:
train_set<- window(data_monthly, start=c(1997, 1), end=c(2015, 12)) #from 1997 to 2015 from training
test_set<- window(data_monthly, start=c(2016, 1), end=c(2021, 12)) # from 2016 to 2021 for testing

Fitting ARIMA time series model

In [230]:
Automatic_ARIMA_model<- auto.arima(y=train_set,seasonal = FALSE)
Automatic_ARIMA_model

AutoArimaModel_forecasts=forecast(Automatic_ARIMA_model, PI=TRUE,h=length(test_set),level = c(95),interval = "c")


#accuracy measures
accuracy(AutoArimaModel_forecasts,test_set)[,c(2,3,5)]
Series: train_set 
ARIMA(2,1,1) with drift 

Coefficients:
         ar1     ar2      ma1   drift
      0.3831  0.4496  -0.6130  0.7160
s.e.  0.0964  0.0600   0.1004  0.1765

sigma^2 estimated as 1.415:  log likelihood=-359.77
AIC=729.54   AICc=729.81   BIC=746.67
A matrix: 2 × 3 of type dbl
RMSEMAEMAPE
Training set1.176510.93697670.6344676
Test set7.758786.59825002.3671166

Fitting Naive time series model

In [204]:
naive_model<- snaive(y=train_set,h=length(test_set))

#accuracy measures
accuracy(naive_model,test_set)[,c(2,3,5)]
A matrix: 2 × 3 of type dbl
RMSEMAEMAPE
Training set11.02135 9.6930436.118867
Test set29.2044826.6445269.556393

Fitting non-seasonal Holt-Winters model (HW)

In [211]:
HW_model <- HoltWinters(train_set, beta=FALSE, gamma=FALSE)

HW_model 

HW_forecasts=forecast(HW_model , PI=TRUE,h=length(test_set),level = c(95),interval = "c")


#accuracy measures
accuracy(HW_forecasts,test_set)[,c(2,3,5)]
Holt-Winters exponential smoothing without trend and without seasonal component.

Call:
HoltWinters(x = train_set, beta = FALSE, gamma = FALSE)

Smoothing parameters:
 alpha: 0.999943
 beta : FALSE
 gamma: FALSE

Coefficients:
      [,1]
a 253.2753
A matrix: 2 × 3 of type dbl
RMSEMAEMAPE
Training set 1.515392 1.2010590.7990761
Test set24.05245220.8608847.4321282

Fitting Simple Exponential Smoothing model

In [222]:
ses_model <- ses(train_set, h = length(test_set))
#accuracy measures
accuracy(ses_model,test_set)[,c(2,3,5)]
A matrix: 2 × 3 of type dbl
RMSEMAEMAPE
Training set 1.51207 1.1958120.7955904
Test set24.0524920.8609307.4321448

Plotting forecasts

In [223]:
p1=autoplot(AutoArimaModel_forecasts,ylab="Series",xlab="Time",main="Non-seasonal ARIMA(2,1,1) model") 

p2=autoplot(naive_model,ylab="Series",xlab="Time",main="Naive model")

p3=autoplot(HW_forecasts,ylab="Series",xlab="Time",main="Non-seasonal Holt-Winter model")

p4=autoplot(ses_model,ylab="Series",xlab="Time",main="Simple Exponential Smoothing model")


grid.arrange(p1,p2,p3,p4,ncol=2)