ARIMA and fable - Forecasting

ksingh19 · February 2, 2021, 3:27pm

Hi,

I am trying to forecast Sales for next 5 years and I am also considering five other factors for now. The list of factors might increase later where I will try to find and use best 5 factors. But I am trying to figure out an efficient way to forecast these factors first as that is required before I can perform forecasting for Sales. As of now I was forecasting one factor at a time. But I would like to be able to forecast all the factors at once, probably using a function, or some other method that I am not aware of. Below was my trial on creating a function to do the task. But its not working the way I would like and throwing errors. Any help would be appreciated here.

Sample data --- Original data starts from 1996

#Libraries
library(tidyverse)
library(lubridate)
library(tsibble)
library(fable)

# Sample Data
df <- data.frame(
  check.names = FALSE,
                          date = c("2019-01-01","2019-02-01","2019-03-01",
                                   "2019-04-01","2019-05-01","2019-06-01","2019-07-01",
                                   "2019-08-01","2019-09-01","2019-10-01",
                                   "2019-11-01","2019-12-01","2020-01-01",
                                   "2020-02-01","2020-03-01","2020-04-01",
                                   "2020-05-01","2020-06-01","2020-07-01",
                                   "2020-08-01","2020-09-01","2020-10-01","2020-11-01",
                                   "2020-12-01"),
                         Sales = c(7183,
                                   6610,8206,8565,9670,9787,9115,11883,
                                   9466,9862,6275,9298,9010,8944,7036,2783,
                                   5497,8595,7792,8965,12561,11450,8756,
                                   9473),
                    `Factor 1` = c(19070,
                                   20301,19056,18915,19546,20399,20388,
                                   19126,18174,17104,17536,15443,15488,16651,
                                   14724,15082,14371,14416,14865,14525,
                                   13887,15968,18994,25454),
                    `Factor 2` = c(20364,
                                   20545,20109,20080,21193,22256,22755,
                                   25551,27215,28535,27251,28197,29524,29921,
                                   34792,28715,26983,27009,26988,27232,
                                   29185,30070,29144,27335),
                    `Factor 4` = c(423,
                                   548,365,452,479,486,487,570,563,642,
                                   904,730,994,1432,1846,1926,997,464,385,
                                   331,470,328,405,412),
                    `Factor 3` = c(7099,
                                   7830,6955,8396,10306,10054,9235,10597,
                                   8507,8788,6705,9208,9057,10078,5327,
                                   3185,4709,8895,8564,9496,12277,14540,
                                   11984,15992),
                    `Factor 5` = c(5726,
                                   6419,8685,8760,8639,8703,8407,9150,7849,
                                   8721,7009,10320,7527,8518,7244,8512,
                                   7181,8721,7467,9342,10824,11753,9449,
                                   11373)
                 )

I would like to forecast all 5 factors at once.

My list of factors might increase later, but I will find the best 5 factors that can have impact on Sales. So, I will still need to use 3- 5 factors before predicting Sales

Below is my trial to forecast multiple factors at once - not correct. Any help here would be highly appreciated as this will reduce repetitive forecasts on 5 factors.

factor_fc <- function(x){

  x <- x%>%
   mutate(date = yearmonth(date))%>%
   as_tsibble(index = date)%>%
    model(factor_model = ARIMA(across(`Factor 1`: `Factor 2`))) %>%
    forecast(h=60)

return(x)
}

fact_fc <- map(df, ~factor_fc(.))

Initial Working Approach -

Forecasting Individual Factors first - Repetitive and tedious if more factors are involved. Looking for a better way to handle all factors at once

#Forecasts for factor 1 and similarly for other 4 factors
factor_1 <- df%>%
  select(date, `Factor 1`)

factor_1_fc <- factor_1%>%
  model(factor_1_model = ARIMA(`Factor 1`)) %>%
  forecast(h=60)

#Forecast plot for factor_1
factor_1_fc %>%
  autoplot(factor_1, level = NULL) +
  ggtitle("Forecasts - factor_1") +
  xlab("Year") +
  guides(colour = guide_legend(title = "Forecast - Factor 1"))+
  theme_minimal()

factor_1_fc <- factor_1_fc%>%
  rename("factor_11" = .mean)%>%
  select(date, factor_11)%>%
  rename("factor_1" = factor_11)

Thanks for your help!

technocrat · February 3, 2021, 7:17am

For multiple regressive analysis, Hyndman recommends backward step selection:

Start with the model containing all potential predictors.
Remove one predictor at a time. Keep the model if it improves the measure of predictive accuracy.
Iterate until no further improvement.

In the snippet workflow below, that requires only adapting the fit_sat function.

# Libraries
suppressPackageStartupMessages({
  library(feasts)
  library(fpp3)
  library(GGally)
  library(ggplot2)
  library(patchwork)
})

# functions

fit_sat <- function(x) {
  x %>% model(tslm = TSLM(Sales ~ 
  Factor_1 + Factor_2 + Factor_3 + Factor_4 + Factor_5),
  tslmplus = TSLM(Sales ~ 
  Factor_1 + Factor_2 + Factor_3 + Factor_4 + Factor_5 +
  trend() + season()),
  arima = ARIMA(Sales ~ Factor_1 + Factor_2 + Factor_3 + Factor_4 + Factor_5))
} 

make_DF <- function(x) left_join(tsib, residuals(fit %>% select(x)), by = "Month")

plot_resids <- function(DF, Factor_1, .resid, Factor_2, Factor_3, Factor_4, Factor_5) {
  p1 <- ggplot(DF, aes(x = Factor_1, y = .resid)) +
    geom_point() + labs(y = "Residuals") + theme_minimal()
  p2 <- ggplot(DF, aes(x = Factor_2, y = .resid)) +
    geom_point() + labs(y = "Residuals") + theme_minimal()
  p3 <- ggplot(DF, aes(x = Factor_3, y = .resid)) +
    geom_point() + labs(y = "Residuals") + theme_minimal()
  p4 <- ggplot(DF, aes(x = Factor_4, y = .resid)) +
    geom_point() + labs(y = "Residuals") + theme_minimal()
  p5 <- ggplot(DF, aes(x = Factor_5, y = .resid)) +
    geom_point() + labs(y = "Residuals") + theme_minimal() 
  p6 <- ggplot(DF) + theme_minimal()
  (p1 | p2) / (p3 | p4) / (p5 | p6)
}

show_lb <- function(x) {
  fit %>% 
    select(x) %>%
    augment() %>% 
    features(.innov, ljung_box, lag = 6, dof = 5)
}

show_metrics <- function(x) {
  fit %>% 
    select(x) %>%
    glance() %>%
    t()
}

vis_corr <- function(x) x %>% ggpairs(columns = 2:6) + theme_minimal()

viz_fit <- function(x) {
  fit %>% 
    select(x) %>%
    augment() %>%
    ggplot(aes(x = Month)) +
    geom_line(aes(y = Sales, colour = "Data")) +
    geom_line(aes(y = .fitted, colour = "Fitted")) +
    labs(y = NULL, title = "Sales") +
    scale_color_manual(values=c(Data="black",Fitted="red")) +
    guides(colour = guide_legend(title = NULL)) +
    theme_minimal()
}

vis_init <- function( x) autoplot(x) + theme_minimal()

vis_resids <- function(x) {
  fit %>% 
    select(x) %>%
    gg_tsresiduals() 
}

# Sample Data:
# tips:
# don't use df, data and date because these are built-in funtion names; some operations may treat them as closures
# avoid having to tick-quote names of variables; when time comes for presentation, the column headers can easily be changed.

tsib <- tsibble(
  Month = yearmonth("2019 Jan") + 0:23,
  Sales = c(
    7183,
    6610, 8206, 8565, 9670, 9787, 9115, 11883,
    9466, 9862, 6275, 9298, 9010, 8944, 7036, 2783,
    5497, 8595, 7792, 8965, 12561, 11450, 8756,
    9473
  ),
  Factor_1 = c(
    19070, 
    20301, 19056, 18915, 19546, 20399, 20388,
    19126, 18174, 17104, 17536, 15443, 15488, 16651,
    14724, 15082, 14371, 14416, 14865, 14525,
    13887, 15968, 18994, 25454
  ),
  Factor_2 = c(
    20364,
    20545, 20109, 20080, 21193, 22256 , 22755,
    25551, 27215, 28535, 27251, 28197, 29524, 29921,
    34792, 28715, 26983, 27009, 26988, 27232,
    29185, 30070, 29144, 27335
  ),
  Factor_4 = c(
    423,  
    548, 365, 452, 479, 486, 487, 570, 563, 642,
    904, 730, 994, 1432, 1846, 1926, 997, 464, 385,
    331, 470, 328, 405, 412 
  ),
  Factor_3 = c(
    7099,
    7830, 6955, 8396, 10306, 10054, 9235, 10597,
    8507, 8788, 6705, 9208, 9057, 10078, 5327,
    3185, 4709, 8895, 8564, 9496, 12277, 14540,
    11984, 15992
  ),
  Factor_5 = c(
    5726,
    6419, 8685, 8760, 8639, 8703, 8407, 9150, 7849,
    8721, 7009, 10320, 7527, 8518, 7244, 8512,
    7181, 8721, 7467, 9342, 10824, 11753, 9449,
    11373
  )
)
#> Using `Month` as index variable.


# initial visualization

vis_init(tsib)
#> Plot variable not specified, automatically selected `.vars = Sales`

# correlations

vis_corr(tsib)

# saturated models

fit_sat(tsib) -> fit

# visualize fits

viz_fit("tslm")
#> Note: Using an external vector in selections is ambiguous.
#> ℹ Use `all_of(x)` instead of `x` to silence this message.
#> ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
#> This message is displayed once per session.

viz_fit("tslmplus")

viz_fit("arima")

# show metrics

show_metrics("tslm")
#>               [,1]          
#> .model        "tslm"        
#> r_squared     "0.7024474"   
#> adj_r_squared "0.6197938"   
#> sigma2        "1645370"     
#> statistic     "8.498699"    
#> p_value       "0.0002827262"
#> df            "6"           
#> log_lik       "-202.3641"   
#> AIC           "350.6191"    
#> AICc          "357.6191"    
#> BIC           "358.8654"    
#> CV            "3119113"     
#> deviance      "29616668"    
#> df.residual   "18"          
#> rank          "6"

show_metrics("tslmplus")
#>               [,1]        
#> .model        "tslmplus"  
#> r_squared     "0.9721229" 
#> adj_r_squared "0.8931377" 
#> sigma2        "462454.4"  
#> statistic     "12.30767"  
#> p_value       "0.00261472"
#> df            "18"        
#> log_lik       "-173.9506" 
#> AIC           "317.7922"  
#> AICc          "507.7922"  
#> BIC           "340.1752"  
#> CV            "2599013"   
#> deviance      "2774726"   
#> df.residual   "6"         
#> rank          "18"

show_metrics("arima")
#>          [,1]     
#> .model   "arima"  
#> sigma2   924925.9 
#> log_lik  -95.29308
#> AIC      204.5862 
#> AICc     232.5862 
#> BIC      207.9805 
#> ar_roots Complex,0
#> ma_roots Complex,0

# visualize residuals

vis_resids("tslm")

vis_resids("tslmplus")

vis_resids("arima")

# portmanteau check

show_lb("tslm")
#> # A tibble: 1 x 3
#>   .model lb_stat lb_pvalue
#>   <chr>    <dbl>     <dbl>
#> 1 tslm      7.41   0.00649

show_lb("tslmplus")
#> # A tibble: 1 x 3
#>   .model   lb_stat  lb_pvalue
#>   <chr>      <dbl>      <dbl>
#> 1 tslmplus    20.1 0.00000729

show_lb("arima")
#> # A tibble: 1 x 3
#>   .model lb_stat lb_pvalue
#>   <chr>    <dbl>     <dbl>
#> 1 arima     15.8 0.0000699

# actual vs residuals plots

plot_resids(make_DF("tslm"))

plot_resids(make_DF("tslmplus"))

plot_resids(make_DF("arima"))

ksingh19 · February 3, 2021, 3:40pm

Thank you so much @technocrat. Great explanation! I will try to work on using backward step selection instead. But for that also, I guess, we still need to know the forecast values for all of these factors. I think at this time, I am struggling to get the forecasts on all these factors for next five years at once because we can only forecast Sales for next 5 years if we have all values for these factors as well.

Thanks again for your help!

technocrat · February 5, 2021, 1:18am

Here's a snippet to construct the 31 backward steps.

DescTools::CombSet(1:5, 1:5, as.list = TRUE) -> a

mk_steps <- function(x) paste0(stringr::str_replace_all(x,"(\\d)","Factor_\\1"), collapse = " + ")

rev(unlist(lapply(a,mk_steps)))
#>  [1] "Factor_1 + Factor_2 + Factor_3 + Factor_4 + Factor_5"
#>  [2] "Factor_2 + Factor_3 + Factor_4 + Factor_5"           
#>  [3] "Factor_1 + Factor_3 + Factor_4 + Factor_5"           
#>  [4] "Factor_1 + Factor_2 + Factor_4 + Factor_5"           
#>  [5] "Factor_1 + Factor_2 + Factor_3 + Factor_5"           
#>  [6] "Factor_1 + Factor_2 + Factor_3 + Factor_4"           
#>  [7] "Factor_3 + Factor_4 + Factor_5"                      
#>  [8] "Factor_2 + Factor_4 + Factor_5"                      
#>  [9] "Factor_2 + Factor_3 + Factor_5"                      
#> [10] "Factor_2 + Factor_3 + Factor_4"                      
#> [11] "Factor_1 + Factor_4 + Factor_5"                      
#> [12] "Factor_1 + Factor_3 + Factor_5"                      
#> [13] "Factor_1 + Factor_3 + Factor_4"                      
#> [14] "Factor_1 + Factor_2 + Factor_5"                      
#> [15] "Factor_1 + Factor_2 + Factor_4"                      
#> [16] "Factor_1 + Factor_2 + Factor_3"                      
#> [17] "Factor_4 + Factor_5"                                 
#> [18] "Factor_3 + Factor_5"                                 
#> [19] "Factor_3 + Factor_4"                                 
#> [20] "Factor_2 + Factor_5"                                 
#> [21] "Factor_2 + Factor_4"                                 
#> [22] "Factor_2 + Factor_3"                                 
#> [23] "Factor_1 + Factor_5"                                 
#> [24] "Factor_1 + Factor_4"                                 
#> [25] "Factor_1 + Factor_3"                                 
#> [26] "Factor_1 + Factor_2"                                 
#> [27] "Factor_5"                                            
#> [28] "Factor_4"                                            
#> [29] "Factor_3"                                            
#> [30] "Factor_2"                                            
#> [31] "Factor_1"

technocrat · February 5, 2021, 7:16am

Final thought: A forecast horizon of h = 60 is probably not feasible. The confidence bands just grow too quickly.

# Libraries
suppressPackageStartupMessages({
  library(feasts)
  library(fpp3)
  library(GGally)
  library(ggplot2)
  library(patchwork)
})

tsib <- tsibble(
  Month = yearmonth("2019 Jan") + 0:23,
  Sales = c(
    7183,
    6610, 8206, 8565, 9670, 9787, 9115, 11883,
    9466, 9862, 6275, 9298, 9010, 8944, 7036, 2783,
    5497, 8595, 7792, 8965, 12561, 11450, 8756,
    9473
  ),
  Factor_1 = c(
    19070,
    20301, 19056, 18915, 19546, 20399, 20388,
    19126, 18174, 17104, 17536, 15443, 15488, 16651,
    14724, 15082, 14371, 14416, 14865, 14525,
    13887, 15968, 18994, 25454
  ),
  Factor_2 = c(
    20364,
    20545, 20109, 20080, 21193, 22256, 22755,
    25551, 27215, 28535, 27251, 28197, 29524, 29921,
    34792, 28715, 26983, 27009, 26988, 27232,
    29185, 30070, 29144, 27335
  ),
  Factor_4 = c(
    423,
    548, 365, 452, 479, 486, 487, 570, 563, 642,
    904, 730, 994, 1432, 1846, 1926, 997, 464, 385,
    331, 470, 328, 405, 412
  ),
  Factor_3 = c(
    7099,
    7830, 6955, 8396, 10306, 10054, 9235, 10597,
    8507, 8788, 6705, 9208, 9057, 10078, 5327,
    3185, 4709, 8895, 8564, 9496, 12277, 14540,
    11984, 15992
  ),
  Factor_5 = c(
    5726,
    6419, 8685, 8760, 8639, 8703, 8407, 9150, 7849,
    8721, 7009, 10320, 7527, 8518, 7244, 8512,
    7181, 8721, 7467, 9342, 10824, 11753, 9449,
    11373
  )
)
#> Using `Month` as index variable.


tsib %>%
  model(stl_ets = my_dcmp_spec) %>%
  forecast(h = "60 months") %>%
  autoplot(tsib) +
  labs(y = "Sales") +
  theme_minimal()
#> Error in dots_list(...): object 'my_dcmp_spec' not found

tsib %>%
  model(
    snaive = SNAIVE(Sales),
    ets = TSLM(log(Sales) ~ trend() + season()),
  ) %>%
  forecast(h = "2 years 6 months") %>%
  autoplot(tsib, level = 90) +
  theme_minimal()

ksingh19 · February 5, 2021, 8:04pm

Thank you so much @ technocrat! Great details! After performing the backward steps, whats your approach on forecasting selected Factor. Would you still use TSLM or ARIMA on Factor as well or would you use some other methodology before forecasting Sales.

My original data starts from 1996. So, forecasting next 5 years shouldn't be too bad in terms of confidence bands.

Thanks again for your help!

technocrat · February 5, 2021, 10:02pm

Glad that's working for you.

I would first do benchmarks using MEAN, NAIVE, random walk with drift, and SNAIVE and pick the one with best accuracy against a 1996-1999 training set and a 2000 test set.

With that as the bar, TSLM, TSLM with trend and ARIMA should be tested and the winner checked against the benchmark. How satisfied you are with the result from an accuracy standpoint and domain considerations should guide you decision to consider other models, such as SEATS.

Come back with new questions. Probably best in a new thread. DM me if I appear to have missed seeing them.

ksingh19 · February 8, 2021, 6:53pm

Thank you @technocrat!

system · February 15, 2021, 6:53pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.