Confusing about using cross validation tsibble

Hi, I'm trying to fit my model using cross validation time series with tsibble, someone could give me some advice how to do it ?
I would like to fit the best model using the lowest rise

Here my reprex sample:

library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.6.2
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tsibble)
#> Warning: package 'tsibble' was built under R version 3.6.2
library(fable)
#> Warning: package 'fable' was built under R version 3.6.2
#> Carregando pacotes exigidos: fabletools

iniciativa <- tibble(
    data_planejada = seq(as.Date("2020-01-01"), length = 200, by = "1 day"),
    n = sample(seq(100), size = 200, replace = TRUE)
) %>%
    as_tsibble(index = data_planejada)

train <- iniciativa %>%
    filter_index("2020-01-01" ~ "2020-05-29")
test <- iniciativa %>% 
    filter_index("2020-05-30" ~ .)


var_tbl <- train %>% 
    model(
        var1 = VAR(n ~ trend() + AR(1)),
        var2  = VAR(n ~ trend() + AR(2)),
        var3 = VAR(n ~ trend() + AR(3)), 
        var4 = VAR(n ~ trend() + AR(1) + fourier(K = 1)),
        var5  = VAR(n ~ trend() + AR(2) + fourier(K = 2)),
        var6 = VAR(n ~ trend() + AR(3) + fourier(K = 3)), 
        var7 = VAR(n ~ fourier(K = 1)), 
        var8 = VAR(n ~ fourier(K = 2)),
        var9 = VAR(n ~ fourier(K = 3)), 
        var10 = VAR(n), 
        var11 = VAR(n ~ trend() + season(period = "week") + AR(3))
    )

var_fc <- var_tbl %>%  
    forecast(h = "20 weeks")




accuracy(var_fc, test, list(rmse = RMSE, mae = MAE, mape = MAPE, mase = MASE, crps = CRPS, winkler = winkler_score))
#> Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing. 
#> 90 observations are missing between 2020-07-19 and 2020-10-16
#> # A tibble: 11 x 8
#>    .model .type  rmse   mae  mape  mase  crps winkler
#>    <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
#>  1 var1   Test   28.7  23.7  256.   NaN  16.6    113.
#>  2 var10  Test   28.9  23.9  242.   NaN  16.7    118.
#>  3 var11  Test   30.0  24.6  260.   NaN  17.4    113.
#>  4 var2   Test   28.3  23.4  253.   NaN  16.4    113.
#>  5 var3   Test   28.3  23.3  250.   NaN  16.4    113.
#>  6 var4   Test   29.7  24.5  255.   NaN  17.2    113.
#>  7 var5   Test   29.7  24.8  263.   NaN  17.2    113.
#>  8 var6   Test   30.0  24.6  260.   NaN  17.4    113.
#>  9 var7   Test   30.4  25.3  242.   NaN  17.7    116.
#> 10 var8   Test   30.7  25.8  255.   NaN  17.8    114.
#> 11 var9   Test   30.9  25.4  254.   NaN  17.9    121.

Created on 2020-12-22 by the reprex package (v0.3.0)

Test data is too short for the forecast horizon. Increase one or decrease the other.

Sampling defeats the purpose of time series analysis.

1 Like

Rather than use a training and test set, you are better off using time-series cross-validation as discussed here: https://otexts.com/fpp3/tscv.html

Here is an example using your data and models.

library(dplyr)
library(tsibble)
library(fable)

iniciativa <- tibble(
  data_planejada = seq(as.Date("2020-01-01"), length = 200, by = "1 day"),
  n = sample(seq(100), size = 200, replace = TRUE)
) %>%
  as_tsibble(index = data_planejada)

iniciativa_tscv <- iniciativa %>%
  stretch_tsibble(.init=20, .step=1)

var_tbl <- iniciativa_tscv %>% 
  model(
    var1 = VAR(n ~ trend() + AR(1)),
    var2  = VAR(n ~ trend() + AR(2)),
    var3 = VAR(n ~ trend() + AR(3)), 
    var4 = VAR(n ~ trend() + AR(1) + fourier(K = 1)),
    var5  = VAR(n ~ trend() + AR(2) + fourier(K = 2)),
    var6 = VAR(n ~ trend() + AR(3) + fourier(K = 3)), 
    var7 = VAR(n ~ fourier(K = 1)), 
    var8 = VAR(n ~ fourier(K = 2)),
    var9 = VAR(n ~ fourier(K = 3)), 
    var10 = VAR(n), 
    var11 = VAR(n ~ trend() + season(period = "week") + AR(3))
  )
fc_tbl <- var_tbl %>%
  forecast(h=5) %>%
  group_by(.id) %>%
  mutate(h = row_number()) %>%
  ungroup()

fc_tbl %>% accuracy(
  iniciativa, 
  by=c("h",".model"),
  list(rmse = RMSE, mae = MAE, mape = MAPE, mase = MASE, crps = CRPS, winkler = winkler_score)
)
#> Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing. 
#> 5 observations are missing between 2020-07-19 and 2020-07-23
#> # A tibble: 55 x 9
#>        h .model .type  rmse   mae  mape  mase  crps winkler
#>    <int> <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
#>  1     1 var1   Test   29.5  25.1  215. 0.730  17.2    116.
#>  2     2 var1   Test   29.0  24.9  214. 0.723  16.9    112.
#>  3     3 var1   Test   29.4  25.3  215. 0.735  17.2    113.
#>  4     4 var1   Test   29.6  25.5  217. 0.742  17.3    113.
#>  5     5 var1   Test   29.8  25.9  216. 0.751  17.5    114.
#>  6     6 var2   Test   30.1  25.4  220. 0.738  17.6    121.
#>  7     7 var2   Test   30.1  25.3  220. 0.735  17.5    122.
#>  8     8 var2   Test   29.9  25.5  219. 0.740  17.5    114.
#>  9     9 var2   Test   30.3  25.9  223. 0.753  17.7    113.
#> 10    10 var2   Test   31.5  26.9  226. 0.782  18.3    123.
#> # … with 45 more rows

Created on 2020-12-23 by the reprex package (v0.3.0)

Note: There is only one response variable, so these are not VAR models but univariate AR models or univariate regressions with AR errors.

1 Like

Thanks helping me with cross validation, usually in my business it is hard to get data.
There is a rule of how many observations I should have for reasonable forecasts ?

See https://otexts.com/fpp3/long-short-ts.html

1 Like

Really good, I improved a lot the accuracy of my models, sorry for my lack of knowledge, still I have some question if it is possible.
Once I check the best model with the lowest rmse, how to pick it, to print my forecast, with training /test set I knew it, with cross validation approach I have doubts.

library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.6.2
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tsibble)
#> Warning: package 'tsibble' was built under R version 3.6.2
library(fable)
#> Warning: package 'fable' was built under R version 3.6.2
#> Carregando pacotes exigidos: fabletools

iniciativa <- tibble(
    data_planejada = seq(as.Date("2020-01-01"), length = 200, by = "1 day"),
    n = sample(seq(100), size = 200, replace = TRUE)
) %>%
    as_tsibble(index = data_planejada)

iniciativa_cv <- iniciativa %>%
    stretch_tsibble(.init=20, .step=1)


tslm_tbl <- iniciativa_cv %>% 
    model(
        tslm1 = TSLM(n ~ trend() + fourier(K = 1)), 
        tslm2 = TSLM(n ~ trend() + fourier(K = 2)), 
        tslm3 = TSLM(n ~ trend() + fourier(K = 3)),
        tslm4 = TSLM(n ~ trend() + season(period = "week")) 
    )

fc_tbl <- tslm_tbl %>%
    forecast(h=5) %>%
    group_by(.id) %>%
    mutate(h = row_number()) %>%
    ungroup()

fc_tbl %>% accuracy(
    iniciativa, 
    by=c("h",".model"),
    list(rmse = RMSE, mae = MAE, mape = MAPE, mase = MASE, crps = CRPS, winkler = winkler_score)
)
#> Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing. 
#> 5 observations are missing between 2020-07-19 and 2020-07-23
#> # A tibble: 20 x 9
#>        h .model .type  rmse   mae  mape  mase  crps winkler
#>    <int> <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
#>  1     1 tslm1  Test   31.6  26.5  199. 0.761  18.2    130.
#>  2     2 tslm1  Test   31.9  26.6  199. 0.763  18.4    131.
#>  3     3 tslm1  Test   31.3  26.2  194. 0.752  18.1    127.
#>  4     4 tslm1  Test   30.7  25.8  194. 0.740  17.8    127.
#>  5     5 tslm1  Test   30.7  25.8  194. 0.740  17.7    127.
#>  6     6 tslm2  Test   32.0  26.6  192. 0.764  18.4    133.
#>  7     7 tslm2  Test   32.3  26.9  194. 0.771  18.6    133.
#>  8     8 tslm2  Test   31.7  26.5  191. 0.759  18.3    128.
#>  9     9 tslm2  Test   31.1  26.0  189. 0.745  17.9    128.
#> 10    10 tslm2  Test   30.9  25.8  189. 0.739  17.8    128.
#> 11    11 tslm3  Test   32.6  27.0  193. 0.773  18.7    140.
#> 12    12 tslm3  Test   32.8  27.1  194. 0.777  18.8    138.
#> 13    13 tslm3  Test   31.9  26.7  189. 0.764  18.4    131.
#> 14    14 tslm3  Test   31.4  26.3  188. 0.753  18.1    131.
#> 15    15 tslm3  Test   31.3  26.1  188. 0.747  18.0    131.
#> 16    16 tslm4  Test   32.6  27.0  193. 0.773  18.7    140.
#> 17    17 tslm4  Test   32.8  27.1  194. 0.777  18.8    138.
#> 18    18 tslm4  Test   31.9  26.7  189. 0.764  18.4    131.
#> 19    19 tslm4  Test   31.4  26.3  188. 0.753  18.1    131.
#> 20    20 tslm4  Test   31.3  26.1  188. 0.747  18.0    131.

Created on 2020-12-24 by the reprex package (v0.3.0)

Thanks

There's no difference. Pick the best model, apply it to the whole data set, and produce the forecasts.

fc_tbl %>% accuracy(
    iniciativa, 
    by = ".model",
    list(rmse = RMSE, mae = MAE, mape = MAPE, mase = MASE, crps = CRPS, winkler = winkler_score)
  ) %>% 
  arrange(rmse)
#> # A tibble: 4 x 8
#>   .model .type  rmse   mae  mape  mase  crps winkler
#>   <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
#> 1 tslm1  Test   28.9  25.0  225. 0.741  16.8    117.
#> 2 tslm2  Test   29.6  25.5  234. 0.756  17.2    120.
#> 3 tslm3  Test   30.7  26.0  233. 0.771  17.8    127.
#> 4 tslm4  Test   30.7  26.0  233. 0.771  17.8    127.

# Best model (lowest rmse) is tslm1 (it might differ due to random generation of data)

# Re-fit to whole data set and produce forecasts

fc <- iniciativa %>% 
  model(tslm1 = TSLM(n ~ trend() + fourier(K = 1)))  %>%
  forecast(h=20)

fc %>% autoplot(iniciativa)

fc
#> # A fable: 20 x 4 [1D]
#> # Key:     .model [1]
#>    .model data_planejada          n .mean
#>    <chr>  <date>             <dist> <dbl>
#>  1 tslm1  2020-07-19     N(51, 822)  50.8
#>  2 tslm1  2020-07-20     N(49, 822)  48.9
#>  3 tslm1  2020-07-21     N(47, 822)  46.7
#>  4 tslm1  2020-07-22     N(46, 823)  45.8
#>  5 tslm1  2020-07-23     N(47, 823)  46.9
#>  6 tslm1  2020-07-24     N(49, 823)  49.2
#>  7 tslm1  2020-07-25     N(51, 823)  50.9
#>  8 tslm1  2020-07-26     N(51, 824)  50.7
#>  9 tslm1  2020-07-27     N(49, 824)  48.8
#> 10 tslm1  2020-07-28     N(47, 824)  46.6
#> 11 tslm1  2020-07-29     N(46, 824)  45.7
#> 12 tslm1  2020-07-30     N(47, 824)  46.8
#> 13 tslm1  2020-07-31     N(49, 825)  49.0
#> 14 tslm1  2020-08-01     N(51, 825)  50.7
#> 15 tslm1  2020-08-02     N(51, 825)  50.6
#> 16 tslm1  2020-08-03     N(49, 826)  48.7
#> 17 tslm1  2020-08-04     N(46, 826)  46.4
#> 18 tslm1  2020-08-05     N(46, 826)  45.5
#> 19 tslm1  2020-08-06     N(47, 826)  46.6
#> 20 tslm1  2020-08-07     N(49, 827)  48.9

Created on 2020-12-25 by the reprex package (v0.3.0)

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.