Hi, I'm trying to fit my model using cross validation time series with tsibble, someone could give me some advice how to do it ?
I would like to fit the best model using the lowest rise
Here my reprex sample:
library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.6.2
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tsibble)
#> Warning: package 'tsibble' was built under R version 3.6.2
library(fable)
#> Warning: package 'fable' was built under R version 3.6.2
#> Carregando pacotes exigidos: fabletools
iniciativa <- tibble(
data_planejada = seq(as.Date("2020-01-01"), length = 200, by = "1 day"),
n = sample(seq(100), size = 200, replace = TRUE)
) %>%
as_tsibble(index = data_planejada)
train <- iniciativa %>%
filter_index("2020-01-01" ~ "2020-05-29")
test <- iniciativa %>%
filter_index("2020-05-30" ~ .)
var_tbl <- train %>%
model(
var1 = VAR(n ~ trend() + AR(1)),
var2 = VAR(n ~ trend() + AR(2)),
var3 = VAR(n ~ trend() + AR(3)),
var4 = VAR(n ~ trend() + AR(1) + fourier(K = 1)),
var5 = VAR(n ~ trend() + AR(2) + fourier(K = 2)),
var6 = VAR(n ~ trend() + AR(3) + fourier(K = 3)),
var7 = VAR(n ~ fourier(K = 1)),
var8 = VAR(n ~ fourier(K = 2)),
var9 = VAR(n ~ fourier(K = 3)),
var10 = VAR(n),
var11 = VAR(n ~ trend() + season(period = "week") + AR(3))
)
var_fc <- var_tbl %>%
forecast(h = "20 weeks")
accuracy(var_fc, test, list(rmse = RMSE, mae = MAE, mape = MAPE, mase = MASE, crps = CRPS, winkler = winkler_score))
#> Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing.
#> 90 observations are missing between 2020-07-19 and 2020-10-16
#> # A tibble: 11 x 8
#> .model .type rmse mae mape mase crps winkler
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 var1 Test 28.7 23.7 256. NaN 16.6 113.
#> 2 var10 Test 28.9 23.9 242. NaN 16.7 118.
#> 3 var11 Test 30.0 24.6 260. NaN 17.4 113.
#> 4 var2 Test 28.3 23.4 253. NaN 16.4 113.
#> 5 var3 Test 28.3 23.3 250. NaN 16.4 113.
#> 6 var4 Test 29.7 24.5 255. NaN 17.2 113.
#> 7 var5 Test 29.7 24.8 263. NaN 17.2 113.
#> 8 var6 Test 30.0 24.6 260. NaN 17.4 113.
#> 9 var7 Test 30.4 25.3 242. NaN 17.7 116.
#> 10 var8 Test 30.7 25.8 255. NaN 17.8 114.
#> 11 var9 Test 30.9 25.4 254. NaN 17.9 121.
Thanks helping me with cross validation, usually in my business it is hard to get data.
There is a rule of how many observations I should have for reasonable forecasts ?
Really good, I improved a lot the accuracy of my models, sorry for my lack of knowledge, still I have some question if it is possible.
Once I check the best model with the lowest rmse, how to pick it, to print my forecast, with training /test set I knew it, with cross validation approach I have doubts.
library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.6.2
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tsibble)
#> Warning: package 'tsibble' was built under R version 3.6.2
library(fable)
#> Warning: package 'fable' was built under R version 3.6.2
#> Carregando pacotes exigidos: fabletools
iniciativa <- tibble(
data_planejada = seq(as.Date("2020-01-01"), length = 200, by = "1 day"),
n = sample(seq(100), size = 200, replace = TRUE)
) %>%
as_tsibble(index = data_planejada)
iniciativa_cv <- iniciativa %>%
stretch_tsibble(.init=20, .step=1)
tslm_tbl <- iniciativa_cv %>%
model(
tslm1 = TSLM(n ~ trend() + fourier(K = 1)),
tslm2 = TSLM(n ~ trend() + fourier(K = 2)),
tslm3 = TSLM(n ~ trend() + fourier(K = 3)),
tslm4 = TSLM(n ~ trend() + season(period = "week"))
)
fc_tbl <- tslm_tbl %>%
forecast(h=5) %>%
group_by(.id) %>%
mutate(h = row_number()) %>%
ungroup()
fc_tbl %>% accuracy(
iniciativa,
by=c("h",".model"),
list(rmse = RMSE, mae = MAE, mape = MAPE, mase = MASE, crps = CRPS, winkler = winkler_score)
)
#> Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing.
#> 5 observations are missing between 2020-07-19 and 2020-07-23
#> # A tibble: 20 x 9
#> h .model .type rmse mae mape mase crps winkler
#> <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 tslm1 Test 31.6 26.5 199. 0.761 18.2 130.
#> 2 2 tslm1 Test 31.9 26.6 199. 0.763 18.4 131.
#> 3 3 tslm1 Test 31.3 26.2 194. 0.752 18.1 127.
#> 4 4 tslm1 Test 30.7 25.8 194. 0.740 17.8 127.
#> 5 5 tslm1 Test 30.7 25.8 194. 0.740 17.7 127.
#> 6 6 tslm2 Test 32.0 26.6 192. 0.764 18.4 133.
#> 7 7 tslm2 Test 32.3 26.9 194. 0.771 18.6 133.
#> 8 8 tslm2 Test 31.7 26.5 191. 0.759 18.3 128.
#> 9 9 tslm2 Test 31.1 26.0 189. 0.745 17.9 128.
#> 10 10 tslm2 Test 30.9 25.8 189. 0.739 17.8 128.
#> 11 11 tslm3 Test 32.6 27.0 193. 0.773 18.7 140.
#> 12 12 tslm3 Test 32.8 27.1 194. 0.777 18.8 138.
#> 13 13 tslm3 Test 31.9 26.7 189. 0.764 18.4 131.
#> 14 14 tslm3 Test 31.4 26.3 188. 0.753 18.1 131.
#> 15 15 tslm3 Test 31.3 26.1 188. 0.747 18.0 131.
#> 16 16 tslm4 Test 32.6 27.0 193. 0.773 18.7 140.
#> 17 17 tslm4 Test 32.8 27.1 194. 0.777 18.8 138.
#> 18 18 tslm4 Test 31.9 26.7 189. 0.764 18.4 131.
#> 19 19 tslm4 Test 31.4 26.3 188. 0.753 18.1 131.
#> 20 20 tslm4 Test 31.3 26.1 188. 0.747 18.0 131.