Iterating through all combinations of set of possible predictors when creating forecast models using fable

banderson · January 2, 2023, 4:34pm

I'm looking for a programmatic way, given a set of possible predictors, to fit all possible models based on different combinations of those predictors. I’ve tried 2 approaches and neither have worked.

First, I put all the possible formulas into a vector.

# Get names of predictors
    Cols <- names(alldata)
    Cols <- Cols[!Cols %in% "Depvar"]
    n <- length(Cols)
    
# Construct all possible combinations
    id <- unlist(
      lapply(1:n,
             function(i) combn(1:n,i,simplify = FALSE))
      , recursive = FALSE)
    
# Paste them into formulas
    frmlas <- sapply(id, function(i)
      paste("Depvar ~", paste(Cols[i], collapse = "+")))

Then, I’ve tried both looping through and using lapply. This approach works using the basic lm function in R but I’m unable to figure out how to get this to work with fable and the ARIMA method specifically.

# 1st attemp iterating using lapply
    fitted_models = lapply(frmlas, function(frml) model(data_trian, ARIMA(frml)))

# 2nd attempt iterating using loop
    for (i in 1:length(frmlas)) {
        fit[[i]] <- data_trian %>% 
        model(ARIMA(frmlas[i]))
      }

^{Referred here by Forecasting: Principles and Practice, by Rob J Hyndman and George Athanasopoulos}

mitchelloharawild · January 3, 2023, 3:22am

I've put together an example using your code to implement this.

Load packages

library(fable)
#> Loading required package: fabletools
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tsibble)
#> 
#> Attaching package: 'tsibble'
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, union

Create some sample data

alldata <- as_tsibble(USAccDeaths) %>% 
  rename(Depvar = value) %>% 
  mutate(!!!lapply(setNames(letters[1:4], letters[1:4]), function(cn) expr(rnorm(72))))

# Note, head() is suitable here since there is just 1 key, otherwise you need group_by() %>% slice()
data_trian <- head(alldata, -12)

Adapting your code with minimal changes

# Get names of predictors
Cols <- names(alldata)
Cols <- Cols[!(Cols %in% c("index", "Depvar"))] # Better not to consider the index as a regressor - you can use trend() if you like instead.
n <- length(Cols)

# Construct all possible combinations
id <- unlist(
  lapply(1:n,
         function(i) combn(1:n,i,simplify = FALSE))
  , recursive = FALSE)

# Paste them into formulas
frmlas <- sapply(id, function(i)
  paste("Depvar ~", paste(Cols[i], collapse = "+")))

# Convert to formulas
frmlas <- lapply(frmlas, as.formula)

# 1st attemp iterating using lapply
fitted_models = lapply(frmlas, function(frml) model(data_trian, ARIMA(!!frml)))

# 2nd attempt iterating using loop
fit <- list()
for (i in 1:length(frmlas)) {
  fit[[i]] <- data_trian %>% 
    model(ARIMA(!!frmlas[[i]]))
}

There are two missing elements in your code.

Your formulas are just character text, they need to be parsed into formulas with something like as.formula()
Your attempted use of the model formulas in lappy() is looking for a variable called frml, and in the for loop it is trying to calculate a variable frmlas[i]. Instead you want to use !! to say ‘use the value of this object’ rather than the name/expression. You can think of !! as replacing whatever is on the right with it’s value, so ARIMA(!!frml) becomes ARIMA(Depvar ~ a).

How I would do it

# Get names of predictors
Cols <- measured_vars(alldata)
Cols <- setdiff(Cols, "Depvar")

# Get all combinations of regressors, each regressor can be included or removed (represented with NULL)
comb_xreg <- expand.grid(lapply(Cols, list, NULL))

# Simplify into a list of included regressors
comb_xreg <- split(comb_xreg, seq_len(nrow(comb_xreg))) %>%
  # Simplify into vectors, removing NULL
  lapply(unlist, use.names = FALSE) %>% 
  # Convert to list of symbols/names (this is used to identify a variable `a` instead of the text "a")
  lapply(syms)

# Create formulas from regressors
rhs <- comb_xreg %>% 
  lapply(Reduce, f = function(x, y) call("+", x, y))
formulas <- lapply(rhs, function(x) {
  # If there are no regressors, just return the response variable without a formula
  if(is.null(x)) sym("Depvar") else rlang::new_formula(sym("Depvar"), x)
})

# Create model definitions
mdl_defs <- lapply(formulas, function(frml) ARIMA(!!frml))

# Estimate the models
fit <- data_trian %>% 
  model(!!!mdl_defs)

# The names from these models are the list names produced by `split()`, you can change these to be whatever you like.
fit
#> # A mable: 1 x 16
#>                                      `1`                                    `2`
#>                                  <model>                                <model>
#> 1 <LM w/ ARIMA(1,0,1)(0,1,1)[12] errors> <LM w/ ARIMA(0,1,1)(0,1,1)[12] errors>
#> # … with 14 more variables: `3` <model>, `4` <model>, `5` <model>, `6` <model>,
#> #   `7` <model>, `8` <model>, `9` <model>, `10` <model>, `11` <model>,
#> #   `12` <model>, `13` <model>, `14` <model>, `15` <model>, `16` <model>

My approach is more robust as it programmatically produces the model formula, rather than hoping that the R parser converts your string correctly in as.formula(). For example, if your data contained a variable named “a+b”, then as.formula() would incorrectly produce Depvar ~ a + b instead of Depvar ~ `a+b`. My approach should work fine for all sorts of syntactically invalid names.

It also keeps all of the models in a single mable, allowing you to immediately compare them with functions like accuracy()

accuracy(fit)
#> # A tibble: 16 × 10
#>    .model .type       ME  RMSE   MAE    MPE  MAPE  MASE RMSSE    ACF1
#>    <chr>  <chr>    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>   <dbl>
#>  1 1      Training -19.9  275.  188. -0.254  2.22 0.391 0.457 -0.0324
#>  2 2      Training  56.3  297.  210.  0.656  2.46 0.435 0.494 -0.0225
#>  3 3      Training  49.9  279.  190.  0.576  2.22 0.394 0.464 -0.0147
#>  4 4      Training  55.9  298.  211.  0.646  2.47 0.437 0.496 -0.0269
#>  5 5      Training -20.9  274.  185. -0.263  2.18 0.385 0.455 -0.0353
#>  6 6      Training  58.1  297.  206.  0.682  2.42 0.428 0.494 -0.0271
#>  7 7      Training  50.6  278.  187.  0.588  2.19 0.389 0.462 -0.0154
#>  8 8      Training  57.8  299.  207.  0.674  2.43 0.431 0.496 -0.0317
#>  9 9      Training -19.8  275.  188. -0.254  2.22 0.391 0.457 -0.0324
#> 10 10     Training  55.8  297.  210.  0.650  2.46 0.436 0.494 -0.0216
#> 11 11     Training  49.8  279.  190.  0.575  2.22 0.394 0.464 -0.0145
#> 12 12     Training  55.5  299.  211.  0.642  2.47 0.438 0.496 -0.0263
#> 13 13     Training -20.7  274.  185. -0.261  2.18 0.385 0.455 -0.0352
#> 14 14     Training  57.6  297.  206.  0.676  2.42 0.428 0.494 -0.0268
#> 15 15     Training  50.7  278.  187.  0.588  2.19 0.389 0.462 -0.0154
#> 16 16     Training  57.4  299.  208.  0.670  2.43 0.431 0.497 -0.0316

^{Created on 2023-01-03 with reprex v2.0.2}

banderson · January 3, 2023, 5:59pm

Thank you for the clear and comprehensive explanation. You solution is a nice improvement over my initial approach and appreciate you pointing out the pitfalls. This has certainly taught me some useful functions, as well!

banderson · January 9, 2023, 3:12pm

@mitchelloharawild , I have a follow-up question about wanting to apply a log transformation (or any other transformation) to all variables in the model. If I wanted to log transform all variables in the model, what would be the best way to do that using your approach to create the formulas and model definitions? Using lapply and paste0 seems to work for the regressors:

comb_xreg_log <- lapply(comb_xreg, function(x) {
          paste0("log(",x,")")
        }) %>% 
          lapply(unlist, use.names = FALSE) %>% 
          lapply(syms)

However, paste0 does not seem to work for the depvar, as between the step of creating the formulas and the step of creating the model definitions it inserts an additional ~

 # Create formulas from regressors
rhs_log <- comb_xreg_log %>% 
    lapply(Reduce, f = function(x, y) call("+", x, y))

formulas_log <- lapply(rhs_log, function(x) {
    # If there are no regressors, just return the response variable without a formula
    if(is.null(x)) sym(paste0("log(","Depvar",")")) else rlang::new_formula(sym(paste0("log(","Depvar",")")), x)
  })
            
 # Create model definitions
 mdl_defs_log <- lapply(formulas_log, function(frml) ARIMA(!!frml))

Thank you again for your solution.

mitchelloharawild · January 11, 2023, 4:06am

If you want to use transformations then you're now creating expressions rather than symbols/names.

sym(paste0("log(","Depvar",")")) will give you a symbol for the variable name `log(Depvar)`.

Instead, you want to create an expression - there are a few ways to do this.
For your response variable, it is simple as it doesn't change: expr(log(Depvar)) will give you log(`Depvar`)

For your regressors, you can either...

Parse the string into an expression with rlang::parse_expr():

comb_xreg_log <- lapply(comb_xreg, function(x) {
          paste0("log(",x,")")
        }) %>% 
          lapply(unlist, use.names = FALSE) %>% 
          lapply(parse_exprs)

This isn't great, as it involves parsing a variable name - this won't work if the variable names aren't syntactically valid.

Construct the expression using the variable name symbols already created

# Suppose you have a variable name `price`
varname <- sym("price")

# You want to create an expression for the log of `price`
expr(log(!!varname))

# This will create an expression using expr(), that replaces `!!varname` with its value `price`.

# To apply this to all of your regressors, you could use something like:
comb_xreg_log <- lapply(comb_xreg, function(comb) {
  lapply(comb, function(regressor) expr(log(!!regressor)))
})

# The nested lapply is a bit tricky since we have lists of lists of regressors.
# `comb` is describing the list of regressors for the model, and `regressor` is for each individual regressor that we are going to `log()`.

Since we are explicitly saying what is a variable using sym(), and then log transforming it via expr() there is no ambiguity in what is a variable and what is a function. So this code will work with syntactically invalid variable names.

banderson · January 13, 2023, 7:12pm

That works nicely. Thank you for teaching me another couple tricks of the trade and issues to be aware of!

system · January 20, 2023, 7:13pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.