I want to model sequential outcomes in tidy, but there is surprisingly little documentation on this. By sequential outcome, I refer to problems where outcomes may depend on previous outcomes: causal graphs, time series, networks...
Let's consider the mtcars dataset. This dataset lists fuel consumption and 10 aspects of automobile design and performance for 32 automobiles from the 1973–74 season. In particular, I'm interested in the relationship between 3 variables: Miles Per Gallon (mpg), Horsepower (hp) and Engine Displacement (dsp).
The relationship I assume between these variables is illustrated in the graph above. Here, engine displacement describes the design of an engine: The total volume that is "swept" through the motion of the pistons. The displacement of an engine determines its horsepower, the "power" of an engine. Finally, high horsepower can be costly as it affects the miles per gallon to expect from the vehicle.
With that out of the way, let's see what the mtcars dataset can tell us about this relationship:
library(ggplot2) library(tidyr) data(mtcars) # Modelling a Linear Relationship between hp and dsp ggplot(mtcars, aes(disp, hp)) + geom_point() + stat_smooth(method = "lm", size = 1) + ggtitle("Relationship between Engine Displacement and Horsepower")
# Modelling a Quadratic Relationship between mpg and hp ggplot(mtcars, aes(hp, mpg)) + geom_point() + stat_smooth(method = "lm", formula = y ~ I(x^2), size = 1) + ggtitle("Relationship between Horsepower and Miles Per Gallon")
Visually there is a linearly increasing correlation between
disp, and a quadratic inverse correlation between
To represent these separate corelations, I create two parsnip models and fit each.
library(tidyverse) library(tidymodels) #Relationship between disp and hp lm_model_1 <- linear_reg() %>% set_engine('lm') %>% set_mode('regression') lm_fit_1 <- lm_model_1 %>% fit(hp ~ I(disp^2), data = mtcars) #Relationship between hp and mpg lm_model_2 <- linear_reg() %>% set_engine('lm') %>% set_mode('regression') lm_fit_2 <- lm_model_2 %>% fit(mpg ~ hp , data = mtcars)
lm-fit_1 covers the link between
lm-fit_2 covers the link between
A weakness of tidyverse, or maybe just the way I work with tidy, is that the more chain-relationships there are in a problem the more models I have to create separately. workflows can make this easier, but it is much more geared towards preprocessing.
Now let's say I have a new dataset
not_mtcars only containing information on engine displacement, and I want to predict both horsepower and miles per galon.
I see that I have two options:
To get around the dependence here, I first predict
hp and add it as a column to the
not_mtcars tibble. Then I predict
mpg on the updated
not_mtcars <- tibble(disp=seq(100,501,50)) hp_prediction <- predict(lm_fit_1, new_data = not_mtcars)$.pred not_mtcars <- not_mtcars %>% add_column(hp = hp_prediction) mpg_prediction <- predict(lm_fit_2, new_data = not_mtcars)$.pred not_mtcars <- not_mtcars %>% add_column(mpg = mpg_prediction) head(not_mtcars) # A tibble: 6 × 3 disp hp mpg <dbl> <dbl> <dbl> 1 100 101. 23.2 2 150 111. 22.5 3 200 125. 21.6 4 250 142. 20.4 5 300 164. 18.9 6 350 189. 17.2
Filling-out the table like playing sudoku works. It's fairly easy if we only have a couple of outcomes to consider but doesn't scale super-well for anything more complex (time series).
Another way to handle this is the imputation_step family of functions in recipes. While I think that is a step in the right direction, paying attention to the sequence of imputations and which predictors to
impute_with can become overwhelming very quickly.
rec <- recipe(mpg ~ hp+disp, data = mtcars) %>% step_impute_linear( hp, impute_with = imp_vars(disp) ) %>% prep(mtcars) sequential_wflow <- workflow() %>% add_model(lm_model_2) %>% add_recipe(rec) sequential_wflow_fit <- fit(sequential_wflow, data=mtcars) print(sequential_wflow_fit) ══ Workflow [trained] ═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════ Preprocessor: Recipe Model: linear_reg() ── Preprocessor ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 1 Recipe Step • step_impute_linear() ── Model ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Call: stats::lm(formula = ..y ~ ., data = data) Coefficients: (Intercept) hp disp 30.73590 -0.02484 -0.03035
Notice also how the linear model generated in this way will also consider
disp (normally one step behind) when fitting the
Are these just the "correct" ways to model sequential outcomes in tidy or am I missing something?