I essentially just want to create a tibble of the original/derived variable parings from a prepped recipe.
For recipe steps that create one or more new features (thinking specifically of
step_dummy() in my case), what is the best way of identifying the original variables from which the derived variable/s was/were created?
library(tidymodels) library(tidyverse) rec <- recipe( Sepal.Length ~ Species + Sepal.Width, data = iris ) %>% step_normalize(all_numeric_predictors()) %>% step_dummy(all_nominal_predictors()) prepped <- prep(rec)
last_term_info list item of the prepped recipe seems to be pretty close. One way of doing this is to iterate up the data frame, find where
source == "derived" and continue upwards until it finds an
prepped$last_term_info # A tibble: 5 x 6 # Groups: variable  variable type role source number skip <chr> <chr> <list> <chr> <dbl> <lgl> 1 Sepal.Length numeric <chr > original 2 FALSE 2 Sepal.Width numeric <chr > original 2 FALSE 3 Species nominal <chr > original 1 FALSE 4 Species_versicolor numeric <chr > derived 2 FALSE 5 Species_virginica numeric <chr > derived 2 FALSE
I'm worried about the idea above because I don't like relying on the row order and feels kind of hacky, and it also would not work at all in the case of something like
I could also see doing string manipulation, removing the suffix after
Species but I feel like there are a lot of ways that could go wrong if there are other similarly named variables in the recipe. Can anyone thing of a better way of doing this?
I am imagining the output looking something like this:
# A tibble: 2 x 2 derived_variable original_variable <chr> <chr> 1 Species_versicolor Species 2 Species_virginica Species