Identify original/derived variable relationship in tidymodels recipe

tl;dr
I essentially just want to create a tibble of the original/derived variable parings from a prepped recipe.


For recipe steps that create one or more new features (thinking specifically of step_dummy() in my case), what is the best way of identifying the original variables from which the derived variable/s was/were created?

For example:

library(tidymodels)
library(tidyverse)

rec <-
    recipe(
        Sepal.Length ~ Species + Sepal.Width,
        data = iris
    ) %>%
    step_normalize(all_numeric_predictors()) %>%
    step_dummy(all_nominal_predictors())

prepped <- prep(rec)

The last_term_info list item of the prepped recipe seems to be pretty close. One way of doing this is to iterate up the data frame, find where source == "derived" and continue upwards until it finds an "original" row.

prepped$last_term_info

# A tibble: 5 x 6
# Groups:   variable [5]
  variable           type    role      source   number skip 
  <chr>              <chr>   <list>    <chr>     <dbl> <lgl>
1 Sepal.Length       numeric <chr [1]> original      2 FALSE
2 Sepal.Width        numeric <chr [1]> original      2 FALSE
3 Species            nominal <chr [1]> original      1 FALSE
4 Species_versicolor numeric <chr [1]> derived       2 FALSE
5 Species_virginica  numeric <chr [1]> derived       2 FALSE

I'm worried about the idea above because I don't like relying on the row order and feels kind of hacky, and it also would not work at all in the case of something like step_pca().

I could also see doing string manipulation, removing the suffix after Species but I feel like there are a lot of ways that could go wrong if there are other similarly named variables in the recipe. Can anyone thing of a better way of doing this?

I am imagining the output looking something like this:

# A tibble: 2 x 2
  derived_variable   original_variable
  <chr>              <chr>            
1 Species_versicolor Species          
2 Species_virginica  Species 

Thanks!

In general you may not be able to figure this out. For example, if you create dummy variables, then PCA components, then other steps, you probably could not trace them.

For step_dummy(), you can though via the tidy() method:

library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
data(attrition)

rec <-
  recipe(Attrition ~ ., data = attrition) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_dummy(all_nominal_predictors())

prepped <- prep(rec)
tidy(prepped, number = 2)
#> # A tibble: 43 × 3
#>    terms          columns              id         
#>    <chr>          <chr>                <chr>      
#>  1 BusinessTravel Travel_Frequently    dummy_4GB6U
#>  2 BusinessTravel Travel_Rarely        dummy_4GB6U
#>  3 Department     Research_Development dummy_4GB6U
#>  4 Department     Sales                dummy_4GB6U
#>  5 Education      College              dummy_4GB6U
#>  6 Education      Bachelor             dummy_4GB6U
#>  7 Education      Master               dummy_4GB6U
#>  8 Education      Doctor               dummy_4GB6U
#>  9 EducationField Life_Sciences        dummy_4GB6U
#> 10 EducationField Marketing            dummy_4GB6U
#> # … with 33 more rows

Created on 2021-11-08 by the reprex package (v2.0.0)

Thanks, Max, did not know that recipes had tidyers, this definitely helps.

It would be useful to be able to programmatically trace back derived variables to their originals but I now see that it could get messy very quickly.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.