Are these expected rlang hash() behaviors on recipe() object inside of test_that()? Same object, different hashes, but only inside test_that().

Hi Posit Community,

I'm using [testthat::test_that()] to test a package I'm working on. I'm hashing a [recipes::recipe()] object with the [rlang::hash()] function. Below is the reprex, I tested this on a separate computer and was still able to reproduce the behaviors. I walk through the behavior of the codes below the reprex.

library(devtools)
library(testthat)

test_that("rlang::hash()", {
  df <- as.data.frame(mtcars)
  # Generating the recipes
  r <- recipes::recipe(mpg ~ ., data = df)
  r2 <- r |> 
    recipes::step_mutate(new = cyl + 1)
  # First hash of r2 object
  print(rlang::hash(r2)) # "f99acb3996b1ba03e3a339a19815b4cc"
  # Insert a random object
  random_object <- mtcars
  # Second hash of the r2 object. A different hash is returned
  print(rlang::hash(r2)) # "8a36f237207e6f7886dec14b8ea4c702"
  # Even though adding the random object changes nothing about r2?
  expect_identical({r2}, {random_object <- mtcars; r2})
  # However the same hash is returned when the `random_object` is removed.
  rm(random_object)
  print(rlang::hash(r2)) # "f99acb3996b1ba03e3a339a19815b4cc"
  
  # However, if I use a `step` that does not transform the data, this behavior
  # is not reproduced. Maybe this is because of some lazy evaluation with 
  # recipes?
  r3 <- r |> 
    recipes::step_mutate()
  print(rlang::hash(r3)) # "54701ee9cebf6cad0666d2c8a048344c"
  random_object_2 <- mtcars
  print(rlang::hash(r3)) # "54701ee9cebf6cad0666d2c8a048344c"
})
  • First, every time the test_that() code block is run, different hashes are generated. So you probably won't get the same hashes as I did here. Is this an expected behavior? I would expect that for this reprex, the hash wouldn't change across different test_that() runs (given equal R and package versions).
  • If I change the state of the environment by assigning a random_object inside the test_that() environment, the second rlang::hash() call on the r2 object returns a different hash than the first call.
  • If I remove the random_object, the same hash as the first call is returned.
  • This can also be related to {recipes} since if I use a step that does not transform the data, the same hashes are returned even with the modification.
  • If the codes are run in a global environment (outside of test_that()), this doesn't happen. So rlang::hash() does take into consideration the changes in the environment somehow, but not in the global environment?

Thank you for following me so far and looking forwards to any thoughts. Best,
EDIT: Added library calls to reprex

Here are some session info:

 version  R version 4.2.1 (2022-06-23)
 os       Ubuntu 22.04.1 LTS
 system   x86_64, linux-gnu
 ui       RStudio
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 rstudio  2022.07.2+576 Spotted Wakerobin (desktop)
testthat 3.1.5
recipes	 1.0.2
rlang	 1.0.6

Seems like

r2$steps[[1]]$id

is a random id, so yeah, every time this runs, it'll be different.

As for the random objects, the recipe probably has functions, which have environments, and the hash is calculated based on these as well, but the global environment is omitted. (This is just guessing, I don't actually know how the hash is calculated.)

If you want to test that a recipe has a certain value, then I suggest you use snapshot testing (see Snapshot tests • testthat). With the transform argument of expect_snapshot() you'll be able to deal with the randomness.

1 Like

Hi Gabor,

Thank you for your answer. I never learned so much about R until I started building my own package. You probably know this already, but I'm leaving a response here for me and others who might stumble into this issue.

I'm still not 100% sure, but I think you are correct. The r2$steps[[1]]$id is generated only once, but recipes do contain quosures which keep tracks of the environment that the object is originally generated from. I didn't know about snapshot tests and I can't wait to try it out.

Thank you so much!

You should set the ids for each step to make it reproducible or set the seed at prep() time.

Also, you might think of running the butcher package on the recipe before hashing. That will get rid of the quosures (unless you need them later).

2 Likes

Thank you for introducing me to {butcher} Max. I never knew about the problems that {butcher} solves and now I'm so glad that the package exists.

On a more personal level, getting a message from you really made my day. I've read all your books from cover to cover. I'd like to say thank you for everything, from {caret}, APM, FES, to {tidymodels} and now TMWR. The R community is truly amazing.

Sincerely,

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.