How to ensure reproducibility of an rds file without version controlling the file and/or creating it each time?

JasonAizkalns · January 29, 2020, 7:00pm

Assume we are working on a script, data_cleaning.R, that ultimately creates a large .rds object. One idiom I have used (and this could be the wrong approach) to avoid re-creating that object each time we source the script is something like this:

data_cleaning.R

PATH_FOR_DATA <- "objects/big_df.rds"
stopifnot(!exists(PATH_FOR_DATA))

# Do a bunch of stuff...

write_rds(df, PATH_FOR_DATA)

This works great. But let's say we want to version control this script for collaboration. It works fine the first time -- anyone "new" will pull down the script and since they've never run it, they will create their own copy of the big_df.rds object locally.

But what happens when someone changes the data_cleaning.R script? How do you ensure collaborators always have the latest-and-greatest copy? Maybe something with a build number? Could still be problematic?

mfherman · January 29, 2020, 7:28pm

This could be a good use case for the drake package where you define the big_df.rds as a dependency of data_cleaning.R and so if data_cleaning.R changes, the rds file will be updated too.

system · February 19, 2020, 7:28pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.