Save partial computations for an R team

recommendations

#1

Hello there,

In my company we are currently two data scientists that work with R, and we are looking at integrating R more and more into our development and production environment.
One of the problems I am facing is this. Let's say that I am working on exploring an hypothesis which requires long computations. What I would usually do if i were working on my own is: save the partial results of these computations on a local .Rds file so that I don't have to re-run them every time, but can load them from disk.
However, this poses a problem when working in a team.

Let's say that I try out an hypothesis in an R notebook and save partial computations to disk as above. Then I commit my changes to the repository and create a pull request for my teammate to review my work. Now, he would have to run the whole time-consuming code to be able to inspect the results, because the file I saved is only available on my local machine. This often leads to us just looking at each other code and not run it, because that would take time, but this is not ideal. Has any of you encountered this problem or know a solution/package/way to approach this problem of sharing partial results?

In other words: is there a good system to share (and possibly version) data and partial results in a team, and not just code?


#2

Have you considered using something like Docker at all? There's a very useful intro and roundup of resources for R users here:

You might also take a look at the drake package. I haven't used it much, but it has management for data versioning such that you don't unnecessarily re-run results:

A more extensive manual can be found here:
https://ropenscilabs.github.io/drake-manual/


#3

When I'm in your particular situation I use the following solution

  • Commit the cached .rds files to git, but save them in a directory called tmp_data/. The implication being that you should be able to delete everything in that directory and not worry.

    • I have a standardised way of naming the .rds files, e.g. notebook1-object1.rds. So this file contains object1 from notebook1.
  • Each notebook starts with a load_objects() command, something like

    load_objects = function(use_cache = FALSE) {
      if(!use_cache) invisible(NULL)
      pattern = "^notebook1-(.*)\\.rds$"
      fnames = list.files(path = "tmp_data/", pattern = pattern, full.names = TRUE)
      var = map(fnames, readRDS)
      names(var) = str_match(f, pattern)[,2]
      list2env(var, envir = .GlobalEnv)
      invisible(NULL)
    }
  • I also have a function for saving intermediate objects
    save_rds = function(obj, notebook) {
       obj_name = deparse(substitute(obj)) # Gets the object name
       fname = glue::glue("tmp_data/{notebook}-{obj_name}.rds")
       saveRDS(obj, fname)
    }

I find that this is nice compromise between something really complicated, and something efficient.


#4

Other aspect to keep in mind the sync of the cache. For example, both caches (yours and your colleague’s) reflect valid states (e.g. different workflow for different inputs) and you might want to be able to switch from one state to the other. One solution would be to have a common cache storage such as a network shared folder, dropbox, or AWS S3.

Depending on your exact workflow, Drake, mentioned above, might be a good solution. I worked on something similar recently, rflow (https://github.com/numeract/rflow), with the goal of a better integration with the tidyverse workflow while being lighter than Drake. It got the job done for me (it is used in production) but there is room for much more improvement (it may be a good candidate for the tidyverse dev day).


#5

@mara
I do use docker extensively, but I don't think it can be used to solve the problem I have (or at least I don't see how). Drake looks cool but as far as I understand it only works locally.
@csgillespie Yes git would be an option, the problem is that with your strategy the git history gets inflated, in particular if the files change often, since git has to keep track of all the history for the binaries. There is a git "large file storage" option to store binaries files like this which I'm looking into.
@MikeBadescu rflow looks like an interesting idea, but I'm not sure it would help with the caching problem, because the fundamental issue is that both me an my coworker generate partial files on our local machines which we would like to access (thus indeed having a common cache storage might be a possible solution).
It seems to me it would be really cool if something like drake or rflow was augmented with tools to solve this sort of data sharing issues.

R.


#6

Are you sure about that? There's a section on remote workers in the drake manual in the High-performance computing chapter:
https://ropenscilabs.github.io/drake-manual/hpc.html#remote-workers


#7

Drake is using storr which can be set up (at least to a point) as a shared cache (Redis maybe?). Another solution would be to access the cache folder (.drake ?) and sync it with S3 using the asw.s3 package.

rflow is designed to have an extensible cache system (similar to memoise) and, eventually, it will support aws.s3 sync.