Packrat bundles with different R versions - suggestions and alternatives

Hi all,

I am looking into packrat as a way of keeping things from breaking when updating / installing packages for a particular project. A typical situation would be that a new package needs to be installed but requires some dependencies to be upgraded / installed, which then will break existing code. It is particularly common with Biocondiuctorpackages. It is also an issue for us because quite some projects take years before completion, sometimes spanning several R / BioC versions.

What did I actually do? Bundle creation:

  • on my local computer (OSX, fresh install), installed R 3.6 and multiple R and BioC packages
  • Started a new project with packrat::init()
  • Created a snapshot with packrat::snapshot()
  • And finally a bundle with packrat::bundle(), which I think includes all package sources as well.

Project restoration:

  • On a server, R version 3.5.1 and CentOS Linux 7, I unbundle the bundle and got the following error
> packrat::unbundle("test_packrat-2019-06-11.tar.gz", "./")
- Untarring 'test_packrat-2019-06-11.tar.gz' in directory '/lustre/projects/bioinfo/domingue/scripts'...
- Restoring project library...
Installing BH (1.69.0-1) ... 
        OK (built source)
Installing BiocGenerics (0.30.0) ... 
Error: Command failed (1)

Failed to run system command:

        '/sw/apps/r/3.5.1/lib64/R/bin/R' --vanilla CMD INSTALL '/tmp/RtmpvUnPCU/BiocGenerics' --library='/lustre/projects/bioinfo/domingue/scripts/test_packrat/packrat/lib/x86_64-pc-linux-gnu/3.5.1' --install-tests --no-docs --no-multiarch --no-demo 

The command failed with output:
ERROR: this R is version 3.5.1, package 'BiocGenerics' requires R >=  3.6.0
In addition: Warning message:
In restore(project = getwd(), restart = FALSE) :
  The most recent snapshot was generated using R version 3.6.0

So as far as I can tell I am running into an issue, sort of unsurprisingly, of R / BioC packages not compiling in a different R version.

The questions are now:

  1. How to overcome this issue short of creating a docker container for every project?
  2. Is there a better way of keeping R workflows from breaking when R or it's packages are updated? I am thinking something like virtualenvs in python would work (python + packages for each project), but I am not sure something like this exists in R.

Regarding 2, I am guessing snapshots will tell me which version the packages where before updating, but I don't know if it would be easy to restore in case of an R update.

Cheers!

1 Like

You could be interesting by the next generation packrat called renv

Hope is it will handle better those cases. I think if you test it in your situation, your experience will be valuable.

There was just a new addition of renv::migrate to help transform an existing packrat project into renv project.

All this is still in dev but early feedbacks are always very useful !

1 Like

Thank you for the tip. I had heard of renv but somehow couldn't remember the name.

As for it's suitability, it doens't seem to be an improvment over packrat, at least for my use case. After installing it, devtools::install_github("rstudio/renv"), I ran renv::migrate() as suggested, copied the lock file to another computer, and ran renv::restore() and got the following error:

* Querying repositories for available source packages ... Done!
Retrieving 'https://bioconductor.org/packages/3.8/bioc/src/contrib/Archive/AnnotationDbi/AnnotationDbi_1.46.0.tar.gz' ...
curl: (22) The requested URL returned error: 404 Not Found
curl: (22) The requested URL returned error: 404 Not Found
Retrieving 'https://bioconductor.org/packages/3.8/data/annotation/src/contrib/Archive/AnnotationDbi/AnnotationDbi_1.46.0.tar.gz' ...
curl: (22) The requested URL returned error: 404 Not Found
curl: (22) The requested URL returned error: 404 Not Found
Retrieving 'https://bioconductor.org/packages/3.8/data/experiment/src/contrib/Archive/AnnotationDbi/AnnotationDbi_1.46.0.tar.gz' ...
curl: (22) The requested URL returned error: 404 Not Found
curl: (22) The requested URL returned error: 404 Not Found
Retrieving 'https://bioconductor.org/packages/3.8/workflows/src/contrib/Archive/AnnotationDbi/AnnotationDbi_1.46.0.tar.gz' ...
curl: (22) The requested URL returned error: 404 Not Found
curl: (22) The requested URL returned error: 404 Not Found
Retrieving 'https://cran.rstudio.com/src/contrib/Archive/AnnotationDbi/AnnotationDbi_1.46.0.tar.gz' ...
curl: (22) The requested URL returned error: 404 Not Found
curl: (22) The requested URL returned error: 404 Not Found
Retrieving 'https://cran.rstudio.com//src/contrib/Archive/AnnotationDbi/AnnotationDbi_1.46.0.tar.gz' ...
curl: (22) The requested URL returned error: 404 Not Found
curl: (22) The requested URL returned error: 404 Not Found
Error: failed to retrieve package 'AnnotationDbi' from CRAN
Traceback (most recent calls first):
  9: stop(sprintf(fmt, ...), call. = call.)
  8: stopf("failed to retrieve package '%s' from CRAN", record$Package)
  7: renv_retrieve_cran(record)
  6: renv_retrieve_bioconductor(record)
  5: renv_retrieve_impl(package)
  4: handler(package, renv_retrieve_impl(package))
  3: renv_retrieve(packages)
  2: renv_restore_run_actions(project, diff, current, lockfile)
  1: renv::restore()

Again the issue is mainly with Biocondutor packages. Reading the GH issues, it should have gracefully failed the installation and moved on to the next package, but that was not the case. No idea what it was looking for the package sources on that location either.

Cheers.

There could still be some improvement to make in renv to make it fully compatible with bioconductor.

I think your feedback can be useful if you can test it further.

Time allowing, I am happy to help @cderv. Just let me know how.

Sorry for not responding to this earlier. If you can boil this down to a reproducible example, would you mind posting that as an issue on the renv package git repo? If you can include the lock file as well, I think that would be helpful. The package is definitely still early in its development, and it would be good to have these types of use cases documented!

1 Like

One last thing. I think you may get some value out of perusing some of the material over at environments.rstudio.com . There is a lot of thinking that can go into environment reproduction.

Unfortunately, when "downgrading" the version of R (moving from 3.6.0 to 3.5.1), there is not a whole lot you can do about ensuring a (new) package is compatible with an older version of R. In this case, we would recommend version controlling your "lockfile" so that you can go "back in time" to an older lockfile when things worked with the older version of R.

Depending on your context, you might also be interested in RStudio Package Manager, which allows "freezing" a repository at a moment in time, so that you can (1) get all of your packages from the same place and store the sources there and (2) ensure that the versions do not change. All of this without any client-side management a la packrat / renv.

https://www.rstudio.com/products/package-manager/

@adomingues You could use the conda package manager to create environments with Python, CRAN, and Bioconductor packages. Many CRAN packages are available from the conda-forge channel and likewise for Bioconductor packages from the bioconda channel.

Thanks for the tip @jdblischak. I am now using conda to manage R versions, but I am still installing R packages from source because, and maybe @cderv can correct me, renv seems to kept better track of source packages rather than binaries. Anyway, it solved a problem.

1 Like

Yes renv keeps track of source package.

I think it is mainly because Binaries for packages are only available for last version. When you want to restore a project with older version, you'll have to get the source package as no binaries will be available.
If available, I believe renv still install the binary package from CRAN. (TBC)

It's trick one to create a reprex @cole and @cderv . That said, I did some tests and renv is working fine for most of my test cases. For instance, I created two projects, installed an older version of a bioconductor package and then:

  • upgraded it in a project without affecting the other project or the system wide installation;
  • downgraded it without issues.

I still need to test how it work when I transfer the project to different system.

So all in all it is working fine. Two things that are still a bit wonky on my workflow:

  1. It would be nice to get warning about discrepancy in R versions when a project is restored in a system that is using a different R version, or simply if the wrong conda env is active. For example, I have the server drive mounted in my local system, which means that sometimes I am working locally on the project folder, sometimes on the server. Both have different R installations. If I get a warning that "renv was created in R3.6. Current session is 3.5 and some packages might not load", I would simply exit, activate R3.6, and carry on. Currently I am will work until something goes wrong.
  2. This is an issue with my workflow rather than renv but you may have some helpful ideas. We are keeping scripts (code) in a separate folder from the data (that's not something I can change). Since renv uses the R scripts to create the lock file, but I would need it activate in data, I don't know how to manage this. I could symlink scripts in the data folder and do renv::init afterwards. This would accomplish my objective of creating the renv using the information contained in the project scripts, but activating inside the data folder. However it seems a bit of a hack and, crucially, I would version control the lock file (only scripts is under version control). Any suggestions?

Since this is a bit long and detailed, shall I move it to github?
Thanks a lot of your help an input.

From the code, you already should get a warning if the version of R in the lockfile is not the same as the version in your session

This should activate when renv load I think. Is it not the case ?

I am not sure to understand why you would need to activate inside the data folder ?
Why not activate on the parent folder where you have your script folder and your data folder ?
Are the script folder shared between projects and why you need renv in data folder ?

From the code, you already should get a warning if the version of R in the lockfile is not the same as the version in your session. You are abosolutly right @cderv:

Failed to find installation of renv -- attempting to bootstrap...
* Downloading renv 0.5.0-18 ... Done!
* Installing renv 0.5.0-18 ... Done!
Successfully installed and loaded renv 0.5.0-18.
* Project '~/projects/bioinfo/data/test_renv' loaded. [renv 0.5.0-18]
Warning message:
Project requested R version '3.5.1' but '3.6.1' is currently being used

I somehow missed it. My bad.

I am not sure to understand why you would need to activate inside the data folder ?

This the simplified project "structure" we follow:


├── projects
│   ├── data
│   │   ├── proj1
│   │   ├── proj2
│   ├── scripts
│   │   ├── proj1
│   │   ├── proj2

Scripts is where the code and docs are kept under version control, and data where we do all the analysis. If I understood correctly renv::init() would be called on scripts/proj1 and therefore create the project files there (renv.lock, ...), which is fine for version control. However, the R session would be started on data/proj1 which is not an renv project.
Is this slightly clearer now? I think the difficulty is that we don't use a conventional project structure and therefore makes things a little more complicated.

Yes I agree with that, it is not conventional. You would have to tweak a bit the standard way of using renv but I think it is possible - and maybe some wrapper could be useful to you.

Look at the doc of renv to understand how it works.

  • renv::init(bare = TRUE) would allow you to initialize a renv project in your data folder where you want. bare = TRUE means the packages discovery won't be automatic.
  • You can then use renv::hydrate() to install in this project the dependencies you want. Look at the arguments for details. Either manually or you can use renv::dependencies() to explicitly search for dependencies in some specific path. Here you need I guess renv::dependencies(path = "../scripts/proj1) when you are in data/proj1. Look also at the argument to see how to change the defaults.
  • When you add a new library in scripts, this will be the trickiest. Normally, you would use renv::install to install the library, and renv::snapshot to capture the state. you would need to provide I think the proj argument to your data folder and snapshot using simple type (not dependencies guess).

As you see, you definitely won't be able to use all the defaults in renv but I think with all the different arguments in the functions, you can tweak as you need.

I suggest you try something and if you manage to work something out, I am really interesting to know how it would work. I guess it will be interesting for others in your case, and maybe there is some improvement to make in renv for this special workflow.

What do you think ?

1 Like

You could also try calling renv::load("scripts/proj1") to tell renv to tell renv to treat that as the project directory, even if you're currently working in a separate directory. (You may need to call renv::init(bare = TRUE) in that directory first to get the renv-related infrastructure written out)

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.