What if different packages are used in different projects? I hate downloading packages for every packrat project again and again. Is there a way to combine this „internal cran“ with packrat and hosting different versions of pkgs from different repos (cran, bioconductor, neuroconductor etc) in one local repo that gets synced with packrat for example? If not, are there other solutions for this kind of problem?
We use MRAN (daily snapshots of CRAN provided by Microsoft as part of their “checkpoint” framework) to create shared libraries of R packages all users can access. We install new snapshot about once a month (keeping the old one) and use Rprofile.site to make one snapshot (typically second to last) the “default” in .libPaths(). We pull the updates of bioconductor packages and also packages from github at the same time. For our internal packages we always target the snapshot marked as “testing” (typically the most recent one) so once we have a new snapshot and the “testing” becomes the “default” the internal packages stop changing for that snapshot. So the library looks like this:
We have a few functions - kind of a DSL - for the users to work with the snapshots (list, switch…). We like this approach much more than either packrat or checkpoint. All the users are always on the same versions of all packages (sans the projects with explicitly requested snapshots - mostly archived ones), we have fixed development targets for the internal packages, and we don’t store hundreds of copies of the same package in the users’ or projects’ local libraries.
@alexv, thanks for the tips! We do something similar with MRAN snapshots, except we keep package versions pegged to specific Microsoft R Server releases. The downside to this is that R package versions can get pretty old (since MRS releases are relatively infrequent). I like the idea of having more rolling snapshots (with a “testing” version as well) and may look to implement something similar to that in the future.
Do you have any issues when breaking changes happen to packages when you update the default snapshot (e.g. the dplyr count function from 0.4.3 > 0.5 broke some users’ code)? One possibility would be to require users to specify a snapshot in their script, so that if you have legacy code that doesn’t get touched in a few years it can still access the same package versions as before. Is that something you considered?
@vergilcw, every time the testing snapshot becomes the default one we send an announcement to all the users where we list all the version changes. They are encouraged (and given directions how) to check the release notes for any package they use in their work. In addition, they can use one of the functions I mentioned to freeze the snapshot in their script to be, say “2017-09-01”. That way their code will run exactly the same at any point in the future. This is very similar to MS’s checkpoint package but the users don’t install any packages themselves.
That approach works most of the time but sometimes it becomes too restrictive. For some experimental work when the scope of a project goes beyond just R and requires carefully chosen versions of, say certain R packages, some system libraries, other applications and/or languages with their own packages (python?, java?) we use Nix package manager (https://nixos.org/nix) to create reproducible “sandbox” environments consisting of all the required versions of all the tools. Unfortunately such sandbox environments are not easy to access and manage from RStudio so the users use other means (emacs/ess, etc.).
@sellorm wish I’d seen your solution before I persuaded our devs to get to grips with R and miniCRAN (well, maybe not, as this way I get more people at the company invested in R!).
The approach we’re currently leaning to is essentially the one that @scw describes.
We’re also intending to adopt a snapshot approach for our internal miniCRAN too, simply because it makes it easy for us to recreate historic environments by creating a new R setup that points to the specific snapshot urls (for MRAN and our miniCRAN). Having the ability to to re-run analysis is important to us from an internal risk audit perspective.
@sellorm There’s also GRANBase which doubles as a build and test system for R packages. It also uses the ‘covr’ package to create test coverage reports. If used with Apache HTTP server, it behaves exactly like a CRAN repository.
We have an internal CRAN-alike that hosts packages we’ve developed for projects. Until recently, we did this using custom scripts to publish packages to a file server that was also hosted over HTTP - now we’ve started using the Sonatype Nexus artifact repository and its r repository plugin to host R packages. So far, so good.
We also use Atlassian BitBucket, which builds, tests, and publishes our packages. We essentially use GitHub Flow branching (google that if you don’t know what it is - apparently I can only put 2 links in a post on this forum), so we have commits to the develop branch auto-publish to a ‘dev’ repository (after appending the Bamboo build number to the package version), and commits to master auto-publish to our main repository.
We haven’t found it beneficial to proxy or mirror CRAN itself - we just hit existing public CRAN mirrors.
Oh! I also tried to use this at work and it seems promising. However, there are still some issues open and all do not work as intended. It does not seems ready with all the specificity of CRAN-like repository. I am very interested to share experience on that. (proxying R repo, internal hosted repo and grouped repo)
Currently we are using a file server hosted over http. So I am kind of the same situation as you !
I maintain our internal miniCRAN repository. We have a central Sharepoint list (I KNOW ) of packages that users request and add to in the run up to the next (internal) R release. I save this as an Excel sheet which forms the “lookup” for miniCRAN. We use this to download packages into our miniCRAN, and as @scw does we set up users to point to this location by default. We actually take things to the next level by installing ALL packages from this miniCRAN onto users’ laptops. It’s overkill, but it means that users have most everything they need just a library( ) call away.
We also developed some helper functions to check, install, update and addNew packages from our miniCRAN. These are wrappers around the default R functions, but help us check what’s on desktops, remediate, roll out new packages etc.
I collated some recent discussion I had around this topic in a Storify: