Internal CRAN-like system - best practices


#1

Hi!

Was wondering if anyone is aware of any best practice information for creating a CRAN-like repo, inside an organisation’s network.

I’ve used a tool that I threw together (https://github.com/sellorm/alsoran), but not any other approaches so far.

What are other people doing?

Cheers,
Mark


Writeup: internal-package distribution
Writeup: internal-package distribution
#2

Just gave your repo a quick read, looks like a well developed solution! The “what about drat and miniCRAN?” in the README doesn’t seem to have an answer in the following paragraph :slight_smile:


#3

Very interesting thread :+1:

What if different packages are used in different projects? I hate downloading packages for every packrat project again and again. Is there a way to combine this „internal cran“ with packrat and hosting different versions of pkgs from different repos (cran, bioconductor, neuroconductor etc) in one local repo that gets synced with packrat for example? If not, are there other solutions for this kind of problem?


#4

My colleagues and I have been testing a solution where the packages are
installed to a shared location and the package version and R version are
both specified in the package-loading function.

We wrapped these functions into an internal package called NAVpackrat.


#5

Thanks for the review Dean, I’ll have to have a look at that section and do something about that!

Thanks again,
Mark


#6

That’s a really interesting question!

I’m not aware of a solution at the moment that’s capable of doing that automatically. Doing it manually, sounds like it might be a bit of a headache, but might be the only way for now.

Unless anyone has other information?


#7

We use MRAN (daily snapshots of CRAN provided by Microsoft as part of their “checkpoint” framework) to create shared libraries of R packages all users can access. We install new snapshot about once a month (keeping the old one) and use Rprofile.site to make one snapshot (typically second to last) the “default” in .libPaths(). We pull the updates of bioconductor packages and also packages from github at the same time. For our internal packages we always target the snapshot marked as “testing” (typically the most recent one) so once we have a new snapshot and the “testing” becomes the “default” the internal packages stop changing for that snapshot. So the library looks like this:

r_library/
  2017-09-01/
    cran/
      dplyr
      ...
    bio/
      ...
    github/
      ...
    internal/
      ...
  2017-10-01/
    ...
  default -> 2017-09-01

We have a few functions - kind of a DSL - for the users to work with the snapshots (list, switch…). We like this approach much more than either packrat or checkpoint. All the users are always on the same versions of all packages (sans the projects with explicitly requested snapshots - mostly archived ones), we have fixed development targets for the internal packages, and we don’t store hundreds of copies of the same package in the users’ or projects’ local libraries.


#8

@alexv, thanks for the tips! We do something similar with MRAN snapshots, except we keep package versions pegged to specific Microsoft R Server releases. The downside to this is that R package versions can get pretty old (since MRS releases are relatively infrequent). I like the idea of having more rolling snapshots (with a “testing” version as well) and may look to implement something similar to that in the future.

Do you have any issues when breaking changes happen to packages when you update the default snapshot (e.g. the dplyr count function from 0.4.3 > 0.5 broke some users’ code)? One possibility would be to require users to specify a snapshot in their script, so that if you have legacy code that doesn’t get touched in a few years it can still access the same package versions as before. Is that something you considered?


#9

@vergilcw, every time the testing snapshot becomes the default one we send an announcement to all the users where we list all the version changes. They are encouraged (and given directions how) to check the release notes for any package they use in their work. In addition, they can use one of the functions I mentioned to freeze the snapshot in their script to be, say “2017-09-01”. That way their code will run exactly the same at any point in the future. This is very similar to MS’s checkpoint package but the users don’t install any packages themselves.

That approach works most of the time but sometimes it becomes too restrictive. For some experimental work when the scope of a project goes beyond just R and requires carefully chosen versions of, say certain R packages, some system libraries, other applications and/or languages with their own packages (python?, java?) we use Nix package manager (https://nixos.org/nix) to create reproducible “sandbox” environments consisting of all the required versions of all the tools. Unfortunately such sandbox environments are not easy to access and manage from RStudio so the users use other means (emacs/ess, etc.).


What are the main limits to R in a production environment?
#10

We have just started using an internal CRAN built from the miniCRAN package. Works well and allows us to deploy internal packages to the CRAN.

We push a custom rprofile.site file to all our users which changes there repo options to use the internal CRAN first then a snapshot from MRAN. Looks like this:

options(repos = c(INTCRAN=path/to/internal/CRAN,CRAN = “https://mran.revolutionanalytics.com/snapshot/2017-11-05”))

This ensures all our users check the local CRAN first, then the snapshot from MRAN - the internal CRAN is built from the same snapshot.

Not without some teething issues getting this set up but google has all the answers:
https://cran.r-project.org/web/packages/miniCRAN/vignettes/miniCRAN-introduction.html

@sellorm - did you look at miniCRAN?


#11

I love miniCRAN, but it doesn’t fit with the op models of a large chunk of our clients. In these scenarios, I need tools that don’t need the user to understand anything about R :wink:

Of course in many, maybe most, cases the internal CRAN will be managed by the user community anyway, and then it’s ideal.


#12

@sellorm wish I’d seen your solution before I persuaded our devs to get to grips with R and miniCRAN (well, maybe not, as this way I get more people at the company invested in R!).

The approach we’re currently leaning to is essentially the one that @scw describes.

We’re also intending to adopt a snapshot approach for our internal miniCRAN too, simply because it makes it easy for us to recreate historic environments by creating a new R setup that points to the specific snapshot urls (for MRAN and our miniCRAN). Having the ability to to re-run analysis is important to us from an internal risk audit perspective.


#13

Nothing wrong with getting your devs into R! It’s a bit of a gateway drug I reckon :wink:

I’ve not had to go the snapshotting route recently, but I think it would be ideal for my current client. Their super-restrictive environment would suit a more measured and stable upgrade cycle.


#15

@sellorm There’s also GRANBase which doubles as a build and test system for R packages. It also uses the ‘covr’ package to create test coverage reports. If used with Apache HTTP server, it behaves exactly like a CRAN repository.


#16

We have an internal CRAN-alike that hosts packages we’ve developed for projects. Until recently, we did this using custom scripts to publish packages to a file server that was also hosted over HTTP - now we’ve started using the Sonatype Nexus artifact repository and its r repository plugin to host R packages. So far, so good.

We also use Atlassian BitBucket, which builds, tests, and publishes our packages. We essentially use GitHub Flow branching (google that if you don’t know what it is - apparently I can only put 2 links in a post on this forum), so we have commits to the develop branch auto-publish to a ‘dev’ repository (after appending the Bamboo build number to the package version), and commits to master auto-publish to our main repository.

We haven’t found it beneficial to proxy or mirror CRAN itself - we just hit existing public CRAN mirrors.


#17

Oh! I also tried to use this at work and it seems promising. However, there are still some issues open and all do not work as intended. It does not seems ready with all the specificity of CRAN-like repository. I am very interested to share experience on that. (proxying R repo, internal hosted repo and grouped repo)
Currently we are using a file server hosted over http. So I am kind of the same situation as you ! :smile:


#18

Hi @cderv, what problems in particular have you had? I can keep an eye out for them in my environment.


#19

I maintain our internal miniCRAN repository. We have a central Sharepoint list (I KNOW ) of packages that users request and add to in the run up to the next (internal) R release. I save this as an Excel sheet which forms the “lookup” for miniCRAN. We use this to download packages into our miniCRAN, and as @scw does we set up users to point to this location by default. We actually take things to the next level by installing ALL packages from this miniCRAN onto users’ laptops. It’s overkill, but it means that users have most everything they need just a library( ) call away.

We also developed some helper functions to check, install, update and addNew packages from our miniCRAN. These are wrappers around the default R functions, but help us check what’s on desktops, remediate, roll out new packages etc.

I collated some recent discussion I had around this topic in a Storify: