Internal CRAN-like system - best practices

sellorm · November 4, 2017, 1:50am

Hi!

Was wondering if anyone is aware of any best practice information for creating a CRAN-like repo, inside an organisation's network.

I've used a tool that I threw together (https://github.com/sellorm/alsoran), but not any other approaches so far.

What are other people doing?

Cheers,
Mark

daattali · November 4, 2017, 2:39am

Just gave your repo a quick read, looks like a well developed solution! The "what about drat and miniCRAN?" in the README doesn't seem to have an answer in the following paragraph

Tazinho · November 4, 2017, 10:12am

Very interesting thread

What if different packages are used in different projects? I hate downloading packages for every packrat project again and again. Is there a way to combine this „internal cran“ with packrat and hosting different versions of pkgs from different repos (cran, bioconductor, neuroconductor etc) in one local repo that gets synced with packrat for example? If not, are there other solutions for this kind of problem?

thomas · November 4, 2017, 1:55pm

My colleagues and I have been testing a solution where the packages are
installed to a shared location and the package version and R version are
both specified in the package-loading function.

We wrapped these functions into an internal package called NAVpackrat.

sellorm · November 4, 2017, 7:12pm

Thanks for the review Dean, I'll have to have a look at that section and do something about that!

Thanks again,
Mark

sellorm · November 4, 2017, 7:16pm

That's a really interesting question!

I'm not aware of a solution at the moment that's capable of doing that automatically. Doing it manually, sounds like it might be a bit of a headache, but might be the only way for now.

Unless anyone has other information?

alexv · November 6, 2017, 8:53pm

We use MRAN (daily snapshots of CRAN provided by Microsoft as part of their "checkpoint" framework) to create shared libraries of R packages all users can access. We install new snapshot about once a month (keeping the old one) and use Rprofile.site to make one snapshot (typically second to last) the "default" in .libPaths(). We pull the updates of bioconductor packages and also packages from github at the same time. For our internal packages we always target the snapshot marked as "testing" (typically the most recent one) so once we have a new snapshot and the "testing" becomes the "default" the internal packages stop changing for that snapshot. So the library looks like this:

r_library/
  2017-09-01/
    cran/
      dplyr
      ...
    bio/
      ...
    github/
      ...
    internal/
      ...
  2017-10-01/
    ...
  default -> 2017-09-01

We have a few functions - kind of a DSL - for the users to work with the snapshots (list, switch...). We like this approach much more than either packrat or checkpoint. All the users are always on the same versions of all packages (sans the projects with explicitly requested snapshots - mostly archived ones), we have fixed development targets for the internal packages, and we don't store hundreds of copies of the same package in the users' or projects' local libraries.

vergilcw · November 7, 2017, 4:00pm

@alexv, thanks for the tips! We do something similar with MRAN snapshots, except we keep package versions pegged to specific Microsoft R Server releases. The downside to this is that R package versions can get pretty old (since MRS releases are relatively infrequent). I like the idea of having more rolling snapshots (with a "testing" version as well) and may look to implement something similar to that in the future.

Do you have any issues when breaking changes happen to packages when you update the default snapshot (e.g. the dplyr count function from 0.4.3 > 0.5 broke some users' code)? One possibility would be to require users to specify a snapshot in their script, so that if you have legacy code that doesn't get touched in a few years it can still access the same package versions as before. Is that something you considered?

alexv · November 7, 2017, 6:24pm

@vergilcw, every time the testing snapshot becomes the default one we send an announcement to all the users where we list all the version changes. They are encouraged (and given directions how) to check the release notes for any package they use in their work. In addition, they can use one of the functions I mentioned to freeze the snapshot in their script to be, say "2017-09-01". That way their code will run exactly the same at any point in the future. This is very similar to MS's checkpoint package but the users don't install any packages themselves.

That approach works most of the time but sometimes it becomes too restrictive. For some experimental work when the scope of a project goes beyond just R and requires carefully chosen versions of, say certain R packages, some system libraries, other applications and/or languages with their own packages (python?, java?) we use Nix package manager (https://nixos.org/nix) to create reproducible "sandbox" environments consisting of all the required versions of all the tools. Unfortunately such sandbox environments are not easy to access and manage from RStudio so the users use other means (emacs/ess, etc.).

scw · November 8, 2017, 12:05pm

We have just started using an internal CRAN built from the miniCRAN package. Works well and allows us to deploy internal packages to the CRAN.

We push a custom rprofile.site file to all our users which changes there repo options to use the internal CRAN first then a snapshot from MRAN. Looks like this:

options(repos = c(INTCRAN=path/to/internal/CRAN,CRAN = "https://mran.revolutionanalytics.com/snapshot/2017-11-05"))

This ensures all our users check the local CRAN first, then the snapshot from MRAN - the internal CRAN is built from the same snapshot.

Not without some teething issues getting this set up but google has all the answers:
https://cran.r-project.org/web/packages/miniCRAN/vignettes/miniCRAN-introduction.html

@sellorm - did you look at miniCRAN?

sellorm · November 9, 2017, 12:19am

I love miniCRAN, but it doesn't fit with the op models of a large chunk of our clients. In these scenarios, I need tools that don't need the user to understand anything about R

Of course in many, maybe most, cases the internal CRAN will be managed by the user community anyway, and then it's ideal.

nickforr · November 9, 2017, 9:13pm

@sellorm wish I'd seen your solution before I persuaded our devs to get to grips with R and miniCRAN (well, maybe not, as this way I get more people at the company invested in R!).

The approach we're currently leaning to is essentially the one that @scw describes.

We're also intending to adopt a snapshot approach for our internal miniCRAN too, simply because it makes it easy for us to recreate historic environments by creating a new R setup that points to the specific snapshot urls (for MRAN and our miniCRAN). Having the ability to to re-run analysis is important to us from an internal risk audit perspective.

sellorm · November 9, 2017, 9:31pm

Nothing wrong with getting your devs into R! It's a bit of a gateway drug I reckon

I've not had to go the snapshotting route recently, but I think it would be ideal for my current client. Their super-restrictive environment would suit a more measured and stable upgrade cycle.

kulkard2 · November 17, 2017, 9:51pm

@sellorm There's also GRANBase which doubles as a build and test system for R packages. It also uses the 'covr' package to create test coverage reports. If used with Apache HTTP server, it behaves exactly like a CRAN repository.

KenWilliams · November 19, 2017, 5:15am

We have an internal CRAN-alike that hosts packages we've developed for projects. Until recently, we did this using custom scripts to publish packages to a file server that was also hosted over HTTP - now we've started using the Sonatype Nexus artifact repository and its r repository plugin to host R packages. So far, so good.

We also use Atlassian BitBucket, which builds, tests, and publishes our packages. We essentially use GitHub Flow branching (google that if you don't know what it is - apparently I can only put 2 links in a post on this forum), so we have commits to the develop branch auto-publish to a 'dev' repository (after appending the Bamboo build number to the package version), and commits to master auto-publish to our main repository.

We haven't found it beneficial to proxy or mirror CRAN itself - we just hit existing public CRAN mirrors.

cderv · November 19, 2017, 12:57pm

Oh! I also tried to use this at work and it seems promising. However, there are still some issues open and all do not work as intended. It does not seems ready with all the specificity of CRAN-like repository. I am very interested to share experience on that. (proxying R repo, internal hosted repo and grouped repo)
Currently we are using a file server hosted over http. So I am kind of the same situation as you !

KenWilliams · November 19, 2017, 8:05pm

Hi @cderv, what problems in particular have you had? I can keep an eye out for them in my environment.

MikeKSmith · December 4, 2017, 12:23pm

I maintain our internal miniCRAN repository. We have a central Sharepoint list (I KNOW ) of packages that users request and add to in the run up to the next (internal) R release. I save this as an Excel sheet which forms the "lookup" for miniCRAN. We use this to download packages into our miniCRAN, and as @scw does we set up users to point to this location by default. We actually take things to the next level by installing ALL packages from this miniCRAN onto users' laptops. It's overkill, but it means that users have most everything they need just a library( ) call away.

We also developed some helper functions to check, install, update and addNew packages from our miniCRAN. These are wrappers around the default R functions, but help us check what's on desktops, remediate, roll out new packages etc.

I collated some recent discussion I had around this topic in a Storify:
https://storify.com/MikeKSmith/how-do-you-manage-r-packages-for-your-organisation

ijlyttle · April 12, 2018, 1:37am

Merged from a separate conversation

This is an extension of a thread started in the #package-development forum. Thanks to some kind feedback, I was encouraged to make a post in here in #r-admin on an article I had written on internal-package distribution.

The problem being addressed is how to build and maintain a CRAN-like repository for private packages within an "internal" environment, like a company or an institution. You want the convenience and robustness of install.packages(), but you cannot share the package beyond your institution.

tl;dr version: the method I propose uses the drat package with an instance of GitHub Enterprise.

The longer version is addressed by a package I am working on, called ghentr, to help you use some of devtools' GitHub "magic" with your instance of GitHub Enterprise.

If you are still interested in learning more, there's also a presentation you can watch from #rstudio-conf (2018).

cderv · April 12, 2018, 5:59am

To add experience on this subect, I can share what I have done so fare as we don't have Github Entreprise.

For internal package, I use a internal repository using the help of drat and packrat to create the repository. The repository is either in a file system (you can a have a filesystem repository) or in a webserver. One of the solution was to have internal repository endpoint on our cran mirror. Example:

Cran mirror : https://cran.acme.com
internal repo1: https://cran.acme.com/intrep1

Other solutions I try and currently experience are related to product my company have.

We have a gitlab so I am thinking of a ghentr cousin for gitlab, but all functions are not possible as gitlab page are not by default.
we also have a nexus repository for all other language repo (java, python, c++, ...) so why not use it for R. I am trying that to.