miniCRAN on rstudio-server pro, production environment

Hello, i wanted to know what is the proper way to make available some hundred of packages on rstudio-server pro for a lot of users, in a production environment, which means the rstudio-server has no internet access. Is setting-up a miniCRAN making any sense in my situation ? Do i have to set-up the miniCRAN repository on a separate server or shared-folder ? I can't use Rstudio Package Manager because of budget restrictions. Thank you for answers.

Do you actually need the users to be able to install packages into their personal libraries? I found it much easier and cleaner to create a shared library of installed packages for everybody to use. You can create the library on a computer where you do have internet access then transfer it to production by rsync, scp, tape, etc. Once copied to your production server, just use .libPaths() in Rprofile.site to make the library visible to everybody by default. That solves two important issues - your server doesn't end up having dozens of copies of, say, dplyr the users would want to install, and everybody is guaranteed to be on the same versions of all packages so you can push the updates to everybody at once seamlessly. You can even create a date-based snapshots of the library so that the user can switch to earlier versions of the packages if needed for reproducibility. That is what we have been doing for a few years already and it has been very robust and convenient for users.

3 Likes

Thank you for your answer, but in this case, i can't figure out what would be the difference between a shared library and miniCRAN ? In my situation, the only constraint is that is offline production environment, and the fact that some users would request sometimes to add a package. If I choose the shared library solution, how can i deal with package resolution ? are there some metadata or dependency tree to keep updated ?
Apart from that, i'm looking for the most painless solution for future admins of this environment.

Thank you again.

If you don't have a shared library, but just an internal package repository, users can organize their libraries any way they wish. They don't need to update packages. They can use customized versions of packages. They might install packages from CRAN or GitHub instead of your repo (even if you make it difficult, because people are geniuses at doing dumb things).

If you have a lot of users on a dedicated R server, you'd probably benefit from a shared library. This way, whenever a user runs library(ggplot2), it'll load the package installed in the shared library. They'll be using the same version as everyone else. They won't hit that annoying Error in library(pkg) : there is no package called 'pkg' when running each other's programs. Or worse, not hitting an error but getting different results. For the R admin, there's also the peace of mind that comes with a standard environment. Fewer moving parts.

IMO, setting up a repo makes sense when it's not a one-stop-shop. I maintain one at work for our in-house packages. Users get everything else from CRAN or GitHub. This is a good solution for us, because we don't have an R admin to maintain a shared library or handle which packages to include.

2 Likes

To share experience, for our internal offline clusters we did both solutions:

First we made an internal CRAN mirror for our offline servers. We have an architecture of two servers for the CRAN mirror, one being in a online zone and syncing with CRAN the other being sync internally and connected to our internal servers. we used rsync, but you can also create this with miniCRAN building a sync system from online server creating the miniCRAN repo from CRAN toward where you would store your internal archive network (filesystem or web server).

Then, on our internal cluster, we maintain a shared library for our users so that they don't need to install some very wide used package. Installation time on Linux can be very long too. It is also the way for us to maintain a ready-to-use environment for new users.

Solutions on this website offer good insights on all this topics.

Hope it helps

3 Likes

Thank you very much for your answers, it really helps me. Just a last question about miniCRAN: if users don't intend to create and publish packages, and we plan to maintain a single version of each package, does miniCRAN bring something more than just a shared library where i can just rsync any package i want to add to ? I really want to understand what solution is the simpliest in a 'simple' use case like mine, but with offline production and scaling constraints. Thank you again

I think it's worth being careful of your terminology here. If you are using rsync to create a folder structure similar to CRAN, then you are creating a "repository" (not a "shared library"). I.e. a repository of source tarballs (or windows / mac binaries). A "repository" can have just one version of each package, but it can also have multiple, can serve different operating systems, etc. As such, it is a flexible and advisable solution to providing packages to your internal infrastructure.

That said, miniCRAN is just a tool for maintaining such a folder structure. You can also do so through rsync (although beware you will have to traverse the package dependency graph yourself - i.e. what packages do I need to create a functional version of dplyr?) or other tools like drat (a different R package aimed at a similar problem).

A "shared library" is a place where packages are installed. This only allows a single version of each package and is tied to a specific operating system (the operating system where the installed packages were built). Shared libraries are created by using install.packages() or R CMD install, etc. There are some other nuances of a library around build dependencies and linkages that do not exist in a repository too (i.e. if Rcpp was used to build dplyr, then rebuilding Rcpp should ideally require rebuilding dplyr).

All of that said, you can use whatever tools you want to maintain a repository. There are tools for managing libraries as well (the ones I am most familiar with are packrat and renv, although I know I have seen others). I hope that helps! :slight_smile:

3 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.