What are the main limits to R in a production environment?


#11

Hi @spiritus87, not sure if you have looked at packrat to help you isolate your projects: https://rstudio.github.io/packrat/


#12

As @edgararuiz metnioned, packrat is meant for controlling package dependencies on a per-project basis.

You're right that packaging your own code is the right way to go. Packages are the intended way for sharing code and data in a machine-agnostic way. Creating packages was often confusing and frustrating (especially on Windows), but RStudio now makes it so simple. I suggest using the miniCRAN package for creating your own package repository to store your internal packages (https://cran.r-project.org/package=miniCRAN).


#13

packrat is nice but if you want a particular version of package (or not to update some packages) you might encounter difficulties ...
In general, packrat and versioning is hard and not maintainable in the long term.


#14

For 2nd and 3rd part I think you will find useful Rscript and littleR. Especially littleR is great. I found it only recently, but if you use scripts from example directory you can save a lot of time/coding and your R scripts will be on the same level as bash, or any other language.


#15

What is the best practice to minimize the chance that an updated package will break the system?

You could use MRAN snapshots.


#16

There is a bit of issue with packages that gel with external applications. e.g. I was using the RSelenium package and then there was an update from the Selenium folks that broke my code. Such things can happen with any other language if the underlying technology gets upgraded. On the other hand I use the Rwordpress package to update my blog which I feel works absolutely fine and seamless since the last year or so.

I believe its the way one picks and chooses the packages makes R more production ready. As such there are many ways to do stuff in R the wrong way and its very easy to fall for them. :slight_smile:


#17

I second MRAN snapshots. One thing I've noticed with them is that if you include installing from the snapshot in your Docker container it won't cache it and will re-install everything every single time image is built. Not the end of the world, but with lots and lots of C++ in dplyr, for example, compilation takes time.

In fact, I wanted to add that in my company dependency management has been the biggest sticking point. Every time we need to make sure that dependencies won't break and every time we must guarantee that package versions are fixed (this is done in Python trivially with requirements.txt file, so there is an expectation that it should be as easy with R too).

Other than that, most of the complaints from DevOps tend to be: "Well, it's not Python and I know Python, so use it instead". Those are not valid complaints, of course, but there is this prevailing attitude for sure.

Another point that has been often a stickler is the ability to easily switch environments when deploying to staging/production. We are doing it right now with different yaml files for staging/production, but there is a package that I can't remember right now that allows to do that in one yaml file and all you need to do is set one environment variable and correct parameters will be loaded (UPDATE I've found it - it is called config - who would've thought :slight_smile: ).

In general, from my experience, all the biggest complaints about R in production tend to be lack of education since most of them can be solved, so it is up to us, R developers, to make sure that we explain in detail why certain fears are at least exaggerated and can be solved without too much pain.


#18

If I had one R christmas wish (@Rstudio @RConsortium @...), it would be packrat running smooth out of the box on all environments and all repositories.


#19

@spiritus87 - really good reasoning and structuring of productionizing the project. What actually would be great is to have guidelines, best practices and tangible code examples for productionizing R scripts (models and others). Does anyone over here have good, practical references?


#20

packrat is nice but if you want a particular version of package (or not to update some packages) you might encounter difficulties …
In general, packrat and versioning is hard and not maintainable in the long term.

So I definitely agree that packrat has its difficulties, but I wanted to be clear (cf. @xhudik) that it can handle a particular version of a package, not updating packages, etc. so long as your computer can install packages from source. "difficulties" is the operative word in his comment - IMO they are not insurmountable difficulties, though. As mentioned, dependencies on third party tools outside of the R universe (cf. Selenium above, @s_maroo) would need to be handled separately.

Packrat also has an advantage over MRAN in that it can include versions of locally developed packages (not on CRAN) and git repos. In the past, I have used the drat package to build local CRAN-like repositories of locally developed packages. Also, after getting used to some of the nuances of packrat, I really like that it declares explicit version numbers (like requirements.txt) and does make my code stable / reproducible using CRAN's archived package sources.

@konradino A budding discussion on guidelines for using R in production is here.


#21

@cole - hmm, can you simply update 1 package without updating the rest? E.g.

Package: BH
Source: CRAN
Version: 1.62.0-1
Hash: 14dfb3e8ffe20996118306ff4de1fab2

simply change to

Package: BH
Source: CRAN
Version: 1.55.0-1
Hash: 14dfb3e8ffe20996118306ff4de1fab2

?
The simplest would be just to rewrite packrat.lock (as I did in the lines above), however this doesnt work. And I'm not aware of any other way how to define a particular version of package I want to install (up/downgrade).

I completely agree that 3rd party dependencies need to be solved by users (cannot e done by packrat)


#22

If you want to fix system dependencies to versions, I think (Docker) containers are the way to go. You can install specific versions with the system package manager as needed.

@mishabalyasin Regarding "re-install every single time": Imho this can be largely mitigated by (a) using the Rocker project's images as base images (if you need dplyr, rocker/verse, you also ger all the MRAN advantages "for free" with version tagged images) and (b) letting Docker Hub (or GitLab, or your own build server) build the images for you.

I must admit that these two pieces of advice don't go well together.


#23

@xhudik that is a fantastic question. Sure thing! This is the sort of thing that takes some getting used to and could perhaps be improved. Steps to reproduce:

install.packages('BH')
packrat::snapshot()
PackratFormat: 1.4
PackratVersion: 0.4.8.1
RVersion: 3.4.2
Repos: CRAN=https://cran.rstudio.com/

Package: BH
Source: CRAN
Version: 1.65.0-1
Hash: 95f62be4d6916aae14a310a8b56a6475

Package: packrat
Source: CRAN
Version: 0.4.8-1
Hash: 6ad605ba7b4b476d84be6632393f5765

Now, if I want to force a package version, then I can edit the lock file as you mention. I usually delete the Hash: entry entirely, since this maps back to the version that I currently have installed. It seems to work fine without doing so though. Specifically:

PackratFormat: 1.4
PackratVersion: 0.4.8.1
RVersion: 3.4.2
Repos: CRAN=https://cran.rstudio.com/

Package: BH
Source: CRAN
Version: 1.55.0-1
Hash: 95f62be4d6916aae14a310a8b56a6475

Package: packrat
Source: CRAN
Version: 0.4.8-1
Hash: 6ad605ba7b4b476d84be6632393f5765

Then, call packrat::restore() to restore your state to that represented by your lockfile. You will get a nice little confirmation warning (this is where being able to build from source is important):

Note that packrat's internals get in the way here, because it does another snapshot before restoring state, so I get 1.65.0-1 in my lockfile, even though 1.55.0-1 is installed. This might be a bug / feature request (and might be paired well with a set_lock_version function or something to make this process easier.

PackratFormat: 1.4
PackratVersion: 0.4.8.1
RVersion: 3.4.2
Repos: CRAN=https://cran.rstudio.com/

Package: BH
Source: CRAN
Version: 1.65.0-1
Hash: 95f62be4d6916aae14a310a8b56a6475

Package: packrat
Source: CRAN
Version: 0.4.8-1
Hash: 6ad605ba7b4b476d84be6632393f5765

The way to remedy that is with another packrat::snapshot(), but here we run into a note.

I typically appreciate the verbosity, but I politely tell packrat that I know what I'm doing with packrat::snapshot(ignore.stale=TRUE).

Now my lockfile is in the state that I expect, with a new Hash and packages in the state that I want:

PackratFormat: 1.4
PackratVersion: 0.4.8.1
RVersion: 3.4.2
Repos: CRAN=https://cran.rstudio.com/

Package: BH
Source: CRAN
Version: 1.55.0-1
Hash: d924d63d19a9615bdcb2548b534550f6

Package: packrat
Source: CRAN
Version: 0.4.8-1
Hash: 6ad605ba7b4b476d84be6632393f5765

Some noted points:

  • Per @Tazinho 's Christmas wish - it definitely is not a "just works" or "running smooth out of the box" kind of solution, but it has all the power and flexibility I want (especially with regards to installing specific commits from a git repo, archived source versions of local packages, etc.)
  • The reason for the "ignore.stale" requirement is that packrat does not know whether I want to keep 1.55.0-1 installed or whether the 1.65.0-1 in my lockfile is what I really want. Because of the version conflict, it checks to be sure I know what I am doing before overwriting the lockfile version. The warning might prompt me to say "Oh, I forgot to packrat::restore()! Woops!"
  • One of the big pain points that happens with packrat is when the R session terminates in the middle of an install and packages are left in a weird state. I'm not sure if this is crazy or not, but my response has typically been to just rm -r the folder in question and trust packrat to rebuild my dependencies from scratch.

Hope that helps! Packrat has been a life-saver for me, and I do lots of the version-munging that you mention. It would certainly be possible to have packrat::snapshot() be an automated part of the development submission process and packrat::restore() be an automated part of the release process. I have been bitten by that in the past - forget to do one or another and then things break during the release: "???? I tested this! Oh! Snap. I forgot to restore my dependencies on the new system."


#24

@nuest As I mentioned here Internal CRAN-like system - best practices inability of packrat (or checkpoint) to deal with system dependencies is the main reason we use Nix package manager.

A recent example: a user requested sf - a package for spatial data which depends on a standard library from GDAL project. It turned out that our version of Linux did not have packages for new-ish versions of GDAL. So the choice was between building GDAL from sources (and then maintaining the install to keep it compatible with future upgrades of both R and R packages) and just running one Nix command:

nix-shell -p R rPackages.sf --run R

which takes care of bulding and caching correct versions of all c++ libraries and corresponding R packages. That command is guaranteed to keep working for all future upgrades of all the moving parts. If I need to throw in some python packages to the environment (tensorflow?), it would be a matter of adding them to the command line and Nix will make sure that all versions of everything will be compatible with each other. Another bonus - if I so desire, the "sandbox" will be invisible to anybody else (including my other projects which may require different versions of the tools). The command line can be replaced by a short script in the Nix language which I can then run to enter the sandbox.

Regarding Docker vs. Nix, I really like this post https://blog.wearewizards.io/why-docker-is-not-the-answer-to-reproducible-research-and-why-nix-may-be.

The reproducibility issue is solved automaticall by the fact that the "recipes" for all Nix packages consititute a single entity - the Nix Packages collection and it is trivial to pin a particular commit of Nixpkgs in either that command line or the corresponding nix script.


Using R and conda
#25

I think R being single-threaded is definitely an issue, surmountable, but something that makes R in production for batch work not a problem but R hanging around as a service less great.

I wish I could make sense of packrat but Rocker + MRAN has been a huge help for dependency lockdown.

But honestly I think most of the problem is R developers not coming from a culture of software development and therefore having neither the tools or practices expected to do the work of moving R to production.


#26

packrat is generally really great but unfortunately it has a lot of bugs and situations where it doesn't work and as such breaks many automatic deployment scripts (besides beeing extremely slow for bigger dependencies).
On the other hand development is still active, the maintainer very helpful and many bugs are fixed in development version. This is not helping for production though as there was no stable release since 2016 anymore.
So in the end I have quite some mixed feelings. Giving its some kind of "RStudio" supported package, a clearer roadmap and/or strategy would be very helpful.


#27

Just to mention, here is also a nice writeup on checkpoint, packrat and docker.


The conclusion is that packrat and docker are the best option.
In my personal opinion there are still issues about docker and especially on Windows docker is not without headache (and I would not recommend it). However, in the comments of the blogpost also an interesting discussion is started on further drawbacks.


split this topic #28

8 posts were split to a new topic: Questions about R in production


Questions about R in production
closed

This topic has been closed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.
#30

Questions about R in production
#31

This topic was closed. If you have questions related to it, we encourage you to start a new thread.


Questions about R in production