How to get started with R for production?


#1

Hi,

I am a young (<1 year experience) french data scientist. From what I’ve seen and heard, the myth R is not good for production seems to be commonly accepted in the French data scientists community.
But from what I’ve read, for example this http://blog.sellorm.com/2016/11/26/talk-r-is-production-safe/ and this
https://www.quora.com/Why-cant-R-be-used-to-write-production-grade-code-Why-is-Python-not-used-for-prototyping-also/answer/Giuseppe-Paleologo?srid=u89Dp .
R can and is used in production.

So I would like to know if they are some great resources to learn how to do production in R (not necessarily free). I would like to start a personal project on the subject, so I am also open to ideas of ways to apply what I can read.

Thank you,


What are the main limits to R in a production environment?
R / RStudio needing restart after package installation
#2

Do you have an idea what you’d like to use R for?
I’m not entirely sure what you mean.

Do you want to set up a server running R?
You can do that with no problems on DigitalOcean or Scaleway (only listing those I’ve used before, there are more).
If you need more control, there’s RStudio’s RStudio Server (Pro) and Shiny Server (Pro) for shiny apps.

Do you want to learn about how to do specific things using R? Then maybe you would like to read some books like http://r4ds.had.co.nz.


#3

Oh sorry, I am not very clear because I have not yet encounter production problems.

To precise a bit, I have already a good knowledge of R and read R for Data Science.

Some things I would like to know more of:
When doing data analysis I may not code the way needed to do production. For instance, I don’t test my code. I know it is necessary for production but I don’t know what to test and how to test.
Also I don’t know how to put some code at large scale and how to build a R application that automatically do computations when new data is given.

Thank you for your suggestions, I will check DigitalOcean and Scaleway


#4

Some of the big production issues I have experienced with R:

  • Making results reproducible and isolating third-party dependencies (i.e. protecting stable jobs from breaking changes). I used packrat for this, although I know there are some other solutions (checkpoint, etc.)
  • Testing. testthat is my favorite testing framework for R
  • dev/QA/prod configuration without code changes, which is made easier by way of the config package
  • performance - this was often solved by using a database more efficiently and coordinating my efforts with the database team
  • version control using github / the RStudio IDE

All of those problems were largely addressed by establishing good programming practices using the (open source) tools I mentioned. Some of the other common issues (sharing results, collaboration, scheduling, authentication, security, scaling) are addressed directly by the types of software that RStudio develops. RStudio Connect might be another piece of software you look into as you think through this process. sparklyr is another tool that you might look at - a package for integrating R with Spark for distributed computing.


#5

Re: Isolating third-party dependencies, I’ve been doing some work with and am pretty impressed by https://github.com/robertzk/lockbox.

@cole What issues around testing have you had that weren’t adequately addressed by testthat?


#6

@RussellPierce None really come to mind. I was saying that “testing” can be a challenge for moving processes into production, and testthat solves those problems neatly. For reference, there is a good article on testing (with a focus on packages) here.

The assertthat package can be used for run-time assertions, but there are other options as well.


#7

There is an interesting sister-discussion going on over here.


#8

Cole’s post already covers what I’d call “step 1” stuff. All of that is table stakes for R in production, well-solved problems, and meets most of the part of treating your R code development as software development.

I think the next steps are to understand whether “in production” for you means “callable via an API”, “batch processing”, or “building a full application”. All of these are “in production” but require different solutions/packages/practices, and are mostly doable.


#9

I wasn’t aware of assertthat. I have used testthat for runtime assertions in addition to package development. are there advantages of assertthat over testthat for runtime assertions?


#10

That’s an interesting thought! My first gut reaction is just that the packages were designed with different goals and so are better optimized for their strengths. testthat was designed for unit tests and so has a bunch of wrappers that do a handy job of monitoring which unit tests pass / fail, etc. (You should check out the new testthat - it produces really pretty output! Not sure if the new version has hit CRAN yet).

assertthat has the same creator/maintainer and is explicitly created with run-time assertions in mind (as a replacement for stopifnot).

On the practical front, though, I think the focus may be on the types of error messages issued. assertthat advertises that it:

makes it easy to declare the pre and post conditions that your code should satisfy, while also producing friendly error messages so that your users know what they’ve done wrong.

Usually, expect_equal (in testthat) expects to have the parent test_that script pick up any errors and relay the message that a unit test failed, so I think the quality of the error message may be lower when using testthat at runtime.

Additionally, there are probably some edge cases or internals that have a different thought process. The help page seems pretty adamant on assertthat:

Assertion functions should return a single TRUE or FALSE: any other result is an error, and assert_that will complain about it.

I expect that there is a reason for that specificity. I just don’t know it. I’m sure @hadley would have some more thoughts on the differences between assertthat and testthat.