Would love to hear others' experiences using R in production. What has been working well and what have been the problems? Would you recommend it? We are currently using mostly Spark and Python in production, but exploring supporting R in production to lower the barrier to put things in production.
We use R in production and I don't see any evident difference compared to other languages (like Python for instance)
packrat for dependency management and
testthat for testing. Continuous Integration works pretty well and if you want to calculate test coverage then you can use the
Some people believe that R is a sort of strange programming language but it's not true at all. It depends on how you work with it.
As @amarchin has pointed, the
packrat packages are very helpful and the way-to-go in getting R working in a production environment.
However, I have faced some problems previously when trying to configure R in an external server. In older versions of certain OS, such as centOS, I found that:
- The latest available pre-compiled R version was 2.15. Since these OS also include older versions of C compilers, the actual compilation and installation of newer R versions can be a nightmare. But, if I could, you can .
- This also implies that installing packages that must be compiled depends on the available C compiler. If it's not up-to-date, many newer packages fail to install.
I guess this is currently a highly unlikely situation, but it's the one I faced some 4 years ago. Just be patient and everything will be working fine within few days.
On the other hand, my work concerns primarily processing Neuroimaging (MRI, PET, etc.) studies. That is, the entities I use to analyze are 3d arrays of size 256x256x128. Using many of these arrays as intermediate results in our pipeline has led us to the need of big amounts of RAM memory and to the use of task schedulers to priorize some jobs over others, avoiding memory conflicts.
In my experience, although R is seen as a strange programming language by some of my colleagues, mostly because it's not strong-typed, it allows rapid development of analysis solutions, and this is what makes worthy all the work in setting up the production environment.
Weak-typing is typical of scripting languages and for sure it can be scary if someone come from C++ or Java and want put something in production.
However, looking at what is happening with
purrr for example, I think that now there is more attention about types. A lot of functions allow people to write code that handle the types consistently and this is awesome.
I saw a presentation by Robert Krzyzanowski, the Director of Data Engineering at the online consumer lender Avant, about how they deploy machine learning models with R.
The software framework is called Syberia:
Thank you for sharing this! Extremely helpful and gives a lot of interesting ideas
That's great to hear. Without getting into the specifics, is the data you manage very large?
Not much for this project, the order of magnitude is GB (not TB or more). Hence we don't need to use SparkR or
sparklyr in this specific case.
We're using R in production through docker containers and packrat. The docker containers for R can be quite large compared to other non data science apps, but it's the same issue with python. I recommend trying it out on a smaller project that you don't need permission for first and seeing how that goes. Success is usually easier to sell than uncertainty.
Not much experience myself, but the Stitchfix data science team recently wrote a blog post about their experiences with this. There's a few things to watch out for, but lots of workarounds.
We're using Docker to create immutable artefacts then using unit test frameworks for Docker for CI/CD. Seems to be working well.
I've really appreciated all the work that has gone into rocker docker images as basic building blocks. Their solid and friendly R development environments to work in locally and in production.
If you use a lot of additional packages, working with up to date versions of R is critical.
Another gotcha to look out for is
Rscript doesn't load
methods by default, while an R interactive session will. This can lead to some unexpected errors.
We used R in production on web scale data. The pattern that I liked was to develop analysis scripts as packages which were then called by the server on a chron job. So the server would install the package from Github and then call
package::run_script() which would be a kind of top level execution function for the analysis job. This handled most things except if different analyses required different versions of the same package. We mostly avoided that problem by having all analysts use the same package versions, but that's not that tenable for larger organizations.
The main benefits of this pattern were:
- It reduces devops work because they don't need to worry about installing R dependencies
- You get access to the R package tools for testing and documentation
- Analysts own the full analytical workflow which helps them learn about and avoid deployment problems
- It's a more light weight solution than packrat