Questions about R in production


#1

Split from What are the main limits to R in a production environment?


I guess no.

A few months ago One of my friend asked me if I could build 2 dynamic dashboard for his e commerce which works on 5 different countries and have like 50k + user active every second on each server. He thought since I develop shiny dashboards this is something I could do. I knew even with shiny pro and async and stuff shiny can not handle so many live customers at once.

So I redirected him to a team of javascript developers.

Somehow I feel R is meant to be run within the walls of an organization where if something crash you can say wait I will fix it. Not on a live server for thousands of clients.

And to top it off Rstudio is reinventing the entire wheel by creating exact replica of old packages with focus on usability and not on speed.

I mean why do you need a tibble when you have data.table. Why do you need glue when you have paste and map when you have lapply. And why on earth you create an entirely different API system called plumber when you could simply have helped openCPU write cleaner syntax.

I would totally agree if these packages were substantially faster but all this does is create redundant packages without improving speed. Which is almost primary concern in production.

So I started learning golang too. So that one I prototype something in R and make it in production with Go and javascript. I love R too. I think last week I helped a colleague create a huge excel sheet with so many numerical calculation with bins and cuts and group by and averages and stuff. It took us 2 days on R just the numerical part. But had it been go it would have taken 2 weeks at least.

But somehow I use it like Excel and power point. Not like C and I haven't seen any big system implemented on R never heard of it and no body ever stood up and said yes it can be done without re writing everything in rcpp (I mean with base packages).

So I guess that's one of the main limits of R in production environment.

Does anybody share the same opinion?


What are the main limits to R in a production environment?
#2

I am not so sure about this (though I can say I haven't tried it yet). It seems to me that if you have an R API (e.g. using opencpu) you should be able to scale it horizontally as much as you need by throwing more hardware at it, though indeed at that point one might question why not to reimplement your model/computations into something more performant (say, C# or java or golang or whatever). That being said, you could probably make similar observations for python or any other interpreted language for that matter.

Well, these are quite different things that you are throwing in together. For instance, glue offers much much more functionality than simple paste. Similarly, the purrr family of functions offers much more functionality than the basic lapply. I was also myself very skeptical of purrr at the beginning because it seemed to reinvent the wheel but boy, I was wrong. Purrr is an amazing package with a lot of added functionality.
I haven't used plumber myself but it seems to adhere to a different, more "low threshold" philosophy for turning R functions into callable API's. That being said, I agree that opencpu is a great tool which would benefit from more rstudio support; I think if more people worked on improving it (other than Jeroen Ooms) we would have a serious R contender to the python flask framework.
But on glue and purrr you're dead wrong, I'm afraid.

I guess one thing which is not clear in your complaint is what you mean by "speed". Base R is not the fastest thing, granted, but dplyr and (even more) data.table are as fast as anything you can find in e.g. python. It is true that R has some limitations (single-threadedness) that make it awkward to use for e.g. a web API, but I don't think you can tackle these limitations, nor the slowness of base R, at the level of packages; they seem things that you can only really tackle at the lowest levels of R's implementation (i might be wrong on this though).

Sure, but what's the problem with that? That's not necessarily a problem with R. You would have to do that with python too, if you had a website with a lot of concurrent visitors (I admit python would handle more users/traffic than R, but up to a point).

Yeah but, R was developed with totally different concerns from C. With the same reasoning you could say: "I use C like I use C# and Java. Not like R and excel and I haven't seen any data analysis/logistic regression implemented in C other than with thousands of lines of code that it would take weeks to write only to discover that maybe Bayesian models are better, and let's restart this C circle of hell from scratch..."

The point is, use every tool for the job it was designed for. R wasn't designed for fast scalable web API's, in the same way in which C wasn't designed for fast data and model exploration.
I do agree that R is behind python when it comes to "clout as a production ready language" and not only as an analysis language, but I think that's just because most developers are more familiar with python than R. In any case, Rstudio is doing a lot to make R gain more "production clout", just look at sparklyr for an example.

R.


#3

These are the two biggest reasons R isn't used as a "enterprise" "production" language. Sure, you could rewrite your models in another language, but why would you? That's a lot of developer time. In most cases, it's possible to bolster computing resources. RAM's expensive these days, but developers can be costlier.

There's no rule you must use only a single language for a project. It doesn't always make things easy or performant. It's often best to have one language "in the driver's seat" invoking scripts or programs, giving input and reading output. R isn't going to be an enterprise-running language, but that doesn't mean it has no place in an enterprise.

As for R's poor performance, this is often the "fault" of the code writer. If you want faster calculations, use an array or sparseMatrix. Use basic indexing or assignment instead of filter and mutate.

Most R code is written to be easy to understand and modify. It presents the steps in an analysis like the methodology section of an article. Performance isn't always a concern. So there a lot of R coders who don't know how to write efficient or fast code. But that doesn't mean it's impossible or even difficult. They just need to learn how it's done.

I will agree that some packages are just reinvented wheels. I think of purrr as a repackaging of base higher-order functions with a uniform style for names and arguments. I'm perfectly happy with vapply, mapply, lapply, Reduce, and Negate.


#4

Thanks for taking your valuable time in writing such a detailed answer.

But I would like to add only one thing.

Question was asked if we could use R in production. I answered no I have never heard such a thing. May be except a few unique scenario.

I wasn't comparing it to python at all. Not even javascript. But somehow these languages are able to develop flask django node etc... We are not production ready. That's all I said that's all I meant. Not comparing it with them.

And by rewriting I meant when speedglm rewrote lm function they made it 5 times faster. It was worth learning and could be used in production saving some RAM. When data.table re wrote data frame. They made it very fast so did ranger, matter, xts and openCPU etc...

But this entire tidyverse philosophy is training new R users that R Is easy and Fun. So that it can increase industrial usage. I am not saying it's wrong. I am saying it's not moving in the direction of production ready R.

And I agree that packages like modeldb, dbplot, DBI, odbc etc are moving in right direction. What I am saying is not all of them R.

If I could prove someone that R is using 64 gb of server RAM where java would use only 42 or 52. I could convince him to go into production with R. but we are taking about a way more difference than that and server do cost money. Programmer will cost you once server will cost you every month until its available.

I know this because my firm builds software's for industry's and so far we have used R like only 4 or 5 times and even on those cases the users were less than 50. I am an analyst but even then I tried to convince them to use R more but I couldn't convince myself either.

But not to worry. I am sure things will change and I will be the first one to admire that. I was just giving my point of view. Outside this community I fight for R but inside it I want to tell the truth about how I actually feel. And I quoted my own instances in the previous post. So please don't get me wrong. We are on the same boat.

::grinning:: :grin::grinning:


#5

I think this is as interesting point. It seems to me that even within R (i.e., without comparing R to other languages) there is a tension between performance vs clarity of the code. I'm not saying this is an either-or, but often it seems like one. My guess is that most R coders emphasize clarity because when you develop an analysis or a model, well, you want to make sure all the complicated steps you apply are as transparent as possible. Which makes sense.
On this point, I agree with Hadley Wickham's maxim that you should first write code for clarity, and then refactor parts for performance when you see you need it. I think it's a good approach.

But what would be the base R equivalent of something like purrr::map_df?


#6

Actually, one of the things that I think would improve R's standing as a production ready language quite a bit, and that is related to what you're saying, is if it had a more "uniform" way to do things. Think of machine learning; in python you have scikit learn which gives you a very good uniform interface to a lot of modelling techniques. In R you also have packages to do everything that scikit-learn does, but they are scattered, with duplicates, and of varying quality and performance.
I think on this point R would really benefit from more efforts towards standardization in a way that goes beyond CRAN's task views. The tidyverse is one such effort towards standardization of data wrangling tools, so in that way it does make R more production ready I believe. What I would really love would be for the data.table and tidyverse people to merge their efforts into one single package that would combine both the usability of dplyr and the performance of data.table. Again here the issue is one of standardization; if you ask a python data scientist the stack for data wrangling is very clear: numpy + pandas. There's no choices and debates to be made there. I think a similar standardization would be good for R, but I can't see it happening soon I'm afraid.


#7

There is dtplyr, that tried to do just that, but considering that it hasn't been touched in a year, I would say there were some obstacles.


#8

From memory, dtplyr does not have the performance of data.table because there are fundamental differences to how they are designed.

I don't think there are any prospects of uniting dplyr and data.table. Maybe future versions of R can raise the performance level across the board.

Instead we should consider ourselves lucky that we have the choice of two great alternatives to base R (as well as some further options), rather than only the one or none at all.


#9

I strongly disagree with this. I think a single standard choice for such basic tasks would be much better, as in the python world; possibility of choice is not always good and not valuable by itself.
For instance, I have found myself in cases in which I started a project with dplyr only to find that for certain tasks I had to switch to data.table because dplyr wasn't performant enough. So I end up with a codebase which is a mix between dplyr and data.table, which works but is not ideal. I'd rather have one standard language and use that across the board.


#10

Ok, I do agree to an extent in that I have also juggled between the two and it's not an easy choice as I like both. However, you had the option to switch to data.table to improve the performance of your project, which is surely better than not having had that option at all. Presumably base R would not have had sufficient performance either.

There are also plenty of people who prefer the concise data.table syntax, which looks more like base R, than the verbose and expressive dplyr. If there were only such a thing as dplyr syntax with data.table performance, then those people would miss out.

Anyway, I'll leave it at that.


#11

Premature optimization is the root of all evil.
- Donald Knuth

I didn't know about map_df. There's no 1:1 equivalent in base R, but I've always done it like this:

dfs <- lapply(x, f)
mydata <- do.call(rbind, dfs)

#12

I think having both data.table and the tidyverse is fine. Once you get comfortable with data.table syntax, it's an amazingly clear way to update a central table. The tidyverse's most valuable, though not most cited, package is tidyr. I don't know of a clearer way to write code overhauling entire data structures.

I vaguely remember @JohnMount at Win-Vector blogged about making a package with a more visual alternative for reshaping data, but I can't remember it's name.

I'm not very familiar with scikit-learn, but the documentation for the different types of models shows they have different interfaces. Which makes sense.

  • I don't see the "scattered" state. CRAN will accept any package that meets their requirements, which reflect where every user's bar should be. Because of this, any production quality package should be acceptable on CRAN. Packages in development can go on GitHub for free (whether it's their primary source or just a mirror). If a package isn't on CRAN or GitHub, that's the choice of the developer.
  • CRAN enforces quality. By "quality," I mean they pass all CRAN checks (which are really strict). The statistical quality can vary, but CRAN requires all packages have a way for users to submit bugs. And packages are open source, so you (and everyone else using the package) can check their work.
  • Duplication can be fine, as long as it produces better alternatives in the long run. Still, it would be nice if people were quicker to submit patches than reinvent packages.

#13

The visual re-shaping is cdata, which has a writeup here. We are going to be teaching this more and more and think it will become important.

From the outside dtplyr looks like it is abandon. Also it does a lot of copying (not just at the start/end of a pipeline) so does not represent data.table level performance at all. I'd say give data.table a try (it is well worth it). If you insist on a non-bracket piped operator notation (data.table is already in fact piped or method-chained) give rqdatatable a try.

BTW it is a huge mistake to only consider official tidyverse components. CRAN has a lot of high quality offerings that are not "tidyverse".


#14

I am neither an expert in scikit learn nor in R's caret or mlr. However, when I compare what I've seen so far, I must say that from a machine learning perspective scikit is the clearest and I really wonder why there is still no port of it in R. After this presentation it became also obvious why nowadays far more than 80% of Kaggle scripts are using Python (at least this is what I counted at a kernel competition regarding a basic classification task some time ago).


#15

A very interesting story 2 years ago one of my colleague had a choice weather to pick R or SAS for learning. He started by calling base read.csv on a 900 some mb file on his 8 gig lappy. And it took him a while to load to manipulate and then he did the same on SAS installed on his computer. It was faster and better. He learned SAS entire day trial and error stuff. In evening he asked me why do you or anybody uses R its so slow. Then I showed him data.table and other packages.

He stayed in R not for clarity but for speed and since then we discuss the data.table way of doing things. And it's lot like sql so it was very very easy to learn. It's been almost 5 years using R I still don't know compete dplyr. And melt and dcast functions works very easily than tidyr counterpart. But It does not have support for nested list column in theory it does because every column is a list I have tried it. But it doesn't have functions to utilize it. That's the reason I wish they could work together.

Analyst need clarity and software's need speed.

That's the reason scala is still in business.

R have bindings to all the languages but golang I wish if it had one. Golang doesn't have data frames and analysis and R doesn't have compiled binary. I think both could benefit from one another.

I wish R could have a go binding and an MVC ween frame work.


#16

There is this - https://github.com/rstats-go by Romain Francois, so there is some work being done in this direction.


#17

Thanks for sending the link But when you visit there it seems that someone never completed it.

But is there any blog or documentation stating how to use R in production for fastest and safest results.


#18

I'm not sure about the function to support nested list column. You can have list column and operate on it, just the syntax often need some tweaking, and sometimes you need some creativity. I think 1. use variable for column names 2. list column operations are two awkward aspects in data.table, at least should be more examples and documentations.