Hello! I am new to the rstudio community (but not to rstudio, of course :)).
I am writing this post to get some perspectives from you people on the deployment of R machine learning/predictive models in a production environment. Up to now I have used R only to do exploratory data analysis, reporting, model selection and so forth, but all these activities are 'static' in the sense that they allow for no or limited automation in a production setting. If I wanted to deploy my R models in production, what would be my best option?
A more concrete case I am working on. Let's say I want to do forecasting on pageviews data. I have a development cloud server where I run rstudio, periodically import pageviews data from (say) google analytics as batches, train/retrain/score/explore various models using the forecast package, and select the best performing model - say it's model M. Good job.
Now I want to deploy this model M on a production server so as to predict on new data that is thrown at it - e.g., forecast pageviews data for the next days, for different geographical locations, etc. The ideal solution would be something that is:
(2) scalable: it should be fast and robust enough to handle many predictions without breaking
(3) easy: it should be as straightforward as possible for someone who has limited support from developers/devOps
It seems to me that this scenario that I am describing is extremely common, and is one of the most important - if not the single most important - challenge facing R (and, much less, for Python; see below). I have conducted an extensive search online for the options at my disposal, and these seem to be:
(1) Get the developers to translate the model to java/C/whatever. This is of course terrible, as it does not satisfy conditions (1) and (3) above. It is perhaps good for big corporations like Google who can afford to have a whole team of engineers to take the models from the data scientists and optimize everything in C++ and whatnot, but for most companies/scenarios this is impracticable; moreover, this approach is feasible for Python (which is a language that engineers know) but much less for R (which is not used in development).
(2) Make use of proprietary environments/platforms which make it easier to deploy R models, such as Microsoft ML server/SQL Server ML services or platforms like https://www.dominodatalab.com/ and similar. Some of these services require you to already have some type of infrastructure (e.g. SQL server) where all your data is stored, which makes it inflexible if your model takes heterogenous data from multiple sources. Platforms like Domino, it seems to me, make you pay for something that you can do yourself (see point 4 below), which might be good as they free you from the hassle - but then they do not constitute different methods to deploy R models
(3) deploy the model by saving the .rds (serialize it), move it to the production server, predict on new batch data that comes in say every day, and return the dataframe/json object containing the scoring/predictions for further processing. This approach works for quite some use cases, but it has many drawbacks, in particular that you can only score batches of data, which makes it quite inflexible.
(4) deploy the model as a micro-webservice/API on a cloud production server, which can take HTTP requests with input data and returns the predictions as, say, a JSON. This, it seems to me, is perhaps the best and most flexible approach, since developers can request predictions without understanding R and it can be easily adapted for any other predictive model, by writing a small API for each model. The issues here come from scalability, of course, which are made worse by the fact that R is single-threaded. It seems that there are the following packages to expose an R model as a service:
(1) Plumber. Seems under active development, but I am not sure how stable it is. Scalability could be dealt with by running many R processes in different docker containers using kubernetes (see here, and the posts below).
(2) OpenCPU: seems pretty solid and tested, see here, here, and here. Single-threadedness is dealt with by starting a new process for every request, keeping RAM and cpu usage in check.
For both of the options above my biggest worry is scalability. For instance, here it is said that openCPU worked quite well, but in the end they switched to Python because it's a "more proper programming language" (whatever that means), while here it is said that openCPU scales reasonably well but not for intensive websites.
(5) a final option would be to go for the above option (webservice/API) but switch partially or wholly to python in production. A partial switch would look like this: set up an API using flask + gunicorn or Django and run the R models using rpy2. A full switch would take the same route but just run the python equivalents of the R models. My main question here is whether flask + gunicorn (+rpy2 if needed) will scale better than openCPU/plumber.
I would like to get you guys' perspective on this issue, which, as I said, seems quite topical to me for the future of the R language. What's the best way to go about this problem?