Hi guys,
@edgararuiz It can be both a script writing predictions to backend database or an API; I've met people who said they switched to python in production in both these contexts, but then again the rest of the development stack they had was also python, so that makes sense
I have often found, however, that people who write R code are more accustomed to writing quick-and-dirty analysis scripts, R markdown analyses or maybe shiny apps than writing code for a robust system to be deployed in production (@nick anectode fits this). I think this probably comes from the nature of the community, as R is more prevalent in its use among analysts & statisticians and, as Hadley mentions somewhere in its advanced R, the best practices of software engineering are "often patchy" (quoting by memory, but it's something along these lines). So it might be that the alleged unfitness of R in production (vs python, let's say) has little to do with the language itself but rather with the expertise of the people using R - I think this is actually what underlies @sellorm discussion in the slides he linked to above, although this might be changing, I don't know. I did wonder though if there are additional features that could be added to the language to boost its attractiveness for production environments.
For my part, I haven't used R in production yet and certainly don't claim to be a fabulous software engineer, but I plan to set up a production system using R and these are the steps I'll follow:
- exploration and preliminary modelling using a static dataset in R notebooks. At this phase different possible models are explored and a selection of them is determined for final consideration.
- setting up an automated system (on a development server) for batch retraining of the final selection of models, tracking their error over time (i.e., run a CRON job every period to retrain and re-evaluate the models on new data and write their error measures to file, so I can inspect their time-series). I am particularly concerned about this intermediate step because I want to make sure that my model is always accurate under changing conditions.
- The best model selected at point (2) will indeed be deployed as a lightweight .rds to score data and write to a backend DB. In the future I want to experiment with exposing the models as API's too.
For both steps (2) and (3) I'm making heavy use of R packages. It seems to me that being able to write R packages should be a crucial skill for an R programmer, at the same level of importance (or, indeed, even more important) than using R markdown and such; as it makes code much more robust for future use.
I am still trying to figure out how to proceed in regards to the cran R packages that my production system will depend on, i.e. what is the best practice to minimize the chance that an updated package will break the system. I was thinking of Docker containers but perhaps there's a simpler way to go? Any ideas?
Ciao,
Riccardo.