I am working for a large enterprise, we did start small and used R in a very similar way as you described however we quickly realized that R was not a good fit for us. Complex ETL process with advanced analytics was heavy for R to handle diligently. Maintainability of the code was a real nightmare.
Without going into too many technical details, we had speed issues, inefficient process execution, and complex pipeline to orchestrate with R.
As a next step we moved to a different language and framework (Spark with Scala) - we immediately noticed that Spark offers much less reading and writing to and from disk, multi-threaded tasks etc.
In terms of Scala versus R - we notice some pretty interesting improvement of the maintainability of our code. Java and Scala, with their mostly super-strongly typed and compiled features, are great languages for large-scale projects.
It’s true that it will take you a longer time to code in them than in Python/R, but the maintenance and onboarding of new data will be easier — at least in my experience.
Data is modeled with case classes. It has proper function signatures, proper immutability, and proper separation of concerns.
But if you don’t have the time or desire to work with them all, this is what I would do:
-
R : Good for research, plotting, and data analysis.
-
Python : Good for small- or medium-scale projects to build models and analyze data, especially for fast startups or small teams.
-
Scala/Java : Good for robust programming with many developers and teams; it has fewer machine learning utilities than Python and R, but it makes up for it with increased code maintenance.
Finally, we are using airflow to orchestrate our pipelines and all our computation and storage are made on AWS (raw data in S3, data warehousing in redshift).
This setup constitutes our data lake platform and we are able to ingest process and make available huge quantity of data from various sources with a very small team of developers in a very cost-effective way.