My team has been tasked with developing a data warehouse from scratch for our institution. We have perhaps upwards of 100 on-premise data sources that will eventually be tamed and consolidated in a data warehouse. This is not high frequency stuff and not long data, but rather wide and varied datasets that are currently, in most instances, being batch processed once a day for analysis purposes.
What we’ve been contemplating in the run-up to this task is: Is there any reason why we couldn’t handle the ETL in R?
Our team consists of R specialists, and our belief is that the local school-to-work pipeline will continue to to produce capable R developers and analysts. So our thinking goes that with tools like
httr we could effectively extract data from the various data sources, transform it in the manner that we are most capable with - in R - and load it into the data warehouse, perhaps using the tools we use now, like
drake for each process. I figure RStudio Connect could handle the scheduled processes with ease, which would then leave “everything in one place” if that’s also where the analysis will be scheduled. Does that make sense?
So we’re wondering if anyone out there has gone through a similiar train of thought and either gone ahead with it; or perhaps, decided against going this route?
The question our CDO is putting to us, and I would like to ask here is: “Is there anyone who has experience building a data warehouse (or a virtual data warehouse) using R as the main ETL tool? What are the pros-and-cons? What pitfalls could lie ahead that I am not seeing - me being very biased towards using R? Has anyone, for example, used data-lakes for storing raw data and then used R to transform and load data into a structured (relations) database structure?”
Any and all thoughts highly appreciated!