Case studies of using R as ETL tool for data warehousing?

Hi everyone,

My team has been tasked with developing a data warehouse from scratch for our institution. We have perhaps upwards of 100 on-premise data sources that will eventually be tamed and consolidated in a data warehouse. This is not high frequency stuff and not long data, but rather wide and varied datasets that are currently, in most instances, being batch processed once a day for analysis purposes.

What we’ve been contemplating in the run-up to this task is: Is there any reason why we couldn’t handle the ETL in R?

Our team consists of R specialists, and our belief is that the local school-to-work pipeline will continue to to produce capable R developers and analysts. So our thinking goes that with tools like dbplyr , sparklyr, DBI and httr we could effectively extract data from the various data sources, transform it in the manner that we are most capable with - in R - and load it into the data warehouse, perhaps using the tools we use now, like drake for each process. I figure RStudio Connect could handle the scheduled processes with ease, which would then leave “everything in one place” if that’s also where the analysis will be scheduled. Does that make sense?

So we’re wondering if anyone out there has gone through a similiar train of thought and either gone ahead with it; or perhaps, decided against going this route?

The question our CDO is putting to us, and I would like to ask here is: “Is there anyone who has experience building a data warehouse (or a virtual data warehouse) using R as the main ETL tool? What are the pros-and-cons? What pitfalls could lie ahead that I am not seeing - me being very biased towards using R? Has anyone, for example, used data-lakes for storing raw data and then used R to transform and load data into a structured (relations) database structure?”

Any and all thoughts highly appreciated!

4 Likes

This is exactly what we do in my analytics team.

We have RStudio connect and aside from using it to host our applications, books, etc, we regularly use it to run automated ETL scripts.

Our data is all over the place - Salesforce, google sheets, SQL, Cloudera - I regularly write .rmd scripts that connect to multiple sources, use dplyr for wrangling and table creation, and then either write the resulting files locally for an application / Talend process to pick up, or simply use the pins package to make a pin for our application to pull from.

We just schedule them with the built in scheduler. Works great.

4 Likes

I am working for a large enterprise, we did start small and used R in a very similar way as you described however we quickly realized that R was not a good fit for us. Complex ETL process with advanced analytics was heavy for R to handle diligently. Maintainability of the code was a real nightmare.

Without going into too many technical details, we had speed issues, inefficient process execution, and complex pipeline to orchestrate with R.

As a next step we moved to a different language and framework (Spark with Scala) - we immediately noticed that Spark offers much less reading and writing to and from disk, multi-threaded tasks etc.

In terms of Scala versus R - we notice some pretty interesting improvement of the maintainability of our code. Java and Scala, with their mostly super-strongly typed and compiled features, are great languages for large-scale projects.

It’s true that it will take you a longer time to code in them than in Python/R, but the maintenance and onboarding of new data will be easier — at least in my experience.

Data is modeled with case classes. It has proper function signatures, proper immutability, and proper separation of concerns.

But if you don’t have the time or desire to work with them all, this is what I would do:

  • R : Good for research, plotting, and data analysis.
  • Python : Good for small- or medium-scale projects to build models and analyze data, especially for fast startups or small teams.
  • Scala/Java : Good for robust programming with many developers and teams; it has fewer machine learning utilities than Python and R, but it makes up for it with increased code maintenance.

Finally, we are using airflow to orchestrate our pipelines and all our computation and storage are made on AWS (raw data in S3, data warehousing in redshift).
This setup constitutes our data lake platform and we are able to ingest process and make available huge quantity of data from various sources with a very small team of developers in a very cost-effective way.

3 Likes

I think that @mic0331 gave a nice advice on why this can be a bad idea... So I just want to add my two cents on this discussion:

In the company I work the only ETL processes that we do in RStudio Connect are extractions from the data lake to avoid connecting directly with databases in R applications (for example, a Shiny app).

The ETL process needs to be wrapped in a Rmd, which is in my opinion far from ideal, and the scheduling is manual (although packages like connectapi can be used for doing things more programatically, but it's quite new...). Besides I don't think that you can do a proper load balance other than staggering RAM-intensive jobs, which is bad for us running applications in the same environment. For these reasons I am currently considering moving even these simple processes away from RSC...

In summary, as much as I like RStudio Connect for many things (and recommend it), I think it is not meant for large ETL processes and/or for handling a large amount of them.

2 Likes

@Hlynur I'm really thankful for this discussion! My response here is going to be obviously biased, but I hope it will also be rather informed. I'm like 92.5% percent confident in my response here, but i'd be happy for others to chime in as well.

At this stage in development we can think of RStudio Connect (RSC) as a really nice and fancy CRON scheduler (I'll definitely be adding @mic0331's vote for Airflow integration into RSC as a feature request!).

Your thoughts about using tools like sparklyr, dbi, and httr are 100% inline with what I would recommend. As @mic0331 points out, languages like Python and R and not going to scale in data analysis super effectively when you get to some rather massive data sizes. As such, by using dbplyr / dbi and its family, you will actually not be doing your work in R so to speak but rather in the database that houses the data. This will then actually adhere to @mic0331's recommendations—particularly if you're using sparklyr.

I have seen a lot of success in this approach particularly for institutions similar to yours. By being able to stay inside of R and use other tools such as spark (which is using a scala runtime) you will not have to learn new languages or frameworks, you will be able to use other languages via abstraction (R has a wonderful history as an interface language), and reduce the overall footprint of your RSC server by utilizing your databases and APIs (if using httr).

To @joseluizferreira's statement: you will not be able to load balance the etl via connect because the etl will actually not be occurring on RSC. This does not mean that you cannot configure a load balanced RSC cluster, however (see https://docs.rstudio.com/connect/admin/load-balancing/ for this. You'll need to change your DB from SQLite to Postgres DB). RSC will be sending the code over to the databases to do the heavy lifting. The data will only be returned as an R object if collect()ed. And if you want to be super extra conscientious about memory you do it all with dbplyr or spark_apply() functions.

TL;DR

  • Big ETL should be done by big data tools (e.g. databases & spark)
  • You can use big data tools via R with sparklyr, DBI, httr, dbplyr , etc. (see db.rstudio.com)
  • RSC can be load balanced.
  • RSC does not have DAG scheduling (though you can get really far with standard scheduling)
4 Likes

@josiah Nice to be wrong on that. :slight_smile: Thank you for the link!

1 Like

I didn't say it! You did! :wink: