What are Rstudio's solution to workflow management systems

I was reading this blog about workflow management systems (WMS).

The idea is a little foreign to me. I am having difficulty understanding the problem that Luigi and Airflow solve. My normal workflow involves using taskscheduleR to schedule tasks.

Also, do any analogs to these systems exist in R, Rstudio server, or Rstudio connect?

2 Likes

RStudio Connect has some workflow and scheduling features, but it's more of a publishing platform first.

I use Airflow for my scheduled tasks. When I started using Airflow I really just wanted cron with a better UI and some logging and email notifications. My Airflow jobs are all super simple, they just run a shell script that fires off an R job.

Airflow really shines when dealing with complex workflows with interdependent steps and heavy loads. Airflow stores its jobs as DAGs so it knows that certain steps can be run in parallel while other steps are dependent on things finishing before they start. And the centralized logging gives good info on what failed and where. This article is one person's experience in switching to Airflow, and it seems representative of many experiences I have heard. One of the neat aspects of Airflow is it can control a parallel queue of workers. So jobs that can be run in parallel get fired up on different workers and Airflow orchestrates the whole shebang.

My personal opinion is the web UI for Airflow is rough. Yet the underlying technology around workflow management is amazing. This is pretty typical FLOSS, I guess. Great engineering with bad UI seems quite common.

4 Likes

Workflow automation tools (or at least discussions, blog posts, and conference talks about them) are very popular now. Airflow represents a class of more general purpose tools I think. See the Common Workflow Language for many more examples. There are a few tools more specific to R-based analytics:

  • drake R package : is "a general-purpose workflow manager for data-driven tasks".
  • Unrelated Drake tool: "make for data"
  • MLflow: "An open source platform for the machine learning lifecycle" from Databricks. This one is probably the most famous given that the project lead is also the lead of Apache Spark and there is a well-known company behind it. Originally Python-centric, MLflow has support for R and the upcoming RStudio 1.2 has some means to integrate with it (https://github.com/rstudio/rstudio/pull/3301). I personally don't use it because MLflow projects require either anaconda or (thanks to a very recent addition) docker to be installed and we use neither.
    UPDATE: yet another workflow package https://github.com/Mu-Sigma/analysis-pipelines - composable interoperable pipelines with R, Spark, and Python.
2 Likes

Airflow really shines when dealing with complex workflows with interdependent steps and heavy loads.

Could you describe what a complex workflow with interdependent steps and heavy loads looks like?

1 Like

We have seen many customers have success using R Markdown + RStudio Connect to automate simple workflows.

As an example, this R Markdown document pulls in some stock data, cleans it, and writes the results to a database: https://colorado.rstudio.com/rsc/content/1032/Portfolios_ETL.html

The document can be deployed to RStudio Connect and scheduled.

The benefits of this approach are:

  1. The deployment to Connect automatically handles creating an environment with the proper R packages and a matched version of R. This can be a pain in general purpose workflow runners.
  1. The scheduling is easy with a nice UI:

  2. Connect will email you if the task fails, and can optionally email you on success.

  3. Because we're using R Markdown, the ETL code is documented in-place, which is really handy. We can even create some quick graphs to visually check the results of our process over time. In Connect, you can automatically scroll through render histories:

The main limitation is that this scheduling does not account for DAGs. As an example, say you wanted to pull data, fit 10 different models, compare the model results, pull in some supplemental data, and then finally merge the results and supplemental data into a report. A DAG lets you represent each of those as a step, and also lets you arrange them in dependent order. The benefit is that tools that run DAGs are often lazy. e.g. in this case, if one of the models fail, and then you restart the process, a DAG tool will typically not re-run all the models, just the one that failed. Likewise, a DAG tool would usually be smart enough to know whether or not the supplemental data has changed. You could write similar functionality into a R Marldown document, but you'd be reinventing lots of wheels.

Overall though, if you are getting by with taskscheduleR, you'd likely get a lot of mileage out of R Markdown and R Markdown + RStudio Connect.

5 Likes

complexity and interdependency are defined the same way Justice Potter Stewart defined pornography, "I'll know it when I see it..."

In my experience "complex" means simply "something I want to do which I can't do with my current tools." For example we have a number of processes that if all run sequentially would take more than 24 hours to run. And we need them daily. But by using a workflow tool we can run some steps in parallel. That's not rocket surgery, but building that capability up from scratch with good logging would be a royal pain.

Similarly, "heavy loads" just means "■■■■ breaks when I try to run it all on one box". So we break the process up into steps that can be spread around. The job manager can watch and if any one chunk fails it can fire off a job that runs just the missing bits. There are tons of ways to handle each little snag. But since workflows tend to get more complex over time, it seems to make sense to use a workflow tool instead of bespoke solutions for every workflow hiccup. Especially when Airflow is Open Source!

5 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.