What are some pipeline tools that R Users recommend?

I have been spinning my wheels researching pipeline tools. My colleague and I have looked at Airflow, Jenkins Pipeline, Circle CI and Gitlab. We have looked at some in more depth than others and have not arrived at a 'winner' yet.

I am posting on this forum since most of my work is done in R and any pipelines we build will usually contain some r scripts. I have also just had positive experiences posting here before :slight_smile:

Random longshot q, are the folks at rstudio working on any pipeline app? Just on the off chance I thought I'd ask.

I have used cronR package in the past but in this case we need something more substantial. We need a web interface where we can check on and rerun pipelines manually, including specific stages or the whole thing.

General preference is for open source but open to paid solutions too.

Though our devops team are moving our infra to IBM cloud, would prefer a tool that is cloud provider agnostic. So averse to AWS Pipeline and the GCP and Azure equivilents.

Our devops team are spread thin and hence we, the analysts, wish to 'own' the pipeline infra for analytics so that we are less dependent down the line. So, while we are reasonably technically competent, we are not at advanced as engineers or devops and so are looking for something that might be easier to set up and install.

What are some pipeline tools that fellow R users are using and would any come recommended?

3 Likes

Since posting someone also told me to add Github Actions to the list of contenders.

Depends on what you want to achieve. At previous work I've used airflow, we ran sql jobs, python jobs and some R. If you only care about the order of jobs, you could look at Drake. But Drake does not have scheduling I believe. I've heard great things about kubernetes, but haven't used it yet. With kubernetes your jobs are docker containers.

Hi,
As mentioned, it depends what you want. Looking at your list of pipeline tools (Airflow, Jenkins Pipeline, Circle CI and Gitlab) I get the impression that you listed (at least?) 2 kinds of functional pipeline categories. 1) pipeline tools to structure your data flow (i.e. using Airflow) and 2) pipeline tools to deliver your software development into production (i.e. using Jenkins, Circle, Gitlab and Github Actions). So, what pipeline functionality exactly are you looking for?

To structure your data flows you could take a look at Knime Analytics Platform, Databricks and Dataiku. The latter two are commercial tools but work fine. In all three tools you can embed or invoke R code.

Best Regards

Thanks for the suggestions and info. 'Depends' got mentioned in both replies so far. I have two primary use cases:

  • Pull data from an API, do some processing and then populate a postgres database with it
  • With the data in the postgres db, I will create some dashboards that need to be automated. I will want to create new tables that feed these dashboards directly so they are fast. So after the data are pulled on a given day I will then want to update the dashboard custom tables with some pipelines.

Pretty small stuff and pretty 'small data'. But it would be nice to have a user friendly gui to create new pipelines and easy-ish set up.

I have some new tools to research now, including Knime, Databricks and Dataiku. Thanks for the pointers.

Have you looked at RStudio Connect?

To me it sounds like what you're looking for. It makes scheduling, packaging your code and deployment of shiny apps really easy.

I would be interested but looking at their about page, I don't think they have pipeline functionality per se? Extracting data from APIs, processing and sending to a db? Jobs, schedules, stages, tasks etc?

What is that you're looking for in a more traditional pipeline tool?

I think RStudio Connect is great for stuff like pulling data, manipulating it and writing it to a database. It makes it easy to schedule and it is highly configurable to staging and things like load balancing.

But if you're looking for a traditional pipeline tool then maybe airflow could be of interest, or even an R package like drake. Also found this interesting topic where you compare airflow to RStudio Connect: What are Rstudio's solution to workflow management systems

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.