Executing an R script from Airflow

Hi all,
Does anyone have experience executing an R script from Airflow?
Ideally it would be through an R operator.

Thanks,
Chuck

There is an ROperator PR that I've looked into before.

It keeps getting comments every few months, and their comments suggest its complete but just needs more testing before being merged in.

At the very least you could probably use their BashOperator

https://github.com/apache/incubator-airflow/blob/master/airflow/operators/bash_operator.py

3 Likes

@chuck_m I am using R and Airflow in production, I suggest you only using BashOperator and combining Rmarkdown rendering power, which means better debugging experience.

@harryzhu is there an example you could point me towards? I'm assuming you'd be using Rscript via a batch script. If you do that, does the airflow bashoperator capture the logs from the r session? Particularly, what's the advantage of using rmarkdown for debugging?

thanks!

@chuck_m

here is the example, play rmarkdown, sparklyr and airflow:

rmd_exe_base = """/bin/Rscript -e 'rmarkdown::render("/data/share/airflow/dags/project/Rmd/{job_name}.Rmd",params=list(exe_date="{exe_date}",master="yarn-client"),run_pandoc=F)' """
your_ops = BashOperator(
    task_id='your_db.your_tbl',
    depends_on_past=False,
     retries = 3,
    retry_delay=timedelta(0, 300),
    bash_command=rmd_exe_base.format(exe_date = exe_date,job_name = "your_job"),
    dag=dag)

and you can checkout the rmd_exe_base rendered command in airflow ui at task view.

And the advantage of Rmarkdown is the chunk can log the process bar automatically, and organize code and parameters very well.

3 Likes

Very cool, I'm going to give it a shot. Thanks!

@harryzhu I'm just getting my feet wet with Airflow and R. How do you deal with working directory in your render example?

I'm running *.R files and I handle this by creating a bash script that sets the working dir then sources the R file. I'm not sold on that as a good workflow, because it feels like I'm hard coding paths which leaves me with the nagging concern that Jenny Bryan is going to come burn my computer :smile:

-J

1 Like

@jdlong

in the linux shell command,

cd your_dirctory && Rscript your_rscript.R might be what you want

yeah that's how I'm doing it already. Was curious if you all had sorted out another way :wink:

The way I handle working directories is by using the rprojroot package:

setwd(rprojroot::find_rstudio_root_file())

1 Like

What I ended up doing was creating a shell script in my dags directory called run_r.sh. I want all my R jobs to run in the directory in which they are located. So my run_r.sh looks like this:

#!/usr/local/bin/Rscript

args = commandArgs(trailingOnly=TRUE)

setwd(dirname(args[1]))
source(args[1])

so it takes one argument, the name of the R script, changes the working dir to the same directory as the R script, then sources the script.

My bash operator in my dag ends up looking like this:

run_this = BashOperator(
    task_id='my_r_thing',
    bash_command='/Users/jal/airflow/dags/run_r.sh /Users/jal/Documents/my_r_thing.R ',
    dag=dag,
)

for each of my DAGS I used the same run_r.sh and just pass them different R scripts. Don't forget the space after the script name.

works like a champ.

6 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Hey if you're interested in Airflow, you might want to pop into their git hub issues section and say you're interested in an R operator. I've nudged them to reopen the issue.

3 Likes