what to tell colleagues who don't want to wait for knitr

My research group is transitioning from using conda + jupyter + irkernel for data analysis with R to Rstudio Server. I've noticed that many of my colleagues now spend a lot of time waiting for knitr to finish. It seems that waiting for knitr is the default (naive) mode that a new Rstudio user falls into. There are tricks for speeding up knitr runs (eg., cache & using multiple sessions if using server pro), but why is the default for Rstudio "click knitr and wait"? Such an approach is not taken with continuous integration for instance; a developer doesn't sit and wait for all CI tests to complete before moving on. The developer starts working on the next task right away while the CI tests run in the background/remotely. Why isn't the default for knitr to run the knitting process in the background? Even with background knitting, there's still the issue of long running code that takes hours or days to complete. If a user just changes a small section of their rmd file, they have to re-knit and either wait hours/days or use cache=TRUE for code chunks, which seems to have its own issues (below).

What should I tell my colleagues who are starting to complain about the long knitr wait times? The solutions that I've thought of while researching this problem are:

  • Just sit and wait (maybe minutes or hours, which is often the case for our large datasets & ML analyses)
  • Use cache=True or the archivist package.
    • This can limit reproducibility, and it only helps with subsequent knitr runs, not the 1st knitr render (AFAIK). If the user has run a 30-min job interactively (direct code chuck execution) and then knits, that 30 min job will run all over again during that first knitr render job, right?
    • If we use this cache approach, should we always use knitr to run each new code chuck instead of interactive running of code chucks in order to cache that chuck (eg., to prevent running that 30 min job twice)?
  • Knit in parallel via multiple R sessions for the same project (luckily we have Server Pro)
    • This still isn't optimal for situations where the knit job takes many hours or days to run. If I just want to make some minor changes to the text in my rmd file, and I didn't have my long running code chunks cached, then I would have to wait hours or days for that doc to re-knit, just to have minor updates to the text.
  • Manually cache output from long running jobs with saveRDS(), which would allow for caching during interactive runs of code chunks, but it is even more sketchy than cache=TRUE
  • Use rmarkdown::render() either within a session or via Rscript
    • I believe that running rmarkdown::render() within a session can lead to problems because the rendering is not occurring in a 'fresh' environment
    • The command line Rscript approach seems like a viable option, but then how does one load the packrat env prior to running rmarkdown::render() within the Rscript command?
      • This still doesn't help with the issue of: "I just want to make a small edit to the text, but I don't want to re-render the doc all over again"

What do most Rstudio users do??? Do most just sit and wait for their docs to knit (at least knit for the first time in order to cache), or am I missing something?

1 Like

Do you know about {drake} :package: ? You could find it very useful for your long, computation heavy task.

It analyzes your workflow, skips steps with up-to-date results, and orchestrates the rest with optional distributed computing. At the end, drake provides evidence that your results match the underlying code and data, which increases your ability to trust your research.

Building your analysis with drake will allow you to easily run only what has changed, and easily profit from distributed computing (in background, in clusters, or else) from the steps that can be run in parallel.

It could be very well suited for long running analysis. You could build drake pipelines, that ends with a reports that loads any previous result to put them in a communication ready report. The caching is already very well included in drake by default, and saves a lot. The distributed feature really helps too - and you just have to build a pipeline.

Just an idea in case you don't know this.

Also, for background process in RStudio IDE, you have the Job Pane. You can make an RScript that render your document in a clean session and launch this RScript though the job pane.

Other callr :package: will allow you to run a R command in background too.

I am not sure to follow completly: In RStudio, when you click on the knitr button, a new Rmarkdown pane should be opened to follow the running of the rendering process that is already in a background job. Your R console is free as you can continue to work while the rendering happens, can't you ?

It really seems to me that you need to execute your rendering on another environment than RStudio Server IDE. You could deploy and run elsewhere (HPC cluster, servers clusters, ...) - futures :package: and friends could help. You could also use a product like RStudio Connect to deploy your source code and let it run on another server. Using a CI can help you mange all this too.
In the end, drake pipeline and workflow seems to be aiming at helping your use case. You should really look into it, the documentation is awesome !

Hope it helps.

3 Likes

Thanks @cderv for the detailed response! I was vaguely familiar with Drake, but I haven't really looked into it. I heavily use snakemake, but that's for the initial processing of (meta-)genomics data (eg., QC, mapping, assembly), not the analyses that come afterward in R/python. The idea of having a project completely reproducible via pipelining software is very appealing, but often fails to actually work for complicated research projects (eg., some steps require manual work that the user couldn't/wouldn't automate, and thus the pipeline would have to be broken into multiple separate sub-pipelines, with manual work in the middle of each sub-pipeline). Still, I'll check out drake and see if it's feasible for large scale research projects.

I'm just getting started with knitting (my naive question should have given that away), so I wasn't really sure if processes could still be run during knitting. I see that this is possible, but it seems that one can only knit one rmd document at a time using the "Knit" button. I guess my colleagues were complaining about only being able to knit one doc at a time with the knit button. I'll checkout callr (thanks!). With RScript rendering, is there an easier way to get the packrat env other than adding the R exe to one's PATH and the same for the lib path(s)?

Thanks again for your advice!

From what I know the creator of drake develop this tools to deal with the challenge of large research project. See this blogpost rOpenSci | The prequel to the drake R package

And I guess He will be more than happy to discuss your use case with you to see if drake is made for it or not.

Sorry but I am not sure to see the question here and what you mean. For newer project, I would use the new renv :package: (new packrat) (Project Environments • renv)
The default mechanism would use the project library for any R session open in this project folder. Or you could manually activate it.

renv/packrat gets initialized when the project .Rprofile is loaded. For this mechanism to work you have to launch R/Rscript from the correct working directory -- just cd /home/user/myrmdproject before running Rscript. Also as cderv suggested, save yourself a lot of trouble and migrate to renv from packrat if you can.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.