My research group is transitioning from using conda + jupyter + irkernel
for data analysis with R to Rstudio Server. I've noticed that many of my colleagues now spend a lot of time waiting for knitr to finish. It seems that waiting for knitr is the default (naive) mode that a new Rstudio user falls into. There are tricks for speeding up knitr runs (eg., cache & using multiple sessions if using server pro), but why is the default for Rstudio "click knitr and wait"? Such an approach is not taken with continuous integration for instance; a developer doesn't sit and wait for all CI tests to complete before moving on. The developer starts working on the next task right away while the CI tests run in the background/remotely. Why isn't the default for knitr to run the knitting process in the background? Even with background knitting, there's still the issue of long running code that takes hours or days to complete. If a user just changes a small section of their rmd file, they have to re-knit and either wait hours/days or use cache=TRUE
for code chunks, which seems to have its own issues (below).
What should I tell my colleagues who are starting to complain about the long knitr wait times? The solutions that I've thought of while researching this problem are:
- Just sit and wait (maybe minutes or hours, which is often the case for our large datasets & ML analyses)
- Use
cache=True
or the archivist package.- This can limit reproducibility, and it only helps with subsequent knitr runs, not the 1st knitr render (AFAIK). If the user has run a 30-min job interactively (direct code chuck execution) and then knits, that 30 min job will run all over again during that first knitr render job, right?
- If we use this cache approach, should we always use knitr to run each new code chuck instead of interactive running of code chucks in order to cache that chuck (eg., to prevent running that 30 min job twice)?
- Knit in parallel via multiple R sessions for the same project (luckily we have Server Pro)
- This still isn't optimal for situations where the knit job takes many hours or days to run. If I just want to make some minor changes to the text in my rmd file, and I didn't have my long running code chunks cached, then I would have to wait hours or days for that doc to re-knit, just to have minor updates to the text.
- Manually cache output from long running jobs with
saveRDS()
, which would allow for caching during interactive runs of code chunks, but it is even more sketchy thancache=TRUE
- Use
rmarkdown::render()
either within a session or viaRscript
- I believe that running
rmarkdown::render()
within a session can lead to problems because the rendering is not occurring in a 'fresh' environment - The command line
Rscript
approach seems like a viable option, but then how does one load the packrat env prior to runningrmarkdown::render()
within theRscript
command?- This still doesn't help with the issue of: "I just want to make a small edit to the text, but I don't want to re-render the doc all over again"
- I believe that running
What do most Rstudio users do??? Do most just sit and wait for their docs to knit (at least knit for the first time in order to cache), or am I missing something?