How about preprocessing the .Rmd file with something like:

lines <- readLines("Test1.Rmd")

writeLines(purrr::map_chr(1:15, ~ lines[.]), "test1s.Rmd")

and then knit the result?

2 Likes

Depending on your usecase of "...repeat similar (even if never identical) analyses for different data sets..." this also might be a good use of https://rmarkdown.rstudio.com/developer_parameterized_reports.html

Note the input dataset section under Parameter User Interfaces

1 Like

I like this a lot! Let me try.

Thanks for the suggestion: my report is indeed parametric, even if not to such an advanced level as shown in your link (which I will definitely study in the next days).

Having said that, I don't think that's a viable solution for me. There are consistent differences among the data sets:

  • the column names are different
  • the "structure" of the "missingness" is different (sometimes very few data are missing at all: some other times entire columns are missing, indicative of a failed sensor)
  • even the physical sense of the variables can be different.
  • In some cases I need to look at all the variables, in other I don't.

Thus, some level of manual editing of the report is in my opinion unavoidable. It's true than in most cases I need to perform an EDA and a survival analysis, but I don't think I can easily parametrize that: depending on the specific data set, I may be content with a Weibull model, or I may need a Cox proportional hazard models, or something even more complicated.

I should at the very least invest a lot of time (which I don't have right now) in studying tidyeval, and in writing scripts which are very flexible in terms of number of variables involved, column names, preprocessing and modeling steps to apply...I don't believe in "automatic Data Science": I think some manual intervention is needed. Or maybe it could be possible, but that would require building a Data Science platform: it's not something I can do on my own with an R Markdown report.

But this is just my personal opinion, and I'm sure that for more standardized tasks (like for example performing always the same kind of analysis on similar dataset which are collected weekly) people can be far more productive using parametrized reports.

1 Like

by the way, it seems to work (I need to do some more tests) but I'm not sure what it does, exactly. Can you explain? Also, is the writeLines really necessary, or could I knit from a character vector, instead than from a file?

This is another interesting option, I thought R notebooks were not really different from R HTML reports, but I may well be wrong! I'll need to study this option too.

You evaluate them piece by piece. If you're knitting to PDF, the whole document has to be created. That said, there are caching options, which you can explore in knitr chunk options:

Aside: I'm going to move this to #R-Markdown, since it's not really IDE-related.

2 Likes

I'm not that familiar with the details of knitr, but as far as I can tell the knit function takes a file path as input, not a connection as some functions do. Maybe there is a way to turn an in-memory string into a file, but I don't know.

However because of the way files tend to be handled internally you might find out that writing to a file then reading it is about as fast as just reading the in memory string (if there was a way to do that) so I suggest giving it a try and seeing if it meets your performance requirements.

BTW some languages (when the OS supports it) have the concept of a temporary file, that is a file that is going to be thrown away when the app ends. These kinds of file are handled differently that regular files and often live just in memory. Unfortunately R doesn't seem to support this concept... all there is is a function to create a temporary file name.

2 Likes

@danr yep, Python supports that through the tempfile module!

Expanding on what @mara said with chunk options, passing in eval=FALSE, echo=FALSE in each chunk you don't want to be inlcuded in the knitted document will do the trick.

3 Likes

I will test this too! Thanks

The knitr option discussed in this SO thread may be useful in your situation:
https://stackoverflow.com/questions/33705662/how-to-request-an-early-exit-when-knitting-an-rmd-document

6 Likes

@DavoWW this is perfect! Let me recap the solution here, so that people don't have to read through the SO thread. It's very easy to stop knitting a document at any line of your .Rmd document: just add the line

`r knitr::knit_exit()`

anywhere in the source document (I put it in an inline expression because it's more compact, but it would still work if you put it in a code chunk). Fantastic!

11 Likes

@danr Actually, R does provide a "disposable temporary file" facility. Use the with_file function of the withr package.

1 Like

@pteetor this is brilliant! I didn't know about that. I have to tweet this gem out :grin:

I run into this issue a lot. Often my analysis take some time to run. When I've finished the analysis and made the plots, the knitting part often take a lot of unnecessary time, because it needs to re-run everything. What makes it worse is that after 30 min of knitting, some error comes up. Although there are cache options, it's just not convenient to go through chunk by chunk to decide what to cache.

What I would like to have is to create html from notebook without re-running anything. The .nb.html files sometimes does that, but there are cases when it stops updating due to some knitr errors.

This is why I like Jupyter better since the .ipynb file can just be converted to html without re-running stuff.

You can achieve what you want in many ways:

  • use the R notebooks, instead than the R Markdown document, as suggested above by @mara:
  • use @DavoWW knitr::knit_exit() trick: knit_exit() doesn't simply abort knitting. It just stops it at the point where you placed it. So you get an HTML file as an output, but of course without the parts you didn't knit (which is probably not good for your use case).

  • concerning caching options, invert your point of view :grinning: instead than going chunk by chunk, just set the default (e.g., cache all chunks) with opts_chunk$set(...) at the beginning of your doc, and then deactivate caching only in the chunk you're currently editing, for example

  • move the analysis-heavy part of the code outside the R Markdown document completely to an R script (it also makes debugging easier). The script must save the results of the analysis to some file, e.g., .csv, .rds, .rda or a feather file. Then in the R Markdown you can add a chunk with an if statement, which checks if the file exists (in which case it loads it) otherwise it sources the analysis script. This way, knitting takes way less time.

  • use drake! It requires changing your mindset quite a bit, so it definitely isn't an easy step. The reward is that it accelerates quite a lot the process of developing and reproducing time-consuming analyses.

6 Likes

Thanks for the pointers! I've been using R notebooks. I like the notebook environment for exploratory analysis, although my focus is not really on knitting/generating html reports.

move the analysis-heavy part of the code outside the R Markdown document completely to an R script (it also makes debugging easier).

Indeed, I recently found many of my Rmd and notebooks files should probably be just R scripts, because they take time to run and I never look at their html or nb.html files. The reason I started them as Rmd/notebook files is that I like the separation of code chunks and the inline output printing. They're more like an enhanced console with inline output and code storage...

My workflow is mostly: exploratory analysis -> find the best approach to analyze data -> write some functions to reproducibly produce results -> document the function calls. The notebook environment is great for exploring different approaches and keeping track of what I tried in the past few hours. But after I find a solution, I usually output the results as csv, pdf, etc. A perfectly knitted HTML file is usually not part of the motivation. About one out of 5-10 notebooks are knitted, and the rest are just Rmd files storing the code I tried. Maybe this reflects that I need to learn better habits of data analysis (recently I start thinking about how to optimize my workflow).

drake looks great! I always wanted my workflow to be more like a pipeline, where changing analysis parameters or inputs automatically starts new analysis. Right now it seems to be beyond my scope a little bit, but I hope to use it in the future.

1 Like

Add the code
eval=FALSE in the chunk option, will skip this code in knitting

like:
{r eval=FALSE }
cat("GOOD LUCK")

1 Like