Best Practices for Reproducible Research - Should We Show Full Mapping of Raw Data to That Used in Research?

bragks · August 6, 2018, 7:09am

I'd like to follow up on this a little bit. I've had a look around at different resources and they all introduce the concept of reproducible research which makes sense. However, I'm having a hard time figuring out exactly which method is the "best" for defining data within the markdown document.

In my example I have a script for wrangling a file that I imported from .xlsx which results in my final dataset. Should I copy this entire script into the markdown file and set include = FALSE? Should I save it with save.Rdata and use load()? Are there any other options that are "better"?

Thank you!

Split from Error in UseMethod("select_") : when trying to Knit Rmarkdown - #2 by jrlewi

jrlewi · August 7, 2018, 1:23am

Hi @bragks - this is a really good follow up question and I'd be glad to share some of my thoughts on this. The key question I try to answer when creating reproducible code is:

'If I provide the code to someone in the future (this person could be my future self or someone else entirely), will they be able to reproduce the results on a potentially different machine?'

A different question is - 'what should I provide in my markdown document?'

They are both important questions, but I think the second has more to do with who is the ultimate consumer of your markdown document. Often - there are steps that can take a long time and you don't want to redo them each time you edit and recompile. In this case - doing all this in a separate source document is fine. I tend to save R objects I need with saveRDS() and load them with readRDS(); though I am not dogmatic about what is the 'right' way. If the consumer of your markdown document would benefit from seeing these preprocessing steps in a report, the source document could be another markdown file. After all, sometimes it is the preprocessing that needs to be made transparent in the documentation.

Another option (that may be considered 'better') is to put it all in a single markdown document and make use of the CACHE = TRUE chunk option for the chunks that take a long time. The results in the chunk are then cached and reused when recompiling - there are some caveats to this and a quick overview of caching can be found here: RPubs - Caching Code Chunks.

Lastly - I suggest this article on project-oriented workflow which discusses good practices to follow when sharing your code (and note, at the very least you are always sharing your code with your future self...)

hughparsonage · August 7, 2018, 12:17pm

In my example I have a script for wrangling a file that I imported from .xlsx which results in my final dataset. Should I copy this entire script into the markdown file and set include = FALSE? Should I save it with save.Rdata and use load()? Are there any other options that are "better"?

One technique I'd promote is to create a data-only package with a data-raw directory containing this script, ending with devtools::use_data(<data>). Then you can just use library(<myprojectsdata>) in your knitr document. Advantages of this include lazy (i.e. faster) loading of the data, avoiding extraneous objects, better versioning and sharing, as well as forcing you to document your dataset.

bragks · August 8, 2018, 10:19am

I'm not (at all) familiar with making packages, but this does seem like a nice approach. You're saying all changes I make to the script and the resulting objects would follow the package with this method?

Still, from a beginners perspective, I'm having some trouble seeing why I can't just "link" my workflow in a script to the rmd. It just seems a bit counterproductive, but I'm assuming this is intentional for reasons I have yet to understand.

E.g.
load data in rmd -> make some change to dataframe in source -> save changes to dataframe -> reload data in rmd -> do random stuff in rmd

vs.
load data in rmd -> make some changes to dataframe in source -> do random stuff in rmd

jcblum · August 8, 2018, 2:40pm

When you say “make some changes to dataframe in source”, do you mean changing the script that imports/preprocesses your data? And when you say “save changes to data frame”, do you mean running that script to create a new data frame object reflecting the changes? (FWIW, that’s a step you’re never going to be able to skip)

Aside: the data frame isn’t “real”...

Something it might be a good idea to wrap your head around (discussed in the Project-Oriented Workflow article that @jrlewi linked) is the idea that “source is real” — meaning you should think of the objects that your code creates as ephemeral, and only the instructions for creating them (=the source code) as the durable, real artifacts of your work.

Thinking about “saving changes” to a dataframe object somewhat runs against this principle. If source is real, then you can and should (early and often ) clear out all the objects in your workspace, secure in the knowledge that it is trivial for your scripts to recreate them. You aren’t creating a precious data frame — you’re creating a precious set of instructions, which can generate any number of disposable data frames that will be exactly identical every time (unless you change the instructions!).

Needless to say, this is really different from how most people are used to thinking about computer software, so it takes some getting used to!

The simple answer for why it can’t work like the second example is that — as you know! — RMarkdown/knitr just doesn’t work this way. Knitting is done in a new, independent session, so Rmd files have to be self-contained. Why this is good for reproducibility is that it means rendering doesn’t depend on somebody taking the right series of steps “by hand” to set up objects in the environment ahead of time.

However, “self-contained” doesn’t mean that you have to copy your data import script into every Rmd you make — in fact, this is a bad idea because multiple copies inevitably lead to diverging changes. There are a lot of other options (mostly already mentioned in this thread):

You can use source() in the setup code chunk to run the pre-processing script.
You can make the last line of the preprocessing script a call to saveRDS(), to save the data frame object the script created to an RDS file. Then you can have your Rmd setup chunk just load the data frame object from that file (with readRDS()). If you make changes to the preprocessing script, you will have to remember to run it again so that the RDS file gets updated. People usually do this as a convenience in cases where running all the import/preprocessing steps is slow.
The preprocessing steps can live as a separate Rmd document that is included as a child document in your analysis Rmd.
You can make a data package (sounds daunting, but not that hard!) and call it in your Rmd. Probably best for when your preprocessing script has stabilized.

There are some other variations on these themes, too. I don’t think there’s a single best practice for every situation.

Leon · August 8, 2018, 3:07pm

Hi @bragks,

I have tried to outline the work flow that I employ for analysis here

Hope it helps

wlandau · August 8, 2018, 6:33pm

I think the drake package can help here. (Full disclosure: I am the creator and maintainer.) drake does not create its own execution environment/session, but it does ease much of the friction you all are rightfully bringing up. Some relevant features:

Automatic dependency watching throughtout the the whole pipeline, including those large input datasets you may not want to preprocess in a knitr report. This is similar to knitr's cache = TRUE feature, but more developed.
Automatic saving and loading of targets and easy user-side access to the cache. No need to micromanage all those data files.
Report-building steps as targets with dependencies. In other words, the heavy computation happens outside knitr, and it is still reproducible.
Parallel computing and scale.

Resources:

my slides
Kirill's slides
code for a simple example (download with drake::drake_example("main").
development page
reference manual
package website

jimbotyson · August 10, 2018, 12:32pm

Wouldn't it make sense to provide the data reading/wrangling script separately as an R scriptfile and provide a Makefile showing the dependencies for the final product?

bragks · August 13, 2018, 6:09am

This package looks great! I'm a bit strapped for time at the moment (and worried that I'll mess something up by introducing something new), but I'll definitely give this a go for the next project I'm working on!

wlandau · August 13, 2018, 12:07pm

@jimbotyson you could do it that way, but make + R can get cumbersome.

Each Makefile rule creates its own R session from scratch, and all those sessions can add up to a lot of time wasted on overhead.
make watches file timestamps, so it will rerun an R script even if you add something as trivial as comments or indentation.
There is a lot of bookkeeping. You still have to worry about saving and loading output/intermediate data files, what format to use, and where to put them.

wlandau · August 13, 2018, 12:08pm

Glad to hear it, @bragks. I would be happy to help you get started when the time comes.

Leon · August 18, 2018, 1:13pm

For those following this thread, I just had the pleasure of attending a talk by @wlandau and if you're in need of strict reproducibility with potential heavy steps in your workflow, then I definitely recommend taking a serious look at drake and the fact that it is on rOpenSci is a testament to the work, which was put into the package!

...and nice to meet you @wlandau and thanks again for the awesome hex-sticker

wlandau · August 19, 2018, 8:53pm

Thanks, Leon! I am glad I could meet you in person last week, and I enjoyed your deep learning talk.

The slides from my drake talk are here.