Workflow: How to write a journal article with R markdown when code/figures spread across multiple directories?

achamess · April 7, 2018, 12:30am

Hi all,
I'm trying to effectively use R markdown for my analysis as well as academic writing workflow.

I work in experimental neuroscience. I use R Notebooks as my lab notebooks. Each experiment I run has a corresponding folder in which I have:

experiment_A/

R Notebook file for the experiment
data/
output/

This works well, and everything is self-contained in the top level folder. All the code and output are in one place and if anyone wants to reproduce my work for a given experiment they can.

The messy part comes when I'm compiling a bunch of individual experiments together to write a manuscript.

I know that the ideal is to make everything reproducible and self-contained in a single directory structure. The manuscript is related to the individual experimental results that comprise it, but it's a separate entity.

So my question is, how should I structure this? If I'm writing a manuscript in R Markdown, should I just link to the output figures in their experiment directories? That would work for the purposes of making the manuscript, but all the code that generated those figures would be separate from the manuscript unless I copy pasted. Similarly, if anyone wants to reproduce my manuscript output, it wouldn't all be self contained. They would need each of the individual experimental Rmd files that made up the components. Does that make sense?

Is there a good way to do this? I suppose I could copy each and every experimental directory to a final manuscript/ directory upon submitting and ending the project.

When people describe R Markdown workflows, it's described as if the final product (manuscript) is the first time you're doing the analysis, writing code, and getting outputs.

But in experimental science at least, you're analyzing and visualizing results incrementally across the timeline of a whole project (months to years) and not waiting until you write your manuscript to analyze your data for the first time, all in one master Rmd file.

So how do you incorporate all that piecemeal analysis into a final manuscript without a lot of copy-pasting?

mara · April 7, 2018, 2:19pm

You might find this thread relevant.

I think perhaps that figure is a bit misleading (if interpreted literally) in terms of actual workflows. The scripts and chunks probably aren't being written or executed for the first time in article format. There's no need to put them in different directories if you use workflow tips, such as those described in Jenny's article and the latter thread below.

achamess · April 7, 2018, 4:15pm

Thanks for this! And to @jennybryan for the insights from her articles and threads.

This one is extremely useful: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005510#sec009

I suppose one solution would be to change how I structure my work. I could create a new top-level directory in my lab notebook that includes both the manuscript files and the experiment folders. That will require changing how I organize my work, but that's not trivial, but doable.

Integrating code chunks from individual Rmd files into the master manuscript file may not be necessary if one can at least track back to the original within the directory hierarchy.

prosoitos · April 17, 2018, 7:24am

I would create one "super root" directory (called for instance "thesis").

In it, I would create a directory for each experiment. Those are the regular project roots, where your .Rproj will live if you use RStudio projects. That way, you can work on multiple sub-projects (= experiments) at the same time if you want (by having multiple RStudio instances open).

To avoid copying and pasting, if you ever need to use files from one experiment to another, I would create links (super easy to do in linux, not sure how to do them with other OS).

For portability, since you want something self-contained, I would consider the whole super root "thesis" as the unit. Your individual experiments cannot live independently and if you were to share your project with someone, you would have to give them the entire "thesis" repository.

prosoitos · April 17, 2018, 7:28am

Just read your last reply (had only read your question): you came up with more or less the same idea.

For your "master Rmd manuscript", if you wanted to be fancy, you could use a Makefile that will automatically build a document from your various Rmd files. If you want to keep it simpler however, you could probably source your various files into your master file (not sure if it is an option with Rmd, but it is very easy in R).

prosoitos · April 17, 2018, 7:31am

I don't think that creating that "top-level directory" (or what I called the "super root") should change your workflow if you keep each experiment as an RStudio project. You just have to create the necessary links between files and make sure that the whole thing always remain together.

If you version control your work, the links are going to be a pain though... You probably want to version control the whole thing (so have your .git or .svn in your "top-level directory").