Best Practice for good documented reproducible analysis


#1

Hi,
until now, my analysis were kind of messy big .R files. I i would run some code and then copy & paste tables and plots. I want to learn how to make it better.

Today i´m interested in your workflow and how you make your analysis good documented and reproducible. Do you have a default “empty” structure for your folders and inside your .R or .rmd files? My idea is: i would probably create a new rstudio project file, use an .rmd file for the whole report, but would do the analysis in seperate .r files which then are called within the .rmd file - chunk via source(). i.e.

  • one .R file for loading libraries (perhaps with needs()-package?)
  • one .R file for loading data and data-cleaning
  • one .R file for Plot1
  • one .R file for Table 2 and Plot2

How do you make sure, your code still runs when r-packages get updated? Do you use git for version control? How? When does it make sense to write a package?


#2

In a current research project I do the following and would also be curious about any potential improvements.

  1. backup raw data on another medium
  2. setup an rstudioproject
  3. put it under privat version control via github (only the scripts/ not the data)
  4. use subfolder structure R for scripts and data with subfolders input/output for data. In input I have subfolders for all input and raw data depending on the structure of the data.
  5. my first r script is usually called 00_main.
    From there I source other files.
  6. usually the first line is packrat::init()
  7. the first script is usually 01_install_and_load_libraries. Install calls are afterwards outcommented, so that I just make library calls when I source this scripts. Also environment settings are done here. The 2nd script usually contains helper functions, which I need during the whole analysis.
  8. in the following scripts I load and clean the rawdata. They get numbered like 04_preprocess_01_df_a, 04_preprocess_02_df_b, … the preprocessing takes very long, so I safe the clean data under data/output/… In these scripts I usually implement tests, change datatypes and introduce naming conventions. The sourcing of the preprocessing is afterwards outcommented in the main file. (In other kind of projects at this step would also be stuff like importing data from a database)
  9. the next scripts I load the preprocessed/clean data. They are named like 05_load_01_df_a, 05_load_02_df_b,… usually logical stuff happens in the loading script. When possible I try to load the data in a way, that I have a general setting for all upcoming analysis steps.

The scrips above are always run. When raw data changes, also the preprocessing must be repeated.

The next part is the analysis. Here I usually have some hypotheses and questions. Sometimes it is also necessary to do some further processing or data enrichment to answer specific questions. I try to split these analysis into own substructures and I keep them independent. This means, I only run 1 experiment at a time. And normally don’t run 2 analysis without restarting r in between.

They are named for example 06_analysis_01_enrich_df_a, 06_analysis_model_01,… For the next analysis I start with 07_ etc. The output of these analysis is written in subfolders organized like data/output/results/06_analysis/… In the first script of each analysis I document what questions I want to answer, how the analysis is organised, critical steps, etc. In the subscripts and also as a comment behind the source command in main, I comment when and where I write data. One very important step is also, that variables introduced in subscripts are usually removed at the end of a subscript, or at least in the last subscript of an anslysis.

For more overview of my code I usually use the tidyverse conventions and strcode pkg. I also try to use tidyverse packages in highlevelcode, when the speed is ok, since it is very expressive.

I am especially interested in docker, to recreate the whole environment. What most annoys me is when I get errormessages from packrat.


#3

I did a little roundup of some R workflow/analysis writeups on my blog a few weeks ago

Brian Fannin’s represtools :package: (a ~portmanteau of reproducible research tools) might be of particular interest. Also, the recent PeerJ preprint, Packaging data analytical work reproducibly using R (and friends) by Ben Marwick, Carl Boettiger, and Lincoln Mullen (which I don’t think was in that roundup post) is :+1:!


Data Science Project Template for R
#4

My first step is to create a data package:

http://r-pkgs.had.co.nz/data.html

I put my raw data and processing scripts in a “data-raw” directory, and I treat the raw data files like master negatives: I try to never alter them, instead making all changes via r code.

I also try to “normalize” my data: instead of creating one gigantic data frame, I divide the data into separate data frames, such as demographics, biomarkers, and survey items, each with the participant ID number, which I then join together as needed.

Rationale for creating a data package

  1. I find that I often want to reanalyze my data a few years later, or hand it off to a grad student for further analysis, but that this is really hard if the data processing is mixed up with the original analysis. Having the data isolated in its own package makes it trivial to start a new analysis with a simple library(my_data), or to share the data with colleagues.

  2. Putting the data in a package forces me to think about finding the sweet spot in the data processing pipeline where the data will be maximally useful for the current and future analyses: not so little that I find myself making the same changes over and over, but not so much that I never use the processed versions of the data again.

  3. The R package structure makes it easy to document the data, and to access that documentation.

  4. Having the data in a package makes it trivial to submit the data to a data archive, as many journals now require.

There are some downsides:

  1. It’s an extra step that takes a little time.

  2. If I find an error in the data I have to fix it and then rebuild the package. If I forget the rebuild step, my analysis will still be using the old version of the data in my package library rather than the corrected version.

  3. In the early phases of the analysis, especially, I find myself moving code from the analysis to the data package, or from the data package to the analysis, as I try to find the optimal division between data processing and data analysis.

For me, though, the benefits of creating a data package outweigh the costs.


How does the evolution from EDA -> analysis package work?
#6

For me, organisation hinges on is it a single document report (rmarkdown), or a multi-chapter analysis (bookdown).

In the first, I will have a project folder, a data folder (potentially with subfolders) and an Rmarkdown document. The second often has extra images etc so needs a lot more organisation.


#7

I use rrtools for making my work well-documented and reproducible, and describe it a little more on another thread here: Data Science Project Template for R

To answer your specific questions:

How do you make sure, your code still runs when r-packages get updated?

rrtools uses Docker and MRAN to create an isolated computational environment with specific package versions. They do not change during the life of that project when I update the packages on my desktop version of R. We tried packrat but found it unreliable.

Do you use git for version control?

rrtools uses git and GitHub with public or private repos. This make collaboration very efficient, and is a nice backup.

When does it make sense to write a package?

Our view is that an R package is a suitable file structure to base a research compendium on because it is familiar to most R users, and there is already so much tooling available to automate package creation, testing, and use. So it’s quite natural to bundle the the code and other files associated with a journal article or report in with an R package.