Best Practice for good documented reproducible analysis

My first step is to create a data package:

http://r-pkgs.had.co.nz/data.html

I put my raw data and processing scripts in a "data-raw" directory, and I treat the raw data files like master negatives: I try to never alter them, instead making all changes via r code.

I also try to "normalize" my data: instead of creating one gigantic data frame, I divide the data into separate data frames, such as demographics, biomarkers, and survey items, each with the participant ID number, which I then join together as needed.

Rationale for creating a data package

  1. I find that I often want to reanalyze my data a few years later, or hand it off to a grad student for further analysis, but that this is really hard if the data processing is mixed up with the original analysis. Having the data isolated in its own package makes it trivial to start a new analysis with a simple library(my_data), or to share the data with colleagues.

  2. Putting the data in a package forces me to think about finding the sweet spot in the data processing pipeline where the data will be maximally useful for the current and future analyses: not so little that I find myself making the same changes over and over, but not so much that I never use the processed versions of the data again.

  3. The R package structure makes it easy to document the data, and to access that documentation.

  4. Having the data in a package makes it trivial to submit the data to a data archive, as many journals now require.

There are some downsides:

  1. It's an extra step that takes a little time.

  2. If I find an error in the data I have to fix it and then rebuild the package. If I forget the rebuild step, my analysis will still be using the old version of the data in my package library rather than the corrected version.

  3. In the early phases of the analysis, especially, I find myself moving code from the analysis to the data package, or from the data package to the analysis, as I try to find the optimal division between data processing and data analysis.

For me, though, the benefits of creating a data package outweigh the costs.

5 Likes