Best Practice for good documented reproducible analysis

Tazinho · October 14, 2017, 2:10pm

In a current research project I do the following and would also be curious about any potential improvements.

backup raw data on another medium
setup an rstudioproject
put it under privat version control via github (only the scripts/ not the data)
use subfolder structure R for scripts and data with subfolders input/output for data. In input I have subfolders for all input and raw data depending on the structure of the data.
my first r script is usually called 00_main.
From there I source other files.
usually the first line is packrat::init()
the first script is usually 01_install_and_load_libraries. Install calls are afterwards outcommented, so that I just make library calls when I source this scripts. Also environment settings are done here. The 2nd script usually contains helper functions, which I need during the whole analysis.
in the following scripts I load and clean the rawdata. They get numbered like 04_preprocess_01_df_a, 04_preprocess_02_df_b, ... the preprocessing takes very long, so I safe the clean data under data/output/.... In these scripts I usually implement tests, change datatypes and introduce naming conventions. The sourcing of the preprocessing is afterwards outcommented in the main file. (In other kind of projects at this step would also be stuff like importing data from a database)
the next scripts I load the preprocessed/clean data. They are named like 05_load_01_df_a, 05_load_02_df_b,.... usually logical stuff happens in the loading script. When possible I try to load the data in a way, that I have a general setting for all upcoming analysis steps.

The scrips above are always run. When raw data changes, also the preprocessing must be repeated.

The next part is the analysis. Here I usually have some hypotheses and questions. Sometimes it is also necessary to do some further processing or data enrichment to answer specific questions. I try to split these analysis into own substructures and I keep them independent. This means, I only run 1 experiment at a time. And normally don't run 2 analysis without restarting r in between.

They are named for example 06_analysis_01_enrich_df_a, 06_analysis_model_01,... For the next analysis I start with 07_ etc. The output of these analysis is written in subfolders organized like data/output/results/06_analysis/... In the first script of each analysis I document what questions I want to answer, how the analysis is organised, critical steps, etc. In the subscripts and also as a comment behind the source command in main, I comment when and where I write data. One very important step is also, that variables introduced in subscripts are usually removed at the end of a subscript, or at least in the last subscript of an anslysis.

For more overview of my code I usually use the tidyverse conventions and strcode pkg. I also try to use tidyverse packages in highlevelcode, when the speed is ok, since it is very expressive.

I am especially interested in docker, to recreate the whole environment. What most annoys me is when I get errormessages from packrat.