Data Science Project Template for R

I use rrtools, inspired by devtools, as a package-based project template. The package structure is already familiar to many R users, so it's easy to browse and find items if you're not the author. rrtools generates a simple file structure like this, that is easy to modify and extend:

analysis/
|
├── paper/
│   ├── paper.Rmd       # this is the main document to edit
│   └── references.bib  # this contains the reference list information
|
├── figures/            # location of the figures produced by the Rmd
|
├── data/
│   ├── raw_data/       # data obtained from elsewhere
│   └── derived_data/   # data generated during the analysis
|
└── templates
    ├── journal-of-archaeological-science.csl
    |                   # this sets the style of citations & reference list
    ├── template.docx   # used to style the output of the paper.Rmd
    └── template.Rmd

Our motivation is to prompt the user into good practices, rather than prescribe elaborate or idiosyncratic file structures that impose a cognitive load on the user. The rrtools readme has a short list of papers that describe the best practices that it draws on.

The Rmd file is where most of the code and text lives, but we also put code in script files that are used by the Rmd. I have one package-project per report or journal article, and my students have one for their dissertation or project report.

We made this template after studying examples on GitHub of researchers using the R package structure to organise and share their code, data and reports. We published the results of our survey in Packaging data analytical work reproducibly using R (and friends) (that @mara noted above, thanks!).

rrtools uses MRAN and Docker to give some isolation of the computational environment and package versions (we tried packrat but found it unreliable). We use Travis or Circle-CI for continuous integration. With the combination of these, we can get a reproducibility check each time we push to GitHub.

This package-as-research-compendium approach is optimal when the data are small-to-medium sized and the analysis can be done on a laptop. It probably could be adapted for bigger and long running projects, but I don't have much experience with that.

6 Likes