Data Science Project Template for R

denis · November 28, 2017, 2:32pm

I often struggle when organizing a project (file structure, RStudio's Projects...) and haven't yet settled on an ideal template.

I recently came across this project template for python. I was wondering if there is such a thing for R and whether we, as a community, should strive to come up with a set of best practices and conventions.

The python template seems to be overkill for most of the projects I work on, but it would be nice to have something similar for R. Is there something like this already available?

mara · November 28, 2017, 3:41pm

There are actually quite a few project templates out there. I rounded up some of them in a post, which came up in this thread:

I'm not sure that I had the STAT545 page in there:

There are also more that are targeted toward specific types of projects, e.g.
Machine Learning Project Template in R.

DaveRGP · November 28, 2017, 3:50pm

We recently looked at this problem in our useR group. One of our members has written this package:

As we noted in our presentation event, it's opinionated (it even has a "best practice" button!), but that's not always a bad thing TBF in templating that's almost the point

mara · November 28, 2017, 4:07pm

eek, how is StefLocke's not in my list?! A gross oversight!

DaveRGP · November 28, 2017, 4:08pm

It's newish, not sure that's an excuse that would wash with her though ;p

mara · November 28, 2017, 4:09pm

No, I love that package and totally know about it! ** insert several lock and key puns here ** (look, I saved you the time of having to read my annoying dad jokes…even though I'm not a dad)

pavopax · December 6, 2017, 7:02am

I also have a lightweight project template:

pgensler · January 13, 2018, 8:56pm

It would be nice to have the ability to be able to create a project using an existing template as a starting point, like from Cookiecutter, similar to how Visual Studio lets you do so. Visual Studio does amazing integration with Cookiecutter, that I'm surprised this is not brought up in lieu of these recent conversations (here and here)

Maybe this is just me, but wouldn't it make sense to build out a feature in RStudio where you can build a project from scaffolding or from a template like cookiecutter? The RStudio extensions does not really seem like the right solution for the issue at hand:
https://rstudio.github.io/rstudio-extensions/rstudio_project_templates.html

mara · January 13, 2018, 9:51pm

That's not true. If anything, there are multiple recommended project structures. Some packages that can be useful in this (not exhaustive list at all, just ones I happen to have handy):

pRojects by @stephlocke https://itsalocke.com/projects/
- includes several different project templates, to boot:

https://twitter.com/dataandme/statuses/938106243977236485

template by Francisco Rodriguez-Sanchez GitHub - Pakillo/template: A template for data analysis projects structured as R packages (or not)
RStudio Project Templates extension RStudio Project Templates
ProjectTemplate by John Myles White ProjectTemplate

jennybryan · January 14, 2018, 5:25am

Adding a bit more ...

The usethis package has some basic support for non-package projects in the CRAN version, such as create_project().

Context: usethis is one of the packages created in the Great Devtools Diaspora. It is the primary home for workflow functions, for example, the functions to create a package or project or to add specific files or features during development, like adding test infrastructure or using Git/GitHub.

A new release is imminent (check out the dev version on GitHub) and non-package support has increased moderately. But it's still limited to fairly mechanical operations, initiated by the developer.

The release after that is likely to include opinionated support for a specific analytical workflow and directory structure. More in the tradition of, say Project Template (but quite different).

benmarwick · January 15, 2018, 8:11am

I use rrtools, inspired by devtools, as a package-based project template. The package structure is already familiar to many R users, so it's easy to browse and find items if you're not the author. rrtools generates a simple file structure like this, that is easy to modify and extend:

analysis/
|
├── paper/
│   ├── paper.Rmd       # this is the main document to edit
│   └── references.bib  # this contains the reference list information
|
├── figures/            # location of the figures produced by the Rmd
|
├── data/
│   ├── raw_data/       # data obtained from elsewhere
│   └── derived_data/   # data generated during the analysis
|
└── templates
    ├── journal-of-archaeological-science.csl
    |                   # this sets the style of citations & reference list
    ├── template.docx   # used to style the output of the paper.Rmd
    └── template.Rmd

Our motivation is to prompt the user into good practices, rather than prescribe elaborate or idiosyncratic file structures that impose a cognitive load on the user. The rrtools readme has a short list of papers that describe the best practices that it draws on.

The Rmd file is where most of the code and text lives, but we also put code in script files that are used by the Rmd. I have one package-project per report or journal article, and my students have one for their dissertation or project report.

We made this template after studying examples on GitHub of researchers using the R package structure to organise and share their code, data and reports. We published the results of our survey in Packaging data analytical work reproducibly using R (and friends) (that @mara noted above, thanks!).

rrtools uses MRAN and Docker to give some isolation of the computational environment and package versions (we tried packrat but found it unreliable). We use Travis or Circle-CI for continuous integration. With the combination of these, we can get a reproducibility check each time we push to GitHub.

This package-as-research-compendium approach is optimal when the data are small-to-medium sized and the analysis can be done on a laptop. It probably could be adapted for bigger and long running projects, but I don't have much experience with that.

mara · January 15, 2018, 10:24am

I knew there was a relevant thread somewhere! Good call, Ben. I knew I was forgetting lots of good ones, rrtools included, obviously!

@denis, see what I meant about options, now? Personally, my project structures will vary slightly depending on collaborators, etc. Using the here package is nice, since it means you can move your Rmd around without wreaking havoc, if you so choose, and you don't need to worry about hierarchies as much.

pgensler · January 17, 2018, 2:53am

Yes, Docker is incredibly useful for reproducibility, I wrote a blog on it:
https://medium.com/@peterjgensler/creating-sandbox-environments-for-r-with-docker-def54e3491a3
Is your Dockerfile made to be a template for someone's project? Mine is definitely nowhere near as small as what you have.

Personally, I'm a huge fan of Cookiecutter's template, but I can imagine how useful here:here would be:

Are you using CI for deploying the container, or simply for building your scripts for the analysis?
Can I ask why you are using CircleCI for CI? I've found it so hard to work with CircleCI and Docker that I'm not sure its optimal for beginners to use.

benmarwick · January 17, 2018, 6:11am

Thanks for your comment. To address your questions directly:

Is your Dockerfile made to be a template for someone’s project?

Yes, rrtools generates a custom dockerfile from a template. There are two details that keep it small.

First is using rocker/verse as the base image, so we get all the tidyverse packages without waiting.
Second is that we organise the compendium as an R package, and any other packages used in the project are listed as Imports: for that pkg. So when our compendium pkg is installed in the container, all the non-tidyverse dependencies are installed also. We don't need to list them in the dockerfile because they're already in the DESCRIPTION file for the pkg

There is occasionally some minor manual handling required for the dockerfile, for example when an R pkg requires a non-R library and we need to add an apt-get install xxx or two. We experimented with containerit::dockerfile() to auto-generate dockerfiles based on the project, but they came out very verbose.

Are you using CI for deploying the container, or simply for building your scripts for the analysis?

In rrtools we are using CI for testing that we can successfully knit the Rmd to a PDF or Word doc. The manuscript or report is typically the primary output for us, so we use the CI to ensure that we haven't introduced any changes that prevent the manuscript from being rendered. Currently it's not set up to collect anything from the CI (except the red or green badge!), so we're not deploying in any sense.

Can I ask why you are using CircleCI for CI?

We use Travis by default, but only Circle-CI allows for free CI on private GitHub repos. Sometimes we want to keep our projects private until publication, so if we want CI during that time (and don't want to pay), then we can use Circle-CI. We haven't found it much different from Travis, but we're using both only in very simple ways, so we might be missing some of the complexity that you deal with.

And your question in an earlier post (that I just now noticed):

wouldn’t it make sense to build out a feature in RStudio where you can build a project from scaffolding or from a template like cookiecutter?

There is an experimental, work-in-progress fork of rrtools by Matt Harris that takes advantage of the RStudio Project Template system:

jalsalam · January 17, 2018, 2:52pm

I'm very excited. Are there write-ups about this workflow? I can just try to be patient I guess.

denis · January 17, 2018, 4:05pm

Thank you so much for the input, everyone. What a great community we have, right? Maybe we should try to create a template of the R community, so other languages can at least try to emulate it.

jennybryan · January 17, 2018, 9:59pm

There is a design document but very much in flux. I'd describe current status as "opportunistic dogfooding" and probably not terribly useful/interesting to others.

Rainer · January 22, 2018, 11:35am

Hm. I haven't looked at them in detail, but I am (ab)using the package structure in R - I put R functions in an R directory, reports and analysis into inst/Reports (using R notebooks and knitting), data into Data, etc. Using the DESCRIPTION file for dependencies etc, I can simply load everything y using devtools::load_all(). Works perfectly for me. But maybe I am missing something here?

rdataforge · January 26, 2018, 11:18am

My two cents: a Shiny add-in to initialize an Rstudio project. Best of this is you can choose folders and names even create your own desired structure