How does the evolution from EDA -> analysis package work?

Are there good resources out there about how workflow and analysis evolve as they go from exploratory into a more polished product such as an "analysis package"?

It feels to me like there are significant differences in the workflow of an analysis from when it is a single script -> multiple scripts -> scripts + functions -> scripts + functions + tests + documentation, etc...

I have a decent understanding of how the beginning stages work, and from reading the 'R Packages' book, I am gaining in understanding of how a package works. I have seen some scattered wisdom and comments out there about how to fit a large data analysis into a package structure. I am having trouble in the morphing process from an exploratory data analysis that gradually grows in size and so I want to add package features such as function documentation and tests to make sure I am not making unintentional changes to the analysis as I refactor functions that perform parts of the analysis.

Just to give one example, package dependencies:

  1. During the multi-script phase, I tend to have a script called setup.R which I can call at the beginning of any given sub-analysis to make sure I have the same set of packages loaded everywhere. I also use this to source function files.

  2. [ ... mysterious middle ... ]

  3. For the fully-formed analysis package, the setup.R script doesn't work. Instead, each function should have roxygen-declared @importFrom or fully specified package::fun type usage inside the functions.

But in the mean time, I just want to run some unit tests on my analysis functions -- I don't want to make the dependencies perfectly minimal the way you would for a package you expect others to be using. For now I have added @import tags for all the packages I used to load in setup.R, but R/devtools is constantly yelling at me and I am having tests fail because of package dependency stuff that isn't really related to my analysis.

Anyway, point is: are there good resources out there about how workflow changes as the analysis grows and gains package features?

4 Likes

Here's one post by @rmflight that might be of interest: "Creating an analysis as a package and vignette". There have also been a few threads here that could also help fill in the mysterious middle.

  • @ehagen's response on Best Practice for good documented reproducible analysis

There are probably a few more, but those might get you rolling!

1 Like

Thank you! I clearly need to work harder on searches for existing discussions on Rstudiocomm before starting a new thread.

No problem. None of them are exactly what you've described, but hopefully there will be helpful info in there.

To follow up from that, you might want to check out some of the responses to this tweet, there was some discussion on the rstats subreddit, and @noamross has previously compiled some general resources on project organization that may be helpful as well.

2 Likes