How does the evolution from EDA -> analysis package work?


#1

Are there good resources out there about how workflow and analysis evolve as they go from exploratory into a more polished product such as an “analysis package”?

It feels to me like there are significant differences in the workflow of an analysis from when it is a single script -> multiple scripts -> scripts + functions -> scripts + functions + tests + documentation, etc…

I have a decent understanding of how the beginning stages work, and from reading the ‘R Packages’ book, I am gaining in understanding of how a package works. I have seen some scattered wisdom and comments out there about how to fit a large data analysis into a package structure. I am having trouble in the morphing process from an exploratory data analysis that gradually grows in size and so I want to add package features such as function documentation and tests to make sure I am not making unintentional changes to the analysis as I refactor functions that perform parts of the analysis.

Just to give one example, package dependencies:

  1. During the multi-script phase, I tend to have a script called setup.R which I can call at the beginning of any given sub-analysis to make sure I have the same set of packages loaded everywhere. I also use this to source function files.

  2. [ … mysterious middle … ]

  3. For the fully-formed analysis package, the setup.R script doesn’t work. Instead, each function should have roxygen-declared @importFrom or fully specified package::fun type usage inside the functions.

But in the mean time, I just want to run some unit tests on my analysis functions – I don’t want to make the dependencies perfectly minimal the way you would for a package you expect others to be using. For now I have added @import tags for all the packages I used to load in setup.R, but R/devtools is constantly yelling at me and I am having tests fail because of package dependency stuff that isn’t really related to my analysis.

Anyway, point is: are there good resources out there about how workflow changes as the analysis grows and gains package features?


#2

Here’s one post by @rmflight that might be of interest: “Creating an analysis as a package and vignette”. There have also been a few threads here that could also help fill in the mysterious middle.

  • @ehagen’s response on Best Practice for good documented reproducible analysis

There are probably a few more, but those might get you rolling!


#3

Thank you! I clearly need to work harder on searches for existing discussions on Rstudiocomm before starting a new thread.


#4

No problem. None of them are exactly what you’ve described, but hopefully there will be helpful info in there.


#5

To follow up from that, you might want to check out some of the responses to this tweet, there was some discussion on the rstats subreddit, and @noamross has previously compiled some general resources on project organization that may be helpful as well.