Project-oriented workflow; setwd(), rm(list = ls()) and computer fires

jennybryan · December 12, 2017, 4:11am

I wrote a blog post elaborating on recent provocative slides that discourage the use of setwd() and rm(list = ls()) at the top of R scripts.

The Twitter reaction was a bit shocking in volume and it's hard to discuss things there. So I've made this thread in case the conversation continues! Note: we've had a semi-related thread already: First line of every R script?.

nick · December 12, 2017, 3:54pm

I followed a little bit of that Twitter discussion and was somewhat surprised by the pushback the idea received (though, as you point out, the wording may have had something to do with that).

I just want to reinforce one bit:

In my personal experience, not getting into the habit of saving intermediate steps to a file was the biggest impediment to completely internalizing the idea that "source is real" ((c) @jennybryan?). It makes you very dependent on your workspace, as a section of code that takes even two minutes to run seems highly wasteful to re-run when you are in the middle of an analysis. And then, inevitably, you make an irreversible change while to an object that takes some time to generate while testing syntax, and so you try to include the steps that the produced the change in your source file, but you won't run it because that would mean wasting multiple minutes of your life.

The point being, definitely save your intermediate steps! If the object itself is a reasonably small and simple data frame (no nested columns/strange attributes/etc), it can even make sense to save it out to a csv instead of a rds. Being able to "show your work" with intermediate files can help when sending the analysis to a non-R-using client.

pedram · December 12, 2017, 5:29pm

I don't use setwd or rm(list=ls(), instead prefering to make sure RStudio never saves my workspace on exit, habitually restarting R, and using ProjectTemplate for caching, loading and munging. I like the workflow that ProjectTemplate sets up for you. Here's how I typically use it.

Starting clean, I throw all my data into subfolders within the data folder. I set the recursive data loading to false in the global.dcf confg file, and create .R scripts within data to write code to load the data files. I do this rather than rely on automatic data loading by file type because files are often never as clean as I need them to be. Using scripts, I can use data.table::fread or readxl easily, supply column types up front, filter out unneeded columns, and work through whatever other ugly data steps are required.

Once I have the data loaded, I'll restart R, load the project, and use the caching function to have ProjectTemplate automatically cache all the files it loads. There are some quirks, such as having to use dots instead of underlines in the object names, but that's tolerable if only slightly annoying.

From there, I set the data loading config to FALSE, so I don't accidentally start loading data anymore, and then work on munging. There is where I start to add more cache('...') commands, usually after any major munging step. I might use rm as well to remove the original datasets from the global environment if they're no longer needed. I tend not to overwrite existing objects (I find I regret that every time I do).

Once munging is complete, I'll set munging to FALSE in the global config yet again. Restart R, and start to do analysis. Every new analysis will use load.project() to load cached entries, though I can override that by setting cache loading to FALSE, and loading only particular objects if needed. Ideally, all datasets are in such a state that analyzing them becomes trivial compared to the real hard work, importing and munging.

jennybryan · December 13, 2017, 2:16am

I'm going transfer some Q&A from twitter to this thread.

is there any advantage to using here::here() over just regular relative paths (eg. "../figs/blah.png")? I'm guessing Win/*nix portability, maybe?
from @rensa_co

Yes I think so. I allude very cryptically to this in the blog post. Using the .. strategy assumes that working directory will always be constant, relative to the project, at run time. But here are two common scenarios where it's a real struggle to make that true.

Rmarkdown in a subdirectory. I like to have subdirectories and I like to use the "Knit" button in RStudio. But I like to leave R's working directory set to top-level of Project during development. By using here::here() to build paths, my code works during interactive development and the whole document renders with the "Knit" button.

Tests. I use testthat for unit tests in packages. I like to leave R's working directory set to top-level of Project during package and test development. But various ways of running the tests have the working directory set elsewhere, i.e. lower in the package. By building paths to reference files and objects with here::here() or testthat::test_path(), interactive development and automated testing are no longer in tension re: working directory.

adamk · December 13, 2017, 5:58pm

My team's often in a situation where two projects or more rely on similar data. I'm curious if anyone has opinions on the most sane way to do this, especially where the data for the project should live. We happen to extract subsets from internal databases for some context.

Do you favour subdirectories for each "sub project"? Does that get confusing?

Or absolute paths to a dedicated central folder to get the data? That seems like it's out of here::here's reach. Would this be a reason to use set_wd() if you have some project specific prep to do? Or is a there package in the making to extend here::here's reach?

Or finally is it best to separate folders entirely for each project each with its own copy of the data extract? That might help keep data stable for each project, but I wonder if it threatens having a single source of truth in projects dependent on it.

nick · December 13, 2017, 6:35pm

I would suggest having the first of your scripts for each project copy the data from the central location, if the data is of a size such that it's feasible to do so. That way, your data can't change out from under you, but it's explicit where it came from and can be updated if desired.

For that script, an absolute path seems reasonable, as anyone outside of your organization won't be able to access it regardless, and anyone in the organization should have access to the given path. It would be similar to having a script that pulls data from a database prior to processing -- your reference to the database is generally going to be "absolute".

jennybryan · December 13, 2017, 6:36pm

At a high-level, I like the project organization advice given in Good enough practices in scientific computing (full disclosure: I am a co-author, but didn't write this bit):

As a rule of thumb, divide work into projects based on the overlap in data and code files. If 2 research efforts share no data or code, they will probably be easiest to manage independently. If they share more than half of their data and code, they are probably best managed together, while if you are building tools that are used in several projects, the common code should probably be in a project of its own.

You'll have to mentally adjust all of that for your case, where shared data is the "tool" that is used in several projects.

For your specific situation, and with R in mind, you could put shared data extracts into a data package so you can just use library() instead of copying and loading, e.g. delimited files. Many companies, such as Airbnb, have also written internal packages to make it easier to use such internal data sources consistently. If you had that, each of your individual projects could contain the logic to do its own data extraction.

jessemaegan · December 13, 2017, 9:56pm

thank you so much for taking the time to write this up! I didn't know about here::here() and find it absolutely delightful to use.

the part that resonated with me the most was how incorporating these steps into your workflow make your code both portable and shareable!

raviolli77 · December 14, 2017, 12:56am

I'll look into the here package but I usually write in my scripts that I'm publishing on GitHub the following:

# SET THE WORKING DIRECTORY APPROPRIATELY
setwd('~/set/approp/wd/')

This tells them to change the workiing directory to the one they cloned or downloaded my project in.
This is assuming people understand how to set a working directory, which I am assuming they do.

jennybryan · December 14, 2017, 7:47am

The main point of the post, though, is exactly that this is an unsustainable practice. It assumes that every recipient will hand edit every script to reflect local path.

If someone clones a Git repo, the standard convention is that everything is written relative to that project/repo. The here package will recognize the top-level directory of a Git repo and supports building all paths relative to that.

jrlewi · December 14, 2017, 4:34pm

Any suggestions for situations when another user doesn't have a package downloaded that is called by library()? They can certainly download it easily enough, but this causes the new user to act before they can run the script. Is packrat a solution to this?

nick · December 14, 2017, 6:12pm

Depending on the context that the script is being shared, packrat may be an option. In general, though, just including the library calls at the top of the script should be enough -- I may not want to run your script if it requires 15 packages I don't have.

If the analysis crosses several .R / .Rmd files, then a single install_required_packages.R script could be helpful. Of course, the real answer at that point is to just make it a package.

jrlewi · December 15, 2017, 4:38am

Thanks for the suggestions. But let's pretend you need to run the code with 15 packages you don't have - the goal, after all, is portability right?

Sure, a package will definitely be portable, but it can also make things less editable when sharing with someone who is less familiar with package building. Also, simple projects usually don't warrant a package (an opinion of mine that could be debunked I suppose). I think an install_required_packages.R is a good idea in those cases - perhaps checking for installed packages first and only installing if not found. Something like:

packages_needed <- c('tidyr', 'dplyr', 'ggplot2')
installed <- installed.packages()
sapply(packages_needed, function(p)
  if(!p %in% installed[,1]){
    install.packages(p)
  })

But installing packages within a 'resident R script' seems to be a pet peeve of @jennybryan so I still wonder if this is really best practice?

jennybryan · December 15, 2017, 10:54pm

I think, if you feel there's a need to assist with package installation, then it's a great idea to make it a stand-alone script that is clearly labelled.

My main pet peeve is people mixing package installation into data analysis scripts.

This is one area where shipping a data analysis as a package has a distinct advantage, because DESCRIPTION now captures the dependencies and installation of the package will ensure all the necessary packages are present.

jasonparker · December 16, 2017, 12:01am

@jennybryan I want to say thanks for writing up your thoughts on this subject. I think a lot of the pushback on Twitter was because giving a talk/presentation in person isn't the same as sharing an image online.

I definitely get where you're coming from, which is that using simplistic shortcuts to "make the thing work" can get us into real trouble when our code needs to be useful for other people. I quickly got out of the habit of using setwd() at the beginning of scripts because I had to move a bunch of files to a server environment and then nothing worked.

martj42 · December 16, 2017, 3:33am

Thank you for the arson threat. This finally made me remove the only setwd() I have in my code, which had always annoyed me.

kstierhoff · December 19, 2017, 5:36am

One unexpected benefit to using here::here() was that switching between interactive and knitr modes became seamless. For example, all of my paths were originally relative to my doc/doc.Rmd file, hence loading data files was load("../data/data.Rmd" when knitting, but I'd have to manually set my working directory to doc/ when working interactively to test new code, debug old code, etc. Using here::here("data/data.Rmd") works both ways and I don't have to think about it. Maybe that makes sense...

[I also removed the `rm(list = ls())` at the top of my .Rmd file while I was as it]

jennybryan · December 19, 2017, 6:48am

One unexpected benefit to using here::here() was that switching between interactive and knitr modes became seamless.

Yes that is exactly one of the aggravations @krlmlr set out to solve Such sweet relief!

atxprof · December 19, 2017, 3:03pm

I have had to force students to use pacman::p_load because it just works. 90% of markdown not knitting is improper use of library or install.packages. Understand that I am talking about R newbies that do not always listen carefully to advise

atxprof · December 19, 2017, 3:20pm

Just noticed this in my weekly "Summary" email. Just wanted to say (as I had posted in twitter) that I totally agree. I never use those constructs in own work and do not encourage them.

The issue I have is that the notebook interface can be confusing for newbies. Perhaps I should not use it for teaching. The results of executing a notebook can be influenced by the session history so I find students do not realize their markdown is not standalone ie it depends on peculiarities of their environment. 90% of this is package management. They load them but do not add the command to their markdown. I have had to force them to use code that minimizes the chance that their markdown will not knit when I get it. In that specific context, clearing the environment explicitly has worked 100% of the time.

An alternative is forcing them to turn in packages so they can utilize check, travis, etc. However everything in teaching is a tradeoff. You have to decide what to teach and not teach and it has not made sense thus far.