Project-oriented workflow; setwd(), rm(list = ls()) and computer fires

Yeah, I see this too. But in my course, submitting the work = pushing the .Rmd and the rendered .md to GitHub. It only takes one (temporary) mark of 0 :slightly_smiling_face: to really make the point that the .Rmd must actually load any packages that it uses.

I freely acknowledge that every teaching environment is different. I'm dealing with graduate students, over the course of 13 weeks. This means I am able to take a hard line that they must write code that is self-contained, because it's how they need to work long-term.

I know that @mine and Colin Rundel use wercker with their undergrads, so the students can catch these obvious problems, such as not including library() calls, prior to HW submission. But that's feasible because they (the instructors) control the students' computing environment.

First of all, please don't set my pc on fire. :blush:

I am trying to not use rm(list = ls()). I am using it inside a project when I need source some R scripts sequentially. Some scripts generate large objects in the end and I need remove them before run the next script.

Manually, I could remove, restart the R session and run each R script many times as needed. What is the option if I want source N files inside a specific R script?

Thanks!

If you are running a series of scripts and want them to work independently, then running them from something like a shell script using Rscript seems like the best route. That way, you are getting a truly clean R environment each time (as mentioned earlier, removing objects won't unload packages and such). There's a bunch of resources on good ways to handle it, but GNU make gets a lot of well-deserved press:
https://wlandau.github.io/2016/06/14/workflow/

That will also help with only re-running those scripts that are actually required, based on the changes made.

1 Like

Thanks for your help, nick!

Maybe I'll create a new R script and source R scripts using the system() function.

Update: If you have R installed from anaconda as well (multiple R installations), things can get a little more complicated. Besides that, version control of packages (checkpoint/packrat) is another component to look.

1 Like

I see. Currently changing all projects on github to include here :slight_smile:

Currently changing all projects on github to include here :slight_smile:

Well, to be clear, that's not strictly necessary! I'm trying to publicize and encourage this convention: it's always implied that the working directory is set to to the project folder. Agreement on this convention is all we need for simple projects where all files live together in one big happy directory.

But it is absolutely true that many real life projects need more structure, i.e. they require subdirectories within the project. That produces more moving parts re: location of script or .Rmd vis-a-vis working directory. There is no longer a global convention that will make everything '"just work" for everyone, all the time. In this very common scenario, the here package really shines for building paths that "just work".

As @nick said, it sounds like you might have outgrown using one R script to run other R scripts. If you're cleaning out the workspace, that means the scripts aren't transmitting info from one to the next via objects in the workspace (which is good!). So the scripts are sharing an R process purely out of convenience. So, yes, you might want to automate this pipeline using make.

We have unit on this in STAT 545: http://stat545.com/automation00_index.html

Couldn't agree more, but I switched from R to Python for most of my work where python had the os package which helped a lot with issues like this. But after reading up on the here package, restructuring a lot of scripts I have that many people fork and try to replicate.

Example here with the script featuring here instead of setwd()

1 Like

In the context of pipeline creation, the "drake" package seems a very promising alternative/complement to make:

https://wlandau-lilly.github.io/drake

1 Like

Just published an article on datascience.com where I integrated here() into my project!

Has definitely helped with creating reproducible projects and headache around version control. Thanks again @jennybryan!

4 Likes

I took another look at here just now after reading your post, @raviolli77. Once I read the section at the bottom of this README of @jennybryan's called "The Fine Print", it all clicked for me. I think last time there was a bit of, "Do I need to be using an R Project to make this work?" so seeing the heuristics here uses to determine the parent directory helped it all land for me :slight_smile: Thanks, both of you!

3 Likes

Awesome glad to have been able to help!

I'm looking for suggestions about helping our workflow. We have a central repository for data and thus the raw data will never be where the R project is saved. We tend to do read_csv("really long path/filename.csv") or something like that. Because the data we are working with are on Linux shares, these commands work from any of our computers (even Windows, Mac and Linux OS). What are other people doing if they are in the same situation?

Hi @StatSteph,

I hope people with direct experience weigh in. In the meantime, I relay some approaches I've come across:

Have a convention where by you create a symbolic link in each project to the data, in a standard location within the project. This then becomes part of initial project setup on a computer and the analysis code remains portable. I think @HenrikBengtsson does this?

Create an internal package to facilitate data access, e.g. builds data paths based on OS or a config/startup file in the project. One you have such a package, you might even find it a convenient way to share other common functions, e.g. for project setup or ggplot2 themes.

1 Like

Yup, I've been working with file links so that it appears as if there's a local directory but it is actually living somewhere else. On Linux, these are created as:

$ ln -s /path/to/target_folder .

This will create a "virtual" folder target_folder/ in the current directory. One can do something similar on Windows - using a completely different command call. But for simplicity, I use R.utils (I'm the author) for these tasks, which is cross platform:

> R.utils::createLink(target = "/path/to/target_folder")
> dir("target_folder")  # will list the files in /path/to/target_folder

Now, on Windows, that dir() may not work (depending on file system format and type of link). The fully cross-platform way is to use:

> path <- R.utils::filePath("target_folder", expandLinks = "any")
> dir(path)

This will even follow good old Windows Shortcuts links.

It's been a while since I've actively used it on Windows, but it was working flawlessly for a good decade now and CRAN checks validate the above on a regular basis on all OSes.

4 Likes

Our internal packages tend to include read functions with paths to the data hard coded. This generally works fine except for when the IT department decide to move things around without telling anyone!

Yes - I've been there. They changed the names of all our servers... they told us but provided no mapping between old and new. It was trial by error sitting in a linux terminal using the autocomplete until we found the correct one. At least since you had a package, you only had to change the package code to fix all the things depending on it. We had to change ALL our path references. Most of the work done here is on SAS and some poor person was changing libname's in hundreds of programs.

I have a package that provides an archive_path() function to our data archive, and use it religiously when loading data from there. If the path changes, I just have to modify this one function. The plus is also that its just a few lines of code to make it work on windows as well as linux machines