Separating code and data - using setwd() or not

fritzjooste · March 28, 2023, 11:49pm

Hello. I am cautiously asking about this because I know my view is a bit controversial, as shown by the discussion here on Twitter.

In my view a core element in the above Twitter Thread is where Hadley Wickham states "Surely you should always keep code and data together". That is logical if your "project" is a once-off research study where all your code and data logically fits together.

But what about a case where, for example, you use a stand-alone script that reads data from a folder on a weekly basis. Which folder it is reading from varies from week to week. Also, the script may source some utility functions which are held somewhere else on a shared drive or in a folder that is not a sub-folder of the RStudio project.

In such a use-case, is it really that bad to have a line at the top of the script where the person running the script can explicitly set the location of the source data and/or any common scripts that need to be loaded before the main script is run?

In such a case, copying the shared utility scripts each time into a Project folder would seem to violate the DRY principle of coding.

This use-case is not really a standard "research-project" situation but more like an operational workflow.

Any views on this appreciated!

technocrat · March 29, 2023, 1:08am

This is a use case for {here}. Implementation would depend on how much to change existing workflows. Probably the easiest is a git based project store providing a folder hierarchy in which to homebase user sessions. I like to organize projects along these lines

drwxr-xr-x    - ro 30 Jun  2022 data
drwxr-xr-x    - ro 30 Jun  2022 docs
.rw-r--r--  205 ro 11 Oct  2022 fcast.Rproj
drwxr-xr-x    - ro 30 Jun  2022 R
.rw-r--r--   49 ro 30 Jun  2022 README.md

This allows a script in the R folder (or anywhere else in the directory tree) to work as follows

d <- read.csv(here::here("data/lastest.csv"))

fritzjooste · March 29, 2023, 1:20am

Many thanks for your response. I am not fully familiar with {here} but the example you show seem to assume that the data is still in a sub-folder of the project or working folder. Will it cover a case where:

Script is located at 'c:\folder_a' (some tweaks needed on case by case basis)
Data is located (in week 1) in 'f:\unit_6'
Data is located (in week 2) in 'f:\some_other_unit'

and then the shared utility functions are in 'f:/shared_scripts'

I realise with some fundamental abstraction and re-design this workflow may be standardized. But it seems like an overkill for a standard weekly report generation tool?

Thanks again.

technocrat · March 29, 2023, 8:49pm

Ugh, network drives, an unfortunate substitute for an internal http. I've been away from Windows since the 90s, so can't suggest anything other than

a package for scripts, with the tweakables made optional arguments (says the man who doesn't have to do it himself)
a paste0() function to construct a full pathname to the target folder given the ISO week

Condolences,

fritzjooste · March 29, 2023, 9:09pm

Thanks for the reply and condolences. Yes, your suggestion (2) is exactly the approach I am taking. A path is built using paste0 with several variables, each defined at the top of the script. Whoever runs the script simply has to edit one (or at most two) of these path variables, and the rest will run smoothly.

Fully agree a package that takes path arguments in functions is the way to do this 'properly' but there is a subtle balance to strike between abstraction and operational realities. Small tweaks need to be made to the script on a case-by-case basis (e.g. change a title for a report graph, or the Y-axis limit for a graph), which makes a package a really complex and potentially overengineered solution.

Thanks again!

AlexisW · March 29, 2023, 11:04pm

I'd say there are two separate problems: whether to use setwd() and whether to use hardcoded paths in the script.

The problem with setwd() is that it introduces state, the same command will or won't work depending on this state, that you can't see while you read the script. I feel strongly against it for any non-interactive use (and I'd say {here} and RSudio Projects also make it unnecessary during interactive use).

Now for hardcoding, if the script is simple enough and rarely modified, that's probably not a problem. But as soon as things may get complex, it seems very important that the "data" is clearly separated from the "code".

My intuition is that putting all the configuration options at the beginning of the file and clearly separated from the code is good enough for many situations. But if you use that script enough it would make sense to use a package like {getopt} to pass these options as command-line arguments, or to put these configuration options in a separate config file and use source() to load it at the beginning of the script (there are even packages to make more readable and advanced configuration files).

Again, you get the advantage of changing the "data" part without ever opening the "code" file, and if you do things right, you can save the config file or calling options as metadata along with the output (to avoid having three outputs that look different but you don't remember how they were generated).

I agree that in the situation you describe, RStudio projects or {here} are not ideal solutions. Though if you wanted to use them without copy-pasting the Project or the data every time, you could also consider symbolic links to the data (but as that would leave no trace of what was called, I don't think it's a good approach either).

fritzjooste · March 30, 2023, 3:10am

Many thanks for your thoughtful remarks Alexis. The statement you made (quoted above) suggests to me I am not alone in thinking that data and code should not always be kept together, and indeed there may be situations where it is best to separate them cleanly.

I was not aware of the package {getopt}. Will look into it.
Thanks again.

AlexisW · March 30, 2023, 1:11pm

That's just one of several similar packages to turn an R script into a callable command-line tool. Actually you don't even need any package, you can just use commandArgs() and call the script with Rscript.exe.

Right, but note we have two definitions of "data", here I'm saying the path itself is data, not code, thus it should not be buried within the code, rather stored in a variable that is defined in a clearly separated place (config file, argument, ...). I'm not making any statement about the actual data files.

For the data files, in a standard Project-based analysis I would try my best to keep it close to the code (in the same Project directory), but in a clearly separated subdirectory. In the example you describe (data in a central repository), I agree it wouldn't be practical to keep them close to the code.

Blinking6540 · April 3, 2023, 4:42am

I use

InputFile <- file.choose()

fritzjooste · April 3, 2023, 8:33pm

Thanks! Did not know about that. I prefer a variable at the top of the script, since the user is running it in any event.

system · April 10, 2023, 8:33pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.