Workflow/directory structure for loosely connected projects

kbzsl · January 25, 2020, 11:43am

I have multiple projects, where a couple of them share some of the data. This "shared data" can be a large amount of raw data collected from various sources and/or some intermediate versions from the data munging steps. Having different projects means that the code for some of data collection and or munging steps is repeated along different project (with some minor differences).

I am planning to restructure all these projects, but I won't create a single big Rproj for all these projects, because their scope and/or their target audience can be very different. On the other hand, I would reduce the data collection and processing steps to a limited number of projects (time/resources intensive tasks). And this presents my dilemma: how to set the folder structure (thus the workflow) to able to share (= reuse) some data, but relying on the RStudio projects and on the here package.

Searching through forums it looks that the nested projects is not a good idea. What would be your recommendation how to re-structure my projects? Thank you.

kbzsl · January 25, 2020, 1:15pm

I think that I found a compromise in meantime. If I define a new folder on the same level as the other R project folders, I can reference it in the following way from all the projects:

fs::dir_ls(here::here("../shared_data"))

The solution would be a directory structure like this:

 ... +- R_projects
	+- project_1
	+- project_2
	+- ...
	+- shared_data

What's your opinion? Does this approach contradict with the recommendations (and/or common sense)? Do you have any better idea/approach? Thank you.

jcolomb · January 28, 2020, 8:11am

hello.
This is a data management issue. Because data and code have different life time and life cycle, I would encourage you to get the data separated from the code (i.e. have a Rproject for the code, and a different one for the data). Path to the data (and other variables) would then be given at the start of the code (maybe as an external file you can source() to). One can then have a very small code with the data, which is just giving the datapath to the data with here() and calling the analysis code. Finally, one can transform the code into a package and have a very limited code added to the data.
This would make reuse of the code for new data easier, as well as publication of the data. What solution is best is dependent on the size and time span of the projet(s).

In some cases, one can even download/access data online or in intranet, so that the code does access data which is ready to be published...

biopaw · January 31, 2020, 9:56pm

I do agree with the type of suggestion jcolomb is proposing. I have a lot of projects and I have a lot of common or reference data, common code that is either in .R script that I source, or 1 of two R packages I maintain for myself with code that I like to reuse.

Under projects I would have a group of projects that I maintain through GitHub (which is most of the time), a group of projects, that also unclude 'data projects', that has the common data, and the wee bit of data to coral the data from source and make it usable, and 1 folder 'R', for utils.R that I use to save functions that I make; that if I use often enough, end up in my personal ultils library that is itself a GitHub project.

I do it like this:

/Projects
--> GitHub
--> -->
--> -->
--> --> ...
--> LocalProjects
--> --> /P1_
--> --> /P2_
--> --> /P3_
--> --> ...
--> --> /D1_
--> --> /D2_
--> --> ...
--> --> R

Then basically within each project (P's, D's, and GitHub projects) is structured like this:

/P1_MyExampleProject
--> data
--> figures
--> R
--> reports
--> presentations
--> save
--> scriptsDraft
--> scriptsFinal

and each script in the setup chunk, sets the wd to the root of the project (P1_MyExampleProjec)

Peter

leungi · February 2, 2020, 12:36am

Have you considered using pins?

biopaw · February 2, 2020, 3:48am

first time I have seen it. I will consider it.

P

kbzsl · February 10, 2020, 7:38am

The pins package is and interesting alternative.
Is there any possibility to specify the file format for local board? According to pin_info the csv and rds formats are used.
Recently I switched, from storing the temporary files in rds format to parquet, because sometimes I have to share them.

kbzsl · February 11, 2020, 8:48am

I found the articles about Extending Boards/Pins.

leungi · February 15, 2020, 5:45pm

Is parquet format needed because you're sharing with python users?

If so, you may want to read this this

system · March 7, 2020, 5:46pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.