Data/Folder Organization for a Project

I'm in the process of creating a project in R. I've done projects in the past (I've have read some of the best practices for project/folder organization i.e. using here and usethis) and always find myself trying to remember how I did something, or where data came from. It's time to improve on this.

I have a pretty basic question. A lot of the data that I use gets scraped via API. I'd like to store the raw untouched data, in addition to the modified data based on the raw data.

I want to do the following:

  1. Save Raw Data via Scrape
  2. Save "Tidy Dataset"
  3. Document both the Raw Data Scrape, and Tidy Data

After reading Hadley's book on packages, I think I understand the process for saving the modified data. I was planning on documenting the data in an .r script in the data-raw folder that I created.

A few questions:

  1. Should I save the raw scraped data in data-raw or inst/extdata?
  2. Do I document both the data scrape and the modified data in the same .R files within data-raw?
  3. Can I create 1 master .R file that documents all of my data-sets (I have over 10 scrapes that I do), or do I need to have 1 file per data-set?
1 Like

I believe it really is up to you; there are few truly hard rules (apart from documenting your process, which should be non negotiable).

If your data comes from ephemeral sources - such as scraping APIs - it definitely makes sense to store the raw data somewhere, in addition to working copy of tidied data.

It is common to see the structure of your working "tidy" dataset evolving as your project proceeds. It is usually easy to recreate this working copy from primary data, but it might not be practical to do the scraping again (and even if it were possible you would not be guaranteed the original results).

My personal workflow is having:

  • the primary data in /data-raw and treat is as write once / read only afterwards
  • keep a working copy of data in /data and treat is as ephemeral; it need not be version tracked, and can be recreated easily
  • maintain separate scripts for producing the raw data (often with scheduled execution) and for creating the working dataset. I keep these in /code folder, others use /R and both is IMHO OK.

I am sure that others will have different workflows - say store the data in database instead of a file system folder - but the pattern is a general one.

1 Like

Files in inst/extdata will be included in the package when installed. That folder is for package data in file formats not allowed in data. For example, it can hold an HTML template to use with a custom RMarkdown format.

With packages, I've taken up the practice of dividing data files into four groups:

  1. data/: Prepared datasets for the package.
  2. inst/[something]: Non-standard data files. Basically inst/extdata, but I prefer using descriptive directory names like css or stata.
  3. data-raw/: Text files of the data which are used to create the files from #1. These can be edited in any way a developer chooses, even by hand. They are version-controlled, because changes here mean the package's contents are different.
  4. external/[something]: "Unreliable" data files and scripts for processing them. If I scraped data from an API or used a SAS program to reverse-engineer a file format, it goes here. These are tools used to build the files in data-raw/, and they change mostly because other people make changes to what they offer. Most of the time, the data files are not version controlled, and the scripts describe how to get them.

You can look at my naaccr package for an example. I use Excel, PDF, and other messy files from sources that can easily change in the future, but like to keep curated files in data-raw. There are actually way more files on my computer in the external directory.

It works for me, especially because I think curated and language-agnostic data files are one of the most valuable things from any open source project.

1 Like

That was helpful. Thanks!

In my case, I organize my projects and basically run the same analysis each year. Each year has it's own project.

Because of that, I have data that I created in last years project that gets used in this years.

A few questions:

  1. How do I document the data tidying that took place in those files carried from previous years? Do I simply add comments above the .rds in the file that says the source of the data, and where the documentation resides?
  2. Would I keep this data in data?
  3. Does all data from third parties go into external in your case? In your case, is the origin file for the text files in data-raw from a specific place?
  1. For documenting past data, the best solution is whatever you and any partners will use. It could be comments in the script that reads in the old data, or a separate file with notes written in markdown named previous-data-doc.md. It's just for humans, so put it where they'll find it and in a format they'll read. I've learned the hard way: finding documentation needs to be easier than writing an email. Otherwise, you'll just get emails.

  2. If you're making a package, don't put anything but data files in data/. Otherwise, it's your call. I'd put it in data-raw/, because I consider the scripts and files used to create the final data as documentation.

  3. Pretty much. Like I said, I wanted every data file in data-raw/ to be plain-text, tabular, and including details that make it easy to work with the files. Of course, data from other people rarely comes this way. In my case, they're PDFs, colorfully formatted Excel workbooks, massive XML collections, and some SAS programs. The scripts I used to transform them into the data-raw/ text files include comments about where to find the data.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.