File structure for data pipeline

I have structured many of my analytics projects as R packages for reproducibility but some projects I run on a weekly basis that create new data each time. Are there any best practices for the file structure when building data pipelines in R?

Here is a simplified file structure that I have come up with.

.
├── R # folder containing R functions
│   ├── utils.R
├── data # folder for .Rda files that I dont actually use
├── data-raw # scripts for processing data 
│   ├── etl.R
│   └── create_report.R
├── inst
│   ├── processed # cleaned data and generated report
│	│   ├── todays_data.csv
│	│   ├── todays_report.html

And my questions are:

  1. Where should an ETL script live in an R package? Right now I put them in /data-raw.
  2. How should I orchestrate the execution of multiple scripts that live in the R package. For example, I have /data-raw/etl.R and /data-raw/create_report.R. Right now can use source() or open them and run all.
  3. Where should I export the cleaned data that runs each week? Currently they would export to inst/processed/.

Thank you!

Fortunately, life in R hasn't settled down into warring camps over the one-true-way to handle this type of task. There are some frameworks and a lot of different opinions as to their relative merits. I can tell you what I do and why.

  1. Everything lives in the file system under a directory projects. I begin by creating a github repo with a README file and clone it from the projects directory. I'll create a project in RStudio from its directory.

  2. All R code lives under an R subfolder.

  3. The data subfolder contains incoming data without any preprocessing. If it was hard to come by, it has read-only permissions.

  4. If I have shell scripts, or compiled programs, they go in code.

  5. All processed data goes into obj (for object), usually in Rds form.

  6. All documents except README go into doc. Readme is used as a notepad, lab log, whatever.

Everything gets pushed to github at the end of each session.

For coding, I have a script, libr.R with all of the libraries I anticipate using, func.R with functions, so I don't have to hunt for them. Other scripts begin with

source(here::here("R/libr.R"))
source(here::here("R/func.R"))

This approach occupies the middle ground between a one-off and the scale of effort that calls for a package or requires close collaboration. I have a set of naming conventions that I use with it to facilitate code reuse.

This all is working for me and will stop at some point and I'll adopt or invent something else. The main thing is to cut down on what has to be thought over again and again.

| symbol |           default            |
|:------:|:----------------------------:|
|  137   |       a seed argument        |
|   a    |      a temporary object      |
|   b    |      a temporary object      |
|   c    | prohibited symbol – not used |
|   d    |    a dataframe or tibble     |
|  dte   |         a dataframe          |
|   e    |           anything           |
|   f    |          a function          |
|   g    |      an inner function       |
|   h    |      an inner function       |
|   hs   |            a hash            |
|   i    |         an iterator          |
|   j    |         an iterator          |
|   k    |         an iterator          |
|   l    |            a list            |
|   m    |           a matrix           |
|   n    |           a count            |
|   o    |       an output object       |
|   q    |           a queue            |
|   r    |            a rep             |
|   s    |           a sample           |
|   sq   |          a sequence          |
|   ss   |           a subset           |
|   t    | prohibited symbol – not used |
|  tsib  |          a tsibble           |
|  tib   |           a tibble           |
|   u    | an unknown value placeholder |
|   v    |           a vector           |
|   w    |         an argument          |
|   x    |         an argument          |
|   y    |         an argument          |
|   z    |         an argument          |

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.