Data cleaning patterns: chain scripts together with `source()` or read and write csvs

I like splitting up my data cleaning pipeline into multiple files prefixed by numbers, like the tidyverse style guide recommends. It makes it easier to work on each step individually. I see two main methods for chaining these scripts together. Which do you think is better?

Sometimes I chain together files by source()ing the previous file in the chain.

Pros

  • Running the latest script runs all the scripts
  • Only pay the cost of parsing a csv once
  • Preserve all the data types (e.g. factor levels)
  • Flexible when dealing with more than one table

Cons

  • Environment gets out of control (tons of intermediate objects)
  • Can only inspect data in R (some colleagues prefer to open stuff in excel) this con is also true for using RDS intermediately
  • Have to run all the code to get back to where you were

Other times I formulate each script to write a single csv out as an intermediate product.

Pros

  • Keeps environment manageable: easy to see which objects you are working with
  • Easy to pick up right after most recent step and pay cost of expensive computations

Cons

  • Can cause reproducibility issues if file are out of sync
  • Pay the cost of parsing a potentially large csv everytime
  • Confusing if dealing with more than one csv in each step

These are usually one-off pipelines each with their own custom solutions, so creating a truly reusable pipeline for data cleaning isn't really an option here.

Right now, I lean toward the source() method. What are your thoughts?

That pattern I recommend is the following:

Write multiple scripts that contain only function definitions. Do this even if the functions are specialized (have column names hard-coded in and such). Then sources all of these scripts and call these large functions one at a time in a single "over script."

The principle is: each script should either define functions or perform steps. No script should be a mixture. Scripts that define functions are safe to source (have no side effects other than loading in functions). Scripts that do work are uniquely at the top level, so easy to find/manage.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.