I like splitting up my data cleaning pipeline into multiple files prefixed by numbers, like the tidyverse style guide recommends. It makes it easier to work on each step individually. I see two main methods for chaining these scripts together. Which do you think is better?
Sometimes I chain together files by source()
ing the previous file in the chain.
Pros
- Running the latest script runs all the scripts
- Only pay the cost of parsing a csv once
- Preserve all the data types (e.g. factor levels)
- Flexible when dealing with more than one table
Cons
- Environment gets out of control (tons of intermediate objects)
- Can only inspect data in R (some colleagues prefer to open stuff in excel) this con is also true for using RDS intermediately
- Have to run all the code to get back to where you were
Other times I formulate each script to write a single csv out as an intermediate product.
Pros
- Keeps environment manageable: easy to see which objects you are working with
- Easy to pick up right after most recent step and pay cost of expensive computations
Cons
- Can cause reproducibility issues if file are out of sync
- Pay the cost of parsing a potentially large csv everytime
- Confusing if dealing with more than one csv in each step
These are usually one-off pipelines each with their own custom solutions, so creating a truly reusable pipeline for data cleaning isn't really an option here.
Right now, I lean toward the source()
method. What are your thoughts?