When you say âmake some changes to dataframe in sourceâ, do you mean changing the script that imports/preprocesses your data? And when you say âsave changes to data frameâ, do you mean running that script to create a new data frame object reflecting the changes? (FWIW, thatâs a step youâre never going to be able to skip)
Aside: the data frame isnât ârealâ...
Something it might be a good idea to wrap your head around (discussed in the Project-Oriented Workflow article that @jrlewi linked) is the idea that âsource is realâ â meaning you should think of the objects that your code creates as ephemeral, and only the instructions for creating them (=the source code) as the durable, real artifacts of your work.
Thinking about âsaving changesâ to a dataframe object somewhat runs against this principle. If source is real, then you can and should (early and often
) clear out all the objects in your workspace, secure in the knowledge that it is trivial for your scripts to recreate them. You arenât creating a precious data frame â youâre creating a precious set of instructions, which can generate any number of disposable data frames that will be exactly identical every time (unless you change the instructions!).
Needless to say, this is really different from how most people are used to thinking about computer software, so it takes some getting used to!
The simple answer for why it canât work like the second example is that â as you know! â RMarkdown/knitr just doesnât work this way. Knitting is done in a new, independent session, so Rmd files have to be self-contained. Why this is good for reproducibility is that it means rendering doesnât depend on somebody taking the right series of steps âby handâ to set up objects in the environment ahead of time.
However, âself-containedâ doesnât mean that you have to copy your data import script into every Rmd you make â in fact, this is a bad idea because multiple copies inevitably lead to diverging changes. There are a lot of other options (mostly already mentioned in this thread):
- You can use
source() in the setup code chunk to run the pre-processing script.
- You can make the last line of the preprocessing script a call to
saveRDS(), to save the data frame object the script created to an RDS file. Then you can have your Rmd setup chunk just load the data frame object from that file (with readRDS()). If you make changes to the preprocessing script, you will have to remember to run it again so that the RDS file gets updated. People usually do this as a convenience in cases where running all the import/preprocessing steps is slow.
- The preprocessing steps can live as a separate Rmd document that is included as a child document in your analysis Rmd.
- You can make a data package (sounds daunting, but not that hard!) and call it in your Rmd. Probably best for when your preprocessing script has stabilized.
There are some other variations on these themes, too. I donât think thereâs a single best practice for every situation.