What to keep in source control


I use Team Foundation Server for source control at work and didn’t take much care in setting up my source control other than to point it at a folder and tell it to monitor anything that changed in that folder. Close to three years later and I am getting errors “The database is full. Contact your Team Foundation Server administrator,” and can no longer check anything in.

That’s obviously a big problem for me in the short term, but for this thread, I’d like to focus on a more long term aspect. When developing packages, or even analyses, what files need to be kept in source control? For instance, is there any value in monitoring .Rd files if I’m using roxygen? Do I need to save .pdfs, or .tex files from generating documents from markdown?

Image files, such as .png are an interesting dilemma, because often they get inserted into a document and I don’t really need full source control of their changes. Other times, they are static (not generated by the code) and so I need them. Is it better to capture them all, or exclude them by default and make exceptions for those that I need?


You may want to keep in mind two ideas:

  1. The repository should be self-contained or, in other words, enable the experiments contained in it be reproduced.
  2. Do not track files that are generated from other files (e.g. PDFs generated from
    LaTeX, or tex files generated from Markdown).

For the first one: if your repository has a document that contains images, it should contain every source item (image, source code) that enables render (build) the document. Besides keeping track of the changes, think in the source code as a backup. If you lose the images, you cannot to reproduce the same output as before. If you update the images and you do not track them, you cannot reproduce an earlier version of the document. Unless, you include mechanisms to generate the images.

For the second, do not add files that are automatically generated by tools, or that you can generate by running your own script that lives in the repository. What you can do is to add a script to automate the generation of those files (if those involve several steps to make them).

In summary, the question you have to answer yourself: if you retrieve a pristine copy from the repository, can you obtain the output you are expecting?


I put my views on this in the “Which files to commit” section of this article:

Short version: I think a lot of traditions from software development do not serve us well as data analysts who use source control. Specifically the taboo against committing downstream products. We have a lot of downstream products that are immediately consumable and useful to a wide audience and it doesn’t make sense to force people to regenerate them. Also diffs in derived products can help you catch errors and unexpected consequences of new data, package updates, etc. GitHub has lovely diffs for PNGs, for example, which is great for seeing what changed about a figure.

As for packages specifically, yes it’s typical to track .Rd files created via roxygen2, although it makes some people feel queasy.