Method to track changes in output files without adding them to git

I usually use git to track changes to my projects, but I tend to .gitignore output figures and datafiles, since they can make the git repo very large.
However it would be useful to know whether the output figures (or files) have changed. I was wondering the best way to do this. I was thinking about generating and saving a hash code, something like this.
Does this seem like a good approach?

library(digest) # hash functions
library(purrr)
library(ggplot2)

# make a thing
set.seed(1)
thing <- data.frame(x = runif(10), y= runif(10)) %>%
  ggplot() +
  geom_point(aes(x = x, y = y))

# hash the thing
hash <- purrr::map_chr(thing, digest, algo="xxhash32")

# check if the thing has changed from last time
thingfile <- "thing.png"
hashfile <- "thing_hash.rds"
if (!file.exists(hashfile)){
  ggsave(thingfile, thing)
  saveRDS(hash, hashfile)
} else {
  existing <- readRDS(hashfile)
  if (isTRUE(all.equal(hash, existing))){
    print(paste(hashfile, "unchanged"))
  } else {
    print(paste(hashfile, "changed!"))
    ggsave(thingfile, thing)
    saveRDS(hash, hashfile)
  }
}

Here is a modified ggsave that uses this approach.

library(ragg)
library(purrr)
library(digest)
library(ggplot2)

myggsave <- function(fname, height = 210, width = 297, units = "mm", forceplot = TRUE, ...){
  hash <- map_chr(last_plot(), digest, algo="xxhash32")
  hashfile <- paste0(fname, ".hash.rds")
  if (!file.exists(hashfile)){
    print(paste("Creating", fname))
    ggsave(fname, device = agg_png, height = height, width = width, units = units, ...)
    saveRDS(hash, hashfile)
  } else {
    existing <- readRDS(hashfile)
    if (isTRUE(all.equal(hash, existing))){
      print(paste("No change to", fname))
    } else {
      if (forceplot){
        print(paste("Overwriting", fname))
        ggsave(fname, device = agg_png, height = height, width = width, units = units, ...)
        saveRDS(hash, hashfile)
      } else {
        print(paste("Keeping old version of", fname))
      }
    }
  }
}

Unfortunately, this github issue seems to suggest this approach won't work very well.

The ggplot object includes links to the R environment, since it sometimes has to access objects that are resolved at print time. This is stored in the plot_env slot, however this captures the whole environment, not just the bits needed by the plot. So changes to the environment can change the hash but not the plot itself.

You can remove the env slot by adding

hash$plot_env <- ""

but this will could mask changes that affect the plot.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.