A better workflow for managing heaps of plots?

ggplot2

#1

When I do analyses it’s common for me to run off a butt-load of plots. Even with faceting and grouping and all those fun tricks, I might have anywhere from dozens-to-hundreds of plots spat out of an analysis.

A typical problem for me is turning the crank on my analysis (which is usually on a remote server), writing out all of these exploratory plots to disk, then transferring the whole directory of plots to my own drive using scp or rsync to browse.

At some point, I then start picking out plots for display (eg. in PowerPoint), and at this point changing them for public consumption is a little frustrating. It seems wasteful to go back into my analysis code, inject a bunch of extra ggplot code to polish them up, and then run the whole thing again just because I wanted a couple of plots to have light-on-dark or a transparent background or a better colour scheme. On the other hand, I’d rather avoid modifying the PDFs or SVGs in Illustrator, because if the analysis changes, I have to redo it (and PDF output, while layered, isn’t layered semantically: this work is a pain I’d like to minimise).

How can I do this better? It strikes me that it might be a good idea to start saving my exploratory plot objects, so that if I ever need to go back and pretty one up, I can just retrieve the object and add new ggplot2 elements to change it. I’ve done this before with GLMs, exporting out a named list with metadata encoded into the element names and saving the structure to disk with saveRDS.

So this is starting to sound like a package idea. A basic version of this could write plots to a data frame list column instead of out to disk directly; a more complex version could write them to an external database. Either way, such a database could:

  • Return plot objects based on metadata, so that you can write them out or view them (or modify them first);
  • Return selections of plot objects, in case you need to make changes to a bunch of plots;
  • Potentially do something akin to version control for plots, if that would be useful.

Am I overthinking/overengineering this? Does anyone else have this problem dealing with too many plots? Could I just eliminate this problem by having a better workflow in other ways?


#2

Depending on what scale you want to break down your plots and how variable your plots are (i.e. if you can describe them succinctly in a few functions, but they are massively parallel in that you want tonnes of iterations with slightly different parameters) I would look at TrelliscopeJS for plotting what they call “small multiples”. There is a great introduction webinar here.

The gist of it is you write your plot (in base, or ggplot) as a function, then apply it over the dataset to create a database of plots. You can then also then select, filter and order the plots via a shiny interface on metrics called “cognostics” to locate the important plots and isolate them.


#3

That sounds like exactly what I need, @DaveRGP. Right now I’m creating these plots by iterating, but that’s mostly because it’s old, crappy code—there’s no reason it couldn’t be wrapped in a function instead (and I daresay it’d be easier to modify them later if they were wrapped in functions).

EDIT: the more I read about this package, the more it is blowing my mind :open_mouth: I particularly like using group_by %>% nest, as it seems like the sort of workflow I had in mind in OP, and it means the plot objects are recorded there in a list column, ready to be viewed, saved or modified.


#4

Happy to help :slight_smile:


#5

Thank you so much - never heard about TrelliscopeJS before: looked at it and was amazed!