How to abstract this `fill` behavior with ggplot? New Geom? New Stat?

ggplot2

#1

I recently encountered a bar chart that compared counts of an ordinal variable (age range) across two categorical variables (binary gender). The fill behavior/color was interesting: the length of each bar corresponded to the counts of the larger level, but the fill showed the the counts of the smaller level + the remainder of the larger level. I was able to recreate the chart with a fair amount of reshaping, but I'd like to abstract this into a simpler process. Is it easiest to create a new Geom? a new Stat? A separate shaping function?

An example of the code and final plot is below:

library(tidyverse)
library(zeallot)
library(glue)

df <- tibble(
  x1 = c("A", "A", "B", "B", "C", "C"),
  x2 = c("M", "F", "M", "F", "M", "F"),
  x3 = c(10, 15, 20, 30, 40, 30)
) %>% 
  group_by(x2, x1) %>% 
  summarize(estimate = sum(x3)) %>% 
  ungroup() %>% 
  spread(x2, estimate) %>% 
  mutate(larger = if_else(`F` > `M`, "Larger F", "Larger M")) %>% 
  gather(x2, estimate, `F`:`M`, -x1, -larger)

c(fdf, mdf) %<-% split(df, f = df$x2)

names(mdf) <- glue("m_{names(mdf)}")
names(fdf) <- glue("f_{names(fdf)}")

df2 <- bind_cols(mdf, fdf) %>%  
  rowwise() %>% 
  mutate(larger = if_else(m_estimate > f_estimate, "Male", "Female"),
         total_est = m_estimate + f_estimate,
         min_pop_est = min(m_estimate, f_estimate)) %>% 
  mutate(remainder_est = max(m_estimate, f_estimate) - min_pop_est) 

df3 <- df2 %>% 
  select(group = f_x1, larger, min_pop_est, remainder_est) %>% 
  gather(fill_col, value, min_pop_est:remainder_est, -group, -larger) %>% 
  unite("fill_col", c("larger", "fill_col"), sep = "_") %>% 
  mutate(fill_col = recode(
    fill_col,
    "Female_min_pop_est" = "min_pop_est",
    "Male_min_pop_est" = "min_pop_est"))

ggplot(df3, aes(x = group, y = value, fill = fill_col)) +
  geom_col() +
  coord_flip()

27%20PM


#2

Okay - not sure if it is better than what you did, but an 'almost got it but it's getting late' approach is below which defines a new stat. The stat seems to be computing the values correctly (I plotted the points too). However, the stacking is off in group C because that is the only group where the male group provides the remainder...not exactly great behavior. Also, this solution doesn't explicitly label which group caused the remainder as your example does.

So - I guess this is a partial solution and hope it can help you figure out the rest.

  library(tidyverse)
#> Warning: package 'stringr' was built under R version 3.4.4
df <- tibble(
  x1 = c("A", "A", "B", "B", "C", "C"),
  x2 = c("M", "F", "M", "F", "M", "F"),
  x3 = c(10, 15, 20, 30, 40, 30)
) %>% 
  group_by(x2, x1) %>% 
  summarize(estimate = sum(x3)) %>% 
  ungroup() %>% 
  spread(x2, estimate) %>% 
  mutate(larger = if_else(`F` > `M`, "Larger F", "Larger M")) %>% 
  gather(x2, estimate, `F`:`M`, -x1, -larger)

Stat_Rem <- ggproto("Stat_Rem", Stat,
                     compute_group = function(data, scales) {
                       oo  <- order(data$y)
                       ll <- length(data$y)
                       ind <- which(oo == ll)
                       ss <- sum(data$y[-ind])
                       data$y[ind] <- data$y[ind] - ss
                       data
                     },
                     required_aes = c("x", "y")
)


stat_rem <- function(mapping = NULL, data = NULL, geom = "col",
                       position = 'identity', na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, ...){
  layer(
    stat = Stat_Rem, data = data, mapping = mapping, geom = geom, 
    position = position, show.legend = show.legend, inherit.aes = inherit.aes, params = list(na.rm = na.rm, ...)
  )
}


ggplot(df, aes(x = x1, y = estimate, fill = x2)) +
  stat_rem() + geom_point() + 
  coord_flip()

Created on 2018-08-02 by the reprex
package
(v0.2.0).


#3

This is a great start, thanks John. This is tricky!


#4

@jrlewi I forgot to ask: can I not just stuff all those above tidyverse pipes within compute_group()? Like this isn't working right now, but is there any theoretical reason why it wouldn't work if I continued working on it?

Stat_Rem <- ggproto("Stat_Rem", Stat,
                    compute_group = function(data, scales) {
                      if (length(unique(data$fill)) > 2) {
                        stop("Don't know how to create compare_cols with more than two fill levels")
                      }
                      
                      fv <- unique(data$fill)
                      fvl <- list()
                      for (i in seq_along(fv)) {
                        fvl[[i]] <- rlang::sym(fv[i])
                      }
                      
                        data <- dplyr::group_by(data, fill, x)
                        data <- dplyr::summarize(data, y = sum(y))
                        data <- dplyr::ungroup(data)
                        data <- tidyr::spread(data, fill, y)
                        data <- dplyr::mutate(data, larger = dplyr::if_else(!!fvl[[1]] < !!fvl[[2]], "larger1", "larger2"))
                        data <- tidyr::gather(data, y, !!fvl[[1]]:!!fvl[[2]], -x, -larger)
                        
                        s <- split(data, f = data$fill)
                        df1 <- s[[1]]
                        df2 <- s[[2]]
                        
                        names(df1) <- glue::glue("{names(df1)}1")
                        names(df2) <- glue::glue("{names(df2)}2")
                        
                        data <- dplyr::bind_cols(df1, df2)
                        data <- dplyr::rowwise(data)
                        data <- dplyr::mutate(data, larger = dplyr::if_else(estimate1 > estimate2), "1", "2",
                                              total = estimate1 + estimate2,
                                              smaller = min(estimate1, estimate2))
                        data <- dplyr::mutate(data, remainder = max(estimate1, estimate2) - smaller)
                        data <- dplyr::select(data, g = x1, larger, smaller, remainder)
                        data <- dplyr::gather(data, fill, y, smaller:remainder, -g, -larger)
                        data <- tidyr::unite("fill", c("larger", "fill"), sep = "_")
                        data
                    },
                    required_aes = c("x", "y", "fill")
)

#5

Good question and I am not sure. Admittedly I don't know a ton about extending ggplot to custom geoms/stats; I took your question as a learning opportunity. So maybe someone more knowledgable in this area can chime in.