How to abstract this `fill` behavior with ggplot? New Geom? New Stat?

daranzolin · August 2, 2018, 11:30pm

I recently encountered a bar chart that compared counts of an ordinal variable (age range) across two categorical variables (binary gender). The fill behavior/color was interesting: the length of each bar corresponded to the counts of the larger level, but the fill showed the the counts of the smaller level + the remainder of the larger level. I was able to recreate the chart with a fair amount of reshaping, but I'd like to abstract this into a simpler process. Is it easiest to create a new Geom? a new Stat? A separate shaping function?

An example of the code and final plot is below:

library(tidyverse)
library(zeallot)
library(glue)

df <- tibble(
  x1 = c("A", "A", "B", "B", "C", "C"),
  x2 = c("M", "F", "M", "F", "M", "F"),
  x3 = c(10, 15, 20, 30, 40, 30)
) %>% 
  group_by(x2, x1) %>% 
  summarize(estimate = sum(x3)) %>% 
  ungroup() %>% 
  spread(x2, estimate) %>% 
  mutate(larger = if_else(`F` > `M`, "Larger F", "Larger M")) %>% 
  gather(x2, estimate, `F`:`M`, -x1, -larger)

c(fdf, mdf) %<-% split(df, f = df$x2)

names(mdf) <- glue("m_{names(mdf)}")
names(fdf) <- glue("f_{names(fdf)}")

df2 <- bind_cols(mdf, fdf) %>%  
  rowwise() %>% 
  mutate(larger = if_else(m_estimate > f_estimate, "Male", "Female"),
         total_est = m_estimate + f_estimate,
         min_pop_est = min(m_estimate, f_estimate)) %>% 
  mutate(remainder_est = max(m_estimate, f_estimate) - min_pop_est) 

df3 <- df2 %>% 
  select(group = f_x1, larger, min_pop_est, remainder_est) %>% 
  gather(fill_col, value, min_pop_est:remainder_est, -group, -larger) %>% 
  unite("fill_col", c("larger", "fill_col"), sep = "_") %>% 
  mutate(fill_col = recode(
    fill_col,
    "Female_min_pop_est" = "min_pop_est",
    "Male_min_pop_est" = "min_pop_est"))

ggplot(df3, aes(x = group, y = value, fill = fill_col)) +
  geom_col() +
  coord_flip()

27%20PM

jrlewi · August 3, 2018, 2:58am

Okay - not sure if it is better than what you did, but an 'almost got it but it's getting late' approach is below which defines a new stat. The stat seems to be computing the values correctly (I plotted the points too). However, the stacking is off in group C because that is the only group where the male group provides the remainder...not exactly great behavior. Also, this solution doesn't explicitly label which group caused the remainder as your example does.

So - I guess this is a partial solution and hope it can help you figure out the rest.

  library(tidyverse)
#> Warning: package 'stringr' was built under R version 3.4.4
df <- tibble(
  x1 = c("A", "A", "B", "B", "C", "C"),
  x2 = c("M", "F", "M", "F", "M", "F"),
  x3 = c(10, 15, 20, 30, 40, 30)
) %>% 
  group_by(x2, x1) %>% 
  summarize(estimate = sum(x3)) %>% 
  ungroup() %>% 
  spread(x2, estimate) %>% 
  mutate(larger = if_else(`F` > `M`, "Larger F", "Larger M")) %>% 
  gather(x2, estimate, `F`:`M`, -x1, -larger)

Stat_Rem <- ggproto("Stat_Rem", Stat,
                     compute_group = function(data, scales) {
                       oo  <- order(data$y)
                       ll <- length(data$y)
                       ind <- which(oo == ll)
                       ss <- sum(data$y[-ind])
                       data$y[ind] <- data$y[ind] - ss
                       data
                     },
                     required_aes = c("x", "y")
)


stat_rem <- function(mapping = NULL, data = NULL, geom = "col",
                       position = 'identity', na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, ...){
  layer(
    stat = Stat_Rem, data = data, mapping = mapping, geom = geom, 
    position = position, show.legend = show.legend, inherit.aes = inherit.aes, params = list(na.rm = na.rm, ...)
  )
}


ggplot(df, aes(x = x1, y = estimate, fill = x2)) +
  stat_rem() + geom_point() + 
  coord_flip()

Created on 2018-08-02 by the reprex
package (v0.2.0).

daranzolin · August 4, 2018, 5:46am

This is a great start, thanks John. This is tricky!

daranzolin · August 4, 2018, 6:16am

@jrlewi I forgot to ask: can I not just stuff all those above tidyverse pipes within compute_group()? Like this isn't working right now, but is there any theoretical reason why it wouldn't work if I continued working on it?

Stat_Rem <- ggproto("Stat_Rem", Stat,
                    compute_group = function(data, scales) {
                      if (length(unique(data$fill)) > 2) {
                        stop("Don't know how to create compare_cols with more than two fill levels")
                      }
                      
                      fv <- unique(data$fill)
                      fvl <- list()
                      for (i in seq_along(fv)) {
                        fvl[[i]] <- rlang::sym(fv[i])
                      }
                      
                        data <- dplyr::group_by(data, fill, x)
                        data <- dplyr::summarize(data, y = sum(y))
                        data <- dplyr::ungroup(data)
                        data <- tidyr::spread(data, fill, y)
                        data <- dplyr::mutate(data, larger = dplyr::if_else(!!fvl[[1]] < !!fvl[[2]], "larger1", "larger2"))
                        data <- tidyr::gather(data, y, !!fvl[[1]]:!!fvl[[2]], -x, -larger)
                        
                        s <- split(data, f = data$fill)
                        df1 <- s[[1]]
                        df2 <- s[[2]]
                        
                        names(df1) <- glue::glue("{names(df1)}1")
                        names(df2) <- glue::glue("{names(df2)}2")
                        
                        data <- dplyr::bind_cols(df1, df2)
                        data <- dplyr::rowwise(data)
                        data <- dplyr::mutate(data, larger = dplyr::if_else(estimate1 > estimate2), "1", "2",
                                              total = estimate1 + estimate2,
                                              smaller = min(estimate1, estimate2))
                        data <- dplyr::mutate(data, remainder = max(estimate1, estimate2) - smaller)
                        data <- dplyr::select(data, g = x1, larger, smaller, remainder)
                        data <- dplyr::gather(data, fill, y, smaller:remainder, -g, -larger)
                        data <- tidyr::unite("fill", c("larger", "fill"), sep = "_")
                        data
                    },
                    required_aes = c("x", "y", "fill")
)

jrlewi · August 4, 2018, 1:42pm

Good question and I am not sure. Admittedly I don't know a ton about extending ggplot to custom geoms/stats; I took your question as a learning opportunity. So maybe someone more knowledgable in this area can chime in.

yonicd · August 27, 2018, 10:26pm

10 minutes worth of hacky geom, but a start in the right direction that should get you far enough to go at it alone. i can try a bit more when i get a chance.

library(gapminder)
library(ggplot2)
library(rlang)
library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.5.1
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)

x <- gapminder %>%
  dplyr::mutate(pop=as.numeric(pop))%>%
  dplyr::group_by(continent, year) %>%
  dplyr::summarise(populations = sum(pop)) %>%
  tidyr::spread(continent, populations) %>%
  dplyr::mutate(
    year = factor(year)
  )

geom_cbar <- function(mapping = NULL, data = NULL, stat = "identity", position = "stack", 
                         ..., width = NULL, binwidth = NULL, na.rm = FALSE, show.legend = NA, 
                         inherit.aes = TRUE) {

  # create  the difference
  
  diff_data <- data%>%
    dplyr::mutate(
      diff = !!mapping$g1 - !!mapping$g2,
      grey  = (!!mapping$g1 + !!mapping$g2) - diff,
      area = ifelse((diff) > 0,rlang::quo_name(mapping$g1),rlang::quo_name(mapping$g2)),
      diff = abs(diff)
    )
  
  # create the comparison
  
  comp_data <- diff_data%>%
    dplyr::select(!!mapping$x,area,diff,grey)%>%
    tidyr::gather(type,value,-c(!!mapping$x,area))%>%
    dplyr::mutate(
      comparison = sprintf('%s_%s',type,area),
      comparison = ifelse(grepl('^grey',comparison),NA,comparison),
      comparison  = factor(comparison, labels = c(rlang::quo_name(mapping$g1),
                                                  rlang::quo_name(mapping$g2))
                           )
    )
  
  mapping$g1 <- NULL
  mapping$g2 <- NULL
  
  layer(data = comp_data, mapping = mapping, stat = stat, geom = GeomBar, 
        position = position, show.legend = show.legend, inherit.aes = inherit.aes, 
        params = list(width = width, na.rm = na.rm, ...)
        )
}


p <- ggplot() +
  geom_cbar(aes(x=year,y=value,fill=comparison,g1 = Americas, g2 = Europe),data=x)

p


p + scale_fill_discrete(breaks = c('Americas','Europe'))


ggplot() +
  geom_cbar(aes(x=year,y=value,fill=comparison,g1 = Americas, g2 = Africa),data=x)

Created on 2018-08-27 by the reprex
package (v0.2.0).

daranzolin · August 27, 2018, 11:33pm

Nice! Thanks for your help with this!

lol I was secretly hoping it was impossible in the ggplot framework and everyone would have to use my package.