Using stat_ instead of dplyr to summarize groups in a ggplot?

ggplot2

#1

Here there,

I would like to create a usual ggplot2 with 2 variables x, y and a grouping factor z.
On top of the plot I would like a mean and an interval for each grouping level (so for both x and y).
I can do that using dplyr and multiple call to ggplot functions (see below) but I find the code rather ugly and I wonder if one cannot use cleverly one of the stat_ functions (or something else) to get the same thing in 3 lines of code without having to do the dplyr steps by hand.

library(ggplot2)
library(dplyr)

iris %>%
  group_by(Species) %>%
  summarize(mean_Petal.Length = mean(Petal.Length),
            mean_Sepal.Length = mean(Sepal.Length),
            se_Petal.Length = sd(Petal.Length)/sqrt(length(Petal.Length)),
            se_Sepal.Length = sd(Sepal.Length)/sqrt(length(Sepal.Length))) %>%
  ggplot(aes(y = mean_Petal.Length, x = mean_Sepal.Length, colour = Species)) +
    geom_point() +
    geom_errorbarh(aes(xmax = mean_Sepal.Length + 2*se_Sepal.Length,
                       xmin = mean_Sepal.Length - 2*se_Sepal.Length),
                   height = 0.1) +
    geom_errorbar(aes(ymax = mean_Petal.Length + 2*se_Petal.Length,
                      ymin = mean_Petal.Length - 2*se_Petal.Length),
                  width = 0.05) +
    geom_point(data = iris, aes(y = Petal.Length, x = Sepal.Length, colour = Species),
               shape = 1) +
  theme_classic() +
  labs(y = "Petal Length", x = "Sepal Length")

Many thanks,

Alex

PS: hopefully the solution would also ensure that vertical and horizontal flat arrow heads have same length without having to make the code even uglier.


#2

To generate the mean points and error bars, we need to be able to use the mean x and y locations of each Species as the x and y aesthetics, respectively. To do that, we need to add those values to the data frame (which I've done below with left_join). Then we can use stat_summary and stat_summaryh (from the ggstance package). In the example below, I've also used summarise_at to shorten the code that generates the means.

library(tidyverse)
library(ggstance)

# Join means by Species to the original data frame and pipe to ggplot
left_join(iris, 
          iris %>%
            group_by(Species) %>%
            summarise_at(vars(matches("Length")), funs(mean=mean))
          ) %>% 
  ggplot(aes(colour=Species)) +
    # Plot raw data
    geom_point(aes(x=Sepal.Length, y=Petal.Length), shape=1) +
    # Plot mean by Species
    geom_point(aes(x=Sepal.Length_mean, y=Petal.Length_mean), shape=18, size=3) +
    # Add vertical errorbars 
    stat_summary(fun.data=mean_se, fun.args=list(mult=1.96), 
                 aes(x=Sepal.Length_mean, y=Petal.Length),
                 geom="errorbar", width=0.05) +
    # Add horizontal errorbars
    stat_summaryh(fun.data=mean_se_h, fun.args=list(mult=1.96), 
                  aes(x=Sepal.Length, y=Petal.Length_mean),
                  geom="errorbarh", width=0.1) +
    theme_bw()

Rplot05

Also, for future reference, your original code can be shortened by calculating the means and standard errors with summarise_at:

iris %>%
    group_by(Species) %>%
    summarise_at(vars(matches("Length")), funs(mean, se=sd(.)/sqrt(n())))

Here's a function that attempts to automate the creation of the plot above. It may look a bit confusing if you haven't studied non-standard evaluation in the tidyverse (I still find it more than a bit confusing myself):

library(rlang)

fnc = function(data, group, x, y, adj=1) {

  group=enquo(group)
  x = enquo(x)
  y = enquo(y)
  
  # Set size of whisker end caps
  wv = data %>% pull(!!y) %>% range %>% diff/100*adj
  wh = data %>% pull(!!x) %>% range %>% diff/100*adj
  
  # If grouping variable is numeric, turn it into a factor
  if(data %>% pull(!!group) %>% is.numeric) {
    data = data %>% 
      mutate(!!quo_name(group) := factor(!!group))
  } 

  # Generate column names for the x and y means that we'll calculate below
  xmean = paste0(quo_text(x), "_mean")
  ymean = paste0(quo_text(y), "_mean")
  
  left_join(data, 
            data %>%
              group_by(!!group) %>%
              summarise_at(vars(!!x, !!y), funs(mean=mean))
           ) %>% 
    ggplot(aes(colour=!!group)) +
      geom_point(aes(x=!!x, y=!!y), shape=1) +
      geom_point(aes(x=!!sym(xmean), y=!!sym(ymean)), shape=18, size=3) +
      stat_summary(fun.data=mean_se, fun.args=list(mult=1.96),
                   aes(x=!!sym(xmean), y=!!y),
                   geom="errorbar", width=wh) +
      stat_summaryh(fun.data=mean_se_h, fun.args=list(mult=1.96),
                    aes(x=!!x, y=!!sym(ymean)),
                    geom="errorbarh", width=wv) +
      theme_bw()
}

Now let's try out the function:

egg::ggarrange(
  fnc(iris, Species, Petal.Length, Sepal.Length),
  fnc(starwars %>% filter(mass < 500, grepl("Human|Droid|Wookiee|Yoda", species)), 
      species, height, mass),
  fnc(mtcars, cyl, mpg, hp),
  fnc(mtcars, cyl, mpg, hp, adj=3),
  ncol=2)