Using stat_ instead of dplyr to summarize groups in a ggplot?



Here there,

I would like to create a usual ggplot2 with 2 variables x, y and a grouping factor z.
On top of the plot I would like a mean and an interval for each grouping level (so for both x and y).
I can do that using dplyr and multiple call to ggplot functions (see below) but I find the code rather ugly and I wonder if one cannot use cleverly one of the stat_ functions (or something else) to get the same thing in 3 lines of code without having to do the dplyr steps by hand.


iris %>%
  group_by(Species) %>%
  summarize(mean_Petal.Length = mean(Petal.Length),
            mean_Sepal.Length = mean(Sepal.Length),
            se_Petal.Length = sd(Petal.Length)/sqrt(length(Petal.Length)),
            se_Sepal.Length = sd(Sepal.Length)/sqrt(length(Sepal.Length))) %>%
  ggplot(aes(y = mean_Petal.Length, x = mean_Sepal.Length, colour = Species)) +
    geom_point() +
    geom_errorbarh(aes(xmax = mean_Sepal.Length + 2*se_Sepal.Length,
                       xmin = mean_Sepal.Length - 2*se_Sepal.Length),
                   height = 0.1) +
    geom_errorbar(aes(ymax = mean_Petal.Length + 2*se_Petal.Length,
                      ymin = mean_Petal.Length - 2*se_Petal.Length),
                  width = 0.05) +
    geom_point(data = iris, aes(y = Petal.Length, x = Sepal.Length, colour = Species),
               shape = 1) +
  theme_classic() +
  labs(y = "Petal Length", x = "Sepal Length")

Many thanks,


PS: hopefully the solution would also ensure that vertical and horizontal flat arrow heads have same length without having to make the code even uglier.


To generate the mean points and error bars, we need to be able to use the mean x and y locations of each Species as the x and y aesthetics, respectively. To do that, we need to add those values to the data frame (which I've done below with left_join). Then we can use stat_summary and stat_summaryh (from the ggstance package). In the example below, I've also used summarise_at to shorten the code that generates the means.


# Join means by Species to the original data frame and pipe to ggplot
          iris %>%
            group_by(Species) %>%
            summarise_at(vars(matches("Length")), funs(mean=mean))
          ) %>% 
  ggplot(aes(colour=Species)) +
    # Plot raw data
    geom_point(aes(x=Sepal.Length, y=Petal.Length), shape=1) +
    # Plot mean by Species
    geom_point(aes(x=Sepal.Length_mean, y=Petal.Length_mean), shape=18, size=3) +
    # Add vertical errorbars 
    stat_summary(, fun.args=list(mult=1.96), 
                 aes(x=Sepal.Length_mean, y=Petal.Length),
                 geom="errorbar", width=0.05) +
    # Add horizontal errorbars
    stat_summaryh(, fun.args=list(mult=1.96), 
                  aes(x=Sepal.Length, y=Petal.Length_mean),
                  geom="errorbarh", width=0.1) +


Also, for future reference, your original code can be shortened by calculating the means and standard errors with summarise_at:

iris %>%
    group_by(Species) %>%
    summarise_at(vars(matches("Length")), funs(mean, se=sd(.)/sqrt(n())))

Here's a function that attempts to automate the creation of the plot above. It may look a bit confusing if you haven't studied non-standard evaluation in the tidyverse (I still find it more than a bit confusing myself):


fnc = function(data, group, x, y, adj=1) {

  x = enquo(x)
  y = enquo(y)
  # Set size of whisker end caps
  wv = data %>% pull(!!y) %>% range %>% diff/100*adj
  wh = data %>% pull(!!x) %>% range %>% diff/100*adj
  # If grouping variable is numeric, turn it into a factor
  if(data %>% pull(!!group) %>% is.numeric) {
    data = data %>% 
      mutate(!!quo_name(group) := factor(!!group))

  # Generate column names for the x and y means that we'll calculate below
  xmean = paste0(quo_text(x), "_mean")
  ymean = paste0(quo_text(y), "_mean")
            data %>%
              group_by(!!group) %>%
              summarise_at(vars(!!x, !!y), funs(mean=mean))
           ) %>% 
    ggplot(aes(colour=!!group)) +
      geom_point(aes(x=!!x, y=!!y), shape=1) +
      geom_point(aes(x=!!sym(xmean), y=!!sym(ymean)), shape=18, size=3) +
      stat_summary(, fun.args=list(mult=1.96),
                   aes(x=!!sym(xmean), y=!!y),
                   geom="errorbar", width=wh) +
      stat_summaryh(, fun.args=list(mult=1.96),
                    aes(x=!!x, y=!!sym(ymean)),
                    geom="errorbarh", width=wv) +

Now let's try out the function:

  fnc(iris, Species, Petal.Length, Sepal.Length),
  fnc(starwars %>% filter(mass < 500, grepl("Human|Droid|Wookiee|Yoda", species)), 
      species, height, mass),
  fnc(mtcars, cyl, mpg, hp),
  fnc(mtcars, cyl, mpg, hp, adj=3),