unable to plot mode in ggplot

rahul67 · August 10, 2019, 8:53pm

Hello all,
I am unable to plot mode in ggplot.

dataset=mtcars, m <- c(mtcars$mpg), mean <- mean(m), median <- median(m), mode <- names(table(m))[table(m)==max(table(m))]

ggplot(data = mtcars, aes(mpg))+geom_histogram(color='white',fill='grey81', binwidth = 4)+
  labs(x='mpg', y='time', title='Histogram of mpg')+
  theme_bw()+
  geom_vline(aes(xintercept=mean, color="mean"), show.legend = TRUE, size=2)+
  geom_vline(aes(xintercept=median, color="median"), show.legend = TRUE, linetype="dashed")+
  geom_point(aes(xintercept=mode, color="mode"), show.legend = TRUE, linetype="dashed")+
  scale_color_manual(name="central tendency", values = c(mean="red", median="blue"))

andresrcs · August 10, 2019, 10:11pm

Another option is to use stat_bin() to get the calculated count and only plot the maximum, see this example.

library(ggplot2)
m <- c(mtcars$mpg)
mean <- mean(m)
median <- median(m)

ggplot(data = mtcars, aes(mpg)) + 
    geom_histogram(color='white',fill='grey81', binwidth = 4)+
    labs(x='mpg', y='time', title='Histogram of mpg')+
    theme_bw()+
    geom_vline(aes(xintercept=mean, color="mean"), show.legend = TRUE, size=2)+
    geom_vline(aes(xintercept=median, color="median"), show.legend = TRUE, linetype="dashed")+
    stat_bin(geom = "vline",
             aes(xintercept = stat(ifelse(count == max(count), x, NA)), color = "mode"),
             show.legend = TRUE,
             linetype="dashed") +
    scale_color_manual(name="central tendency", values = c(mean="red", median="blue", mode="green"))

^{Created on 2019-08-10 by the reprex package (v0.3.0.9000)}

pieterjanvc · August 10, 2019, 10:25pm

Hi,

Just to add... the mode is a tricky parameter, as there are several conditions:

Only one value has the highest frequency: 1 mode
Multiple values have the same, highest, frequency: multiple modes
all values occur only once: no mode

You can implement that like this:

library(ggplot2)
library(dplyr)
dataset=mtcars

getMode = function(x){
  
  test = table(x) 
  
  if(length(test) > 1){
    test = table(x) %>% sort(decreasing = T) %>% as.data.frame(stringsAsFactors = F) %>%
      mutate(x = as.numeric(x)) %>% filter(Freq == max(Freq))
    test$x
  } else {
    NULL
  }
}

m <- c(mtcars$mpg)
mean <- mean(m)
median <- median(m) 
mode <- getMode(m)

ggplot(data = mtcars, aes(mpg))+geom_histogram(color='white',fill='grey81', binwidth = 4)+
  labs(x='mpg', y='time', title='Histogram of mpg')+
  theme_bw()+
  geom_vline(aes(xintercept=mean, color="mean"), show.legend = TRUE, size=2)+
  geom_vline(aes(xintercept=median, color="median"), show.legend = TRUE, linetype="dashed", size=2)+
  geom_vline(data = as.data.frame(mode), aes(xintercept=mode, color = "mode"), show.legend = TRUE, linetype="solid")+
  scale_color_manual(name="central tendency", values = c(mean="red", median="blue", mode = "darkgreen"))

Note that the data you're looking at has multiple modes... (19.2 is both a mode and median so overlap)

getMode(m)
[1] 10.4 15.2 19.2 21.0 21.4 22.8 30.4

Grtz,
PJ

joels · August 10, 2019, 10:58pm

This can also be done with a single call to stat_summary, though it takes a bit of extra work to set it up to use colour and linetype aesthetics. To use stat_summary, I've created a simple function to return the mean, median, and mode. I've also created a helper function to calculate the mode. The function uses a kernel density estimate to estimate the mode and it returns only one mode. With some extra work the various functions and the plot code could be adjusted to account for multiple modes.

library(ggplot2)
library(ggstance)

Mode = function(d, adj=0.75) {
  d = density(d, adjust=adj) 
  d = data.frame(x=d$x, y=d$y)
  d$x[which.max(d$y)]
}

fnc = function(x, ...) {
  c(mean=mean(x), median=median(x), mode=Mode(x, ...))
}

labs = c("mean", "median", "mode")

ggplot(data=mtcars, aes(mpg)) +
  geom_histogram(color='white',fill='grey81', binwidth = 4) +
  stat_summaryh(fun.x=fnc, geom="vline",
                aes(xintercept=..x.., y=0, 
                    colour=factor(..x.., levels=unique(x)), 
                    linetype=factor(..x.., levels=unique(x)))) +
  theme_bw() +
  guides(colour=guide_legend(override.aes=list(size=3))) +
  theme(legend.key=element_rect(fill=NA, colour="white", size=3)) +
  scale_colour_discrete(name="", labels=labs) +
  scale_linetype_manual(name="", breaks=labs,
                        values=c("solid", "dashed", "dashed"))

Rplot02

jcblum · August 11, 2019, 2:48am

A couple more notes on the trickiness of mode as a summary statistic:

There's at least one R package devoted to implementations of mode estimation:

And there are more mode calculation methods than you could possibly want here:

Yarnabrina · August 11, 2019, 2:48am

This is not an answer, but my viewpoints on some of the things I noticed in this thread. Please note that I mean no disrespect to anyone.

This is a wrong R code. If you want to execute multiple lines in a single line, use ; instead of ,.

However, as jcblum pointed out in a DM, though it is possible to use of ;'s to put multiple statements in one line, this is not encouraged. Check out the links below:

2 Syntax | The tidyverse style guide
🖊 [archived] R Coding Style Guide – Iegor Rudnytskyi, PhD – Data Science Educator & Evangelist [3rd point from the very end]

For a continuous variable, mode is defined as the value which corresponds to the maximum probability density. mpg is not a discrete variable, and the concept of frequency doesn't make much sense here.

This will not be able to handle multiple modes. which.max will always return index corresponding to the first maximum, and hence we'll always get a single answer. However, I cannot suggest a better alternative, as even if I try d$x[d$y == max(d$y)], there is no guarantee that we'll get multiple answers.

Illustration

set.seed(seed = 37206)

ModeJoels = function(d, adj=0.75) {
  d = density(d, adjust=adj) 
  d = data.frame(x=d$x, y=d$y)
  d$x[which.max(d$y)]
} # can't find multiple modes

ModeYarnabrina <- function(dataset, adjustChoice = 0.75)
{
  with(data = density(x = dataset, adjust = adjustChoice), expr = {x[y == max(y)]})
} # no guarantee to find multiple modes (and indeed probability of finding multiple modes is very small)

p <- sample(x = 0:1, size = 1000, replace = TRUE)
q <- rnorm(n = 1000, mean = 5)
r <- rnorm(n = 1000, mean = -5)

x <- ifelse(test = p, yes = q, no = r)

ModeJoels(d = x)
#> [1] -5.102445

ModeYarnabrina(dataset = x)
#> [1] -5.102445

hist(x = x)
abline(v = -5.102445, col = "red")

One more edit (and hopefully the final one)

First of all, I apologise to criticise joels's post. He already mentioned about finding only one mode, and I missed that altogether. Sorry!

Second, I was searching online to find multiple modes, and the most confusing part of that is mode is not clearly defined for distributions with multiple modes. Here's a small excerpt from the Wikipedia page:

When the probability density function of a continuous distribution has multiple local maxima it is common to refer to all of the local maxima as modes of the distribution. Such a continuous distribution is called multimodal (as opposed to unimodal). A mode of a continuous probability distribution is often considered to be any value x at which its probability density function has a locally maximum value, so any peak is a mode.

If we assume this definition, then perhaps one can use quantmod::findPeaks or ggpmisc::stat_peaks functions on the estimated density. The main idea of these functions is to consider the derivative and find points where sign of the derivative changes to negative from positive. This is not foolproof, but it usually works. I've provided links to the source codes and a related thread from Cross Validated below:

joels · August 11, 2019, 4:47pm

I mentioned in my answer that the function will return only one mode, although for a kernel density estimate there's little chance of there being more than one mode (if mode is understood to be the globally most likely value(s), though there may be many local maxima). With continuous data, the location of the mode(s) will depend on the details of the mode calculation method (e.g., the bandwidth and kernel in the case of a kernel density estimate).

system · September 1, 2019, 4:47pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.