Quantile Box Plot (which is not an outlier box plot)

Hi all,

Is there an R package that produces Quantile Box plots?

To be specific: the common box-and-whiskers plot is used to show the IQR and outliers that violate the 1.5×IQR rule. Those are easy (and there are tons of packages that have them).

I'm looking for (what I call) a Quantile Box plot. It shows the typical 1st, 2nd(median) and 3rd quantiles, as well as the min and max of the data. There are also tick marks at the 2.5th, 10th, 90th and 97.5th quartiles.

JMP has this plot, and I honestly think that's the only place I've seen it. Striking a chord with anyone?

G'day Lee.
Don't know either if that exists, but I have modified boxplots in the past for my own use.

If what you are looking for is the rightmost plot of the graph below, it is quite simple to draw:

ejemplo

That's the code:

## Data
a <- rexp(1000)

## Derived data for the making the boxplot
b <- quantile(a, c(.025,.1,.25,.5,.75,.9,.975)) # quantiles
mima <- range(a)                        # max and min
cent <- 2                               # hand tunned for this example

## jpeg('ejemplo.jpg')
boxplot(a, xlim = c(-.5,2.5))
points(0 +runif(length(a), -.2,.2), a, pch = 19, col = adjustcolor('dodgerblue', .5))
rect(cent - .2, b[3], cent + .2, b[5], border = 'firebrick')
points(cent + c(-.2,.2), c(b[4], b[4]), type = 'l', col = 'firebrick', lwd = 3)
arrows(cent, b[5], cent, mima[2], col = 'firebrick', angle = 90)
arrows(cent, b[3], cent, mima[1], col = 'firebrick', angle = 90)
points(rep(cent, 4), b[c(1,2,6,7)], pch = '-', cex = 1.5, col = 'firebrick')
axis(side = 1, at = c(0,1,2), labels = c('Data', 'R boxplot', 'new boxplot'))
##dev.off()

This is a rudimentary example, for the shown plot, but I can make it general, for just plotting the rightmost one and including more boxplots in the same plot , changing background colour etc inside its own function on the evening. I am at lunch time but soon I have to go back to work :slight_smile:

cheers

Sorry for re-posting.

What about that:

NewBoxPlot <- function(x,y){
    y <- as.character(y)
    nbox <- length(unique(y))
    f <- unique(y)
    lims <- c(0 - .5, nbox -.5)
    cents <- 0:(nbox-1)
    labs.y <- quantile(x, c(0,.025,.1,.25,.5,.75,.9,.975,1)) # quantiles
    plot(0, xaxt = 'n', yaxt = 'n', bty = 'n', pch = '', ylab = '', xlab = '',xlim = lims, ylim = range(x))
    for (i in 1:nbox){
        xi <- x[y==f[i]]
        cent = cents[i]
        b <- quantile(xi, c(.025,.1,.25,.5,.75,.9,.975)) # quantiles
        mima <- range(xi)                        # max and min
        rect(cent - .2, b[3], cent + .2, b[5], border = 'firebrick')
        points(cent + c(-.2,.2), c(b[4], b[4]), type = 'l', col = 'firebrick', lwd = 3)
        arrows(cent, b[5], cent, mima[2], col = 'firebrick', angle = 90)
        arrows(cent, b[3], cent, mima[1], col = 'firebrick', angle = 90)
        points(rep(cent, 4), b[c(1,2,6,7)], pch = '-', cex = 1.5, col = 'firebrick')
    }
    axis(side = 1, at = cents, labels = f)
    axis(side = 2, at = labs.y, labels = round(labs.y,1))
}

set.seed(666)
Data <- data.frame(x = rnorm(100),
                   y = sample(c('a','b','c','d','e'), 100, replace = TRUE)
                   )


NewBoxPlot(Data$x,Data$y)

extended

Note that for the Y axis, I have used the same quantiles but for all Data$x, but that's intended.
This is a very rudimentary function, easy to be modiffied etc if it is what you wanted :slight_smile:

Your code is amazing and I thank you kindly. You seem to know R quite well, so I'll assume that if this were possible already, you'd know.

Full Marks!

1 Like

I am glad you liked it. It is very 'basic' (from r-'base') code that is handy to know, but some people make fun at me as my code tend to be a blast from the past, over-looping etc (not to mention using emacs 99% of the time :zipper_mouth_face:)
I am curious in which area or research ofr analysis this plots are more used, I never heard of them (but I do not like much boxplots either). I found vague information in a simple search on google...

I have make some cosmetic changes to allow more easy control of the output (function code below)

Data:

set.seed(666)
Data <- data.frame(x = c(rnorm(300, mean=4, sd =1),rexp(300), rbeta(300, shape1 = .5, shape2 = .5) -2.5, c(rnorm(150,0,1.2),rnorm(150,5,1)), runif(300, -4,8)),
                   y = rep(c('a','b','c','d', 'e'), each = 300)

From the simplest version:

NewBoxPlot(Data$x,Data$y)

simple

to a more fancy one:

NewBoxPlot(Data$x,Data$y, fill = TRUE, points = TRUE, width = .7, 
    bg = TRUE, colbg = 'rosybrown1',colbox = 'firebrick4', 
    colpoints = 'firebrick1', qdata = TRUE, qdatacol = 'palevioletred1')

complex

I must say that in one of my screens the background colour is not appreciated (in the auxiliary, maybe the cable is partially broken?), but it is on the other, where I use to see plots, and it is the one from laptop.

the function is:

NewBoxPlot <- function(x,y, points = FALSE, bg = FALSE, colbox = 'black', colpoints = colbox, fill = FALSE,  colbg = 'gray', width = .5, qdata = FALSE, qdatacol = 'white'){
    if (width > 1)
        width = 1
    width <- width / 2
    y <- as.character(y)
    nbox <- length(unique(y))
    f <- unique(y)
    lims <- c(0 - .5, nbox -.5)
    cents <- 0:(nbox-1)
    BG <- function(){
            lbg <- par('usr')
            rect(lbg[1],lbg[3],lbg[2], lbg[4], col = adjustcolor(colbg,.3))
            if (qdata)
                abline(h = quantile(x, c(.25,.5,.75)), col = qdatacol)
    }
    labs.y <- quantile(x, c(0,.025,.1,.25,.5,.75,.9,.975,1)) # quantiles
    plot(0, xaxt = 'n', yaxt = 'n', pch = '', ylab = '', xlab = '', xlim = lims, ylim = range(x), panel.first = if (bg) {BG()})
    for (i in 1:nbox){
        xi <- x[y==f[i]]
        cent = cents[i]
        b <- quantile(xi, c(.025,.1,.25,.5,.75,.9,.975)) # quantiles
        mima <- range(xi)                        # max and min
        if (points)
            points(cent + runif(length(xi), -width/2, width/2), xi, pch = 19, col = adjustcolor(colpoints, .3))
        rect(cent - width, b[3], cent + width, b[5], border = colbox, col = if (fill) {adjustcolor(colbox, .3)})
        points(cent + c(-width,width), c(b[4], b[4]), type = 'l', col = colbox, lwd = 3)
        arrows(cent, b[5], cent, mima[2], col = colbox, angle = 90, lty = 2, length = .3*width)
        arrows(cent, b[3], cent, mima[1], col = colbox, angle = 90, lty = 2, length = .3*width)
        points(rep(cent, 4), b[c(1,2,6,7)], pch = '-', cex = 1.5, col = colbox)
    }
    axis(side = 1, at = cents, labels = f)
    axis(side = 2, at = labs.y, labels = round(labs.y,1))
}

and I am happy to finish a roxy skeleton for having a help file if you think it can have some use :slight_smile:
cheers
Fer

Wow -- stop! I had written some code similar to your very first one (except mine only drew one plot, and I was trying to use segment instead of arrows, which didn't occur to me. I may use it in a book I'm working on, but certainly not your expanded ones that show so much effort.

I've never really known why the plot exists. It shows up in SAS/JMP, so I figured there must be some reason behind it. Initially I thought it was to determine distribution shape, but (a)a regular box plot does that and (b)I already know how to draw a histogram.

Aside from the strange quantile box plot, JMP puts something called the shortest half on its Outlier (==typical) box plot. I think it has something to do with the shorth statistic, but the documentation only defines what it is (the shortest interval that contains 50% of the data), and gives a vague reference to a book (Rousseeuw, P. J., and Leroy, A. M. (1987). Robust Regression and Outlier Detection. New York: John Wiley & Sons) whose 2nd edition I own, and which doesn't help. I'll attach a picture; the shortest half is in red. And vertically is JMP's default way of showing histograms and box plots. Another mystery.

I envy your skill!
JMPplots

1 Like

A quick way to get close to what you're describing would be to change the standard boxplot so that it plots percentiles for the whiskers rather than 1.5*IQR (I find the 1.5*IQR to be difficult to interpret and generally use the 5th and 95th percentiles for the whiskers in my own plots). Also, what you've called the "shortest half" sounds like what I've seen called the highest-density interval (HDI)--the shortest interval that contains some percentage of the data. The HDInterval package has the hdi function for calculating the HDI.

Below is a ggplot version of the above. We create three helper functions to calculate, respectively, the boxplot percentiles (1st, 25th, 50th, 75th, and 99th by default), the HDI (50% by default), and two additional percentiles (5th and 95th by default) and then use stat_summary to draw the geoms.

library(ggplot2)
theme_set(theme_classic() + 
           theme(panel.background=element_rect(colour = "grey40", fill = NA)))

# Function to calculate boxplot percentiles
bp.pctiles = function (x, probs = c(0.01, 0.25, 0.5, 0.75, 0.99)) {
  r <- quantile(x, probs = probs, na.rm = TRUE)
  names(r) <- c("ymin", "lower", "middle", "upper", "ymax")
  r
}

# Function to calculate HDI
hdi.stat = function(x, credMass=0.5) {
  r = unname(HDInterval::hdi(x, credMass=credMass))
  r = c(y=r[1], yend=r[2])
  r
}

# Function to calculate two more percentiles
add.percentiles = function(x) {
  r = quantile(x, probs=c(0.05, 0.95))
  names(r) = c("lwr", "upr")
  r
}
ggplot(iris, aes(x=Species, xend=Species,  y=Petal.Width)) +
  # Add boxplots
  stat_summary(fun.data=bp.pctiles, geom="boxplot", aes(width=0.4)) +
  # Add two more percentiles as dashes across the boxplot whiskers
  stat_summary(fun.y=add.percentiles, geom="point", pch="_", colour="grey40", size=6) +
  # Add HDI
  stat_summary(fun.data=hdi.stat, geom="segment", 
               position=position_nudge(x=-0.3), colour="red",
               arrow=arrow(angle=90, ends="both", length=unit(0.1, "cm")))

Rplot01

1 Like

Once again I am skewered by the rapier-like knowledge of the RStudio Community.

I have been looking at the lshorth package and was getting nowhere in figuring out how to get this done.

1 Like

Sorry for some kind of late response.
Box-plots, in any variant, as adding the lines of the high-dens interval that @joels explained us (I never saw them before now), give a summary of data, often rough. For me are blast from the past (and in a time-scale, I owned my first computer when I was around 25). From a past time when one had to draw its own plots, and using statistics that could be computed with a hand calculator, and when also ink and paper was expensive, they were the best deal. They give a brief fast-visual idea of data, but often misleading (look in my last plot, the last two distributions, I make them intentionally to look similar in a box-plot fashion, but quite different in a density-violin fashion)

I have been quite fan of using the so-called violin plots (as Spaniard 'garlic-eater' I am, I would rather call them guitar plot, but...). Much more informative in those days when make a plot is a matter of code, not ink, time or space. I normally code my own violins-guitars, but they are also implemented in several packages, including ggplot2.

Sorry for answering-commenting unsolicited opinions :zipper_mouth_face:

Best

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.