Sample in ggplot2 aesthetic

I am writing a tutorial for my students on using ggplot2 for normality testing. I've been looking for an explanation of what sample = hwy accomplishes in the aesthetic mapping shown below, but haven't found one I can point my students to! Can anyone explain this, or point me to something already out there that does? Thanks!

library(ggplot2)
ggplot(data = mpg, mapping = aes(sample = hwy)) + 
  stat_qq(distribution = qnorm)

Not a full-blown answer, but the ggproto object for stat_qq here (that link's just pointing to the line in the tidyverse source code of stat_qq.r on GitHub) might be of help.

code excerpt also here
StatQq <- ggproto("StatQq", Stat,
  default_aes = aes(y = ..sample.., x = ..theoretical..),

  required_aes = c("sample"),

  compute_group = function(data, scales, quantiles = NULL,
                           distribution = stats::qnorm, dparams = list(),
                           na.rm = FALSE) {

    sample <- sort(data$sample)
    n <- length(sample)

    # Compute theoretical quantiles
    if (is.null(quantiles)) {
      quantiles <- stats::ppoints(n)
    } else {
      stopifnot(length(quantiles) == n)
    }

    theoretical <- do.call(distribution, c(list(p = quote(quantiles)), dparams))

    data.frame(sample, theoretical)
  }
)

It's basically retrieving the sample data of the specified variable (hwy in this case), and computing quantiles (using length), and sorting it for plotting. (Which, I know, isn't a great explanation)! :speak_no_evil:

2 Likes

Isn't hwy just the sample (variable) that you are checking for Normality (ie the y variable)?

1 Like

That's a very good question, I've been using ggplot2 for years but I wasn't aware of this aesthetic. I think the documentation is a bit lacking on this point, but however in this example is rather simple: sample defines the variable you are going to use to show the quantile-quantile plot, I'm afraid that the most informative doc page is the one on geom_qq and stat_qq here

1 Like

Roughly speaking, ggplot2 has two types of geoms/stats:

  1. Visualize the data directly (e.g. geom_point() and geom_line())
  2. Visualize the result of some statistical calculation over the data (e.g. geom_density() and geom_histogram())

The most of the geoms of the latter type have their corresponding stat_*(). stat_qq() is the latter one, the stat for quantile-quantile plot.

As sample is a rare aesthetic, you may feel this is very special, but it is not. stat_qq() calculates x and y aesthetics from sample values just as, for example, stat_density() calculates y aesthetics from x values.

About the internals of ggplot2, the vignette Extending ggplot2 may help, but this may be a roundabout way... I also want to know some good documentations!

1 Like

Possibly your question is more about what a q-q plot is than how ggplot produces it? What you are producing is, after all, just a standard q-q plot. In which case, you might find this link useful: http://data.library.virginia.edu/understanding-q-q-plots/

It is often useful to add the line we would expect to see if the observed data were drawn from (say) a normal distribution. You can do that as follows:

library(ggplot2)
ggplot(mpg, aes(sample = hwy)) + geom_qq() + geom_qq_line()
2 Likes

@mara, this is perfect. Thanks. Illustrates the source of my confusion (the variable you are testing is the y variable, so why call it the sample variable?). I wasn't thinking of the sample vs theoretical dichotomy that the ggproto object uses, but it makes perfect sense. Thanks!

1 Like

@DavidB, thanks for the tip about geom_qq_line(). I know how to do this in base R and poked around this summer for how to add the line in ggplot2. It looks like this is a new function that is only available in the development version of ggplot2 on GitHub? I'm excited for the CRAN version to be updated with this. Trying to keep things simple for my students and only have the use the CRAN versions of packages...

Also great to see other sociologists on here!

1 Like

@chris.prener Yes, geom_qq_line is in version 2.2.1.9000, which is currently available on Github.

Glad to see it's not all data scientists on here! :wink: