# Sample in ggplot2 aesthetic

#1

I am writing a tutorial for my students on using ggplot2 for normality testing. I've been looking for an explanation of what `sample = hwy` accomplishes in the aesthetic mapping shown below, but haven't found one I can point my students to! Can anyone explain this, or point me to something already out there that does? Thanks!

``````library(ggplot2)
ggplot(data = mpg, mapping = aes(sample = hwy)) +
stat_qq(distribution = qnorm)
``````

#2

Not a full-blown answer, but the `ggproto` object for `stat_qq` here (that link's just pointing to the line in the tidyverse source code of `stat_qq.r` on GitHub) might be of help.

code excerpt also here
``````StatQq <- ggproto("StatQq", Stat,
default_aes = aes(y = ..sample.., x = ..theoretical..),

required_aes = c("sample"),

compute_group = function(data, scales, quantiles = NULL,
distribution = stats::qnorm, dparams = list(),
na.rm = FALSE) {

sample <- sort(data\$sample)
n <- length(sample)

# Compute theoretical quantiles
if (is.null(quantiles)) {
quantiles <- stats::ppoints(n)
} else {
stopifnot(length(quantiles) == n)
}

theoretical <- do.call(distribution, c(list(p = quote(quantiles)), dparams))

data.frame(sample, theoretical)
}
)
``````

It's basically retrieving the sample data of the specified variable (`hwy` in this case), and computing quantiles (using length), and sorting it for plotting. (Which, I know, isn't a great explanation)!

#3

Isn't hwy just the sample (variable) that you are checking for Normality (ie the y variable)?

#4

That's a very good question, I've been using `ggplot2` for years but I wasn't aware of this aesthetic. I think the documentation is a bit lacking on this point, but however in this example is rather simple: `sample` defines the variable you are going to use to show the quantile-quantile plot, I'm afraid that the most informative doc page is the one on `geom_qq` and `stat_qq` here

#5

Roughly speaking, ggplot2 has two types of geoms/stats:

1. Visualize the data directly (e.g. `geom_point()` and `geom_line()`)
2. Visualize the result of some statistical calculation over the data (e.g. `geom_density()` and `geom_histogram()`)

The most of the geoms of the latter type have their corresponding `stat_*()`. `stat_qq()` is the latter one, the stat for quantile-quantile plot.

As `sample` is a rare aesthetic, you may feel this is very special, but it is not. `stat_qq()` calculates `x` and `y` aesthetics from `sample` values just as, for example, `stat_density()` calculates `y` aesthetics from `x` values.

About the internals of ggplot2, the vignette Extending ggplot2 may help, but this may be a roundabout way... I also want to know some good documentations!

#6

Possibly your question is more about what a q-q plot is than how ggplot produces it? What you are producing is, after all, just a standard q-q plot. In which case, you might find this link useful: http://data.library.virginia.edu/understanding-q-q-plots/

It is often useful to add the line we would expect to see if the observed data were drawn from (say) a normal distribution. You can do that as follows:

``````library(ggplot2)
ggplot(mpg, aes(sample = hwy)) + geom_qq() + geom_qq_line()
``````

#7

@mara, this is perfect. Thanks. Illustrates the source of my confusion (the variable you are testing is the y variable, so why call it the sample variable?). I wasn't thinking of the sample vs theoretical dichotomy that the `ggproto` object uses, but it makes perfect sense. Thanks!

#8

@DavidB, thanks for the tip about `geom_qq_line()`. I know how to do this in base R and poked around this summer for how to add the line in `ggplot2`. It looks like this is a new function that is only available in the development version of `ggplot2` on GitHub? I'm excited for the CRAN version to be updated with this. Trying to keep things simple for my students and only have the use the CRAN versions of packages...

Also great to see other sociologists on here!

#9

@chris.prener Yes, `geom_qq_line` is in version 2.2.1.9000, which is currently available on Github.

Glad to see it's not all data scientists on here!