Can anyone explain what exactly `stat_...()` functions do in `ggplot2`?

I've been struggling to understand what exactly stat_...() functions do for quite a while, yet without a clear answer. Let me start with an example code.

Code 1:

sample_data = tibble(x=rnorm(1000), y=rnorm(1000))
ggplot(sample_data, aes(x=x, y=y)) + geom_point()

Code 2:

ggplot(sample_data, aes(x=x, y=y)) + geom_point(stat="density_2d")

Code 3:

ggplot(sample_data, aes(x=x, y=y)) + geom_point(stat="density")
#Error: geom_point requires the following missing aesthetics: y

I know that there is no point using stat="density_2d" or stat="density" with geom_point(). I'm just trying to understand what exactly stat_density_2d() and stat_density() do behind the scene. Can anyone explain why the results of Code 1 and Code 2 differ, and why Code 3 throws an error? And what stat_...() functions (stat_bin(), stat_contour(), stat_boxplot() etc) do in general?

The 2d density plot is like a topographic map. The smallest circle in the center shows the region of highest point density. You can kind of see the same thing in the point plot since there is not a lot of overplotting. Here is a plot for two variables sampled from uniform random distributions that only slightly overlap.

sample_data <- data.frame(x = runif(10000,1,5), y = runif(10000,-5,1))
ggplot(sample_data, aes(x=x, y=y)) + geom_point()

image

ggplot(sample_data, aes(x=x, y=y)) + geom_point(stat="density_2d")

image

Each region with no additional elevation lines inside it is a "hill". There are small peaks and bigger peaks. this shows how to deal with overplotting while still getting a sense of where the density is in the joint distribution.

1 Like

stat functions perform statistical transformations. So instead of plotting the raw data, you are plotting a transformation of the data. This can take many forms, depending on the stat used.

All that the stat does, is take the raw data and compute a new dataframe with transformed (and possibly new) columns.

In this case, the density_2d creates a new dataframe with new x and y values, as well as some new variables, such as level. The new x and y coordinates are based on the old ones, but are not directly linked (points aren't individually translated). The level variable indicates which contour line a coordinate belongs to. Normally, paths are drawn between all coordinates belonging the same line. But since you are using geom_point the coordinates are just plotted as points. Combining stats and geoms in different ways gives a lot of power to ggplot users.

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

My understanding is that density plots are not a rotation of the actual data into another dimensional space but rather a discovery of lines that connect areas of "equal" point density per some configurable area. There is a linetype aesthetic that is configurable so you can get solid or dashed lines instead of points in the display.

Perform a 2D kernel density estimation using MASS::kde2d() and display the results with contours. This can be useful for dealing with overplotting. This is a 2d version of geom_density() .

1 Like

Thank you for the reply. Yes I know what the contours shown in the plot mean. What I wonder is how this is done. What confuses me is the fact that the contours are comprised of actual data points arranged side by side, rather than drawn as lines. It seems that the x and y coordinates of each point in Code 1 plot are converted to x and y coordinates of each point in Code 2 plot. By which rule does stat_density_2d map each point in Code 1 plot to each point in Code 2 plot? I mean, which points in Code 1 plot comprise the outermost contour in Code 2 plot, and which ones the innermost contour?

Thank you so much @Axeman! Your answer helped me a lot clarify things!