Unexpected behaviour of geom_smooth

I often use geom_smooth() to plot smooths to my data. I just discovered that when the y-axis is transformed (e.g. log axis), geom_smooth unexpectedly uses the transformed data for its smooth. This gives a biased smooth compared to smoothing the raw data. As far I can see there is no warning in the documentation that this is happening.

library(ggplot2)
library(dplyr)
library(mgcv)

set.seed(1)

df <- data.frame(x = seq(0, 1, 0.01)) %>%
  mutate(y = exp(runif(n()) + 6 * (x * (1 - x) * (0.5 - x) + 0.1 * x))) # data

mod <- gam(y ~ s(x, bs = "cs"), data = df, method = "REML") # smooth
df$pred <- predict(mod)

mod2 <- gam(y ~ s(x, bs = "cs"), data = df %>% mutate(y = log10(y)), method = "REML") # smooth in log10 space
df$pred2 <- 10 ^ predict(mod2)

df %>%
  ggplot() +
  labs(colour = "Smooth") +
  geom_point(aes(x = x, y = y)) +
  geom_smooth(aes(x = x, y = y, colour = "geom_smooth"), method = "gam", size = 4) +
  geom_line(aes(x = x, y = pred, colour = "normal space"), size = 1) + # does not match geom_smooth
  geom_line(aes(x = x, y = pred2, colour = "log space"), size = 1) + # matches geom_smooth
  scale_y_log10()
#> `geom_smooth()` using formula 'y ~ s(x, bs = "cs")'


Created on 2021-12-20 by the reprex package (v2.0.1)

Interestingly, if I use coord_trans(y = "log") instead of scale_y_log10() it works as expected!

Edit: What I learned from this is slapping a smooth on data is a risky business. You'd better know what you are doing when the client starts interpreting your smooth as a legitimate model!

Edit2: Also the default smoothing parameters are not necessarily very good. "gam" does not provide enough degrees of freedom for many data sets.

A running median might be a useful alternative.

2 Likes

scale_y_log10 performs the data transformation before any statistical summaries (such as geom_smooth ), while coord_trans(y="log") performs the transformation after doing statistical summaries. This is discussed in the help for coord_trans(), but I can't find anything about it in the help for scale_y_log10. There are a bunch of Stack Overflow questions related to this (here, for example), suggesting it trips up some people.