Unexpected behaviour of geom_smooth

I often use geom_smooth() to plot smooths to my data. I just discovered that when the y-axis is transformed (e.g. log axis), geom_smooth unexpectedly uses the transformed data for its smooth. This gives a biased smooth compared to smoothing the raw data. As far I can see there is no warning in the documentation that this is happening.

library(ggplot2)
library(dplyr)
library(mgcv)

set.seed(1)

df <- data.frame(x = seq(0, 1, 0.01)) %>%
  mutate(y = exp(runif(n()) + 6 * (x * (1 - x) * (0.5 - x) + 0.1 * x))) # data

mod <- gam(y ~ s(x, bs = "cs"), data = df, method = "REML") # smooth
df$pred <- predict(mod)

mod2 <- gam(y ~ s(x, bs = "cs"), data = df %>% mutate(y = log10(y)), method = "REML") # smooth in log10 space
df$pred2 <- 10 ^ predict(mod2)

df %>%
  ggplot() +
  labs(colour = "Smooth") +
  geom_point(aes(x = x, y = y)) +
  geom_smooth(aes(x = x, y = y, colour = "geom_smooth"), method = "gam", size = 4) +
  geom_line(aes(x = x, y = pred, colour = "normal space"), size = 1) + # does not match geom_smooth
  geom_line(aes(x = x, y = pred2, colour = "log space"), size = 1) + # matches geom_smooth
  scale_y_log10()
#> `geom_smooth()` using formula 'y ~ s(x, bs = "cs")'


Created on 2021-12-20 by the reprex package (v2.0.1)

Interestingly, if I use coord_trans(y = "log") instead of scale_y_log10() it works as expected!

Edit: What I learned from this is slapping a smooth on data is a risky business. You'd better know what you are doing when the client starts interpreting your smooth as a legitimate model!

Edit2: Also the default smoothing parameters are not necessarily very good. "gam" does not provide enough degrees of freedom for many data sets.

A running median might be a useful alternative.

2 Likes

scale_y_log10 performs the data transformation before any statistical summaries (such as geom_smooth ), while coord_trans(y="log") performs the transformation after doing statistical summaries. This is discussed in the help for coord_trans(), but I can't find anything about it in the help for scale_y_log10. There are a bunch of Stack Overflow questions related to this (here, for example), suggesting it trips up some people.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.