facet_wrap with rolling mean grabs data from previous plot

Using facet_wrap to show multiple geom_col charts onto which I superimpose geom_line which are rolling averages done with rollmean. The problem is that rollmean pretty clearly is grabbing values from the preceding facet to calculate the mean. IOW facet 1 looks great. Facet 2 has a line that averages in values from facet 1's data.

Is this avoidable?

Hi!

To help us help you, could you please prepare a reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:

Actual data with simplified plot:

covid_reg <- read_csv("https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-regioni/dpc-covid19-ita-regioni.csv")

df <- filter(covid_reg, denominazione_regione %in% c("Veneto", "Lombardia","Emilia-Romagna","Piemonte"))
ggplot(df, aes(x = data, y = nuovi_positivi)) +
    geom_col(width = .75, aes(color = "daily")) +
    geom_line(aes(y = rollmean(x = nuovi_positivi, k = 7, align = c("right"), fill = NA) ) ) +
    facet_wrap (~ denominazione_regione, ncol=4)
1 Like

I don't see any strange artifacts from the output of that.

Hey! I suggest using the group_by to calculate the rolling mean prior to plotting to have group specific rolling averages.

library(tidyverse)
library(zoo)

covid_reg <- read_csv("https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-regioni/dpc-covid19-ita-regioni.csv")

df <- filter(covid_reg, denominazione_regione %in% c("Veneto", "Lombardia","Emilia-Romagna","Piemonte")) %>% 
  group_by(denominazione_regione) %>% 
  mutate(roll_mean = rollmean(x = nuovi_positivi, k = 7, align = c("right"), fill = NA))

ggplot(df, aes(x = data, y = nuovi_positivi)) +
  geom_col(width = .75, aes(color = "daily")) +
  geom_line(aes(y = roll_mean) ) +
  facet_wrap (~ denominazione_regione, ncol=4)

test

2 Likes

You're right. This doesn't show the artifact I have in the more complex code I'm actually using. It shows another one: the line should be the 7-day average of the columns, but it clearly isn't.

Here's slightly modified code with better labels. I provided two options for defining df, once is just the last of the four regions shown in the other. Note how in the case of just one region, it seems to be correct, that is, the average line is really the average.

covid_reg <- read_csv("https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-regioni/dpc-covid19-ita-regioni.csv")

startDate = "2020-01-01"

df <- filter(covid_reg, denominazione_regione %in% c("Veneto", "Lombardia","Emilia-Romagna","Piemonte") & data >= startDate)
df <- filter(covid_reg, denominazione_regione %in% c("Veneto") & data >= startDate)

ggplot(df, aes(x = data, y = nuovi_positivi)) +
    geom_col(width = .75, aes(color = "daily")) +
    geom_line(aes(color = "average", y = rollmean(x = nuovi_positivi, k = 7, align = c("right"), fill = NA) ) ) +
    facet_wrap (~ denominazione_regione, ncol=4)

Yep, this seems to work. Thanks!

(I'm still curious about what's going on with facet_wrap.)

I'm not sure about the original data you posted, but when passed to geom_line as x, the rollmean function is being applied to the entire column "nuovi_positivi" in the order it appears in the original data (there is no grouping). To the best of my knowledge, facetting and grouping in ggplot only provides subsetting of data to plot; it will not group any operations.

Yeah, that makes sense based on how the lines are coming out.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.