Comparing value-per-date data across different years

Leon · July 12, 2018, 7:10am

I'm interested in comparing values from specific dates across different years, the following works, but it is a bit "hacky", so I'm curious if there is a more straight forward and elegant way?

Briefly: For each data, I extract the specific year and use that to colour by and then I create a "dummy" date, where I set all years to 2020 and then on the x-axis I only show month and day from the "dummy" date

# Load libraries
library('tidyverse')
library('lubridate')
library('scales')

# Create example data
set.seed(566684)
n = 100
d = tibble(day    = sample(1:31, n, replace = TRUE),
           month  = sample(1:12, n, replace = TRUE),
           year   = sample(2016:2018,n, replace = TRUE) %>% factor,
           date   = paste(day, month, year, sep = '-') %>% dmy,
           value  = rnorm(n),
           date_x = date %>% as.character %>%
                      str_replace("^\\d{4}","2020") %>% as_date)

# Create plot
d %>%
  ggplot(aes(x = date_x, y = value, colour = year)) +
  geom_point() +
  geom_line() +
  theme_bw() +
  scale_x_date(labels = date_format("%b"))

mara · July 12, 2018, 1:20pm

My go-to function for that would've been lubridate's yday(), but (as you can see in the two plots) it has disadvantages in terms of easy axis labelling (though I guess you could do your as_date() transformation with a dummy date on the other side of that pipeline).

# Load libraries
library(tidyverse)
library(lubridate)
library(scales)

# Create example data
set.seed(566684)
n = 100
d = tibble(day    = sample(1:31, n, replace = TRUE),
           month  = sample(1:12, n, replace = TRUE),
           year   = sample(2016:2018,n, replace = TRUE) %>% factor,
           date   = paste(day, month, year, sep = '-') %>% dmy,
           value  = rnorm(n),
           date_x = date %>% as.character %>%
             str_replace("^\\d{4}","2020") %>% as_date)
#> Warning: 3 failed to parse.

# Create plot
d %>%
  ggplot(aes(x = date_x, y = value, colour = year)) +
  geom_point() +
  geom_line() +
  theme_bw() +
  scale_x_date(labels = date_format("%b"))
#> Warning: Removed 3 rows containing missing values (geom_point).
#> Warning: Removed 3 rows containing missing values (geom_path).


# with year day
d <- d %>%
  mutate(year_day = yday(date))

d %>%
  ggplot(aes(x = year_day, y = value, colour = year)) +
  geom_point() +
  geom_line() +
  theme_bw()
#> Warning: Removed 3 rows containing missing values (geom_point).

#> Warning: Removed 3 rows containing missing values (geom_path).

Created on 2018-07-12 by the reprex package (v0.2.0).

Leon · July 12, 2018, 1:44pm

Thanks! It was something like that, I was looking for. However:

yday(ymd("2016-02-29"))
[1] 60
yday(ymd("2018-02-29"))
[1] NA
yday(ymd("2018-03-01"))
[1] 60

So leap year quirks, when comparing across years...

jonspring · July 12, 2018, 5:20pm

I'm not sure I understand. Are there circumstances where your data includes entries for a date that doesn't occur on the calendar (like 2018-02-29)?

Your original approach, of plotting dates as if they were in 2020 (a leap year), seems to me the only kind of approach that will let you directly compare July 12 in one year to July 12 in another.

In my own experience looking at daily time series with a strong weekly pattern, it has been more important to preserve day-of-week alignment than date name alignment. For instance, when looking at Thursday 2018-7-12, I'm usually more interested in how it compares to Thursday 2017-07-13 than I am in comparing it to Wednesday 2017-07-12.

To do that, I've employed a function like below, which finds the closest shift (+/- 3 days) that brings the date into alignment with a given year:

date_align <- function(date, year = 2018) {
  date_wday_align = floor_date(date, "1 year") %>% wday()
  trgt_wday_align = ymd(paste(align_yr, "0101")) %>% wday()
  
  adjustment = ((date_wday_align - trgt_wday_align + 3) %% 7) - 3
  aligned_date = ymd(paste(align_yr, month(date), day(date))) + adjustment
}

While this changes most dates' appearance (in many cases into different months, sometimes even into different years), it more often results in more apples-to-apples comparisons on a weekly scale.

d = tibble(day    = sample(1:7, n, replace = TRUE),
           month  = sample(c(1,1), n, replace = TRUE),
           year   = sample(2010:2020,n, replace = TRUE) %>% factor,
           date   = paste(day, month, year, sep = '-') %>% dmy,
           value  = rnorm(n),
           date_x2 = dmy(paste(day, month, align_yr)),

           date_align = date_align(date, 2018),
           
           # Diagnostics
           name_adj = yday(date_align) - yday(date),
           wday_error = wday(date_align) - wday(date)
           )

Leon · July 14, 2018, 6:14am

My point was, that using @mara's approach results in yday giving ambiguous dates for day 60 like the example I gave.

In this particular situation, I am interested in consistently comparing dates across years, so yes, this is also the conclusion I have arrived at

Thanks for input to you both