Working with timezones in lubridate

rensa · January 17, 2018, 2:23am

I got some help on SO help this morning working with timezones in lubridate. My problem was that the lubridate functions appear to only accept a timezone of length 1, and I had a situation where I wanted to convert a bunch of strings in one data frame column to a timezone specified in another column.

The answerer suggested wrapping the mutate call in rowwise and ungroup, which works great! But it's not intuitive, and I'm wondering how performant it would be on large datasets. It would be great if the tz arguments in ymd_hms, with_tz, force_tz and others used tidyeval as well so that a column could be specified for them.

I also noticed while working on this problem that when tibbles with date-time columns are printed, they print using one of the time zones in the column—I think maybe the latest time zone in the column or the tibble? This was a little unintuitive to me: I expected that they would print using either UTC or my local time zone. Does anyone know if there's any way to configure this behaviour?

rensa · January 17, 2018, 2:50am

Looks like someone else has had this problem, and the discussion seemed to suggest that either:

a) The core functions that lubridate is abstracting can't handle a vector of timezones, or
b) The timezone attribute is associated with the entire POSIXct vector, not the individual elements.

Or both. The second explanation could explain why I'm seeing a tibble print out with a single time zone: when I ungroup, vectors with different timezones are getting converted to a common one. Is that a fair guess?

If it is, I guess there's nothing that can be done Time zones suuuuuuuuuck.

rensa · January 17, 2018, 3:30am

I also really like this solution of using a list column! Although list columns can be a bit more complicated to deal with, I think it's important to prioritise predictable output, and using a list column means that you can have elements with different timezones.

Tazinho · January 17, 2018, 7:57am

Since it is not vectorized, how about a small workaround via setting mytz as a column and then using if_else() or case_when()?

testdf %>% mutate(mydt_new = case_when(
  mytz == 'Australia/Sydney' ~ ymd_hms(mydt, tz = 'Australia/Sydney'), 
  mytz == 'Australia/Adelaide' ~ ymd_hms(mydt, tz = 'Australia/Adelaide'), 
  mytz == 'Australia/Perth' ~ ymd_hms(mydt, tz = 'Australia/Perth'))
)
# A tibble: 3 x 3
  mytz               mydt                mydt_new           
  <chr>              <chr>               <dttm>             
1 Australia/Sydney   2018-01-17T09:15:00 2018-01-17 09:15:00
2 Australia/Adelaide 2018-01-17T09:16:00 2018-01-17 09:46:00
3 Australia/Perth    2018-01-17T09:18:00 2018-01-17 12:18:00

Just a thought...of course with many different timezones this won't help much...

mishabalyasin · January 17, 2018, 9:36am

That's a good workaround. Another one that I used is to group_by timezone and then convert each group using only one timezone. I've figured that at most there are about 600 timezones (you can see them all with OlsonNames()), so while that is not ideal, it still won't be a giant bottleneck.
But I agree, timezones suck

tbradley · January 17, 2018, 1:48pm

If there are a lot of timezones you could try something like this to try and automate it:

testdf %>%
  group_by(mytz) %>%
  nest() %>%
  mutate(new_df = map2(data, mytz, ~ .x %>% mutate(mydt_new = ymd_hms(mydt, tz = .y)))) %>%
  unnest(new_df, .drop = TRUE)

mishabalyasin · January 17, 2018, 2:56pm

There is no need to nest/unnest, I think. It can be costly, so group_by/ungroup is likely to be faster and simpler (but that's personal preference, of course).
This should work:

testdf %>%
    dplyr::group_by(tz) %>%
    dplyr::mutate(converted_column = ymd_hms(mydt, tz = tz[[1]])) %>%
    dplyr::ungroup()

tbradley · January 17, 2018, 3:20pm

It looks like you are correct. I ran some benchmarking comparing both methods and the accepted answer on SO and yours is the fastest and they all give the same answer on my machine.

microbenchmark::microbenchmark(
  nest_version = testdf %>%
    group_by(mytz) %>%
    nest() %>%
    mutate(new_df = map2(data, mytz, ~ .x %>% mutate(mydt_new = ymd_hms(mydt, tz = .y)))) %>%
    unnest(new_df, .drop = TRUE),
  group_by_version = testdf %>%
    dplyr::group_by(mytz) %>%
    dplyr::mutate(converted_column = ymd_hms(mydt, tz = mytz[[1]])) %>%
    dplyr::ungroup(),
  so_answer = testdf %>%
    mutate(mydt_new = map2(mydt, mytz, ~ymd_hms(.x, tz = .y))) %>% 
    unnest(mydt_new)
)

Unit: milliseconds
             expr      min       lq     mean   median       uq       max neval
     nest_version 34.92987 37.22360 39.49682 38.59006 39.65608 104.83612   100
 group_by_version 12.82341 13.48555 14.13943 13.88366 14.35380  18.32356   100
        so_answer 21.58804 22.60258 23.90003 23.36432 24.27091  47.48220   100