Tidyverse/{dplyr} equivalent of pandas.DataFrame.resample()

baggiponte · April 20, 2021, 1:45pm

Hello there! I have a tibble dataframe with historical data about bike sharing trips and, of course, there are some POSIXct columns with the format DD/MM/YYYY hh:mm. I was wondering if the tidyverse offered a function analogue to Pandas' pandas.DataFrame.resample().

In this instance, Pandas' resample() will allow to group_by() my data by any time unit: one hour, one day or even three minutes. Is there a function to do that in R?

The alternative would be to use {lubridate} to extract time features up to the unit I need and then perform a groupby, but I guess this can be optimised.

Thank you!

davis · April 20, 2021, 8:24pm

With lubridate, you generally use floor_date() for "grouping" times together by a particular unit. Then you can just group on that floored column, rather than extracting out all the individual components.

library(dplyr)
library(lubridate)

set.seed(123)

date_times <- as.POSIXct("2019-01-01 01:01:01", "America/New_York")
date_times <- date_times + sort(sample(1e7, 100))

df <- tibble(date = date_times, x = rnorm(length(date_times)))
df
#> # A tibble: 100 x 2
#>    date                      x
#>    <dttm>                <dbl>
#>  1 2019-01-01 11:26:45  0.0653
#>  2 2019-01-03 11:02:50 -0.0341
#>  3 2019-01-03 12:50:31  2.13  
#>  4 2019-01-05 16:46:14 -0.741 
#>  5 2019-01-05 16:55:19 -1.10  
#>  6 2019-01-05 21:36:20  0.0378
#>  7 2019-01-09 02:52:22  0.310 
#>  8 2019-01-12 12:06:25  0.437 
#>  9 2019-01-13 02:06:29 -0.458 
#> 10 2019-01-16 16:21:24 -1.06  
#> # … with 90 more rows

df %>%
  mutate(
    day = floor_date(date, "day"),
    hour = floor_date(date, "2 hour"),
    minute = floor_date(date, "3 minute")
  )
#> # A tibble: 100 x 5
#>    date                      x day                 hour               
#>    <dttm>                <dbl> <dttm>              <dttm>             
#>  1 2019-01-01 11:26:45  0.0653 2019-01-01 00:00:00 2019-01-01 10:00:00
#>  2 2019-01-03 11:02:50 -0.0341 2019-01-03 00:00:00 2019-01-03 10:00:00
#>  3 2019-01-03 12:50:31  2.13   2019-01-03 00:00:00 2019-01-03 12:00:00
#>  4 2019-01-05 16:46:14 -0.741  2019-01-05 00:00:00 2019-01-05 16:00:00
#>  5 2019-01-05 16:55:19 -1.10   2019-01-05 00:00:00 2019-01-05 16:00:00
#>  6 2019-01-05 21:36:20  0.0378 2019-01-05 00:00:00 2019-01-05 20:00:00
#>  7 2019-01-09 02:52:22  0.310  2019-01-09 00:00:00 2019-01-09 02:00:00
#>  8 2019-01-12 12:06:25  0.437  2019-01-12 00:00:00 2019-01-12 12:00:00
#>  9 2019-01-13 02:06:29 -0.458  2019-01-13 00:00:00 2019-01-13 02:00:00
#> 10 2019-01-16 16:21:24 -1.06   2019-01-16 00:00:00 2019-01-16 16:00:00
#> # … with 90 more rows, and 1 more variable: minute <dttm>

df %>%
  mutate(day = floor_date(date, "day")) %>%
  group_by(day) %>%
  summarise(daily_mean = mean(x))
#> # A tibble: 65 x 2
#>    day                 daily_mean
#>    <dttm>                   <dbl>
#>  1 2019-01-01 00:00:00     0.0653
#>  2 2019-01-03 00:00:00     1.05  
#>  3 2019-01-05 00:00:00    -0.600 
#>  4 2019-01-09 00:00:00     0.310 
#>  5 2019-01-12 00:00:00     0.437 
#>  6 2019-01-13 00:00:00    -0.458 
#>  7 2019-01-16 00:00:00    -1.06  
#>  8 2019-01-17 00:00:00     0.457 
#>  9 2019-01-18 00:00:00    -0.551 
#> 10 2019-01-19 00:00:00     0.456 
#> # … with 55 more rows

^{Created on 2021-04-20 by the reprex package (v1.0.0)}

You might also look into the clock package, which has more advanced grouping utilities (date_group(), date_floor(), date_shift())

baggiponte · April 21, 2021, 3:31pm

Hi Davis, thank you for this amazing answer! R and the Tidyverseare outstanding. I should get started with clock sooner or later!

system · April 28, 2021, 3:32pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.