Hello there! I have a tibble dataframe
with historical data about bike sharing trips and, of course, there are some POSIXct
columns with the format DD/MM/YYYY hh:mm
. I was wondering if the tidyverse
offered a function analogue to Pandas' pandas.DataFrame.resample() .
In this instance, Pandas' resample()
will allow to group_by()
my data by any time unit: one hour, one day or even three minutes. Is there a function to do that in R?
The alternative would be to use {lubridate}
to extract time features up to the unit I need and then perform a groupby, but I guess this can be optimised.
Thank you!
davis
April 20, 2021, 8:24pm
2
With lubridate, you generally use floor_date()
for "grouping" times together by a particular unit. Then you can just group on that floored column, rather than extracting out all the individual components.
library(dplyr)
library(lubridate)
set.seed(123)
date_times <- as.POSIXct("2019-01-01 01:01:01", "America/New_York")
date_times <- date_times + sort(sample(1e7, 100))
df <- tibble(date = date_times, x = rnorm(length(date_times)))
df
#> # A tibble: 100 x 2
#> date x
#> <dttm> <dbl>
#> 1 2019-01-01 11:26:45 0.0653
#> 2 2019-01-03 11:02:50 -0.0341
#> 3 2019-01-03 12:50:31 2.13
#> 4 2019-01-05 16:46:14 -0.741
#> 5 2019-01-05 16:55:19 -1.10
#> 6 2019-01-05 21:36:20 0.0378
#> 7 2019-01-09 02:52:22 0.310
#> 8 2019-01-12 12:06:25 0.437
#> 9 2019-01-13 02:06:29 -0.458
#> 10 2019-01-16 16:21:24 -1.06
#> # … with 90 more rows
df %>%
mutate(
day = floor_date(date, "day"),
hour = floor_date(date, "2 hour"),
minute = floor_date(date, "3 minute")
)
#> # A tibble: 100 x 5
#> date x day hour
#> <dttm> <dbl> <dttm> <dttm>
#> 1 2019-01-01 11:26:45 0.0653 2019-01-01 00:00:00 2019-01-01 10:00:00
#> 2 2019-01-03 11:02:50 -0.0341 2019-01-03 00:00:00 2019-01-03 10:00:00
#> 3 2019-01-03 12:50:31 2.13 2019-01-03 00:00:00 2019-01-03 12:00:00
#> 4 2019-01-05 16:46:14 -0.741 2019-01-05 00:00:00 2019-01-05 16:00:00
#> 5 2019-01-05 16:55:19 -1.10 2019-01-05 00:00:00 2019-01-05 16:00:00
#> 6 2019-01-05 21:36:20 0.0378 2019-01-05 00:00:00 2019-01-05 20:00:00
#> 7 2019-01-09 02:52:22 0.310 2019-01-09 00:00:00 2019-01-09 02:00:00
#> 8 2019-01-12 12:06:25 0.437 2019-01-12 00:00:00 2019-01-12 12:00:00
#> 9 2019-01-13 02:06:29 -0.458 2019-01-13 00:00:00 2019-01-13 02:00:00
#> 10 2019-01-16 16:21:24 -1.06 2019-01-16 00:00:00 2019-01-16 16:00:00
#> # … with 90 more rows, and 1 more variable: minute <dttm>
df %>%
mutate(day = floor_date(date, "day")) %>%
group_by(day) %>%
summarise(daily_mean = mean(x))
#> # A tibble: 65 x 2
#> day daily_mean
#> <dttm> <dbl>
#> 1 2019-01-01 00:00:00 0.0653
#> 2 2019-01-03 00:00:00 1.05
#> 3 2019-01-05 00:00:00 -0.600
#> 4 2019-01-09 00:00:00 0.310
#> 5 2019-01-12 00:00:00 0.437
#> 6 2019-01-13 00:00:00 -0.458
#> 7 2019-01-16 00:00:00 -1.06
#> 8 2019-01-17 00:00:00 0.457
#> 9 2019-01-18 00:00:00 -0.551
#> 10 2019-01-19 00:00:00 0.456
#> # … with 55 more rows
Created on 2021-04-20 by the reprex package (v1.0.0)
You might also look into the clock package, which has more advanced grouping utilities (date_group()
, date_floor()
, date_shift()
)
Hi Davis, thank you for this amazing answer! R
and the Tidyverse
are outstanding. I should get started with clock
sooner or later!
system
Closed
April 28, 2021, 3:32pm
4
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed. If you have a query related to it or one of the replies, start a new topic and refer back with a link.