Accessing related values in summarise() call

cmichaud · October 19, 2020, 5:08pm

I am attempting to generate a summary table from a large dataframe. My dataframe has dates associated with each record. I want to include max and min values of certain numeric fields and the date associated with the min and max value.

Reprex

library(dplyr)
# ----- Dates -----
start_date <- as.Date('2020-01-01')  
end_date <- as.Date('2020-10-01')   

set.seed(19)
dates <- as.Date(sample( as.numeric(start_date): as.numeric(end_date), 10, 
                replace = F), 
        origin = '1970-01-01')

# ----- Flow and temp -----
set.seed(19)
flow = sample(10:100, 10)
set.seed(19)
temp = sample(5:27, 10)

# ----- River -----
set.seed(19)
river = sample(c("CO", "GR"), 10, replace = TRUE)

# ----- Create tibble -----
water_data <- tibble(river, dates, flow, temp)

Summary table

summary_tbl <- water_data %>% 
    group_by(river) %>% 
    summarise(max_flow = max(flow, na.rm = TRUE),
              max_temp = max(temp, na.rm = TRUE),
              .groups = "drop")

Now I want to create 2 additional fields within the summarise() call: date_max_flow and date_max_temp each capturing the date associated with that max value. While I could determine the min and max values and then subset the original dataframe by them, my dataframe is big this would probably get confusing. It seems like there ought to be a way to do this within the call to summarise.

Thanks

joels · October 19, 2020, 5:25pm

You can use indexing to get the dates of the maxima. For example:

water_data %>% 
  group_by(river) %>% 
  summarise(max_flow = max(flow, na.rm = TRUE),
            max_temp = max(temp, na.rm = TRUE),
            max_date_flow = dates[which.max(flow)],
            max_date_temp = dates[which.max(temp)],
            .groups = "drop")

  river max_flow max_temp max_date_flow max_date_temp
  <chr>    <int>    <int> <date>        <date>       
1 CO          91       22 2020-02-14    2020-01-15   
2 GR          76       26 2020-05-04    2020-07-03

Note that which.max will return only the index of the first maximum, so if there can be more than one maximum, you might want to do something like this:

water_data2 = water_data
water_data2$flow[1] = 91

water_data2 %>% 
  group_by(river) %>% 
  summarise(max_flow = max(flow, na.rm = TRUE),
            max_temp = max(temp, na.rm = TRUE),
            max_date_flow = dates[which(flow==max(flow))],
            max_date_temp = dates[which(temp==max(temp))],
            .groups = "drop")

  river max_flow max_temp max_date_flow max_date_temp
  <chr>    <dbl>    <int> <date>        <date>       
1 CO          91       22 2020-06-30    2020-01-15   
2 CO          91       22 2020-02-14    2020-01-15   
3 GR          76       26 2020-05-04    2020-07-03

Either way, it might get confusing to have information from different rows of the original data in the same row of the summary data. Here's another option where we convert the data to long format, which allows us to use a single filter statement to get all of the maxima for each outcome (flow and temp in this case). This approach returns a tidy summary table (the flow and temp maxima are in separate rows), but we can always pivot the output to wide format as needed:

library(tidyverse)

summary_tbl = water_data2 %>% 
  pivot_longer(c(flow,temp), names_to="var") %>% 
  group_by(river, var) %>% 
  filter(value==max(value)) %>% 
  ungroup %>% 
  arrange(river, var)

  river dates      var   value
  <chr> <date>     <chr> <dbl>
1 CO    2020-06-30 flow     91
2 CO    2020-02-14 flow     91
3 CO    2020-01-15 temp     22
4 GR    2020-05-04 flow     76
5 GR    2020-07-03 temp     26

summary_tbl %>% 
  pivot_wider(names_from=var, values_from=c(value, dates)) %>% 
  unnest(-river)

  river value_flow value_temp dates_flow dates_temp
  <chr>      <dbl>      <dbl> <date>     <date>    
1 CO            91         22 2020-06-30 2020-01-15
2 CO            91         22 2020-02-14 2020-01-15
3 GR            76         26 2020-05-04 2020-07-03

system · October 27, 2020, 7:01pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.