Trouble Extracting Max Average of a tibble/tsibble

Cinji18 · January 3, 2023, 1:37am

I have data that looks like this:

> tourismQ5
# A tibble: 24,320 × 5
   Quarter    Region   State           Purpose  Trips
   <chr>      <chr>    <chr>           <chr>    <dbl>
 1 1998-01-01 Adelaide South Australia Business  135.
 2 1998-04-01 Adelaide South Australia Business  110.
 3 1998-07-01 Adelaide South Australia Business  166.
 4 1998-10-01 Adelaide South Australia Business  127.
 5 1999-01-01 Adelaide South Australia Business  137.
 6 1999-04-01 Adelaide South Australia Business  200.
 7 1999-07-01 Adelaide South Australia Business  169.
 8 1999-10-01 Adelaide South Australia Business  134.
 9 2000-01-01 Adelaide South Australia Business  154.
10 2000-04-01 Adelaide South Australia Business  169.
# … with 24,310 more rows

My objective is to find the Region-Purpose combination that has the maximum average of Trips.

I currently have the following:

tourismQ5 %>%
  select(Region, Purpose, Trips) %>%
  group_by(Region, Purpose) %>%
  summarise(Trips = max(mean(Trips))) %>%
  filter(max(Trips) == Trips)

and got this:

A tibble: 76 × 3
# Groups:   Region [76]
   Region                     Purpose   Trips
   <chr>                      <chr>     <dbl>
 1 Adelaide                   Visiting 205.  
 2 Adelaide Hills             Visiting  14.2 
 3 Alice Springs              Holiday   31.9 
 4 Australia's Coral Coast    Holiday  113.  
 5 Australia's Golden Outback Business  71.3 
 6 Australia's North West     Business  90.2 
 7 Australia's South West     Holiday  309.  
 8 Ballarat                   Visiting  60.5 
 9 Barkly                     Holiday    8.94
10 Barossa                    Holiday   25.6 
# … with 66 more rows

However, I was expecting a one-liner:

Sydney                          Visiting 747.

I have tried multiple methods to get it to summarise/group by both Region and Purpose but it either throws an error or continue to summarise by Region only. What am I missing?

Thank you.

scottyd22 · January 3, 2023, 1:57am

You need to ungroup() before the final filter.

tourismQ5 %>%
  select(Region, Purpose, Trips) %>%
  group_by(Region, Purpose) %>%
  summarise(Trips = max(mean(Trips))) %>%
  ungroup() %>%
  filter(max(Trips) == Trips)

robjhyndman · January 3, 2023, 2:12am

There's also no need for the max within summarise() as it is taking the maximum of a single value.

Cinji18 · January 3, 2023, 6:06am

Thank you for the quick response.

Your solution definitely works. I was not aware of the ungroup() function. I searched for it online and did find resources for it, but they don't seem to explain it well. With group_by(), I consider it the same as GROUP BY in SQL, but in SQL, there's no "ungroup". So I'm having a difficult time wrapping my head around ungroup(). Is the reason it exists because multiple calculations can be done in groups? Also, I don't understand why summarise groups the output only by Region when the group_by specifically states Region and Purpose.

Any clarification would be greatly appreciated.

Thank you.

mikecrobp · January 3, 2023, 9:18am

Does this help? Grouped data • dplyr

I went through the same surprise when I found that group_by of more than one column and then summarise left some residual groups. I have read that this may now be considered a design mistake and the .groups = "drop" is a sort of fix to that. That was introduced at dplyr 1.0.0

The dropping of the last grouping column explains why you see grouping based on just Region.

system · January 24, 2023, 9:19am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.