I have a common idiom I use regularly in SQL (Redshift) and I'm trying to port the same concept over to dplyr to use on the same DB via a dbplyr sql backend. It involves using row_number and partition by grouped with fewer groups than the data I'm sorting. That's hard to explain, but look at the dummy example below. You can see I group by 2 variables but partition by only 1:
SELECT
groupA,
groupB,
Row_number() OVER ( PARTITION BY groupA, ORDER BY Sum(my_amount) ) AS rank,
sum( my_amount ) as my_amount
FROM
my_table
GROUP BY
groupA,
groupB
to do this in dplyr I do it in two, somewhat slower, steps:
my_table %>%
group_by(groupA) %>%
summarize( my_amount = sum(my_amount)) %>%
mutate(rank = row_number()) %>%
select(groupA, rank) ->
sort_order
sort_order %>%
inner_join(my_table) %>%
inner_join(weight_table, by = rank) %>% ## this is why I need to group and partition by different resolution
group_by(groupA, groupB) %>%
mutate( my_amount = my_amount * weight ) -> ## weight comes from the weight_table joined above
output
I'm doing this all over on the database and the above dplyr pipe sequence is really slow because of the joining the sort_order table back to my_table. Compared to the SQL version with PARTITION BY, the dplyr version is painfully slow.
Is there a more efficient way you all know of to emulate the PARTITION BY with dplyr and an SQL backend? Searching for this has been a little hard because of the naming clash with dplyr's partition function.
aaaaand I kinda sorta think this might be the solution:
I was going to add a link to a related Stack Overflow question and so I googled for [dplyr "partition by'] which gave me the SO question.. but also a hit in the dplyr documentation. Turns out I can use the vars_group parameter like so:
translate_sql(rank(), vars_group = "ID")
#> <SQL> rank() OVER (PARTITION BY "ID")
That is happening because mutate() does not have an argument called vars_group, only translate_sql() does. This means that vars_group is interpreted simply as a new variable in the table. Currently the partition cannot be overridden in-line for ranking functions, only the ORDER BY:
Oh.. yes. that's obvious now that you say it. It's such a good thing that I force myself to learn new things often. Keeps me humble and empathetic with new users
Now as for the grouping, I don't think I can just group_by(groupA) because I want the resulting dataset to have both groupA and groupB dimensions AND be sorted by the sum of the field my_amount in only one group. I need both groups later. And rejoining back to the original dataset is sloooooow. I think I made my example above just a wee bit too simple to fully illustrate my use case. Here's a slightly more elaborate (two extra data rows) reprex of what I'm trying to do in SQL:
CREATE TEMP TABLE my_table(groupA char , groupB char, my_amount real) ;
INSERT INTO my_Table(groupA, groupB, my_amount)
VALUES
('A','C',4),
('A','D',2),
('A','D',1),
('B','C',3),
('B','D',4),
('B','D',1);
SELECT
groupA,
groupB,
Row_number() OVER ( PARTITION BY groupA ORDER BY sum(my_amount) ) AS rank,
sum( my_amount ) as my_amount
FROM
my_table
GROUP BY
groupA,
groupB;
which results in:
groupa
groupb
rank
my_amount
B
C
1
3
B
D
2
5
A
D
1
3
A
C
2
4
So you can see it maintains both groupa and groupb dimensions, sums up within the groups, but the rank is based only on grouping by groupa first and then taking the row_number within the group.
Thanks for giving this a think. I really appreciate it.
Thank you, I'll be here all week. Don't forget to tip your sys admins.
I have to admit that I'm not very skilled with windowing functions. Until I ran into the use case I currently have, I had never used them in SQL at all. I wrapped my head around the examples in the help file for window_order and window_frame. But I don't quite grok how to use them to get the effect of a group by n fields and a partition by with n-1 fields. I think that's the crux of my conundrum.
Hi, I'm having some difficulty picturing what is the result that you're trying to achieve. Can you write an example of a table that would have the expected result? No code, just the results you'd like to see
But we started with 6 rows... the sum was grouped by groupa and groupb but the partition/rank was just by groupa.
Thanks for struggling along with me. This has been sort of hard to explain. I've found I can do the group by 2 variables first. Then do the rank grouping by one variable. Then join the rank-one-group back to the summed-two-group data. That accomplishes the same thing but it's more complicated code that's harder to read linearly.
Ok, if we were to do that on local data, would have this been the way to do it? I get the same results as you, but want to make sure that it will apply to your actual problem.
Only I have a conceptual problem.... I don't grok why that works. With the data grouped by groupA & groupB it seems like the row_number() would operate on each group, and result in 1,1,1,1, since there's only one record in each group.
Very interesting thread!! As I have thoughts on the last part I am jumping into it.
I think it because when you group by several variables and apply some functions that works with group, one group layer goes away at each operation. Here you have 2 groups at the beginning, then you apply summarise, that leaves only one, arrange does not change that and last mutate works on a one-grouped data.frame, grouped by groupA
# two groups before summarise
my_table %>%
group_by(groupA, groupB) %>%
groups()
#> [[1]]
#> groupA
#>
#> [[2]]
#> groupB
# one groups before mutate: only groupA
my_table %>%
group_by(groupA, groupB) %>%
summarise(my_total = sum(my_amount)) %>%
arrange(my_total) %>%
groups()
#> Warning: le package 'bindrcpp' a été compilé avec la version R 3.4.4
#> [[1]]
#> groupA
I've not been working with dplyr for that long, so I didn't realize that the last group_by grouping just falls off after summarize! I just reproduced this a few different times and thought it was a bug, so did some googling and discovered:
hadley commented on Jan 1, 2015:
This is by design. After you've summarised, the last group will only have one row per group, so it's not useful to group on it.
Well now, I sure learned something today! Thank you very much @cderv and @edgararuiz!
I will be working this into my presentation to the NY R group next week!
When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll-up a dataset:
daily <- group_by(flights, year, month, day)
(per_day <- summarise(daily, flights = n()))
#> # A tibble: 365 x 4
#> # Groups: year, month [?]
#> year month day flights
#> <int> <int> <int> <int>
#> 1 2013 1 1 842
#> 2 2013 1 2 943
#> 3 2013 1 3 914
#> 4 2013 1 4 915
#> # ... with 361 more rows
(per_month <- summarise(per_day, flights = sum(flights)))
#> # A tibble: 12 x 3
#> # Groups: year [?]
#> year month flights
#> <int> <int> <int>
#> 1 2013 1 27004
#> 2 2013 2 24951
#> 3 2013 3 28834
#> 4 2013 4 28330
#> # ... with 8 more rows
(per_year <- summarise(per_month, flights = sum(flights)))
#> # A tibble: 1 x 2
#> year flights
#> <int> <int>
#> 1 2013 336776
There's also a mention of how summarize peels off grouping layers in the group_by docs, but you have to read through the examples to catch it. I wish it were called out in the Details section! (I admit this is a bit of a pet peeve of mine: important information exclusively given in examples and not in the "main text" of docs ).
Sounds like a great topic to dig into in a useR group meeting!
Thanks for those pointers. Now that I know it, I see it everywhere To me this feels like a big enough deal I'd like dplyr to chattily warn me about it, the way that it does when I use explicit joining. I think it should return red text after the summarize that says something like, 'summary complete, group_by reduced to 'fieldA', 'fieldB', etc.