Dplyr version of SQL `PARTITION BY` with sort order for use with SQL backend

dplyr

#1

I have a common idiom I use regularly in SQL (Redshift) and I'm trying to port the same concept over to dplyr to use on the same DB via a dbplyr sql backend. It involves using row_number and partition by grouped with fewer groups than the data I'm sorting. That's hard to explain, but look at the dummy example below. You can see I group by 2 variables but partition by only 1:

SELECT 
          groupA, 
          groupB, 
          Row_number() OVER ( PARTITION BY groupA, ORDER BY Sum(my_amount) ) AS rank,
          sum( my_amount ) as my_amount
FROM 
          my_table
GROUP BY
          groupA, 
          groupB

to do this in dplyr I do it in two, somewhat slower, steps:

my_table %>%
  group_by(groupA) %>%
  summarize( my_amount = sum(my_amount)) %>%
  mutate(rank = row_number()) %>%
  select(groupA, rank) ->
sort_order

sort_order %>%
  inner_join(my_table) %>%
  inner_join(weight_table, by = rank) %>% ## this is why I need to group and partition by different resolution
  group_by(groupA, groupB) %>%
  mutate( my_amount = my_amount * weight ) -> ## weight comes from the weight_table joined above
output

I'm doing this all over on the database and the above dplyr pipe sequence is really slow because of the joining the sort_order table back to my_table. Compared to the SQL version with PARTITION BY, the dplyr version is painfully slow.

Is there a more efficient way you all know of to emulate the PARTITION BY with dplyr and an SQL backend? Searching for this has been a little hard because of the naming clash with dplyr's partition function.


#2

aaaaand I kinda sorta think this might be the solution:

I was going to add a link to a related Stack Overflow question and so I googled for [dplyr "partition by'] which gave me the SO question.. but also a hit in the dplyr documentation. Turns out I can use the vars_group parameter like so:

translate_sql(rank(), vars_group = "ID")
#> <SQL> rank() OVER (PARTITION BY "ID")

http://dbplyr.tidyverse.org/articles/sql-translation.html

But I don't see how to control the sort order the way I do in the SQL with ORDER BY Sum(my_amount) ... can I use rank(my_amount)?

Here's the SO question if anyone would like to give a tidy answer to go along with all the data.table answers there:


#4

well I just discovered tbl_lazy ... so I got that going for me, which is nice.

And now I have no idea what dplyr is doing when group_by and vars_group are combined. Take a gander at this reprex:

my_table <- data.frame(groupA = c('a','a','b','b'),
                      groupB = c('c','d','c','d') , 
                      my_amount = 1:4
                      )

my_table %>%
  tbl_lazy( src =simulate_odbc() ) %>%
  group_by(groupA, groupB) %>%
  mutate( rank = rank(), vars_group = 'groupA' ) %>% 
  show_query

which blows out this SQL:

SELECT `groupA`, `groupB`, `my_amount`, 
rank() OVER (PARTITION BY `groupA`, `groupB`) AS `rank`, 
'groupA' AS `vars_group`
FROM `df`

umm... I was expecting the vars_group to end up following the PARTITION BY like this:

> translate_sql(rank(), vars_group = "ID")
<SQL> rank() OVER (PARTITION BY "ID")

so what am I missing here?


#5

Hi JD,

That is happening because mutate() does not have an argument called vars_group, only translate_sql() does. This means that vars_group is interpreted simply as a new variable in the table. Currently the partition cannot be overridden in-line for ranking functions, only the ORDER BY:

my_table %>%
  tbl_lazy( src =simulate_odbc() ) %>%
  group_by(groupA) %>%
  mutate(rank = rank(order = groupB)) %>% 
  show_query()

To achieve what you wanted to see in your original question, just group by groupA:

my_table %>%
  tbl_lazy( src =simulate_odbc() ) %>%
  group_by(groupA) %>%
  mutate(rank = rank(order)) %>% 
  show_query()

#6

You might also want to check out window_order() and window_frame(). Maybe we should add window_partition()?


#7

Oh.. yes. that's obvious now that you say it. It's such a good thing that I force myself to learn new things often. Keeps me humble and empathetic with new users :slight_smile:

Now as for the grouping, I don't think I can just group_by(groupA) because I want the resulting dataset to have both groupA and groupB dimensions AND be sorted by the sum of the field my_amount in only one group. I need both groups later. And rejoining back to the original dataset is sloooooow. I think I made my example above just a wee bit too simple to fully illustrate my use case. Here's a slightly more elaborate (two extra data rows) reprex of what I'm trying to do in SQL:

CREATE TEMP TABLE my_table(groupA char , groupB char, my_amount real) ;
INSERT INTO my_Table(groupA, groupB, my_amount)
VALUES     
     ('A','C',4), 
     ('A','D',2),
     ('A','D',1),
     ('B','C',3),
     ('B','D',4),
     ('B','D',1);
SELECT 
          groupA, 
          groupB, 
          Row_number() OVER ( PARTITION BY groupA ORDER BY sum(my_amount) ) AS rank,
          sum( my_amount ) as my_amount
FROM 
          my_table
GROUP BY
          groupA, 
          groupB;

which results in:

groupa groupb rank my_amount
B C 1 3
B D 2 5
A D 1 3
A C 2 4

So you can see it maintains both groupa and groupb dimensions, sums up within the groups, but the rank is based only on grouping by groupa first and then taking the row_number within the group.

Thanks for giving this a think. I really appreciate it.


#8

I tried window_frame() but all I felt was pane.

rim shot :drum:

Thank you, I'll be here all week. Don't forget to tip your sys admins.

I have to admit that I'm not very skilled with windowing functions. Until I ran into the use case I currently have, I had never used them in SQL at all. I wrapped my head around the examples in the help file for window_order and window_frame. But I don't quite grok how to use them to get the effect of a group by n fields and a partition by with n-1 fields. I think that's the crux of my conundrum.


#9

Hi, I'm having some difficulty picturing what is the result that you're trying to achieve. Can you write an example of a table that would have the expected result? No code, just the results you'd like to see


#10

yep, it's that final table above:

groupa groupb rank my_amount
B C 1 3
B D 2 5
A D 1 3
A C 2 4

But we started with 6 rows... the sum was grouped by groupa and groupb but the partition/rank was just by groupa.

Thanks for struggling along with me. This has been sort of hard to explain. I've found I can do the group by 2 variables first. Then do the rank grouping by one variable. Then join the rank-one-group back to the summed-two-group data. That accomplishes the same thing but it's more complicated code that's harder to read linearly.

-J


#11

Ok, if we were to do that on local data, would have this been the way to do it? I get the same results as you, but want to make sure that it will apply to your actual problem.

library(tibble)
library(dplyr)

my_table <- tribble(
  ~groupA, ~groupB, ~my_amount,
  "A","C",4, 
  "A","D",2,
  "A","D",1,
  "B","C",3,
  "B","D",4,
  "B","D",1
)

my_table %>%
  group_by(groupA, groupB) %>%
  summarise(my_total = sum(my_amount)) %>%
  arrange(my_total) %>%
  mutate(rank = row_number())

BTW - The results will not come back in the exact order that you have in your example, but the ranking values are correct.


#12

Woot! That's exactly what I'm after!

Only I have a conceptual problem.... I don't grok why that works. With the data grouped by groupA & groupB it seems like the row_number() would operate on each group, and result in 1,1,1,1, since there's only one record in each group.


#13

Very interesting thread!! As I have thoughts on the last part I am jumping into it.

I think it because when you group by several variables and apply some functions that works with group, one group layer goes away at each operation. Here you have 2 groups at the beginning, then you apply summarise, that leaves only one, arrange does not change that and last mutate works on a one-grouped data.frame, grouped by groupA

# two groups before summarise
my_table %>%
  group_by(groupA, groupB) %>%
  groups()
#> [[1]]
#> groupA
#> 
#> [[2]]
#> groupB

# one groups before mutate: only groupA
my_table %>%
  group_by(groupA, groupB) %>%
  summarise(my_total = sum(my_amount)) %>%
  arrange(my_total) %>%
  groups()
#> Warning: le package 'bindrcpp' a été compilé avec la version R 3.4.4
#> [[1]]
#> groupA

Created on 2018-05-04 by the reprex package (v0.2.0).


#14

I've not been working with dplyr for that long, so I didn't realize that the last group_by grouping just falls off after summarize! I just reproduced this a few different times and thought it was a bug, so did some googling and discovered:

hadley commented on Jan 1, 2015:

This is by design. After you've summarised, the last group will only have one row per group, so it's not useful to group on it.

Well now, I sure learned something today! Thank you very much @cderv and @edgararuiz!

I will be working this into my presentation to the NY R group next week!


#15

This is mentioned in the dplyr intro vignette, but it's a bit buried:
https://dplyr.tidyverse.org/articles/dplyr.html#grouped-operations

When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll-up a dataset:

daily <- group_by(flights, year, month, day)
(per_day   <- summarise(daily, flights = n()))
#> # A tibble: 365 x 4
#> # Groups:   year, month [?]
#>    year month   day flights
#>   <int> <int> <int>   <int>
#> 1  2013     1     1     842
#> 2  2013     1     2     943
#> 3  2013     1     3     914
#> 4  2013     1     4     915
#> # ... with 361 more rows
(per_month <- summarise(per_day, flights = sum(flights)))
#> # A tibble: 12 x 3
#> # Groups:   year [?]
#>    year month flights
#>   <int> <int>   <int>
#> 1  2013     1   27004
#> 2  2013     2   24951
#> 3  2013     3   28834
#> 4  2013     4   28330
#> # ... with 8 more rows
(per_year  <- summarise(per_month, flights = sum(flights)))
#> # A tibble: 1 x 2
#>    year flights
#>   <int>   <int>
#> 1  2013  336776

There's also a mention of how summarize peels off grouping layers in the group_by docs, but you have to read through the examples to catch it. I wish it were called out in the Details section! (I admit this is a bit of a pet peeve of mine: important information exclusively given in examples and not in the "main text" of docs :confounded:).

Sounds like a great topic to dig into in a useR group meeting!


#16

Thanks for those pointers. Now that I know it, I see it everywhere :slight_smile: To me this feels like a big enough deal I'd like dplyr to chattily warn me about it, the way that it does when I use explicit joining. I think it should return red text after the summarize that says something like, 'summary complete, group_by reduced to 'fieldA', 'fieldB', etc.

I may bring that up on the dplyr github.


#17

Just for reference, I think the issue you opened is this one


#18

thank you. Yes for reference I should have linked back to it!