Dplyr::arrange .by_group=TRUE fails with SQL backend

dplyr
dbplyr

#1

I've either got a misunderstanding or a bug... I think it's a bug.

It seems that dplyr::arrange() with the .by_group=TRUE parameter set produces SQL with an error. Here's how to reprex it:

On the DB (Redshift in my case) set up a dummy table:

drop TABLE sandbox.testorder; 

CREATE TABLE sandbox.testorder (
    grp varchar(255),
    n  DOUBLE PRECISION
);

INSERT INTO sandbox.testorder (grp , n) VALUES ('a',3.3);
INSERT INTO sandbox.testorder (grp , n) VALUES ('a',1.1);
INSERT INTO sandbox.testorder (grp , n) VALUES ('b',2.2);
INSERT INTO sandbox.testorder (grp , n) VALUES ('b',4.4);

Then from R (presuming a connection to the DB called con and already loaded dbplyr

testorder  <- tbl(con, "testorder")

testorder %>%
  group_by( grp ) %>%
  arrange( n, .by_group=TRUE) ->
out_test

show_query(out_test)

which generates the following SQL:

SELECT *
FROM "testorder"
ORDER BY "n", TRUE

which fails if I try to collect(out_test) with the following error:

Error in postgresqlExecStatement(conn, statement, ...) : 
  RS-DBI driver: (could not Retrieve the result : ERROR:  non-integer constant in ORDER BY

the rub seems to be the ,TRUE there at the end. If I remove it, I get the followable runnable SQL:

SELECT *
FROM "testorder"
ORDER BY "n"

My guess is that the routine that generates the SQL has a glitch. Looks like it's just passing , TRUE instead of adding in the group by variables.

I have not dug into the dbplyr or dplyr code to try and find the magic incantation. Not being familiar with the code base I didn't really feel up to digging into that beast tonight.

If this should go somewhere else, like the dbplyr github issues log, let me know. I wasn't 100% sure where the right place to post is.


#2

Because dbplyr and dplyr are so intricately intertwined, issues for either and/or both go in the dplyr repo issues (dbplyr doesn't have an option to file issues, so that makes the decision pretty easy).

In general, if you're pretty sure it's a bug (and, as a general note to all, it's always totally fine to ask here first), then filing in the GitHub repos is the way to go, since we can track things there, and there's all sorts of helpful GitHub link-i-ness for closing issues when the code's been fixed etc.

Looks like you've got the reprex and description ready and raring to go, but I'm linking to the Filing issues guide from the tidyverse page for posterity. :+1:


#3

Hi, I'm just trying to understand why this is not sufficient:

testorder %>%
  arrange(grp, n)

#4

Edgar, in practice that is sufficient. However in reading the documents, I interpreted them as telling me that in order to sort by both the group bys and also the values, I needed the .by_group=TRUE.

From the docs:

.by_group If TRUE, will sort first by grouping variable. Applies to grouped data frames only.

source: https://dplyr.tidyverse.org/reference/arrange.html

I have not tested the sort behavior as it relates to the group_by columns.


#5

thanks. I popped this into the issue tracker. Apologies for the cross posting.


#6

No problem, can you just link to the issue here as well? Thanks.


#7

raised as an issue on Github:


#8

Cool, that's good to know. I was wondering about how we could interpret .by_group for a possible translation. Id really be nice to find a reprex that could not be created w/o that argument. Thanks JD!