slice_head on grouped dataframe reordering by grouping variable

dmolitor · November 18, 2020, 8:56pm

In the following example, I am passing in a grouped dataframe to slice_head.

suppressWarnings(library(dplyr))
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df <- tibble(
  group = rep(c("b", "c", "a"), c(1, 2, 4)),
  x = runif(7)
)

df
#> # A tibble: 7 x 2
#>   group      x
#>   <chr>  <dbl>
#> 1 b     0.648 
#> 2 c     0.247 
#> 3 c     0.192 
#> 4 a     0.0977
#> 5 a     0.0479
#> 6 a     0.216 
#> 7 a     0.501

df %>%
  group_by(group) %>%
  slice_head(n = 1)
#> # A tibble: 3 x 2
#> # Groups:   group [3]
#>   group      x
#>   <chr>  <dbl>
#> 1 a     0.0977
#> 2 b     0.648 
#> 3 c     0.247

In my case (and it seems like this is the intuitive result) I would like the see the grouping order retained, but it alphanumerically sorts by the grouping variable after chopping the head off of each group. From the documentation, it appears that this is exactly what the .preserve argument in slice is meant to address, however when I use slice and set .preserve = TRUE I get the exact same output.

df %>%
  group_by(group) %>%
  slice(1, .preserve = TRUE)
#> # A tibble: 3 x 2
#> # Groups:   group [3]
#>   group     x
#>   <chr> <dbl>
#> 1 a     0.958
#> 2 b     0.680
#> 3 c     0.927

The following is the source code for slice_head

function (.data, ..., n, prop) 
{
    ellipsis::check_dots_empty()
    size <- check_slice_size(n, prop)
    idx <- switch(size$type, n = function(n) seq2(1, min(size$n, 
        n)), prop = function(n) seq2(1, min(size$prop * n, n)))
    slice(.data, idx(dplyr::n()))
}

It appears that slice_head is calling slice which, by default, has .preserve = FALSE. It seems that, at the least, slice_head should allow the user to determine the value of .preserve, however that still leaves me wondering why slice with .preserve = TRUE didn't seem to work, as illustrated above. Am I doing something stupid here? I know there are easy workarounds but this is just bugging me.

nirgrahamuk · November 18, 2020, 9:08pm

Whats your dplyr packageVersion?

dmolitor · November 18, 2020, 9:10pm

Sorry, should have included my session info. Here it is:

devtools::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                       
#>  version  R version 4.0.2 (2020-06-22)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  English_United States.1252  
#>  ctype    English_United States.1252  
#>  tz       America/New_York            
#>  date     2020-11-18                  
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version date       lib source        
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.0.2)
#>  backports     1.1.10  2020-09-15 [1] CRAN (R 4.0.2)
#>  callr         3.4.4   2020-09-07 [1] CRAN (R 4.0.2)
#>  cli           2.0.2   2020-02-28 [1] CRAN (R 4.0.2)
#>  crayon        1.3.4   2017-09-16 [1] CRAN (R 4.0.2)
#>  desc          1.2.0   2018-05-01 [1] CRAN (R 4.0.2)
#>  devtools      2.3.1   2020-07-21 [1] CRAN (R 4.0.2)
#>  digest        0.6.25  2020-02-23 [1] CRAN (R 4.0.2)
#>  ellipsis      0.3.1   2020-05-15 [1] CRAN (R 4.0.2)
#>  evaluate      0.14    2019-05-28 [1] CRAN (R 4.0.2)
#>  fansi         0.4.1   2020-01-08 [1] CRAN (R 4.0.2)
#>  fs            1.5.0   2020-07-31 [1] CRAN (R 4.0.2)
#>  glue          1.4.2   2020-08-27 [1] CRAN (R 4.0.2)
#>  highr         0.8     2019-03-20 [1] CRAN (R 4.0.2)
#>  htmltools     0.5.0   2020-06-16 [1] CRAN (R 4.0.2)
#>  knitr         1.29    2020-06-23 [1] CRAN (R 4.0.2)
#>  magrittr      1.5     2014-11-22 [1] CRAN (R 4.0.2)
#>  memoise       1.1.0   2017-04-21 [1] CRAN (R 4.0.2)
#>  pkgbuild      1.1.0   2020-07-13 [1] CRAN (R 4.0.2)
#>  pkgload       1.1.0   2020-05-29 [1] CRAN (R 4.0.2)
#>  prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.0.2)
#>  processx      3.4.4   2020-09-03 [1] CRAN (R 4.0.2)
#>  ps            1.3.4   2020-08-11 [1] CRAN (R 4.0.2)
#>  R6            2.4.1   2019-11-12 [1] CRAN (R 4.0.2)
#>  remotes       2.2.0   2020-07-21 [1] CRAN (R 4.0.2)
#>  rlang         0.4.7   2020-07-09 [1] CRAN (R 4.0.2)
#>  rmarkdown     2.3     2020-06-18 [1] CRAN (R 4.0.2)
#>  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 4.0.2)
#>  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.2)
#>  stringi       1.5.3   2020-09-09 [1] CRAN (R 4.0.2)
#>  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.0.2)
#>  testthat      2.3.2   2020-03-02 [1] CRAN (R 4.0.2)
#>  usethis       1.6.1   2020-04-29 [1] CRAN (R 4.0.2)
#>  withr         2.3.0   2020-09-22 [1] CRAN (R 4.0.2)
#>  xfun          0.16    2020-07-24 [1] CRAN (R 4.0.2)
#>  yaml          2.2.1   2020-02-01 [1] CRAN (R 4.0.2)

And dplyr version 1.0.2

AlexisW · November 19, 2020, 1:24am

I'm not sure it's exactly this that it addresses. I see an effect of .preserve when slicing the second row:

df %>%
  group_by(group) %>%
  slice(2, .preserve = TRUE)
# A tibble: 2 x 2
# Groups:   group [3]
#  group     x
#  <chr> <dbl>
#1 a     0.478
#2 c     0.537

df %>%
  group_by(group) %>%
  slice(2, .preserve = FALSE)
# A tibble: 2 x 2
# Groups:   group [2]
#  group     x
#  <chr> <dbl>
#1 a     0.478
#2 c     0.537

Here, b is lost since it only has a single row, and you see that with .preserve the levels are kept.

I may be wrong, but I don't think I ever saw the documentation mention an order for groups. My reading would be that a grouping has levels, but no guarantee on order, any operation can order the result any way it wants. That's why in your example slice_head() reorders by alphabetical order without you asking and one day it might decide to order differently without warning. If you want an ordering you have to use arrange().

dmolitor · November 19, 2020, 1:50am

Aha, I see. Yes, looking back it says it preserves the "grouping structure" but I think you're right that that's not necessarily referring to the group order.

So, just to make sure I'm understanding correctly, you're suggesting that slice_head reorders the groups based on some underlying heuristic applied to the grouping variable and the user (me) can't force it to maintain the original group order? If this is the case, that seems extremely problematic. For example, in this case the data may be sorted in exactly the way it needs to be, and slice_head arranging it alphabetically by the grouping variable completely messes the proper ordering up.

AlexisW · November 19, 2020, 4:16am

Interesting point, I see your point about it being problematic, I absolutely don't know.

Diving a bit more, I now realize there is an ordering of the group, which I think is implicitly given upon creation:

df <- tibble(
  group = rep(c("b", "c", "a"), c(1, 2, 4)),
  x = runif(7)
)

df1 <- df %>%
  group_by(group)

group_data(df1)
#> # A tibble: 3 x 2
#>   group       .rows
#> * <chr> <list<int>>
#> 1 a             [4]
#> 2 b             [1]
#> 3 c             [2]
slice_head(df1)
#> # A tibble: 3 x 2
#> # Groups:   group [3]
#>   group     x
#>   <chr> <dbl>
#> 1 a     0.977
#> 2 b     0.493
#> 3 c     0.804

df2 <- df %>%
  mutate(group = fct_inorder(group)) %>%
  group_by(group)

group_data(df2)
#> # A tibble: 3 x 2
#>   group       .rows
#> * <fct> <list<int>>
#> 1 b             [1]
#> 2 c             [2]
#> 3 a             [4]
slice_head(df2)
#> # A tibble: 3 x 2
#> # Groups:   group [3]
#>   group     x
#>   <fct> <dbl>
#> 1 b     0.493
#> 2 c     0.804
#> 3 a     0.977
Created on 2020-11-18 by the reprex package (v0.3.0)

And if you look in slice(), it ends up calling dplyr:::slice_rows()which uses mask$get_rows() which I think is using the same information (thus the same order) as group_data(). So I don't think the original row order is preserved by slice().

As for .preserve, the only reference I could find was totally at the end of the slice()ing process, and I'm quite sure it's only about dropping unused levels:

if (!preserve && isTRUE(attr(groups, ".drop"))) {
  groups <- group_data_trim(groups)
}
#(in dplyr:::dplyr_row_slice.grouped_df)

PS: there was one more place to check, if I understand correctly, it's actually on purpose:

dmolitor · November 19, 2020, 1:30pm

Nice one, I think you got to the root of it. So slice_head calls slice which then calls dplyr_row_slice and this is the code for dplyr_row_slice.grouped_df:

function (data, i, ..., preserve = FALSE) 
{
    out <- vec_slice(as.data.frame(data), i)
    groups <- group_data(data)
    new_id <- vec_slice(group_indices(data), i)
    new_grps <- vec_group_loc(new_id)
    rows <- rep(list_of(integer()), length.out = nrow(groups))
    rows[new_grps$key] <- new_grps$loc
    groups$.rows <- rows
    if (!preserve && isTRUE(attr(groups, ".drop"))) {
        groups <- group_data_trim(groups)
    }
    new_grouped_df(out, groups)
}

It's setting the groups using group_data which, as you showed above, orders them alphanumerically unless forced otherwise by setting the grouping variable as a factor with the levels in a particular order. And nice find on that github workflow! It does seem like reordering might be an intentional side-effect. However, I really personally dislike that, and I think it should at least give the user the option to maintain input group ordering.

AlexisW · November 19, 2020, 1:55pm

Not exactly: at this point the order is already defined in the i parameter, which is equal to loc defined in dplyr:::slice.data.frame as :

loc <- dplyr:::slice_rows(.data, ...)

Where ... contains 1 if you're calling slice_head() (it's the number of rows to extract).

I guess you can open a feature request for an option .reorder = TRUE. Just to be clear: I think you want to maintain the row order, not the group order (which is already maintained if the groups are an ordered factor). For example:

tibble(group = c("b","a","b","a"),
             vals = rnorm(length(group))) %>%
  group_by(group) %>%
  slice(1,2)
#  group    vals
#  <chr>   <dbl>
#1 a      1.67  
#2 a     -0.220 
#3 b     -0.0320
#4 b      0.754

What you want is not that b be before a, but that they still alternate.

system · December 10, 2020, 1:55pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.