Why does group_indices use alphabetical ordering?


#1

I’d like to number each group in a data frame so that the groups are ordered according to the order they appear in the data frame. This is the code that I have so far:

library(tibble)
library(dplyr)

df <- tibble(
  category = c("a", "b", "c", "c"),
  value = c(7, 1, 4, 2)
)

df <- df %>%
  group_by(category) %>%
  mutate(mean_value = mean(value)) %>%
  arrange(mean_value, category) %>%
  ungroup()

df %>% mutate(id = group_indices(., category))
#> # A tibble: 4 x 4
#>   category value mean_value    id
#>   <chr>    <dbl>      <dbl> <int>
#> 1 b         1.00       1.00     2
#> 2 c         4.00       3.00     3
#> 3 c         2.00       3.00     3
#> 4 a         7.00       7.00     1

I’d like the id variable to be ordered like this:

#> # A tibble: 4 x 4
#>   category value mean_value    id
#>   <chr>    <dbl>      <dbl> <int>
#> 1 b         1.00       1.00     1
#> 2 c         4.00       3.00     2
#> 3 c         2.00       3.00     2
#> 4 a         7.00       7.00     3

I ordered the data frame according to the criteria that I wanted to use (mean_value), and now I’d like to number the groups to align with category.

Why does the group_indices function order alphabetically by default? Is there a simple way for me to achieve my goal?


#2

Hi @kylevoyto,

FYI, there’s a related issue open in the dplyr repo:


#3

I don’t know if it can be considered simple, but I would write my own function for that:

respect_sort <- function(df, category = "category", id = "id"){
  df[[id]] <- NA
  lvls <- df[[category]] %>% unique()
  mapping <- seq(1:length(lvls))
  purrr::walk2(lvls, mapping, function(x, y){
    df[[id]][df[[category]] == x] <<- y
  })
  df
}

> df %>% respect_sort()
# A tibble: 4 x 4
  category value mean_value    id
  <chr>    <dbl>      <dbl> <int>
1 b         1.00       1.00     1
2 c         4.00       3.00     2
3 c         2.00       3.00     2
4 a         7.00       7.00     3

It’s a little hacky, but it does what you want.


#4

You can wrap group_indices in another function.

grpid = function(x) match(x, unique(x))
df %>% mutate(id = group_indices(., category) %>% grpid)

# A tibble: 4 x 4
  category value mean_value    id
     <chr> <dbl>      <dbl> <int>
1        b     1          1     1
2        c     4          3     2
3        c     2          3     2
4        a     7          7     3

For what it’s worth, the result you want is provided by default with data.table:

library(data.table)
DT = data.table(df)

DT[, id := .GRP, by=.(category)][]

   category value mean_value id
1:        b     1          1  1
2:        c     4          3  2
3:        c     2          3  2
4:        a     7          7  3

#5

From @mara links to the issue, we understand that for factors it is ok. So you can do this :

library(tibble)
library(dplyr, warn.conflicts = FALSE)

df <- tibble(
  category = c("a", "b", "c", "c"),
  value = c(7, 1, 4, 2)
)

df <- df %>%
  group_by(category) %>%
  mutate(mean_value = mean(value)) %>%
  arrange(mean_value, category) %>%
  ungroup()

df %>%
  mutate(id = group_indices(., factor(category, levels = unique(category))))
#> # A tibble: 4 x 4
#>   category value mean_value    id
#>   <chr>    <dbl>      <dbl> <int>
#> 1 b         1.00       1.00     1
#> 2 c         4.00       3.00     2
#> 3 c         2.00       3.00     2
#> 4 a         7.00       7.00     3

Created on 2018-02-21 by the reprex package (v0.2.0).