How to sort descending within filtered dplyr group_by groups?

DavoWW · December 31, 2018, 6:42am

I need help with the following problem. I want to be able to create the column called "desired" in this demo data frame using only the information in columns "bb", "cc", and "dd". The idea is pretty simple: to reverse the order of one variable within group_by groups having filtered or conditioned on one of the columns. My closest attempt so far is shown below but the 'sort' function ignores the groups and sorts across the whole variable.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
my.df <- data.frame(aa=factor(c(1:24)), 
                    bb=factor(rep(c(1:6), each=4)),
                    cc=factor(rep(c(1,2), each=4, times=3)),
                    dd=c(11:14, 21:24, 31:34, 41:44, 51:54, 61:64),
                    desired=c(11:14, 24:21, 31:34, 44:41, 51:54, 64:61))
my.df %>% 
  group_by(bb) %>% 
  mutate(ee=ifelse(cc==2, sort(.$dd, decreasing = TRUE), dd))
#> # A tibble: 24 x 6
#> # Groups:   bb [6]
#>    aa    bb    cc       dd desired    ee
#>    <fct> <fct> <fct> <int>   <int> <int>
#>  1 1     1     1        11      11    11
#>  2 2     1     1        12      12    12
#>  3 3     1     1        13      13    13
#>  4 4     1     1        14      14    14
#>  5 5     2     2        21      24    64
#>  6 6     2     2        22      23    63
#>  7 7     2     2        23      22    62
#>  8 8     2     2        24      21    61
#>  9 9     3     1        31      31    31
#> 10 10    3     1        32      32    32
#> # ... with 14 more rows

^{Created on 2018-12-31 by the reprex package (v0.2.1)}

Thanks for any help.

bragks · December 31, 2018, 10:10am

Just a suggestion, don't know if this will work, but you could have a look at arrange_if() and possibly write some predicate function that evaluates to TRUE if cc == 2.

You probably wouldn't want to use ifelse() for this. Notice the error if you run the "safe" version, if_else() from dplyr.

library(dplyr, warn.conflicts = FALSE)

my.df <- data.frame(aa=factor(c(1:24)), 
                    bb=factor(rep(c(1:6), each=4)),
                    cc=factor(rep(c(1,2), each=4, times=3)),
                    dd=c(11:14, 21:24, 31:34, 41:44, 51:54, 61:64),
                    desired=c(11:14, 24:21, 31:34, 44:41, 51:54, 64:61))
my.df %>% 
  group_by(bb) %>% 
  mutate(ee=if_else(cc==2, sort(.$dd, decreasing = TRUE), dd))
#> Error: `true` must be length 4 (length of `condition`) or one, not 24

^{Created on 2018-12-31 by the reprex package (v0.2.1)}

cderv · December 31, 2018, 1:00pm

by calling .$dd you referring to the entire dd column in the tibble. To sort only on the subset of table by groups, you just need to use NSE and use the name of the variable, by riding of the .$

library(dplyr)
#> 
#> Attachement du package : 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
my.df <- data.frame(aa=factor(c(1:24)), 
                    bb=factor(rep(c(1:6), each=4)),
                    cc=factor(rep(c(1,2), each=4, times=3)),
                    dd=c(11:14, 21:24, 31:34, 41:44, 51:54, 61:64),
                    desired=c(11:14, 24:21, 31:34, 44:41, 51:54, 64:61))
my.df %>% 
  group_by(bb) %>% 
  mutate(ee=ifelse(cc==2, sort(dd, decreasing = TRUE), dd))
#> # A tibble: 24 x 6
#> # Groups:   bb [6]
#>    aa    bb    cc       dd desired    ee
#>    <fct> <fct> <fct> <int>   <int> <int>
#>  1 1     1     1        11      11    11
#>  2 2     1     1        12      12    12
#>  3 3     1     1        13      13    13
#>  4 4     1     1        14      14    14
#>  5 5     2     2        21      24    24
#>  6 6     2     2        22      23    23
#>  7 7     2     2        23      22    22
#>  8 8     2     2        24      21    21
#>  9 9     3     1        31      31    31
#> 10 10    3     1        32      32    32
#> # ... with 14 more rows

^{Created on 2018-12-31 by the reprex package (v0.2.1)}

DavoWW · December 31, 2018, 10:51pm

Thanks for the suggestions. if_else certainly gives a more informative error message.
It seems that arrange_if is not available inside a mutate() function.

I thought this might work, but no:

my.df %>%
  group_by(bb) %>%
  mutate(ee = if_else(cc==2, arrange(desc(dd)), dd))

Error in mutate_impl(.data, dots) : 
  Evaluation error: no applicable method for 'arrange_' applied to an object of class "c('integer', 'numeric')".

I get the same error if "dd" is a factor.

cderv · January 1, 2019, 12:07pm

arrange takes a data.frame / tibble as first argument. You don't have that here, you can't use it directly on a vector. This is a function not for vector like sort. This why you get the error.

bragks · January 2, 2019, 7:14am

Would arrange_if() work if you wrote your own function for .predicate?

Simplified:

my.df %>% 
  group_by(bb) %>% 
  arrange_if(cc == 2, ...) # replace cc == 2 with some function

cderv · January 2, 2019, 10:09pm

arrange_if allows to use a predicate to select column to arrange with. It cannot be used to select a group or arrange only part of the column. You need to approach the problem differently. Maybe considered your one table has several tables with some in the wrong order and that you need to arrange using a custom predicate function that uses cc.

Also, I understand you do not want to sort the whole table rows but only change the dd column to put some values in another order. It is not an arrange task, it is a mutating operation. arrange will change also the rows order according to your predicate, and the aa column will also be sorting according to a decreasing dd if cc==2 - not what you want from you desired output.

jdlong · January 6, 2019, 11:27am

The approach shown by @cderv is very good and illustrates doing this operation in one succinct statement. I tend to be a bit more verbose, so I would have solved this in two steps:

Create a sorted column
Use case_when to selectively use the sorted or the original values.

Example:


library(dplyr)

my.df <- data.frame(aa=factor(c(1:24)), 
                    bb=factor(rep(c(1:6), each=4)),
                    cc=factor(rep(c(1,2), each=4, times=3)),
                    dd=c(11:14, 21:24, 31:34, 41:44, 51:54, 61:64),
                    desired=c(11:14, 24:21, 31:34, 44:41, 51:54, 64:61))
my.df %>%
  group_by(bb) %>%
  mutate(dd_sorted = sort(dd, decreasing = TRUE)) %>%
  mutate(dd_new = case_when(cc == 2 ~ dd_sorted,
                            TRUE ~ dd))
#> # A tibble: 24 x 7
#> # Groups:   bb [6]
#>    aa    bb    cc       dd desired dd_sorted dd_new
#>    <fct> <fct> <fct> <int>   <int>     <int>  <int>
#>  1 1     1     1        11      11        14     11
#>  2 2     1     1        12      12        13     12
#>  3 3     1     1        13      13        12     13
#>  4 4     1     1        14      14        11     14
#>  5 5     2     2        21      24        24     24
#>  6 6     2     2        22      23        23     23
#>  7 7     2     2        23      22        22     22
#>  8 8     2     2        24      21        21     21
#>  9 9     3     1        31      31        34     31
#> 10 10    3     1        32      32        33     32
#> # … with 14 more rows

^{Created on 2019-01-06 by the reprex package (v0.2.1)}

I could then drop any columns I didn't want to keep. To me my version seems more understandable to others (including future me). But keep in mind that I have a terrible amount of first person bias here

system · January 13, 2019, 11:27am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.