How to sort descending within filtered dplyr group_by groups?

dplyr

#1

I need help with the following problem. I want to be able to create the column called "desired" in this demo data frame using only the information in columns "bb", "cc", and "dd". The idea is pretty simple: to reverse the order of one variable within group_by groups having filtered or conditioned on one of the columns. My closest attempt so far is shown below but the 'sort' function ignores the groups and sorts across the whole variable.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
my.df <- data.frame(aa=factor(c(1:24)), 
                    bb=factor(rep(c(1:6), each=4)),
                    cc=factor(rep(c(1,2), each=4, times=3)),
                    dd=c(11:14, 21:24, 31:34, 41:44, 51:54, 61:64),
                    desired=c(11:14, 24:21, 31:34, 44:41, 51:54, 64:61))
my.df %>% 
  group_by(bb) %>% 
  mutate(ee=ifelse(cc==2, sort(.$dd, decreasing = TRUE), dd))
#> # A tibble: 24 x 6
#> # Groups:   bb [6]
#>    aa    bb    cc       dd desired    ee
#>    <fct> <fct> <fct> <int>   <int> <int>
#>  1 1     1     1        11      11    11
#>  2 2     1     1        12      12    12
#>  3 3     1     1        13      13    13
#>  4 4     1     1        14      14    14
#>  5 5     2     2        21      24    64
#>  6 6     2     2        22      23    63
#>  7 7     2     2        23      22    62
#>  8 8     2     2        24      21    61
#>  9 9     3     1        31      31    31
#> 10 10    3     1        32      32    32
#> # ... with 14 more rows

Created on 2018-12-31 by the reprex package (v0.2.1)

Thanks for any help.


#2

Just a suggestion, don't know if this will work, but you could have a look at arrange_if() and possibly write some predicate function that evaluates to TRUE if cc == 2.

You probably wouldn't want to use ifelse() for this. Notice the error if you run the "safe" version, if_else() from dplyr.

library(dplyr, warn.conflicts = FALSE)

my.df <- data.frame(aa=factor(c(1:24)), 
                    bb=factor(rep(c(1:6), each=4)),
                    cc=factor(rep(c(1,2), each=4, times=3)),
                    dd=c(11:14, 21:24, 31:34, 41:44, 51:54, 61:64),
                    desired=c(11:14, 24:21, 31:34, 44:41, 51:54, 64:61))
my.df %>% 
  group_by(bb) %>% 
  mutate(ee=if_else(cc==2, sort(.$dd, decreasing = TRUE), dd))
#> Error: `true` must be length 4 (length of `condition`) or one, not 24

Created on 2018-12-31 by the reprex package (v0.2.1)


#3

by calling .$dd you referring to the entire dd column in the tibble. To sort only on the subset of table by groups, you just need to use NSE and use the name of the variable, by riding of the .$

library(dplyr)
#> 
#> Attachement du package : 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
my.df <- data.frame(aa=factor(c(1:24)), 
                    bb=factor(rep(c(1:6), each=4)),
                    cc=factor(rep(c(1,2), each=4, times=3)),
                    dd=c(11:14, 21:24, 31:34, 41:44, 51:54, 61:64),
                    desired=c(11:14, 24:21, 31:34, 44:41, 51:54, 64:61))
my.df %>% 
  group_by(bb) %>% 
  mutate(ee=ifelse(cc==2, sort(dd, decreasing = TRUE), dd))
#> # A tibble: 24 x 6
#> # Groups:   bb [6]
#>    aa    bb    cc       dd desired    ee
#>    <fct> <fct> <fct> <int>   <int> <int>
#>  1 1     1     1        11      11    11
#>  2 2     1     1        12      12    12
#>  3 3     1     1        13      13    13
#>  4 4     1     1        14      14    14
#>  5 5     2     2        21      24    24
#>  6 6     2     2        22      23    23
#>  7 7     2     2        23      22    22
#>  8 8     2     2        24      21    21
#>  9 9     3     1        31      31    31
#> 10 10    3     1        32      32    32
#> # ... with 14 more rows

Created on 2018-12-31 by the reprex package (v0.2.1)


#4

Thanks for the suggestions. if_else certainly gives a more informative error message.
It seems that arrange_if is not available inside a mutate() function.

I thought this might work, but no:

my.df %>%
  group_by(bb) %>%
  mutate(ee = if_else(cc==2, arrange(desc(dd)), dd))

Error in mutate_impl(.data, dots) : 
  Evaluation error: no applicable method for 'arrange_' applied to an object of class "c('integer', 'numeric')".

I get the same error if "dd" is a factor.


#5

arrange takes a data.frame / tibble as first argument. You don't have that here, you can't use it directly on a vector. This is a function not for vector like sort. This why you get the error.


"Missing" workflow in tidyverse?
#6

Would arrange_if() work if you wrote your own function for .predicate?

Simplified:

my.df %>% 
  group_by(bb) %>% 
  arrange_if(cc == 2, ...) # replace cc == 2 with some function

#7

arrange_if allows to use a predicate to select column to arrange with. It cannot be used to select a group or arrange only part of the column. You need to approach the problem differently. Maybe considered your one table has several tables with some in the wrong order and that you need to arrange using a custom predicate function that uses cc.

Also, I understand you do not want to sort the whole table rows but only change the dd column to put some values in another order. It is not an arrange task, it is a mutating operation. arrange will change also the rows order according to your predicate, and the aa column will also be sorting according to a decreasing dd if cc==2 - not what you want from you desired output.


#8

The approach shown by @cderv is very good and illustrates doing this operation in one succinct statement. I tend to be a bit more verbose, so I would have solved this in two steps:

  1. Create a sorted column
  2. Use case_when to selectively use the sorted or the original values.

Example:


library(dplyr)

my.df <- data.frame(aa=factor(c(1:24)), 
                    bb=factor(rep(c(1:6), each=4)),
                    cc=factor(rep(c(1,2), each=4, times=3)),
                    dd=c(11:14, 21:24, 31:34, 41:44, 51:54, 61:64),
                    desired=c(11:14, 24:21, 31:34, 44:41, 51:54, 64:61))
my.df %>%
  group_by(bb) %>%
  mutate(dd_sorted = sort(dd, decreasing = TRUE)) %>%
  mutate(dd_new = case_when(cc == 2 ~ dd_sorted,
                            TRUE ~ dd))
#> # A tibble: 24 x 7
#> # Groups:   bb [6]
#>    aa    bb    cc       dd desired dd_sorted dd_new
#>    <fct> <fct> <fct> <int>   <int>     <int>  <int>
#>  1 1     1     1        11      11        14     11
#>  2 2     1     1        12      12        13     12
#>  3 3     1     1        13      13        12     13
#>  4 4     1     1        14      14        11     14
#>  5 5     2     2        21      24        24     24
#>  6 6     2     2        22      23        23     23
#>  7 7     2     2        23      22        22     22
#>  8 8     2     2        24      21        21     21
#>  9 9     3     1        31      31        34     31
#> 10 10    3     1        32      32        33     32
#> # … with 14 more rows

Created on 2019-01-06 by the reprex package (v0.2.1)

I could then drop any columns I didn't want to keep. To me my version seems more understandable to others (including future me). But keep in mind that I have a terrible amount of first person bias here :slight_smile:


#9

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.