Disclaimer: I know the tidy solution to the issue described below and that is the dplyr::min_rank()
function.
As my potential helpers may already know, the base R functions order()
and sort()
are different in that the former outputs a vector of indices and the latter outputs the sorted version of the vector you pass to it. It turns out that the order()
function does not work as intended in the tidy framework. Let's consider the following tibble:
set.seed(123)
dat <- tibble(
unit = LETTERS[1:5],
a = rnorm(5),
b = rnorm(5)
)
dat
# A tibble: 5 x 3
unit a b
<chr> <dbl> <dbl>
1 A -0.560 1.72
2 B -0.230 0.461
3 C 1.56 -1.27
4 D 0.0705 -0.687
5 E 0.129 -0.446
Now, I would like to create two new columns: rank_a
and rank_b
, which, as the names imply, contain the rank (or order) of each value in their corresponding columns.
dat <- dat %>%
mutate(
rank_a = order(a),
rank_b = order(b),
)
dat
# A tibble: 5 x 5
unit a b rank_a rank_b
<chr> <dbl> <dbl> <int> <int>
1 A -0.560 1.72 1 3
2 B -0.230 0.461 2 4
3 C 1.56 -1.27 4 5
4 D 0.0705 -0.687 5 2
5 E 0.129 -0.446 3 1
A close look at the tibble above reveals that the order()
function did not work as intended. An example is that the table states that the value 0.129
in the a
column (i.e. unit E) is the 3rd lowest value in the column. This is not true! The 3rd lowest value is actually 0.0705
(i.e. unit D)! Interestingly enough, the function works as expected outside the tidy framework.
dat$a[order(dat$a)]
[1] -0.56047565 -0.23017749 0.07050839 0.12928774 1.55870831
The rank_b
column suffers from the same issue.