Understanding arrange()

dplyr

#1

I am learning from R for Data Science. In section 5.3 of the book, I can't seem to comprehend arrange() functionality of the tidyverse/dplyr package. It says "If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns". I really can't seem to get what he is pointing at but here is the code.

> library(tidyverse)
> library(nycflights13)
> f = nycflights13::flights

I see arrange() arranges in ascending order whatever the column you give it to.

> arrange(f, dep_delay)

# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013    12     7     2040           2123       -43       40           2352
 2  2013     2     3     2022           2055       -33     2240           2338
 3  2013    11    10     1408           1440       -32     1549           1559
 4  2013     1    11     1900           1930       -30     2233           2243
 5  2013     1    29     1703           1730       -27     1947           1957
 6  2013     8     9      729            755       -26     1002            955
 7  2013    10    23     1907           1932       -25     2143           2143
 8  2013     3    30     2030           2055       -25     2213           2250
 9  2013     3     2     1431           1455       -24     1601           1631
10  2013     5     5      934            958       -24     1225           1309

But when 2 arguments are passed to it then it behaves differently. You see the 3rd argument arr_time is not really in ascending order.

> arrange(f, dep_delay, arr_time)

# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013    12     7     2040           2123       -43       40           2352
 2  2013     2     3     2022           2055       -33     2240           2338
 3  2013    11    10     1408           1440       -32     1549           1559
 4  2013     1    11     1900           1930       -30     2233           2243
 5  2013     1    29     1703           1730       -27     1947           1957
 6  2013     8     9      729            755       -26     1002            955
 7  2013    10    23     1907           1932       -25     2143           2143
 8  2013     3    30     2030           2055       -25     2213           2250
 9  2013     5    14      914            938       -24     1143           1204
10  2013     5     5      934            958       -24     1225           1309

Same is true if I pass more arguments. Now the 4th argument is day but that is not in ascending order either.

> arrange(f, month,dep_delay,day)
# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1    11     1900           1930       -30     2233           2243
 2  2013     1    29     1703           1730       -27     1947           1957
 3  2013     1    12     1354           1416       -22     1606           1650
 4  2013     1    21     2137           2159       -22     2232           2316
 5  2013     1    20      704            725       -21     1025           1035
 6  2013     1    12     2050           2110       -20     2310           2355
 7  2013     1    12     2134           2154       -20        4             50
 8  2013     1    14     2050           2110       -20     2329           2355
 9  2013     1     4     2140           2159       -19     2241           2316
10  2013     1    11     1947           2005       -18     2209           2230

#2

The last argument is the tie breaker after the preceding ones. So, for example, if there are two flights with month and dep_delay that are the same, they will then be ordered by day in ascending order.


#3

Are you familiar with spreadsheet sorting interfaces like this one?
https://goo.gl/images/99k1Qw

arrange is doing the same thing (except that it can sort by more than 3 variables!) — it is sorting the entire data frame (= arranging the rows) by the variables you choose. It’s not sorting the columns independently of each other.

People often describe this as “sorting first by Col1, then by Col2”, or as “sorting by Col2 within Col1”, but another way to think about it is in terms of breaking ties. If Col1 has repeated values, then all the rows with a given value in Col1 are “tied” for first place. How do you decide which row to put first? You look at the values in Col2 — and so on.

It might be easier to see what’s going on if we reorder the columns so that the arranged-by variables come first, and look at more rows of data:

> arrange(f, month,dep_delay,day) %>% 
    select(month, dep_delay, day, everything()) %>% 
    head(20)

# A tibble: 20 x 19
   month dep_delay   day  year dep_time sched_dep_time arr_time sched_arr_time
   <int>     <dbl> <int> <int>    <int>          <int>    <int>          <int>
 1     1       -30    11  2013     1900           1930     2233           2243
 2     1       -27    29  2013     1703           1730     1947           1957
 3     1       -22    12  2013     1354           1416     1606           1650
 4     1       -22    21  2013     2137           2159     2232           2316
 5     1       -21    20  2013      704            725     1025           1035
 6     1       -20    12  2013     2050           2110     2310           2355
 7     1       -20    12  2013     2134           2154        4             50
 8     1       -20    14  2013     2050           2110     2329           2355
 9     1       -19     4  2013     2140           2159     2241           2316
10     1       -18    11  2013     1947           2005     2209           2230
11     1       -18    19  2013     1912           1930     2026           2050
12     1       -18    23  2013     1142           1200     1239           1304
13     1       -18    27  2013      617            635      852            934
14     1       -17     4  2013     1243           1300     1432           1450
15     1       -17     7  2013     2013           2030     2150           2206
16     1       -17     9  2013     1143           1200     1242           1304
17     1       -17    10  2013      810            827      955           1031
18     1       -17    14  2013     1558           1615     1826           1831
19     1       -17    15  2013      543            600      710            715
20     1       -17    25  2013     1143           1200     1242           1304
# ... with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Can you see how within repeated values of month, the rows are arranged in order of dep_delay, and then within repeated values of dep_delay the rows are arranged in order of day?


#4

Thanks jcblum I understood most of it and not al. It was helpful that you arranged-by-variables. Let's look at 3 values from the table you have:

Month     dep_delay       day
1              -18                   19  
1              -18                   23  
1              -18                   27 

I think I get it. Only the subset of the current column, that corresponds to the tied values of the preceding column, is actually arranged in ascending order. Just like you said, it is not an independent arrange()-ing like I imagined. Thank you so much :slight_smile: