dplyr arrange() for natural sorting of several columns

Hi all,

I am trying to sort a dataframe by several columns.
Two conditions need to be met:

  1. One or more columns may have the structure of a character string containing a mix of letters and numbers, in which case we want to first sort alphabetically and then numerically within the same group.
  2. The sorting by any additional column should follow the sorting style as is normally done by the arrange() function in the dplyr package: sorting within the group of the previous column.

Problem: arrange() only seems to sort some of the content containing letters and numbers in the right way. See the code below:

#Create data frame with three columns
treatment = c("a 75 mg", 'p 0 mg', 'b 1 mg/kg', 'b 100mg/kg', 
'a 300 mg', 'b 0 mg/kg',  'a 1000 mg', 'a 300 mg')
study = c('01', '01', '02', '02', '04', '01', '03', '01')
patients = c(1, 10, 100, 3, 14, 5, 10, 3)
  
myData = data.frame(treatment, study, patients, stringsAsFactors = F)

> myData
   treatment study patients
1    a 75 mg    01        1
2     p 0 mg    01       10
3  b 1 mg/kg    02      100
4 b 100mg/kg    02        3
5   a 300 mg    04       14
6  b 0 mg/kg    01        5
7  a 1000 mg    03       10
8   a 300 mg    01        3

# 1. Sort by the column that contains letters and numbers only

# dplyr arrange()
b = arrange(myData, treatment)

# Treatment group 'b' is sorted correctly in ascending order, but treatment  
# group 'a' seems to be sorted in descending order.

> b
   treatment study patients
1  a 1000 mg    03       10
2   a 300 mg    04       14
3   a 300 mg    01        3
4    a 75 mg    01        1
5  b 0 mg/kg    01        5
6  b 1 mg/kg    02      100
7 b 100mg/kg    02        3
8     p 0 mg    01       10

# Sort using mixedorder() (gtools) package gives the correct output
a = myData[mixedorder(myData$treatment),]

> a
   treatment study patients
1    a 75 mg    01        1
5   a 300 mg    04       14
8   a 300 mg    01        3
7  a 1000 mg    03       10
6  b 0 mg/kg    01        5
3  b 1 mg/kg    02      100
4 b 100mg/kg    02        3
2     p 0 mg    01       10

# 2. Sort by several columns with arrange()
# Treatment group 'a' is not sorted correctly at all
# Treatment group 'b' is sorted correctly for columns and study (ascending), 
# but not for patients

c = arrange(myData, treatment, study, patients)

> c
   treatment study patients
1  a 1000 mg    03       10
2   a 300 mg    01        3
3   a 300 mg    04       14
4    a 75 mg    01        1
5  b 0 mg/kg    01        5
6  b 1 mg/kg    02      100
7 b 100mg/kg    02        3
8     p 0 mg    01       10

Is there a straighforward, simple way to sort the way I need to,
or am I simply missing a step somewhere?

This is because it's sorting alphabetically, and "alphabetically" speaking 1 comes before 3 (that's why people often prefix, say, numbered file names with zeroes: 01, 02, 03...—like you have for the study numbers).

It sounds like gtools' mixedorder() is working for you, so I can't think of a good reason not to use it.

Another approach with tidyr and dplyr would be to separate out the amounts from the groups and units so that you can properly sort them numerically. Note that there's the issue of precedence. So, for example, study 04 is coming before study 03 in group a because 300 is less than 10000.

library(tidyverse)
treatment = c("a 75 mg", 'p 0 mg', 'b 1 mg/kg', 'b 100 mg/kg', 
              'a 300 mg', 'b 0 mg/kg',  'a 1000 mg', 'a 300 mg')
study = c('01', '01', '02', '02', '04', '01', '03', '01')
patients = c(1, 10, 100, 3, 14, 5, 10, 3)

myData = data.frame(treatment, study, patients, stringsAsFactors = F)

myDat2 <- myData %>%
  separate(treatment, into = c("group", "amt", "unit"), sep = " ")

myDat2 %>%
  mutate(amt = as.numeric(amt)) %>%
  group_by(group) %>%
  arrange(group, amt, study, patients, .by_group = TRUE)
#> # A tibble: 8 x 5
#> # Groups:   group [3]
#>   group   amt unit  study patients
#>   <chr> <dbl> <chr> <chr>    <dbl>
#> 1 a        75 mg    01           1
#> 2 a       300 mg    01           3
#> 3 a       300 mg    04          14
#> 4 a      1000 mg    03          10
#> 5 b         0 mg/kg 01           5
#> 6 b         1 mg/kg 02         100
#> 7 b       100 mg/kg 02           3
#> 8 p         0 mg    01          10

Created on 2020-01-21 by the reprex package (v0.3.0.9001)

2 Likes

Thank you for your help Mara.
I understand now how it sorts.

I cannot use gtools mixedorder() because it does not work for more than one column.

So then I'm thinking I have two options:

  1. I write up my own function with mixedorder()
  2. I use arrange() and separate column content accordingly

I will try the column separation first and will see how that works out. The code needs to be generalised and incorporated into a larger piece of software so it might take me a while, but I will get back to you on this if I get stuck. Thank you for now!

Mara, I managed to generalise and integrate the code and am currently testing it.
Do you have any solution for what I should do in case the amount for the treatment is given as a range, for example "a 30-50 mg"? This is lost in the conversion to numeric.

You could separate into min_amt and max_amt, and have it be NA in cases where it's not a range.

You might also take a look at some other implementations of mixedSort() and/or mixedOrder() or natural sorting in general (I haven't used them personally) to see if any deal with multiple columns https://rdrr.io/github/jmw86069/jamba/man/mixedOrder.html

3 Likes

Thanks a lot for your help Mara. The solutions worked well for me and I managed to solve my problem.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.