How do I sort text value alphabetically across columns in tidyverse?

Hi all,

I am trying to refurbish some old code (written prior tidyverse).

The goal is to create pairwise combinations of several yeast strains, starting from a character vector with the 8 strain names. I have used expand.grid() to generate a matrix with all 64 (8 by 8) possible pairwise combination of strains on each row.

Ideally, I want to annotate if a pair maybe redundant (e.g. A-B and B-A) so, for each row, I sort lexically the strains names and check the rows that are duplicated.

Since I am using apply() on MARGIN=1 the resulting matrix is transposed compared to the original one. That is why I have to back-transpose it before looking at duplicated pairs.

I was wondering if it could be done more simply using tidyverse syntax.
So far I could not find a better way to do it than the code below.

library(tidyverse)
strains <- c("AMH", "BAN", "BED", "BPL", "BTT", "CMP", "CPI", "CQC")
# make all pairs of strains
p_strains <- expand.grid(s1 = strains, s2 = strains) %>% as_tibble()

# find redundant pairs (e.g. A-B and B-A)
is_pair_dup <- apply(p_strains, 1, sort) %>% # sort strains alphabetically across columns
  t() %>% duplicated() # find pairs duplicated across row

# annotate unique pair of strains and "self pair" (i.e. pair composed of the same strain)
p_strains <- p_strains %>%
  mutate(is_identical = s1 == s2, is_duplicated = is_pair_dup, )

# Tidyverse version with chaining
tidy_strains <- expand.grid(s1 = strains, s2 = strains) %>%
  as_tibble() %>%
  mutate(
    is_identical  = s1 == s2,
    # QUESTION: can i do the following in a more straightforward way (e.g. with c_across )?
    is_duplicated = apply(., 1, sort) %>% t() %>% duplicated()
  )

# checking that both methods return identical tibble
identical(p_strains, tidy_strains)
#> [1] TRUE

Created on 2022-10-03 with reprex v2.0.2

Well here is a first version, which isn't really easier to read than your base R version.

[...]
tidy_strains2 <- expand_grid(s1 = strains, s2 = strains) |>
  rowwise() |>
  mutate(pair = list(sort(c(s1, s2)))) |>
  ungroup() |>
  mutate(is_identical = s1 == s2,
         is_duplicated = duplicated(pair)) |>
  select(-pair)

all.equal(p_strains, tidy_strains2, check.attributes = FALSE)
#> [1] "Component \"s1\": 'current' is not a factor"                     
#> [2] "Component \"s2\": 'current' is not a factor"                     
#> [3] "Component \"is_duplicated\": target is array, current is logical"
class(p_strains$is_duplicated)
#> [1] "array"

Notice the 3 differences:

  • s1 and s2 are factors when using expand.grid(), not expand_grid() (technically the second is "more tidyverse", but you might prefer the first).
  • when you use this transpose, you actually end up with a column that is an column-array, though it's not obvious when just looking at it. The rowwise() approach keeps it a simple vector.
  • there are also many differences in the attributes which are consequences of those.

To make it more readable, I would maybe put the sort into a function (and eliminate the rowwise grouping):

assemble_pairs <- function(vec1, vec2){
  map2(vec1, vec2,
       ~ sort(c(.x, .y)))
}

tidy_strains3 <- expand_grid(s1 = strains, s2 = strains) |>
  mutate(pair = assemble_pairs(s1, s2),
         is_identical = s1 == s2,
         is_duplicated = duplicated(pair)) |>
  select(-pair)

all.equal(tidy_strains2, tidy_strains3)
#> [1] TRUE

And in that case it might be even clearer to put the whole duplication-detection code into its own function:

find_duplicate_pairs <- function(vec1, vec2){
  map2(vec1, vec2,
       ~ sort(c(.x, .y))) |>
    duplicated()
}

tidy_strains4 <- expand_grid(s1 = strains, s2 = strains) |>
  mutate(is_identical = s1 == s2,
         is_duplicated = find_duplicate_pairs(s1, s2))

all.equal(tidy_strains2, tidy_strains3)
#> TRUE
1 Like

Thanks @AlexisW. I find your solution pretty neat.
The "unnecessary" transpose operations were bothering me.
Also, I did not know about column-array but now I'll be more careful when comparing objects.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.