Compare and analyze data using multiple columns

Diegocastano · June 29, 2021, 4:06am

Dear community,

At the moment I need a method to analyze data that I have and it would be of great help if you could collaborate with me. The data is shown as in the following example:

> glimpse(test)
Rows: 196
Columns: 8
$ Start.H <int> 1, 41, 81, 121, 121, 161, 401, 441, 721, 921, 1081, 1201, 1201, 1241, 1521, 1681, 1721, 2041, 2481, 2561, 2681…
$ End.H   <int> 160, 160, 200, 240, 240, 280, 520, 680, 1040, 1120, 1280, 1320, 1320, 1360, 1800, 1800, 1840, 2440, 2680, 2680…
$ Start.I <int> 1, 41, 81, 121, 481, 681, 681, 841, 881, 1041, 1121, 1201, 1521, 1521, 1561, 1641, 1681, 1721, 1921, 2441, 248…
$ End.I   <int> 120, 160, 200, 240, 600, 800, 800, 1000, 1040, 1200, 1240, 1360, 1640, 1640, 1680, 1760, 1840, 1840, 2400, 260…
$ Start.B <int> 1, 41, 41, 81, 121, 121, 161, 401, 721, 921, 1241, 1521, 2041, 2681, 2761, 2801, 2801, 2881, 2921, 2961, 2961,…
$ End.B   <int> 120, 160, 160, 200, 240, 240, 280, 520, 1040, 1120, 1360, 1720, 2640, 2880, 2880, 2920, 2920, 3040, 3040, 3080…
$ Start.C <int> 1, 41, 81, 121, 121, 161, 401, 721, 921, 1121, 1201, 1241, 1521, 1721, 2041, 2481, 2681, 2761, 2881, 3681, 400…
$ End.C   <int> 120, 160, 200, 240, 240, 280, 520, 1040, 1120, 1320, 1320, 1360, 1720, 1840, 2440, 2640, 2880, 2880, 3080, 396…

The Start and End are positions in the genome of the same organism that underwent different replicates (H, I, B and C). This is the suffix for each Start and End, as you can see in the dataframe.
My purpose is that if all the positions of both Start and End with all the replicas (H, I, B and C) are the same, the word "Conserved" is added in a new column called "Type". Now, the different combinations would be called "Shared" and if the position of Start and End is different from all the others it would be called "Unique".

For this, I wrote the following script:

test <- data %>% 
  mutate(Type = if_else(Start.H & End.H == Start.I & End.I == Start.B & End.B == Start.C & End.C) ~ "Conserved"
         Start.H & End.H == Start.I & End.I != Start.B & End.B != Start.C & End.C) ~ "Shared"
         Start.H & End.H == Start.I & End.I == Start.B & End.B != Start.C & End.C) ~ "Shared"
         Start.H & End.H == Start.I & End.I == Start.B & End.B != Start.C & End.C) ~ "Shared"
         Start.H & End.H != Start.I & End.I == Start.B & End.B != Start.C & End.C) ~ "Shared"
         Start.H & End.H != Start.I & End.I != Start.B & End.B == Start.C & End.C) ~ "Shared"
         Start.H & End.H != Start.I & End.I == Start.B & End.B == Start.C & End.C) ~ "Shared"
         Start.H & End.H != Start.I & End.I != Start.B & End.B != Start.C & End.C) ~ "Unique"

Unfortunately, it doesn't work for the goal I need. For this reason, if you have a more feasible way to perform this operation, it would be greatly appreciated.

Thanks.

lucasgraybuck · June 29, 2021, 5:07am

Here's an approach using dplyr::case_when() along with a couple of helper functions - these may not be the fastest option, and there are surely other methods that would work as well.

library(dplyr)

data <- data.frame(
  Start.H = c(1, 41, 81, 121, 121, 161, 401, 441, 721, 921, 1081, 1201, 1201, 1241, 1521, 1681, 1721, 2041, 2481),
  End.H = c(160, 160, 200, 240, 240, 280, 520, 680, 1040, 1120, 1280, 1320, 1320, 1360, 1800, 1800, 1840, 2440, 2680),
  Start.I = c(1, 41, 81, 121, 481, 681, 681, 841, 881, 1041, 1121, 1201, 1521, 1521, 1561, 1641, 1681, 1721, 1921),
  End.I = c(120, 160, 200, 240, 600, 800, 800, 1000, 1040, 1200, 1240, 1360, 1640, 1640, 1680, 1760, 1840, 1840, 2400),
  Start.B = c(1, 41, 41, 81, 121, 121, 161, 401, 721, 921, 1241, 1521, 2041, 2681, 2761, 2801, 2801, 2881, 2921),
  End.B = c(120, 160, 160, 200, 240, 240, 280, 520, 1040, 1120, 1360, 1720, 2640, 2880, 2880, 2920, 2920, 3040, 3040),
  Start.C = c(1, 41, 81, 121, 121, 161, 401, 721, 921, 1121, 1201, 1241, 1521, 1721, 2041, 2481, 2681, 2761, 2881),
  End.C = c(120, 160, 200, 240, 240, 280, 520, 1040, 1120, 1320, 1320, 1360, 1720, 1840, 2440, 2640, 2880, 2880, 3080)
)

all_eq <- function(x) {
  all(x == x[1])
}

all_neq <- function(x) {
  combos <- combn(x,2)
  !any(apply(combos, 2, all_eq))
}

test <- data %>%
  rowwise() %>%
  mutate(Type = case_when(
    all_eq(c(Start.H, Start.I, Start.B, Start.C)) & all_eq(c(End.H, End.I, End.B, End.C)) ~ "Conserved",
    all_neq(c(Start.H, Start.I, Start.B, Start.C)) & all_neq(c(End.H, End.I, End.B, End.C)) ~ "Unique",
    TRUE ~ "Shared"
  )) %>%
  ungroup()

Note for the case_when, we don't need to specify every possible combination - we can take care of the Conserved and Unique cases, and allow anything that doesn't match these conditions to be Shared.

Another note is that I don't think your logical statements are doing quite what you'd like them to do.

Let's look at this Conserved case:

Start.H & End.H == Start.I & End.I == Start.B & End.B == Start.C & End.C

It looks like you're using & here where you'd really like to use c() to combine your Start and End values.

c(Start.H, End.H) == c(Start.I, End.I) == c(Start.B, End.B) == c(Start.C, End.C)

This gets us closer, but as far as I know, we can't actually chain the logical == comparisons this way. To combine logical comparisons, we'll now need to add & between them, and expand out to include all of the possible pairs.

And, with two values in each vector, we may also need to use all to get a single value

all(c(Start.H, End.H) == c(Start.I, End.I)) & all(c(Start.H, End.H) == c(Start.B, End.B)) & all(c(Start.H, End.H) == c(Start.C, End.C)) & all(c(Start.I, End.I) == c(Start.B, End.B)) & all(c(Start.I, End.I) == c(Start.C, End.C)) & all(c(Start.B, End.B) == c(Start.C, End.C))

As you can see, we now have a lot of comparisons! But, we could just compare each set to the first set:

all(c(Start.H, End.H) == c(Start.I, End.I)) & all(c(Start.H, End.H)  == c(Start.B, End.B)) & all(c(Start.H, End.H)  == c(Start.C, End.C))

Or do a nested comparison, like so:

all(((c(Start.H, End.H) == c(Start.I, End.I)) == c(Start.B, End.B)) == c(Start.C, End.C))

These are probably less practical

Diegocastano · June 29, 2021, 1:57pm

It's great what you did!
If I wish in this case to compare strings of the variable of the different organisms that are found in each row I assign it as (H, I, B and C) "Conserved", if within these values there is an NA, which indicates that it is not present in that organism I assign it as (H, NA, B, C) "Shared" and finally if it is in a single organism as (H, NA, NA, NA, NA) "Unique".

I did this

test <- data %>%
  rowwise() %>%
  mutate(Type = case_when(
    all_equal(c(Host.H = Human, Host.I = Intermediate, Host.B = Bat, Host.C = Consensus), na.rm = T) ~ "Conserved")),
    all_neq(c(Host.H = Human, Host.I = Intermediate, Host.B = Bat, Host.C = Consensus), na.rm = T) ~ "Unique",
    TRUE ~ "Shared")) %>%
  ungroup()

The logic is correct, but it doesn't work. Could you support me?

system · July 20, 2021, 1:57pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.