Hierarchical filtering of a table using (ideally base) R

tacrolimus · September 7, 2020, 2:36pm

I have a table which looks like this but with many more entries:

ID     Gene     Tier     Consequence   
1314   ABC      TIER1    missense  
1314   PKD1     TIER1    frameshift 
1314   PKD1     TIER1    stop_gain 
6245   BJD      TIER1    splice_site_variant 
7631   PKD2     TIER1    missense
7631   PKD2     TIER1    non_coding
5336   PKD1     TIER3    missense
1399   PKD1     TIER2    non_coding

I would like to select one row pwer ID with the preference that Tier1 >tier2 >tier3 and then stop_gain>framshift>splice_site_variant>missense>non_coding_mutation. In reality there are roughly 10 types of "Consequence" in a hierarchical order.

I have subset the table already on tier1:

tier1 <- df[which(df$tier == 'TIER1),]

but now wanted to to subset on hierarchy of consequence to I get one line per ID with the highest consequence being selected.

Desired outcome:

ID Gene Tier Consequence
1314 PKD1 TIER1 stop_gain
6245 BJD TIER1 splice_site_variant
7631 PKD2 TIER1 missense
5336 PKD1 TIER3 missense
1399 PKD1 TIER2 non_coding

I thought about turning the consequences into numbers and then using that but was wondering if there was a way of doing this with the text. I work in a HPC in an airlock environment so solutions using base R would be preferable.

Many thanks for your time

nirgrahamuk · September 7, 2020, 2:53pm

can you explain why you want to use base R, and not dplyr/tidyverse which has been designed with these sorts of tasks in mind ?

tacrolimus · September 7, 2020, 2:59pm

Dear nigrahamuk,

As mentioned at the bottom of the message I work within a fairly restrictive environment. Whilst I can use the packages mentioned if there are functions within them that aren't available then it can be a real faff to get them into the HPC environment (we work with sensitive information).

If you have a dply/tidyverse solution I am more than happy to see if it works. Base R is just a safer bet.

Thanks for taking the time to read the message.

All the best

nirgrahamuk · September 7, 2020, 3:07pm

#example data
df <- structure(list(ID = c(
  1314L, 1314L, 1314L, 6245L, 7631L, 7631L,
  5336L, 1399L
), Gene = c(
  "ABC", "PKD1", "PKD1", "BJD", "PKD2",
  "PKD2", "PKD1", "PKD1"
), Tier = c(
  "TIER1", "TIER1", "TIER1",
  "TIER1", "TIER1", "TIER1", "TIER3", "TIER2"
), Consequence = structure(c(
  4L,
  5L, 1L, 2L, 4L, 3L, 4L, 3L
), .Label = c(
  "stop_gain", "splice_site_variant",
  "non_coding", "missense", "frameshift"
), class = "factor")), row.names = c(
  NA,
  -8L
), class = c("tbl_df", "tbl", "data.frame"))

dplyr style solution

df2 <- filter(df,
       Tier=="TIER1") %>% group_by(ID) %>% 
  arrange(ID,Consequence) %>%
  filter(row_number()==1)

tacrolimus · September 7, 2020, 3:21pm

Does this take the first row of the (row_number ()==1) ? How would I "tell" R the order of importance for the different consequence types?

Many thanks

nirgrahamuk · September 7, 2020, 3:27pm

Yes, my solution relies on the ordering of Consequence; as you can see its not a natural ordering by alphabet that you might expect. I handled that in the example by the factor encoding of the variable. Are you familiar with factors in R ?

tacrolimus · September 7, 2020, 3:28pm

Sorry, if I also just wanted to filter on consequence and not tier how would I modify this code?

Many thanks

nirgrahamuk · September 7, 2020, 3:31pm

omit the initial

filter(df,
       Tier=="TIER1") %>%

elmstedt · September 7, 2020, 11:19pm

I'll chime in with a base R example, where I select row for each id based on the order of gene, then tier, then consequence.

First, some reproducible dummy data,

set.seed(123)
id <- sample(1000:9999, 4)
gene <- replicate(4, paste(sample(LETTERS, 4, TRUE), collapse = ""))
tier <- paste("TIER", 1:4, sep = "")
consequence <- c("splice_site_varient", "frameshift", "stop_gain")
df <- expand.grid(id = id,
            gene = gene,
            tier = tier,
            consequence = consequence,
            stringsAsFactors = FALSE)
# take a random subset so we don't get the same result for each id.
df <- df[sample(nrow(df), 40), ]
head(df)

#>       id gene  tier         consequence
#> 26  3510 ZESY TIER2 splice_site_varient
#> 7   9717 TNVY TIER1 splice_site_varient
#> 170 3510 ZESY TIER3           stop_gain
#> 137 3462 ZESY TIER1           stop_gain
#> 164 3985 RVKE TIER3           stop_gain
#> 78  3510 YICH TIER1          frameshift

The general strategy will be to turn gene, tier, consequence into ordered factors which we can sort(), order(), or rank().

gene
#> [1] "RVKE" "TNVY" "ZESY" "YICH"
df[["gene"]] <- factor(df[["gene"]], levels = gene, ordered = TRUE)
head(df[["gene"]])

#> [1] ZESY TNVY ZESY ZESY RVKE YICH
#> Levels: RVKE < TNVY < ZESY < YICH

tier
#> [1] "TIER1" "TIER2" "TIER3" "TIER4"
df[["tier"]] <- factor(df[["tier"]], levels = tier, ordered = TRUE)
head(df[["tier"]])

#> [1] TIER2 TIER1 TIER3 TIER1 TIER3 TIER1
#> Levels: TIER1 < TIER2 < TIER3 < TIER4

consequence
#> [1] "splice_site_varient" "frameshift"          "stop_gain"
df[["consequence"]] <- factor(df[["consequence"]], levels = consequence, ordered = TRUE)
head(df[["consequence"]])

#> [1] splice_site_varient splice_site_varient stop_gain          
#> [4] stop_gain           stop_gain           frameshift         
#> Levels: splice_site_varient < frameshift < stop_gain

Take note of the direction of the order.

idx <- with(df,
            tapply(seq_along(id),
                   id,
                   function(x) {
                     x[order(gene[x], tier[x], consequence[x])[[1]]]
                   }))
df[idx, ]

#>       id gene  tier         consequence
#> 81  3462 RVKE TIER2          frameshift
#> 34  3510 RVKE TIER3 splice_site_varient
#> 164 3985 RVKE TIER3           stop_gain
#> 7   9717 TNVY TIER1 splice_site_varient

^{Created on 2020-09-07 by the reprex package (v0.3.0)}

technocrat · September 8, 2020, 12:05am

Illustrates well the use of {base} for something that can be done more fluently with tools such as {dplyr}

tacrolimus · September 8, 2020, 8:29am

Many thanks for taking the time to answer. This worked perfectly. Taught me a lot about ordering too!

tacrolimus · September 8, 2020, 8:41am

Basic question but if I started with the table as per the example. Would I have to tell R the columns should be treated as ordered factors e.g. if I wanted to input tier would the code be:

df$tier <- ordered(df$tier, 
                          levels = rev(c("TIER1", "TIER2", 
                                         "TIER3")))

And then proceed with the remaining code as above.

elmstedt · September 8, 2020, 9:02am

That's the idea. Just be careful with the ordering, whatever level you put first will be the lowest.

system · September 15, 2020, 9:02am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.