Hierarchical filtering of a table using (ideally base) R

I have a table which looks like this but with many more entries:

ID     Gene     Tier     Consequence   
1314   ABC      TIER1    missense  
1314   PKD1     TIER1    frameshift 
1314   PKD1     TIER1    stop_gain 
6245   BJD      TIER1    splice_site_variant 
7631   PKD2     TIER1    missense
7631   PKD2     TIER1    non_coding
5336   PKD1     TIER3    missense
1399   PKD1     TIER2    non_coding

I would like to select one row pwer ID with the preference that Tier1 >tier2 >tier3 and then stop_gain>framshift>splice_site_variant>missense>non_coding_mutation. In reality there are roughly 10 types of "Consequence" in a hierarchical order.

I have subset the table already on tier1:

tier1 <- df[which(df$tier == 'TIER1),]

but now wanted to to subset on hierarchy of consequence to I get one line per ID with the highest consequence being selected.

Desired outcome:

ID Gene Tier Consequence
1314 PKD1 TIER1 stop_gain
6245 BJD TIER1 splice_site_variant
7631 PKD2 TIER1 missense
5336 PKD1 TIER3 missense
1399 PKD1 TIER2 non_coding

I thought about turning the consequences into numbers and then using that but was wondering if there was a way of doing this with the text. I work in a HPC in an airlock environment so solutions using base R would be preferable.

Many thanks for your time

can you explain why you want to use base R, and not dplyr/tidyverse which has been designed with these sorts of tasks in mind ?

1 Like

Dear nigrahamuk,

As mentioned at the bottom of the message I work within a fairly restrictive environment. Whilst I can use the packages mentioned if there are functions within them that aren't available then it can be a real faff to get them into the HPC environment (we work with sensitive information).

If you have a dply/tidyverse solution I am more than happy to see if it works. Base R is just a safer bet.

Thanks for taking the time to read the message.

All the best

#example data
df <- structure(list(ID = c(
  1314L, 1314L, 1314L, 6245L, 7631L, 7631L,
  5336L, 1399L
), Gene = c(
  "ABC", "PKD1", "PKD1", "BJD", "PKD2",
  "PKD2", "PKD1", "PKD1"
), Tier = c(
  "TIER1", "TIER1", "TIER1",
  "TIER1", "TIER1", "TIER1", "TIER3", "TIER2"
), Consequence = structure(c(
  5L, 1L, 2L, 4L, 3L, 4L, 3L
), .Label = c(
  "stop_gain", "splice_site_variant",
  "non_coding", "missense", "frameshift"
), class = "factor")), row.names = c(
), class = c("tbl_df", "tbl", "data.frame"))

dplyr style solution

df2 <- filter(df,
       Tier=="TIER1") %>% group_by(ID) %>% 
  arrange(ID,Consequence) %>%

Does this take the first row of the (row_number ()==1) ? How would I "tell" R the order of importance for the different consequence types?

Many thanks

Yes, my solution relies on the ordering of Consequence; as you can see its not a natural ordering by alphabet that you might expect. I handled that in the example by the factor encoding of the variable. Are you familiar with factors in R ?

1 Like

Sorry, if I also just wanted to filter on consequence and not tier how would I modify this code?

Many thanks

omit the initial

       Tier=="TIER1") %>%

I'll chime in with a base R example, where I select row for each id based on the order of gene, then tier, then consequence.

First, some reproducible dummy data,

id <- sample(1000:9999, 4)
gene <- replicate(4, paste(sample(LETTERS, 4, TRUE), collapse = ""))
tier <- paste("TIER", 1:4, sep = "")
consequence <- c("splice_site_varient", "frameshift", "stop_gain")
df <- expand.grid(id = id,
            gene = gene,
            tier = tier,
            consequence = consequence,
            stringsAsFactors = FALSE)
# take a random subset so we don't get the same result for each id.
df <- df[sample(nrow(df), 40), ]
#>       id gene  tier         consequence
#> 26  3510 ZESY TIER2 splice_site_varient
#> 7   9717 TNVY TIER1 splice_site_varient
#> 170 3510 ZESY TIER3           stop_gain
#> 137 3462 ZESY TIER1           stop_gain
#> 164 3985 RVKE TIER3           stop_gain
#> 78  3510 YICH TIER1          frameshift

The general strategy will be to turn gene, tier, consequence into ordered factors which we can sort(), order(), or rank().

#> [1] "RVKE" "TNVY" "ZESY" "YICH"
df[["gene"]] <- factor(df[["gene"]], levels = gene, ordered = TRUE)
#> Levels: RVKE < TNVY < ZESY < YICH
#> [1] "TIER1" "TIER2" "TIER3" "TIER4"
df[["tier"]] <- factor(df[["tier"]], levels = tier, ordered = TRUE)
#> Levels: TIER1 < TIER2 < TIER3 < TIER4
#> [1] "splice_site_varient" "frameshift"          "stop_gain"
df[["consequence"]] <- factor(df[["consequence"]], levels = consequence, ordered = TRUE)
#> [1] splice_site_varient splice_site_varient stop_gain          
#> [4] stop_gain           stop_gain           frameshift         
#> Levels: splice_site_varient < frameshift < stop_gain

Take note of the direction of the order.

idx <- with(df,
                   function(x) {
                     x[order(gene[x], tier[x], consequence[x])[[1]]]
df[idx, ]
#>       id gene  tier         consequence
#> 81  3462 RVKE TIER2          frameshift
#> 34  3510 RVKE TIER3 splice_site_varient
#> 164 3985 RVKE TIER3           stop_gain
#> 7   9717 TNVY TIER1 splice_site_varient

Created on 2020-09-07 by the reprex package (v0.3.0)


Illustrates well the use of {base} for something that can be done more fluently with tools such as {dplyr}


Many thanks for taking the time to answer. This worked perfectly. Taught me a lot about ordering too!

1 Like

Basic question but if I started with the table as per the example. Would I have to tell R the columns should be treated as ordered factors e.g. if I wanted to input tier would the code be:

df$tier <- ordered(df$tier, 
                          levels = rev(c("TIER1", "TIER2", 

And then proceed with the remaining code as above.

That's the idea. Just be careful with the ordering, whatever level you put first will be the lowest.


This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.