I'll chime in with a base R example, where I select row for each id based on the order of gene, then tier, then consequence.
First, some reproducible dummy data,
set.seed(123)
id <- sample(1000:9999, 4)
gene <- replicate(4, paste(sample(LETTERS, 4, TRUE), collapse = ""))
tier <- paste("TIER", 1:4, sep = "")
consequence <- c("splice_site_varient", "frameshift", "stop_gain")
df <- expand.grid(id = id,
gene = gene,
tier = tier,
consequence = consequence,
stringsAsFactors = FALSE)
# take a random subset so we don't get the same result for each id.
df <- df[sample(nrow(df), 40), ]
head(df)
#> id gene tier consequence
#> 26 3510 ZESY TIER2 splice_site_varient
#> 7 9717 TNVY TIER1 splice_site_varient
#> 170 3510 ZESY TIER3 stop_gain
#> 137 3462 ZESY TIER1 stop_gain
#> 164 3985 RVKE TIER3 stop_gain
#> 78 3510 YICH TIER1 frameshift
The general strategy will be to turn gene, tier, consequence into ordered factors which we can sort(), order(), or rank().
gene
#> [1] "RVKE" "TNVY" "ZESY" "YICH"
df[["gene"]] <- factor(df[["gene"]], levels = gene, ordered = TRUE)
head(df[["gene"]])
#> [1] ZESY TNVY ZESY ZESY RVKE YICH
#> Levels: RVKE < TNVY < ZESY < YICH
tier
#> [1] "TIER1" "TIER2" "TIER3" "TIER4"
df[["tier"]] <- factor(df[["tier"]], levels = tier, ordered = TRUE)
head(df[["tier"]])
#> [1] TIER2 TIER1 TIER3 TIER1 TIER3 TIER1
#> Levels: TIER1 < TIER2 < TIER3 < TIER4
consequence
#> [1] "splice_site_varient" "frameshift" "stop_gain"
df[["consequence"]] <- factor(df[["consequence"]], levels = consequence, ordered = TRUE)
head(df[["consequence"]])
#> [1] splice_site_varient splice_site_varient stop_gain
#> [4] stop_gain stop_gain frameshift
#> Levels: splice_site_varient < frameshift < stop_gain
Take note of the direction of the order.
idx <- with(df,
tapply(seq_along(id),
id,
function(x) {
x[order(gene[x], tier[x], consequence[x])[[1]]]
}))
df[idx, ]
#> id gene tier consequence
#> 81 3462 RVKE TIER2 frameshift
#> 34 3510 RVKE TIER3 splice_site_varient
#> 164 3985 RVKE TIER3 stop_gain
#> 7 9717 TNVY TIER1 splice_site_varient
Created on 2020-09-07 by the reprex package (v0.3.0)