How to Label only Top 10 Rows in Dataset for Plot

ABC86 · May 20, 2022, 4:42pm

Hello!

I was wondering if anyone knew how I could restrict my volcano plot labeling to only the top 10 differentially expressed genes (aka, the ones with the smallest of p-values)?

#edgeR analysis

library(limma)
library(edgeR)
library(readxl)
library(dplyr)
library(stats4)
library(BiocGenerics)
library(AnnotationDbi)
library(BiocManager)
library(org.Hs.eg.db)
library(GO.db)
library(ggplot2)
library(ggrepel)


#call control, call another treatment
group <- factor(c(rep('HD1', 15), rep('HD2', 10)))
#Here it will not work unless you make C variable first and then do function
#to make DGE
C <- read_excel('R/HD1vsHD2.xlsx')

C <- C%>%
  {
    DGEList(counts = C[,c(3:27)], 
            genes = C[, c('Geneid', 'Length')], 
            group = group)
  }

design <- model.matrix(~group)
counts_keep <- filterByExpr(C,  min.count = 20)
counts_filter <- C[counts_keep, ]
counts_norm <- calcNormFactors(counts_filter)

plotMDS(counts_norm)

counts_filter <- estimateDisp(counts_norm, design)
fit <- glmQLFit(counts_filter,design)

#test variable creates an object that has the p value and log 2 fold change
test <- glmQLFTest(fit,coef= 2)


geneid <- test[["genes"]][["Geneid"]]
logFC <- test[["table"]][["logFC"]]
pvalue <- test[["table"]][["PValue"]]

df <- data.frame(geneid, logFC, pvalue)


####### volcano plot
options(ggrepel.max.overlaps = Inf)

#determine if genes are differentially expressed
df$diffexpressed <- "No"

#if log2Foldchange > 0.6 and pvalue < 0.05, set as "Up" 
df$diffexpressed[df$logFC > 0.6 & df$pvalue < 0.05] <- "Up"

#if log2Foldchange < -0.6 and pvalue < 0.05, set as "Down"
df$diffexpressed[df$logFC < -0.6 & df$pvalue < 0.05] <- "Down"

#reorganize the df dataset for pvalue (small to big)
df %>% arrange(pvalue)

#top 10 genes and their pvalues 
head(arrange(df,pvalue),10)$pvalue #for top 10 smallest pvalues

ggplot(df, aes(x = logFC, y = -log10(pvalue), col = diffexpressed, label = NA)) +
  geom_point() + 
  coord_cartesian(ylim = c(0,28)) +
  theme_minimal() +
  scale_color_manual(values = c("purple", "grey", "orange")) +
  geom_vline(xintercept = c(-0.6, 0.6), col ="red") +
  geom_hline(yintercept = -log10(0.05), col ="red") +
  theme(text = element_text(size=10)) +
  labs(title = "???")

If I label all of my genes using label = geneid , then the volcano plot becomes illegible as all of the gene names take up the screen. So at the moment, I have label = NA in my ggplot so that no points are labeled:

ggplot(df, aes(x = logFC, y = -log10(pvalue), col = diffexpressed, label = NA)) + ...

Here is an image of the arranged df datat set where the pvalues are in ascending order, as well as the volcano plot produced!

Any advice on how to label only the top 10 rows of that list present in the Console?

Thank you!

Matthias · May 23, 2022, 2:25pm

You can add a column, let's call it "label" that contains a label only for the hits in your list:

# Identify the top 10 hits
Top_Hits = head(arrange(df,pvalue),10) 

# Add column label, containing the gene name for the top hits or nothing for all others
df$label = if_else(df$geneid %in% Top_Hits$geneid,  
                   df$geneid, NULL)

then you can use geom_text_repel (from ggrepel) to show the labels

   # use geom_text_repel to show labels (while preventing overlap)
  ggrepel::geom_text_repel(aes(label = label),
                           size = 3, show.legend = FALSE)

system · June 13, 2022, 2:25pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.