Remove list of genes from total

Dear all,

I hope someone can help me with this as I just cant figure it out. I have my DESEQ2 object cts, which includes all my genes. Now I only want to look at 100 genes, which are in the list "risk_genes".

I tried with cts_new <- cts[cts %in% risk_genes,] but then I am left with 0 genes.

What am I doing wrong?

Thank you so much for any help!!

Bine

Do you want to filter the genes based on some condition or are you only after a subset of (the top, bottom, random?) 100 observations?

Please provide a reprex (small reproducible example) that we can work with; that way we are better equipped to help you out.

Thank you for your reply.

Let me try to make an example:

Complete cts:

>  >                    Sample 1   Sample 2   Sample 3 .....
> > Gene 1                     45       89      ....
> > Gene 2                     43       45      ....
> > Gene 3                     45      234      ....
> > Gene 4                     46       45      ....
> > Gene 5                     ...                   
> > ......

Now I have a list of genes with

Gene 2
Gene 3

I am only interested in gene 2 and 3 for my analysis.
So from the complete dataset above I want to extract only the rows for gene 2 and gene 3.

In the end I would like to see this:

                  Sample 1   Sample 2   Sample 3 .....
Gene 2                 43      45        ....
Gene 3                 45      234       ....

Hope this makes it a bit more clear!

Thank you!!

You can use the slice function from the {dplyr} package to achieve this. I attach a minimal example on mock data below.

library(tidyverse)

genes <- tibble(
  gene = c("Gene 1", "Gene 2", "Gene 3", "Gene 4", "Gene 5"),
  sample_1 = c(runif(1:5)),
  sample_2 = c(runif(1:5)),
  sample_3 = c(runif(1:5))
)

genes %>% 
  slice(2:3)
#> # A tibble: 2 × 4
#>   gene   sample_1 sample_2 sample_3
#>   <chr>     <dbl>    <dbl>    <dbl>
#> 1 Gene 2   0.0727   0.692    0.0661
#> 2 Gene 3   0.323    0.0999   0.846
1 Like

Thank you but I have a list of 920 genes which I want to keep. Also I have 400 samples.
I cannot type every gene individually as you did with c("Gene 1", "Gene 2", "Gene 3", "Gene 4", "Gene 5") or sample_1 = c(runif(1:5)),
sample_2 = c(runif(1:5)),
sample_3 = c(runif(1:5)) for 400 samples.

Would it be possible to use a list as i tried in my code above?

Thank you!

Again, is there some condition you want to select rows in the gene column based on? Otherwise, if you just want specific rows that you know, just expand the slice function to the rows you want to include. It is difficult to give more specific advice without knowing what you are trying to achieve, and how.

The condition is that these genes belong to the list "risk genes".

So I have my complete dataset and my list "risk genes".
All genes (e.g. gene 2 and 3 in above example from "risk genes") should be extracted from the complete dataset.

Minimal example on how to achieve this below (risk_genes represents your own, larger collection of risk genes):

library(tidyverse)

genes <- tibble(
  gene = c("Gene 1", "Gene 2", "Gene 3", "Gene 4", "Gene 5"),
  sample_1 = c(runif(1:5)),
  sample_2 = c(runif(1:5)),
  sample_3 = c(runif(1:5))
)

risk_genes <- c("Gene 2", "Gene 3")

genes %>% 
  filter(gene %in% risk_genes)
#> # A tibble: 2 × 4
#>   gene   sample_1 sample_2 sample_3
#>   <chr>     <dbl>    <dbl>    <dbl>
#> 1 Gene 2    0.350    0.801    0.532
#> 2 Gene 3    0.124    0.309    0.191
1 Like

Thank you very much.
Not sure why it is not working I am again left with 0 genes in cts_new in the end if I use

cts_new<- cts %>% dplyr::filter(gene %in% risk_genes_1)

Cts is my complete dataset.

Without seeing an example of how your data looks (both cts and risk_genes_1), there's really no solid way of helping you out here.

Based on your example, it appears that Gene is not a column of your data.frame, but rather the rownames. You can deal with this by either assigning a new column as the rownames, or filtering on the rownames.

  1. cts$gene <- rownames(cts); cts[cts$gene %in% risk_genes, ]

  2. or this: cts[rownames(cts) %in% risk_genes, ]

RepEx

cts <- data.frame(sample1 = round(rnorm(100,10,2),0),
                  sample2 = round(rnorm(100,15,3),0),
                sample3 = round(rnorm(100,8,3),0))
> `head(cts)`
>   sample1 sample2 sample3
> 1      11      10      10
> 2       9      19       8
> 3       6       9      10
> 4      10      11      10
> 5       9      11       8
> 6       7      15       8
> rownames(cts) <- paste0("gene", 1:100)
> head(cts)
>       sample1 sample2 sample3
> gene1      11      10      10
> gene2       9      19       8
> gene3       6       9      10
> gene4      10      11      10
> gene5       9      11       8
> gene6       7      15       8
> 
> genelist <- c("gene1","gene50","gene22")
>  
> cts[rownames(cts) %in% genelist,]
>        sample1 sample2 sample3
> gene1       11      10      10
> gene22      12      16       8
> gene50      10      13      10

Preamble: since DESeq2 is a bioconductor package, you will likely have better luck getting help over at the bioconductor support site.

It looks like you are trying to subset cts down to a few set of genes, and I'm guessing that your cts object is a DESeqDataSet, which isn't like a data.frame or tibble at all, so functions that work on those (like filter, etc.) will not work here.

A DESeqDataSet is a SummarizedExperiment, and you can learn the basics of what this data structure is, and how to manipulate it from this vignette.

@aka.dr.house's suggestion should work, though, because you can index into the genes (rows) and samples (columns) of your DESeqDataSeq by slicing it as if it were a 2d matrix.

1 Like