Replacing row names of data frame

Dear RStudio Community,

In its simplest form, my question is this. I have a data.frame (res) which has a series of row names. I want to replace these row names with a different string, which is the 2nd column of another variable (G_list). Sometimes this replacement string is empty. In particular, the first column of G_list has string entries which match with rownames(res). However, many of the 2nd column of G_list is empty.

For example:

> rownames(res)[1]
[1] "ENSG00000223972"

Which I want to replace with the string in the column hgnc_symbol (namely "DDX11L1"). This is because G_list[209,] gives:

> G_list[209,]
                ensembl_gene_id hgnc_symbol ensembl_gene_id_version
ENSG00000223972 ENSG00000223972 DDX11L1     ENSG00000223972.8

So I am matching based on the row names of G_list and res, and replacing with a string in G_list$hgnc_symbol.

However, for some row names in res, I have the following:

rownames(res)[100]
[1] "ENSG00000117600"

And I want to find the string to replace this with in G_list, which occurs at row 389:

> G_list[389,]
                ensembl_gene_id hgnc_symbol ensembl_gene_id_version
ENSG00000117600 ENSG00000117600                   ENSG00000117600.8

So I need to replace the row name with an empty row name. (Or I can leave the row name as "ENSG00000117600" if the replacement is "" [blank].

How do I do this?

Thanks for your help!
Lottie

Could you supply us with a small subset of your data as it looks now and how you would like it to look when you are done?

Dear Leon,

Certainly!

Here it is:

> res
                  baseMean log2FoldChange     lfcSE        stat      pvalue      padj
                 <numeric>      <numeric> <numeric>   <numeric>   <numeric> <numeric>
ENSG00000223972   0.592269      2.4573801  6.196180  0.39659597 0.691665426        NA
ENSG00000227232 419.788972      0.3387703  0.382205  0.88635750 0.375424916 0.7252578
ENSG00000238009   0.550554     -0.0363070  5.256634 -0.00690689 0.994489145        NA
ENSG00000237683  17.226259      3.8523382  1.141383  3.37514947 0.000737756 0.0565244
ENSG00000268903   1.548329      2.0099032  2.157024  0.93179442 0.351442781        NA
ENSG00000239906   0.475776     -0.0362637  6.765833 -0.00535983 0.995723495        NA

and:

> G_list
                ensembl_gene_id hgnc_symbol
ENSG00000223972 ENSG00000223972 SCYL3
ENSG00000227232 ENSG00000227232 
ENSG00000238009 ENSG00000238009 FGR
ENSG00000237683 ENSG00000237683 CFH
ENSG00000268903 ENSG00000268903 
ENSG00000239906 ENSG00000239906 NIPAL3

with desired output being altered rownames in res to reflect those symbols given in G_list$hgnc_symbol

> res_new
                  baseMean log2FoldChange     lfcSE        stat      pvalue      padj
                 <numeric>      <numeric> <numeric>   <numeric>   <numeric> <numeric>
SCYL3             0.592269      2.4573801  6.196180  0.39659597 0.691665426        NA
ENSG00000227232   419.788972    0.3387703  0.382205  0.88635750 0.375424916 0.7252578
FGR               0.550554     -0.0363070  5.256634 -0.00690689 0.994489145        NA
CFH               17.226259     3.8523382  1.141383  3.37514947 0.000737756 0.0565244
ENSG00000268903   1.548329      2.0099032  2.157024  0.93179442 0.351442781        NA
NIPAL3            0.475776     -0.0362637  6.765833 -0.00535983 0.995723495        NA

I see, then take a look at the following and check if that solves your challenge:

# Load libraries ----------------------------------------------------------
library("tidyverse")


# Define example data -----------------------------------------------------
res <- tribble(
  ~ensembl_gene_id, ~baseMean, ~log2FoldChange, ~lfcSE, ~stat, ~pvalue, ~padj,
  "ENSG00000223972", 0.592269, 2.4573801, 6.196180, 0.39659597, 0.691665426, NA,
  "ENSG00000227232", 419.788972, 0.3387703, 0.382205, 0.88635750, 0.375424916, 0.7252578,
  "ENSG00000238009", 0.550554, -0.0363070, 5.256634, -0.00690689, 0.994489145, NA,
  "ENSG00000237683", 17.226259, 3.8523382, 1.141383, 3.37514947, 0.000737756, 0.0565244,
  "ENSG00000268903", 1.548329, 2.0099032, 2.157024, 0.93179442, 0.351442781, NA,
  "ENSG00000239906", 0.475776, -0.0362637, 6.765833, -0.00535983, 0.995723495, NA
)
G_list <- tribble(
  ~ensembl_gene_id, ~hgnc_symbol,
  "ENSG00000223972", "SCYL3",
  "ENSG00000227232", NA,
  "ENSG00000238009", "FGR",
  "ENSG00000237683", "CFH",
  "ENSG00000268903", NA, 
  "ENSG00000239906", "NIPAL3"
)


# Wrangle data ------------------------------------------------------------

# Tibble solution
res_joined <- res %>% 
  full_join(G_list,
            by = "ensembl_gene_id") %>% 
  mutate(gene_id = case_when(is.na(hgnc_symbol) ~ ensembl_gene_id,
                             !is.na(hgnc_symbol) ~ hgnc_symbol))

# Data frame solution
res_joined_as_df <- res_joined %>%
  select(-ensembl_gene_id, - hgnc_symbol) %>% 
  column_to_rownames("gene_id")

Hope it helps :slightly_smiling_face:

Thanks Leon. That certainly works with the example data, or at least how you have coded the example data. However I don't have "NA", I hAve a blank string. How would I alter your code to account for that?

Then simply change your logical test in the call to case_when() to:

hgnc_symbol == ""

Hi Leon,

Could you not do something like:

G_list(G_list$hgnc_symbol == "") <- G_list$ensemble_gene_id

To replace the blanks with whatever is in the ensemble_gene_id column of G_list?

Your help with this would be really appreciated!