How to count total number of rows containing a multi string pattern in R?

mtoufiq · July 6, 2021, 7:57pm

Hi,

I have a dataframe (25000 rows * 4 columns) containing string/gene symbols as rows and samples as columns. It seems like roughly 75% of the strings/gene symbols are unique (for instance, RFC2, HSPA6, PAX8), however there are some rows with multi strings separated by a forward slash or a dot (for instance, DDR1.....MIR4640, MIR5193.....UBA7, LINC00152.....LOC101930489). Is it possible to count total number of rows containing a multi string pattern in R?

Example of input dataset (see below)

dput(data.matrix_v2)
structure(list(GSM647547 = c(0.776, 1.916, 1.004, 1.2, 1.008, 
0.805, 0.851, 1.082, 2.02, 1.03, 1.024, 1.043, 0.941, 1.215, 
1.109, 1.138, 1.007, 1.244, 1.254, 0.995), GSM647552 = c(1.004, 
1.741, 0.968, 1.276, 1.126, 1.772, 1.318, 1.067, 0.341, 0.88, 
1.288, 0.958, 1.354, 1.939, 1.65, 1.738, 1.058, 0.827, 0.925, 
1.122), GSM647553 = c(0.96, 1.4, 0.437, 1.19, 1.092, 0.872, 0.821, 
1.042, 0.426, 0.949, 1.08, 0.92, 1.107, 1.543, 1.18, 1.053, 0.971, 
0.663, 1.091, 1.146), GSM647565 = c(1.358, 1.207, 1.254, 1.068, 
1.043, 0.757, 0.999, 1.254, 1.055, 1.025, 1.036, 1.383, 1.035, 
1.174, 1.271, 0.958, 1.158, 1.571, 1.509, 1.026)), class = "data.frame", row.names = c("DDR1.....MIR4640", 
"RFC2", "HSPA6", "PAX8", "GUCA1A", "MIR5193.....UBA7", "THRA", 
"PTPN21", "CCL5", "CYP2E1", "EPHB3", "ESRRA", "CYP2A6", "SCARB1", 
"TTLL12", "LINC00152.....LOC101930489", "WFDC2", "MAPK1", "MAPK1.1", 
"ADAM32"))

Expected Output

Print total number of rows containing a multi string pattern

Thank you,

Toufiq

pieterjanvc · July 6, 2021, 8:29pm

Hi,

Here is a way of doing that with RegEx

library(tidyverse)
# Data
myData = structure(
  list(
    GSM647547 = c(0.776, 1.916, 1.004, 1.2, 1.008, 
                  0.805, 0.851, 1.082, 2.02, 1.03, 1.024, 1.043, 0.941, 1.215, 
                  1.109, 1.138, 1.007, 1.244, 1.254, 0.995), 
    GSM647552 = c(1.004, 
                  1.741, 0.968, 1.276, 1.126, 1.772, 1.318, 1.067, 0.341, 0.88, 
                  1.288, 0.958, 1.354, 1.939, 1.65, 1.738, 1.058, 0.827, 0.925, 
                  1.122), 
    GSM647553 = c(0.96, 1.4, 0.437, 1.19, 1.092, 0.872, 0.821, 
                  1.042, 0.426, 0.949, 1.08, 0.92, 1.107, 1.543, 1.18, 1.053, 0.971, 
                  0.663, 1.091, 1.146), 
    GSM647565 = c(1.358, 1.207, 1.254, 1.068, 
                  1.043, 0.757, 0.999, 1.254, 1.055, 1.025, 1.036, 1.383, 1.035, 
                  1.174, 1.271, 0.958, 1.158, 1.571, 1.509, 1.026)), 
  class = "data.frame", 
  row.names = c("DDR1.....MIR4640", 
                "RFC2", "HSPA6", "PAX8", "GUCA1A", "MIR5193.....UBA7", "THRA", 
                "PTPN21", "CCL5", "CYP2E1", "EPHB3", "ESRRA", "CYP2A6", "SCARB1", 
                "TTLL12", "LINC00152.....LOC101930489", "WFDC2", "MAPK1", "MAPK1.1", 
                "ADAM32"))

# Total number of multi strings
myFilter = str_detect(rownames(myData), "\\.{2,}|\\/")
sum(myFilter)
#> [1] 3

myData  = myData %>% filter(myFilter)
myData
#>                            GSM647547 GSM647552 GSM647553 GSM647565
#> DDR1.....MIR4640               0.776     1.004     0.960     1.358
#> MIR5193.....UBA7               0.805     1.772     0.872     0.757
#> LINC00152.....LOC101930489     1.138     1.738     1.053     0.958

^{Created on 2021-07-06 by the reprex package (v2.0.0)}

Hope this helps,
PJ

The RegEx pattern looks like this: \.{2,}|\/ which means: either select everything with 2 or more consecutive dots, or something with a forward slash in the text. I opted for two or more dots because there is a sample with in its name a dot and I don't think that represents two genes (i.e. MAPK1.1)

mtoufiq · July 6, 2021, 8:37pm

@pieterjanvc, thank you very much. Very helpful.

system · July 13, 2021, 8:38pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.