# How to count total number of rows containing a multi string pattern in R?

Hi,

I have a dataframe (`25000 rows * 4 columns`) containing string/gene symbols as rows and samples as columns. It seems like roughly 75% of the strings/gene symbols are unique (for instance, `RFC2`, `HSPA6`, `PAX8`), however there are some rows with multi strings separated by a forward slash or a dot (for instance, `DDR1.....MIR4640`, `MIR5193.....UBA7`, `LINC00152.....LOC101930489`). Is it possible to count total number of rows containing a multi string pattern in R?

Example of input dataset (see below)

``````dput(data.matrix_v2)
structure(list(GSM647547 = c(0.776, 1.916, 1.004, 1.2, 1.008,
0.805, 0.851, 1.082, 2.02, 1.03, 1.024, 1.043, 0.941, 1.215,
1.109, 1.138, 1.007, 1.244, 1.254, 0.995), GSM647552 = c(1.004,
1.741, 0.968, 1.276, 1.126, 1.772, 1.318, 1.067, 0.341, 0.88,
1.288, 0.958, 1.354, 1.939, 1.65, 1.738, 1.058, 0.827, 0.925,
1.122), GSM647553 = c(0.96, 1.4, 0.437, 1.19, 1.092, 0.872, 0.821,
1.042, 0.426, 0.949, 1.08, 0.92, 1.107, 1.543, 1.18, 1.053, 0.971,
0.663, 1.091, 1.146), GSM647565 = c(1.358, 1.207, 1.254, 1.068,
1.043, 0.757, 0.999, 1.254, 1.055, 1.025, 1.036, 1.383, 1.035,
1.174, 1.271, 0.958, 1.158, 1.571, 1.509, 1.026)), class = "data.frame", row.names = c("DDR1.....MIR4640",
"RFC2", "HSPA6", "PAX8", "GUCA1A", "MIR5193.....UBA7", "THRA",
"PTPN21", "CCL5", "CYP2E1", "EPHB3", "ESRRA", "CYP2A6", "SCARB1",
"TTLL12", "LINC00152.....LOC101930489", "WFDC2", "MAPK1", "MAPK1.1",
``````

Expected Output

Print total number of rows containing a multi string pattern

Thank you,

Toufiq

Hi,

Here is a way of doing that with RegEx

``````library(tidyverse)
# Data
myData = structure(
list(
GSM647547 = c(0.776, 1.916, 1.004, 1.2, 1.008,
0.805, 0.851, 1.082, 2.02, 1.03, 1.024, 1.043, 0.941, 1.215,
1.109, 1.138, 1.007, 1.244, 1.254, 0.995),
GSM647552 = c(1.004,
1.741, 0.968, 1.276, 1.126, 1.772, 1.318, 1.067, 0.341, 0.88,
1.288, 0.958, 1.354, 1.939, 1.65, 1.738, 1.058, 0.827, 0.925,
1.122),
GSM647553 = c(0.96, 1.4, 0.437, 1.19, 1.092, 0.872, 0.821,
1.042, 0.426, 0.949, 1.08, 0.92, 1.107, 1.543, 1.18, 1.053, 0.971,
0.663, 1.091, 1.146),
GSM647565 = c(1.358, 1.207, 1.254, 1.068,
1.043, 0.757, 0.999, 1.254, 1.055, 1.025, 1.036, 1.383, 1.035,
1.174, 1.271, 0.958, 1.158, 1.571, 1.509, 1.026)),
class = "data.frame",
row.names = c("DDR1.....MIR4640",
"RFC2", "HSPA6", "PAX8", "GUCA1A", "MIR5193.....UBA7", "THRA",
"PTPN21", "CCL5", "CYP2E1", "EPHB3", "ESRRA", "CYP2A6", "SCARB1",
"TTLL12", "LINC00152.....LOC101930489", "WFDC2", "MAPK1", "MAPK1.1",

# Total number of multi strings
myFilter = str_detect(rownames(myData), "\\.{2,}|\\/")
sum(myFilter)
#> [1] 3

myData  = myData %>% filter(myFilter)
myData
#>                            GSM647547 GSM647552 GSM647553 GSM647565
#> DDR1.....MIR4640               0.776     1.004     0.960     1.358
#> MIR5193.....UBA7               0.805     1.772     0.872     0.757
#> LINC00152.....LOC101930489     1.138     1.738     1.053     0.958
``````

Created on 2021-07-06 by the reprex package (v2.0.0)

Hope this helps,
PJ

The RegEx pattern looks like this: `\.{2,}|\/` which means: either select everything with 2 or more consecutive dots, or something with a forward slash in the text. I opted for two or more dots because there is a sample with in its name a dot and I don't think that represents two genes (i.e. MAPK1.1)

2 Likes

@pieterjanvc, thank you very much. Very helpful.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.