Using tidyverse to count times a string appears in a table.

Pioneer82 · October 17, 2019, 6:39pm

I’m attempting to count the number of times a particular string appears in a table. I have between 20 and 70 files (tab-delimited tables) in a folder. Each file contains a data frame with over 1,000 strings. For each string there's a row. I’m trying to count the number of times the string in each row appears in the other rows within each file. The code below displays the number of times a manually designated string appears in the file as a repeat or as a component of another string. Unfortunately because I have to enter the target string manually, it amounts to doing a search in an excel file. I need the script to check each string against all the rows on its own and report only the strings that appear once or more in another row. The code below is what I have. It iterates through each row of a file before it moves on to the another file.

Example of the input file. Each input file is basically a single column data frame. Each row in the column is comprised of a single character string, e.g, attcctc, etc. The single-column could have over a thousand rows. The name of each file, which is tab-delimited ends with .txt, and there could be anywhere between 10 and 90 different files in a folder. Below is a small example of what a file called sample01.txt would contain. The numerals are just the row index.

sample01.txt
1. TTCAGGTTACGT
2. TCGGGATTACACCC
3. ACGGGATAACACCTCG
4. GCCAT
5. GGGTTACS

etc.

    library(stringr)
    library(tidyverse)
    
    files <- list.files(path="/mnt/data/TCR_PROCESSING-ISEQ100/Data/Processed_data/samples_by_name", pattern= ".txt")

  for (i in files){
       print(i)
       data <- read.table(file =paste0( "/mnt/data/TCR_PROCESSING-ISEQ100/Data/Processed_data/samples_by_name/", i), sep = '\t', header = TRUE)
        for (t in unique(data)){ 
            clones <- deframe(data)
            number.seq <- str_count(clones, "TGTGC")
            repeats <- sum(number.seq)-1
            print(repeats)    
         }
    }

The code above, which just checks for a single string (in the code it would be "TGTGC" would yield something like this. It displays the count of the times that string appears in each file after checking for its presence in each row of a file. Furthermore, the program looks for the string even if it's part of another longer string, e.g, "ACTGTGCAA"

 sample01.txt
 0
 sample02.txt
 0
 sample03.txt
 1

The above result tells me that the string "TGTGC" was found in all three files, but since what I want is just the times it appears more than once, the code found it's only repeated more than once in the file "sample03.txt". What I really want to do is add a routine where the code checks for the presence of repeats of each string in each file in all of the files, one-by-one. As it stands right now, I have to enter the string in the code manually. This works fine if I'm only looking for one string, but now I have to check for each and every string in each and every file.

Example desired output:

"TTCAGGTTACGT"
sample01.txt
1
sample02.txt
3
sample05.txt
1

"TCGGGATTACACCC"
sample01.txt
1
sample03.txt
2
sample04.txt
1
...... etc.

I really don't want the report to show when the check yields a "0" or a "-1". Just where it's >0. I hope this helps clarify.

I hope someone can give me some advice on how to accomplish this. I recently started programming in R and I'm not familiar with all its functions or libraries.

pieterjanvc · October 17, 2019, 9:43pm

Hi,

Could you elaborate a bit on what you like to search where, because this is a bit confusing.

Try and create a reprex where you generate two or three data frames at the start that represent the files that have to be searched (or part of each file) and the one that is the excel file. Then show us what you like to count in each file and what the output should look like.

For example, if a string appears multiple times in one row in a file, but also in different rows in the same file, how should this be reported...

A reprex consists of the minimal code and data needed to recreate the issue you're having. You can find instructions how to build and share one here:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

Good luck
PJ

valeri · October 18, 2019, 8:09am

In particular, I can't follow where did nt.data come from. Again, it would be great if you can include an example data file.

A few unrelated tips: list.files has a parameter full.names which you can set to TRUE to return full paths - then you would not need to paste the file name together again in your loop. Also, when constructing file paths, the file.path function is preferable to paste0.

Pioneer82 · October 18, 2019, 2:01pm

I apologize for not being clear. It is a bit confusing. I edited the statement for clarification. I'm starting out with a number of files in a folder. Each file is made up of a single data frame which contains only 1 column. Each row of the column is comprised of only one string. However, there could be several thousands rows (strings) in each data frame (file). The code I included works well, but I can only check for one string at a time, which I have to enter manually. I need for the code to either loop or whatever is best to pick one string, check against all the rows in all the files, report the number of times it's repeated more than once in each file, then move to the next string and so on, until it checks for all the strings in each of the files. The final report should display the file string checked, and the file names where the count is 1 (repeated more than once, ergo the -1 in the count line).

Pioneer82 · October 18, 2019, 2:10pm

I like that advice. Thanks.

Pioneer82 · October 18, 2019, 2:13pm

Thanks pieterjanvc. I added a little more information regarding the input file structure and output desired. I hope this helps build a sample file or two. You can just put random characters, but because I'm dealing with DNA nucleotides, the strings are made up of random lengths and only of 4 letters.

Pioneer82 · October 18, 2019, 2:14pm

Sorry for the nt.data. I corrected the name to just "data".

pieterjanvc · October 18, 2019, 2:49pm

Hi,

Here is an example of an implementation

library(stringr)
library(tidyverse)

options(stringsAsFactors = F) #so we don't have to repseat it multiple times

#Simulate some files
set.seed(1) #for reproducibility

file1 = data.frame(x = sapply(1:7, function(x) {
  paste(sample(c("C", "T", 'A', "G"), 600, replace = T), 
        collapse = "")}
))

file2 = data.frame(x = sapply(1:10, function(x) {
  paste(sample(c("C", "T", 'A', "G"), 200, replace = T), 
        collapse = "")}
))

write.table(file1, "file1.txt", row.names  = F, col.names = F)
write.table(file2, "file2.txt", row.names  = F, col.names = F)

#Read in all files and reduce to one long string per file (space between strings)
allFiles = list.files(path = ".", full.names = T, pattern = ".txt")
allFiles = lapply(lapply((lapply(allFiles, read.table)), deframe), paste, collapse = " ")

#List of strings to search
myStrings = c("CTAG", "CCCA", "TATA", "CTGCCCA")

#Create table with rows as files, columns as string counts
finalData = map_dfc(myStrings, function(myString){
  tibble(!!sym(myString) := sapply(allFiles, str_count, pattern = myString) -1)
}) %>% mutate(file = list.files(pattern = ".txt"))

Which will generate this output:

finalData
# A tibble: 2 x 5
   CTAG  CCCA  TATA CTGCCCA file     
  <dbl> <dbl> <dbl>   <dbl> <chr>    
1    17    16    11       0 file1.txt
2     8     5    10      -1 file2.txt

I must say I used some more advanced coding to avoid loops and make everything run as fast as possible.

The first part just generates some dummy files to work with, you don't have to run this step of course
We then read in all files (you need to replace the path name with your folder)
Next follows reading in of the files and collapsing them into one long string per file (each string separated by a space so the search won't go across strings)
Finally we run every search string over all the files (now each one big string) and make the counts. Note that since you want no repetitions to be value 0, if a string is not found at all the value will be -1

You'll be probably able to understand most of the code, but the !!sym(myString) := in the last part of the code might be confusing as it is some tidyverse magic to make sure you can set column names on the go. If you don't like or understand this, you can rewrite the function as follows:

finalData = map_dfc(myStrings, function(myString){
  data.frame(string = sapply(allFiles, str_count, pattern = myString) -1)
}) %>% mutate(file = list.files(pattern = ".txt"))
colnames(finalData) = c("CTAG", "CCCA", "TATA", "CTGCCCA")

Hope this helps,
PJ

Pioneer82 · October 18, 2019, 3:04pm

Since I'm still low on the learning curve of R, anything that is newer is welcome.

One last thing I forgot to mention, since it was asked. If a target string is detected twice in a single row, it should be counted twice, that is, I'm counting the number of hits the target string is found.

Pioneer82 · October 18, 2019, 3:05pm

I'll give it a try in a few minutes and report back. Thanks so much.

Pioneer82 · October 18, 2019, 4:09pm

PJ, I'm not sure, but it seems you have limited the search to four strings. Is this correct? If so, I really just showed four strings as an example. The problem is that I have thousands of strings, each occupying a row in the data frame of a single file. Plus, there could be up to 96 files to be exact. Each with a couple thousand rows or so.

What I need to do is to have the script pull the first script, check it against all the rows (scripts) in its file and all the strings in the other files as well. Then the script takes in the second string (2nd row) and does the same. This should be repeated for every row in every file, so if I have only 3 files, each with 2000 strings, each string of each file will be checked for its presence in each row and each file. Doing the math, we are talking checking each string of each file 6000 times. The number of rows in each file can be difference, some files can have 2200 rows (strings) another 3,000 strings, etc.

Pioneer82 · October 18, 2019, 4:17pm

I understand not wanting to use loops. It's something from my Python and OOP practice. I understand R is much faster for data analysis in part because it avoids loops. Unfortunately, it has a steep learning curve and I'm not familiar with many of its basic functions or special library functions. Any suggestions on what techniques and learning sources I can use to learn R a bit faster?

Leon · October 18, 2019, 4:18pm

Default go to: https://r4ds.had.co.nz/

Pioneer82 · October 18, 2019, 4:23pm

Great! Thank, Leon. I'll look it up. I gravitated towards data analysis and now I find myself having to learn R for speed and efficiency.

martin.R · October 18, 2019, 4:41pm

Be a bit careful around statements about the efficiency or not of loops in R because this can be ambiguous or misleading.

Looping across rows is mostly bad and unnecessary in R (there are exceptions). This is a mistake many make when coming from other languages when the R function is already vectorised.

Looping across columns is generally ok, but there are generally neater ways of doing so (e.g. the apply family of functions in base, purrr in the tidyverse, etc.). These do not make the code faster or more efficient though.

If you are interested in performance then you should also check out data.table.

Pioneer82 · October 18, 2019, 5:10pm

Thanks. I wasn't aware of these details.

pieterjanvc · October 18, 2019, 8:53pm

Hi,

@Pioneer82 : You can use as many string as you like, just update the variable myStrings. In that case though, it might be better to have the strings as the rows and files as columns because it's more natural to have many rows instead of columns (and should be easier to work with too). You can also use as many files as you like, see the first allFiles assignment where you read all files in a folder (make sure the folder only contains txt files that are of interest). Let me know how that goes...

@martin.R: I'm sorry for my misinterpretation on the efficiency of apply statements. I hope I did not confuse anybody and will read some more about it myself as it's clear I did not get the details right. Luckily it doesn't change anything to the workflow of the current code though.

PJ

Leon · October 19, 2019, 7:27am

Hi @Pioneer82,

Take a look at this:

# Create example data using base R
# ------------------------------------------------------------------------------

# Set some params
set.seed(610868)
n_files = 100
n_seqs = 1000
l_seqs = 5
dna = c('A', 'C', 'T', 'G')
out_dir = '~/tmp'

# Create files
for( i in 1:n_files ){
  seqs_for_file = rep(NA, n_seqs)
  for( j in 1:n_seqs ){
    seqs_for_file[j] = paste0(sample(x = dna, size = l_seqs, replace = TRUE),
                              collapse = '')
  }
  write.table(x = seqs_for_file, file = paste0(out_dir, '/seqs_', i, '.txt'),
              quote = FALSE, row.names = FALSE, col.names = FALSE)
}

# Analyse files using Tidyverse
# ------------------------------------------------------------------------------

# Load libraries
library('tidyverse')

# Get input file names
target_files = list.files(path = out_dir, pattern = "\\.txt$")

# Read files into list of tibbles
d = map(str_c(out_dir, '/', target_files), read_table, col_names = FALSE)

# Set names in list to target files
names(d) = target_files

# To each tibble, add a variable with the target file name
d = lapply(names(d), function(n_i){ d[[n_i]] = mutate(d[[n_i]], f = n_i) })

# Combine to one tibble and set variable names
d = d %>% bind_rows %>% rename(input_file = f, seq = X1)

# Create count table
seq_count_table = d %>%
  count(input_file, seq) %>%
  spread(seq, n) %>%
  mutate_if(is.numeric, replace_na, 0)

Yielding:

> seq_count_table
# A tibble: 100 x 1,025
   input_file AAAAA AAAAC AAAAG AAAAT AAACA AAACC AAACG AAACT AAAGA AAAGC AAAGG AAAGT AAATA AAATC AAATG AAATT AACAA
   <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 seqs_1.txt     6     1     1     0     1     1     2     0     2     1     3     1     1     2     0     2     2
 2 seqs_10.t…     0     0     0     0     1     1     3     2     0     2     0     1     0     2     0     0     1
 3 seqs_100.…     0     0     0     0     0     0     1     0     3     1     0     1     2     0     2     1     1
 4 seqs_11.t…     2     1     0     2     0     1     0     1     1     1     0     2     1     1     1     0     1
 5 seqs_12.t…     2     1     2     0     2     1     1     0     0     1     2     0     0     1     2     0     2
 6 seqs_13.t…     1     0     1     1     0     1     0     2     1     1     0     0     1     0     2     0     1
 7 seqs_14.t…     0     1     2     1     0     1     0     1     0     1     1     1     0     4     1     0     1
 8 seqs_15.t…     0     2     1     0     0     3     0     0     3     3     0     0     0     0     0     1     1
 9 seqs_16.t…     1     1     0     0     2     0     0     1     0     1     0     0     1     0     1     3     0
10 seqs_17.t…     0     2     0     1     1     1     2     0     0     1     1     3     1     2     2     2     1
# … with 90 more rows, and 1,007 more variables: AACAC <dbl>, AACAG <dbl>, AACAT <dbl>, AACCA <dbl>, AACCC <dbl>,
#   AACCG <dbl>, AACCT <dbl>, AACGA <dbl>, AACGC <dbl>, AACGG <dbl>, AACGT <dbl>, AACTA <dbl>, AACTC <dbl>,
#   AACTG <dbl>, AACTT <dbl>, AAGAA <dbl>, AAGAC <dbl>, AAGAG <dbl>, AAGAT <dbl>, AAGCA <dbl>, AAGCC <dbl>,
#   AAGCG <dbl>, AAGCT <dbl>, AAGGA <dbl>, AAGGC <dbl>, AAGGG <dbl>, AAGGT <dbl>, AAGTA <dbl>, AAGTC <dbl>,
#   AAGTG <dbl>, AAGTT <dbl>, AATAA <dbl>, AATAC <dbl>, AATAG <dbl>, AATAT <dbl>, AATCA <dbl>, AATCC <dbl>,
#   AATCG <dbl>, AATCT <dbl>, AATGA <dbl>, AATGC <dbl>, AATGG <dbl>, AATGT <dbl>, AATTA <dbl>, AATTC <dbl>,
#   AATTG <dbl>, AATTT <dbl>, ACAAA <dbl>, ACAAC <dbl>, ACAAG <dbl>, ACAAT <dbl>, ACACA <dbl>, ACACC <dbl>,
#   ACACG <dbl>, ACACT <dbl>, ACAGA <dbl>, ACAGC <dbl>, ACAGG <dbl>, ACAGT <dbl>, ACATA <dbl>, ACATC <dbl>,
#   ACATG <dbl>, ACATT <dbl>, ACCAA <dbl>, ACCAC <dbl>, ACCAG <dbl>, ACCAT <dbl>, ACCCA <dbl>, ACCCC <dbl>,
#   ACCCG <dbl>, ACCCT <dbl>, ACCGA <dbl>, ACCGC <dbl>, ACCGG <dbl>, ACCGT <dbl>, ACCTA <dbl>, ACCTC <dbl>,
#   ACCTG <dbl>, ACCTT <dbl>, ACGAA <dbl>, ACGAC <dbl>, ACGAG <dbl>, ACGAT <dbl>, ACGCA <dbl>, ACGCC <dbl>,
#   ACGCG <dbl>, ACGCT <dbl>, ACGGA <dbl>, ACGGC <dbl>, ACGGG <dbl>, ACGGT <dbl>, ACGTA <dbl>, ACGTC <dbl>,
#   ACGTG <dbl>, ACGTT <dbl>, ACTAA <dbl>, ACTAC <dbl>, ACTAG <dbl>, ACTAT <dbl>, ACTCA <dbl>, …

Each row is a file and each column is a sequence, so now you can check sequence occurrence across files like so:

> seq_count_table %>% select(input_file, ATGCT) %>% filter(ATGCT > 0) %>% arrange(-ATGCT)
# A tibble: 67 x 2
   input_file  ATGCT
   <chr>       <dbl>
 1 seqs_37.txt     5
 2 seqs_23.txt     4
 3 seqs_1.txt      3
 4 seqs_29.txt     3
 5 seqs_38.txt     3
 6 seqs_46.txt     3
 7 seqs_59.txt     3
 8 seqs_71.txt     3
 9 seqs_85.txt     3
10 seqs_91.txt     3
# … with 57 more rows

Hope it helps

raytong · October 19, 2019, 10:17am

Hi @Pioneer82. It seems interesting. I suggest a code like this. The final result will be a named (repeated seq) list of tibble contains the file name and number of repeat.

I generate the test files with code.

for (i in 1:70) {
  replicate(1000, paste(sample(c("A", "T", "C", "G"), floor(runif(1, min = 5, max = 15)), replace = TRUE), collapse = ""), simplify = TRUE) %>%
    writeLines(paste0("sample", i, ".txt"))
}

As your case, the code may run for few minutes. I test it with 10 files each 1000 sequences will run 30 seconds something.

library(tidyverse)

files <- list.files("./", pattern = ".txt")

df <- map_dfr(files, ~{
  data.frame(seq = readLines(.x), file = .x, stringsAsFactors = FALSE)
})

df %>%
  bind_cols(map_dfc(unique(.$seq), function(x, seq){`colnames<-`(data.frame(x = str_count(seq, x)), x)}, seq = .$seq)) %>%
  select(-seq) %>%
  group_by(file) %>%
  summarise_all(sum) %>%
  gather(seq, value, -file) %>%
  mutate(value = value - 1) %>%
  filter(value > 0) %>%
  group_by(seq) %>%
  nest() %>%
  {`names<-`(.$data, .$seq)}

Pioneer82 · October 21, 2019, 2:41pm

Sorry it has taken me awhile to reply, Leon. I'll test your code on my model and report my observations. Thanks.