I’m attempting to count the number of times a particular string appears in a table. I have between 20 and 70 files (tab-delimited tables) in a folder. Each file contains a data frame with over 1,000 strings. For each string there's a row. I’m trying to count the number of times the string in each row appears in the other rows within each file. The code below displays the number of times a manually designated string appears in the file as a repeat or as a component of another string. Unfortunately because I have to enter the target string manually, it amounts to doing a search in an excel file. I need the script to check each string against all the rows on its own and report only the strings that appear once or more in another row. The code below is what I have. It iterates through each row of a file before it moves on to the another file.
Example of the input file. Each input file is basically a single column data frame. Each row in the column is comprised of a single character string, e.g, attcctc, etc. The single-column could have over a thousand rows. The name of each file, which is tab-delimited ends with .txt, and there could be anywhere between 10 and 90 different files in a folder. Below is a small example of what a file called sample01.txt would contain. The numerals are just the row index.
sample01.txt
1. TTCAGGTTACGT
2. TCGGGATTACACCC
3. ACGGGATAACACCTCG
4. GCCAT
5. GGGTTACS
etc.
library(stringr)
library(tidyverse)
files <- list.files(path="/mnt/data/TCR_PROCESSING-ISEQ100/Data/Processed_data/samples_by_name", pattern= ".txt")
for (i in files){
print(i)
data <- read.table(file =paste0( "/mnt/data/TCR_PROCESSING-ISEQ100/Data/Processed_data/samples_by_name/", i), sep = '\t', header = TRUE)
for (t in unique(data)){
clones <- deframe(data)
number.seq <- str_count(clones, "TGTGC")
repeats <- sum(number.seq)-1
print(repeats)
}
}
The code above, which just checks for a single string (in the code it would be "TGTGC" would yield something like this. It displays the count of the times that string appears in each file after checking for its presence in each row of a file. Furthermore, the program looks for the string even if it's part of another longer string, e.g, "ACTGTGCAA"
sample01.txt
0
sample02.txt
0
sample03.txt
1
The above result tells me that the string "TGTGC" was found in all three files, but since what I want is just the times it appears more than once, the code found it's only repeated more than once in the file "sample03.txt". What I really want to do is add a routine where the code checks for the presence of repeats of each string in each file in all of the files, one-by-one. As it stands right now, I have to enter the string in the code manually. This works fine if I'm only looking for one string, but now I have to check for each and every string in each and every file.
Example desired output:
"TTCAGGTTACGT"
sample01.txt
1
sample02.txt
3
sample05.txt
1
"TCGGGATTACACCC"
sample01.txt
1
sample03.txt
2
sample04.txt
1
...... etc.
I really don't want the report to show when the check yields a "0" or a "-1". Just where it's >0. I hope this helps clarify.
I hope someone can give me some advice on how to accomplish this. I recently started programming in R and I'm not familiar with all its functions or libraries.