calculate the number of Kmer

Hello everyone

I have a genome sequence and need to calculate the number of Kmer.
For example, how many times will certain sequences consisting of 4 letters occur. But I need a solution without using ready-made functions like kcount from ape library.

Thanks for your help

Hi @AsiyaV ,

Welcome to the RStudio community! :wave:

Let's see if we can help you out here. I think the first thing we'd like to see is a reproducible example (or reprex) as described in this article.

I suspect based on my very limited knowledge of genome sequences, that what you're likely looking at is a long character string of some sort. And you're asking "How do you identify certain sequences of character strings within a value from a dataframe?" Well if your goal is a De Bruijn Graph I'm not sure I can help you here. But if it's a simpler use case in the sense of you want to pick out a specific set of letters, you could try something like this:

library(tidyverse)

# An example dataset
genome_sequence <- tribble(
  ~sequence,
  "AGTCGTAGATGCTT",
  "AGTCGTGCTGAGAT",
  "AGAGATCGTGCTGA"
)

# create a new dataframe containing the sequences that match "GAGA"
specific_sequence <- genome_sequence %>% 
  filter(grepl("GAGA", sequence)) # grepl() is useful for searching character variables in R

# Or write a function that you can pass any sequence to filter by, and output
# a dataframe that matches the input sub-sequence
sequence_checker <- function(sub_sequence) {
  new_sequence <- genome_sequence %>% 
    filter(grepl(sub_sequence, sequence))
  return(new_sequence)
}

# And then call that function with your input sequence
GAGA_genomes <- sequence_checker("GAGA")
TT_genomes <- sequence_checker("TT")

# Check how many sequences matched that sub sequence
nrow(GAGA_genomes)

would be interesting to motivate your request with some explanation why ? In general programmers will tend optimise their time by incorporating the work of others, and not reinvent the wheel.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.