How to count the occurrence of a character and then find its percentage

iMayank · June 9, 2018, 6:08am

Hi all,
As we know that three sets of codons codes for an amino acids, for example ATG codes just for M(methionine) and ATC, ATA,ATT codes for I (isoleucine)
and percentage of ATG in a DNA sequence would always be 1 for coding M and percentage of ATC in DNA sequence would always be 0.33 for coding I so as ATA and ATT.
I want to make a function which could calculate the counts of the codons in a sequence and then calculate its frequency percentage of forming particular amino acids.

codon <- list(ATA = "I", ATC = "I", ATT = "I", ATG = "M", ACA = "T", 
              ACC = "T", ACG = "T", ACT = "T", AAC = "N", AAT = "N", AAA = "K", 
              AAG = "K", AGC = "S", AGT = "S", AGA = "R", AGG = "R", CTA = "L", 
              CTC = "L", CTG = "L", CTT = "L", CCA = "P", CCC = "P", CCG = "P", 
              CCT = "P", CAC = "H", CAT = "H", CAA = "Q", CAG = "Q", CGA = "R", 
              CGC = "R", CGG = "R", CGT = "R", GTA = "V", GTC = "V", GTG = "V", 
              GTT = "V", GCA = "A", GCC = "A", GCG = "A", GCT = "A", GAC = "D", 
              GAT = "D", GAA = "E", GAG = "E", GGA = "G", GGC = "G", GGG = "G", 
              GGT = "G", TCA = "S", TCC = "S", TCG = "S", TCT = "S", TTC = "F", 
              TTT = "F", TTA = "L", TTG = "L", TAC = "Y", TAT = "Y", TAA = "stop", 
              TAG = "stop", TGC = "C", TGT = "C", TGA = "stop", TGG = "W")


( fracs <- 1/table(unlist(codon)) )

codonfracs <- setNames(lapply(codon, function(x) unname(fracs[x])), names(codon))
str(head(codonfracs))

s <- 'AAGGCCTGCGCAAATATTTCCACTCCTTCCCGGGTGCTCCTGAGTTGAACCCGC
TTAGAGACTCCGAAATCAACGACGACTTCCACCAGTGGGCCCAGTGACCGCCACACTGGA
CCCCATACCACTTCTTTTTGTTATTCTTAAATATGTT
'

strsplit3 <- function(s, k=3) {
  starts <- seq.int(1, nchar(s), by=k)
  stops <- c(starts[-1] - 1, nchar(s))
  mapply(substr, s, starts, stops, USE.NAMES=FALSE)
}
strsplit3(s)

I have separated my argument into frame of 3. Please guide me in finding the count of each codon in an argument also its percentage of occurrence. It Output for which i am looking for is in the form of table which includes four column Codons, amino acids for which it codes for, count of codons and percentage of occurrence for forming that amino acids.
Thank you.

Stephen · June 14, 2018, 8:35pm

percentage of occurrence for forming that amino acids.

It's not perfectly clear what that means, but I took a guess. Your best bet is probably to turn your lookup list into a data frame, then join it to a data frame of codons. Then the summarization will just take some thinking about group_by, summarize and mutate.

Starting from your code:

require(tidyverse)

### Make a tibble of codons, one per row
ind_codons <- strsplit(s, "")[[1]] %>% as_tibble()

### Grouping into 3s
# start with string of codons (removing newline characters)
codons <- s %>%
    # split into sets of 3
    strsplit3 %>% 
    # turn it into a dataframe
    as_tibble() %>% 
    # rename codon column
    transmute(codon=value)

### Now you have a tibble with each codon and a logical column for each component codon.
    
### Turn the amino acid map from a list into a tibble
amino_acid_lookup <- unlist(codon) %>% as_tibble() %>% rownames_to_column() %>% 
    transmute(codon=rowname, amino_acid=value)

### Join the two
combined <- codons %>% 
    inner_join(amino_acid_lookup, by = "codon")

### Output stats:
summary_stats <- 
    combined %>% 
    group_by(codon, amino_acid) %>% 
    # Number of occurences
    summarize(codon_count = n()) %>% 
    ungroup() %>%
    # Codon frequency out of all codons
    mutate(codon_percent = codon_count/sum(codon_count)) %>% 
    # Amino acid number of occurences  
    group_by(amino_acid) %>% 
    mutate(amino_acid_count = n()) %>% 
    # Amino acid frequency out of possible codons
    mutate(amino_acid_percent_by_codon = amino_acid_count/sum(amino_acid_count)) %>% 
    ungroup() %>% 
    # Amino acid frequency out of all amino acids (I don't think you need this value)
    mutate(amino_acid_percent_overall = amino_acid_count/sum(amino_acid_count))

Let me know if you have questions about specifics of the above.