How to count the occurrence of a character and then find its percentage

#1

Hi all,
As we know that three sets of codons codes for an amino acids, for example ATG codes just for M(methionine) and ATC, ATA,ATT codes for I (isoleucine)
and percentage of ATG in a DNA sequence would always be 1 for coding M and percentage of ATC in DNA sequence would always be 0.33 for coding I so as ATA and ATT.
I want to make a function which could calculate the counts of the codons in a sequence and then calculate its frequency percentage of forming particular amino acids.

``````codon <- list(ATA = "I", ATC = "I", ATT = "I", ATG = "M", ACA = "T",
ACC = "T", ACG = "T", ACT = "T", AAC = "N", AAT = "N", AAA = "K",
AAG = "K", AGC = "S", AGT = "S", AGA = "R", AGG = "R", CTA = "L",
CTC = "L", CTG = "L", CTT = "L", CCA = "P", CCC = "P", CCG = "P",
CCT = "P", CAC = "H", CAT = "H", CAA = "Q", CAG = "Q", CGA = "R",
CGC = "R", CGG = "R", CGT = "R", GTA = "V", GTC = "V", GTG = "V",
GTT = "V", GCA = "A", GCC = "A", GCG = "A", GCT = "A", GAC = "D",
GAT = "D", GAA = "E", GAG = "E", GGA = "G", GGC = "G", GGG = "G",
GGT = "G", TCA = "S", TCC = "S", TCG = "S", TCT = "S", TTC = "F",
TTT = "F", TTA = "L", TTG = "L", TAC = "Y", TAT = "Y", TAA = "stop",
TAG = "stop", TGC = "C", TGT = "C", TGA = "stop", TGG = "W")

( fracs <- 1/table(unlist(codon)) )

codonfracs <- setNames(lapply(codon, function(x) unname(fracs[x])), names(codon))

s <- 'AAGGCCTGCGCAAATATTTCCACTCCTTCCCGGGTGCTCCTGAGTTGAACCCGC
TTAGAGACTCCGAAATCAACGACGACTTCCACCAGTGGGCCCAGTGACCGCCACACTGGA
CCCCATACCACTTCTTTTTGTTATTCTTAAATATGTT
'

strsplit3 <- function(s, k=3) {
starts <- seq.int(1, nchar(s), by=k)
stops <- c(starts[-1] - 1, nchar(s))
mapply(substr, s, starts, stops, USE.NAMES=FALSE)
}
strsplit3(s)
``````

I have separated my argument into frame of 3. Please guide me in finding the count of each codon in an argument also its percentage of occurrence. It Output for which i am looking for is in the form of table which includes four column Codons, amino acids for which it codes for, count of codons and percentage of occurrence for forming that amino acids.
Thank you.

#2

percentage of occurrence for forming that amino acids.

It's not perfectly clear what that means, but I took a guess. Your best bet is probably to turn your lookup list into a data frame, then join it to a data frame of codons. Then the summarization will just take some thinking about `group_by`, `summarize` and `mutate`.

``````require(tidyverse)

### Make a tibble of codons, one per row
ind_codons <- strsplit(s, "")[[1]] %>% as_tibble()

### Grouping into 3s
codons <- s %>%
# split into sets of 3
strsplit3 %>%
# turn it into a dataframe
as_tibble() %>%
# rename codon column
transmute(codon=value)

### Now you have a tibble with each codon and a logical column for each component codon.

### Turn the amino acid map from a list into a tibble
amino_acid_lookup <- unlist(codon) %>% as_tibble() %>% rownames_to_column() %>%
transmute(codon=rowname, amino_acid=value)

### Join the two
combined <- codons %>%
inner_join(amino_acid_lookup, by = "codon")

### Output stats:
summary_stats <-
combined %>%
group_by(codon, amino_acid) %>%
# Number of occurences
summarize(codon_count = n()) %>%
ungroup() %>%
# Codon frequency out of all codons
mutate(codon_percent = codon_count/sum(codon_count)) %>%
# Amino acid number of occurences
group_by(amino_acid) %>%
mutate(amino_acid_count = n()) %>%
# Amino acid frequency out of possible codons
mutate(amino_acid_percent_by_codon = amino_acid_count/sum(amino_acid_count)) %>%
ungroup() %>%
# Amino acid frequency out of all amino acids (I don't think you need this value)
mutate(amino_acid_percent_overall = amino_acid_count/sum(amino_acid_count))
``````

Let me know if you have questions about specifics of the above.