A more concise solution probably exists, but the code below captures my approach. The general idea is to:
- create a tibble of all of the headings in the .fasta file
- walk through each heading to pinpoint which string needs replacement
- join the crosswalk of old strings to new strings (this is likely the Excel list you mention)
- do a string replace
These steps leave you with a column of the newly formed headings. As a final step, set the headings of your .fasta file to the new headings. In the example, I did this as fasta_new
to be able to show the comparison (notice the new object has new_string1 and new_string2).
library(tidyverse)
# this represents the .fasta file (only 2 columns for simplicity)
fasta = tibble(
`tr|A0A383WB61|A0A383WB61_TETOB Chitin-binding type-2 domain-containing protein (Fragment) OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS14496 PE=4 SV=1` = c(1, 2),
`heading2=AJRST15|type-2 binding` = c(3, 4)
)
# headings to change
headings = tibble(old = names(fasta))
# crosswalk of old strings to new strings
crosswalk = tibble(
old_string = c('A0A383WB61', 'AJRST15'),
new_string = c('new_string1', 'new_string2')
)
# function to check for each target_string in crosswalk
# if the target string exists, return it, otherwise NA
string_check = function(i) {
headings %>%
mutate(old_string = ifelse(str_detect(old, i), i, NA))
}
# walk through each of the target strings (i.e. crosswalk$target_string)
headings = lapply(crosswalk$old_string, string_check) %>%
# stack into one large tibble
bind_rows() %>%
# remove NA results
filter(!is.na(old_string)) %>%
# join the crosswalk to bring in the replacement
left_join(crosswalk) %>%
# replace the target_string in old with the replacement_string
mutate(new = str_replace(old, old_string, new_string)) %>%
# keep old and new (if you want to compare)
select(old, new)
#> Joining, by = "old_string"
# update the fasta file headings
fasta_new = fasta
names(fasta_new) = headings$new
names(fasta)
#> [1] "tr|A0A383WB61|A0A383WB61_TETOB Chitin-binding type-2 domain-containing protein (Fragment) OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS14496 PE=4 SV=1"
#> [2] "heading2=AJRST15|type-2 binding"
names(fasta_new)
#> [1] "tr|new_string1|A0A383WB61_TETOB Chitin-binding type-2 domain-containing protein (Fragment) OS=Tetradesmus obliquus OX=3088 GN=BQ4739_LOCUS14496 PE=4 SV=1"
#> [2] "heading2=new_string2|type-2 binding"