Creating a new column in a dataset

Hi,
I have a large dataset containing different fungi species, and one column on each row describes the taxonomy (including kingdom, phylum, class, order, family, genus, species). I would like to create a new column in the dataset, that only includes the "species" name, not all the other information from the taxonomy column. How would I go about isolating this information, as all species names occur after s__ in the taxonomy column, and are of different character lengths. I have attempted to use the mutate function, with str_extract, subset, and start. ITS_counts is that dataset, taxonomy is the column within the dataset im trying to use, s__ is the part of taxonomy I would like to isolate the species name from on each row. The code I have tried to use is...

mutate("species" = str_extract(ITS_counts$taxonomy, substr(start=".*s__", 1000, stop = NULL), group = NULL))

with errors...

Error in substr(start = ".*s__", 1000, stop = NULL) :
invalid substring arguments
In addition: Warning message:
In substr(start = ".*s__", 1000, stop = NULL) : NAs introduced by coercion

Thank you.

Please post the output of

dput(head(ITS_count$taxonomy, 20))

That will allow us to work with your data.

the output is...

c("k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Botryosphaeriales;f__Botryosphaeriaceae;g__Diplodia;s__Diplodia_subglobosa",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__Capnodiales_fam_Incertae_sedis;g__Vermiconia;s__Vermiconia_calcicola",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__Cladosporiaceae;g__Cladosporium;s__Cladosporium_exasperatum",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__Cladosporiaceae;g__Cladosporium;s__Cladosporium_halotolerans",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__Mycosphaerellaceae;g__Mycosphaerella;s__Mycosphaerella_ulmi",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__Mycosphaerellaceae;g__unidentified;s__unidentified",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__unidentified;g__unidentified;s__unidentified",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__unidentified;g__unidentified;s__unidentified",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Dothideales;f__Dothioraceae;g__Aureobasidium;s__Aureobasidium_pullulans",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Dothideomycetes_ord_Incertae_sedis;f__Dothideomycetes_fam_Incertae_sedis;g__Biatriospora;s__Biatriospora_mackinnonii",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Dothideomycetes_ord_Incertae_sedis;f__Dothideomycetes_fam_Incertae_sedis;g__Leptospora;s__Leptospora_rubella",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Dothideomycetes_ord_Incertae_sedis;f__Dothideomycetes_fam_Incertae_sedis;g__Monodictys;s__unidentified",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Dothideomycetes_ord_Incertae_sedis;f__Dothideomycetes_fam_Incertae_sedis;g__Septoriella;s__Septoriella_hirta",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Dothideomycetes_ord_Incertae_sedis;f__Dothideomycetes_fam_Incertae_sedis;g__Zymoseptoria;s__Zymoseptoria_halophila",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Dothideomycetidae_ord_Incertae_sedis;f__Eremomycetaceae;g__Arthrographis;s__unidentified",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__Biatriosporaceae;g__Nigrograna;s__unidentified",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__Corynesporascaceae;g__Corynespora;s__Corynespora_citricola",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__Cucurbitariaceae;g__Pyrenochaetopsis;s__Pyrenochaetopsis_leptospora",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__Dacampiaceae;g__Teichospora;s__Teichospora_rubriostiolata",
"k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__Didymellaceae;g__Neoascochyta;s__Neoascochyta_graminicola"
)

The regular expression in str_extract look backwards for the text ";s__" and extracts everything from there to the end of the text.

library(stringr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
DF <- data.frame(taxonomy = c("k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Botryosphaeriales;f__Botryosphaeriaceae;g__Diplodia;s__Diplodia_subglobosa",
  "k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__Capnodiales_fam_Incertae_sedis;g__Vermiconia;s__Vermiconia_calcicola",
  "k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__Cladosporiaceae;g__Cladosporium;s__Cladosporium_exasperatum",
  "k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__Cladosporiaceae;g__Cladosporium;s__Cladosporium_halotolerans",
  "k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__Mycosphaerellaceae;g__Mycosphaerella;s__Mycosphaerella_ulmi",
  "k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__Mycosphaerellaceae;g__unidentified;s__unidentified",
  "k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__unidentified;g__unidentified;s__unidentified",
  "k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Capnodiales;f__unidentified;g__unidentified;s__unidentified",
  "k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Dothideales;f__Dothioraceae;g__Aureobasidium;s__Aureobasidium_pullulans",
  "k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Dothideomycetes_ord_Incertae_sedis;f__Dothideomycetes_fam_Incertae_sedis;g__Biatriospora;s__Biatriospora_mackinnonii",
  "k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Dothideomycetes_ord_Incertae_sedis;f__Dothideomycetes_fam_Incertae_sedis;g__Leptospora;s__Leptospora_rubella",
  "k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Dothideomycetes_ord_Incertae_sedis;f__Dothideomycetes_fam_Incertae_sedis;g__Monodictys;s__unidentified",
  "k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Dothideomycetes_ord_Incertae_sedis;f__Dothideomycetes_fam_Incertae_sedis;g__Septoriella;s__Septoriella_hirta",
  "k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Dothideomycetes_ord_Incertae_sedis;f__Dothideomycetes_fam_Incertae_sedis;g__Zymoseptoria;s__Zymoseptoria_halophila",
  "k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Dothideomycetidae_ord_Incertae_sedis;f__Eremomycetaceae;g__Arthrographis;s__unidentified",
  "k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__Biatriosporaceae;g__Nigrograna;s__unidentified",
  "k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__Corynesporascaceae;g__Corynespora;s__Corynespora_citricola",
  "k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__Cucurbitariaceae;g__Pyrenochaetopsis;s__Pyrenochaetopsis_leptospora",
  "k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__Dacampiaceae;g__Teichospora;s__Teichospora_rubriostiolata",
  "k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__Didymellaceae;g__Neoascochyta;s__Neoascochyta_graminicola"
))

DF <- DF |> mutate(Species = str_extract(taxonomy, "(?<=;s__).+$"))

DF$Species
#>  [1] "Diplodia_subglobosa"         "Vermiconia_calcicola"       
#>  [3] "Cladosporium_exasperatum"    "Cladosporium_halotolerans"  
#>  [5] "Mycosphaerella_ulmi"         "unidentified"               
#>  [7] "unidentified"                "unidentified"               
#>  [9] "Aureobasidium_pullulans"     "Biatriospora_mackinnonii"   
#> [11] "Leptospora_rubella"          "unidentified"               
#> [13] "Septoriella_hirta"           "Zymoseptoria_halophila"     
#> [15] "unidentified"                "unidentified"               
#> [17] "Corynespora_citricola"       "Pyrenochaetopsis_leptospora"
#> [19] "Teichospora_rubriostiolata"  "Neoascochyta_graminicola"

Created on 2023-03-31 with reprex v2.0.2

how would I do this for all 939 rows? and make it into a new column within ITS_counts?

The data frame DF is just a stand in for your ITS_counts. Your code would be

ITS_counts <- ITS_counts |> mutate(Species = str_extract(taxonomy, "(?<=;s__).+$"))

That has worked!! Thank you so much for your help! :slight_smile:

following on from this, I would like to create fasta files of certain species with the sequences in ITS_counts. I have been able to do this, however when aligning the sequences in another program, it requires all species to have unique names, therefore the multiple"unidentified" species cause an issue here. How would I remove all unidentified species from this code...

write.fasta(as.list(seqs2), as.character(ITS_counts2$"Species"), file.out="seqs2fasta")

the output is a file called seqs2fasta containing the species and sequences, but there are many unidentified species, that I would like to somehow not include in this output file.

Thanks.

You can use the filter() function from dplyr to remove rows where the Species column is "unidentified".

ITS_filtered <- ITS_counts |> filter(Species != "unidentified")

Then use ITS_filtered to write your fasta file.

Thank you so much!!!!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.