String modification

Hello,
I have a tibble with a variable containing a code assigned to two variable containing species names of fungi.

tibble(otu_id = c("4875_0", "4875_4", "4875_3", "4875_32", "4875_1", "4875_5", "4875_9", "4875_8", "4875_7", "4875_12"),
       genus = c("Cladosporium_7681", "Vishniacozyma_813272", "Phoma_9358", "Fomitopsis_17612", "Coniochaeta_1209", "Resinicium_18453", "Alternaria_7106", "Heterobasidion_17745", "Botrytis_7435", "Sporobolomyces_10025"), 
       species = c("Cladosporium_cladosporioides_294915", "Vishniacozyma_victoriae_813285", "Phoma_herbarum_171008", "Fomitopsis_pinicola_101927", NA, "Resinicium_bicolor_338261", NA, "Heterobasidion_annosum_119859", "Botrytis_cinerea_217312", "Sporobolomyces_lactosus_357887"))

I would like to perform the following tasks:

  1. remove the code which follows the names.
    Example: Cladosporium_7681 must be just Cladosporium.

  2. add ";otu_id_names" to each names in both the column of "genus" and "species".
    Example: Cladosporium_7681 must be ";0Cladosporium"
    or Heterobasidion_annosum_119859 must be ";8Heterobasidion_annosum"

  3. replace NA in species column with the names present in genus.
    Example: row number 5 must have ";1Coniochaeta" in species variable.

Thanks for the help.

library(tidyverse)


sample_df <- tibble(otu_id = c("4875_0", "4875_4", "4875_3", "4875_32", "4875_1", "4875_5", "4875_9", "4875_8", "4875_7", "4875_12"),
                    genus = c("Cladosporium_7681", "Vishniacozyma_813272", "Phoma_9358", "Fomitopsis_17612", "Coniochaeta_1209", "Resinicium_18453", "Alternaria_7106", "Heterobasidion_17745", "Botrytis_7435", "Sporobolomyces_10025"), 
                    species = c("Cladosporium_cladosporioides_294915", "Vishniacozyma_victoriae_813285", "Phoma_herbarum_171008", "Fomitopsis_pinicola_101927", NA, "Resinicium_bicolor_338261", NA, "Heterobasidion_annosum_119859", "Botrytis_cinerea_217312", "Sporobolomyces_lactosus_357887"))

sample_df %>% 
    mutate_at(vars(genus, species), ~ str_remove(., pattern = "_\\d+$")) %>%
    mutate(species = if_else(is.na(species), genus, species)) %>% 
    mutate_at(vars(genus, species), ~ paste0(";", str_extract(otu_id, "(?<=_)\\d+$"), .))
#> # A tibble: 10 x 3
#>    otu_id  genus             species                       
#>    <chr>   <chr>             <chr>                         
#>  1 4875_0  ;0Cladosporium    ;0Cladosporium_cladosporioides
#>  2 4875_4  ;4Vishniacozyma   ;4Vishniacozyma_victoriae     
#>  3 4875_3  ;3Phoma           ;3Phoma_herbarum              
#>  4 4875_32 ;32Fomitopsis     ;32Fomitopsis_pinicola        
#>  5 4875_1  ;1Coniochaeta     ;1Coniochaeta                 
#>  6 4875_5  ;5Resinicium      ;5Resinicium_bicolor          
#>  7 4875_9  ;9Alternaria      ;9Alternaria                  
#>  8 4875_8  ;8Heterobasidion  ;8Heterobasidion_annosum      
#>  9 4875_7  ;7Botrytis        ;7Botrytis_cinerea            
#> 10 4875_12 ;12Sporobolomyces ;12Sporobolomyces_lactosus

Created on 2020-11-20 by the reprex package (v0.3.0.9001)

2 Likes

I still miss the " " symbols at the start and end of the character. I would like to display them.
Example ```
;0Cladosporium_cladosporioides must been showed as ";0Cladosporium_cladosporioides"

And another task: 
4) after the number add an underscore 
Example ";0Cladosporium_cladosporioides" must be ";0_Cladosporium_cladosporioides"

I think you already have a good starting point to finish the job yourself, good luck!

2 Likes