Complex uniting of 2 columns into one

Hi all, I have 2 datasets I need to combine in order to merge with another

One has 3 columns- ID, synonym ID and duration as shown below, you can see that
the species column has repeats to allow for synonym IDs used by some.

Species	Synonym.Symbol	Duration
AUREO		
BELLA		
BRACH5		
CASTI2		
CAAFA2		                                Perennial
CAAFA2	        CAAFC	
CAAFA2	        CAAFI	



Is there any way to retain the duration information, but create one column that has ALL the species potential names, without repeats? i.e. combine the species column and the synonym species column, but without a separator (i.e. make a new row) and retain the duration information, as in

Species	     Duration
AUREO		
BELLA		
BRACH5		
CASTI2		
CAAFA2		                                
CAAFC	        Perennial
CAAFAI	        Perennial	

Eventually, this will be used to merge the duration information into a large vegetation analysis database, where some researchers used different IDs for the same species . I did create a list of all the unique names into one column, but don't know how to match the duration values of 2 columns with this new list

names<-unique(c(as.character(longevity.hemi$Species), as.character(longevity.hemi$Synonym.Symbol))

Thanks!

It's a little hard to tell what you're expected output is because there are values that appear in your second table that don't in the first, but here's an attempt. It would be very helpful if you provided your data in an easy to copy and paste format. One way is to paste the output of dput(longevity.hemi) here.

library(tidyverse)

df <- tribble(
  ~Species, ~Synonym.Symbol, ~Duration,
  "AUREO",  NA,              NA,    
  "BELLA",  NA,              NA,
  "BRACH5", NA,              NA,
  "CASTI2", NA,              NA,
  "CAAFA2", NA,              "Perennial",
  "CAAFA2", "CAAFC",         NA,
  "CAAFA2", "CAAFI",         NA  
)

spe <- df %>% 
  group_by(Species) %>% 
  fill(Duration, .direction = "updown") %>% 
  ungroup()

syn <- spe %>% 
  filter(!is.na(Synonym.Symbol)) %>% 
  select(Species = Synonym.Symbol, Duration)

spe %>% 
  select(Species, Duration) %>% 
  distinct() %>% 
  bind_rows(syn)
#> # A tibble: 7 x 2
#>   Species Duration 
#>   <chr>   <chr>    
#> 1 AUREO   <NA>     
#> 2 BELLA   <NA>     
#> 3 BRACH5  <NA>     
#> 4 CASTI2  <NA>     
#> 5 CAAFA2  Perennial
#> 6 CAAFC   Perennial
#> 7 CAAFI   Perennial

Created on 2020-06-09 by the reprex package (v0.3.0)

Hey thanks for your response! sorry about not copy and pasting everything perfectly.

Your output is what I am looking for and works perfectly for the tibble you created, but not for my actual dataframe, for some reason.
The output of dput(longevity.hemi) is very long. Here is the head

  Species Synonym.Symbol Duration   
  <fct>   <fct>          <fct>      
1 AUREO   ""             ""         
2 BELLA   ""             ""         
3 BRACH5  ""             ""         
4 CASTI2  ""             ""         
5 CAAFA2  ""             "Perennial"
6 CAAFA2  "CAAFC"  

But when I run your code, at the end I get this tibble, which has the duplicates you don't have in your code and doesn't fill in the duration status

   Species Duration   
   <chr>   <fct>      
 1 AUREO   ""         
 2 BELLA   ""         
 3 BRACH5  ""         
 4 CASTI2  ""         
 5 CAAFA2  "Perennial"
 6 CAAFA2  ""         
 7 CAAFL2  "Perennial"
 8 CAAFL2  ""         
 9 CAAFN2  "Perennial"
10 CAAFN2  ""   

I'm not sure why it won't work for essentially the same data.

Thank you!

Thanks for posting some of your data -- again, it is much easier to help if you paste in a copy and paste friendly format. Please look at the link I shared in my last post for help with this.

It looks like your data doesn't contain NAs, but rather empty strings "" so for the code to work, you would need to convert these "" to NA first.

library(tidyverse)

df <- tribble(
  ~Species, ~Synonym.Symbol, ~Duration,
  "AUREO",  "",              "",    
  "BELLA",  "",              "",
  "BRACH5", "",              "",
  "CASTI2", "",              "",
  "CAAFA2", "",              "Perennial",
  "CAAFA2", "CAAFC",         "",
  "CAAFA2", "CAAFI",         ""  
)

spe <- df %>% 
  mutate_at(vars(Synonym.Symbol, Duration), ~ if_else(.x == "", NA_character_, .x)) %>% 
  group_by(Species) %>% 
  fill(Duration, .direction = "updown") %>% 
  ungroup()

syn <- spe %>% 
  filter(!is.na(Synonym.Symbol)) %>% 
  select(Species = Synonym.Symbol, Duration)

spe %>% 
  select(Species, Duration) %>% 
  distinct() %>% 
  bind_rows(syn)
#> # A tibble: 7 x 2
#>   Species Duration 
#>   <chr>   <chr>    
#> 1 AUREO   <NA>     
#> 2 BELLA   <NA>     
#> 3 BRACH5  <NA>     
#> 4 CASTI2  <NA>     
#> 5 CAAFA2  Perennial
#> 6 CAAFC   Perennial
#> 7 CAAFI   Perennial

Created on 2020-06-09 by the reprex package (v0.3.0)

The challenge you have using dput is related not only to dataframe size in terms of pure rows, but potentially the large number of factor values for each factor variable.
Recommend that to provide example data you identify a reasonable number of rows to share. Select them with either head() or slice() functions.
Use forcats packages fct_drop in combination with mutateif(is.factor, fct_drop)
Then this final cropped frame can be passed to dput

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.