Merging two csv files

AndrewR · December 15, 2019, 5:47pm

I have just started with R and need to merge two csv files that share the column name SpeName. Below is the how the two files are formatted

CSV1
Kingdom,Phylum,Class,Order,Family,Genus,Species,,SciName
PLANTAE,TRACHEOPHYTA,MAGNOLIOPSIDA,FABALES,FABACEAE,Acacia,auriculiformis,19891902,Acacia auriculiformis
PLANTAE,TRACHEOPHYTA,MAGNOLIOPSIDA,FABALES,FABACEAE,Acacia,crassicarpa,38366,Acacia crassicarpa
PLANTAE,TRACHEOPHYTA,MAGNOLIOPSIDA,FABALES,FABACEAE,Acacia,decurrens,60757212,Acacia decurrens
PLANTAE,TRACHEOPHYTA,MAGNOLIOPSIDA,FABALES,FABACEAE,Acacia,koa,19891713,Acacia koa
PLANTAE,TRACHEOPHYTA,MAGNOLIOPSIDA,FABALES,FABACEAE,Acacia,mangium,18435820,Acacia mangium

CSV2
SciName,FAMILY,CONTINENT,PILOT NAME
Abarema jupunba Britton & Killip,Leguminosae (Fabaceae),AM,Huruasa
Acacia spp.,Leguminosae (Mimosaceae),AS,Acacia
Acacia auriculiformis A. Cunn.,Leguminosae (Mimosaceae),AS,Acacia
Acacia mangium Willd.,Leguminosae (Mimosaceae),AS,Acacia
Acanthopanax ricinifolius Seem. (cf. Kalopanax Araliaceae AS Senseptemlobus),Araliaceae,AS,Sen
Acrocarpus fraxinifolius Arn.,Leguminosae (Caesalpiniaceae),AS,Kuranjan
Actinodaphne spp.,Lauraceae,AS,Medang

What I'd like, is to create a data frame with a unique row for each unique value of SciName so that the values from each csv are combined. My problem is that the names in SciName do not 100% match and yet they are the same for example, above you have Acacia mangium in csv1 and Acacia mangium Wild. in csv2. I would like to create a row that keeps the longer version of the name i.e. Acacia mangium under SciName and combines the data in the rest of the columns for both names. This should only happen though where there are direct matches with the first two names in both csvs. Where there are no matches, null values would be given to the cells in the addition columns. Using the example above the table would look like:

Kingdom,Phylum,Class,Order,Family,Genus,Species, ,SciName,Family2,CONTINENT,PILOT NAME
,,,,,,,,Abarema jupunba Britton & Killip,Leguminosae (Fabaceae),AM,Huruasa
PLANTAE,TRACHEOPHYTA,MAGNOLIOPSIDA,FABALES,FABACEAE,Acacia,auriculiformis,19891902,Acacia auriculiformis A. Cunn,Leguminosae (Mimosaceae),AS,Acacia

Hope this is clear, and if R is not the best place to do this, it would also be great to get some pointers to some other solutions that can deal with relatively large datasets.

Thanks for any help

FJCC · December 15, 2019, 7:03pm

I am not entirely sure how you want to join the two data set. Is this right?

CSV1 <- read.csv("/home/fjcc/R/Play/csv1.csv", stringsAsFactors = FALSE)
CSV2 <- read.csv("/home/fjcc/R/Play/csv2.csv", stringsAsFactors = FALSE)
library(stringr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
CSV1 <- CSV1 %>%  mutate(JoinName = str_extract(SciName, "^[^\\s]+\\s[^\\s]+"))
CSV2 <- CSV2 %>%  mutate(JoinName = str_extract(SciName, "^[^\\s]+\\s[^\\s]+"))
CSVall <- full_join(CSV1, CSV2, by ="JoinName")

^{Created on 2019-12-15 by the reprex package (v0.2.1)}

AndrewR · December 15, 2019, 8:21pm

Thanks for this. This is perfect gives me exactly what I need. Thanks for your help with this. Also for introducing me to some new functions

Just wondering, what does the "^[1]+\s[^\s]+" part mean. Just in case I want be using this in the future, Tried searching the help in RStudio but having problems understanding this part.

^\s ↩︎

Sfdude · December 16, 2019, 3:08am

Andrew:

the Regular Expression (regexp):
^[1]+\s[^\s]+

means (in summary, reading left to right...):

^ = start at the beginning of the SciName string
[^\s]+ = match the 1st word (ie: Acacia)
\s = match a single space separator btw both words
[^\s]+ = match the 2nd word (ie: mangium)

Regexp expressions
can be very powerful
but require some initial patience to learn...
Obviously,
FJCC is an expert at this.

Hope this helps you...
SFdude
San Francisco

PS: I don't know why my double backslashes
appear as a single backslash in this
Rstudio commenting system...

^\s ↩︎

FJCC · December 16, 2019, 3:38am

It is kind of Sfdude to say I am a regular expression expert but looking back at what I wrote there, I could have written a much simpler expression.

CSV1 <- read.csv("/home/fjcc/R/Play/csv1.csv", stringsAsFactors = FALSE)
CSV2 <- read.csv("/home/fjcc/R/Play/csv2.csv", stringsAsFactors = FALSE)
library(stringr)
library(dplyr)
CSV1 <- CSV1 %>%  mutate(JoinName = str_extract(SciName, "^\\w+\\s\\w+"))
CSV2 <- CSV2 %>%  mutate(JoinName = str_extract(SciName, "^\\w+\\s\\w+"))
CSVall <- full_join(CSV1, CSV2, by ="JoinName")

^ means "the beginning of the string"
\w means "any Word character"
+ means "one or more of the preceding thing"
\s means "a white space character" think of that as spaces and tabs, though that is not exact

So, the whole revised expression means "from the start of the string, one or more Word characters, a space, one or more Word characters".
Within R you frequently have to double the back slashes but that is not usual in regular expressions.

AndrewR · December 16, 2019, 9:22am

Thanks for the help, all really useful.

system · January 6, 2020, 9:37am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.