String match in R


#1

Hi i have 2 csv files eg A and B. I have almost 20k rows in each files
A.csv
Data =


C.gov
HELLO SERVICES
ABC .COM
ABC CHARITY trust

B.csv
Data =
Uk.Police
Abc Charity
C.gov
Hello Services

A has organisation name and B has Charitable Organisations. Now i need to use for loop to tag Organisations where it find a match in B.csv. please help


#2

The way you've presented it right now is difficult to understand. Nevertheless, take a look at dplyr::left_join function. It might do what you want.


#3

thanks sorry for not being clear...does left join work if I have words which may be similar in other table? I need pattern matching so that I can find matches in %ge. if Org Name(row 1) has 3 words and if all three word matches from table 2(entire list) then it should give 100% else "No match found".

Table 1
Organisation Name
GALWAY UNIVERSITY FOUNDATION
GREYSTONES PRESBYTERIAN CHURCH
IRISH FOOTBALL ASSOCIATION LTD
IRISH YOUTH FOUNDATION

Table 2
Charity name
HOMEBOUND CRAFTSMEN TRUST
PAINTERS' COMPANY CHARITY
ROHBF
THE ROYAL OPERA HOUSE BENEVOLENT FUND
HERGA
HERGA WORLD DISTRESS FUND
THE WILLIAM GOLDSTEIN LAY STAFF BENEVOLENT FUND (ROYAL HOSPITAL OF ST BARTHOLOMEW)


#4

I'm still not sure what you mean. Using examples you have what would matches be? I'm especially interested in the examples that show what you were describing with "if-else" clause.

Also you are saying that name in Table 1 should match list in Table 2. Do you mean that Table 2 stores lists where each word is separate? I guess, you have strings in both cases, so not sure where lists come into play.

What you describe sounds like fuzzy join, so there is a package called fuzzyjoin that might help you. I didn't use it myself, but it should be fairly straightforward.


#5

Thanks, sorry I am new so may not be putting in good words .... one question, this code is giving me output as [1] 0 0 1 0 2. How can I transpose it in column so that I can see results in two columns

Beverly 0
Gloucester 0
Manchester-by-the-Sea Manchester
Nahant 0
Salem Salem

survey <- c("Salem", "salem, ma","Manchester","Manchester-By-The-Sea")
master <- c("Beverly","Gloucester","Manchester-by-the-Sea","Nahant","Salem")

n.match <- function(pattern, x, ...) {
    matches <- numeric(length(pattern))
    for (i in 1:length(pattern)) {
       idx <- agrep(pattern[i],x,ignore.case=TRUE, max.distance = 2)
       matches[i] <- length(idx)
    }
    matches       
}
n.match(master,survey)

#6

Could you please turn this into a self-contained reprex (short for minimal reproducible example)? It will help us help you if we can be sure we're all working with/looking at the same stuff.

Right now the best way to install reprex is:

# install.packages("devtools")
devtools::install_github("tidyverse/reprex")

If you've never heard of a reprex before, you might want to start by reading the tidyverse.org help page. The reprex dos and don'ts are also useful.

If you run into problems with access to your clipboard, you can specify an outfile for the reprex, and then copy and paste the contents into the forum.

reprex::reprex(input = "fruits_stringdist.R", outfile = "fruits_stringdist.md")

For pointers specific to the community site, check out the reprex FAQ, linked to below.


#7

Hi what i meant was i have two columns and i want each row of first column to be partially matched with entire column of 2nd row..please can u help


#9

you nearly had it, just add

df <- data.frame(master = master,nmatch = n.match(master,survey))

rgds,
Peter