fuzzy matching every element in a list against values from a data frame column

The problem is, a doctor who received payments from different companies must report where the money come from. I have a list of officially recorded companies and a list of company names that doctor reports. The goal of the code is to check if that the database's record is accurate(There is no specific sequence of the names in the column or in the list) But that doctor only write part of the company name, so I have to use the agrep function to obtain a list of Boolean value.

The actual list or column are much larger, and I just construct a simpler model in following codes. I have tried to vary the max.distance parameter. I found out when max.distance is 5 or larger, I will get 4 TRUE; otherwise I will get 4 False. I am not sure if my codes have logic problems or I didn't adjust the max.distance properly. Hoping for any suggestions

df <- data.frame(CompanyInDataBase = c('Pfizer Inc', 'Shire North America Group Inc', 'Roche Inc', 'Bayor Inc'), 
                 stringsAsFactors = FALSE)
report = c('Shire', 'Pfizer', 'Genetech')

for(i in 1:length(report)){
  match <- agrepl(report[i], df$CompanyInDataBase, max.distance = 0.1)
}

I expect the output of a list of correct Boolean value, the size of this list should be the same as CompanyInDataBase's.

Hi @ILoveYukee. You can use the map function with the list output. For the fuzzy matching, it is hard to define a threshold for that. You may set some rules for the matching by passing a list e.g. list(insertions = 1, deletions = 1) to the max.distance argument which mean allow one character insertions and deletions.

library(tidyverse)

df <- data.frame(CompanyInDataBase = c('Pfizer Inc', 'Shire North America Group Inc', 'Roche Inc', 'Bayor Inc'), 
                 stringsAsFactors = FALSE)
report = c('Shire', 'Pfizer', 'Genetech')

match <- map(report, ~{
  agrepl(.x, df$CompanyInDataBase, max.distance = 0.1)
})

match
#> [[1]]
#> [1] FALSE  TRUE FALSE FALSE
#> 
#> [[2]]
#> [1]  TRUE FALSE FALSE FALSE
#> 
#> [[3]]
#> [1] FALSE FALSE FALSE FALSE

Created on 2019-09-10 by the reprex package (v0.3.0)