How to deal with misspellings when matching text strings between columns

I currently have code that scores whether subject responses in column 1 match the correct responses in column 2, and this works fine for subject responses that are spelled correctly or include at least one correct word in the response e.g., subject response: Sphinx ; correct response: Great Sphinx".

However, I am having trouble scoring subject responses that are incorrectly spelled e.g, Pizza instead of Pisa or when the subject uses a different word but it is still a correct match e.g, O2 Arena instead of Millennium Dome. Is there a way to modify the code below or add a new code for it to also give a score of 0 or 1 for responses in column 1 that are misspelled or do not match the words in column 2 but are still a correct response?

Any suggestions would be greatly appreciated.

data_recall<- data.frame(correct_responses= c("Gondola On Canal", "Sphinx", "Don't Know", "Millennium Dome", "Pisa"), subject_responses = c("Grand Canal Venice", "Great Sphinx", "Mountain Everest","02 Arena","Leaning tower of Pizza"))

data_recall[[2]] <- str_to_title(data_recall[[2]])

data_recall$recall_MATCH<-apply(data_recall, 1, function(x) 
  ifelse(any(unlist(strsplit(as.character(x[1]),"\\s+")) %in% 
               unlist(strsplit(as.character(x[2]),"\\s+"))),'1','0'))

Hi @aabz,
Welcome to the RStudio Community Forum.

Google is your friend; try searching for "fuzzy text matching in R".

If you expect wildly different answers meaning the same location E.g. "O2 Arena "versus "Millennium Dome" then you might have to build a database of synonyms.

2 Likes

Thank you for the recommendation! I will look in to Fuzzy Matching.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.