Comparing 2 varaibles

Hi there, I want to find out the best matching between the variables "o" and "p".

The logic should permit some structure detect to allow possible strings that are not typed properly like "benit = benito".

In addition the logic should give priority to the observations in variable "p" that have a Full match ( all/ most of strings in the varaible "o" have ben detected with strings in the variable "p") and have less excess of strings. For example:

o -> c("rosa benito 1 2 3")
p -> c("rosa benit 1 2 3  5 8 4 6", "ros benito 1 2 3 4", "rosa benito 1 2 3 4 5 6 7")


In in this case the best matching would be for "ros benito 1 2 3 4"

Can someone help me here? I am going to type below a better. Many thanks

id_o = c(1,2), description_o = c("adam carla bryan 19 18 17", "rosa benito 1 2 3"))

p<- data.frame(id_p = c(1,2, 3, 4, 5, 6, 7, 8, 9, 10), description_p = c("adam bryan carla 18 17 19",
                                                        "adam carla bryan 19 18 17 16",
                                                        "adam carla bryan 19 18 17 16 15 14 13",
                                                        "adam carla bryan 19 18 17 16 15 14 13",
                                                        "adam carla  19 18 17 16 15",
                                                        "adam car bry  19 18 17 16 15",
                                                        "rosa benito 1 3",
                                                        "rosa benito 2 3",
                                                        "rosa benit 1 2 3",
                                                        "rosaaa benito 1 2 3"))

 q<- data.frame(id_o = c(1,2), description_o = c("adam carla bryan 19 18 17", "rosa benito 1 2 3"),
                id_p = c(1, 9), 
                description_p = c("adam bryan carla 18 17 19",
                                  "rosa benit 1 2 3"))

Check out the adist function. I think it is exactly what you want.

I had a hard time understanding your question, however this might do the trick.

o <- c("rosa benito 1 2 3","rosa benito 1 2 3 4 5")
p <- c("rosa benit 1 2 3  5 8 4 6", "ros benito 1 2 3 4", "rosa benito 1 2 3 4 5 6 7")

# scores how well y fits into x, based on 
# "In addition the logic should give priority to the observations in variable "p" 
#    that have a Full match ( all/ most of strings in the varaible "o" have ben 
#    detected with strings in the variable "p") and have less excess of strings. "
matchScore <- function(x,y){
  sum(sapply(unlist(strsplit(y,'\\s+')),function(s) ifelse(grepl(s,x,fixed=T),1,0)))
}

# groups up all o's with all possible p's
op <- data.frame(expand.grid(o = o,p = p, stringsAsFactors = F))
op$id_o <- rep(1:length(o),times=length(p))
op$id_p <- rep(1:length(p),each=length(o))

# apply the matchScore function to every row
op$score <- apply(op,1,function(r) matchScore(r[1],r[2]))

# For every value in o, what p options had the highest score with the shortest length
bestMatches <- do.call(rbind,lapply(split(op,op$o), function(x) head(x[x$score == max(x$score),][order(nchar(x$p)),],1)))
row.names(bestMatches) <- NULL
View(bestMatches)


# result
                      o                         p         id_o id_p score
#1     rosa benito 1 2 3        ros benito 1 2 3 4          1    2     5
#2 rosa benito 1 2 3 4 5       rosa benito 1 2 3 4 5 6 7    2    3     7

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.