Trying to do Sorenson-Dice matching (intersect() and setdiff())

pathos · February 4, 2022, 10:40am

Let's say I have the following sample data:

  postcode postcode_city    
  <chr>    <chr>            
1 3069 XJ  3069 XJ Rotterdam
2 3076 BJ  3076 BJ Rotterdam
3 3037 EA  3037 EA Rotterdam
4 3043 KC  3043 KC Rotterdam
5 3031 AM  3031 AM Rotterdam
6 3039 ZK  3039 ZK Rotterdam

I found a package that doesn't install into the current version of R, so I looked at the source code here: OmicsMarkeR source: R/stability.R

With a small deletion, essentially, this is the code:

sorensen <- function(x,y){
    index <- 
        2*(length(intersect(x,y)))/(2*(length(intersect(x,y)))+
                                        length(setdiff(x,y))+
                                        length(setdiff(y,x)))
    return(index)
}

### the goal:
sorensen(df$postcode, df$postcode_city)
# [1] 0

### since above isn't working, attempting individual parts
intersect(df$postcode[1], df$postcode_city[1])
# character(0)
setdiff(df$postcode[1], df$postcode_city[1])
# [1] "3069 XJ"
setdiff(df$postcode_city[1], df$postcode[1]) # just reversed x:y to y:x
# [1] "3069 XJ Rotterdam"

So setdiff seems to be off, and intersect doesn't seem to work at all.

pieterjanvc · February 4, 2022, 1:29pm

Hi,

You cannot compare strings like that using intersect or setdiff.
"3069 XJ" is not the same as "3069 XJ Rotterdam" thus there will be no intersection and everything will be different. It's not clear what your goal is here, as the SD coefficient is based on similarities between list, but I don't see which lists you are trying to compare here.

You can look at which postcodes have the same city for example, or the number of unique postcodes etc, but for that you'd first need to transform your data. For example:

library(tidyverse)

myData = data.frame(
  
  stringsAsFactors = FALSE,
  postcode = c("3069 XJ","3076 BJ","3037 EA",
               "3043 KC","3031 AM","3039 ZK"),
  postcode_city = c("3069 XJ Rotterdam","3076 BJ Rotterdam","3037 EA Rotterdam",
                    "3043 KC Rotterdam","3031 AM Rotterdam","3039 ZK Rotterdam")
)
myData
#>   postcode     postcode_city
#> 1  3069 XJ 3069 XJ Rotterdam
#> 2  3076 BJ 3076 BJ Rotterdam
#> 3  3037 EA 3037 EA Rotterdam
#> 4  3043 KC 3043 KC Rotterdam
#> 5  3031 AM 3031 AM Rotterdam
#> 6  3039 ZK 3039 ZK Rotterdam


myData %>% separate(postcode, c("postcode", "abbr")) %>% 
  mutate(city = str_remove(postcode_city, "^\\d+\\s\\w+\\s")) %>% 
  select(-postcode_city)
#>   postcode abbr      city
#> 1     3069   XJ Rotterdam
#> 2     3076   BJ Rotterdam
#> 3     3037   EA Rotterdam
#> 4     3043   KC Rotterdam
#> 5     3031   AM Rotterdam
#> 6     3039   ZK Rotterdam

^{Created on 2022-02-04 by the reprex package (v2.0.1)}

Now you can do more analyses based on any of the 3 variables. Please explain a bit more about what you like to do if needed.

Hope this helps,
PJ

pathos · February 4, 2022, 1:53pm

What I want to do is string fuzzy matching, using Dice's coefficient.

postcode and postcode_city are the lists of what I would like to compare.

Essentially, with the currently shown sample lists, there should be 100% similarity or close (I'm assuming).

pieterjanvc · February 4, 2022, 2:49pm

Hi,

So when you look at the dataset I created, you can exactly extract that information because the last column (city) is identical for all samples. This way there is no need for any special other logic. Of course if you like to compare string similarities if this is just a dummy example (e.g. when there might be typos or other string variations) there are more methods to do string comparison on a character basis.

Does this make sense? Please provide more examples if this is not what you are looking for
PJ

pathos · February 7, 2022, 6:06am

Yes they are all identical -- this is just an example. As I mentioned, it should result in 100% similarity.

And you are right that I would like to do string comparisons, but I would like to specifically try SD-method.

nirgrahamuk · February 7, 2022, 10:27am


library(tidyverse)
(example_df <- enframe(rownames(mtcars)) %>% mutate(val2 = lag(value)))

sorensen <- function(fullx,fully){
  purrr::map2_dbl(fullx,fully,
             ~ {
  x<-strsplit(x = .x,
              split="") %>% unlist
  y<-strsplit(x = .y,
              split="")%>% unlist
    2*(length(intersect(x,y)))/(2*(length(intersect(x,y)))+
                                  length(setdiff(x,y))+
                                  length(setdiff(y,x)))
              })
}

### the goal:
sorensen(example_df$value,example_df$val2)

#or
example_df %>% mutate(
  myscore = sorensen(value,val2))

system · February 14, 2022, 10:28am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.