Detect typos in dataframe column (levensthein)

laplanca · February 18, 2022, 4:59pm

Hello,

In a large data file, I want to identify duplicates based on several criteria. I use a small program for this, but it seems imperfect and does not detect input errors, for example. I would indeed keep finding a way to use something like the levensthein distance to detect errors like JEANINE/JANNINE. Only I can't find how to use this on the same column of sorting tables so that all the lines are compared with each other (first name for example). I thought about going through using python or sql in R but I'm not quite sure how to set this up. If you see how to do it or another method, I'm all ears! Thanks in advance

nirgrahamuk · February 18, 2022, 5:24pm

library(stringdist)
library(randomNames)

set.seed(42)
(somenames <- randomNames(100,which.names = "first"))
(targetnames <- sample(somenames,10))

stringdistmatrix(a = targetnames,b=somenames,useNames = TRUE,method = "lv")

taking 50 names and finding their distance from 10 name target subset
imagine the 50 are your possible names and the 10 are your dictionary of names.

laplanca · February 21, 2022, 7:53am

It seems like i can't use those packages and library and I don't know why... Otherwise it looks like a good solution!

nirgrahamuk · February 21, 2022, 8:08am

Have you installed a package before?

laplanca · February 21, 2022, 8:29am

Yes of course, I juste have to download a more recent version of R, now it works, but the first name proposed don't look like mine. I tried to put in a variable prenomdiff all my distinct names thanks to a sql procedure and then use your method then but it don't works properly :

prenomdiff<-sqldf('select distinct(I_PRENOM) from data')
(somenames <- prenomdiff)
(targetnames <- prenomdiff)

test<-stringdistmatrix(a = targetnames,b=somenames,useNames = TRUE,method = "lv")

error message :
In stringdistmatrix(a = targetnames, b = somenames, useNames = TRUE,  : 
You are passing one or more arguments of type 'list' to
'stringdistmatrix'. These arguments will be converted with 'as.character'
which is likeley not to give what you want (did you mean to use 'seq_distmatrix'?).
This warning can be avoided by explicitly converting the argument(s).

laplanca · February 21, 2022, 8:40am

I test on a sample to show you what I want to got at the end. Moreover I would like to print only when the lv value is inferior or equal at 3, or color the values i want to put in evidence. I don't know if it's possible.

nirgrahamuk · February 21, 2022, 9:39am

sqldf returns data.frames, whereas the stringdismatrix function wants to consume character vectors.
you can use R's $ syntax to selected a character vector column from a data.frame, for example

prenomdiff$I_PRENOM

Its possible to print and control the style of the matrix result.
For example :

library(stringdist)
library(randomNames)
library(tidyverse)

set.seed(42)
#altered to guarantee uniqueness
(somenames <- unique(randomNames(100,which.names = "first")))
(targetnames <- sample(somenames,10,replace = FALSE))

(amat <- stringdistmatrix(a = targetnames,
                          b = somenames,
                          useNames = TRUE, 
                          method = "lv"))
library(gt)
as.data.frame(amat) %>%
  gt() %>%
  data_color(
    columns = colnames(amat),
    colors = scales::col_numeric(
      palette = "Blues",
      domain = c(0, 3),
      na.color = "#FFFFFF",
    )
  )

nirgrahamuk · February 21, 2022, 2:08pm

somenames <- c("MOHAMED","ELHOUSSINE","LUIS","JOEL","MOHAMMED","YVES","VICTOR","FADHEL","MAXIME","ALAIN")
targetnames <- c("MOHAMED","ELHOUSSINE","LUIS","JOEL","MOHAMMED","YVES","VICTOR","FADHEL","MAXIME","ALAIN","MXYOME")
test<-stringdistmatrix(a = targetnames,b=somenames,useNames = TRUE,method = "lv")

calc_result <- as.data.frame(test) %>% 
  rownames_to_column() %>% 
  pivot_longer(cols=-rowname) %>% 
  filter(value<3)

nirgrahamuk · February 21, 2022, 3:03pm

I just tested stringdistmatrix on my laptop with inputs of 2000 by 2000, it resulted in a 32mb matrix and took less than 3 seconds.

when you say you cant apply the stringdistmatrix function on this size of data, what do you mean exactly ?

P.s. I'm surprised you aren't comparing a larger 'possibles' list to a shorter 'true' list. I have to wonder what it benefits you to compare a possible list with itself? but hope you have your own ideas about your purpose that it makes sense to you.

laplanca · February 21, 2022, 3:27pm

well, sorry I didn't look well, I do have an error but it's not link to the size, i tried with a df of more then 130000 and it's work.
With my df with 1900 rows I have this error that I don't understand :

Error:
! Column 1614 must be named.
Caused by error in `stop_vctrs()`:
! Names can't be empty.
x Empty name found at location 1614.
Run `rlang::last_error()` to see where the error occurred.
>

To answer to your PS, I am not sure either if what i'am doing is good, but my point is to find duplicates rows in a database so I think it's good like this.

My other problem now is to try to put an another variable from the initial data to compare if the names are close because of a mistake or by hasard (for example if date of birth are the same)

nirgrahamuk · February 21, 2022, 3:34pm

I don't know how to interpret your error... You didnt share any code in relation to it that might help me.
as we discussed previously stringdistmatrix does not take a dataframe as input.
Perhaps the process you are using to go from your data.frame to the stringdismatrix inputs is problematic, but I can't see it.

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

laplanca · February 24, 2022, 1:44pm

Well, I don't have this error now. The cose I use is this one :

somenames <- np2[sample(1:nrow(np2),10000, ),] 
targetnames <-somenames
test<-stringdistmatrix(a = targetnames,b=somenames,useNames = TRUE,method = "lv")

calc_result <- 
        as.data.frame(test) %>% 
        rownames_to_column() %>% 
        pivot_longer(cols=-rowname) %>% 
        filter(value<=2 & value > 0)

np2 represent a combination of name, first name and date of birth to try to identifie the person who are the nearest. The problem is that I have 30000 entries in my base and if I put more than 10000 in my sample, I have an error because it outstand the memory capacity of R. So this program is very good but don't take in charge enough lines in my case. I don't think there is any solution to resolve this.

nirgrahamuk · February 24, 2022, 1:48pm

you could try reducing the targetnames to only those beginning with a particular letter of the alphabet.
this should result in an object roughly 26th the size of the total otherwise.
you can iterate the analysis over each targetname set beginning with a new alphabetic character
at each stage keep what is interesting and discard what is not. recombine at the end.

laplanca · February 24, 2022, 2:15pm

I see what you mean, but I have no idea of how put this in place
Also, I have of course each output two time, in each way :
(jean dupont / jean dupond and jean dupond/ jean dupont)
I'm sure I can do something to correct that but I don't find out for now.

nirgrahamuk · February 24, 2022, 5:01pm

library(stringdist)

somenames <- c("MOHAMED","ELHOUSSINE","LUIS","JOEL","MOHAMMED","YVES","VICTOR","FADHEL","MAXIME","ALAIN")
targetnames <- c("MOHAMED","ELHOUSSINE","LUIS","JOEL","MOHAMMED","YVES","VICTOR","FADHEL","MAXIME","ALAIN","MXYOME")

do_it <- function(somenames,targetnames,letter){
subnames <- targetnames[substr(targetnames,1,1)==letter]
  as.data.frame(stringdistmatrix(a = subnames,b=somenames,useNames = TRUE,method = "lv")) %>% 
  rownames_to_column() %>% 
  pivot_longer(cols=-rowname) %>% 
  filter(value<=2 & value > 0)
}

do_it(somenames,
      targetnames,
      "A")

do_it(somenames,
      targetnames,
      "M")

library(purrr)

all_results <- map_dfr(LETTERS,
    ~do_it(somenames,
           targetnames,
           .x))

laplanca · February 25, 2022, 8:32am

This is very great! It give exactly what I need! Also, do you know how to deal with my problem of repetition in each way
(002 jean dupont/ 043 jean dupond and 043 jean dupond/ 002 jean dupont) (tell me if my explication is not clear)?
It can be good because it will delete the half of the observations

laplanca · February 28, 2022, 8:44am

I think I found the solution, thank for all, you helped me so much! I used this :

result<-unique(t(apply(all_results, 1, sort)))

system · March 7, 2022, 8:44am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.