Is there an easy way to consolidate inconsistent string data?

Hello,

I'm new to R and brand new to the Rstudio community. I'm working with a dataset for a request process and one of the variables is the name of the requestor. I want to run some basic descriptive stats on the dataset. Problem is, the names of the requestors have been input very inconsistently. I see variations similar to what's below but with many different names.

Requestor
Doe, John
JDoe
John Doe
Johndoe
JJones
Jane Jones
Jones, Jane

Is there a way to consolidate the different versions of these names into one unique name per person (John Doe and Jane Jones)? I have many different names in the dataset, but it seems like there are a limited number of different structures (Doe, John vs. John Doe).

It's doable, but not necessarily easily, depending on the degree of comfort with regular expressions and the conformity of the example to the data to be processed. In particular, names in ALLCAPS or alllower, such as JANETROPE and janetrope (is that Jane Trope or Janet Rope?). For that, you need some data source of surnames to split against.

Here's a starter to show how to identify the different patterns shown in the data. Play around with that and come back with questions on how to take the results and transform into some standard form, such as lname,fname.

library(stringr)
dat <- data.frame(
  Requestor =
    c("Doe, John", "JDoe", "John Doe", "Johndoe", "JJones", "Jane Jones", "Jones")
)

pattern1 <- "^[A-Z][a-z]+[,][ ]*[A-Z][a-z]+$" # Doe, John
pattern2 <- "^[A-Z][A-Z][a-z]+$" #JDoe JJones
pattern3 <- "^[A-Z][a-z]+[ ]*[A-Z][a-z]+$" #John Doe Jane Jones
pattern4 <- "^[A-Z][a-z]+$" 
pattern5 <- "^[A-Z]+[a-z]$"
str_detect(dat$Requestor,pattern1)
#> [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
str_detect(dat$Requestor,pattern2)
#> [1] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
str_detect(dat$Requestor,pattern3)
#> [1] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE
str_detect(dat$Requestor,pattern4)
#> [1] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE
str_detect(dat$Requestor,pattern5)
#> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1 Like

Thank you! I'm just starting to get into regular expressions so this will be a good exercise. I'll follow up when I make some progress.

1 Like

Hey there,

I've made some headway on this. Here's what I've done so far. Sorry if it's hard to follow, reprex's are next on my list of things to learn.

  • I changed all characters to lower case using a combination of mutate() and str_to_lower() with the intent of removing any complication caused by differences in case.
  • I learned some about the format of the data by using str_detect() and regular expressions like you had listed in your previous comment.
  • I noticed that I had a pretty limited number of last names so I exported each unique value in my requestor column and created a wordbank of surnames based on what I saw.
  • I returned to R and used str_extract() to create a column that extracted the appropriate surname from the original column.

I'm now almost where I need to be. I want to find a way to pull a first initial and put it with the last name I have extracted. For example, I want to return "jdoe" instead of just "doe". Is there a way to extract all of the characters that do NOT match a pattern in stringr? I think I can figure out the rest of the problem but I'm stuck on this specific piece.

Something like this?

library(stringr)
raw <- "jdoe"
surnames <- "doe|smith|jones"
initials <- str_split(raw,surnames)[[1]][1]
lname <- str_extract(raw,surnames)
initials
#> [1] "j"
lname
#> [1] "doe"
fname <- c(initials,lname)
fname
#> [1] "j"   "doe"
1 Like

That worked! Thanks for helping out!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.