Using column values (strings) in df1 to find (grep?) matches in df2, then pull different df2 variable into new column in df1

Ross_Miller · May 4, 2021, 7:11pm

So I am sure the title is a bit confusing, but what I am trying to do is this: I have two datasets - I have a column in df1 in which the values are all full names (~80,000). I have a column in df2 (length ~20,000) that includes V1 - a column of names, but originally there were several names in each "cell" with many names per row separated by forward slashes, and v2, a column with an arbitrary index reference (character). What I want to do search all of the df2 column for full matches of df1 names, then create a new df1 column extracting V2 from df2 from any rows where a df1 name is matched. The names in df2 are very messy and may include extra words, so it can't be an exact match, rather it needs to be a string match within the value. I originally tried to GREP the matches out of the original column, but then to simplify I separated all the names in df2 and applied pivot_longer so that now there is ony "one" name per row (now ~1.9M rows and the name bariable is still very messy and often includes various titles and other words). Now I am wondering how to execute the above: 1. Is there a way to use a list/vector of the df1 names within a grep/str_subset call to the entire v1 column and return the matched string and v2 into new variables; 2. Can/should I instead create a for-loop to itereate each name in a list of df1 names into a GREP call for every row and return the matched string and v2 into new variables; 3. Am I considering the probelm the wrong way and there is an entirely different solution? Thanks in advance, I have been a passive reader of this forum for a while but this is my first post for help.

technocrat · May 4, 2021, 9:25pm

This will not cover the case but may help in reducing the scope of the problem.

suppressPackageStartupMessages({
  library(dplyr)
  library(stringr)
})
the_names <- data.frame(df1 = c("Oliver Wendall Holmes","Richard Feynman"),
                        df2 = c("Justice Holmes", "Nobel Prize Winner Feynman"))

pattern <- "\\b \\b\\w+$"
the_names <- the_names %>% mutate(df1_surnames = str_extract(df1, pattern)) %>%
                           mutate(df2_surnames = str_extract(df2,pattern))

the_names
#>                     df1                        df2 df1_surnames df2_surnames
#> 1 Oliver Wendall Holmes             Justice Holmes       Holmes       Holmes
#> 2       Richard Feynman Nobel Prize Winner Feynman      Feynman      Feynman

Ross_Miller · May 5, 2021, 5:02pm

technocrat:

suppressPackageStartupMessages({
  library(dplyr)
  library(stringr)
})
the_names <- data.frame(df1 = c("Oliver Wendall Holmes","Richard Feynman"),
                        df2 = c("Justice Holmes", "Nobel Prize Winner Feynman"))

pattern <- "\\b \\b\\w+$"
the_names <- the_names %>% mutate(df1_surnames = str_extract(df1, pattern)) %>%
                           mutate(df2_surnames = str_extract(df2,pattern))

the_names
#>                     df1                        df2 df1_surnames df2_surnames
#> 1 Oliver Wendall Holmes             Justice Holmes       Holmes       Holmes
#> 2       Richard Feynman Nobel Prize Winner Feynman      Feynman      Feynman

Thank you, this is very helpful conceptually! I had tried some regex-based name cleaning to create a merge key and this code helped to simplify. I am now further down the clean "crosswalk" path, but it still seems to me that there should be a function that would execute an iterative search for matches to each value using a list of strings. While the total number and types of words in the messy column are varied (some are not proper names at all) all the proper names appear in the variable in the same format: firstname[space] lastname, so if I could match the string within the variable value and print it to a new variable the process would be much more efficient.

technocrat · May 5, 2021, 5:16pm

That is in the messy column, so that both columns contain the same string—firstname[space] lastname? If so

suppressPackageStartupMessages({
  library(dplyr)
  library(stringr)
})

the_names <- data.frame(df1 = c("Oliver Wendell Holmes","Richard Feynman"),
                        df2 = c("Justice Oliver Holmes", "Richard Feynman, Nobel Prize Winner"))

the_names <- the_names %>% mutate(matched = ifelse(str_detect(df2,df1),TRUE,FALSE))

the_names
#>                     df1                                 df2 matched
#> 1 Oliver Wendell Holmes               Justice Oliver Holmes   FALSE
#> 2       Richard Feynman Richard Feynman, Nobel Prize Winner    TRUE

system · May 26, 2021, 5:16pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.