Joining datasets with inconsistent keys

bayesian · July 7, 2021, 5:20pm

I am trying to attach demographics data (full_name and personality) to a dataset of texts written by individuals (reprex provided below). There are several issues:

In the dataset, whenever there are two people with the same name in the same year, both names (in column name) include differentiating last name initials (e.g., Adam A. and Adam B.)
However, in demographics, when such duplicates exist, the first name does not have their surname added as an initial (e.g., Adam and Adam B.) - To standardize, I created an initials column that added surname initials to everyone, with a view to aiding the join later.
It is also possible for the same person to appear in different years, but if there is no first name duplicate in that year, there will be no initials (e.g., Laura C. in year 3 vs just Laura in year 6)
Possible also to have a different person with same initials (Adam Apple vs Adam Another -- both Adam in name and Adam A. in initials. The only way we tell them apart is by year)

This is what I am trying to achieve by joining demographics to dataset:

Note: typo for name in second last row -- should be Claudia J. instead of just Claudia

I thought about doing multiple left joins, first for the non-initial names and then by = c("name" = "name")) and then for those with same name in same year and thus have initials by = c("name" = "initials")) but I got very funky results and different nrows.

Reprex data:

dataset <- tibble::tribble(
  ~year,        ~name,                            ~text,
     3L,    "Adam A.",                          "fubar",
     3L,    "Adam B.",                       "asdsdasd",
     3L,   "Laura B.",                           "blah",
     3L,   "Laura C.",                 "brown hairball",
     3L,   "Laura C.",                 "black hairball",
     3L,       "John",                "quick brown fox",
     3L,       "Zeke",                  "over lazy dog",
     6L,       "Adam", "different person same initials",
     6L,       "Jack",                 "birds are cool",
     6L,      "Laura",                   "appear again",
     6L,    "Claudia J.",                            "foo",
     6L, "Claudia M.",                            "bar"
  )


# initials field is created  
demographics <- tibble::tribble(
  ~year,       ~full_name,        ~name,    ~initials, ~personality,
     3L,     "John Green",       "John",    "John G.",       "INTP",
     3L,     "Adam Apple",       "Adam",    "Adam A.",       "INTJ",
     3L,    "Adam Banana",    "Adam B.",    "Adam B.",       "ESFJ",
     3L,    "Laura Bosch",      "Laura",   "Laura B.",       "ESFJ",
     3L,    "Laura Caley",   "Laura C.",   "Laura C.",       "ISFP",
     3L,      "Zeke Wong",       "Zeke",    "Zeke W.",       "ENFP",
     6L,   "Adam Another",       "Adam",    "Adam A.",       "ESTP",
     6L,   "Jack Sparrow",       "Jack",    "Jack S.",       "INTJ",
     6L,    "Laura Caley",      "Laura",   "Laura C.",       "ISFP",
     6L,      "Abi-Maria",  "Abi-Maria",  "Abi-Maria",       "ENFJ",
     6L, "Douglas Orange",    "Douglas", "Douglas O.",       "ISFJ",
     6L,   "Claudia Jane",    "Claudia", "Claudia J.",       "ISFP",
     6L,  "Claudia Miley", "Claudia M.", "Claudia M.",       "INFP"
  )

Any help is greatly appreciated!

cactusoxbird · July 7, 2021, 5:53pm

So I think that the code below will generally address the issue you're having. However, I wasn't sure about the two year 6 entries for Claudias in your dataset: Shouldn't both Claudias have a last initial included in the name column based on issue number 1 that you provide? If that's the case, then I think the method below should be useful!

I used an inner_join to replicate the table you provide, but I'm not sure if this is the type of join you're targeting.

library(tidyverse)

dataset <- tibble::tribble(
  ~year,        ~name,                            ~text,
  3L,    "Adam A.",                          "fubar",
  3L,    "Adam B.",                       "asdsdasd",
  3L,   "Laura B.",                           "blah",
  3L,   "Laura C.",                 "brown hairball",
  3L,   "Laura C.",                 "black hairball",
  3L,       "John",                "quick brown fox",
  3L,       "Zeke",                  "over lazy dog",
  6L,       "Adam", "different person same initials",
  6L,       "Jack",                 "birds are cool",
  6L,      "Laura",                   "appear again",
  6L,    "Claudia",                            "foo",
  6L, "Claudia M.",                            "bar"
)


# initials field is created  
demographics <- tibble::tribble(
  ~year,       ~full_name,        ~name,    ~initials, ~personality,
  3L,     "John Green",       "John",    "John G.",       "INTP",
  3L,     "Adam Apple",       "Adam",    "Adam A.",       "INTJ",
  3L,    "Adam Banana",    "Adam B.",    "Adam B.",       "ESFJ",
  3L,    "Laura Bosch",      "Laura",   "Laura B.",       "ESFJ",
  3L,    "Laura Caley",   "Laura C.",   "Laura C.",       "ISFP",
  3L,      "Zeke Wong",       "Zeke",    "Zeke W.",       "ENFP",
  6L,   "Adam Another",       "Adam",    "Adam A.",       "ESTP",
  6L,   "Jack Sparrow",       "Jack",    "Jack S.",       "INTJ",
  6L,    "Laura Caley",      "Laura",   "Laura C.",       "ISFP",
  6L,      "Abi-Maria",  "Abi-Maria",  "Abi-Maria",       "ENFJ",
  6L, "Douglas Orange",    "Douglas", "Douglas O.",       "ISFJ",
  6L,   "Claudia Jane",    "Claudia", "Claudia J.",       "ISFP",
  6L,  "Claudia Miley", "Claudia M.", "Claudia M.",       "INFP"
)

update_demographics <- demographics %>%
  # Split demo name into first & last cols
  separate(col = name, into = c("first", "last_init"), sep = " ") %>%
  # Are there first name duplicates in the year?
  add_count(year, first) %>%
  mutate(
    # If there's duplicates...
    join_id = if_else(
      condition = n != 1,
      # Then use name with last initial
      true = initials,
      # Otherwise just first name will do
      false = first)) %>%
  select(-c(last_init, first, n))
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 10 rows [1, 2, 4,
#> 6, 7, 8, 9, 10, 11, 12].


inner_join(
  x = dataset,
  y = update_demographics,
  by = c("year", "name" = "join_id")
)
#> # A tibble: 11 x 6
#>     year name       text                      full_name    initials  personality
#>    <int> <chr>      <chr>                     <chr>        <chr>     <chr>      
#>  1     3 Adam A.    fubar                     Adam Apple   Adam A.   INTJ       
#>  2     3 Adam B.    asdsdasd                  Adam Banana  Adam B.   ESFJ       
#>  3     3 Laura B.   blah                      Laura Bosch  Laura B.  ESFJ       
#>  4     3 Laura C.   brown hairball            Laura Caley  Laura C.  ISFP       
#>  5     3 Laura C.   black hairball            Laura Caley  Laura C.  ISFP       
#>  6     3 John       quick brown fox           John Green   John G.   INTP       
#>  7     3 Zeke       over lazy dog             Zeke Wong    Zeke W.   ENFP       
#>  8     6 Adam       different person same in~ Adam Another Adam A.   ESTP       
#>  9     6 Jack       birds are cool            Jack Sparrow Jack S.   INTJ       
#> 10     6 Laura      appear again              Laura Caley  Laura C.  ISFP       
#> 11     6 Claudia M. bar                       Claudia Mil~ Claudia ~ INFP

^{Created on 2021-07-07 by the reprex package (v2.0.0)}

bayesian · July 7, 2021, 6:05pm

You're right! My bad. As for the join, I was actually looking at a left join because I want to keep all the rows in dataset. In my actual data, I will have some NAs for demographics because there will be individuals without such info in the database.

I am away from my computer now so I am unable to test it out, but it looks promising so I will try it out and get back to you!

bayesian · July 8, 2021, 3:48pm

Had to tweak it to account for more diverse names (e.g., longer than 2 words) but in general it worked great. The add_count() function really made things easier. Thanks!

cactusoxbird · July 8, 2021, 5:51pm

Awesome, glad to hear it!

system · July 15, 2021, 5:51pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.