Help joining two datasets together

Hi all,
I just finished a round of data collection for a new study and I need to join two sets of data from two surveys together to link participant responses, but I'm having trouble. The first survey contained participant demographics and some other info, while the second survey had info from the study that may have taken place days later (depending on condition). De-identified toy versions are below:

part_1=tibble::tribble(
         ~end_date_1,                       ~consent_name,    ~consent_email, ~diff_check, ~age, ~race, ~sex, ~ethnicity, ~time,
  "11/12/2020 12:06",                    "Addam Marbrand",       "Lannister",          2L,  18L,    6L,   2L,         1L,    2L,
   "11/14/2020 9:36",           "Aegon Frey (Jinglebell)",            "None",          2L,  18L,    2L,   2L,         2L,    1L,
  "11/14/2020 15:10",                   "Aegon Targaryen", "House Targaryen",          2L,  18L,    6L,   1L,         2L,    2L,
  "11/14/2020 18:56",                     "Adrack Humble",   "House Greyjoy",          2L,  21L,    1L,   1L,         1L,    2L,
  "11/14/2020 19:29",                    "Aemon Costayne",       "Lannister",          2L,  19L,    6L,   2L,         1L,    2L,
  "11/14/2020 20:24",                   "Aemon Estermont",       "Baratheon",          1L,  19L,    1L,   1L,         2L,    1L,
  "11/15/2020 10:27", "Aemon Targaryen (son of Maekar I)",   "Night's Watch",          1L,  21L,    1L,   2L,         1L,    2L,
  "11/15/2020 12:00",                        "Aenys Frey",            "None",          2L,  18L,    2L,   2L,         2L,    1L,
  "11/15/2020 12:41",                     "Aeron Greyjoy",   "House Greyjoy",          2L,  23L,    1L,   2L,         2L,    2L,
  "11/15/2020 17:06",                            "Aethan",   "Night's Watch",          1L,  18L,    6L,   1L,         1L,    1L,
  "11/15/2020 18:43",                             "Aggar",   "House Greyjoy",          2L,  21L,    6L,   1L,         1L,    2L,
  "11/15/2020 18:58",                              "Aggo", "House Targaryen",          2L,  19L,    1L,   2L,         1L,    2L,
  "11/15/2020 19:28",                     "Alan of Rosby",   "Night's Watch",          1L,  20L,    1L,   2L,         2L,    2L,
  "11/15/2020 19:40",                           "Alayaya",            "None",          1L,  18L,    1L,   2L,         1L,    2L,
  "11/15/2020 21:21",                       "Albar Royce",           "Arryn",          1L,  18L,    6L,   2L,         1L,    1L,
   "11/16/2020 3:36",                            "Albett",   "Night's Watch",          2L,  19L,    6L,   2L,         1L,    2L,
  "11/16/2020 10:51",                          "Alebelly",     "House Stark",          1L,  18L,    1L,   2L,         2L,    2L,
  "11/16/2020 11:12",                  "Alerie Hightower",    "House Tyrell",          2L,  18L,    1L,   2L,         2L,    1L,
  "11/16/2020 12:49",                "Alesander Staedmon",       "Baratheon",          2L,  18L,    1L,   2L,         2L,    2L,
  "11/16/2020 12:22",                   "Alester Florent",       "Baratheon",          1L,  20L,    1L,   2L,         2L,    2L,
  "11/16/2020 13:32",                   "Alia of Braavos",            "None",          3L,  18L,    1L,   2L,         2L,    1L,
  "11/16/2020 15:13",                       "Alla Tyrell",    "House Tyrell",          2L,  18L,    1L,   2L,         1L,    2L
  )

part_2=tibble::tribble(
          ~start_date,    ~end_date, ~duration_part2, ~variable_1, ~variable_2, ~variable_3, ~variable_4, ~variable_5, ~variable_6, ~variable_7, ~variable_8, ~variable_9, ~variable_10, ~variable_11, ~variable_12, ~variable_13, ~variable_14, ~variable_15, ~variable_16, ~variable_17, ~variable_18, ~variable_19, ~variable_20, ~variable_21, ~variable_22, ~variable_23, ~variable_24, ~variable_25, ~variable_26, ~variable_27, ~variable_28, ~variable_29, ~variable_30, ~variable_31,     ~name_validation, ~email_validation, ~time,
         "11/12/2020", "11/12/2020",             73L,          0L,          2L,         50L,          0L,          NA,          NA,          NA,          NA,          NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,         "Aenys Frey",            "None",    NA,
         "11/14/2020", "11/14/2020",             97L,          0L,          2L,         40L,          0L,          NA,          NA,          NA,          NA,          NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,      "Aeron Greyjoy",   "House Greyjoy",    NA,
         "11/14/2020", "11/14/2020",            108L,          1L,          1L,          NA,          0L,          NA,          NA,          NA,          NA,          NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,             "Aethan",   "Night's Watch",    NA,
         "11/14/2020", "11/14/2020",             61L,          0L,          2L,         40L,          1L,          1L,          2L,          NA,          NA,          NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,              "Aggar",   "House Greyjoy",    NA,
         "11/15/2020", "11/15/2020",             39L,          0L,          1L,          NA,          0L,          NA,          NA,          NA,          NA,          NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,               "Aggo", "House Targaryen",    NA,
         "11/15/2020", "11/15/2020",             22L,          0L,          1L,          NA,          0L,          NA,          NA,          NA,          NA,          NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,      "Alan of Rosby",   "Night's Watch",    NA,
         "11/15/2020", "11/15/2020",             66L,          1L,          1L,          NA,          0L,          NA,          NA,          NA,          NA,          NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,            "Alayaya",            "None",    NA,
         "11/15/2020", "11/15/2020",             61L,          1L,          1L,          NA,          0L,          NA,          NA,          NA,          NA,          NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,        "Albar Royce",           "Arryn",    NA,
         "11/15/2020", "11/15/2020",            796L,          0L,          1L,          NA,          0L,          NA,          NA,          NA,          NA,          NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,             "Albett",   "Night's Watch",    NA,
         "11/15/2020", "11/15/2020",            141L,          1L,          1L,          NA,          1L,          0L,          NA,          NA,          NA,          NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           "Alebelly",     "House Stark",    NA,
         "11/15/2020", "11/15/2020",             65L,          1L,          2L,         60L,          0L,          NA,          NA,          NA,          NA,          NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,   "Alerie Hightower",    "House Tyrell",    NA,
         "11/15/2020", "11/15/2020",             58L,          1L,          1L,          NA,          0L,          NA,          NA,          NA,          NA,          NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA,           NA, "Alesander Staedmon",       "Baratheon",    NA
         )

You can see that part 1 has more responses than part 2; not everyone who signed up came back to finish the study. So I need a way to merge these datasets so that I have every column in both, but responses are matched by name and email, and any name/email combo that does not appear in the second is dropped.

Anyone have any suggestions?

it seems to me you are asking for this :

dplyr::full_join(part_1,part_2,
          by=c("consent_name"="name_validation",
               "consent_email"="email_validation"))

using dplyr::left_join(), the by argument takes a character vector of matching columns to combine on.

I think that worked! I have to do some more verifying though. Another related question: You might have noticed that the "time" variable from survey 1 for some reason didn't transfer over in Qualtrics to survey 2, so everyone's "time" variable is blank in the second data set. Is there any way to match up those columns and have it pull the data from part 1 and put it in part 2's, matching by person?

I was tempted to try something like part_2$time[]=part_1$time[] , but I'm afraid that will just paste the column there and not match things up correctly by person

You might have noticed that when you create the combined dataframe, you get two column time.x which is all NA and time.y which is the time column from part_1. You could drop time.x and rename time.y to time.

1 Like

part_1 has more information than part_2. To merge these datasets so that you keep every column in both, you can use left_join and put part_1 as your left table to merge.
You can then drop time coming from part_2 table and rename time coming from part_1 to time, so:

combined <- left_join(part_1, part_2, by = c("consent_email" = "email_validation", "consent_name" = "name_validation")) %>%
rename(time="time.x") %>% select(-time.y)

This almost works, but it seems that some people get stuff pulled over from part_1 while others do not (pic attached). If I flip the join around to join part_2 by part_1, then the reverse happens: everyone's responses from part 2 are there, but some people are missing part_1 stuff. Any idea why?

Screenshot 2020-12-08 092539

Might have finally figured it out. I think inner join is what I am looking for. test=part_1 %>% inner_join(part_2, by=c("consent_name"="name_validation","consent_email"="email_validation"))%>% select(-(time.y)) seems to join both datasets perfectly.

The only thing I have left to figure out is why the numbers don't add up. Survey 1 has 142 responses, and Survey 2 has 117. My final combined survey should therefore have 117, but for some reason only has 95.

EDIT: potentially an issue with participants misspelling their name or email, or a difference in strings if the join is case sensitive? Not sure.

EDIT 2: converting all names and email addresses to lower case largely fixed this; I can now search through and delete duplicate strings

Have you check whether people with missing information are the ones that did not start the course?
It it likely to be so; there is not start or end date for those people, but because you used left_join the information about their name is being kept. If you want to not keep that information you can use inner_join

It is likely. You can verify that hypothesis with

table(part_1$consent_name %in% part_2$name_validation)

Checking through the data and duplicate responses might be the issue. I used the line part_1%>% group_by(consent_name,consent_email) %>% slice(1) to remove duplicate entries, but that only worked for some participants it seems, because that line of code is case-sensitive.

I'm seeing some people who typed their names with capital letters in one entry, and then with lower case letters in another. Or the same problem in the email column.

Is there a way to remove duplicate entries with stringr that can ignore case?

[quote="Longshot408, post:14, topic:90263"]
part_1%>% group_by(consent_name,consent_email) %>% slice(1)
[/quote]t

You can change their names to all lower case in both tables using

part_1<-part_1%>% mutate(consent_name=str_to_lower(consent_name) )
part_2<-part_2%>% mutate(name_validation=str_to_lower(name_validation))

Check out for typos tho!

1 Like

That works great! Thank you. This brought it up from 95 responses when I used inner_join to 106. I'm sure the remaining 10 are due to typos and such; I can see one person already who gave a different email address for each response.

I'll figure out how to clean the rest of the data from here. Much appreciated

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.