How to merge data frames

I tried merging 2 dataframes. But few observations are getting duplicated.
library(tidyverse)
data1<-tibble::tribble(
~duration, ~enumerator, ~en_name, ~survey_date, ~child_name, ~child_age2, ~l1c1_identify_pic1, ~l1c1_identify_pic2, ~l1c1_identify_pic3, ~l1c1_identify_pic4, ~l1c1_identify_pic5, ~l1c1_identify_pic6, ~l1c1_classify_bird, ~l1c1_classify_animal,
2885L, "PEN008", "Seemakausar Nadaf", "24-09-2021 00:00", "Sharat Mangoji", 6L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2438L, "PEN006", "Jayashree H Malipatil", "24-09-2021 00:00", "Devaraj C Tallur", 6L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1686L, "PEN003", "Jyoti T Patil", "24-09-2021 00:00", "Roshan begum", 6L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L,
1409L, "PEN008", "Seemakausar Nadaf", "24-09-2021 00:00", "Samrth", 6L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1825L, "PEN004", "Viranna Potadar", "24-09-2021 00:00", "Sanjay irappa dusad", 6L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 0L,
2505L, "PEN002", "Hemalata B. Bhajantri", "24-09-2021 00:00", "Mahatmappa", 6L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L,
1872L, "PEN009", "Savita Vandal", "24-09-2021 00:00", "Soubhagya", 6L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L
)

data2<-tibble::tribble(
~child_name, ~l1en1_recog_num1, ~l1en1_recog_num2, ~l1en1_recog_num3, ~l1en1_recog_num4, ~l1en1_recog_num5,
"Sharat Mangoji", 1L, 1L, 1L, 1L, 1L,
"Devaraj C Tallur", 1L, 0L, 0L, 0L, 1L,
"Roshan begum", 1L, 1L, 1L, 1L, 1L,
"Samrth", 1L, 0L, 0L, 1L, 0L,
"Sanjay irappa dusad", 0L, 0L, 0L, 0L, 0L,
"Mahatmappa", 1L, 0L, 0L, 1L, 0L
)

combine1<-left_join(data1,data2,by="child_name")
combine1
#> # A tibble: 7 x 19
#> duration enumerator en_name survey_date child_name child_age2 l1c1_identify_p~
#>
#> 1 2885 PEN008 Seemak~ 24-09-2021~ Sharat Ma~ 6 1
#> 2 2438 PEN006 Jayash~ 24-09-2021~ Devaraj C~ 6 1
#> 3 1686 PEN003 Jyoti ~ 24-09-2021~ Roshan be~ 6 1
#> 4 1409 PEN008 Seemak~ 24-09-2021~ Samrth 6 1
#> 5 1825 PEN004 Virann~ 24-09-2021~ Sanjay ir~ 6 1
#> 6 2505 PEN002 Hemala~ 24-09-2021~ Mahatmappa 6 1
#> 7 1872 PEN009 Savita~ 24-09-2021~ Soubhagya 6 1
#> # ... with 12 more variables: l1c1_identify_pic2 ,
#> # l1c1_identify_pic3 , l1c1_identify_pic4 ,
#> # l1c1_identify_pic5 , l1c1_identify_pic6 ,
#> # l1c1_classify_bird , l1c1_classify_animal ,
#> # l1en1_recog_num1 , l1en1_recog_num2 , l1en1_recog_num3 ,
#> # l1en1_recog_num4 , l1en1_recog_num5
Created on 2021-11-08 by the reprex package (v2.0.1)

is there a duplication within the example you provided to us ?

No. My dataset has 406 observations with 600 columns. I pasted only a part of it

The actual situation is like this. We have conducted a survey with 4 sections. I calculated the no response ratio for each section and then created data frames for each section. After that, I merged first and second. Then I merged third and fourth. I am giving reprex below.

library(tidyverse)

data1<-tibble::tribble(
  ~duration, ~enumerator,                ~en_name,       ~survey_date,           ~child_name, ~child_age2, ~l1c1_identify_pic1, ~l1c1_identify_pic2, ~l1c1_identify_pic3, ~l1c1_identify_pic4, ~l1c1_identify_pic5, ~l1c1_identify_pic6, ~l1c1_classify_bird, ~l1c1_classify_animal, ~l1c1_identification, ~l1c1_classification, ~l1c1_total, ~l1c2_identify_col1, ~NDenom.x.x, ~No.response.ratio.x.x, ~l1el1_wr_rt, ~l1el1_total, ~l1el2_fac1, ~l1el2_fac2, ~l1el2_fac3, ~l1el2_total, ~l1el3_1sen, ~l1el3_1sen_comp, ~l1el3_2sen, ~l1el3_2sen_comp, ~NDenom.y.x, ~No.response.ratio.y.x, ~l1en1_recog_num1, ~l1en1_recog_num2, ~l1en1_recog_num3, ~l1en1_recog_num4, ~l1en1_recog_num5, ~l1en1_total, ~l1en2_recog_num1, ~l1en2_recog_num2, ~NDenom.x.y, ~No.response.ratio.x.y, ~sel_conversation_l1, ~sel_focus_l1, ~sec_total_l1, ~sel_emo1_l1, ~sel_emo2_l1, ~sel_emo3_l1, ~sel_emo4_l1, ~sel_emo5_l1, ~ser_total_l1, ~N98.y.y, ~NDenom.y.y, ~No.response.ratio.y.y,
      2885L,    "PEN008",     "Seemakausar Nadaf", "24-09-2021 00:00",      "Sharat Mangoji",          6L,                  1L,                  1L,                  1L,                  1L,                  1L,                  1L,                  1L,                    1L,                  12L,                   2L,         14L,                  1L,         37L,            2.702702703,          98L,          98L,          0L,          1L,          1L,           2L,          1L,               1L,          1L,               1L,         35L,                     20,                1L,                1L,                1L,                1L,                1L,           5L,                1L,                1L,         24L,                      0,                   2L,            3L,            5L,           1L,           1L,           1L,           1L,           1L,            5L,       0L,         10L,                     0L,
      2438L,    "PEN006", "Jayashree H Malipatil", "24-09-2021 00:00",    "Devaraj C Tallur",          6L,                  1L,                  1L,                  1L,                  1L,                  1L,                  1L,                  1L,                    1L,                  12L,                   2L,         14L,                 98L,         37L,            13.51351351,          98L,          98L,          1L,          1L,          1L,           3L,          1L,               1L,          1L,               1L,         35L,            25.71428571,                1L,                0L,                0L,                0L,                1L,           2L,               98L,               98L,         24L,            20.83333333,                  -1L,           -1L,           -2L,           1L,           1L,           1L,           1L,           1L,            5L,       0L,         10L,                     0L,
      1686L,    "PEN003",         "Jyoti T Patil", "24-09-2021 00:00",        "Roshan begum",          6L,                  1L,                  1L,                  0L,                  1L,                  1L,                  1L,                  1L,                    0L,                  10L,                   1L,         11L,                  0L,         37L,            2.702702703,          98L,          98L,          1L,          1L,          1L,           3L,          1L,               1L,          1L,               1L,         35L,            5.714285714,                1L,                1L,                1L,                1L,                1L,           5L,                1L,                1L,         24L,            16.66666667,                   3L,            3L,            6L,           1L,           1L,           1L,           1L,           1L,            5L,       0L,         10L,                     0L,
      1409L,    "PEN008",     "Seemakausar Nadaf", "24-09-2021 00:00",              "Samrth",          6L,                  1L,                  1L,                  1L,                  1L,                  1L,                  1L,                  1L,                    1L,                  12L,                   2L,         14L,                  0L,         37L,                      0,           0L,           0L,          0L,          1L,          1L,           2L,          1L,               1L,          1L,               1L,         34L,            23.52941176,                1L,                0L,                0L,                1L,                0L,           2L,               98L,               98L,         24L,            41.66666667,                   1L,            1L,            2L,           1L,           1L,           1L,           1L,           1L,            5L,       0L,         10L,                     0L,
      1825L,    "PEN004",       "Viranna Potadar", "24-09-2021 00:00", "Sanjay irappa dusad",          6L,                  1L,                  1L,                  1L,                  1L,                  0L,                  1L,                  0L,                    0L,                  10L,                   0L,         10L,                  0L,         37L,                      0,           0L,           0L,          0L,          0L,          0L,           0L,          0L,               NA,          NA,               NA,         30L,                      0,                0L,                0L,                0L,                0L,                0L,           0L,                0L,                0L,         24L,                      0,                   1L,            1L,            2L,           1L,           0L,           0L,           0L,           0L,            1L,       0L,         10L,                     0L,
      2505L,    "PEN002", "Hemalata B. Bhajantri", "24-09-2021 00:00",          "Mahatmappa",          6L,                  1L,                  1L,                  1L,                  1L,                  1L,                  1L,                  0L,                    0L,                  12L,                   0L,         12L,                  0L,         37L,                      0,           1L,           1L,          1L,          1L,          1L,           3L,          1L,               1L,          1L,               1L,         35L,                      0,                1L,                0L,                0L,                1L,                0L,           2L,                0L,                0L,         24L,                      0,                   2L,            2L,            4L,           1L,           1L,           1L,           1L,           1L,            5L,       0L,         10L,                     0L,
      1872L,    "PEN009",         "Savita Vandal", "24-09-2021 00:00",           "Soubhagya",          6L,                  1L,                  1L,                  0L,                  1L,                  1L,                  1L,                  1L,                    1L,                  10L,                   2L,         12L,                  0L,         37L,                      0,           0L,           0L,          0L,          0L,          0L,           0L,          0L,               NA,          NA,               NA,         30L,                      0,                1L,                0L,                0L,                1L,                0L,           2L,                0L,                0L,         24L,                      0,                   1L,            1L,            2L,           1L,           1L,           1L,           1L,           1L,            5L,       0L,         10L,                     0L
  )

data1_cog<-data1 %>% 
  select(duration:No.response.ratio.x.x)
data1_lan<-data1 %>% 
  select(child_name,l1el1_wr_rt:No.response.ratio.y.x)
data1_num<-data1 %>% 
  select(child_name,l1en1_recog_num1:No.response.ratio.x.y)
data1_sel<-data1 %>% 
  select(child_name,sel_conversation_l1:No.response.ratio.y.y)

combine1<-left_join(data1_cog,data1_lan,by="child_name")
combine2<-left_join(data1_num,data1_sel,by="child_name")
combine_master<-left_join(combine1,combine2,by="child_name")

combine_master
#> # A tibble: 7 x 54
#>   duration enumerator en_name survey_date child_name child_age2 l1c1_identify_p~
#>      <int> <chr>      <chr>   <chr>       <chr>           <int>            <int>
#> 1     2885 PEN008     Seemak~ 24-09-2021~ Sharat Ma~          6                1
#> 2     2438 PEN006     Jayash~ 24-09-2021~ Devaraj C~          6                1
#> 3     1686 PEN003     Jyoti ~ 24-09-2021~ Roshan be~          6                1
#> 4     1409 PEN008     Seemak~ 24-09-2021~ Samrth              6                1
#> 5     1825 PEN004     Virann~ 24-09-2021~ Sanjay ir~          6                1
#> 6     2505 PEN002     Hemala~ 24-09-2021~ Mahatmappa          6                1
#> 7     1872 PEN009     Savita~ 24-09-2021~ Soubhagya           6                1
#> # ... with 47 more variables: l1c1_identify_pic2 <int>,
#> #   l1c1_identify_pic3 <int>, l1c1_identify_pic4 <int>,
#> #   l1c1_identify_pic5 <int>, l1c1_identify_pic6 <int>,
#> #   l1c1_classify_bird <int>, l1c1_classify_animal <int>,
#> #   l1c1_identification <int>, l1c1_classification <int>, l1c1_total <int>,
#> #   l1c2_identify_col1 <int>, NDenom.x.x <int>, No.response.ratio.x.x <dbl>,
#> #   l1el1_wr_rt <int>, l1el1_total <int>, l1el2_fac1 <int>, ...

Created on 2021-11-08 by the reprex package (v2.0.1)

Its unlikely we can give you specific help then.
However, duplicates will likely exist in your output, if they are there in your input. Do you have reason to expect no duplicates? Are they perfect duplicates, you could simply dplyr::distinct() them, if they are imperfect duplicates, you might need to consider a scheme to select 'the best' duplicate

In my original dataset, there are no duplicates.

are the child names unique ?

yes child names are unique. I just checked it.

I could be mistaken but it seems to me that for mathematical/logical reasons ,left joins on unique keys will not create duplicates.

Let me suggest the following. For the sake of providing your example. identify one row which is a duplicate in your output. find where it is in your original dataset, make that included in the example you provide to us. dplyr has helpful functions such as slice() that make it very easy to return particular rows based on position, or filter() to return rows based on contents.

I am sorry. Actually there are few children with the same name. How can I get around this problem?

I have resolved it. I did left_join by Student_ID which is unique. Thanks for your help.

2 Likes

To join two data frames (datasets) vertically, use the rbind function. The two data frames must have the same variables, but they do not have to be in the same order. If data frameA has variables that data frameB does not, then either:

  1. [Delete] the extra variables in data frameA or
  2. Create the additional variables in data frameB

before joining them with rbind( ) .
Best regards from AOS TV APK Thank you.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.