Merging two datasets with different subjects and 100 additional variables in the second dataset

Hi,

(Advanced Beginner here)

I am doing a field experiment in economics. So our rows are made of subjects observations for each variable of interest.

We performed the experiment in one school = my first dataset
Then we performed the experiment in a second school, but this time we added three control tasks, increasing the number of columns in the dataset by 100 = my second dataset.

I want to perform the data cleaning + analysis on a dataset with the two schools, so I want to merge both datasets.

For the additional variables created in the second dataset, I want them to appear as "NA" in the rows of the first dataset.

So I am looking for a result like this

-Dataset 1:
ID Var1 Var2

  1. a. b.
    2 a. b.

-Dataset 2:
ID Var 1 Var 2 Var 3

  1. c. d. e.
    2 c. d. e.

-Dataset I want:
ID Var 1 Var 2. Var 3

  1. a. b. NA
    2 a. b. NA
  2. c. d. e.
  3. c. d. e.

rbind, cbind and merge at beginner level with these functions are not helping me.

Thanks for your help,
Benjamin

Hi @BenjaminPrisse,
The dplyr::bind_rows() function is what you need:

Dataset_1 <- read.table(text="
ID Var001 Var002
1 a b
2 a b
", header=TRUE)

Dataset_2 <- read.table(text="
ID Var001 Var002 Var003
1 c d e
2 c d e
", header=TRUE)

# Dataset I want:
# ID Var1 Var2 Var3
# 1 a b NA
# 2 a b NA
# 1 c d e
# 2 c d e

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# From help: When row-binding, columns are matched by name, and any 
# missing columns will be filled with NA.

combined.df <- bind_rows(Dataset_1, Dataset_2)
combined.df
#>   ID Var001 Var002 Var003
#> 1  1      a      b   <NA>
#> 2  2      a      b   <NA>
#> 3  1      c      d      e
#> 4  2      c      d      e

Created on 2020-05-28 by the reprex package (v0.3.0)
HTH

1 Like

Another problem popup.

The 100 new variables in Dataset 2 are not necessarily at the end of the it.

They are introduced at diverse point of Dataset 1.

Example :

Dataset 1 :
Var1 Var 2 Var3

Dataset 2 :
Var1 NewVar TimerNewVar Var2 Var3

I would like to keep this order STRICTLY and bind_row is not exactly doing this. In fact, he slightly change the order of column (without any pattern I can discern).

Also, when I do bind_row, instead of having a dataset of 225 + 100 = 325 variables the number of variables inside the largest dataset), I end up with a dataset of 394 dataset without having any clue why.

I have .xlsx for Dataset 1 and .csv for Dataset 2.
Moreover, the guy who entered datas was not me, so I don't know any potential difference for entering the two datasets.

If it can help you answering me.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

Thanks !

I found this cheat code :

data1[c(226:325)] <- NA
names(data1) <- names(data2)
data <- rbind(data1,data2)

So this "manually" method can also make the job, but was not what I was looking for.

You are right, same variables were named differently for the two datasets.

Thanks for your help.

you may need to provide a reprex to get better help.
As it seems to me that bind_rows would create the additional variable names after the original names as you requested. Perhaps though your observation that you get more than 225+100 new variables reveals inconsistencies in the 'common' column names that you believed were consistent between the two sets.
Inspection of the names should reveal this. use janitor::clean_names() if you have non syntactic R names (Which may be likely if you are coming from excel).
you can also use setdiff() on the names from the 1st dataset against the 2nd to see what variables on dataset 1 are not in common to dataset2 (due to unexpected differences).

setdiff(names(Dataset_1),names(Dataset_2))
1 Like