Compare two data sets

Jason.C · September 29, 2018, 7:26pm

Greetings, I’ve been given an RData file that contains two datasets. The first one "dat" has 121 variables and the second "my_data" has 123 variables. How can I find out what the which variables are different between these two datasets?

Cheers,
Jason

hlendway · September 30, 2018, 3:46am

I think this might help. I created two sample data sets, data1 and data2. Data2 has columns "e" and "f" and data1 does not. This code puts the columns that exist in data2 but not in data1 into the tibble missing_columns.

data1 <- tibble("a"=runif(5),"b"=runif(5),"c"=runif(5),"d"=runif(5))
data2 <- tibble("a"=runif(5),"b"=runif(5),"c"=runif(5),"d"=runif(5),"e"=runif(5),"f"=runif(5))
missing_columns <- data2 %>% 
  select(which(!(colnames(data2) %in% colnames(data1))))

jcblum · October 1, 2018, 8:09am

You can also use set operations (base docs; dplyr docs):

data_short <-
  data.frame(
    "a" = runif(5),
    "b" = runif(5),
    "c" = runif(5),
    "d" = runif(5)
  )
data_long <-
  data.frame(
    "a" = runif(5),
    "b" = runif(5),
    "c" = runif(5),
    "d" = runif(5),
    "e" = runif(5),
    "f" = runif(5)
  )

# See which column names are in both
intersect(names(data_short), names(data_long))
#> [1] "a" "b" "c" "d"

# See which columns from the longer data frame aren't in the shorter one
setdiff(names(data_long), names(data_short))
#> [1] "e" "f"

# Select only the columns from the longer data frame that are in both
data_long[intersect(names(data_long), names(data_short))]
#>           a         b         c         d
#> 1 0.1540338 0.7234066 0.5916640 0.7219967
#> 2 0.5342947 0.7615840 0.4242600 0.5354202
#> 3 0.7874348 0.5796321 0.4673035 0.2321965
#> 4 0.7270508 0.5580596 0.8692353 0.8039400
#> 5 0.4221739 0.5960718 0.8802889 0.3052120

# Select only the extra columns from the longer data frame
data_long[setdiff(names(data_long), names(data_short))]
#>           e          f
#> 1 0.8467308 0.81951804
#> 2 0.7265840 0.45185584
#> 3 0.2096210 0.69614875
#> 4 0.9464625 0.90677953
#> 5 0.4143425 0.06293806

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# Also works with dplyr
data_long %>% select(intersect(names(data_long), names(data_short)))
#>           a         b         c         d
#> 1 0.1540338 0.7234066 0.5916640 0.7219967
#> 2 0.5342947 0.7615840 0.4242600 0.5354202
#> 3 0.7874348 0.5796321 0.4673035 0.2321965
#> 4 0.7270508 0.5580596 0.8692353 0.8039400
#> 5 0.4221739 0.5960718 0.8802889 0.3052120

^{Created on 2018-10-01 by the reprex package (v0.2.1)}

Jason.C · October 1, 2018, 3:34pm

Thanks for the response. This helped with my immediate need. Thank you.

Jason.C · October 1, 2018, 3:36pm

Thanks a lot! I spent a lot of time looking for something 'diff' related, and you’ve shown me what I was missing. Also, your example will help me with additional issues in the future.

Cheers,
Jason

abdoulaye · January 24, 2019, 8:33pm

Hello thanks, is intersect belongs to dplyr package?

andresrcs · February 14, 2019, 8:33pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.