identifying which observations are lost by merging two data frames

chiarad · June 1, 2023, 1:41pm

Hey, Im working on a research project as part of my master's: I have multiple data frames that I combined into one, however, during that process, I lost a few observations (1312 to 1289). I would love to know which exact observations are lost here, to be able to assess why they are missing. The observations are multiple measurements for each of the subjects. I believe that the ones lost are samples that were registered but in the end, were destroyed etc, but they still show up in the first data frame as they are initially assigned a sample id.
i tried using the following packages:

library(arsenal)
n.diff.obs(comparedf()

library(diffdf)
diffdf()

But both didn't work in the way I want. using arsenal I just got the number of different observations but I of course already know that. using diffdf it seems to compare the rows, but they do not match up between my data frames so I get a lot of differences.
I want to get the sample id for the observations that are missing, so basically which ids are missing in the merged datafrmae . Is there a way to do this in R?

nirgrahamuk · June 1, 2023, 1:58pm

the structure of your data is unclear...
do you have unique key(s) on which you join ?
unique keys will certainly help you to identify whats in and whats out, because you can check for the presence or absence of the key

(a1 <- structure(list(id = c(1, 1, 2, 2), 
                     nr = c(1, 2, 1, 2), 
                     l = c("a", "b", "c", "d"), 
                     key = structure(c(1L, 3L, 2L, 4L), 
                        levels = c( "1.1", "2.1", "1.2", "2.2"), 
                        class = "factor")), 
                row.names = c(NA, -4L),
                class = "data.frame"))

(b2 <- structure(list(id = 2:3,
                      nr = c(1, 1),
                      L = c("B", "C"),
                      key = structure(1:2, 
                                      levels = c("2.1", "3.1"), class = "factor")),
                 row.names = c(NA, -2L), 
                 class = "data.frame"))

(jn_3 <- merge(a1,
               b2))

# lost from a1
setdiff(a1$key,jn_3$key)

# lost from b2
setdiff(b2$key,jn_3$key)

technocrat · June 2, 2023, 2:04am

Without a reprex (see the FAQ), it's not possible to give an answer without too many assumption.

Do the data frames contain identical field?
Is there a unique key field for observations?

system · July 14, 2023, 2:05am

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.