Compare info in two data.frame

I need to check and find info by comparing columns from 2 different data.frames.
Lets imagine these 2 data.frames:

data.frame(stringsAsFactors=FALSE,
Trajectory = c("%%EEE_!AAAA", "!AAAB_NICAR", "GIKIN_NICAR", "!AAAA_!AAAB"),
Trajectory = c("%%EEE_!AAAA", "!AAAB_NICAR", "GIKIN_NICAR", "!AAAA_!AAAB")
)
data.frame(stringsAsFactors=FALSE,
Segment = c("!AAAB_NICAR", "!AAAC_$$FF", "GIKIN_NICAR", "!AAAA_!AAAA"),
Segment = c("!AAAB_NICAR", "!AAAC_$$FF", "GIKIN_NICAR", "!AAAA_!AAAA")
)
In this example, when looking for Trajectory names into Segment names, some of then shouldn't be found, those are: %%EEE_!AAAA and !AAAA_!AAAB.

What I want the code to do for me is to search Trajectory names into Segment names and create a new data.frame with the non-found names and the found ones (call Mached). In the example above, the new data.frame would be:

data.frame(stringsAsFactors=FALSE,
Non_Found = c("%%EEE_!AAAA", "!AAAA_!AAAB"),
Mached = c("!AAAB_NICAR", "GIKIN_NICAR")
)

I simplified your data set to one column in each data frame because the two columns were identical, including having the same name. Identical column names will cause problems when trying to subset the data frame.

The results of the search for non-found and matched data can be put in a data frame only if there are the same number of elements in the two results. Will that always be true?

DF1 <- data.frame(stringsAsFactors=FALSE,
           Trajectory = c("%%EEE_!AAAA", "!AAAB_NICAR", "GIKIN_NICAR", "!AAAA_!AAAB")
)
DF2 <- data.frame(stringsAsFactors=FALSE,
           Segment = c("!AAAB_NICAR", "!AAAC_$$FF", "GIKIN_NICAR", "!AAAA_!AAAA")
)
Found <- DF1[ DF1$Trajectory %in% DF2$Segment,]
Found
#> [1] "!AAAB_NICAR" "GIKIN_NICAR"

NotFound <- DF1[!DF1$Trajectory %in% DF2$Segment,]
NotFound
#> [1] "%%EEE_!AAAA" "!AAAA_!AAAB"

Created on 2019-10-15 by the reprex package (v0.3.0.9000)

Not at all, actually "Trajectory" data.frame is around 2600000 items and "Segments" is no more than 56000. The thing is that I need to check if all trajectory names (most of them repited) are in the segments data.frame (mached) and if not identified them (non-found). More than a comparison one to one is to check if an item is or not in another data.frame.
Hope I have been able to explain my self and thanks for your help :slightly_smiling_face:

If you have many repeated values, I suggest using the unique() function to increase the efficiency of the search.

UniqueTraj <- unique(DF1$Trajectory)
UniqueSeg <- unique(DF2$Segment)

NotFound <- UniqueTraj[!UniqueTraj %in% UniqueSeg]

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.