Merging two data frames with keeping some records

Maybe this question is basic but I have issues with merging two data frames together.
No issues with normal merging when I keep only matching records:

Service.car.data <- merge(car.service.data.17.short, final.vital.data, by="DF.URN")

but when I try to keep all records from final.vital.data and I use below:

Service.car.data.all <- merge(car.service.data.17.short, final.vital.data, by="DF.URN", all.y = TRUE)

I have number of records matching final.vital.data (fine) but when I look closer I can see that values not matched are NAs and DF.URN not matched are blank (I should still see them)
When I then try to recode any of numeric variables coming from car.service.data.17.short like below, I have weird results (have only 1s, rather than 1s and 0s):

Service.car.data.all$MatchedResponseURNs <- ifelse(Service.car.data.all$Rec >0, 1, 0)

Is it because I need to remove NAs before recoding? When I merge two databases like in option 1 (only matching records), I don't have this recoding issue.

The df are too large to prepare reprex but pleas let me know if it is required to solve this problem.

Please prepare a reprex with a small subset of your data (or any sample data) that reproduces your issue.

1 Like

I cannot replicate the issue above with the small example files below but the other problem is still there. I need 1s and 0s as values of the MatchedResponseURNs, instead I can see 1s and NA. I think the problem is with having "NAs" rather than blank fields in the Service.car.data.all

car.service.data.17.short <- data.frame(stringsAsFactors=FALSE,
                                        DF.URN = c("aaa", "bbb", "ccc", "ggg", "hhh", "iii"),
                                        InterviewDate = c("2019-07-09 07:21:00", "2019-07-09 08:31:00",
                                                          "2019-07-09 08:44:00",
                                                          "2019-07-09 08:38:00",
                                                          "2019-07-09 09:32:00",
                                                          "2019-07-09 09:43:00"),
                                        Rec = c(10, 9, 9, 8, 8, 7))
                                        
final.vital.data <- data.frame(stringsAsFactors=FALSE,
                               DF.URN = c("aaa", "bbb", "ccc", "ddd", "eee", "fff", "zzz", "yyy"),
                               RoDate = c("2019-10-02", "2019-10-12", "2019-10-02", "2019-10-02",
                                          "2019-10-02",
                                          "2019-10-02",
                                          "2019-10-02",
                                          "2019-11-02")
)


Service.car.data <- merge(car.service.data.17.short, final.vital.data, by="DF.URN")
Service.car.data.all <- merge(car.service.data.17.short, final.vital.data, by="DF.URN", all.y = TRUE)

Service.car.data.all$MatchedResponseURNs <- ifelse(Service.car.data.all$Rec >0, 1, 0)

Can you help?

You're right about the NA fields causing the problem with the ifelse() function. Since this occurs in the wild, as well as toy datasets, you'll want to filter out records with 'NA`

NA stands for Not Available, and in practice equals to "nothing", so you can't compare a number to "nothing", because there is nothing to compare with, maybe you'll want to replace NA by 0 or "something" else.

Service.car.data.all$Rec[is.na(Service.car.data.all$Rec)] <- 0

Thank you for your ideas. I simply want to flag records merged and unmerged to be able to calculate proportions, use filtering etc...

Also, I would like to mention that merging files in SPSS or by other software solutions keep unmatched fields blanks rather than NAs.

In R NA = "blank", are you saying that SPSS or other software allows you to do this logical comparison with out throwing an error or warning? That doesn't sounds right.

blank > 0

Merging process is similar but the final result does not contain NAs. When you see it in SPSS or export it to Excel or csv, the fields are simply blank, not containing any value not NAs...

I think the point that andresrcs was making is that, in R, NA is blank. It's a special value that indicates missing-ness. SPSS will (I'm 99.9% sure, but not having used SPSS I can't be 100%) do something similar internally, even if it shows you blank values rather than NA.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.