Comparing [somewhat] large data sets

stabdelraheem · April 8, 2021, 12:04am

Hi all,

I have two dfs each with >100K rows that I am looking to combine into one. Prior to merging them I wanted to look for any potential redundancies between the two dfs. I went through the process of modifying each one so that each df is identical to the other, and then did an inner join using ~6 columns to check for redundancies in entries based on those criteria. The result was a new df with >23K rows. I'm relived that that worked, but is there a way to efficiently process that 23k+ df ? I feel like manually comparing it to each individual df is a bit cumbersome and could leave me prone to errors of omission/repetition in the final combined df. I was wondering if anyone had any tips or resources on how to go about comparing large datasets like this? I know this is relatively small by some comparisons, but it's the largest dataset that I've worked with so far, and I'd love to learn from people with more experience.

Thanks in advance for your time!

woodward · April 8, 2021, 2:11am

I don't totally understand the question, but you can use the all.equal() function to compare two objects. If you just want to compare the values you might need to remove some attributes first.

Duplicated rows of a df can be removed with distinct()

stabdelraheem · April 8, 2021, 6:35pm

Thank you for the suggestions. I'll look into all.equal() and distinct().

To clarify, I originally had two large but distinct dfs with similar, but not identical, data. I essentially wanted to check for duplicate entries based on date/time and location. This is because the data entered in either df were most likely not going to be entered identically due to original structural differences between the dfs.

I wasn't sure if there was a way to cross-ref the result of the inner join (the duplicates) with the original dfs that was less error prone than doing it manually. It seems like there really isn't?

Thank you for your time and suggestions! I really appreciate it.

woodward · April 8, 2021, 7:53pm

Well you can just bind the date/time and location and look for duplicates?

Maybe you can provide a reprex of some sample data and show what you're trying to do.

library(dplyr)
bind_rows(
  df1 %>% select(date, location),
  df2 %>% select(date, location)
) %>%
group_by(date, location) %>%
mutate(
  duplicated = n() > 1
)

system · April 29, 2021, 7:53pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.