It's quite large column-wise but I would say the best way to think about it would be company information in table A, and the same in B. So, state, zip, address, phone, etc.
ZipCode <- c('07434','12343','19425','07436','10010','56789','07434','07434','07434','07434','07434','07434')
ZipCode2 <- c('19425','10010','10010','07456','19090','1','14567','19293','34523','12563','32452','07434','07434')
DT1 <- data.table(ZipCode)
DT2 <- data.table(ZipCode)
We would fully expect to get back a new dataset with 12 rows of Zip 07434.
df1 never gets reduced to a single row, df1 is a culmination of every company in frame1, that has a matching zip code in frame 2. Duplicates do exist, and it gets pretty large which is why we had the idea of looping one row at a time, applying the function to the rows returns, and then getting it OUT of memory.
Nothing for DF1(since we create it on the fly), DF2 has about 13million results.
This wouldn't be able to run. We have to split it up or loop, etc. This would result in about 900 million rows.
Think of it like this. In the one table are leads for sales, but in the other table is data from a provider. Since the provider just gives you full dumps, we need to parse out any matches. We do this by using Jaro-Winkler. But, we have to initially get the first dataset and the easiest way to accomplish this is by joining on Zip. If the Zip doesn't match, we found only about .01% of cases where the business name was the same. We're comfortable with not having .01%.
I started playing with breaking it up state by state and then applying the function to that dataset and it seems to be working quite well, however it still chokes on our larger state populations. Unfortunately I don't think there's a way to get around the loop.
At this point we know the the bottleneck is that when you join on a massive ZipCode(NYC as an example) you're going to have tens of millions of results. This process is going to take a long time, we are used to that. The problem is that R can't seem to handle the memory issue like SAS was hence why we were trying to loop through and create 50k datasets, 100kdatasets, etc and then wipe them from memory before going to the next row