For loop(or if someone has a better idea) on large dataset

kwgood1980 · July 2, 2018, 10:39pm

Hi All. I have an interesting and frankly frustrating problem I'm trying to work through and I'm not sure what I'm doing wrong. Long story short, process is being moved from SaS to R so we have to rewrite some legacy code.

We take two dataframes, combine them by Zip code(duplicates will exist) and then do a stringdist comparison between various columns. Here's the general issue:

If we don't use an iterative approach, we run out of ram. Quickly. This is on a 128gig of RAM EC2 instance mind you..

If we use an iterative approach, it takes FOREVER. We're running Microsoft R Open so my assumption is that this is by default running in a parallel fashion. We have already isolated this to the loop. Almost the entire process after the loop takes reasonable amount of time. If we take the stringdist functionality out of the loop, it will run, however on larger datasets we get back to running out of RAM.

I'm fairly new to R so I'm sure that I'm just going something "wrong" in trying to convert it from the legacy SaS code.

Here's the code that we have isolated as the bottleneck. Essentially the idea is to grab a row, merge it from df1 to df2 by zips, then apply the stringdist logic. If it's a match write to .csv, remove DF and repeat, etc. No match, move to next row.

for(row in 1:length(df1$Zip)) {

  df1 <- inner_join(df1, df2, by = c('Zip')) 
  df1[] <- lapply(df1, as.character)

  df1$MatchBizName <- pmax(1-stringdist(df1$BusinessName, df1$BusinessName, method="jw", p=0.1),
                           1-stringdist(df1$DBAName, df1$DBAName, method="jw", p=0.1),
                           1-stringdist(df1$BusinessName, df1$DBABusinessName, method="jw", p=0.1),
                           1-stringdist(df1$DBAName, df1$BusinessName, method="jw", p=0.1))
  df1$MatchPhone <- ifelse(1-stringdist(df1$Phone, df1$Phone, method="jw", p=0.1)>=1,1,0)
  df1$MatchFirstName <- 1-stringdist(df1$FirstName, df1$FirstName, method="jw", p=0.1)
  df1$MatchLastName <- 1-stringdist(df1$LastName, df1$LastName, method="jw", p=0.1)
  df1$MatchStreetNum <- 1-stringdist(df1$StreetNum, df1$StreetNum, method="jw", p=0.1)
  df1$MatchStreetName <- 1-stringdist(df1$StreetName, df1$StreetName, method="jw", p=0.1)
  df1$MatchCity <- 1-stringdist(df1$City, df1$City, method="jw", p=0.1)

  df1matches <- subset(df1,(MatchBizName >= 0.9 & MatchCity == 1 & MatchStreetName >= 0.7 & MatchStreetNum >= 0.7) |
                         (MatchPhone == 1 & MatchFirstName == 1 & MatchLastName == 1 & MatchStreetNum == 1 & MatchStreetName == 1 & MatchCity == 1) |
                         (MatchBizName >= 0.9 & MatchStreetNum == 1 & MatchStreetName >= 0.9 & MatchCity == 1) |
                         (MatchStreetNum == 1 & MatchPhone == 1 & MatchStreetName == 1 & MatchCity == 1) |
                         (MatchLastName >= 0.9 & MatchPhone == 1 & MatchBizName >= 0.9 & MatchCity == 1) |
                         (MatchPhone == 1 & MatchCity == 1 & MatchStreetNum == 1 & MatchStreetName >= 0.9 & MatchBizName >= 0.6) )

  rm(df1)
  rm(df1matches)

  }

mara · July 3, 2018, 9:24am

Sorry, not an actual solution, just one minor point: I'm not terribly familiar with Microsoft's R distro, but I don't know that this assumption is true if you're not explicitly calling the necessary functions (e.g. using rxExec in the example at the link below).

nwerth · July 3, 2018, 2:32pm

It looks like you copy-pasted the SAS code and then edited it to fit R's syntax and use R functions. That's not how to do this kind of thing when changing languages. You need to consider the entire problem and rewrite the solution.

Some points specific to your code:

Your loop changes the value of row for each iteration, but row is never used in the code. I assume you don't know that most functions in R are already vectorized. Read the Vector Arithmetic section of "An Introduction to R" to see what that means.

df1 <- inner_join(df1, df2, by = c('Zip'))

This will explode the size of df1, because it adds ncol(df2) - 1 columns in each iteration. Your memory problem is most likely this line.

df1$MatchPhone <- ifelse(1-stringdist(df1$Phone, df1$Phone, method="jw", p=0.1)>=1,1,0)

In SAS, people implement Boolean columns as numeric variables, and just assign it 0 or 1 (hopefully). R has a formal Boolean class: logical. It's the result of logical comparison statements, so you can rewrite this code as:

df1$MatchPhone <- 1 - stringdist(df1$Phone, df1$Phone, method="jw", p=0.1) >= 1

The last section:

rm(df1)
rm(df1matches)

If you need to use rm in a script, it's often a sign you're going about things the wrong way. Because these objects are reassigned each iteration, deleting them is pointless. Also, how does this loop not hit an error in the second iteration if you delete df1 but the first line of the loop uses df1?

kwgood1980 · July 3, 2018, 3:28pm

Well, yeah. The desired outcome is exactly the same, however since we can't use SaS functions & syntax, clearly it needs to be fit to R's.

I'll have to spend some time reading up on this. I knew of the vectorization at a high-level, but many of the granular details elude me(I've been using R for about a week because a fancy guy in an office decided it would be no problem at all to re-write 10 years of legacy code in a couple off weeks).

Our goal was to look row by row and merge off THAT specific row's contents perform the merge, analyze, loop to row 2 in df1 etc, etc. This ties into your last point about the object removal. It makes sense and almost seems like we are just wasting cpu time by removing things that will just be overwritten.

nwerth · July 3, 2018, 3:48pm

If you remove the for statement and the braces before and after the code, it should work as you want. I don't have access to your data, so this isn't 100% sure, but it looks like all your data-manipulation code is already vectorized.

You're right, and changing it probably won't give any measurable time savings. It's just one of those "standard idioms" every language has.

I've been using R for about a week because a fancy guy in an office decided it would be no problem at all to re-write 10 years of legacy code in a couple off weeks

This is a pretty common path to R, and the one I took (read: was forced down). As somebody who's been through this, the best way is to start from a clean slate with the code. If you have time (in the upcoming cruel couple weeks), browse through R for Data Science. dplyr and tidyr are the current gold-standards for transforming and working with data in scripts.

Remember that the biggest selling point of R for the busy analyst is how easy it is for the community to share solutions on CRAN. Before writing code to do a special task, check if anyone else has already solved the problem. The CRAN Task Views page showcases some tried-and-true packages for different tasks.

I will say, nowadays I dread the times I have to maintain old SAS code. R has spoiled me.

jonspring · July 3, 2018, 5:15pm

To get a better idea of likely bottlenecks, I'd be curious to understand what your data looks like. There might be an approach to split up the problem differently to make it fit better in memory and quicker. As @nwerth mentioned, a lot of R functions are vectorized, so they're much faster if they can be organized column by column rather than row by row.

Where is the code that limits df1 to a single row? As is, it looks like it joins all of df1 and all of df2, and performs the calculations you specified for each row, and then starts over and does that same calc for the whole table, over and over.

What do you get if you run these before your code for d1 and d2?

object.size(df1)
str(df1)

And what do you see when you examine a new object after joining them?

df3 <- inner_join(df1, df2, by = c('zip'))
object.size(df3)
str(df3)

I'm not sure the join is working quite how you expect. Given your description, I would have expected you'd want separate versions of each of the columns based on the source table. So if you run df3 <- inner_join(df1, df2, by = c('Zip')) then I'd expect that that would give a bigger table, almost twice as wide, with fields like BusinessName.x and BusinessName.y. Those are what you'd want to compare in the tests below. Are you sure there's a df1$BusinessName field in the joined table? Even if there is, it should be compared to a different column, lest you get a perfect match by definition.

As a global note, I would second the aspect of @nwerth's suggestion about approaching this a bit differently. For instance, you might try solving the problem using smaller tables so that you can troubleshoot quickly, and once it's working correctly you can expand your solution to the entire data set. I might try using a "toy table" that is already limited to a single zip code.

library(tidyverse)
df1_sm <- df1 %>% filter(Zip == 53203) # Or any single zip in your data
df2_sm <- df2 %>% filter(Zip == 53203)

Then when you join it should still be a manageable size, which might help make it easier to troubleshoot and verify.

jonspring · July 3, 2018, 5:29pm

You might also get to a much more direct solution using the fuzzyjoin package. It allows you to specify the matching conditions up front, rather than joining each row to every other row with the same zip code and then filtering out the poor matches.

kwgood1980 · July 5, 2018, 1:03pm

It's quite large column-wise but I would say the best way to think about it would be company information in table A, and the same in B. So, state, zip, address, phone, etc.

ZipCode <- c('07434','12343','19425','07436','10010','56789','07434','07434','07434','07434','07434','07434')
ZipCode2 <- c('19425','10010','10010','07456','19090','1','14567','19293','34523','12563','32452','07434','07434')
DT1 <- data.table(ZipCode)
DT2 <- data.table(ZipCode)

We would fully expect to get back a new dataset with 12 rows of Zip 07434.

df1 never gets reduced to a single row, df1 is a culmination of every company in frame1, that has a matching zip code in frame 2. Duplicates do exist, and it gets pretty large which is why we had the idea of looping one row at a time, applying the function to the rows returns, and then getting it OUT of memory.

Nothing for DF1(since we create it on the fly), DF2 has about 13million results.

This wouldn't be able to run. We have to split it up or loop, etc. This would result in about 900 million rows.

Think of it like this. In the one table are leads for sales, but in the other table is data from a provider. Since the provider just gives you full dumps, we need to parse out any matches. We do this by using Jaro-Winkler. But, we have to initially get the first dataset and the easiest way to accomplish this is by joining on Zip. If the Zip doesn't match, we found only about .01% of cases where the business name was the same. We're comfortable with not having .01%.

I started playing with breaking it up state by state and then applying the function to that dataset and it seems to be working quite well, however it still chokes on our larger state populations. Unfortunately I don't think there's a way to get around the loop.

library(tidyverse)
df1_sm <- df1 %>% filter(Zip == 53203) # Or any single zip in your data
df2_sm <- df2 %>% filter(Zip == 53203)
Then when you join it should still be a manageable size, which might help make it easier to troubleshoot and verify.

At this point we know the the bottleneck is that when you join on a massive ZipCode(NYC as an example) you're going to have tens of millions of results. This process is going to take a long time, we are used to that. The problem is that R can't seem to handle the memory issue like SAS was hence why we were trying to loop through and create 50k datasets, 100kdatasets, etc and then wipe them from memory before going to the next row

kwgood1980 · July 5, 2018, 1:12pm

It does work in the regard that it's giving results we expect however we cannot join the entire datasets in one shot. It's simply too big. If I parse it down, we get results that we would I added a small data example responding to jon below but joining in one shot would result in about 900 million rows. Our idea(how it worked in SAS) was to:

Join row 1 in DF1 to DF2, get 50,000 results, apply function, move to next row, get 250,000 results, apply function, move to next row, get 1million results, apply function move to next row, etc. We sadly can't just apply the function to the 900million rows we would get in the join.

martin.R · July 5, 2018, 1:36pm

Have you tried implementing this using data.table? (I notice you did use this in your zip code example in the previous post). The modify-in-place methods it uses, as well as its general efficiency, will help to save memory. data.table is also fantastic with joins, particularly on indexed tables.

You refer to 900 million records being too large and you say you have 128GB RAM. I have used over 600 million records with 32GB RAM successfully previously, so 900 million does not seem excessive. The number and content of your columns may make this comparison moot though.

nwerth · July 5, 2018, 2:53pm

The way SAS gets around RAM limits is by doing a lot of work on disk (usually in a database). R can work with databases, too. Here's an example:

library(dplyr)
library(dbplyr)
library(RSQLite)

set.seed(9)
zip_codes <- factor(90200:90299)
n_rows <- 1e4
df1 <- data_frame(
  ZIP = sample(zip_codes, n_rows, replace = TRUE),
  x1  = rnorm(n_rows),
  y1  = runif(n_rows)
)
df2 <- data_frame(
  ZIP = sample(zip_codes, n_rows, replace = TRUE),
  x2  = rnorm(n_rows),
  y2  = runif(n_rows)
)

db <- dbConnect(SQLite(), tempfile())
df1_remote <- copy_to(db, df1)
df2_remote <- copy_to(db, df2)

df1_remote and df2_remote are tbl_dbi objects, which means R will try to do certain tasks in the database. We can join them and save the result as a temporary table in the database.

df3_remote <- df1_remote %>%
  inner_join(df2_remote, by = "ZIP") %>%
  compute()
df3_remote
# # Source:   table<lonwxtdaqe> [?? x 5]
# # Database: sqlite 3.22.0 [...]
#    ZIP       x1    y1    x2      y2
#    <chr>  <dbl> <dbl> <dbl>   <dbl>
#  1 90222 0.0848 0.703 -2.15 0.0547 
#  2 90222 0.0848 0.703 -1.82 0.795  
#  3 90222 0.0848 0.703 -1.72 0.940  
#  4 90222 0.0848 0.703 -1.55 0.311  
#  5 90222 0.0848 0.703 -1.41 0.0668 
#  6 90222 0.0848 0.703 -1.36 0.592  
#  7 90222 0.0848 0.703 -1.35 0.00236
#  8 90222 0.0848 0.703 -1.29 0.700  
#  9 90222 0.0848 0.703 -1.29 0.209  
# 10 90222 0.0848 0.703 -1.11 0.384  
# # ... with more rows

Now we'll rely on the RSQLite package (which itself relies on the DBI package). A tbl_dbi object stores the SQL needed to query the whole table, which we'll extract and request from the database. The database will prepare a result, and we can retrieve only a certain number of rows each time. Those rows will be filtered to what we want, and we'll repeat until done. At the end, all the individual results will be combined.

query_result <- df3_remote %>%
  remote_query() %>%
  dbSendQuery(conn = db)

results <- list()
ii <- 0
while (!dbHasCompleted(query_result)) {
  ii <- ii + 1
  results[[ii]] <- query_result %>%
    fetch(n = 100000) %>%
    mutate(
      match_x = near(x1, x2, tol = 0.1),
      match_y = near(y1, y2, tol = 0.1)
    ) %>%
    filter(match_x | match_y)
}

fuzzy_joined <- bind_rows(results)
str(fuzzy_joined)
# 'data.frame':	236191 obs. of  7 variables:
#  $ ZIP    : chr  "90222" "90222" "90222" "90222" ...
#  $ x1     : num  0.0848 0.0848 0.0848 0.0848 0.0848 ...
#  $ y1     : num  0.703 0.703 0.703 0.703 0.703 ...
#  $ x2     : num  -1.8174 -1.2931 -0.6456 -0.0715 0.0443 ...
#  $ y2     : num  0.795 0.7 0.719 0.763 0.578 ...
#  $ match_x: logi  FALSE FALSE FALSE FALSE TRUE TRUE ...
#  $ match_y: logi  TRUE TRUE TRUE TRUE FALSE FALSE ...

kwgood1980 · July 5, 2018, 3:17pm

I need to stop doing that. I'm so used to "data-frames" that I instinctively type df1, df2, etc. I am using data.tables.

kwgood1980 · July 5, 2018, 3:51pm

This seems to be working. The function contains the above jaro-winkler matching. RAM usage is fairly stable since we are continually removing DF1. The .csv is growing in size, however I have no idea how long this is going to run for. Is this doing what I'm assuming by grabbing a row in the NJ data.table, looking at the ZIP, and merging it to all the rows witch matching zips from DT2, and then applying my function, etc?

NJ <- dt1[StateP == "NJ"]
setkey(NJ, ZipP)
res<- NJ[dt2, on = 'ZipP', allow.cartesian=TRUE]
for(row in NJ$ZipP) {
NJ[,function(.SD)]}
rm(NJ)

I'm assuming this may not be ideal since I have to constantly grow the res data.table to whatever the loop is. I would imagine there is a penalty for copying and growing over and over again?

kwgood1980 · July 5, 2018, 5:43pm

This didn't work. Let it run for an hour and it couldn't finish. This would be a nightmare for larger states.

I'm wondering if potentially breaking it up by ZipCode(more zips but much smaller datasets) would be the way to go.... Our largest Zip count is about 2600. This is barely more than our smallest state. My thought is, if I break the data-sets up by zip, then the merges would be much smaller as would be the stringdist function. I would have to find a way how to dynamically grab the Zips I need without manual intervention.

This might be the best way to go..

martin.R · July 5, 2018, 9:28pm

I don't know what the correct solution would be, but you are correct that there is a severe penalty for doing this:.

You need to work out a method which does not loop over the rows.

Also, this snippet generates a left join between dt2 and NJ:

res<- NJ[dt2, on = 'ZipP', allow.cartesian=TRUE]

If you want an inner join instead, then use:
res<- NJ[dt2, on = 'ZipP', nomatch = 0, allow.cartesian=TRUE]

If dt2 is also keyed, then you can omit the on = 'ZipP' part.

kwgood1980 · July 5, 2018, 9:40pm

What if I create an empty vector/data.table/whatever before the loop so it's just appending as opposed to growing, shrinking, etc? I can make it say, 1.5x larger than the biggest possible amount of rows? I honestly don't know of a way to get out of the loop. Let me see if I can create some fake data to give everyone an idea of what columns I have to work with.

Good intel about the merge. There is a ton of bad information on SO and other sites in regards to how data.tables merge process works. This has been throwing me off for a bit.

martin.R · July 5, 2018, 10:05pm

As data.table has been around for a long time, then so have SO Q&A's even when the package has moved on. There are also multiple allowable syntaxes, which does not always help.

My introduction to data.table was when trying to perform a join between two df's with very little RAM in my early R days. I found a post (can no longer find it unfortunately) with the many options in base, sqldf, (early) dplyr, data.table and other packages. I thought the X[Y] syntax was magic and it was also blazingly fast which impressed me as a n00b.

If you still get stuck you might try posting on SO about your problem. The authors do answer a lot of questions there and there are a handful of others whose knowledge is authoritative about the package.

kwgood1980 · July 5, 2018, 10:06pm

Ok, NOW we have progress... I ran the below code:

sqlzip <- sqlQuery(con "select * from ziptable")
sqlzip <- sqlzip[ZipP = '10001']

starttime <- Sys.time()
for (row in sqlzip$Zipp) {
data <- DT1[ZipP %in% row]
DT2 <- data[customers, nomatch = 0, allow.cartesian=TRUE]
data[,function(.SD)]
endtime <- Sys.time
totaltime <- (endtime - starttime)
print(totaltime)

1.14 minutes

I ran this for only the largest zipcode in our result set. When I run it for individual zips that have say, 200 rows, it's running in under a second...

It seems like something is happening under the hood which is killing the entire process from completing.

kwgood1980 · July 5, 2018, 10:49pm

Martin, I feel like I've seen your name before.. Datacamp maybe?

But, the issue seems to have been the join... I can see the KB size on the .csv being written to flying up.. I'm betting it's going to be done within the next few hours based on previous csv sizes from SAS. Almost feels like a bug with data.table? I don't know.

jonspring · July 5, 2018, 11:16pm

It sounds like you're getting closer. Interesting problem, so I'd love to hear what ends up working best.

I'm curious how fuzzyjoin compares in speed for your scenario. I could imagine an approach where you run two fuzzyjoins and combine the results, since it seems each of your tests for a match require either:

MatchBizName >= 0.9
MatchPhone == 1

I'd guess this would already be pretty manageable, size-wise, and then you could filter for distinct matches and for additional aligned fields from your tests, e.g. to match City, StreetNum, and StreetName... But is it slow?

Here's some fake data:

library(tidyverse)

# Let's say these are your sales leads
tableA <- tribble(
  ~zipcode, ~state, ~address,  ~bizname,
  94103,  "CA",    "11 Main",  "Shoe Biz",
  94103,  "CA", "55 2nd St.",  "Show Biz",
  94103,  "CA", "56 2nd St.",  "FakeCo Widgets",
  10010,  "NY",    "11 Main",  "Famous Original Rays",
  10010,  "NY",    "63 22nd",  "Famous Original Roys"
) %>%
  rowid_to_column(var = "business_ID")

# And these are provided by a vendor
tableB <- tribble(
  ~zipcode, ~state, ~address,  ~bizname, ~leadQuality,
  94103,  "CA",    "11 Mainz", "Shoe's Biz", 8,
  94103,  "CA", "55A 2nd St.", "Shower Biz", 5, 
  94103,  "CA", "56 2nd St.",  "FakeCo Widgets Inc.", 3,
  10010,  "NY",   "13 Main",   "Famous Original Rays", 9,
  10010,  "NH", "63 22nd St.", "Famous Original Roys", 1
)

And here's a fuzzyjoin between the two tables:

library(fuzzyjoin)
combo_fuzzy <-
  tableA %>%
  stringdist_left_join(tableB, by = c("bizname", "bizname"), 
                       max_dist = 0.1, method = "jw") %>%
  # This first step might still make a lot of matches for common
  # company names, so let's filter...
  filter(zipcode.x == zipcode.y)

The join could get big for chains or common names, so you might need to run in a loop by zipcode or city or state for it to work efficiently.