Data.Table method of doing the following:

datatable

#1

I have some below code that I feel could be sped up by going about it differently.

  df1leadupdate <- subset(df1, (glm==1))
  df1leadcheck <- subset(df1, (glm==0))
  df1insert <- subset(df1leadcheck, !(ID %in% df1leadupdate$ID))
  df1insert <- unique(df1insert, by = "ID")

Basically I have a table, split it based on the models findings and then say if it's in this table, it can't be in that table. At the end, I then just grab the unique rows. Is there a 'data.table' way of doing the same thing?


#2

If I understand correctly, this should work:

df1[
  ,
  if (all(glm == glm[1])) {  # All values of glm must be the same
    .SD[1]                   # Take only the first row
  },
  by = list(ID)
]

#3

I think this is on the right track, I just did a horrible job of explaining it.

Here's some more of the code to break it out a bit further:

df1$glm <- predict(fit.glm, df1)  #Apply the GLM here
  df1lupdate <- df1[glm == '1'] #subset any "matches"
  df1lcheck <- df1[glm == '0'] #But I also want NO matches
  df1insert <- subset(df1leadcheck, !(ID %in% df1leadupdate$ID))  #however it can't be in both. E.g. if ID 123 is a MATCH it can't be a non-match and vice versa
  df1insert <- unique(df1insert, by = "ID") #Make sure that inserts will be unique
  
  
  if(nrow(df1update)!=0) {  
    fwrite(df1insert, file=paste0("D:/nomatches/",row,".csv"))
  }  else {fwrite(df1lpdate,file=paste0("D:/matches/",row,".csv"))}
  rm(df1)
  rm(df1leadcheck)
  rm(df1leadupdate)
}

So I'm taking my big DT, subsetting it into matches and non matches, and ensuring that if it's going into the insert table, it can't be in the update table and finally, insuring the insert records are unique.


#4

Can you please turn this question into a reproducible example? That’s the best way to make sure everybody is on the same page about what the code is doing, and therefore helps you and your helpers get to an answer more quickly.


#5

It's 3000 lines of code that analyzes over 100 columns. Creating a reproducible example is honestly almost impossible.

My question around the specific block of code should be adequate for those that know the data. table better than I. The current code works flawlessly, I just want to see the data.table way of doing it for that specific passage to see if there's any performance boost.


#6

Right. In this case, a good reprex would abstract out the specific thing you want help with, and provide an example other people can run and check their data.table adaptation against. That's a lot more effective than trying to use words to explain what the code is supposed to do, and what success looks like. Supplying a reprex also tends to get you better answers — there are a number of people who aren't even going to bother with your question without a solid reprex to make the problem enticing (reasonable, given that people are volunteering their time to help).