High performance and regex


#1

I have to extract parts of some strings and modify them using regex (think url validation/modification, for instance) in a dataframe of about 30 million rows. What are the general advices one should follow? I am aware this is not specifically an R question, but I suppose they may be specific R dimensions.


#2
  • Use the stringi package, especially it’s ability to vectorize most function arguments
  • Use the data.table package for fast subsetting, group operations, and assignment
  • Before doing anything hacky and hard to understand, ask yourself if you really need the extra efficiency

#3

May be worth it splitting the data and do it in parallel fashion. That should be straightforward using parallel::parSapply (for example). Only if stringi isn’t already fast enough :smile:.


#4

There could be a lot of difference between simple, static pattern and complex regex patterns that involves look around.