High performance and regex

I have to extract parts of some strings and modify them using regex (think url validation/modification, for instance) in a dataframe of about 30 million rows. What are the general advices one should follow? I am aware this is not specifically an R question, but I suppose they may be specific R dimensions.

2 Likes
  • Use the stringi package, especially it's ability to vectorize most function arguments
  • Use the data.table package for fast subsetting, group operations, and assignment
  • Before doing anything hacky and hard to understand, ask yourself if you really need the extra efficiency
4 Likes

May be worth it splitting the data and do it in parallel fashion. That should be straightforward using parallel::parSapply (for example). Only if stringi isn't already fast enough :smile:.

There could be a lot of difference between simple, static pattern and complex regex patterns that involves look around.