High performance and regex


I have to extract parts of some strings and modify them using regex (think url validation/modification, for instance) in a dataframe of about 30 million rows. What are the general advices one should follow? I am aware this is not specifically an R question, but I suppose they may be specific R dimensions.

  • Use the stringi package, especially it’s ability to vectorize most function arguments
  • Use the data.table package for fast subsetting, group operations, and assignment
  • Before doing anything hacky and hard to understand, ask yourself if you really need the extra efficiency


May be worth it splitting the data and do it in parallel fashion. That should be straightforward using parallel::parSapply (for example). Only if stringi isn’t already fast enough :smile:.


There could be a lot of difference between simple, static pattern and complex regex patterns that involves look around.