High performance and regex

joel · January 23, 2018, 3:27pm

I have to extract parts of some strings and modify them using regex (think url validation/modification, for instance) in a dataframe of about 30 million rows. What are the general advices one should follow? I am aware this is not specifically an R question, but I suppose they may be specific R dimensions.

nwerth · January 23, 2018, 4:00pm

Use the stringi package, especially it's ability to vectorize most function arguments
Use the data.table package for fast subsetting, group operations, and assignment
Before doing anything hacky and hard to understand, ask yourself if you really need the extra efficiency

gvegayon · January 28, 2018, 8:22pm

May be worth it splitting the data and do it in parallel fashion. That should be straightforward using parallel::parSapply (for example). Only if stringi isn't already fast enough .

dracodoc · January 29, 2018, 5:47pm

There could be a lot of difference between simple, static pattern and complex regex patterns that involves look around.