splitstackshape::cSplit relies on base
strsplit under the hood, and
strsplit can also use Perl-like (PCRE) regular expressions via the
perl = TRUE parameter, but
cSplit isn't exposing this option to you, so you're stuck with Extended Regular Expression syntax.
A valid R extended regex string that matches what you want it to match is:
strsplit (and therefore
cSplit) does not include the delimiter in the split output. So even with the right syntax, you're going to lose the step numbers in every split line after the first one:
txt <- c(
"1. line one
2. line two
3. line three",
"24. line one
25. 1) first point
2) second point"
strsplit(txt, "\\n[[:digit:]]+\\. ")
#>  "1. line one" "line two" "line three"
#>  "24. line one" "1) first point\n2) second point"
In R's Perl-like regex, you could solve this problem with a lookahead, but R's extended regex doesn't support lookaheads. I'm afraid I can't think of a way to keep the delimiters using only R extended regex — maybe another helper can?
Otherwise, I can see a couple of options:
- Use more string processing to restore the missing step numbers after the splitting is done ()
cSplit and add a
perl = TRUE parameter to its call to
strsplit, then use a Perl-like regex. For example, the following R Perl-like regex gives the splits you want:
strsplit(txt, "\\n(?=\\d+\\. )(?m)", perl = TRUE)
#>  "1. line one" "2. line two" "3. line three"
#>  "24. line one"
#>  "25. 1) first point\n2) second point"
If you did fork
cSplit to support the
perl = TRUE option, you might consider opening an issue (or even a pull request!) on
splitstackshape's GitHub to see if the maintainer wants to add this functionality into the package.
\. were over-escaped. I find RegExr to be a great tool for debugging (and learning about) regular expressions, especially because of its "explain" feature. Here's a test of the regex you tried: https://regexr.com/3oat4