Regular expression with csplit function for splitting cells in excel

Sherif · April 21, 2018, 2:47pm

Hello, i have a question, currently i am trying to split columns in an excel file, in a way that i can split each line if that line begins with a digit (and the plus for double digit numbers) and the number is followed by full stop "."that is defined by a regular expression that works with java script, but i can't make it work with R. i have also tried using the simple "\r" but it is not consistent with all of my data.

Also here is a sample of my excel file

library(readxl)
library(splitstackshape)
library(xlsx)
df = read_excel("master.xlsx")
df2 <- cSplit(df, c("Steps", "Result"), sep = "/(?=^\\d+\\. )/m", direction = "long")
write.xlsx(df2, “splitted.xlsx")

And here the regular expression that works with java script
text.split(/(?=^\d+\. )/m)

jcblum · April 21, 2018, 7:57pm

According to the cSplit docs, you need to set fixed = FALSE for sep to be treated as a regular expression.

Does it work with that parameter setting?

Sherif · April 22, 2018, 9:19am

Thank you for your reply, Yes i have tried to add that parameter, but it would then throw an error that this is an invalid regular expression.

> df2 <- cSplit(df, c("Steps", "Result"), sep = "/(?=^\\d+\\. )/m", direction = "long", fixed = FALSE)
Error in strsplit(indt[[splitCols[x]]], split = sep[x], fixed = fixed) : 
  invalid regular expression '/(?=^\d+\. )/m', reason 'Invalid regexp'

Leon · April 22, 2018, 9:27am

Could you supply a couple of examples of input and expected output?

Sherif · April 22, 2018, 9:56am

Yes here is the sample input and here is the sample output, thanks.

jcblum · April 22, 2018, 1:56pm

splitstackshape::cSplit relies on base strsplit under the hood, and strsplit uses Extended Regular Expressions by default. This type of regex has slightly different syntax and fewer features than the Javascript implementation. strsplit can also use Perl-like (PCRE) regular expressions via the perl = TRUE parameter, but cSplit isn't exposing this option to you, so you're stuck with Extended Regular Expression syntax.

A valid R extended regex string that matches what you want it to match is:

"\\n[[:digit:]]+\\. "

BUT: strsplit (and therefore cSplit) does not include the delimiter in the split output. So even with the right syntax, you're going to lose the step numbers in every split line after the first one:

txt <- c(
"1. line one
2. line two
3. line three", 
"24. line one
25. 1) first point
2) second point"
)

strsplit(txt, "\\n[[:digit:]]+\\. ")
#> [[1]]
#> [1] "1. line one" "line two"    "line three" 
#> 
#> [[2]]
#> [1] "24. line one"                    "1) first point\n2) second point"

In R's Perl-like regex, you could solve this problem with a lookahead, but R's extended regex doesn't support lookaheads. I'm afraid I can't think of a way to keep the delimiters using only R extended regex — maybe another helper can?

Otherwise, I can see a couple of options:

Use more string processing to restore the missing step numbers after the splitting is done ()
Fork cSplit and add a perl = TRUE parameter to its call to strsplit, then use a Perl-like regex. For example, the following R Perl-like regex gives the splits you want:

strsplit(txt, "\\n(?=\\d+\\. )(?m)", perl = TRUE)
#> [[1]]
#> [1] "1. line one"   "2. line two"   "3. line three"
#> 
#> [[2]]
#> [1] "24. line one"                       
#> [2] "25. 1) first point\n2) second point"

If you did fork cSplit to support the perl = TRUE option, you might consider opening an issue (or even a pull request!) on splitstackshape's GitHub to see if the maintainer wants to add this functionality into the package.

About your regex in Javascript...

Btw, I'm not sure that the regex you tried would have given you the desired output in Javascript, either. The lookahead was around the whole expression, so nothing is matched, and the \d and \. were over-escaped. I find RegExr to be a great tool for debugging (and learning about) regular expressions, especially because of its "explain" feature. Here's a test of the regex you tried: https://regexr.com/3oat4

Leon · April 22, 2018, 2:24pm

This will take care of the splitting, I will leave the intra-cell line breaks up to you

# Clear workspace
rm(list=ls())

# Load libraries
library('tidyverse')
library('readxl')
library('openxlsx')

# Set functions
do_split = function(x, pattern = "\\d+\\.\\s{1}"){
  if( is_tibble(x) ){ x = pull(x) }
  num_bullets = x %>% str_extract_all("\\d+\\. ") %>% unlist
  x %>% str_split(pattern) %>% unlist %>% .[.!=""] %>% str_c(num_bullets,.) %>%
    list %>% return
}

# Read data
d = read_excel(path = '~/Desktop/master.xlsx')

# Wrangle data
o = d %>%
  mutate(Result = Result %>% do_split, Steps  = Steps %>% do_split) %>%
  unnest(Result, Steps)

# Write data
write.xlsx(x = o, file =  '~/Desktop/out.xlsx')

jcblum · April 22, 2018, 3:03pm

for a solution that frees the OP from the limitations of the package they were using! It had not occurred to me that a clever application of list columns could replicate the really valuable part of what that package was doing. Really nice.

(But, um, you might want to keep a fire extinguisher handy if you see Jenny Bryan coming...)

Leon · April 22, 2018, 4:45pm

(Know about the workspace controversy - Not entirely convinced )

Sherif · April 23, 2018, 8:47am

thank you so much for clarifying that and for putting an effort to help and finding alternative solutions, i appreciate it!

Sherif · April 23, 2018, 8:48am

Thank you, you are definitely a life saver

Leon · April 23, 2018, 10:09am

No prob, glad to help! Make sure you understand what I did, so that you might safe yourself in the future

Sherif · April 24, 2018, 9:23am

I am actually not sure if i entirely get the function, as i wanted to see if i can change it a little, because while i was trying with an input data that looks like this, the behaviour is that it would output a file like this, instead of the desired output which is this. I'm not sure why this behaviour happens.