Regular expression with csplit function for splitting cells in excel

readxl

#1

Hello, i have a question, currently i am trying to split columns in an excel file, in a way that i can split each line if that line begins with a digit (and the plus for double digit numbers) and the number is followed by full stop "."that is defined by a regular expression that works with java script, but i can't make it work with R. i have also tried using the simple "\r" but it is not consistent with all of my data.

Also here is a sample of my excel file

library(readxl)
library(splitstackshape)
library(xlsx)
df = read_excel("master.xlsx")
df2 <- cSplit(df, c("Steps", "Result"), sep = "/(?=^\\d+\\. )/m", direction = "long")
write.xlsx(df2, “splitted.xlsx")

And here the regular expression that works with java script
text.split(/(?=^\d+\. )/m)


#2

According to the cSplit docs, you need to set fixed = FALSE for sep to be treated as a regular expression.

Does it work with that parameter setting?


#3

Thank you for your reply, Yes i have tried to add that parameter, but it would then throw an error that this is an invalid regular expression.

> df2 <- cSplit(df, c("Steps", "Result"), sep = "/(?=^\\d+\\. )/m", direction = "long", fixed = FALSE)
Error in strsplit(indt[[splitCols[x]]], split = sep[x], fixed = fixed) : 
  invalid regular expression '/(?=^\d+\. )/m', reason 'Invalid regexp'

#4

Could you supply a couple of examples of input and expected output?


#5

Yes here is the sample input and here is the sample output, thanks.


#6

splitstackshape::cSplit relies on base strsplit under the hood, and strsplit uses Extended Regular Expressions by default. This type of regex has slightly different syntax and fewer features than the Javascript implementation. strsplit can also use Perl-like (PCRE) regular expressions via the perl = TRUE parameter, but cSplit isn't exposing this option to you, so you're stuck with Extended Regular Expression syntax.

A valid R extended regex string that matches what you want it to match is:

"\\n[[:digit:]]+\\. "

:exclamation:BUT: strsplit (and therefore cSplit) does not include the delimiter in the split output. So even with the right syntax, you're going to lose the step numbers in every split line after the first one:

txt <- c(
"1. line one
2. line two
3. line three", 
"24. line one
25. 1) first point
2) second point"
)

strsplit(txt, "\\n[[:digit:]]+\\. ")
#> [[1]]
#> [1] "1. line one" "line two"    "line three" 
#> 
#> [[2]]
#> [1] "24. line one"                    "1) first point\n2) second point"

In R's Perl-like regex, you could solve this problem with a lookahead, but R's extended regex doesn't support lookaheads. I'm afraid I can't think of a way to keep the delimiters using only R extended regex — maybe another helper can?

Otherwise, I can see a couple of options:

  • Use more string processing to restore the missing step numbers after the splitting is done (:tired_face:)
  • Fork cSplit and add a perl = TRUE parameter to its call to strsplit, then use a Perl-like regex. For example, the following R Perl-like regex gives the splits you want:
strsplit(txt, "\\n(?=\\d+\\. )(?m)", perl = TRUE)
#> [[1]]
#> [1] "1. line one"   "2. line two"   "3. line three"
#> 
#> [[2]]
#> [1] "24. line one"                       
#> [2] "25. 1) first point\n2) second point"

If you did fork cSplit to support the perl = TRUE option, you might consider opening an issue (or even a pull request!) on splitstackshape's GitHub to see if the maintainer wants to add this functionality into the package.

About your regex in Javascript...

Btw, I'm not sure that the regex you tried would have given you the desired output in Javascript, either. The lookahead was around the whole expression, so nothing is matched, and the \d and \. were over-escaped. I find RegExr to be a great tool for debugging (and learning about) regular expressions, especially because of its "explain" feature. Here's a test of the regex you tried: https://regexr.com/3oat4


#7

This will take care of the splitting, I will leave the intra-cell line breaks up to you :slight_smile:

# Clear workspace
rm(list=ls())

# Load libraries
library('tidyverse')
library('readxl')
library('openxlsx')

# Set functions
do_split = function(x, pattern = "\\d+\\.\\s{1}"){
  if( is_tibble(x) ){ x = pull(x) }
  num_bullets = x %>% str_extract_all("\\d+\\. ") %>% unlist
  x %>% str_split(pattern) %>% unlist %>% .[.!=""] %>% str_c(num_bullets,.) %>%
    list %>% return
}

# Read data
d = read_excel(path = '~/Desktop/master.xlsx')

# Wrangle data
o = d %>%
  mutate(Result = Result %>% do_split, Steps  = Steps %>% do_split) %>%
  unnest(Result, Steps)

# Write data
write.xlsx(x = o, file =  '~/Desktop/out.xlsx')

#8

:clap: :100: for a solution that frees the OP from the limitations of the package they were using! It had not occurred to me that a clever application of list columns could replicate the really valuable part of what that package was doing. Really nice.

(But, um, you might want to keep a fire extinguisher handy if you see Jenny Bryan coming...)


#9

:+1: :fire_engine: (Know about the workspace controversy - Not entirely convinced :wink:)


#10

thank you so much for clarifying that and for putting an effort to help and finding alternative solutions, i appreciate it!:slight_smile:


#11

Thank you, you are definitely a life saver :smiley:


#12

No prob, glad to help! Make sure you understand what I did, so that you might safe yourself in the future :sunglasses:


#13

I am actually not sure if i entirely get the function, as i wanted to see if i can change it a little, because while i was trying with an input data that looks like this, the behaviour is that it would output a file like this, instead of the desired output which is this. I'm not sure why this behaviour happens.