strsplit - using regular expressions to count consonant clusters

lindsayg · September 16, 2021, 2:58am

Hi! I am pretty new to R and regular expressions, and looking for suggestions on how to properly use the strsplit() function with a regular expression that suits my needs. I am writing a program to count consonant clusters (two or more consonants in a row) in a word, but having trouble with a regex to properly describe this.

Here is the code I am working with:

klattese <- "strieeteff"
split <- strsplit(klattese, "[iIEe@aWY^cOoUuRx|X\\-\\']+")

This works for a string like klattese above, producing expected output "str" "t" "ff"
but is failing on more complicated strings.

klattese <- "^-b@ˈn-d^nd"
split <- strsplit(klattese, "[iIEe@aWY^cOoUuRx|X\\-\\']+")

Produces "" "-b" "'n-d" "nd", but my expected output is "b" "n" "d" "nd".

Are there any suggestions on a different regex I could use to get the expected results? I think it may have something to do with the special characters ' and - but I am not certain. I have also tried regex with "[iIEe@aWY^cOoUuRx|X\-\']+" using just one backslash escape, but still no luck.

lindsayg · September 16, 2021, 6:12pm

Solution ended up being a slightly more organized and split up regular expression:
"([iIEe@aWY^cOoUuRx|X\\ˈ]+|-+)+"

system · September 23, 2021, 6:12pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.