regular expression with Chinese characters

Hi experts,

I tried to find a sentence where two specified words occurred together with less than 4 Chinese characters apart .

input <- data.frame(
stringsAsFactors = FALSE,
x = c('我有一張股票','我有玩股票','我買股票','我看了股票')
)
p='(有|玩|買|賣|看)(w{0,4})股票'
str_subset(input$x, regex(p))

I expected that all items should have been returned but only two items were returned. Could anyone help to correct my lines to return all items? Besides, when use str_subset, the length of input$x needs to be multiple of p. For example, an error will occur (longer object length is not a multiple of shorter object length) if input$x is length of 5 and p is length of 2. How can I avoid this error? thanks.

Best,
Veda

I think your regular expression is wrong. Something like this. Square brackets [] means "one of the items inside the bracket", dot . means any character, and round bracket () groups things together.

str_extract_all(input$x, '[有玩買賣看].{0,4}股票')

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.