problem with gsub function

I do text mining in Arabic language ,and I wrote this code and it will check on number of character, and if it greater than 5 will do gsub function
but it gives me error and I am sure that because of gsub function but i do not know how to deal with it.

here is my code

try$text<-sapply(strsplit(try$text, ''), function(i){i[nchar(i) > 5] <- gsub('(?<=\\p{L})\\x{064A}\\x{0646}$', '', i[nchar(i) 5];perl = TRUE);paste(i,collapse = ' ')})

here is the error

**Error in gsub("(?<=\\p{L})\\x{064A}\\x{0646}$", "", i[nchar(i) > 5], perl = TRUE) : 
  invalid regular expression '(?<=\p{L})\x{064A}\x{0646}$'
In addition: Warning message:
In gsub("(?<=\\p{L})\\x{064A}\\x{0646}$", "", i[nchar(i) > 5], perl = TRUE) :
 Show Traceback
 Rerun with Debug
 Error in gsub("(?<=\\p{L})\\x{064A}\\x{0646}$", "", i[nchar(i) > 5], perl = TRUE) : 
  invalid regular expression '(?<=\p{L})\x{064A}\x{0646}$' **

thank you

Here's a partial approach, if I understood your question correctly (it's been more than 50 years since I last studied Arabic

> library(stringr)
> library(stringi)
> stri_locale_set("ar")
You are now working with stringi_1.2.4 (ar.UTF-8; ICU4C 61.1 [bundle]; Unicode 10.0)
> text <- '\u0627\u0647\u0644\u0627 \u0648\u0633\u0647\u0644\u0627\u05d0'
> words <- str_split(text, boundary("word"))[[1]]
> word1 <- words[1]
> word1
[1] "اهلا"
> nchar(word1)
[1] 4
> word2 <- words[2]
> word2
[1] "وسهلاא"
> nchar(word2)
[1] 6
> trimmed <- str_remove(word2, regex("\u0627\u05d0", dotall = TRUE))
> trimmed
[1] "وسهل"
> nchar(trimmed)
[1] 4
1 Like

it do the job ,but I want to use it in (sapply) function so I can go through whole data frame.
so I just need a function equivalent to (gsub) function so i can use it alternative of it.
I tried this code but is not worked, it split the words to characters and don't removed anything.

sapply(strsplit(text, ''), function(i){str_remove(i[nchar(i) > 5] , regex("\u0627\u05d0", dotall = TRUE))
+   ;paste(i,collapse = ' ')})
[1] "ا ه ل ا   و س ه ل ا א"


I'm guessing you need the paste at the beginning of the function.

sapply(strsplit(text, ' '), function(i){paste(str_remove(i[nchar(i) > 5] , regex("\u0627\u05d0", dotall = TRUE)), collapse = ' ')})

It also looks like you've got a typo in the strsplit, you probably want a space between the quotes.



As written, it appears that you are splitting text into characters, not words.

If your data frame has lines of text in columns, you can use dplyr::mutate to vectorize the operation. For illustration, assume you are analyzing the lead articles of a news paper by date.

# structure of data frame, df
published     lead
<date>          <chr>

df_trim <- df %>% mutate(lead = your_function(lead) 

The function would implement the sample code for a block of text, by splitting it into words, discarding words of nchar[word] > 5 and then applying str_remove any targeted characters.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.