problem with gsub function

fatima_mb · December 30, 2018, 9:28am

I do text mining in Arabic language ,and I wrote this code and it will check on number of character, and if it greater than 5 will do gsub function
but it gives me error and I am sure that because of gsub function but i do not know how to deal with it.

here is my code

try$text<-sapply(strsplit(try$text, ''), function(i){i[nchar(i) > 5] <- gsub('(?<=\\p{L})\\x{064A}\\x{0646}$', '', i[nchar(i) 5];perl = TRUE);paste(i,collapse = ' ')})

here is the error

**Error in gsub("(?<=\\p{L})\\x{064A}\\x{0646}$", "", i[nchar(i) > 5], perl = TRUE) : 
  invalid regular expression '(?<=\p{L})\x{064A}\x{0646}$'
In addition: Warning message:
In gsub("(?<=\\p{L})\\x{064A}\\x{0646}$", "", i[nchar(i) > 5], perl = TRUE) :
 
 Show Traceback
 
 Rerun with Debug
 Error in gsub("(?<=\\p{L})\\x{064A}\\x{0646}$", "", i[nchar(i) > 5], perl = TRUE) : 
  invalid regular expression '(?<=\p{L})\x{064A}\x{0646}$' **

thank you

technocrat · December 30, 2018, 5:38pm

Here's a partial approach, if I understood your question correctly (it's been more than 50 years since I last studied Arabic

> library(stringr)
> library(stringi)
> stri_locale_set("ar")
You are now working with stringi_1.2.4 (ar.UTF-8; ICU4C 61.1 [bundle]; Unicode 10.0)
> text <- '\u0627\u0647\u0644\u0627 \u0648\u0633\u0647\u0644\u0627\u05d0'
> words <- str_split(text, boundary("word"))[[1]]
> word1 <- words[1]
> word1
[1] "اهلا"
> nchar(word1)
[1] 4
> word2 <- words[2]
> word2
[1] "وسهلاא"
> nchar(word2)
[1] 6
> trimmed <- str_remove(word2, regex("\u0627\u05d0", dotall = TRUE))
> trimmed
[1] "وسهل"
> nchar(trimmed)
[1] 4

fatima_mb · December 31, 2018, 1:02pm

it do the job ,but I want to use it in (sapply) function so I can go through whole data frame.
so I just need a function equivalent to (gsub) function so i can use it alternative of it.
I tried this code but is not worked, it split the words to characters and don't removed anything.

sapply(strsplit(text, ''), function(i){str_remove(i[nchar(i) > 5] , regex("\u0627\u05d0", dotall = TRUE))
+   ;paste(i,collapse = ' ')})
[1] "ا ه ل ا   و س ه ل ا א"

pete · December 31, 2018, 7:17pm

Hi,

I'm guessing you need the paste at the beginning of the function.

sapply(strsplit(text, ' '), function(i){paste(str_remove(i[nchar(i) > 5] , regex("\u0627\u05d0", dotall = TRUE)), collapse = ' ')})

It also looks like you've got a typo in the strsplit, you probably want a space between the quotes.

--pete

technocrat · January 1, 2019, 4:10pm

As written, it appears that you are splitting text into characters, not words.

If your data frame has lines of text in columns, you can use dplyr::mutate to vectorize the operation. For illustration, assume you are analyzing the lead articles of a news paper by date.

library(dplyr)
library(magrittr)
library(stringi)
library(stringr)
stri_locale_set("ar")
# structure of data frame, df
published     lead
<date>          <chr>

df_trim <- df %>% mutate(lead = your_function(lead)

The function would implement the sample code for a block of text, by splitting it into words, discarding words of nchar[word] > 5 and then applying str_remove any targeted characters.

system · January 22, 2019, 4:10pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.