I do text mining in Arabic language ,and I wrote this code and it will check on number of character, and if it greater than 5 will do gsub function
but it gives me error and I am sure that because of gsub function but i do not know how to deal with it.
it do the job ,but I want to use it in (sapply) function so I can go through whole data frame.
so I just need a function equivalent to (gsub) function so i can use it alternative of it.
I tried this code but is not worked, it split the words to characters and don't removed anything.
sapply(strsplit(text, ''), function(i){str_remove(i[nchar(i) > 5] , regex("\u0627\u05d0", dotall = TRUE))
+ ;paste(i,collapse = ' ')})
[1] "ا ه ل ا و س ه ل ا א"
As written, it appears that you are splitting text into characters, not words.
If your data frame has lines of text in columns, you can use dplyr::mutate to vectorize the operation. For illustration, assume you are analyzing the lead articles of a news paper by date.
library(dplyr)
library(magrittr)
library(stringi)
library(stringr)
stri_locale_set("ar")
# structure of data frame, df
published lead
<date> <chr>
df_trim <- df %>% mutate(lead = your_function(lead)
The function would implement the sample code for a block of text, by splitting it into words, discarding words of nchar[word] > 5 and then applying str_remove any targeted characters.