tm package removing unwanted characters works in r but not knitr

I wish to remove some special characters from a text corpus. I used the tm package in r. It worked successfully in r, but knitr refused to execute. When I commented out the removal of special characters, knitr worked fine. I can't not remove in my project, however. How can I eliminate unwanted characters but still get knitr to work?

library(tm)
tfilePath <- "C:/Users/mlrob/Documents/en_US/en_US.twitter.txt"
twitter <- readLines(tfilePath,  skipNul = TRUE)
twittersample <- twitter[rbinom(length(twitter)*.01,length(twitter),.01)]
tdocs <- Corpus(VectorSource(twittersample))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
tdocs <- tm_map(tdocs, toSpace, "â") 
tdocs <- tm_map(tdocs, toSpace, "/") 
tdocs <- tm_map(tdocs, toSpace, "@")
tdocs <- tm_map(tdocs, toSpace, "<")
tdocs <- tm_map(tdocs, toSpace, "~")
tdocs <- tm_map(tdocs, toSpace, "#")
tdocs <- tm_map(tdocs, toSpace, "Ÿ")
tdocs <- tm_map(tdocs, toSpace, "ð")
tdocs <- tm_map(tdocs, toSpace, "®")
tdocs <- tm_map(tdocs, toSpace, "\")
                tdocs <- tm_map(tdocs, toSpace, "€")
                tdocs <- tm_map(tdocs, toSpace, "™")
                
Thanks for any advice!

I think the issue is in this line

tdocs <- tm_map(tdocs, toSpace, "\")

Because \ is the usual escape character, R is trying to escape the following " and doesn't see the end of the character string.

If you substitute tdocs <- tm_map(tdocs, toSpace, "\\\\") (4 backslashes instead of 1), I think it should work.

For more details, you can see here: https://stackoverflow.com/questions/14879204/how-to-escape-a-backslash-in-r

2 Likes

Thank you for your reply. I deleted the line with the " \ " removal. Then knitr continued to work, but it is not taking out all the special characters that I want it to. And it will not allow any processing to get a term document matrix. Here is part of that report:

tdocs <- tm_map(tdocs, toSpace, "®")
## Warning in tm_map.SimpleCorpus(tdocs, toSpace, "®"): transformation drops
## documents
tdocs <- tm_map(tdocs, toSpace, "???")
## Warning in tm_map.SimpleCorpus(tdocs, toSpace, "???"): transformation drops
## documents
 tdocs <- tm_map(tdocs, toSpace, "T")
## Warning in tm_map.SimpleCorpus(tdocs, toSpace, "T"): transformation drops

I didn't put in the "???" nor the "T". Somehow, knitr is confused.

Maybe it has to do with your locale, try using hex codes instead e.g. â hex code = "\u00C2"
also ? is a metacharacter, you have to escape with \\ all metacharacters ., +, *, ?, ^, $, (, ), [, ], {, }, |

I dropped that line. The code ran, but with other problems. I've noticed that knitr will not allow the tm package to drop special characters that are not on the keyboard (nonASCHII) The tm package works fine in r, but not in knitr. I heard one suggestion, to convert non-ASCHII characters to ASCHII characters. I heard another suggestion, to delete all non-ASCHII characters (tried to with tm; this worked in r, but not knitr.) About converting the non-ASCHII to ASCHII characters-- is there a function for that? Something a beginner r person like me could understand?

Is this all you are using tm for? There are other options for replacing special characters - stringr would provide a good tidyverse solution:

library(stringr)

test <- c("This; sentance # is littered â with ® problems.Ÿ")

test_edited <- str_replace_all(string = test, pattern = "[;#⮟]", replacement = "")

test_edited
#> [1] "This sentance  is littered  with  problems."

test_edited <- str_squish(test_edited)

test_edited
#> [1] "This sentance is littered with problems."

Created on 2019-03-22 by the reprex package (v0.2.1)

1 Like

I'm using tm to clean up a large corpus, then to get a document frequency matrix to find the most commonly used words, bigrams, and trigrams. If I used stringr, would it take out all the unwanted characters in a corpus, or just in a sentence? This project is for a data science class, where we learned tm. No one ever mentioned the tidyverse. I will have to read more about it.

Thanks for your reply.

1 Like

I’m a big fan of tidyverse packages. It looks like you’re using twitter data? My approach would be to clean the individual tweets beforehand. You can use stringr functions with dplyr::mutate() calls. The R for Data Science book has chapters on both data wrangling with dplyr and stringr.

Thanks for your suggestion! Trying to install the stringr package. I've got R version 3.5.2. I typed in the console

install.packages("stringr") but it wouldn't install, with these messages:

Installing package into ‘C:/Users/mlrob/Documents/R/win-library/3.5’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/stringr_1.4.0.zip'
Warning in install.packages :
InternetOpenUrl failed: 'The operation timed out'
Error in download.file(url, destfile, method, mode = "wb", ...) :
cannot open URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/stringr_1.4.0.zip'
Warning in install.packages :
download of package ‘stringr’ failed

How can I install stringr? I also tried install.packages("tidyverse") and a bunch of packages installed, but no stringr.

This sounds like a temporary issue. Did you try again later?

Thank you. I tried library(stringr) and it worked a few minutes later. I tried removing the unwanted characters using stringr, and IT WORKED! Thank you so much!! I will try knitting this in RMarkdown.

For some strange reason, the special characters that I deleted appear again in my analysis in the bi-grams and trigrams. Here is my code in R Markdown, which knitted just fine.

Load the required packages.

library(tm)
library(ggplot2)
library(quanteda)
library(stringr)

Load the text files from the Working Directory.

tfilePath <- "C:/Users/mlrob/Documents/en_US/en_US.twitter.txt"

twitter <- readLines(tfilePath,  skipNul = TRUE)

The Twitter file size of 2360148 elements (read from the environment window),
was too large for r to analyze. I analyzed a random sample of .01 percent of the file.

twittersample <- twitter[rbinom(length(twitter)*.01,length(twitter),.01)]

special characters in Twitter & â €¦™ ð Ÿ ¥ were removed.

twitter_edited <- str_replace_all(string=twittersample, pattern= "[&…™ðŸ¥]" , replacement= "")

Convert to a corpus, and use the tm package to clean the texts

tdocs<- Corpus(VectorSource(twitter_edited))

#Convert to lower case
tdocs <-tm_map(tdocs, content_transformer(tolower))
#Remove numbers
tdocs <-tm_map(tdocs,removeNumbers)
#Remove common stopwords
tdocs <- tm_map(tdocs,removeWords, stopwords("english"))
# Remove punctuations
tdocs <- tm_map(tdocs, removePunctuation)
#Eliminate extra white spaces
tdocs <- tm_map(tdocs, stripWhitespace)

a list of the 25 most commonly used words in Twitter

dtm <- TermDocumentMatrix(tdocs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word=names(v), freq=v)
head(d, 25)
d25<- d[1:25,]
print(d25)
#Frequency word plots are here as follows
```{r}
barplot(d25$freq, las=2, names.arg=d25$word, col="lightblue",main="most frequent words 
in Twitter",
ylab="Word frequencies")
         

# transform to quanteda corpus

qtdocs <- corpus(tdocs)
summary(qtdocs, 5)
##Twitter bi-grams
toks <- tokens(qtdocs)
toks_bigram <- tokens_ngrams(toks, n=2)
#Get document feature matrix of bigrams
dfm_bigrams <- dfm(toks_bigram)
bi_dat <- textstat_frequency(dfm_bigrams)
print(bi_dat[1:20])
#Plot the bigrams for Twitter
# plot 20 most frequent bigrams
library("ggplot2")


# plot the data
ggplot(bi_dat[1:20], aes(x= reorder(feature, frequency), y= frequency)) +
   geom_bar(stat="identity") + 
  coord_flip() +
    labs(x = NULL, y = "Frequency") +
    labs(title="Twitter bi-grams")

#head(toks_bigram[[1]], 50)
head(toks_bigram, 25)

##Twitter tri-grams
#toks <- tokens(qtdocs)
toks_trigram <- tokens_ngrams(toks, n=3)
#Get document feature matrix of bigrams
dfm_trigrams <- dfm(toks_trigram)
tri_dat <- textstat_frequency(dfm_trigrams)
print(tri_dat[1:20])
#Plot the trigrams for Twitter


# plot the data
ggplot(tri_dat[1:20], aes(x= reorder(feature, frequency), y= frequency)) +
  geom_bar(stat="identity") + 
  coord_flip() +
  labs(x = NULL, y = "Frequency") +
  labs(title="Twitter tri-grams")

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.