Problem with non-ASCII characters in DocumentTermMatrix


#1

I am doing a sentiment analysis project for PhD research. I have been getting the following error:

Error in .tolower(txt) : invalid input 'ââ€' in 'utf8towcs'

This happens after I have cleaned the text in my corpus and I try to create a DocumentTermMatrix. After doing some initial research, I found that it is due to non-ASCII characters in the Twitter text, such as emojis. Can someone please tell me how to solve this problem? Thanks.

Here is the R code that I was using:

setwd('C:/rscripts/tweet_sentiment')

dataset = read.csv('hillary_tweets.csv')

library(readr)
library(tm)
library(ggplot2)
library(wordcloud)
library(plyr)
library(lubridate)

require(SnowballC)

text <- as.character(dataset$text)
sample <- sample(text, (length(text)))
corpus <- Corpus(VectorSource(list(sample)))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords('english'))
corpus <- tm_map(corpus, stemDocument)
dtm_up <- DocumentTermMatrix(VCorpus(VectorSource(corpus[[1]]$content)))

#2

Could you ask this with a minimal REPRoducible EXample (reprex)? A reprex makes it much easier for others to understand your issue and figure out how to help.

In this case, I'd include a snippet of your dataset object, which includes non-ASCII characters to replicate your error.
And that way you can skip setwd('C:/rscripts/tweet_sentiment')
and dataset = read.csv('hillary_tweets.csv')


I'm having a hard time replicating your error, but as a quick suggestion, you might check out the r-package rtweet. It has a plain_tweets function that takes your tweets and returns a value "reformatted with ascii encoding and normal ampersands and without URL links, line breaks, fancy spaces/tabs, fancy apostrophes."

And there are tools to deal with non-ASCII characters in R rather than removing them. StackOverflow has nice discussions on this. And a reprex might be useful to help along these lines too.


#3

Thank you for your reply. I was able to get the Document Term Matrix successfully completed. What I need to do right now is to feed the DTM to an XGBoost machine learning classification model. I am having some issues with getting the DTM to successfully work in the classifier. I will post another issue detailing this.

Jonathan Adkins


#4

Awesome!
Would it be easy and useful to others to share your solution?


#5

I will go ahead and post what I was able to come up with to solve my issue. This is only half of what I need to accomplish. I have recently posted another thread to ask for help on my other issue, which is adding my document term matrix to an XGBoost classifier. Here is the code that I used for the importing and cleaning of my Twitter dataset:
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''

setwd('C:/rscripts/random_forest')

dataset = read.csv('tweets_all.csv', stringsAsFactors = FALSE)

library(tm)

corpus <- iconv(dataset$text, to = "utf-8")
corpus <- Corpus(VectorSource(corpus))
inspect(corpus[1:5])
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
cleanset <- tm_map(corpus, removeWords, stopwords('english'))
removeURL <- function(x) gsub('http[[:alnum:]]*', '', x)
cleanset <- tm_map(cleanset, content_transformer(removeURL))
cleanset <- tm_map(cleanset, stripWhitespace)
cleanset <- tm_map(cleanset, removeWords, c('Â\u009dhillary','„Â','‚Â','just','are','all','they'))

tdm <- TermDocumentMatrix(cleanset)
tdm
tdm <- as.matrix(tdm)

''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''