Error in FUN(content(x), ...) : invalid multibyte string 1777

I recently started reading about sentiment analysis using R and tried to implement it using sample data, which consists of 4 columns such as classifier, date, text, and type. When doing data cleaning using tm_map function to convert all the texts to lowercase, I have encountered with error "Error in FUN(content(x), ...) : invalid multibyte string 1777" for which I couldn't find possible solutions. If anyone has met with the same kind of issue and know the workaround to fix this, please let me know.

Thanks in advance.

Hi,

In order for us to help you with your question, please provide us a minimal reprocudible example where you provide a minimal (dummy) dataset and code that can recreate the issue. One we have that, we can go from there. For help on creating a Reprex, see this guide:

Good luck!
PJ

Hello PJ,
Thanks for your mail and please find the code below for your review,

tweetData = read.csv("tweets.csv",stringsAsFactors = FALSE)
train = tweetData[tweetData$type=="train",-c(4)]
test = tweetData[tweetData$type=="test",-c(4)]
head(train,n=5)

classifier                         date
1          1 Mon Apr 06 22:45:40 PDT 2009
2          1 Mon Apr 06 23:01:15 PDT 2009
3          1 Mon Apr 06 23:21:30 PDT 2009
4          1 Tue Apr 07 01:03:56 PDT 2009
5          1 Tue Apr 07 03:16:35 PDT 2009
                                                                                                                                        text
1 Bad news was Dad has cancer and is dying   Good news new business started and  I am now a life coach practising holistic weight management
2                                                                                            im lonely  keep me company! 22 female, new york
3                                                                                      Sad about Kutner being killed off my fav show House! 
4                                                                     is going to priceline (city) tomorrow, but lost her 'must haves' list 
5                                                                Difficulties with GTalk  Closing the Division for the day. Later, everyone

library(tm)
tweets.corpus = Corpus(VectorSource(train$text))
summary(tweets.corpus)
inspect(tweets.corpus[1:5])

#Data Cleaning
tweets.corpus = tm_map(tweets.corpus,tolower)
tweets.corpus = tm_map(tweets.corpus,stripWhitespace)
tweets.corpus = tm_map(tweets.corpus,removePunctuation)
tweets.corpus = tm_map(tweets.corpus,removeNumbers)
my_stopwords = c(stopwords("english"),'available')
tweets.corpus = tm_map(tweets.corpus,removeWords,my_stopwords)

when doing the data cleaning, I am getting the following error Error in FUN(content(x), ...) : invalid multibyte string 1777

Please let me know if you know any solution about this.

Hi,

It seems your problem is not in you code, but in your input. invalid multibyte string likely refers to characters not recognized by the character encoding format.

Find out what encoding the file has (often issue when files were generated on for example Mac and then used on Windows or vice versa) and then specify that in R like so:

data = read.csv("data.csv", encoding="UTF-8")

Another option is to remove all special characters by using something like toString()

Hope this helps,
PJ

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.