R Studio Error when trying to generate plots

Error in gsub("</", "\u003c/", payload, fixed = TRUE) :
input string 1 is invalid UTF-8

library(wordcloud2)
w <- data.frame(names(w), w)
colnames(w) <- c('word', 'freq')
wordcloud2(w,
size = 0.7,
shape = 'triangle',
rotateRatio = 0.5,
minSize = 1)

I used my .csv file and I am able to get the bar plot and wordcloud to work but not wordcloud2

Hi, welcome!

To help us help you, could you please prepare a reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:

imm <- read.csv(file.choose(), header = T)
str(imm)

Build corpus

library(tm)
corpus <- iconv(imm$text, to = "utf-8")
corpus <- Corpus(VectorSource(corpus))
inspect(corpus[1:5])

Clean text

corpus <- tm_map(corpus, tolower)
inspect(corpus[1:5])

corpus <- tm_map(corpus, removePunctuation)
inspect(corpus[1:5])

corpus <- tm_map(corpus, removeNumbers)
inspect(corpus[1:5])

cleanset <- tm_map(corpus, removeWords, stopwords('english'))
inspect(cleanset[1:5])

cleanset <- tm_map(cleanset, stripWhitespace)
inspect(cleanset[1:5])

Term document matrix

tdm <- TermDocumentMatrix(cleanset)
tdm
tdm <- as.matrix(tdm)
tdm[1:10, 1:20]

library(wordcloud2)
w <- data.frame(names(w), w)
colnames(w) <- c('word', 'freq')
wordcloud2(w,
size = 0.7,
shape = 'triangle',
rotateRatio = 0.5,
minSize = 1)

That is not a reprex since you are not providing sample data, the error seems to be related with non-UTF8 characters in your data so it is important to have a sample that reproduces the issue.

That is the code I am using. What do you mean sample data?

A small subset of the data you are using with that code that allows us to reproduce your issue.

Ahhhh I see. Basically I am generating tweets, saving it as a .csv file.

What could be causing it though? Maybe I should try a new .csv file

Which I have tried 3 .csv files but no luck.

We would need one of those csv files to test your code and take a look into the issue.

Okay, can I send a link?

here is the link https://www.simfileshare.net/download/1605492/

I am a gamer so the simfileshare is where I uploaded it.

any luck with testing the file?

Do you realize this is a community forum where we volunteer our time right? Please do not be pushy about getting an answer quickly, that is considered rude here. I'll take a look into this when I have some free time in front of my computer, or maybe someone else will do it before that.

1 Like

I apologize. Thank you.

You are not showing from where w comes from and I have no way to know it. Can you please turn this into a proper reproducible example as explained in the guide I gave you earlier?

library(tm)
library(wordcloud2)

imm <- read.csv("https://cdn.simfileshare.net/download/1605492/?dl", header = T)

# Build corpus
corpus <- iconv(imm$text, to = "utf-8")
corpus <- Corpus(VectorSource(corpus))
inspect(corpus[1:5])
#> <<SimpleCorpus>>
#> Metadata:  corpus specific: 1, document level (indexed): 0
#> Content:  documents: 5
#> 
#> [1] <NA>                                                                                                               
#> [2] <NA>                                                                                                               
#> [3] Great evening with Richmond Hill Youth cooking! Friday night pizza <U+0001F60A><U+0001F355> https://t.co/P1wxnEe37y
#> [4] <NA>                                                                                                               
#> [5] <NA>

# Clean text
corpus <- tm_map(corpus, tolower)
#> Warning in tm_map.SimpleCorpus(corpus, tolower): transformation drops documents
inspect(corpus[1:5])
#> <<SimpleCorpus>>
#> Metadata:  corpus specific: 1, document level (indexed): 0
#> Content:  documents: 5
#> 
#> [1] <NA>                                                                                                               
#> [2] <NA>                                                                                                               
#> [3] great evening with richmond hill youth cooking! friday night pizza <u+0001f60a><u+0001f355> https://t.co/p1wxnee37y
#> [4] <NA>                                                                                                               
#> [5] <NA>

corpus <- tm_map(corpus, removePunctuation)
#> Warning in tm_map.SimpleCorpus(corpus, removePunctuation): transformation drops
#> documents
inspect(corpus[1:5])
#> <<SimpleCorpus>>
#> Metadata:  corpus specific: 1, document level (indexed): 0
#> Content:  documents: 5
#> 
#> [1] <NA>                                                                                                   
#> [2] <NA>                                                                                                   
#> [3] great evening with richmond hill youth cooking friday night pizza u0001f60au0001f355 httpstcop1wxnee37y
#> [4] <NA>                                                                                                   
#> [5] <NA>

corpus <- tm_map(corpus, removeNumbers)
#> Warning in tm_map.SimpleCorpus(corpus, removeNumbers): transformation drops
#> documents
inspect(corpus[1:5])
#> <<SimpleCorpus>>
#> Metadata:  corpus specific: 1, document level (indexed): 0
#> Content:  documents: 5
#> 
#> [1] <NA>                                                                                   
#> [2] <NA>                                                                                   
#> [3] great evening with richmond hill youth cooking friday night pizza ufauf httpstcopwxneey
#> [4] <NA>                                                                                   
#> [5] <NA>

cleanset <- tm_map(corpus, removeWords, stopwords('english'))
#> Warning in tm_map.SimpleCorpus(corpus, removeWords, stopwords("english")):
#> transformation drops documents
inspect(cleanset[1:5])
#> <<SimpleCorpus>>
#> Metadata:  corpus specific: 1, document level (indexed): 0
#> Content:  documents: 5
#> 
#> [1] <NA>                                                                               
#> [2] <NA>                                                                               
#> [3] great evening  richmond hill youth cooking friday night pizza ufauf httpstcopwxneey
#> [4] <NA>                                                                               
#> [5] <NA>

cleanset <- tm_map(cleanset, stripWhitespace)
#> Warning in tm_map.SimpleCorpus(cleanset, stripWhitespace): transformation drops
#> documents
inspect(cleanset[1:5])
#> <<SimpleCorpus>>
#> Metadata:  corpus specific: 1, document level (indexed): 0
#> Content:  documents: 5
#> 
#> [1] <NA>                                                                              
#> [2] <NA>                                                                              
#> [3] great evening richmond hill youth cooking friday night pizza ufauf httpstcopwxneey
#> [4] <NA>                                                                              
#> [5] <NA>

# Term document matrix
tdm <- TermDocumentMatrix(cleanset)
tdm
#> <<TermDocumentMatrix (terms: 1186, documents: 1000)>>
#> Non-/sparse entries: 2956/1183044
#> Sparsity           : 100%
#> Maximal term length: 24
#> Weighting          : term frequency (tf)
tdm <- as.matrix(tdm)
tdm[1:10, 1:20]
#>                  Docs
#> Terms             1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
#>   cooking         0 0 1 0 0 1 0 1 1  1  1  0  0  0  0  1  0  1  1  1
#>   evening         0 0 1 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
#>   friday          0 0 1 0 0 0 0 0 0  1  0  0  0  0  0  0  0  0  0  0
#>   great           0 0 1 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
#>   hill            0 0 1 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
#>   httpstcopwxneey 0 0 1 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
#>   night           0 0 1 0 0 0 0 0 0  1  0  0  0  0  0  0  0  0  0  0
#>   pizza           0 0 1 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
#>   richmond        0 0 1 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
#>   ufauf           0 0 1 0 0 0 0 1 0  0  0  0  0  0  0  0  0  0  0  0


w <- data.frame(names(w), w)
#> Error in data.frame(names(w), w): objeto 'w' no encontrado
colnames(w) <- c('word', 'freq')
#> Error in colnames(w) <- c("word", "freq"): objeto 'w' no encontrado
wordcloud2(w,
           size = 0.7,
           shape = 'triangle',
           rotateRatio = 0.5,
           minSize = 1)
#> Error in "table" %in% class(data): objeto 'w' no encontrado

Created on 2020-02-01 by the reprex package (v0.3.0)

If I assume w is derived from tdm, I can't reproduce your issue, I don't get any error message

w <- data.frame(rownames(tdm), rowSums(tdm))
colnames(w) <- c('word', 'freq')
wordcloud2(w,
           size = 0.7,
           shape = 'triangle',
           rotateRatio = 0.5,
           minSize = 1)

I changed my code to that last part and I still generate an error, it's so odd.

This command is system dependent, I have tested this code on a Linux system, this is the only difference I can think of.
On which operating system are you?

Windows 10 is the OS I am using.

OK I have tested it on Windows 10, it works if you use "UTF-8" (with upper cases)

library(tm)
library(wordcloud2)

imm <- read.csv("https://cdn.simfileshare.net/download/1605492/?dl", header = T)

corpus <- iconv(imm$text, to = "UTF-8")
corpus <- Corpus(VectorSource(corpus))

# Clean text
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
cleanset <- tm_map(corpus, removeWords, stopwords('english'))
cleanset <- tm_map(cleanset, stripWhitespace)

# Term document matrix
tdm <- TermDocumentMatrix(cleanset)
tdm <- as.matrix(tdm)

w <- data.frame(rownames(tdm), rowSums(tdm))
colnames(w) <- c('word', 'freq')

wordcloud2(w,
           size = 0.7,
           shape = 'triangle',
           rotateRatio = 0.5,
           minSize = 1)

As a side note, notice that you still have to do a lot of text cleaning you have several HTML tags left on the text.

2 Likes

OMG That worked. I can't believe this, all this headache over some capital letters. I admire you, thank you!

This is why I dislike coding!