R Studio Error when trying to generate plots

MissAuburn · February 1, 2020, 2:11pm

Error in gsub("</", "\u003c/", payload, fixed = TRUE) :
input string 1 is invalid UTF-8

library(wordcloud2)
w <- data.frame(names(w), w)
colnames(w) <- c('word', 'freq')
wordcloud2(w,
size = 0.7,
shape = 'triangle',
rotateRatio = 0.5,
minSize = 1)

I used my .csv file and I am able to get the bar plot and wordcloud to work but not wordcloud2

andresrcs · February 1, 2020, 2:40pm

Hi, welcome!

To help us help you, could you please prepare a reproducible example (reprex) illustrating your issue? Please have a look at this guide, to see how to create one:

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

MissAuburn · February 1, 2020, 2:46pm

imm <- read.csv(file.choose(), header = T)
str(imm)

Build corpus

library(tm)
corpus <- iconv(imm$text, to = "utf-8")
corpus <- Corpus(VectorSource(corpus))
inspect(corpus[1:5])

Clean text

corpus <- tm_map(corpus, tolower)
inspect(corpus[1:5])

corpus <- tm_map(corpus, removePunctuation)
inspect(corpus[1:5])

corpus <- tm_map(corpus, removeNumbers)
inspect(corpus[1:5])

cleanset <- tm_map(corpus, removeWords, stopwords('english'))
inspect(cleanset[1:5])

cleanset <- tm_map(cleanset, stripWhitespace)
inspect(cleanset[1:5])

Term document matrix

tdm <- TermDocumentMatrix(cleanset)
tdm
tdm <- as.matrix(tdm)
tdm[1:10, 1:20]

library(wordcloud2)
w <- data.frame(names(w), w)
colnames(w) <- c('word', 'freq')
wordcloud2(w,
size = 0.7,
shape = 'triangle',
rotateRatio = 0.5,
minSize = 1)

andresrcs · February 1, 2020, 2:51pm

That is not a reprex since you are not providing sample data, the error seems to be related with non-UTF8 characters in your data so it is important to have a sample that reproduces the issue.

MissAuburn · February 1, 2020, 2:56pm

That is the code I am using. What do you mean sample data?

andresrcs · February 1, 2020, 3:37pm

A small subset of the data you are using with that code that allows us to reproduce your issue.

MissAuburn · February 1, 2020, 3:42pm

Ahhhh I see. Basically I am generating tweets, saving it as a .csv file.

What could be causing it though? Maybe I should try a new .csv file

MissAuburn · February 1, 2020, 3:43pm

Which I have tried 3 .csv files but no luck.

andresrcs · February 1, 2020, 3:45pm

We would need one of those csv files to test your code and take a look into the issue.

MissAuburn · February 1, 2020, 3:59pm

Okay, can I send a link?

MissAuburn · February 1, 2020, 4:00pm

here is the link https://www.simfileshare.net/download/1605492/

I am a gamer so the simfileshare is where I uploaded it.

MissAuburn · February 1, 2020, 4:50pm

any luck with testing the file?

andresrcs · February 1, 2020, 4:53pm

Do you realize this is a community forum where we volunteer our time right? Please do not be pushy about getting an answer quickly, that is considered rude here. I'll take a look into this when I have some free time in front of my computer, or maybe someone else will do it before that.

MissAuburn · February 1, 2020, 4:54pm

I apologize. Thank you.

andresrcs · February 1, 2020, 6:28pm

You are not showing from where w comes from and I have no way to know it. Can you please turn this into a proper reproducible example as explained in the guide I gave you earlier?

library(tm)
library(wordcloud2)

imm <- read.csv("https://cdn.simfileshare.net/download/1605492/?dl", header = T)

# Build corpus
corpus <- iconv(imm$text, to = "utf-8")
corpus <- Corpus(VectorSource(corpus))
inspect(corpus[1:5])
#> <<SimpleCorpus>>
#> Metadata:  corpus specific: 1, document level (indexed): 0
#> Content:  documents: 5
#> 
#> [1] <NA>                                                                                                               
#> [2] <NA>                                                                                                               
#> [3] Great evening with Richmond Hill Youth cooking! Friday night pizza <U+0001F60A><U+0001F355> https://t.co/P1wxnEe37y
#> [4] <NA>                                                                                                               
#> [5] <NA>

# Clean text
corpus <- tm_map(corpus, tolower)
#> Warning in tm_map.SimpleCorpus(corpus, tolower): transformation drops documents
inspect(corpus[1:5])
#> <<SimpleCorpus>>
#> Metadata:  corpus specific: 1, document level (indexed): 0
#> Content:  documents: 5
#> 
#> [1] <NA>                                                                                                               
#> [2] <NA>                                                                                                               
#> [3] great evening with richmond hill youth cooking! friday night pizza <u+0001f60a><u+0001f355> https://t.co/p1wxnee37y
#> [4] <NA>                                                                                                               
#> [5] <NA>

corpus <- tm_map(corpus, removePunctuation)
#> Warning in tm_map.SimpleCorpus(corpus, removePunctuation): transformation drops
#> documents
inspect(corpus[1:5])
#> <<SimpleCorpus>>
#> Metadata:  corpus specific: 1, document level (indexed): 0
#> Content:  documents: 5
#> 
#> [1] <NA>                                                                                                   
#> [2] <NA>                                                                                                   
#> [3] great evening with richmond hill youth cooking friday night pizza u0001f60au0001f355 httpstcop1wxnee37y
#> [4] <NA>                                                                                                   
#> [5] <NA>

corpus <- tm_map(corpus, removeNumbers)
#> Warning in tm_map.SimpleCorpus(corpus, removeNumbers): transformation drops
#> documents
inspect(corpus[1:5])
#> <<SimpleCorpus>>
#> Metadata:  corpus specific: 1, document level (indexed): 0
#> Content:  documents: 5
#> 
#> [1] <NA>                                                                                   
#> [2] <NA>                                                                                   
#> [3] great evening with richmond hill youth cooking friday night pizza ufauf httpstcopwxneey
#> [4] <NA>                                                                                   
#> [5] <NA>

cleanset <- tm_map(corpus, removeWords, stopwords('english'))
#> Warning in tm_map.SimpleCorpus(corpus, removeWords, stopwords("english")):
#> transformation drops documents
inspect(cleanset[1:5])
#> <<SimpleCorpus>>
#> Metadata:  corpus specific: 1, document level (indexed): 0
#> Content:  documents: 5
#> 
#> [1] <NA>                                                                               
#> [2] <NA>                                                                               
#> [3] great evening  richmond hill youth cooking friday night pizza ufauf httpstcopwxneey
#> [4] <NA>                                                                               
#> [5] <NA>

cleanset <- tm_map(cleanset, stripWhitespace)
#> Warning in tm_map.SimpleCorpus(cleanset, stripWhitespace): transformation drops
#> documents
inspect(cleanset[1:5])
#> <<SimpleCorpus>>
#> Metadata:  corpus specific: 1, document level (indexed): 0
#> Content:  documents: 5
#> 
#> [1] <NA>                                                                              
#> [2] <NA>                                                                              
#> [3] great evening richmond hill youth cooking friday night pizza ufauf httpstcopwxneey
#> [4] <NA>                                                                              
#> [5] <NA>

# Term document matrix
tdm <- TermDocumentMatrix(cleanset)
tdm
#> <<TermDocumentMatrix (terms: 1186, documents: 1000)>>
#> Non-/sparse entries: 2956/1183044
#> Sparsity           : 100%
#> Maximal term length: 24
#> Weighting          : term frequency (tf)
tdm <- as.matrix(tdm)
tdm[1:10, 1:20]
#>                  Docs
#> Terms             1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
#>   cooking         0 0 1 0 0 1 0 1 1  1  1  0  0  0  0  1  0  1  1  1
#>   evening         0 0 1 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
#>   friday          0 0 1 0 0 0 0 0 0  1  0  0  0  0  0  0  0  0  0  0
#>   great           0 0 1 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
#>   hill            0 0 1 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
#>   httpstcopwxneey 0 0 1 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
#>   night           0 0 1 0 0 0 0 0 0  1  0  0  0  0  0  0  0  0  0  0
#>   pizza           0 0 1 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
#>   richmond        0 0 1 0 0 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
#>   ufauf           0 0 1 0 0 0 0 1 0  0  0  0  0  0  0  0  0  0  0  0


w <- data.frame(names(w), w)
#> Error in data.frame(names(w), w): objeto 'w' no encontrado
colnames(w) <- c('word', 'freq')
#> Error in colnames(w) <- c("word", "freq"): objeto 'w' no encontrado
wordcloud2(w,
           size = 0.7,
           shape = 'triangle',
           rotateRatio = 0.5,
           minSize = 1)
#> Error in "table" %in% class(data): objeto 'w' no encontrado

^{Created on 2020-02-01 by the reprex package (v0.3.0)}

If I assume w is derived from tdm, I can't reproduce your issue, I don't get any error message

w <- data.frame(rownames(tdm), rowSums(tdm))
colnames(w) <- c('word', 'freq')
wordcloud2(w,
           size = 0.7,
           shape = 'triangle',
           rotateRatio = 0.5,
           minSize = 1)

MissAuburn · February 1, 2020, 10:15pm

I changed my code to that last part and I still generate an error, it's so odd.

andresrcs · February 1, 2020, 10:43pm

This command is system dependent, I have tested this code on a Linux system, this is the only difference I can think of.
On which operating system are you?

MissAuburn · February 1, 2020, 10:45pm

Windows 10 is the OS I am using.

andresrcs · February 2, 2020, 12:10am

OK I have tested it on Windows 10, it works if you use "UTF-8" (with upper cases)

library(tm)
library(wordcloud2)

imm <- read.csv("https://cdn.simfileshare.net/download/1605492/?dl", header = T)

corpus <- iconv(imm$text, to = "UTF-8")
corpus <- Corpus(VectorSource(corpus))

# Clean text
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
cleanset <- tm_map(corpus, removeWords, stopwords('english'))
cleanset <- tm_map(cleanset, stripWhitespace)

# Term document matrix
tdm <- TermDocumentMatrix(cleanset)
tdm <- as.matrix(tdm)

w <- data.frame(rownames(tdm), rowSums(tdm))
colnames(w) <- c('word', 'freq')

wordcloud2(w,
           size = 0.7,
           shape = 'triangle',
           rotateRatio = 0.5,
           minSize = 1)

As a side note, notice that you still have to do a lot of text cleaning you have several HTML tags left on the text.

MissAuburn · February 2, 2020, 2:54pm

OMG That worked. I can't believe this, all this headache over some capital letters. I admire you, thank you!

This is why I dislike coding!