Topic Modelling Preprocessing | Get rid of all special characters & symbols

Hi all,

today I've got a question related to preprocessing my dataset in order to do some topic modelling afterwards. Let's start with my current code:

library(plyr)
library(readr)
library(tidyverse)
library(tm)
library(wordcloud)
library(topicmodels)


# Create csv-Database
workingDir <- "/Users/flocke/Dropbox/Meins/FOM/Thesis/AmazonReviews/51558011"
myfiles = list.files(path=workingDir, 
                     pattern="*.txt", 
                     full.names=TRUE)

database = ldply(myfiles, 
                 read_csv, 
                 col_names = FALSE)

colnames(database) <- c("doc_id", 
                        "Date(YYYYMMDD)",
                        "ASIN",
                        "Stars",
                        "Helpful",
                        "OverallRating",
                        "Date(Text)",
                        "UserID",
                        "ReviewTitle",
                        "text")   # Review Text

# Create Corpus
amz_corpus <- Corpus(DataframeSource(database))

#Remove BS
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
amz_corpus <- tm_map(amz_corpus, toSpace, "\xc")
amz_corpus <- tm_map(amz_corpus, toSpace, "\\\\")
inspect(amz_corpus[62069])

amz_corpus <- tm_map(amz_corpus, toSpace, "[[:punct:]]")
amz_corpus <- tm_map(amz_corpus, toSpace, "@")
amz_corpus <- tm_map(amz_corpus, toSpace, "/")
amz_corpus <- tm_map(amz_corpus, toSpace, "&")
amz_corpus <- tm_map(amz_corpus, toSpace, "#")

#Cleaning up the text
amz_corpus <- tm_map(amz_corpus, stripWhitespace)
amz_corpus <- tm_map(amz_corpus, content_transformer(tolower))
amz_corpus <- tm_map(amz_corpus, removePunctuation)
amz_corpus <- tm_map(amz_corpus, removeNumbers)

#Remove Stopwords  **<- Point of failure**
amz_corpus <- tm_map(amz_corpus, removeWords, stopwords('english'))

#Text stemming
amz_corpus <- tm_map(amz_corpus, stemDocument)

#Build term-document-matrix
dtm <- TermDocumentMatrix(amz_corpus)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 100)

I've get a bunch of errors like this:

Error in gsub(sprintf("(*UCP)\b(%s)\b", paste(sort(words, decreasing = TRUE), :
input string 61984 is invalid UTF-8

Inspecting the element mentioned above gives me the following output:

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 8
Content:  documents: 1

                                                                                                                                                                                                                                                                                                                                                                                                  987 
\xc3  am not a h\xc3 p hop fun .\xc3  usally l\xc3 sten to jazz or jazz rock or reggea. but th\xc3 s album made me love the h\xc3 p hop.the lyr\xc3 cs w\xc3 ch are very overwhelm\xc3 ng and \xc3 mpress\xc3 ve.\xc3 t makes me happy \xc3 t puts me \xc3 n a good mood.\xc3   l\xc3 sten to \xc3 t every day and \xc3  am sure \xc3  w\xc3 ll never get bored of \xc3 t.thanks  lauren thanks a lot 

I've done some investigations and found out, that this might be related to all kind of weird special characters which I might get rid of first.

Example:
20

So my questions is: How can I get rid of them? :confused:

Many thanks in advance!

The first step is to recognize these as Unicode characters that are not mapped to UTF8. You can read about the gory details. I say that because this won't be the first time you encounter Unicode related issues, and this is a good example.

The help(Encoding) command shows you how this works in practice with an example

x <- "fa\xE7ile"
Encoding(x)
Encoding(x) <- "latin1"
x

which yields façile. Emojis, Chinese or Japanese, Russian and a whole host of other characters present similar issues.

What I'd try first is textclean , which has a replace_non_ascii function that should get rid of most of these. Whether that creates a problem for your text analysis, I can't say, but you can take tokenized output, which should be a vector and test for the positions of any remaining non-ascii characters and write a short gsub function to replaced them with "". You may also want to use purrr's map function if you tokenized materials are in separate files.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.