Hi folks,
I'm trying to ectract German words from Facebook comments. At the moment, unfortunately, I'm not receiving any Umlaute (ä,ö,ü), but nothing (which means gaps) or a,o,u (iconv(x, "UTF-8", "ASCII//TRANSLIT" instead of iconv(enc2utf8(x), sub="byte") which leads to false results and further issues when removing stop words.
It might depend on the encoding options, which is set to UTF-8 and my locales are set to:
[1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8.
I've read in different posts, that people are using German_Germany.1252 as locales but I cannot set it as default in R since R is rececting this call.
I'm using RCUrl, tm, rjson and stringi as packages. Here's my code so far:
library("RCurl")
library("tm")
library("rjson")
library("stringi")
url <- "https://graph.facebook.com/v3.2/795931377273410/comments?limit=999&access_token=EAAHmLBC6OnsBAEbHfGfHi3iEBFmKZCEQUY0Pf6d3y5A7VbxsZBl4nk61UuZCLXwB14tS9uwmIwQZBEh6cG7KDHoePxJ9SHPDZCBLzrXKPjSbZB1t5TZCWqTZARcXCBkjZAZBePMooCa459M1uN8BrK26ttottyRd8QZBG5cE9ZCk1bEDhke8OrfnBLudg6ZCHzAmv4WoZD"
d<- getURL(url)
j<- fromJSON(d)
comments <- sapply(j$data,function(j) {list(comment=j$message)})
Cleanedcomments <- sapply(comments, function(x) iconv(enc2utf8(x), sub="byte"))
my_corpus <- Corpus(VectorSource(Cleanedcomments))
my_function <- content_transformer(function (x, pattern ) gsub("[^\x01-\x7F]", "", x, pattern, "", x))
my_corpus <- tm_map(my_corpus, my_function, "/")
my_corpus <- tm_map(my_corpus, my_function, "@")
my_corpus <- tm_map(my_corpus, my_function, "\\|")
my_corpus <- tm_map(my_corpus, content_transformer(stri_trans_tolower))
my_corpus <- tm_map(my_corpus, removeNumbers)
my_corpus <- tm_map(my_corpus, removeWords, c(stopwords("german")))
my_corpus <- tm_map(my_corpus, removePunctuation)
my_corpus <- tm_map(my_corpus, stripWhitespace)
my_tdm <- TermDocumentMatrix(my_corpus)
m <- as.matrix(my_tdm)
View(m)
If anyone has got an idea how to deal with the issue, I'd be glad to hear it.
Kind Regards
Tobias