currently I've some struggle with preparing a set of TXT (comma separated) files in order to do some topic modelling (LDA) with the corresponding data set.
Basically I've got some txt files with about 10 columns where lets say column 4 includes some written text.
My idea was to create da dataframe of all my TXT-files (>>100), then create an subset with just the text columns and convert this one into a Corpus followed by some preprocessing steps etc. Actually I get some error message at this point.
Code looks like this:
library(tm) library(wordcloud) workingDir <- "/my/dir/" fileList <- list.files(path=workingDir, pattern=".txt") fileList <- paste(workingDir, "//", fileList, sep="") # create the corpus dataList <- lapply(fileList, FUN=readLines) dataList <- lapply(dataList, FUN=paste, collapse=" ") #Create Corpus amz_corpus <- Corpus(DataframeSource(dataList)) #Cleaning up the text
Error message is this one:
Fehler in DataframeSource(dataList) : all(!is.na(match(c("doc_id", "text"), names(x)))) is not TRUE