Update: I gave up on the original job, and set it to work on the reduced dataset of approximately 400.000 headlines. After 13 hours and 230 GB swap usage, it finally finished. Meanwhile, I have been successful in doing batch processing, that only takes approximately 11 minutes! I really wonder why that is.
In conclusion, batch processing seems to be very much worth it.
I'm not happy with my batch processing code, however, as it works by creating .csv files. Can anyone point me towards a more elegant solution?
dir.create(("temp"), showWarnings = FALSE) # create a temp folder
size = nrow(recentnews)
y1 = 0 # start
y3 = 5000 # numbers of posts
y2 = y3-1
x = size / y3
for (i in 0:x) {
filename = paste("temp/abcnews", i, ".csv", sep = "")
st <- analyzeSentiment(recentnews$text[y1:y2], rules = list("SentimentLM"=list(ruleSentiment, loadDictionaryLM())))
write.csv(st, filename)
print(y1)
print(y2)
y1 = y1 + y3
y2 = y2 + y3
if(y2 > size) {
y2 = size
}
}
library(dplyr)
library(readr)
library(gtools)
#merge all files
df <- mixedsort(sort(list.files(path="temp", full.names = TRUE))) %>% #sort and list files
lapply(read.csv) %>%
bind_rows()
write.csv(df, "merge.csv") # save the merged file to default folder
unlink("temp", recursive = TRUE) # delete temp folder