How realistic is it to do sentiment analysis on extremely large datasets ("A Million News Headlines")

Hello,

I'm trying to do sentiment analysis on this dataset, containing 1.1 million newspaper headlines: https://www.kaggle.com/therohk/million-headlines

My machine has now been at it for 24 hours. The first attempt ended after a few minutes with RStudio aborting with a terminal error. I figured I ran out of RAM (there is 16 GB), so I repurposed an 80GB SSD as a swap disk. After some time, RStudio then gave an error about there not being enough space for a 617.5 GB vector. I am now using a 2 TB HDD as swap, but I am afraid it has stalled. It has been sitting at 1.2 TB swap for around ten hours, and there is barely any CPU or HDD activity, but I can hear the HDD working.

I know it is theoretically impossible to know how long it takes to compute something without actually computing it, but has anyone else tried something similar?
I have found out that I actually only need sentiment analysis on 434.000 headlines, but I am afraid of aborting the current job in case it is aaalmost finished.

Also, there is no way to speed things up, right? It seemed to only use one CPU core. There is no GPU acceleration either, right?

Code:

news <- read.csv("abcnews.csv")

news$sentiment <- analyzeSentiment(news, rules = list("SentimentLM"=list(ruleSentiment, loadDictionaryLM())))

You can speed things up by not using an HDD as swap, this makes things painfully slow, maybe it would be better if you process your data in batches that fit into RAM, and also you could parallelize the batch processing so you can leverage all cores.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.

Update: I gave up on the original job, and set it to work on the reduced dataset of approximately 400.000 headlines. After 13 hours and 230 GB swap usage, it finally finished. Meanwhile, I have been successful in doing batch processing, that only takes approximately 11 minutes! I really wonder why that is.

In conclusion, batch processing seems to be very much worth it.

I'm not happy with my batch processing code, however, as it works by creating .csv files. Can anyone point me towards a more elegant solution?

dir.create(("temp"), showWarnings = FALSE) # create a temp folder

size = nrow(recentnews)
y1 = 0     # start
y3 = 5000  # numbers of posts
y2 = y3-1  
x = size / y3

for (i in 0:x) {
  filename = paste("temp/abcnews", i, ".csv", sep = "")
  st <- analyzeSentiment(recentnews$text[y1:y2], rules = list("SentimentLM"=list(ruleSentiment, loadDictionaryLM())))
  write.csv(st, filename)
  print(y1)
  print(y2)
  y1 = y1 + y3
  y2 = y2 + y3
  if(y2 > size) {
    y2 = size
  }
}

library(dplyr)
library(readr)
library(gtools)
#merge all files
df <- mixedsort(sort(list.files(path="temp", full.names = TRUE))) %>%  #sort and list files
  lapply(read.csv) %>%
  bind_rows()
write.csv(df, "merge.csv")  # save the merged file to default folder
unlink("temp", recursive = TRUE) # delete temp folder