Scraping GoodReads with RSelenium

ledgreve · January 7, 2020, 9:03am

Hello,

I came across BuissonFlorent's scripts for scraping Goodreads and text mining (https://github.com/BuissonFlorent/GoodReads_TextMining/blob/master/GR_Webscraping.R). I wanted to run his script for scraping Goodreads, so I downloaded the ZIP-file and opened the script with RStudio. When I ran the script, I received following error message: write.csv(global.df, output.filename) Error in is.data.frame(x) : object 'global.df' not found. However, I looked at the script and do not understand why this error occurs.
Could someone help me to solve this problem? It's the script named "GR_Webscraping.R". Thank you in advance!

pieterjanvc · January 7, 2020, 1:02pm

Hi,

Welcome to the RStudio community!

I have taken a look at the script, and it seems it uses some outdated code. I have not used this before, so I'm not an expert, but here is a version I was able to get to work on my PC

library(data.table)   # Required for rbindlist
library(dplyr)        # Required to use the pipes %>% and some table manipulation commands
library(magrittr)     # Required to use the pipes %>%
library(rvest)        # Required for read_html
library(RSelenium)    # Required for webscraping with javascript

url <- "https://www.goodreads.com/book/show/18619684-the-time-traveler-s-wife#other_reviews"
book.title <- "The time traveler's wife"
output.filename <- "GR_TimeTravelersWife.csv"

rD <- rsDriver(chromever = "79.0.3945.36")
remDr <- rD[["client"]]
remDr$navigate(url)

global.df <- data.frame(book = character(),
                        reviewer = character(),
                        rating = character(),
                        review = character(), 
                        stringsAsFactors = F)

# Main loop going through the website pages
for(t in 1:98){
  
  #Extracting the reviews from the page
  reviews <- remDr$findElements("css selector", "#bookReviews .stacked")
  reviews.html <- lapply(reviews, function(x){x$getElementAttribute("outerHTML")[[1]]})
  reviews.list <- lapply(reviews.html, function(x){read_html(x) %>% html_text()} )
  reviews.text <- unlist(reviews.list)
  
  # Cleaning the reviews with Regex
  reviews.text2 <- gsub("[^A-Za-z\\-]|\\.+"," ",reviews.text) # Removing all characters that are not letters, dash or periods
  reviews.clean <- gsub("\n|[ \t]+"," ",reviews.text2)  # Removing the end of line characters and extra spaces
  
  n <- floor(length(reviews)/2)
  reviews.df <- data.frame(book = character(n), 
                           reviewer = character(n), 
                           rating = character(n), 
                           review = character(n), 
                           stringsAsFactors = F)
  
  # Populating a data frame with the relevant fields
  for(j in 1:n){
    reviews.df$book[j] <- book.title
    
    #Isolating the name of the author of the review
    auth.rat.sep <- regexpr(" rated it | marked it | added it ", reviews.clean[2*j-1])
    reviews.df$reviewer[j] <- substr(reviews.clean[2*j-1], 5, auth.rat.sep-1)
    
    #Isolating the rating
    rat.end <- regexpr("· | Shelves| Recommend| review of another edition", reviews.clean[2*j-1])
    if (rat.end==-1){rat.end=nchar(reviews.clean[2*j-1])}
    reviews.df$rating[j] <- substr(reviews.clean[2*j-1], auth.rat.sep+10, rat.end-1)
    
    #Removing the beginning of each review that was repeated on the html file
    short.str <- substr(reviews.clean[2*j], 1, 50)
    rev.start <- unlist(gregexpr(short.str, reviews.clean[2*j]))[2]
    if (is.na(rev.start)){rev.start <- 1}
    rev.end <- regexpr("\\.+more|Blog", reviews.clean[2*j])
    if (rev.end==-1){rev.end <- nchar(reviews.clean[2*j])}
    reviews.df$review[j] <- substr(reviews.clean[2*j], rev.start, rev.end-1)
  }
  
  global.lst <- list(global.df, reviews.df)
  global.df <- rbindlist(global.lst)
  
  NextPageButton <- remDr$findElement("css selector", ".next_page")
  print(NextPageButton)
  print(t)
  NextPageButton$clickElement()
  Sys.sleep(3)
}   
#end of the main loop

rD[["server"]]$stop()

write.csv(global.df, output.filename)

NOTES:

You now have to use the rsDriver instead of startServer and remoteDriver etc
Make sure to choose the correct version of your browser. I have chrome, and it defaults to version 80 while at this moment my (stable?) version is only 79.0.3945.36. Check the version of the browser you use and change it if needed. If you don't do this, the browser won't open a window (you'll see it flash, but immediately disappears and you get error)
Make sure to stop the server once finished rD[["server"]]$stop() or when the code crashes, or you get an error saying there is already an instance running.
I don't know why the script has a loop with t = 98. It seems there are only 10 pages, but still if I test the number of reviews continues after t > 10, so there must be a different counting system. Make sure you understand where this is coming from if you need it for other purposes, because it will make the loop much longer.

Hope this helps,
PJ

ledgreve · January 8, 2020, 8:18am

(UPDATE BELOW)

Dear @pieterjanvc ,

Thank you for your kind welcome and for helping me, I really appreciate it! And thank you for explaining the changes you made, this made it into a useful learning experience for me and gave me more insight in the code. The fact that the code was outdated, was something that I never would have found on my own.
I tried to run the code in RStudio again (after checking my browser version and changing the one in the code to match it), i received following error message:
rD <- rsDriver(chromever = "79.0.3945.117") checking Selenium Server versions: BEGIN: PREDOWNLOAD BEGIN: DOWNLOAD BEGIN: POSTDOWNLOAD checking chromedriver versions: BEGIN: PREDOWNLOAD BEGIN: DOWNLOAD BEGIN: POSTDOWNLOAD Error in chrome_ver(chromecheck[["platform"]], chromever) : version requested doesnt match versions available = 78.0.3904.105,79.0.3945.36,80.0.3987.16
Do you know how I might solve this problem as well?
Thank you

UPDATE!! --> It worked! I was able to scrape the reviews and download them by changing chromever to the version you mentioned!
I do have two new questions (if you don't mind me asking), namely whether you think it would be possible to scrape the date as well and to not remove the numbers and other "special" characters from the review (though it is a dataframe)? I need to keep the original reviews and especially the full, correct username.
I was also wondering if you could tell me how I can check how many pages there are (to change t=98 accordingly)? Thank you!!!!

ledgreve · January 9, 2020, 8:58am

Hello @pieterjanvc ,
Sorry for bothering you and thank you, once again, for your help. I tested the script on the book The Finkler Question (https://www.goodreads.com/book/show/8664368-the-finkler-question), which - at the moment I am typing this - has 2062 reviews. The csv-file I got contained 2941 strings, which seemed kind of ok, but when I filtered the I saw that a lot of them were duplicates. After filtering these, I only had 90 unique values left, so only 90 out of 2062 reviews. Do you know what might cause of this problem and how I might solve it? I really do need to scrape all of the reviews for my research.
Kind regards and I wish you a pleasant day!

pieterjanvc · January 9, 2020, 12:51pm

Hi,

I spent too much time on this, but it was so much fun
Here is my new code:

library(data.table)   # Required for rbindlist
library(dplyr)        # Required to use the pipes %>% and some table manipulation commands
library(magrittr)     # Required to use the pipes %>%
library(rvest)        # Required for read_html
library(RSelenium)    # Required for webscraping with javascript
library(lubridate)
library(stringr)
library(purrr)

options(stringsAsFactors = F) #needed to prevent errors when merging data frames

#Paste the GoodReads Url
url <- "https://www.goodreads.com/book/show/7504988-deloume-road"

#Enter the number of review pages (manually check for now)
nPages = 4 

#Set your browser settings
rD <- rsDriver(chromever = "79.0.3945.36")
remDr <- rD[["client"]]
remDr$setTimeout(type = "implicit", 2000)
remDr$navigate(url)

bookTitle = unlist(remDr$getTitle())
finalData = data.frame()

# Main loop going through the website pages
for(pageNumber in 1:nPages){
 
 #Expand all reviews
 expandMore <- remDr$findElements("link text", "...more")
 sapply(expandMore, function(x) x$clickElement())

 #Extracting the reviews from the page
 reviews <- remDr$findElements("css selector", "#bookReviews .stacked")
 reviews.html <- lapply(reviews, function(x){x$getElementAttribute("outerHTML")[[1]]})
 reviews.list <- lapply(reviews.html, function(x){read_html(x) %>% html_text()} )
 reviews.text <- unlist(reviews.list)
 
 #Some reviews have only rating and no text, so we process them separately
 onlyRating = unlist(map(1:length(reviews.text), function(i) str_detect(reviews.text[i], "^\\\n\\\n")))
 
 #Full reviews
 if(sum(!onlyRating) > 0){
   
   filterData = reviews.text[!onlyRating]
   fullReviews = purrr::map_df(seq(1, length(filterData), by=2), function(i){
     review = unlist(strsplit(filterData[i], "\n"))
     
     data.frame(
       date = mdy(review[2]), #date
       username = str_trim(review[5]), #user
       rating = str_trim(review[9]), #overall
       comment = str_trim(review[12]) #comment
     )
   })
   
   #Add review text to full reviews
   fullReviews$review = unlist(purrr::map(seq(2, length(filterData), by=2), function(i){
     str_trim(str_remove(filterData[i], "\\s*\\n\\s*\\(less\\)"))
   }))
   
 } else {
   fullReviews = data.frame()
 }

 
 #partial reviews (only rating)
 if(sum(onlyRating) > 0){
   
   filterData = reviews.text[onlyRating]
   partialReviews = purrr::map_df(1:length(filterData), function(i){
     review = unlist(strsplit(filterData[i], "\n"))
     
     data.frame(
       date = mdy(review[9]), #date
       username = str_trim(review[4]), #user
       rating = str_trim(review[8]), #overall
       comment = "",
       review = ""
     )
   })
   
 } else {
   partialReviews = data.frame()
 }
 
 finalData = rbind(finalData, fullReviews, partialReviews)
 
 NextPageButton <- remDr$findElement("css selector", ".next_page")
 NextPageButton$clickElement()
 
 message(paste("PAGE", pageNumber, "of", nPages, "Processed"))
 Sys.sleep(2)
}   
#end of the main loop

#Replace missing ratings by 'not rated'
finalData$rating = ifelse(finalData$rating == "", "not rated", finalData$rating)

#Stop server
rD[["server"]]$stop()

#Write results
write.csv(finalData, paste0(bookTitle, ".csv"), row.names = F)

Changes

IMPORTANT: you need to check how many pages of reviews there are manually and set that number (nPages) before you start
The title now gets collected automatically (including author)
The reviews were missing text because many of them got truncated by a ...more button that had to be clicked. The code now looks for all of those and clicks them, expanding all the text
The text is now containing all characters (special ones get escaped)
I added the date and comment to the table, so you can see if people reviewed on other edition
I optimised the data table generation
Missing ratings are now noted as "not rated"
In some cases people only rated and not reviewed a book, this is now taken care of by looking for it and setting the review text to an empty string

Try and let me know how it works!
PJ

ledgreve · January 10, 2020, 11:04am

@pieterjanvc
Hello again! Thank you so much for your help, it is incredibly kind of you! The script works wonderfully and it’s a lot faster than before as well! And thank you for making it possible for me to scrape the dates.

At first when I tried to run your code in RStudio, I had some troubles with the "mdy"-function and the "str_trim"-funtion, but those were solved once I installed the "stringr"-package ( and added ‘library(“stringr”)’ to the script) and the "lubridate"-package.

However, I encountered a problem with two books of which I tried to scrape the reviews. For Deloume Road (https://www.goodreads.com/book/show/7504988-deloume-road?ac=1&from_search=true&qid=wEtGAysS9h&rank=1) and Tales from the Mall (https://www.goodreads.com/book/show/13637188-tales-from-the-mall) I got the same error – I include one of them here:

message(paste("PAGE", pageNumber, "of", nPages, "Processed"))
+   Sys.sleep(2)
+ }   
Error in `$<-.data.frame`(`*tmp*`, "review", value = c("Tales from the Mall is a mad mix of fascinating facts, statistics, historical background, fictionalised accounts based on real interviews and actual short stories - all revolving around shopping malls. Just like shopping malls, the book sometimes confused me, overstimulated me and satiated any sense of voyeurism I may harbour (shopping centres are fab for people watching.... and so is this book!) - and certainly never bored me. Some of the short stories (whether or not they were based on fact or\n  Tales from the Mall is a mad mix of fascinating facts, statistics, historical background, fictionalised accounts based on real interviews and actual short stories - all revolving around shopping malls. Just like shopping malls, the book sometimes confused me, overstimulated me and satiated any sense of voyeurism I may harbour (shopping centres are fab for people watching.... and so is this book!) - and certainly never bored me. Some of the short stories (whether or not they were based on fact or fiction) were exceptionally well written, and I was very disappointed when they ended. I've never read anything else by Ewan Morrison, but based on the short stories, I'd be keen to read a novel written by him. The well researched historical background, rich with stats and figures was interesting, but the most fascinating factual chapters were all about the psychological manipulation that is applied in the design, lay-out and even staffing of the malls and the shops within. I don't tend to frequent shopping centres very often, but the next time I do, it will be with a much more critical and aware mind. Thanks, Mr Morrison, for the great stories and many eye-openers on such an interesting aspect of our culture, society, and even geography.",  : 
  replacement has 25 rows, data has 26
> #end of the main loop

I would like to ask one last things, though I would understand of course, if it isn’t possible or if you’re to busy. I would like to ask is whether you think it might be possible to somehow make it possible to decide which reviews are scraped? At the moment, the Goodreads page opens and (seemingly randomly) automatically shows only the English reviews or the reviews for “all languages”. It would be very practical if I could manually type into the script which ones I want, so it remains consistent which reviews are scraped.

Once again thank you for you help and guidance, it is very much appreciated!

pieterjanvc · January 10, 2020, 1:14pm

Hi,

First of all, I'm sorry I forgot the packages, I had been working on it too long and forgot I didn't add them in the end. thanks for noticing.

The problem with the book you mentioned is that those pages have people on them that only rated and didn't leave a review. This messed up my scraping analysis, but I've taken care of it and now they are added to the list with empty reviews.

I did not have time to look into the language thing, but I think this will get you started.
I updated the code in the previous post.

Good luck
PJ

ledgreve · January 14, 2020, 3:41pm

Dear @pieterjanvc,

Thank you for your time and help! I tried the new modified script and was able to scrape Deloume Road and Tales From the Mall without problems. However, when I re-scraped the other books whose reviews I was collecting, I got the same error for a few of them (in this example for Mantel's Wolf Hall):

+ message(paste("PAGE", pageNumber, "of", nPages, "Processed"))
+ Sys.sleep(2)
+ }
PAGE 1 of 10 Processed
PAGE 2 of 10 Processed
Selenium message:stale element reference: element is not attached to the page document
(Session info: chrome=79.0.3945.117)
For documentation on this error, please visit: https://www.seleniumhq.org/exceptions/stale_element_reference.html
Build info: version: '4.0.0-alpha-2', revision: 'f148142cf8', time: '2019-07-01T21:30:10'
System info: host: 'LW0xxxxx', ip: 'xxx.xxx.xxx.xxx', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '1.8.0_231'
Driver info: driver.version: unknown
Error:  Summary: StaleElementReference
   Detail: An element command failed because the referenced element is no longer attached to the DOM.
   class: org.openqa.selenium.StaleElementReferenceException
 Further Details: run errorDetails method
> #end of the main loop
>
> #Replace missing ratings by 'not rated'
> finalData$rating = ifelse(finalData$rating == "", "not rated", finalData$rating)

I am not sure what this means, at first I thought it might be a consequence of the fact that my actual chrome version is more recent than in the script (79.0.3945.117 instead of 79.0.3945.36), but that does not explain why the problem only shows up with some of the books. Furthermore, if this were the problem, it would have been a problem with the previous script as well, wouldn't it?
Kind regards and once again thank you for you assistance and patience!

pieterjanvc · January 14, 2020, 4:16pm

Hi,

I don't have any issues with that book:

The reason that error comes I think is because the internet is slower and the page doesn't load properly before the scraping starts. In that case, try and increase the sleep after loading a new page:

#Increase and try again
message(paste("PAGE", pageNumber, "of", nPages, "Processed"))
 Sys.sleep(4)

Does this work?
PJ

ledgreve · January 15, 2020, 7:29am

Hello @pieterjanvc,

Thank you so much! I ran the script for Wolf Hall again today and everything worked just fine, apparently there must have been some internet problems yesterday. I am sorry to have bothered you! And thank you once again for your help!

Kind regards!

system · January 22, 2020, 7:29am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.