I am running a script in RStudio (a wonderful RStudio community member helped me with it) to scrape Goodreads reviews. Recently, I got an error message for some of the pages I'm trying to scrape. I don't wish to keep disturbing the person who helped me, so I've been trying (and failing) to solve it myself the past few days. I've been working on it again earlier, but I just keep on getting it wrong again, so I thought I might ask here.
This is the error I get:
Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 31, 30
The numbers at the end may change, f.e.
30, 33 . What seems strange to me is that the error is not constant, it only occurs for some of the pages I'm trying to scrape, although the script itself remains the same. Example: scraping the reviews of The Handmaid's Tale (https://www.goodreads.com/book/show/38447.The_Handmaid_s_Tale?ac=1&from_search=true&qid=ZGrzc7AfLN&rank=1) causes an error (
32, 30 ), but scraping the reviews of Typhoon Kingdom (https://www.goodreads.com/book/show/52391186-typhoon-kingdom) causes no problems.
I've removed some parts of the code to find out where the problem comes from and it seems to me that it must be caused by either this piece of code that extracts the review-IDs:
#Get the review ID's from all the links reviewId = reviews.html %>% str_extract("/review/show/\\d+") reviewId = reviewId[!is.na(reviewId)] %>% str_extract("\\d+")
or by the line
finalData = rbind(finalData, cbind(reviewId, rbind(fullReviews, partialReviews))). When I remove the first piece of code and change
finalData = rbind(finalData, cbind(reviewId, rbind(fullReviews, partialReviews))) back to
finalData = rbind(finalData, fullReviews, partialReviews) (the review-IDs weren't extracted at originally), the script runs without problems and without causing any errors. However, I really need to be able to extract these review-IDs to properly anonymise my data, so simply leaving it out is not really an option.
I've tried to exchange that part of the code with this, as this should also be able to scrape the review-ID as well (but please correct me if I'm wrong):
#Get the review ID's from all the links reviewId = reviews.html %>% str_extract("review_\\d+") reviewId = reviewId[!is.na(reviewId)] %>% str_extract("\\d+")
This did not solve the problem and caused the same error, though with some differences: 1. the error has completely different numbers:
Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 0, 30 and 2. the error now occurs for every single URL instead of for some, so actually managed to somehow make it worse.
I've googled the error message and apparently the problem could be caused if there aren't as many rows as columns in a dataframe. Some say using rbind.fill and cbind.fill could work as a solution, but apparently you can't install rowr in R 4.0.1 and only using rbind.fill didn't solve the problem.
There are a lot of online questions about this error message and just as many different solutions, but so far I haven't found one that works for this script.
Does anyone know how this problem might be solved? Concrete steps would be very appreciated. Thank you!