Scraping Amazon user information

rafigabasov · July 19, 2020, 11:19pm

I am trying to build a function to scrape certain info of Amazon users like reviewer ranking, number of helpful votes, etc. However, when applying the function, it returns an empty table without any information.

The code is below:

#### Loading packages ####
library(tidyverse)
library(rvest)

#### Function to scrape user information ####
scrape_user <- function(user_id){
  url_user <- paste0("https://www.amazon.com/gp/profile/amzn1.account.",user_id)
  doc <- read_html(url_user) # Assign results to `doc`
 # Reviewer Ranking
  doc %>%
    html_nodes("[class='a-size-base']") %>%
    html_text() -> reviewer_ranking
  # Helpful Votes
  doc %>% 
    html_nodes("[class='a-size-large a-color-base']") %>% 
    html_text() -> helpful_votes
  # Number of Reviews
  doc %>% 
    html_nodes("[class='a-size-large a-color-base']") %>% 
    html_text() -> n_reviews
  # Return a tibble
  tibble(user_id = user_id,
         reviewer_ranking,
         helpful_votes,
         n_reviews) %>% return()
}

#### List of IDs of users ####
id_list <- c("AE2RRRB42BQPO7HTSHCHKTBW442Q", "AH3CPFJRT5PTJEZKE2WZK5GLBQYQ", "AFOACDZPXXUUXUXCG4IGOAXJDS2A")

#### Scraping the information ####
users_info <- data.frame(matrix(ncol=4,nrow=0, dimnames=list(NULL, c("user_id", "reviewer_ranking", "helpful_votes", "n_reviews"))))
for (j in id_list) {
  message("Getting information for user with ID ",j)
  Sys.sleep(5)
  users_info = rbind(users_info, scrape_user(j))}

I think the problem could be that, while going to any user profile on Amazon, it takes some time to fully load the page and until that an almost empty page is shown. Is it possible to make rvest wait until the page fully loaded to scrape data? Or do you think the problem is something else?

zac-garland · July 20, 2020, 4:08am

After taking a look, it means that you'd need to use a webdriver like RSelenium to access the content. Meaning, amazon doesn't allow non-interactive use of their site.

However, RSelenium comes with it's own challenges, and often has little information as the preffered way to utilize selenium is via docker.

rafigabasov · July 20, 2020, 1:39pm

I would like to mention that I can scrape customer reviews on product pages of Amazon quite easily with a similar code. Product pages load fully from the first second. But the user profile pages takes some time and shows half-empty pages until then.
So, isn't there a way to wait for the full load while using rvest package? And if not, how can I achieve it with RSelenium? Any ideas?

nirgrahamuk · July 20, 2020, 1:41pm

just speculating but maybe a webcrawl approach is worth a try

system · August 10, 2020, 1:41pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.