Scraping Amazon user information

I am trying to build a function to scrape certain info of Amazon users like reviewer ranking, number of helpful votes, etc. However, when applying the function, it returns an empty table without any information.

The code is below:

#### Loading packages ####
library(tidyverse)
library(rvest)
#### Function to scrape user information ####
scrape_user <- function(user_id){
  url_user <- paste0("https://www.amazon.com/gp/profile/amzn1.account.",user_id)
  doc <- read_html(url_user) # Assign results to `doc`
 # Reviewer Ranking
  doc %>%
    html_nodes("[class='a-size-base']") %>%
    html_text() -> reviewer_ranking
  # Helpful Votes
  doc %>% 
    html_nodes("[class='a-size-large a-color-base']") %>% 
    html_text() -> helpful_votes
  # Number of Reviews
  doc %>% 
    html_nodes("[class='a-size-large a-color-base']") %>% 
    html_text() -> n_reviews
  # Return a tibble
  tibble(user_id = user_id,
         reviewer_ranking,
         helpful_votes,
         n_reviews) %>% return()
}
#### List of IDs of users ####
id_list <- c("AE2RRRB42BQPO7HTSHCHKTBW442Q", "AH3CPFJRT5PTJEZKE2WZK5GLBQYQ", "AFOACDZPXXUUXUXCG4IGOAXJDS2A")
#### Scraping the information ####
users_info <- data.frame(matrix(ncol=4,nrow=0, dimnames=list(NULL, c("user_id", "reviewer_ranking", "helpful_votes", "n_reviews"))))
for (j in id_list) {
  message("Getting information for user with ID ",j)
  Sys.sleep(5)
  users_info = rbind(users_info, scrape_user(j))}

I think the problem could be that, while going to any user profile on Amazon, it takes some time to fully load the page and until that an almost empty page is shown. Is it possible to make rvest wait until the page fully loaded to scrape data? Or do you think the problem is something else?

After taking a look, it means that you'd need to use a webdriver like RSelenium to access the content. Meaning, amazon doesn't allow non-interactive use of their site.

However, RSelenium comes with it's own challenges, and often has little information as the preffered way to utilize selenium is via docker.

I would like to mention that I can scrape customer reviews on product pages of Amazon quite easily with a similar code. Product pages load fully from the first second. But the user profile pages takes some time and shows half-empty pages until then.
So, isn't there a way to wait for the full load while using rvest package? And if not, how can I achieve it with RSelenium? Any ideas?

just speculating but maybe a webcrawl approach is worth a try

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.