web scraping help

Hi, I am having issues scraping data on amazon website.Please see the codes below;

pacman::p_load(
  #data wrangling
  tidyverse, stringr,
  #web scraping
  rvest
)
url <- ("https://www.amazon.com/Books/b/ref=s9_acss_bw_cg_bsmpill_1e1_w?ref=bsm_nav_pill_nyt/ie=UTF8&node=549028&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=merchandised-search-1&pf_rd_r=HC2BQ4Y2FJ040GDE8AHS&pf_rd_t=101&pf_rd_p=ef8cebb8-ad4b-453c-8030-a931d3822444&pf_rd_i=16857165011")
best_seller <- read_html(url) %>% 
  html_elements("#productTitle") %>% 
  html_text2() %>% str_squish()
#Ratings
Rating <- read_html(url) %>% 
  html_elements(".a-star-4-5") %>% html_text2()
#author
links <- read_html(url) %>% 
  html_elements(".contributorNameID") %>% 
  html_text2() 

Probably an Amazon self-protective measure to guard against bots. Their own API has a rate throttle of 10/second, so they aren’t eager to be scrapped. One way for them to know is to look at the browser header of the HTTP request. I ran into this with the sec.gov site.

Do a test of getting a page on some other site. If it works then Amazon defense mechanisms are likely the problem, and evading those strays too close to the red line where acceptable use becomes hacking.

The codes work well with all other website I have used, but its not working with Amazon. I have used the (httr) but its not working too

That makes me think that something within the constraints of their API will be needed.

Can you put me through what might be needed?

Only as far as here, which requires a seller account, which I no longer have.

Try with this form that I find in Reddit and user [marguslt] make for us:

library(dplyr)
library(rvest)
library(stringr)

nyt_bestsellers <- "https://www.amazon.com/Books/b/node=549028"
az_sess <- session(nyt_bestsellers)
tibble(
  titles <- az_sess %>% 
    html_elements("a.acs-product-block__product-title span.a-truncate-full") %>% 
    html_text(),
  authors <- az_sess %>% 
    html_elements("span.acs-product-block__contributor span.a-truncate-full") %>% 
    html_text() %>% 
    str_squish(),
  ratings <- az_sess %>% 
    html_elements("div.acs-product-block__review i.a-icon-star-medium") %>% 
    html_attr("class") %>% 
    str_extract("\\d(-\\d)?$")
)
#> # A tibble: 140 × 3
#>    `titles <- ...`                                               autho…¹ ratin…²
#>    <chr>                                                         <chr>   <chr>  
#>  1 Lessons in Chemistry: A Novel                                 Bonnie… 4-5    
#>  2 The House in the Pines: A Novel                               Ana Re… 3-5    
#>  3 Without a Trace: A Novel                                      Daniel… 4-5    
#>  4 The Boys from Biloxi: A Legal Thriller                        John G… 4-5    
#>  5 Demon Copperhead: A Novel                                     Barbar… 4-5    
#>  6 Fairy Tale                                                    Stephe… 4-5    
#>  7 Tomorrow, and Tomorrow, and Tomorrow: A novel                 Gabrie… 4-5    
#>  8 Mad Honey: A Novel                                            Jodi P… 4-5    
#>  9 The Midnight Library: A Novel                                 Matt H… 4-5    
#> 10 Babel: Or the Necessity of Violence: An Arcane History of th… R. F K… 4-5    
#> # … with 130 more rows, and abbreviated variable names ¹​`authors <- ...`,
#> #   ²​`ratings <- ...`
Created on 2023-01-18 by the reprex package (v2.0.1)
1 Like

Thanks so much for this. However, after running it on my PC I got the output below. Please can you tell me what I not doing right?

image

I can reproduce your result @Techzill . @M_AcostaCH do you have an authentication token being passed somewhere? My az_sess auth_token is

  .. ..$ auth_token: NULL

or, maybe, cookie expiration is set for 1970?

@technocrat do I need to have a token? can you pls share your codes and the token ? or is the token unique?

I’m speculating that the reason you and I get an empty tibble is that neither of us have a token and @M_AcostaCH does. But about that I could be wrong.

lets wait for his response then

This was my first time make scraping of Amazon. I'm find examples about this and try the code.

When I run the script at morning was well, but when run at night don't show the data. :thinking:

Im going to check this problem and tell us.

1 Like