web scraping help

Techzill · January 16, 2023, 10:36pm

Hi, I am having issues scraping data on amazon website.Please see the codes below;

pacman::p_load(
  #data wrangling
  tidyverse, stringr,
  #web scraping
  rvest
)
url <- ("https://www.amazon.com/Books/b/ref=s9_acss_bw_cg_bsmpill_1e1_w?ref=bsm_nav_pill_nyt/ie=UTF8&node=549028&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=merchandised-search-1&pf_rd_r=HC2BQ4Y2FJ040GDE8AHS&pf_rd_t=101&pf_rd_p=ef8cebb8-ad4b-453c-8030-a931d3822444&pf_rd_i=16857165011")
best_seller <- read_html(url) %>% 
  html_elements("#productTitle") %>% 
  html_text2() %>% str_squish()
#Ratings
Rating <- read_html(url) %>% 
  html_elements(".a-star-4-5") %>% html_text2()
#author
links <- read_html(url) %>% 
  html_elements(".contributorNameID") %>% 
  html_text2()

technocrat · January 16, 2023, 11:09pm

Probably an Amazon self-protective measure to guard against bots. Their own API has a rate throttle of 10/second, so they aren’t eager to be scrapped. One way for them to know is to look at the browser header of the HTTP request. I ran into this with the sec.gov site.

Do a test of getting a page on some other site. If it works then Amazon defense mechanisms are likely the problem, and evading those strays too close to the red line where acceptable use becomes hacking.

Techzill · January 17, 2023, 7:26pm

The codes work well with all other website I have used, but its not working with Amazon. I have used the (httr) but its not working too

technocrat · January 17, 2023, 8:27pm

That makes me think that something within the constraints of their API will be needed.

Techzill · January 17, 2023, 8:31pm

Can you put me through what might be needed?

technocrat · January 17, 2023, 9:21pm

Only as far as here, which requires a seller account, which I no longer have.

M_AcostaCH · January 18, 2023, 6:36pm

Try with this form that I find in Reddit and user [marguslt] make for us:

https://www.reddit.com/r/RStudio/comments/10ezo2r/web_scraping_amazon/

library(dplyr)
library(rvest)
library(stringr)

nyt_bestsellers <- "https://www.amazon.com/Books/b/node=549028"
az_sess <- session(nyt_bestsellers)
tibble(
  titles <- az_sess %>% 
    html_elements("a.acs-product-block__product-title span.a-truncate-full") %>% 
    html_text(),
  authors <- az_sess %>% 
    html_elements("span.acs-product-block__contributor span.a-truncate-full") %>% 
    html_text() %>% 
    str_squish(),
  ratings <- az_sess %>% 
    html_elements("div.acs-product-block__review i.a-icon-star-medium") %>% 
    html_attr("class") %>% 
    str_extract("\\d(-\\d)?$")
)
#> # A tibble: 140 × 3
#>    `titles <- ...`                                               autho…¹ ratin…²
#>    <chr>                                                         <chr>   <chr>  
#>  1 Lessons in Chemistry: A Novel                                 Bonnie… 4-5    
#>  2 The House in the Pines: A Novel                               Ana Re… 3-5    
#>  3 Without a Trace: A Novel                                      Daniel… 4-5    
#>  4 The Boys from Biloxi: A Legal Thriller                        John G… 4-5    
#>  5 Demon Copperhead: A Novel                                     Barbar… 4-5    
#>  6 Fairy Tale                                                    Stephe… 4-5    
#>  7 Tomorrow, and Tomorrow, and Tomorrow: A novel                 Gabrie… 4-5    
#>  8 Mad Honey: A Novel                                            Jodi P… 4-5    
#>  9 The Midnight Library: A Novel                                 Matt H… 4-5    
#> 10 Babel: Or the Necessity of Violence: An Arcane History of th… R. F K… 4-5    
#> # … with 130 more rows, and abbreviated variable names ¹`authors <- ...`,
#> #   ²`ratings <- ...`
Created on 2023-01-18 by the reprex package (v2.0.1)

Techzill · January 18, 2023, 9:46pm

M_AcostaCH:

library(dplyr)
library(rvest)
library(stringr)

nyt_bestsellers <- "https://www.amazon.com/Books/b/node=549028"
az_sess <- session(nyt_bestsellers)
tibble(
  titles <- az_sess %>% 
    html_elements("a.acs-product-block__product-title span.a-truncate-full") %>% 
    html_text(),
  authors <- az_sess %>% 
    html_elements("span.acs-product-block__contributor span.a-truncate-full") %>% 
    html_text() %>% 
    str_squish(),
  ratings <- az_sess %>% 
    html_elements("div.acs-product-block__review i.a-icon-star-medium") %>% 
    html_attr("class") %>% 
    str_extract("\\d(-\\d)?$")
)

Thanks so much for this. However, after running it on my PC I got the output below. Please can you tell me what I not doing right?

technocrat · January 18, 2023, 10:06pm

I can reproduce your result @Techzill . @M_AcostaCH do you have an authentication token being passed somewhere? My az_sess auth_token is

  .. ..$ auth_token: NULL

or, maybe, cookie expiration is set for 1970?

Techzill · January 18, 2023, 10:18pm

@technocrat do I need to have a token? can you pls share your codes and the token ? or is the token unique?

technocrat · January 18, 2023, 10:24pm

I’m speculating that the reason you and I get an empty tibble is that neither of us have a token and @M_AcostaCH does. But about that I could be wrong.

Techzill · January 18, 2023, 10:29pm

lets wait for his response then

M_AcostaCH · January 19, 2023, 4:25am

This was my first time make scraping of Amazon. I'm find examples about this and try the code.

When I run the script at morning was well, but when run at night don't show the data.

Im going to check this problem and tell us.

system · March 2, 2023, 4:25am

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.