Scraping Multiple URL and Page

Hi,

I have a link that I want to change 2 different pieces. I can scrap multiple page thanks to purrr package.

I want to scrap wines for first three 3 pages and for "Washington" and "Basic".

URL=https://www.winemag.com/?s=washington&drink_type=wine&page=2&search_type=all

As you can see in below, I can scrap for multiple page but I want scrap also for Washington and Basic with "%s".

url_base <- "http://www.winemag.com/?s=washington&drink_type=wine&page=%d"

map_df(1:3, function(i) {


  cat(".")

  pg <- read_html(sprintf(url_base, i))

  data.frame(wine=html_text(html_nodes(pg, ".review-listing .title")),
             excerpt=html_text(html_nodes(pg, "div.excerpt")),
             rating=gsub(" Points", "", html_text(html_nodes(pg, "span.rating"))),
             appellation=html_text(html_nodes(pg, "span.appellation")),
             price=gsub("\\$", "", html_text(html_nodes(pg, "span.price"))),
             stringsAsFactors=FALSE)

Thank you!

Hi,

How about this:

allLinks = unlist(lapply(c("washington", "basic"), function(s){
  sprintf("https://www.winemag.com/?s=%s&drink_type=wine&page=%i&search_type=all", 
             s, 1:3)
}))

allLinks
[1] "https://www.winemag.com/?s=washington&drink_type=wine&page=1&search_type=all"
[2] "https://www.winemag.com/?s=washington&drink_type=wine&page=2&search_type=all"
[3] "https://www.winemag.com/?s=washington&drink_type=wine&page=3&search_type=all"
[4] "https://www.winemag.com/?s=basic&drink_type=wine&page=1&search_type=all"     
[5] "https://www.winemag.com/?s=basic&drink_type=wine&page=2&search_type=all"     
[6] "https://www.winemag.com/?s=basic&drink_type=wine&page=3&search_type=all"

result = map_df(allLinks, function(link){
  pg <- read_html(link)
  #rest of code goes here ...
})

PJ

1 Like

Thanks PJ!!! :smiley:

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.