Accelarate Web Scraping Process

atakzltn · September 3, 2020, 11:24am

Hi,

Is there any way to accelerate rvest process?

I know that there is a relation between number of website that will be scrap and rvest process but I wonder whether there is a way.

map_df(1:3, function(i) {


  cat(".")

  pg <- read_html(sprintf(url_base, i))

  data.frame(wine=html_text(html_nodes(pg, ".review-listing .title")),
             excerpt=html_text(html_nodes(pg, "div.excerpt")),
             rating=gsub(" Points", "", html_text(html_nodes(pg, "span.rating"))),
             appellation=html_text(html_nodes(pg, "span.appellation")),
             price=gsub("\\$", "", html_text(html_nodes(pg, "span.price"))),
             stringsAsFactors=FALSE)

Thank you...

elmstedt · September 3, 2020, 2:28pm

Before we start, from an ethical standpoint, when you're scraping a site, unless it's a site from a big player (Google, Microsoft, Yahoo, etc) be polite and don't hammer them with requests. It could be someone's home server you end up crashing or someone's hobby site you end up costing hundreds of dollars in billing from their provider.

That said, there are some things you might try.

First, check the site to see if they offer am API for the data you want. They probably won't, but a surprising number of sites do. If they do, learn the API, it's better for you and them.

Second, if there's no API, check to see if there's an option on the page you want to scrape for the number of records or items to show on a page. If there is, you should select the largest value you can. Look to see if it changes anything in the URL, someone's you can edit this manually to view even more records at once. Or, if you can choose how many items to view but nothing changes in the URL, you might consider switching from rvest to RSelenium for your scraping and doing some custom JavaScript injection to edit the number of items viewed.

Next, you might consider doing the work in batches. Do multiple (possibly all) of your read_html() calls and save a list of page objects, then parse all of the pages later.

Aside from that, R is a primarily single-threaded process, I've personally not found a great way to speed up the web scraping. But, if I was going to scrape something which would take multiple days to complete, I would likely split the work and run the code on multiple Amazon EC2 instances. But, I'm not a parallelization expert so their might be a better or easier way to do this.

andresrcs · September 3, 2020, 4:35pm

You can use the furrr package to seamlessly parallelize map_ commands

system · September 24, 2020, 4:35pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.