rvest: Trying to get data of a website, which redirects

I'm using rvest to get som data from a site, where the data is displayed 50 records at a time using page index in the URL like so:

https://www.site.domain/results?param=2&page=1

When I then ask for

&page=2

The site automatically redirects me back to

&page=1

Meaning, that I once again retrieve the first 50 records - Any experience on how to access the data on pages 2, 3, ...?

Is it possible you need to escape the ampersand? Just guessing.

It's possible the website in question might be checking the referrer and disallowing deep looking to content with JavaScript.

If that's the case you may need to employ more sophisticated web scraping techniques using something like the RSelenium package.

Even if I simply copy/paste the url incrementing &page=1 to &page=2, the site automatically redirects to &page=1. Only if I click the button to page 2, will the URL say &page=2 and show results accordingly

Then the site is using JavaScript to (at the very least) enforce a navigation path through the data.

It's a common anti-webscraping technique.

If you want to automate scraping the data you'll need to do it in a way which mimics an actual user moving through the pages.

The way to do this in R is with RSelenium. It's a fairly involved process to get it up and running, and though I done it a bunch and led a few seminars on it, I unfortunately am not in a position to walk you through it myself.

The best I can do is point you to a few online resources and if you have smaller questions along the way someone (maybe even I) will probably be able to point you in the right direction.

https://cran.r-project.org/web/packages/RSelenium/vignettes/basics.html


https://callumgwtaylor.github.io/blog/2018/02/01/using-rselenium-and-docker-to-webscrape-in-r-using-the-who-snake-database/
http://joshuamccrain.com/tutorials/web_scraping_R_selenium.html
https://rpubs.com/johndharrison/RSelenium-Docker

Scrapping an adversarial website can be very fun if you like this sort of thing. Know that at times it is more art than science as you will need to probe quite a bit to see how they try to thwart you and develop a response to it.

Things to watch out for:

  • Popups which render the rest of the page inaccessible until you dismiss them.
  • Assigning elements randomly generated names, IDs, or putting dummy elements into the page structure so you can't reliably access elements with a standardized xpath or css selector.
  • Pages which load a static page which them loads a completely dynamically generated page to display.

You may need to spend a lot of time inspecting pages. I recommend you browse using Chrome and get used to inspecting elements (Ctrl + Shift + I) and maybe brush up on some JavaScript yourself so you can inject code in the Developer Console (Ctrl + Shift + J).

You'll also have a MUCH, MUCH, MUCH easier time if you can watch your "headless" browser at work, so make sure you read about using VNC with RSelenium.

Good luck!

2 Likes

Excellent - Thank you spending the time writing this answer, much appreciated! :+1: :slightly_smiling_face:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.