Web Scraping multiple pages within the same URL where rvest doesn't work

web-scraping

#1

Hi,

I want to scrape https://understat.com/league/La_liga/2018 where it has multiple pages within the same page (the players table). rvest obviously fails over there but I want to check if anyone has any ideas on how to go about scraping this via R.

I am not an expert in web technologies so not really sure what to classify this page under.

Thanks


#2

Am I correct in understanding that the pagination is giving you problems? The page in its default state only gives you the first page, and the buttons dynamically repopulate the table. I think you probably have two options:

  1. rvest has some functions that allow you to "Navigate around a website as if you're in a browser," though I haven't used them before. If you have a look at the documentation for html_session() in the rvest reference, you can see how they work. It's possible you can use this to click the pagination buttons.
  2. Inspecting the page source, it looks like the tables are each accompanied by a <script> tag that loads the entire table's data in as a JavaScript variable using JSON.parse(). For some reason, it looks like the argument of JSON.parse() is a literal string—I'm not sure if it's there on page load, or if some sort of request afterward populates it (and I don't know why you'd do that either). But either way, it looks like a neat, structured way to get to the data! If you can get to the <script> tag with rvest and it's populated, you can strip out the function call and parse that string yourself :slight_smile:

I hope those ideas help!

EDIT: someone with a stronger webdev background might be able to explain this better, but as far as I can see, no other external script is inserting that player data. I'm not familiar with jTable, but I wonder if they have some page build process that spits out JSON and drops it into the <script> tag for the plugin to process.

But either way, if that is the case, then using rvest to grab the contents of the <script> tag and processing the string literal yourself is probably the best way forward :smiley: I recommend using jsonlite to do that: it can often boil JSON straight down to a data frame!