Web Scraping From Unknown Number of Web Pages

Hello Everyone,

I am attempting to finish a project that requires web scraping to get the primary data source. The final piece to the puzzle is how to scrape data from an unknown number of webpages.
I am currently using the jsonlite package to scrape the data and create a data frame. If the number of pages was a known value, this would be an easy problem. However, in this web data, the goal is to scrape data from a constant start date to a dynamic (system date) end date. Therefore, the data may take up 5 webpages today, but in a month the data may take up 7 webpages (as an example). Therefore, because this code must be automated, the number of webpages each session is unknown.
I would love some assistance with this! Thank you for your time. I have included by basic code with the API below.

note: I think that using a while loop may be successful, but this is beyond my current understanding of web scraping.

'''

 my_url <-  "https://api.nhle.com/stats/rest/en/skater/summary?isAggregate=false&isGame=true&sort=%5B%7B%22property%22:%22gameDate%22,%22direction%22:%22DESC%22%7D%5D&start=0&limit=100&factCayenneExp=gamesPlayed%3E=1&cayenneExp=franchiseId%3D21%20and%20gameDate%3C=%222021-03-10%2023%3A59%3A59%22%20and%20gameDate%3E=%222021-01-01%22%20and%20gameTypeId=2"
 my_raw_results <- httr::GET(my_url)
 nhl_content <- httr::content(my_raw_results, as = "text", encoding = "UTF-8")
 nhl_fromJSON <- jsonlite::fromJSON(nhl_content)

 completeNHL_df <- as.data.frame(nhl_fromJSON) 

'''

This code successfully scrapes the base-URL (my_url). The pattern to create the secondary URLS is below:

'''

 baseURl_1 <-"https://api.nhle.com/stats/rest/en/skater/summary?isAggregate=false&isGame=true&sort=%5B%7B%22property%22:%22gameDate%22,%22direction%22:%22DESC%22%7D%5D&start="
 URL_page_No <- c(0,100,200,300)
 ## URL_page_No starts at 0 and increases by 100 each time the page number increases. (0=page1,100=page2,200=page3,300=page4 etc) <-I have used 0-300 as an example, in reality further values are needed (unknown number of webpages)
 baseURL_2 <- "&limit=100&factCayenneExp=gamesPlayed%3E=1&cayenneExp=franchiseId%3D21%20and%20gameDate%3C=%22"
 URL_sys_date <- Sys.Date()
 baseURL_3 <- "%2023%3A59%3A59%22%20and%20gameDate%3E=%222021-01-01%22%20and%20gameTypeId=2"

  my_URL <- paste0(baseURL_1, URL_page_No, baseURL_2, URL_sys_date, baseURL_3)

'''

Thank you for the help!

Hi,

It's too bad the API does not have official documentation. This makes everything more guess work. I have come up with two possible solutions:

  1. Assume you don't know the total number of data points
library(httr)

baseURl_1 <-"https://api.nhle.com/stats/rest/en/skater/summary?isAggregate=false&isGame=true&sort=%5B%7B%22property%22:%22gameDate%22,%22direction%22:%22DESC%22%7D%5D&start="
baseURL_2 <- "&limit=100&factCayenneExp=gamesPlayed%3E=1&cayenneExp=franchiseId%3D21%20and%20gameDate%3C=%22"
URL_sys_date <- Sys.Date()
baseURL_3 <- "%2023%3A59%3A59%22%20and%20gameDate%3E=%222021-01-01%22%20and%20gameTypeId=2"

URL_page_No = 0
getNext = T
result = list()
while(getNext){
  
  my_URL <- paste0(baseURl_1, URL_page_No, baseURL_2, URL_sys_date, baseURL_3)
  newData = GET(my_URL)
  newData = content(newData)$data
  
  if(length(newData) > 0){
    result = append(result, newData)
    URL_page_No = URL_page_No + 100
  } else {
    getNext = F
  }
  
}
  1. Use the "total" value when doing a request to see the total number of data points in the set
library(httr)

baseURl_1 <-"https://api.nhle.com/stats/rest/en/skater/summary?isAggregate=false&isGame=true&sort=%5B%7B%22property%22:%22gameDate%22,%22direction%22:%22DESC%22%7D%5D&start="
baseURL_2 <- "&limit=100&factCayenneExp=gamesPlayed%3E=1&cayenneExp=franchiseId%3D21%20and%20gameDate%3C=%22"
URL_sys_date <- Sys.Date()
baseURL_3 <- "%2023%3A59%3A59%22%20and%20gameDate%3E=%222021-01-01%22%20and%20gameTypeId=2"
URL_page_No = 0

my_URL <- paste0(baseURl_1, URL_page_No, baseURL_2, URL_sys_date, baseURL_3)
newData = GET(my_URL)

n = content(newData)$total

result = sapply(seq(0, n, 100), function(URL_page_No){
  
  my_URL <- paste0(baseURl_1, URL_page_No, baseURL_2, URL_sys_date, baseURL_3)
  newData = GET(my_URL)
  content(newData)$data
  
})

Hope this helps,
PJ

1 Like

Thank you!! this was very helpful!

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.