Hello Everyone,
I am attempting to finish a project that requires web scraping to get the primary data source. The final piece to the puzzle is how to scrape data from an unknown number of webpages.
I am currently using the jsonlite package to scrape the data and create a data frame. If the number of pages was a known value, this would be an easy problem. However, in this web data, the goal is to scrape data from a constant start date to a dynamic (system date) end date. Therefore, the data may take up 5 webpages today, but in a month the data may take up 7 webpages (as an example). Therefore, because this code must be automated, the number of webpages each session is unknown.
I would love some assistance with this! Thank you for your time. I have included by basic code with the API below.
note: I think that using a while loop may be successful, but this is beyond my current understanding of web scraping.
'''
my_url <- "https://api.nhle.com/stats/rest/en/skater/summary?isAggregate=false&isGame=true&sort=%5B%7B%22property%22:%22gameDate%22,%22direction%22:%22DESC%22%7D%5D&start=0&limit=100&factCayenneExp=gamesPlayed%3E=1&cayenneExp=franchiseId%3D21%20and%20gameDate%3C=%222021-03-10%2023%3A59%3A59%22%20and%20gameDate%3E=%222021-01-01%22%20and%20gameTypeId=2"
my_raw_results <- httr::GET(my_url)
nhl_content <- httr::content(my_raw_results, as = "text", encoding = "UTF-8")
nhl_fromJSON <- jsonlite::fromJSON(nhl_content)
completeNHL_df <- as.data.frame(nhl_fromJSON)
'''
This code successfully scrapes the base-URL (my_url). The pattern to create the secondary URLS is below:
'''
baseURl_1 <-"https://api.nhle.com/stats/rest/en/skater/summary?isAggregate=false&isGame=true&sort=%5B%7B%22property%22:%22gameDate%22,%22direction%22:%22DESC%22%7D%5D&start="
URL_page_No <- c(0,100,200,300)
## URL_page_No starts at 0 and increases by 100 each time the page number increases. (0=page1,100=page2,200=page3,300=page4 etc) <-I have used 0-300 as an example, in reality further values are needed (unknown number of webpages)
baseURL_2 <- "&limit=100&factCayenneExp=gamesPlayed%3E=1&cayenneExp=franchiseId%3D21%20and%20gameDate%3C=%22"
URL_sys_date <- Sys.Date()
baseURL_3 <- "%2023%3A59%3A59%22%20and%20gameDate%3E=%222021-01-01%22%20and%20gameTypeId=2"
my_URL <- paste0(baseURL_1, URL_page_No, baseURL_2, URL_sys_date, baseURL_3)
'''
Thank you for the help!