Web scraping using the search bar

This is the website I am using for example.


This is what is seen when you open the webpage. I want to be able to search a school (in R) and pull up their 2020-21 game log in r to save the table for use. This is using the highlighted search bar in the photo. Any help is great thank you!

The below code bypasses the search bar entirely and goes directly to the 2021 game log page of whatever team you assign to the variable {team} and returns the table. You'll probably want to do a bit of cleaning to the table if you want the data to be more r-friendly, but I'm not sure what you're doing with the data so I can't demo that for you. I'm not sure if this will work for you or not since it doesn't actually utilize the search bar functionality of the web page. If this doesn't work, we'll need to know more about your specific use case to help.


require(tidyverse)
require(rvest)

## Create variables that store the url substrings that come before and after the team name in the URL
url_sub1 <- 'https://www.sports-reference.com/cbb/schools/'
url_sub2 <- '/2021-gamelogs.html'

# Assign the name of the university to be scraped
team <- 'kansas-state'

## Here we do the actual scraping
df1 <- paste0(url_sub1, team, url_sub2) %>% # build the url to be scraped by combining team name with url sub strings
  read_html() %>% # scrape the html
  html_node(., '#sgl-basic') %>% # pull the html element that corresponds to the table
  html_table() # auto-format the table

1 Like

This is actually exactly what I was looking for! I originally thinking of a way to send a string to the search bar, but this is way better!

Update to this, how could I store the teams tables as a vector?

You can't store a data frame in a vector, but you can inside a list. The below code takes a vector of 4 team names, scrapes them all, and stores each team table in its own element inside of the list named list_df

require(tidyverse)
require(rvest)


fn_scrape <- function(.url1, .team, .url2){
  Sys.sleep(1.0) # wait one second in-between scrapes for ethical scraping
  paste0(.url1, .team, .url2) %>% # build the url to be scraped by combining team name with url sub strings
  read_html() %>% # scrape the html
  html_node(., '#sgl-basic') %>% # pull the html element that corresponds to the table
  html_table() # auto-format the table
}


## Create variables that store the url substrings that come before and after the team name in the URL
url_sub1 <- 'https://www.sports-reference.com/cbb/schools/'
url_sub2 <- '/2021-gamelogs.html'

# Assign the name of the university to be scraped
team <- c('kansas-state', 'kansas', 'texas-tech', 'texas')

## Here we do the actual scraping
list_df <- team %>% 
  map(.x, .f = ~fn_scrape(.url1 = url_sub1, .team = .x, .url2 = url_sub2))
1 Like

I am actually extremely interested in getting better at webscraping. I don't know if you would, but would you be available to zoom about it? Thank you so much!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.