I keep banging my head against the wall with what I think should be a simple problem. Would appreciate any advice/direction.
I'm trying to scrape the results of complex web queries (encoded as URLs) from baseball-reference.com. I am a subscriber to the site. When I log into the site and submit the queries manually, they return the full data that I want as an HTML table. Non-subscribers get a subset of the return data that omits the Top 10 records in the data set.
In R using rvest I create a session, fill in the login form, submit it (and receive a 200 status code, indicating the submission was successful), but the data I get returned by my query URL is always as though I am not logged in (in other words, it's missing the first 10 return values).
Here's a rough excerpt of what I'm trying to do (clearly un-optimized as yet).
POST_LOGIN_URL <- "https://www.baseball-reference.com/my/" REQUEST_URL <- *"URL to request report"* session <- html_session(POST_LOGIN_URL) form <- html_form(session)[] form <- set_values(form, user_ID = "*(my userID)*", password = "*(my password)*") form$url <- POST_LOGIN_URL session2 <- submit_form(session, form) reply <- session2 %>% jump_to(REQUEST_URL) %>% read_html() %>% html_table()
Again, any advice or pointers would be very much appreciated!