Help with logging into a web site with rvest

rdr · February 17, 2020, 5:49pm

I keep banging my head against the wall with what I think should be a simple problem. Would appreciate any advice/direction.

I'm trying to scrape the results of complex web queries (encoded as URLs) from baseball-reference.com. I am a subscriber to the site. When I log into the site and submit the queries manually, they return the full data that I want as an HTML table. Non-subscribers get a subset of the return data that omits the Top 10 records in the data set.

In R using rvest I create a session, fill in the login form, submit it (and receive a 200 status code, indicating the submission was successful), but the data I get returned by my query URL is always as though I am not logged in (in other words, it's missing the first 10 return values).

Here's a rough excerpt of what I'm trying to do (clearly un-optimized as yet).

POST_LOGIN_URL <- "https://www.baseball-reference.com/my/"
REQUEST_URL <- *"URL to request report"*
session <- html_session(POST_LOGIN_URL)
form <- html_form(session)[[2]]
form <- set_values(form, user_ID = "*(my userID)*", password = "*(my password)*")
form$url <- POST_LOGIN_URL
session2 <- submit_form(session, form)
reply <- session2 %>% jump_to(REQUEST_URL) %>% read_html() %>% html_table()

Again, any advice or pointers would be very much appreciated!

Thank you

-- Robert

mara · February 24, 2020, 3:40pm

I'm not sure this would be in compliance with their Terms of Service:

From Section 2

Except as specifically provided in this paragraph, you agree not to use or launch any automated system, including without limitation, robots, spiders, offline readers, or like devices, that accesses the Site in a manner which sends more request messages to the Site server in any given period of time than a typical human would normally produce in the same period by using a conventional on-line Web browser to read, view, and submit materials.

And later…

6. Site Content.

You may not frame, capture, harvest, or collect any part of the Site or Content without SRL's advance written consent.

rdr · February 24, 2020, 4:05pm

Thanks for your comment, Mara. My interest in this project, which is personal and non-commericial, began with the book Baseball Stats in R, the blog for which included guides for this kind of scraping. The author indicated there that he'd had conversations with the site's owner, who cleared individual study such as this, but that he prohibited mass data downloads.

All the same, you're right that I should clear that for myself with them directly before proceeding further.

-- Robert

mara · February 24, 2020, 4:08pm

Yeah, I think checking is good (especially since you don't want your IP address to get blocked, or something of the like). I also recommend the polite package for scraping in a way that complies with their robots.txt, etc.

Sorry I don't actually have answers re. the authentication part.

system · March 16, 2020, 4:08pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.