Webscraping with rvest and login

Hi, guys!

I have been trying to do a web scraping from a page, but I'm lost.

I need, first to do a login and than scrap the information.

How can I acess the form and fill it with rvest, httr or JSONlite?


Part of the HTML:

<form class="_ab3b" id="loginForm" method="post">
....
<input aria-label="Telefone, nome de usuário ou email" aria-required="true" autocapitalize="none" autocorrect="off" maxlength="75" name="username" type="text" class="_aa4b _add6 _ac4d" value="username">
....
<input aria-label="Senha" aria-required="true" autocapitalize="none" autocorrect="off" name="password" type="password" class="_aa4b _add6 _ac4d" value="password">

I have been trying something like:

url <- "https://www.EXAMPLE.com/accounts/login/"
session <- rvest::html_session(url)

form <- 
rvest::read_html(url) |> 
   rvest::html_element("body") |> 
   rvest::html_form("form")


filled_form <- rvest::set_values(form,
                          username = "notmyrealemail",
                          password = "notmyrealpassword")

rvest::submit_form(session, filled_form)

player_page <- rvest::jump_to(page,
           "https://www.EXAMPLE.com/profile/?__a=1&__d=11")

But I can't pass the form part.

1 Like

You try with Rselenium?

1 Like

Hi, thanks!

Yes, but I don't know how to download the json file after login.

I don't really know what you mean by you "can't pass the form part." Here's a pattern I've used various times successfully with rvest. Naturally, you'll need adapt it for your site's structure. Pay special attention to the unnamed fields in the form. This tripped me up for a long time :slight_smile:

mainPageURL <- "https://..."
mySession <- session(mainPageURL)

login <- mySession %>% 
  session_jump_to("users/login.cgi") %>% 
  html_element(".srbasic") %>%
  html_form() %>% 
  html_form_set(
    username = "username",
    password = "password"
  )

# Fixup an unnamed field in the form object. Unnamed fields are not allowed and will cause an error.
login$fields[[4]]$name <- "button"

# Set the action field to use the login.cgi program
login$action <- "https://xxx.com/users/login.cgi"

# Create the session object
logged_in <- mySession %>% session_submit(login)
1 Like

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.