Issue using read_html with rvest

blayer3 · September 3, 2022, 11:35am

Long-time casual R enthusiast, first-time poster (please be gentle).

I'm having problems with a dataframe pulled from a website using the rvest read_html function.

I am trying to write a script which scrapes a table a tennis player info from tennisabstract.com. I was previously doing this without issue using the following:

df_tennis <-  read.delim("C:/Users/blair/OneDrive/Desktop/ATP MENS ELO 27.06.22")

I wanted to make things more efficient with read_html (rvest) so I wouldn't need to manually copy and paste the table from the website into a csv every time I run my script.

The following code is what I am using to scrape the table. I convert it to a dataframe to make it compatible with my existing code from the earlier script using read.delim(). I then use filter to pull row data for a specific player.

atp_elo <- read_html("http://tennisabstract.com/reports/atp_elo_ratings.html")
    tennis <<- atp_elo %>% 
      html_element("#reportable") %>% 
      html_table()
    #remove empty columns
    df_tennis <<- as.data.frame(tennis[-c(5, 9, 13)])

player1_info <<- df_tennis %>%
  filter(Player == "Novak Djokovic")

but this returns a dataframe of 0 obs. of 13 variables. If I filter for a specific rank, then I get the information I want but I need to be able to pull rows using a player's name. I was using the exact same method in my earlier code so i suspect the dataframe produced using read_html is formatted differently in some way.

For your reference, the earlier version of my script that works:

df_tennis <<- read.delim("C:/Users/blair/OneDrive/Desktop/ATP MENS ELO 27.06.22")

player1_info <<- df_tennis %>%
  filter(Player == "Novak Djokovic")

Note that this returns a dataframe of 1 obs. of 16 variables (because I never had to remove the 3 empty columns when using read.delim). The overall length of the dataframes is also different because the above code uses an older version of tennisabstract data (I was having the same problem when this data was current).

I would appreciate any help on how I can fix this issue and to understand why it occurred.

Cheers!

scottyd22 · September 3, 2022, 12:09pm

Using str_squish() to remove any extra characters in the Player column appears to give the expected result.

df_tennis <- as.data.frame(tennis[-c(5, 9, 13)]) %>%
  mutate(Player = str_squish(Player))

player1_info <- df_tennis %>%
  filter(Player == "Novak Djokovic")

player1_info
#>   Rank         Player Age    Elo HardRaw ClayRaw GrassRaw hElo   cElo   gElo
#> 1    1 Novak Djokovic  35 2187.8  2054.2  2023.2   1976.6 2121 2105.5 2082.2
#>     Peak Match Peak Age Peak Elo
#> 1 2016 Miami F     28.8   2469.9

Created on 2022-09-03 with reprex v2.0.2.9000

system · September 10, 2022, 12:10pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.