How to scrape a table from txt file containing HTML code?

Hi, I have a following problem. I downloaded a HTML body of a table. The file is saved as myfile.txt. See first two lines bellow:

<tr class="header">
<th class="display_name"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">Beneficiary</font></font></th>
<th class="district_display"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">Location of the beneficiary</font></font></th>
<th class="formatted_year"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">Period of payments received</font></font></th>
<th class="sum"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">EAGF and EAFRD, EUR</font></font></th>
</tr>
<tr onclick="show_hide_tr('pd_1');" style="cursor: pointer; width: 100%;" class=" row1"><td class="display_name"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">Dzidra Breidaga</font></font></td>
<td class="district_display"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">Salacgriva county</font></font></td>
<td class="formatted_year"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">2017-2018</font></font></td>
<td class="sum"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">
4491.20
</font></font></td>
</tr>
<tr onclick="show_hide_tr('pd_2');" style="cursor: pointer; width: 100%;" class=" row1"><td class="display_name"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">Jānis Mikijanskis</font></font></td>
<td class="district_display"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">Ludzas nov.</font></font></td>
<td class="formatted_year"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">2017-2018</font></font></td>
<td class="sum"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">
2926.31
</font></font></td>
</tr>

I would like to read this file as a df in R. I tried this:

library(rvest)

adresa <- 'C:/Users/.../myfile.txt'
table <- html_nodes(adresa, "table")

But I got an error "Error in UseMethod("xml_find_all") :
no applicable method for 'xml_find_all' applied to an object of class "character""

Desired output is:

Beneficiary	Location of the beneficiary	Period of payments received	EAGF and EAFRD, EUR
Dzidra Breidaga	Salacgriva county	2017-2018	4491.20
Jānis Mikijanskis	Ludzas nov.	2017-2018	2926.31

How can I fix it please? Thanks

First you need to use xml2::read_html() to read the text as html / xml content. The second is that you can't select a node that isn't present in the data. e.g. you need <table></table> to select it. A reprex below shows how you can fix this.

tb_text <- '<tr class="header">
<th class="display_name"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">Beneficiary</font></font></th>
<th class="district_display"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">Location of the beneficiary</font></font></th>
<th class="formatted_year"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">Period of payments received</font></font></th>
<th class="sum"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">EAGF and EAFRD, EUR</font></font></th>
</tr>
<tr onclick="show_hide_tr(\'pd_1\');" style="cursor: pointer; width: 100%;" class=" row1"><td class="display_name"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">Dzidra Breidaga</font></font></td>
<td class="district_display"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">Salacgriva county</font></font></td>
<td class="formatted_year"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">2017-2018</font></font></td>
<td class="sum"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">
4491.20
</font></font></td>
</tr>
<tr onclick="show_hide_tr(\'pd_2\');" style="cursor: pointer; width: 100%;" class=" row1"><td class="display_name"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">Janis Mikijanskis</font></font></td>
<td class="district_display"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">Ludzas nov.</font></font></td>
<td class="formatted_year"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">2017-2018</font></font></td>
<td class="sum"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">
2926.31
</font></font></td>
</tr>'

library(tidyverse)
library(rvest)

# you cant select a node that isn't present
read_html(tb_text) %>% 
  html_nodes("table")
#> {xml_nodeset (0)}

# you can select it when present
read_html(paste("<table>",tb_text,"</table>")) %>% 
  html_nodes("table") %>% 
  html_table()
#> [[1]]
#>         Beneficiary Location of the beneficiary Period of payments received
#> 1   Dzidra Breidaga           Salacgriva county                   2017-2018
#> 2 Janis Mikijanskis                 Ludzas nov.                   2017-2018
#>   EAGF and EAFRD, EUR
#> 1             4491.20
#> 2             2926.31

Created on 2020-11-03 by the reprex package (v0.3.0)

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.