html_nodes() returns a
xml_nodes datatype, which is normally processed afterwards with
html_tables to convert tables to frames.
As you know, sometimes you get spurious data, that one need to understand where it comes from.
So the question is, how do you explore xml_nodes contents?
In the next web-scrapping example, the author wants to extract the 3rd table, but in the xml_nodes result, it seems to be the 6th found by Trial&Error.
population_html <- read_html("https://en.wikipedia.org/wiki/List_of_countries_by_population_in_1900") population_nodes <- html_nodes(population_html, "table") population_nodes View(population_nodes) str(population_nodes)
View, print, and str, don't really shows much information about information contained on each node.
The problem gets bigger if I try to find a way to filter the tables below the div of main-content, as str() results is too large, and View() shows a tree with no information.