List nth-child() recursively

Hello everyone,

I am webscrapping a site that contains data embedded in multiple table classes, up to 200 of those per html page. If I use:

*Data_Number <-Data_Content %>%*
*html_node(".TableDataArchive") %>%*

**it will extract ony one set of values**

**To extract all of the data from each page I have to create 200 of these commands**

*Data_Number_6 <- Data_Content %>%*
*html_node(".TableDataArchive:nth-child(6)") %>%*

*Data_Number_7 <- Data_Content %>%*
*html_node(".TableDataArchive:nth-child(7)") %>%*

*Data_Number_n+1 <- Data_Content %>%*
*html_node(".TableDataArchive:nth-child(n+1)") %>%*

Is there a smart way to use for () to parse the value into nth-child(n+1) ? I did try different approaches all did not work. Or maybe some one can point me to a documentation/examples I can use?

Errors were like

> Error in if (a == 1 && b_min_1 <= 0) { : 
>   missing value where TRUE/FALSE needed

I recently scraped a page that had 26 tables, but all on one page. Adding an "s" to node will get them all at once:

html_nodes(".TableDataArchive") %>% html_table()

And in my case I just did:

html_nodes("table") %>% html_table()

Not sure how much that will help in your case, but do try and let us know what you get back.

Hi Jeremy,

Thank you for your prompt reply.

I think I did try that but it was 2AM so I do not recall whether I have done it.

Anyway I have retested the web scrapping by using html.nodes and it does export the child tables. Although not correctly. For instance lots and lots or special chars i.e. \t and \n are present in the output and the date is unreadable.

On a manual inspection of the pages, I did notice inconsistency in the HTML formatting. For instance, I did notice tables within tables.

Considering that are about 170 web pages I believe I am going to use an external tool to extract those tables then trying to import the data in RStudio . . . . I may also make public the code and work-around.

Thank you

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.