Help with rvest and scraping

Hi everyone!

I'm currently writing some code in R language in order to extract information of the funding that various projects on a website have acquired.
I am using the rvest-package in R.

Here is a sample of how the HTML-code on the website looks:

<title>Project 2030 is launched</title>
<div data-name="category">Domestic news</div> <!--/category--> 
<div data-name="funding">25000000</div><!--/funding-->

In R, I've succesfully acquired the title with:

> library(rvest)
> a_webpage <- read_html("www.example.com")
> a_webpage %>%
+ html_node("title") %>%
+ html_text()
[1] Project 2030 is launched

My question is.. how can I do the same for the "funding" part - or more specifically, how can I extract the number 25000000? Using "html_node("div#funding)" or other varities does not seem to be sufficient.

Thanks! :slight_smile:

By the way; here is a link to the website:
https://www.tuborgfondet.dk/projekt/mind-your-own-business-groenland
... with the title being found in line 62, and the funding amount is in line 199 of: view-source:https://www.tuborgfondet.dk/projekt/mind-your-own-business-groenland

library(rvest)


html_text <- '<title>Project 2030 is launched</title>
   <div data-name="category">Domestic news</div> <!--/category--> 
   <div data-name="funding">25000000</div><!--/funding-->
   <div data-name="funding">999</div><!--/funding-->'

b_webpage <- read_html(html_text)
b_webpage %>%
  html_node("title") %>%
  html_text()

b_webpage %>%
  html_nodes("div[data-name='funding']") %>%
  html_text()

a_webpage <- read_html("https://www.tuborgfondet.dk/projekt/mind-your-own-business-groenland")
a_webpage %>%
  html_node("title") %>%
  html_text()


a_webpage %>%
  html_nodes("div[data-name='funding']") %>%
  html_text()

I'd recommend using xpath to identify the specific nodes you want. Discovering node identifiers using SelectorGadget or writing your own CSS selectors often works great, but it can fail you when things aren't identified super carefully. Here's an example to grab the funding field:

a_webpage %>% 
html_nodes(xpath = "//div[@data-name='funding']") %>% 
html_text()
[1] "11209000"

Thanks a lot! This truly helped.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.