rvest/xml2 replace nodes before scraping

I am trying to scrape a web page which contains the following structure:

<p>
  <a href="https://somewhere1.com">sometext1</a>
  <br> 
    somemoretext1
</p>
<p>
  <a href="https://somewhere2.com">sometext2</a>
  <br> 
    somemoretext2
  <br>
  <br>
  <a href="https://somewhere3.com">sometext3</a>
  <br> 
    somemoretext3
</p>

Basically I would like to split up the second <p> node by replacing the two adjacent <br> tags with </p><p> or similar before I select all <p> nodes for further processing (with html_nodes("p")). So every <p> node should contain only one link plus "somemoretext", just like the first <p> node.

In the end I want to scrape all link-URLs, all "sometext"s, and all "somemoretext"s.

I assume that xml2::xml_replace() could be part of a solution, but I haven't figured out how to get it to work, yet, even after reading the modification vignette.

(Note that the document contains many more <p> nodes with sometimes multiple adjacent <br> tags, so I might have to split up one <p> node into more than two.)

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.