I am trying to scrape a web page which contains the following structure:
<p>
<a href="https://somewhere1.com">sometext1</a>
<br>
somemoretext1
</p>
<p>
<a href="https://somewhere2.com">sometext2</a>
<br>
somemoretext2
<br>
<br>
<a href="https://somewhere3.com">sometext3</a>
<br>
somemoretext3
</p>
Basically I would like to split up the second <p> node by replacing the two adjacent <br> tags with </p><p> or similar before I select all <p> nodes for further processing (with html_nodes("p")). So every <p> node should contain only one link plus "somemoretext", just like the first <p> node.
In the end I want to scrape all link-URLs, all "sometext"s, and all "somemoretext"s.
I assume that xml2::xml_replace() could be part of a solution, but I haven't figured out how to get it to work, yet, even after reading the modification vignette.
(Note that the document contains many more <p> nodes with sometimes multiple adjacent <br> tags, so I might have to split up one <p> node into more than two.)