I am new to R and learning to scrape web pages. I am trying to scrape all user reviews across three pages for a deprecated WordPress plugin. So I have the code below:
#specify the first page URL
fpURL <- 'https://wordpress.org/support/plugin/easyrecipe/reviews/'
#read the HTML contents in the first page URL
contentfpURL <- read_html(fpURL)
#identify the anchor tags in the first page URL
fpAnchors <- html_nodes(contentfpURL, css='a.bbp-topic-permalink')
#extract the HREF attribute value of each anchor tag
fpHREF <- html_attr(fpAnchors, 'href')
#create empty lists to store titles & contents found in the HREF attribute value of each anchor tag
titles = c()
contents = c()
#loop the following actions for each HREF found firstpage
for (u in fpHREF) {
#read the HTML content of the review page
fpURL = read_html(u)
#identify the title anchor and read the title text
fpreviewT = html_text(html_nodes(fpURL, css='h1.page-title'))
#identify the content anchor and read the content text
fpreviewC = html_text(html_nodes(fpURL, css='div.bbp-topic-content'))
#store the review titles and contents in the previous lists
titles = c(titles, fpreviewT)
contents = c(contents, fpreviewC)
}
#identify the anchor tag pointing to the next summary page
npAnchor <- html_text(html_nodes(contentfpURL, css='a.next page-numbers'))
#extract the HREF attribute value of the anchor tag pointing to the next summary page
npHREF <- html_attr(npAnchor, 'href')
#loop the following actions for every next summary page HREF attribute
for (u in npHREF) {
#specify the URL of the summary page
spURL <- read_html('npHREF')
#identify all the anchor tags on that summary page
spAnchors <- html_nodes(spURL, css='a.bbp-topic-permalink')
#extract the HREF attribute value of each anchor tag
spHREF <- html_attr(spAnchors, 'href')
#loop the following actions for each HREF found on that summarypage
for (u in fpHREF) {
#read the HTML contents of the review page
spURL = read_html(u)
#identify the title anchor and read the title text
spreviewT = html_text(html_nodes(spURL, css='h1.page-title'))
#identify the content anchor and read the content text
spreviewC = html_text(html_nodes(spURL, css='div.bbp-topic-content'))
#store the review titles and contents in the previous lists
titles = c(titles, spreviewT)
contents = c(contents, spreviewC)
}
}
However, my code does not work. I am not sure what I am doing wrong, maybe it's the multiple loops?
I will appreciate some help. Thank you.