How to scrape information from within multiple search results

thatgirlwiththeskirt · March 12, 2021, 6:00pm

TL;DR I can get the URL for each profile but I can't figure out how to scrape from each profile and put the information into a table

I am new to web-scraping, and I am trying to scrape information from profile from this website: https://www.theeroticreview.com/main.asp

This does not violate their Terms of Use, but the website also doesn't have an API. I am able to extract the url for each profile from all the pages of search results, and then paste them to the domain name. However, I can only do this for one page of results, and I am unable to then follow those urls to scrape information from the actual profiles.

My code looks like this:

#Scrape the profile URLs
profile_url_lst <- list()
for(page_num in 1:73){
  main_url <- paste0("https://www.theeroticreview.com/reviews/newreviewsList.asp?searchreview=1&gCity=region1%2Dus%2Drhode%2Disland&gCityName=Rhode+Island+%28State%29&SortBy=3&gDistance=0&page=", page_num)
  html_content <- read_html(main_url)
  profile_urls <- html_content %>% html_nodes("body")%>% html_children() %>% html_children() %>% .[2] %>% html_children() %>% 
    html_children() %>% .[3] %>% html_children() %>% .[4] %>% html_children() %>% html_children() %>% html_children() %>% 
    html_attr("href")
  
  profile_url_lst[[page_num]] <- profile_urls
Sys.sleep(2)
}
#Bind into list and combine with domain name 
profiles <- cbind(profile_urls)
complete_urls <- paste0('https://www.theeroticreview.com', profile_urls)
complete <- cbind(complete_urls)
complete

#Scrape information from each profile
TED_lst <- list()
base_url <- "https://www.theeroticreview.com"
completed <- c(profile_urls)
for(i in completed) {
  urls <- paste(base_url, i, sep = "")
  pages <- read_html(urls)
  
  TED <- pages %>% html_nodes(".hidden-sm , .td-date .td-link , #collapse3 .col-sm-6+ .col-sm-6 .row:nth-child(5) .col-xs-6 , #collapse3 .row:nth-child(3) .col-xs-6 , .col-sm-6:nth-child(1) .row:nth-child(1) .col-xs-6 , .col-sm-6+ .col-sm-6 .row:nth-child(3) .col-xs-6 , .col-sm-6:nth-child(1) .row:nth-child(2) .col-xs-6 , #collapse1 .row:nth-child(8) .col-xs-6 , #collapse1 .col-sm-6:nth-child(2) .row:nth-child(5) .col-xs-6 , #collapse1 .col-sm-6+ .col-sm-6 .row:nth-child(1) .col-xs-6 , #collapse1 .col-sm-6:nth-child(2) .row:nth-child(2) .col-xs-6 , .float-heading-left p , h1") %>% html_text()
  TED_lst <- TEDs
}

When I run this code, I can only generate complete urls for a single page, and while the code to scrape from the profiles works with one of these urls and the loop function removed, attempting to run the function above returns NULL and the url for a single profile. How do I get it to scrape all of the information from multiple profiles, and then bind it into a table to be used in regression analysis?

system · April 2, 2021, 6:00pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.