web scraping Unsuccesful

Hello,

I am new to web scraping via the package "rvest."
Code below shows that I am able to scrape the title of the post successfully.
However, when I look to scrape the headings of various stories on the page,
the program returns "xml_nodeset."

When I originally scraped this page, this code worked fine. Any ideas to ensure that the scrape will run succesfully as stories change day to day?

library(rvest)
reddit_political_webpage<- read_html("https://www.reddit.com/r/politics/")
reddit_political_webpage %>%
    html_node("title") %>%
    html_text()
#> [1] "Politics"


reddit_political_webpage %>%
    html_nodes("#t3_p164nk ._eYtD2XCVieq6emjKBH3m , #t3_p190o4 ._eYtD2XCVieq6emjKBH3m, #t3_p14fxj ._eYtD2XCVieq6emjKBH3m")
#> {xml_nodeset (0)}

Hi @jack3

As you already know, each subreddit is a potentially infinite webpage as it is updated with new topics every couple of hours. This means that you may only want to scrape the latest topics.

The code below worked for me and the topics are the latest ones as I run the code (Aug 15, 2021 at 2:05AM CST):

library(rvest)

html <- read_html("https://www.reddit.com/r/politics/")

html %>%
  html_elements(css = "h3._eYtD2XCVieq6emjKBH3m") %>%
  html_text()

[1] "Saturday Morning Political Cartoon Thread"                                                                                                                                                           
[2] "Mississippi 8th Grader Dies With COVID Hours After Reeves Downplays Child Cases"                                                                                                                     
[3] "Tennessee's Former Vax Chief Says Conservatives Avoiding Vaccine 'Out of Spite'"                                                                                                                     
[4] "Former FBI deputy director Andrew McCabe says Trump is 'threatening members of law enforcement' in targeting officer who killed Capitol rioter Ashli Babbitt"                                        
[5] "Vast Stretches of America Are Shrinking. Almost All of Them Voted for Trump."                                                                                                                        
[6] "Senator Whitehouse Asks January 6 Commission to Study Role of Dark Money in Breach"                                                                                                                  
[7] "Republicans claim to fear left-wing authoritarianism — but there's no such thing | Yes, dictators sometimes cloak themselves in \"socialism.\" But tyranny, here and elsewhere, is always right-wing"

Thanks,

I did notice a different point in your code though:
("html_elements(css = "h3._eYtD2XCVieq6emjKBH3m").

Maybe this line allows for extracting stories as they are updated, because my code was failing with this. Can you explain how you determined this component of the code "h3._eYtD2XCVieq6emjKBH3m" Are you using a selector tool?

Doing web scraping requires a bit of knowledge of what makes up a webpage (i.e. HTML and CSS). "h3._eYtD2XCVieq6emjKBH3m" is known as a CSS selector and this is what selects all topics on your subreddit of interest. I basically right-clicked on one topic, then clicked on "Inspect" to access the Developer Tools of Chrome. Then, I looked for the correct selector for all titles. "h3" is an HTML element and "_eYtD2XCVieq6emjKBH3m" is a CSS class. So when you put them together, it means you are looking for all h3 elements (titles) of class _eYtD2XCVieq6emjKBH3m.

It may seem confusing, but it is actually fairly simple. If you can spare a bit of time, I recommend you watching a YouTube video on basic HTML and CSS. It will greatly help you if web scraping is something you often do.

Much appreciated, I'll find what I can.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.