I am new to R. I am trying to build a dataset of a newspaper to be able to perform tidytext analysis on it.
Using some online help, I have gathered links to 400 articles for their editorials. Now I want to get text out of the editorial pages. But I am getting the following error: Error in open.connection(x, "rb") : HTTP error 403.
I have tried to scrap links to all editorials published online. These are 400.
I want to map all 400 pages. They get titles and text out of each page. I tried this approach on five links that it worked, but I am unable to map 400 pages.
If I can map 400 pages, then I would expect to replicate the following code, which helped me get 20 some editorial posts.
purrr::safely to get map keeps going even if you encounter some errors
I would advice to add a crawl delay of a few seconds when scraping. Either manually or using polite , a wrapper for httr using robotstxt infos.
Scraping will take more time but I think it the good behavior and the error will I think disappear.
Thanks for the help. I am trying to apply text sentiment analysis on the editorials, so that's why I want to download that many articles. No scrapping for the sake of scrapping.
polite option worked well. With rvest, even when I use Sys.Dely() I keep getting 403 error after a while. Both approaches are slow, but polite is slower of the two. Yet, good to have it.
One more thing, I am trying to grapple with. How can I supply two digit number to past0() function. I have some website's where I need to give three missing links, e.g. the date at the end of this URL: https://www.dawn.com/archive/2019-05-22,
I would want to supply, the year, month and day parts separately.
Thanks a million, this is really helpful. The more I learn R, the more I love it and the R community which is always there to help.
So I tried to apply the seq(as.date())... to generate dates and they paste them using paste0() function below. But, I am getting the following mistake running the code... (hopefully this will be the last question in the series.)
Error:
Error in UseMethod("xml_find_all") :
no applicable method for 'xml_find_all' applied to an object of class "NULL"