I've created a few functions to scrape web data using the rvest workflow, and I'm trying to scrape 35000 urls. I'm filling a tibble this way:
scraped <- tibble(
info1 = map(urls, possibly(scrape_function1, otherwise = NULL)),
info2 = map(urls, possibly(scrape_function2, otherwise = NULL))
)
The code above works when I've tested it on 500 urls. The issue is that it's very slow, and I can't leave it running so long. If anything went wrong in my computer at url 30,000, all the previous data would be lost despite the fact it would have successfully run up to then. This must be a fairly common issue, so I'm wondering what ways people have to save the processing already done.
The alternative way I could think of doing this is to run it over chunks of the urls. I'm not against doing this, but I haven't been able to think of a clever way of doing so. For example, I could create an empty tibble and fill it this way:
scraped$info1[1:500] <- map(licenses_vector[1:500], possibly(scrape_function1, otherwise = NULL))
scraped$info2[1:500] <- map(licenses_vector[1:500], possibly(scrape_function2, otherwise = NULL))
scraped$info1[501:1000] <- map(licenses_vector[501:1000], possibly(scrape_function1, otherwise = NULL))
scraped$info2[501:1000] <- map(licenses_vector[501:1000], possibly(scrape_function2, otherwise = NULL))
However, I would have to copy and paste that code 70 times, manually changing which part of the vectors I'm subsetting... Obviously this is very prone to error.
I'm quite new to programming, so if anyone has a better way they use to split data I would be grateful for the help! Or perhaps an alternative workflow for scraping a large number of urls.