Saving progress when mapping over a large number of variables

I've created a few functions to scrape web data using the rvest workflow, and I'm trying to scrape 35000 urls. I'm filling a tibble this way:

scraped <- tibble(
  info1 = map(urls, possibly(scrape_function1, otherwise = NULL)),
  info2 = map(urls, possibly(scrape_function2, otherwise = NULL))
)

The code above works when I've tested it on 500 urls. The issue is that it's very slow, and I can't leave it running so long. If anything went wrong in my computer at url 30,000, all the previous data would be lost despite the fact it would have successfully run up to then. This must be a fairly common issue, so I'm wondering what ways people have to save the processing already done.

The alternative way I could think of doing this is to run it over chunks of the urls. I'm not against doing this, but I haven't been able to think of a clever way of doing so. For example, I could create an empty tibble and fill it this way:

scraped$info1[1:500] <- map(licenses_vector[1:500], possibly(scrape_function1, otherwise = NULL))
scraped$info2[1:500] <- map(licenses_vector[1:500], possibly(scrape_function2, otherwise = NULL))

scraped$info1[501:1000] <- map(licenses_vector[501:1000], possibly(scrape_function1, otherwise = NULL))
scraped$info2[501:1000] <- map(licenses_vector[501:1000], possibly(scrape_function2, otherwise = NULL))

However, I would have to copy and paste that code 70 times, manually changing which part of the vectors I'm subsetting... Obviously this is very prone to error.

I'm quite new to programming, so if anyone has a better way they use to split data I would be grateful for the help! Or perhaps an alternative workflow for scraping a large number of urls.

This is precisely the use-case for purrr::safely and friends. These are called adverbs since they modify behaviour of your function. Specifically, function wrapped with safely will never produce an error, so it is a perfect candidate in your case.

I've actually already wrapped my function in purrr::possibly (part of the safely family), but it's solving a different issue. It means that were it to get to the end, it would give me the data for any outputs it managed to get even if this wasn't all of them.

I'm not sure I explained my problem very well. What I mean is, I'm looping over so many inputs that it would take multiple days for the function to run. I can't leave it running that long, so I want to get outputs periodically. But I'm not sure how best to either a) split the data set or b) get the results from mapping as they come in.

OK, then you can try using future package for some parallelization. In your case it sounds like your problem is embarrassingly parallel, so if you have multiple cores (and you probably do), then you can run your requests in parallel.

One more approach is to save results to disk/DB for persistent storage. SQLite is quite simple and maybe you can try going down that road. So you'll run your program for whatever time you want, it'll save results, and when you need to interrupt, you'll be able to see where it stopped. Then you can continue from that point next day.

Some kind of a workflow automation tool like Drake should help. It will create checkpoints with intermediate results, can run jobs in parallel, and can pick up the processing without rerunning completed jobs in case of failures.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.