Help with download.file-how to skip over empty urls and save using destfile

kari9438 · July 13, 2022, 4:41pm

Hi all,

Thanks in advance for any feedback.

As part of my dissertation I'm trying to scrape data from the web (been working on this for months). I have a couple issues:

-Each document I want to scrape has a document number. However, the numbers don't always go up in order. For example, one document number is 2022, but the next one is not necessarily 2023, it could be 2038, 2040, etc. I don't want to hand go through to get each document number. I have tried to wrap download.file in purrr::safely(), but once it hits a document that does not exist it stops.
-Second, I'm still fairly new to R, and am having a hard time setting up destfile for multiple documents. Indexing the path for where to store downloaded data ends up with the first document stored in the named place, the next document as NA.

Here's the code I've been working on:

base.url <- "https://www.europarl.europa.eu/doceo/document/"
document.name.1 <- "P-9-2022-00"
document.extension <- "_EN.docx"

#document.number <- 2321
document.numbers <- c(2330:2333)

for (i in 1:length(document.numbers)) {

temp.doc.name <- paste0(base.url,
document.name.1,
document.numbers[i],
document.extension)
print(temp.doc.name)

safely <- purrr::safely(download.file(temp.doc.name,
destfile = "/Users/...[i]"))

}

Ultimately, I need to scrape about 120,000 documents from the site. Where is the best place to store the data? I'm thinking I might run the code for each of the 15 years I'm interested in separately, in order to (hopefully) keep it manageable.

Again, thanks for any feedback!

Kari W

AlexisW · July 14, 2022, 11:48pm

this is incorrect: as [i] is inside the " quotes, this means the literal string "[i]". You probably want to generate the destfile at each loop iteration:

temp.destfile.name <- paste0("/Users/.../document_number_", i)

download.file(temp.doc.name, temp.destfile.name)

Another thing is that if you're on Windows, you have to download the file as binary:

download.file(temp.doc.name, temp.destfile.name, mode = "wb")

You can also switch to packages like {httr2} that are more modern when it comes to downloading things (but in your case, the download requests are pretty simple, might not be necessary).

I downloaded one random document, it was 10 kB. So for 120,000 files, that's a bit more than 1.2 GB (assuming all the files have similar size). I'd say that's small enough to just save in a directory along with your script. But then if your goal is to run some text-extraction code, there might not be any point in keeping these files.

One possibility could be to download all the files from one year (storing them in tempdir()), read them with {officer} or other suited package, and save the content as an rds or qs file along with your script. That way you have the full text available for future use, and I expect it'll take less space (you probably have to play with a few files a bit to see what information you actually want to save and whether it's really smaller).

Also, one small detail on the for loop:

for (i in 1:length(document.numbers)) {

In that case you don't really care about i, you only care about the document number. So you can save a few characters using:

for (current_document_number in document.numbers) {
   temp.doc.name <- paste0(base.url,
      document.name.1,
      current_document_number,
      document.extension)
   print(temp.doc.name)
}

By the way, you also have a space at the end of the base.url that will create problems.

system · August 4, 2022, 11:49pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.