Download "files" that are a list of links


#1

I have a long list of links. Each one is for a NetCDF file. If I put a link in my browser, a file automatically starts downloading, but my browser doesn’t go anywhere.

What are these, links, or files? How do I read them in R?

When I try RCurl::getURL(), I get

Error in nc_open trying to open file <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>302 Found</title>
</head><body>
<h1>Found</h1>
<p>The document has moved <a href="https://urs.earthdata.nasa.gov/oauth/authorize/. .. etc

I have all the links in a folder called “myfiles.dat”. Hoping to move ahead and learn purrr with this set.


#2

Looks like things have moved:

Given the oauth in the link, even if you get there, you’ll probably need to log in or provide an access token.


#3

Can you provide a list of example links?


#4

Here is a full link:

http://goldsmr2.gesdisc.eosdis.nasa.gov/daac-bin/OTF/HTTP_services.cgi?FILENAME=%2Fdata%2FMERRA%2FMST1NXMLD.5.2.0%2F2004%2F01%2FMERRA300.prod.simul.tavg1_2d_mld_Nx.20040101.hdf&FORMAT=bmM0Lw&BBOX=45.687%2C-95.804%2C45.694%2C-95.794&LABEL=MERRA300.prod.simul.tavg1_2d_mld_Nx.20040101.SUB.nc4&SHORTNAME=MST1NXMLD&SERVICE=SUBSET_MERRA&VERSION=1.02&LAYERS=&VARIABLES=tsoil1

If I copy and paste it into my browser, a file still downloads.

@alistaire, you’re right. I do need to log in to get this information. Now I understand that RCurl won’t work without the login. They have some short instructions how to download all the files using Unix. Looks like I will probably have to contact them to see if there is a way around it since I am not a Unix user.


#5

If you use Windows, the Windows Subsystem for Linux will let you run anything you could need.

If you run MacOS or Linux, they’re built on top of Unix, and so are ready to go.

In all likelihood, you could do this all directly from R with httr, but it may still take some work.


#6

Thanks. The data are so close. . .yet so far away.


#7

NetCDFs are popular with climate scientists and almost nobody else :sweat_smile: If you need advice on getting started with whichever product this is (eg. accessing it), I can ask around the office and see if someone’s used it before.

The ncdf4 R package works well with NetCDFs, as does purrr (in fact, I * cough * just wrote a blog post on using purrr with file formats like NetCDF :wink: ). Someone’s also working on a package called tidync to make dealing with NetCDFs easier still, but I’m not sure how far along it is.


#8

This blog post looks incredibly helpful! Thank you!

As for accessing the data, Scott Chamberlain has been working on it and sounds like there might be an interface for this dataset: https://github.com/ropensci/dappr .


#9

I see this when I try and access that file:

r <- httr::GET("http://goldsmr2.gesdisc.eosdis.nasa.gov/daac-bin/OTF/HTTP_services.cgi?FILENAME=%2Fdata%2FMERRA%2FMST1NXMLD.5.2.0%2F2004%2F01%2FMERRA300.prod.simul.tavg1_2d_mld_Nx.20040101.hdf&FORMAT=bmM0Lw&BBOX=45.687%2C-95.804%2C45.694%2C-95.794&LABEL=MERRA300.prod.simul.tavg1_2d_mld_Nx.20040101.SUB.nc4&SHORTNAME=MST1NXMLD&SERVICE=SUBSET_MERRA&VERSION=1.02&LAYERS=&VARIABLES=tsoil1")
r
#> Response [https://urs.earthdata.nasa.gov/oauth/authorize/?scope=uid&app_type=401&client_id=e2WVk8Pw6weeLUKZYOxvTQ&response_type=code&redirect_uri=http%3A%2F%2Fgoldsmr2.gesdisc.eosdis.nasa.gov%2Fdata-redirect&state=aHR0cHM6Ly9nb2xkc21yMi5nZXNkaXNjLmVvc2Rpcy5uYXNhLmdvdi9kYWFjLWJpbi9PVEYvSFRUUF9zZXJ2aWNlcy5jZ2k%2FRklMRU5BTUU9JTJGZGF0YSUyRk1FUlJBJTJGTVNUMU5YTUxELjUuMi4wJTJGMjAwNCUyRjAxJTJGTUVSUkEzMDAucHJvZC5zaW11bC50YXZnMV8yZF9tbGRfTnguMjAwNDAxMDEuaGRmJkZPUk1BVD1ibU0wTHcmQkJPWD00NS42ODclMkMtOTUuODA0JTJDNDUuNjk0JTJDLTk1Ljc5NCZMQUJFTD1NRVJSQTMwMC5wcm9kLnNpbXVsLnRhdmcxXzJkX21sZF9OeC4yMDA0MDEwMS5TVUIubmM0JlNIT1JUTkFNRT1NU1QxTlhNTEQmU0VSVklDRT1TVUJTRVRfTUVSUkEmVkVSU0lPTj0xLjAyJkxBWUVSUz0mVkFSSUFCTEVTPXRzb2lsMQ]
#>   Date: 2018-01-18 23:36
#>   Status: 401
#>   Content-Type: text/html; charset=utf-8
#>   Size: 27 B
#> HTTP Basic: Access denied.

This suggests that you’ve logged into the site in your browser and it’s probably using cookies to remember you.

It might be possible to automate the log-in and download process with rvest, but if you haven’t done any webscraping before it’s going to be quite a lot of work (and I don’t think there’s a good single resources where you can learn the basics)


#10

ncdf4 will now read this thredds/dap source directly, so I'd try using raster:: raster on the link e.g. https://rpubs.com/cyclemumner/380576

This topic is fraught though, lots of options lots of piecemeal history and lots of confusion