How to read Netcdf files

I have a problem to open netcdf files. This is a netcdf file with daily precipitation data in January over the globe. I tried to read precipitation, latitude and longitude data, but could not get it. For example, when reading precipitation data, there are two dimensions: "land" and "tstep". The "land" is latitude-longitude combination every half-degree spacing, while the "tstep" is day ID (which is from 1 to 31).
I don't know if the land is row-wise or column-wise from latitude and longitude alone? I can not attach the netcdf file, but I copy the relevant description here. Thanks for your help.

5 dimensions:
x Size:720
y Size:360
land Size:67420
compress: y x
note1: half-degree spacing
note2: land mask in file: WFD-land-lat-long-z.nc
tstep Size:31 *** is unlimited ***
E40yr Size:1

For example, if I want to extract data for a specific area: latitude (25 to 45 degree north), longitude (34 to 52 degree west), which rows of data should I consider? Currently, when using ncvar_get() function to get the precipitation data, it has two dimensions, "land" -67420 rows and "tstep" -31 columns.
When I try to get latitude and longitude separately, they each have two dimensions, "x" or longitude -720 rows and "y" or latitude -360 columns. Sincerely thanks for your help.

Hi @Ellenz! Sounds like you have a pretty interesting NetCDF on your hands :laughing:

I think we're gonna need to take a closer look at some of the values in these variables. With a half-degree grid, 360 values in y and 720 in x sounds like your grid.

I should say here that most NetCDF visualisation software (NCview, Panoply, etc.) is built to take in CF-compliant NetCDF files. Typically I would expect to see latitude and longitude dimensions whose values take on the late/lon values for the grid. Mashing two dimensions into one definitely isn't CF-compliant!

That said, 360x720 is 259 200, not 67 420 (although your output does mention a mask, do maybe the mask is stored in the file and land is the precipitation in grid cells over land only). I think we really need to see some of the values of the land dimension to figure out what's happening here.

Can you please open the land variable with nc_open() and then run str() on the resultant object? And maybe run off a histogram too, if it's a vector. Let's have a look at it :slightly_smiling_face:

2 Likes

I found out that there is another netcdf file contains "land". I opened this latter file and used str() on the resultant object, and got the following explanation. What does it mean?

 $ filename   : chr "wfd-land-lat-lon-z.nc"
 $ writable   : logi FALSE
 $ id         : int 327680
 $ safemode   : logi FALSE
 $ format     : chr "NC_FORMAT_NETCDF4_CLASSIC"
 $ is_GMT     : logi FALSE
 $ groups     :List of 1
  ..$ :List of 7
  .. ..$ id   : int 327680
  .. ..$ name : chr ""
  .. ..$ ndims: int 1
  .. ..$ nvars: int 6
  .. ..$ natts: int 7
  .. ..$ dimid: int [1(1d)] 0
  .. ..$ fqgn : chr ""
  .. ..- attr(*, "class")= chr "ncgroup4"
 $ fqgn2Rindex:List of 1
  ..$ : int 1
 $ ndims      : num 1
 $ natts      : num 7
 $ dim        :List of 1
  ..$ land:List of 10
  .. ..$ name         : chr "land"
  .. ..$ len          : int 67420
  .. ..$ unlim        : logi FALSE
  .. ..$ group_index  : int 1
  .. ..$ group_id     : int 327680
  .. ..$ id           : int 0
  .. ..$ dimvarid     :List of 5
  .. .. ..$ id         : int 2
  .. .. ..$ group_index: int 1
  .. .. ..$ group_id   : int 327680
  .. .. ..$ list_index : num -1
  .. .. ..$ isdimvar   : logi TRUE
  .. .. ..- attr(*, "class")= chr "ncid4"
  .. ..$ units        : chr ""
  .. ..$ vals         : int [1:67420(1d)] 26641 27361 30241 30961 31681 32401 33121 33841 34561 35281 ...
  .. ..$ create_dimvar: logi TRUE
  .. ..- attr(*, "class")= chr "ncdim4"
 $ unlimdimid : num -1
 $ nvars      : num 5
 $ var        :List of 5
  ..$ Longitude:List of 22
  .. ..$ id                :List of 5
  .. .. ..$ id         : num 0
  .. .. ..$ group_index: num -1
  .. .. ..$ group_id   : int 327680
  .. .. ..$ list_index : num 1
  .. .. ..$ isdimvar   : logi FALSE
  .. .. ..- attr(*, "class")= chr "ncid4"
  .. ..$ name              : chr "Longitude"
  .. ..$ ndims             : int 1
  .. ..$ natts             : int 2
  .. ..$ size              : int 67420
  .. ..$ dimids            : int 0
  .. ..$ prec              : chr "float"
  .. ..$ units             : chr "degrees_east"
  .. ..$ longname          : chr "Longitude"
  .. ..$ group_index       : int 1
  .. ..$ chunksizes        : logi NA
  .. ..$ storage           : num 1
  .. ..$ shuffle           : logi FALSE
  .. ..$ compression       : logi NA
  .. ..$ dims              : list()
  .. ..$ dim               :List of 1
  .. .. ..$ :List of 10
  .. .. .. ..$ name         : chr "land"
  .. .. .. ..$ len          : int 67420
  .. .. .. ..$ unlim        : logi FALSE
  .. .. .. ..$ group_index  : int 1
  .. .. .. ..$ group_id     : int 327680
  .. .. .. ..$ id           : int 0
  .. .. .. ..$ dimvarid     :List of 5
  .. .. .. .. ..$ id         : int 2
  .. .. .. .. ..$ group_index: int 1
  .. .. .. .. ..$ group_id   : int 327680
  .. .. .. .. ..$ list_index : num -1
  .. .. .. .. ..$ isdimvar   : logi TRUE
  .. .. .. .. ..- attr(*, "class")= chr "ncid4"
  .. .. .. ..$ units        : chr ""
  .. .. .. ..$ vals         : int [1:67420(1d)] 26641 27361 30241 30961 31681 32401 33121 33841 34561 35281 ...
  .. .. .. ..$ create_dimvar: logi TRUE
  .. .. .. ..- attr(*, "class")= chr "ncdim4"
  .. ..$ varsize           : int 67420
  .. ..$ unlim             : logi FALSE
  .. ..$ make_missing_value: logi FALSE

Also, in my initial question, when I try to get "latitude" and "longitude" values from the initial netcdf file, they both have two dimensions (720, 360). Does this mean that latitude and longitude have the same contents, which is longitude and latitude values of the whole file? Thanks again.

I found this tutorial very helpful when working with NetCDF data in R
http://geog.uoregon.edu/bartlein/courses/geog607/Rmd/netCDF_01.htm

No worries!

Getting an entire grid for lat and lon, can be a little confusing, because most of us are familiar with grids that run along latitude and longitude values. However, that's not necessarily the case: if your data uses a different map projection, you might have, say, a y index where not only the latitude varies as you travel along it, but also the longitude.

If you have a look at these values, you might find that all of the longitude values are the same in one dimension of the grid, and all of the latitude values are the same in the other dimension. If you're using ncdf4, you unfortunately have to subset grids using the grid indices, not actual lat/lon values. But if you're using tidync, that package'll actually let you make queries directly on the lat/lof on a NetCDF like:

tidync('myfile.nc') %>%
  hyper_filter(lat = lat < -30, lon = lon > 20)

Rather than using str() on the NetCDF object, are you able to extract the land variable with ncvar_get() and use str() on that? NetCDF files are handled kind of like databases (you open a connection to it first, then extract parts of the actual contents).

I'm gonna be honest with you: it's likely you're going to have go to back to the maintainer of this dataset, or the maker of the software that produced this file, and ask them how the file is structured. One of the advantages of NetCDFs is that they can store metadata like the units of each variable or dimension (as you can see with some of the other variables there in the str() output), or notes about how the data should be interpreted, and a lot of that is missing here. This is very far from what R folks would call 'tidy' data, and if you can just talk to the data source, it'll probably save us all a lot of time :confused:

Now I try to open the netcdf file with the mask values, using the below code.

info1= nc_open('wfd-land-lat-lon-z.nc')
info1.land= ncvar_get(info1,'land')
str(info1.land)

Then I got these values:
int [1:67420(1d)] 26641 27361 30241 30961 31681 32401 33121 33841 34561 35281 ...

Is the file from a public dataset that you can link to? I'd prefer to get it from a verifable source, if that's okay, since NetCDFs are binary. Plus, if we go back and forth on the thread, it means anyone else having a similar problem can see what we did about it :slight_smile:

I did not get it from a public source, but from a lab group. It works in the thread, so how should I do next? Thanks.

Have you had any luck speaking to the people from that lab? Is there any documentation accompanying the data you can perhaps consult?

Did you check out the guide RuReady droped?
It has an example with array dimensions 720 (longitudes) x 360 (latitudes) x 12 (months), similar to yours.

In that guide, the section "Get a single time slice of the data, create an R data frame, and write a .csv file" gets you closer to your question "if I want to extract data for a specific area: latitude (25 to 45 degree north), longitude (34 to 52 degree west),". Create a data.frame of observations and filter it down to the lat/long range you're interested in.


Looks like that guide does refer to the deprecated ncdf package, but for the NetCDF package, there's a handy guide here: https://journal.r-project.org/archive/2013-2/michna-woods.pdf

ncvar_get "Reads data from an existing netCDF file.... This routine reads data values from a variable in an existing netCDF file....." (read more in the docs). In your str call, you can see this in dim >> land >> vals.

Without a reproducible example or example data, it's hard to get you much closer other than to offer guides that document how to open and work with NetCDF datatypes.

2 Likes

Looks like that guide does refer to the deprecated ncdf package,

It covers both ncdf and ncdf4 packages. You should be able to find them using Ctrl + F

I found the values of the file: wfd-land-lat-lon-z.nc'. Now I have another question, for example, if I want to get the data for a specific region from the dataset below, assuming it is in a netcdf file. But the ID values are not consecutive, how to write numbers in count? If the 'prec' variable has two dimensions: ID and prec. Thanks.

pr.month1 = ncvar_get(pr.open,'prec',start=c(10,1),count=c(-1,-1))

ID lon lat prec
49186 -67.25 -55.75 8.2
49901 -69.75 -55.25 7.9
49902 -69.25 -55.25 8.4
49903 -68.75 -55.25 7.8
49904 -68.25 -55.25 8.9
49905 -67.75 -55.25 9.1
49906 -67.25 -55.25 9.0
50617 -71.75 -54.75 8.8
50619 -70.75 -54.75 8.7
50620 -70.25 -54.75 7.9
50621 -69.75 -54.75 7.8
50622 -69.25 -54.75 4.8
50623 -68.75 -54.75 7.9
50624 -68.25 -54.75 8.1
50625 -67.75 -54.75 7.2
50626 -67.25 -54.75 7.5
50627 -66.75 -54.75 7.3
50628 -66.25 -54.75 7.3
50629 -65.75 -54.75 8.0
50630 -65.25 -54.75 8.5

If you're running the entire variables (globally) out to a data frame, @Ellenz, then you can use dplyr::filter() to subset by region :slight_smile: For example, to get latitudes between 0 and 30 N and longitudes between 120 and 150 E:

library(tidyverse)
pr.month1 = pr.month1 %>% filter(lon > 120, lon < 150, lat > 0, lat < 30)

If you want to get a regional subset out without loading the whole contents of the NetCDF variable first, I'd recommend using tidync instead :slight_smile:

Thanks, I can load "tidync" package this time. It seems that it does not need to be closed after reading it. I will try it.
Also, I have found the below method to read and subset the original netcdf file.

ncFile <- nc_open( MyNetCDF )
LonIdx <- which( ncFile$dim$lon$vals > 10 & ncFile$dim$lon$vals < 30)
LatIdx <- which( ncFile$dim$lat$vals > 34.5 & ncFile$dim$lat$vals < 44.5)
MyVariable <- ncvar_get( ncFile, varName)[ LonIdx, LatIdx]
nc_close(ncFile)

If my variable has two dimensions: land and time. Land has been explained before, while time is the number of days in a month. So for January, is it correct to write the code as this? I want to include all days in the month, but it seems that it has 30 rather than 31 in the 2nd dimension of the resulting dataset.
MyVariable <- ncvar_get( ncFile, varName)[ LandIdx, -1]

1 Like