Methods of accessing data and metadata

STAC (SpatioTemporal Asset Catalog)

“Enabling online search and discovery of geospatial assets”

STAC is a specification aimed at standardizing the language around geospatial datasets in order to increase accessability and interoperability across many datasets. It is for use with data stored in the cloud.

Making data easily searchable

At its core, STAC provides a JSON file wrapper around any geospatial data (i.e. any data relating to the Earth). The goal of this wrapper file is to contain all relevant information that a user may want to search for when finding a dataset. In this way, STAC seeks to make all earth-related cloud-optimized datasets easily searchable, but does not provide the search tools themselves. STAC integrates well with a tool like intake that can search for and load the desired datasets.

Intake

“Taking the pain out of data access and distribution”

Intake is a python package used to find, investigate, load, and disseminate data.

Data loading

Intake can be used to load many different types of data formats (e.g. tabular data, multi-dimensional data, etc.) into a python notebook or script using familiar containers (e.g. Pandas dataframes, Xarray DataArrays, etc.). Intake can work on local, remote, or cloud computing infrastructures, and is relatively fast due to its ability to integrate distirbuted computing (e.g. Dask).

There are a number of plugins that currently exist for different types of data. Several that may be of particular interest to the climate science community are:

Intake data catalogues at NCI

There are currently two Intake catalogues listing climate data hosted at NCI:

####Other features Intake (and many of the plugins) can also be used for a few other tasks:

  • cataloging system for listing data sources, metadata, and parameters

  • convenience functions that can be used to, among other things, distribute data catalogs

  • investigate data sources and create plots using a GUI

Working with authorised catalogues

There is no direct way for intake to open and load authorised catalogues. Including the username and password in the URL helps open the catalogue, but further data processing using Dask generates an error message relating to the “nonnumeric port” in the URL. Nikhil Garg (Data61, CSIRO) advises on creating a .netrc file and a .dodsrc file in the home directory to resolve this issue. These files are used by the netCDF4 library; when intake uses Xarray and Dask to access the files, netCDF4 is utilised in the background. The .netrc file contains details about the machine, username, and password. It is a single-line file that can be written in the following format: machine machine_name login user_ID password user_password. The .dodsrc file points to the location of the .netrc file, and can be written as a single-line: HTTP.NETRC=YourHomeDirectory/.netrc. Note, when working on NCI OOD, these files need to be created in the OOD home directory. Hint: According to Nikhil, this method works for any tool (e.g., R, cdo, nco).

Intake and distributed client

Intake is not working currently with a distributed client. You can avoid using distributed client or open the data catalogue link with Xarray to work around this issue, e.g., ds = xarray.open_dataset(“catalogue_link”, chunks={‘lat’: 100, ‘lon’: 100, ‘time’: 500})