Data handling in Python¶

The data formats most commonly used in climate science are covered in more detail in the Large-scale climate data section of this book. Both xarray and Iris can access most of them, provided that the dependencies for the formats are also installed. A full list and guide of formats accessible via xarray is available from the package documentation. The open_dataset() function can be called with different engines. NetCDF4 is the main library to read netcdf data, another option is to use h5netcdf, which is based on h5py, this can be faster depending on the file structure. To open grib files in xarray either cfgrib or PyNIO needs to be installed. Xarray also has additional features, sometimes developed by a third-party, that can extend further the list of accessible data formats. Iris is preferable if you need to access grib files and can also access pp files, a binary data format used by the UM model. Main dependencies for iris are netCDF4 and scipy. Other libraries are useful to manage or access dataset collections both local and remote. Siphon and pydap helps accessing remote files on thredds and/or OpenDAP services. Intake can be used to build a catalogue to help locate and query local datasets.

netCDF4 ¶: netCDF4 is the Unidata library to handle netcdf files
h5netcdf ¶: h5netcdf is an interface for the netCDF4 file-format that reads and writes local or remote HDF5 files directly via h5py or h5pyd, without need for netCDF4. It can be faster than the latter depending on the actual file structure (any example??)
Zarr ¶: Zarr is a relatively new format to store chunked, compressed, N-dimensional arrays, optimised for cloud data access
cfgrib ¶: cfgrib is a python interface to ecCodes, a set of tools for decoding and encoding grib1 and grib2 files. ecCodes replaced the grib-api. Cfgrib allows to open a grib file with xarray and iris.
Pygrib ¶: pygrib is another high-level interface to ecCodes to read and write grib files
cfunits ¶: cfunits is an interface the UDUNITS-2 library with CF extensions, to store, combine and compare physical units and convert numeric values to different units.
cftime ¶: cftime is used to decode time units and variable values in a CF compliant netCDF file.
intake ¶: intake supports building catalogues of datasets, which are easy to naviagte, query and that can be augmented with metadata. We are covering intake and its extensions in more depth here.
siphon ¶: siphon provides utilities to navigate and download data from remote data services, in particular thredds servers.
CleF ¶: CleF Climate Finder is a python based command line tool to discover ESGF datasets at NCI, the functions can also be loaded and called in a interactive session or script.
IOOS Compliance Checker ¶: to check netcdf files metadata against CF and ACDD conventions

HDF data access¶

Satellite data is more commonly available as HDF. There are several packages to handle HDF data what is best it really depends on the specific HDF format, as HDF files can really varies in characteristic and complexity. Some HDF files can also be read as raster using libraries like rio-xarray.

hdf5 ¶: interface specific to the HDF5 data format
h5py ¶: h5py is an interface to the HDF5 data format
pyhdf ¶: pyhdf (previously known as python-hdf4) is an interface to the HDF4 data format, including the HDF4-EOS format
pytables ¶: pytables allows to manipulate data tables and array objects in a hierarchical structure based on the HDF5 library.

Other raster data access¶

While less commonly used in core climate science research, geographical data formats as raster data is not uncommon in meteorology and GIS data is becoming more common in climate adaptation and impact studies. An example of this is the Cloud Optimised GeoTIFF (COG)format, which is becoming more popular to serve satellite data on cloud servers. As for the hdf format there is a variety of libraries whose usefulness will depend on the exact nature of your raster or GIS data.

xgrads ¶: xgrads parses the ctl descriptor of GrADS binary data to load data as a xarray dataset.
rasterio ¶: raster data library based on GDAL data model
rio-xarray ¶: exote4nds xarray to work seamlessly with rasterio
rasterstats ¶: summarizing geospatial raster datasets based on vector geometries. It includes functions for zonal statistics and interpolated point queries.
shapely ¶: shapely is a module to manipulate and analyse geometric objects in the Cartesian plane.
geopandas ¶: geopandas combines the capabilities of pandas and shapely, providing geospatial operations in pandas and a high-level interface to multiple geometries to shapely.
fiona ¶: Fiona reads and writes geographic data files; it contains extension modules for GDAL.
gdal ¶: gdal python bindings provide a wrapper to the C Geospatial Data Abstraction Library library. Gdal can manipulate a lot of geospatial data including netcdf, hdf, grib gis data formats and several image format, so it can be useful to convert between formats.

Grid handling¶

xESMF ¶: xESMF is a universal regridder for geospatial data. It is part of the Pangeo ecosystem.
xgcm ¶: xgcm extends the xarray data model to finite volume grid cells (common in General Circulation Models) and provides interpolation and difference operations for such grids.
gridded ¶: gridded is a single API for accessing / working with gridded ocean model results on multiple grid types
pyproj ¶: pyproj is an interface to proj a cartographic projections and coordinate transformations library.

General purpose packages¶

The following packages are not specific to climate or science but they are really useful to handle generic tasks. os and sys perform operating system functions and together with glob are useful to handle directories and files. Datetime and calendar help managing time related information. The package csv is useful to handle tabular ascii data, while pyyaml and json are often used for configuration files as well as other kind of metadata. Finally to pass input parameters to python script you can use sys.argv() function for a basic approach, argparse and click provide more features. Both packages provides automatically generated help message, you can define input type, default and valid values. The package click is also used to create command-line based programs. Some of these modules are distributed with the main python library (indicated with *). You still need to import them in a script but there is no need to install them.

—COMMENT: as for all the other introduction this could be massively improved, particularly in terms of highlighting usage of time related libraries.

os ¶: os offers an operative system interface (*)
sys ¶: sys is the interface to system specific parameters and functions, an example is sys.argv() that allows to access input parameters passed to a python script. (*)
glob ¶: glob finds all the pathnames matching a specified pattern according to the rules used by the Unix shell. (*)
time ¶: time provides time related functions. (*)
datetime ¶: datetime is used to manipulate date and time. (*)
dateutil ¶: dateutil is an extension to datetime.
calendar ¶: calendar provides calendar related functions. (*)
csv ¶: csv is used to read and write csv files. (*)
json ¶: to handle json files which are useful to store table information and pass schema, vocabularies, and other dictionary style information to programs. (*)
pyyaml ¶: to load and parse yaml files, these are often used to handle programs/models’ configurations
argparse ¶: argparse is useful to handle inputs and to write user-friendly command-line interfaces. (*)
click ¶: click allows to create command line interfaces, it is more powerful than argparse.
sqlite3 ¶: sqlite3 is an interface to sqlite databases. It is easier to use than other libraries but fairly basic. (*)
SQLalchemy ¶: SQLalchemy is an interface to SQL based databases, including postgres, mysql and sqlite. It is very powerful but complex to use.
requests ¶: requests is an HTTP interface which is useful to download data from a website.
ftplib ¶: ftplib is an interface to the FTP protocol, it is useful to dowload data from a FTP server. (*)
BeautifulSoup ¶: BeautifulSoup is a very useful library to parse xml/html content. It can be handy when trying to download data from a website.
tkinter ¶: The tkinter package (“Tk interface”) is the standard Python interface to the Tcl/Tk GUI toolkit (*)

Working with Big/Challenging Data Collections

Data handling in Python

Contents

Data handling in Python¶

HDF data access¶

Other raster data access¶

Grid handling¶

General purpose packages¶

Visualization¶

Interfaces to other software¶

Working in parallel¶