Data handling in Python

The data formats most commonly used in climate science are covered in more detail in the Large-scale climate data section of this book. Both xarray and Iris can access most of them, provided that the dependencies for the formats are also installed. A full list and guide of formats accessible via xarray is available from the package documentation. The open_dataset() function can be called with different engines. NetCDF4 is the main library to read netcdf data, another option is to use h5netcdf, which is based on h5py, this can be faster depending on the file structure. To open grib files in xarray either cfgrib or PyNIO needs to be installed. Xarray also has additional features, sometimes developed by a third-party, that can extend further the list of accessible data formats. Iris is preferable if you need to access grib files and can also access pp files, a binary data format used by the UM model. Main dependencies for iris are netCDF4 and scipy. Other libraries are useful to manage or access dataset collections both local and remote. Siphon and pydap helps accessing remote files on thredds and/or OpenDAP services. Intake can be used to build a catalogue to help locate and query local datasets.

netCDF4

netCDF4 is the Unidata library to handle netcdf files

h5netcdf

h5netcdf is an interface for the netCDF4 file-format that reads and writes local or remote HDF5 files directly via h5py or h5pyd, without need for netCDF4. It can be faster than the latter depending on the actual file structure (any example??)

Zarr

Zarr is a relatively new format to store chunked, compressed, N-dimensional arrays, optimised for cloud data access

cfgrib

cfgrib is a python interface to ecCodes, a set of tools for decoding and encoding grib1 and grib2 files. ecCodes replaced the grib-api. Cfgrib allows to open a grib file with xarray and iris.

Pygrib

pygrib is another high-level interface to ecCodes to read and write grib files

cfunits

cfunits is an interface the UDUNITS-2 library with CF extensions, to store, combine and compare physical units and convert numeric values to different units.

cftime

cftime is used to decode time units and variable values in a CF compliant netCDF file.

intake

intake supports building catalogues of datasets, which are easy to naviagte, query and that can be augmented with metadata. We are covering intake and its extensions in more depth here.

siphon

siphon provides utilities to navigate and download data from remote data services, in particular thredds servers.

CleF

CleF Climate Finder is a python based command line tool to discover ESGF datasets at NCI, the functions can also be loaded and called in a interactive session or script.

IOOS Compliance Checker

to check netcdf files metadata against CF and ACDD conventions

HDF data access

Satellite data is more commonly available as HDF. There are several packages to handle HDF data what is best it really depends on the specific HDF format, as HDF files can really varies in characteristic and complexity. Some HDF files can also be read as raster using libraries like rio-xarray.

hdf5

interface specific to the HDF5 data format

h5py

h5py is an interface to the HDF5 data format

pyhdf

pyhdf (previously known as python-hdf4) is an interface to the HDF4 data format, including the HDF4-EOS format

pytables

pytables allows to manipulate data tables and array objects in a hierarchical structure based on the HDF5 library.

Other raster data access

While less commonly used in core climate science research, geographical data formats as raster data is not uncommon in meteorology and GIS data is becoming more common in climate adaptation and impact studies. An example of this is the Cloud Optimised GeoTIFF (COG)format, which is becoming more popular to serve satellite data on cloud servers. As for the hdf format there is a variety of libraries whose usefulness will depend on the exact nature of your raster or GIS data.

xgrads

xgrads parses the ctl descriptor of GrADS binary data to load data as a xarray dataset.

rasterio

raster data library based on GDAL data model

rio-xarray

exote4nds xarray to work seamlessly with rasterio

rasterstats

summarizing geospatial raster datasets based on vector geometries. It includes functions for zonal statistics and interpolated point queries.

shapely

shapely is a module to manipulate and analyse geometric objects in the Cartesian plane.

geopandas

geopandas combines the capabilities of pandas and shapely, providing geospatial operations in pandas and a high-level interface to multiple geometries to shapely.

fiona

Fiona reads and writes geographic data files; it contains extension modules for GDAL.

gdal

gdal python bindings provide a wrapper to the C Geospatial Data Abstraction Library library. Gdal can manipulate a lot of geospatial data including netcdf, hdf, grib gis data formats and several image format, so it can be useful to convert between formats.

Grid handling

xESMF

xESMF is a universal regridder for geospatial data. It is part of the Pangeo ecosystem.

xgcm

xgcm extends the xarray data model to finite volume grid cells (common in General Circulation Models) and provides interpolation and difference operations for such grids.

gridded

gridded is a single API for accessing / working with gridded ocean model results on multiple grid types

pyproj

pyproj is an interface to proj a cartographic projections and coordinate transformations library.

General purpose packages

The following packages are not specific to climate or science but they are really useful to handle generic tasks. os and sys perform operating system functions and together with glob are useful to handle directories and files. Datetime and calendar help managing time related information. The package csv is useful to handle tabular ascii data, while pyyaml and json are often used for configuration files as well as other kind of metadata. Finally to pass input parameters to python script you can use sys.argv() function for a basic approach, argparse and click provide more features. Both packages provides automatically generated help message, you can define input type, default and valid values. The package click is also used to create command-line based programs. Some of these modules are distributed with the main python library (indicated with *). You still need to import them in a script but there is no need to install them.

—COMMENT: as for all the other introduction this could be massively improved, particularly in terms of highlighting usage of time related libraries.

os

os offers an operative system interface (*)

sys

sys is the interface to system specific parameters and functions, an example is sys.argv() that allows to access input parameters passed to a python script. (*)

glob

glob finds all the pathnames matching a specified pattern according to the rules used by the Unix shell. (*)

time

time provides time related functions. (*)

datetime

datetime is used to manipulate date and time. (*)

dateutil

dateutil is an extension to datetime.

calendar

calendar provides calendar related functions. (*)

csv

csv is used to read and write csv files. (*)

json

to handle json files which are useful to store table information and pass schema, vocabularies, and other dictionary style information to programs. (*)

pyyaml

to load and parse yaml files, these are often used to handle programs/models’ configurations

argparse

argparse is useful to handle inputs and to write user-friendly command-line interfaces. (*)

click

click allows to create command line interfaces, it is more powerful than argparse.

sqlite3

sqlite3 is an interface to sqlite databases. It is easier to use than other libraries but fairly basic. (*)

SQLalchemy

SQLalchemy is an interface to SQL based databases, including postgres, mysql and sqlite. It is very powerful but complex to use.

requests

requests is an HTTP interface which is useful to download data from a website.

ftplib

ftplib is an interface to the FTP protocol, it is useful to dowload data from a FTP server. (*)

BeautifulSoup

BeautifulSoup is a very useful library to parse xml/html content. It can be handy when trying to download data from a website.

tkinter

The tkinter package (“Tk interface”) is the standard Python interface to the Tcl/Tk GUI toolkit (*)

Visualization

Matplotlib is a popular and useful plotting library, it integrates seamlessly with other packages and the JupyterLab environment. As it is so widely used, there are also a lot of third-party libraries that extend its capabilities. Cartopy is used to visualise data on accurate maps, cartopy replaced basemap which is now considered obsolete.
Other packages to consider are seaborn, holoviews, plotly.

matplotlib

matplotlib is a comprehensive library for creating static, animated, and interactive visualizations

cartopy

cartopy helps handling and visualising cartographic data

cmocean

cmocean provides colourmaps specifically created for common oceanographic variables to use with matplotlib

seaborn

seaborn is based on matplotlib and is used to make statistical graphics

plotly

plotly python allows to make quality interactive graphs, plotly is at the base of dash a framework to quickly create web data applications.

bokeh

bokeh is a library for creating interactive visualizations for web browsers and dashboards.

holoviews

holoviews helps visualizing data for exploration rather that to create graphs

hvPlot

hvPlot provides a high-level plotting API built on HoloViews that provides a general and consistent API for plotting data in a wide variety of formats.

GeoViews makes it easy to explore and visualize geographical, meteorological, and oceanographic datasets. It is built on holoviews

xhistogram

xhistogram Xhistogram makes it easier to calculate flexible, complex histograms with multi-dimensional data. It integrates with xarray and dask.

—-COMMENT: this part is even more a work in progress than others!

Interfaces to other software

CDO-python

CDO for python is a wrapper around the CDO binary. It parses method arguments and options, builds a command line and executes it. NB (Scott has a regridding function that exploit this)

PyNCML

PyNCML is a simple python library to apply NcML logic to netCDF files. (last updates were in 2017 potentially obsolete)

PyNIO

PyNIO is a python interface to NCL, as NCL is currently in maintenance node (last updates were in 2019)

PyNGL

PyNGL is also an interface to NCL but for visualization

—-COMMENT!!! This section could be moved elsewhere? I’m not sure what’s the best placement maybe after dealing with data formats and visualization? hence it appears twice —-

Working in parallel

–COMMENT I’ve copied most of the dask definition from what Scott had in the coputations notebook. Probably, aside fro the first paragraph, the rest could be better used in an introduction/compariosn with the other libraries here.

Dask

Dask is a library for working with larger-than memory arrays and parallel data analysis transparently. Xarray can use Dask arrays as a backend when opening a netCDF file with the chunks attribute, and Dask has its own Pandas-like DataFrame implementation. Dask splits an array up into chunks. When doing operations on a Dask array, rather than evaluating the operation immediately Dask will create a task graph of what operations need to be run to create the output array chunks from the input array chunks. The task graph is only evaluated when results are needed (e.g. by saving to a file or creating a plot), and different chunks can be evaluated in parallel. Dask usually can work with your existing code with just small modifications, it will try to work out the best scaling based on the memory and cpus it detects on the system. You can tune dask performance using the thread/process mixture to deal with GIL-holding computations (which are rare in Numpy/Pandas/Scikit-Learn workflows). It is best to start a [dask.distributed.Client]{https://docs.dask.org/en/latest/how-to/deploy-dask/single-distributed.html} to allow Dask to process data in parallel with multiple processes.

multiprocessing

The built-in Python multiprocessing library has low-level tools for parallel computing. You can create a ‘pool’ of processes, then given a function and a list of arguments it can run that function on each argument in parallel.

mpi4py

mpi4py is an implementation of the MPI library for Python, from which you can create parallel methods the same way as Fortran sending data between processes via messages.

xarray-beam

xarray-beam is a library for writing Apache Beam pipelines consisting of xarray Dataset objects. This is a new module in development as part of the Pangeo stack. The main aim of xarray-beam is to provide an alternative to dask in climate data analysis cases where applying dask is not suitable or efficient. Xarray-beam aims to facilitate data transformations and analysis on large-scale multi-dimensional labeled arrays. (provide examples???)