Python

This is a free, open-source language that is a standard tool used in many organisations and industries. Python is easy to learn and read, hence is popularity. It also interfaces with many other programs and tools. Compared to other languages python is slow and has high memory usage, this can become a challenge when working with big datasets.

Integrated Development Environments

An integrated development environment (IDE) is a tool that helps managing your workspace when working on a software code. At its most basic an IDE is an editor that understand and can highlight the programming language syntax. They can have integration with testing packages, version control and other developers tools. Some can be setup to work remotely (jupyterlab, VSCode).

jupyter

Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include data cleaning and transformation, numerical simulation, statistical modelling, data visualization, machine learning, and much more. While jupyter is a python package it supports more than 40 languages including R, Julia and of course python.

jupyterlab

JupyterLab is a web-based interactive development environment for Jupyter notebooks, code, and data. JupyterLab is flexible: configure and arrange the user interface to support a wide range of workflows in data science, scientific computing, and machine learning. JupyterLab is extensible and modular: write plugins that add new components and integrate with existing ones.

spyder

Spyder is a graphical user interface that provides an interface similar to matlab.

PyCharm

PyCharm is a python IDE, not all the versions are free, but a free license is available for single accademic use for the PyCharm Community editon and the Educational edition. The educational edition includes python training modules.

VSCode

VSCode is a source code editor which is available for Windows, macOS and Linux. You can edit code locally, or use plugins to remotely connect to servers over SSH. It also integrates with Anaconda, letting you run Python programs in different environments. VSCode is designed to be lightweight and adaptable, so has just basic functionalities out of the box and you need to install extensions to add more. In particular, useful extensions for python are: Python, Pylance and Jupyter.

Package and environment management

A package manager is a collection of tools that automates the configuration, installation, upgrades and removal of software packages and handles dependencies. Some package managers are also environment managers as they allow users to create separate environments and handle potential conflicts between packages belonging to the same environment. An environment manager will also keep track of all the packages and versions installed, so that it’s easier to reproduce the same environment again in a consistent manner.
Here we cover some of the package and environment managers most used for Python. For a full list, check the python documentation. Some managers are Python specific, such as venv, virtualenv and pipenv. The conda managers can also be used for R, Julia and many other analysis softwares.

conda

Conda is an open source package management system and environment management system that runs on Windows, macOS and Linux. Conda quickly installs, runs and updates packages and their dependencies. Conda easily creates, saves, loads and switches between environments on your local computer. It was created for Python programs, but it can package and distribute software for any language.

Anaconda

Anaconda contains pretty much all the python libraries you would want to get started, great for newcomers but takes up a lot of space. Not recommended on shared systems with quotas but good on local laptops. Includes Spyder, a Matlab-like programming environment (IDE).

miniconda

A lightweight version of anaconda which by default only includes core libraries, good for building specific environments for data analysis. This underpins the conda modules in the hh5 project at NCI.

mamba

A reimplementation of the conda package manager in C++, mamba is a fast, robust, and cross-platform package manager.

virtualenv

Virtualenv allows to create isolated Python environments. Since Python 3.3, a basic version (venv module) is integrated into the standard library.

pipenv

Pipenv creates and manages separate virtualenv in a project-based way. The project specific requirements are listed in the Pipfile, of which a locked version is automatically created once the packages are installed. Pipenv works well on Windows, which can be sometimes problematic for other tools.

Warning

It is good practice, where possible, to use existing/provided analysis environments in order to avoid generating large numbers of duplicate files. Before installing conda, for example, it’s a good idea to check whether a shared conda installation and environment that serves your needs doesn’t already exist. Some examples of managed analysis environments include:

Community

There are a few community based projects that aim to provide stacks of python packages selected for climate or related fields analysis. They often also provides examples of how to use these packages in the forms of notebooks and/or tutorials.

Pangeo

Pangeo is a community of people working collaboratively to develop software and infrastructure to enable Big Data geoscience research. A Pangeo environment is made of up of many different open-source software packages for ocean, atmosphere, land and climate science.

PyAOS

PyAOS is a community project that offers a stack of python libraries used by the Atmosphere and Ocean Science communities.

ProjectPythia

Project Pythia aims to provide a public, web-accessible training resource that will help educate current, and aspiring, earth scientists to more effectively use both the Scientific Python Ecosystem and Cloud Computing to make sense of huge volumes of numerical scientific data

EarthPy

EarthPy is a collection of IPython notebooks with examples of Earth Science related Python code: tutorials, descriptions of the modules, small scripts, or just tricks. They welcome contributions.

Analysis

The three main python packages used in climate science are numpy, pandas and xarray. Lots of the other analysis are based of them, xarray is itself based on pandas which is based on numpy.

When to use numpy vs pandas vs xarray?
As most big climate data is multidimensional and stored as netCDF files, xarray is usually the best tool to base your analysis. Still numpy and pandas can be faster than xarray for certain operations. For example pandas is faster for groupby operation and generally querying data.
As numpy is the base of the other two packages even when your analysis is not fully based on numpy, you are often dealing with numpy arrays and operations when using them.
Xarray provides several ways to convert your arrays to and from pandas dataframes and the arrays values are numpy arrays. So these three packages can and are often used interchangeably in the same analysis code.
The table below provides a schematic of the main differences, more on the reletionship between xarray and pandas is also available from the xarray FAQ page.

Numpy

Pandas

Xarray

Best use

numerical computations

tabular data analysis

multidim labeled data analysis

Data structure

homogeneous array

Series (columns), Dataframes (table)

Labelled data arrays and datasets

Data input/output

Read from csv, txt and simple binary files. Needs other libraries to input/output formats like netcdf, hdf5 and Zarr. Can output binary, csv, txt files

Read/write many formats, including hdf5, for netcdf you need other libraries

best tool for netcdf, including multiple files at once, includes support for openDAP and compression, chunks can easily convert arrays to pandas and numpy (http://xarray.pydata.org/en/stable/user-guide/io.html)

Vectorised operations

Yes

Yes

Yes

Dimensions

multi-dimensional

2D with multi-index support

multi-dimensional

Time handling

No

Yes

Yes

Speed

Faster < 50K elements, fast indexing

Faster > 500K rows

Based on

C uses multiple functionalities

R provides similar functions

pandas

memory use

more efficient

uses more memory

uses more memory

Plotting

No

Yes

Yes

Attributes

No

Yes

Yes

Labels selection

No

Yes

Yes

As xarray gains popularity there are more and more xarray based packages. A list of the major ones is provided by on the package documentation. Some extends the formats supported by xarray, others are used for specific analysis.

Iris

Iris is an alternative general analysis package to xarray. It is developed by the UK MetOffice as part of their SciTools stack, as cartopy and cfunits. Iris is specifically designed for weather, climate and ocean data, so has a lot of relevant functions and examples. Iris requires CF compliance for netCDF files, as it uses these conventions as a data model. Iris can also handle both grib (1 and 2) formats and pp binary files. These last are specific to the UK MetOffice and is what the UM atmospheric uses as a output format. SciPy is a collection of mathematical algorithms and convenience functions built on NumPy. SciPy is still a core dependency of Iris but it is not as often used now for analysis on its own

Machine learning packages

PyTorch and TensorFlow are very similar in terms of features but PyTorch is more used in research environments since it has a better memory optimisation management and allows more fine-grained control of the model structure.

aesara

Aesara, previously know as Theano, is used to define, evaluate and optimize mathematical expressions involving multi-dimensional arrays in an efficient manner. It optimizes the utilization of CPU and GPU and is often used in large-scale computationally intensive scientific projects, but it is simple and approachable enough to be used for smaller projects too.

Keras

Keras is a high-level neural networks API for TensorFlow2. It provides essential abstractions and building blocks for developing and shipping machine learning solutions with high iteration velocity.

PyTorch

PyTorch is a ML framework based on the C library Torch. Basic data structure is a tensor. It allows to write highly customized neural network components.

scikit-learn

Scikit-learn is built on top of viz., NumPy and SciPy. Scikit-learn supports most of the classical supervised and unsupervised learning algorithms. Scikit-learn can also be used for data-mining and data-analysis.

TensorFlow

TensorFlow is developed by Google to develop and train ML models. The basic data structure is a tensor. TensorFlow can efficiently execute low-level tensor operations on CPU, GPU, TPU.

As machine learning is very popular there are plenty of resources available online. The Realpython website, for example has several machine learning related tutorials.

Other Packages

—COMMENT still need intro, better title , basically I’m chucking here anything which isn’t “exclusive” for climate and it’s used for analysis

Also unless we have some specific to say about these packages we could just refer to several existing lists. Provided we don’t need to create references to them in other part of the book.

eof

eof is used for EOF (empirical orthogonal functions) analysis. NB eof has also an interface for Iris.

gsw-Python

gsw-Python is an implementation of the Thermodynamic Equation of Seawater 2010 (TEOS-10). It is based primarily on numpy ufunc wrappers of the GSW-C implementation. It aims to replace python-gsw which is purely python based.

metpy

MetPy is a collection of tools in Python for reading, visualizing, and performing calculations with weather data.

Py-ART

Py-ART the Python ARM Radar Toolkit is a collection of weather and radar utilities

windspharm

windspharm is a package for performing computations on global wind fields in spherical geometry.

xrft

xrft is used for taking the discrete Fourier transform (DFT) on xarray and dask arrays