Analysis ready data¶

Local data collections and tools¶

data projects on gadi and related tool

Pre-defined conda environments¶

It is good practice, where possible, to use existing/provided analysis environments in order to avoid generating large numbers of duplicate files. Before installing conda, for example, it’s a good idea to check whether a shared conda installation and environment that serves your needs doesn’t already exist. Some examples of managed analysis environments include:

Cosima cookbook¶

…

STAC collection example?¶

COMMENT we’re mentioning STAC in previous pages but it’s not clear to me where it is used in climate science.

NCI and acs-replica intake catalogues¶

…

Remote resources¶

Thredds catalogues with opendap access¶

Community projects¶

There are a few community based projects that aim to provide stacks of python packages selected for climate or related fields analysis. They often also provides examples of how to use these packages in the forms of notebooks and/or tutorials.

Pangeo ¶: Pangeo is a community of people working collaboratively to develop software and infrastructure to enable Big Data geoscience research. A Pangeo environment is made of up of many different open-source software packages for ocean, atmosphere, land and climate science.
PyAOS ¶: PyAOS is a community project that offers a stack of python libraries used by the Atmosphere and Ocean Science communities.
ProjectPythia ¶: Project Pythia aims to provide a public, web-accessible training resource that will help educate current, and aspiring, earth scientists to more effectively use both the Scientific Python Ecosystem and Cloud Computing to make sense of huge volumes of numerical scientific data
EarthPy ¶: EarthPy is a collection of IPython notebooks with examples of Earth Science related Python code: tutorials, descriptions of the modules, small scripts, or just tricks. They welcome contributions.

Pangeo Forge: an open source framework for extraction, transformation, and loading of scientific data¶

Pangeo Forge is a combination of two things, with the ultimate goal of uploading datasets into the cloud in an analysis-ready, cloud-optimized (ARCO) format:

Pangeo Forge Recipes - an open source Python package, which allows you to create and run extraction, transformation, and loading pipelines (“recipes”) and run them from your own computer
Pangeo Forge Cloud - a cloud-based automation framework which runs these recipes in the cloud from code stored in GitHub

Pangeo Forge is inspired directly by Conda Forge, a community-led collection of recipes for building conda packages (see the Python Tools page for more info on conda). Pangeo Forge seeks to play the same role for datasets.

When to use Pangeo Forge¶

Pangeo Forge is useful if you have access to some data, and would like to work with the data on the cloud. It is optimized for multidimensional array data (e.g. NetCDF, GRIB, Zarr) that can be opened with Xarray. To upload a dataset to the cloud via Pangeo Forge, a user should submit a Pull Request to the Pangeo Forge staged-recipes GitHub repository, following the introductory guide in their documentation.

Working with Big/Challenging Data Collections

Analysis ready data

Contents