Harvest your climate metadata#

https://img.shields.io/badge/License-BSD-purple.svg https://img.shields.io/pypi/pyversions/freva-client.svg https://img.shields.io/badge/ViewOn-GitHub-purple https://github.com/freva-org/metadata-crawler/actions/workflows/ci_job.yml/badge.svg https://codecov.io/gh/freva-org/metadata-crawler/graph/badge.svg?token=W2YziDnh2N

Overview#

Metadata Crawler is a tool for harvesting, normalising, and indexing metadata from climate and earth‑system datasets stored on POSIX file systems, S3/MinIO object stores, or OpenStack Swift. The software is highly configurable: dataset definitions, directory and filename patterns, and metadata extraction are controlled via TOML configuration files. You can use the asynchronous and synchronous Python APIs directly or drive everything through a command‑line interface (CLI).

Installation & Quick Start#

Install via pip or conda-forge:

python -m pip install metadata-crawler
conda install -c conda-forge metadata-crawler

After installation, use the CLI immediately (see TL;DR below) or import the modules in your own code.

Too long; didn’t read (TL;DR)#

  • Multi-backend discovery: POSIX, S3/MinIO, Swift (async REST), Intake

  • Two-stage pipeline: crawl → catalogue then catalogue → index

  • Schema driven: strong types (e.g. string, datetime[2], float[4], string[])

  • DRS dialects: packaged CMIP6/CMIP5/CORDEX; build your own via inheritance

  • Path specs & data specs: parse directory/filename parts and/or read dataset attributes/vars

  • Special rules: conditionals and method/function calls (e.g. CMIP6 realm, time aggregation)

  • Index backends: JSONLines, DuckDB, Apache Solr, MongoDB

  • Support of dataset versions: Dataset versions are stored separately. Data containing all dataset versions and the latest versions only.

The CLI uses a custom framework inspired by Typer but is not Typer. The Main commands are grouped under four verbs: config, crawl, index and delete.

Check also mdc --help

Check the configuration#

mdc config --config drs_config.toml --json |jq  .drs_settings

Without the --json flag the merged toml config (pre defined config + user defined config) will be displayed and can be piped into a file for later usage and adjusted.

Tip

Use the --json flag with jq command line json parser to inspect the configuration by <key>-<value> pair queries.

Harvest metadata into a catalogue#

mdc crawl cat.yaml -c drs_config.toml --dataset cmip6-fs --dataset obs-fs \
          --threads 4 --batch-size 100

This reads dataset definitions from drs_config.toml and writes harvested metadata into a temporary catalogue file. You can specify one or more dataset names via --dataset or explicit paths via --data-object. Catalogue formats include JSONLines (gzipped) or DuckDB.

Index catalogue entries#

mdc <backend> index cat-1.yaml cat2.yaml

This reads entries from a catalogue and inserts/updates them in the chosen index backend. Supported backends include Solr and MongoDB (see API Reference).

Delete entries from an index#

mdc <backend> delete --facets file /path/to/*.nc

Deletes entries matching facet/value pairs. Wild cards in the value are supported (e.g., "file *.nc").

For detailed options and examples, see the usage chapter and API Reference.

Contents#

See also

Freva

The freva evaluation system.

Freva admin docs

Installation and configuration of the freva services.

Indices and tables#