Harvest your climate metadata#
Overview#
Metadata Crawler is a tool for harvesting, normalising, and indexing metadata from climate and earth‑system datasets stored on POSIX file systems, S3/MinIO object stores, or OpenStack Swift. The software is highly configurable: dataset definitions, directory and filename patterns, and metadata extraction are controlled via TOML configuration files. You can use the asynchronous and synchronous Python APIs directly or drive everything through a command‑line interface (CLI).
Installation & Quick Start#
Install via pip or conda-forge:
python -m pip install metadata-crawler
conda install -c conda-forge metadata-crawler
After installation, use the CLI immediately (see TL;DR below) or import the modules in your own code.
Too long; didn’t read (TL;DR)#
Multi-backend discovery: POSIX, S3/MinIO, Swift (async REST), Intake
Two-stage pipeline: crawl → catalogue then catalogue → index
Schema driven: strong types (e.g.
string,datetime[2],float[4],string[])DRS dialects: packaged CMIP6/CMIP5/CORDEX; build your own via inheritance
Path specs & data specs: parse directory/filename parts and/or read dataset attributes/vars
Special rules: conditionals and method/function calls (e.g. CMIP6 realm, time aggregation)
Index backends: JSONLines, DuckDB, Apache Solr, MongoDB
Support of dataset versions: Dataset versions are stored separately. Data containing all dataset versions and the latest versions only.
The CLI uses a custom framework inspired by Typer
but is not Typer. The Main commands are grouped under four verbs:
config, crawl, index and delete.
Check also mdc --help
Check the configuration#
mdc config --config drs_config.toml --json |jq .drs_settings
Without the --json flag the merged toml config (pre defined config + user
defined config) will be displayed and can be piped into a file for later usage
and adjusted.
Tip
Use the --json flag with jq command line json parser to inspect
the configuration by <key>-<value> pair queries.
Harvest metadata into a catalogue#
mdc crawl cat.yaml -c drs_config.toml --dataset cmip6-fs --dataset obs-fs \
--threads 4 --batch-size 100
This reads dataset definitions from drs_config.toml and writes harvested
metadata into a temporary catalogue file. You can specify one or
more dataset names via --dataset or explicit paths via --data-object.
Catalogue formats include JSONLines (gzipped) or DuckDB.
Index catalogue entries#
mdc <backend> index cat-1.yaml cat2.yaml
This reads entries from a catalogue and inserts/updates them in the chosen index backend. Supported backends include Solr and MongoDB (see API Reference).
Delete entries from an index#
mdc <backend> delete --facets file /path/to/*.nc
Deletes entries matching facet/value pairs.
Wild cards in the value are supported (e.g., "file *.nc").
For detailed options and examples, see the usage chapter and API Reference.
Contents#
See also
- Freva
The freva evaluation system.
- Freva admin docs
Installation and configuration of the freva services.