Command‑line interface#
The software installs a console entry point named
metadata-crawler or mdc that exposes the high‑level subcommands:
add– Collect metadata into a temporary catalog.config– Display general configurationsolr- Index and delete metadata to/from Apache solr.mongo– Index and deleta metadata to/from MongoDB.walk-intake– Convenience module to traverse and check intake catalogues.
Use --help on any command to see available options. Below are
some examples.
Basic crawling#
To harvest a directory of files into a JSON lines catalog (multiple config files are supported since v2511.0.0):
mdc add \
/tmp/cat.yml \
-c /path/to/drs_config-1.toml \
-c /path/to/drs_config-1.toml \
--catalogue-backend jsonlines \
--threads 4 \
--batch-size 100 \
--data-object /path/to/data
Alternatively you can provide one or more dataset names defined in your DRS configuration instead of explicit file paths (glob pattern for config files are also supported since v2511.0.0):
metadata-crawler add \
/tmp/catalog.yaml \
-c /path/to/drs_*.toml \
--data-set cmip6-fs obs-fs
Changed in version 2511.0.0: The metadata-crawler add sub commands support multiple config files
and glob pattern of config files.
Indexing#
Once a catalog has been generated you can index it into a backend.
Apache Slor and MongoDB backends are supported out of the box. The
following example writes to a json.gz file and index named latest:
metadata-crawler solr index \
/tmp/catalog.yml \
--server localhost:8983
For MongoDB, supply the database URL and name:
metadata-crawler mongo index \
/tmp/catalog.yml /tmp/catalog-2.yml \
--url mongodb://localhost:27017 \
--database metadata
Deleting#
The delete command removes documents from the index using one or
more facet filters. Facet values may contain shell wild cards
(* and ?) which are translated to MongoDB regular expressions
(Apache Solr deletion uses filters internally). For example:
metadata-crawler mongo delete \
--url mongodb://localhost:27017 \
--database metadata \
-f project CMIP6 -f file "*.nc"
See metadata-crawler --help for a complete list of options.