Using the Python API#

The Python API exposes high‑level functions to perform crawling and indexing tasks. These functions accept the same parameters as the CLI but give you full control over the event loop and thread pool.

Two styles of APIs are provided:

  • Synchronous wrappers that block until completion.

  • Asynchronous coroutines that can be integrated into your own asyncio event loop and combined with other tasks.

Synchronous usage#

The synchronous API functions return when the operation is finished and raise exceptions on error. A typical workflow consists of

  1. Crawling: collect metadata from one or more files or datasets into a temporary catalog (e.g. JSON lines).

  2. Indexing: read entries from the catalog and write them to the configured index backend (e.g. Apache Solr or MongoDB).

  3. Deleting: remove previously indexed entries matching a set of search facets (optional).

Below is a minimal example that crawls data from a local directory, stores it in a metadata store, and indexes it to Apache Solr:

from metadata_crawler import add, index, delete

# 1) collect metadata into a catalog
add(
    "/path/to/drs_config.toml",
    "/path/to/second/drs_config.toml",
    store="/tmp/catalog.jsonl",
    data_object=["/path/to/data"],
    backend="jsonlines",
    threads=8,
    batch_size=50,
)

# 2) index the catalog into a Apache Solr core named 'latest'
index(
    "solr",
    "/tmp/catalog-1.yml",
    "/tmp/catalog-2.yml",
    batch_size=50,
)

# 3) optionally delete entries from the index
delete(
    "mongo",
    url="mongodb://mongo:secret@localhost:27017",
    database="metadata",
    latest_version="latest",
    facets=[("project", "CMIP6"), ("institute", "MPI-M")],
)

Changed in version 2511.0.0: The catalogue argument store of the the add() has been rearanged and is now a keyword argument: add("data.yaml", "drs-config.toml") becomes add("drs-config.toml", store="data.yaml"). If the store keyword is omitted the output catalogue will be interpreted as config file.

Added in version 2605.0.0: Instead of writing to file-based intake catalogues, metadata can be crawled directly into a MongoDB or PostgreSQL database. Database backends store catalogue metadata internally, so no YAML catalogue file is needed. The backend is detected automatically from the URL scheme.

MongoDB as data store:

add(
   "/path/to/drs_config.toml",
   "/path/to/second/drs_config.toml",
   store="username:password@server/databasename",
   data_object=["/path/to/data"],
   backend="mongodb",
   threads=8,
   batch_size=50,
)

PostgreSQL as data store:

add(
   "/path/to/drs_config.toml",
   "/path/to/second/drs_config.toml",
   store="username:password@server/databasename",
   data_object=["/path/to/data"],
   backend="postgresql",
   threads=8,
   batch_size=50,
)

Asynchronous usage#

For applications that already run an event loop, metadata‑crawler provides async counterparts to the functions above. They are named async_add, async_index and async_delete. These coroutines can be awaited directly or scheduled concurrently with other tasks:

import asyncio
from metadata_crawler import async_add, async_index, async_delete


async def main():
    # crawl metadata from one or more data objects or datasets
    await async_add(
        "/path/to/",
        store="/tmp/catalog.yaml",
        data_set=["cmip6-fs", "obs-fs"],
        threads=8,
        batch_size=50,
    )

    # index into a MongoDB backend named 'latest'
    await async_index(
        "mongo",
        "/tmp/catalog-1.yml",
        "/tmp/catalog-2.yml",
        config_file="/path/to/drs_config.toml",
        url="mongodb://localhost:27017",
        database="metadata",
        threads=8,
        batch_size=50,
    )

    # delete entries matching a wildcard pattern (glob translated to regex)
    await async_delete(
        "solr",
        server="localhost:8983",
        latest_version="latest",
        facets=[("file", "*.nc"), ("project", "OBS")],
    )


asyncio.run(main())

Changed in version 2511.0.0: The catalogue argument store of the the async_add() has been rearanged and is now a keyword argument: async_add("data.yaml", "drs-config.toml") becomes async_add("drs-config.toml", store="data.yaml"). If the store keyword is omitted the output catalogue will be interpreted as config file.

Added in version 2605.0.0: Instead of writing to file-based intake catalogues, metadata can be crawled directly into a MongoDB or PostgreSQL database. Database backends store catalogue metadata internally, so no YAML catalogue file is needed. The backend is detected automatically from the URL scheme.

Library Reference#

Metadata Crawler API high level functions.

metadata_crawler.index(index_system: str, *metadata_stores: Path | str | List[str] | List[Path], batch_size: int = 2500, verbosity: int = 0, log_suffix: str | None = None, backend: Literal['mongodb', 'postgresql', 'intake'] | None = None, **kwargs: Any) None[source]#

Index metadata in the indexing system.

Parameters:
  • index_system – The index store where the metadata is indexed.

  • metadata_stores – Uri to the metadata store(s).

  • batch_size – If the index system supports batch-sizes, the size of the batches.

  • verbosity – Set the verbosity level.

  • log_suffix – Add a suffix to the log file output.

  • backend (str) –

    Backend to be used for the metadata store. If None given (default) the backend will be guessed from the storage uri

    Changed in version 2605.0.0: Added "mongodb" and "postgresql" backends.

  • **kwargs – Keyword arguments used to delete data from the index.

Examples

index(
    "solr",
    "/tmp/catalog-1.yml",
    "/tmp/catalog-2.yml",
    batch_size=50,
    server="localhost:8983",
)
metadata_crawler.add(*config_files: Path | str | Dict[str, Any] | TOMLDocument, store: str | Path | None = None, data_object: List[str] | str | None = None, data_set: List[str] | str | None = None, catalogue_backend: Literal['mongodb', 'postgresql', 'intake'] | None = None, backend: Literal['mongodb', 'postgresql', 'intake'] | None = None, data_store_prefix: str | None = None, collection: str | None = None, table: str | None = None, batch_size: int = 25000, comp_level: int = 4, storage_options: Dict[str, Any] | None = None, shadow: List[str] | str | None = None, latest_version: str = 'latest', all_versions: str = 'files', n_procs: int | None = None, no_sweep: bool = False, sweep_grace_period: int = 5, verbosity: int = 0, log_suffix: str | None = None, password: bool = False, fail_under: int = -1, **kwargs: Any) None[source]#

Harvest metadata from storage systems and add them to an intake catalogue.

Changed in version 2511.0.0: The catalogue argument has been rearanged and is now a keyword argument: add("data.yaml", "drs-config.toml") becomes add("drs-config.toml", store="data.yaml"). If the store keyword is omitted the output catalogue will be interpreted as config file.

Parameters:
  • config_files – Path to the drs-config file / loaded configuration.

  • store – Path to the intake catalogue where the collected metadata will be stored.

  • data_ojbect – Instead of defining datasets that are to be crawled you can crawl data based on their directories. The directories must be a root dirs given in the drs-config file. By default all root dirs are crawled.

  • data_set – Datasets that should be crawled. The datasets need to be defined in the drs-config file. By default all datasets are crawled. Names can contain wildcards such as xces-*.

  • data_dir – Instead of defining datasets are are to be crawled you can crawl data based on their directories. The directories must be a root dirs given in the drs-config file. By default all root dirs are crawled.

  • data_store_prefix – Name or path of the metadata store. For the jsonlines backend this is a filesystem path prefix for the .json.gz files (resolved relative to yaml_path unless absolute). For database backends it serves as the default collection or table name. Defaults to "metadata".

  • collection – Alias for data_store_prefix — preferred when using the mongodb backend. Maps directly to the MongoDB collection name.

  • table – Alias for data_store_prefix — preferred when using the sql backend. Maps directly to the SQL table name.

  • backend

    Backend to be used for the metadata store. If None given (default) the backend will be guessed from the storage uri

    Changed in version 2605.0.0: Added "mongodb" and "postgresql" backends.

catalogue_backend:

Alias for backend

no_sweep:

Skip removal of stale records after crawling. By default, database backends (MongoDB, PostgreSQL) remove entries older than the grace period ” (set via sweep_grace_period). Use this flag for partial or incremental crawls where not all data sources are being re-discovered.

Added in version 2605.0.0.

sweep_grace_period:

Number of days to keep records before they become eligible for sweeping. Records older than this grace period are removed after a crawl. Overrides the MDC_GRACE_DAYS environment variable. Defaults to 5 days.

Added in version 2605.0.0.

bach_size:

Batch size that is used to collect the meta data. This can affect performance.

comp_level:

Compression level used to write the meta data to csv.gz

storage_options:

Set additional storage options for adding metadata to the metadata store

shadow:

‘Shadow’ this storage options. This is useful to hide secrets in public data catalogues.

latest_version:

Name of the core holding ‘latest’ metadata.

all_versions:

Name of the core holding ‘all’ metadata versions.

password:

Display a password prompt and set password before beginning.

n_procs:

Set the number of parallel processes for collecting.

verbosity:

Set the verbosity of the system.

log_suffix:

Add a suffix to the log file output.

fail_under:

Fail if less than X of the discovered files could be indexed.

Parameters:

**kwargs – Additional keyword arguments.

Examples

add(
    "~/data/drs-config.toml",
    store="my-data.yaml",
    data_set=["cmip6", "cordex"],
)
metadata_crawler.delete(index_system: str, batch_size: int = 2500, verbosity: int = 0, log_suffix: str | None = None, **kwargs: Any) None[source]#

Delete metadata from the indexing system.

Parameters:
  • index_system – The index server where the metadata is indexed.

  • batch_size – If the index system supports batch-sizes, the size of the batches.

  • verbosity – Set the verbosity of the system.

  • log_suffix – Add a suffix to the log file output.

  • **kwargs – Keyword arguments used to delete data from the index.

Examples

delete(
    "solr",
    server="localhost:8983",
    facets=[("project", "CMIP6"), ("institute", "MPI-M")],
)
metadata_crawler.glance_metadata(store: Path | str, backend: Literal['mongodb', 'postgresql', 'intake'] | None = None, **storage_options: Any) Dict[str, Any][source]#

Inspect the meta data for a given table.

metadata_crawler.get_config(*, preserve_comments: Literal[True] = True) ConfigMerger[TOMLDocument][source]#
metadata_crawler.get_config(*, preserve_comments: Literal[False]) ConfigMerger[Dict[str, Any]]
metadata_crawler.get_config(*, preserve_comments: bool) ConfigMerger[Any]

Get a drs config file merged with the default config.

The method is helpful to inspect all possible configurations and their default values.

Parameters:
  • config – Path to a user defined config file that is going to be merged with the default config.

  • preserve_comments – Preserve the comments in a config file.

async metadata_crawler.async_index(index_system: str, *metadata_stores: str | Path | Sequence[str | Path], batch_size: int = 2500, verbosity: int = 0, log_suffix: str | None = None, backend: Literal['mongodb', 'postgresql', 'intake'] | None = None, **kwargs: Any) None[source]#

Index metadata in the indexing system.

Parameters:
  • index_system – The index server where the metadata is indexed.

  • metadata_stores – Uri to the metadata store(s).

  • batch_size – If the index system supports batch-sizes, the size of the batches.

  • verbosity – Set the verbosity of the system.

  • log_suffix – Add a suffix to the log file output.

  • backend

    Backend to be used for the metadata store. If None given (default) the backend will be guessed from the storage uri

    Changed in version 2605.0.0: Added "mongodb" and "postgresql" backends.

  • **kwargs – Keyword arguments used to delete data from the index.

Example

await async_index(
   "solr"
    "/tmp/catalog.yaml",
    server="localhost:8983",
    batch_size=1000,
)
async metadata_crawler.async_delete(index_system: str, batch_size: int = 2500, verbosity: int = 0, log_suffix: str | None = None, **kwargs: Any) None[source]#

Delete metadata from the indexing system.

Parameters:
  • index_system – The index server where the metadata is indexed.

  • batch_size – If the index system supports batch-sizes, the size of the batches.

  • verbosity – Set the verbosity of the system.

  • log_suffix – Add a suffix to the log file output.

  • **kwargs – Keyword arguments used to delete data from the index.

Examples

await async_delete(
    "solr"
    server="localhost:8983",
    latest_version="latest",
    facets=[("file", "*.nc"), ("project", "OBS")],
)
async metadata_crawler.async_add(*config_files: Path | str | Dict[str, Any] | TOMLDocument, store: str | Path | Dict[str, Any] | TOMLDocument | None = None, data_object: List[str] | str | None = None, data_set: List[str] | str | None = None, data_store_prefix: str | None = None, collection: str | None = None, table: str | None = None, batch_size: int = 25000, catalogue_backend: Literal['mongodb', 'postgresql', 'intake'] | None = None, backend: Literal['mongodb', 'postgresql', 'intake'] | None = None, comp_level: int = 4, storage_options: Dict[str, Any] | None = None, shadow: List[str] | str | None = None, latest_version: str = 'latest', all_versions: str = 'files', password: bool = False, n_procs: int | None = None, no_sweep: bool = False, sweep_grace_period: int = 5, verbosity: int = 0, log_suffix: str | None = None, fail_under: int = -1, **kwargs: Any) None[source]#

Harvest metadata from storage systems and add them to an intake catalogue.

Changed in version 2511.0.0: The catalogue argument has been rearanged and is now a keyword argument: async_add("data.yaml", "drs-config.toml") becomes async_add("drs-config.toml", store="data.yaml"). If the store keyword is omitted the output catalogue will be interpreted as config file.

Parameters:
  • config_files – Path to the drs-config file / loaded configuration.

  • store – Path to the intake catalogue.

  • data_objects – Instead of defining datasets that are to be crawled you can crawl data based on their directories. The directories must be a root dirs given in the drs-config file. By default all root dirs are crawled.

  • data_object – Objects (directories or catalogue files) that are processed.

  • data_set – Dataset(s) that should be crawled. The datasets need to be defined in the drs-config file. By default all datasets are crawled. Names can contain wildcards such as xces-*.

  • data_store_prefix – Name or path of the metadata store. For the jsonlines backend this is a filesystem path prefix for the .json.gz files (resolved relative to yaml_path unless absolute). For database backends it serves as the default collection or table name. Defaults to "metadata".

  • collection – Alias for data_store_prefix — preferred when using the mongodb backend. Maps directly to the MongoDB collection name.

  • table – Alias for data_store_prefix — preferred when using the sql backend. Maps directly to the SQL table name.

  • backend

    Backend to be used for the metadata store. If None given (default) the backend will be guessed from the storage uri

    Changed in version 2605.0.0: Added "mongodb" and "postgresql" backends.

catalogue_backend:

Alias for backend

no_sweep:

Skip removal of stale records after crawling. By default, database backends (MongoDB, PostgreSQL) remove entries older than the grace period ” (set via sweep_grace_period). Use this flag for partial or incremental crawls where not all data sources are being re-discovered.

Added in version 2605.0.0.

sweep_grace_period:

Number of days to keep records before they become eligible for sweeping. Records older than this grace period are removed after a crawl. Overrides the MDC_GRACE_DAYS environment variable. Defaults to 5 days.

Added in version 2605.0.0.

batch_size:

Batch size that is used to collect the meta data. This can affect performance.

comp_level:

Compression level used to write the meta data to csv.gz

storage_options:

Set additional storage options for adding metadata to the metadata store

shadow:

‘Shadow’ this storage options. This is useful to hide secrets in public data catalogues.

latest_version:

Name of the core holding ‘latest’ metadata.

all_versions:

Name of the core holding ‘all’ metadata versions.

password:

Display a password prompt before beginning

n_procs:

Set the number of parallel processes for collecting.

verbosity:

Set the verbosity of the system.

log_suffix:

Add a suffix to the log file output.

fail_under:

Fail if less than X of the discovered files could be indexed.

Parameters:

**kwargs – Additional keyword arguments.

Examples

await async_add(
     "~/data/drs-config.toml",
     store="my-data.yaml",
     data_set=["cmip6", "cordex"],
)