Using the Python API#
The Python API exposes high‑level functions to perform crawling and indexing tasks. These functions accept the same parameters as the CLI but give you full control over the event loop and thread pool.
Two styles of APIs are provided:
Synchronous wrappers that block until completion.
Asynchronous coroutines that can be integrated into your own asyncio event loop and combined with other tasks.
Synchronous usage#
The synchronous API functions return when the operation is finished and raise exceptions on error. A typical workflow consists of
Crawling: collect metadata from one or more files or datasets into a temporary catalog (e.g. JSON lines or DuckDB).
Indexing: read entries from the catalog and write them to the configured index backend (e.g. Apache Solr or MongoDB).
Deleting: remove previously indexed entries matching a set of search facets (optional).
Below is a minimal example that crawls data from a local directory, stores it in a JSON lines catalog, and indexes it to Apache Solr:
from metadata_crawler import add, index, delete
# 1) collect metadata into a catalog
add(
store="/tmp/catalog.jsonl",
config_fle="/path/to/drs_config.toml",
data_object=["/path/to/data"],
catalogue_backend="jsonlines", # or 'duckdb'
threads=8,
batch_size=50,
)
# 2) index the catalog into a DuckDB index named 'latest'
index(
"solr",
"/tmp/catalog-1.yml",
"/tmp/catalog-2.yml",
batch_size=50,
)
# 3) optionally delete entries from the index
delete(
"mongo",
url="mongodb://mongo:secret@localhost:27017",
database="metadata",
latest_version="latest",
facets=[("project", "CMIP6"), ("institute", "MPI-M")],
)
Asynchronous usage#
For applications that already run an event loop, metadata‑crawler
provides async counterparts to the functions above. They are named
async_add, async_index and async_delete. These
coroutines can be awaited directly or scheduled concurrently with
other tasks:
import asyncio
from metadata_crawler import async_add, async_index, async_delete
async def main():
# crawl metadata from one or more data objects or datasets
await async_add(
store="/tmp/catalog.yaml",
config_file="/path/to/",
data_set=["cmip6-fs", "obs-fs"],
catalogue_backend="duckdb",
threads=8,
batch_size=50,
)
# index into a MongoDB backend named 'latest'
await async_index(
"mongo" "/tmp/catalog-1.yml",
"/tmp/catalog-2.yml",
config_file="/path/to/drs_config.toml",
url="mongodb://localhost:27017",
database="metadata",
threads=8,
batch_size=50,
)
# delete entries matching a wildcard pattern (glob translated to regex)
await async_delete(
"solr",
server="localhost:8983",
latest_version="latest",
facets=[("file", "*.nc"), ("project", "OBS")],
)
asyncio.run(main())
Library Reference#
Metadata Crawler API high level functions.
- metadata_crawler.index(index_system: str, *catalogue_files: Path | str | List[str] | List[Path], batch_size: int = 2500, verbosity: int = 0, **kwargs: Any) None[source]#
Index metadata in the indexing system.
- Parameters:
index_system – The index server where the metadata is indexed.
catalogue_files – Path to the file(s) where the metadata was stored.
batch_size – If the index system supports batch-sizes, the size of the batches.
verbosity – Set the verbosity level.
**kwargs – Keyword arguments used to delete data from the index.
Examples
index( "solr", "/tmp/catalog-1.yml", "/tmp/catalog-2.yml", batch_size=50, server="localhost:8983", )
- metadata_crawler.add(store: Path | str | None = None, config_file: str | Path | Dict[str, Any] | TOMLDocument | None = None, data_object: List[str] | None = None, data_set: List[str] | None = None, data_store_prefix: str = 'metadata', catalogue_backend: str = 'duckdb', batch_size: int = 2500, comp_level: int = 4, storage_options: Dict[str, Any] | None = None, latest_version: str = 'latest', all_versions: str = 'files', threads: int | None = None, verbosity: int = 0, password: bool = False) None[source]#
Harvest metadata from storage systems and add them to an intake catalogue.
- Parameters:
store – Path to the intake catalogue.
config_file – Path to the drs-config file / loaded configuration.
data_ojbect – Instead of defining datasets that are to be crawled you can crawl data based on their directories. The directories must be a root dirs given in the drs-config file. By default all root dirs are crawled.
data_set – Datasets that should be crawled. The datasets need to be defined in the drs-config file. By default all datasets are crawled.
data_store_prefix – Absolute path or relative path to intake catalogue source
data_dir – Instead of defining datasets are are to be crawled you can crawl data based on their directories. The directories must be a root dirs given in the drs-config file. By default all root dirs are crawled.
bach_size – Batch size that is used to collect the meta data. This can affect performance.
comp_level – Compression level used to write the meta data to csv.gz
storage_options – Set additional storage options for adding metadata to the metadata store
catalogue_backend – Intake catalogue backend
latest_version – Name of the core holding ‘latest’ metadata.
all_versions – Name of the core holding ‘all’ metadata versions.
password – Display a password prompt and set password before beginning.
threads – Set the number of threads for collecting.
verbosity – Set the verbosity of the system.
Examples
add( "my-data.yaml", "~/data/drs-config.toml", data_set=["cmip6", "cordex"], )
- metadata_crawler.delete(index_system: str, batch_size: int = 2500, verbosity: int = 0, **kwargs: Any) None[source]#
Delete metadata from the indexing system.
- Parameters:
index_system – The index server where the metadata is indexed.
batch_size – If the index system supports batch-sizes, the size of the batches.
verbosity – Set the verbosity of the system.
**kwargs – Keyword arguments used to delete data from the index.
Examples
delete( "solr", server="localhost:8983", facets=[("project", "CMIP6"), ("institute", "MPI-M")], )
- metadata_crawler.get_config(config: Path | str | None = None) ConfigMerger[source]#
Get a drs config file merged with the default config.
The method is helpful to inspect all possible configurations and their default values.
- Parameters:
config – Path to a user defined config file that is going to be merged with the default config.
- async metadata_crawler.async_index(index_system: str, *catalogue_files: Path | str | List[str] | List[Path], batch_size: int = 2500, verbosity: int = 0, **kwargs: Any) None[source]#
Index metadata in the indexing system.
- Parameters:
index_system – The index server where the metadata is indexed.
catalogue_file – Path to the file where the metadata was stored.
batch_size – If the index system supports batch-sizes, the size of the batches.
verbosity – Set the verbosity of the system.
**kwargs – Keyword arguments used to delete data from the index.
Example
await async_index( "solr" "/tmp/catalog.yaml", server="localhost:8983", batch_size=1000, )
- async metadata_crawler.async_delete(index_system: str, batch_size: int = 2500, verbosity: int = 0, **kwargs: Any) None[source]#
Delete metadata from the indexing system.
- Parameters:
index_system – The index server where the metadata is indexed.
batch_size – If the index system supports batch-sizes, the size of the batches.
verbosity – Set the verbosity of the system.
**kwargs – Keyword arguments used to delete data from the index.
Examples
await async_delete( "solr" server="localhost:8983", latest_version="latest", facets=[("file", "*.nc"), ("project", "OBS")], )
- async metadata_crawler.async_add(store: str | Path | Dict[str, Any] | TOMLDocument | None = None, config_file: str | Path | Dict[str, Any] | TOMLDocument | None = None, data_object: List[str] | str | None = None, data_set: List[str] | str | None = None, data_store_prefix: str = 'metadata', batch_size: int = 2500, comp_level: int = 4, storage_options: Dict[str, Any] | None = None, catalogue_backend: str = 'duckdb', latest_version: str = 'latest', all_versions: str = 'files', password: bool = False, threads: int | None = None, verbosity: int = 0) None[source]#
Harvest metadata from storage systems and add them to an intake catalogue.
- Parameters:
store – Path to the intake catalogue.
config_file – Path to the drs-config file / loaded configuration.
data_objects – Instead of defining datasets that are to be crawled you can crawl data based on their directories. The directories must be a root dirs given in the drs-config file. By default all root dirs are crawled.
data_object – Objects (directories or catalogue files) that are processed.
data_set – Dataset(s) that should be crawled. The datasets need to be defined in the drs-config file. By default all datasets are crawled.
data_store_prefix (str) – Absolute path or relative path to intake catalogue source
batch_size – Batch size that is used to collect the meta data. This can affect performance.
comp_level – Compression level used to write the meta data to csv.gz
storage_options – Set additional storage options for adding metadata to the metadata store
catalogue_backend – Intake catalogue backend
latest_version – Name of the core holding ‘latest’ metadata.
all_versions – Name of the core holding ‘all’ metadata versions.
password – Display a password prompt before beginning
threads – Set the number of threads for collecting.
verbosity – Set the verbosity of the system.
Examples
await async_add( store="my-data.yaml", config_file="~/data/drs-config.toml", data_set=["cmip6", "cordex"], )