Using the Python API#
The Python API exposes high‑level functions to perform crawling and indexing tasks. These functions accept the same parameters as the CLI but give you full control over the event loop and thread pool.
Two styles of APIs are provided:
Synchronous wrappers that block until completion.
Asynchronous coroutines that can be integrated into your own asyncio event loop and combined with other tasks.
Synchronous usage#
The synchronous API functions return when the operation is finished and raise exceptions on error. A typical workflow consists of
Crawling: collect metadata from one or more files or datasets into a temporary catalog (e.g. JSON lines).
Indexing: read entries from the catalog and write them to the configured index backend (e.g. Apache Solr or MongoDB).
Deleting: remove previously indexed entries matching a set of search facets (optional).
Below is a minimal example that crawls data from a local directory, stores it in a JSON lines catalog, and indexes it to Apache Solr:
from metadata_crawler import add, index, delete
# 1) collect metadata into a catalog
add(
store="/tmp/catalog.jsonl",
config_fle="/path/to/drs_config.toml",
data_object=["/path/to/data"],
catalogue_backend="jsonlines",
threads=8,
batch_size=50,
)
# 2) index the catalog into a Apache Solr core named 'latest'
index(
"solr",
"/tmp/catalog-1.yml",
"/tmp/catalog-2.yml",
batch_size=50,
)
# 3) optionally delete entries from the index
delete(
"mongo",
url="mongodb://mongo:secret@localhost:27017",
database="metadata",
latest_version="latest",
facets=[("project", "CMIP6"), ("institute", "MPI-M")],
)
Asynchronous usage#
For applications that already run an event loop, metadata‑crawler
provides async counterparts to the functions above. They are named
async_add, async_index and async_delete. These
coroutines can be awaited directly or scheduled concurrently with
other tasks:
import asyncio
from metadata_crawler import async_add, async_index, async_delete
async def main():
# crawl metadata from one or more data objects or datasets
await async_add(
store="/tmp/catalog.yaml",
config_file="/path/to/",
data_set=["cmip6-fs", "obs-fs"],
threads=8,
batch_size=50,
)
# index into a MongoDB backend named 'latest'
await async_index(
"mongo" "/tmp/catalog-1.yml",
"/tmp/catalog-2.yml",
config_file="/path/to/drs_config.toml",
url="mongodb://localhost:27017",
database="metadata",
threads=8,
batch_size=50,
)
# delete entries matching a wildcard pattern (glob translated to regex)
await async_delete(
"solr",
server="localhost:8983",
latest_version="latest",
facets=[("file", "*.nc"), ("project", "OBS")],
)
asyncio.run(main())
Library Reference#
Metadata Crawler API high level functions.
- metadata_crawler.index(index_system: str, *catalogue_files: Path | str | List[str] | List[Path], batch_size: int = 2500, verbosity: int = 0, log_suffix: str | None = None, **kwargs: Any) None[source]#
Index metadata in the indexing system.
- Parameters:
index_system – The index server where the metadata is indexed.
catalogue_files – Path to the file(s) where the metadata was stored.
batch_size – If the index system supports batch-sizes, the size of the batches.
verbosity – Set the verbosity level.
log_suffix – Add a suffix to the log file output.
**kwargs – Keyword arguments used to delete data from the index.
Examples
index( "solr", "/tmp/catalog-1.yml", "/tmp/catalog-2.yml", batch_size=50, server="localhost:8983", )
- metadata_crawler.add(store: Path | str | None = None, config_file: str | Path | Dict[str, Any] | TOMLDocument | None = None, data_object: List[str] | str | None = None, data_set: List[str] | str | None = None, data_store_prefix: str = 'metadata', catalogue_backend: Literal['jsonlines'] = 'jsonlines', batch_size: int = 25000, comp_level: int = 4, storage_options: Dict[str, Any] | None = None, shadow: List[str] | str | None = None, latest_version: str = 'latest', all_versions: str = 'files', n_procs: int | None = None, verbosity: int = 0, log_suffix: str | None = None, password: bool = False, fail_under: int = -1, **kwargs: Any) None[source]#
Harvest metadata from storage systems and add them to an intake catalogue.
- Parameters:
store – Path to the intake catalogue.
config_file – Path to the drs-config file / loaded configuration.
data_ojbect – Instead of defining datasets that are to be crawled you can crawl data based on their directories. The directories must be a root dirs given in the drs-config file. By default all root dirs are crawled.
data_set – Datasets that should be crawled. The datasets need to be defined in the drs-config file. By default all datasets are crawled. Names can contain wildcards such as
xces-*.data_store_prefix – Absolute path or relative path to intake catalogue source
data_dir – Instead of defining datasets are are to be crawled you can crawl data based on their directories. The directories must be a root dirs given in the drs-config file. By default all root dirs are crawled.
bach_size – Batch size that is used to collect the meta data. This can affect performance.
comp_level – Compression level used to write the meta data to csv.gz
storage_options – Set additional storage options for adding metadata to the metadata store
shadow – ‘Shadow’ this storage options. This is useful to hide secrets in public data catalogues.
catalogue_backend – Intake catalogue backend
latest_version – Name of the core holding ‘latest’ metadata.
all_versions – Name of the core holding ‘all’ metadata versions.
password – Display a password prompt and set password before beginning.
n_procs – Set the number of parallel processes for collecting.
verbosity – Set the verbosity of the system.
log_suffix – Add a suffix to the log file output.
fail_under – Fail if less than X of the discovered files could be indexed.
**kwargs – Additional keyword arguments.
Examples
add( "my-data.yaml", "~/data/drs-config.toml", data_set=["cmip6", "cordex"], )
- metadata_crawler.delete(index_system: str, batch_size: int = 2500, verbosity: int = 0, log_suffix: str | None = None, **kwargs: Any) None[source]#
Delete metadata from the indexing system.
- Parameters:
index_system – The index server where the metadata is indexed.
batch_size – If the index system supports batch-sizes, the size of the batches.
verbosity – Set the verbosity of the system.
log_suffix – Add a suffix to the log file output.
**kwargs – Keyword arguments used to delete data from the index.
Examples
delete( "solr", server="localhost:8983", facets=[("project", "CMIP6"), ("institute", "MPI-M")], )
- async metadata_crawler.async_index(index_system: str, *catalogue_files: Path | str | List[str] | List[Path], batch_size: int = 2500, verbosity: int = 0, log_suffix: str | None = None, **kwargs: Any) None[source]#
Index metadata in the indexing system.
- Parameters:
index_system – The index server where the metadata is indexed.
catalogue_file – Path to the file where the metadata was stored.
batch_size – If the index system supports batch-sizes, the size of the batches.
verbosity – Set the verbosity of the system.
log_suffix – Add a suffix to the log file output.
**kwargs – Keyword arguments used to delete data from the index.
Example
await async_index( "solr" "/tmp/catalog.yaml", server="localhost:8983", batch_size=1000, )
- async metadata_crawler.async_delete(index_system: str, batch_size: int = 2500, verbosity: int = 0, log_suffix: str | None = None, **kwargs: Any) None[source]#
Delete metadata from the indexing system.
- Parameters:
index_system – The index server where the metadata is indexed.
batch_size – If the index system supports batch-sizes, the size of the batches.
verbosity – Set the verbosity of the system.
log_suffix – Add a suffix to the log file output.
**kwargs – Keyword arguments used to delete data from the index.
Examples
await async_delete( "solr" server="localhost:8983", latest_version="latest", facets=[("file", "*.nc"), ("project", "OBS")], )
- async metadata_crawler.async_add(store: str | Path | Dict[str, Any] | TOMLDocument | None = None, config_file: str | Path | Dict[str, Any] | TOMLDocument | None = None, data_object: List[str] | str | None = None, data_set: List[str] | str | None = None, data_store_prefix: str = 'metadata', batch_size: int = 25000, comp_level: int = 4, storage_options: Dict[str, Any] | None = None, shadow: List[str] | str | None = None, catalogue_backend: Literal['jsonlines'] = 'jsonlines', latest_version: str = 'latest', all_versions: str = 'files', password: bool = False, n_procs: int | None = None, verbosity: int = 0, log_suffix: str | None = None, fail_under: int = -1, **kwargs: Any) None[source]#
Harvest metadata from storage systems and add them to an intake catalogue.
- Parameters:
store – Path to the intake catalogue.
config_file – Path to the drs-config file / loaded configuration.
data_objects – Instead of defining datasets that are to be crawled you can crawl data based on their directories. The directories must be a root dirs given in the drs-config file. By default all root dirs are crawled.
data_object – Objects (directories or catalogue files) that are processed.
data_set – Dataset(s) that should be crawled. The datasets need to be defined in the drs-config file. By default all datasets are crawled. Names can contain wildcards such as
xces-*.data_store_prefix (str) – Absolute path or relative path to intake catalogue source
batch_size – Batch size that is used to collect the meta data. This can affect performance.
comp_level – Compression level used to write the meta data to csv.gz
storage_options – Set additional storage options for adding metadata to the metadata store
shadow – ‘Shadow’ this storage options. This is useful to hide secrets in public data catalogues.
catalogue_backend – Intake catalogue backend
latest_version – Name of the core holding ‘latest’ metadata.
all_versions – Name of the core holding ‘all’ metadata versions.
password – Display a password prompt before beginning
n_procs – Set the number of parallel processes for collecting.
verbosity – Set the verbosity of the system.
log_suffix – Add a suffix to the log file output.
fail_under – Fail if less than X of the discovered files could be indexed.
**kwargs – Additional keyword arguments.
Examples
await async_add( store="my-data.yaml", config_file="~/data/drs-config.toml", data_set=["cmip6", "cordex"], )
- metadata_crawler.get_config(config: Path | str | None = None) ConfigMerger[source]#
Get a drs config file merged with the default config.
The method is helpful to inspect all possible configurations and their default values.
- Parameters:
config – Path to a user defined config file that is going to be merged with the default config.