Using the Python API --------------------- .. _python_lib: The Python API exposes high‑level functions to perform crawling and indexing tasks. These functions accept the same parameters as the CLI but give you full control over the event loop and thread pool. Two styles of APIs are provided: * **Synchronous** wrappers that block until completion. * **Asynchronous** coroutines that can be integrated into your own asyncio event loop and combined with other tasks. Synchronous usage ^^^^^^^^^^^^^^^^^^ The synchronous API functions return when the operation is finished and raise exceptions on error. A typical workflow consists of 1. **Crawling**: collect metadata from one or more files or datasets into a temporary catalog (e.g. JSON lines or DuckDB). 2. **Indexing**: read entries from the catalog and write them to the configured index backend (e.g. Apache Solr or MongoDB). 3. **Deleting**: remove previously indexed entries matching a set of search facets (optional). Below is a minimal example that crawls data from a local directory, stores it in a JSON lines catalog, and indexes it to Apache Solr: .. code-block:: python from metadata_crawler import add, index, delete # 1) collect metadata into a catalog add( store="/tmp/catalog.jsonl", config_fle="/path/to/drs_config.toml", data_object=["/path/to/data"], catalogue_backend="jsonlines", # or 'duckdb' threads=8, batch_size=50, ) # 2) index the catalog into a DuckDB index named 'latest' index( "solr", "/tmp/catalog-1.yml", "/tmp/catalog-2.yml", batch_size=50, ) # 3) optionally delete entries from the index delete( "mongo", url="mongodb://mongo:secret@localhost:27017", database="metadata", latest_version="latest", facets=[("project", "CMIP6"), ("institute", "MPI-M")], ) Asynchronous usage ^^^^^^^^^^^^^^^^^^^ For applications that already run an event loop, metadata‑crawler provides async counterparts to the functions above. They are named ``async_add``, ``async_index`` and ``async_delete``. These coroutines can be awaited directly or scheduled concurrently with other tasks: .. code-block:: python import asyncio from metadata_crawler import async_add, async_index, async_delete async def main(): # crawl metadata from one or more data objects or datasets await async_add( store="/tmp/catalog.yaml", config_file="/path/to/", data_set=["cmip6-fs", "obs-fs"], catalogue_backend="duckdb", threads=8, batch_size=50, ) # index into a MongoDB backend named 'latest' await async_index( "mongo" "/tmp/catalog-1.yml", "/tmp/catalog-2.yml", config_file="/path/to/drs_config.toml", url="mongodb://localhost:27017", database="metadata", threads=8, batch_size=50, ) # delete entries matching a wildcard pattern (glob translated to regex) await async_delete( "solr", server="localhost:8983", latest_version="latest", facets=[("file", "*.nc"), ("project", "OBS")], ) asyncio.run(main()) Library Reference ----------------- .. automodule:: metadata_crawler :exclude-members: DataCollector :member-order: bysource