.. _add_backends: Custom index backends --------------------- An *index backend* stores the final, translated metadata records. Built‑in backends include a **DuckDB** database (either on disk, in memory or on S3) and **MongoDB** via Motor. You can implement additional index backends to suit your needs. Base classes and helpers ^^^^^^^^^^^^^^^^^^^^^^^^ The ``metadata_stores.py`` module defines two key abstractions: * ``IndexStore`` – An abstract base class representing an index backend. Concrete implementations must implement methods to ``add`` batches of records, ``read`` chunks from an index, and ``delete`` based on facet filters. A convenience ``close`` method cleans up resources. * ``StorageIndex`` – A simple data class grouping together the index name and any configuration needed by the backend. SolrIndex ^^^^^^^^^ ``SolrIndex`` indexes metadata into a Apache Solr. When initialised you specify the solr server and the core names to create (``latest``, ``files``, etc.). The schema is derived from the configuration. The store supports two modes: MongoIndexStore ^^^^^^^^^^^^^^^ ``MongoIndexStore`` stores records in MongoDB collections. Each index name corresponds to a collection. Records are upserted based on the ``file`` facet: if a document with the same ``file`` exists it will be replaced; otherwise it is inserted. Deletion uses ``$regex`` queries for glob patterns and ``$eq`` for exact values. Provide the MongoDB connection URL and database name via the ``url`` and ``database`` parameters. You may specify additional options (e.g. TLS settings) in ``storage_options``. Recipe: Implementing a custom index ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To add a new index backend: 1. **Subclass** ``IndexStore`` and implement the abstract methods ``index`` to add and ``delete`` records. 2. **Register** your implementation under the entry point ``metadata_crawler.index_backends`` so it can be discovered via the ``index_backend`` CLI option. 3. The ``schema`` argument passed to your constructor contains ``SchemaField`` objects that describe the canonical facets (see :doc:`../chapter2-config/index`). Use this information to construct tables or documents with appropriate types. Example skeleton ^^^^^^^^^^^^^^^^ .. code-block:: python import os from typing import Any, Dict, Iterator, List, Optional, Tuple from metadata_crawler.metadata_stores import IndexStore class MySQLIndex(IndexStore): def __post_init__(self): """Any additional attributes can be set in this method.""" self.password = os.getenv("MYSQL_PASSWD") or "" async def index( self, server: Optional[str] = None, user: Optional[str] = None, pw: bool = True ) -> None: """insert or upsert records.""" if pw and not self.password: self.password = getpass("Give DB password: ") with self.db_connection(server, user, self.password) as con: for table in self.index_names: async for chnunk in self.get_metadata(index): con.add(chunk) async def delete( self, facets: Optional[List[Tuple[str, str]]] = None, server: Optional[str] = None, user: Optional[str] = None, pw: bool = True, ) -> None: """remove matching records.""" if pw and not self.password: self.password = getpass("Give DB password: ") with self.db_connection(server, user, self.password) as con: for table in self.index_names: con.delete(**dir(facets)) .. admonition:: pyproject.toml .. code-block:: toml # register in pyproject.toml [project.entry-points."metadata_crawler.index_backends"] mysql = "my_package.my_index:MySQLIndex" Extending the CLI ^^^^^^^^^^^^^^^^^^ The CLI entry point ``metadata-crawler`` registers its commands in ``cli.py``. You can extend the CLI by defining new commands or options and registering them. This registration is inspired by the `Typer `_ library. CLI API ******** ``cli.py`` defines decorators ``@cli_function`` and the ``cli_parameter`` method to annotate functions with help messages and parameter metadata. The actual CLI commands are defined in your :ref:`add_backends` via the ``@cli_function`` decorator. To add a new command: 1. **Decorate** the ``index`` and ``delete`` functions in our :ref:`add_backends` Use the ``@cli_function`` decorator to register it. 2. **Annotate** the function parameters with ``Annotated`` and ``cli_parameter`` to supply CLI options (see ``SolrIndex`` for examples). 3. **Registering** Once decorated the registering will happen automatically. Example: adding a cli for the ``MySQL`` Index ********************************************** The MySQL index backend from above can be turned to a CLI as follows: .. code-block:: python from typing import Optional from typing_extensions import Annotated from .metadata_stores import IndexStore @cli_function(help="Index data in MySQL") def index( self, server: Annotated[str, cli_parameter("--server", help="Server name")], user: Annotate[Optional[str], cli_parameter("--user", help="User name")] = None, db: Annotated[str, cli_parameter("--database", help="Database name")] = "foo", pw: Annotate[ bool, cli_parmeter("--password", "-p", action="store_true", help="Ask for password"), ] = False, ) -> None: """Your index implementation here.""" .. note:: The arguments and keyword arguments of th e``cli_parameter`` method follow the logic of `argparse.ArgumentParser.add_argument `_. When you run ``metadata-crawler mysql --server localhost -p`` the function executes your custom logic. .. automodule:: metadata_crawler.api.cli :exclude-members: Parameter **API Reference:** .. autoclass:: metadata_crawler.api.index.BaseIndex