Adding storage backends#

Storage backends provide the abstraction for traversing files and directories, reading datasets and extracting metadata. The crawler ships with built‑in backends for Posix, S3/MinIO, Swift and Intake. You can add your own backend by subclassing PathTemplate defined in storage_backend.py.

Base classes#

The core classes used for storage backends are:

  • TemplateMixin – Provides a storage_template method to render strings using Jinja2. Useful when the backend requires templated URLs or credentials.

  • PathMixin – Provides convenience methods suffix and name to extract parts of a path using anyio.Path.

  • PathTemplate – Abstract base class combining TemplateMixin , LookupMixin and PathMixin. Concrete backends must implement asynchronous methods is_dir, is_file, iterdir, rglob and synchronous methods path and uri. Optional overrides include open_dataset (open an xarray dataset given a URI) and read_attr (read a metadata attribute).

Recipe: Implement a backend#

To implement a new backend:

  1. Subclass PathTemplate and set the class variable _fs_type to a short name identifying your backend.

  2. Implement the abstract methods:
    • is_dir(path) should return True if the given URI is a directory/prefix.

    • is_file(path) should return True if the given URI is a file/object containing data.

    • iterdir(path) should asynchronously iterate over immediate children (directories and files) of the given path.

    • rglob(path, glob_pattern="*") should asynchronously yield Metadata objects for all files matching the glob pattern.

    • path(path) should return a URI with scheme/authority as required by your backend.

    • uri(path) should return the raw URI (including bucket or container names as appropriate).

  3. Register your backend by adding it to the entry point group
    • metadata_crawler.storage_backends in your setup.cfg or

    • pyproject.toml. This allows the fs_type string in the configuration to resolve to your backend class.

    pyproject.toml

    [project.entry-points."metadata_crawler.storage_backends"]
    foo = "my_package.foo_backend:FooBackend"
    

Example skeleton#

Here is a minimal example of a custom storage backend for a hypothetical foo protocol:

from metadata_crawler.storage_backend import PathTemplate, Metadata
from anyio import Path


class FooBackend(PathTemplate):
    _fs_type = "foo"

    async def is_dir(self, path: str) -> bool:
        # implement logic to check for a directory
        ...

    async def is_file(self, path: str) -> bool:
        # implement logic to check for a file
        ...

    async def iterdir(self, path: str):
        # yield child names
        ...

    async def rglob(self, path: str, glob_pattern: str = "*"):
        # recursively yield Metadata objects
        for child in await self.iterdir(path):
            if await self.is_dir(child):
                async for item in self.rglob(child, glob_pattern):
                    yield item
            elif await self.is_file(child) and fnmatch(child, glob_pattern):
                yield Metadata(path=child)

    def path(self, path: str) -> str:
        return f"foo://{path}"

    def uri(self, path: str) -> str:
        return self.path(path)
# Then register in your packaging config:
[project.entry-points."metadata_crawler.storage_backends"]
foo = "my_package.foo_backend:FooBackend"

Once registered, you can set fs_type = "foo" in a dataset definition and optionally provide storage_options that will be passed into your backend’s constructor.

API Reference:#

class metadata_crawler.api.storage_backend.PathTemplate(suffixes: List[str] | None = None, **storage_options: Any)[source]#

Bases: ABC, PathMixin, TemplateMixin, LookupMixin

Base class for interacting with different storage systems.

This class defines fundamental methods that should be implemented to retrieve information across different storage systems.

Parameters:
  • suffixes (List[str], default: [".nc", ".girb", ".zarr", ".tar", ".hdf5"]) – A list of available file suffixes.

  • storage_options (Any) – Information needed to interact with the storage system.

_user#

Value of the DRS_STORAGE_USER env variable (defaults to current user)

Type:

str

_pw#

a password passed by the DRS_STORAGE_PASSWD env variable

Type:

str

suffixes#

A list of available file suffixes.

Type:

List[str]

storage_options#

A dict with information needed to interact with the storage system.

Type:

Dist[str, Any]

CMOR_STATIC: Mapping[Tuple[str, ...], Any] = {}#
__init__(suffixes: List[str] | None = None, **storage_options: Any) None[source]#
async close() None[source]#

Close any open sessions.

env_map: Dict[str, str] | None = None#
fs_type(path: str | Path | Path) str[source]#

Define the file system type.

get_fs_and_path(uri: str) Tuple[AbstractFileSystem, str]#

Return (fs, path) suitable for xarray.

Parameters:

uri – Path to the object store / file name

Returns:

The AbstractFileSystem class and the corresponding path to the data store.

Return type:

fsspec.AbstractFileSystem, str

abstractmethod async is_dir(path: str | Path | Path) bool[source]#

Check if a given path is a directory object on the storage system.

Parameters:

path (str, asyncio.Path, pathlib.Path) – Path of the object store

Returns:

bool

Return type:

True if path is dir object, False if otherwise or doesn’t exist

abstractmethod async is_file(path: str | Path | Path) bool[source]#

Check if a given path is a file object on the storage system.

Parameters:

path – Path of the object store

Returns:

True if path is file object, False if otherwise or doesn’t exist

Return type:

bool

abstractmethod async iterdir(path: str | Path | Path) AsyncIterator[str][source]#

Get all sub directories from a given path.

Parameters:

path – Path of the object store

Yields:

str – 1st level sub directory

lookup(path: str, attribute: str, *tree: str, **read_kws: Any) Any#

Get metadata from a lookup table.

This function will read metadata from a pre-defined cache table and if the metadata is not present in the cache table it’ll read the the object store and add the metadata to the cache table.

Parameters:
  • path – Path to the object store / file name

  • attribute – The attribute that is retrieved from the data. variable attributes can be defined by a .. For example: tas.long_name would get attribute long_name from variable tas.

  • *tree – A tuple representing nested attributes. Attributes are nested for more efficient lookup. (‘atmos’, ‘1hr’, ‘tas’) will translate into a tree of [‘atmos’][‘1hr’][‘tas’]

  • **read_kws – Keyword arguments passed to open the datasets.

open_dataset(path: str, **read_kws: Any) Dataset | File[source]#

Open a dataset with xarray.

Parameters:
  • path – Path to the object store / file name

  • **read_kws – Keyword arguments passed to open the datasets.

Returns:

The xarray dataset.

Return type:

xarray.Dataset

abstractmethod path(path: str | Path | Path) str[source]#

Get the full path (including any schemas/netlocs).

Parameters:

path – Path of the object store

Returns:

URI of the object store

Return type:

str

prep_template_env() None#

Prepare the jinja2 env.

read_attr(attribute: str, path: str | Path, **read_kws: Any) Any[source]#

Get a metadata attribute from a datastore object.

Parameters:
  • attr (The attribute that is queried can be of the form of) – <attribute>, <variable>.<attribute>, <attribute>, <variable>.<attribute>

  • path (Path to the object store / file path)

  • read_kws (Keyword arguments for opening the datasets.)

Returns:

str

Return type:

Metadata from the data.

render_templates(data: Any, context: Mapping[str, Any], *, max_passes: int = 2) Any#

Recursively render Jinja2 templates found in strings within data.

This function traverses common container types (dict, list, tuple, set), dataclasses, namedtuples, and pathlib.Path objects. Every string encountered is treated as a Jinja2 template and rendered with the provided context. Rendering can be repeated up to max_passes times to resolve templates that produce further templates on the first pass.

Parameters:
  • data – Arbitrary Python data structure. Supported containers are dict (keys and values), list, tuple (including namedtuples), set, dataclasses (fields), and pathlib.Path. Scalars (e.g., int, float, bool, None) are returned unchanged. Strings are rendered as Jinja2 templates.

  • context – Mapping of template variables available to Jinja2 during rendering.

  • max_passes – Maximum number of rendering passes to perform on each string, by default 2. Increase this if templates generate further templates that need resolution.

Returns:

A structure of the same shape with all strings rendered. Container and object types are preserved where feasible (e.g., tuple stays a tuple, namedtuple stays a namedtuple, dataclass remains the same dataclass type).

Return type:

Any

Raises:

jinja2.TemplateError – For other Jinja2 template errors encountered during rendering.

Notes

  • Dictionary keys are also rendered if they are strings (or nested containers with strings). If rendering causes key collisions, the last rendered key wins.

  • For dataclasses, all fields are rendered and a new instance is returned using dataclasses.replace. Frozen dataclasses are supported.

  • Namedtuples are detected via the _fields attribute and reconstructed with the same type.

Examples

data = {
    "greeting": "Hello, {{ name }}!",
    "items": ["{{ count }} item(s)", 42],
    "path": {"root": "/home/{{ user }}", "cfg": "{{ root }}/cfg"},
}
ctx = {"name": "Ada", "count": 3, "user": "ada", "root": "/opt/app"}
TemplateMixin().render_templates(data, ctx)
# {'greeting': 'Hello, Ada!',
#   'items': ['3 item(s)', 42],
#    'path': {'root': '/home/ada', 'cfg': '/opt/app/cfg'}}
abstractmethod async rglob(path: str | Path | Path, glob_pattern: str = '*') AsyncIterator[MetadataType][source]#

Search recursively for paths matching a given glob pattern.

Parameters:
  • path – Path of the object store

  • glob_pattern (str) – Pattern that the target files must match

Yields:

MetadataType (Path of the object store that matches the glob pattern.)

set_static_from_nested() None#

Flatting the cmor lookup table.

async suffix(path: str | Path | Path) str#

Get the suffix of a given input path.

Parameters:

path (str, asyncio.Path, pathlib.Path) – Path of the object store

Returns:

str

Return type:

The file type extension of the path.

abstractmethod uri(path: str | Path | Path) str[source]#

Get the uri of the object store.

Parameters:

path – Path of the object store

Returns:

URI of the object store

Return type:

str

class metadata_crawler.api.mixin.PathMixin[source]#

Bases: object

Class that defines typical Path operations.

get_fs_and_path(uri: str) Tuple[AbstractFileSystem, str][source]#

Return (fs, path) suitable for xarray.

Parameters:

uri – Path to the object store / file name

Returns:

The AbstractFileSystem class and the corresponding path to the data store.

Return type:

fsspec.AbstractFileSystem, str

async suffix(path: str | Path | Path) str[source]#

Get the suffix of a given input path.

Parameters:

path (str, asyncio.Path, pathlib.Path) – Path of the object store

Returns:

str

Return type:

The file type extension of the path.

class metadata_crawler.api.mixin.TemplateMixin[source]#

Bases: object

Apply templating egine jinja2.

env_map: Dict[str, str] | None = None#
prep_template_env() None[source]#

Prepare the jinja2 env.

render_templates(data: Any, context: Mapping[str, Any], *, max_passes: int = 2) Any[source]#

Recursively render Jinja2 templates found in strings within data.

This function traverses common container types (dict, list, tuple, set), dataclasses, namedtuples, and pathlib.Path objects. Every string encountered is treated as a Jinja2 template and rendered with the provided context. Rendering can be repeated up to max_passes times to resolve templates that produce further templates on the first pass.

Parameters:
  • data – Arbitrary Python data structure. Supported containers are dict (keys and values), list, tuple (including namedtuples), set, dataclasses (fields), and pathlib.Path. Scalars (e.g., int, float, bool, None) are returned unchanged. Strings are rendered as Jinja2 templates.

  • context – Mapping of template variables available to Jinja2 during rendering.

  • max_passes – Maximum number of rendering passes to perform on each string, by default 2. Increase this if templates generate further templates that need resolution.

Returns:

A structure of the same shape with all strings rendered. Container and object types are preserved where feasible (e.g., tuple stays a tuple, namedtuple stays a namedtuple, dataclass remains the same dataclass type).

Return type:

Any

Raises:

jinja2.TemplateError – For other Jinja2 template errors encountered during rendering.

Notes

  • Dictionary keys are also rendered if they are strings (or nested containers with strings). If rendering causes key collisions, the last rendered key wins.

  • For dataclasses, all fields are rendered and a new instance is returned using dataclasses.replace. Frozen dataclasses are supported.

  • Namedtuples are detected via the _fields attribute and reconstructed with the same type.

Examples

data = {
    "greeting": "Hello, {{ name }}!",
    "items": ["{{ count }} item(s)", 42],
    "path": {"root": "/home/{{ user }}", "cfg": "{{ root }}/cfg"},
}
ctx = {"name": "Ada", "count": 3, "user": "ada", "root": "/opt/app"}
TemplateMixin().render_templates(data, ctx)
# {'greeting': 'Hello, Ada!',
#   'items': ['3 item(s)', 42],
#    'path': {'root': '/home/ada', 'cfg': '/opt/app/cfg'}}
class metadata_crawler.api.mixin.LookupMixin[source]#

Bases: object

Provide a Mixing with a process safe lookup().

The mixin does:
  • process-wide static table (CMOR) via CMOR_STATIC

  • per-instance disk cache for file-derived attrs

  • in-flight de-duplication for concurrent misses

Subclass must implement:

def read_attr(self, attribute: str, path: str, **read_kws: Any) -> Any

CMOR_STATIC: Mapping[Tuple[str, ...], Any] = {}#
lookup(path: str, attribute: str, *tree: str, **read_kws: Any) Any[source]#

Get metadata from a lookup table.

This function will read metadata from a pre-defined cache table and if the metadata is not present in the cache table it’ll read the the object store and add the metadata to the cache table.

Parameters:
  • path – Path to the object store / file name

  • attribute – The attribute that is retrieved from the data. variable attributes can be defined by a .. For example: tas.long_name would get attribute long_name from variable tas.

  • *tree – A tuple representing nested attributes. Attributes are nested for more efficient lookup. (‘atmos’, ‘1hr’, ‘tas’) will translate into a tree of [‘atmos’][‘1hr’][‘tas’]

  • **read_kws – Keyword arguments passed to open the datasets.

read_attr(attribute: str, path: str, **read_kws: Any) Any[source]#

Get a metadata attribute from a datastore object.

set_static_from_nested() None[source]#

Flatting the cmor lookup table.