Adding storage backends#
Storage backends provide the abstraction for traversing files and
directories, reading datasets and extracting metadata. The crawler
ships with built‑in backends for Posix, S3/MinIO, Swift and Intake.
You can add your own backend by subclassing
PathTemplate defined in storage_backend.py.
Base classes#
The core classes used for storage backends are:
TemplateMixin– Provides astorage_templatemethod to render strings using Jinja2. Useful when the backend requires templated URLs or credentials.PathMixin– Provides convenience methodssuffixandnameto extract parts of a path usinganyio.Path.PathTemplate– Abstract base class combiningTemplateMixin,LookupMixinandPathMixin. Concrete backends must implement asynchronous methodsis_dir,is_file,iterdir,rgloband synchronous methodspathanduri. Optional overrides includeopen_dataset(open an xarray dataset given a URI) andread_attr(read a metadata attribute).
Recipe: Implement a backend#
To implement a new backend:
Subclass
PathTemplateand set the class variable_fs_typeto a short name identifying your backend.- Implement the abstract methods:
is_dir(path)should return True if the given URI is a directory/prefix.is_file(path)should return True if the given URI is a file/object containing data.iterdir(path)should asynchronously iterate over immediate children (directories and files) of the given path.rglob(path, glob_pattern="*")should asynchronously yieldMetadataobjects for all files matching the glob pattern.path(path)should return a URI with scheme/authority as required by your backend.uri(path)should return the raw URI (including bucket or container names as appropriate).
- Register your backend by adding it to the entry point group
metadata_crawler.storage_backendsin yoursetup.cfgorpyproject.toml. This allows thefs_typestring in the configuration to resolve to your backend class.
pyproject.toml
[project.entry-points."metadata_crawler.storage_backends"] foo = "my_package.foo_backend:FooBackend"
Example skeleton#
Here is a minimal example of a custom storage backend for a
hypothetical foo protocol:
from metadata_crawler.storage_backend import PathTemplate, Metadata
from anyio import Path
class FooBackend(PathTemplate):
_fs_type = "foo"
async def is_dir(self, path: str) -> bool:
# implement logic to check for a directory
...
async def is_file(self, path: str) -> bool:
# implement logic to check for a file
...
async def iterdir(self, path: str):
# yield child names
...
async def rglob(self, path: str, glob_pattern: str = "*"):
# recursively yield Metadata objects
for child in await self.iterdir(path):
if await self.is_dir(child):
async for item in self.rglob(child, glob_pattern):
yield item
elif await self.is_file(child) and fnmatch(child, glob_pattern):
yield Metadata(path=child)
def path(self, path: str) -> str:
return f"foo://{path}"
def uri(self, path: str) -> str:
return self.path(path)
# Then register in your packaging config:
[project.entry-points."metadata_crawler.storage_backends"]
foo = "my_package.foo_backend:FooBackend"
Once registered, you can set fs_type = "foo" in a dataset
definition and optionally provide storage_options that will be
passed into your backend’s constructor.
API Reference:#
- class metadata_crawler.api.storage_backend.PathTemplate(suffixes: List[str] | None = None, **storage_options: Any)[source]#
Bases:
ABC,PathMixin,TemplateMixin,LookupMixinBase class for interacting with different storage systems.
This class defines fundamental methods that should be implemented to retrieve information across different storage systems.
- Parameters:
suffixes (List[str], default: [".nc", ".girb", ".zarr", ".tar", ".hdf5"]) – A list of available file suffixes.
storage_options (Any) – Information needed to interact with the storage system.
- _user#
Value of the
DRS_STORAGE_USERenv variable (defaults to current user)- Type:
str
- _pw#
a password passed by the
DRS_STORAGE_PASSWDenv variable- Type:
str
- suffixes#
A list of available file suffixes.
- Type:
List[str]
- storage_options#
A dict with information needed to interact with the storage system.
- Type:
Dist[str, Any]
- CMOR_STATIC: Mapping[Tuple[str, ...], Any] = {}#
- env_map: Dict[str, str] | None = None#
- get_fs_and_path(uri: str) Tuple[AbstractFileSystem, str]#
Return (fs, path) suitable for xarray.
- Parameters:
uri – Path to the object store / file name
- Returns:
The AbstractFileSystem class and the corresponding path to the data store.
- Return type:
fsspec.AbstractFileSystem, str
- abstractmethod async is_dir(path: str | Path | Path) bool[source]#
Check if a given path is a directory object on the storage system.
- Parameters:
path (str, asyncio.Path, pathlib.Path) – Path of the object store
- Returns:
bool
- Return type:
True if path is dir object, False if otherwise or doesn’t exist
- abstractmethod async is_file(path: str | Path | Path) bool[source]#
Check if a given path is a file object on the storage system.
- Parameters:
path – Path of the object store
- Returns:
True if path is file object, False if otherwise or doesn’t exist
- Return type:
bool
- abstractmethod async iterdir(path: str | Path | Path) AsyncIterator[str][source]#
Get all sub directories from a given path.
- Parameters:
path – Path of the object store
- Yields:
str – 1st level sub directory
- lookup(path: str, attribute: str, *tree: str, **read_kws: Any) Any#
Get metadata from a lookup table.
This function will read metadata from a pre-defined cache table and if the metadata is not present in the cache table it’ll read the the object store and add the metadata to the cache table.
- Parameters:
path – Path to the object store / file name
attribute – The attribute that is retrieved from the data. variable attributes can be defined by a
.. For example:tas.long_namewould get attributelong_namefrom variabletas.*tree – A tuple representing nested attributes. Attributes are nested for more efficient lookup. (‘atmos’, ‘1hr’, ‘tas’) will translate into a tree of [‘atmos’][‘1hr’][‘tas’]
**read_kws – Keyword arguments passed to open the datasets.
- open_dataset(path: str, **read_kws: Any) Dataset | File[source]#
Open a dataset with xarray.
- Parameters:
path – Path to the object store / file name
**read_kws – Keyword arguments passed to open the datasets.
- Returns:
The xarray dataset.
- Return type:
xarray.Dataset
- abstractmethod path(path: str | Path | Path) str[source]#
Get the full path (including any schemas/netlocs).
- Parameters:
path – Path of the object store
- Returns:
URI of the object store
- Return type:
str
- prep_template_env() None#
Prepare the jinja2 env.
- read_attr(attribute: str, path: str | Path, **read_kws: Any) Any[source]#
Get a metadata attribute from a datastore object.
- Parameters:
attr (The attribute that is queried can be of the form of) – <attribute>, <variable>.<attribute>, <attribute>, <variable>.<attribute>
path (Path to the object store / file path)
read_kws (Keyword arguments for opening the datasets.)
- Returns:
str
- Return type:
Metadata from the data.
- render_templates(data: Any, context: Mapping[str, Any], *, max_passes: int = 2) Any#
Recursively render Jinja2 templates found in strings within data.
This function traverses common container types (
dict,list,tuple,set), dataclasses, namedtuples, andpathlib.Pathobjects. Every string encountered is treated as a Jinja2 template and rendered with the providedcontext. Rendering can be repeated up tomax_passestimes to resolve templates that produce further templates on the first pass.- Parameters:
data – Arbitrary Python data structure. Supported containers are
dict(keys and values),list,tuple(including namedtuples),set, dataclasses (fields), andpathlib.Path. Scalars (e.g.,int,float,bool,None) are returned unchanged. Strings are rendered as Jinja2 templates.context – Mapping of template variables available to Jinja2 during rendering.
max_passes – Maximum number of rendering passes to perform on each string, by default
2. Increase this if templates generate further templates that need resolution.
- Returns:
A structure of the same shape with all strings rendered. Container and object types are preserved where feasible (e.g.,
tuplestays atuple, namedtuple stays a namedtuple, dataclass remains the same dataclass type).- Return type:
Any
- Raises:
jinja2.TemplateError – For other Jinja2 template errors encountered during rendering.
Notes
Dictionary keys are also rendered if they are strings (or nested containers with strings). If rendering causes key collisions, the last rendered key wins.
For dataclasses, all fields are rendered and a new instance is returned using
dataclasses.replace. Frozen dataclasses are supported.Namedtuples are detected via the
_fieldsattribute and reconstructed with the same type.
Examples
data = { "greeting": "Hello, {{ name }}!", "items": ["{{ count }} item(s)", 42], "path": {"root": "/home/{{ user }}", "cfg": "{{ root }}/cfg"}, } ctx = {"name": "Ada", "count": 3, "user": "ada", "root": "/opt/app"} TemplateMixin().render_templates(data, ctx) # {'greeting': 'Hello, Ada!', # 'items': ['3 item(s)', 42], # 'path': {'root': '/home/ada', 'cfg': '/opt/app/cfg'}}
- abstractmethod async rglob(path: str | Path | Path, glob_pattern: str = '*') AsyncIterator[MetadataType][source]#
Search recursively for paths matching a given glob pattern.
- Parameters:
path – Path of the object store
glob_pattern (str) – Pattern that the target files must match
- Yields:
MetadataType (Path of the object store that matches the glob pattern.)
- set_static_from_nested() None#
Flatting the cmor lookup table.
- async suffix(path: str | Path | Path) str#
Get the suffix of a given input path.
- Parameters:
path (str, asyncio.Path, pathlib.Path) – Path of the object store
- Returns:
str
- Return type:
The file type extension of the path.
- class metadata_crawler.api.mixin.PathMixin[source]#
Bases:
objectClass that defines typical Path operations.
- class metadata_crawler.api.mixin.TemplateMixin[source]#
Bases:
objectApply templating egine jinja2.
- env_map: Dict[str, str] | None = None#
- render_templates(data: Any, context: Mapping[str, Any], *, max_passes: int = 2) Any[source]#
Recursively render Jinja2 templates found in strings within data.
This function traverses common container types (
dict,list,tuple,set), dataclasses, namedtuples, andpathlib.Pathobjects. Every string encountered is treated as a Jinja2 template and rendered with the providedcontext. Rendering can be repeated up tomax_passestimes to resolve templates that produce further templates on the first pass.- Parameters:
data – Arbitrary Python data structure. Supported containers are
dict(keys and values),list,tuple(including namedtuples),set, dataclasses (fields), andpathlib.Path. Scalars (e.g.,int,float,bool,None) are returned unchanged. Strings are rendered as Jinja2 templates.context – Mapping of template variables available to Jinja2 during rendering.
max_passes – Maximum number of rendering passes to perform on each string, by default
2. Increase this if templates generate further templates that need resolution.
- Returns:
A structure of the same shape with all strings rendered. Container and object types are preserved where feasible (e.g.,
tuplestays atuple, namedtuple stays a namedtuple, dataclass remains the same dataclass type).- Return type:
Any
- Raises:
jinja2.TemplateError – For other Jinja2 template errors encountered during rendering.
Notes
Dictionary keys are also rendered if they are strings (or nested containers with strings). If rendering causes key collisions, the last rendered key wins.
For dataclasses, all fields are rendered and a new instance is returned using
dataclasses.replace. Frozen dataclasses are supported.Namedtuples are detected via the
_fieldsattribute and reconstructed with the same type.
Examples
data = { "greeting": "Hello, {{ name }}!", "items": ["{{ count }} item(s)", 42], "path": {"root": "/home/{{ user }}", "cfg": "{{ root }}/cfg"}, } ctx = {"name": "Ada", "count": 3, "user": "ada", "root": "/opt/app"} TemplateMixin().render_templates(data, ctx) # {'greeting': 'Hello, Ada!', # 'items': ['3 item(s)', 42], # 'path': {'root': '/home/ada', 'cfg': '/opt/app/cfg'}}
- class metadata_crawler.api.mixin.LookupMixin[source]#
Bases:
objectProvide a Mixing with a process safe lookup().
- The mixin does:
process-wide static table (CMOR) via CMOR_STATIC
per-instance disk cache for file-derived attrs
in-flight de-duplication for concurrent misses
- Subclass must implement:
def read_attr(self, attribute: str, path: str, **read_kws: Any) -> Any
- CMOR_STATIC: Mapping[Tuple[str, ...], Any] = {}#
- lookup(path: str, attribute: str, *tree: str, **read_kws: Any) Any[source]#
Get metadata from a lookup table.
This function will read metadata from a pre-defined cache table and if the metadata is not present in the cache table it’ll read the the object store and add the metadata to the cache table.
- Parameters:
path – Path to the object store / file name
attribute – The attribute that is retrieved from the data. variable attributes can be defined by a
.. For example:tas.long_namewould get attributelong_namefrom variabletas.*tree – A tuple representing nested attributes. Attributes are nested for more efficient lookup. (‘atmos’, ‘1hr’, ‘tas’) will translate into a tree of [‘atmos’][‘1hr’][‘tas’]
**read_kws – Keyword arguments passed to open the datasets.