Datasets#
A dataset entry identifies a logical collection of files or objects
to crawl. In a drs_config.toml file each dataset appears under
its own top‑level table. At minimum you must specify a
root_path (the top directory or bucket prefix) and a
drs_format (the dialect name). Additional keys control which
storage backend is used and allow per‑dataset defaults.
Example#
The following snippet defines three datasets. cmip6-fs is a
CMIP6 dataset stored on a POSIX filesystem; obs-s3 lives in a
MinIO/S3 bucket; and ngm-swift is hosted on a Swift object
store. Defaults override or add facet values when records are
constructed.
.. admonition:: TOML CONFIG
# CMIP6 data on a local filesystem [cmip6-fs] root_path = "/data/model/global/cmip6" drs_format = "cmip6" # optional: defaults [cmip6-fs.defaults] project = "CMIP6" # NextGems stored on s3 [ngm-s3] root_path = "/freva/nexgems" inherits_from = "obs-s3" # (anything else like for obs-s3) glob_pattern = "*.zarr" # (get only zarr stores) # Observational data on S3/MinIO [obs-s3] root_path = "s3://freva/observations" drs_format = "obs" fs_type = "s3" [obs-s3.storage_options] endpoint_url = "https://play.min.io" aws_access_key_id = "<ACCESS_KEY>" aws_secret_access_key = "<SECRET_KEY>" region = "us-east-1" [obs-s3.defaults] project = "observations" # NextGEMS data on Swift [ngm-swift] root_path = "nextgems/era5" drs_format = "nextgems" fs_type = "swift" [ngm-swift.storage_options] os_auth_url = "https://swift.example.com/auth/v3" os_storage_url = "https://swift.example.com/v1" os_username = "{{ env['USER'] | default('myuser') }}" os_password = "{{ env['PASS'] }}" os_project_name = "<PROJECT>" os_user_domain_name = "Default" os_project_domain_name = "Default" [ngm-swift.defaults] project = "nextgems"
Keys#
root_path – Path or prefix to search for files. For S3 and Swift backends it is the prefix inside the bucket/container. For POSIX it is an absolute directory. When using the intake backend this can be a intake or intake-esm catalog file.
drs_format – Name of the DRS dialect to use. Must match one of the dialects defined under
[drs_settings.dialect].fs_type – Optional override of the storage backend for this dataset. Supported values include
posix(default),s3,swiftandintake. Backends determine how the crawler traverses paths and reads data.storage_options – A table of backend specific options. For S3 this may include
endpoint_url,aws_access_key_id,aws_secret_access_key,regionandendpoint_url. For Swift use keys beginning withos_(see the Swift example). Additional fields can be added for custom backends.defaults – Facet values that should be filled in when the corresponding key is absent in the parsed metadata. These defaults override the dialect defaults if both are specified.
inherits_from – Onther dataset name to take defaults from.
glob_pattern – Apply this glob pattern for file object discovery.
Storage options#
The fs_type key determines which storage backend the crawler uses
to traverse files. Each backend accepts a set of backend specific parameters
under storage_options. This section summarises supported backends and
their options.
POSIX (default)#
If fs_type is omitted or set to posix, the crawler walks a
local filesystem. No special options are required.
S3/MinIO#
Set fs_type = "s3" to crawl an object store exposing the S3
API. Required options include:
endpoint_url– The full URL of the S3 endpoint (e.g.https://s3.amazonaws.comor a MinIO server).key,secret– Credentials.region– Region name (may be optional for MinIO).root_path(s3://my-bucket/path) or supply it viastorage_options.bucket.
Swift#
Set fs_type = "swift" to crawl an OpenStack Swift object store.
Swift uses Keystone v3 credentials. Required options include:
os_auth_url– Keystone authentication URL.os_storage_url– Storage URL for object endpoints; typically ends in/v1.os_username,os_password,os_project_id– Identity credentials.container– The container name may be specified either in thestorage_optionsor as part of theroot_path(first path component).
Intake#
When fs_type = "intake" the crawler reads from a intake-esm or other
Intake catalog rather than walking a directory. The root_path
points to the CSV file and storage_options are not required.
Custom backends#
The API can be extended with new storage backends (see
Adding storage backends). Provide the backend
class via an entry point or plugin and specify its name in
fs_type. Your backend may define arbitrary storage_options.