.. _specs: Path and data specs ------------------- Metadata can be extracted from three primary sources: the file system path and the dataset contents and the filesystem itself. Dialects declare how to interpret each via ``specs_dir``/``specs_file`` and ``data_specs``. Path specs (specs_dir and specs_file) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * ``specs_dir`` lists facet names corresponding to directory components. The crawler takes the relative path (between ``root_path`` and the file name), splits it on ``/``, and assigns each segment in order. If the number of segments differs from the length of ``specs_dir``, the extra segments are ignored. * ``specs_file`` lists facet names corresponding to parts of the file name. The file name (without extension) is split on the ``file_sep`` character (``_`` by default). Each part is mapped to a facet. When fewer parts are present than entries in ``specs_file`` the remainder is ignored; conversely extra parts are discarded. This mechanism makes it easy to support DRS naming conventions like: .. code-block:: console mip_era/activity_id/institution_id/source_id/experiment_id/member_id\ /table_id/variable_id/grid_label/version/\ variable_id_table_id_source_id_experiment_id_member_id_grid_label_time.nc Data specs (data_specs) ^^^^^^^^^^^^^^^^^^^^^^^ ``data_specs`` defines how to read attributes and variables from the dataset file itself (e.g. netCDF, Zarr). A ``data_specs`` table contains three subsections: * ``globals`` – A mapping from facet names to dataset global attribute names. For example ``project = "mip_era"`` means the global attribute ``mip_era`` populates the ``project`` facet. * ``var_attrs`` – Rules to extract attributes from specific variables. Each entry describes the target facet, the variable name (can be a placeholder like ``__variable__`` meaning all variables, or a format string like ``{{variable}}`` that refers to the parsed ``variable`` facet), the attribute name .. admonition:: TOML CONFIG .. code-block:: toml [drs_settings.dialect.cmip6.data_specs.var_attrs] # attach standard_name attribute of each variable to the facet 'variable' variable = { var="__variables__", attr="standard_name"} * ``stats`` – Extract numeric statistics from variables or coordinate arrays. Each entry specifies a ``stat`` type (``min``, ``max``, ``minmax``, ``range``, or ``bbox``), the target facet, and variables or coordinate names. ``range`` returns start and end values (useful for time coordinates); ``bbox`` computes the bounding box from latitude and longitude variables. .. admonition:: TOML CONFIG .. code-block:: toml [drs_settings.dialect.cmip6.data_specs.stats.time] stat = "range" coord = "time" default = ["1970-01-01", "1970-01-01"] When ``sources`` includes ``data`` the crawler opens the dataset with ``xarray`` and applies these rules after path parsing. A cache mechanism ensures that repeated attribute lookups across many files are efficient.