Path and data specs#
Metadata can be extracted from three primary sources: the file system
path and the dataset contents and the filesystem itself. Dialects declare
how to interpret each via path_specs and data_specs.
Path specs (dir_parts and file_parts)#
dir_partslists facet names corresponding to directory components. The crawler takes the relative path (betweenfile_partsand the file name), splits it on/, and assigns each segment in order. If the number of segments differs from the length ofdir_parts, the extra segments are ignored.file_partslists facet names corresponding to parts of the file name. The file name (without extension) is split on thefile_sepcharacter (_by default). Each part is mapped to a facet. When fewer parts are present than entries infile_partsthe remainder is ignored; conversely extra parts are discarded.
This mechanism makes it easy to support DRS naming conventions like:
mip_era/activity_id/institution_id/source_id/experiment_id/member_id\
/table_id/variable_id/grid_label/version/\
variable_id_table_id_source_id_experiment_id_member_id_grid_label_time.nc
Data specs (data_specs)#
data_specs defines how to read attributes and variables from the
dataset file itself (e.g. netCDF, Zarr). A data_specs table
contains three subsections:
globals- A mapping from facet names to dataset global attribute names. For exampleproject = "mip_era"means the global attributemip_erapopulates theprojectfacet.var_attrs- Rules to extract attributes from specific variables. Each entry describes the target facet, the variable name (can be a placeholder like__variable__meaning all variables, or a format string like{{variable}}that refers to the parsedvariablefacet), the attribute name
TOML CONFIG
[drs_settings.dialect.cmip6.data_specs.var_attrs]
# attach standard_name attribute of each variable to the facet 'variable'
variable = { var="__variables__", attr="standard_name"}
stats- Extract numeric statistics from variables or coordinate arrays. Each entry specifies astattype (min,max,minmax,range,bboxortimedelta), the target facet, and variables or coordinate names.rangereturns start and end values (useful for time coordinates);bboxcomputes the bounding box from latitude and longitude variables.
TOML CONFIG
[drs_settings.dialect.cmip6.data_specs.stats.time]
stat = "range"
coord = "time"
default = ["1970-01-01", "1970-01-01"]
Infer the time frequency according to CMOR specifications:
TOML CONFIG
[drs_settings.dialect.cmip6.data_specs.stats.time_frequency]
stat = "timedelta"
var = "time"
default = "fx"
When sources includes data the crawler opens the dataset with
xarray and applies these rules after path parsing. A cache
mechanism ensures that repeated attribute lookups across many files
are efficient.
read_kws- Keyword arguments that are passed toxarrayto open datasets. Such asengine=netcdf.