Add the possibility to retrieve descriptions from datasets by pfebrer · Pull Request #1052 · metatensor/metatrain

pfebrer · 2026-02-25T00:22:27Z

Concept/problem to solve

The idea here is that the properties of the systems, targets and extra_data (e.g. units, target type, etc...) are almost always inherent to the data. Therefore it would be nice if all these properties can be associated to the datasets and users don't have to pass them every time on their input yaml files. This is particularly important for the incoming electronic structure targets, where the target descriptions become huge.

Implementation

After several implementations that I deemed too complex, I converged to the current state of things which I think is reasonable.

The idea is:

We associate "dataset descriptions" to the datasets.
These descriptions follow the same structure as the existing dataset hypers.
We take whatever information we find in these descriptions as the defaults for the fields inside systems, targets and extra_data of the input yaml.

This is done in the expand_dataset_config function of utils/omegaconf.py. The other two files have minimal changes that are not important to understand what is going on.

For disk datasets, this information is stored simply as an extra yaml file inside the zip, with the name mtt_dataset_description.yaml. This can be done from the graphical interface of any good OS, so it needs no code.

For xyz datasets, I think this concept could also be useful, but for now I haven't implemented anything. I think it might be better to not do it for now and see how things go for disk datasets. Just for the record, an idea that I had was that one could do:

read_from: {data: dataset.xyz, description: description.yaml}

What's missing

Only tests are missing. I will wait to get some feedback before writing them.

What are the effects of this PR

A real-life input file that I have looks something like this:

training_set:
  systems:
    read_from: QCML_train_512_projs.zip
    length_unit: angstrom
  targets:
    mtt::electron_density_basis_projs:
      quantity: ""
      unit: ""
      per_atom: true
      type:
        spherical:
          irreps:
            - {o3_lambda: 0, o3_sigma: 1}
            - {o3_lambda: 1, o3_sigma: 1}
            - {o3_lambda: 2, o3_sigma: 1}
            ...much more irreps
  extra_data:
    mtt::electron_density_basis_projs_mask:
      per_atom: true
      type:
        spherical:
          irreps:
            - {o3_lambda: 0, o3_sigma: 1}
            - {o3_lambda: 1, o3_sigma: 1}
            - {o3_lambda: 2, o3_sigma: 1}
            ...much more irreps

with this PR it gets reduced to:

training_set:
  systems: QCML_train_512_projs.zip
  targets:
    mtt::electron_density_basis_projs: {}
  extra_data:
    mtt::electron_density_basis_projs_mask: {}

by adding the following mtt_dataset_description.yaml to QCML_train_512_projs.zip:

systems:
  length_unit: angstrom

variables:
  mtt::electron_density_basis_projs:
    quantity: ""
    unit: ""
    per_atom: true
    type:
      spherical:
        irreps:
          - {o3_lambda: 0, o3_sigma: 1}
          - {o3_lambda: 1, o3_sigma: 1}
          - {o3_lambda: 2, o3_sigma: 1}
         ...much more irreps

  mtt::electron_density_basis_projs_mask:
    per_atom: true
    type:
      spherical:
        irreps:
          - {o3_lambda: 0, o3_sigma: 1}
          - {o3_lambda: 1, o3_sigma: 1}
          - {o3_lambda: 2, o3_sigma: 1}
          ...much more irreps

📚 Documentation preview 📚: https://metatrain--1052.org.readthedocs.build/en/1052/

PicoCentauri · 2026-03-19T12:31:26Z

Am I getting this right that you just move the dataset option to an external .yaml file?

Wouldn't it be nicer to allow something like

training_set: foo.yaml

and we just include this yaml file when building the full config?

Seems like people already thought about it: https://stackoverflow.com/questions/528281/how-can-i-include-a-yaml-file-inside-another

pfebrer · 2026-03-19T12:43:17Z

Yes sure one can have the yaml file separately, but I don't see a reason why the dataset and the specification of the data should be shipped separately, it introduces an unnecessary source of mistakes.

Also the idea here is not simply that you "import" some yaml options. The idea is that metatrain can infer the information of the targets from the yaml file in the dataset, but you are not forced to use all targets in the dataset.

As an example, if you have a disk dataset with 10 targets and you only want to fit one, the current PR would still work with:

training_set:
    systems: /path/to/dataset.zip
    targets:
        mtt::the_target: {}

while for doing

training_set: /path/to/dataset.yaml

you'd have to still create the yaml yourself with one target. Also when doing this, the path to the disk dataset found inside the yaml will always have to be right. So in the end these dataset.yaml will not be so reusable.

In summary, the idea of the PR is not to import yaml files but to allow the datasets to be self-descriptive so that we minimize burden for users and possibilities for them to make mistakes.

pfebrer force-pushed the diskdataset_descriptor branch from 80aba6c to 8cfce5c Compare February 25, 2026 00:37

pfebrer mentioned this pull request Mar 1, 2026

Adding spherical atomic basis targets #1062

Closed

Add the possibility to retrieve descriptions from datasets

92901ba

pfebrer force-pushed the diskdataset_descriptor branch from bb15ea1 to 92901ba Compare March 9, 2026 11:32

pfebrer requested a review from PicoCentauri March 9, 2026 11:34

Lint for docs build

3a99d82

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the possibility to retrieve descriptions from datasets#1052

Add the possibility to retrieve descriptions from datasets#1052
pfebrer wants to merge 2 commits into
mainfrom
diskdataset_descriptor

pfebrer commented Feb 25, 2026 •

edited

Loading

Uh oh!

PicoCentauri commented Mar 19, 2026 •

edited

Loading

Uh oh!

pfebrer commented Mar 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pfebrer commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Concept/problem to solve

Implementation

What's missing

What are the effects of this PR

Uh oh!

PicoCentauri commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pfebrer commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pfebrer commented Feb 25, 2026 •

edited

Loading

PicoCentauri commented Mar 19, 2026 •

edited

Loading

pfebrer commented Mar 19, 2026 •

edited

Loading