Skip to content

Add the possibility to retrieve descriptions from datasets#1052

Open
pfebrer wants to merge 2 commits into
mainfrom
diskdataset_descriptor
Open

Add the possibility to retrieve descriptions from datasets#1052
pfebrer wants to merge 2 commits into
mainfrom
diskdataset_descriptor

Conversation

@pfebrer

@pfebrer pfebrer commented Feb 25, 2026

Copy link
Copy Markdown
Contributor

Concept/problem to solve

The idea here is that the properties of the systems, targets and extra_data (e.g. units, target type, etc...) are almost always inherent to the data. Therefore it would be nice if all these properties can be associated to the datasets and users don't have to pass them every time on their input yaml files. This is particularly important for the incoming electronic structure targets, where the target descriptions become huge.

Implementation

After several implementations that I deemed too complex, I converged to the current state of things which I think is reasonable.

The idea is:

  • We associate "dataset descriptions" to the datasets.
  • These descriptions follow the same structure as the existing dataset hypers.
  • We take whatever information we find in these descriptions as the defaults for the fields inside systems, targets and extra_data of the input yaml.

This is done in the expand_dataset_config function of utils/omegaconf.py. The other two files have minimal changes that are not important to understand what is going on.

For disk datasets, this information is stored simply as an extra yaml file inside the zip, with the name mtt_dataset_description.yaml. This can be done from the graphical interface of any good OS, so it needs no code.

For xyz datasets, I think this concept could also be useful, but for now I haven't implemented anything. I think it might be better to not do it for now and see how things go for disk datasets. Just for the record, an idea that I had was that one could do:

read_from: {data: dataset.xyz, description: description.yaml}

What's missing

Only tests are missing. I will wait to get some feedback before writing them.

What are the effects of this PR

A real-life input file that I have looks something like this:

training_set:
  systems:
    read_from: QCML_train_512_projs.zip
    length_unit: angstrom
  targets:
    mtt::electron_density_basis_projs:
      quantity: ""
      unit: ""
      per_atom: true
      type:
        spherical:
          irreps:
            - {o3_lambda: 0, o3_sigma: 1}
            - {o3_lambda: 1, o3_sigma: 1}
            - {o3_lambda: 2, o3_sigma: 1}
            ...much more irreps
  extra_data:
    mtt::electron_density_basis_projs_mask:
      per_atom: true
      type:
        spherical:
          irreps:
            - {o3_lambda: 0, o3_sigma: 1}
            - {o3_lambda: 1, o3_sigma: 1}
            - {o3_lambda: 2, o3_sigma: 1}
            ...much more irreps

with this PR it gets reduced to:

training_set:
  systems: QCML_train_512_projs.zip
  targets:
    mtt::electron_density_basis_projs: {}
  extra_data:
    mtt::electron_density_basis_projs_mask: {}

by adding the following mtt_dataset_description.yaml to QCML_train_512_projs.zip:

systems:
  length_unit: angstrom

variables:
  mtt::electron_density_basis_projs:
    quantity: ""
    unit: ""
    per_atom: true
    type:
      spherical:
        irreps:
          - {o3_lambda: 0, o3_sigma: 1}
          - {o3_lambda: 1, o3_sigma: 1}
          - {o3_lambda: 2, o3_sigma: 1}
         ...much more irreps

  mtt::electron_density_basis_projs_mask:
    per_atom: true
    type:
      spherical:
        irreps:
          - {o3_lambda: 0, o3_sigma: 1}
          - {o3_lambda: 1, o3_sigma: 1}
          - {o3_lambda: 2, o3_sigma: 1}
          ...much more irreps

📚 Documentation preview 📚: https://metatrain--1052.org.readthedocs.build/en/1052/

@pfebrer pfebrer force-pushed the diskdataset_descriptor branch from bb15ea1 to 92901ba Compare March 9, 2026 11:32
@pfebrer pfebrer requested a review from PicoCentauri March 9, 2026 11:34
@PicoCentauri

PicoCentauri commented Mar 19, 2026

Copy link
Copy Markdown
Contributor

Am I getting this right that you just move the dataset option to an external .yaml file?

Wouldn't it be nicer to allow something like

training_set: foo.yaml

and we just include this yaml file when building the full config?

Seems like people already thought about it: https://stackoverflow.com/questions/528281/how-can-i-include-a-yaml-file-inside-another

@pfebrer

pfebrer commented Mar 19, 2026

Copy link
Copy Markdown
Contributor Author

Yes sure one can have the yaml file separately, but I don't see a reason why the dataset and the specification of the data should be shipped separately, it introduces an unnecessary source of mistakes.

Also the idea here is not simply that you "import" some yaml options. The idea is that metatrain can infer the information of the targets from the yaml file in the dataset, but you are not forced to use all targets in the dataset.

As an example, if you have a disk dataset with 10 targets and you only want to fit one, the current PR would still work with:

training_set:
    systems: /path/to/dataset.zip
    targets:
        mtt::the_target: {}

while for doing

training_set: /path/to/dataset.yaml

you'd have to still create the yaml yourself with one target. Also when doing this, the path to the disk dataset found inside the yaml will always have to be right. So in the end these dataset.yaml will not be so reusable.

In summary, the idea of the PR is not to import yaml files but to allow the datasets to be self-descriptive so that we minimize burden for users and possibilities for them to make mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants