Add the possibility to retrieve descriptions from datasets#1052
Add the possibility to retrieve descriptions from datasets#1052pfebrer wants to merge 2 commits into
Conversation
80aba6c to
8cfce5c
Compare
bb15ea1 to
92901ba
Compare
|
Am I getting this right that you just move the dataset option to an external Wouldn't it be nicer to allow something like training_set: foo.yamland we just include this yaml file when building the full config? Seems like people already thought about it: https://stackoverflow.com/questions/528281/how-can-i-include-a-yaml-file-inside-another |
|
Yes sure one can have the yaml file separately, but I don't see a reason why the dataset and the specification of the data should be shipped separately, it introduces an unnecessary source of mistakes. Also the idea here is not simply that you "import" some yaml options. The idea is that As an example, if you have a disk dataset with 10 targets and you only want to fit one, the current PR would still work with: training_set:
systems: /path/to/dataset.zip
targets:
mtt::the_target: {}while for doing training_set: /path/to/dataset.yamlyou'd have to still create the yaml yourself with one target. Also when doing this, the path to the disk dataset found inside the yaml will always have to be right. So in the end these In summary, the idea of the PR is not to import yaml files but to allow the datasets to be self-descriptive so that we minimize burden for users and possibilities for them to make mistakes. |
Concept/problem to solve
The idea here is that the properties of the
systems,targetsandextra_data(e.g. units, target type, etc...) are almost always inherent to the data. Therefore it would be nice if all these properties can be associated to the datasets and users don't have to pass them every time on their input yaml files. This is particularly important for the incoming electronic structure targets, where the target descriptions become huge.Implementation
After several implementations that I deemed too complex, I converged to the current state of things which I think is reasonable.
The idea is:
systems,targetsandextra_dataof the input yaml.This is done in the
expand_dataset_configfunction ofutils/omegaconf.py. The other two files have minimal changes that are not important to understand what is going on.For disk datasets, this information is stored simply as an extra yaml file inside the zip, with the name
mtt_dataset_description.yaml. This can be done from the graphical interface of any good OS, so it needs no code.For xyz datasets, I think this concept could also be useful, but for now I haven't implemented anything. I think it might be better to not do it for now and see how things go for disk datasets. Just for the record, an idea that I had was that one could do:
What's missing
Only tests are missing. I will wait to get some feedback before writing them.
What are the effects of this PR
A real-life input file that I have looks something like this:
with this PR it gets reduced to:
by adding the following
mtt_dataset_description.yamltoQCML_train_512_projs.zip:📚 Documentation preview 📚: https://metatrain--1052.org.readthedocs.build/en/1052/