An n-gram-based decontamination tool for LLM post-training datasets.
uv pip install https://github.com/OpenEuroLLM/decontamination.git
es build # Build an Elasticsearch apptainer image
es prepare # Prepare the image mounts and single-node configuration
es run # Run the Elasticsearch container
ds index # Index datasets defined in `configs/datasets.yaml`
ds search # Find overlap with benchmarks defined in `configs/benchmarks.yaml`
ds remove [--huggingface-id <huggingface-id>] # Decontaminate and (optionally) push to HFuv pip install https://github.com/OpenEuroLLM/decontamination.gitAll commands can be configured using either:
- command arguments. Check
es [subcommand] --helpds [subcommand] --help
- a YAML config file with the
--config <config_file>argument. Check
configs/es/build.yamlconfigs/es/prepare.yamlconfigs/es/run.yamlconfigs/ds/index.yamlconfigs/ds/search.yamlconfigs/ds/remove.yaml
To enable autocompletion for the CLI, run the following commands:
es --install-completion [bash|zsh|fish]
ds --install-completion [bash|zsh|fish]es build # Build an Elasticsearch apptainer image
es prepare # Prepare the image mounts and single-node configuration
es run # Run the Elasticsearch containerElasticsearch is used as a search engine to efficiently index the training datasets and perform n-gram search for overlap with evaluation benchmarks.
ds indexThe datasets to be indexed are specified in the YAML config file (check configs/datasets.yaml).
- ...
path: <dataset_path>
subset: <subset>
split: <split>
...ds searchThe benchmarks to be searched for contamination are specified in the YAML config file (check configs/benchmarks.yaml).
- ...
path: <benchmark_path>
subset: <subset>
split: <split>
...The search results are saved to outputs/results.csv.
ds remove [--huggingface-id <huggingface-id>]The decontaminated dataset is saved to
outputs/<dataset_name>-decontaminated/locally, and<huggingface-id>/<dataset_name>-decontaminated/on Hugging Face (if--huggingface-idis provided).
es->elasticsearchds->datasets
The decontamination scripts in this project are adapted from allenai/open-instruct, AllenAI's post-training codebase. See their Contamination checks section for the original implementation.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.