Decontamination

An n-gram-based decontamination tool for LLM post-training datasets.

TL;DR

uv pip install https://github.com/OpenEuroLLM/decontamination.git

es build         # Build an Elasticsearch apptainer image
es prepare       # Prepare the image mounts and single-node configuration
es run           # Run the Elasticsearch container

ds index         # Index datasets defined in `configs/datasets.yaml`
ds search        # Find overlap with benchmarks defined in `configs/benchmarks.yaml`
ds remove [--huggingface-id <huggingface-id>]  # Decontaminate and (optionally) push to HF

Installation

uv pip install https://github.com/OpenEuroLLM/decontamination.git

CLI

Configuration

All commands can be configured using either:

command arguments. Check

es [subcommand] --help
ds [subcommand] --help

a YAML config file with the --config <config_file> argument. Check

configs/es/build.yaml
configs/es/prepare.yaml
configs/es/run.yaml
configs/ds/index.yaml
configs/ds/search.yaml
configs/ds/remove.yaml

Autocompletion

To enable autocompletion for the CLI, run the following commands:

es --install-completion [bash|zsh|fish]
ds --install-completion [bash|zsh|fish]

Usage

1. Run `elasticsearch` container

es build         # Build an Elasticsearch apptainer image
es prepare       # Prepare the image mounts and single-node configuration
es run           # Run the Elasticsearch container

Elasticsearch is used as a search engine to efficiently index the training datasets and perform n-gram search for overlap with evaluation benchmarks.

2. Index datasets into Elasticsearch

ds index

The datasets to be indexed are specified in the YAML config file (check configs/datasets.yaml).

- ...
  path: <dataset_path>
  subset: <subset>
  split: <split>
  ...

3. Find contaminated samples

ds search

The benchmarks to be searched for contamination are specified in the YAML config file (check configs/benchmarks.yaml).

- ...
  path: <benchmark_path>
  subset: <subset>
  split: <split>
  ...

The search results are saved to outputs/results.csv.

4. Remove contaminated samples

ds remove [--huggingface-id <huggingface-id>]

The decontaminated dataset is saved to

outputs/<dataset_name>-decontaminated/ locally, and
<huggingface-id>/<dataset_name>-decontaminated/ on Hugging Face (if --huggingface-id is provided).

Command abbreviations

es -> elasticsearch
ds -> datasets

Acknowledgements

The decontamination scripts in this project are adapted from allenai/open-instruct, AllenAI's post-training codebase. See their Contamination checks section for the original implementation.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
configs		configs
src/decontamination		src/decontamination
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Decontamination

TL;DR

Installation

CLI

Configuration

Autocompletion

Usage

1. Run `elasticsearch` container

2. Index datasets into Elasticsearch

3. Find contaminated samples

4. Remove contaminated samples

Command abbreviations

Acknowledgements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Decontamination

TL;DR

Installation

CLI

Configuration

Autocompletion

Usage

1. Run elasticsearch container

2. Index datasets into Elasticsearch

3. Find contaminated samples

4. Remove contaminated samples

Command abbreviations

Acknowledgements

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Run `elasticsearch` container

Packages