WORC Database: Downloaders and Experiments

This repository contains scripts to download six public datasets and reproduce radiomics experiments on these six datasets and two other public datasets. The six public datasets are described in the following paper:

Starmans, M. P. A. et al. (2021). The WORC* database: MRI and CT scans, segmentations, and clinical labels for 932 patients from six radiomics studies. Submitted, preprint available from https://doi.org/10.1101/2021.08.19.21262238

The data used for six of the datasets can be found at https://xnat.health-ri.nl/data/projects/worc.

The experiments are described in the following paper: Starmans, M. P. A. et al. (2021). Reproducible radiomics through automated machine learning validated on twelve clinical applications. Submitted, preprint available from https://arxiv.org/abs/2108.08618.

License

When using parts of this code or the above datasets, please cite the two above mentioned papers. Feel free to also cite the code from this repository:

Martijn P.A. Starmans. WORCDatabase. Zenodo (2021). Available from: https://github.com/MStarmans91/WORCDatabase. DOI: http://doi.org/10.5281/zenodo.5119040

For the DOI, visit .

Installation

The experiments only require the WORC Python package to be installed, which can be done using pip. In the original WORC paper, version 3.6.3 was used:

pip install "WORC==3.6.3"

The requirements for WORC itself can be found at https://github.com/MStarmans91/WORC.

Usage: download data

The six public datasets as published in the Data in Brief paper can be downloaded using the provided datadownloader.py script. These are downloaded from the XNAT repository at https://xnat.health-ri.nl/data/projects/worc. Simply import one of the following functions from the script and run it:

download_Lipo
download_Desmoid
download_Liver
download_GIST
download_CRLM
download_Melanoma

Additionally, we provide downloaders for two of the three previously publicly published datasets used in the WORC MEDIA paper:

download_HeadAndNeck
download_Glioma

Documentation for these functions can be found in the docstrings of the functions.

The features used in WORC are stored in CSV format both on the XNAT, and in this Github repository.

Usage: compare results to WORC algorithm

In the files in "crossvalidationsplits" folder, you will find the 100x random-split train-test cross-validation setup used in the experiments of the original WORC paper mentioned above. If you want to compare performance of another method with WORC, we very strongly recommend to use the cross-validation splits from these files so you have the exact same evaluation setup.

For each dataset, a file called crossvalidationsplits_[DATASETNAME].csv is provided. For each cross-validation iteration, a column for training and a column for testing is provided, including the anonymized identities of each patient included in either the train or test set (e.g., 0_train, 0_test). These files are also stored on the Health-RI XNAT, see https://xnat.health-ri.nl/data/projects/worc/resources.

Usage: radiomics features

For each image and lesion segmentation, a set of 546 radiomics features were extracted using WORC, which internally uses the PyRadiomics (71 features) and PREDICT (493 features) toolboxes. For each dataset, a file called WORCDatabase_[DATASETNAME]\_features\_[WORC\_Version].csv is provided in the features folder. For each dataset, each row represents a different image and lesion, with each column providing one of the 546 radiomics features. The final column contains the subject name and associated lesion (e.g. [DATASETNAME]_[SUBJECTID]_[LESIONID]). The first row contains the header with the feature name, see for details the WORC paper and documentation. The WORC software version is added for versioning to support future updates when needed. These files re also stored on the Health-RI XNAT, see https://xnat.health-ri.nl/data/projects/worc/resources.

Usage: experiment reproduction

To reproduce the eight default experiments presented in the WORC paper, i.e., those with the default WORC settings on the above mentioned eight datasets, simply import and run the run_experiment function from the default_experiments.py script. The dataset argument can be used to switch between the eight datasets. See the docstring of the function or the ==help option when running the function on the command line for further documentation.

For the six public datasets as published in the Data in Brief paper, the data is directly fed from XNAT into WORC. For the two previously publicly published datasets used in the WORC paper, some additional steps are performed, see below.

Glioma dataset

For the glioma dataset, radiomics features are supplied instead of raw imaging data. As these have been created using Python2 and a specific version of pandas, these files cannot directly be fed into WORC. Therefore, we have created two scripts to convert the feature files to a format that can be fed into WORC, which can be found in the helpers folder:

convert_features_hdf5_to_csv: this script should be run in a Python2 environment with the pandas 0.19.0 and tables package installed. The .hdf5 files containing the features are converted to .csv files.
convert_features_csv_to_hdf5.py: this script should be run in a Python3 environment with pandas installed (any version, as long as you are using the same one when running the WORC experiment). The .csv files which were created by the previous function are converted back to .hdf5 files, but now with the same pandas version as you are using for WORC. The features are also renamed to correspond with the formatting used in WORC

Documentation for these functions can be found in the respective files. Note that you first have to download the features from http://dx.doi.org/10.17632/rssf5nxxby.3.

Afterwards, we recommend to use the SimpleWORC module as explained in the WORC Tutorial (https://github.com/MStarmans91/WORCTutorial) to run the actual WORC experiment.

Head and neck dataset

For the head and neck dataset, the data can be found at https://xnat.health-ri.nl/data/projects/stwstrategyhn1. This data can be directly fed into WORC. However, in the experiment conducted in the paper, we predict the T-stage based on GTV-1 segmentations, which are both missing for some of the patients. Hence, before running the experiment, we first check for which patients these are available and only include those. This is automatically done in the run_experiment function when using the head and neck dataset.

AutoML comparison experiments

By default, WORC optimizes the radiomics workflow construction using a random search of 1000 iterations and creates an ensemble of the top 50 workflows. In the WORC paper, this was compared to various other setttings for the number of random search iterations, top N for the ensembling, other ensembling methods, and Bayesian optimization using SMAC. These can also be performed by using run_automl_experiment function from the automlcomparison.py script and manipulating the following arguments:

The use_smac argument can be set to True to use SMAC.
The smac_budget argument can be set to 'low', 'medium', or 'high' to change the time budget of SMAC.
The ensembling_method argument can be changed to use the other ensembling methods, see https://worc.readthedocs.io/en/latest/static/configuration.html#ensemble.
The ensembling_size argument can be changed to determine the ensemble size if the top_N method is used.
The RS_iterations argument can be changed to determine the number of random search iterations if SMAC is not used.
The radiomics_sota argument can be set to True to use the radiomics baseline (PyRadiomics + LASSO + Logistic Regression)

Regression testing: past performance

To support regression testing, the standardized performance evaluation files from WORC are provided in the pastperformances folder in files called WORCDatabase_performance_[DATASETNAME]_[WORC\_Version].sjon. Among others, these include several performance metrics in each of the above mentioned 100x random-split cross-validation iterations, and their mean and 95% confidence intervals as reported in the original radiomics study. More details are given in the WORC documentation. The WORC software version is added for versioning to support regression testing int the future.

Known Issues

See the WORC FAQ: https://worc.readthedocs.io/en/latest/static/faq.html

Acknowledgements

The authors thank Laurens Groenendijk for his assistance in processing the data and in the anonimization procedures. Martijn P. A. Starmans acknowledges funding from the research program STRaTeGy with project numbers 14929 and 14930, which is (partly) financed by the Netherlands Organization for Scientific Research (NWO). Part of this study was financed by the Stichting Coolsingel (reference number 567), a Dutch non-profit foundation. This study is supported by EuCanShare and EuCanImage (European Union’s Horizon 2020 research and innovation programme under grant agreement Nr. 825903 and Nr. 952103, respectively).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WORC Database: Downloaders and Experiments

License

Installation

Usage: download data

Usage: compare results to WORC algorithm

Usage: radiomics features

Usage: experiment reproduction

Glioma dataset

Head and neck dataset

AutoML comparison experiments

Regression testing: past performance

Known Issues

Acknowledgements

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
crossvalidationsplits		crossvalidationsplits
features		features
helpers		helpers
pastperformance		pastperformance
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
automlcomparison.py		automlcomparison.py
datadownloader.py		datadownloader.py
default_experiments.py		default_experiments.py

Folders and files

Latest commit

History

Repository files navigation

WORC Database: Downloaders and Experiments

License

Installation

Usage: download data

Usage: compare results to WORC algorithm

Usage: radiomics features

Usage: experiment reproduction

Glioma dataset

Head and neck dataset

AutoML comparison experiments

Regression testing: past performance

Known Issues

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages