Skip to content

fruffini/PEFT_Prognosis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Benchmarking Foundation Models and Parameter-Efficient Fine-Tuning for Prognosis Prediction in Medical Imaging

Five-stage benchmark pipeline: dataset → models → fine-tuning → training regime → inference/evaluation

Computer Methods and Programs in Biomedicine Paper Python Hydra PyTorch HuggingFace License: CC BY-NC 4.0

The first systematic benchmark of fine-tuning strategies — Full Fine-Tuning (FFT), Linear Probing (LP), and Parameter-Efficient Fine-Tuning (PEFT) — applied to CNNs and Foundation Models for COVID-19 prognosis prediction from chest X-rays, under realistic clinical constraints of data scarcity and class imbalance.

Filippo Ruffini · Elena Mulero Ayllon · Linlin Shen · Paolo Soda · Valerio Guarrasi

Unit of Artificial Intelligence and Computer Systems, Università Campus Bio-Medico di Roma · College of Computer Science and Software Engineering, Shenzhen University · Department of Diagnostics and Intervention, Umeå University

📄 Paper · 🧩 Framework · ⚙️ Setup · 🚀 Usage guide · 📊 Results · 📚 Citation


Overview

This repository accompanies the paper "Benchmarking Foundation Models and Parameter-Efficient Fine-Tuning for Prognosis Prediction in Medical Imaging" (published in Computer Methods and Programs in Biomedicine, doi: 10.1016/j.cmpb.2025.106...).

We introduce the first large-scale benchmark that systematically evaluates fine-tuning strategies for clinical prognosis prediction from chest X-rays. The benchmark is structured around three central research questions:

  1. What are the most effective fine-tuning strategies for adapting CNN and FM architectures to prognosis tasks?
  2. Which fine-tuning strategy offers the best efficiency–effectiveness trade-off when applied to FMs?
  3. Can PEFT strategies provide robust adaptation in Few-Shot Learning (FSL) scenarios under prognosis data constraints?

We compare 3 CNN architectures and 8 Foundation Models across 6 fine-tuning strategies (FFT, LP, LoRA, VeRA, BitFit, IA³) on 6 prognostic tasks derived from 4 publicly available COVID-19 CXR datasets, in both full-data and few-shot regimes. The total benchmark required approximately 3,200 GPU-hours on 10 NVIDIA Tesla A40 GPUs.

Key findings: CNNs with FFT remain the most reliable option for severely imbalanced or extremely small datasets. FMs combined with PEFT (especially LoRA and BitFit) are competitive on larger datasets and scale more efficiently. In few-shot settings, LP offers the most stable generalization. No single strategy is universally optimal — the choice depends on dataset size, class balance, and model scale.


Framework

Benchmark pipeline (Figure 1)

Figure 1 — Benchmark pipeline. The experimental framework is structured into five main stages: (1) Dataset Selection — four publicly available COVID-19 CXR datasets covering mortality, severity and ICU admission across varying imbalance levels and sample sizes; (2) Model Categorization — CNNs pretrained on ImageNet and FMs pretrained via self-supervised or contrastive learning on general or biomedical data; (3) Fine-Tuning Strategies — FFT (upper bound), LP (lower bound), and four PEFT methods: LoRA, VeRA, BitFit, and IA³; (4) Training Regimes — full-data and few-shot (k ∈ {2, 4, 8, 16, 32} samples per class); (5) Inference & Evaluation — MCC as primary metric (robust to class imbalance) and PR-AUC as complementary metric.

Models

Model Architecture Pretraining Data #Params (M) Category
ResNet-18 CNN Supervised ImageNet 11.7 CNN
ResNet-50 CNN Supervised ImageNet 23.5 CNN
DenseNet-121 CNN Supervised ImageNet 7.9 CNN
DINOv2-S ViT-S/14 Self-supervised LVD-142M 21 FM
DINOv2-B ViT-B/14 Self-supervised LVD-142M 86 FM
DINOv2-L ViT-L/14 Self-supervised LVD-142M 300 FM
CLIP-Large ViT-L/14 Contrastive LAION-400M 300 FM
MedCLIP (ResNet) ResNet-50 Contrastive CheXpert+MIMIC 23 FM
MedCLIP (Swin) Swin-T Contrastive CheXpert+MIMIC 27 FM
PubMedCLIP ViT-B/32 Contrastive ROCO 86 FM
BioMedCLIP ViT-B/16 Contrastive PMC-15M 86 FM

PEFT compatibility matrix

LoRA VeRA IA³ BitFit LP FFT
ResNet-18/50
DenseNet-121
DINOv2-S/B/L
CLIP-Large
MedCLIP (ResNet)
MedCLIP (Swin)
PubMedCLIP
BioMedCLIP

Datasets

Four publicly available COVID-19 CXR datasets are used, each annotated with a different prognostic outcome. All are publicly downloadable from their respective sources (links below).

Dataset Task ID Task Samples Patients Centers Class distribution Validation
AIforCOVID (Soda et al., 2021) AFC Severity (Mild vs. Severe) 1585 1585 6 53% / 47% LOCO
AIforCOVID AFC_m Mortality (Alive vs. Deceased) 1585 1585 6 85% / 15% LOCO
COVID-19-AR (Desai et al., 2020) CAR ICU Admission (Yes vs. No) 99 99 1 71% / 29% 5-fold CV
CoCross (Kilintzis et al., 2022) CC ICU Outcome (Alive vs. Deceased) 389 150 1 63% / 37% 5-fold CV
Stony Brook COVID-19 (Saltz et al., 2021) NY_small Mortality (1 CXR/patient) 1365 1365 1 87% / 13% 5-fold CV
Stony Brook COVID-19 NY_all Mortality (all CXRs) 13639 1365 1 64% / 36% 5-fold CV

These datasets collectively span the full spectrum of real-world prognostic challenges: from balanced to severely imbalanced classes, from small (N=99) to large (N=13,639) cohorts, and from single-center to multi-center settings.


Repository layout

.
├── src/                                    # Core codebase
│   ├── eval/
│   │   └── classification/
│   │       ├── linear.py                   # Main training entry point (Hydra)
│   │       ├── features_extraction.py      # Feature extraction for LP
│   │       └── ml_training.py              # Classical ML baselines
│   ├── data/
│   │   ├── datasets/                       # Per-dataset torch.Dataset classes
│   │   │   ├── aiforcovid.py               # AIforCOVID (AFC, AFC_m)
│   │   │   ├── car.py                      # COVID-19-AR (CAR)
│   │   │   ├── cocross.py                  # CoCross (CC)
│   │   │   └── ny.py                       # Stony Brook NY (NY_small, NY_all)
│   │   ├── loaders.py                      # DataLoader factory
│   │   ├── samplers.py                     # Class-balanced sampler
│   │   └── augmentations.py                # CXR augmentation pipeline
│   ├── models/                             # CNN and FM wrappers + PEFT injection
│   ├── configs/PEFT_runs/
│   │   ├── config.yaml                     # Hydra root config
│   │   └── experiment/
│   │       ├── databases/                  # Per-dataset configs (AFC, CAR, CC, NY, …)
│   │       ├── models/                     # Per-model configs (resnet_18, vitb14, …)
│   │       ├── paths/                      # System path profiles (local.yaml)
│   │       ├── validation_strategy/        # hold_out / 5fold / loCo
│   │       └── linear_probing_*.yaml       # Experiment presets (PEFT method × regime)
│   ├── bash/
│   │   ├── run_all.sh                      # Full benchmark reproduction (all datasets × regimes)
│   │   ├── multiple_linear.sh              # Per-dataset batch launcher (MAX_JOBS=4 concurrent)
│   │   ├── linear.sh                       # Single-run worker (called by launch_bash.py)
│   │   ├── extractor.sh                    # Feature-extraction worker
│   │   ├── debug_linear.sh                 # Quick single-model debug run
│   │   ├── AGGREGATE_RESULTS.sh            # Aggregate all datasets after runs finish
│   │   └── launch_bash.py                  # Job dispatcher (local background processes)
│   ├── preprocessing/                      # Per-dataset preprocessing scripts
│   │   ├── AFC/
│   │   ├── CoCross/
│   │   ├── COVID-19-AR/
│   │   └── COVID-NY/
│   └── postprocess/
│       ├── aggregate_results/              # Fold aggregation scripts
│       └── interface/                      # Dash-based interactive results explorer
├── figures/
│   ├── final_method.pdf                    # Figure 1 — pipeline overview
│   ├── fine_tuning_comparison.pdf          # Figure 2 — fine-tuning comparison boxplot
│   ├── CNN_vs_FM/                          # Figure 3 — CNN vs FM per-dataset plots
│   ├── ALL/                                # Figure 4 — all-FM PEFT scatter plots
│   └── CE95/                               # Appendix — 95% CI plots per dataset
└── requirements.txt                        # Pinned Python dependencies

Setup

1. Clone the repository

git clone https://github.com/fruffini/PEFT_Prognosis.git
cd PEFT_Prognosis

2. Install Python and create the environment

Python 3.10 is required. The full set of pinned versions used in the paper is in requirements.txt.

# Create and activate the virtual environment
python3.10 -m venv PEFT_env
source PEFT_env/bin/activate      # Windows: PEFT_env\Scripts\activate

# Install dependencies
pip install --upgrade pip
pip install -r requirements.txt
Key packages installed (click to expand)
Package Version Role
torch 2.2.2 Deep learning framework
torchvision 0.17.2 Vision datasets and transforms
peft 0.11.1 HuggingFace PEFT (LoRA, VeRA, BitFit, IA³)
transformers 4.41.1 CLIP, DINOv2, PubMedCLIP, BioMedCLIP
open-clip-torch 2.24.0 OpenCLIP / BioMedCLIP loading
timm 0.9.16 Model utilities
MedCLIP 0.0.3 MedCLIP vision encoders
hydra-core 1.3.0 Configuration management
wandb 0.17.1 Experiment tracking
scikit-learn 1.4.2 Classical ML baselines and metrics

GPU note. The benchmark was run on NVIDIA A40 GPUs (48 GB). For 16–24 GB cards, set a smaller batch_size in the dataset config or use torch.cuda.amp. CPU-only runs are possible but very slow.

3. Configure paths

Open src/configs/PEFT_runs/experiment/paths/system/local.yaml and set the two paths for your machine:

# src/configs/PEFT_runs/experiment/paths/system/local.yaml
data_base_path: /path/to/your/data/processed   # root of preprocessed datasets
output_path:    /path/to/your/results           # where runs and checkpoints are saved

All Hydra commands automatically pick up these paths via experiment/paths/system@_global_=local.

4. Download the datasets

All four datasets are publicly available and free to download. Create the directory tree below, then download each dataset into its raw folder.

data/
└── raw/
    ├── AIforCOVID/          ← download here
    ├── COVID-19-AR/         ← download here
    ├── CoCross/             ← download here
    └── StonyBrook-COVID19/  ← download here

AIforCOVID (AFC / AFC_m)

1 585 patients · 6 Italian hospitals · two tasks: severity (balanced) and mortality (imbalanced)

  1. Register and request access at the AIforCOVID portal or download directly from the Zenodo record.
  2. Place the downloaded archive under data/raw/AIforCOVID/.
  3. Preprocess:
    python src/preprocessing/AFC/preprocess_AFC.py

COVID-19-AR (CAR)

99 CXRs · rural US population · ICU admission · strongly imbalanced

  1. Download from The Cancer Imaging Archive (TCIA) — no registration required.
  2. Place the DICOM/PNG files under data/raw/COVID-19-AR/.
  3. Preprocess:
    python src/preprocessing/COVID-19-AR/preprocess_CAR.py

CoCross (CC)

389 CXRs · longitudinal ICU monitoring · ICU outcome

  1. Download from the CoCross dataset page (supplementary data link in the paper) or directly from the Zenodo record.
  2. Place files under data/raw/CoCross/.
  3. Preprocess:
    python src/preprocessing/CoCross/preprocess_CC.py

Stony Brook COVID-19 (NY_small / NY_all)

13 639 CXRs · 1 365 patients · mortality · two sampling variants

  1. Download from TCIA collection TCIA.BBAG-2690 — free TCIA account required.
  2. Place the downloaded images under data/raw/StonyBrook-COVID19/.
  3. Preprocess (generates both the single-CXR-per-patient split NY and the full longitudinal split NY_all):
    python src/preprocessing/COVID-NY/preprocess_NY.py

After running the four scripts above, the processed splits appear under data/processed/:

data/processed/
├── AFC/            # AIforCOVID — severity
├── AFC_death/      # AIforCOVID — mortality
├── CAR/            # COVID-19-AR
├── CoCross/        # CoCross
├── NY/             # Stony Brook — 1 CXR/patient
└── NY_all/         # Stony Brook — all CXRs

Each directory contains the image files and a CSV manifest with split assignments consumed by the dataset classes in src/data/datasets/.

5. Download pretrained model weights

All model weights are downloaded automatically on first use by the respective library. No manual download is needed except for models behind a Hugging Face access gate.

Model Auto-download Source
ResNet-18 / ResNet-50 / DenseNet-121 torchvision PyTorch Hub
DINOv2-S / B / L torch.hub facebookresearch/dinov2
CLIP-Large open_clip mlfoundations/open_clip
MedCLIP (ResNet / Swin) MedCLIP.from_pretrained() RyanWangZf/MedCLIP
BioMedCLIP open_clip (HF hub) microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224
PubMedCLIP transformers (HF hub) flaviagiammarino/pubmed-clip-vit-base-patch32

For BioMedCLIP and PubMedCLIP (Hugging Face), log in once so the download can proceed:

pip install huggingface_hub
huggingface-cli login   # generate a token at https://huggingface.co/settings/tokens

Weights are cached in ~/.cache/huggingface/ and ~/.cache/torch/ after the first run.


Using the repository

1. Run a single experiment (Hydra CLI)

The main entry point is src/eval/classification/linear.py. All configuration is handled by Hydra — override any parameter on the command line.

source PEFT_env/bin/activate

# Linear Probing — ResNet-18 on AIforCOVID (AFC)
python src/eval/classification/linear.py \
    experiment/databases@db=AFC \
    experiment/models@_global_=resnet_18 \
    experiment=linear_probing_none_test_all \
    experiment/validation_strategy@_global_=loCo

# LoRA (rank=8) — DINOv2-B on Stony Brook NY (all CXRs)
python src/eval/classification/linear.py \
    experiment/databases@db=NY_all \
    experiment/models@_global_=vitb14_pretrain \
    experiment=linear_probing_LoRa_8_test_all \
    experiment/validation_strategy@_global_=5fold

# Full Fine-Tuning — BioMedCLIP on CoCross
python src/eval/classification/linear.py \
    experiment/databases@db=CoCross \
    experiment/models@_global_=biomedclip \
    experiment=full_finetuning \
    experiment/validation_strategy@_global_=5fold

# BitFit — Few-Shot (FSL) — CLIP-Large on AFC
python src/eval/classification/linear.py \
    experiment/databases@db=AFC \
    experiment/models@_global_=clip_large \
    experiment=linear_probing_fitbit_FSL_test_all \
    experiment/validation_strategy@_global_=loCo

Available experiment presets (src/configs/PEFT_runs/experiment/):

Config file Strategy Regime
linear_probing_none_test_all.yaml LP Full data
full_finetuning.yaml FFT Full data
linear_probing_LoRa_{4,8,16}_test_all.yaml LoRA (r=4/8/16) Full data
linear_probing_VeRA_{4,8,16}_test_all.yaml VeRA (r=4/8/16) Full data
linear_probing_fitbit_test_all.yaml BitFit Full data
linear_probing_IA3_test_all.yaml IA³ Full data
linear_probing_none_FSL_test_all.yaml LP Few-Shot
full_finetuning_FSL.yaml FFT Few-Shot
linear_probing_LoRa_FSL_{4,8,16}_test_all.yaml LoRA FSL Few-Shot
linear_probing_VeRA_FSL_{4,8,16}_test_all.yaml VeRA FSL Few-Shot
linear_probing_fitbit_FSL_test_all.yaml BitFit FSL Few-Shot
linear_probing_IA3_FSL_test_all.yaml IA³ FSL Few-Shot

Available model configs (experiment/models/): resnet_18, resnet_50, dense121, vitb14_pretrain, vitl14_pretrain, vits14_pretrain, clip_large, medclip_resnet50, medclip_vision, pubmedclip, biomedclip

Available database configs (experiment/databases/): AFC, AFC_death, CAR, CoCross, NY_small, NY_all

Available validation strategies (experiment/validation_strategy/): loCo (Leave-One-Center-Out, for multi-centric AFC/AFC_m), 5fold (5-fold CV, for single-center datasets), hold_out

System path profiles (experiment/paths/system@_global_=): local (set paths in src/configs/PEFT_runs/experiment/paths/system/local.yaml)

Outputs (per-fold predictions, metrics, checkpoints) land under results/<DATASET>/<MODEL>/<STRATEGY>/.

2. Run batch sweeps

The multiple_linear.sh launcher fans out all model × fine-tuning combinations for a given dataset and regime:

# Full-data benchmark on AIforCOVID (severity)
bash src/bash/multiple_linear.sh -d AFC -v L -e VANILLA

# Few-Shot benchmark on AIforCOVID
bash src/bash/multiple_linear.sh -d AFC -v L -e FSL

# Counterfactual FSL (CFSL)
bash src/bash/multiple_linear.sh -d AFC -v L -e CFSL

# Replace -d with any dataset ID: AFC | AFC_death | CAR | CoCross | NY | NY_all

The -v flag controls validation strategy (L = LOCO, 5 = 5-fold), and -e controls the regime (VANILLA = full data, FSL = few-shot, CFSL = counterfactual FSL).

To run the complete benchmark in one shot (all datasets × all regimes), use the top-level script:

bash src/bash/run_all.sh

3. Aggregate results across folds

After all folds complete, aggregate per-fold results into summary tables:

python src/postprocess/aggregate_results/aggregate_results.py \
    experiment/databases@db=AFC \
    experiment/paths/system@_global_=local

Aggregated tables are written under results/aggregated_results/.

4. Explore results interactively

A Dash-based web interface is included for interactive exploration of aggregated results:

python src/postprocess/interface/index.py
# Open http://localhost:8050 in your browser

The interface includes per-dataset scatter plots of MCC vs. % parameters trained, CNN vs. FM comparisons, and model-level breakdowns.

5. Reproduce paper figures

The figures in figures/CNN_vs_FM/, figures/ALL/, and figures/CE95/ were generated from the aggregated results using the plotting scripts under src/postprocess/. After populating results/aggregated_results/, re-run the plotting pipeline that generated those figures from within the interface or the dedicated postprocess scripts.


Results

Fine-tuning comparison across datasets

Fine-tuning comparison boxplot (Figure 2)

Figure 2 — Distribution of MCC scores per fine-tuning method and dataset. Each box summarizes the mean performance of all models fine-tuned with a given technique on a specific dataset. FFT is consistently strong on small/imbalanced tasks; LP and BitFit offer the most stable lightweight alternatives; LoRA and VeRA show higher variance.

CNN vs. Foundation Model comparison

CNN vs FM – AFC CNN vs FM – AFC_m

CNN vs FM – CAR CNN vs FM – CC

CNN vs FM – NY_small CNN vs FM – NY_all

Figure 3 — CNN vs. FM per dataset (MCC). Each subplot shows the mean test MCC for all fine-tuning methods applicable to both architecture families. Each model is represented by its own symbol and each method by a distinct color. FFT (★) is shown separately. Key observations: (a) on balanced AFC, PEFT and FFT are competitive across architectures; (b) on imbalanced AFC_m, PEFT degrades sharply for most models, with only DINOv2 variants retaining reasonable performance; (e–f) dataset size is a critical factor — PEFT improves markedly from NY_small to NY_all.

All-FM PEFT analysis

All FM – AFC All FM – NY_all

Figure 4 — MCC vs. % parameters trained (FM only). X-axis: fraction of model parameters updated during fine-tuning. Y-axis: mean MCC on the test set. Each point is a (model, method) pair. The plot reveals that performance does not scale monotonically with parameter count: BitFit and LP (far left) often match or outperform mid-range PEFT configurations (LoRA, VeRA), while FFT (far right) dominates on small, imbalanced tasks.

Headline numbers (MCC, mean ± std across folds)

Selected top results from the paper (full tables in the published article):

Model Task Best strategy MCC
DINOv2-S CAR LoRA (r=4) 77.2 ± 16.4
PubMedCLIP NY_small FFT 65.0 ± 25.2
DenseNet-121 CAR FFT 70.4 ± 36.8
DINOv2-S AFC FFT 49.6 ± 13.8
DenseNet-121 AFC_m FFT 51.8 ± 31.1
CLIP-Large NY_all LoRA (r=8) 43.5 ± 10.2
DINOv2-L NY_all BitFit 45.1 ± 9.5
BioMedCLIP NY_small FFT 43.8 ± 7.8

Full results for all model–task–strategy combinations are reported in the published paper (MCC and PR-AUC tables, Wilcoxon signed-rank pairwise comparisons).

95% confidence intervals (Appendix)

MCC 95% CI – AFC MCC 95% CI – NY_all

Appendix figures — MCC with 95% confidence intervals per dataset. Detailed per-model, per-method MCC plots with 95% CI are available for all six tasks in figures/CE95/.


Key conclusions

  • CNNs remain reliable in extreme low-data regimes. ResNet and DenseNet models fine-tuned with FFT outperform FMs consistently on very small (N < 200) or severely imbalanced datasets, due to their compact architecture and strong inductive biases.
  • FMs with PEFT excel as data availability increases. LoRA and BitFit enable efficient adaptation of large pretrained models with minimal parameter updates; they are competitive or superior on larger datasets (NY_all, AFC).
  • PEFT is sensitive to class imbalance. Severe imbalance (e.g., AFC_m at 85–15%) degrades PEFT performance sharply, while more balanced data restores competitiveness.
  • LP is the most stable few-shot strategy. In FSL settings (k ≤ 32 shots per class), LP consistently outperforms other methods on average, offering a computationally inexpensive and robust solution.
  • No single fine-tuning strategy is universally optimal. Model architecture, dataset scale, and class balance jointly determine which approach works best — the benchmark provides an actionable decision map.

Compute

Experiments were conducted on a high-performance computing cluster equipped with 10 NVIDIA Tesla A40 GPUs (48 GB each) via the National Academic Infrastructure for Supercomputing in Sweden (NAISS). Total benchmark cost: approximately 3,200 GPU-hours (~6 GPU-hours per model–dataset combination on average).


Data availability

All four datasets used in this benchmark are publicly available and can be downloaded from their original sources:

Dataset Download
AIforCOVID AIforCOVID portal
COVID-19-AR The Cancer Imaging Archive / Radiology: AI
CoCross CoCross dataset
Stony Brook COVID-19 TCIA collection TCIA.BBAG-2690

No patient-level data is included in this repository.


Citation

If you use this code or build on this benchmark, please cite:

@article{ruffini2025benchmarking,
  title   = {Benchmarking Foundation Models and Parameter-Efficient Fine-Tuning
             for Prognosis Prediction in Medical Imaging},
  author  = {Ruffini, Filippo and Mulero Ayllon, Elena and Shen, Linlin and
             Soda, Paolo and Guarrasi, Valerio},
  journal = {Computer Methods and Programs in Biomedicine},
  year    = {2025},
  doi     = {10.1016/j.cmpb.2025.106...},
  url     = {https://www.sciencedirect.com/science/article/pii/S016926072500611X}
}

Please also cite the foundation models you use:


License

All source code, configurations, documentation, and figures in this repository are released under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0) — see creativecommons.org/licenses/by-nc/4.0 for the full text.

  • Allowed — academic research, teaching, non-profit clinical research, personal study, modification and redistribution with attribution.
  • Not allowed without a separate licence — incorporation into commercial products, paid clinical decision-support systems, or any other commercial exploitation.

For commercial licensing contact the corresponding authors.

About

The first systematic benchmark of fine-tuning strategies applied to CNNs and Foundation Models for COVID-19 prognosis prediction from chest X-rays, under realistic clinical constraints of data scarcity and class imbalance.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors