Benchmarking Foundation Models and Parameter-Efficient Fine-Tuning for Prognosis Prediction in Medical Imaging
The first systematic benchmark of fine-tuning strategies — Full Fine-Tuning (FFT), Linear Probing (LP), and Parameter-Efficient Fine-Tuning (PEFT) — applied to CNNs and Foundation Models for COVID-19 prognosis prediction from chest X-rays, under realistic clinical constraints of data scarcity and class imbalance.
Filippo Ruffini · Elena Mulero Ayllon · Linlin Shen · Paolo Soda · Valerio Guarrasi
Unit of Artificial Intelligence and Computer Systems, Università Campus Bio-Medico di Roma · College of Computer Science and Software Engineering, Shenzhen University · Department of Diagnostics and Intervention, Umeå University
📄 Paper · 🧩 Framework · ⚙️ Setup · 🚀 Usage guide · 📊 Results · 📚 Citation
This repository accompanies the paper "Benchmarking Foundation Models and Parameter-Efficient Fine-Tuning for Prognosis Prediction in Medical Imaging" (published in Computer Methods and Programs in Biomedicine, doi: 10.1016/j.cmpb.2025.106...).
We introduce the first large-scale benchmark that systematically evaluates fine-tuning strategies for clinical prognosis prediction from chest X-rays. The benchmark is structured around three central research questions:
- What are the most effective fine-tuning strategies for adapting CNN and FM architectures to prognosis tasks?
- Which fine-tuning strategy offers the best efficiency–effectiveness trade-off when applied to FMs?
- Can PEFT strategies provide robust adaptation in Few-Shot Learning (FSL) scenarios under prognosis data constraints?
We compare 3 CNN architectures and 8 Foundation Models across 6 fine-tuning strategies (FFT, LP, LoRA, VeRA, BitFit, IA³) on 6 prognostic tasks derived from 4 publicly available COVID-19 CXR datasets, in both full-data and few-shot regimes. The total benchmark required approximately 3,200 GPU-hours on 10 NVIDIA Tesla A40 GPUs.
Key findings: CNNs with FFT remain the most reliable option for severely imbalanced or extremely small datasets. FMs combined with PEFT (especially LoRA and BitFit) are competitive on larger datasets and scale more efficiently. In few-shot settings, LP offers the most stable generalization. No single strategy is universally optimal — the choice depends on dataset size, class balance, and model scale.
Figure 1 — Benchmark pipeline. The experimental framework is structured into five main stages: (1) Dataset Selection — four publicly available COVID-19 CXR datasets covering mortality, severity and ICU admission across varying imbalance levels and sample sizes; (2) Model Categorization — CNNs pretrained on ImageNet and FMs pretrained via self-supervised or contrastive learning on general or biomedical data; (3) Fine-Tuning Strategies — FFT (upper bound), LP (lower bound), and four PEFT methods: LoRA, VeRA, BitFit, and IA³; (4) Training Regimes — full-data and few-shot (k ∈ {2, 4, 8, 16, 32} samples per class); (5) Inference & Evaluation — MCC as primary metric (robust to class imbalance) and PR-AUC as complementary metric.
| Model | Architecture | Pretraining | Data | #Params (M) | Category |
|---|---|---|---|---|---|
| ResNet-18 | CNN | Supervised | ImageNet | 11.7 | CNN |
| ResNet-50 | CNN | Supervised | ImageNet | 23.5 | CNN |
| DenseNet-121 | CNN | Supervised | ImageNet | 7.9 | CNN |
| DINOv2-S | ViT-S/14 | Self-supervised | LVD-142M | 21 | FM |
| DINOv2-B | ViT-B/14 | Self-supervised | LVD-142M | 86 | FM |
| DINOv2-L | ViT-L/14 | Self-supervised | LVD-142M | 300 | FM |
| CLIP-Large | ViT-L/14 | Contrastive | LAION-400M | 300 | FM |
| MedCLIP (ResNet) | ResNet-50 | Contrastive | CheXpert+MIMIC | 23 | FM |
| MedCLIP (Swin) | Swin-T | Contrastive | CheXpert+MIMIC | 27 | FM |
| PubMedCLIP | ViT-B/32 | Contrastive | ROCO | 86 | FM |
| BioMedCLIP | ViT-B/16 | Contrastive | PMC-15M | 86 | FM |
| LoRA | VeRA | IA³ | BitFit | LP | FFT | |
|---|---|---|---|---|---|---|
| ResNet-18/50 | ✓ | ✓ | ✓ | ✓ | ||
| DenseNet-121 | ✓ | ✓ | ✓ | ✓ | ||
| DINOv2-S/B/L | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| CLIP-Large | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| MedCLIP (ResNet) | ✓ | ✓ | ✓ | ✓ | ||
| MedCLIP (Swin) | ✓ | ✓ | ✓ | ✓ | ✓ | |
| PubMedCLIP | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| BioMedCLIP | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Four publicly available COVID-19 CXR datasets are used, each annotated with a different prognostic outcome. All are publicly downloadable from their respective sources (links below).
| Dataset | Task ID | Task | Samples | Patients | Centers | Class distribution | Validation |
|---|---|---|---|---|---|---|---|
| AIforCOVID (Soda et al., 2021) | AFC | Severity (Mild vs. Severe) | 1585 | 1585 | 6 | 53% / 47% | LOCO |
| AIforCOVID | AFC_m | Mortality (Alive vs. Deceased) | 1585 | 1585 | 6 | 85% / 15% | LOCO |
| COVID-19-AR (Desai et al., 2020) | CAR | ICU Admission (Yes vs. No) | 99 | 99 | 1 | 71% / 29% | 5-fold CV |
| CoCross (Kilintzis et al., 2022) | CC | ICU Outcome (Alive vs. Deceased) | 389 | 150 | 1 | 63% / 37% | 5-fold CV |
| Stony Brook COVID-19 (Saltz et al., 2021) | NY_small | Mortality (1 CXR/patient) | 1365 | 1365 | 1 | 87% / 13% | 5-fold CV |
| Stony Brook COVID-19 | NY_all | Mortality (all CXRs) | 13639 | 1365 | 1 | 64% / 36% | 5-fold CV |
These datasets collectively span the full spectrum of real-world prognostic challenges: from balanced to severely imbalanced classes, from small (N=99) to large (N=13,639) cohorts, and from single-center to multi-center settings.
.
├── src/ # Core codebase
│ ├── eval/
│ │ └── classification/
│ │ ├── linear.py # Main training entry point (Hydra)
│ │ ├── features_extraction.py # Feature extraction for LP
│ │ └── ml_training.py # Classical ML baselines
│ ├── data/
│ │ ├── datasets/ # Per-dataset torch.Dataset classes
│ │ │ ├── aiforcovid.py # AIforCOVID (AFC, AFC_m)
│ │ │ ├── car.py # COVID-19-AR (CAR)
│ │ │ ├── cocross.py # CoCross (CC)
│ │ │ └── ny.py # Stony Brook NY (NY_small, NY_all)
│ │ ├── loaders.py # DataLoader factory
│ │ ├── samplers.py # Class-balanced sampler
│ │ └── augmentations.py # CXR augmentation pipeline
│ ├── models/ # CNN and FM wrappers + PEFT injection
│ ├── configs/PEFT_runs/
│ │ ├── config.yaml # Hydra root config
│ │ └── experiment/
│ │ ├── databases/ # Per-dataset configs (AFC, CAR, CC, NY, …)
│ │ ├── models/ # Per-model configs (resnet_18, vitb14, …)
│ │ ├── paths/ # System path profiles (local.yaml)
│ │ ├── validation_strategy/ # hold_out / 5fold / loCo
│ │ └── linear_probing_*.yaml # Experiment presets (PEFT method × regime)
│ ├── bash/
│ │ ├── run_all.sh # Full benchmark reproduction (all datasets × regimes)
│ │ ├── multiple_linear.sh # Per-dataset batch launcher (MAX_JOBS=4 concurrent)
│ │ ├── linear.sh # Single-run worker (called by launch_bash.py)
│ │ ├── extractor.sh # Feature-extraction worker
│ │ ├── debug_linear.sh # Quick single-model debug run
│ │ ├── AGGREGATE_RESULTS.sh # Aggregate all datasets after runs finish
│ │ └── launch_bash.py # Job dispatcher (local background processes)
│ ├── preprocessing/ # Per-dataset preprocessing scripts
│ │ ├── AFC/
│ │ ├── CoCross/
│ │ ├── COVID-19-AR/
│ │ └── COVID-NY/
│ └── postprocess/
│ ├── aggregate_results/ # Fold aggregation scripts
│ └── interface/ # Dash-based interactive results explorer
├── figures/
│ ├── final_method.pdf # Figure 1 — pipeline overview
│ ├── fine_tuning_comparison.pdf # Figure 2 — fine-tuning comparison boxplot
│ ├── CNN_vs_FM/ # Figure 3 — CNN vs FM per-dataset plots
│ ├── ALL/ # Figure 4 — all-FM PEFT scatter plots
│ └── CE95/ # Appendix — 95% CI plots per dataset
└── requirements.txt # Pinned Python dependencies
git clone https://github.com/fruffini/PEFT_Prognosis.git
cd PEFT_PrognosisPython 3.10 is required. The full set of pinned versions used in the paper is in requirements.txt.
# Create and activate the virtual environment
python3.10 -m venv PEFT_env
source PEFT_env/bin/activate # Windows: PEFT_env\Scripts\activate
# Install dependencies
pip install --upgrade pip
pip install -r requirements.txtKey packages installed (click to expand)
| Package | Version | Role |
|---|---|---|
torch |
2.2.2 | Deep learning framework |
torchvision |
0.17.2 | Vision datasets and transforms |
peft |
0.11.1 | HuggingFace PEFT (LoRA, VeRA, BitFit, IA³) |
transformers |
4.41.1 | CLIP, DINOv2, PubMedCLIP, BioMedCLIP |
open-clip-torch |
2.24.0 | OpenCLIP / BioMedCLIP loading |
timm |
0.9.16 | Model utilities |
MedCLIP |
0.0.3 | MedCLIP vision encoders |
hydra-core |
1.3.0 | Configuration management |
wandb |
0.17.1 | Experiment tracking |
scikit-learn |
1.4.2 | Classical ML baselines and metrics |
GPU note. The benchmark was run on NVIDIA A40 GPUs (48 GB). For 16–24 GB cards, set a smaller
batch_sizein the dataset config or usetorch.cuda.amp. CPU-only runs are possible but very slow.
Open src/configs/PEFT_runs/experiment/paths/system/local.yaml and set the two paths for your machine:
# src/configs/PEFT_runs/experiment/paths/system/local.yaml
data_base_path: /path/to/your/data/processed # root of preprocessed datasets
output_path: /path/to/your/results # where runs and checkpoints are savedAll Hydra commands automatically pick up these paths via experiment/paths/system@_global_=local.
All four datasets are publicly available and free to download. Create the directory tree below, then download each dataset into its raw folder.
data/
└── raw/
├── AIforCOVID/ ← download here
├── COVID-19-AR/ ← download here
├── CoCross/ ← download here
└── StonyBrook-COVID19/ ← download here
1 585 patients · 6 Italian hospitals · two tasks: severity (balanced) and mortality (imbalanced)
- Register and request access at the AIforCOVID portal or download directly from the Zenodo record.
- Place the downloaded archive under
data/raw/AIforCOVID/. - Preprocess:
python src/preprocessing/AFC/preprocess_AFC.py
99 CXRs · rural US population · ICU admission · strongly imbalanced
- Download from The Cancer Imaging Archive (TCIA) — no registration required.
- Place the DICOM/PNG files under
data/raw/COVID-19-AR/. - Preprocess:
python src/preprocessing/COVID-19-AR/preprocess_CAR.py
389 CXRs · longitudinal ICU monitoring · ICU outcome
- Download from the CoCross dataset page (supplementary data link in the paper) or directly from the Zenodo record.
- Place files under
data/raw/CoCross/. - Preprocess:
python src/preprocessing/CoCross/preprocess_CC.py
13 639 CXRs · 1 365 patients · mortality · two sampling variants
- Download from TCIA collection TCIA.BBAG-2690 — free TCIA account required.
- Place the downloaded images under
data/raw/StonyBrook-COVID19/. - Preprocess (generates both the single-CXR-per-patient split
NYand the full longitudinal splitNY_all):python src/preprocessing/COVID-NY/preprocess_NY.py
After running the four scripts above, the processed splits appear under data/processed/:
data/processed/
├── AFC/ # AIforCOVID — severity
├── AFC_death/ # AIforCOVID — mortality
├── CAR/ # COVID-19-AR
├── CoCross/ # CoCross
├── NY/ # Stony Brook — 1 CXR/patient
└── NY_all/ # Stony Brook — all CXRs
Each directory contains the image files and a CSV manifest with split assignments consumed by the dataset classes in src/data/datasets/.
All model weights are downloaded automatically on first use by the respective library. No manual download is needed except for models behind a Hugging Face access gate.
| Model | Auto-download | Source |
|---|---|---|
| ResNet-18 / ResNet-50 / DenseNet-121 | ✓ torchvision |
PyTorch Hub |
| DINOv2-S / B / L | ✓ torch.hub |
facebookresearch/dinov2 |
| CLIP-Large | ✓ open_clip |
mlfoundations/open_clip |
| MedCLIP (ResNet / Swin) | ✓ MedCLIP.from_pretrained() |
RyanWangZf/MedCLIP |
| BioMedCLIP | ✓ open_clip (HF hub) |
microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224 |
| PubMedCLIP | ✓ transformers (HF hub) |
flaviagiammarino/pubmed-clip-vit-base-patch32 |
For BioMedCLIP and PubMedCLIP (Hugging Face), log in once so the download can proceed:
pip install huggingface_hub
huggingface-cli login # generate a token at https://huggingface.co/settings/tokensWeights are cached in ~/.cache/huggingface/ and ~/.cache/torch/ after the first run.
The main entry point is src/eval/classification/linear.py. All configuration is handled by Hydra — override any parameter on the command line.
source PEFT_env/bin/activate
# Linear Probing — ResNet-18 on AIforCOVID (AFC)
python src/eval/classification/linear.py \
experiment/databases@db=AFC \
experiment/models@_global_=resnet_18 \
experiment=linear_probing_none_test_all \
experiment/validation_strategy@_global_=loCo
# LoRA (rank=8) — DINOv2-B on Stony Brook NY (all CXRs)
python src/eval/classification/linear.py \
experiment/databases@db=NY_all \
experiment/models@_global_=vitb14_pretrain \
experiment=linear_probing_LoRa_8_test_all \
experiment/validation_strategy@_global_=5fold
# Full Fine-Tuning — BioMedCLIP on CoCross
python src/eval/classification/linear.py \
experiment/databases@db=CoCross \
experiment/models@_global_=biomedclip \
experiment=full_finetuning \
experiment/validation_strategy@_global_=5fold
# BitFit — Few-Shot (FSL) — CLIP-Large on AFC
python src/eval/classification/linear.py \
experiment/databases@db=AFC \
experiment/models@_global_=clip_large \
experiment=linear_probing_fitbit_FSL_test_all \
experiment/validation_strategy@_global_=loCoAvailable experiment presets (src/configs/PEFT_runs/experiment/):
| Config file | Strategy | Regime |
|---|---|---|
linear_probing_none_test_all.yaml |
LP | Full data |
full_finetuning.yaml |
FFT | Full data |
linear_probing_LoRa_{4,8,16}_test_all.yaml |
LoRA (r=4/8/16) | Full data |
linear_probing_VeRA_{4,8,16}_test_all.yaml |
VeRA (r=4/8/16) | Full data |
linear_probing_fitbit_test_all.yaml |
BitFit | Full data |
linear_probing_IA3_test_all.yaml |
IA³ | Full data |
linear_probing_none_FSL_test_all.yaml |
LP | Few-Shot |
full_finetuning_FSL.yaml |
FFT | Few-Shot |
linear_probing_LoRa_FSL_{4,8,16}_test_all.yaml |
LoRA FSL | Few-Shot |
linear_probing_VeRA_FSL_{4,8,16}_test_all.yaml |
VeRA FSL | Few-Shot |
linear_probing_fitbit_FSL_test_all.yaml |
BitFit FSL | Few-Shot |
linear_probing_IA3_FSL_test_all.yaml |
IA³ FSL | Few-Shot |
Available model configs (experiment/models/): resnet_18, resnet_50, dense121, vitb14_pretrain, vitl14_pretrain, vits14_pretrain, clip_large, medclip_resnet50, medclip_vision, pubmedclip, biomedclip
Available database configs (experiment/databases/): AFC, AFC_death, CAR, CoCross, NY_small, NY_all
Available validation strategies (experiment/validation_strategy/): loCo (Leave-One-Center-Out, for multi-centric AFC/AFC_m), 5fold (5-fold CV, for single-center datasets), hold_out
System path profiles (experiment/paths/system@_global_=): local (set paths in src/configs/PEFT_runs/experiment/paths/system/local.yaml)
Outputs (per-fold predictions, metrics, checkpoints) land under results/<DATASET>/<MODEL>/<STRATEGY>/.
The multiple_linear.sh launcher fans out all model × fine-tuning combinations for a given dataset and regime:
# Full-data benchmark on AIforCOVID (severity)
bash src/bash/multiple_linear.sh -d AFC -v L -e VANILLA
# Few-Shot benchmark on AIforCOVID
bash src/bash/multiple_linear.sh -d AFC -v L -e FSL
# Counterfactual FSL (CFSL)
bash src/bash/multiple_linear.sh -d AFC -v L -e CFSL
# Replace -d with any dataset ID: AFC | AFC_death | CAR | CoCross | NY | NY_allThe -v flag controls validation strategy (L = LOCO, 5 = 5-fold), and -e controls the regime (VANILLA = full data, FSL = few-shot, CFSL = counterfactual FSL).
To run the complete benchmark in one shot (all datasets × all regimes), use the top-level script:
bash src/bash/run_all.shAfter all folds complete, aggregate per-fold results into summary tables:
python src/postprocess/aggregate_results/aggregate_results.py \
experiment/databases@db=AFC \
experiment/paths/system@_global_=localAggregated tables are written under results/aggregated_results/.
A Dash-based web interface is included for interactive exploration of aggregated results:
python src/postprocess/interface/index.py
# Open http://localhost:8050 in your browserThe interface includes per-dataset scatter plots of MCC vs. % parameters trained, CNN vs. FM comparisons, and model-level breakdowns.
The figures in figures/CNN_vs_FM/, figures/ALL/, and figures/CE95/ were generated from the aggregated results using the plotting scripts under src/postprocess/. After populating results/aggregated_results/, re-run the plotting pipeline that generated those figures from within the interface or the dedicated postprocess scripts.
Figure 2 — Distribution of MCC scores per fine-tuning method and dataset. Each box summarizes the mean performance of all models fine-tuned with a given technique on a specific dataset. FFT is consistently strong on small/imbalanced tasks; LP and BitFit offer the most stable lightweight alternatives; LoRA and VeRA show higher variance.
Figure 3 — CNN vs. FM per dataset (MCC). Each subplot shows the mean test MCC for all fine-tuning methods applicable to both architecture families. Each model is represented by its own symbol and each method by a distinct color. FFT (★) is shown separately. Key observations: (a) on balanced AFC, PEFT and FFT are competitive across architectures; (b) on imbalanced AFC_m, PEFT degrades sharply for most models, with only DINOv2 variants retaining reasonable performance; (e–f) dataset size is a critical factor — PEFT improves markedly from NY_small to NY_all.
Figure 4 — MCC vs. % parameters trained (FM only). X-axis: fraction of model parameters updated during fine-tuning. Y-axis: mean MCC on the test set. Each point is a (model, method) pair. The plot reveals that performance does not scale monotonically with parameter count: BitFit and LP (far left) often match or outperform mid-range PEFT configurations (LoRA, VeRA), while FFT (far right) dominates on small, imbalanced tasks.
Selected top results from the paper (full tables in the published article):
| Model | Task | Best strategy | MCC |
|---|---|---|---|
| DINOv2-S | CAR | LoRA (r=4) | 77.2 ± 16.4 |
| PubMedCLIP | NY_small | FFT | 65.0 ± 25.2 |
| DenseNet-121 | CAR | FFT | 70.4 ± 36.8 |
| DINOv2-S | AFC | FFT | 49.6 ± 13.8 |
| DenseNet-121 | AFC_m | FFT | 51.8 ± 31.1 |
| CLIP-Large | NY_all | LoRA (r=8) | 43.5 ± 10.2 |
| DINOv2-L | NY_all | BitFit | 45.1 ± 9.5 |
| BioMedCLIP | NY_small | FFT | 43.8 ± 7.8 |
Full results for all model–task–strategy combinations are reported in the published paper (MCC and PR-AUC tables, Wilcoxon signed-rank pairwise comparisons).
Appendix figures — MCC with 95% confidence intervals per dataset. Detailed per-model, per-method MCC plots with 95% CI are available for all six tasks in
figures/CE95/.
- CNNs remain reliable in extreme low-data regimes. ResNet and DenseNet models fine-tuned with FFT outperform FMs consistently on very small (N < 200) or severely imbalanced datasets, due to their compact architecture and strong inductive biases.
- FMs with PEFT excel as data availability increases. LoRA and BitFit enable efficient adaptation of large pretrained models with minimal parameter updates; they are competitive or superior on larger datasets (NY_all, AFC).
- PEFT is sensitive to class imbalance. Severe imbalance (e.g., AFC_m at 85–15%) degrades PEFT performance sharply, while more balanced data restores competitiveness.
- LP is the most stable few-shot strategy. In FSL settings (k ≤ 32 shots per class), LP consistently outperforms other methods on average, offering a computationally inexpensive and robust solution.
- No single fine-tuning strategy is universally optimal. Model architecture, dataset scale, and class balance jointly determine which approach works best — the benchmark provides an actionable decision map.
Experiments were conducted on a high-performance computing cluster equipped with 10 NVIDIA Tesla A40 GPUs (48 GB each) via the National Academic Infrastructure for Supercomputing in Sweden (NAISS). Total benchmark cost: approximately 3,200 GPU-hours (~6 GPU-hours per model–dataset combination on average).
All four datasets used in this benchmark are publicly available and can be downloaded from their original sources:
| Dataset | Download |
|---|---|
| AIforCOVID | AIforCOVID portal |
| COVID-19-AR | The Cancer Imaging Archive / Radiology: AI |
| CoCross | CoCross dataset |
| Stony Brook COVID-19 | TCIA collection TCIA.BBAG-2690 |
No patient-level data is included in this repository.
If you use this code or build on this benchmark, please cite:
@article{ruffini2025benchmarking,
title = {Benchmarking Foundation Models and Parameter-Efficient Fine-Tuning
for Prognosis Prediction in Medical Imaging},
author = {Ruffini, Filippo and Mulero Ayllon, Elena and Shen, Linlin and
Soda, Paolo and Guarrasi, Valerio},
journal = {Computer Methods and Programs in Biomedicine},
year = {2025},
doi = {10.1016/j.cmpb.2025.106...},
url = {https://www.sciencedirect.com/science/article/pii/S016926072500611X}
}Please also cite the foundation models you use:
- DINOv2 — Oquab et al., github.com/facebookresearch/dinov2
- CLIP — Radford et al., openai.com/research/clip
- MedCLIP — Wang et al., github.com/RyanWangZf/MedCLIP
- BioMedCLIP — Zhang et al., HuggingFace
- PubMedCLIP — Eslami et al., HuggingFace
- HuggingFace PEFT — huggingface.co/docs/peft
All source code, configurations, documentation, and figures in this repository are released under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0) — see creativecommons.org/licenses/by-nc/4.0 for the full text.
- ✅ Allowed — academic research, teaching, non-profit clinical research, personal study, modification and redistribution with attribution.
- ❌ Not allowed without a separate licence — incorporation into commercial products, paid clinical decision-support systems, or any other commercial exploitation.
For commercial licensing contact the corresponding authors.











