feat(infrastructure): add VLM base classes and utilities by davidberenstein1957 · Pull Request #638 · PrunaAI/pruna

davidberenstein1957 · 2026-04-25T12:52:15Z

Summary

Adds the VLM inference infrastructure used by all downstream VLM judge metrics:

BaseVLM
LitellmVLM
TransformersVLM
StatefulVLMMeanScoresMetric
shared batch/device helpers

Stack Position

Base: PR feat(vendor): add LLM2Vec embedding model #637 (feat/vlm-pr-1-vendor)
Next: PR feat(text-metrics): split qa_accuracy #645 (feat/vlm-pr-3a-qa-accuracy)
Final integration: PR feat(e2e-tests): stacked e2e after split metrics #641 (feat/vlm-pr-5-e2e-tests)
Canonical umbrella reference: PR feat(evaluation): add VLMMetrics #545 (feat/metrics-vlm-support)

Files

src/pruna/evaluation/metrics/vlm_base.py
src/pruna/evaluation/metrics/vlm_utils.py
tests/evaluation/test_vlm_base_infrastructure.py
src/pruna/evaluation/metrics/utils.py
src/pruna/evaluation/metrics/__init__.py
pyproject.toml

Alignment Notes

This PR is intentionally based on feat/vlm-pr-1-vendor so reviewers only see infrastructure delta.

Test Plan

uv run pytest tests/evaluation/test_vlm_base_infrastructure.py -v

Review Focus

API/local VLM abstraction boundaries
Device handling and batching behavior
Stateful aggregation correctness

Review Flow (Order)

Review the stack in this exact order:

feat(vendor): add LLM2Vec embedding model #637 vendor
feat(infrastructure): add VLM base classes and utilities #638 infrastructure
feat(text-metrics): split qa_accuracy #645 qa_accuracy
feat(text-metrics): split oneig_alignment #646 oneig_alignment
feat(text-metrics): split text_score pair #647 text_score pair
feat(text-metrics): split oneig_reasoning #648 oneig_reasoning
feat(vision-metrics): split vqa #649 vqa
feat(vision-metrics): split vie_score #650 vie_score
feat(vision-metrics): split img_edit_score #651 img_edit_score
feat(e2e-tests): stacked e2e after split metrics #641 e2e tests

This PR in the flow (2/10)

Review after PR feat(vendor): add LLM2Vec embedding model #637.
Next PR to review: feat(text-metrics): split qa_accuracy #645.
Confirm this PR's tests and scope before continuing.

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

Comment @cursor review or bugbot run to trigger another review on this PR

^{Reviewed by Cursor Bugbot for commit 21212de. Configure here.}

cursor · 2026-04-25T12:55:30Z

+                    top = getattr(tok, "top_logprobs", None) or []
+                    for t in top:
+                        token_str = (getattr(t, "token", "") or "").lower()
+                        lp = float(getattr(t, "logprob", -1e9) or -1e9)


Logprob zero treated as missing due to falsy check

Medium Severity

The expression float(getattr(t, "logprob", -1e9) or -1e9) uses the or operator to provide a fallback, but 0.0 is falsy in Python. A logprob of 0.0 means P = exp(0) = 1.0 (100% probability), yet 0.0 or -1e9 evaluates to -1e9, turning that into P ≈ 0. This silently corrupts probability scoring whenever a token has logprob exactly zero.

^{Reviewed by Cursor Bugbot for commit 21212de. Configure here.}

cursor · 2026-04-25T12:55:30Z

+        self.pooling_mode = pooling_mode
+        self.skip_instruction = skip_instruction
+        self.max_length = max_length
+        self.doc_max_length = 512


Constructor ignores doc_max_length parameter, hardcodes 512

Medium Severity

LLM2Vec.__init__ accepts a doc_max_length parameter (line 79) but line 88 assigns self.doc_max_length = 512 instead of self.doc_max_length = doc_max_length. The parameter value is silently discarded, so any doc_max_length loaded from llm2vec_config.json via from_pretrained or passed explicitly has no effect on document truncation behavior.

^{Reviewed by Cursor Bugbot for commit 21212de. Configure here.}

cursor · 2026-04-25T12:55:30Z

    "peft>=0.18.0,<0.19.0",
    "trl<=0.21.0",
    "termcolor==2.3.0",
+    "realesrgan",


Heavy realesrgan moved from optional to core dependencies

Medium Severity

realesrgan was previously under the optional upscale extra but is now a core dependency in dependencies. This forces all users to install a heavy GPU-oriented package (with native compilation requirements) even if they never use upscaling. The upscale optional extra was simultaneously removed.

^{Reviewed by Cursor Bugbot for commit 21212de. Configure here.}

- Add BaseVLM abstract interface - Add LitellmVLM for API-based inference (OpenAI, Anthropic, etc.) - Add TransformersVLM for local Hugging Face models - Add StatefulVLMMeanScoresMetric base class for judge metrics - Add vlm_utils.py with image/batch utilities - Add pyproject.toml dependency pins (peft, litellm) - Add unit tests for infrastructure

Keep PR #638 focused on VLM infrastructure by removing exports for downstream metric classes and restoring Rapidata export from the base branch. Co-authored-by: Cursor <cursoragent@cursor.com>

cursor Bot reviewed Apr 25, 2026

View reviewed changes

davidberenstein1957 changed the base branch from main to feat/vlm-pr-1-vendor May 5, 2026 10:00

davidberenstein1957 force-pushed the feat/vlm-pr-1-vendor branch from f89b047 to fb6d967 Compare May 8, 2026 09:01

davidberenstein1957 force-pushed the feat/vlm-pr-2-infrastructure branch from 21212de to 7054e53 Compare May 8, 2026 09:01

fix(infra): scope metrics exports to infra-only symbols

d4509da

Keep PR #638 focused on VLM infrastructure by removing exports for downstream metric classes and restoring Rapidata export from the base branch. Co-authored-by: Cursor <cursoragent@cursor.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(infrastructure): add VLM base classes and utilities#638

feat(infrastructure): add VLM base classes and utilities#638
davidberenstein1957 wants to merge 2 commits into
feat/vlm-pr-1-vendorfrom
feat/vlm-pr-2-infrastructure

davidberenstein1957 commented Apr 25, 2026 •

edited

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Apr 25, 2026

Uh oh!

cursor Bot Apr 25, 2026

Uh oh!

cursor Bot Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

davidberenstein1957 commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Stack Position

Files

Alignment Notes

Test Plan

Review Focus

Review Flow (Order)

This PR in the flow (2/10)

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 25, 2026

Choose a reason for hiding this comment

Logprob zero treated as missing due to falsy check

Uh oh!

cursor Bot Apr 25, 2026

Choose a reason for hiding this comment

Constructor ignores doc_max_length parameter, hardcodes 512

Uh oh!

cursor Bot Apr 25, 2026

Choose a reason for hiding this comment

Heavy realesrgan moved from optional to core dependencies

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

davidberenstein1957 commented Apr 25, 2026 •

edited

Loading

Constructor ignores `doc_max_length` parameter, hardcodes 512

Heavy `realesrgan` moved from optional to core dependencies