Skip to content

feat(infrastructure): add VLM base classes and utilities#638

Open
davidberenstein1957 wants to merge 2 commits into
feat/vlm-pr-1-vendorfrom
feat/vlm-pr-2-infrastructure
Open

feat(infrastructure): add VLM base classes and utilities#638
davidberenstein1957 wants to merge 2 commits into
feat/vlm-pr-1-vendorfrom
feat/vlm-pr-2-infrastructure

Conversation

@davidberenstein1957
Copy link
Copy Markdown
Member

@davidberenstein1957 davidberenstein1957 commented Apr 25, 2026

Summary

Adds the VLM inference infrastructure used by all downstream VLM judge metrics:

  • BaseVLM
  • LitellmVLM
  • TransformersVLM
  • StatefulVLMMeanScoresMetric
  • shared batch/device helpers

Stack Position

Files

  • src/pruna/evaluation/metrics/vlm_base.py
  • src/pruna/evaluation/metrics/vlm_utils.py
  • tests/evaluation/test_vlm_base_infrastructure.py
  • src/pruna/evaluation/metrics/utils.py
  • src/pruna/evaluation/metrics/__init__.py
  • pyproject.toml

Alignment Notes

This PR is intentionally based on feat/vlm-pr-1-vendor so reviewers only see infrastructure delta.

Test Plan

uv run pytest tests/evaluation/test_vlm_base_infrastructure.py -v

Review Focus

  • API/local VLM abstraction boundaries
  • Device handling and batching behavior
  • Stateful aggregation correctness

Review Flow (Order)

Review the stack in this exact order:

  1. feat(vendor): add LLM2Vec embedding model #637 vendor
  2. feat(infrastructure): add VLM base classes and utilities #638 infrastructure
  3. feat(text-metrics): split qa_accuracy #645 qa_accuracy
  4. feat(text-metrics): split oneig_alignment #646 oneig_alignment
  5. feat(text-metrics): split text_score pair #647 text_score pair
  6. feat(text-metrics): split oneig_reasoning #648 oneig_reasoning
  7. feat(vision-metrics): split vqa #649 vqa
  8. feat(vision-metrics): split vie_score #650 vie_score
  9. feat(vision-metrics): split img_edit_score #651 img_edit_score
  10. feat(e2e-tests): stacked e2e after split metrics #641 e2e tests

This PR in the flow (2/10)

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 21212de. Configure here.

top = getattr(tok, "top_logprobs", None) or []
for t in top:
token_str = (getattr(t, "token", "") or "").lower()
lp = float(getattr(t, "logprob", -1e9) or -1e9)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logprob zero treated as missing due to falsy check

Medium Severity

The expression float(getattr(t, "logprob", -1e9) or -1e9) uses the or operator to provide a fallback, but 0.0 is falsy in Python. A logprob of 0.0 means P = exp(0) = 1.0 (100% probability), yet 0.0 or -1e9 evaluates to -1e9, turning that into P ≈ 0. This silently corrupts probability scoring whenever a token has logprob exactly zero.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 21212de. Configure here.

self.pooling_mode = pooling_mode
self.skip_instruction = skip_instruction
self.max_length = max_length
self.doc_max_length = 512
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constructor ignores doc_max_length parameter, hardcodes 512

Medium Severity

LLM2Vec.__init__ accepts a doc_max_length parameter (line 79) but line 88 assigns self.doc_max_length = 512 instead of self.doc_max_length = doc_max_length. The parameter value is silently discarded, so any doc_max_length loaded from llm2vec_config.json via from_pretrained or passed explicitly has no effect on document truncation behavior.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 21212de. Configure here.

Comment thread pyproject.toml
"peft>=0.18.0,<0.19.0",
"trl<=0.21.0",
"termcolor==2.3.0",
"realesrgan",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heavy realesrgan moved from optional to core dependencies

Medium Severity

realesrgan was previously under the optional upscale extra but is now a core dependency in dependencies. This forces all users to install a heavy GPU-oriented package (with native compilation requirements) even if they never use upscaling. The upscale optional extra was simultaneously removed.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 21212de. Configure here.

- Add BaseVLM abstract interface
- Add LitellmVLM for API-based inference (OpenAI, Anthropic, etc.)
- Add TransformersVLM for local Hugging Face models
- Add StatefulVLMMeanScoresMetric base class for judge metrics
- Add vlm_utils.py with image/batch utilities
- Add pyproject.toml dependency pins (peft, litellm)
- Add unit tests for infrastructure
@davidberenstein1957 davidberenstein1957 force-pushed the feat/vlm-pr-2-infrastructure branch from 21212de to 7054e53 Compare May 8, 2026 09:01
Keep PR #638 focused on VLM infrastructure by removing exports for downstream
metric classes and restoring Rapidata export from the base branch.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant