Skip to content

Add service-mode Retriever.answer support#2202

Open
charlesbluca wants to merge 24 commits into
NVIDIA:mainfrom
charlesbluca:codex/super49b-helm-nim
Open

Add service-mode Retriever.answer support#2202
charlesbluca wants to merge 24 commits into
NVIDIA:mainfrom
charlesbluca:codex/super49b-helm-nim

Conversation

@charlesbluca

@charlesbluca charlesbluca commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

Description

Adds service-mode answer generation so hosted retriever deployments can expose the same core live-RAG workflow as local Retriever.answer(): VectorDB retrieval, LiteLLM answer generation, optional reference scoring, optional judge scoring, and response projection for chunks/metadata. The PR also keeps the Helm/operator path configurable so service-mode answer generation can use either an operator-managed answer LLM NIM or an external OpenAI-compatible endpoint.

Changes include:

  • Shared answer-generation contracts under nemo_retriever.models.llm: an internal AnswerRequest, Pydantic-backed AnswerResult, and shared result/scoring construction used by both local and service-mode answer generation.
  • Retriever.answer(...) remains kwargs-first for users, but now normalizes internally through the shared request/result path, returns a Pydantic AnswerResult, exposes chunk_count, and supports per-call reasoning_enabled overrides without forcing custom clients to accept the override unless it is set.
  • POST /v1/answer now uses the same shared AnswerResult construction as local mode, preserves VectorDB/LLM error handling, keeps include_chunks and include_metadata as HTTP-only response projection flags, and supports reference plus opt-in judge: true scoring.
  • Configurable service and Helm LLM wiring through serviceConfig.llm and optional nimOperator.answer_llm, including Super-49B defaults, alternate model coverage, secret-safe API key injection, and in-cluster api_base / model auto-wiring.
  • Unified LiteLLM prompt/reasoning controls and dependency coverage for service, SDK, and batch-evaluation answer generation.

Validation:

  • pre-commit run --show-diff-on-failure --color=always --all-files
  • uv run pytest tests/test_live_rag.py tests/test_service_answer_generation.py tests/test_llm_params.py
  • env -u NVIDIA_API_KEY -u NGC_API_KEY uv run pytest tests
  • PR checks green as of the latest push: Pre-commit Checks, Retriever Unit Tests, PR Validation Complete, Docker build/test, slim import contract, library-mode installs, and Greptile Review.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@charlesbluca charlesbluca changed the title [codex] Add Super-49B Helm answer generation Add Super-49B Helm answer generation Jun 2, 2026
@charlesbluca charlesbluca force-pushed the codex/super49b-helm-nim branch from 63857d9 to a643751 Compare June 2, 2026 18:34
@charlesbluca

This comment has been minimized.

1 similar comment
@charlesbluca

This comment has been minimized.

@greptile-apps

greptile-apps Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a configurable POST /v1/answer endpoint to service mode, wiring VectorDB retrieval to a LiteLLM-backed LLM, with optional Helm-managed NIM deployment (Super-49B default), API-key injection via Kubernetes Secrets, and per-request reasoning control.

  • New /v1/answer endpoint (ingest.py): lazy-cached LiteLLMClient, httpx-proxied VectorDB query, HTTP 502 on LLM failure, include_chunks/include_metadata response flags, and optional reference scoring/judging via shared build_answer_result.
  • Helm chart additions: generic answer_llm NIMCache/NIMService template with Super-49B defaults; auto-wires api_base and model into the retriever ConfigMap; API key sourced from a secretKeyRef (never written to the ConfigMap).
  • Shared reasoning control: reasoning_enabled=True (default) preserves provider behavior; False injects Nemotron-compatible /no_think prefix and chat_template_kwargs only when explicitly opted in, avoiding UnsupportedParamsError on non-Nemotron providers.

Confidence Score: 4/5

Safe to merge with awareness of the public API break to AnswerResult (flagged in a prior review pass) — the new chunk_count required field and the chunks/metadata optionality change affect library users who construct AnswerResult directly.

The AnswerResult public model gained a required chunk_count field with no default and flipped chunks/metadata from required to optional, which is a breaking change for downstream library callers. This was identified in a prior review pass and remains unresolved. Everything else in the PR — the new endpoint logic, Helm wiring, secret handling, reasoning control, and test coverage — is correct and well-structured.

nemo_retriever/src/nemo_retriever/models/llm/types.py — the AnswerResult shape change is the primary concern; also review the placement of build_answer_result/populate_answer_scores there vs. the evaluation package.

Important Files Changed

Filename Overview
nemo_retriever/src/nemo_retriever/service/routers/ingest.py New POST /v1/answer endpoint: guard checks, httpx VectorDB proxy, lazy-cached LiteLLMClient, HTTP 502 on gen.error, include_chunks/include_metadata stripping — all look correct; previous concerns (re-instantiation, silent 200 on error) are addressed.
nemo_retriever/src/nemo_retriever/models/llm/clients/litellm.py Adds reasoning control (_with_no_reasoning_controls, _NO_REASONING_EXTRA_PARAMS), _deep_merge_dicts for per-call extra params, and _format_rag_system_prompt; reasoning_enabled defaults to True, avoiding UnsupportedParamsError on non-Nemotron providers.
nemo_retriever/src/nemo_retriever/models/llm/types.py AnswerResult migrated to Pydantic BaseModel with new required chunk_count field (breaking change, flagged in prior review); new build_answer_result and populate_answer_scores functions placed here mix evaluation business logic into the types module.
nemo_retriever/src/nemo_retriever/service/config.py New LLMConfig with model_validator that rejects empty model when enabled=True; api_key left null (injected from env); all fields have sensible defaults.
nemo_retriever/helm/templates/nims/answer-llm.yaml New NIMCache/NIMService template for operator-managed answer LLM; pullSecrets validated with fail; NGC_API_KEY filtered from user env and re-injected from authSecret.
nemo_retriever/helm/templates/configmap.yaml Auto-wires llmAPIBase from NIM operator URL when serviceConfig.llm.apiBase is empty; api_key always rendered as null (real key comes from secret env injection); llmEnabled logic correctly handles both explicit and NIM-managed paths.
nemo_retriever/helm/templates/deployment.yaml Adds NEMO_RETRIEVER_LLM_API_KEY env injection via secretKeyRef; falls back to NIM authSecret (NGC_API_KEY) when nimOperator.answer_llm is enabled; applied correctly to both standalone and split-mode containers.
nemo_retriever/tests/test_service_answer_generation.py Good coverage of 404/502 paths, include_flags, reasoning override, reference scoring, and judge opt-in; two test lambdas missing *, reasoning_enabled=None signature, which would silently break if the service ever passes the kwarg unconditionally.
nemo_retriever/tests/test_llm_params.py New TestLiteLLMRAGPrompt and TestBackCompatCallSites additions cover reasoning enable/disable, custom prompts, and pipeline builder forwarding; all lambda signatures include *, reasoning_enabled=None.
nemo_retriever/helm/values.yaml Well-documented answer_llm NIM block and serviceConfig.llm section with clear defaults; apiKeySecret replaces inline api_key; reasoningEnabled defaults to true for external-provider safety.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Service as Retriever Service (POST /v1/answer)
    participant VDB as VectorDB Pod (/v1/query)
    participant LLM as LiteLLMClient (cached in app.state)

    Client->>Service: "POST /v1/answer {query, top_k, include_chunks, reasoning_enabled?}"
    Service->>Service: Check config.vectordb.enabled and config.llm.enabled - 404 if disabled
    Service->>Service: Build CoreAnswerRequest
    Service->>VDB: "POST /v1/query {query, top_k}"
    alt VectorDB error
        VDB-->>Service: non-200 response
        Service-->>Client: Proxy error (original status + content)
    else VectorDB OK
        VDB-->>Service: "{results: [{hits: [...]}]}"
        Service->>Service: Build RetrievalResult (chunks, metadata)
        Service->>Service: Lazy-init LiteLLMClient (cached on app.state)
        Service->>LLM: generate(query, chunks, reasoning_enabled?)
        alt "reasoning_enabled=False"
            LLM->>LLM: _with_no_reasoning_controls(messages) + _NO_REASONING_EXTRA_PARAMS
        end
        LLM-->>Service: GenerationResult
        alt gen.error set
            Service-->>Client: HTTP 502 LLM answer generation failed
        else generation OK
            Service->>Service: build_answer_result(query, retrieval, generation, reference?, judge?)
            Service->>Service: model_copy strip chunks/metadata per include flags
            Service-->>Client: AnswerResult JSON (200)
        end
    end
Loading

Reviews (23): Last reviewed commit: "Apply pre-commit formatting" | Re-trigger Greptile

Comment thread nemo_retriever/src/nemo_retriever/retriever.py Outdated
Comment thread nemo_retriever/pyproject.toml Outdated
@greptile-apps

This comment has been minimized.

Comment thread nemo_retriever/src/nemo_retriever/service/routers/ingest.py Outdated
Comment thread nemo_retriever/src/nemo_retriever/service/routers/ingest.py Outdated
Comment thread nemo_retriever/pyproject.toml Outdated
Comment thread nemo_retriever/tests/test_service_answer_generation.py
@charlesbluca charlesbluca force-pushed the codex/super49b-helm-nim branch 4 times, most recently from 824dc2d to a456393 Compare June 2, 2026 20:51
@charlesbluca

This comment has been minimized.

@charlesbluca charlesbluca force-pushed the codex/super49b-helm-nim branch from a456393 to 85dcb49 Compare June 2, 2026 21:13
@charlesbluca

This comment has been minimized.

@charlesbluca

This comment has been minimized.

@charlesbluca

This comment has been minimized.

@charlesbluca

This comment has been minimized.

@charlesbluca

This comment has been minimized.

@charlesbluca

This comment has been minimized.

Comment thread nemo_retriever/src/nemo_retriever/models/llm/clients/litellm.py
@charlesbluca

This comment has been minimized.

@charlesbluca

This comment has been minimized.

@charlesbluca

This comment has been minimized.

@charlesbluca

This comment has been minimized.

@charlesbluca charlesbluca changed the title Add Super-49B Helm answer generation Add configurable answer LLM generation Jun 3, 2026
@charlesbluca charlesbluca marked this pull request as ready for review June 4, 2026 13:51
@charlesbluca charlesbluca requested review from a team as code owners June 4, 2026 13:51
@charlesbluca charlesbluca requested a review from jperez999 June 4, 2026 13:51
@charlesbluca charlesbluca changed the title Add configurable answer LLM generation Add configurable answer LLM generation to service mode Jun 4, 2026
@charlesbluca

Copy link
Copy Markdown
Collaborator Author

Ignoring the VectorDB timeout nit; /v1/query shares this timeout quirk, so changes there are better suited as general tech debt versus in scope for this PR.

@jperez999 jperez999 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would hold off on merging until @jdye64 takes a look. But from my side LGTM.

reasoning_enabled: bool | None = None


class AnswerResponse(BaseModel):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good job formalizing this

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be in service, or pulled to common. That would make sure local could use this as well.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This got me down the rabbit hole of trying to better align the APIs between /v1/answer and Retriever.answer() in 3cca7c6; think with these changes, we should now cleanly be able to use the shared pydantic models in nemo_retriever.models.llm.types between local/service answer generation

Only remaining service-specific model is ServiceAnswerRequest, which lets users toggle whether they want some of the heavier pieces of the standard local answer response to get serialized over the wire (chunks, metadata)

@charlesbluca charlesbluca changed the title Add configurable answer LLM generation to service mode Add service-mode Retriever.answer support Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants