Add service-mode Retriever.answer support by charlesbluca · Pull Request #2202 · NVIDIA/NeMo-Retriever

charlesbluca · 2026-06-02T18:24:22Z

Description

Adds service-mode answer generation so hosted retriever deployments can expose the same core live-RAG workflow as local Retriever.answer(): VectorDB retrieval, LiteLLM answer generation, optional reference scoring, optional judge scoring, and response projection for chunks/metadata. The PR also keeps the Helm/operator path configurable so service-mode answer generation can use either an operator-managed answer LLM NIM or an external OpenAI-compatible endpoint.

Changes include:

Shared answer-generation contracts under nemo_retriever.models.llm: an internal AnswerRequest, Pydantic-backed AnswerResult, and shared result/scoring construction used by both local and service-mode answer generation.
Retriever.answer(...) remains kwargs-first for users, but now normalizes internally through the shared request/result path, returns a Pydantic AnswerResult, exposes chunk_count, and supports per-call reasoning_enabled overrides without forcing custom clients to accept the override unless it is set.
POST /v1/answer now uses the same shared AnswerResult construction as local mode, preserves VectorDB/LLM error handling, keeps include_chunks and include_metadata as HTTP-only response projection flags, and supports reference plus opt-in judge: true scoring.
Configurable service and Helm LLM wiring through serviceConfig.llm and optional nimOperator.answer_llm, including Super-49B defaults, alternate model coverage, secret-safe API key injection, and in-cluster api_base / model auto-wiring.
Unified LiteLLM prompt/reasoning controls and dependency coverage for service, SDK, and batch-evaluation answer generation.

Validation:

pre-commit run --show-diff-on-failure --color=always --all-files
uv run pytest tests/test_live_rag.py tests/test_service_answer_generation.py tests/test_llm_params.py
env -u NVIDIA_API_KEY -u NGC_API_KEY uv run pytest tests
PR checks green as of the latest push: Pre-commit Checks, Retriever Unit Tests, PR Validation Complete, Docker build/test, slim import contract, library-mode installs, and Greptile Review.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

greptile-apps · 2026-06-02T18:46:54Z

Greptile Summary

This PR adds a configurable POST /v1/answer endpoint to service mode, wiring VectorDB retrieval to a LiteLLM-backed LLM, with optional Helm-managed NIM deployment (Super-49B default), API-key injection via Kubernetes Secrets, and per-request reasoning control.

New /v1/answer endpoint (ingest.py): lazy-cached LiteLLMClient, httpx-proxied VectorDB query, HTTP 502 on LLM failure, include_chunks/include_metadata response flags, and optional reference scoring/judging via shared build_answer_result.
Helm chart additions: generic answer_llm NIMCache/NIMService template with Super-49B defaults; auto-wires api_base and model into the retriever ConfigMap; API key sourced from a secretKeyRef (never written to the ConfigMap).
Shared reasoning control: reasoning_enabled=True (default) preserves provider behavior; False injects Nemotron-compatible /no_think prefix and chat_template_kwargs only when explicitly opted in, avoiding UnsupportedParamsError on non-Nemotron providers.

Confidence Score: 4/5

Safe to merge with awareness of the public API break to AnswerResult (flagged in a prior review pass) — the new chunk_count required field and the chunks/metadata optionality change affect library users who construct AnswerResult directly.

The AnswerResult public model gained a required chunk_count field with no default and flipped chunks/metadata from required to optional, which is a breaking change for downstream library callers. This was identified in a prior review pass and remains unresolved. Everything else in the PR — the new endpoint logic, Helm wiring, secret handling, reasoning control, and test coverage — is correct and well-structured.

nemo_retriever/src/nemo_retriever/models/llm/types.py — the AnswerResult shape change is the primary concern; also review the placement of build_answer_result/populate_answer_scores there vs. the evaluation package.

Important Files Changed

Filename	Overview
nemo_retriever/src/nemo_retriever/service/routers/ingest.py	New POST /v1/answer endpoint: guard checks, httpx VectorDB proxy, lazy-cached LiteLLMClient, HTTP 502 on gen.error, include_chunks/include_metadata stripping — all look correct; previous concerns (re-instantiation, silent 200 on error) are addressed.
nemo_retriever/src/nemo_retriever/models/llm/clients/litellm.py	Adds reasoning control (_with_no_reasoning_controls, _NO_REASONING_EXTRA_PARAMS), _deep_merge_dicts for per-call extra params, and _format_rag_system_prompt; reasoning_enabled defaults to True, avoiding UnsupportedParamsError on non-Nemotron providers.
nemo_retriever/src/nemo_retriever/models/llm/types.py	AnswerResult migrated to Pydantic BaseModel with new required chunk_count field (breaking change, flagged in prior review); new build_answer_result and populate_answer_scores functions placed here mix evaluation business logic into the types module.
nemo_retriever/src/nemo_retriever/service/config.py	New LLMConfig with model_validator that rejects empty model when enabled=True; api_key left null (injected from env); all fields have sensible defaults.
nemo_retriever/helm/templates/nims/answer-llm.yaml	New NIMCache/NIMService template for operator-managed answer LLM; pullSecrets validated with fail; NGC_API_KEY filtered from user env and re-injected from authSecret.
nemo_retriever/helm/templates/configmap.yaml	Auto-wires llmAPIBase from NIM operator URL when serviceConfig.llm.apiBase is empty; api_key always rendered as null (real key comes from secret env injection); llmEnabled logic correctly handles both explicit and NIM-managed paths.
nemo_retriever/helm/templates/deployment.yaml	Adds NEMO_RETRIEVER_LLM_API_KEY env injection via secretKeyRef; falls back to NIM authSecret (NGC_API_KEY) when nimOperator.answer_llm is enabled; applied correctly to both standalone and split-mode containers.
nemo_retriever/tests/test_service_answer_generation.py	Good coverage of 404/502 paths, include_flags, reasoning override, reference scoring, and judge opt-in; two test lambdas missing *, reasoning_enabled=None signature, which would silently break if the service ever passes the kwarg unconditionally.
nemo_retriever/tests/test_llm_params.py	New TestLiteLLMRAGPrompt and TestBackCompatCallSites additions cover reasoning enable/disable, custom prompts, and pipeline builder forwarding; all lambda signatures include *, reasoning_enabled=None.
nemo_retriever/helm/values.yaml	Well-documented answer_llm NIM block and serviceConfig.llm section with clear defaults; apiKeySecret replaces inline api_key; reasoningEnabled defaults to true for external-provider safety.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Service as Retriever Service (POST /v1/answer)
    participant VDB as VectorDB Pod (/v1/query)
    participant LLM as LiteLLMClient (cached in app.state)

    Client->>Service: "POST /v1/answer {query, top_k, include_chunks, reasoning_enabled?}"
    Service->>Service: Check config.vectordb.enabled and config.llm.enabled - 404 if disabled
    Service->>Service: Build CoreAnswerRequest
    Service->>VDB: "POST /v1/query {query, top_k}"
    alt VectorDB error
        VDB-->>Service: non-200 response
        Service-->>Client: Proxy error (original status + content)
    else VectorDB OK
        VDB-->>Service: "{results: [{hits: [...]}]}"
        Service->>Service: Build RetrievalResult (chunks, metadata)
        Service->>Service: Lazy-init LiteLLMClient (cached on app.state)
        Service->>LLM: generate(query, chunks, reasoning_enabled?)
        alt "reasoning_enabled=False"
            LLM->>LLM: _with_no_reasoning_controls(messages) + _NO_REASONING_EXTRA_PARAMS
        end
        LLM-->>Service: GenerationResult
        alt gen.error set
            Service-->>Client: HTTP 502 LLM answer generation failed
        else generation OK
            Service->>Service: build_answer_result(query, retrieval, generation, reference?, judge?)
            Service->>Service: model_copy strip chunks/metadata per include flags
            Service-->>Client: AnswerResult JSON (200)
        end
    end

_{Reviews (23): Last reviewed commit: "Apply pre-commit formatting" | Re-trigger Greptile}

…-nim # Conflicts: # nemo_retriever/pyproject.toml # nemo_retriever/uv.lock

charlesbluca · 2026-06-05T15:41:21Z

Ignoring the VectorDB timeout nit; /v1/query shares this timeout quirk, so changes there are better suited as general tech debt versus in scope for this PR.

jperez999

I would hold off on merging until @jdye64 takes a look. But from my side LGTM.

jperez999 · 2026-06-10T23:57:33Z

+    reasoning_enabled: bool | None = None
+
+
+class AnswerResponse(BaseModel):


good job formalizing this

should this be in service, or pulled to common. That would make sure local could use this as well.

This got me down the rabbit hole of trying to better align the APIs between /v1/answer and Retriever.answer() in 3cca7c6; think with these changes, we should now cleanly be able to use the shared pydantic models in nemo_retriever.models.llm.types between local/service answer generation

Only remaining service-specific model is ServiceAnswerRequest, which lets users toggle whether they want some of the heavier pieces of the standard local answer response to get serialized over the wire (chunks, metadata)

…-nim

charlesbluca changed the title ~~[codex] Add Super-49B Helm answer generation~~ Add Super-49B Helm answer generation Jun 2, 2026

charlesbluca force-pushed the codex/super49b-helm-nim branch from 63857d9 to a643751 Compare June 2, 2026 18:34