Add service-mode Retriever.answer support#2202
Conversation
63857d9 to
a643751
Compare
This comment has been minimized.
This comment has been minimized.
1 similar comment
This comment has been minimized.
This comment has been minimized.
Greptile SummaryThis PR adds a configurable
|
| Filename | Overview |
|---|---|
| nemo_retriever/src/nemo_retriever/service/routers/ingest.py | New POST /v1/answer endpoint: guard checks, httpx VectorDB proxy, lazy-cached LiteLLMClient, HTTP 502 on gen.error, include_chunks/include_metadata stripping — all look correct; previous concerns (re-instantiation, silent 200 on error) are addressed. |
| nemo_retriever/src/nemo_retriever/models/llm/clients/litellm.py | Adds reasoning control (_with_no_reasoning_controls, _NO_REASONING_EXTRA_PARAMS), _deep_merge_dicts for per-call extra params, and _format_rag_system_prompt; reasoning_enabled defaults to True, avoiding UnsupportedParamsError on non-Nemotron providers. |
| nemo_retriever/src/nemo_retriever/models/llm/types.py | AnswerResult migrated to Pydantic BaseModel with new required chunk_count field (breaking change, flagged in prior review); new build_answer_result and populate_answer_scores functions placed here mix evaluation business logic into the types module. |
| nemo_retriever/src/nemo_retriever/service/config.py | New LLMConfig with model_validator that rejects empty model when enabled=True; api_key left null (injected from env); all fields have sensible defaults. |
| nemo_retriever/helm/templates/nims/answer-llm.yaml | New NIMCache/NIMService template for operator-managed answer LLM; pullSecrets validated with fail; NGC_API_KEY filtered from user env and re-injected from authSecret. |
| nemo_retriever/helm/templates/configmap.yaml | Auto-wires llmAPIBase from NIM operator URL when serviceConfig.llm.apiBase is empty; api_key always rendered as null (real key comes from secret env injection); llmEnabled logic correctly handles both explicit and NIM-managed paths. |
| nemo_retriever/helm/templates/deployment.yaml | Adds NEMO_RETRIEVER_LLM_API_KEY env injection via secretKeyRef; falls back to NIM authSecret (NGC_API_KEY) when nimOperator.answer_llm is enabled; applied correctly to both standalone and split-mode containers. |
| nemo_retriever/tests/test_service_answer_generation.py | Good coverage of 404/502 paths, include_flags, reasoning override, reference scoring, and judge opt-in; two test lambdas missing *, reasoning_enabled=None signature, which would silently break if the service ever passes the kwarg unconditionally. |
| nemo_retriever/tests/test_llm_params.py | New TestLiteLLMRAGPrompt and TestBackCompatCallSites additions cover reasoning enable/disable, custom prompts, and pipeline builder forwarding; all lambda signatures include *, reasoning_enabled=None. |
| nemo_retriever/helm/values.yaml | Well-documented answer_llm NIM block and serviceConfig.llm section with clear defaults; apiKeySecret replaces inline api_key; reasoningEnabled defaults to true for external-provider safety. |
Sequence Diagram
sequenceDiagram
participant Client
participant Service as Retriever Service (POST /v1/answer)
participant VDB as VectorDB Pod (/v1/query)
participant LLM as LiteLLMClient (cached in app.state)
Client->>Service: "POST /v1/answer {query, top_k, include_chunks, reasoning_enabled?}"
Service->>Service: Check config.vectordb.enabled and config.llm.enabled - 404 if disabled
Service->>Service: Build CoreAnswerRequest
Service->>VDB: "POST /v1/query {query, top_k}"
alt VectorDB error
VDB-->>Service: non-200 response
Service-->>Client: Proxy error (original status + content)
else VectorDB OK
VDB-->>Service: "{results: [{hits: [...]}]}"
Service->>Service: Build RetrievalResult (chunks, metadata)
Service->>Service: Lazy-init LiteLLMClient (cached on app.state)
Service->>LLM: generate(query, chunks, reasoning_enabled?)
alt "reasoning_enabled=False"
LLM->>LLM: _with_no_reasoning_controls(messages) + _NO_REASONING_EXTRA_PARAMS
end
LLM-->>Service: GenerationResult
alt gen.error set
Service-->>Client: HTTP 502 LLM answer generation failed
else generation OK
Service->>Service: build_answer_result(query, retrieval, generation, reference?, judge?)
Service->>Service: model_copy strip chunks/metadata per include flags
Service-->>Client: AnswerResult JSON (200)
end
end
Reviews (23): Last reviewed commit: "Apply pre-commit formatting" | Re-trigger Greptile
This comment has been minimized.
This comment has been minimized.
824dc2d to
a456393
Compare
This comment has been minimized.
This comment has been minimized.
a456393 to
85dcb49
Compare
This comment has been minimized.
This comment has been minimized.
…-nim # Conflicts: # nemo_retriever/pyproject.toml # nemo_retriever/uv.lock
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Ignoring the VectorDB timeout nit; |
| reasoning_enabled: bool | None = None | ||
|
|
||
|
|
||
| class AnswerResponse(BaseModel): |
There was a problem hiding this comment.
good job formalizing this
There was a problem hiding this comment.
should this be in service, or pulled to common. That would make sure local could use this as well.
There was a problem hiding this comment.
This got me down the rabbit hole of trying to better align the APIs between /v1/answer and Retriever.answer() in 3cca7c6; think with these changes, we should now cleanly be able to use the shared pydantic models in nemo_retriever.models.llm.types between local/service answer generation
Only remaining service-specific model is ServiceAnswerRequest, which lets users toggle whether they want some of the heavier pieces of the standard local answer response to get serialized over the wire (chunks, metadata)
Description
Adds service-mode answer generation so hosted retriever deployments can expose the same core live-RAG workflow as local
Retriever.answer(): VectorDB retrieval, LiteLLM answer generation, optional reference scoring, optional judge scoring, and response projection for chunks/metadata. The PR also keeps the Helm/operator path configurable so service-mode answer generation can use either an operator-managed answer LLM NIM or an external OpenAI-compatible endpoint.Changes include:
nemo_retriever.models.llm: an internalAnswerRequest, Pydantic-backedAnswerResult, and shared result/scoring construction used by both local and service-mode answer generation.Retriever.answer(...)remains kwargs-first for users, but now normalizes internally through the shared request/result path, returns a PydanticAnswerResult, exposeschunk_count, and supports per-callreasoning_enabledoverrides without forcing custom clients to accept the override unless it is set.POST /v1/answernow uses the same sharedAnswerResultconstruction as local mode, preserves VectorDB/LLM error handling, keepsinclude_chunksandinclude_metadataas HTTP-only response projection flags, and supportsreferenceplus opt-injudge: truescoring.serviceConfig.llmand optionalnimOperator.answer_llm, including Super-49B defaults, alternate model coverage, secret-safe API key injection, and in-clusterapi_base/modelauto-wiring.Validation:
pre-commit run --show-diff-on-failure --color=always --all-filesuv run pytest tests/test_live_rag.py tests/test_service_answer_generation.py tests/test_llm_params.pyenv -u NVIDIA_API_KEY -u NGC_API_KEY uv run pytest testsChecklist