Move service ingest construction into ingest core#2221
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
| ingestor = ingest_service.build_service_ingestor( | ||
| ingest_service.ServiceIngestRequest( | ||
| documents=file_patterns, | ||
| input_type=input_type, | ||
| extract_params=extract_params, | ||
| embed_params=embed_params, | ||
| text_chunk_params=text_chunk_params, | ||
| enable_text_chunk=enable_text_chunk, | ||
| dedup_params=DedupParams(iou_threshold=dedup_iou_threshold) if enable_dedup else None, | ||
| caption_params=( | ||
| CaptionParams( | ||
| context_text_max_chars=caption_context_text_max_chars, | ||
| temperature=caption_temperature, | ||
| top_p=caption_top_p, | ||
| max_tokens=caption_max_tokens, | ||
| ) | ||
| if enable_caption | ||
| else None | ||
| ), | ||
| store_params=StoreParams(storage_uri=store_images_uri) if store_images_uri is not None else None, | ||
| connection=ingest_service.ServiceIngestConnectionOptions( | ||
| service_url=service_url, | ||
| service_concurrency=service_concurrency, | ||
| service_api_token=service_api_token, | ||
| ), | ||
| ) | ||
| ) |
There was a problem hiding this comment.
ValueError from empty file set is unhandled in the legacy pipeline path
The old _build_service_ingestor raised typer.BadParameter when no files matched the glob patterns; the replacement build_service_ingestor raises ValueError instead. The run() function here has only a try/finally (no except), so any ValueError propagates unhandled to Typer and produces a raw Python traceback rather than a clean error message. This regression is reachable when _resolve_file_patterns constructs a valid glob like dir/**/*.pdf but the directory is empty — the upstream helper validates path existence, not that the glob actually matches files.
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/pipeline/__main__.py
Line: 1326-1352
Comment:
**`ValueError` from empty file set is unhandled in the legacy pipeline path**
The old `_build_service_ingestor` raised `typer.BadParameter` when no files matched the glob patterns; the replacement `build_service_ingestor` raises `ValueError` instead. The `run()` function here has only a `try/finally` (no `except`), so any `ValueError` propagates unhandled to Typer and produces a raw Python traceback rather than a clean error message. This regression is reachable when `_resolve_file_patterns` constructs a valid glob like `dir/**/*.pdf` but the directory is empty — the upstream helper validates path existence, not that the glob actually matches files.
How can I resolve this? If you propose a fix, please make it concise.| def execute_service_ingest_request(request: ServiceIngestRequest) -> ServiceIngestExecutionResult: | ||
| result = build_service_ingestor(request).ingest() | ||
| result_n_rows = _count_service_result_rows(result) | ||
| return ServiceIngestExecutionResult( | ||
| request=request, | ||
| result=result, | ||
| n_rows=result_n_rows, | ||
| result_n_rows=result_n_rows, | ||
| metadata={ | ||
| "service_url": request.connection.service_url, | ||
| "input_type": request.input_type, | ||
| }, | ||
| ) |
There was a problem hiding this comment.
No unit tests for the new
ingest/service.py module
nemo_retriever/ingest/service.py introduces substantial new business logic — profile-based extract defaults, document-type validation, caption/dedup/chunk param builders with their own error conditions, _split_config_for_auto_documents, and ServiceIngestExecutionResult — but no corresponding test_ingest_service.py exists. The new CLI-level tests exercise the wiring end-to-end, but functions like _build_service_caption_params (which raises on caption_context_text_max_chars < 0), _build_service_dedup_params (raises when iou_threshold is set without enabled), and _split_config_for_auto_documents (mixed media type branching) are untested in isolation. Per the test-mirrors-source-structure and test-coverage-new-code rules, a companion test module is required.
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/ingest/service.py
Line: 259-271
Comment:
**No unit tests for the new `ingest/service.py` module**
`nemo_retriever/ingest/service.py` introduces substantial new business logic — profile-based extract defaults, document-type validation, caption/dedup/chunk param builders with their own error conditions, `_split_config_for_auto_documents`, and `ServiceIngestExecutionResult` — but no corresponding `test_ingest_service.py` exists. The new CLI-level tests exercise the wiring end-to-end, but functions like `_build_service_caption_params` (which raises on `caption_context_text_max_chars < 0`), `_build_service_dedup_params` (raises when `iou_threshold` is set without `enabled`), and `_split_config_for_auto_documents` (mixed media type branching) are untested in isolation. Per the `test-mirrors-source-structure` and `test-coverage-new-code` rules, a companion test module is required.
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
There was a problem hiding this comment.
Yes, the comment is valid. The PR description confirms that the tests that do exist (test_root_cli_workflow.py) are CLI-level workflow tests — they exercise the wiring end-to-end, but don't cover the isolated error conditions in the private helpers inside ingest/service.py.
Specifically, functions like _build_service_dedup_params (which should raise when iou_threshold is set but enabled=False) and _build_service_caption_params (which should raise on negative context_text_max_chars) have branching validation logic that an end-to-end CLI test is unlikely to fully exercise. The _split_config_for_auto_documents mixed-media branching is similarly tricky to cover through the CLI surface alone.
A code search also confirms no test_ingest_service.py exists in the repo yet.
That said, if your team's convention is that CLI-level integration tests are sufficient for these helpers, I'm happy to defer — just let me know.
| @dataclass(frozen=True) | ||
| class ServiceIngestExecutionResult: | ||
| request: ServiceIngestRequest | ||
| result: object | ||
| n_rows: int | None | ||
| result_n_rows: int | None | ||
| metadata: dict[str, Any] | ||
|
|
||
| @property | ||
| def documents(self) -> list[str]: | ||
| return list(self.request.documents) | ||
|
|
||
| @property | ||
| def service_url(self) -> str: | ||
| return self.request.connection.service_url | ||
|
|
||
| def to_summary_dict(self) -> dict[str, Any]: | ||
| return { | ||
| "run_mode": "service", | ||
| "documents": self.documents, | ||
| "service_url": self.service_url, | ||
| "result": self.result, | ||
| "n_rows": self.n_rows, | ||
| "result_n_rows": self.result_n_rows, | ||
| } |
There was a problem hiding this comment.
n_rows and result_n_rows are always identical, creating a confusing public API
In execute_service_ingest_request, both fields are set to the same value (_count_service_result_rows(result)) and both are emitted in to_summary_dict. If the intent was to distinguish "rows reported by the service" from "rows counted from the ingest result object", that distinction is not implemented — callers see two keys with the same number and no documented difference. The redundant field should either be removed, or the two fields should be sourced from different places with their distinction documented.
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/ingest/service.py
Line: 134-158
Comment:
**`n_rows` and `result_n_rows` are always identical, creating a confusing public API**
In `execute_service_ingest_request`, both fields are set to the same value (`_count_service_result_rows(result)`) and both are emitted in `to_summary_dict`. If the intent was to distinguish "rows reported by the service" from "rows counted from the ingest result object", that distinction is not implemented — callers see two keys with the same number and no documented difference. The redundant field should either be removed, or the two fields should be sourced from different places with their distinction documented.
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
There was a problem hiding this comment.
Yes, confirmed bug. In execute_service_ingest_request, both n_rows and result_n_rows are set to the same _count_service_result_rows(result) value. The docstring on ServiceIngestExecutionResult describes an intended distinction — result_n_rows counts rows from the service ingest result, while n_rows is meant to mirror a local row-count contract for CLI summaries — but the implementation doesn't reflect that; both just hold the service result count.
The fix depends on the intent:
Option A: If n_rows was always meant to equal result_n_rows for service mode (no local DB to verify), remove the duplicate and alias one to the other:
@dataclass(frozen=True)
class ServiceIngestExecutionResult:
request: ServiceIngestRequest
result: object
result_n_rows: int | None
metadata: dict[str, Any]
@property
def n_rows(self) -> int | None:
return self.result_n_rowsOption B: If there's a future plan to verify row count from a separate source (e.g., querying the vector DB after ingest), keep both fields but document it clearly in execute_service_ingest_request:
result_n_rows = _count_service_result_rows(result)
return ServiceIngestExecutionResult(
...
n_rows=result_n_rows, # mirrors result_n_rows; no local DB verification in service mode
result_n_rows=result_n_rows,
...
)Given the docstring already describes Option B's intent but the code doesn't implement the distinction, Option A is the cleaner fix until local verification is actually wired up.
| run_mode: str = typer.Option( | ||
| "inprocess", | ||
| "--run-mode", | ||
| help="Execution mode for the SDK ingestor. Defaults to inprocess; use batch for Ray Data scale-out.", | ||
| help=( | ||
| "Execution mode for ingest: inprocess (default), batch for Ray Data scale-out, " | ||
| "or service for a remote retriever service." | ||
| ), | ||
| ), |
There was a problem hiding this comment.
run_mode changed from IngestRunModeValue to str, so Typer no longer renders valid choices in --help output (was [inprocess|batch], now just TEXT) and no longer performs automatic validation before the function body runs. Adding click.Choice preserves the Typer/Click help-text and validation ergonomics without sacrificing the new service value.
| run_mode: str = typer.Option( | |
| "inprocess", | |
| "--run-mode", | |
| help="Execution mode for the SDK ingestor. Defaults to inprocess; use batch for Ray Data scale-out.", | |
| help=( | |
| "Execution mode for ingest: inprocess (default), batch for Ray Data scale-out, " | |
| "or service for a remote retriever service." | |
| ), | |
| ), | |
| run_mode: str = typer.Option( | |
| "inprocess", | |
| "--run-mode", | |
| click_type=click.Choice(["inprocess", "batch", "service"]), | |
| help=( | |
| "Execution mode for ingest: inprocess (default), batch for Ray Data scale-out, " | |
| "or service for a remote retriever service." | |
| ), | |
| ), |
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/adapters/cli/main.py
Line: 200-207
Comment:
`run_mode` changed from `IngestRunModeValue` to `str`, so Typer no longer renders valid choices in `--help` output (was `[inprocess|batch]`, now just `TEXT`) and no longer performs automatic validation before the function body runs. Adding `click.Choice` preserves the Typer/Click help-text and validation ergonomics without sacrificing the new `service` value.
```suggestion
run_mode: str = typer.Option(
"inprocess",
"--run-mode",
click_type=click.Choice(["inprocess", "batch", "service"]),
help=(
"Execution mode for ingest: inprocess (default), batch for Ray Data scale-out, "
"or service for a remote retriever service."
),
),
```
How can I resolve this? If you propose a fix, please make it concise.| def execute_service_ingest_request(request: ServiceIngestRequest) -> ServiceIngestExecutionResult: | ||
| result = build_service_ingestor(request).ingest() |
There was a problem hiding this comment.
Public functions
execute_service_ingest_request, expand_service_file_patterns, and service_split_config_for_request lack docstrings. Per the docstrings-public-interface rule, all public functions must describe their behaviour, parameters, returns, and any exceptions raised.
| def execute_service_ingest_request(request: ServiceIngestRequest) -> ServiceIngestExecutionResult: | |
| result = build_service_ingestor(request).ingest() | |
| def execute_service_ingest_request(request: ServiceIngestRequest) -> ServiceIngestExecutionResult: | |
| """Execute a resolved service ingest request and return a structured result. | |
| Args: | |
| request: A fully-resolved ``ServiceIngestRequest`` produced by | |
| ``resolve_service_ingest_request`` or constructed directly. | |
| Returns: | |
| A ``ServiceIngestExecutionResult`` containing the raw ingest result, | |
| the row count (when the result exposes a ``dataframe`` attribute), and | |
| connection metadata. | |
| Raises: | |
| ValueError: If no files matched the input patterns in *request*. | |
| """ | |
| result = build_service_ingestor(request).ingest() |
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/ingest/service.py
Line: 259-260
Comment:
Public functions `execute_service_ingest_request`, `expand_service_file_patterns`, and `service_split_config_for_request` lack docstrings. Per the `docstrings-public-interface` rule, all public functions must describe their behaviour, parameters, returns, and any exceptions raised.
```suggestion
def execute_service_ingest_request(request: ServiceIngestRequest) -> ServiceIngestExecutionResult:
"""Execute a resolved service ingest request and return a structured result.
Args:
request: A fully-resolved ``ServiceIngestRequest`` produced by
``resolve_service_ingest_request`` or constructed directly.
Returns:
A ``ServiceIngestExecutionResult`` containing the raw ingest result,
the row count (when the result exposes a ``dataframe`` attribute), and
connection metadata.
Raises:
ValueError: If no files matched the input patterns in *request*.
"""
result = build_service_ingestor(request).ingest()
```
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
05bc509 to
e574193
Compare
This comment was marked as outdated.
This comment was marked as outdated.
592b556 to
137c57a
Compare
137c57a to
2ea4c0f
Compare
| def expand_service_file_patterns(documents: Sequence[str]) -> list[str]: | ||
| """Expand recursive file patterns for service ingest construction.""" | ||
|
|
||
| resolved_files: list[str] = [] | ||
| for pattern in documents: | ||
| resolved_files.extend(sorted(_glob.glob(str(pattern), recursive=True))) | ||
| return resolved_files |
There was a problem hiding this comment.
Silent drop of text chunking when all inputs are glob patterns
_split_config_for_auto_documents filters out any path where _glob.has_magic() is True. If a user passes a wildcard glob (e.g., "docs/**/*.pdf") as a service request document, every path in documents will have magic, input_types will be empty, split_config will be {}, and the function returns None — silently disabling text chunking even when --text-chunk is specified. The dry-run output would show "split_config": null with no warning. A validation error or at minimum a logged warning when chunking is requested but no concrete paths are available to detect media types would make this easier to diagnose.
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/ingest/service.py
Line: 282-288
Comment:
**Silent drop of text chunking when all inputs are glob patterns**
`_split_config_for_auto_documents` filters out any path where `_glob.has_magic()` is `True`. If a user passes a wildcard glob (e.g., `"docs/**/*.pdf"`) as a service request document, every path in `documents` will have magic, `input_types` will be empty, `split_config` will be `{}`, and the function returns `None` — silently disabling text chunking even when `--text-chunk` is specified. The dry-run output would show `"split_config": null` with no warning. A validation error or at minimum a logged warning when chunking is requested but no concrete paths are available to detect media types would make this easier to diagnose.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
This needs attention, I think.
0265eab to
4c3cc25
Compare
…ervice-mode-cleanup # Conflicts: # nemo_retriever/docs/cli/README.md # nemo_retriever/src/nemo_retriever/cli/ingest/__init__.py # nemo_retriever/src/nemo_retriever/cli/ingest/app.py # nemo_retriever/src/nemo_retriever/cli/ingest/graph.py # nemo_retriever/src/nemo_retriever/cli/ingest/options.py # nemo_retriever/src/nemo_retriever/cli/ingest/service.py # nemo_retriever/src/nemo_retriever/cli/ingest/shared.py # nemo_retriever/src/nemo_retriever/cli/ingest_workflow.py # nemo_retriever/src/nemo_retriever/cli/main.py # nemo_retriever/src/nemo_retriever/cli/pipeline/__main__.py # nemo_retriever/src/nemo_retriever/cli/shared.py # nemo_retriever/tests/test_pipeline_helpers.py # nemo_retriever/tests/test_root_cli_workflow.py
Summary
Redesign the new root
retriever ingestCLI around a default local path plus explicit execution-mode subcommands:This keeps root ingest as a thin CLI wrapper over the existing ingest ownership model:
retriever ingest ...andlocaluse the graph ingest path withrun_mode="inprocess":IngestPlanRequest->resolve_ingest_plan(...)->run_ingest_workflow(...)/GraphIngestorbatchuses the same graph ingest path withrun_mode="batch"serviceuses the service client path:ServiceIngestPlanRequest->resolve_service_ingest_request(...)->run_service_ingest_workflow(...)/ServiceIngestorThere are no current users of root
retriever ingest, so this intentionally does not preserve the draftretriever ingest --run-mode ...shape.What Changed
ingestcommand with a Typer sub-app.retriever ingest DOCUMENTS...routes to local/inprocess ingest without duplicating local option handling.adapters/cli/ingest/:app.pyregisters the ingest app and default-local routing.graph.pyowns bothlocalandbatch.service.pyowns service-mode request construction.options.pycentralizes repeated Typer option metadata only.shared.pyowns ingest CLI execution/error/summary behavior.adapters/cli/shared.pyfor root CLI quiet/error helpers so query does not depend on ingest internals.serviceno longer exposes LanceDB target flags, Ray tuning, local endpoint config, local embed backend, OCR language, or local media controls.local,batch, and bareingestexpose only graph-mode options.retriever pipeline run --run-mode ...unchanged.Non-Goals
This PR does not add or change:
retriever query --serviceGraphIngestorbehaviorService-mode query support is being handled separately in draft PR NVIDIA/NeMo-Retriever#2228.
Validation
Focused tests:
Results:
test_root_cli_workflow.py: 51 passedJP20 ingest smoke:
retriever ingest local ... --profile fast-textretriever ingest batch ... --profile fast-textretriever ingest service ... --profile fast-textServiceIngestorReview Notes
The main review point is the CLI ownership split:
retriever ingest ...is only a local/inprocess convenience wrapper over the graph command