Move service ingest construction into ingest core by jioffe502 · Pull Request #2221 · NVIDIA/NeMo-Retriever

jioffe502 · 2026-06-09T16:58:52Z

Summary

Redesign the new root retriever ingest CLI around a default local path plus explicit execution-mode subcommands:

retriever ingest DOCUMENTS...          # local/inprocess default
retriever ingest local DOCUMENTS...
retriever ingest batch DOCUMENTS...
retriever ingest service DOCUMENTS...

This keeps root ingest as a thin CLI wrapper over the existing ingest ownership model:

bare retriever ingest ... and local use the graph ingest path with run_mode="inprocess": IngestPlanRequest -> resolve_ingest_plan(...) -> run_ingest_workflow(...) / GraphIngestor
batch uses the same graph ingest path with run_mode="batch"
service uses the service client path: ServiceIngestPlanRequest -> resolve_service_ingest_request(...) -> run_service_ingest_workflow(...) / ServiceIngestor

There are no current users of root retriever ingest, so this intentionally does not preserve the draft retriever ingest --run-mode ... shape.

What Changed

Replaced the monolithic root ingest command with a Typer sub-app.
Added a thin default wrapper so retriever ingest DOCUMENTS... routes to local/inprocess ingest without duplicating local option handling.
Added focused CLI modules under adapters/cli/ingest/:
- app.py registers the ingest app and default-local routing.
- graph.py owns both local and batch.
- service.py owns service-mode request construction.
- options.py centralizes repeated Typer option metadata only.
- shared.py owns ingest CLI execution/error/summary behavior.
Added adapters/cli/shared.py for root CLI quiet/error helpers so query does not depend on ingest internals.
Deleted the giant service-mode invalid-option denylist by making invalid flags unrepresentable:
- service no longer exposes LanceDB target flags, Ray tuning, local endpoint config, local embed backend, OCR language, or local media controls.
- local, batch, and bare ingest expose only graph-mode options.
Kept retriever pipeline run --run-mode ... unchanged.

Non-Goals

This PR does not add or change:

service query
root retriever query --service
eval, harness, BEIR/QA/audio recall logic
pipeline reporting
core GraphIngestor behavior
service VectorDB/query behavior

Service-mode query support is being handled separately in draft PR NVIDIA/NeMo-Retriever#2228.

Validation

Focused tests:

git diff --check
/localhome/local-jioffe/.local/venvs/pre-commit/bin/pre-commit run --files <touched files>
/localhome/local-jioffe/retriever-skills/nemo_retriever/.venv/bin/python -m pytest \
  nemo_retriever/tests/test_root_cli_workflow.py
/localhome/local-jioffe/retriever-skills/nemo_retriever/.venv/bin/python -m pytest \
  nemo_retriever/tests/test_query_workflow_options.py \
  nemo_retriever/tests/test_root_query_cli.py

Results:

test_root_cli_workflow.py: 51 passed
query boundary tests: 11 passed
pre-commit passed on touched files
working tree clean before push

JP20 ingest smoke:

retriever ingest local ... --profile fast-text
- 20 files ingested
- 1884 LanceDB rows
retriever ingest batch ... --profile fast-text
- 20 files ingested
- 1884 LanceDB rows
retriever ingest service ... --profile fast-text
- local service health check passed
- 20 files ingested through ServiceIngestor
- 1940 service result rows
- service stopped after validation

Review Notes

The main review point is the CLI ownership split:

graph-mode CLI owns graph request construction
service-mode CLI owns service request construction
shared CLI helpers are narrow and do not build ingest requests
bare retriever ingest ... is only a local/inprocess convenience wrapper over the graph command
invalid public flag combinations are prevented by command shape rather than runtime rejection

greptile-apps · 2026-06-09T17:05:12Z

+            ingestor = ingest_service.build_service_ingestor(
+                ingest_service.ServiceIngestRequest(
+                    documents=file_patterns,
+                    input_type=input_type,
+                    extract_params=extract_params,
+                    embed_params=embed_params,
+                    text_chunk_params=text_chunk_params,
+                    enable_text_chunk=enable_text_chunk,
+                    dedup_params=DedupParams(iou_threshold=dedup_iou_threshold) if enable_dedup else None,
+                    caption_params=(
+                        CaptionParams(
+                            context_text_max_chars=caption_context_text_max_chars,
+                            temperature=caption_temperature,
+                            top_p=caption_top_p,
+                            max_tokens=caption_max_tokens,
+                        )
+                        if enable_caption
+                        else None
+                    ),
+                    store_params=StoreParams(storage_uri=store_images_uri) if store_images_uri is not None else None,
+                    connection=ingest_service.ServiceIngestConnectionOptions(
+                        service_url=service_url,
+                        service_concurrency=service_concurrency,
+                        service_api_token=service_api_token,
+                    ),
+                )
            )


ValueError from empty file set is unhandled in the legacy pipeline path

The old _build_service_ingestor raised typer.BadParameter when no files matched the glob patterns; the replacement build_service_ingestor raises ValueError instead. The run() function here has only a try/finally (no except), so any ValueError propagates unhandled to Typer and produces a raw Python traceback rather than a clean error message. This regression is reachable when _resolve_file_patterns constructs a valid glob like dir/**/*.pdf but the directory is empty — the upstream helper validates path existence, not that the glob actually matches files.

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/pipeline/__main__.py Line: 1326-1352 Comment: **`ValueError` from empty file set is unhandled in the legacy pipeline path** The old `_build_service_ingestor` raised `typer.BadParameter` when no files matched the glob patterns; the replacement `build_service_ingestor` raises `ValueError` instead. The `run()` function here has only a `try/finally` (no `except`), so any `ValueError` propagates unhandled to Typer and produces a raw Python traceback rather than a clean error message. This regression is reachable when `_resolve_file_patterns` constructs a valid glob like `dir/**/*.pdf` but the directory is empty — the upstream helper validates path existence, not that the glob actually matches files. How can I resolve this? If you propose a fix, please make it concise.

This seems valid?

greptile-apps · 2026-06-09T17:05:13Z

+def execute_service_ingest_request(request: ServiceIngestRequest) -> ServiceIngestExecutionResult:
+    result = build_service_ingestor(request).ingest()
+    result_n_rows = _count_service_result_rows(result)
+    return ServiceIngestExecutionResult(
+        request=request,
+        result=result,
+        n_rows=result_n_rows,
+        result_n_rows=result_n_rows,
+        metadata={
+            "service_url": request.connection.service_url,
+            "input_type": request.input_type,
+        },
+    )


No unit tests for the new ingest/service.py module

nemo_retriever/ingest/service.py introduces substantial new business logic — profile-based extract defaults, document-type validation, caption/dedup/chunk param builders with their own error conditions, _split_config_for_auto_documents, and ServiceIngestExecutionResult — but no corresponding test_ingest_service.py exists. The new CLI-level tests exercise the wiring end-to-end, but functions like _build_service_caption_params (which raises on caption_context_text_max_chars < 0), _build_service_dedup_params (raises when iou_threshold is set without enabled), and _split_config_for_auto_documents (mixed media type branching) are untested in isolation. Per the test-mirrors-source-structure and test-coverage-new-code rules, a companion test module is required.

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/ingest/service.py Line: 259-271 Comment: **No unit tests for the new `ingest/service.py` module** `nemo_retriever/ingest/service.py` introduces substantial new business logic — profile-based extract defaults, document-type validation, caption/dedup/chunk param builders with their own error conditions, `_split_config_for_auto_documents`, and `ServiceIngestExecutionResult` — but no corresponding `test_ingest_service.py` exists. The new CLI-level tests exercise the wiring end-to-end, but functions like `_build_service_caption_params` (which raises on `caption_context_text_max_chars < 0`), `_build_service_dedup_params` (raises when `iou_threshold` is set without `enabled`), and `_split_config_for_auto_documents` (mixed media type branching) are untested in isolation. Per the `test-mirrors-source-structure` and `test-coverage-new-code` rules, a companion test module is required. How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Yes, the comment is valid. The PR description confirms that the tests that do exist (test_root_cli_workflow.py) are CLI-level workflow tests — they exercise the wiring end-to-end, but don't cover the isolated error conditions in the private helpers inside ingest/service.py.

Specifically, functions like _build_service_dedup_params (which should raise when iou_threshold is set but enabled=False) and _build_service_caption_params (which should raise on negative context_text_max_chars) have branching validation logic that an end-to-end CLI test is unlikely to fully exercise. The _split_config_for_auto_documents mixed-media branching is similarly tricky to cover through the CLI surface alone.

A code search also confirms no test_ingest_service.py exists in the repo yet.

That said, if your team's convention is that CLI-level integration tests are sufficient for these helpers, I'm happy to defer — just let me know.

greptile-apps · 2026-06-09T17:05:14Z

+@dataclass(frozen=True)
+class ServiceIngestExecutionResult:
+    request: ServiceIngestRequest
+    result: object
+    n_rows: int | None
+    result_n_rows: int | None
+    metadata: dict[str, Any]
+
+    @property
+    def documents(self) -> list[str]:
+        return list(self.request.documents)
+
+    @property
+    def service_url(self) -> str:
+        return self.request.connection.service_url
+
+    def to_summary_dict(self) -> dict[str, Any]:
+        return {
+            "run_mode": "service",
+            "documents": self.documents,
+            "service_url": self.service_url,
+            "result": self.result,
+            "n_rows": self.n_rows,
+            "result_n_rows": self.result_n_rows,
+        }


n_rows and result_n_rows are always identical, creating a confusing public API

In execute_service_ingest_request, both fields are set to the same value (_count_service_result_rows(result)) and both are emitted in to_summary_dict. If the intent was to distinguish "rows reported by the service" from "rows counted from the ingest result object", that distinction is not implemented — callers see two keys with the same number and no documented difference. The redundant field should either be removed, or the two fields should be sourced from different places with their distinction documented.

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/ingest/service.py Line: 134-158 Comment: **`n_rows` and `result_n_rows` are always identical, creating a confusing public API** In `execute_service_ingest_request`, both fields are set to the same value (`_count_service_result_rows(result)`) and both are emitted in `to_summary_dict`. If the intent was to distinguish "rows reported by the service" from "rows counted from the ingest result object", that distinction is not implemented — callers see two keys with the same number and no documented difference. The redundant field should either be removed, or the two fields should be sourced from different places with their distinction documented. How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

This seems like a bug

Yes, confirmed bug. In execute_service_ingest_request, both n_rows and result_n_rows are set to the same _count_service_result_rows(result) value. The docstring on ServiceIngestExecutionResult describes an intended distinction — result_n_rows counts rows from the service ingest result, while n_rows is meant to mirror a local row-count contract for CLI summaries — but the implementation doesn't reflect that; both just hold the service result count.

The fix depends on the intent:

Option A: If n_rows was always meant to equal result_n_rows for service mode (no local DB to verify), remove the duplicate and alias one to the other:

@dataclass(frozen=True) class ServiceIngestExecutionResult: request: ServiceIngestRequest result: object result_n_rows: int | None metadata: dict[str, Any] @property def n_rows(self) -> int | None: return self.result_n_rows

Option B: If there's a future plan to verify row count from a separate source (e.g., querying the vector DB after ingest), keep both fields but document it clearly in execute_service_ingest_request:

result_n_rows = _count_service_result_rows(result) return ServiceIngestExecutionResult( ... n_rows=result_n_rows, # mirrors result_n_rows; no local DB verification in service mode result_n_rows=result_n_rows, ... )

Given the docstring already describes Option B's intent but the code doesn't implement the distinction, Option A is the cleaner fix until local verification is actually wired up.

greptile-apps · 2026-06-09T17:05:15Z

+    run_mode: str = typer.Option(
        "inprocess",
        "--run-mode",
-        help="Execution mode for the SDK ingestor. Defaults to inprocess; use batch for Ray Data scale-out.",
+        help=(
+            "Execution mode for ingest: inprocess (default), batch for Ray Data scale-out, "
+            "or service for a remote retriever service."
+        ),
+    ),


run_mode changed from IngestRunModeValue to str, so Typer no longer renders valid choices in --help output (was [inprocess|batch], now just TEXT) and no longer performs automatic validation before the function body runs. Adding click.Choice preserves the Typer/Click help-text and validation ergonomics without sacrificing the new service value.

Suggested change

run_mode: str = typer.Option(

"inprocess",

"--run-mode",

help="Execution mode for the SDK ingestor. Defaults to inprocess; use batch for Ray Data scale-out.",

help=(

"Execution mode for ingest: inprocess (default), batch for Ray Data scale-out, "

"or service for a remote retriever service."

),

),

run_mode: str = typer.Option(

"inprocess",

"--run-mode",

click_type=click.Choice(["inprocess", "batch", "service"]),

help=(

"Execution mode for ingest: inprocess (default), batch for Ray Data scale-out, "

"or service for a remote retriever service."

),

),

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/adapters/cli/main.py Line: 200-207 Comment: `run_mode` changed from `IngestRunModeValue` to `str`, so Typer no longer renders valid choices in `--help` output (was `[inprocess|batch]`, now just `TEXT`) and no longer performs automatic validation before the function body runs. Adding `click.Choice` preserves the Typer/Click help-text and validation ergonomics without sacrificing the new `service` value. ```suggestion run_mode: str = typer.Option( "inprocess", "--run-mode", click_type=click.Choice(["inprocess", "batch", "service"]), help=( "Execution mode for ingest: inprocess (default), batch for Ray Data scale-out, " "or service for a remote retriever service." ), ), ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-06-09T17:05:16Z

+def execute_service_ingest_request(request: ServiceIngestRequest) -> ServiceIngestExecutionResult:
+    result = build_service_ingestor(request).ingest()


Public functions execute_service_ingest_request, expand_service_file_patterns, and service_split_config_for_request lack docstrings. Per the docstrings-public-interface rule, all public functions must describe their behaviour, parameters, returns, and any exceptions raised.

Suggested change

def execute_service_ingest_request(request: ServiceIngestRequest) -> ServiceIngestExecutionResult:

result = build_service_ingestor(request).ingest()

def execute_service_ingest_request(request: ServiceIngestRequest) -> ServiceIngestExecutionResult:

"""Execute a resolved service ingest request and return a structured result.

Args:

request: A fully-resolved ``ServiceIngestRequest`` produced by

``resolve_service_ingest_request`` or constructed directly.

Returns:

A ``ServiceIngestExecutionResult`` containing the raw ingest result,

the row count (when the result exposes a ``dataframe`` attribute), and

connection metadata.

Raises:

ValueError: If no files matched the input patterns in *request*.

"""

result = build_service_ingestor(request).ingest()

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/ingest/service.py Line: 259-260 Comment: Public functions `execute_service_ingest_request`, `expand_service_file_patterns`, and `service_split_config_for_request` lack docstrings. Per the `docstrings-public-interface` rule, all public functions must describe their behaviour, parameters, returns, and any exceptions raised. ```suggestion def execute_service_ingest_request(request: ServiceIngestRequest) -> ServiceIngestExecutionResult: """Execute a resolved service ingest request and return a structured result. Args: request: A fully-resolved ``ServiceIngestRequest`` produced by ``resolve_service_ingest_request`` or constructed directly. Returns: A ``ServiceIngestExecutionResult`` containing the raw ingest result, the row count (when the result exposes a ``dataframe`` attribute), and connection metadata. Raises: ValueError: If no files matched the input patterns in *request*. """ result = build_service_ingestor(request).ingest() ``` How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

greptile-apps · 2026-06-10T20:06:46Z

+def expand_service_file_patterns(documents: Sequence[str]) -> list[str]:
+    """Expand recursive file patterns for service ingest construction."""
+
+    resolved_files: list[str] = []
+    for pattern in documents:
+        resolved_files.extend(sorted(_glob.glob(str(pattern), recursive=True)))
+    return resolved_files


Silent drop of text chunking when all inputs are glob patterns

_split_config_for_auto_documents filters out any path where _glob.has_magic() is True. If a user passes a wildcard glob (e.g., "docs/**/*.pdf") as a service request document, every path in documents will have magic, input_types will be empty, split_config will be {}, and the function returns None — silently disabling text chunking even when --text-chunk is specified. The dry-run output would show "split_config": null with no warning. A validation error or at minimum a logged warning when chunking is requested but no concrete paths are available to detect media types would make this easier to diagnose.

Prompt To Fix With AI

This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/ingest/service.py Line: 282-288 Comment: **Silent drop of text chunking when all inputs are glob patterns** `_split_config_for_auto_documents` filters out any path where `_glob.has_magic()` is `True`. If a user passes a wildcard glob (e.g., `"docs/**/*.pdf"`) as a service request document, every path in `documents` will have magic, `input_types` will be empty, `split_config` will be `{}`, and the function returns `None` — silently disabling text chunking even when `--text-chunk` is specified. The dry-run output would show `"split_config": null` with no warning. A validation error or at minimum a logged warning when chunking is requested but no concrete paths are available to detect media types would make this easier to diagnose. How can I resolve this? If you propose a fix, please make it concise.

This needs attention, I think.

…ervice-mode-cleanup # Conflicts: # nemo_retriever/docs/cli/README.md # nemo_retriever/src/nemo_retriever/cli/ingest/__init__.py # nemo_retriever/src/nemo_retriever/cli/ingest/app.py # nemo_retriever/src/nemo_retriever/cli/ingest/graph.py # nemo_retriever/src/nemo_retriever/cli/ingest/options.py # nemo_retriever/src/nemo_retriever/cli/ingest/service.py # nemo_retriever/src/nemo_retriever/cli/ingest/shared.py # nemo_retriever/src/nemo_retriever/cli/ingest_workflow.py # nemo_retriever/src/nemo_retriever/cli/main.py # nemo_retriever/src/nemo_retriever/cli/pipeline/__main__.py # nemo_retriever/src/nemo_retriever/cli/shared.py # nemo_retriever/tests/test_pipeline_helpers.py # nemo_retriever/tests/test_root_cli_workflow.py

jioffe502 requested review from a team as code owners June 9, 2026 16:58

jioffe502 requested a review from edknv June 9, 2026 16:58

This comment was marked as outdated.

Sign in to view

greptile-apps Bot reviewed Jun 9, 2026

View reviewed changes

jioffe502 force-pushed the codex/root-ingest-service-mode branch from 05bc509 to e574193 Compare June 9, 2026 17:39

This comment was marked as outdated.

Sign in to view

jioffe502 force-pushed the codex/root-ingest-service-mode branch 3 times, most recently from 592b556 to 137c57a Compare June 9, 2026 20:52

jioffe502 mentioned this pull request Jun 9, 2026

Split root CLI command implementations out of adapters/cli/main.py #2224

Open

jioffe502 added 2 commits June 10, 2026 19:28

Move service ingest construction into ingest core

c00df36

Redesign root ingest CLI subcommands

2ea4c0f

jioffe502 force-pushed the codex/root-ingest-service-mode branch from 137c57a to 2ea4c0f Compare June 10, 2026 20:01

greptile-apps Bot reviewed Jun 10, 2026

View reviewed changes

jioffe502 and others added 3 commits June 10, 2026 21:07

Handle service chunking globs

746cbba

Default root ingest to local mode

dc29d6f

Merge branch 'main' into codex/root-ingest-service-mode

91969c3

jioffe502 force-pushed the codex/root-ingest-service-mode branch from 0265eab to 4c3cc25 Compare June 11, 2026 22:28

jioffe502 added 7 commits June 11, 2026 22:29

Privatize ingest CLI callbacks

4c3cc25

Document root ingest CLI redesign

9868c17

Polish root ingest docs for public review

3a2965d

Restore release notes

d80fa67

Remove PR references from ingest design docs

1b6aaee

Clean up root ingest graph CLI adapter

9c85d90

		def execute_service_ingest_request(request: ServiceIngestRequest) -> ServiceIngestExecutionResult:
		result = build_service_ingestor(request).ingest()

-def execute_service_ingest_request(request: ServiceIngestRequest) -> ServiceIngestExecutionResult:
-    result = build_service_ingestor(request).ingest()
+def execute_service_ingest_request(request: ServiceIngestRequest) -> ServiceIngestExecutionResult:
+    """Execute a resolved service ingest request and return a structured result.
+    Args:
+        request: A fully-resolved ``ServiceIngestRequest`` produced by
+            ``resolve_service_ingest_request`` or constructed directly.
+    Returns:
+        A ``ServiceIngestExecutionResult`` containing the raw ingest result,
+        the row count (when the result exposes a ``dataframe`` attribute), and
+        connection metadata.
+    Raises:
+        ValueError: If no files matched the input patterns in *request*.
+    """
+    result = build_service_ingestor(request).ingest()

Conversation

jioffe502 commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

Non-Goals

Validation

Review Notes

Uh oh!

This comment was marked as outdated.

greptile-apps Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

jperez999 Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

jperez999 Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

jperez999 Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

greptile-apps Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

jperez999 Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jioffe502 commented Jun 9, 2026 •

edited

Loading