Skip to content

Move service ingest construction into ingest core#2221

Open
jioffe502 wants to merge 12 commits into
NVIDIA:mainfrom
jioffe502:codex/root-ingest-service-mode
Open

Move service ingest construction into ingest core#2221
jioffe502 wants to merge 12 commits into
NVIDIA:mainfrom
jioffe502:codex/root-ingest-service-mode

Conversation

@jioffe502

@jioffe502 jioffe502 commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Summary

Redesign the new root retriever ingest CLI around a default local path plus explicit execution-mode subcommands:

retriever ingest DOCUMENTS...          # local/inprocess default
retriever ingest local DOCUMENTS...
retriever ingest batch DOCUMENTS...
retriever ingest service DOCUMENTS...

This keeps root ingest as a thin CLI wrapper over the existing ingest ownership model:

  • bare retriever ingest ... and local use the graph ingest path with run_mode="inprocess": IngestPlanRequest -> resolve_ingest_plan(...) -> run_ingest_workflow(...) / GraphIngestor
  • batch uses the same graph ingest path with run_mode="batch"
  • service uses the service client path: ServiceIngestPlanRequest -> resolve_service_ingest_request(...) -> run_service_ingest_workflow(...) / ServiceIngestor

There are no current users of root retriever ingest, so this intentionally does not preserve the draft retriever ingest --run-mode ... shape.

What Changed

  • Replaced the monolithic root ingest command with a Typer sub-app.
  • Added a thin default wrapper so retriever ingest DOCUMENTS... routes to local/inprocess ingest without duplicating local option handling.
  • Added focused CLI modules under adapters/cli/ingest/:
    • app.py registers the ingest app and default-local routing.
    • graph.py owns both local and batch.
    • service.py owns service-mode request construction.
    • options.py centralizes repeated Typer option metadata only.
    • shared.py owns ingest CLI execution/error/summary behavior.
  • Added adapters/cli/shared.py for root CLI quiet/error helpers so query does not depend on ingest internals.
  • Deleted the giant service-mode invalid-option denylist by making invalid flags unrepresentable:
    • service no longer exposes LanceDB target flags, Ray tuning, local endpoint config, local embed backend, OCR language, or local media controls.
    • local, batch, and bare ingest expose only graph-mode options.
  • Kept retriever pipeline run --run-mode ... unchanged.

Non-Goals

This PR does not add or change:

  • service query
  • root retriever query --service
  • eval, harness, BEIR/QA/audio recall logic
  • pipeline reporting
  • core GraphIngestor behavior
  • service VectorDB/query behavior

Service-mode query support is being handled separately in draft PR NVIDIA/NeMo-Retriever#2228.

Validation

Focused tests:

git diff --check
/localhome/local-jioffe/.local/venvs/pre-commit/bin/pre-commit run --files <touched files>
/localhome/local-jioffe/retriever-skills/nemo_retriever/.venv/bin/python -m pytest \
  nemo_retriever/tests/test_root_cli_workflow.py
/localhome/local-jioffe/retriever-skills/nemo_retriever/.venv/bin/python -m pytest \
  nemo_retriever/tests/test_query_workflow_options.py \
  nemo_retriever/tests/test_root_query_cli.py

Results:

  • test_root_cli_workflow.py: 51 passed
  • query boundary tests: 11 passed
  • pre-commit passed on touched files
  • working tree clean before push

JP20 ingest smoke:

  • retriever ingest local ... --profile fast-text
    • 20 files ingested
    • 1884 LanceDB rows
  • retriever ingest batch ... --profile fast-text
    • 20 files ingested
    • 1884 LanceDB rows
  • retriever ingest service ... --profile fast-text
    • local service health check passed
    • 20 files ingested through ServiceIngestor
    • 1940 service result rows
    • service stopped after validation

Review Notes

The main review point is the CLI ownership split:

  • graph-mode CLI owns graph request construction
  • service-mode CLI owns service request construction
  • shared CLI helpers are narrow and do not build ingest requests
  • bare retriever ingest ... is only a local/inprocess convenience wrapper over the graph command
  • invalid public flag combinations are prevented by command shape rather than runtime rejection

@jioffe502 jioffe502 requested review from a team as code owners June 9, 2026 16:58
@jioffe502 jioffe502 requested a review from edknv June 9, 2026 16:58
@greptile-apps

This comment was marked as outdated.

Comment on lines 1326 to 1352
ingestor = ingest_service.build_service_ingestor(
ingest_service.ServiceIngestRequest(
documents=file_patterns,
input_type=input_type,
extract_params=extract_params,
embed_params=embed_params,
text_chunk_params=text_chunk_params,
enable_text_chunk=enable_text_chunk,
dedup_params=DedupParams(iou_threshold=dedup_iou_threshold) if enable_dedup else None,
caption_params=(
CaptionParams(
context_text_max_chars=caption_context_text_max_chars,
temperature=caption_temperature,
top_p=caption_top_p,
max_tokens=caption_max_tokens,
)
if enable_caption
else None
),
store_params=StoreParams(storage_uri=store_images_uri) if store_images_uri is not None else None,
connection=ingest_service.ServiceIngestConnectionOptions(
service_url=service_url,
service_concurrency=service_concurrency,
service_api_token=service_api_token,
),
)
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 ValueError from empty file set is unhandled in the legacy pipeline path

The old _build_service_ingestor raised typer.BadParameter when no files matched the glob patterns; the replacement build_service_ingestor raises ValueError instead. The run() function here has only a try/finally (no except), so any ValueError propagates unhandled to Typer and produces a raw Python traceback rather than a clean error message. This regression is reachable when _resolve_file_patterns constructs a valid glob like dir/**/*.pdf but the directory is empty — the upstream helper validates path existence, not that the glob actually matches files.

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/pipeline/__main__.py
Line: 1326-1352

Comment:
**`ValueError` from empty file set is unhandled in the legacy pipeline path**

The old `_build_service_ingestor` raised `typer.BadParameter` when no files matched the glob patterns; the replacement `build_service_ingestor` raises `ValueError` instead. The `run()` function here has only a `try/finally` (no `except`), so any `ValueError` propagates unhandled to Typer and produces a raw Python traceback rather than a clean error message. This regression is reachable when `_resolve_file_patterns` constructs a valid glob like `dir/**/*.pdf` but the directory is empty — the upstream helper validates path existence, not that the glob actually matches files.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems valid?

Comment on lines +259 to +271
def execute_service_ingest_request(request: ServiceIngestRequest) -> ServiceIngestExecutionResult:
result = build_service_ingestor(request).ingest()
result_n_rows = _count_service_result_rows(result)
return ServiceIngestExecutionResult(
request=request,
result=result,
n_rows=result_n_rows,
result_n_rows=result_n_rows,
metadata={
"service_url": request.connection.service_url,
"input_type": request.input_type,
},
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 No unit tests for the new ingest/service.py module

nemo_retriever/ingest/service.py introduces substantial new business logic — profile-based extract defaults, document-type validation, caption/dedup/chunk param builders with their own error conditions, _split_config_for_auto_documents, and ServiceIngestExecutionResult — but no corresponding test_ingest_service.py exists. The new CLI-level tests exercise the wiring end-to-end, but functions like _build_service_caption_params (which raises on caption_context_text_max_chars < 0), _build_service_dedup_params (raises when iou_threshold is set without enabled), and _split_config_for_auto_documents (mixed media type branching) are untested in isolation. Per the test-mirrors-source-structure and test-coverage-new-code rules, a companion test module is required.

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/ingest/service.py
Line: 259-271

Comment:
**No unit tests for the new `ingest/service.py` module**

`nemo_retriever/ingest/service.py` introduces substantial new business logic — profile-based extract defaults, document-type validation, caption/dedup/chunk param builders with their own error conditions, `_split_config_for_auto_documents`, and `ServiceIngestExecutionResult` — but no corresponding `test_ingest_service.py` exists. The new CLI-level tests exercise the wiring end-to-end, but functions like `_build_service_caption_params` (which raises on `caption_context_text_max_chars < 0`), `_build_service_dedup_params` (raises when `iou_threshold` is set without `enabled`), and `_split_config_for_auto_documents` (mixed media type branching) are untested in isolation. Per the `test-mirrors-source-structure` and `test-coverage-new-code` rules, a companion test module is required.

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

valid?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the comment is valid. The PR description confirms that the tests that do exist (test_root_cli_workflow.py) are CLI-level workflow tests — they exercise the wiring end-to-end, but don't cover the isolated error conditions in the private helpers inside ingest/service.py.

Specifically, functions like _build_service_dedup_params (which should raise when iou_threshold is set but enabled=False) and _build_service_caption_params (which should raise on negative context_text_max_chars) have branching validation logic that an end-to-end CLI test is unlikely to fully exercise. The _split_config_for_auto_documents mixed-media branching is similarly tricky to cover through the CLI surface alone.

A code search also confirms no test_ingest_service.py exists in the repo yet.

That said, if your team's convention is that CLI-level integration tests are sufficient for these helpers, I'm happy to defer — just let me know.

Comment on lines +134 to +158
@dataclass(frozen=True)
class ServiceIngestExecutionResult:
request: ServiceIngestRequest
result: object
n_rows: int | None
result_n_rows: int | None
metadata: dict[str, Any]

@property
def documents(self) -> list[str]:
return list(self.request.documents)

@property
def service_url(self) -> str:
return self.request.connection.service_url

def to_summary_dict(self) -> dict[str, Any]:
return {
"run_mode": "service",
"documents": self.documents,
"service_url": self.service_url,
"result": self.result,
"n_rows": self.n_rows,
"result_n_rows": self.result_n_rows,
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 n_rows and result_n_rows are always identical, creating a confusing public API

In execute_service_ingest_request, both fields are set to the same value (_count_service_result_rows(result)) and both are emitted in to_summary_dict. If the intent was to distinguish "rows reported by the service" from "rows counted from the ingest result object", that distinction is not implemented — callers see two keys with the same number and no documented difference. The redundant field should either be removed, or the two fields should be sourced from different places with their distinction documented.

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/ingest/service.py
Line: 134-158

Comment:
**`n_rows` and `result_n_rows` are always identical, creating a confusing public API**

In `execute_service_ingest_request`, both fields are set to the same value (`_count_service_result_rows(result)`) and both are emitted in `to_summary_dict`. If the intent was to distinguish "rows reported by the service" from "rows counted from the ingest result object", that distinction is not implemented — callers see two keys with the same number and no documented difference. The redundant field should either be removed, or the two fields should be sourced from different places with their distinction documented.

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a bug

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, confirmed bug. In execute_service_ingest_request, both n_rows and result_n_rows are set to the same _count_service_result_rows(result) value. The docstring on ServiceIngestExecutionResult describes an intended distinction — result_n_rows counts rows from the service ingest result, while n_rows is meant to mirror a local row-count contract for CLI summaries — but the implementation doesn't reflect that; both just hold the service result count.

The fix depends on the intent:

Option A: If n_rows was always meant to equal result_n_rows for service mode (no local DB to verify), remove the duplicate and alias one to the other:

@dataclass(frozen=True)
class ServiceIngestExecutionResult:
    request: ServiceIngestRequest
    result: object
    result_n_rows: int | None
    metadata: dict[str, Any]

    @property
    def n_rows(self) -> int | None:
        return self.result_n_rows

Option B: If there's a future plan to verify row count from a separate source (e.g., querying the vector DB after ingest), keep both fields but document it clearly in execute_service_ingest_request:

result_n_rows = _count_service_result_rows(result)
return ServiceIngestExecutionResult(
    ...
    n_rows=result_n_rows,        # mirrors result_n_rows; no local DB verification in service mode
    result_n_rows=result_n_rows,
    ...
)

Given the docstring already describes Option B's intent but the code doesn't implement the distinction, Option A is the cleaner fix until local verification is actually wired up.

Comment on lines +200 to +207
run_mode: str = typer.Option(
"inprocess",
"--run-mode",
help="Execution mode for the SDK ingestor. Defaults to inprocess; use batch for Ray Data scale-out.",
help=(
"Execution mode for ingest: inprocess (default), batch for Ray Data scale-out, "
"or service for a remote retriever service."
),
),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 run_mode changed from IngestRunModeValue to str, so Typer no longer renders valid choices in --help output (was [inprocess|batch], now just TEXT) and no longer performs automatic validation before the function body runs. Adding click.Choice preserves the Typer/Click help-text and validation ergonomics without sacrificing the new service value.

Suggested change
run_mode: str = typer.Option(
"inprocess",
"--run-mode",
help="Execution mode for the SDK ingestor. Defaults to inprocess; use batch for Ray Data scale-out.",
help=(
"Execution mode for ingest: inprocess (default), batch for Ray Data scale-out, "
"or service for a remote retriever service."
),
),
run_mode: str = typer.Option(
"inprocess",
"--run-mode",
click_type=click.Choice(["inprocess", "batch", "service"]),
help=(
"Execution mode for ingest: inprocess (default), batch for Ray Data scale-out, "
"or service for a remote retriever service."
),
),
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/adapters/cli/main.py
Line: 200-207

Comment:
`run_mode` changed from `IngestRunModeValue` to `str`, so Typer no longer renders valid choices in `--help` output (was `[inprocess|batch]`, now just `TEXT`) and no longer performs automatic validation before the function body runs. Adding `click.Choice` preserves the Typer/Click help-text and validation ergonomics without sacrificing the new `service` value.

```suggestion
    run_mode: str = typer.Option(
        "inprocess",
        "--run-mode",
        click_type=click.Choice(["inprocess", "batch", "service"]),
        help=(
            "Execution mode for ingest: inprocess (default), batch for Ray Data scale-out, "
            "or service for a remote retriever service."
        ),
    ),
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +259 to +260
def execute_service_ingest_request(request: ServiceIngestRequest) -> ServiceIngestExecutionResult:
result = build_service_ingestor(request).ingest()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Public functions execute_service_ingest_request, expand_service_file_patterns, and service_split_config_for_request lack docstrings. Per the docstrings-public-interface rule, all public functions must describe their behaviour, parameters, returns, and any exceptions raised.

Suggested change
def execute_service_ingest_request(request: ServiceIngestRequest) -> ServiceIngestExecutionResult:
result = build_service_ingestor(request).ingest()
def execute_service_ingest_request(request: ServiceIngestRequest) -> ServiceIngestExecutionResult:
"""Execute a resolved service ingest request and return a structured result.
Args:
request: A fully-resolved ``ServiceIngestRequest`` produced by
``resolve_service_ingest_request`` or constructed directly.
Returns:
A ``ServiceIngestExecutionResult`` containing the raw ingest result,
the row count (when the result exposes a ``dataframe`` attribute), and
connection metadata.
Raises:
ValueError: If no files matched the input patterns in *request*.
"""
result = build_service_ingestor(request).ingest()
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/ingest/service.py
Line: 259-260

Comment:
Public functions `execute_service_ingest_request`, `expand_service_file_patterns`, and `service_split_config_for_request` lack docstrings. Per the `docstrings-public-interface` rule, all public functions must describe their behaviour, parameters, returns, and any exceptions raised.

```suggestion
def execute_service_ingest_request(request: ServiceIngestRequest) -> ServiceIngestExecutionResult:
    """Execute a resolved service ingest request and return a structured result.

    Args:
        request: A fully-resolved ``ServiceIngestRequest`` produced by
            ``resolve_service_ingest_request`` or constructed directly.

    Returns:
        A ``ServiceIngestExecutionResult`` containing the raw ingest result,
        the row count (when the result exposes a ``dataframe`` attribute), and
        connection metadata.

    Raises:
        ValueError: If no files matched the input patterns in *request*.
    """
    result = build_service_ingestor(request).ingest()
```

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@jioffe502 jioffe502 force-pushed the codex/root-ingest-service-mode branch from 05bc509 to e574193 Compare June 9, 2026 17:39
@jioffe502

This comment was marked as outdated.

@jioffe502 jioffe502 force-pushed the codex/root-ingest-service-mode branch 3 times, most recently from 592b556 to 137c57a Compare June 9, 2026 20:52
@jioffe502 jioffe502 force-pushed the codex/root-ingest-service-mode branch from 137c57a to 2ea4c0f Compare June 10, 2026 20:01
Comment on lines +282 to +288
def expand_service_file_patterns(documents: Sequence[str]) -> list[str]:
"""Expand recursive file patterns for service ingest construction."""

resolved_files: list[str] = []
for pattern in documents:
resolved_files.extend(sorted(_glob.glob(str(pattern), recursive=True)))
return resolved_files

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Silent drop of text chunking when all inputs are glob patterns

_split_config_for_auto_documents filters out any path where _glob.has_magic() is True. If a user passes a wildcard glob (e.g., "docs/**/*.pdf") as a service request document, every path in documents will have magic, input_types will be empty, split_config will be {}, and the function returns None — silently disabling text chunking even when --text-chunk is specified. The dry-run output would show "split_config": null with no warning. A validation error or at minimum a logged warning when chunking is requested but no concrete paths are available to detect media types would make this easier to diagnose.

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/ingest/service.py
Line: 282-288

Comment:
**Silent drop of text chunking when all inputs are glob patterns**

`_split_config_for_auto_documents` filters out any path where `_glob.has_magic()` is `True`. If a user passes a wildcard glob (e.g., `"docs/**/*.pdf"`) as a service request document, every path in `documents` will have magic, `input_types` will be empty, `split_config` will be `{}`, and the function returns `None` — silently disabling text chunking even when `--text-chunk` is specified. The dry-run output would show `"split_config": null` with no warning. A validation error or at minimum a logged warning when chunking is requested but no concrete paths are available to detect media types would make this easier to diagnose.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs attention, I think.

@jioffe502 jioffe502 force-pushed the codex/root-ingest-service-mode branch from 0265eab to 4c3cc25 Compare June 11, 2026 22:28
…ervice-mode-cleanup

# Conflicts:
#	nemo_retriever/docs/cli/README.md
#	nemo_retriever/src/nemo_retriever/cli/ingest/__init__.py
#	nemo_retriever/src/nemo_retriever/cli/ingest/app.py
#	nemo_retriever/src/nemo_retriever/cli/ingest/graph.py
#	nemo_retriever/src/nemo_retriever/cli/ingest/options.py
#	nemo_retriever/src/nemo_retriever/cli/ingest/service.py
#	nemo_retriever/src/nemo_retriever/cli/ingest/shared.py
#	nemo_retriever/src/nemo_retriever/cli/ingest_workflow.py
#	nemo_retriever/src/nemo_retriever/cli/main.py
#	nemo_retriever/src/nemo_retriever/cli/pipeline/__main__.py
#	nemo_retriever/src/nemo_retriever/cli/shared.py
#	nemo_retriever/tests/test_pipeline_helpers.py
#	nemo_retriever/tests/test_root_cli_workflow.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants