Integrate clio-search agentic engine + IOWarp clio-core v2.0.0 into clio-agentic-search by SIslamMun · Pull Request #286 · iowarp/clio-kit

SIslamMun · 2026-06-04T14:35:10Z

Summary

Integrates the upgraded clio-search agentic retrieval engine into
clio-agentic-search, plus IOWarp clio-core v2.0.0 support and a round of
type/test hardening. The package gains the full clio-search feature set while
remaining backward compatible; all existing connectors, the API, async jobs and
telemetry are unchanged.

What's new

Agentic retrieval — multi-hop observe→decide→act loop with LLM query
rewriting (and a no-LLM fallback) and SI-unit variant inference
(clio query --agentic --max-hops N --llm-rewrite)
Four new connectors — HDF5 (hdf5_data), NetCDF (netcdf_data),
IOWarp CTE, and NDP (National Data Platform, incl. structured CSV ingestion)
Corpus-adaptive strategy — schema/metadata profiling drives per-query
branch selection and content-quality filtering
Structured ingestion — CSV/table detection and table-aware chunking
Upgraded science engine — expanded SI canonicalization + formula
normalization; science-aware operators (numeric range, unit match, formula)
IOWarp clio-core v2.0.0 — imports the renamed clio_cte_core_ext
(legacy iowarp_core.wrp_cte_core_ext fallback) and performs the required
client bring-up (chimaera_init + initialize_cte) before CTE access

Verification

Migrated every clio-search source and test file (full file-set parity).
271 tests pass under both uv sync --dev (CI) and uv sync --all-extras --dev.
uv run mypy src/ is clean (0 errors) in both environments.
Ruff lint + format clean.
CLI, API (/health, /version, /query, /metrics), and the quality gate
(6/6 scientific checks) verified live.
IOWarp connector verified end-to-end against a live clio-core v2.0.0 runtime
(real blobs indexed and searched, not stubs); NDP verified with real CSV
download/parse.
Parent launcher (clio-kit search …) verified to forward the new flags;
its help text refreshed.

Notes

The SC26 paper benchmark scripts are intentionally not vendored (research
artifacts, not part of the engine; would break the repo-wide ruff gate).
New optional extras: hdf5, netcdf, llm, eval. The IOWarp connector
needs the separately-installed iowarp-core wheel; without it connect()
raises a clear message and CI is unaffected.

Commits

15 focused commits, built and tested incrementally (deps → engine core →
agentic → connectors → namespaces → CLI → docs → mypy → IOWarp v2 → audit fixes).

…apabilities Adopt the clio-search dependency manifest ahead of the feature migration so the CI environment (uv sync --dev) carries the packages the upgraded engine needs: - core: httpx, h5py, aiohttp, matplotlib - extras: hdf5, netcdf (xarray/netCDF4), llm (anthropic/openai), eval - mypy: ignore_missing_imports for h5py, xarray, netCDF4, anthropic, openai Regenerate uv.lock accordingly. No source changes yet; existing test suite and lint remain green.

…e operators Migrate the upgraded retrieval engine from clio-search. The measurement encode/decode format now carries a content-quality flag, which is a shared contract across indexing, storage, retrieval and connectors, so these move together as one coherent unit: - indexing: content-quality scoring (quality.py), structured CSV/table chunking (csv_parser.py), expanded SI canonicalization + formula normalization (scientific.py) - retrieval: corpus profiling, metadata/sample schema inference, branch strategy selection, and quality-aware scientific operators; coordinator now emits a branch_plan_selected trace stage - storage: science measurement/formula tables and quality columns in duckdb_store; updated storage contracts - connectors: filesystem and object_store updated for the new pipeline - evals: quality_gate aligned to the new operators Tests: 244 passing (was 138). Fixed a stale trace-sequence assertion in test_retrieval_flow that predated the branch_plan_selected stage (the same assertion is stale in clio-search). Lint/format applied to satisfy the clio-kit ruff gate.

…rewriting Migrate the agentic retrieval layer from clio-search: - query_rewriter.py: LLM-backed query expansion with SI-unit variant inference, plus a no-LLM FallbackQueryRewriter for environments without the optional llm extra - agentic.py: AgenticRetriever wrapping the coordinator in an observe-decide-act loop with convergence/hop limits and multi-namespace fan-out Removed an unused local in the multi-namespace loop (dead in clio-search; removal preserves runtime behavior) to satisfy the ruff F841 gate. Tests: 17 new (query rewriter + agentic retriever), full suite green.

Migrate two new storage connectors from clio-search: - hdf5: indexes HDF5 datasets/attributes via h5py (core dependency) - netcdf: indexes NetCDF variables/coordinates via xarray (optional netcdf extra), guarded so the module imports cleanly when xarray is absent Both reuse the shared indexing/retrieval pipeline. Tests added for each; NetCDF tests degrade gracefully without the optional extra, matching the CI environment (uv sync --dev).

Migrate two federated-backend connectors from clio-search: - iowarp: indexes blobs from an IOWarp content store - ndp: indexes datasets from the National Data Platform, including structured CSV ingestion; MCP-backed discovery lives in ndp/mcp_client.py and imports the optional 'mcp' package lazily, so the connector module imports cleanly without it Neither is wired into the default namespace registry, so the standard import path and test collection never require their optional dependencies. Removed an unused local in the NDP CSV chunker (dead in clio-search; removal preserves the indexed chunk content) to satisfy the ruff gate.

Wire the new science-format connectors into the namespace layer: - namespace_config: default bundles for hdf5_data and netcdf_data - namespace_registry: build_default_registry now provisions five namespaces (local_fs, object_s3, vector_qdrant, hdf5_data, netcdf_data) Update the production-hardening registry test to assert the five-namespace default. Full suite green.

Expose the agentic retrieval loop through 'clio query': - --agentic enables multi-hop retrieval - --max-hops bounds the loop (default 3) - --llm-rewrite opts into LLM query rewriting (falls back gracefully without the optional anthropic dependency) Existing CLI tests remain green.

Update the README capability list to cover the migrated features: agentic multi-hop retrieval with LLM query rewriting, corpus-adaptive strategy and content-quality filtering, structured CSV/table ingestion, and the HDF5, NetCDF, IOWarp and NDP connectors. Note the new clio query agentic flags. The SC26 paper benchmark drivers from clio-search (top-level benchmarks/) are intentionally not vendored into the monorepo package: they are research artifacts, are not part of the importable engine, and would otherwise be scanned by the repo-wide ruff gate.

Add a retrieval-pipeline overview, a connector table (namespace, default registry membership, required extra), an optional-extras table, and an agentic CLI example that makes clear the multi-hop loop is opt-in and a plain query is unchanged.

Clear the 28 advisory mypy errors carried over from clio-search, with type-only / behaviour-preserving changes: - storage/duckdb_store: correct the measurement-row tuple annotation to include the quality field; add a _scalar_int helper so COUNT(*) queries no longer index a possibly-None fetchone() result - storage/contracts: add the acceptable_quality parameter to the StorageAdapter protocol so it matches the DuckDB implementation - connectors/iowarp: type the untyped iowarp_core client/tag handles as Any (fixes attr-defined), drop the now-needless type: ignore comments, and parameterise a bare list annotation - connectors/hdf5, connectors/netcdf: drop type: ignore comments that are unused under the CI mypy config (h5py/xarray resolve to Any) - connectors/ndp: annotate the mixed-value stats dict as dict[str, Any]; bind the parsed organizations list to a typed local mypy src/ --ignore-missing-imports: 0 errors (was 28). Ruff, format and the full 271-test suite remain green; no runtime behaviour changed.

Add an Examples section with copy-pasteable, verified commands covering the new capabilities: scientific numeric-range queries (incl. cross-unit Pa/kPa matching), formula targeting, the agentic multi-hop loop, HDF5/NetCDF indexing via CLIO_HDF5_ROOT/CLIO_NETCDF_ROOT, and the HTTP API. Env-var names and the serve port (8000) verified against the CLI.

…ext) clio-core v2.0.0 (pip install iowarp-core) rebrands the low-level CTE Python bindings: the module wrp_cte_core_ext (imported as iowarp_core.wrp_cte_core_ext) is now the top-level clio_cte_core_ext. The blob/tag API the connector relies on (get_cte_client, Tag, BlobQuery, GetContainedBlobs, GetBlobSize, GetBlob, PoolQuery.Dynamic) is unchanged across versions — verified against the v2.0.0 core_bindings.cc in iowarp/clio-core. Import the new module first and fall back to the legacy wheel, so the connector works with both clio-core v2.0.0+ and older iowarp_core builds. Update the module docstring and the not-installed error message accordingly. Verified end-to-end against fakes for both module names (index -> lexical + scientific search -> citation) and the no-bindings path (clear RuntimeError). Ruff, format, mypy clean; full suite 271 passed.

…rified) Running against a live clio-core v2.0.0 runtime revealed that connect() called get_cte_client() without first initialising the client, which segfaults on the first Tag operation. clio-core requires the official bring-up sequence — chimaera_init(kClient) then initialize_cte(config, PoolQuery.Dynamic()) — before any blob/tag access. Add an idempotent, per-process _ensure_cte_client() that performs this sequence and is a no-op on the legacy module (which lacks these entry points). It honours CLIO_SERVER_CONF (runtime YAML, default ~/.clio/clio.yaml) and CHI_WITH_RUNTIME (start an in-process runtime for single-node, else attach to a clio_run daemon). Verified end-to-end against a real, running IOWarp v2 runtime (no stubs): seeded blobs into a live CTE tag, then connect() -> index() (scanned=3 indexed=3) -> search_lexical (3 hits) -> search_scientific(300-400 kPa) correctly returned the 320 and 355 kPa blobs and excluded 101 kPa -> build_citation cte://turbine_study/blob_turbine. Ruff, format, mypy clean; full suite 271 passed (with and without iowarp-core installed, so CI's uv sync --dev path is unaffected).

The hardcoded help for the search subcommand predated the engine upgrade. Update it to mention the agentic multi-hop flags (--agentic / --max-hops / --llm-rewrite), the formula operator, and the five default namespaces (local_fs, object_s3, vector_qdrant, hdf5_data, netcdf_data). Help text only — the launcher already forwards all args verbatim to the clio entry point, so the new features worked without code changes; this just makes the discovery output accurate.

An external audit run with 'uv sync --all-extras --dev' surfaced gaps that the CI config (uv sync --dev + mypy --ignore-missing-imports) hid: mypy (uv run mypy src/ with the optional typed packages installed): 17 -> 0 - telemetry: annotate the prometheus output as bytes instead of a redundant cast (no-any-return-safe without the typed package, redundant-cast-free with it); drop the now-unused cast import - query_rewriter: read the Anthropic response block's text defensively (the SDK types content as a block union; only text blocks carry .text) - netcdf: restore the xarray optional-import 'type: ignore[assignment]' (it is needed when xarray is installed); coerce data_var names to str before sort/join - pyproject: add clio_cte_core_ext and iowarp_core to ignore_missing_imports; set warn_unused_ignores = false so optional-backend ignores are clean whether or not the typed package is present tests: 2 failures under --all-extras -> 0 - test_falls_back_to_hash_when_no_sentence_transformers and test_no_results_prints_diagnostic assumed sentence-transformers was absent. Force the hash-embedder path via monkeypatch so they are deterministic with or without the semantic extra. Verified clean in BOTH environments: mypy 0 errors and 271 tests pass under uv sync --dev and under uv sync --all-extras --dev. Ruff/format unchanged.

SIslamMun added 15 commits June 4, 2026 05:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate clio-search agentic engine + IOWarp clio-core v2.0.0 into clio-agentic-search#286

Integrate clio-search agentic engine + IOWarp clio-core v2.0.0 into clio-agentic-search#286
SIslamMun wants to merge 15 commits into
mainfrom
dev-shazzadul

SIslamMun commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SIslamMun commented Jun 4, 2026

Summary

What's new

Verification

Notes

Commits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant