Skip to content

Integrate clio-search agentic engine + IOWarp clio-core v2.0.0 into clio-agentic-search#286

Open
SIslamMun wants to merge 15 commits into
mainfrom
dev-shazzadul
Open

Integrate clio-search agentic engine + IOWarp clio-core v2.0.0 into clio-agentic-search#286
SIslamMun wants to merge 15 commits into
mainfrom
dev-shazzadul

Conversation

@SIslamMun

Copy link
Copy Markdown
Contributor

Summary

Integrates the upgraded clio-search agentic retrieval engine into
clio-agentic-search, plus IOWarp clio-core v2.0.0 support and a round of
type/test hardening. The package gains the full clio-search feature set while
remaining backward compatible; all existing connectors, the API, async jobs and
telemetry are unchanged.

What's new

  • Agentic retrieval — multi-hop observe→decide→act loop with LLM query
    rewriting (and a no-LLM fallback) and SI-unit variant inference
    (clio query --agentic --max-hops N --llm-rewrite)
  • Four new connectors — HDF5 (hdf5_data), NetCDF (netcdf_data),
    IOWarp CTE, and NDP (National Data Platform, incl. structured CSV ingestion)
  • Corpus-adaptive strategy — schema/metadata profiling drives per-query
    branch selection and content-quality filtering
  • Structured ingestion — CSV/table detection and table-aware chunking
  • Upgraded science engine — expanded SI canonicalization + formula
    normalization; science-aware operators (numeric range, unit match, formula)
  • IOWarp clio-core v2.0.0 — imports the renamed clio_cte_core_ext
    (legacy iowarp_core.wrp_cte_core_ext fallback) and performs the required
    client bring-up (chimaera_init + initialize_cte) before CTE access

Verification

  • Migrated every clio-search source and test file (full file-set parity).
  • 271 tests pass under both uv sync --dev (CI) and uv sync --all-extras --dev.
  • uv run mypy src/ is clean (0 errors) in both environments.
  • Ruff lint + format clean.
  • CLI, API (/health, /version, /query, /metrics), and the quality gate
    (6/6 scientific checks) verified live.
  • IOWarp connector verified end-to-end against a live clio-core v2.0.0 runtime
    (real blobs indexed and searched, not stubs); NDP verified with real CSV
    download/parse.
  • Parent launcher (clio-kit search …) verified to forward the new flags;
    its help text refreshed.

Notes

  • The SC26 paper benchmark scripts are intentionally not vendored (research
    artifacts, not part of the engine; would break the repo-wide ruff gate).
  • New optional extras: hdf5, netcdf, llm, eval. The IOWarp connector
    needs the separately-installed iowarp-core wheel; without it connect()
    raises a clear message and CI is unaffected.

Commits

15 focused commits, built and tested incrementally (deps → engine core →
agentic → connectors → namespaces → CLI → docs → mypy → IOWarp v2 → audit fixes).

SIslamMun added 15 commits June 4, 2026 05:38
…apabilities

Adopt the clio-search dependency manifest ahead of the feature migration so
the CI environment (uv sync --dev) carries the packages the upgraded engine
needs:

- core: httpx, h5py, aiohttp, matplotlib
- extras: hdf5, netcdf (xarray/netCDF4), llm (anthropic/openai), eval
- mypy: ignore_missing_imports for h5py, xarray, netCDF4, anthropic, openai

Regenerate uv.lock accordingly. No source changes yet; existing test suite
and lint remain green.
…e operators

Migrate the upgraded retrieval engine from clio-search. The measurement
encode/decode format now carries a content-quality flag, which is a shared
contract across indexing, storage, retrieval and connectors, so these move
together as one coherent unit:

- indexing: content-quality scoring (quality.py), structured CSV/table
  chunking (csv_parser.py), expanded SI canonicalization + formula
  normalization (scientific.py)
- retrieval: corpus profiling, metadata/sample schema inference, branch
  strategy selection, and quality-aware scientific operators; coordinator
  now emits a branch_plan_selected trace stage
- storage: science measurement/formula tables and quality columns in
  duckdb_store; updated storage contracts
- connectors: filesystem and object_store updated for the new pipeline
- evals: quality_gate aligned to the new operators

Tests: 244 passing (was 138). Fixed a stale trace-sequence assertion in
test_retrieval_flow that predated the branch_plan_selected stage (the same
assertion is stale in clio-search). Lint/format applied to satisfy the
clio-kit ruff gate.
…rewriting

Migrate the agentic retrieval layer from clio-search:

- query_rewriter.py: LLM-backed query expansion with SI-unit variant
  inference, plus a no-LLM FallbackQueryRewriter for environments without
  the optional llm extra
- agentic.py: AgenticRetriever wrapping the coordinator in an
  observe-decide-act loop with convergence/hop limits and multi-namespace
  fan-out

Removed an unused local in the multi-namespace loop (dead in clio-search;
removal preserves runtime behavior) to satisfy the ruff F841 gate.
Tests: 17 new (query rewriter + agentic retriever), full suite green.
Migrate two new storage connectors from clio-search:

- hdf5: indexes HDF5 datasets/attributes via h5py (core dependency)
- netcdf: indexes NetCDF variables/coordinates via xarray (optional netcdf
  extra), guarded so the module imports cleanly when xarray is absent

Both reuse the shared indexing/retrieval pipeline. Tests added for each;
NetCDF tests degrade gracefully without the optional extra, matching the
CI environment (uv sync --dev).
Migrate two federated-backend connectors from clio-search:

- iowarp: indexes blobs from an IOWarp content store
- ndp: indexes datasets from the National Data Platform, including
  structured CSV ingestion; MCP-backed discovery lives in ndp/mcp_client.py
  and imports the optional 'mcp' package lazily, so the connector module
  imports cleanly without it

Neither is wired into the default namespace registry, so the standard
import path and test collection never require their optional dependencies.
Removed an unused local in the NDP CSV chunker (dead in clio-search;
removal preserves the indexed chunk content) to satisfy the ruff gate.
Wire the new science-format connectors into the namespace layer:

- namespace_config: default bundles for hdf5_data and netcdf_data
- namespace_registry: build_default_registry now provisions five
  namespaces (local_fs, object_s3, vector_qdrant, hdf5_data, netcdf_data)

Update the production-hardening registry test to assert the five-namespace
default. Full suite green.
Expose the agentic retrieval loop through 'clio query':

- --agentic enables multi-hop retrieval
- --max-hops bounds the loop (default 3)
- --llm-rewrite opts into LLM query rewriting (falls back gracefully
  without the optional anthropic dependency)

Existing CLI tests remain green.
Update the README capability list to cover the migrated features: agentic
multi-hop retrieval with LLM query rewriting, corpus-adaptive strategy and
content-quality filtering, structured CSV/table ingestion, and the HDF5,
NetCDF, IOWarp and NDP connectors. Note the new clio query agentic flags.

The SC26 paper benchmark drivers from clio-search (top-level benchmarks/)
are intentionally not vendored into the monorepo package: they are research
artifacts, are not part of the importable engine, and would otherwise be
scanned by the repo-wide ruff gate.
Add a retrieval-pipeline overview, a connector table (namespace, default
registry membership, required extra), an optional-extras table, and an
agentic CLI example that makes clear the multi-hop loop is opt-in and a
plain query is unchanged.
Clear the 28 advisory mypy errors carried over from clio-search, with
type-only / behaviour-preserving changes:

- storage/duckdb_store: correct the measurement-row tuple annotation to
  include the quality field; add a _scalar_int helper so COUNT(*) queries
  no longer index a possibly-None fetchone() result
- storage/contracts: add the acceptable_quality parameter to the
  StorageAdapter protocol so it matches the DuckDB implementation
- connectors/iowarp: type the untyped iowarp_core client/tag handles as Any
  (fixes attr-defined), drop the now-needless type: ignore comments, and
  parameterise a bare list annotation
- connectors/hdf5, connectors/netcdf: drop type: ignore comments that are
  unused under the CI mypy config (h5py/xarray resolve to Any)
- connectors/ndp: annotate the mixed-value stats dict as dict[str, Any];
  bind the parsed organizations list to a typed local

mypy src/ --ignore-missing-imports: 0 errors (was 28). Ruff, format and the
full 271-test suite remain green; no runtime behaviour changed.
Add an Examples section with copy-pasteable, verified commands covering the
new capabilities: scientific numeric-range queries (incl. cross-unit Pa/kPa
matching), formula targeting, the agentic multi-hop loop, HDF5/NetCDF
indexing via CLIO_HDF5_ROOT/CLIO_NETCDF_ROOT, and the HTTP API. Env-var
names and the serve port (8000) verified against the CLI.
…ext)

clio-core v2.0.0 (pip install iowarp-core) rebrands the low-level CTE Python
bindings: the module wrp_cte_core_ext (imported as iowarp_core.wrp_cte_core_ext)
is now the top-level clio_cte_core_ext. The blob/tag API the connector relies on
(get_cte_client, Tag, BlobQuery, GetContainedBlobs, GetBlobSize, GetBlob,
PoolQuery.Dynamic) is unchanged across versions — verified against the v2.0.0
core_bindings.cc in iowarp/clio-core.

Import the new module first and fall back to the legacy wheel, so the connector
works with both clio-core v2.0.0+ and older iowarp_core builds. Update the
module docstring and the not-installed error message accordingly.

Verified end-to-end against fakes for both module names (index -> lexical +
scientific search -> citation) and the no-bindings path (clear RuntimeError).
Ruff, format, mypy clean; full suite 271 passed.
…rified)

Running against a live clio-core v2.0.0 runtime revealed that connect() called
get_cte_client() without first initialising the client, which segfaults on the
first Tag operation. clio-core requires the official bring-up sequence —
chimaera_init(kClient) then initialize_cte(config, PoolQuery.Dynamic()) — before
any blob/tag access.

Add an idempotent, per-process _ensure_cte_client() that performs this sequence
and is a no-op on the legacy module (which lacks these entry points). It honours
CLIO_SERVER_CONF (runtime YAML, default ~/.clio/clio.yaml) and CHI_WITH_RUNTIME
(start an in-process runtime for single-node, else attach to a clio_run daemon).

Verified end-to-end against a real, running IOWarp v2 runtime (no stubs):
seeded blobs into a live CTE tag, then connect() -> index() (scanned=3 indexed=3)
-> search_lexical (3 hits) -> search_scientific(300-400 kPa) correctly returned
the 320 and 355 kPa blobs and excluded 101 kPa -> build_citation
cte://turbine_study/blob_turbine.

Ruff, format, mypy clean; full suite 271 passed (with and without iowarp-core
installed, so CI's uv sync --dev path is unaffected).
The hardcoded help for the search subcommand predated the engine upgrade.
Update it to mention the agentic multi-hop flags (--agentic / --max-hops /
--llm-rewrite), the formula operator, and the five default namespaces
(local_fs, object_s3, vector_qdrant, hdf5_data, netcdf_data).

Help text only — the launcher already forwards all args verbatim to the
clio entry point, so the new features worked without code changes; this just
makes the discovery output accurate.
An external audit run with 'uv sync --all-extras --dev' surfaced gaps that the
CI config (uv sync --dev + mypy --ignore-missing-imports) hid:

mypy (uv run mypy src/ with the optional typed packages installed): 17 -> 0
  - telemetry: annotate the prometheus output as bytes instead of a redundant
    cast (no-any-return-safe without the typed package, redundant-cast-free with
    it); drop the now-unused cast import
  - query_rewriter: read the Anthropic response block's text defensively
    (the SDK types content as a block union; only text blocks carry .text)
  - netcdf: restore the xarray optional-import 'type: ignore[assignment]' (it is
    needed when xarray is installed); coerce data_var names to str before
    sort/join
  - pyproject: add clio_cte_core_ext and iowarp_core to ignore_missing_imports;
    set warn_unused_ignores = false so optional-backend ignores are clean whether
    or not the typed package is present

tests: 2 failures under --all-extras -> 0
  - test_falls_back_to_hash_when_no_sentence_transformers and
    test_no_results_prints_diagnostic assumed sentence-transformers was absent.
    Force the hash-embedder path via monkeypatch so they are deterministic with
    or without the semantic extra.

Verified clean in BOTH environments: mypy 0 errors and 271 tests pass under
uv sync --dev and under uv sync --all-extras --dev. Ruff/format unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant