Skip to content

[Router] Deterministic conditions: keywords / regex / min_chars / has_* / metadata (M4, #2380)#2491

Merged
SlawomirNowaczyk merged 6 commits into
mainfrom
router/2380-deterministic-conditions
Jun 30, 2026
Merged

[Router] Deterministic conditions: keywords / regex / min_chars / has_* / metadata (M4, #2380)#2491
SlawomirNowaczyk merged 6 commits into
mainfrom
router/2380-deterministic-conditions

Conversation

@ramkrishna2910

Copy link
Copy Markdown
Contributor

Implements the deterministic match conditions (module M4) for the generic routing engine — the pure-CPU leaf ops a collection.router policy uses to route without invoking any model. Built on the #2378 evaluator + #2379 registry seams (both merged), and rebased onto the unified classifier work in #2483.

Closes #2380.

What's here

Eight ops, implemented as file-local Condition subclasses and exposed through a single public symbol — make_deterministic_leaf_factories() — which the engine (#2382) passes to make_leaf_factory as deterministic_factories:

Op Frozen v1 semantics
keywords_any / keywords_all case-insensitive substring (ASCII case fold), any vs all
regex ECMAScript (std::regex), case-sensitive, compiled at policy-load
min_chars / max_chars inclusive bound on input length in UTF-8 bytes
has_tools / has_images boolean request feature equals the authored value
metadata equals (raw exact) / any (comma-split token-set intersection) / exists; case-sensitive

Config is validated when each condition is built (the factory throws std::invalid_argument on malformed leaves); evaluate() never throws, preserving the engine's fail-open contract. Multi-key leaves become an implicit all via the existing registry.

Frozen-edge decisions (locked now while the schema is unreleased)

These pin under-specified v1 semantics, following the non-empty / no-silent-match precedent set on the sibling semantic classifier (#2483):

  • Empty keyword arrays / items rejected (minItems:1 + minLength:1 + factory throw) rather than defining a vacuous fold — an empty keywords_any silently never-matching (and keywords_all silently always-matching) is exactly the footgun [Router] 2381 unified interface for semantic_similarity and model classifier #2483 designed against.
  • ASCII-only case fold for keywords — locale-independent, no ICU; pinned explicitly in the frozen-semantics table rather than left to "case-insensitive".
  • Empty regex rejected (minLength:1 + factory throw) — an empty pattern matches everything; express always-match by omitting the condition.
  • Whitespace-only metadata value counts as absent (matches only exists:false), keeping exists consistent with the comma-split token model equals/any already use.
  • Catastrophic-backtracking regex rejected at policy load — see below.

ReDoS hardening

std::regex is a backtracking engine with no portable step limit, and the text a regex condition matches is untrusted end-user input — so a policy author who writes a nested-unbounded-quantifier pattern ((a+)+, (a*)*, (.*)+, (\d+){2,}) creates a denial-of-service vector. reject_catastrophic_regex() walks the pattern at policy-load and rejects that family before it can reach the request hot path. The runtime try/catch on std::regex_error stays as cross-platform defense-in-depth (MSVC throws error_complexity; libstdc++/libc++ may not).

This is a mitigation, not a linear-time proof — alternation-overlap ReDoS like (a|a)+ is not detected. A hard guarantee would need a non-backtracking engine (RE2), which — being a different dialect — would ship as a separately named op, never a redefinition of this ECMAScript one. That residual risk + the RE2 path are documented in the code and schema rather than a side issue.

Testing

All against the merged base (semantic + deterministic + foundation suites green together):

  • RoutingPolicyDeterministicTest — 64 checks: keywords all-vs-any, ASCII case fold (and non-ASCII not folded), ECMAScript regex incl. case-sensitivity, UTF-8 byte-boundary min/max_chars, metadata equals/any/exists (scalar vs comma-list, missing/empty/whitespace-only key), 12 ReDoS accept/reject cases, malformed-config rejection, trace emission, and the registry's multi-key implicit-all.
  • Existing routing suites (Contract/Evaluator/Registry/Semantic) still pass.
  • Python test_schema_lock.py + test_routing_fixtures.py pass; schema-lock regenerated for the merged schema.
  • Registered in CMake and wired into the CI ctest filter (union with [Router] 2381 unified interface for semantic_similarity and model classifier #2483's Semantic).

Notes

  • Cross-platform: built clean under MSVC /W4; only standard library + nlohmann/json (no new dependency, header invariant preserved).
  • No new endpoints / backends — pure engine-internal contract.

ramkrishna2910 and others added 4 commits June 29, 2026 10:37
Implement the deterministic match conditions for the routing engine:
keywords_any/keywords_all, regex, min_chars/max_chars, has_tools/has_images,
and metadata. Pure CPU, no model, no tokenizer. Each is a Condition built by a
named leaf factory and plugged into make_leaf_factory as deterministic_factories.

Frozen v1 semantics match route_policy.schema.json exactly:
- keywords: case-insensitive substring, ASCII-only case fold (locale-independent)
- regex: ECMAScript (std::regex), compiled at policy-load (fail-fast on bad pattern)
- min_chars/max_chars: inclusive bounds on input length in UTF-8 bytes
- has_tools/has_images: boolean request feature equals authored value
- metadata: case-sensitive; value decoded into a comma-split, trimmed token set;
  equals (raw exact) / any (set intersection) / exists (presence); missing or
  empty value matches only exists:false

Public surface is a single symbol, make_deterministic_leaf_factories(); the
condition classes stay file-local. Config is validated when each condition is
built (factory throws std::invalid_argument on malformed leaves); evaluate()
never throws, preserving the engine's fail-open contract.

Frozen-edge decisions (aligned with the non-empty/no-silent-match precedent set
on the sibling semantic classifier): reject empty keyword arrays and empty
keyword items rather than defining a vacuous fold (minItems/minLength tightened
in the still-unreleased schema, lock refreshed); ASCII-only case fold pinned
explicitly in the frozen-semantics README rather than left to interpretation.

Adds test_routing_policy_deterministic.cpp (50 checks: keywords all-vs-any,
case-insensitivity, ECMAScript regex, UTF-8 byte-boundary chars, metadata
equals/any/exists incl. scalar vs comma-list and missing-key, malformed-config
rejection, trace emission, and the registry's multi-key implicit-all wiring),
registers RoutingPolicyDeterministicTest in CMake, and wires it into CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…#2380)

Two frozen-edge refinements surfaced by scenario testing, both applying the same
"no silent always-match / empty == absent" principle already used for keywords:

1. regex must be non-empty (minLength:1). An empty pattern matches every input —
   an accidental always-match rule. Express always-match by omitting the
   condition, not an empty regex.

2. A whitespace-only metadata value now counts as absent (matches only
   exists:false), consistent with the comma-split token model where equals/any
   already see no content. Previously a "   " value read as exists:true while
   yielding no tokens for any/equals.

Schema descriptions + frozen-semantics README updated, schema-lock refreshed,
and the unit test gains whitespace-only metadata coverage.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…2380)

std::regex is a backtracking engine with no portable step limit, and the input
matched by a regex condition is untrusted end-user text — so a policy author who
writes a pattern with a nested unbounded quantifier (the (a+)+ / (a*)* / (.*)+ /
(\d+){2,} ReDoS family) creates a denial-of-service vector: a crafted request can
pin a worker thread.

reject_catastrophic_regex() walks the pattern at policy-load time and throws
std::invalid_argument when it finds an unbounded quantifier applied to a group
that itself contains an unbounded quantifier, so the dangerous pattern never
reaches the request hot path. This is a mitigation, not a linear-time proof —
alternation-overlap ReDoS like (a|a)+ is not detected; a hard guarantee would
need a non-backtracking engine (RE2), which would be a separate dialect op. The
runtime try/catch on std::regex_error stays as cross-platform defense-in-depth
(MSVC throws error_complexity; libstdc++/libc++ may not).

Also rejects an empty regex pattern in the factory (defense-in-depth behind the
schema minLength). Bounded quantifiers (outer {1,3} or inner {1,3}) and
inner-only groups remain accepted. Schema/README document the rejection; unit
test adds 12 ReDoS accept/reject cases.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…stic-conditions

# Conflicts:
#	.github/workflows/cpp_server_build_test_release.yml
#	CMakeLists.txt
#	src/cpp/resources/schemas/schema-lock.json
#	src/cpp/server/routing_policy.cpp
@github-actions github-actions Bot added the enhancement New feature or request label Jun 30, 2026
@ramkrishna2910 ramkrishna2910 marked this pull request as ready for review June 30, 2026 00:50
@ramkrishna2910 ramkrishna2910 self-assigned this Jun 30, 2026

@fl0rianr fl0rianr left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the very thorough work here — the implementation is well structured, the semantics are clearly documented, and I like that the deterministic leaves stay pure CPU / engine-internal with no new backend dependency.

I’m going to request changes for a few small but important correctness/CI issues before merge:

  1. The new deterministic C++ test is added to the ctest filter but not built in CI.
    Both Linux and macOS workflow sections now include RoutingPolicyDeterministicTest in the ctest -R ... expression, but the workflow only builds the existing routing test targets up to test_routing_policy_semantic. CI currently fails with Unable to find executable: .../test_routing_policy_deterministic.

    Please add:

    cmake --build --preset default --target test_routing_policy_deterministic

    in the same places where the other routing test targets are built.

  2. Schema validation and factory validation differ for metadata.
    The C++ factory rejects an empty metadata key, empty metadata.any, and empty items inside metadata.any, but the schema currently allows those. Since these semantics are being frozen now, the schema should match the implementation.

    Suggested schema changes:

    "key": {
      "type": "string",
      "minLength": 1
    },
    "any": {
      "type": "array",
      "minItems": 1,
      "items": { "type": "string", "minLength": 1 }
    }

    Then please regenerate/update the schema lock as needed.

  3. Minor: Docs/style CI is failing due to stale generated backend docs.
    Can be solved by merging main, rebasing or going for merging directly as this one is only a non-required check atm.

Further minor non-blocking suggestions: the ReDoS wording in the schema could be softened slightly. The code comments correctly describe the regex rejection as a mitigation rather than a proof of linear-time behavior, so the schema description should avoid implying that all catastrophic patterns are prevented.

Once those are addressed, I think this is in good shape. The overall design and test coverage look strong.

@SlawomirNowaczyk SlawomirNowaczyk left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good; I've left three comments with suggestions for minor changes

Comment thread src/cpp/server/routing_policy.cpp
Comment thread src/cpp/server/routing_policy.cpp
Comment thread src/cpp/server/routing_policy.cpp Outdated
ramkrishna2910 and others added 2 commits June 30, 2026 09:36
Review fixes from #2491 (fl0rianr, SlawomirNowaczyk):

- CI: build the test_routing_policy_deterministic target in both workflow
  sections; it was added to the ctest filter but never built, so CI failed with
  "Unable to find executable".
- ReDoS detector: a wrapper group hid its child's unbounded quantifier, so
  ((a+))+ and (a(b+))+ slipped through. Propagate an unbounded inner group to the
  enclosing group on close so a later quantifier on the wrapper is caught.
- regex: cap the input fed to std::regex at 1 MiB; an oversized input is a
  non-match (fail-safe) rather than a potentially hung worker, covering the case
  the static check cannot bound.
- keywords: memoize the ASCII-lowered input on EvalContext so multiple keyword
  leaves fold the (constant) request input at most once.
- Schema/factory parity for metadata: schema now requires non-empty key and a
  non-empty any[] of non-empty items, matching the factory's validation; lock
  refreshed.
- Soften the regex schema wording so it reads as a mitigation, not a guarantee
  that all catastrophic patterns are prevented.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ramkrishna2910

Copy link
Copy Markdown
Contributor Author

Thanks both — all points addressed in fdc99d69 (then merged latest main). Summary:

@fl0rianr

  1. CI build target — added cmake --build --preset default --target test_routing_policy_deterministic in both workflow sections (right after the semantic target). Good catch; the filter entry was there without the build line.
  2. Schema↔factory parity for metadata — schema now requires key minLength:1, any minItems:1, and any items minLength:1, matching the factory. Lock regenerated. Verified both directions agree (empty key / empty any[] / empty item now rejected by schema and factory).
  3. Docs-drift / .deb — merged latest main.
  4. Softened ReDoS wording — the schema description now reads as a mitigation ("the most common catastrophic-backtracking shape … mitigates but does not eliminate all backtracking risk"), not a guarantee, and points at the new input cap.

@SlawomirNowaczyk

  1. Wrapper group hiding a nested quantifier — real gap; ((a+))+ and (a(b+))+ slipped through. Added the propagation on group close exactly as suggested, plus your two test cases. Both now rejected.
  2. Input-size cap — added the 1 MiB guard in RegexCondition::evaluate (oversized input → non-match, fail-safe). Documented the defined behavior in the schema note; added under/over-cap tests.
  3. Memoize ascii_lower — added std::optional<std::string> lowered_input to EvalContext; keyword leaves now fold the input at most once per request.

Local verification (all against the merged base): the five routing C++ suites pass together (Deterministic now 70 checks incl. the new wrapper/​cap/​parity cases), /W4 clean, and test_schema_lock.py + test_routing_fixtures.py pass. Re-requesting review.

@fl0rianr fl0rianr left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, current CI fail flaky (timeout)

@eddierichter-amd eddierichter-amd left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SlawomirNowaczyk SlawomirNowaczyk added this pull request to the merge queue Jun 30, 2026
Merged via the queue into main with commit 71d7f74 Jun 30, 2026
191 of 193 checks passed
@SlawomirNowaczyk SlawomirNowaczyk deleted the router/2380-deterministic-conditions branch June 30, 2026 23:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Router] Deterministic conditions: keywords / regex / min_chars / has_* (M4)

4 participants