[Router] Deterministic conditions: keywords / regex / min_chars / has_* / metadata (M4, #2380)#2491
Conversation
Implement the deterministic match conditions for the routing engine: keywords_any/keywords_all, regex, min_chars/max_chars, has_tools/has_images, and metadata. Pure CPU, no model, no tokenizer. Each is a Condition built by a named leaf factory and plugged into make_leaf_factory as deterministic_factories. Frozen v1 semantics match route_policy.schema.json exactly: - keywords: case-insensitive substring, ASCII-only case fold (locale-independent) - regex: ECMAScript (std::regex), compiled at policy-load (fail-fast on bad pattern) - min_chars/max_chars: inclusive bounds on input length in UTF-8 bytes - has_tools/has_images: boolean request feature equals authored value - metadata: case-sensitive; value decoded into a comma-split, trimmed token set; equals (raw exact) / any (set intersection) / exists (presence); missing or empty value matches only exists:false Public surface is a single symbol, make_deterministic_leaf_factories(); the condition classes stay file-local. Config is validated when each condition is built (factory throws std::invalid_argument on malformed leaves); evaluate() never throws, preserving the engine's fail-open contract. Frozen-edge decisions (aligned with the non-empty/no-silent-match precedent set on the sibling semantic classifier): reject empty keyword arrays and empty keyword items rather than defining a vacuous fold (minItems/minLength tightened in the still-unreleased schema, lock refreshed); ASCII-only case fold pinned explicitly in the frozen-semantics README rather than left to interpretation. Adds test_routing_policy_deterministic.cpp (50 checks: keywords all-vs-any, case-insensitivity, ECMAScript regex, UTF-8 byte-boundary chars, metadata equals/any/exists incl. scalar vs comma-list and missing-key, malformed-config rejection, trace emission, and the registry's multi-key implicit-all wiring), registers RoutingPolicyDeterministicTest in CMake, and wires it into CI. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…#2380) Two frozen-edge refinements surfaced by scenario testing, both applying the same "no silent always-match / empty == absent" principle already used for keywords: 1. regex must be non-empty (minLength:1). An empty pattern matches every input — an accidental always-match rule. Express always-match by omitting the condition, not an empty regex. 2. A whitespace-only metadata value now counts as absent (matches only exists:false), consistent with the comma-split token model where equals/any already see no content. Previously a " " value read as exists:true while yielding no tokens for any/equals. Schema descriptions + frozen-semantics README updated, schema-lock refreshed, and the unit test gains whitespace-only metadata coverage. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…2380) std::regex is a backtracking engine with no portable step limit, and the input matched by a regex condition is untrusted end-user text — so a policy author who writes a pattern with a nested unbounded quantifier (the (a+)+ / (a*)* / (.*)+ / (\d+){2,} ReDoS family) creates a denial-of-service vector: a crafted request can pin a worker thread. reject_catastrophic_regex() walks the pattern at policy-load time and throws std::invalid_argument when it finds an unbounded quantifier applied to a group that itself contains an unbounded quantifier, so the dangerous pattern never reaches the request hot path. This is a mitigation, not a linear-time proof — alternation-overlap ReDoS like (a|a)+ is not detected; a hard guarantee would need a non-backtracking engine (RE2), which would be a separate dialect op. The runtime try/catch on std::regex_error stays as cross-platform defense-in-depth (MSVC throws error_complexity; libstdc++/libc++ may not). Also rejects an empty regex pattern in the factory (defense-in-depth behind the schema minLength). Bounded quantifiers (outer {1,3} or inner {1,3}) and inner-only groups remain accepted. Schema/README document the rejection; unit test adds 12 ReDoS accept/reject cases. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…stic-conditions # Conflicts: # .github/workflows/cpp_server_build_test_release.yml # CMakeLists.txt # src/cpp/resources/schemas/schema-lock.json # src/cpp/server/routing_policy.cpp
fl0rianr
left a comment
There was a problem hiding this comment.
Thanks for the very thorough work here — the implementation is well structured, the semantics are clearly documented, and I like that the deterministic leaves stay pure CPU / engine-internal with no new backend dependency.
I’m going to request changes for a few small but important correctness/CI issues before merge:
-
The new deterministic C++ test is added to the ctest filter but not built in CI.
Both Linux and macOS workflow sections now include RoutingPolicyDeterministicTest in thectest -R ...expression, but the workflow only builds the existing routing test targets up to test_routing_policy_semantic. CI currently fails withUnable to find executable: .../test_routing_policy_deterministic.Please add:
cmake --build --preset default --target test_routing_policy_deterministic
in the same places where the other routing test targets are built.
-
Schema validation and factory validation differ for
metadata.
The C++ factory rejects an empty metadata key, empty metadata.any, and empty items inside metadata.any, but the schema currently allows those. Since these semantics are being frozen now, the schema should match the implementation.Suggested schema changes:
"key": { "type": "string", "minLength": 1 }, "any": { "type": "array", "minItems": 1, "items": { "type": "string", "minLength": 1 } }
Then please regenerate/update the schema lock as needed.
-
Minor: Docs/style CI is failing due to stale generated backend docs.
Can be solved by merging main, rebasing or going for merging directly as this one is only a non-required check atm.
Further minor non-blocking suggestions: the ReDoS wording in the schema could be softened slightly. The code comments correctly describe the regex rejection as a mitigation rather than a proof of linear-time behavior, so the schema description should avoid implying that all catastrophic patterns are prevented.
Once those are addressed, I think this is in good shape. The overall design and test coverage look strong.
SlawomirNowaczyk
left a comment
There was a problem hiding this comment.
Looks good; I've left three comments with suggestions for minor changes
Review fixes from #2491 (fl0rianr, SlawomirNowaczyk): - CI: build the test_routing_policy_deterministic target in both workflow sections; it was added to the ctest filter but never built, so CI failed with "Unable to find executable". - ReDoS detector: a wrapper group hid its child's unbounded quantifier, so ((a+))+ and (a(b+))+ slipped through. Propagate an unbounded inner group to the enclosing group on close so a later quantifier on the wrapper is caught. - regex: cap the input fed to std::regex at 1 MiB; an oversized input is a non-match (fail-safe) rather than a potentially hung worker, covering the case the static check cannot bound. - keywords: memoize the ASCII-lowered input on EvalContext so multiple keyword leaves fold the (constant) request input at most once. - Schema/factory parity for metadata: schema now requires non-empty key and a non-empty any[] of non-empty items, matching the factory's validation; lock refreshed. - Soften the regex schema wording so it reads as a mitigation, not a guarantee that all catastrophic patterns are prevented. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks both — all points addressed in
Local verification (all against the merged base): the five routing C++ suites pass together ( |
fl0rianr
left a comment
There was a problem hiding this comment.
LGTM, current CI fail flaky (timeout)
Implements the deterministic match conditions (module M4) for the generic routing engine — the pure-CPU leaf ops a
collection.routerpolicy uses to route without invoking any model. Built on the #2378 evaluator + #2379 registry seams (both merged), and rebased onto the unified classifier work in #2483.Closes #2380.
What's here
Eight ops, implemented as file-local
Conditionsubclasses and exposed through a single public symbol —make_deterministic_leaf_factories()— which the engine (#2382) passes tomake_leaf_factoryasdeterministic_factories:keywords_any/keywords_allregexstd::regex), case-sensitive, compiled at policy-loadmin_chars/max_charshas_tools/has_imagesmetadataequals(raw exact) /any(comma-split token-set intersection) /exists; case-sensitiveConfig is validated when each condition is built (the factory throws
std::invalid_argumenton malformed leaves);evaluate()never throws, preserving the engine's fail-open contract. Multi-key leaves become an implicitallvia the existing registry.Frozen-edge decisions (locked now while the schema is unreleased)
These pin under-specified v1 semantics, following the non-empty / no-silent-match precedent set on the sibling semantic classifier (#2483):
minItems:1+minLength:1+ factory throw) rather than defining a vacuous fold — an emptykeywords_anysilently never-matching (andkeywords_allsilently always-matching) is exactly the footgun [Router] 2381 unified interface for semantic_similarity and model classifier #2483 designed against.minLength:1+ factory throw) — an empty pattern matches everything; express always-match by omitting the condition.exists:false), keepingexistsconsistent with the comma-split token modelequals/anyalready use.ReDoS hardening
std::regexis a backtracking engine with no portable step limit, and the text a regex condition matches is untrusted end-user input — so a policy author who writes a nested-unbounded-quantifier pattern ((a+)+,(a*)*,(.*)+,(\d+){2,}) creates a denial-of-service vector.reject_catastrophic_regex()walks the pattern at policy-load and rejects that family before it can reach the request hot path. The runtimetry/catchonstd::regex_errorstays as cross-platform defense-in-depth (MSVC throwserror_complexity; libstdc++/libc++ may not).This is a mitigation, not a linear-time proof — alternation-overlap ReDoS like
(a|a)+is not detected. A hard guarantee would need a non-backtracking engine (RE2), which — being a different dialect — would ship as a separately named op, never a redefinition of this ECMAScript one. That residual risk + the RE2 path are documented in the code and schema rather than a side issue.Testing
All against the merged base (semantic + deterministic + foundation suites green together):
RoutingPolicyDeterministicTest— 64 checks: keywords all-vs-any, ASCII case fold (and non-ASCII not folded), ECMAScript regex incl. case-sensitivity, UTF-8 byte-boundarymin/max_chars,metadataequals/any/exists (scalar vs comma-list, missing/empty/whitespace-only key), 12 ReDoS accept/reject cases, malformed-config rejection, trace emission, and the registry's multi-key implicit-all.Contract/Evaluator/Registry/Semantic) still pass.test_schema_lock.py+test_routing_fixtures.pypass; schema-lock regenerated for the merged schema.Semantic).Notes
/W4; only standard library + nlohmann/json (no new dependency, header invariant preserved).