Skip to content

Grammar-based query validation: StructureTable + SAT/THREAD engine for sequence & anchor impossibility #444

@zharinov

Description

@zharinov

Problem

The structural-check pass (#443) catches what can be inside what. It is blind to order and adjacency: it cannot reject patterns that demand an impossible sequence, a strict adjacency the grammar never produces, or an impossible quantified/anchored shape.

Example: (function_declaration .! (identifier)).! demands the identifier be the literal first child, but a function always begins with "async"/"function". Impossible, yet only a sequence-aware analysis sees it. The soft . variant is satisfiable (it skips anonymous tokens), so the checker must never reject it — false rejections are the one failure mode this feature must not have.

Construction

The set of valid visible child sequences is not regular for self-embedding grammars (balanced-paren shapes; 19/143 surveyed grammars — dart, devicetree, swift, sql, tree-sitter's own query language, …). So don't materialize the language. Materialize only whether the hidden structure can carry the query's child automaton between two states — a finite state-pair domain even when the language isn't regular.

A least-fixed-point over two mutually-recursive predicates (Knaster–Tarski; start everything false, iterate to stability):

  • SAT(p, v) — query pattern p's structure is realizable by a node built from grammar variable v.
  • THREAD(p, h, q, q') — the visible frontier of hidden variable h can drive automaton A_p from state q to q'.

No search bounds, no give-up paths — termination comes from finite judgment domains, not budgets. If an implementation reaches for a depth limit, it's being built wrong.

Inputs — StructureTable

Distill from SyntaxGrammar before it is discarded. Grammar::from_raw (crates/plotnik-core/src/grammar/types.rs:127-154) builds the flattened, repeat-free Production/ProductionStep ({symbol, alias, field_name}, prepared.rs) to derive node-shapes, then drops it. Retain a distilled table: per variable, productions as step sequences with (effective kind, named-ness, visibility class ∈ {visible, hidden-subtree, hidden-leaf}, field); plus kind→inhabiting-variables (incl. aliased occurrences), supertype→subtype closure, and extra kinds.

Query side — A_p

Per node pattern, a small NFA over its child list with pattern edges (a child subpattern) and gap edges (self-loops labeled with a node-class set). Derive the gap classes from anchor context by reusing compute_nav_modes (crates/plotnik-compiler/src/compile/navigation.rs) so checker and codegen can't drift:

Context Gap admits
no anchor any node
. both operands named anonymous + extras
. an anonymous operand extras only
.! nothing

Quantifiers → loops (no collapsing); alternation branches → per-branch entry classes; boundary anchors → entry/exit gap class.

Two implementation cruxes (not pinned by the design note)

  1. NodeTypeId ↔ variable/Symbol bridge. The query side resolves kinds to NodeTypeId (the link.rs world); SAT is keyed by variable deliberately — that's the fix for tree-sitter's alias blind spot (e.g. typescript member_expression inside typeof types is built by the aliased _type_query_member_expression, whose internals differ). The bridge must keep both spaces consistent: kind for edge labels, variable for the recursion.
  2. StructureTable shape + visibility classification — recovering token-ness (hidden-leaf) per ProductionStep.symbol post-flatten, and exposing the table cross-crate from plotnik-core.

Why exact w.r.t. the grammar

Vertical: subtree sets of a CFG depend only on the producing variable, not context → per-level decomposition keyed by variable is lossless. Horizontal: threading is the product of production structure with A_p; hidden recursion of any shape (R → R R, balanced parens, mutual recursion) just adds THREAD facts over the same finite state-pair domain.

How this beats tree-sitter (verified against tmp/tree-sitter/lib/src/query.c)

Tree-sitter solves the same problem in ts_query__analyze_patterns / ts_query__perform_analysis but gives up: stack capped at MAX_ANALYSIS_STATE_DEPTH = 8 (query.c:25), iterations at MAX_ANALYSIS_ITERATION_COUNT = 256 (query.c:26); on exceeding either it sets did_abort and the pattern is assumed valid (query.c:1915-1928). It also skips wildcard-rooted patterns with children (parent_step_indices excludes wildcards) and keys nesting by kind, not producing variable. This construction has none of those blind spots. ERROR handling matches tree-sitter's stance: ERROR query steps are always matchable, patterns rooted at ERROR skip structural validation (cf. query.c:1473-1476).

Approximation ledger (all err toward accepting; rejection stays sound)

precedence / associativity / conflicts, dynamic precedence / GLR, external scanners, reserved-word / lexical shadowing, extras placement, string/regex predicates (opaque), ERROR / MISSING (out of scope) — each line accepts more, never rejects more.

Acceptance

  • (function_declaration .! (identifier)) rejected; (function_declaration . (identifier)) accepted.
  • Self-embedding grammars (devicetree balanced parens) terminate with no depth limit.
  • Recursive named definitions (Nested = (call (Nested))) handled by the query-side LFP.
  • Zero false rejections on the valid-query corpus.
  • Diagnostic points at the deepest failing pattern node (e.g. "the first child of function_declaration is async or function, never identifier; the soft anchor . skips anonymous tokens").

Notes

Diagnostics fragments come from #430. A later literal-vs-literal predicate-evaluation pass (precedence-aware filtering) is an optional precision step — only if real queries miss it.

Sizing (design note): ~400 LOC table + ~700 engine + ~300 diagnostics + tests.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions