Skip to content

fix(pipeline): publisher fencing/WAL + agent gossip + network-sim parity (4/4)#334

Open
Bojan131 wants to merge 2 commits intomainfrom
fix-pipeline-publisher-agent-network
Open

fix(pipeline): publisher fencing/WAL + agent gossip + network-sim parity (4/4)#334
Bojan131 wants to merge 2 commits intomainfrom
fix-pipeline-publisher-agent-network

Conversation

@Bojan131
Copy link
Copy Markdown
Contributor

Summary

PR 4 of 4 — the final and heaviest slice in the split of #327. This carries the data-pipeline layer where the publisher↔agent gossip flow cross-cuts the most fixtures, which is why all the related changes ride together.

Changes by category

Publisher bug fixes

  • P-2 — fencing token in `dkg-publisher.ts`. Concurrent publish requests could step on each other's WAL entries; now each publish acquires a monotonic fencing token, and the WAL recovery path rejects entries whose token is older than the latest known.
  • `chain-event-poller.ts` — transient-error classification. Network blips now get retried with backoff instead of bubbling as fatal; permanent errors still surface.
  • `async-lift-publisher-impl.ts` — tighter key-bound subtraction so the lift accumulator can't drift across cg boundaries.
  • `update-handler.ts` — r23-4 fix: don't double-emit on the publish retry path.

Agent bug fixes

  • A-5 — gossip envelope signing in `dkg-agent.ts` (canonicalisation step previously dropped optional fields, causing signature mismatches on round-trip through the network-sim).
  • A-7 — DID-format validation hardened in `endorse.ts` — accept the full `did:*` spec rather than the v9 short-form only.
  • A-12 / A-13 — workspace-config validation now catches missing cg-private-key + asymmetric per-cg overrides at config-load time rather than at first publish.
  • A-15 — signed-gossip-publish egress now includes the cg-id binding so a malicious peer can't replay a signed envelope to a different cg.

Network-sim

  • K-4 / K-5 — deterministic RNG + libp2p parity in `sim-engine.ts`. The sim previously seeded its RNG from `Math.random()`, making test failures non-reproducible. The gossip-fanout topology also diverged from real libp2p peer-selection; now matches `pubsub.gossipsub` heartbeat semantics so sim-only passes can't lie about real-network behaviour. (These are the K-4/K-5 sentinels documented in `.test-audit/BUGS_FOUND.md`.)

Tests added

  • `publisher/test/wal-recovery.test.ts` (new, +1320) — full P-2 fencing-token lifecycle.
  • `publisher/test/chain-event-poller-r24-4.test.ts` (new, +478).
  • `publisher/test/chain-event-poller-transient-classifier.test.ts` (new, +214).
  • `publisher/test/async-lift-subtraction-key-bound.test.ts` (new, +167).
  • `publisher/test/per-cg-quorum-state.test.ts` (new, +245).
  • `publisher/test/update-handler-r23-4.test.ts` (new, +132).
  • `agent/test/op-wallets-and-workspace-config.test.ts` (new, +503) — A-12/A-13 matrix.
  • `agent/test/per-cg-quorum-rpc-failure-extra.test.ts` (new, +177).
  • `agent/test/signed-gossip-publish-egress.test.ts` (new, +150) — A-15 cg-id binding.
  • `agent/test/strict-gossip-envelope-extra.test.ts` (new, +106) — A-5 canonicalisation.
  • `agent/test/finalization-handler.test.ts` (new, +103).
  • `agent/test/wm-multi-agent-isolation-extra.test.ts` (+393).
  • `network-sim/test/network-sim-extra.test.ts` (+187) — K-4/K-5 assertions.

Coverage ratchet

  • `tornadoAgentCoverage` 67/68/57/66 → 75/78/63/74.

Sequencing context

This is slice 4 of 4 from the #327 split:

  1. contracts + chain — fix(contracts,chain): V10 dual-emit + ABI regen + hardhat test rewrites (1/4) #331 (open)
  2. storage + query — fix(storage,query): RDF correctness + crypto hardening + Blazegraph CI (2/4) #332 (open)
  3. operator surface — fix(operator): cli auth/keystore/autoupdater + mcp-server + openclaw (3/4) #333 (open)
  4. This PR — pipeline (~10.2k lines)

This PR sits on top of the other layers and may need a clean rebase if any of them merge first. No code-level conflicts expected — only fixture conflicts if storage adapter contracts change.

Test plan

  • CI green on this branch
  • `pnpm --filter @origintrail-official/dkg-publisher test` locally — wal-recovery is the heaviest suite
  • `pnpm --filter @origintrail-official/dkg-agent test` locally
  • `pnpm --filter @origintrail-official/dkg-network-sim test` locally — verify the sim is now deterministic across runs
  • Manual: kill+restart a publisher mid-WAL and verify recovery rejects the stale fencing token

Made with Cursor

Fixes the data-pipeline layer of the bug-fix backlog from PR #327.
Slice 4 of 4 (split by architectural layer for reviewability) — the
heaviest slice because the publisher↔agent gossip flow cross-cuts
the most fixtures.

Publisher bug fixes
- P-2: fencing token in dkg-publisher.ts. Concurrent publish requests
  could step on each other's WAL entries; now each publish acquires
  a monotonic fencing token, and the WAL recovery path rejects entries
  whose token is older than the latest known.
- chain-event-poller.ts: transient-error classification. Network blips
  now get retried with backoff instead of bubbling as fatal; permanent
  errors still surface.
- async-lift-publisher-impl.ts: tighter key-bound subtraction so the
  lift accumulator can't drift across cg boundaries.
- update-handler.ts: r23-4 fix — don't double-emit on the publish
  retry path.

Agent bug fixes
- A-5: gossip envelope signing in dkg-agent.ts (the canonicalisation
  step previously dropped optional fields, causing signature mismatches
  on round-trip through the network-sim).
- A-7: DID-format validation hardened in endorse.ts — accept the
  full did:* spec rather than the v9 short-form only.
- A-12 / A-13: workspace-config validation now catches missing
  cg-private-key + asymmetric per-cg overrides at config-load time
  rather than at first publish (much better operator UX).
- A-15: signed-gossip-publish egress now includes the cg-id binding
  so a malicious peer can't replay a signed envelope to a different cg.

Network-sim
- K-4 / K-5: deterministic RNG + libp2p parity in sim-engine.ts. The
  sim previously seeded its RNG from Math.random(), making test
  failures non-reproducible. Also, the gossip-fanout topology
  diverged from real libp2p peer-selection; now matches the
  pubsub.gossipsub heartbeat semantics so sim-only passes can't lie
  about real-network behaviour. (These are the K-4/K-5 sentinels
  documented in .test-audit/BUGS_FOUND.md.)

Tests added
- publisher/test/wal-recovery.test.ts (new, +1320 lines): full P-2
  fencing-token lifecycle including crash/restore, concurrent-writer
  starvation, and out-of-order replay rejection.
- publisher/test/chain-event-poller-r24-4.test.ts (new, +478 lines):
  transient-vs-fatal classification matrix.
- publisher/test/chain-event-poller-transient-classifier.test.ts
  (new, +214 lines): unit-level coverage for the classifier itself.
- publisher/test/async-lift-subtraction-key-bound.test.ts (new, +167):
  the cg-boundary regression for async-lift.
- publisher/test/per-cg-quorum-state.test.ts (new, +245).
- publisher/test/update-handler-r23-4.test.ts (new, +132).
- agent/test/op-wallets-and-workspace-config.test.ts (new, +503):
  A-12/A-13 operator-config validation matrix.
- agent/test/per-cg-quorum-rpc-failure-extra.test.ts (new, +177):
  RPC-failure isolation per cg quorum.
- agent/test/signed-gossip-publish-egress.test.ts (new, +150):
  A-15 cg-id binding on egress.
- agent/test/strict-gossip-envelope-extra.test.ts (new, +106):
  A-5 envelope canonicalisation regression.
- agent/test/finalization-handler.test.ts (new, +103).
- agent/test/wm-multi-agent-isolation-extra.test.ts (+393): expanded
  isolation coverage.
- network-sim/test/network-sim-extra.test.ts (+187): K-4/K-5
  deterministic-RNG + parity assertions.

Coverage ratchet
- tornadoAgentCoverage 67/68/57/66 → 75/78/63/74.

Sequencing: this is PR 4 of 4 — the final and largest slice.
PR 1 (#331) contracts+chain, PR 2 (#332) storage+query, PR 3 (#333)
operator surface. This PR sits on top of those layers and may need
a clean rebase if any of them merge first.

Made-with: Cursor
Two AA-2 assertions in attested-assets-extra.test.ts used to fan-out a
fixed number of `setTimeout(r, 0)` yields between an async gossip
publish (which enqueues an async ed25519 verification on the receiver)
and the assertion observing the resulting event. On slower CI runners
the verification didn't resolve before the assertions ran, so the test
false-failed.

Switched to a 10ms-poll `waitFor(predicate, timeoutMs)` helper that
polls the observable we are about to assert against. Faster on the
happy path AND immune to the race. Belongs in this PR because the race
is on the agent/publisher gossip-publish path covered by slice 4.

Made-with: Cursor
const message = err instanceof Error ? err.message : String(err);
const isTransientHeadRace =
/block range extends beyond current head block/i.test(message)
|| /code=UNKNOWN_ERROR.*32602/i.test(message);
&& opts.agentAddress
&& this.localAgents.size > 1
&& !callerSelfReadsOwnWm
&& !opts.adminAuthenticated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: this bypass never fires today because /api/query still only passes callerAgentAddress into agent.query() and does not set adminAuthenticated. On multi-agent nodes, valid node-level admin tokens will now hit the new WM signature gate and get empty results for legitimate cross-agent reads. Either plumb the flag from the daemon in this PR or derive the admin case here from the authenticated caller context.

throw new Error('workspace config: root must be an object');
}
const obj = raw as Record<string, unknown>;
const contextGraph = obj.contextGraph;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: this parser hard-requires contextGraph, but the existing workspace config contract still accepts project as a backward-compatible alias (packages/mcp-dkg/src/config.ts already does contextGraph ?? project). Any repo that still uses project: in .dkg/config.* or dkg-config fences will start failing to load here. Accept both keys to avoid breaking existing workspaces.

);
if (bData.bindings.length > 0) break;
const confirmedAsk = await nodeB.query(
`ASK { GRAPH ?g { ?kc <http://dkg.io/ontology/status> "confirmed" } }`,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Issue: this ASK is unscoped, so any unrelated confirmed KC already present in node B will make bHasConfirmed true and let the loop pass even if ENTITY_1 only arrived via sync. Scope the check to the batch under test (for example by UAL/root entity in the target _meta graph) so the test actually proves FinalizationHandler ran for this publish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants