Skip to content

Feat/hot path optimization with external rpc validation#508

Open
oten91 wants to merge 42 commits intofeat/hot-path-optimizationfrom
feat/hot-path-optimization-with-external-rpc-validation
Open

Feat/hot path optimization with external rpc validation#508
oten91 wants to merge 42 commits intofeat/hot-path-optimizationfrom
feat/hot-path-optimization-with-external-rpc-validation

Conversation

@oten91
Copy link
Contributor

@oten91 oten91 commented Feb 17, 2026

updating to go 1.25
adding optional external rpc validation

When all suppliers in a session are behind the real chain tip, internal
consensus considers everything "in sync" because it only compares
endpoints against each other. Add optional per-service external RPC
endpoints that periodically fetch the real block height as a floor for
perceivedBlockNumber, correctly filtering stale suppliers.

Supports multiple sources per service for redundancy (max wins). Handles
EVM hex, Cosmos decimal, and Solana numeric response formats. Failure-safe:
unreachable sources log a warning and are skipped. External heights are
written to Redis for cross-replica benefit.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@oten91 oten91 requested a review from jorgecuesta February 17, 2026 22:49
@oten91 oten91 force-pushed the feat/hot-path-optimization-with-external-rpc-validation branch 2 times, most recently from 113320b to ff9606e Compare February 17, 2026 23:06
Co-Authored-By: Claude Opus 4.6 <[email protected]>
@oten91 oten91 force-pushed the feat/hot-path-optimization-with-external-rpc-validation branch from ff9606e to e932bd2 Compare February 17, 2026 23:19
Copy link
Contributor

@jorgecuesta jorgecuesta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a minor question about it assuming all the endpoints are JSON rpc compatible and a possible optimization.

@oten91 oten91 force-pushed the feat/hot-path-optimization-with-external-rpc-validation branch from d1bf40c to 1cb8631 Compare February 18, 2026 13:50
… mode to external block fetcher

- Replace docker/cli exclude with direct require of v29.2.1 which uses
  moby/moby packages with proper Go modules, fixing the range-over-function
  compilation error in E2E tests.
- Pin docker/docker to v28.5.1 and dockertest to v3.11.0 to fix broken pipe
  error when streaming Docker build context via the API in E2E tests.
- Add type field ("jsonrpc" default or "rest") to ExternalBlockSource config
  so Cosmos REST endpoints (GET /status) work alongside JSON-RPC POST.
- Expand .dockerignore to exclude non-essential directories (docusaurus, docs,
  research, etc.) reducing build context size.
- Add integration tests against real public RPC endpoints (ETH, Solana,
  Osmosis JSON-RPC, Osmosis REST) to validate all chain type parsing.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@oten91 oten91 force-pushed the feat/hot-path-optimization-with-external-rpc-validation branch from 1cb8631 to 25be89b Compare February 18, 2026 14:10
oten91 and others added 20 commits February 18, 2026 15:20
Prevents cold-start filtering where the external block source reports the
real chain tip before suppliers have reported their heights, causing all
endpoints to appear "behind" and get filtered out.

Default grace period: 60s. Configurable per-service via `grace_period`
in `external_block_sources` config.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Adds ConsumeExternalBlockHeight to Cosmos and Solana QoS instances so
external_block_sources works for all chain types, not just EVM.

Also fixes EVM: skip Redis writes during grace period to prevent other
replicas from picking up the external height before suppliers report.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Docker image build + container startup can take 5+ minutes in CI,
which was consuming the entire 5-minute context timeout before tests
even started. This caused all Vegeta test methods to immediately
report "canceled" with 0 requests.

Move context creation to after getGatewayURLForTestMode() returns,
so the 5-minute timeout only governs actual test execution.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
The CI workflow already builds the Docker image in Phase 1 and passes
it as an artifact to Phase 2. But force_rebuild_image was still true
in the default config, causing a redundant ~5 minute rebuild from
scratch (with NoCache) before every E2E test run.

Set force_rebuild_image=false in the CI config script so the test
reuses the pre-built image from Phase 1.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
All 5 matrix jobs (3 HTTP + 2 WebSocket) were downloading the Docker
image artifact simultaneously, triggering GitHub's secondary rate
limit (403 Forbidden). Stagger downloads at 0/5/10/15/20s intervals.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…o QoS instances

QoS instances are created at startup before external health check rules are
loaded, so syncAllowance starts at 0 (disabled). This means stale endpoints
were never filtered even when external rules specified a sync_allowance.

Fix by using atomic.Uint64 for syncAllowance in both EVM and Cosmos QoS
configs, and propagating the value from the health check executor when
external rules are loaded/refreshed.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…eneric QoS

Services using generic/NoOp QoS (near, sui, tron) previously had no block
height tracking, meaning stale endpoints were never filtered even when
external block sources were configured. This converts NoOpQoS from a
stateless value type to a stateful pointer type with per-endpoint block
height tracking, sync-allowance-based filtering, and external block height
consumption — matching the pattern used by Cosmos and Solana QoS.

Default behavior is unchanged: when syncAllowance=0 (default), endpoint
selection remains purely random.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
In multi-replica deployments, each replica updates perceivedBlockHeight
locally only. This means Replica A may see block 1000 while Replica B
stays at 950, causing B to select stale endpoints. EVM already had full
Redis sync; this extends the same pattern to the remaining QoS types.

Adds SetReputationService, StartBackgroundSync (periodic Redis reads),
and async Redis writes in UpdateFromExtractedData and
ConsumeExternalBlockHeight for Solana, Cosmos, and NoOp QoS. Uses simple
max semantics (if Redis > local, update) rather than EVM's consensus
mechanism. No changes to cmd/main.go needed — existing duck-typed
interface checks auto-detect the new methods.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…onfigured

NewSimpleQoSInstance (used for all EVM services) sets getEVMChainID()
to "". However, isChainIDValid() still ran unconditionally, rejecting
every observed endpoint — either because it had no chain ID observation
yet (errNoChainIDObs) or because the observed chain ID (e.g. "0x38")
didn't match "". This caused a flood of warn logs and forced fallback
to random endpoint selection.

Skip chain ID validation and synthetic eth_chainId checks entirely when
no expected chain ID is configured, consistent with the constructor's
intent to delegate validation to active health checks.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…a stale supplier filtering

Health checks only run on the leader pod, so non-leader replicas had empty
per-endpoint block heights and could not filter stale suppliers. This adds
Redis HSET/HGETALL sync for per-endpoint block heights across all 4 QoS types
(EVM, Solana, Cosmos, NoOp), enabling all replicas to filter stale endpoints.

Write side: async HSET from UpdateFromExtractedData after each health check.
Read side: HGETALL on the existing 5s background sync ticker, updating only
existing local endpoint entries where Redis has a higher block height.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Tests cover both write side (UpdateFromExtractedData writes per-endpoint
block heights to Redis) and read side (StartBackgroundSync reads from Redis
and updates local endpoint store) for all 4 QoS types: EVM, Solana, Cosmos,
and NoOp. Verifies that:
- Higher Redis values update local endpoints
- Lower Redis values do NOT downgrade local endpoints
- Non-existent local endpoints are NOT created from Redis data
- Periodic sync picks up new Redis data after startup

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Non-leader replicas had empty endpoint stores at startup because health
checks only run on the leader pod. The syncEndpointBlocksFromRedis
closure would fetch per-endpoint block heights from Redis but silently
skip all entries since none existed locally. This meant stale suppliers
were never filtered on non-leader pods.

Now, EVM/Cosmos/NoOp QoS types create new endpoint entries from Redis
data when the endpoint doesn't exist in the local store, enabling
block height filtering from the first request. Solana is excluded
because its validateBasic() requires health+epoch data.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…tractBlockHeight

Extends extractBlockHeight to handle non-EVM response formats for external
block sources on generic chains (near, sui, tron):
- Decimal strings without 0x prefix parsed as base-10 (Sui)
- Numeric latest_block_height in sync_info objects (NEAR)

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…e period

Add mdbx_panic to the heuristic error indicator patterns so that Erigon
nodes with corrupted/full MDBX databases trigger retries on a different
supplier instead of returning the error directly to users. Also reduce
the external block grace period from 60s to 30s to shorten the cold
start window where stale suppliers can leak through.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
… comet_bft unsupported

CometBFT methods (status, block, validators, etc.) sent via JSON-RPC POST
were rejected with "unsupported RPC type" when suppliers hadn't staked
comet_bft endpoints. Since CometBFT nodes handle both REST and JSON-RPC
interfaces on the same endpoint, these requests can safely route to json_rpc
endpoints instead. This fallback is a no-op when comet_bft is properly staked.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…ng UNKNOWN_RPC

NoOp QoS was discarding the RPC type detected by the gateway and always
setting UNKNOWN_RPC on service payloads. This caused all generic services
(tron, sui, near) to report unknown_rpc in metrics/observations. Now the
detected type (json_rpc, rest, etc.) flows through to the payload.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Endpoint block height entries accumulate indefinitely as sessions rotate,
causing non-leader replicas to recreate stale entries from Redis on every
sync cycle. This adds a lastSeen timestamp to each endpoint entry, updated
during filtering when endpoints appear in active sessions. A periodic sweep
(every 5s tick) removes entries older than 2h TTL from both local stores
and Redis via HDEL.

Changes across all QoS types (EVM, Cosmos, Solana, NoOp):
- Add lastSeen field to endpoint structs
- Add touchEndpoints() to update lastSeen during filtering (separate WLock)
- Add sweepStaleEndpoints() to remove entries past TTL
- Call sweep in StartBackgroundSync tick loop
- Add RemoveEndpointBlockHeights to ReputationService/Storage interfaces
- Implement HDEL-based cleanup in Redis storage, delete in memory storage
- Update all 6 test mocks with the new interface method

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…ove them

Entries created by syncEndpointBlocksFromRedis had lastSeen=zero, which
the sweep intentionally skips. This meant Redis-synced entries were never
cleaned up. Set lastSeen=time.Now() on new entries so they get swept after
the 2h TTL if they never appear in an active session.

Also fix NoOp UpdateFromExtractedData to preserve existing lastSeen instead
of overwriting the entire endpointState struct.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Broken/lazy suppliers return empty responses like "result":[],
"result":{}, or bare {}. This adds two layers of defense:

Layer 1 — Heuristic detection (retry on different supplier):
- REST bare {} → rest_empty_object (confidence 0.80, retry)
- JSON-RPC "result":{} → confidence 0.95, MajorErrorSignal (never valid
  for any JSON-RPC method)
- JSON-RPC "result":[] → confidence 0.75, MinorErrorSignal (valid for
  some methods like eth_getLogs, retry is benign)

Layer 2 — Block height filter (prevent future requests):
- Add ErrInvalidBlockHeightResult sentinel error for extractors to
  signal "response had a result but it was unparseable"
- EVM extractor: returns sentinel when eth_blockNumber gets non-string
  result ([], {}, number, boolean) or unparseable hex
- Cosmos extractor: returns sentinel when CometBFT JSON-RPC response
  has empty container result ({} or [])
- UpdateFromExtractedData (EVM/Cosmos/Solana): stores block height 0
  when InvalidBlockHeight flag is set, causing the endpoint to be
  filtered (0 < perceived - sync_allowance)

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…infrastructure

When a relay fails, retries now exclude ALL endpoints behind the same
domain (host), not just the exact endpoint address. This prevents wasting
retry budget on different supplier endpoints that share the same broken
backend (e.g., multiple suppliers behind rel.spacebelt.xyz all returning
empty responses).

Also stops retrying entirely when all available endpoints are from tried
domains, instead of resetting and cycling through the same broken
infrastructure with backoff.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
oten91 and others added 19 commits February 22, 2026 12:25
…mains fleet-wide

When a domain's infrastructure is broken (e.g., returning empty CometBFT
responses), each pod previously discovered this independently on every request,
wasting the first relay attempt. Now when any pod discovers a broken domain,
it shares this via Redis so all pods skip that domain immediately — including
on the first attempt. Only ~1 request per TTL window probes the domain.

Standalone DomainCircuitBreaker component (not on ReputationService) avoids
touching 6 test mocks. Uses Redis HASH with per-field TTL for cross-pod
sharing and a local cache with lazy 5s refresh for zero hot-path Redis calls.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…ositive)

CometBFT's "health" method legitimately returns {"result":{}} when the
node is healthy. The heuristic was unconditionally flagging "result":{}
as a broken supplier response at 0.95 confidence, causing every CometBFT
health request to waste a retry attempt across all Cosmos chains.

The fix passes rpcType to analyzeJSONRPC and skips the empty object check
for COMET_BFT, where "result":{} is a valid success response. EVM/Solana
chains still flag "result":{}" since no EVM method returns an empty object.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…paths

CometBFT health method legitimately returns "result":{}, but the
COMET_BFT rpc type gets lost through two fallback paths before reaching
the heuristic analyzer:

1. QoS validator: COMET_BFT → JSON_RPC when not in supportedAPIs
2. Shannon URL fallback: COMET_BFT → JSON_RPC when endpoint lacks URL

Add Payload.OriginalRPCType + EffectiveRPCType() to preserve the
original detected type for heuristic analysis while routing through
the fallback type. All checkResponseSuccess calls now use the
effective type so CometBFT responses are analyzed correctly regardless
of which fallback path was taken.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…h_getLogs

"result":[] is a legitimate response for methods like eth_getLogs,
eth_accounts, and eth_getFilterChanges. Retrying was previously
considered "benign", but with the domain circuit breaker it now
punishes valid responses by circuit-breaking the entire domain.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
… suppliers

"result":[] was removed in acd7df5 because it false-positived on eth_getLogs
and triggered the circuit breaker on valid responses. But most JSON-RPC methods
(eth_blockNumber, eth_getBalance, etc.) should never return an array — suppliers
doing so are broken. This re-adds the detection with a whitelist of 17 methods
where empty arrays are valid (eth_getLogs, trace_filter, debug_getBadBlocks, etc.).

The JSON-RPC method name is now threaded from QoS parsing (jsonrpc.BuildPayload)
through the Payload struct, checkResponseSuccess, and into heuristic.Analyze.
When the method is unknown (empty string), the check is conservatively skipped.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Circuit breaker now uses escalating TTLs (1m→2m→4m→8m→16m→30m cap)
for domains that repeatedly break, instead of a flat 60s. High-confidence
deceptive response patterns (empty array/object results) are upgraded from
MajorErrorSignal (-10) to CriticalErrorSignal (-25) to trigger the
strike system with exponential cooldowns. Also extends circuit breaker
filtering to websocket endpoint selection and adds Solana method-aware
empty array detection.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…alid responses

Solana's GetServicePayloads() was not setting JSONRPCMethod on the
protocol payload, so the heuristic analyzer received an empty method
and could not perform method-aware empty array detection. This caused
"result":[] from broken suppliers to be classified as jsonrpc_success
instead of jsonrpc_invalid_empty_array for scalar methods like getSlot.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…reset

A single success was resetting CriticalStrikes to 0, allowing deceptive
suppliers to avoid the cooldown system entirely by interleaving fake
successes with critical errors. Now strikes decay by 1 per success,
requiring sustained good behavior to fully recover.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
WebSocket handler was missing RPC type validation, allowing connections
for services that don't have "websocket" in their configured rpc_types.
This let suppliers pump unlimited success signals via WebSocket for
services like Solana (json_rpc only), bypassing reputation penalties.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
RPCTypeValidator was defined on the Gateway struct and checked in both
HTTP and WebSocket handlers, but never instantiated — always nil.
This meant RPC type validation was silently skipped for all requests,
allowing WebSocket connections to services like Solana that only
support json_rpc.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
At 50+ critical strikes, `1 << exponent` overflows int64, producing
zero or negative cooldown durations. This caused suppliers with 100+
strikes to have instantly-expiring cooldowns instead of the intended
1-hour max. Cap exponent at 7 (5min * 2^7 = 640min > 60min max).

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…to all protocols

Some suppliers return a static {"jsonrpc":"2.0","id":1,"result":[]} for
ALL requests regardless of protocol type, including REST and CometBFT
paths. This evades current detection because:
- REST analyzeREST() only looks for error indicators
- CometBFT "result":[] check requires jsonrpcMethod which is empty for GET
- Cosmos health checks lack expected_response_contains validation

Changes:
- Add REST protocol mismatch detection (JSON-RPC response to REST request)
- Add CometBFT empty array detection (no CometBFT method returns arrays)
- Register both as deceptive patterns for CriticalErrorSignal + strikes
- Set JSONRPCMethod on Cosmos JSON-RPC payloads (aligns with EVM/Solana)
- Add expected_response_contains to all 15 Cosmos chain health checks

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Sei is a dual chain (Cosmos + EVM) but most suppliers only run EVM
nodes. The CometBFT health checks (health, status, syncing) were
penalizing EVM-only suppliers that correctly reject unknown methods
like "status" with "method does not exist". Switch to eth_blockNumber
and eth_chainId which all sei suppliers support.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Spacebelt returns {"result":[]} to eth_blockNumber which passes basic
HTTP 200 check. Adding expected_response_contains: '"0x' catches this
since real EVM responses always contain hex strings.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Some suppliers return HTTP 200 with
{"error":{"code":-32000,"message":"Failed to call fallback API"}} when
they can't serve archival requests. This was leaking through to clients
as a 200 while the same class of error ("missing trie node") from other
suppliers was correctly caught and retried.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Suppliers returning fast canned responses like {"result":[]} pass
transport-level checks (no error, has bytes), win the hedge race, and
cause the correct response from another supplier to be discarded. The
heuristic later catches the bad response but the good one is gone.

Now the hedge layer runs the same heuristic analysis before declaring
a winner. If the first response fails heuristic validation, it waits
for the second response instead of immediately returning the bad one.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
… major

The syncing REST health check had reputation_signal: critical_error,
causing CriticalErrorSignal (strikes + exponential cooldown) on any
health check failure. This was unfairly penalizing legitimate suppliers
whose REST endpoint occasionally failed.

Changed to major_error for all 14 Cosmos chains. The status JSON-RPC
check retains critical_error since its expected_response_contains
targets canned responses specifically.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…height

The session cache key previously included the current block height, which
changes every block (~60s). Since sessions span ~50 blocks, this caused a
cache miss on every new block for the same session, making the cache
ineffective and overloading full nodes with redundant GetSession gRPC calls.

The fix computes sessionStartHeight deterministically from the current
height and NumBlocksPerSession (already cached via GetSharedParams), so
the cache key remains stable for the entire session window.

Falls back to current height if SharedParams is unavailable, preserving
previous behavior rather than failing requests.

Closes #509
The original formula `h - (h % N)` is 0-based, but Shannon sessions
start at block 1: [1,60], [61,120], etc. This caused the last block
of each session to get a different cache key, resulting in 2 fetches
per session instead of 1.

Use Shannon's formula `h - ((h-1) % N)` to align cache keys with
actual session boundaries. Updated tests to use NumBlocksPerSession=60
matching production config.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants