Feat/hot path optimization with external rpc validation#508
Open
oten91 wants to merge 42 commits intofeat/hot-path-optimizationfrom
Open
Feat/hot path optimization with external rpc validation#508oten91 wants to merge 42 commits intofeat/hot-path-optimizationfrom
oten91 wants to merge 42 commits intofeat/hot-path-optimizationfrom
Conversation
When all suppliers in a session are behind the real chain tip, internal consensus considers everything "in sync" because it only compares endpoints against each other. Add optional per-service external RPC endpoints that periodically fetch the real block height as a floor for perceivedBlockNumber, correctly filtering stale suppliers. Supports multiple sources per service for redundancy (max wins). Handles EVM hex, Cosmos decimal, and Solana numeric response formats. Failure-safe: unreachable sources log a warning and are skipped. External heights are written to Redis for cross-replica benefit. Co-Authored-By: Claude Opus 4.6 <[email protected]>
113320b to
ff9606e
Compare
Co-Authored-By: Claude Opus 4.6 <[email protected]>
ff9606e to
e932bd2
Compare
jorgecuesta
approved these changes
Feb 18, 2026
Contributor
jorgecuesta
left a comment
There was a problem hiding this comment.
LGTM, just a minor question about it assuming all the endpoints are JSON rpc compatible and a possible optimization.
d1bf40c to
1cb8631
Compare
… mode to external block fetcher
- Replace docker/cli exclude with direct require of v29.2.1 which uses
moby/moby packages with proper Go modules, fixing the range-over-function
compilation error in E2E tests.
- Pin docker/docker to v28.5.1 and dockertest to v3.11.0 to fix broken pipe
error when streaming Docker build context via the API in E2E tests.
- Add type field ("jsonrpc" default or "rest") to ExternalBlockSource config
so Cosmos REST endpoints (GET /status) work alongside JSON-RPC POST.
- Expand .dockerignore to exclude non-essential directories (docusaurus, docs,
research, etc.) reducing build context size.
- Add integration tests against real public RPC endpoints (ETH, Solana,
Osmosis JSON-RPC, Osmosis REST) to validate all chain type parsing.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
1cb8631 to
25be89b
Compare
Prevents cold-start filtering where the external block source reports the real chain tip before suppliers have reported their heights, causing all endpoints to appear "behind" and get filtered out. Default grace period: 60s. Configurable per-service via `grace_period` in `external_block_sources` config. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Adds ConsumeExternalBlockHeight to Cosmos and Solana QoS instances so external_block_sources works for all chain types, not just EVM. Also fixes EVM: skip Redis writes during grace period to prevent other replicas from picking up the external height before suppliers report. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Docker image build + container startup can take 5+ minutes in CI, which was consuming the entire 5-minute context timeout before tests even started. This caused all Vegeta test methods to immediately report "canceled" with 0 requests. Move context creation to after getGatewayURLForTestMode() returns, so the 5-minute timeout only governs actual test execution. Co-Authored-By: Claude Opus 4.6 <[email protected]>
The CI workflow already builds the Docker image in Phase 1 and passes it as an artifact to Phase 2. But force_rebuild_image was still true in the default config, causing a redundant ~5 minute rebuild from scratch (with NoCache) before every E2E test run. Set force_rebuild_image=false in the CI config script so the test reuses the pre-built image from Phase 1. Co-Authored-By: Claude Opus 4.6 <[email protected]>
All 5 matrix jobs (3 HTTP + 2 WebSocket) were downloading the Docker image artifact simultaneously, triggering GitHub's secondary rate limit (403 Forbidden). Stagger downloads at 0/5/10/15/20s intervals. Co-Authored-By: Claude Opus 4.6 <[email protected]>
…o QoS instances QoS instances are created at startup before external health check rules are loaded, so syncAllowance starts at 0 (disabled). This means stale endpoints were never filtered even when external rules specified a sync_allowance. Fix by using atomic.Uint64 for syncAllowance in both EVM and Cosmos QoS configs, and propagating the value from the health check executor when external rules are loaded/refreshed. Co-Authored-By: Claude Opus 4.6 <[email protected]>
…eneric QoS Services using generic/NoOp QoS (near, sui, tron) previously had no block height tracking, meaning stale endpoints were never filtered even when external block sources were configured. This converts NoOpQoS from a stateless value type to a stateful pointer type with per-endpoint block height tracking, sync-allowance-based filtering, and external block height consumption — matching the pattern used by Cosmos and Solana QoS. Default behavior is unchanged: when syncAllowance=0 (default), endpoint selection remains purely random. Co-Authored-By: Claude Opus 4.6 <[email protected]>
In multi-replica deployments, each replica updates perceivedBlockHeight locally only. This means Replica A may see block 1000 while Replica B stays at 950, causing B to select stale endpoints. EVM already had full Redis sync; this extends the same pattern to the remaining QoS types. Adds SetReputationService, StartBackgroundSync (periodic Redis reads), and async Redis writes in UpdateFromExtractedData and ConsumeExternalBlockHeight for Solana, Cosmos, and NoOp QoS. Uses simple max semantics (if Redis > local, update) rather than EVM's consensus mechanism. No changes to cmd/main.go needed — existing duck-typed interface checks auto-detect the new methods. Co-Authored-By: Claude Opus 4.6 <[email protected]>
…onfigured NewSimpleQoSInstance (used for all EVM services) sets getEVMChainID() to "". However, isChainIDValid() still ran unconditionally, rejecting every observed endpoint — either because it had no chain ID observation yet (errNoChainIDObs) or because the observed chain ID (e.g. "0x38") didn't match "". This caused a flood of warn logs and forced fallback to random endpoint selection. Skip chain ID validation and synthetic eth_chainId checks entirely when no expected chain ID is configured, consistent with the constructor's intent to delegate validation to active health checks. Co-Authored-By: Claude Opus 4.6 <[email protected]>
…a stale supplier filtering Health checks only run on the leader pod, so non-leader replicas had empty per-endpoint block heights and could not filter stale suppliers. This adds Redis HSET/HGETALL sync for per-endpoint block heights across all 4 QoS types (EVM, Solana, Cosmos, NoOp), enabling all replicas to filter stale endpoints. Write side: async HSET from UpdateFromExtractedData after each health check. Read side: HGETALL on the existing 5s background sync ticker, updating only existing local endpoint entries where Redis has a higher block height. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Tests cover both write side (UpdateFromExtractedData writes per-endpoint block heights to Redis) and read side (StartBackgroundSync reads from Redis and updates local endpoint store) for all 4 QoS types: EVM, Solana, Cosmos, and NoOp. Verifies that: - Higher Redis values update local endpoints - Lower Redis values do NOT downgrade local endpoints - Non-existent local endpoints are NOT created from Redis data - Periodic sync picks up new Redis data after startup Co-Authored-By: Claude Opus 4.6 <[email protected]>
Non-leader replicas had empty endpoint stores at startup because health checks only run on the leader pod. The syncEndpointBlocksFromRedis closure would fetch per-endpoint block heights from Redis but silently skip all entries since none existed locally. This meant stale suppliers were never filtered on non-leader pods. Now, EVM/Cosmos/NoOp QoS types create new endpoint entries from Redis data when the endpoint doesn't exist in the local store, enabling block height filtering from the first request. Solana is excluded because its validateBasic() requires health+epoch data. Co-Authored-By: Claude Opus 4.6 <[email protected]>
…tractBlockHeight Extends extractBlockHeight to handle non-EVM response formats for external block sources on generic chains (near, sui, tron): - Decimal strings without 0x prefix parsed as base-10 (Sui) - Numeric latest_block_height in sync_info objects (NEAR) Co-Authored-By: Claude Opus 4.6 <[email protected]>
…e period Add mdbx_panic to the heuristic error indicator patterns so that Erigon nodes with corrupted/full MDBX databases trigger retries on a different supplier instead of returning the error directly to users. Also reduce the external block grace period from 60s to 30s to shorten the cold start window where stale suppliers can leak through. Co-Authored-By: Claude Opus 4.6 <[email protected]>
… comet_bft unsupported CometBFT methods (status, block, validators, etc.) sent via JSON-RPC POST were rejected with "unsupported RPC type" when suppliers hadn't staked comet_bft endpoints. Since CometBFT nodes handle both REST and JSON-RPC interfaces on the same endpoint, these requests can safely route to json_rpc endpoints instead. This fallback is a no-op when comet_bft is properly staked. Co-Authored-By: Claude Opus 4.6 <[email protected]>
…ng UNKNOWN_RPC NoOp QoS was discarding the RPC type detected by the gateway and always setting UNKNOWN_RPC on service payloads. This caused all generic services (tron, sui, near) to report unknown_rpc in metrics/observations. Now the detected type (json_rpc, rest, etc.) flows through to the payload. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Endpoint block height entries accumulate indefinitely as sessions rotate, causing non-leader replicas to recreate stale entries from Redis on every sync cycle. This adds a lastSeen timestamp to each endpoint entry, updated during filtering when endpoints appear in active sessions. A periodic sweep (every 5s tick) removes entries older than 2h TTL from both local stores and Redis via HDEL. Changes across all QoS types (EVM, Cosmos, Solana, NoOp): - Add lastSeen field to endpoint structs - Add touchEndpoints() to update lastSeen during filtering (separate WLock) - Add sweepStaleEndpoints() to remove entries past TTL - Call sweep in StartBackgroundSync tick loop - Add RemoveEndpointBlockHeights to ReputationService/Storage interfaces - Implement HDEL-based cleanup in Redis storage, delete in memory storage - Update all 6 test mocks with the new interface method Co-Authored-By: Claude Opus 4.6 <[email protected]>
…ove them Entries created by syncEndpointBlocksFromRedis had lastSeen=zero, which the sweep intentionally skips. This meant Redis-synced entries were never cleaned up. Set lastSeen=time.Now() on new entries so they get swept after the 2h TTL if they never appear in an active session. Also fix NoOp UpdateFromExtractedData to preserve existing lastSeen instead of overwriting the entire endpointState struct. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Broken/lazy suppliers return empty responses like "result":[],
"result":{}, or bare {}. This adds two layers of defense:
Layer 1 — Heuristic detection (retry on different supplier):
- REST bare {} → rest_empty_object (confidence 0.80, retry)
- JSON-RPC "result":{} → confidence 0.95, MajorErrorSignal (never valid
for any JSON-RPC method)
- JSON-RPC "result":[] → confidence 0.75, MinorErrorSignal (valid for
some methods like eth_getLogs, retry is benign)
Layer 2 — Block height filter (prevent future requests):
- Add ErrInvalidBlockHeightResult sentinel error for extractors to
signal "response had a result but it was unparseable"
- EVM extractor: returns sentinel when eth_blockNumber gets non-string
result ([], {}, number, boolean) or unparseable hex
- Cosmos extractor: returns sentinel when CometBFT JSON-RPC response
has empty container result ({} or [])
- UpdateFromExtractedData (EVM/Cosmos/Solana): stores block height 0
when InvalidBlockHeight flag is set, causing the endpoint to be
filtered (0 < perceived - sync_allowance)
Co-Authored-By: Claude Opus 4.6 <[email protected]>
…infrastructure When a relay fails, retries now exclude ALL endpoints behind the same domain (host), not just the exact endpoint address. This prevents wasting retry budget on different supplier endpoints that share the same broken backend (e.g., multiple suppliers behind rel.spacebelt.xyz all returning empty responses). Also stops retrying entirely when all available endpoints are from tried domains, instead of resetting and cycling through the same broken infrastructure with backoff. Co-Authored-By: Claude Opus 4.6 <[email protected]>
…mains fleet-wide When a domain's infrastructure is broken (e.g., returning empty CometBFT responses), each pod previously discovered this independently on every request, wasting the first relay attempt. Now when any pod discovers a broken domain, it shares this via Redis so all pods skip that domain immediately — including on the first attempt. Only ~1 request per TTL window probes the domain. Standalone DomainCircuitBreaker component (not on ReputationService) avoids touching 6 test mocks. Uses Redis HASH with per-field TTL for cross-pod sharing and a local cache with lazy 5s refresh for zero hot-path Redis calls. Co-Authored-By: Claude Opus 4.6 <[email protected]>
…ositive)
CometBFT's "health" method legitimately returns {"result":{}} when the
node is healthy. The heuristic was unconditionally flagging "result":{}
as a broken supplier response at 0.95 confidence, causing every CometBFT
health request to waste a retry attempt across all Cosmos chains.
The fix passes rpcType to analyzeJSONRPC and skips the empty object check
for COMET_BFT, where "result":{} is a valid success response. EVM/Solana
chains still flag "result":{}" since no EVM method returns an empty object.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
…paths
CometBFT health method legitimately returns "result":{}, but the
COMET_BFT rpc type gets lost through two fallback paths before reaching
the heuristic analyzer:
1. QoS validator: COMET_BFT → JSON_RPC when not in supportedAPIs
2. Shannon URL fallback: COMET_BFT → JSON_RPC when endpoint lacks URL
Add Payload.OriginalRPCType + EffectiveRPCType() to preserve the
original detected type for heuristic analysis while routing through
the fallback type. All checkResponseSuccess calls now use the
effective type so CometBFT responses are analyzed correctly regardless
of which fallback path was taken.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
…h_getLogs "result":[] is a legitimate response for methods like eth_getLogs, eth_accounts, and eth_getFilterChanges. Retrying was previously considered "benign", but with the domain circuit breaker it now punishes valid responses by circuit-breaking the entire domain. Co-Authored-By: Claude Opus 4.6 <[email protected]>
… suppliers "result":[] was removed in acd7df5 because it false-positived on eth_getLogs and triggered the circuit breaker on valid responses. But most JSON-RPC methods (eth_blockNumber, eth_getBalance, etc.) should never return an array — suppliers doing so are broken. This re-adds the detection with a whitelist of 17 methods where empty arrays are valid (eth_getLogs, trace_filter, debug_getBadBlocks, etc.). The JSON-RPC method name is now threaded from QoS parsing (jsonrpc.BuildPayload) through the Payload struct, checkResponseSuccess, and into heuristic.Analyze. When the method is unknown (empty string), the check is conservatively skipped. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Circuit breaker now uses escalating TTLs (1m→2m→4m→8m→16m→30m cap) for domains that repeatedly break, instead of a flat 60s. High-confidence deceptive response patterns (empty array/object results) are upgraded from MajorErrorSignal (-10) to CriticalErrorSignal (-25) to trigger the strike system with exponential cooldowns. Also extends circuit breaker filtering to websocket endpoint selection and adds Solana method-aware empty array detection. Co-Authored-By: Claude Opus 4.6 <[email protected]>
…alid responses Solana's GetServicePayloads() was not setting JSONRPCMethod on the protocol payload, so the heuristic analyzer received an empty method and could not perform method-aware empty array detection. This caused "result":[] from broken suppliers to be classified as jsonrpc_success instead of jsonrpc_invalid_empty_array for scalar methods like getSlot. Co-Authored-By: Claude Opus 4.6 <[email protected]>
…reset A single success was resetting CriticalStrikes to 0, allowing deceptive suppliers to avoid the cooldown system entirely by interleaving fake successes with critical errors. Now strikes decay by 1 per success, requiring sustained good behavior to fully recover. Co-Authored-By: Claude Opus 4.6 <[email protected]>
WebSocket handler was missing RPC type validation, allowing connections for services that don't have "websocket" in their configured rpc_types. This let suppliers pump unlimited success signals via WebSocket for services like Solana (json_rpc only), bypassing reputation penalties. Co-Authored-By: Claude Opus 4.6 <[email protected]>
RPCTypeValidator was defined on the Gateway struct and checked in both HTTP and WebSocket handlers, but never instantiated — always nil. This meant RPC type validation was silently skipped for all requests, allowing WebSocket connections to services like Solana that only support json_rpc. Co-Authored-By: Claude Opus 4.6 <[email protected]>
At 50+ critical strikes, `1 << exponent` overflows int64, producing zero or negative cooldown durations. This caused suppliers with 100+ strikes to have instantly-expiring cooldowns instead of the intended 1-hour max. Cap exponent at 7 (5min * 2^7 = 640min > 60min max). Co-Authored-By: Claude Opus 4.6 <[email protected]>
…to all protocols
Some suppliers return a static {"jsonrpc":"2.0","id":1,"result":[]} for
ALL requests regardless of protocol type, including REST and CometBFT
paths. This evades current detection because:
- REST analyzeREST() only looks for error indicators
- CometBFT "result":[] check requires jsonrpcMethod which is empty for GET
- Cosmos health checks lack expected_response_contains validation
Changes:
- Add REST protocol mismatch detection (JSON-RPC response to REST request)
- Add CometBFT empty array detection (no CometBFT method returns arrays)
- Register both as deceptive patterns for CriticalErrorSignal + strikes
- Set JSONRPCMethod on Cosmos JSON-RPC payloads (aligns with EVM/Solana)
- Add expected_response_contains to all 15 Cosmos chain health checks
Co-Authored-By: Claude Opus 4.6 <[email protected]>
Sei is a dual chain (Cosmos + EVM) but most suppliers only run EVM nodes. The CometBFT health checks (health, status, syncing) were penalizing EVM-only suppliers that correctly reject unknown methods like "status" with "method does not exist". Switch to eth_blockNumber and eth_chainId which all sei suppliers support. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Spacebelt returns {"result":[]} to eth_blockNumber which passes basic
HTTP 200 check. Adding expected_response_contains: '"0x' catches this
since real EVM responses always contain hex strings.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
Some suppliers return HTTP 200 with
{"error":{"code":-32000,"message":"Failed to call fallback API"}} when
they can't serve archival requests. This was leaking through to clients
as a 200 while the same class of error ("missing trie node") from other
suppliers was correctly caught and retried.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
Suppliers returning fast canned responses like {"result":[]} pass
transport-level checks (no error, has bytes), win the hedge race, and
cause the correct response from another supplier to be discarded. The
heuristic later catches the bad response but the good one is gone.
Now the hedge layer runs the same heuristic analysis before declaring
a winner. If the first response fails heuristic validation, it waits
for the second response instead of immediately returning the bad one.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
… major The syncing REST health check had reputation_signal: critical_error, causing CriticalErrorSignal (strikes + exponential cooldown) on any health check failure. This was unfairly penalizing legitimate suppliers whose REST endpoint occasionally failed. Changed to major_error for all 14 Cosmos chains. The status JSON-RPC check retains critical_error since its expected_response_contains targets canned responses specifically. Co-Authored-By: Claude Opus 4.6 <[email protected]>
…height The session cache key previously included the current block height, which changes every block (~60s). Since sessions span ~50 blocks, this caused a cache miss on every new block for the same session, making the cache ineffective and overloading full nodes with redundant GetSession gRPC calls. The fix computes sessionStartHeight deterministically from the current height and NumBlocksPerSession (already cached via GetSharedParams), so the cache key remains stable for the entire session window. Falls back to current height if SharedParams is unavailable, preserving previous behavior rather than failing requests. Closes #509
The original formula `h - (h % N)` is 0-based, but Shannon sessions start at block 1: [1,60], [61,120], etc. This caused the last block of each session to get a different cache key, resulting in 2 fetches per session instead of 1. Use Shannon's formula `h - ((h-1) % N)` to align cache keys with actual session boundaries. Updated tests to use NumBlocksPerSession=60 matching production config. Co-Authored-By: Claude Opus 4.6 <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
updating to go 1.25
adding optional external rpc validation