Skip to content

fix(chaintracker): retry previous slots on Solana block-not-available…#2254

Open
NadavLevi wants to merge 1 commit intomainfrom
fix/solana-block-not-available-retry
Open

fix(chaintracker): retry previous slots on Solana block-not-available…#2254
NadavLevi wants to merge 1 commit intomainfrom
fix/solana-block-not-available-retry

Conversation

@NadavLevi
Copy link
Copy Markdown
Collaborator

@NadavLevi NadavLevi commented Mar 29, 2026

User description

… (-32004)

Solana slots can be skipped or not yet propagated, causing -32004 errors during block hash fetching. This adds a two-phase retry mechanism for Solana-family chains in both provider and consumer chain fetchers: first, retry the same slot up to 2 times with a 150ms delay (propagation delays); then, fall back to up to 3 previous slots (permanently skipped slots).

Description

Closes: #XXXX


Author Checklist

All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.

I have...

  • read the contribution guide
  • included the correct type prefix in the PR title, you can find examples of the prefixes below:
  • confirmed ! in the type prefix if API or client breaking change
  • targeted the main branch
  • provided a link to the relevant issue or specification
  • reviewed "Files changed" and left comments if necessary
  • included the necessary unit and integration tests
  • updated the relevant documentation or specification, including comments for documenting Go code
  • confirmed all CI checks have passed

Reviewers Checklist

All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.

I have...

  • confirmed the correct type prefix in the PR title
  • confirmed all author checklist items have been addressed
  • reviewed state machine logic, API design and naming, documentation is accurate, tests and test coverage


Generated description

Below is a concise technical summary of the changes proposed in this PR:
Add a Solana-specific two-phase retry around block hash fetchers so that ChainFetcher.FetchBlockHashByNum and the router’s EndpointChainFetcher fall back to the same slot or recent preceding slots when nodes surface -32004 errors. Tighten endpoint health tracking by making MarkUnhealthy/ResetHealth report state transitions and only emit metrics when an endpoint actually changes enabled state.

TopicDetails
Solana fetch retry Implement Solana-specific retries via FetchBlockHashWithSolanaRetry, add error classification/tests, and reroute slot fetches through the helper to cover both ChainFetcher and router-level fetchers.
Modified files (7)
  • protocol/chainlib/chain_fetcher.go
  • protocol/chainlib/svm_chain_tracker.go
  • protocol/chainlib/svm_chain_tracker_test.go
  • protocol/common/errors.go
  • protocol/common/svm.go
  • protocol/rpcsmartrouter/endpoint_chain_fetcher.go
  • protocol/rpcsmartrouter/endpoint_chain_fetcher_test.go
Latest Contributors(2)
UserCommitDate
[email protected]perf(protocol): optimi...February 01, 2026
NadavLevifix: cap string len on...January 15, 2026
Endpoint health Refine endpoint health tracking by returning booleans from MarkUnhealthy/ResetHealth and only updating smart-router metrics when the enabled state actually flips.
Modified files (2)
  • protocol/lavasession/consumer_types.go
  • protocol/rpcsmartrouter/rpcsmartrouter_server.go
Latest Contributors(2)
UserCommitDate
NadavLevifeat(metrics): align p...March 25, 2026
[email protected]fix: make relay retry ...March 01, 2026
This pull request is reviewed by Baz. Review like a pro on (Baz).

@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Add Solana block-not-available retry mechanism for skipped slots

🐞 Bug fix

Grey Divider

Walkthroughs

Description
• Add retry mechanism for Solana block-not-available (-32004) errors
  - Retries up to 5 previous slots when endpoint returns -32004
  - Handles skipped slots and propagation delays on Solana/SVM chains
• Implement block-not-available detection in both provider and consumer fetchers
  - Extract response data for error inspection in chain_fetcher.go
  - Add JSON-RPC error code parsing utility function
• Add comprehensive test coverage for error detection and chain identification
• Enhance logging for endpoint health tracking and retry attempts
Diagram
flowchart LR
  A["Fetch Block Hash Request"] --> B{"Is Solana Family?"}
  B -->|Yes| C["Try Current Block"]
  B -->|No| D["Single Attempt"]
  C --> E{"Success?"}
  E -->|Yes| F["Return Hash"]
  E -->|No| G{"Block Not Available -32004?"}
  G -->|Yes| H["Retry Previous Slot<br/>Max 5 Retries"]
  G -->|No| I["Return Error"]
  H --> E
  D --> E
  F --> J["Result"]
  I --> J
Loading

Grey Divider

File Changes

1. protocol/chainlib/chain_fetcher.go 🐞 Bug fix +101/-7

Implement Solana block retry with error detection

• Add constants for Solana block-not-available error code (-32004) and max retry count (5)
• Refactor FetchBlockHashByNum to implement retry loop for Solana-family chains
• Extract single block fetch logic into new fetchSingleBlockHashByNum helper method
• Add isBlockNotAvailableError function to detect -32004 errors from JSON-RPC responses
• Add isSolanaFamily function to identify Solana/SVM chains (SOLANA, SOLANAT, KOII, KOIIT)
• Update error returns to include response data for error inspection

protocol/chainlib/chain_fetcher.go


2. protocol/rpcsmartrouter/endpoint_chain_fetcher.go 🐞 Bug fix +124/-12

Add Solana block retry logic to endpoint chain fetcher

• Add constants for Solana block-not-available error code and max retry count
• Refactor FetchBlockHashByNum with retry loop for Solana-family chains
• Extract single block fetch logic into new fetchSingleBlockHash helper method
• Add isBlockNotAvailableError function to parse JSON-RPC error responses
• Add isSolanaFamily function for chain identification
• Enhance logging with endpoint URL and attempt tracking
• Update method signatures to return response data for error inspection

protocol/rpcsmartrouter/endpoint_chain_fetcher.go


3. protocol/lavasession/consumer_types.go ✨ Enhancement +7/-0

Enhance endpoint health tracking logging

• Add detailed logging to MarkUnhealthy method with endpoint address, refusal count, and threshold
• Log whether endpoint will be disabled based on connection attempt threshold
• Include direct RPC flag in warning output for better debugging

protocol/lavasession/consumer_types.go


View more (1)
4. protocol/rpcsmartrouter/endpoint_chain_fetcher_test.go 🧪 Tests +99/-0

Add unit tests for error detection and chain identification

• Add comprehensive test suite for isBlockNotAvailableError function with 10 test cases
• Test Solana -32004 error detection, other error codes, success responses, and edge cases
• Add test suite for isSolanaFamily function covering all supported chains and non-Solana chains
• Verify case-sensitivity and empty chain ID handling

protocol/rpcsmartrouter/endpoint_chain_fetcher_test.go


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review bot commented Mar 29, 2026

Code Review by Qodo

🐞 Bugs (3) 📘 Rule violations (2) 📎 Requirement gaps (0)

Grey Divider


Action required

1. LavaFormatWarning missing gofmt spacing 📘 Rule violation ⚙ Maintainability
Description
The new MarkUnhealthy log call uses "MarkUnhealthy called",nil, (missing space after the comma),
which indicates the file is not gofmt/goimports clean. This violates the repo requirement to keep Go
code formatted consistently.
Code

protocol/lavasession/consumer_types.go[R237-243]

+	utils.LavaFormatWarning("MarkUnhealthy called",nil,
+		utils.LogAttr("endpoint", e.NetworkAddress),
+		utils.LogAttr("refusals", e.ConnectionRefusals),
+		utils.LogAttr("threshold", MaxConsecutiveConnectionAttempts),
+		utils.LogAttr("is_direct_rpc", e.IsDirectRPC()),
+		utils.LogAttr("will_disable", e.ConnectionRefusals >= MaxConsecutiveConnectionAttempts),
+	)
Evidence
PR Compliance ID 1 requires Go files to be gofmt/goimports formatted; the added call shows non-gofmt
formatting (missing space after comma) on the modified lines.

AGENTS.md
protocol/lavasession/consumer_types.go[234-243]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The added `utils.LavaFormatWarning` call in `MarkUnhealthy()` is not gofmt-formatted (e.g., `"MarkUnhealthy called",nil,` should be `"MarkUnhealthy called", nil,`).

## Issue Context
Repo compliance requires Go code to be gofmt/goimports clean.

## Fix Focus Areas
- protocol/lavasession/consumer_types.go[234-243]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Block zero skipped 🐞 Bug ✓ Correctness
Description
FetchBlockHashByNum breaks out when currentBlock <= 0, so a request for block/slot 0 is never
attempted and instead returns a generic “retries exhausted” error. ChainTracker can compute
blockNumToFetch := latestBlock - idx (reaching 0 when blocksToSave > latestBlock), so this can fail
bootstrap/short-height scenarios and makes Solana slot 0 unreachable.
Code

protocol/chainlib/chain_fetcher.go[R488-492]

+	for attempt := 0; attempt < maxAttempts; attempt++ {
+		if currentBlock <= 0 {
+			break
+		}
+
Evidence
Both fetchers abort before attempting block/slot 0. ChainTracker’s backwards scan can legitimately
request 0 (latestBlock-idx), and Solana’s GET_BLOCK_BY_NUM uses getBlock(slot), where slot 0 is a
valid value.

protocol/chainlib/chain_fetcher.go[467-527]
protocol/rpcsmartrouter/endpoint_chain_fetcher.go[165-221]
protocol/chaintracker/chain_tracker.go[254-273]
specs/mainnet-1/specs/solana.json[980-1004]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`FetchBlockHashByNum` stops retries when `currentBlock <= 0`, which prevents fetching slot/block `0` entirely and can return a misleading “retries exhausted” error without ever attempting a request.

### Issue Context
This affects both provider-side (`chainlib.ChainFetcher`) and consumer/direct-RPC (`rpcsmartrouter.EndpointChainFetcher`) implementations. ChainTracker can request block 0 during backwards reads (`latestBlock - idx`).

### Fix Focus Areas
- Change the guard to allow `0` (e.g., `currentBlock < 0`), and explicitly handle negative `blockNum` as invalid.
- Ensure the “exhausted” path returns/wraps the last real error when available (avoid returning an error that drops the cause).

### Fix Focus Areas (code)
- protocol/chainlib/chain_fetcher.go[479-528]
- protocol/rpcsmartrouter/endpoint_chain_fetcher.go[165-222]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Endpoint health data race 🐞 Bug ⛯ Reliability
Description
Endpoint.MarkUnhealthy mutates ConnectionRefusals and Enabled without taking endpoint.mu even though
the struct documents mu protects these fields, and it is invoked from concurrent request paths. The
newly added unconditional LavaFormatWarning in MarkUnhealthy will also emit warning logs (and
allocate an error) on every call, amplifying impact under failure bursts.
Code

protocol/lavasession/consumer_types.go[R235-246]

func (e *Endpoint) MarkUnhealthy() {
	e.ConnectionRefusals++
+	utils.LavaFormatWarning("MarkUnhealthy called",nil,
+		utils.LogAttr("endpoint", e.NetworkAddress),
+		utils.LogAttr("refusals", e.ConnectionRefusals),
+		utils.LogAttr("threshold", MaxConsecutiveConnectionAttempts),
+		utils.LogAttr("is_direct_rpc", e.IsDirectRPC()),
+		utils.LogAttr("will_disable", e.ConnectionRefusals >= MaxConsecutiveConnectionAttempts),
+	)
	if e.ConnectionRefusals >= MaxConsecutiveConnectionAttempts {
		e.Enabled = false
		utils.LavaFormatWarning("disabled unhealthy endpoint", nil,
Evidence
Endpoint declares mu protects ConnectionRefusals/Enabled, and other code paths use
endpoint.mu.Lock() when updating these fields. Smart router request handling calls MarkUnhealthy
during failures, so concurrent invocations are expected. LavaFormatWarning constructs/returns an
error even when the caller ignores it, adding overhead for every MarkUnhealthy call.

protocol/lavasession/consumer_types.go[188-252]
protocol/lavasession/consumer_types.go[703-810]
protocol/rpcsmartrouter/rpcsmartrouter_server.go[1838-1844]
utils/lavalog.go[423-460]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`Endpoint.MarkUnhealthy` updates `ConnectionRefusals` and `Enabled` without synchronization, despite `Endpoint.mu` being documented as the guard for these fields. It can be called concurrently from request-handling code, causing data races and inconsistent endpoint disabling.

### Issue Context
This PR also adds an unconditional `LavaFormatWarning("MarkUnhealthy called", ...)` inside `MarkUnhealthy`, which increases log volume and per-call allocations during error bursts.

### Fix Focus Areas
- Add `e.mu.Lock()`/`e.mu.Unlock()` around `ConnectionRefusals`/`Enabled` mutations and any reads used for log fields.
- Consider reducing log volume (e.g., log only when the endpoint transitions to disabled, or when refusals crosses the threshold).

### Fix Focus Areas (code)
- protocol/lavasession/consumer_types.go[234-252]
- protocol/rpcsmartrouter/rpcsmartrouter_server.go[1838-1861]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

4. Tests not TestX_Y named 📘 Rule violation ⚙ Maintainability
Description
New tests are named TestIsBlockNotAvailableError and TestIsSolanaFamily, which do not follow the
required TestComponent_Scenario naming convention. This reduces consistency and makes test intent
harder to scan in large suites.
Code

protocol/rpcsmartrouter/endpoint_chain_fetcher_test.go[R10-76]

+func TestIsBlockNotAvailableError(t *testing.T) {
+	tests := []struct {
+		name     string
+		response []byte
+		expected bool
+	}{
+		{
+			name:     "Solana -32004 block not available",
+			response: []byte(`{"jsonrpc":"2.0","id":1,"error":{"code":-32004,"message":"Block not available for slot 12345"}}`),
+			expected: true,
+		},
+		{
+			name:     "Solana -32004 minimal error",
+			response: []byte(`{"error":{"code":-32004}}`),
+			expected: true,
+		},
+		{
+			name:     "different error code -32001",
+			response: []byte(`{"jsonrpc":"2.0","id":1,"error":{"code":-32001,"message":"Block cleaned up"}}`),
+			expected: false,
+		},
+		{
+			name:     "different error code -32007",
+			response: []byte(`{"jsonrpc":"2.0","id":1,"error":{"code":-32007,"message":"Slot skipped"}}`),
+			expected: false,
+		},
+		{
+			name:     "success response with result",
+			response: []byte(`{"jsonrpc":"2.0","id":1,"result":{"blockhash":"abc123","slot":100}}`),
+			expected: false,
+		},
+		{
+			name:     "empty response",
+			response: []byte{},
+			expected: false,
+		},
+		{
+			name:     "nil response",
+			response: nil,
+			expected: false,
+		},
+		{
+			name:     "invalid JSON",
+			response: []byte(`not json`),
+			expected: false,
+		},
+		{
+			name:     "no error field",
+			response: []byte(`{"jsonrpc":"2.0","id":1}`),
+			expected: false,
+		},
+		{
+			name:     "EVM -32004 method not supported (same code different chain)",
+			response: []byte(`{"jsonrpc":"2.0","id":1,"error":{"code":-32004,"message":"Method not supported"}}`),
+			expected: true, // The function only checks the code, chain filtering is done by the caller
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			result := isBlockNotAvailableError(tt.response)
+			assert.Equal(t, tt.expected, result)
+		})
+	}
+}
+
+func TestIsSolanaFamily(t *testing.T) {
Evidence
PR Compliance ID 5 requires test functions to be named TestComponent_Scenario; the newly added
test function names omit the underscore-based convention.

AGENTS.md
protocol/rpcsmartrouter/endpoint_chain_fetcher_test.go[10-99]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Test function names do not follow the required `TestComponent_Scenario` pattern.

## Issue Context
The repo standard expects underscore-separated test names for consistency.

## Fix Focus Areas
- protocol/rpcsmartrouter/endpoint_chain_fetcher_test.go[10-99]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


5. Duplicated Solana retry helpers 🐞 Bug ⚙ Maintainability
Description
Solana-family detection and block-not-available retry constants/helpers are duplicated in chainlib
and rpcsmartrouter, while ChainTracker also hard-codes the same Solana-family list. This can drift
over time and create inconsistent behavior depending on which fetcher/tracker path is used.
Code

protocol/rpcsmartrouter/endpoint_chain_fetcher.go[R19-28]

+const (
+	// maxBlockNotAvailableRetries is the maximum number of previous slots to try
+	// when a Solana endpoint returns -32004 ("Block not available for slot X").
+	// This handles both propagation delays and skipped slots.
+	maxBlockNotAvailableRetries = 5
+
+	// solanaBlockNotAvailableCode is the Solana JSON-RPC error code for
+	// "Block not available for slot X".
+	solanaBlockNotAvailableCode = -32004
+)
Evidence
The same Solana-family list and constants appear in multiple packages (two fetchers plus
ChainTracker’s switch). Without a shared source of truth, future edits can easily update one list
and miss others.

protocol/chainlib/chain_fetcher.go[25-36]
protocol/chainlib/chain_fetcher.go[749-774]
protocol/rpcsmartrouter/endpoint_chain_fetcher.go[19-28]
protocol/rpcsmartrouter/endpoint_chain_fetcher.go[285-301]
protocol/rpcsmartrouter/endpoint_chain_fetcher.go[377-387]
protocol/chaintracker/chain_tracker.go[669-673]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
Solana retry constants and helpers (`isSolanaFamily`, `isBlockNotAvailableError`, retry counts) are duplicated across packages, and the Solana-family list is also duplicated in ChainTracker selection logic. This increases the chance of divergent behavior as chains are added/renamed.

### Issue Context
Current code is consistent, but future updates (e.g., adding a new Solana/SVM chain ID) must be made in 3 places.

### Fix Focus Areas
- Extract Solana-family detection + retry constants into a shared package/module and reuse from both fetchers and ChainTracker.

### Fix Focus Areas (code)
- protocol/chainlib/chain_fetcher.go[25-36]
- protocol/chainlib/chain_fetcher.go[749-774]
- protocol/rpcsmartrouter/endpoint_chain_fetcher.go[19-28]
- protocol/rpcsmartrouter/endpoint_chain_fetcher.go[285-301]
- protocol/rpcsmartrouter/endpoint_chain_fetcher.go[377-387]
- protocol/chaintracker/chain_tracker.go[669-673]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 29, 2026

Codecov Report

❌ Patch coverage is 28.87324% with 101 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
protocol/rpcsmartrouter/endpoint_chain_fetcher.go 0.00% 41 Missing ⚠️
protocol/chainlib/chain_fetcher.go 22.22% 27 Missing and 1 partial ⚠️
protocol/lavasession/consumer_types.go 0.00% 20 Missing ⚠️
protocol/common/svm.go 0.00% 6 Missing ⚠️
protocol/rpcsmartrouter/rpcsmartrouter_server.go 0.00% 4 Missing ⚠️
protocol/chainlib/svm_chain_tracker.go 94.28% 1 Missing and 1 partial ⚠️
Flag Coverage Δ
consensus 8.70% <0.00%> (-0.01%) ⬇️
protocol 33.96% <28.87%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
protocol/common/errors.go 100.00% <ø> (ø)
protocol/chainlib/svm_chain_tracker.go 94.28% <94.28%> (ø)
protocol/rpcsmartrouter/rpcsmartrouter_server.go 13.53% <0.00%> (+0.03%) ⬆️
protocol/common/svm.go 0.00% <0.00%> (ø)
protocol/lavasession/consumer_types.go 72.72% <0.00%> (-2.04%) ⬇️
protocol/chainlib/chain_fetcher.go 23.77% <22.22%> (-0.43%) ⬇️
protocol/rpcsmartrouter/endpoint_chain_fetcher.go 0.00% <0.00%> (ø)

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 29, 2026

Test Results

0 tests  ±0   0 ✅ ±0   0s ⏱️ ±0s
0 suites ±0   0 💤 ±0 
7 files   ±0   0 ❌ ±0 

Results for commit eecc870. ± Comparison against base commit 6b5b690.

♻️ This comment has been updated with latest results.

@NadavLevi NadavLevi force-pushed the fix/solana-block-not-available-retry branch from ce72aac to 008f163 Compare March 29, 2026 12:06
@NadavLevi NadavLevi force-pushed the fix/solana-block-not-available-retry branch from 008f163 to 8f057c7 Compare March 29, 2026 15:14
@NadavLevi NadavLevi force-pushed the fix/solana-block-not-available-retry branch from 8f057c7 to 527933a Compare March 29, 2026 16:05
@NadavLevi NadavLevi force-pushed the fix/solana-block-not-available-retry branch from 527933a to 3808812 Compare March 29, 2026 16:30
@NadavLevi NadavLevi force-pushed the fix/solana-block-not-available-retry branch from 3808812 to aca549b Compare March 30, 2026 09:57
@NadavLevi NadavLevi force-pushed the fix/solana-block-not-available-retry branch from aca549b to bf8de2b Compare March 31, 2026 09:52
…2004)

Solana slots can be skipped or not yet propagated, causing -32004 errors
during block hash fetching. This adds a two-phase retry mechanism for
Solana-family chains in both provider and consumer chain fetchers: first,
retry the same slot up to 2 times with a 150ms delay (propagation delays);
then, fall back to up to 3 previous slots (permanently skipped slots).
@NadavLevi NadavLevi force-pushed the fix/solana-block-not-available-retry branch from bf8de2b to eecc870 Compare March 31, 2026 13:02
@NadavLevi NadavLevi requested a review from nimrod-teich March 31, 2026 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant