You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The gateway's contiguous data retrieval from AR.IO peers and trusted gateways has several gaps that become more important as the network grows:
No per-peer outbound concurrency limiting — There is no cap on concurrent outbound requests to a given peer or trusted gateway. Under load, many concurrent request handlers can all select the same high-weight peer, overwhelming it. This is bad neighbor behavior in a cooperative network and ties up local connections on slow peers.
Weight system blind spot — The peer weight system (ArIOPeerManager) handles quality (which peers are fast/reliable) but not quantity (how much load is placed on any single peer). High-performing peers get higher weights, which means they get selected more often under load — the opposite of what a healthy network wants.
No backpressure from saturated peers — When a peer is slow or overloaded, there's no mechanism to divert traffic to other peers until the per-request timeout fires (10 seconds). Concurrency limits would create natural backpressure: once a peer's slots fill, requests route elsewhere immediately.
Sequential peer attempts with no hedging — ArIODataSource.getData() tries peers one at a time. If the first peer is slow but doesn't fail, the full 10s timeout must elapse before the next peer is tried. Tail latency = retryCount * requestTimeoutMs.
No weight recovery from transient failures — Peers penalized to weight 1 after a brief outage recover only through successful requests. But at weight 1 they're rarely selected (weighted random), creating a negative feedback loop where recovery takes hours.
All failures treated equally — reportFailure() applies a flat -temperatureDelta regardless of cause. A 404 (peer lacks data), 503 (temporarily overloaded), and connection refused (peer down) all get the same penalty, despite very different implications.
Candidate pool tied to retry count — selectPeers(totalRetryCount) means the number of peers considered equals the max retry count (capped at 3). These are different concerns: you may want a larger candidate pool to draw from while still limiting total attempts.
No local peer list cache — ArIOPeerManager fetches the peer list from the AO network process on startup (fire-and-forget, not awaited). If the AO process is unreachable, this.peers stays {} and all P2P retrieval is dead until the next hourly refresh succeeds. A single bad startup can leave the gateway peerless for up to an hour.
No data locality in peer selection — Peer selection for contiguous data is purely weight-based with no consideration of which peers are likely to have the requested data. Two requests for the same ID may route to completely different peers, wasting cache potential. The chunk system has bucket-aware selection (selectBucketPeersForOffset), but contiguous data has no equivalent. This means every request is a cold miss from the peer's perspective unless it happens to have the data from an unrelated request.
Registry metadata discarded — updatePeerList() receives rich per-gateway metadata from the AR.IO registry (composite weight, performance ratio, epoch pass/fail stats, stake, tenure) but discards everything except the URL. Initial peer weights default to 50 for all peers regardless of network-observed quality, ignoring free quality signals.
This is an umbrella issue for iterative improvements to contiguous data P2P retrieval. We'll add items here as we identify them.
Requirements
Must Have
Per-peer outbound concurrency limiter — Shared concurrency limiter that caps concurrent outbound requests to any single peer or trusted gateway
Configurable via PEER_MAX_CONCURRENT_OUTBOUND (default TBD, likely 5-10)
Configurable via TRUSTED_GATEWAY_MAX_CONCURRENT_OUTBOUND (default TBD, likely 10-20, or shared setting)
When a peer's slots are full, skip it and try the next peer (fail-fast, don't queue)
Single shared instance across ArIODataSource and GatewaysDataSource
Lazy per-peer initialization (don't pre-allocate for all known peers)
Hedged requests — Fire parallel requests after a configurable delay to eliminate tail latency from slow peers
Configurable via PEER_HEDGE_DELAY_MS (default: 500ms, 0 = disabled)
Configurable via PEER_MAX_HEDGED_REQUESTS (default: 3, caps concurrent fan-out)
First peer starts immediately; each subsequent peer starts after hedge delay if no success yet
First success aborts all other in-flight requests via AbortController
Hard failures immediately trigger the next peer (no wait for hedge timer)
Failure-differentiated weight adjustment — Replace flat penalty with failure-type-aware multipliers
404 (not found): 1x temperatureDelta — peer may lack data but isn't broken
Timeout: 1x temperatureDelta — could be transient or systemic
Connection refused/DNS failure: 3x temperatureDelta — peer likely down
Other 5xx: 2x temperatureDelta — something is wrong with the peer
New PeerFailureType enum and classifyPeerFailure() helper
Local peer list cache — Persist the peer list to disk after each successful fetch; load on startup as fallback when the AO process is unreachable
Cache file at data/peers/ar-io-peers.json (follows existing data/ directory convention)
Write atomically (write to tmp file, rename) to avoid corrupted reads
On startup: attempt network fetch → on failure, load cached list → log warning with cache age
Include a timestamp in the cache file so staleness can be assessed
Optionally configurable max cache age via PEER_LIST_CACHE_MAX_AGE_MS (default: 86400000 = 24h); if cache is older, log a more urgent warning but still use it (stale peers > no peers)
Should Have
Metrics for outbound concurrency: active connections per peer, skip count due to saturation
Log warnings when peers are consistently saturated (indicates too-low limits or too-few peers)
Weight decay toward default — Exponential decay of weights toward default (50) over time
Configurable via PEER_WEIGHT_DECAY_HALF_LIFE_MS (default: 600000 = 10 min)
Track last-update timestamp per peer per category
Apply decay lazily on read (during peer selection), not on a timer
Prevents negative feedback loop: penalized peers gradually recover selection probability
Also decays over-boosted peers, preventing permanently high weights from early luck
Example: weight 1 → 25 after 10min → 37 after 20min → 44 after 30min → ~50 after 60min
Decouple candidate pool from retry count — Select a larger pool of candidate peers, attempt a smaller number
Configurable via PEER_CANDIDATE_COUNT (default: 5-10)
selectPeers(candidateCount) picks from a wider pool; maxAttempts caps actual requests
Enables hedged requests to draw from a deeper bench without increasing max retries
Consistent hash routing for cache locality — Use consistent hashing on data IDs to create stable peer affinity, so repeated requests for the same data route to the same peers and benefit from their cache
Hash each peer to positions on a ring (using URL or wallet address); for a given data ID, select the N closest peers as its "home set"
Try the home set first (ranked by weight within the set), fall back to general weighted selection on miss
Creates emergent cache specialization: each peer naturally accumulates a warm cache for "its" slice of the ID space
Stable mapping — same ID always tries the same peers, strictly better for cache hit rates than random weighted selection
Composes with weight system: home set provides candidate ordering, weights rank within the set
Graceful degradation: if a home-set peer goes down, the ring falls through to the next peer, which builds its own warm cache over time
IDs are SHA-256 hashes so distribution across the ring is inherently uniform
ArNS peer affinity — Track which peers have recently served specific ArNS names successfully, prefer them for subsequent requests to the same name
Small in-memory map: Map<arnsName, { peers: string[], lastUpdated: number }>
On successful ArNS data retrieval, record the peer for that name
On subsequent requests, try recently-successful peers first before falling back to consistent hash routing or general selection
requestAttributes.arnsName is already passed to getData(), so the name is available at selection time
Entries expire after a TTL (e.g., 30 min) to adapt to changing conditions
Targets highest-value requests: ArNS names are the front door for most user traffic (dApp homepages, frequently accessed content)
Bounded size: thousands of names at most, negligible memory
Seed initial weights from registry quality scores — Use compositeWeight / gatewayPerformanceRatio / epoch stats from the AR.IO gateway registry to seed initial per-peer weights instead of defaulting all peers to 50
Store additional gateway metadata alongside the URL in ArIOPeerManager
Map normalizedCompositeWeight (0–1) to the weight range (1–100) for initial weight
Local weight adjustments (success/failure) still apply on top, so the system adapts from an informed baseline
Avoids wasting early requests on peers the network already knows are poor performers
Could Have (Future)
Optional brief queueing before skip (wait N ms for a slot before skipping)
Integration with weight system: factor saturation into peer selection
Adaptive concurrency limits based on peer response characteristics
Latency variance tracking — Track p50/p95 TTFB per peer using circular buffers, factor stability into weight adjustments (stable peers get bonus, volatile peers get penalty)
New module src/data/outbound-request-limiter.ts — Wraps a Map<string, pLimit> with lazy per-peer initialization. Exposes tryAcquire(peerUrl) to check/reserve a slot and a release function.
Modify ArIODataSource — In the peer retry loop, check if the peer has available slots before requesting. If saturated, skip to next peer.
Modify GatewaysDataSource — Same pattern in the gateway tier loop. Skip saturated gateways, try next in tier.
Wire in system.ts — Create one shared limiter instance, inject into both data sources.
Config in src/config.ts — New env vars for concurrency limits.
Implementation Approach: Hedged Requests
Replace the sequential for-loop in ArIODataSource.getData() with a hedged request pattern:
Request timeline (hedge delay = 300ms):
t=0ms → Request to Peer A
t=300ms → Peer A hasn't responded → Request to Peer B (hedge)
t=450ms → Peer B responds first → abort Peer A, return Peer B's data
Key design decisions:
Use AbortController to cancel losing requests (existing abort signal pattern)
Hard failures short-circuit the hedge delay — next peer starts immediately
Cap max concurrent hedged requests to limit fan-out (default 3)
If hedging disabled (PEER_HEDGE_DELAY_MS=0), fall back to existing sequential behavior
Implementation Approach: Failure Differentiation
Add a classifyPeerFailure(error) helper that maps errors to failure types:
Track timestamps alongside weights in a parallel Map<WeightCategory, Map<string, number>>. Update timestamps on every reportSuccess/reportFailure call.
Implementation Approach: Local Peer List Cache
Persist the peer list to disk after each successful updatePeerList() call. On startup, if the network fetch fails, load the cached list as a fallback.
Constructor — Accept an optional peerCachePath parameter. On initialization (when initialPeers not provided), attempt network fetch, then fall back to cache:
privateasyncloadCachedPeerList(): Promise<void>{constdata=JSON.parse(awaitfs.readFile(this.peerCachePath,'utf-8'));constage=Date.now()-data.updatedAt;if(age>this.peerCacheMaxAgeMs){log.warn('Using stale peer list cache',{ageMs: age});}else{log.info('Loaded peer list from local cache',{ageMs: age,count: Object.keys(data.peers).length});}this.peers=data.peers;// Initialize category weights for cached peers}
Implementation Approach: Consistent Hash Routing
Use a consistent hash ring to map data IDs to a stable "home set" of peers. This creates cache locality without protocol changes or network coordination.
Data structure:
classPeerHashRing{// Virtual nodes for even distribution (e.g., 50 vnodes per peer)privatering: Map<number,string>;// hash position → peer URLprivatesortedPositions: number[];privatevnodeCount: number;addPeer(peerUrl: string): void{for(leti=0;i<this.vnodeCount;i++){consthash=hashToPosition(`${peerUrl}:${i}`);this.ring.set(hash,peerUrl);}this.sortedPositions=[...this.ring.keys()].sort((a,b)=>a-b);}getHomeSet(dataId: string,count: number): string[]{constposition=hashToPosition(dataId);// Walk clockwise from position, collect unique peersconstpeers: string[]=[];letidx=binarySearchClosest(this.sortedPositions,position);while(peers.length<count&&peers.length<this.peerCount){constpeer=this.ring.get(this.sortedPositions[idx%this.sortedPositions.length]);if(!peers.includes(peer))peers.push(peer);idx++;}returnpeers;}}
Integration with ArIODataSource.getData():
Peer selection for data ID X:
1. homeSet = hashRing.getHomeSet(X, homeSetSize) // stable candidates
2. ranked = rankByWeight(homeSet) // quality ordering
3. filtered = filterByConcurrencySlots(ranked) // skip saturated
4. fallback = generalWeightedSelection(excludeHomeSet) // if home set exhausted
5. candidates = [...filtered, ...fallback]
Ring maintenance:
Rebuild ring on updatePeerList() (peers join/leave)
Ring is read-heavy, write-infrequent — simple sorted array is sufficient
Virtual nodes (50 per peer) ensure even distribution despite varying peer count
Implementation Approach: ArNS Peer Affinity
A lightweight learned routing layer specifically for ArNS name resolution.
interfaceArNSPeerAffinity{peers: string[];// recently successful peers, most recent firstlastUpdated: number;}classArNSAffinityCache{privatecache: Map<string,ArNSPeerAffinity>=newMap();privatemaxPeersPerName=3;privatettlMs=1_800_000;// 30 minutesrecordSuccess(arnsName: string,peer: string): void{constentry=this.cache.get(arnsName)??{peers: [],lastUpdated: 0};// Move peer to front, dedup, cap sizeentry.peers=[peer, ...entry.peers.filter(p=>p!==peer)].slice(0,this.maxPeersPerName);entry.lastUpdated=Date.now();this.cache.set(arnsName,entry);}getAffinityPeers(arnsName: string): string[]{constentry=this.cache.get(arnsName);if(!entry||Date.now()-entry.lastUpdated>this.ttlMs)return[];returnentry.peers;}}
The ArNS layer is a targeted optimization on top of consistent hashing. It handles the common case (popular names requested repeatedly) while consistent hashing handles the general case (all data IDs).
Implementation Approach: Seed Weights from Registry
Retain gateway metadata during updatePeerList() and use it for initial weight seeding:
// In updatePeerList(), instead of just storing URL:for(constgatewayofitems){peers[gateway.gatewayAddress]={url: `${gateway.settings.protocol}://${gateway.settings.fqdn}`,compositeWeight: gateway.weights?.normalizedCompositeWeight,performanceRatio: gateway.weights?.gatewayPerformanceRatio,passRate: gateway.stats.prescribedEpochCount>0
? gateway.stats.passedEpochCount/gateway.stats.prescribedEpochCount
: undefined,};}// When initializing weights for a new peer:constregistryWeight=peerMetadata.compositeWeight??peerMetadata.performanceRatio;constinitialWeight=registryWeight!==undefined
? Math.max(MIN_WEIGHT,Math.round(registryWeight*MAX_WEIGHT))
: DEFAULT_WEIGHT;
This requires changing the peer storage from Record<string, string> to Record<string, PeerMetadata>, which affects getPeers(), getPeerUrls(), and callers. The URL-only interface can be preserved with accessor methods.
Existing Patterns to Follow
p-limit (v6.2.0) already in dependencies, used in CompositeChunkDataSource and ArweaveCompositeClient
ArIOChunkSource uses parallel peer requests with concurrency control
RebroadcastingChunkSource uses p-limit for outbound concurrency + limiter TokenBucket for rate limiting
Improve Contiguous Data P2P Retrieval
Problem
The gateway's contiguous data retrieval from AR.IO peers and trusted gateways has several gaps that become more important as the network grows:
No per-peer outbound concurrency limiting — There is no cap on concurrent outbound requests to a given peer or trusted gateway. Under load, many concurrent request handlers can all select the same high-weight peer, overwhelming it. This is bad neighbor behavior in a cooperative network and ties up local connections on slow peers.
Weight system blind spot — The peer weight system (
ArIOPeerManager) handles quality (which peers are fast/reliable) but not quantity (how much load is placed on any single peer). High-performing peers get higher weights, which means they get selected more often under load — the opposite of what a healthy network wants.No backpressure from saturated peers — When a peer is slow or overloaded, there's no mechanism to divert traffic to other peers until the per-request timeout fires (10 seconds). Concurrency limits would create natural backpressure: once a peer's slots fill, requests route elsewhere immediately.
Sequential peer attempts with no hedging —
ArIODataSource.getData()tries peers one at a time. If the first peer is slow but doesn't fail, the full 10s timeout must elapse before the next peer is tried. Tail latency =retryCount * requestTimeoutMs.No weight recovery from transient failures — Peers penalized to weight 1 after a brief outage recover only through successful requests. But at weight 1 they're rarely selected (weighted random), creating a negative feedback loop where recovery takes hours.
All failures treated equally —
reportFailure()applies a flat-temperatureDeltaregardless of cause. A 404 (peer lacks data), 503 (temporarily overloaded), and connection refused (peer down) all get the same penalty, despite very different implications.Candidate pool tied to retry count —
selectPeers(totalRetryCount)means the number of peers considered equals the max retry count (capped at 3). These are different concerns: you may want a larger candidate pool to draw from while still limiting total attempts.No local peer list cache —
ArIOPeerManagerfetches the peer list from the AO network process on startup (fire-and-forget, not awaited). If the AO process is unreachable,this.peersstays{}and all P2P retrieval is dead until the next hourly refresh succeeds. A single bad startup can leave the gateway peerless for up to an hour.No data locality in peer selection — Peer selection for contiguous data is purely weight-based with no consideration of which peers are likely to have the requested data. Two requests for the same ID may route to completely different peers, wasting cache potential. The chunk system has bucket-aware selection (
selectBucketPeersForOffset), but contiguous data has no equivalent. This means every request is a cold miss from the peer's perspective unless it happens to have the data from an unrelated request.Registry metadata discarded —
updatePeerList()receives rich per-gateway metadata from the AR.IO registry (composite weight, performance ratio, epoch pass/fail stats, stake, tenure) but discards everything except the URL. Initial peer weights default to 50 for all peers regardless of network-observed quality, ignoring free quality signals.This is an umbrella issue for iterative improvements to contiguous data P2P retrieval. We'll add items here as we identify them.
Requirements
Must Have
Per-peer outbound concurrency limiter — Shared concurrency limiter that caps concurrent outbound requests to any single peer or trusted gateway
PEER_MAX_CONCURRENT_OUTBOUND(default TBD, likely 5-10)TRUSTED_GATEWAY_MAX_CONCURRENT_OUTBOUND(default TBD, likely 10-20, or shared setting)ArIODataSourceandGatewaysDataSourceHedged requests — Fire parallel requests after a configurable delay to eliminate tail latency from slow peers
PEER_HEDGE_DELAY_MS(default: 500ms, 0 = disabled)PEER_MAX_HEDGED_REQUESTS(default: 3, caps concurrent fan-out)AbortControllerFailure-differentiated weight adjustment — Replace flat penalty with failure-type-aware multipliers
1x temperatureDelta— peer may lack data but isn't broken0.2x temperatureDelta— transient, minimal long-term penalty1x temperatureDelta— could be transient or systemic3x temperatureDelta— peer likely down2x temperatureDelta— something is wrong with the peerPeerFailureTypeenum andclassifyPeerFailure()helperLocal peer list cache — Persist the peer list to disk after each successful fetch; load on startup as fallback when the AO process is unreachable
data/peers/ar-io-peers.json(follows existingdata/directory convention)PEER_LIST_CACHE_MAX_AGE_MS(default: 86400000 = 24h); if cache is older, log a more urgent warning but still use it (stale peers > no peers)Should Have
PEER_WEIGHT_DECAY_HALF_LIFE_MS(default: 600000 = 10 min)PEER_CANDIDATE_COUNT(default: 5-10)selectPeers(candidateCount)picks from a wider pool;maxAttemptscaps actual requestsMap<arnsName, { peers: string[], lastUpdated: number }>requestAttributes.arnsNameis already passed togetData(), so the name is available at selection timecompositeWeight/gatewayPerformanceRatio/ epoch stats from the AR.IO gateway registry to seed initial per-peer weights instead of defaulting all peers to 50ArIOPeerManagernormalizedCompositeWeight(0–1) to the weight range (1–100) for initial weightCould Have (Future)
Technical Notes
Implementation Approach: Per-Peer Concurrency Limiter
New module
src/data/outbound-request-limiter.ts— Wraps aMap<string, pLimit>with lazy per-peer initialization. ExposestryAcquire(peerUrl)to check/reserve a slot and a release function.Modify
ArIODataSource— In the peer retry loop, check if the peer has available slots before requesting. If saturated, skip to next peer.Modify
GatewaysDataSource— Same pattern in the gateway tier loop. Skip saturated gateways, try next in tier.Wire in
system.ts— Create one shared limiter instance, inject into both data sources.Config in
src/config.ts— New env vars for concurrency limits.Implementation Approach: Hedged Requests
Replace the sequential for-loop in
ArIODataSource.getData()with a hedged request pattern:Key design decisions:
AbortControllerto cancel losing requests (existing abort signal pattern)PEER_HEDGE_DELAY_MS=0), fall back to existing sequential behaviorImplementation Approach: Failure Differentiation
Add a
classifyPeerFailure(error)helper that maps errors to failure types:Modify
reportFailure()inArIOPeerManagerto accept the failure type and apply a multiplier to the temperature delta.Implementation Approach: Weight Decay
Apply decay lazily in
_selectPeersUncached()before building the weighted table:Track timestamps alongside weights in a parallel
Map<WeightCategory, Map<string, number>>. Update timestamps on everyreportSuccess/reportFailurecall.Implementation Approach: Local Peer List Cache
Persist the peer list to disk after each successful
updatePeerList()call. On startup, if the network fetch fails, load the cached list as a fallback.Cache file format (
data/peers/ar-io-peers.json):{ "updatedAt": 1707753600000, "peers": { "walletAddress1": "https://peer1.example.com", "walletAddress2": "https://peer2.example.com" } }Key changes to
ArIOPeerManager:peerCachePathparameter. On initialization (wheninitialPeersnot provided), attempt network fetch, then fall back to cache:updatePeerList()— Write cache atomically:Implementation Approach: Consistent Hash Routing
Use a consistent hash ring to map data IDs to a stable "home set" of peers. This creates cache locality without protocol changes or network coordination.
Data structure:
Integration with
ArIODataSource.getData():Ring maintenance:
updatePeerList()(peers join/leave)Implementation Approach: ArNS Peer Affinity
A lightweight learned routing layer specifically for ArNS name resolution.
Integration with
ArIODataSource.getData():The ArNS layer is a targeted optimization on top of consistent hashing. It handles the common case (popular names requested repeatedly) while consistent hashing handles the general case (all data IDs).
Implementation Approach: Seed Weights from Registry
Retain gateway metadata during
updatePeerList()and use it for initial weight seeding:This requires changing the peer storage from
Record<string, string>toRecord<string, PeerMetadata>, which affectsgetPeers(),getPeerUrls(), and callers. The URL-only interface can be preserved with accessor methods.Existing Patterns to Follow
p-limit(v6.2.0) already in dependencies, used inCompositeChunkDataSourceandArweaveCompositeClientArIOChunkSourceuses parallel peer requests with concurrency controlRebroadcastingChunkSourceusesp-limitfor outbound concurrency +limiterTokenBucket for rate limitingCHUNK_REBROADCAST_MAX_CONCURRENTAbortSignal.any()andAbortSignal.timeout()used throughout for combined signal handlingStrategic Fit
These improvements create a multi-signal peer selection system:
Implementation Order
Key Files
src/data/outbound-request-limiter.tssrc/data/ar-io-data-source.tssrc/data/gateways-data-source.tssrc/peers/ar-io-peer-manager.tssrc/data/peer-hash-ring.tssrc/data/arns-affinity-cache.tssrc/system.tssrc/config.tssrc/data/ar-io-data-source.test.tssrc/data/gateways-data-source.test.tssrc/peers/ar-io-peer-manager.test.tssrc/data/peer-hash-ring.test.tssrc/data/arns-affinity-cache.test.tsRelated