Port bandwidth leak prevention: backend degradation detection & gating by full-bars · Pull Request #182 · urnetwork/connect

full-bars · 2026-05-30T04:29:35Z

Bandwidth Leak Prevention: Backend Degradation Detection & Gating

I've ported the proven backend degradation detection and contract gating mechanisms
from my v3.23-fix implementation
to upstream. This addresses the data leak during control API outages (issues #175, #181).

The Problem

When the control API is unreachable, providers continue attempting contract creation
and queueing transfers against a dead API. Without throttling, this becomes a massive
retry storm: thousands of route attempts per second, each spawning goroutines and
queuing transfers that will never complete because clients can't authorize.

The provider essentially screams at an API that isn't even listening, blasting the
network with data that has no chance of reaching a client. On metered bandwidth
connections (common for proxy providers), a sustained outage can consume an entire
month's bandwidth allocation in just a few hours. For proxy users relying on that
bandwidth, this translates to service loss beyond the API outage itself — the pipe
is wide open but the data is being wasted.

The core issue: there's no signal to the provider that the API is down. So it keeps
trying, keeps transferring, keeps wasting bandwidth, until either the outage ends or
the budget is exhausted. This fix adds that signal via degradation detection, then
uses it to stop the bleeding.

The Solution

Backend Degradation Detection:

Track consecutive backend failures (auth timeouts + OOB errors) with atomic counters
Require threshold (3+ consecutive) + recency check (within 2 minutes) to avoid false
positives from transient timeouts
Distinguish sustained outages from normal churn on a busy provider

Gating & Throttling When Degraded:

Skip contract creation goroutines (don't queue work against dead API)
Increase contract retry interval from 5s to 30s (reduce API call frequency)
Apply exponential backoff to resend timeouts (spread 16 retries across much longer
intervals instead of sending them in quick succession)

Production Validation

I validated this strategy during a recent control API outage on my fleet (2026-05-30, 3:10-3:21 PM PDT):

Scope: ~120 servers across North America + Europe
Duration: 11 minutes of outage window
Result: Bandwidth consumption dropped to ~12% of baseline during outage, confirming
that the mechanisms effectively reduce (not eliminate) the leak
Post-recovery: Traffic briefly spikes as in-flight data is retried, but this is
intended cleanup behavior (clients and contracts are long gone, so it amounts to
clearing queues). The strategy remains effective — the leak is contained during the
actual outage window.

Important caveat: An 11-minute window provides limited statistical power. The data
shows the mechanism works as designed, but edge cases or longer outages might reveal
new issues. The fix is based on production code that's been running in my
v3.23-fix implementation, so
confidence is high.

Implementation Details

Atomics ensure safe concurrent updates across multiple goroutines
Threshold + recency window (3+ failures within 2 min) prevents false triggers on
transient timeouts
Exponential backoff caps at 64× RTT to avoid excessively long waits
Contract creation gating at the goroutine launch site (not in API layer) to avoid
feedback loops

Related PRs

PR#180: Log spam fixes for auth/contract/drop errors (rate-limiting)
Issues [BUG] Massive Bandwidth Leak and Resource Exhaustion during Server-Side Outage #175, Bandwidth amplification during backend outages (resend queue + ack loss) #181: Bandwidth leak during API outages

full-bars · 2026-05-30T04:38:14Z

@xcolwell — requesting review on this bandwidth leak prevention PR.

This ports the backend degradation detection and contract gating mechanisms from the fork I maintain to upstream, addressing issues #175 and #181. The implementation distinguishes sustained API outages from transient timeouts, then gates contract creation and applies exponential backoff when degraded.

Validated in production during a recent 11-minute control API outage across ~120 servers—bandwidth consumption dropped to ~12% of baseline, confirming the mechanisms effectively reduce the leak.

Related: PR#180 (log spam fixes for auth/contract/drop errors).

full-bars · 2026-06-12T22:42:10Z

Rewrote the degradation detection model. The previous implementation set a boolean flag when suppressedAuthErrs > 100, which had three problems: it only fired on large fleets, it ignored OOB errors entirely, and it never cleared until the next 2-minute recheck even after the backend recovered.

The new model uses a consecutive failure counter fed by both auth and OOB errors, reset immediately on any success. Trips after 3 consecutive failures within 2 minutes; clears the moment a single connect or OOB call succeeds. The gating and backoff behavior (contract creation, 30s retry interval, 64× RTT resend cap) is unchanged, only the signal feeding them is corrected.

Also noticed a dealine typo in nearby comments (pre-existing upstream, 5 instances in transport.go) left untouched to keep this PR focused, worth a trivial follow-up fix.

During a control-API outage providers keep creating contracts and resending against a dead backend, leaking metered bandwidth. This adds a process-wide degradation signal and uses it to throttle that waste. - consecutive-fail degradation model: isBackendDegraded() trips after backendDegradedFailThreshold (3) consecutive backend failures (auth or OOB) with the last failure within backendDegradedWindow (2m). Any success resets the counter; the recency window clears stale counts on idle providers. Failures/successes are recorded in the auth (H1/H3) and OOB-create paths. - gating in SendSequence.updateContract: stretch the contract retry interval to 30s and skip CreateContract entirely while degraded, so no new contracts are queued against an unreachable platform; recovery resumes within one retry interval once isBackendDegraded() clears. - self-contained rate-limit helpers (shouldLogAuthErr/shouldLogOobErr + package-level atomics) so this PR can merge independently of urnetwork#180. Logging routes through the per-instance self.log/self.client.log (upstream removed the global glog). The degraded-only resend backoff from the original branch is dropped: main already backs off resends multiplicatively (capped at MaxResendInterval), which subsumes it. Squashed from the original 2-commit branch so there is no intermediate state that depends on the removed glog.

full-bars · 2026-06-15T13:00:50Z

Rebased onto current `main`, squashed to one commit

Rebased onto current main and squashed the two commits into one. Three things changed in the process:

Logging moved to self.log / self.client.log. main removed the global glog. The degradation helpers' rate-limited logging now happens at the call site (where the per-instance logger is available), returning (ok, suppressed) rather than logging internally. This is also why the two commits were squashed: the original first commit logged inside the package-level helpers via glog, which no longer exists, so there was no clean intermediate state.
Dropped the degraded-only resend backoff. main now backs off resends multiplicatively on every resend, capped at MaxResendInterval. That subsumes this branch's narrower degraded-only backoff, so I removed it. The two unique anti-leak mechanisms remain: the 30s contract-retry interval and gating CreateContract while degraded.
Heads-up on merge order with contract,transport: reduce log spam during backend outages #180. This PR and PR180 both define shouldLogAuthErr, shouldLogOobErr, and their four rate-limit atomics (kept duplicated so each PR is self-contained and can merge in any order). Whichever of the two merges second will need a ~25-line dedup: delete those four atomics and two helper functions from the second PR, keeping only this PR's degradation-detection additions. Whichever merges first needs no change.

full-bars mentioned this pull request Jun 13, 2026

perf: increase InitialContractTransferByteCount to 256KiB #177

Open

full-bars force-pushed the fix/bandwidth-leak-prevention branch from 6d2e486 to acded84 Compare June 15, 2026 13:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port bandwidth leak prevention: backend degradation detection & gating#182

Port bandwidth leak prevention: backend degradation detection & gating#182
full-bars wants to merge 1 commit into
urnetwork:mainfrom
full-bars:fix/bandwidth-leak-prevention

full-bars commented May 30, 2026 •

edited

Loading

Uh oh!

full-bars commented May 30, 2026

Uh oh!

full-bars commented Jun 12, 2026 •

edited

Loading

Uh oh!

full-bars commented Jun 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

full-bars commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bandwidth Leak Prevention: Backend Degradation Detection & Gating

The Problem

The Solution

Production Validation

Implementation Details

Related PRs

Uh oh!

full-bars commented May 30, 2026

Uh oh!

full-bars commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

full-bars commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rebased onto current main, squashed to one commit

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

full-bars commented May 30, 2026 •

edited

Loading

full-bars commented Jun 12, 2026 •

edited

Loading

full-bars commented Jun 15, 2026 •

edited

Loading

Rebased onto current `main`, squashed to one commit