0.5.3: TCP keepalive + lastSeen-aware stale detection (loopback peer restart)#22
Merged
0.5.3: TCP keepalive + lastSeen-aware stale detection (loopback peer restart)#22
Conversation
v0.5.2 added stale-prior detection via the `_closed` flag. That fired
when transport.close() had been called explicitly, but missed the
common dead-but-ESTABLISHED case: a peer process is killed, its OS
sends FIN on its sockets, but on the survivor side the TCP socket
sits in ESTABLISHED state until OS keepalive reaps it.
On Wi-Fi this isn't very visible — mobile/Wi-Fi TCP routes are noisy
(route flaps, ARP timeouts, AP transitions, carrier hand-offs) and
keepalive idle defaults are short. Stale sockets die in seconds.
On loopback (two processes on the same host) it's a hard block.
macOS default TCP_KEEPALIVE = 7200s (2 hours) before the first probe.
A killed-and-relaunched same-host peer stays "ESTABLISHED" on the
survivor for 2h, and the dedup logic against this zombie entry
rejects every redial.
Observed: Mac Catalyst MeloMove ↔ claude-code-mac (Node) on the same
Mac. Mac restart for rebuild → claude-code-mac retains dead
ESTABLISHED socket → new Mac dial rejected forever. iPhone↔
claude-code-mac on Wi-Fi works because Wi-Fi noise reaps stale
sockets quickly.
Three changes:
1. TcpTransport sets socket.setKeepAlive(true, 1000) — 1s initial
idle delay, OS-default probe cadence after. macOS detects dead
remote within ~10s instead of ~2h.
2. inbound-connection handler and _createPeer now treat a peer
entry with lastSeen older than _heartbeatInterval (default 10s)
as stale, regardless of the _closed flag. The remote re-dialling
is itself strong evidence the prior is dead — it wouldn't dial
again if it had a healthy connection. Catches the gap before
keepalive fires + any future dead-ESTABLISHED scenarios.
3. Identity-aware close handlers. When a stale prior is closed and
replaced, its eventual close handler must NOT clobber the new
transport entry. Both close handlers (existingPeer branch and
new-peer branch) now check `transports.get(source) === transport`
before mutating the dict.
150/150 unit tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Companion fix to v0.5.2's same-host dedup. v0.5.2 added stale-prior detection via the `_closed` flag — set when `transport.close()` had been called explicitly. But the common dead-but-ESTABLISHED case (peer process killed; OS doesn't deliver FIN to the survivor before keepalive reaps it) leaves `_closed=false` forever.
On loopback this is a hard block — macOS default `TCP_KEEPALIVE = 7200s` (2 hours) before the first probe. Survivor sees dead socket as alive, dedup against the zombie rejects every redial.
Field evidence
Mac Catalyst MeloMove ↔ claude-code-mac (Node) on the same Mac. Each Mac MeloMove rebuild → claude-code-mac retains a dead ESTABLISHED socket → new Mac MeloMove dial rejected for 2h.
```
node 17910 hongwei 20u IPv4 ... TCP 192.168.1.102:59732->192.168.1.102:50598 (ESTABLISHED)
^^^^^^^^^^^^^^^^^^^^^^^
old MeloMove process, dead
```
iPhone ↔ claude-code-mac on Wi-Fi works. Same logical bug, less visible — Wi-Fi noise (route flaps, ARP, AP transitions) reaps stale sockets within seconds. The bug only became user-visible on loopback.
Fix (three-part)
1. TCP keepalive on every TcpTransport socket
```js
socket.setKeepAlive(true, 1000);
```
1-second initial idle delay before OS keepalive probes start. macOS then detects dead remote in ~10s instead of ~2h.
2. lastSeen-aware stale detection in dedup paths
Both `inbound-connection` handler and `_createPeer` now treat a peer entry with `lastSeen` older than `_heartbeatInterval` (default 10s) as stale, regardless of `_closed` flag. The remote re-dialling is itself evidence its prior is dead — a healthy peer wouldn't dial again. Closes the dead prior explicitly so its close handler removes the dict entry before the new transport is registered.
3. Identity-aware close handlers
When a stale prior is closed and replaced, its eventual close handler must NOT clobber the new transport entry. Both close handlers in `_createPeer` now guard with `transports.get(source) === transport` before mutating. Prevents a late-firing close from a swapped-out prior tearing down its replacement.
Tests
150/150 existing unit tests pass. No regressions in transport priority, peer lifecycle, multi-transport, or any other path.
Test plan
🤖 Generated with Claude Code