Skip to content

0.5.3: TCP keepalive + lastSeen-aware stale detection (loopback peer restart)#22

Merged
sym-bot merged 2 commits intomainfrom
fix/stale-peer-detection
Apr 29, 2026
Merged

0.5.3: TCP keepalive + lastSeen-aware stale detection (loopback peer restart)#22
sym-bot merged 2 commits intomainfrom
fix/stale-peer-detection

Conversation

@sym-bot
Copy link
Copy Markdown
Owner

@sym-bot sym-bot commented Apr 29, 2026

Summary

Companion fix to v0.5.2's same-host dedup. v0.5.2 added stale-prior detection via the `_closed` flag — set when `transport.close()` had been called explicitly. But the common dead-but-ESTABLISHED case (peer process killed; OS doesn't deliver FIN to the survivor before keepalive reaps it) leaves `_closed=false` forever.

On loopback this is a hard block — macOS default `TCP_KEEPALIVE = 7200s` (2 hours) before the first probe. Survivor sees dead socket as alive, dedup against the zombie rejects every redial.

Field evidence

Mac Catalyst MeloMove ↔ claude-code-mac (Node) on the same Mac. Each Mac MeloMove rebuild → claude-code-mac retains a dead ESTABLISHED socket → new Mac MeloMove dial rejected for 2h.

```
node 17910 hongwei 20u IPv4 ... TCP 192.168.1.102:59732->192.168.1.102:50598 (ESTABLISHED)
^^^^^^^^^^^^^^^^^^^^^^^
old MeloMove process, dead
```

iPhone ↔ claude-code-mac on Wi-Fi works. Same logical bug, less visible — Wi-Fi noise (route flaps, ARP, AP transitions) reaps stale sockets within seconds. The bug only became user-visible on loopback.

Fix (three-part)

1. TCP keepalive on every TcpTransport socket

```js
socket.setKeepAlive(true, 1000);
```

1-second initial idle delay before OS keepalive probes start. macOS then detects dead remote in ~10s instead of ~2h.

2. lastSeen-aware stale detection in dedup paths

Both `inbound-connection` handler and `_createPeer` now treat a peer entry with `lastSeen` older than `_heartbeatInterval` (default 10s) as stale, regardless of `_closed` flag. The remote re-dialling is itself evidence its prior is dead — a healthy peer wouldn't dial again. Closes the dead prior explicitly so its close handler removes the dict entry before the new transport is registered.

3. Identity-aware close handlers

When a stale prior is closed and replaced, its eventual close handler must NOT clobber the new transport entry. Both close handlers in `_createPeer` now guard with `transports.get(source) === transport` before mutating. Prevents a late-firing close from a swapped-out prior tearing down its replacement.

Tests

150/150 existing unit tests pass. No regressions in transport priority, peer lifecycle, multi-transport, or any other path.

Test plan

  • All existing unit tests pass (`npm test`)
  • Verify on Mac Catalyst MeloMove ↔ claude-code-mac (same Mac, loopback): rebuild + redeploy Mac MeloMove, observe claude-code-mac shows up in Mac MeloMove's PEERS list within ~10s instead of being permanently blocked.

🤖 Generated with Claude Code

sym-bot and others added 2 commits April 29, 2026 17:45
v0.5.2 added stale-prior detection via the `_closed` flag. That fired
when transport.close() had been called explicitly, but missed the
common dead-but-ESTABLISHED case: a peer process is killed, its OS
sends FIN on its sockets, but on the survivor side the TCP socket
sits in ESTABLISHED state until OS keepalive reaps it.

On Wi-Fi this isn't very visible — mobile/Wi-Fi TCP routes are noisy
(route flaps, ARP timeouts, AP transitions, carrier hand-offs) and
keepalive idle defaults are short. Stale sockets die in seconds.

On loopback (two processes on the same host) it's a hard block.
macOS default TCP_KEEPALIVE = 7200s (2 hours) before the first probe.
A killed-and-relaunched same-host peer stays "ESTABLISHED" on the
survivor for 2h, and the dedup logic against this zombie entry
rejects every redial.

Observed: Mac Catalyst MeloMove ↔ claude-code-mac (Node) on the same
Mac. Mac restart for rebuild → claude-code-mac retains dead
ESTABLISHED socket → new Mac dial rejected forever. iPhone↔
claude-code-mac on Wi-Fi works because Wi-Fi noise reaps stale
sockets quickly.

Three changes:

  1. TcpTransport sets socket.setKeepAlive(true, 1000) — 1s initial
     idle delay, OS-default probe cadence after. macOS detects dead
     remote within ~10s instead of ~2h.

  2. inbound-connection handler and _createPeer now treat a peer
     entry with lastSeen older than _heartbeatInterval (default 10s)
     as stale, regardless of the _closed flag. The remote re-dialling
     is itself strong evidence the prior is dead — it wouldn't dial
     again if it had a healthy connection. Catches the gap before
     keepalive fires + any future dead-ESTABLISHED scenarios.

  3. Identity-aware close handlers. When a stale prior is closed and
     replaced, its eventual close handler must NOT clobber the new
     transport entry. Both close handlers (existingPeer branch and
     new-peer branch) now check `transports.get(source) === transport`
     before mutating the dict.

150/150 unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@sym-bot sym-bot merged commit 367b29e into main Apr 29, 2026
2 checks passed
@sym-bot sym-bot deleted the fix/stale-peer-detection branch April 29, 2026 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant