Skip to content

fix(sandbox): target the sandbox's own gateway on non-default NEMOCLAW_GATEWAY_PORT#4987

Open
jahubba wants to merge 1 commit into
NVIDIA:mainfrom
jahubba:fix/sandbox-gateway-name-nondefault-port
Open

fix(sandbox): target the sandbox's own gateway on non-default NEMOCLAW_GATEWAY_PORT#4987
jahubba wants to merge 1 commit into
NVIDIA:mainfrom
jahubba:fix/sandbox-gateway-name-nondefault-port

Conversation

@jahubba

@jahubba jahubba commented Jun 8, 2026

Copy link
Copy Markdown

Summary

Sandbox lifecycle commands (connect/destroy/doctor/snapshot) and the shared gateway recovery/health path hardcoded the gateway registration name nemoclaw and ignored NEMOCLAW_GATEWAY_PORT. Since #4422 a non-default port registers the gateway as nemoclaw-<port>, so these paths targeted a gateway that doesn't exist: doctor reported a connected nemoclaw-<port> gateway as failed and tried to start one on the default port (often "Address already in use"); destroy/snapshot used the wrong gateway/container name; connect's readiness gate mis-resolved. Setting NEMOCLAW_GATEWAY_PORT did not help because the name was a literal.

Related Issue

Fixes #4985. Completes the per-port migration started in #4422 (distinct from the multi-instance issue #4865).

Changes

  • src/lib/onboard/gateway-binding.ts — new resolveSandboxGatewayName(sb) resolving the sandbox's persisted gatewayName/gatewayPort (falls back to the bare default for older entries).
  • src/lib/gateway-runtime-action.tsgetNamedGatewayLifecycleState / recoverNamedGatewayRuntime take a gatewayName, defaulting to the env-resolved per-port name (resolveGatewayName(GATEWAY_PORT)), so non-sandbox callers honor NEMOCLAW_GATEWAY_PORT; the healthy_named classifier now compares against the resolved name instead of the literal "nemoclaw".
  • src/lib/actions/sandbox/{connect,destroy,doctor,snapshot,gateway-state,policy-channel}.ts — resolve the gateway from the sandbox entry instead of the hardcoded NEMOCLAW_GATEWAY_NAME constant; thread it through gateway select/info/remove/destroy, the docker-driver container/DNS names, and the recovery/health calls. doctor status detail/hint now name the actual gateway.
  • src/lib/onboard/gateway-binding.test.ts — unit tests for resolveSandboxGatewayName (name / port-derived / fallback).

Behavior

  • Sandbox-scoped commands resolve nemoclaw-<port> from the registry; non-sandbox commands honor NEMOCLAW_GATEWAY_PORT.
  • Backward compatible: default port and entries without the persisted fields keep the bare nemoclaw name.
  • Known follow-up: startGatewayForRecovery() (the gateway start path) still reads onboarding's process-global port; the common "gateway already running" path is fully resolved here.

Type of Change

  • Code change (feature, bug fix, or refactor)

Verification

  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Ran the affected suites locally: gateway-binding, gateway-state-drift, sandbox actions, status-health, doctor-gateway-token (408 tests pass); npm run typecheck:cli clean; Biome clean. Relying on CI for the full matrix (a fixed set of NIM/GPU/install suites fail identically on a clean main on this aarch64 host).

Signed-off-by: Jason Hubbard jasonahubbard@gmail.com

Sandbox lifecycle commands (connect/destroy/doctor/snapshot) and the
shared gateway recovery/health path (recoverNamedGatewayRuntime /
getNamedGatewayLifecycleState) hardcoded the gateway name "nemoclaw" and
ignored NEMOCLAW_GATEWAY_PORT. Since NVIDIA#4422 a non-default port registers
the gateway as `nemoclaw-<port>`, so these paths targeted a gateway that
does not exist: `doctor` reported a connected `nemoclaw-<port>` gateway
as failed and tried to start one on the default port; destroy/snapshot
operated on the wrong gateway/container name; connect's readiness gate
mis-resolved the gateway.

Add `resolveSandboxGatewayName()` (reads the sandbox's persisted
gatewayName/gatewayPort) and thread it through these commands.
Parameterize getNamedGatewayLifecycleState/recoverNamedGatewayRuntime
with `gatewayName`, defaulting to the env-resolved per-port name so
non-sandbox callers are unaffected. Adds unit tests for the resolver.

Signed-off-by: Jason Hubbard <jasonahubbard@gmail.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 8, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@wscurran wscurran added the bug-fix PR fixes a bug or regression label Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sandbox commands ignore NEMOCLAW_GATEWAY_PORT — hardcoded nemoclaw breaks non-default-port gateways

2 participants