Skip to content

feat(vgpu-manager): default Fabric Manager to vGPU multitenancy mode#2602

Draft
lexfrei wants to merge 1 commit into
NVIDIA:mainfrom
lexfrei:feat/fabricmanager-vgpu-mode
Draft

feat(vgpu-manager): default Fabric Manager to vGPU multitenancy mode#2602
lexfrei wants to merge 1 commit into
NVIDIA:mainfrom
lexfrei:feat/fabricmanager-vgpu-mode

Conversation

@lexfrei

@lexfrei lexfrei commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Description

RFC / Draft — seeking maintainer design direction before completing. Please do not merge.

On NVSwitch (HGX) systems running SR-IOV vGPU, NVIDIA Fabric Manager must run in vGPU multitenancy mode (FABRIC_MODE=2) instead of the default bare-metal / passthrough mode (FABRIC_MODE=0). In mode 0 a whole-card vGPU guest cannot initialize CUDA — cuInit fails with error 802 because the guest's NVLink fabric never becomes usable. FABRIC_MODE=1 is the older shared-NVSwitch multitenancy model; FABRIC_MODE=2 is the mode required for SR-IOV vGPU (see the NVIDIA Fabric Manager User Guide, config-file section).

Today the operator provides no way to select the Fabric Manager mode. On the operator-managed-driver path the driver container starts Fabric Manager from the packaged fabricmanager.cfg unchanged (FABRIC_MODE=0) and reads no environment variable to override it, so SR-IOV vGPU on NVSwitch cannot be brought up through the operator.

This change defaults the vGPU Manager container (the vm-vgpu sandbox-workload path) to FABRIC_MODE=2. A user-supplied FABRIC_MODE in vgpuManager.env takes precedence. The value is scoped to the vGPU Manager container, so container / passthrough / bare-metal workloads are unchanged, and on non-NVSwitch vGPU hosts the driver entrypoint does not start Fabric Manager so the value is ignored.

Why this is an RFC (honest scope)

This is the operator half of a two-part change and is intentionally inert on its own:

  • The operator can only inject an env var; the container entrypoint must consume it. The vGPU Manager container image does not currently install or start Fabric Manager (it is installed only in the non-vGPU driver image), and no entrypoint reads FABRIC_MODE. A companion gpu-driver-container change is needed to (a) ship Fabric Manager in the vGPU Manager image and (b) write FABRIC_MODE into fabricmanager.cfg before starting Fabric Manager on NVSwitch hosts.
  • Open design question: in a vGPU-on-NVSwitch deployment, which component should own Fabric Manager? On NVSwitch systems it is started today by the driver container, but in a vGPU deployment the vGPU Manager container is what runs on the node (and ships no Fabric Manager). I chose the vGPU Manager container as the injection target because it is the container that actually runs on vGPU nodes, but I would like direction before implementing the driver-container side.
  • Out of scope: the host-installed-driver path (driver.enabled=false), where the operator manages neither the driver nor Fabric Manager; and per-VM fabric-partition activation, which is a separate device-plugin concern.

Related: #2594 describes the adjacent SR-IOV / vendor-specific VFIO vGPU device-creation gap on the same class of hardware.

Checklist

  • No secrets, sensitive information, or unrelated changes
  • Lint checks passing (make lint) — the changed package is clean (golangci-lint run ./controllers/ reports 0 issues). Repo-wide make lint could not complete locally on macOS because an unrelated command (cmd/nvidia-validator) imports a Linux-only vendored package that is build-excluded on darwin; expected to pass on Linux CI.
  • Generated assets in-sync (make validate-generated-assets) — this change adds no API/CRD changes, so no generated assets change. The target could not run to completion on macOS for the same Linux-only-dependency reason.
  • Go mod artifacts in-sync (make validate-modules)
  • Test cases are added for new code paths

Testing

  • go build ./controllers/ and go test ./controllers/ pass, including a new TestTransformVGPUManagerFabricMode that covers the FABRIC_MODE=2 default and the user-override precedence.
  • golangci-lint run ./controllers/ reports 0 issues.
  • End-to-end behavior on NVSwitch hardware is not exercised here — it depends on the companion gpu-driver-container change and cannot run in CI.

Companion (image half): NVIDIA/gpu-driver-container#854 — ships and starts Fabric Manager in the vGPU Manager image so the FABRIC_MODE this PR injects is actually consumed. The two are an RFC pair; please review the design together.

SR-IOV vGPU on NVSwitch (HGX) systems requires NVIDIA Fabric Manager to
run in vGPU multitenancy mode (FABRIC_MODE=2) rather than the packaged
bare-metal default (FABRIC_MODE=0). Without it a whole-card vGPU guest
cannot initialize CUDA (fails at cuInit).

Default the vGPU Manager container to FABRIC_MODE=2 so its entrypoint can
configure fabricmanager.cfg for NVSwitch hosts. Non-NVSwitch vGPU hosts
do not start Fabric Manager, so the value is ignored there. A
user-provided FABRIC_MODE in VGPUManager.Env takes precedence.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
@copy-pr-bot

copy-pr-bot Bot commented Jul 4, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant