feat(vgpu-manager): default Fabric Manager to vGPU multitenancy mode by lexfrei · Pull Request #2602 · NVIDIA/gpu-operator

lexfrei · 2026-07-04T21:18:54Z

Description

RFC / Draft — seeking maintainer design direction before completing. Please do not merge.

On NVSwitch (HGX) systems running SR-IOV vGPU, NVIDIA Fabric Manager must run in vGPU multitenancy mode (FABRIC_MODE=2) instead of the default bare-metal / passthrough mode (FABRIC_MODE=0). In mode 0 a whole-card vGPU guest cannot initialize CUDA — cuInit fails with error 802 because the guest's NVLink fabric never becomes usable. FABRIC_MODE=1 is the older shared-NVSwitch multitenancy model; FABRIC_MODE=2 is the mode required for SR-IOV vGPU (see the NVIDIA Fabric Manager User Guide, config-file section).

Today the operator provides no way to select the Fabric Manager mode. On the operator-managed-driver path the driver container starts Fabric Manager from the packaged fabricmanager.cfg unchanged (FABRIC_MODE=0) and reads no environment variable to override it, so SR-IOV vGPU on NVSwitch cannot be brought up through the operator.

This change defaults the vGPU Manager container (the vm-vgpu sandbox-workload path) to FABRIC_MODE=2. A user-supplied FABRIC_MODE in vgpuManager.env takes precedence. The value is scoped to the vGPU Manager container, so container / passthrough / bare-metal workloads are unchanged, and on non-NVSwitch vGPU hosts the driver entrypoint does not start Fabric Manager so the value is ignored.

Why this is an RFC (honest scope)

This is the operator half of a two-part change and is intentionally inert on its own:

The operator can only inject an env var; the container entrypoint must consume it. The vGPU Manager container image does not currently install or start Fabric Manager (it is installed only in the non-vGPU driver image), and no entrypoint reads FABRIC_MODE. A companion gpu-driver-container change is needed to (a) ship Fabric Manager in the vGPU Manager image and (b) write FABRIC_MODE into fabricmanager.cfg before starting Fabric Manager on NVSwitch hosts.
Open design question: in a vGPU-on-NVSwitch deployment, which component should own Fabric Manager? On NVSwitch systems it is started today by the driver container, but in a vGPU deployment the vGPU Manager container is what runs on the node (and ships no Fabric Manager). I chose the vGPU Manager container as the injection target because it is the container that actually runs on vGPU nodes, but I would like direction before implementing the driver-container side.
Out of scope: the host-installed-driver path (driver.enabled=false), where the operator manages neither the driver nor Fabric Manager; and per-VM fabric-partition activation, which is a separate device-plugin concern.

Related: #2594 describes the adjacent SR-IOV / vendor-specific VFIO vGPU device-creation gap on the same class of hardware.

Checklist

No secrets, sensitive information, or unrelated changes
Lint checks passing (make lint) — the changed package is clean (golangci-lint run ./controllers/ reports 0 issues). Repo-wide make lint could not complete locally on macOS because an unrelated command (cmd/nvidia-validator) imports a Linux-only vendored package that is build-excluded on darwin; expected to pass on Linux CI.
Generated assets in-sync (make validate-generated-assets) — this change adds no API/CRD changes, so no generated assets change. The target could not run to completion on macOS for the same Linux-only-dependency reason.
Go mod artifacts in-sync (make validate-modules)
Test cases are added for new code paths

Testing

go build ./controllers/ and go test ./controllers/ pass, including a new TestTransformVGPUManagerFabricMode that covers the FABRIC_MODE=2 default and the user-override precedence.
golangci-lint run ./controllers/ reports 0 issues.
End-to-end behavior on NVSwitch hardware is not exercised here — it depends on the companion gpu-driver-container change and cannot run in CI.

Companion (image half): NVIDIA/gpu-driver-container#854 — ships and starts Fabric Manager in the vGPU Manager image so the FABRIC_MODE this PR injects is actually consumed. The two are an RFC pair; please review the design together.

SR-IOV vGPU on NVSwitch (HGX) systems requires NVIDIA Fabric Manager to run in vGPU multitenancy mode (FABRIC_MODE=2) rather than the packaged bare-metal default (FABRIC_MODE=0). Without it a whole-card vGPU guest cannot initialize CUDA (fails at cuInit). Default the vGPU Manager container to FABRIC_MODE=2 so its entrypoint can configure fabricmanager.cfg for NVSwitch hosts. Non-NVSwitch vGPU hosts do not start Fabric Manager, so the value is ignored there. A user-provided FABRIC_MODE in VGPUManager.Env takes precedence. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>

copy-pr-bot · 2026-07-04T21:18:59Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

lexfrei mentioned this pull request Jul 4, 2026

feat(vgpu-manager): ship and start Fabric Manager on NVSwitch hosts NVIDIA/gpu-driver-container#854

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(vgpu-manager): default Fabric Manager to vGPU multitenancy mode#2602

feat(vgpu-manager): default Fabric Manager to vGPU multitenancy mode#2602
lexfrei wants to merge 1 commit into
NVIDIA:mainfrom
lexfrei:feat/fabricmanager-vgpu-mode

lexfrei commented Jul 4, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lexfrei commented Jul 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Why this is an RFC (honest scope)

Checklist

Testing

Uh oh!

copy-pr-bot Bot commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lexfrei commented Jul 4, 2026 •

edited

Loading