feat(vgpu-manager): default Fabric Manager to vGPU multitenancy mode#2602
Draft
lexfrei wants to merge 1 commit into
Draft
feat(vgpu-manager): default Fabric Manager to vGPU multitenancy mode#2602lexfrei wants to merge 1 commit into
lexfrei wants to merge 1 commit into
Conversation
SR-IOV vGPU on NVSwitch (HGX) systems requires NVIDIA Fabric Manager to run in vGPU multitenancy mode (FABRIC_MODE=2) rather than the packaged bare-metal default (FABRIC_MODE=0). Without it a whole-card vGPU guest cannot initialize CUDA (fails at cuInit). Default the vGPU Manager container to FABRIC_MODE=2 so its entrypoint can configure fabricmanager.cfg for NVSwitch hosts. Non-NVSwitch vGPU hosts do not start Fabric Manager, so the value is ignored there. A user-provided FABRIC_MODE in VGPUManager.Env takes precedence. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
RFC / Draft — seeking maintainer design direction before completing. Please do not merge.
On NVSwitch (HGX) systems running SR-IOV vGPU, NVIDIA Fabric Manager must run in vGPU multitenancy mode (
FABRIC_MODE=2) instead of the default bare-metal / passthrough mode (FABRIC_MODE=0). In mode 0 a whole-card vGPU guest cannot initialize CUDA —cuInitfails with error 802 because the guest's NVLink fabric never becomes usable.FABRIC_MODE=1is the older shared-NVSwitch multitenancy model;FABRIC_MODE=2is the mode required for SR-IOV vGPU (see the NVIDIA Fabric Manager User Guide, config-file section).Today the operator provides no way to select the Fabric Manager mode. On the operator-managed-driver path the driver container starts Fabric Manager from the packaged
fabricmanager.cfgunchanged (FABRIC_MODE=0) and reads no environment variable to override it, so SR-IOV vGPU on NVSwitch cannot be brought up through the operator.This change defaults the vGPU Manager container (the
vm-vgpusandbox-workload path) toFABRIC_MODE=2. A user-suppliedFABRIC_MODEinvgpuManager.envtakes precedence. The value is scoped to the vGPU Manager container, so container / passthrough / bare-metal workloads are unchanged, and on non-NVSwitch vGPU hosts the driver entrypoint does not start Fabric Manager so the value is ignored.Why this is an RFC (honest scope)
This is the operator half of a two-part change and is intentionally inert on its own:
FABRIC_MODE. A companiongpu-driver-containerchange is needed to (a) ship Fabric Manager in the vGPU Manager image and (b) writeFABRIC_MODEintofabricmanager.cfgbefore starting Fabric Manager on NVSwitch hosts.driver.enabled=false), where the operator manages neither the driver nor Fabric Manager; and per-VM fabric-partition activation, which is a separate device-plugin concern.Related: #2594 describes the adjacent SR-IOV / vendor-specific VFIO vGPU device-creation gap on the same class of hardware.
Checklist
make lint) — the changed package is clean (golangci-lint run ./controllers/reports 0 issues). Repo-widemake lintcould not complete locally on macOS because an unrelated command (cmd/nvidia-validator) imports a Linux-only vendored package that is build-excluded on darwin; expected to pass on Linux CI.make validate-generated-assets) — this change adds no API/CRD changes, so no generated assets change. The target could not run to completion on macOS for the same Linux-only-dependency reason.make validate-modules)Testing
go build ./controllers/andgo test ./controllers/pass, including a newTestTransformVGPUManagerFabricModethat covers theFABRIC_MODE=2default and the user-override precedence.golangci-lint run ./controllers/reports 0 issues.gpu-driver-containerchange and cannot run in CI.Companion (image half): NVIDIA/gpu-driver-container#854 — ships and starts Fabric Manager in the vGPU Manager image so the
FABRIC_MODEthis PR injects is actually consumed. The two are an RFC pair; please review the design together.