Skip to content

Cannot run kubevirt virtual machine using nvidia GPU plugin #89

Description

@keepthemomentum

Hello,

I have a kubernetes cluster running virtual machines using kubevirt.
A worker node in the cluster has GPU, i want to use the GPU for the VM in passthrough mode.
I enabled the feature gate, deployed the kubevirt-gpu-device plugin on the cluster.

lspci -nn | grep -i nvidia
04:00.0 3D controller [0302]: NVIDIA Corporation GK110BGL [Tesla K40m] [10de:1023] (rev a1)

lspci -nnk -d 10de:
04:00.0 3D controller [0302]: NVIDIA Corporation GK110BGL [Tesla K40m] [10de:1023] (rev a1)
Subsystem: NVIDIA Corporation 12GB Computational Accelerator [10de:097e]
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

This is how the logs from the gpu-device-plugin pod look like:

kubectl logs pod/nvidia-kubevirt-gpu-dp-daemonset-4xmm8 -n kube-system

2024/01/05 12:59:06 Not a device, continuing
2024/01/05 12:59:06 Nvidia device 0000:03:00.0
2024/01/05 12:59:06 Iommu Group 22
2024/01/05 12:59:06 Device Id 1023
2024/01/05 12:59:06 Error accessing file path "/sys/bus/mdev/devices": lstat /sys/bus/mdev/devices: no such file or directory
2024/01/05 12:59:06 Iommu Map map[22:[{0000:03:00.0}]]
2024/01/05 12:59:06 Device Map map[1023:[22]]
2024/01/05 12:59:06 vGPU Map map[]
2024/01/05 12:59:06 GPU vGPU Map map[]
2024/01/05 12:59:06 DP Name GK110BGL_TESLA_K40M
2024/01/05 12:59:06 Devicename GK110BGL_TESLA_K40M
2024/01/05 12:59:06 GK110BGL_TESLA_K40M Device plugin server ready
2024/01/05 12:59:06 healthCheck(GK110BGL_TESLA_K40M): invoked

ls -l /var/lib/kubelet/device-plugins/
total 40
-rw------- 1 root root 39215 Jan 5 13:59 kubelet_internal_checkpoint
srwxr-xr-x 1 root root 0 Jan 4 11:40 kubelet.sock
srwxr-xr-x 1 root root 0 Jan 5 13:59 kubevirt-GK110BGL_TESLA_K40M.sock

It still couldn't run the pod, it says 'no preemption victims found for incoming pod'
What am i missing? could someone help.

Events:
Type Reason Age From Message


Warning FailedScheduling 47m default-scheduler 0/5 nodes are available: 1 Insufficient nvidia.com/GK110BGL_Tesla_K40m, 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/5 nodes are available: 1 No preemption victims found for incoming pod, 4 Preemption is not helpful for scheduling..
Warning FailedScheduling 16m (x6 over 41m) default-scheduler 0/5 nodes are available: 1 Insufficient nvidia.com/GK110BGL_Tesla_K40m, 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/5 nodes are available: 1 No preemption victims found for incoming pod, 4 Preemption is not helpful for scheduling..

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions