RAM and Recklessness: Four GPUs and a Prayer
Part 6 of Building a Diskless Datacenter
The Dell C4140 has four Tesla V100 SXM2 16GB GPUs connected via NVLink in a full mesh. They were originally designed for training neural networks in a datacenter. Now they're in my garage, and I need them inside a KubeVirt VM running NixOS while keeping NVLink working.
This is the GPU passthrough story.
The Hardware
The V100 SXM2 is a specific form factor. Unlike PCIe GPUs that plug into standard slots, SXM2 GPUs are soldered onto a custom board with a proprietary high-bandwidth connector. The key difference: NVLink.
NVLink is a direct GPU-to-GPU interconnect that bypasses PCIe entirely. Our four V100s are connected in an NV2 mesh — 2 bonded NVLink connections between every pair of GPUs, providing 6 lanes at 25.781 GB/s each. Total aggregate bandwidth: 300 GB/s across the mesh, versus about 32 GB/s per GPU over PCIe.
The PCI topology:
1a:00.0 - GPU 0 (10de:1db1)
1c:00.0 - GPU 1 (10de:1db1)
1d:00.0 - GPU 2 (10de:1db1)
1e:00.0 - GPU 3 (10de:1db1)
Each GPU is in its own IOMMU group, which means they can be individually passed through to VMs via VFIO. But will NVLink survive VFIO passthrough? That was the question.
VFIO Setup
VFIO (Virtual Function I/O) binds PCI devices to a userspace driver instead of the kernel's native driver. This lets QEMU (KubeVirt's hypervisor) directly assign the physical GPU to a VM.
The NixOS configuration:
# gpu-passthrough.nix
{
boot.kernelParams = [
"intel_iommu=on"
"iommu=pt"
"vfio-pci.ids=10de:1db1" # V100 SXM2 PCI ID
];
boot.initrd.kernelModules = [ "vfio_pci" "vfio" "vfio_iommu_type1" ];
}
Critical detail: the PCI ID for V100 SXM2 is 10de:1db1, not 10de:1db6 (which is V100 PCIe 32GB). Get this wrong and VFIO binds nothing, and the kernel's nvidia driver grabs the GPUs instead.
The iPXE kernelParams Problem
Here's a NixOS-specific gotcha: setting boot.kernelParams in your NixOS configuration doesn't automatically reach the iPXE boot script. NixOS generates kernel command-line parameters for its bootloader, but iPXE has its own separate kernel command line.
Our mkNetbootPackage function in flake.nix had to explicitly read nixosConfig.config.boot.kernelParams and embed them in the generated iPXE script. Without this fix, the host would boot without IOMMU enabled, and VFIO would silently fail to bind.
fix: pass NixOS kernelParams through iPXE boot script
One of those bugs that took hours to find because everything looked correct in the NixOS config.
NVLink Through VFIO
The big question: does NVLink work when GPUs are passed through to a VM via VFIO?
Yes. And it's not even degraded.
Inside the GPU VM:
$ nvidia-smi nvlink --status
GPU 0: Tesla V100-SXM2-16GB
Link 0: 25.781 GB/s (NV2)
Link 1: 25.781 GB/s (NV2)
...
Full NV2 mesh between all four GPUs. The NVLink connections are direct point-to-point on the SXM2 board — there are no separate PCI bridge devices for NVLink. VFIO passes through the GPU's PCI function, and NVLink is part of that function. It just works.
This was a genuinely pleasant surprise. I expected either no NVLink or degraded performance. Getting full bandwidth meant the GPU VM could do real multi-GPU training with the interconnect bandwidth that makes SXM2 worthwhile.
The GPU VM
KubeVirt exposes GPUs to VMs through permittedHostDevices:
spec:
configuration:
permittedHostDevices:
pciHostDevices:
- pciVendorSelector: "10DE:1DB1" # uppercase!
resourceName: "nvidia.com/GV100GL_TESLA_V100_SXM2_16GB"
The VM spec requests GPU resources:
spec:
domain:
devices:
gpus:
- name: gpu0
deviceName: nvidia.com/GV100GL_TESLA_V100_SXM2_16GB
- name: gpu1
deviceName: nvidia.com/GV100GL_TESLA_V100_SXM2_16GB
# ... all 4
resources:
requests:
memory: 140Gi # 128Gi guest + 12Gi QEMU overhead
The Memory Trap
That 140Gi memory request is not a typo. VFIO DMA requires the entire guest memory to be pinned (locked) in physical RAM — the hypervisor can't swap it out because the GPU is doing DMA directly into those pages.
Our first attempt used 128Gi. The VM OOM-killed immediately. Why? QEMU itself needs memory for page tables (~259MB for 128GB of guest RAM), vhost/virtio buffers, and its own runtime. With only 128Gi allocated, the guest got all of it, QEMU had nothing left for overhead, and the kernel killed the process.
The fix: request 140Gi (128Gi guest + 12Gi overhead). The c4140 has 384GB of physical RAM, so this is fine — but on a smaller machine, running four VFIO GPUs with full memory pinning would be a real constraint.
NixOS + NVIDIA in the VM
Getting NVIDIA drivers working inside a NixOS VM is its own adventure. NixOS doesn't do things the normal way — there's no /usr/lib with driver files, everything is in /nix/store with symlinks through /run/opengl-driver/lib.
nvidia-device-plugin expects to find NVIDIA libraries in standard Linux paths. On NixOS, it needs:
/nix/storemounted (the library symlinks resolve there)LD_LIBRARY_PATH=/run/nvidia/lib- nodeSelector on
nvidia.com/gpu.product=V100(notnvidia.com/gpu=true, which also matches the VFIO host)
nvidia-container-runtime was supposed to handle GPU injection into containers automatically. On NixOS, it's broken. The NixOS module hardware.nvidia-container-toolkit.enable = true generates CDI specs, but containerd's config template has a duplicate TOML section that prevents the CDI runtime from loading.
Fix: cloud-init overwrites the containerd config template with an imports = ["/etc/containerd/cdi.toml"] pattern that avoids the duplication.
In the end, GPU pods use a combination of:
- CDI for device exposure
- Manual volume mounts for NVIDIA libraries (as a fallback)
- Cloud-init patches for containerd configuration
It works, but it's held together with the kind of care that means "please don't update anything."
The Ephemeral Identity Problem
The GPU VM uses an ephemeral containerDisk for its root filesystem. This means every time the VM restarts, it gets a fresh disk image. Great for immutability, terrible for identity.
SSH host keys change every restart. There's no way around this with containerDisk — the filesystem is read from the container image each boot. Anyone SSH-ing to the GPU VM has to accept a new host key every time. We use root@10.42.67.93 (whiskey@ gets "connection closed" for reasons we haven't debugged).
K3s node password: K3s generates a unique node password on first registration and the server remembers it. On reboot, if the password doesn't match, the server rejects the node. With ephemeral disks, the password is different every boot.
Fix: write a stable password in cloud-init before K3s starts:
write_files:
- path: /etc/rancher/node/password
content: "stable-password-here"
Current GPU Usage
The four V100s are primarily used for:
- vLLM inference — running language models on the homelab cluster (more in the next post)
- Model quantization — FP8 dynamic quantization that needs ~48GB host RAM
- Training experiments — NVLink mesh gives real multi-GPU scaling
The VFIO passthrough adds negligible overhead — GPU compute performance in the VM matches bare metal within measurement noise. The only cost is the 12Gi memory overhead for QEMU, which is a rounding error when you have 384GB.