RAM and Recklessness: VMs All the Way Down
Part 5 of Building a Diskless Datacenter
Once you have a bare-metal Kubernetes cluster with eBPF networking and NVMe-oF storage, the natural next question is: "What if I ran more Kubernetes clusters inside this one?" And then: "What if those inner clusters also had their own CNI, their own ingress, and their own storage?"
This is the story of KubeVirt, nested Kubernetes, and learning that bridge-nf-call-iptables will ruin your life in ways you cannot predict.
Why Nested Clusters?
The metal cluster is infrastructure. It runs storage drivers, network configuration, and the control plane. I didn't want to deploy user-facing applications directly on it — that's mixing concerns and makes upgrades risky.
Instead: run KubeVirt on the metal cluster, spin up VMs, install K3s inside those VMs, and deploy applications there. Each nested cluster is a separate failure domain. If I break one, the others keep running. If I need to test a destructive Cilium upgrade, I can do it on a throwaway cluster without risking the metal control plane.
KubeVirt Setup
KubeVirt turns Kubernetes into a hypervisor. You define VMs as Kubernetes resources, and KubeVirt runs them using QEMU/KVM inside pods. The VM gets a virtual NIC bridged to the host network, virtual disks backed by PVCs, and cloud-init for configuration.
Our VM template:
- Root disk: 100Gi Block PVC on rds-csi-rwx (NVMe-oF from the RDS)
- Cloud-init: NoCloud disk with userdata/metadata for first-boot configuration
- Networking: Bridge to VLAN 66 (KubeVirt VM network)
- Resources: 8 vCPUs, 16Gi RAM per VM
The root disk is a NixOS image built as a qcow2, pushed to a container registry as a containerDisk, and pulled by KubeVirt on first boot. NixOS's growPartition automatically expands the root filesystem to fill the 100Gi PVC.
The CDI Permission Problem
KubeVirt's Containerized Data Importer (CDI) is supposed to handle importing disk images into PVCs. Except CDI runs as UID 107, and you can't write to block devices as UID 107. Our PVCs are Block mode (not Filesystem) for performance, so CDI just... fails.
Error: permission denied writing to /dev/xvda
The workaround: skip CDI entirely. We wrote manual import jobs that use crane to pull the container image, extract the qcow2, and use qemu-img convert (running as root) to write it directly to the block PVC. Ugly, but reliable.
The NixOS VM Image
Building a VM image from NixOS is elegant. The VM configuration is just another NixOS module:
# vm-base-k3s.nix — base module for K3s worker VMs
{
boot.loader.grub.enable = true;
boot.growPartition = true;
services.k3s = {
enable = true;
role = "agent";
serverAddr = "https://10.42.66.120:6443"; # inner cluster VIP
};
# ... cloud-init, networking, SSH
}
The image builds with nix build .#nixosConfigurations.kubevirt-k3s.config.system.build.qcow2, producing a minimal qcow2 that boots to a K3s agent ready to join a cluster. We later split this into a vm-base-k3s generic module and a gpu-k3s module that adds NVIDIA drivers.
The container image wraps it:
FROM scratch
COPY disk/nixos.qcow2 ./disk/nixos.qcow2
That's it. A FROM scratch image with just a disk. KubeVirt pulls it and boots it.
Nested Networking: Here Be Dragons
The VM network is VLAN 66 (10.42.66.0/24). Each VM gets a MAC address, and KubeVirt bridges it to br66 on the host. The VMs are directly on the VLAN — they get real IPs and can talk to everything on VLAN 66.
But the inner Kubernetes cluster needs its own networking. We chose Cilium again, with BGP peering to advertise pod and service CIDRs upstream.
The BGP chain for srvlab-cluster:
srvlab Cilium workers (AS 64516)
→ firewall-os router VM (AS 64515)
→ RDS (AS 65400)
→ CCR2116
Yes, there's a virtual router (firewall-os, our custom router distro) running as a KubeVirt VM, receiving BGP routes from the nested Cilium nodes and advertising them to the physical RDS. Three layers of routing to get a packet from a pod inside a VM to the internet.
The LB VIP Problem
KubeVirt VMs bridged to br66 can't reach Cilium LB VIPs. The reason is subtle: bridged traffic bypasses Cilium's eBPF hooks. The eBPF programs are attached to the host's network interfaces, but bridged frames go directly between the bridge ports without hitting the host's IP stack.
So if a VM tries to reach a LoadBalancer service VIP, the packet goes: VM → bridge → ???. The bridge forwards it based on MAC address, nobody has that IP's MAC, and the packet dies.
The fix: route LB traffic through the pod network instead of the bridge:
ip route add 10.42.69.0/24 via 10.0.2.1 dev enp1s0 onlink
This tells the VM to send LB-destined traffic out the masqueraded pod network interface (where Cilium can intercept it) instead of the bridge (where it can't).
The srvlab-cluster Build
The srvlab-cluster is the real-world implementation of all this nesting:
Metal cluster resources:
- 3 control plane VMs (10.42.66.121-123)
- N worker VMs (DHCP-assigned, KEDA-autoscaled)
- 1 firewall-os router VM (10.42.66.1 / 10.42.67.201)
- KEDA ScaledObject managing worker count
srvlab internal:
- K3s with Cilium CNI and BGP
- Flux CD for GitOps
- ingress-nginx with Let's Encrypt certs
- Tailscale proxy for external access (100.64.0.16)
- rds-csi for persistent volumes
- PostgreSQL, MariaDB for service databases
The worker VMs are managed by a KEDA ScaledObject on the metal cluster that watches a VirtualMachineInstanceReplicaSet. Currently it's pinned at 3 replicas (the cron trigger is a placeholder), but the infrastructure is there for real autoscaling once we deploy Prometheus to the srvlab cluster.
Cloud-Init Challenges
Cloud-init in KubeVirt VMs is finicky. The VM reads a NoCloud disk at boot and runs the configuration. Some things we learned:
Hostnames: Worker VMs need unique hostnames, but they're created from a ReplicaSet, so they all start identical. We derive the hostname from the VMI metadata:
# bootcmd — runs before networking
HOSTNAME=$(curl -s http://169.254.169.254/metadata/hostname 2>/dev/null)
hostnamectl set-hostname "$HOSTNAME"
Except that didn't work reliably, so we switched to reading the hostname from the cidata disk directly.
Network interfaces: Cilium expects the VLAN 67 interface to be named v67. KubeVirt gives you enp2s0. The rename:
ip link set enp2s0 name v67
K3s node password: K3s generates a random password on first start and sends it to the server. On subsequent boots, the server expects the same password. With ephemeral containerDisk-based VMs, the filesystem is fresh every boot. Fix: write a stable password in cloud-init before K3s starts.
Flux CD and GitOps
The srvlab-cluster uses Flux CD for GitOps, with the configuration in a Git repository at git.whiskey.works:whiskey/flux-repo, path clusters/srvlab/.
The dependency chain:
sources → infrastructure → infrastructure-config → services
sources: HelmRepository and GitRepository definitions
infrastructure: cert-manager, ingress-nginx, external-dns, external-secrets, rds-csi, databases
infrastructure-config: CRD-dependent resources (ClusterIssuers, ClusterSecretStores) that need their CRDs installed first
services: Applications (Ghost, HedgeDoc, Uptime Kuma, etc.)
This layering matters because Flux applies resources in dependency order. If cert-manager's CRDs aren't installed when you try to create a ClusterIssuer, it fails. The extra infrastructure-config layer was added after we hit exactly that problem:
fix: split CRD-dependent resources into infrastructure-config layer
The Tailscale Trick
External access to srvlab services goes through Tailscale, not a public IP. A Tailscale proxy pod in the cluster:
- Authenticates to our Headscale server
- Gets a Tailscale IP (100.64.0.16)
- Forwards TCP ports 80/443 to ingress-nginx
DNS: *.srvlab.whiskey.works → 100.64.0.16 via external-dns + DNSimple
Anyone on our Tailscale network can reach blog.srvlab.whiskey.works or status.srvlab.whiskey.works with valid Let's Encrypt certificates. The TLS chain: client → Tailscale → proxy pod → ingress-nginx (terminates TLS) → backend service.
The gotcha: external-dns auto-creates A records pointing to the ingress-nginx LoadBalancer IP (10.42.66.201), which overrides the wildcard DNS. Every ingress needs:
annotations:
external-dns.alpha.kubernetes.io/target: "100.64.0.16"
What's Running
The srvlab-cluster currently hosts:
- Ghost (blog.srvlab.whiskey.works) — Blogging, backed by MariaDB
- HedgeDoc (docs.srvlab.whiskey.works) — Collaborative markdown, backed by PostgreSQL
- Uptime Kuma (status.srvlab.whiskey.works) — Status monitoring, SQLite
- echo-test — TLS verification endpoint
All with Let's Encrypt certificates via DNS-01 challenges, Tailscale access, and NVMe-oF persistent storage. It's a real platform, running inside VMs, inside Kubernetes, on diskless servers that boot from RAM.