RAM and Recklessness: Control Plane on SmartNICs

Part 4 of Building a Diskless Datacenter


Running Kubernetes control plane nodes on NVIDIA BlueField-2 DPUs sounded elegant in theory: dedicated hardware for etcd, isolated from workload noise, surviving host reboots. In practice, it was a months-long fight with Ansible, Cilium, and the fundamental assumption that your control plane nodes have more than 8 ARM cores and 16GB of RAM.

Why DPUs for Control Plane?

The BlueField-2 DPU is essentially an embedded ARM server attached to your NIC:

  • 8x ARM A72 cores @ 2.75 GHz
  • 16 GB DDR4
  • 64 GB eMMC storage
  • 2x 25 GbE ports
  • Runs Ubuntu 22.04

It has its own power state — even when the host is powered off, the DPU stays on (as long as the server has standby power). This makes it perfect for control plane work:

  • etcd survives host reboots — workers PXE boot and rejoin; the control plane never moved
  • Resource isolation — etcd and kube-apiserver run on dedicated hardware, not competing with workloads
  • Security boundary — the control plane is on physically separate silicon from the data plane

K3s, Not K8s

We chose K3s over full Kubernetes because the DPUs have limited resources. K3s bundles the control plane components into a single binary and uses less memory. Even so, running etcd + apiserver + controller-manager + scheduler on 16GB is tight.

The K3s configuration is opinionated:

--cluster-init                    # embedded etcd
--flannel-backend=none            # we're using Cilium
--disable-network-policy          # Cilium handles this
--disable-kube-proxy              # Cilium replaces kube-proxy
--disable=traefik                 # we'll use ingress-nginx
--disable=servicelb               # Cilium/MetalLB instead
--cluster-cidr=10.142.0.0/16     # pods
--service-cidr=10.143.0.0/16     # services
--cluster-dns=10.143.0.10        # CoreDNS

Disabling flannel, kube-proxy, traefik, and servicelb strips K3s down to its core. Cilium replaces all the networking components with eBPF-based alternatives that are faster and more capable.

The Bootstrap Sequence

Bootstrapping a three-node etcd cluster on DPUs via Ansible was the hardest part of the entire project. The sequence:

  1. Primary DPU (c4140 at 10.42.67.92): Install K3s with --cluster-init
  2. Wait for etcd to be healthy and kube-apiserver to respond
  3. Install Cilium — but the cluster has no CNI yet, so pods can't schedule
  4. Wait for Cilium to be ready (catch-22: Cilium needs apiserver, apiserver needs to be accessible)
  5. Install kube-vip for the HA VIP (10.42.67.100)
  6. Replica DPUs (r640 at .82, r740xd at .42): Join the cluster using the primary's IP, not the VIP

The ordering is critical and we got it wrong multiple times:

fix: Install Cilium BEFORE kube-vip (swap Phase 4 and 5)
fix: Use primary node IP instead of VIP for Cilium k8sServiceHost
fix: Replicas join via primary IP instead of VIP during bootstrap
fix: Correct path resolution to kubernetes/ directory (three levels up)

The worst bug: installing kube-vip before Cilium. Kube-vip runs as a static pod and needs network connectivity to do leader election. Without Cilium, there's no CNI, so the pod has no network. It sits there forever, and the VIP never comes up, and the replicas can't join because they're pointing at the VIP.

Cilium on BlueField-2

Cilium is an eBPF-based CNI, and eBPF is kernel-version-sensitive. The DPUs run Ubuntu 22.04 with kernel 5.15. The NixOS workers run kernel 6.12. This means:

  • DPUs: Cilium uses "Legacy TC" mode for packet processing (kernel too old for TCX)
  • Workers: Cilium uses the newer TCX attachment mode

Both work. But configuring Cilium to handle this mixed environment required:

kubeProxyReplacement: true
bpf:
  masquerade: true
nativeRoutingCIDR: 10.142.0.0/16
devices:
  - auto          # let Cilium figure out the interfaces

The devices: auto was important — the DPU interfaces have different names than the worker interfaces. Explicitly setting a device name breaks on one platform or the other.

The SSH Murder Problem

The most dangerous moment in the bootstrap: installing Cilium on a DPU. Cilium modifies iptables rules as part of its host firewall setup. On a BlueField-2, your SSH session comes in through the same interface that Cilium is configuring. If Cilium's iptables rules are wrong — even for a moment — your SSH session dies, and the only way back is through iDRAC and the rshim console.

We hit this. Multiple times. The fix:

fix: Preserve SSH access on NVIDIA BlueField DPUs during Cilium install
fix: Restore SSH immediately after K3s install inline
fix: Disable Cilium host firewall on BlueField DPUs

The final solution: disable Cilium's host firewall entirely on DPU nodes. The DPUs don't run workloads, so Cilium's host-level protection isn't needed, and it was the source of all the SSH-killing iptables rules.

The kube-proxy Saga

Disabling kube-proxy sounds simple: add --disable-kube-proxy to the K3s server flags. But the aftermath is messy.

When kube-proxy was previously enabled, it left iptables chains everywhere: KUBE-SEP-*, KUBE-SVC-*, KUBE-EXT-*, KUBE-MARK-*. These orphaned chains don't do anything harmful, but they're confusing when you're debugging networking issues and see hundreds of iptables rules that shouldn't exist.

Worse: the --disable-kube-proxy flag on the K3s server propagates to agents automatically — but only on agent restart. Existing agents keep running kube-proxy until they restart. On our PXE-booted workers, that means "until the next reboot," which might be weeks.

The cleanup was manual: SSH to each node, flush every KUBE-* chain in nat/filter/mangle, and restart. Or just wait for the next PXE boot, which clears everything.

br_netfilter: The Silent Killer

This one deserves special mention because it caused the most subtle bugs.

Linux's br_netfilter module routes bridged Ethernet frames through the iptables chain. This is required for Kubernetes networking to work on bridged interfaces. But it has side effects that are absolutely wild:

Problem 1: rpfilter drops bridged traffic

When bridged frames go through iptables mangle PREROUTING, the rpfilter module checks if the source IP would be routable through the incoming interface. For bridged traffic, the "incoming interface" is the bridge, not the physical port, so rpfilter fails and drops the packet.

Fix: Per-VLAN RETURN rules in mangle PREROUTING using -i brXX and --physdev-in vXX-br.

Problem 2: kube-proxy DNAT corrupts bridged VM traffic

This was the really nasty one. When kube-proxy was running, it had DNAT rules in nat PREROUTING for service ClusterIPs. With br_netfilter enabled, bridged VM traffic also hits these DNAT rules. If a VM happens to send traffic to an IP that matches a Kubernetes service ClusterIP, kube-proxy rewrites the destination. Your VM's legitimate traffic to some external IP gets DNAT'd to a random pod.

Fix: -i br+ RETURN rule in nat PREROUTING to skip all kube-proxy rules for bridged traffic.

Note: --physdev-is-bridged (the "right" way to detect bridged traffic) matched zero packets in our testing. The -i br+ interface match was the only reliable method.

All of these rules are now managed declaratively in the NixOS configuration via networking.firewall.extraCommands.

The VIP and High Availability

kube-vip provides the VIP at 10.42.67.100. All three DPUs participate in leader election, and one of them holds the VIP at any time. Workers and external clients connect to 10.42.67.100:6443 and don't need to know which DPU is primary.

This works well — except during the bootstrap, where the VIP doesn't exist yet, and during DPU upgrades, where moving the VIP between nodes takes a few seconds.

The upgrade procedure we settled on:

  1. One DPU at a time
  2. INSTALL_K3S_SKIP_START=true to replace the binary without restarting
  3. systemctl restart k3s manually
  4. Verify kubectl get nodes shows Ready and etcd is healthy
  5. Move to the next DPU

Never run the K3s install script without INSTALL_K3S_SKIP_START or INSTALL_K3S_EXEC. The default behavior overwrites the systemd service file, stripping all our custom flags. We learned this the hard way.

Cilium Bootstrap After Reboot

There's a chicken-and-egg problem with kube-proxy disabled. After a worker PXE reboots:

  1. K3s agent starts and needs to reach the apiserver at the service ClusterIP (10.143.0.1)
  2. Without kube-proxy, there's no iptables DNAT rule to redirect 10.143.0.1 → 10.42.67.100:6443
  3. Cilium would normally handle this, but Cilium hasn't started yet
  4. Cilium can't start because it can't reach the apiserver

Deadlock. The fix: a static iptables DNAT rule in the NixOS configuration:

iptables -t nat -A OUTPUT -d 10.143.0.1/32 -p tcp --dport 443 \
  -j DNAT --to-destination 10.42.67.100:6443

This bridges the gap until Cilium takes over. The long-term fix is adding k8s-service-host: 10.42.67.100 to Cilium's Helm values so it doesn't rely on the service ClusterIP during bootstrap.

Where We Ended Up

Three DPUs running K3s server with embedded etcd. Cilium handling all networking with eBPF. kube-proxy disabled. kube-vip providing the HA endpoint. Workers PXE boot and join automatically.

The control plane has survived host reboots, worker reimages, and network reconfigurations without losing a beat. The DPUs quietly run etcd at about 1GB memory usage and negligible CPU. They're perfect for this role — they just don't want you to set them up.