RAM and Recklessness: Burn the Disks

Part 3 of Building a Diskless Datacenter


The whole point of this project was to run servers without local disks. No SSDs to fail, no RAID to rebuild, no state to migrate. Just servers that boot from the network, run from RAM, and treat local storage as someone else's problem.

Getting there was a journey through three different operating systems and more initrd debugging than any human should endure.

Attempt 1: Talos Linux

Talos was the obvious first choice. It's designed for Kubernetes, it's immutable, it supports PXE boot, and it has a clean API for machine configuration. We built the whole iPXE infrastructure around it — a config server that dynamically generated machine configs based on the server's service tag.

It almost worked. The PXE boot chain was solid: iDRAC → iPXE → config server → Talos kernel + initrd. But Talos had opinions about storage that didn't align with ours. We wanted NVMe/TCP mounts for persistent state, and Talos's disk management layer fought us every step of the way.

The dealbreaker was flexibility. Talos is opinionated by design — that's its strength for normal deployments. But when you're doing unusual things with DPU networking and NVMe-oF storage, you need access to the plumbing. Talos deliberately hides the plumbing.

Attempt 2: NixOS (with NVMe/TCP)

NixOS was the answer. Declarative configuration, reproducible builds, and — critically — full control over the initrd and boot process. If I needed a custom kernel module loaded in stage 1 to set up NVMe/TCP before mounting root, I could do that.

The first NixOS iteration used NVMe/TCP for /var:

Host boots via iPXE → loads NixOS kernel + initrd into RAM
  → initrd brings up VLAN 68 network interface
  → connects to RDS NVMe/TCP target (10.42.68.1:4420)
  → mounts NVMe volume as /var
  → switches to stage 2
  → K3s agent starts, joins cluster

The NVMe/TCP connection happened entirely in the initrd. This was a custom NixOS module that:

  1. Loaded nvme-tcp and nvme-fabrics kernel modules
  2. Brought up the VLAN 68 interface using ip link commands
  3. Connected to the NVMe-oF subsystem using nvme connect
  4. Waited for the block device to appear
  5. Mounted it as /var

It worked. Each host had its own NVMe volume on the RDS, formatted ext4, mounted before any services started. K3s stored its state in /var/lib/rancher/k3s, containers used /var/lib/containerd, and logs went to /var/log. Clean separation between ephemeral OS (in RAM) and persistent state (on NVMe/TCP).

The Debugging

Getting the initrd right was the worst part. NixOS uses a systemd-based initrd, and debugging it means adding rd.break=pre-mount to kernel parameters and staring at a dracut shell wondering why the NVMe device didn't appear.

The progression of debug commits tells the story:

debug: Remove init= param and add verbose debug output
debug: Add rd.break=pre-mount to pause before root mount
debug: Temporarily disable NVMe/TCP modules to test base netboot
debug: Minimize r640 hardware.nix to match installer simplicity
debug: Add rd.break=pre-udev to pause before USB enumeration
debug: Use NixOS-specific debug params instead of dracut

Each commit represents a 30-minute cycle of: rebuild NixOS image → upload to RDS → reboot host → watch serial console → swear → try again.

The breakthroughs:

  • The VGA console was secondary to serial by default. We were watching a blank screen while boot messages went to ttyS0.
  • boot.initrd.network.enable conflicts with the netboot-minimal profile. You can't use NixOS's built-in initrd networking when you're already network-booting.
  • The DPU interface uses native VLAN 68, not tagged. One wrong VLAN configuration = no storage.

The Resilience Problem

NVMe/TCP over the network is fast — we benchmarked it at near-wire-speed on 25G. But it's also a single point of failure. If the RDS reboots, if the network blips, if a cable gets bumped — every host loses /var simultaneously.

We added a watchdog service that monitored the NVMe connection and triggered a graceful node drain if the link dropped. But the real issue was more fundamental: tying host availability to network storage availability is fragile.

Attempt 3: NixOS (tmpfs /var)

The final evolution was radical: drop NVMe/TCP for /var entirely. Run /var as a 16GB tmpfs in RAM.

Host boots via iPXE → loads NixOS kernel + initrd into RAM
  → mounts tmpfs as /var (16GB)
  → switches to stage 2
  → K3s agent starts, joins cluster

No storage dependency at boot. No NVMe/TCP. No watchdog. The host boots in seconds, joins the cluster, and starts running workloads. If the host reboots, /var is empty — K3s reconnects to the control plane, containers re-download, and it's back.

This works because:

  • K3s agent state is disposable — the control plane (on the DPUs) has the truth
  • Container images re-pull — yes, it's slower after a reboot, but it's also simpler
  • Persistent data lives in PVCs — served by rds-csi over NVMe/TCP to the pods, not the hosts
  • These servers have 256-384GB of RAM — 16GB for tmpfs is nothing

The tradeoff: emptyDir volumes in pods count against the container's memory cgroup (since they're backed by tmpfs). Jobs that write large temporary files need inflated memory limits. We've been bitten by this — build jobs that worked fine on disk-backed hosts get OOM-killed on tmpfs workers if you don't set memory limits high enough.

The iPXE Menu System

Every host boots from the same iPXE infrastructure. The RDS serves an iPXE script that:

  1. Identifies the machine by service tag (Dell's serial number)
  2. Shows an interactive menu: "Boot NixOS" or "Boot Installer"
  3. Downloads the appropriate kernel, initrd, and squashfs from the RDS
  4. Boots

The NixOS images are built by nix build on a macOS workstation, uploaded to the RDS via a deploy script. Each host configuration is a NixOS flake target:

# flake.nix (simplified)
nixosConfigurations = {
  r640 = mkNixosConfig "r640" ./nixos/hosts/r640;
  c4140 = mkNixosConfig "c4140" ./nixos/hosts/c4140;
  r740xd = mkNixosConfig "r740xd" ./nixos/hosts/r740xd;
};

netbootPackages = mapAttrs mkNetbootPackage nixosConfigurations;

mkNetbootPackage takes a NixOS configuration and produces a directory with kernel, initrd, and rootfs.squashfs — everything iPXE needs to boot the machine into a fully configured NixOS system.

SSH Host Keys: The Identity Problem

One consequence of stateless hosts: SSH host keys regenerate on every boot. This is terrible for automation (known_hosts breaks constantly) and for security (you can't verify host identity).

Our solution: bake SSH host keys into the NixOS image in /etc. Since /etc is part of the read-only squashfs root, the keys persist across reboots. They're stored encrypted in the git repo using sops-nix, with age keys derived from the host keys themselves (via ssh-to-age).

Yes, this is circular: the keys decrypt themselves. The bootstrap is: generate keys once, encrypt with sops, add to the Nix build, and from then on the image always contains the right keys.

The sops-nix Bootstrap

Secrets management on diskless hosts is interesting. We use sops-nix with age encryption:

  1. Each host has SSH host keys baked into /etc/ssh/
  2. ssh-to-age derives an age key from the ED25519 host key
  3. .sops.yaml maps each host's age key to the secrets it can decrypt
  4. sops-nix decrypts secrets at activation time using the derived age key

The K3s join token, for example, is encrypted in secrets.yaml and only decryptable by the hosts that need it. No secrets in plain text, no external vault dependency, and the decryption key is the host's own identity.

Where We Ended Up

The final architecture for host boot:

iDRAC PXE → iPXE → RDS HTTP → NixOS kernel + initrd + squashfs → RAM
  → / (squashfs, read-only)
  → /var (tmpfs, 16GB)
  → /etc (from squashfs, contains SSH keys + sops secrets)
  → K3s agent → joins DPU control plane → ready for workloads

Boot time from power-on to K3s Ready: about 90 seconds. No disks touched. No network storage mounted. Pure RAM.

The persistent storage story moved entirely to Kubernetes: rds-csi provides NVMe/TCP PVCs to pods that need them. The hosts themselves are truly cattle — identical, disposable, and rebuildable in the time it takes to POST.