RAM and Recklessness: The Ephemeral Storage Problem

RAM and Recklessness — Bonus Post


There's a dirty secret hiding in every "stateless" Kubernetes worker node: container images aren't stateless at all.

We built this entire infrastructure around the idea that worker nodes are cattle. They boot from the network, run from RAM, and carry no local state. When a host reboots, /var is empty — a fresh 16GB tmpfs. K3s reconnects to the control plane, pods get rescheduled, and life goes on.

Except containerd needs to pull every image again. And those images have to go somewhere.

How Containerd Actually Stores Images

When you kubectl apply a deployment and a worker pulls the image, containerd breaks it into two distinct storage areas:

The content store (/var/lib/rancher/k3s/agent/containerd/io.containerd.content.v1.content/) holds compressed blobs — the raw layers, manifests, and configs exactly as they came from the registry. These are content-addressable files named by their sha256 digest. Simple, immutable, and safe to share.

The snapshotter (/var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter/) unpacks those compressed layers into the filesystem trees that containers actually mount. Overlayfs stacks them together — base image on the bottom, each layer on top, writable layer at the peak. This is inherently local because overlayfs needs a real filesystem underneath it.

On a normal server with a terabyte of NVMe, nobody thinks about this. Pull a hundred images, unpack them, who cares. But when your entire /var is a 16GB tmpfs backed by RAM, every pulled image eats into the same finite pool that your running containers use for logs, temp files, and emptyDir volumes.

The Math That Hurts

A modest workload on one of our srvlab workers:

Cilium agent:              ~180MB compressed, ~500MB unpacked
CoreDNS:                   ~15MB compressed, ~45MB unpacked
metrics-server:            ~10MB compressed, ~30MB unpacked
A typical app deployment:  ~200MB compressed, ~600MB unpacked

That's 400MB of content store blobs and over a gigabyte of snapshotter data — for four pods. Scale to a dozen services and you're looking at several gigabytes of tmpfs consumed before your actual workloads write a single byte of data.

And here's the kicker: when the worker reboots (which is routine for these PXE-booting machines), it pulls all of those images again. Every time. Over the network. The cluster doesn't cache anything for you — each node maintains its own content store, its own snapshotter. Three workers running the same image means three copies of every blob.

Attempt 1: Just Raise the Limits

The first reaction was predictable: make tmpfs bigger. These servers have 512-768GB of RAM — why not give /var 64GB or 128GB?

Because it's not just about capacity, it's about waste. Three workers each caching the same container images in RAM is three copies of identical data in three separate pools of expensive DDR4. With GPU workloads, ML model weights, and actual application memory competing for the same DIMM slots, dedicating hundreds of gigabytes to triplicate image caches feels wrong.

It also doesn't solve the cold-start problem. After a rolling reboot, every node hits the registry simultaneously, saturating the link and slowing pod scheduling while images download in parallel.

Attempt 2: NFS Content Store

The content store is just files named by their hash. No locking, no complex filesystem semantics, no overlayfs. Multiple readers, rare writes, and every file is immutable once written. This is practically designed for NFS.

The insight: move the content store to a shared NFS mount, keep the snapshotter local.

NFS (shared, persistent):
  /var/lib/rancher/k3s/agent/containerd/io.containerd.content.v1.content/
    ├── blobs/sha256/abc123...  (compressed layer)
    ├── blobs/sha256/def456...  (compressed layer)
    └── blobs/sha256/789abc...  (manifest)

tmpfs (local, ephemeral):
  /var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter/
    └── overlayfs/snapshots/...  (unpacked filesystem trees)

We deployed this using a straightforward NFS mount in cloud-init — runs before K3s starts, mounts 10.42.68.4:/micron/nfs/srvlab-containerd/content on the content store path, with a fallback to local tmpfs if the mount fails. The NFS lives on the r740xd's ZFS pool, served over the dedicated storage VLAN (68) so image traffic doesn't compete with pod networking.

The result: the first worker to pull an image writes blobs to NFS. The second and third workers find the blobs already there. Snapshotter still runs locally — each node unpacks layers into its own tmpfs — but the compressed blob storage is shared and persistent.

What NFS Gets Right

  • Deduplication for free. Content-addressable storage on a shared mount means one copy of each blob across all workers.
  • Survives reboots. Workers reboot, tmpfs is gone, but the content store is still on NFS. Containerd sees the blobs, only needs to re-unpack the snapshotter layers.
  • Simple. It's a mount point. No new services, no new daemons, no new failure domains beyond the NFS server itself.
  • Graceful fallback. If NFS is down, the mount fails, containerd uses local tmpfs, and everything works like before — just slower.

What NFS Gets Wrong

NFS hard mounts are a trap. When the NFS server hiccups — and it will — a hard mount freezes any process doing I/O on that path. Containerd blocks. Kubelet blocks. The node goes NotReady. The cluster starts evicting pods. One storage blip cascades into a scheduling storm.

You can use soft mounts with timeouts, but then you get silent data corruption when reads return partial results. You can use hard,intr (deprecated) or hard,timeo=600, but that just delays the freeze by a few minutes.

The fundamental issue: NFS failure modes are synchronous. A read from a hung NFS mount doesn't return an error — it blocks the calling thread indefinitely. For a content store that containerd reads on every image pull, this is a loaded gun.

Attempt 3: Object Storage

The blobs in the content store are content-addressable, immutable, and accessed by hash. This is quite literally what object storage was built for.

We have MinIO running on the RDS — it's outside the Kubernetes cluster, on dedicated hardware, with its own storage pool. Using it as a blob store has structural advantages over NFS:

  • Failure modes are asynchronous. An S3 GET returns an error code, not a blocked thread. Containerd can handle "blob not found" gracefully — it just re-pulls from the upstream registry.
  • No mount semantics. No stale file handles, no client-server state, no mount table entries. Each request is independent.
  • Already exists. MinIO is there, it's running, it has capacity.

The problem is that containerd doesn't speak S3. Its content store plugin expects a POSIX filesystem: open(), read(), stat(), link(). You can't just point it at a bucket.

FUSE Bridges

The obvious solution is a FUSE filesystem that presents S3 as a local mount:

  • s3fs-fuse — the original, battle-tested, slow
  • goofys — faster, less POSIX-correct, abandoned
  • mountpoint-s3 — Amazon's official option, read-heavy optimized

Any of these would let you mount a MinIO bucket at the content store path and containerd would never know the difference. But FUSE has its own problems: every file operation crosses the kernel/userspace boundary twice, performance is middling, and — critically — a hung FUSE daemon has the same cascading failure as a hung NFS mount.

You've replaced one synchronous failure mode with another.

JuiceFS: The Interesting Middle Ground

JuiceFS sits between NFS and raw object storage. It's a POSIX-compatible filesystem that splits storage into two tiers:

  • Metadata — stored in a database (PostgreSQL, Redis, etc.)
  • Data chunks — stored in object storage (S3, MinIO, etc.)
  • Local cache — bounded read cache on the client

For our setup, the metadata could go in the PostgreSQL instance already running on the RDS, and the data chunks would go in MinIO. Each worker gets a configurable local read cache in tmpfs — say, 2GB — so hot blobs are served from RAM (fast) while cold blobs are fetched from MinIO (slow but fine).

The key insight: with a bounded local cache, you control exactly how much tmpfs goes to image storage. Not "whatever images happen to be pulled" but a hard cap that the cache respects. LRU eviction handles the rest.

And failure behavior is better than NFS: JuiceFS can serve reads from the local cache even if the backend is temporarily unreachable. It's not perfect — metadata operations still need the database — but it's more resilient than a hard NFS mount.

Pull-Through Registry Cache

Step back from the content store entirely. Instead of modifying how containerd stores blobs, change where it pulls from.

A pull-through registry cache (Zot, Harbor, or plain Distribution) sits between your workers and upstream registries:

Worker → Pull-through cache → Docker Hub / GHCR / Gitea
            ↓ (cached)
         S3 / MinIO storage

First pull hits upstream, subsequent pulls from any worker hit the local cache. The cache can use MinIO as its storage backend natively — registries are designed for blob storage in a way that POSIX filesystems aren't.

But here's the catch: this doesn't solve the local storage problem. Containerd still writes every pulled blob to its local content store, then unpacks it into the snapshotter. The pull-through cache reduces network bandwidth and pull latency, not local disk usage.

You'd still need to combine this with NFS or JuiceFS for the local content store to actually reduce tmpfs pressure. The pull-through cache is a complementary optimization, not a replacement.

What Actually Matters

After exploring all of these options, the decision matrix comes down to two questions:

1. How much local storage pressure do you need to eliminate?

If the answer is "just the duplicate blobs across workers" — NFS is fine. It's simple, it works today, and the failure modes are manageable with soft mounts or careful monitoring.

If the answer is "I want bounded, predictable tmpfs usage" — JuiceFS with a capped local cache is the right tool. More complexity, but real control over resource consumption.

2. How sensitive are you to storage infrastructure failures?

If a 30-second NFS hiccup cascading into node evictions is acceptable — NFS. If it's not — object storage with an async failure mode, whether through JuiceFS or a FUSE bridge.

Where We Landed

For now, we're running the NFS content store. It's deployed, it's working, and the workers are sharing blobs across a single NFS mount on the storage VLAN. The pragmatic choice.

But the JuiceFS path is tempting. PostgreSQL and MinIO are already on the RDS. The local cache semantics solve the tmpfs bounding problem properly. And the failure isolation is genuinely better than NFS.

Sometimes the right answer isn't the one you deploy today — it's the one you know you'll deploy when the current solution first bites you at 3 AM.


This is a bonus post in the RAM and Recklessness series. The main series covers the full infrastructure from bare metal to GPU passthrough.