RAM and Recklessness: Community Infrastructure
Part 8 of Building a Diskless Datacenter
At some point, a homelab stops being a personal experiment and starts being infrastructure that other people rely on. The srvlab community needed a blog, a status page, collaborative docs, and a way to access it all without VPN credentials or port forwarding. This is how we built that on top of everything we'd already built.
The srvlab-cluster
The community services run on the srvlab-cluster — the nested Kubernetes cluster from Part 5. This gives us complete isolation from the metal cluster. If someone finds a bug in Ghost and exploits it, they're inside a VM inside a pod inside a Kubernetes cluster that has no access to the metal control plane.
The stack:
- Flux CD for GitOps — everything in Git, nothing applied manually
- cert-manager with Let's Encrypt DNS-01 via DNSimple
- ingress-nginx as the ingress controller
- external-dns for automatic DNS record management
- external-secrets for pulling secrets from the metal cluster
- rds-csi for NVMe-oF persistent storage
- Tailscale for external access
Tailscale as the Front Door
None of the srvlab services are exposed to the public internet. Instead, they're accessible through Tailscale — our Headscale-powered mesh network.
A Tailscale proxy pod runs in the cluster:
env:
- name: TS_HOSTNAME
value: srvlab-ingress
- name: TS_SERVE_CONFIG
value: /config/serve.json # TCP forward 80/443 → ingress-nginx
This pod:
- Authenticates to Headscale (headscale.srvlab.io) with a non-ephemeral, reusable auth key
- Gets Tailscale IP
100.64.0.16 - Forwards all HTTP/HTTPS traffic to ingress-nginx
DNS is handled by external-dns, which creates records in DNSimple:
blog.srvlab.whiskey.works → 100.64.0.16
docs.srvlab.whiskey.works → 100.64.0.16
status.srvlab.whiskey.works → 100.64.0.16
The result: anyone on the Tailscale network navigates to blog.srvlab.whiskey.works, gets routed through the mesh to the proxy pod, which forwards to ingress-nginx, which terminates TLS with a valid Let's Encrypt certificate and routes to the backend service.
The external-dns Override
There's a subtlety that cost us time. When ingress-nginx creates a LoadBalancer service, it gets IP 10.42.66.201 (from Cilium's LBIPPool). external-dns sees the ingress and creates an A record pointing to that IP.
But 10.42.66.201 is an internal VLAN 66 address — unreachable from outside the lab. We need DNS to point to 100.64.0.16 (the Tailscale IP).
The fix: every ingress needs this annotation:
external-dns.alpha.kubernetes.io/target: "100.64.0.16"
This tells external-dns to ignore the LoadBalancer IP and use the Tailscale IP instead. Without it, the first DNS resolution works (wildcard record), but external-dns eventually overwrites the wildcard with the wrong IP.
The Services
Ghost Blog
Ghost is the blog engine at blog.srvlab.whiskey.works. It's the classic Node.js blog platform, running Ghost 5 Alpine.
Ghost only supports MySQL/MariaDB or SQLite for its database — no PostgreSQL. So we stood up a dedicated MariaDB instance:
# MariaDB init script
CREATE DATABASE IF NOT EXISTS ghost CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
CREATE USER IF NOT EXISTS 'ghost'@'%' IDENTIFIED BY 'ghost-srvlab';
GRANT ALL PRIVILEGES ON ghost.* TO 'ghost'@'%';
Ghost connects to mariadb.mariadb.svc.cluster.local:3306. The database and Ghost content directory (themes, images) are both on rds-csi PVCs — NVMe-oF backed, surviving pod restarts.
HedgeDoc
HedgeDoc provides collaborative markdown editing at docs.srvlab.whiskey.works. Think Google Docs but self-hosted and markdown-native.
The HedgeDoc deployment went through several iterations:
-
SQLite attempt:
CMD_DB_URL=sqlite:///data/hedgedoc.sqlite— this should work according to the docs but doesn't. HedgeDoc's Sequelize integration doesn't parse SQLite URLs correctly. -
SQLite workaround:
CMD_DB_DIALECT=sqlite+CMD_DB_STORAGE=/data/hedgedoc.sqlite— this works! But SQLite is not great for a service that multiple people use concurrently. -
PostgreSQL migration: We stood up a central PostgreSQL 16 instance and pointed HedgeDoc at it:
- name: CMD_DB_URL
value: postgres://hedgedoc:hedgedoc-srvlab@postgresql.postgresql.svc.cluster.local:5432/hedgedoc
The PostgreSQL instance serves as the central database for the cluster. Other services that support PostgreSQL can connect to it, each with their own database and credentials.
Uptime Kuma
Uptime Kuma at status.srvlab.whiskey.works monitors our services and endpoints. It's SQLite-only — no PostgreSQL option — which is fine for a monitoring tool that mostly does reads.
It runs with a 1Gi PVC for its SQLite database and a simple deployment pinned to the control plane node.
The Ephemeral Worker Problem
The first deployment of these services failed spectacularly. Pods kept getting evicted with:
The node was low on resource: ephemeral-storage
The srvlab workers have tmpfs /var (inherited from the metal cluster's diskless architecture). Container images consume ephemeral storage budget, and on a tmpfs-backed worker, that budget is the RAM allocation for /var.
When you pull a 200MB Ghost container image, that's 200MB of ephemeral storage consumed. Plus the container's writable layer. Plus any emptyDir volumes. On a worker with a 2GB ephemeral budget, you run out fast.
The fix: pin all stateful services to control plane nodes, which have real disk-backed storage:
nodeSelector:
node-role.kubernetes.io/control-plane: "true"
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
The control plane VMs have 100Gi root PVCs (on rds-csi), so ephemeral storage is plentiful.
The Storage Path
Persistent storage on srvlab-cluster uses rds-csi, the same CSI driver as the metal cluster, but with a twist: both clusters currently share the same user and storage path on the RDS.
The RDS segregates clusters by path:
/storage-pool/metal-csi— metal cluster PVCs/storage-pool/homelab-csi— homelab cluster PVCs (paused)/storage-pool/srvlab-csi— planned, not yet created
We wanted a dedicated srvlab-csi user and path, but the RDS SSH credentials are cached through a YubiKey, and the PIN cache expired while we were deploying. Rather than block on that, we reused the metal-csi credentials.
This is a TODO: when YubiKey access is restored, create the srvlab-csi user and migrate PVCs to the dedicated path. For now, it works — the PVCs just live under the metal-csi path with srvlab-specific volume names.
GitOps: The Full Loop
Every resource in the srvlab-cluster is managed by Flux CD. The workflow:
- Edit YAML in the flux-repo (
clusters/srvlab/) git push- Flux detects the change and applies it
- If it fails, check
kubectl get kustomization -Afor error messages
The Flux dependency chain ensures ordered deployment:
sources (HelmRepositories, GitRepositories)
→ infrastructure (cert-manager, ingress, databases)
→ infrastructure-config (ClusterIssuers, SecretStores)
→ services (Ghost, HedgeDoc, Uptime Kuma)
This ordering was learned the hard way. Our first attempt had everything in one flat kustomization. cert-manager CRDs weren't installed when the ClusterIssuer was applied. Secrets didn't exist when deployments referenced them. Chaos.
The four-layer model means: by the time a service deploys, its CRDs exist, its certificates can be issued, its secrets are available, and its database is running.
What's Next
The srvlab community infrastructure is live and serving real users. But there's more to do:
- Proper KEDA autoscaling — currently pinned at 3 workers, needs Prometheus metrics for demand-based scaling
- Segregated storage — dedicated srvlab-csi path on the RDS
- More services — the platform can host anything that runs in a container
- AI inference — exposing the DGX Spark's vLLM through the srvlab-cluster for community use via LiteLLM
The homelab has gone from three servers and a dream to a multi-cluster platform serving a community. It boots from RAM, runs VMs inside Kubernetes, passes GPUs through to virtualized NixOS, and serves AI inference from a desktop computer. Every layer has bugs. Every bug has a story. And every story starts with "it should be simple."
This is the final post in the Building a Diskless Datacenter series. The infrastructure continues to evolve — follow the blog for updates as we add new capabilities and inevitably break things.