RAM and Recklessness: The Smallest Supercomputer
Part 7 of Building a Diskless Datacenter
NVIDIA's DGX Spark is a desktop-sized computer with a GB10 Grace Blackwell chip: 20 ARM cores, 128GB of unified memory shared between CPU and GPU, and a compute capability (sm_121) that almost nothing supports yet. Getting vLLM to run inference on it required building everything from source, patching GPU kernels, and discovering bugs in three different software stacks.
The Hardware
The DGX Spark sits on a shelf next to the server rack. It's small — about the size of a Mac Pro. Inside:
- GB10 Grace Blackwell — ARM CPU + Blackwell GPU on a single chip
- 128GB unified memory — shared between CPU and GPU, no PCIe bus between them
- Compute capability 12.1 (sm_121) — a new architecture that's binary-compatible with sm_120 but has its own tensor core instructions
- CUDA 13.1.1 — the bleeding edge
The unified memory is the killer feature. Traditional GPUs are limited by VRAM — a V100 has 16GB, an A100 has 80GB. The GB10 can address the full 128GB from both CPU and GPU. This means you can run models that would normally need multiple GPUs on a single chip.
The downside: sm_121 is so new that most ML software doesn't know it exists.
The Software Stack Problem
Here's what doesn't work on sm_121 out of the box:
- FlashAttention — compiled against CUDA 12, crashes with libcudart.so.12 ABI errors on CUDA 13
- vLLM prebuilt wheels — compiled for sm_80/sm_90, wrong compute capability
- FlashInfer prebuilt — same problem
- Triton's bundled ptxas — doesn't know sm_121a arch
- CUTLASS Python DSL — doesn't include sm_121a in its admissible architectures
The only thing that worked was the NGC PyTorch container (26.01-py3), which NVIDIA ships with CUDA 13 and ARM64 support. Everything else had to be built from source.
Building vLLM for GB10
The build stack, layer by layer:
Base: nvcr.io/nvidia/pytorch:26.01-py3 — PyTorch 2.10.0, CUDA 13.1.1, ARM64
Step 1: Remove flash-attn
pip uninstall flash-attn -y
This package is compiled against CUDA 12 and causes immediate crashes. FlashInfer replaces its functionality.
Step 2: Build FlashInfer from source
FLASHINFER_CUDA_ARCH_LIST=12.1a pip install flashinfer-python --no-build-isolation
The 12.1a suffix is critical — 12.1 gives you binary-compatible sm_120 code, but 12.1a generates architecture-specific instructions that use the GB10's tensor cores.
Step 3: Set Triton's ptxas path
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
Triton bundles its own ptxas that doesn't know sm_121a. Pointing it to the system ptxas (from CUDA 13) fixes compilation.
Step 4: Build vLLM from source
pip install vllm --no-build-isolation
The whole build takes about 45 minutes on the GB10 itself.
CUTLASS and FP4
The GB10 supports FP4 (4-bit floating point) through CUTLASS kernels. FP4 is exciting because it doubles the effective memory capacity — a 62.5GB model in FP8 becomes ~31GB in FP4, fitting comfortably in the 128GB unified memory with room for KV cache.
But CUTLASS had issues:
C++ API: Works for sm_121. We benchmarked 356 TFLOPS dense FP4 GEMM, which is 71% of the theoretical peak. That's excellent for a desktop device.
Python DSL: Doesn't include sm_121a in BlockScaledMmaOp.admissible_archs. We had to patch CUTLASS to add the architecture. Without the patch, the Python API refuses to generate FP4 kernels for GB10.
Shared memory limit: The GB10 has 99 KiB of shared memory per SM, versus 228 KiB on the B200. Some CUTLASS tile configurations that work on B200 overflow on GB10. FlashInfer's autotuner tries these configs at startup, fails with 28 warnings and ~2800 lines of C++ traces, then falls back to configurations that fit. It's harmless but alarming the first time you see it.
The Scitrera Problem
The obvious approach was to use pre-built vLLM container images. Scitrera publishes ARM64 vLLM images for the DGX Spark. They don't work.
The flash attention kernels in those images are compiled for the wrong compute capability. Instead of sm_121, they target an older architecture. The result: silent wrong results or crashes, depending on which code path gets hit.
We ended up building and hosting our own image at gitea.whiskey.works/whiskey/spark-vllm, with CI/CD through a Gitea Actions runner running on the metal cluster. The build factory is its own little infrastructure:
kubernetes/dgx-spark/build-factory/
├── act-runner deployment
├── RBAC for build jobs
├── registry credentials
└── ExternalSecret for auth
Model Quantization Adventures
With the software stack working, the next challenge was fitting interesting models into 128GB. The target: GLM-4.7-Flash, a 62.5GB (BF16) mixture-of-experts model with 8 experts per layer.
FP8 Dynamic Quantization
FP8 (8-bit floating point) halves model size with minimal quality loss. We used llm-compressor's model_free_ptq (post-training quantization without calibration data):
- Needs ~2-4GB VRAM for the quantization process
- Needs ~48GB host RAM to hold the model weights
- Can actually run on a V100 (which has plenty of host RAM)
- Output: FP8 weights + per-channel scales
The FP8 model: 32GB, easily fits in 128GB with room for 200k+ context KV cache.
We uploaded the abliterated variant as whiskeywhiskey/Huihui-GLM-4.7-Flash-abliterated-FP8-Dynamic on Hugging Face.
NVFP4 Quantization
NVFP4 (NVIDIA 4-bit floating point) is more aggressive: 4-bit weights with block-level scaling factors. Half the size of FP8, but with more quality degradation.
NVFP4 has a catch: it needs the full model in GPU memory for quantization. The 62.5GB BF16 model needs ~80GB including workspace. Only the DGX Spark's 128GB unified memory can do this — a V100 with 16GB VRAM is out of the question.
The Fused MoE Expert Problem
GLM-4.7-Flash uses transformers v5's fused MoE implementation. Instead of storing each expert as separate weight tensors, it stores all experts in a single fused 3D tensor (Glm4MoeLiteNaiveMoe).
This is invisible to llm-compressor, which walks the model's named parameters looking for 2D weight matrices to quantize. A 3D tensor doesn't match, so it skips the experts entirely. The result: a "quantized" model where only the dense layers are actually quantized, and all the MoE experts (the majority of the model) are still BF16.
The fix: a zero-copy UnfusedExperts wrapper that presents the fused 3D tensor as individual 2D expert weight views. llm-compressor sees the 2D views, quantizes them, and the underlying fused tensor gets quantized in-place.
MTP + NVFP4: The Upstream Bug
vLLM supports Multi-Token Prediction (MTP), where a draft model predicts multiple tokens at once for speculative decoding. GLM-4.7-Flash uses MTP natively.
MTP + NVFP4 is broken across vLLM 0.15.x and 0.16.x. The MTP drafter's weight loader doesn't handle NVFP4's scale tensors. It expects standard weight shapes and crashes when it encounters the block-scaled format.
This is an upstream vLLM bug. Our workaround: use FP8 (which works with MTP) or disable MTP with NVFP4.
Serving Configuration
All vLLM deployments on the DGX Spark share a common pattern:
selector:
matchLabels:
app.kubernetes.io/name: vllm-spark
Model swapping is done by scaling deployments:
kubectl scale deployment/vllm-fp8 --replicas=1
kubectl scale deployment/vllm-nvfp4 --replicas=0
All deployments use the same service selector, so the service automatically routes to whichever deployment is scaled up. This lets us switch between FP8, NVFP4, and BF16 models without changing any downstream configuration.
Standard settings across all deployments:
--gpu-memory-utilization 0.80— leaves 20% for KV cache growth--max-model-lenvaries by quantization (202k for FP8, less for larger models)
The Build Factory
The DGX Spark's build infrastructure runs on the metal cluster:
- act-runner: Gitea Actions runner that builds container images
- RBAC: ServiceAccount with permissions to run build jobs
- Registry secret: Credentials for pushing to
gitea.whiskey.works - ExternalSecret: Pulls credentials from the metal cluster's secret store
When we push to the spark-images repo, CI builds the vLLM image, pushes it to the registry, and the DGX Spark pulls it on next deployment. The whole pipeline is self-hosted — no external CI/CD dependencies.
Performance
On the GLM-4.7-Flash FP8 model with 202k context:
- Time to first token: ~200ms
- Throughput: ~50 tokens/sec for single requests
- Concurrent: Scales well due to continuous batching
For a desktop-sized device, this is remarkable. It's serving a 62.5B parameter model (32GB quantized) with quality indistinguishable from the BF16 original, from a box that draws about 200W.
The V100 cluster can also serve models, but the 16GB per-GPU VRAM limitation means smaller models or more aggressive quantization. The DGX Spark's 128GB unified memory is the sweet spot for large model inference.