13 Mar 2026 5 min read RAM and Recklessness

RAM and Recklessness: The Smallest Supercomputer

Part 7 of Building a Diskless Datacenter

NVIDIA's DGX Spark is a desktop-sized computer with a GB10 Grace Blackwell chip: 20 ARM cores, 128GB of unified memory shared between CPU and GPU, and a compute capability (sm_121) that almost nothing supports yet. Getting vLLM to run inference on it required building everything from source, patching GPU kernels, and discovering bugs in three different software stacks.

The Hardware

The DGX Spark sits on a shelf next to the server rack. It's small — about the size of a Mac Pro. Inside:

GB10 Grace Blackwell — ARM CPU + Blackwell GPU on a single chip
128GB unified memory — shared between CPU and GPU, no PCIe bus between them
Compute capability 12.1 (sm_121) — a new architecture that's binary-compatible with sm_120 but has its own tensor core instructions
CUDA 13.1.1 — the bleeding edge

The unified memory is the killer feature. Traditional GPUs are limited by VRAM — a V100 has 16GB, an A100 has 80GB. The GB10 can address the full 128GB from both CPU and GPU. This means you can run models that would normally need multiple GPUs on a single chip.

The downside: sm_121 is so new that most ML software doesn't know it exists.

The Software Stack Problem

Here's what doesn't work on sm_121 out of the box:

FlashAttention — compiled against CUDA 12, crashes with libcudart.so.12 ABI errors on CUDA 13
vLLM prebuilt wheels — compiled for sm_80/sm_90, wrong compute capability
FlashInfer prebuilt — same problem
Triton's bundled ptxas — doesn't know sm_121a arch
CUTLASS Python DSL — doesn't include sm_121a in its admissible architectures

The only thing that worked was the NGC PyTorch container (26.01-py3), which NVIDIA ships with CUDA 13 and ARM64 support. Everything else had to be built from source.

Building vLLM for GB10

The build stack, layer by layer:

Base: nvcr.io/nvidia/pytorch:26.01-py3 — PyTorch 2.10.0, CUDA 13.1.1, ARM64

Step 1: Remove flash-attn

pip uninstall flash-attn -y

This package is compiled against CUDA 12 and causes immediate crashes. FlashInfer replaces its functionality.

Step 2: Build FlashInfer from source

FLASHINFER_CUDA_ARCH_LIST=12.1a pip install flashinfer-python --no-build-isolation

The 12.1a suffix is critical — 12.1 gives you binary-compatible sm_120 code, but 12.1a generates architecture-specific instructions that use the GB10's tensor cores.

Step 3: Set Triton's ptxas path

export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas

Triton bundles its own ptxas that doesn't know sm_121a. Pointing it to the system ptxas (from CUDA 13) fixes compilation.

Step 4: Build vLLM from source

pip install vllm --no-build-isolation

The whole build takes about 45 minutes on the GB10 itself.

CUTLASS and FP4

The GB10 supports FP4 (4-bit floating point) through CUTLASS kernels. FP4 is exciting because it doubles the effective memory capacity — a 62.5GB model in FP8 becomes ~31GB in FP4, fitting comfortably in the 128GB unified memory with room for KV cache.

But CUTLASS had issues:

C++ API: Works for sm_121. We benchmarked 356 TFLOPS dense FP4 GEMM, which is 71% of the theoretical peak. That's excellent for a desktop device.

Python DSL: Doesn't include sm_121a in BlockScaledMmaOp.admissible_archs. We had to patch CUTLASS to add the architecture. Without the patch, the Python API refuses to generate FP4 kernels for GB10.

Shared memory limit: The GB10 has 99 KiB of shared memory per SM, versus 228 KiB on the B200. Some CUTLASS tile configurations that work on B200 overflow on GB10. FlashInfer's autotuner tries these configs at startup, fails with 28 warnings and ~2800 lines of C++ traces, then falls back to configurations that fit. It's harmless but alarming the first time you see it.

The Scitrera Problem

The obvious approach was to use pre-built vLLM container images. Scitrera publishes ARM64 vLLM images for the DGX Spark. They don't work.

The flash attention kernels in those images are compiled for the wrong compute capability. Instead of sm_121, they target an older architecture. The result: silent wrong results or crashes, depending on which code path gets hit.

We ended up building and hosting our own image at gitea.whiskey.works/whiskey/spark-vllm, with CI/CD through a Gitea Actions runner running on the metal cluster. The build factory is its own little infrastructure:

kubernetes/dgx-spark/build-factory/
├── act-runner deployment
├── RBAC for build jobs
├── registry credentials
└── ExternalSecret for auth

Model Quantization Adventures

With the software stack working, the next challenge was fitting interesting models into 128GB. The target: GLM-4.7-Flash, a 62.5GB (BF16) mixture-of-experts model with 8 experts per layer.

FP8 Dynamic Quantization

FP8 (8-bit floating point) halves model size with minimal quality loss. We used llm-compressor's model_free_ptq (post-training quantization without calibration data):

Needs ~2-4GB VRAM for the quantization process
Needs ~48GB host RAM to hold the model weights
Can actually run on a V100 (which has plenty of host RAM)
Output: FP8 weights + per-channel scales

The FP8 model: 32GB, easily fits in 128GB with room for 200k+ context KV cache.

We uploaded the abliterated variant as whiskeywhiskey/Huihui-GLM-4.7-Flash-abliterated-FP8-Dynamic on Hugging Face.

NVFP4 Quantization

NVFP4 (NVIDIA 4-bit floating point) is more aggressive: 4-bit weights with block-level scaling factors. Half the size of FP8, but with more quality degradation.

NVFP4 has a catch: it needs the full model in GPU memory for quantization. The 62.5GB BF16 model needs ~80GB including workspace. Only the DGX Spark's 128GB unified memory can do this — a V100 with 16GB VRAM is out of the question.

The Fused MoE Expert Problem

GLM-4.7-Flash uses transformers v5's fused MoE implementation. Instead of storing each expert as separate weight tensors, it stores all experts in a single fused 3D tensor (Glm4MoeLiteNaiveMoe).

This is invisible to llm-compressor, which walks the model's named parameters looking for 2D weight matrices to quantize. A 3D tensor doesn't match, so it skips the experts entirely. The result: a "quantized" model where only the dense layers are actually quantized, and all the MoE experts (the majority of the model) are still BF16.

The fix: a zero-copy UnfusedExperts wrapper that presents the fused 3D tensor as individual 2D expert weight views. llm-compressor sees the 2D views, quantizes them, and the underlying fused tensor gets quantized in-place.

MTP + NVFP4: The Upstream Bug

vLLM supports Multi-Token Prediction (MTP), where a draft model predicts multiple tokens at once for speculative decoding. GLM-4.7-Flash uses MTP natively.

MTP + NVFP4 is broken across vLLM 0.15.x and 0.16.x. The MTP drafter's weight loader doesn't handle NVFP4's scale tensors. It expects standard weight shapes and crashes when it encounters the block-scaled format.

This is an upstream vLLM bug. Our workaround: use FP8 (which works with MTP) or disable MTP with NVFP4.

Serving Configuration

All vLLM deployments on the DGX Spark share a common pattern:

selector:
  matchLabels:
    app.kubernetes.io/name: vllm-spark

Model swapping is done by scaling deployments:

kubectl scale deployment/vllm-fp8 --replicas=1
kubectl scale deployment/vllm-nvfp4 --replicas=0

All deployments use the same service selector, so the service automatically routes to whichever deployment is scaled up. This lets us switch between FP8, NVFP4, and BF16 models without changing any downstream configuration.

Standard settings across all deployments:

--gpu-memory-utilization 0.80 — leaves 20% for KV cache growth
--max-model-len varies by quantization (202k for FP8, less for larger models)

The Build Factory

The DGX Spark's build infrastructure runs on the metal cluster:

act-runner: Gitea Actions runner that builds container images
RBAC: ServiceAccount with permissions to run build jobs
Registry secret: Credentials for pushing to gitea.whiskey.works
ExternalSecret: Pulls credentials from the metal cluster's secret store

When we push to the spark-images repo, CI builds the vLLM image, pushes it to the registry, and the DGX Spark pulls it on next deployment. The whole pipeline is self-hosted — no external CI/CD dependencies.

Performance

On the GLM-4.7-Flash FP8 model with 202k context:

Time to first token: ~200ms
Throughput: ~50 tokens/sec for single requests
Concurrent: Scales well due to continuous batching

For a desktop-sized device, this is remarkable. It's serving a 62.5B parameter model (32GB quantized) with quality indistinguishable from the BF16 original, from a box that draws about 200W.

The V100 cluster can also serve models, but the 16GB per-GPU VRAM limitation means smaller models or more aggressive quantization. The DGX Spark's 128GB unified memory is the sweet spot for large model inference.