P0 done: test-backend-ops MUL_MAT on CUDA0 = 1103/1103 (CUDA vs CPU ref, covers
Q4_0/Q4_K at m=4096,k=14336,n=1..512) - the correctness gate the W4A16 kernel must
keep green. Baseline llama-bench dense Q4 prefill ~750 t/s (~46 TFLOP/s, ~21% of
the 213 BF16 ceiling) - the number to beat toward ~3300. Reusable harness at
~/p0harness.sh (needed -DLLAMA_BUILD_TESTS=ON).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Decisive DGX experiment: rebuilt with -DGGML_CUDA_FORCE_CUBLAS (it's a compile
#ifdef, not the runtime env we'd been setting - so prior 'cuBLAS no-op' tests
never engaged it). Real result: cuBLAS is SLOWER than MMQ for dense Q4 (pp2048
690 vs 750) and runs an Ampere cutlass_80_tensorop kernel - CUDA-13 has no sm_121
GEMM, falls back to sm_80. So both MMQ and cuBLAS sit at ~46 TFLOP/s; no library
shortcut to the 213 ceiling on GB10. Confirms a hand-tuned sm_120a kernel is
required. Added the phased W4A16 Marlin-style implementation plan (P0 harness ->
P5 enable) as the committed multi-week build; corrected the cuBLAS note.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Per-user decode is at parity without spec-dec (10.2 vs 11.7, bandwidth-bound).
vLLM's per-user speed = speculative decoding (lossless, target-verified). GB10 is
best-case (bandwidth-bound + idle compute); llama.cpp spec-dec measured 2.9x on
dense Qwen2.5-32B. Qwen3-32B has no native MTP - use Qwen3-1.7B draft or EAGLE3
head. Recommendation: make spec-dec easy for dense >=14B on Blackwell (keeps
Q4_K_M quality, no kernel). Prefill-kernel + continuous-batching are separate
(TTFT / aggregate). Our own DGX run pending (box rebooted, llama-cli hangs).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
MXFP4 dense moves prefill off int8-MMQ onto the FP4-MMA path (existing kernel) for
a free 1.44x - shippable as a Blackwell dense-quant recommendation. But it's ~17%
of the FP4 roofline, so the FP4-MMA kernel is itself untuned: ~4-6x still in the
kernel. Sharpens the target to TUNING the FP4-MMA (serves dense+MoE, only path to
beat vLLM). Marlin-style W4A16 BF16 is the alt to match on the BF16 ceiling.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Key corrections: (1) vLLM 24k is AGGREGATE; single-stream roofline ~3300 t/s
(BF16) / 6600 (FP4). (2) GB10 is 1:1:2 BF16:INT8:FP4 - INT8 == BF16, only FP4 is
2x. (3) Measured: dense int8-MMQ at 21% of ceiling, MoE FP4-MMQ at ~5% - both
EXIST, just untuned for Blackwell. Strategy: to MATCH vLLM, tune MMQ or build a
Marlin-style W4A16 BF16 GEMM (FP4 NOT required); to BEAT, fix the existing FP4
MMA on sm_121 (build/miscompile, not greenfield). Dropped the tcgen05 grouped
GEMM rewrite. Cheap next test: dense MXFP4 quant + existing FP4-MMA.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Researched: W4A4 hangs on GB10 because FlashInfer ships no FP4 cubins for
sm_120/121 (all datacenter Sm100a); dense mm_fp4 is gated-off/returns-zeros on
consumer Blackwell, and the FlashInfer FP4 autotuner spins on the first forward
pass. Not a misconfig - dense W4A4 inference isn't validated on sm_121. W4A16
(4-bit weight / 16-bit act, Marlin) vs llama Q4_K_M is the correct apples-to-
apples (same quant class) AND the fast path. Removed the misleading 'W4A4 would
be faster / lower bound' framing. Sources: vllm #30163/#26381, flashinfer
#2577/#3294, cutlass #3096.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Benchmark confirms dense prefill 7.6-32x behind too, so the kernel track needs a
non-grouped FP4 dense GEMM (simpler, land first) + the MoE grouped GEMM. Both
share the e2m1 block-scaled collective; dense is grouped-with-one-group.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The only work that closes the vLLM gap on Blackwell: mul_mat_q<MXFP4> is 37%
prefill + 54.6% decode-B64 GPU time; paged attention can't touch it (proven).
Scaffold (builds clean on GB10, default byte-identical): fp4-grouped-moe.{cuh,cu}
entry + gated hook in ggml_cuda_mul_mat_id (env GGML_CUDA_FP4_GROUPED), always
falls back to MMQ for now. Design doc has the CUTLASS/tcgen05 implementation
phases + parity harness + the dense-path follow-up (#28).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
No tcgen05/CUTLASS grouped-GEMM MoE kernel exists upstream (merged/in-flight/
draft); CUTLASS not a dep; no fork has one; activation-quant gather already
fused. Matching vLLM needs a from-scratch tcgen05 grouped GEMM (months,
maintainers deferring to cuTile). No tractable patch closes the 27x.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
On NVIDIA Blackwell consumer GPUs (sm_120/121, incl. GB10/DGX Spark) a larger
physical batch (n_ubatch) materially lifts MoE prefill throughput - measured on
a GB10 with Qwen3-30B-A3B to lift the prefill ceiling and saturate at ~2048.
When a model config leaves `batch:` unset, EffectiveBatchSize now picks 2048 on
Blackwell instead of 512; explicit `batch:` always overrides. Detection is a
shared, cached Go helper (xsysinfo.IsNVIDIABlackwell, nvidia-smi compute_cap
>= 12). Logic is isolated in core/backend/hardware_defaults.go and applied at
the common ModelOptions builder, so it covers the C++ llama.cpp backend too.
Measured (GB10, Qwen3-Coder-30B-A3B MXFP4): prefill ub512 2994 -> ub2048 3316
t/s; saturates past 2048. Also recorded in the DGX gap plan: 4-bit quant alone
captures the decode win (Q4_K_M 93.5 >= MXFP4 86.4 t/s), MXFP4's only edge is
prefill via Blackwell FP4 tensor cores.
Tests: hardware_defaults_internal_test.go; existing NBatch specs pinned to the
no-Blackwell branch for determinism.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Captures the full dgx.casa investigation: Q8/F16/vLLM baselines, concurrency
sweeps, paged-patch (no concurrency effect), nsys+code root-cause (MoE int8
MMQ on Ampere-class tensor cores = 74.5% compute, no FP8 path), and the
lever plan.
Measured wins:
- Lever 1 (MXFP4 / Blackwell FP4 path): decode +50-66% over Q8, prefill
plateau +66% (2200->3650). MXFP4 decode beats vLLM FP8 at B=1 (83 vs 48),
near-parity B=8. Prefill still plateaus (fused-MoE-GEMM gap).
- Lever 2 (ubatch): saturates at 2048; ceiling is the kernel, not batch.
Designed (not built): Lever 3 fused FP4/FP8 MoE grouped GEMM, Lever 4 FP8
GEMM (needs ggml_mul_mat_ext scale plumbing), Lever 5 tcgen05 kernels, and
the complete paged attention (on-demand alloc + gather-read + continuous
batching + prefix sharing). Honest scope: each is multi-week kernel/systems
work.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Wire paged, non-contiguous fixed-size BLOCK placement into the real
llama.cpp KV cache (find_slot), behind env LLAMA_KV_PAGED, and validate
Gate 0 on a real GGUF: Qwen3-0.6B greedy generation is TOKEN-IDENTICAL to
the contiguous cache while its KV is physically scattered across permuted
blocks (cells 0-15, 144-159, 32-47, ...). Proven non-contiguous via
LLAMA_KV_PAGED_DEBUG, not a silent fallback.
This retires the correctness premise of paged attention IN THE MODEL (not
just at the ggml-op level): attention is invariant to physical KV placement,
because reads use per-cell pos/seq metadata for masking. The patch lives at
patches/0001-paged-kv-block-placement.patch (against llama.cpp 0253fb21f).
Scope: storage/placement layer, single sequence. Remaining (P4): the
gather-read compute path (attend only a seq's own blocks) for the throughput
win, and the multi-sequence driver. README updated with repro + status.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Capture verified state (P0 manager parity, P1 ggml write/gather, P2 attention
numerics 7.5e-08, P3 capacity 9.2x + prefix-sharing 11.3x) and the exact
remaining work: wire build_attn_paged into llama-graph.cpp and validate
token-identical generation on Qwen3-0.6B (Gate 0), then win-2 throughput.
Records the integration seams (create_memory, find_slot, get_k/get_v,
build_attn, mask) and the honest caveats (unified cache already shares a
pool; vLLM's classic kernel is deprecated) so the next session starts warm.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Quantify the two multi-tenant wins that are properties of the host-side
block model (vLLM-parity), independent of the in-model compute path:
WIN 1 concurrency capacity @ 512-block budget
contiguous (reserve n_ctx/seq): 4 sequences
paged (on-demand blocks): 37 sequences
--> 9.2x more concurrent sequences
WIN 3 cross-tenant prefix sharing (32 tenants, 1024-tok shared prefix)
prefix-cache OFF: 2176 physical blocks
prefix-cache ON: 192 physical blocks
--> 11.3x less KV memory
WIN 2 (throughput) is deliberately reported as PENDING: it requires the
paged gather-read path wired into llama-graph.cpp (Gate 0) and is not
measurable at the allocation layer. The win-1 baseline is per-sequence
n_ctx reservation (stream mode); llama.cpp's unified cache already shares
one pool, so the honest win there is on-demand sizing + prefix dedup.
Phase 3 (partial) of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Retire the central numeric risk from the design: feeding gather-to-scratch
KV (a sequence whose blocks are non-contiguous in the shared pool, [2,1,5])
into ggml's standard attention ops produces correct attention.
Path under test: set_rows write -> get_rows gather (K and V) ->
mul_mat(K,Q) -> soft_max_ext -> mul_mat(V^T, probs). Result is compared
against an independent host-computed softmax attention over the same K/V/Q.
Max abs error ~7.5e-08 (n_kv=48, d=8, n_q=4).
This proves the paged read path is numerically sound on CPU with no new
ggml op. Remaining: wire build_attn_paged into llama-graph.cpp and validate
Gate 0 (token-identical greedy generation in a real model).
Phase 2 (core) of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Validate the paged KV read/write path at the ggml-op level, driven by
PagedKVManager:
- write: ggml_set_rows(pool, k_src, slot_mapping) scatter K rows by slot
- read: ggml_get_rows(pool, gather_idx) gather a seq's slots into
contiguous scratch (the tensor an attention kernel consumes)
The test forces a non-contiguous, out-of-order physical block layout
(allocate seqA+seqB, free seqA, reallocate seqC -> blocks [2,1,5]) and
proves gather(write(x)) == x plus cross-sequence isolation in the shared
pool. This de-risks the central question (does slot-addressed paged storage
round-trip correctly through ggml) before the llama-graph integration.
Pool is statically allocated via ggml_backend_alloc_ctx_tensors, mirroring
how llama.cpp allocates its KV cache. CPU backend, no new ggml op.
Built against ggml from the vendored llama.cpp checkout.
Phase 1 of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>