LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-23 16:19:07 -04:00

Author	SHA1	Message	Date
Ettore Di Giacinto	aaf7b4112e	test(llama-cpp): NVFP4-dense FP4 quality+speed eval on GB10 NVFP4-dense is producible via --tensor-type attn=nvfp4 --tensor-type ffn=nvfp4 (GGML_TYPE_NVFP4 has a full quantize path; no top-level ftype needed). Clean-from-BF16 4B PPL: NVFP4 14.31 vs Q4_K 13.66 vs MXFP4 17.42 vs BF16 13.32 - Q4_K-class, not MXFP4-class. Prefill routes onto the FP4 MMA kernel (~1.29x Q4_K on 4B, within 5% of MXFP4). It is the quality-preserving FP4 win MXFP4 was not. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 18:44:57 +00:00
Ettore Di Giacinto	037ad82b7c	docs(paged): MXFP4-dense vs Q4_K quality gate on GB10 (do not recommend) Fair clean-source perplexity check on DGX Spark (GB10): quantize Qwen3-4B from one BF16 source to both Q4_K_M and MXFP4 (no imatrix, identical recipe). Q4_K_M is +2.6% PPL vs BF16; MXFP4-dense is +30.8% (+27.5% worse than Q4_K). The existing 32B MXFP4 was confirmed double-quant (Q4_K_M -> MXFP4 via --allow-requantize), but the clean 4B test shows the gap is intrinsic to the format, not the double-quant. Output stays coherent. Verdict: the ~1.58x prefill / ~1.2x decode win does not justify a Blackwell MXFP4-dense quality recommendation; keep Q4_K_M the dense default, pursue NVFP4 instead. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 17:25:14 +00:00
Ettore Di Giacinto	1887385b79	analysis: MXFP4-dense fails quality check (~27% worse PPL than Q4_K) - do not recommend Clean fair comparison (Qwen3-4B, all from same BF16 source, wikitext PPL): BF16 13.32, Q4_K_M 13.66 (+2.6%, near-lossless), MXFP4 17.42 (+30.8%). MXFP4 is ~27% worse than Q4_K even clean from BF16 (32B double-quant cross-check: 7.39 vs 8.46, +14.6%, same direction). MXFP4_MOE is built for MoE expert tensors; on dense attn/ffn it is far lossier than Q4_K's 6-bit superblock structure. The ~1.58x prefill is not worth ~27% PPL - Q4_K stays the dense default; FP4 only where the model is trained for it (MoE). Verdict: do NOT ship a Blackwell MXFP4-dense rec. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 17:24:24 +00:00
Ettore Di Giacinto	40ee9cdd13	docs(paged): evaluate llama.cpp PR #17004 (GPU/backend sampling) on GB10 PR #17004 is merged and already present in our pinned llama.cpp f3e1828. Measured on DGX Spark (GB10, sm_121, Qwen3-32B-Q4_K_M): - llama-batched-bench does no sampling (random tokens), so it cannot test the fix; its ~540 t/s plateau is not sampling-bound. - Real-sampling A/B via llama-batched (CPU vs -bs GPU sampler): +25% at np=32, +3% at np=64, GGML_ASSERT(obj_new) graph-alloc crash at np>=128. - nsys at np=64: GPU-busy time and kernel mix unchanged (392 vs 404 t/s); sampling kernels negligible. GPU utilization did not rise. Clean negative: the fix does not break the plateau toward the ~2700 ceiling or past vLLM 667, and is unusable at the multi-user parallelism in question. Adoption: code arrives via LLAMA_VERSION bump (prepare.sh vendors the modified upstream server-context.cpp), but grpc-server must set params.sampling.backend_sampling to enable it; grammar/tool-call/logprobs requests fall back to CPU. Defer adoption until #18547/#18550 stabilise it. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 15:44:21 +00:00
Ettore Di Giacinto	d6c91b7d62	analysis: finalize PR #22569 paged-KV eval (full detail + compute-bound note) Agent-finalized eval: builds (1-line Qwen3 reshape fix), but on GB10+32B paged is ~12% slower than contiguous and both cap at LLAMA_MAX_SEQ=256 (not OOM; 16GiB/119). Agent argues 32B is compute-bound + plateaus by npl=128 so raising the cap won't help - but 540 t/s << ~1900 bandwidth ceiling, so the plateau cause is unconfirmed (attention-over-KV or CPU sampling, not matmul saturation). Next: raise the cap + remeasure to settle it. Verdict: do not adopt #22569; paged KV not a GB10 lever. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 14:35:02 +00:00
Ettore Di Giacinto	92e93dfc34	analysis: paged KV gives ZERO benefit on GB10 (measured) - not the lever Full sweep, Qwen3-32B: contiguous decode 537/541 t/s at npl=128/256 (plateau); paged (#22569) 477/471 - SLOWER at matched concurrency. Both FAIL at npl=512/1024 with n_seq_max<=256 - paged does NOT bypass the LLAMA_MAX_SEQ=256 compile cap, its whole purpose. GB10's limit is the 256-seq cap + the ~540 decode plateau (flat by npl=128), NOT KV capacity/fragmentation (122 GB unified). Paged KV solves a problem GB10 doesn't have; it remains valid for memory-constrained datacenter GPUs (24-48GB) but must be validated there, not GB10. Do not adopt #22569; do not build paged KV for GB10. Real GB10 questions: the 256 cap (cheap) + the 540 plateau (vs vLLM 667). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 13:31:33 +00:00
Ettore Di Giacinto	fdb7f56bb7	docs(llama-cpp): scope chunked prefill + n_batch/n_ubatch decouple Add CHUNKED_PREFILL_PLAN.md for the llama.cpp backend. Key finding: the vendored llama.cpp server scheduler (update_slots) already implements chunked prefill with prefill/decode interleaving on the pinned version - decode tokens are seated first each iteration, prefill fills the leftover n_batch budget, both share one llama_decode. The draft upstream PR #10718 goal is already absorbed; no re-implementation needed. The real LocalAI gap is the n_batch/n_ubatch coupling at grpc-server.cpp (both set to nbatch()), which pins the logical scheduling window to the physical ubatch width. The plan scopes the decouple (C++ option + proto NUBatch + options.go), an optional decode-headroom prefill cap as a vendored patch, a token-identical verification harness, and keeps the work orthogonal to paged KV. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 12:54:22 +00:00
Ettore Di Giacinto	07985ba45b	analysis: measured llama.cpp aggregate vs vLLM - already ~75-80% at npl<=128 llama-batched-bench Qwen3-32B-Q4_K_M: aggregate decode 235/391/540 t/s at npl=32/64/128 vs vLLM 328/569/667 = 72/69/81%, multiplier 53x (vLLM 56x), still climbing at 128. The 30x headline is wrong at realistic concurrency: llama.cpp is ahead single-stream (MXFP4 1153 > 800) and ~75-80% aggregate. Aggregate prefill is flat ~760 but GB10-compute-capped (vLLM ~800 too), so chunked prefill is a latency/TTFT win not throughput; paged KV is the high-concurrency (thousands-seqs) lever for vLLM's 24k regime. ROI: MXFP4 ship -> chunked prefill -> paged KV. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 11:32:40 +00:00
Ettore Di Giacinto	fc589b3fad	analysis: vLLM GB10 advantage is the SCHEDULER, not the kernel (pivot) Code-grounded vLLM v0.23.0 analysis + DGX measurement: vLLM single-stream W4A16 prefill ~800 t/s (~52 TFLOPS) is TIED with llama.cpp MMQ (718/47), using the exact XOR-swizzle + 4-stage cp.async Marlin we proved collapses GB10 occupancy. vLLM has no FP4 cubins on sm_121 (forced W4A16 fallback), so llama.cpp MXFP4 (1153) already beats vLLM single-stream. vLLM's ~24k headline is the aggregate decode multiplier (~56x) from paged KV + chunked prefill + continuous batching - a scheduler win. llama.cpp lacks paged KV + chunked prefill. Kernel work (W4A16 178 t/s, FP4-MMA) banked as not-the-lever; effort pivots to the scheduler. Detail in VLLM_DECOMPOSITION.md; W4A16 plan marked STOPPED. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 07:09:42 +00:00
Ettore Di Giacinto	2b79083b71	feat(w4a16): grow tile to BN128/16w (q4_K +17%, pp512 148->178) P3b-2 for the Blackwell W4A16 Marlin GEMM. The q4_K dequant wall is partly cross-N-block-redundant: every N-block re-decodes the same weight strip, so halving the N-block count (BN 64->128) halves that redundant 6-bit superblock decode. A BN sweep showed this only pays off when BN is spread across more warps (16 warps, 8 m16n8 C-tiles/warp) rather than more fragments-per-warp - the FN=8 / FM=4 variants (16 C-tiles/warp) regressed to ~6.6 TFLOPS on register pressure. Shipping tile is now WM=4,WN=4,FM=2,FN=4 -> BM=128, BN=128, 16 warps. Thermally-bracketed cold A/B (q4_K n=512 / q4_0 n=512 via test-backend-ops perf; pp512/pp2048 via llama-bench Qwen3-32B-Q4_K_M): BN64/8w (prev): 8.50 / 10.56 TFLOPS, measured 8.45/10.51 again (bracket) BN128/16w (this): 9.92 / 11.68 TFLOPS, pp512 177.6, pp2048 185.0 -> +17% q4_K, +11% q4_0, +20% pp512 vs the previous commit; +49% pp512 vs the original block-tiled kernel (119). Parity gate GGML_CUDA_W4A16=1 test-backend-ops MUL_MAT = 1103/1103, flag set and unset (byte-identical when unset). Still ~4.7x under MMQ (47 TFLOPS) and does NOT beat MMQ; BN growth divides the redundant decode but cannot remove the per-k-step decode itself - the offline weight prepack remains the next unlock for q4_K. Plan doc P3 table + bottleneck notes updated. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 02:01:12 +00:00
Ettore Di Giacinto	2f648dc6a0	feat(w4a16): conflict-free skew-pad ldmatrix + BM128/8w tile (q4_K +28%, q4_0 +40%) P3b for the Blackwell (sm_120/121) W4A16 Marlin GEMM. Two combined changes over the prior block-tiled kernel, both verified by a thermally-bracketed cold A/B (committed measured identically before and after): - Skew-padded shared layout: store the staged weight/activation rows at a padded stride of 12 bf162 (8 data + 4 pad) and feed the tensor cores with ldmatrix.x4 (A) / ldmatrix.x2 (B). ldmatrix's per-lane address is rowstride; the natural stride 8 divides the 32-bank cycle and collides rows 0,4,8,12 (2-way bank conflict). Skewing to 12 (still 16-byte aligned) spreads {r12 mod 32} across 8 distinct bank-quads, so both ldmatrix halves are conflict-free at only +50% on the ~6 KB staged tile - unlike a 128-byte -row XOR swizzle, which is conflict-free but needs 16 KB shared and collapses occupancy on GB10 (measured 2.84 TFLOPS, worse than baseline). - Larger tile: BM=128, BN=64, 8 warps (WM=4,WN=2,FM=2,FN=4), which cuts the redundant per-M-block activation re-reads. Cold A/B (q4_K n=512 / q4_0 n=512 via test-backend-ops perf; pp512/pp2048 via llama-bench Qwen3-32B-Q4_K_M): committed: 6.63 / 7.53 TFLOPS, pp512 119 this: 8.52 / 10.49 TFLOPS, pp512 148.5, pp2048 153.9 (+28% / +40% / +25%) Parity gate GGML_CUDA_W4A16=1 test-backend-ops MUL_MAT = 1103/1103, flag set and unset (byte-identical when unset). Still ~5.5x under MMQ (47 TFLOPS) and does NOT beat MMQ yet; the q4_K limiter has now moved from the mma feed to the per-element 6-bit superblock dequant (q4_0 scales to 15.8 TFLOPS with more warps while q4_K stays ~8.5), so the offline weight prepack is the next unlock. Plan doc P3 section updated with the sweep data and the corrected bottleneck. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 01:15:07 +00:00
Ettore Di Giacinto	9973fa995a	feat(w4a16): P3 step 1 - block-tiled multi-warp Marlin GEMM (GB10) Replace the P2 1-warp-per-16x8 W4A16 kernel with a block-tiled multi-warp kernel: blockDim=(32, WMWN) so threadIdx.x is the warp lane (required by mma.cuh get_i/get_j) and threadIdx.y is the warp index. WMWN warps compute a BM(=WMFM16) x BN(=WNFN8) output tile, each warp owning an FM x FN grid of m16n8k16 BF16 mma fragments accumulated in F32. The BM x 16 dequantized Q4 weight strip is staged once per k-step in a small (~4 KB) shared buffer and reused across the block's whole BN span. Shipping config WM=2,WN=2,FM=2,FN=4. The P2 launch put all threads on threadIdx.x; with >1 warp that drove the mma tile get_j past the shared bound (out-of-bounds shared read, caught by compute-sanitizer). The new (32, nwarps) layout matches mmf.cu and fixes it. Parity gate holds 1103/1103 (test-backend-ops MUL_MAT CUDA0), flag set and unset (byte-identical when GGML_CUDA_W4A16 is unset; the seam returns false). Perf (q4_K m=4096 k=14336 n=512): ~2 TFLOPS (P2) -> ~7-9 TFLOPS (thermal dependent); llama-bench Qwen3-32B-Q4_K_M pp512 31.75 -> ~118-142 t/s. Still below the MMQ baseline (47 TFLOPS / 718 t/s): a tile sweep stayed flat and q4_0 vs q4_K differ by only ~12%, so dequant compute is not the limiter - the shared-load / mma-feed is. A naive double-buffered cp.async pipeline (32 KB shared) regressed via occupancy collapse and an ldmatrix swap was neutral (unswizzled layout bank-conflicts), both reverted. The path to >=150 TFLOPS is the full Marlin machinery (XOR-swizzled shared layout + offline weight reshuffle + tuned async pipeline + Stream-K), deferred to P3 step 4. See W4A16_MARLIN_KERNEL_PLAN.md for the per-step table and dead-end notes. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 23:36:58 +00:00
Ettore Di Giacinto	4de0c3b1b2	feat(cuda): W4A16 P2 correctness-first BF16 GEMM kernel Replace the P1 dispatch-seam TODO in marlin-w4a16.cu with a real W4A16 GEMM for consumer Blackwell (sm_120/121). In-kernel dequant of Q4 weights to BF16, mma.sync m16n8k16 f32.bf16.bf16.f32 tensor-core multiply against BF16-converted f32 activations, f32 accumulate and write, reusing ggml's mma.cuh tile abstractions. Handles the contiguous 2D GEMM prefill path for Q4_0 and Q4_K (f32 activations, ne2==ne3==1); batched, broadcast, permuted, non-contiguous and f16-activation cases return false and fall back to MMQ so the gate stays green. M/N boundaries are zero-padded in-kernel. Parity gate (GGML_CUDA_W4A16=1 test-backend-ops MUL_MAT on GB10): 1103/1103 passed; default flag-off build stays byte-identical 1103/1103. Model sanity: Qwen3-32B-Q4_K_M llama-bench pp512 31.75 t/s (slow is expected for P2 - the naive single-warp kernel is the correctness checkpoint; P3 adds the cp.async pipeline and weight reshuffle). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 22:09:12 +00:00
Ettore Di Giacinto	9a71e81fc4	kernel: written subagent dispatch briefs for P3/P4/P5 Same strategy as P2: one fresh Opus-4.8 subagent per phase, each handed a complete zero-context brief, dispatched sequentially as each predecessor lands (P3 pipeline needs P2's correct kernel, P4 tune needs P3, P5 enable needs P4). Shared DGX/harness/commit boilerplate factored into a COMMON section; each phase brief carries its goal, incremental steps, acceptance gate, and a splice note for the prior phase's actual deliverable. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 22:01:18 +00:00
Ettore Di Giacinto	718b31d063	kernel(P1): W4A16 dispatch seam (gated, byte-identical fallback to MMQ) marlin-w4a16.{cuh,cu} + a gated hook in ggml_cuda_mul_mat (dense path), behind GGML_CUDA_W4A16 + sm_120/121 + Q4_0/Q4_K + f32. Returns false -> MMQ, so the default build is byte-identical. Verified on GB10: clean build, test-backend-ops MUL_MAT 1103/1103, llama-bench pp512 unchanged (717.77 default / 718.26 flagged), and GGML_CUDA_W4A16=1 reaches the seam ([w4a16] P1 warning) before falling back. Source + apply steps under kernel/w4a16/ (DGX checkout is volatile). The frame the P2 correctness kernel + P3 Marlin pipeline fill. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 21:46:38 +00:00
Ettore Di Giacinto	d291e15114	kernel(P0): record precise op-level baseline (q4_K n=512 = 47 TFLOPS, ~22% of ceiling) test-backend-ops perf MUL_MAT m=4096 k=14336: q4_K prefill (n=512) = 47.1 TFLOPS, q4_0 = 49.5; decode (n=1) = 761/817 GFLOPS (memory-bound). The prefill GEMM target is 47 -> ~213 TFLOPS (~4.5x). Cleaner per-shape target than end-to-end for kernel iteration. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 21:33:50 +00:00
Ettore Di Giacinto	dae2679c3b	kernel(P0): parity harness established + baseline (test-backend-ops 1103/1103 green) P0 done: test-backend-ops MUL_MAT on CUDA0 = 1103/1103 (CUDA vs CPU ref, covers Q4_0/Q4_K at m=4096,k=14336,n=1..512) - the correctness gate the W4A16 kernel must keep green. Baseline llama-bench dense Q4 prefill ~750 t/s (~46 TFLOP/s, ~21% of the 213 BF16 ceiling) - the number to beat toward ~3300. Reusable harness at ~/p0harness.sh (needed -DLLAMA_BUILD_TESTS=ON). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 21:29:21 +00:00
Ettore Di Giacinto	13e6ee89c7	kernel: validate cuBLAS dead-end (sm_80 fallback) + W4A16 Marlin impl plan Decisive DGX experiment: rebuilt with -DGGML_CUDA_FORCE_CUBLAS (it's a compile #ifdef, not the runtime env we'd been setting - so prior 'cuBLAS no-op' tests never engaged it). Real result: cuBLAS is SLOWER than MMQ for dense Q4 (pp2048 690 vs 750) and runs an Ampere cutlass_80_tensorop kernel - CUDA-13 has no sm_121 GEMM, falls back to sm_80. So both MMQ and cuBLAS sit at ~46 TFLOP/s; no library shortcut to the 213 ceiling on GB10. Confirms a hand-tuned sm_120a kernel is required. Added the phased W4A16 Marlin-style implementation plan (P0 harness -> P5 enable) as the committed multi-week build; corrected the cuBLAS note. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 21:16:13 +00:00
Ettore Di Giacinto	76cc0b6abc	docs(paged): phased plan to make llama.cpp a viable vLLM alternative Phase 1 (config, PR #10411, DONE): VRAM-scaled n_parallel + Blackwell batch. Phase 2: paged KV (PR #22569, ~9.5x concurrency). Phase 3: chunked prefill + n_batch/ubatch split. Phase 4: batched-GEMM kernel tuning. Phase 5: backend sampling. Cross-cutting: spec-dec for dense. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 09:35:53 +00:00
Ettore Di Giacinto	122df1c620	analysis: vLLM throughput gap decomposed - spec-dec is the per-user lever Per-user decode is at parity without spec-dec (10.2 vs 11.7, bandwidth-bound). vLLM's per-user speed = speculative decoding (lossless, target-verified). GB10 is best-case (bandwidth-bound + idle compute); llama.cpp spec-dec measured 2.9x on dense Qwen2.5-32B. Qwen3-32B has no native MTP - use Qwen3-1.7B draft or EAGLE3 head. Recommendation: make spec-dec easy for dense >=14B on Blackwell (keeps Q4_K_M quality, no kernel). Prefill-kernel + continuous-batching are separate (TTFT / aggregate). Our own DGX run pending (box rebooted, llama-cli hangs). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 08:40:20 +00:00
Ettore Di Giacinto	14e3da25b6	kernel: dense MXFP4 test = free 1.44x (765->1153) but FP4-MMA untuned (~17% of ceiling) MXFP4 dense moves prefill off int8-MMQ onto the FP4-MMA path (existing kernel) for a free 1.44x - shippable as a Blackwell dense-quant recommendation. But it's ~17% of the FP4 roofline, so the FP4-MMA kernel is itself untuned: ~4-6x still in the kernel. Sharpens the target to TUNING the FP4-MMA (serves dense+MoE, only path to beat vLLM). Marlin-style W4A16 BF16 is the alt to match on the BF16 ceiling. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 07:48:29 +00:00
Ettore Di Giacinto	f5e9caece1	kernel: reframed Blackwell kernel-gap map (research + profiles) Key corrections: (1) vLLM 24k is AGGREGATE; single-stream roofline ~3300 t/s (BF16) / 6600 (FP4). (2) GB10 is 1:1:2 BF16:INT8:FP4 - INT8 == BF16, only FP4 is 2x. (3) Measured: dense int8-MMQ at 21% of ceiling, MoE FP4-MMQ at ~5% - both EXIST, just untuned for Blackwell. Strategy: to MATCH vLLM, tune MMQ or build a Marlin-style W4A16 BF16 GEMM (FP4 NOT required); to BEAT, fix the existing FP4 MMA on sm_121 (build/miscompile, not greenfield). Dropped the tcgen05 grouped GEMM rewrite. Cheap next test: dense MXFP4 quant + existing FP4-MMA. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 07:21:56 +00:00
Ettore Di Giacinto	d2651c86d9	bench(dense): root-cause the W4A4 NVFP4 hang; W4A16 vs Q4 is the headline Researched: W4A4 hangs on GB10 because FlashInfer ships no FP4 cubins for sm_120/121 (all datacenter Sm100a); dense mm_fp4 is gated-off/returns-zeros on consumer Blackwell, and the FlashInfer FP4 autotuner spins on the first forward pass. Not a misconfig - dense W4A4 inference isn't validated on sm_121. W4A16 (4-bit weight / 16-bit act, Marlin) vs llama Q4_K_M is the correct apples-to- apples (same quant class) AND the fast path. Removed the misleading 'W4A4 would be faster / lower bound' framing. Sources: vllm #30163/#26381, flashinfer #2577/#3294, cutlass #3096. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 06:59:50 +00:00
Ettore Di Giacinto	ce60737fc5	kernel(doc): dense scope resolved - two FP4 kernels (dense first, then grouped) Benchmark confirms dense prefill 7.6-32x behind too, so the kernel track needs a non-grouped FP4 dense GEMM (simpler, land first) + the MoE grouped GEMM. Both share the e2m1 block-scaled collective; dense is grouped-with-one-group. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 03:56:33 +00:00
Ettore Di Giacinto	b7b2e8291c	kernel(fp4-grouped-moe): scaffold the FP4 grouped-GEMM MoE dispatch (Lever 3) The only work that closes the vLLM gap on Blackwell: mul_mat_q<MXFP4> is 37% prefill + 54.6% decode-B64 GPU time; paged attention can't touch it (proven). Scaffold (builds clean on GB10, default byte-identical): fp4-grouped-moe.{cuh,cu} entry + gated hook in ggml_cuda_mul_mat_id (env GGML_CUDA_FP4_GROUPED), always falls back to MMQ for now. Design doc has the CUTLASS/tcgen05 implementation phases + parity harness + the dense-path follow-up (#28). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 23:44:31 +00:00
Ettore Di Giacinto	62f0ae17e3	docs(paged): upstream survey - no FP4 MoE GEMM to patch in; phase 3 is from-scratch No tcgen05/CUTLASS grouped-GEMM MoE kernel exists upstream (merged/in-flight/ draft); CUTLASS not a dep; no fork has one; activation-quant gather already fused. Matching vLLM needs a from-scratch tcgen05 grouped GEMM (months, maintainers deferring to cuTile). No tractable patch closes the 27x. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 22:44:11 +00:00
Ettore Di Giacinto	b14214620c	docs(paged): Lever-3 phase-1 nwarps tweak = dead end (constants coupled) static_assert(nwarps*tile_C::I == mmq_y) locks nwarps=8 for mmq_y=128; can't raise occupancy without co-scaling mmq_y (blows Blackwell smem). MMQ kernel is not freely tunable -> parity needs the tcgen05/CUTLASS rewrite, not knobs. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 22:32:02 +00:00
Ettore Di Giacinto	1449b806ab	docs(paged): Lever-3 + paged-attention implementation plans + upstream ggml issue draft Plan A (Lever 3): phased path to FP4 MoE GEMM parity — cheap tweaks, act-quant fusion, then the real lever (tcgen05/CUTLASS grouped GEMM), full-model FP4. Plan B (paged attention): on-demand pool, gather-read + Gate 0, continuous batching, prefix sharing; benchmark in memory-pressured/mixed-length regimes. Upstream issue draft: GB10 numbers, nsys profile, ruled-out config knobs, tcgen05 proposal. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 22:28:28 +00:00
Ettore Di Giacinto	9f16a907be	docs(paged): Lever 3 profiled + Q4/MXFP4 findings, auto-ubatch shipped Prefill doesn't scale with bigger single prompts (attention O(N^2)); real gap is batched MoE prefill (B=32: 27x vs vLLM, ~22 effective TFLOP/s). nsys pins Lever 3 target: mul_mat_q<MXFP4> MoE GEMM 37% + un-fused act-quant 8%; native FP4 MMA already engaged, inefficiency is the per-expert thin-tile scheduler. Q4_K_M matches MXFP4 on decode (decode win is generic 4-bit); MXFP4's only edge is prefill. Auto-ubatch=2048 on Blackwell shipped (PR #10411). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 20:56:46 +00:00
Ettore Di Giacinto	aba0bfd24f	feat(backend): auto-default physical batch to 2048 on Blackwell GPUs On NVIDIA Blackwell consumer GPUs (sm_120/121, incl. GB10/DGX Spark) a larger physical batch (n_ubatch) materially lifts MoE prefill throughput - measured on a GB10 with Qwen3-30B-A3B to lift the prefill ceiling and saturate at ~2048. When a model config leaves `batch:` unset, EffectiveBatchSize now picks 2048 on Blackwell instead of 512; explicit `batch:` always overrides. Detection is a shared, cached Go helper (xsysinfo.IsNVIDIABlackwell, nvidia-smi compute_cap >= 12). Logic is isolated in core/backend/hardware_defaults.go and applied at the common ModelOptions builder, so it covers the C++ llama.cpp backend too. Measured (GB10, Qwen3-Coder-30B-A3B MXFP4): prefill ub512 2994 -> ub2048 3316 t/s; saturates past 2048. Also recorded in the DGX gap plan: 4-bit quant alone captures the decode win (Q4_K_M 93.5 >= MXFP4 86.4 t/s), MXFP4's only edge is prefill via Blackwell FP4 tensor cores. Tests: hardware_defaults_internal_test.go; existing NBatch specs pinned to the no-Blackwell branch for determinism. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 20:46:45 +00:00
Ettore Di Giacinto	7aa61d4c32	docs(paged): DGX Blackwell gap analysis + lever plan (living doc) Captures the full dgx.casa investigation: Q8/F16/vLLM baselines, concurrency sweeps, paged-patch (no concurrency effect), nsys+code root-cause (MoE int8 MMQ on Ampere-class tensor cores = 74.5% compute, no FP8 path), and the lever plan. Measured wins: - Lever 1 (MXFP4 / Blackwell FP4 path): decode +50-66% over Q8, prefill plateau +66% (2200->3650). MXFP4 decode beats vLLM FP8 at B=1 (83 vs 48), near-parity B=8. Prefill still plateaus (fused-MoE-GEMM gap). - Lever 2 (ubatch): saturates at 2048; ceiling is the kernel, not batch. Designed (not built): Lever 3 fused FP4/FP8 MoE grouped GEMM, Lever 4 FP8 GEMM (needs ggml_mul_mat_ext scale plumbing), Lever 5 tcgen05 kernels, and the complete paged attention (on-demand alloc + gather-read + continuous batching + prefix sharing). Honest scope: each is multi-week kernel/systems work. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 20:15:14 +00:00
Ettore Di Giacinto	bbc84a9889	feat(paged): Gate 0 in-model - token-identical generation with paged KV placement Wire paged, non-contiguous fixed-size BLOCK placement into the real llama.cpp KV cache (find_slot), behind env LLAMA_KV_PAGED, and validate Gate 0 on a real GGUF: Qwen3-0.6B greedy generation is TOKEN-IDENTICAL to the contiguous cache while its KV is physically scattered across permuted blocks (cells 0-15, 144-159, 32-47, ...). Proven non-contiguous via LLAMA_KV_PAGED_DEBUG, not a silent fallback. This retires the correctness premise of paged attention IN THE MODEL (not just at the ggml-op level): attention is invariant to physical KV placement, because reads use per-cell pos/seq metadata for masking. The patch lives at patches/0001-paged-kv-block-placement.patch (against llama.cpp 0253fb21f). Scope: storage/placement layer, single sequence. Remaining (P4): the gather-read compute path (attend only a seq's own blocks) for the throughput win, and the multi-sequence driver. README updated with repro + status. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 08:51:42 +00:00
Ettore Di Giacinto	3ed3279739	docs(paged): status + integration map for in-model Gate 0 Capture verified state (P0 manager parity, P1 ggml write/gather, P2 attention numerics 7.5e-08, P3 capacity 9.2x + prefix-sharing 11.3x) and the exact remaining work: wire build_attn_paged into llama-graph.cpp and validate token-identical generation on Qwen3-0.6B (Gate 0), then win-2 throughput. Records the integration seams (create_memory, find_slot, get_k/get_v, build_attn, mask) and the honest caveats (unified cache already shares a pool; vLLM's classic kernel is deprecated) so the next session starts warm. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 08:45:51 +00:00
Ettore Di Giacinto	ddace5fb6a	feat(paged): paged-bench - measure capacity & prefix-sharing wins Quantify the two multi-tenant wins that are properties of the host-side block model (vLLM-parity), independent of the in-model compute path: WIN 1 concurrency capacity @ 512-block budget contiguous (reserve n_ctx/seq): 4 sequences paged (on-demand blocks): 37 sequences --> 9.2x more concurrent sequences WIN 3 cross-tenant prefix sharing (32 tenants, 1024-tok shared prefix) prefix-cache OFF: 2176 physical blocks prefix-cache ON: 192 physical blocks --> 11.3x less KV memory WIN 2 (throughput) is deliberately reported as PENDING: it requires the paged gather-read path wired into llama-graph.cpp (Gate 0) and is not measurable at the allocation layer. The win-1 baseline is per-sequence n_ctx reservation (stream mode); llama.cpp's unified cache already shares one pool, so the honest win there is on-demand sizing + prefix dedup. Phase 3 (partial) of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 08:44:41 +00:00
Ettore Di Giacinto	5a5d3df8c8	feat(paged): Phase 2 core - attention over paged KV matches reference Retire the central numeric risk from the design: feeding gather-to-scratch KV (a sequence whose blocks are non-contiguous in the shared pool, [2,1,5]) into ggml's standard attention ops produces correct attention. Path under test: set_rows write -> get_rows gather (K and V) -> mul_mat(K,Q) -> soft_max_ext -> mul_mat(V^T, probs). Result is compared against an independent host-computed softmax attention over the same K/V/Q. Max abs error ~7.5e-08 (n_kv=48, d=8, n_q=4). This proves the paged read path is numerically sound on CPU with no new ggml op. Remaining: wire build_attn_paged into llama-graph.cpp and validate Gate 0 (token-identical greedy generation in a real model). Phase 2 (core) of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 08:35:35 +00:00
Ettore Di Giacinto	c6698dd4bf	feat(paged): Phase 1 - ggml paged write/gather mechanism (CPU) Validate the paged KV read/write path at the ggml-op level, driven by PagedKVManager: - write: ggml_set_rows(pool, k_src, slot_mapping) scatter K rows by slot - read: ggml_get_rows(pool, gather_idx) gather a seq's slots into contiguous scratch (the tensor an attention kernel consumes) The test forces a non-contiguous, out-of-order physical block layout (allocate seqA+seqB, free seqA, reallocate seqC -> blocks [2,1,5]) and proves gather(write(x)) == x plus cross-sequence isolation in the shared pool. This de-risks the central question (does slot-addressed paged storage round-trip correctly through ggml) before the llama-graph integration. Pool is statically allocated via ggml_backend_alloc_ctx_tensors, mirroring how llama.cpp allocates its KV cache. CPU backend, no new ggml op. Built against ggml from the vendored llama.cpp checkout. Phase 1 of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 08:33:26 +00:00
Ettore Di Giacinto	edb1a11abc	feat(paged): vLLM-parity KV block manager (Phase 0, CPU-first prototype) Host-side paged-attention block manager ported faithfully from vLLM V1 (block_pool.py, kv_cache_utils.py, single_type_kv_cache_manager.py): - KVCacheBlock + intrusive LRU FreeBlockQueue (O(1) middle removal) - BlockPool: get_new_blocks / touch / free_blocks eviction ordering / cache_full_blocks / lazy eviction on reuse - PagedKVManager: on-demand allocate, block_table, slot arithmetic (slot = block_id*block_size + offset), free - Prefix caching: chained block hashing + find_longest_cache_hit (first-miss stop), enabling automatic cross-tenant prefix sharing Pure C++17, zero ggml/llama.cpp dependency, unit-tested to vLLM behavioral parity (4/4 suites green). Parity is on algorithm/behavior, not hash bytes. Phase 0 of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md. Phases 1-5 (ggml storage, gather-to-scratch read path, Gate 0 correctness, benchmark wins, prefix-share serving) follow. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 08:26:31 +00:00

37 Commits