LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-23 16:19:07 -04:00

Author	SHA1	Message	Date
Ettore Di Giacinto	67c6208b3a	feat(llama-cpp/paged): cross-request prefix caching patch 0006 Mirror patch 0006 of the paged-attention series into the vendored llama.cpp patch set. Extends the vendored PagedKVManager (src/paged-kv-manager) with host-side cross-request prefix sharing: place_with_prefix reuses cached physical blocks for a new sequence shared prefix (ref_cnt++) and allocates only the divergent suffix; cow_block copy-on-writes a still-shared (ref>1) block before a divergent write so co-owners stay byte-correct; ref-counted free releases a shared block only at ref 0. Core kv-cache files untouched; gated behind LLAMA_KV_PAGED, default off. Gate 0 verified on the dev tree (CPU, Qwen3-0.6B-Q8_0): shared-prefix greedy tokens byte-identical to the unshared baseline at both a block boundary and mid-block, measured 2-block reuse (ref_cnt==2, only the suffix allocated), and copy-on-write + seq_rm ref-count safety with no use-after-free. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 10:14:27 +00:00
Ettore Di Giacinto	667a21c119	feat(llama-cpp): expose paged KV cache as a per-server option (patch 0005) Wire the continuous-batching serving path (update_slots) to the on-demand paged KV-cache engine (patches 0001-0004). update_slots already drives the engine transparently through the existing kv-cache seams: each slot's sequence allocates paged blocks on arrival (find_slot placement) and returns them on slot release (the seq_rm free seam). No serving-loop change is needed for correctness. This patch only exposes the enable cleanly: instead of forcing operators to export the process-wide LLAMA_KV_PAGED env, add `kv_paged` (aliases `paged_kv` / `paged_attention`) and `kv_paged_debug` model options that set the env before the model/context is created. Default off; when the option is absent nothing is touched, so an externally exported env still works and stock behaviour is unchanged. Verified on a dynamic continuous-batching harness (NP physical slots reused across M>NP queued prompts, single mixed llama_decode per step, greedy): 12 dynamically-arriving sequences over 4 slots are token-identical to the stock single-slot serial baseline under both the unified and per-sequence caches. The debug trace confirms per-slot [paged-alloc] grow on arrival and per-stream release on seq_rm. The per-slot allocate/free capacity benefit only materialises under a per-sequence cache (kv_unified:false), since paged block ownership is keyed by stream; the unified cache collapses every slot onto one stream and the run stays correct but degenerates to a single bounded, stock-recycled pool. We do not flip kv_unified here, to keep the default serving behaviour and idle-slot prompt cache unchanged. No core llama.cpp patch: no engine bug was found under dynamic slot churn. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 09:33:32 +00:00
Ettore Di Giacinto	04e3d04ab8	build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on) Move the paged-attention patch series (0001-0004 + docs) into patches/paged/, applied behind a new LLAMA_PAGED build flag (default on). The base patches/ dir is now clean, so a dep-bump that breaks a paged hook can be unblocked with LLAMA_PAGED=off (clean-against-upstream build) and the paged carry fixed independently - decoupling the paged-KV maintenance from routine bumps without a separate backend. Both apply paths wired (Makefile git-apply + prepare.sh re-apply, flag passed through). Runtime stays gated by LLAMA_KV_PAGED env, so an on build is byte-identical to stock until that env is set. Glob/flag logic verified in bash. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 09:22:36 +00:00
Ettore Di Giacinto	4968cd8a94	paged-attn 0004: on-demand KV block allocation Wire the paged placement in find_slot through the vendored PagedKVManager (0001) instead of a fixed full-pool permutation. Blocks are popped from a free pool on demand as a sequence crosses block boundaries, and returned on sequence end (full seq_rm / clear). One manager per (kv-cache, stream); all state lives in a new src/paged-alloc unit keyed by a static registry, so the core kv-cache struct is untouched (find_slot/clear/seq_rm gain only a gated call). Default off; stock path byte-identical. Gate 0 (CPU, Qwen3-0.6B-Q8_0), LLAMA_KV_PAGED=1 token-identical vs stock: - single-stream llama-simple, 48 tok: identical - multi-stream driver, 3 seqs x 40 tok: identical Demand-driven confirmed via debug log: blocks grow 0->1->2->3->4 at logical positions 16/32/48 (peak 4 blocks vs 16-block budget), per stream independently. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 08:50:57 +00:00
Ettore Di Giacinto	37e0e1ef55	paged-attn 0003: lift gather-read to multi-stream The 0003 gather-read was single-stream only (GGML_ASSERT k->ne[3]==1). Lift it to N streams: one index column per stream over the unified batch, gathered with a single ggml_get_rows along the stream axis. Each column is position-sorted (preserving the flash-attn online-softmax reduction order that makes the read byte-identical) and padded to the max non-empty count across streams with a masked (empty) cell, which contributes exp(-inf)=0. Core touch stays additive: the one-line build_attn hook is unchanged; only the two kv-cache gather helpers (now per-stream) and src/paged-attn.cpp grow. Gate 0 (CPU, Qwen3-0.6B-Q8_0): a multi-sequence greedy driver (non-unified KV, k->ne[3]>1) is token-identical between stock (env unset) and LLAMA_KV_PAGED=1: 3 seqs x 40 tok, 2 seqs x 32 tok, 5 seqs x 32 tok all identical; single-stream llama-simple unchanged. Debug log confirms n_stream=3 engaged the multi path. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 08:46:12 +00:00
Ettore Di Giacinto	d9d846e04b	feat(paged): patch 0003 gather-read - Gate 0 green, token-identical, additive Implements the paged-attention gather-read (the real engine compute): attention reads ONLY a sequence's used cells by gathering K, V and the kq_mask by the non-empty-cell index list before build_attn_mha. Verified token-identical to stock greedy generation, 9/9 across 3 prompts x {32,96,128} tokens on Qwen3-0.6B, with n_gather=71 < n_kv=256 confirming real compaction (not an identity no-op). Built in the additive "hook, don't edit" form: all logic in new src/paged-attn.{h,cpp} (an llm_graph_input_i gather-index subclass + the K/V/mask gather), hooked by one line in build_attn + two thin accessors on llama_kv_cache_context + one CMake line. No edit to llm_graph_input_attn_kv or llama-graph.h. 216 insertions; default-off behind LLAMA_KV_PAGED so stock path stays byte-identical. Key correctness finding: get_gather_idxs emits cells sorted by token position. CPU flash-attn's online softmax reduces cells in physical-array order and is FP-order- sensitive, so 0002's scattered placement alone (full-window read) diverges from stock past the first block; the position-sorted gather reproduces stock's exact reduction order -> bit-identical. So 0003 is what makes paged placement token-identical under flash-attn. Verified on a dev tree at the pin (0001+0002+0003 on branch paged); not pushed. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 08:26:46 +00:00
Ettore Di Giacinto	84d59e659b	docs(paged): additive "hook, don't edit" layout for the patch series Maintainers rejected PR #22569 (the upstream paged draft) as "slop" - it rewrites core attention and is unvendorable. Our own series must be additive so it survives llama.cpp pin bumps. This documents the rule and the per-patch core-touch budget: every change is either new code in a new vendored src/ file, or a single env-gated hook at one call site that delegates to it - no logic in core files, no core struct edits. Grounds it in the pinned source: llm_graph_input_i is pure-virtual and res->add_input() lets a new file register a graph input, so paged behavior plugs in without editing core graph types. Redesigns 0003 (gather-read) from the old 4-file surgery to one build_attn hook + a new paged-attn.{h,cpp} (a gather-input subclass) + two thin cache accessors (~8 core lines vs a core-struct rewrite). 0005 lands entirely in LocalAI's grpc-server.cpp (no core patch). Dev tree at the pin with 0001+0002 applied is set up; 0003 implementation is the next focused token-identical Gate-0 block. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 07:28:44 +00:00
Ettore Di Giacinto	931793aa24	feat(paged): target-readiness for 2xH200 - correctness PASS, load-gen harness, projection Deliverables for pushing paged KV toward the real target (2xH200), since GB10 is only the test box and its "no win" result is a low-bandwidth artifact: 1. Correctness verified. test-paged-kv-e2e is greedy-equivalent to the contiguous reference (top-5 argmax ref=paged=3743, overlap 5/5). Found + fixed the blocking bug: common_fit_paged_kv_blocks over-reports free VRAM on GB10's unified device and tried 245GB of KV on a 119GB box, OOM-aborting context creation. Patch in patches/0002; durable fix (clamp to free_vram, honor --fit off) noted. 2. paged-loadgen.cpp: a dynamic-load benchmark that actually exercises where paging wins - variable prompt/gen lengths, continuous arrival, shared prefix - and reports the capacity ratio (contiguous reserve / paged peak KV). The stock tools run fixed-length all-at-once load, which is why they never show a paged win. 3. Projection to 2xH200, grounded in measured GB10 plateaus. Decode is bandwidth- bound, so the ceiling (~16k t/s for 32B) needs ~3,800 concurrent seqs, but contiguous KV fits only ~490 in HBM at 2k ctx - so KV memory IS the binding constraint on the target (unlike GB10), and paged KV's ~5-10x capacity (no over-reservation + prefix sharing) is what reaches the ceiling. The thesis holds on the target; remaining work is hardening/finishing the paged op (PR22569 was 12-13% slower and lacks prefix sharing). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 23:16:28 +00:00
Ettore Di Giacinto	0337505dc8	docs(paged): measure paged KV at high concurrency (LLAMA_MAX_SEQ=2048) - no single-GB10 win Closes the open question from PR22569_EVAL: that eval was blocked by the 256-seq compile cap and used a compute-bound 32B. Recompiled LLAMA_MAX_SEQ=2048 and swept a bandwidth-bound model (Qwen3-1.7B) to npl=2048, both KV layouts. Result: aggregate decode plateaus at the hardware ceiling for BOTH layouts - 1.7B flattens ~3200-3700 t/s by npl=512 (contiguous and paged alike), 32B-dense ~540 by npl=128. Pushing concurrency past the plateau collapses per-seq tps (23->1.9) and explodes TTFT (0.6s->64s) with no aggregate gain. Paged KV is a memory-capacity / anti-fragmentation / prefix-sharing feature, not a single-node throughput lever; the 24k aggregate is a fleet-level (multi-GPU) result, unreachable on one GB10 regardless of KV layout. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 22:47:20 +00:00
Ettore Di Giacinto	faeb5b457c	analysis: NVFP4 closes the decode gap too (547->619, ~93% of vLLM) Measured npl=128 cold A/B: NVFP4 decode 619 vs Q4_K 547 (+13%), closing the gap to vLLM (667) from ~22% to ~7%. NVFP4's FP4-MMA kernel is more bandwidth-efficient at the thin n=128 decode shape than Q4_K int8-MMQ (which ran 2.1x above the floor), so it IS the better int4 decode GEMM the diagnosis called for - no multi-day Marlin-for-K-quants needed. With NVFP4, llama.cpp on GB10 is ahead on prefill (1209 vs 800) and within ~7% on decode. Remaining 7% = optional FP4 kernel tuning. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 21:42:17 +00:00
Ettore Di Giacinto	6e0b910210	analysis: decode gap is GPU/kernel-bound, NOT host overhead (corrects premise) Rigorous re-measurement on pr24423: concurrent decode is GPU-compute-bound (~96% util, sampled), CUDA graphs ARE enabled at npl=128 (94/98 calls replay a captured graph; n_kv padded to 256 keeps topology stable), and graphs ON vs OFF is only +1.5% at npl=128. The earlier '20% GPU util / 170ms host' read was a windowing error (whole-run nsys vs decode-windowed). So no host/graph patch helps. The real 547->667 gap is the quantized DECODE GEMM: mul_mat_q (Q4_K/Q6_K) is ~68% of decode GPU time and runs ~2.1x above the GB10 bandwidth floor (poorly tuned for the thin n=128 shape); vLLM's Marlin int4 runs closer. Lever = a Marlin-style int4 decode kernel for K-quants (or a Marlin-friendly int4 serving format), not host work. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 21:32:58 +00:00
Ettore Di Giacinto	aaf7b4112e	test(llama-cpp): NVFP4-dense FP4 quality+speed eval on GB10 NVFP4-dense is producible via --tensor-type attn=nvfp4 --tensor-type ffn=nvfp4 (GGML_TYPE_NVFP4 has a full quantize path; no top-level ftype needed). Clean-from-BF16 4B PPL: NVFP4 14.31 vs Q4_K 13.66 vs MXFP4 17.42 vs BF16 13.32 - Q4_K-class, not MXFP4-class. Prefill routes onto the FP4 MMA kernel (~1.29x Q4_K on 4B, within 5% of MXFP4). It is the quality-preserving FP4 win MXFP4 was not. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 18:44:57 +00:00
Ettore Di Giacinto	037ad82b7c	docs(paged): MXFP4-dense vs Q4_K quality gate on GB10 (do not recommend) Fair clean-source perplexity check on DGX Spark (GB10): quantize Qwen3-4B from one BF16 source to both Q4_K_M and MXFP4 (no imatrix, identical recipe). Q4_K_M is +2.6% PPL vs BF16; MXFP4-dense is +30.8% (+27.5% worse than Q4_K). The existing 32B MXFP4 was confirmed double-quant (Q4_K_M -> MXFP4 via --allow-requantize), but the clean 4B test shows the gap is intrinsic to the format, not the double-quant. Output stays coherent. Verdict: the ~1.58x prefill / ~1.2x decode win does not justify a Blackwell MXFP4-dense quality recommendation; keep Q4_K_M the dense default, pursue NVFP4 instead. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 17:25:14 +00:00
Ettore Di Giacinto	1887385b79	analysis: MXFP4-dense fails quality check (~27% worse PPL than Q4_K) - do not recommend Clean fair comparison (Qwen3-4B, all from same BF16 source, wikitext PPL): BF16 13.32, Q4_K_M 13.66 (+2.6%, near-lossless), MXFP4 17.42 (+30.8%). MXFP4 is ~27% worse than Q4_K even clean from BF16 (32B double-quant cross-check: 7.39 vs 8.46, +14.6%, same direction). MXFP4_MOE is built for MoE expert tensors; on dense attn/ffn it is far lossier than Q4_K's 6-bit superblock structure. The ~1.58x prefill is not worth ~27% PPL - Q4_K stays the dense default; FP4 only where the model is trained for it (MoE). Verdict: do NOT ship a Blackwell MXFP4-dense rec. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 17:24:24 +00:00
Ettore Di Giacinto	40ee9cdd13	docs(paged): evaluate llama.cpp PR #17004 (GPU/backend sampling) on GB10 PR #17004 is merged and already present in our pinned llama.cpp f3e1828. Measured on DGX Spark (GB10, sm_121, Qwen3-32B-Q4_K_M): - llama-batched-bench does no sampling (random tokens), so it cannot test the fix; its ~540 t/s plateau is not sampling-bound. - Real-sampling A/B via llama-batched (CPU vs -bs GPU sampler): +25% at np=32, +3% at np=64, GGML_ASSERT(obj_new) graph-alloc crash at np>=128. - nsys at np=64: GPU-busy time and kernel mix unchanged (392 vs 404 t/s); sampling kernels negligible. GPU utilization did not rise. Clean negative: the fix does not break the plateau toward the ~2700 ceiling or past vLLM 667, and is unusable at the multi-user parallelism in question. Adoption: code arrives via LLAMA_VERSION bump (prepare.sh vendors the modified upstream server-context.cpp), but grpc-server must set params.sampling.backend_sampling to enable it; grammar/tool-call/logprobs requests fall back to CPU. Defer adoption until #18547/#18550 stabilise it. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 15:44:21 +00:00
Ettore Di Giacinto	d6c91b7d62	analysis: finalize PR #22569 paged-KV eval (full detail + compute-bound note) Agent-finalized eval: builds (1-line Qwen3 reshape fix), but on GB10+32B paged is ~12% slower than contiguous and both cap at LLAMA_MAX_SEQ=256 (not OOM; 16GiB/119). Agent argues 32B is compute-bound + plateaus by npl=128 so raising the cap won't help - but 540 t/s << ~1900 bandwidth ceiling, so the plateau cause is unconfirmed (attention-over-KV or CPU sampling, not matmul saturation). Next: raise the cap + remeasure to settle it. Verdict: do not adopt #22569; paged KV not a GB10 lever. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 14:35:02 +00:00
Ettore Di Giacinto	92e93dfc34	analysis: paged KV gives ZERO benefit on GB10 (measured) - not the lever Full sweep, Qwen3-32B: contiguous decode 537/541 t/s at npl=128/256 (plateau); paged (#22569) 477/471 - SLOWER at matched concurrency. Both FAIL at npl=512/1024 with n_seq_max<=256 - paged does NOT bypass the LLAMA_MAX_SEQ=256 compile cap, its whole purpose. GB10's limit is the 256-seq cap + the ~540 decode plateau (flat by npl=128), NOT KV capacity/fragmentation (122 GB unified). Paged KV solves a problem GB10 doesn't have; it remains valid for memory-constrained datacenter GPUs (24-48GB) but must be validated there, not GB10. Do not adopt #22569; do not build paged KV for GB10. Real GB10 questions: the 256 cap (cheap) + the 540 plateau (vs vLLM 667). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 13:31:33 +00:00
Ettore Di Giacinto	fdb7f56bb7	docs(llama-cpp): scope chunked prefill + n_batch/n_ubatch decouple Add CHUNKED_PREFILL_PLAN.md for the llama.cpp backend. Key finding: the vendored llama.cpp server scheduler (update_slots) already implements chunked prefill with prefill/decode interleaving on the pinned version - decode tokens are seated first each iteration, prefill fills the leftover n_batch budget, both share one llama_decode. The draft upstream PR #10718 goal is already absorbed; no re-implementation needed. The real LocalAI gap is the n_batch/n_ubatch coupling at grpc-server.cpp (both set to nbatch()), which pins the logical scheduling window to the physical ubatch width. The plan scopes the decouple (C++ option + proto NUBatch + options.go), an optional decode-headroom prefill cap as a vendored patch, a token-identical verification harness, and keeps the work orthogonal to paged KV. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 12:54:22 +00:00
Ettore Di Giacinto	07985ba45b	analysis: measured llama.cpp aggregate vs vLLM - already ~75-80% at npl<=128 llama-batched-bench Qwen3-32B-Q4_K_M: aggregate decode 235/391/540 t/s at npl=32/64/128 vs vLLM 328/569/667 = 72/69/81%, multiplier 53x (vLLM 56x), still climbing at 128. The 30x headline is wrong at realistic concurrency: llama.cpp is ahead single-stream (MXFP4 1153 > 800) and ~75-80% aggregate. Aggregate prefill is flat ~760 but GB10-compute-capped (vLLM ~800 too), so chunked prefill is a latency/TTFT win not throughput; paged KV is the high-concurrency (thousands-seqs) lever for vLLM's 24k regime. ROI: MXFP4 ship -> chunked prefill -> paged KV. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 11:32:40 +00:00
Ettore Di Giacinto	fc589b3fad	analysis: vLLM GB10 advantage is the SCHEDULER, not the kernel (pivot) Code-grounded vLLM v0.23.0 analysis + DGX measurement: vLLM single-stream W4A16 prefill ~800 t/s (~52 TFLOPS) is TIED with llama.cpp MMQ (718/47), using the exact XOR-swizzle + 4-stage cp.async Marlin we proved collapses GB10 occupancy. vLLM has no FP4 cubins on sm_121 (forced W4A16 fallback), so llama.cpp MXFP4 (1153) already beats vLLM single-stream. vLLM's ~24k headline is the aggregate decode multiplier (~56x) from paged KV + chunked prefill + continuous batching - a scheduler win. llama.cpp lacks paged KV + chunked prefill. Kernel work (W4A16 178 t/s, FP4-MMA) banked as not-the-lever; effort pivots to the scheduler. Detail in VLLM_DECOMPOSITION.md; W4A16 plan marked STOPPED. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 07:09:42 +00:00
Ettore Di Giacinto	2b79083b71	feat(w4a16): grow tile to BN128/16w (q4_K +17%, pp512 148->178) P3b-2 for the Blackwell W4A16 Marlin GEMM. The q4_K dequant wall is partly cross-N-block-redundant: every N-block re-decodes the same weight strip, so halving the N-block count (BN 64->128) halves that redundant 6-bit superblock decode. A BN sweep showed this only pays off when BN is spread across more warps (16 warps, 8 m16n8 C-tiles/warp) rather than more fragments-per-warp - the FN=8 / FM=4 variants (16 C-tiles/warp) regressed to ~6.6 TFLOPS on register pressure. Shipping tile is now WM=4,WN=4,FM=2,FN=4 -> BM=128, BN=128, 16 warps. Thermally-bracketed cold A/B (q4_K n=512 / q4_0 n=512 via test-backend-ops perf; pp512/pp2048 via llama-bench Qwen3-32B-Q4_K_M): BN64/8w (prev): 8.50 / 10.56 TFLOPS, measured 8.45/10.51 again (bracket) BN128/16w (this): 9.92 / 11.68 TFLOPS, pp512 177.6, pp2048 185.0 -> +17% q4_K, +11% q4_0, +20% pp512 vs the previous commit; +49% pp512 vs the original block-tiled kernel (119). Parity gate GGML_CUDA_W4A16=1 test-backend-ops MUL_MAT = 1103/1103, flag set and unset (byte-identical when unset). Still ~4.7x under MMQ (47 TFLOPS) and does NOT beat MMQ; BN growth divides the redundant decode but cannot remove the per-k-step decode itself - the offline weight prepack remains the next unlock for q4_K. Plan doc P3 table + bottleneck notes updated. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 02:01:12 +00:00
Ettore Di Giacinto	2f648dc6a0	feat(w4a16): conflict-free skew-pad ldmatrix + BM128/8w tile (q4_K +28%, q4_0 +40%) P3b for the Blackwell (sm_120/121) W4A16 Marlin GEMM. Two combined changes over the prior block-tiled kernel, both verified by a thermally-bracketed cold A/B (committed measured identically before and after): - Skew-padded shared layout: store the staged weight/activation rows at a padded stride of 12 bf162 (8 data + 4 pad) and feed the tensor cores with ldmatrix.x4 (A) / ldmatrix.x2 (B). ldmatrix's per-lane address is rowstride; the natural stride 8 divides the 32-bank cycle and collides rows 0,4,8,12 (2-way bank conflict). Skewing to 12 (still 16-byte aligned) spreads {r12 mod 32} across 8 distinct bank-quads, so both ldmatrix halves are conflict-free at only +50% on the ~6 KB staged tile - unlike a 128-byte -row XOR swizzle, which is conflict-free but needs 16 KB shared and collapses occupancy on GB10 (measured 2.84 TFLOPS, worse than baseline). - Larger tile: BM=128, BN=64, 8 warps (WM=4,WN=2,FM=2,FN=4), which cuts the redundant per-M-block activation re-reads. Cold A/B (q4_K n=512 / q4_0 n=512 via test-backend-ops perf; pp512/pp2048 via llama-bench Qwen3-32B-Q4_K_M): committed: 6.63 / 7.53 TFLOPS, pp512 119 this: 8.52 / 10.49 TFLOPS, pp512 148.5, pp2048 153.9 (+28% / +40% / +25%) Parity gate GGML_CUDA_W4A16=1 test-backend-ops MUL_MAT = 1103/1103, flag set and unset (byte-identical when unset). Still ~5.5x under MMQ (47 TFLOPS) and does NOT beat MMQ yet; the q4_K limiter has now moved from the mma feed to the per-element 6-bit superblock dequant (q4_0 scales to 15.8 TFLOPS with more warps while q4_K stays ~8.5), so the offline weight prepack is the next unlock. Plan doc P3 section updated with the sweep data and the corrected bottleneck. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 01:15:07 +00:00
Ettore Di Giacinto	9973fa995a	feat(w4a16): P3 step 1 - block-tiled multi-warp Marlin GEMM (GB10) Replace the P2 1-warp-per-16x8 W4A16 kernel with a block-tiled multi-warp kernel: blockDim=(32, WMWN) so threadIdx.x is the warp lane (required by mma.cuh get_i/get_j) and threadIdx.y is the warp index. WMWN warps compute a BM(=WMFM16) x BN(=WNFN8) output tile, each warp owning an FM x FN grid of m16n8k16 BF16 mma fragments accumulated in F32. The BM x 16 dequantized Q4 weight strip is staged once per k-step in a small (~4 KB) shared buffer and reused across the block's whole BN span. Shipping config WM=2,WN=2,FM=2,FN=4. The P2 launch put all threads on threadIdx.x; with >1 warp that drove the mma tile get_j past the shared bound (out-of-bounds shared read, caught by compute-sanitizer). The new (32, nwarps) layout matches mmf.cu and fixes it. Parity gate holds 1103/1103 (test-backend-ops MUL_MAT CUDA0), flag set and unset (byte-identical when GGML_CUDA_W4A16 is unset; the seam returns false). Perf (q4_K m=4096 k=14336 n=512): ~2 TFLOPS (P2) -> ~7-9 TFLOPS (thermal dependent); llama-bench Qwen3-32B-Q4_K_M pp512 31.75 -> ~118-142 t/s. Still below the MMQ baseline (47 TFLOPS / 718 t/s): a tile sweep stayed flat and q4_0 vs q4_K differ by only ~12%, so dequant compute is not the limiter - the shared-load / mma-feed is. A naive double-buffered cp.async pipeline (32 KB shared) regressed via occupancy collapse and an ldmatrix swap was neutral (unswizzled layout bank-conflicts), both reverted. The path to >=150 TFLOPS is the full Marlin machinery (XOR-swizzled shared layout + offline weight reshuffle + tuned async pipeline + Stream-K), deferred to P3 step 4. See W4A16_MARLIN_KERNEL_PLAN.md for the per-step table and dead-end notes. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 23:36:58 +00:00
Ettore Di Giacinto	4de0c3b1b2	feat(cuda): W4A16 P2 correctness-first BF16 GEMM kernel Replace the P1 dispatch-seam TODO in marlin-w4a16.cu with a real W4A16 GEMM for consumer Blackwell (sm_120/121). In-kernel dequant of Q4 weights to BF16, mma.sync m16n8k16 f32.bf16.bf16.f32 tensor-core multiply against BF16-converted f32 activations, f32 accumulate and write, reusing ggml's mma.cuh tile abstractions. Handles the contiguous 2D GEMM prefill path for Q4_0 and Q4_K (f32 activations, ne2==ne3==1); batched, broadcast, permuted, non-contiguous and f16-activation cases return false and fall back to MMQ so the gate stays green. M/N boundaries are zero-padded in-kernel. Parity gate (GGML_CUDA_W4A16=1 test-backend-ops MUL_MAT on GB10): 1103/1103 passed; default flag-off build stays byte-identical 1103/1103. Model sanity: Qwen3-32B-Q4_K_M llama-bench pp512 31.75 t/s (slow is expected for P2 - the naive single-warp kernel is the correctness checkpoint; P3 adds the cp.async pipeline and weight reshuffle). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 22:09:12 +00:00
Ettore Di Giacinto	9a71e81fc4	kernel: written subagent dispatch briefs for P3/P4/P5 Same strategy as P2: one fresh Opus-4.8 subagent per phase, each handed a complete zero-context brief, dispatched sequentially as each predecessor lands (P3 pipeline needs P2's correct kernel, P4 tune needs P3, P5 enable needs P4). Shared DGX/harness/commit boilerplate factored into a COMMON section; each phase brief carries its goal, incremental steps, acceptance gate, and a splice note for the prior phase's actual deliverable. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 22:01:18 +00:00
Ettore Di Giacinto	718b31d063	kernel(P1): W4A16 dispatch seam (gated, byte-identical fallback to MMQ) marlin-w4a16.{cuh,cu} + a gated hook in ggml_cuda_mul_mat (dense path), behind GGML_CUDA_W4A16 + sm_120/121 + Q4_0/Q4_K + f32. Returns false -> MMQ, so the default build is byte-identical. Verified on GB10: clean build, test-backend-ops MUL_MAT 1103/1103, llama-bench pp512 unchanged (717.77 default / 718.26 flagged), and GGML_CUDA_W4A16=1 reaches the seam ([w4a16] P1 warning) before falling back. Source + apply steps under kernel/w4a16/ (DGX checkout is volatile). The frame the P2 correctness kernel + P3 Marlin pipeline fill. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 21:46:38 +00:00
Ettore Di Giacinto	d291e15114	kernel(P0): record precise op-level baseline (q4_K n=512 = 47 TFLOPS, ~22% of ceiling) test-backend-ops perf MUL_MAT m=4096 k=14336: q4_K prefill (n=512) = 47.1 TFLOPS, q4_0 = 49.5; decode (n=1) = 761/817 GFLOPS (memory-bound). The prefill GEMM target is 47 -> ~213 TFLOPS (~4.5x). Cleaner per-shape target than end-to-end for kernel iteration. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 21:33:50 +00:00
Ettore Di Giacinto	dae2679c3b	kernel(P0): parity harness established + baseline (test-backend-ops 1103/1103 green) P0 done: test-backend-ops MUL_MAT on CUDA0 = 1103/1103 (CUDA vs CPU ref, covers Q4_0/Q4_K at m=4096,k=14336,n=1..512) - the correctness gate the W4A16 kernel must keep green. Baseline llama-bench dense Q4 prefill ~750 t/s (~46 TFLOP/s, ~21% of the 213 BF16 ceiling) - the number to beat toward ~3300. Reusable harness at ~/p0harness.sh (needed -DLLAMA_BUILD_TESTS=ON). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 21:29:21 +00:00
Ettore Di Giacinto	13e6ee89c7	kernel: validate cuBLAS dead-end (sm_80 fallback) + W4A16 Marlin impl plan Decisive DGX experiment: rebuilt with -DGGML_CUDA_FORCE_CUBLAS (it's a compile #ifdef, not the runtime env we'd been setting - so prior 'cuBLAS no-op' tests never engaged it). Real result: cuBLAS is SLOWER than MMQ for dense Q4 (pp2048 690 vs 750) and runs an Ampere cutlass_80_tensorop kernel - CUDA-13 has no sm_121 GEMM, falls back to sm_80. So both MMQ and cuBLAS sit at ~46 TFLOP/s; no library shortcut to the 213 ceiling on GB10. Confirms a hand-tuned sm_120a kernel is required. Added the phased W4A16 Marlin-style implementation plan (P0 harness -> P5 enable) as the committed multi-week build; corrected the cuBLAS note. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 21:16:13 +00:00
Ettore Di Giacinto	76cc0b6abc	docs(paged): phased plan to make llama.cpp a viable vLLM alternative Phase 1 (config, PR #10411, DONE): VRAM-scaled n_parallel + Blackwell batch. Phase 2: paged KV (PR #22569, ~9.5x concurrency). Phase 3: chunked prefill + n_batch/ubatch split. Phase 4: batched-GEMM kernel tuning. Phase 5: backend sampling. Cross-cutting: spec-dec for dense. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 09:35:53 +00:00
Ettore Di Giacinto	122df1c620	analysis: vLLM throughput gap decomposed - spec-dec is the per-user lever Per-user decode is at parity without spec-dec (10.2 vs 11.7, bandwidth-bound). vLLM's per-user speed = speculative decoding (lossless, target-verified). GB10 is best-case (bandwidth-bound + idle compute); llama.cpp spec-dec measured 2.9x on dense Qwen2.5-32B. Qwen3-32B has no native MTP - use Qwen3-1.7B draft or EAGLE3 head. Recommendation: make spec-dec easy for dense >=14B on Blackwell (keeps Q4_K_M quality, no kernel). Prefill-kernel + continuous-batching are separate (TTFT / aggregate). Our own DGX run pending (box rebooted, llama-cli hangs). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 08:40:20 +00:00
Ettore Di Giacinto	14e3da25b6	kernel: dense MXFP4 test = free 1.44x (765->1153) but FP4-MMA untuned (~17% of ceiling) MXFP4 dense moves prefill off int8-MMQ onto the FP4-MMA path (existing kernel) for a free 1.44x - shippable as a Blackwell dense-quant recommendation. But it's ~17% of the FP4 roofline, so the FP4-MMA kernel is itself untuned: ~4-6x still in the kernel. Sharpens the target to TUNING the FP4-MMA (serves dense+MoE, only path to beat vLLM). Marlin-style W4A16 BF16 is the alt to match on the BF16 ceiling. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 07:48:29 +00:00
Ettore Di Giacinto	f5e9caece1	kernel: reframed Blackwell kernel-gap map (research + profiles) Key corrections: (1) vLLM 24k is AGGREGATE; single-stream roofline ~3300 t/s (BF16) / 6600 (FP4). (2) GB10 is 1:1:2 BF16:INT8:FP4 - INT8 == BF16, only FP4 is 2x. (3) Measured: dense int8-MMQ at 21% of ceiling, MoE FP4-MMQ at ~5% - both EXIST, just untuned for Blackwell. Strategy: to MATCH vLLM, tune MMQ or build a Marlin-style W4A16 BF16 GEMM (FP4 NOT required); to BEAT, fix the existing FP4 MMA on sm_121 (build/miscompile, not greenfield). Dropped the tcgen05 grouped GEMM rewrite. Cheap next test: dense MXFP4 quant + existing FP4-MMA. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 07:21:56 +00:00
Ettore Di Giacinto	d2651c86d9	bench(dense): root-cause the W4A4 NVFP4 hang; W4A16 vs Q4 is the headline Researched: W4A4 hangs on GB10 because FlashInfer ships no FP4 cubins for sm_120/121 (all datacenter Sm100a); dense mm_fp4 is gated-off/returns-zeros on consumer Blackwell, and the FlashInfer FP4 autotuner spins on the first forward pass. Not a misconfig - dense W4A4 inference isn't validated on sm_121. W4A16 (4-bit weight / 16-bit act, Marlin) vs llama Q4_K_M is the correct apples-to- apples (same quant class) AND the fast path. Removed the misleading 'W4A4 would be faster / lower bound' framing. Sources: vllm #30163/#26381, flashinfer #2577/#3294, cutlass #3096. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 06:59:50 +00:00
Ettore Di Giacinto	19742aee64	bench(dense): FORCE_CUBLAS no-op for dense too (720.8 vs 721.8) - every flag lever exhausted Confirms parity (dense+MoE, both phases) is strictly the FP4 tensor-core kernel; no config/flag shortcut remains. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 03:59:27 +00:00
Ettore Di Giacinto	ce60737fc5	kernel(doc): dense scope resolved - two FP4 kernels (dense first, then grouped) Benchmark confirms dense prefill 7.6-32x behind too, so the kernel track needs a non-grouped FP4 dense GEMM (simpler, land first) + the MoE grouped GEMM. Both share the e2m1 block-scaled collective; dense is grouped-with-one-group. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 03:56:33 +00:00
Ettore Di Giacinto	37cbc089b0	bench(dense): Qwen3-32B dense parity - dense has the kernel gap too (PP 7.6-32x) vLLM W4A16 vs llama Q4_K_M dense: prefill 7.6-32x behind (llama plateaus ~765, vLLM scales to 24.4k); decode ~parity at B=1 (weight-bandwidth-bound), 2.2x at B=64. Full NVFP4 (W4A4) hangs on this vLLM/GB10 stack - W4A16 used. Decision: the Lever-3 kernel track must ALSO deliver a non-grouped FP4 dense GEMM, not just the MoE grouped GEMM (dense GEMM is the simpler first kernel to land). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 03:55:58 +00:00
Ettore Di Giacinto	b7b2e8291c	kernel(fp4-grouped-moe): scaffold the FP4 grouped-GEMM MoE dispatch (Lever 3) The only work that closes the vLLM gap on Blackwell: mul_mat_q<MXFP4> is 37% prefill + 54.6% decode-B64 GPU time; paged attention can't touch it (proven). Scaffold (builds clean on GB10, default byte-identical): fp4-grouped-moe.{cuh,cu} entry + gated hook in ggml_cuda_mul_mat_id (env GGML_CUDA_FP4_GROUPED), always falls back to MMQ for now. Design doc has the CUTLASS/tcgen05 implementation phases + parity harness + the dense-path follow-up (#28). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 23:44:31 +00:00
Ettore Di Giacinto	cb28deda6b	bench(paged): decode profile overturns 'engine-addressable' - decode is 54.6% MoE GEMM too Decode-dominated B=64 nsys: mul_mat_q<MXFP4> 54.6%, attention only 19.8%. Both phases are FP4-MoE-kernel-bound (Lever 3). The paged series cannot close the vLLM gap in either phase; its real value is capacity + prefix-sharing, not tok/s parity. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 23:27:35 +00:00
Ettore Di Giacinto	2a500c371f	bench(paged): fresh GB10 head-to-head vs vLLM - two distinct gaps Prefill 6-48x behind and does NOT scale with B (kernel-bound, paging can't fix). Decode: we win at B=1; 2.5-3.7x behind at B>=8 - THAT concurrency gap is the engine's domain (0004 pool + 0005 continuous batching target it). Baseline for the series to improve on. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 23:20:22 +00:00
Ettore Di Giacinto	48fbb9384f	docs(paged): refine 0003 plan - used-cell gather, per-ubatch rebuild, single-stream first Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 23:14:25 +00:00
Ettore Di Giacinto	145e45b6f2	docs(paged): exact executable plan for 0003 gather-read Every edit mapped (gather-index graph input mirroring k_idxs; gather K/V/mask by one aligned index; n_kv compaction; gated so stock stays byte-identical) with the token-identical gate and the known risks (mask transpose layout, v_trans). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 23:12:18 +00:00
Ettore Di Giacinto	c4b4f3a3e4	docs(paged): series status 0001/0002 done+verified; honest parity note Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 23:05:14 +00:00
Ettore Di Giacinto	61ff738177	patch(paged) 0002: LLAMA_KV_PAGED block placement, Gate 0 token-identical find_slot places a sequence's tokens at permuted non-contiguous blocks; greedy generation is token-identical to stock (verified on Qwen3-0.6B at the pin), branch confirmed firing. Default off. The placement substrate for the gather-read. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 23:04:28 +00:00
Ettore Di Giacinto	ce48cc0751	patch(paged) 0001: vendor PagedKVManager into llama.cpp src First patch of the stacking series. Adds src/paged-kv-manager.{h,cpp} (the CPU-verified vLLM-parity block manager) + CMake entry. No behavior change. Generated against the pinned LLAMA_VERSION; applies clean. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 22:55:22 +00:00
Ettore Di Giacinto	ba3fa5a633	build(paged): stacking patch-series scaffolding for llama.cpp paged attention Numbered patches under backend/cpp/llama-cpp/patches/ applied in order against the pinned LLAMA_VERSION (build hook in the llama.cpp: target). Each phase is one small, independently-buildable patch so the work rebases cleanly across llama.cpp bumps (anti-drift). README defines the series (0001 vendor manager -> 0006 prefix caching) + the regen workflow. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 22:53:20 +00:00
Ettore Di Giacinto	62f0ae17e3	docs(paged): upstream survey - no FP4 MoE GEMM to patch in; phase 3 is from-scratch No tcgen05/CUTLASS grouped-GEMM MoE kernel exists upstream (merged/in-flight/ draft); CUTLASS not a dep; no fork has one; activation-quant gather already fused. Matching vLLM needs a from-scratch tcgen05 grouped GEMM (months, maintainers deferring to cuTile). No tractable patch closes the 27x. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 22:44:11 +00:00
Ettore Di Giacinto	b14214620c	docs(paged): Lever-3 phase-1 nwarps tweak = dead end (constants coupled) static_assert(nwarps*tile_C::I == mmq_y) locks nwarps=8 for mmq_y=128; can't raise occupancy without co-scaling mmq_y (blows Blackwell smem). MMQ kernel is not freely tunable -> parity needs the tcgen05/CUTLASS rewrite, not knobs. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 22:32:02 +00:00
Ettore Di Giacinto	1449b806ab	docs(paged): Lever-3 + paged-attention implementation plans + upstream ggml issue draft Plan A (Lever 3): phased path to FP4 MoE GEMM parity — cheap tweaks, act-quant fusion, then the real lever (tcgen05/CUTLASS grouped GEMM), full-model FP4. Plan B (paged attention): on-demand pool, gather-read + Gate 0, continuous batching, prefix sharing; benchmark in memory-pressured/mixed-length regimes. Upstream issue draft: GB10 numbers, nsys profile, ruled-out config knobs, tcgen05 proposal. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 22:28:28 +00:00
Ettore Di Giacinto	9f16a907be	docs(paged): Lever 3 profiled + Q4/MXFP4 findings, auto-ubatch shipped Prefill doesn't scale with bigger single prompts (attention O(N^2)); real gap is batched MoE prefill (B=32: 27x vs vLLM, ~22 effective TFLOP/s). nsys pins Lever 3 target: mul_mat_q<MXFP4> MoE GEMM 37% + un-fused act-quant 8%; native FP4 MMA already engaged, inefficiency is the per-expert thin-tile scheduler. Q4_K_M matches MXFP4 on decode (decode win is generic 4-bit); MXFP4's only edge is prefill. Auto-ubatch=2048 on Blackwell shipped (PR #10411). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 20:56:46 +00:00

1 2 3 4 5 ...

442 Commits