LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-23 16:19:07 -04:00

Author	SHA1	Message	Date
Ettore Di Giacinto	92e93dfc34	analysis: paged KV gives ZERO benefit on GB10 (measured) - not the lever Full sweep, Qwen3-32B: contiguous decode 537/541 t/s at npl=128/256 (plateau); paged (#22569) 477/471 - SLOWER at matched concurrency. Both FAIL at npl=512/1024 with n_seq_max<=256 - paged does NOT bypass the LLAMA_MAX_SEQ=256 compile cap, its whole purpose. GB10's limit is the 256-seq cap + the ~540 decode plateau (flat by npl=128), NOT KV capacity/fragmentation (122 GB unified). Paged KV solves a problem GB10 doesn't have; it remains valid for memory-constrained datacenter GPUs (24-48GB) but must be validated there, not GB10. Do not adopt #22569; do not build paged KV for GB10. Real GB10 questions: the 256 cap (cheap) + the 540 plateau (vs vLLM 667). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 13:31:33 +00:00
Ettore Di Giacinto	fdb7f56bb7	docs(llama-cpp): scope chunked prefill + n_batch/n_ubatch decouple Add CHUNKED_PREFILL_PLAN.md for the llama.cpp backend. Key finding: the vendored llama.cpp server scheduler (update_slots) already implements chunked prefill with prefill/decode interleaving on the pinned version - decode tokens are seated first each iteration, prefill fills the leftover n_batch budget, both share one llama_decode. The draft upstream PR #10718 goal is already absorbed; no re-implementation needed. The real LocalAI gap is the n_batch/n_ubatch coupling at grpc-server.cpp (both set to nbatch()), which pins the logical scheduling window to the physical ubatch width. The plan scopes the decouple (C++ option + proto NUBatch + options.go), an optional decode-headroom prefill cap as a vendored patch, a token-identical verification harness, and keeps the work orthogonal to paged KV. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 12:54:22 +00:00
Ettore Di Giacinto	07985ba45b	analysis: measured llama.cpp aggregate vs vLLM - already ~75-80% at npl<=128 llama-batched-bench Qwen3-32B-Q4_K_M: aggregate decode 235/391/540 t/s at npl=32/64/128 vs vLLM 328/569/667 = 72/69/81%, multiplier 53x (vLLM 56x), still climbing at 128. The 30x headline is wrong at realistic concurrency: llama.cpp is ahead single-stream (MXFP4 1153 > 800) and ~75-80% aggregate. Aggregate prefill is flat ~760 but GB10-compute-capped (vLLM ~800 too), so chunked prefill is a latency/TTFT win not throughput; paged KV is the high-concurrency (thousands-seqs) lever for vLLM's 24k regime. ROI: MXFP4 ship -> chunked prefill -> paged KV. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 11:32:40 +00:00
Ettore Di Giacinto	fc589b3fad	analysis: vLLM GB10 advantage is the SCHEDULER, not the kernel (pivot) Code-grounded vLLM v0.23.0 analysis + DGX measurement: vLLM single-stream W4A16 prefill ~800 t/s (~52 TFLOPS) is TIED with llama.cpp MMQ (718/47), using the exact XOR-swizzle + 4-stage cp.async Marlin we proved collapses GB10 occupancy. vLLM has no FP4 cubins on sm_121 (forced W4A16 fallback), so llama.cpp MXFP4 (1153) already beats vLLM single-stream. vLLM's ~24k headline is the aggregate decode multiplier (~56x) from paged KV + chunked prefill + continuous batching - a scheduler win. llama.cpp lacks paged KV + chunked prefill. Kernel work (W4A16 178 t/s, FP4-MMA) banked as not-the-lever; effort pivots to the scheduler. Detail in VLLM_DECOMPOSITION.md; W4A16 plan marked STOPPED. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 07:09:42 +00:00
Ettore Di Giacinto	2b79083b71	feat(w4a16): grow tile to BN128/16w (q4_K +17%, pp512 148->178) P3b-2 for the Blackwell W4A16 Marlin GEMM. The q4_K dequant wall is partly cross-N-block-redundant: every N-block re-decodes the same weight strip, so halving the N-block count (BN 64->128) halves that redundant 6-bit superblock decode. A BN sweep showed this only pays off when BN is spread across more warps (16 warps, 8 m16n8 C-tiles/warp) rather than more fragments-per-warp - the FN=8 / FM=4 variants (16 C-tiles/warp) regressed to ~6.6 TFLOPS on register pressure. Shipping tile is now WM=4,WN=4,FM=2,FN=4 -> BM=128, BN=128, 16 warps. Thermally-bracketed cold A/B (q4_K n=512 / q4_0 n=512 via test-backend-ops perf; pp512/pp2048 via llama-bench Qwen3-32B-Q4_K_M): BN64/8w (prev): 8.50 / 10.56 TFLOPS, measured 8.45/10.51 again (bracket) BN128/16w (this): 9.92 / 11.68 TFLOPS, pp512 177.6, pp2048 185.0 -> +17% q4_K, +11% q4_0, +20% pp512 vs the previous commit; +49% pp512 vs the original block-tiled kernel (119). Parity gate GGML_CUDA_W4A16=1 test-backend-ops MUL_MAT = 1103/1103, flag set and unset (byte-identical when unset). Still ~4.7x under MMQ (47 TFLOPS) and does NOT beat MMQ; BN growth divides the redundant decode but cannot remove the per-k-step decode itself - the offline weight prepack remains the next unlock for q4_K. Plan doc P3 table + bottleneck notes updated. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 02:01:12 +00:00
Ettore Di Giacinto	2f648dc6a0	feat(w4a16): conflict-free skew-pad ldmatrix + BM128/8w tile (q4_K +28%, q4_0 +40%) P3b for the Blackwell (sm_120/121) W4A16 Marlin GEMM. Two combined changes over the prior block-tiled kernel, both verified by a thermally-bracketed cold A/B (committed measured identically before and after): - Skew-padded shared layout: store the staged weight/activation rows at a padded stride of 12 bf162 (8 data + 4 pad) and feed the tensor cores with ldmatrix.x4 (A) / ldmatrix.x2 (B). ldmatrix's per-lane address is rowstride; the natural stride 8 divides the 32-bank cycle and collides rows 0,4,8,12 (2-way bank conflict). Skewing to 12 (still 16-byte aligned) spreads {r12 mod 32} across 8 distinct bank-quads, so both ldmatrix halves are conflict-free at only +50% on the ~6 KB staged tile - unlike a 128-byte -row XOR swizzle, which is conflict-free but needs 16 KB shared and collapses occupancy on GB10 (measured 2.84 TFLOPS, worse than baseline). - Larger tile: BM=128, BN=64, 8 warps (WM=4,WN=2,FM=2,FN=4), which cuts the redundant per-M-block activation re-reads. Cold A/B (q4_K n=512 / q4_0 n=512 via test-backend-ops perf; pp512/pp2048 via llama-bench Qwen3-32B-Q4_K_M): committed: 6.63 / 7.53 TFLOPS, pp512 119 this: 8.52 / 10.49 TFLOPS, pp512 148.5, pp2048 153.9 (+28% / +40% / +25%) Parity gate GGML_CUDA_W4A16=1 test-backend-ops MUL_MAT = 1103/1103, flag set and unset (byte-identical when unset). Still ~5.5x under MMQ (47 TFLOPS) and does NOT beat MMQ yet; the q4_K limiter has now moved from the mma feed to the per-element 6-bit superblock dequant (q4_0 scales to 15.8 TFLOPS with more warps while q4_K stays ~8.5), so the offline weight prepack is the next unlock. Plan doc P3 section updated with the sweep data and the corrected bottleneck. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-21 01:15:07 +00:00
Ettore Di Giacinto	9973fa995a	feat(w4a16): P3 step 1 - block-tiled multi-warp Marlin GEMM (GB10) Replace the P2 1-warp-per-16x8 W4A16 kernel with a block-tiled multi-warp kernel: blockDim=(32, WMWN) so threadIdx.x is the warp lane (required by mma.cuh get_i/get_j) and threadIdx.y is the warp index. WMWN warps compute a BM(=WMFM16) x BN(=WNFN8) output tile, each warp owning an FM x FN grid of m16n8k16 BF16 mma fragments accumulated in F32. The BM x 16 dequantized Q4 weight strip is staged once per k-step in a small (~4 KB) shared buffer and reused across the block's whole BN span. Shipping config WM=2,WN=2,FM=2,FN=4. The P2 launch put all threads on threadIdx.x; with >1 warp that drove the mma tile get_j past the shared bound (out-of-bounds shared read, caught by compute-sanitizer). The new (32, nwarps) layout matches mmf.cu and fixes it. Parity gate holds 1103/1103 (test-backend-ops MUL_MAT CUDA0), flag set and unset (byte-identical when GGML_CUDA_W4A16 is unset; the seam returns false). Perf (q4_K m=4096 k=14336 n=512): ~2 TFLOPS (P2) -> ~7-9 TFLOPS (thermal dependent); llama-bench Qwen3-32B-Q4_K_M pp512 31.75 -> ~118-142 t/s. Still below the MMQ baseline (47 TFLOPS / 718 t/s): a tile sweep stayed flat and q4_0 vs q4_K differ by only ~12%, so dequant compute is not the limiter - the shared-load / mma-feed is. A naive double-buffered cp.async pipeline (32 KB shared) regressed via occupancy collapse and an ldmatrix swap was neutral (unswizzled layout bank-conflicts), both reverted. The path to >=150 TFLOPS is the full Marlin machinery (XOR-swizzled shared layout + offline weight reshuffle + tuned async pipeline + Stream-K), deferred to P3 step 4. See W4A16_MARLIN_KERNEL_PLAN.md for the per-step table and dead-end notes. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 23:36:58 +00:00
Ettore Di Giacinto	4de0c3b1b2	feat(cuda): W4A16 P2 correctness-first BF16 GEMM kernel Replace the P1 dispatch-seam TODO in marlin-w4a16.cu with a real W4A16 GEMM for consumer Blackwell (sm_120/121). In-kernel dequant of Q4 weights to BF16, mma.sync m16n8k16 f32.bf16.bf16.f32 tensor-core multiply against BF16-converted f32 activations, f32 accumulate and write, reusing ggml's mma.cuh tile abstractions. Handles the contiguous 2D GEMM prefill path for Q4_0 and Q4_K (f32 activations, ne2==ne3==1); batched, broadcast, permuted, non-contiguous and f16-activation cases return false and fall back to MMQ so the gate stays green. M/N boundaries are zero-padded in-kernel. Parity gate (GGML_CUDA_W4A16=1 test-backend-ops MUL_MAT on GB10): 1103/1103 passed; default flag-off build stays byte-identical 1103/1103. Model sanity: Qwen3-32B-Q4_K_M llama-bench pp512 31.75 t/s (slow is expected for P2 - the naive single-warp kernel is the correctness checkpoint; P3 adds the cp.async pipeline and weight reshuffle). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 22:09:12 +00:00
Ettore Di Giacinto	9a71e81fc4	kernel: written subagent dispatch briefs for P3/P4/P5 Same strategy as P2: one fresh Opus-4.8 subagent per phase, each handed a complete zero-context brief, dispatched sequentially as each predecessor lands (P3 pipeline needs P2's correct kernel, P4 tune needs P3, P5 enable needs P4). Shared DGX/harness/commit boilerplate factored into a COMMON section; each phase brief carries its goal, incremental steps, acceptance gate, and a splice note for the prior phase's actual deliverable. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 22:01:18 +00:00
Ettore Di Giacinto	718b31d063	kernel(P1): W4A16 dispatch seam (gated, byte-identical fallback to MMQ) marlin-w4a16.{cuh,cu} + a gated hook in ggml_cuda_mul_mat (dense path), behind GGML_CUDA_W4A16 + sm_120/121 + Q4_0/Q4_K + f32. Returns false -> MMQ, so the default build is byte-identical. Verified on GB10: clean build, test-backend-ops MUL_MAT 1103/1103, llama-bench pp512 unchanged (717.77 default / 718.26 flagged), and GGML_CUDA_W4A16=1 reaches the seam ([w4a16] P1 warning) before falling back. Source + apply steps under kernel/w4a16/ (DGX checkout is volatile). The frame the P2 correctness kernel + P3 Marlin pipeline fill. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 21:46:38 +00:00
Ettore Di Giacinto	d291e15114	kernel(P0): record precise op-level baseline (q4_K n=512 = 47 TFLOPS, ~22% of ceiling) test-backend-ops perf MUL_MAT m=4096 k=14336: q4_K prefill (n=512) = 47.1 TFLOPS, q4_0 = 49.5; decode (n=1) = 761/817 GFLOPS (memory-bound). The prefill GEMM target is 47 -> ~213 TFLOPS (~4.5x). Cleaner per-shape target than end-to-end for kernel iteration. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 21:33:50 +00:00
Ettore Di Giacinto	dae2679c3b	kernel(P0): parity harness established + baseline (test-backend-ops 1103/1103 green) P0 done: test-backend-ops MUL_MAT on CUDA0 = 1103/1103 (CUDA vs CPU ref, covers Q4_0/Q4_K at m=4096,k=14336,n=1..512) - the correctness gate the W4A16 kernel must keep green. Baseline llama-bench dense Q4 prefill ~750 t/s (~46 TFLOP/s, ~21% of the 213 BF16 ceiling) - the number to beat toward ~3300. Reusable harness at ~/p0harness.sh (needed -DLLAMA_BUILD_TESTS=ON). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 21:29:21 +00:00
Ettore Di Giacinto	13e6ee89c7	kernel: validate cuBLAS dead-end (sm_80 fallback) + W4A16 Marlin impl plan Decisive DGX experiment: rebuilt with -DGGML_CUDA_FORCE_CUBLAS (it's a compile #ifdef, not the runtime env we'd been setting - so prior 'cuBLAS no-op' tests never engaged it). Real result: cuBLAS is SLOWER than MMQ for dense Q4 (pp2048 690 vs 750) and runs an Ampere cutlass_80_tensorop kernel - CUDA-13 has no sm_121 GEMM, falls back to sm_80. So both MMQ and cuBLAS sit at ~46 TFLOP/s; no library shortcut to the 213 ceiling on GB10. Confirms a hand-tuned sm_120a kernel is required. Added the phased W4A16 Marlin-style implementation plan (P0 harness -> P5 enable) as the committed multi-week build; corrected the cuBLAS note. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 21:16:13 +00:00
Ettore Di Giacinto	76cc0b6abc	docs(paged): phased plan to make llama.cpp a viable vLLM alternative Phase 1 (config, PR #10411, DONE): VRAM-scaled n_parallel + Blackwell batch. Phase 2: paged KV (PR #22569, ~9.5x concurrency). Phase 3: chunked prefill + n_batch/ubatch split. Phase 4: batched-GEMM kernel tuning. Phase 5: backend sampling. Cross-cutting: spec-dec for dense. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 09:35:53 +00:00
Ettore Di Giacinto	122df1c620	analysis: vLLM throughput gap decomposed - spec-dec is the per-user lever Per-user decode is at parity without spec-dec (10.2 vs 11.7, bandwidth-bound). vLLM's per-user speed = speculative decoding (lossless, target-verified). GB10 is best-case (bandwidth-bound + idle compute); llama.cpp spec-dec measured 2.9x on dense Qwen2.5-32B. Qwen3-32B has no native MTP - use Qwen3-1.7B draft or EAGLE3 head. Recommendation: make spec-dec easy for dense >=14B on Blackwell (keeps Q4_K_M quality, no kernel). Prefill-kernel + continuous-batching are separate (TTFT / aggregate). Our own DGX run pending (box rebooted, llama-cli hangs). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 08:40:20 +00:00
Ettore Di Giacinto	14e3da25b6	kernel: dense MXFP4 test = free 1.44x (765->1153) but FP4-MMA untuned (~17% of ceiling) MXFP4 dense moves prefill off int8-MMQ onto the FP4-MMA path (existing kernel) for a free 1.44x - shippable as a Blackwell dense-quant recommendation. But it's ~17% of the FP4 roofline, so the FP4-MMA kernel is itself untuned: ~4-6x still in the kernel. Sharpens the target to TUNING the FP4-MMA (serves dense+MoE, only path to beat vLLM). Marlin-style W4A16 BF16 is the alt to match on the BF16 ceiling. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 07:48:29 +00:00
Ettore Di Giacinto	f5e9caece1	kernel: reframed Blackwell kernel-gap map (research + profiles) Key corrections: (1) vLLM 24k is AGGREGATE; single-stream roofline ~3300 t/s (BF16) / 6600 (FP4). (2) GB10 is 1:1:2 BF16:INT8:FP4 - INT8 == BF16, only FP4 is 2x. (3) Measured: dense int8-MMQ at 21% of ceiling, MoE FP4-MMQ at ~5% - both EXIST, just untuned for Blackwell. Strategy: to MATCH vLLM, tune MMQ or build a Marlin-style W4A16 BF16 GEMM (FP4 NOT required); to BEAT, fix the existing FP4 MMA on sm_121 (build/miscompile, not greenfield). Dropped the tcgen05 grouped GEMM rewrite. Cheap next test: dense MXFP4 quant + existing FP4-MMA. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 07:21:56 +00:00
Ettore Di Giacinto	d2651c86d9	bench(dense): root-cause the W4A4 NVFP4 hang; W4A16 vs Q4 is the headline Researched: W4A4 hangs on GB10 because FlashInfer ships no FP4 cubins for sm_120/121 (all datacenter Sm100a); dense mm_fp4 is gated-off/returns-zeros on consumer Blackwell, and the FlashInfer FP4 autotuner spins on the first forward pass. Not a misconfig - dense W4A4 inference isn't validated on sm_121. W4A16 (4-bit weight / 16-bit act, Marlin) vs llama Q4_K_M is the correct apples-to- apples (same quant class) AND the fast path. Removed the misleading 'W4A4 would be faster / lower bound' framing. Sources: vllm #30163/#26381, flashinfer #2577/#3294, cutlass #3096. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 06:59:50 +00:00
Ettore Di Giacinto	19742aee64	bench(dense): FORCE_CUBLAS no-op for dense too (720.8 vs 721.8) - every flag lever exhausted Confirms parity (dense+MoE, both phases) is strictly the FP4 tensor-core kernel; no config/flag shortcut remains. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 03:59:27 +00:00
Ettore Di Giacinto	ce60737fc5	kernel(doc): dense scope resolved - two FP4 kernels (dense first, then grouped) Benchmark confirms dense prefill 7.6-32x behind too, so the kernel track needs a non-grouped FP4 dense GEMM (simpler, land first) + the MoE grouped GEMM. Both share the e2m1 block-scaled collective; dense is grouped-with-one-group. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 03:56:33 +00:00
Ettore Di Giacinto	37cbc089b0	bench(dense): Qwen3-32B dense parity - dense has the kernel gap too (PP 7.6-32x) vLLM W4A16 vs llama Q4_K_M dense: prefill 7.6-32x behind (llama plateaus ~765, vLLM scales to 24.4k); decode ~parity at B=1 (weight-bandwidth-bound), 2.2x at B=64. Full NVFP4 (W4A4) hangs on this vLLM/GB10 stack - W4A16 used. Decision: the Lever-3 kernel track must ALSO deliver a non-grouped FP4 dense GEMM, not just the MoE grouped GEMM (dense GEMM is the simpler first kernel to land). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 03:55:58 +00:00
Ettore Di Giacinto	b7b2e8291c	kernel(fp4-grouped-moe): scaffold the FP4 grouped-GEMM MoE dispatch (Lever 3) The only work that closes the vLLM gap on Blackwell: mul_mat_q<MXFP4> is 37% prefill + 54.6% decode-B64 GPU time; paged attention can't touch it (proven). Scaffold (builds clean on GB10, default byte-identical): fp4-grouped-moe.{cuh,cu} entry + gated hook in ggml_cuda_mul_mat_id (env GGML_CUDA_FP4_GROUPED), always falls back to MMQ for now. Design doc has the CUTLASS/tcgen05 implementation phases + parity harness + the dense-path follow-up (#28). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 23:44:31 +00:00
Ettore Di Giacinto	cb28deda6b	bench(paged): decode profile overturns 'engine-addressable' - decode is 54.6% MoE GEMM too Decode-dominated B=64 nsys: mul_mat_q<MXFP4> 54.6%, attention only 19.8%. Both phases are FP4-MoE-kernel-bound (Lever 3). The paged series cannot close the vLLM gap in either phase; its real value is capacity + prefix-sharing, not tok/s parity. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 23:27:35 +00:00
Ettore Di Giacinto	2a500c371f	bench(paged): fresh GB10 head-to-head vs vLLM - two distinct gaps Prefill 6-48x behind and does NOT scale with B (kernel-bound, paging can't fix). Decode: we win at B=1; 2.5-3.7x behind at B>=8 - THAT concurrency gap is the engine's domain (0004 pool + 0005 continuous batching target it). Baseline for the series to improve on. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 23:20:22 +00:00
Ettore Di Giacinto	48fbb9384f	docs(paged): refine 0003 plan - used-cell gather, per-ubatch rebuild, single-stream first Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 23:14:25 +00:00
Ettore Di Giacinto	145e45b6f2	docs(paged): exact executable plan for 0003 gather-read Every edit mapped (gather-index graph input mirroring k_idxs; gather K/V/mask by one aligned index; n_kv compaction; gated so stock stays byte-identical) with the token-identical gate and the known risks (mask transpose layout, v_trans). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 23:12:18 +00:00
Ettore Di Giacinto	c4b4f3a3e4	docs(paged): series status 0001/0002 done+verified; honest parity note Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 23:05:14 +00:00
Ettore Di Giacinto	61ff738177	patch(paged) 0002: LLAMA_KV_PAGED block placement, Gate 0 token-identical find_slot places a sequence's tokens at permuted non-contiguous blocks; greedy generation is token-identical to stock (verified on Qwen3-0.6B at the pin), branch confirmed firing. Default off. The placement substrate for the gather-read. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 23:04:28 +00:00
Ettore Di Giacinto	ce48cc0751	patch(paged) 0001: vendor PagedKVManager into llama.cpp src First patch of the stacking series. Adds src/paged-kv-manager.{h,cpp} (the CPU-verified vLLM-parity block manager) + CMake entry. No behavior change. Generated against the pinned LLAMA_VERSION; applies clean. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 22:55:22 +00:00
Ettore Di Giacinto	ba3fa5a633	build(paged): stacking patch-series scaffolding for llama.cpp paged attention Numbered patches under backend/cpp/llama-cpp/patches/ applied in order against the pinned LLAMA_VERSION (build hook in the llama.cpp: target). Each phase is one small, independently-buildable patch so the work rebases cleanly across llama.cpp bumps (anti-drift). README defines the series (0001 vendor manager -> 0006 prefix caching) + the regen workflow. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 22:53:20 +00:00
Ettore Di Giacinto	62f0ae17e3	docs(paged): upstream survey - no FP4 MoE GEMM to patch in; phase 3 is from-scratch No tcgen05/CUTLASS grouped-GEMM MoE kernel exists upstream (merged/in-flight/ draft); CUTLASS not a dep; no fork has one; activation-quant gather already fused. Matching vLLM needs a from-scratch tcgen05 grouped GEMM (months, maintainers deferring to cuTile). No tractable patch closes the 27x. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 22:44:11 +00:00
Ettore Di Giacinto	b14214620c	docs(paged): Lever-3 phase-1 nwarps tweak = dead end (constants coupled) static_assert(nwarps*tile_C::I == mmq_y) locks nwarps=8 for mmq_y=128; can't raise occupancy without co-scaling mmq_y (blows Blackwell smem). MMQ kernel is not freely tunable -> parity needs the tcgen05/CUTLASS rewrite, not knobs. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 22:32:02 +00:00
Ettore Di Giacinto	1449b806ab	docs(paged): Lever-3 + paged-attention implementation plans + upstream ggml issue draft Plan A (Lever 3): phased path to FP4 MoE GEMM parity — cheap tweaks, act-quant fusion, then the real lever (tcgen05/CUTLASS grouped GEMM), full-model FP4. Plan B (paged attention): on-demand pool, gather-read + Gate 0, continuous batching, prefix sharing; benchmark in memory-pressured/mixed-length regimes. Upstream issue draft: GB10 numbers, nsys profile, ruled-out config knobs, tcgen05 proposal. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 22:28:28 +00:00
Ettore Di Giacinto	9f16a907be	docs(paged): Lever 3 profiled + Q4/MXFP4 findings, auto-ubatch shipped Prefill doesn't scale with bigger single prompts (attention O(N^2)); real gap is batched MoE prefill (B=32: 27x vs vLLM, ~22 effective TFLOP/s). nsys pins Lever 3 target: mul_mat_q<MXFP4> MoE GEMM 37% + un-fused act-quant 8%; native FP4 MMA already engaged, inefficiency is the per-expert thin-tile scheduler. Q4_K_M matches MXFP4 on decode (decode win is generic 4-bit); MXFP4's only edge is prefill. Auto-ubatch=2048 on Blackwell shipped (PR #10411). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 20:56:46 +00:00
Ettore Di Giacinto	aba0bfd24f	feat(backend): auto-default physical batch to 2048 on Blackwell GPUs On NVIDIA Blackwell consumer GPUs (sm_120/121, incl. GB10/DGX Spark) a larger physical batch (n_ubatch) materially lifts MoE prefill throughput - measured on a GB10 with Qwen3-30B-A3B to lift the prefill ceiling and saturate at ~2048. When a model config leaves `batch:` unset, EffectiveBatchSize now picks 2048 on Blackwell instead of 512; explicit `batch:` always overrides. Detection is a shared, cached Go helper (xsysinfo.IsNVIDIABlackwell, nvidia-smi compute_cap >= 12). Logic is isolated in core/backend/hardware_defaults.go and applied at the common ModelOptions builder, so it covers the C++ llama.cpp backend too. Measured (GB10, Qwen3-Coder-30B-A3B MXFP4): prefill ub512 2994 -> ub2048 3316 t/s; saturates past 2048. Also recorded in the DGX gap plan: 4-bit quant alone captures the decode win (Q4_K_M 93.5 >= MXFP4 86.4 t/s), MXFP4's only edge is prefill via Blackwell FP4 tensor cores. Tests: hardware_defaults_internal_test.go; existing NBatch specs pinned to the no-Blackwell branch for determinism. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 20:46:45 +00:00
Ettore Di Giacinto	7aa61d4c32	docs(paged): DGX Blackwell gap analysis + lever plan (living doc) Captures the full dgx.casa investigation: Q8/F16/vLLM baselines, concurrency sweeps, paged-patch (no concurrency effect), nsys+code root-cause (MoE int8 MMQ on Ampere-class tensor cores = 74.5% compute, no FP8 path), and the lever plan. Measured wins: - Lever 1 (MXFP4 / Blackwell FP4 path): decode +50-66% over Q8, prefill plateau +66% (2200->3650). MXFP4 decode beats vLLM FP8 at B=1 (83 vs 48), near-parity B=8. Prefill still plateaus (fused-MoE-GEMM gap). - Lever 2 (ubatch): saturates at 2048; ceiling is the kernel, not batch. Designed (not built): Lever 3 fused FP4/FP8 MoE grouped GEMM, Lever 4 FP8 GEMM (needs ggml_mul_mat_ext scale plumbing), Lever 5 tcgen05 kernels, and the complete paged attention (on-demand alloc + gather-read + continuous batching + prefix sharing). Honest scope: each is multi-week kernel/systems work. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 20:15:14 +00:00
Ettore Di Giacinto	bbc84a9889	feat(paged): Gate 0 in-model - token-identical generation with paged KV placement Wire paged, non-contiguous fixed-size BLOCK placement into the real llama.cpp KV cache (find_slot), behind env LLAMA_KV_PAGED, and validate Gate 0 on a real GGUF: Qwen3-0.6B greedy generation is TOKEN-IDENTICAL to the contiguous cache while its KV is physically scattered across permuted blocks (cells 0-15, 144-159, 32-47, ...). Proven non-contiguous via LLAMA_KV_PAGED_DEBUG, not a silent fallback. This retires the correctness premise of paged attention IN THE MODEL (not just at the ggml-op level): attention is invariant to physical KV placement, because reads use per-cell pos/seq metadata for masking. The patch lives at patches/0001-paged-kv-block-placement.patch (against llama.cpp 0253fb21f). Scope: storage/placement layer, single sequence. Remaining (P4): the gather-read compute path (attend only a seq's own blocks) for the throughput win, and the multi-sequence driver. README updated with repro + status. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 08:51:42 +00:00
Ettore Di Giacinto	3ed3279739	docs(paged): status + integration map for in-model Gate 0 Capture verified state (P0 manager parity, P1 ggml write/gather, P2 attention numerics 7.5e-08, P3 capacity 9.2x + prefix-sharing 11.3x) and the exact remaining work: wire build_attn_paged into llama-graph.cpp and validate token-identical generation on Qwen3-0.6B (Gate 0), then win-2 throughput. Records the integration seams (create_memory, find_slot, get_k/get_v, build_attn, mask) and the honest caveats (unified cache already shares a pool; vLLM's classic kernel is deprecated) so the next session starts warm. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 08:45:51 +00:00
Ettore Di Giacinto	ddace5fb6a	feat(paged): paged-bench - measure capacity & prefix-sharing wins Quantify the two multi-tenant wins that are properties of the host-side block model (vLLM-parity), independent of the in-model compute path: WIN 1 concurrency capacity @ 512-block budget contiguous (reserve n_ctx/seq): 4 sequences paged (on-demand blocks): 37 sequences --> 9.2x more concurrent sequences WIN 3 cross-tenant prefix sharing (32 tenants, 1024-tok shared prefix) prefix-cache OFF: 2176 physical blocks prefix-cache ON: 192 physical blocks --> 11.3x less KV memory WIN 2 (throughput) is deliberately reported as PENDING: it requires the paged gather-read path wired into llama-graph.cpp (Gate 0) and is not measurable at the allocation layer. The win-1 baseline is per-sequence n_ctx reservation (stream mode); llama.cpp's unified cache already shares one pool, so the honest win there is on-demand sizing + prefix dedup. Phase 3 (partial) of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 08:44:41 +00:00
Ettore Di Giacinto	5a5d3df8c8	feat(paged): Phase 2 core - attention over paged KV matches reference Retire the central numeric risk from the design: feeding gather-to-scratch KV (a sequence whose blocks are non-contiguous in the shared pool, [2,1,5]) into ggml's standard attention ops produces correct attention. Path under test: set_rows write -> get_rows gather (K and V) -> mul_mat(K,Q) -> soft_max_ext -> mul_mat(V^T, probs). Result is compared against an independent host-computed softmax attention over the same K/V/Q. Max abs error ~7.5e-08 (n_kv=48, d=8, n_q=4). This proves the paged read path is numerically sound on CPU with no new ggml op. Remaining: wire build_attn_paged into llama-graph.cpp and validate Gate 0 (token-identical greedy generation in a real model). Phase 2 (core) of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 08:35:35 +00:00
Ettore Di Giacinto	c6698dd4bf	feat(paged): Phase 1 - ggml paged write/gather mechanism (CPU) Validate the paged KV read/write path at the ggml-op level, driven by PagedKVManager: - write: ggml_set_rows(pool, k_src, slot_mapping) scatter K rows by slot - read: ggml_get_rows(pool, gather_idx) gather a seq's slots into contiguous scratch (the tensor an attention kernel consumes) The test forces a non-contiguous, out-of-order physical block layout (allocate seqA+seqB, free seqA, reallocate seqC -> blocks [2,1,5]) and proves gather(write(x)) == x plus cross-sequence isolation in the shared pool. This de-risks the central question (does slot-addressed paged storage round-trip correctly through ggml) before the llama-graph integration. Pool is statically allocated via ggml_backend_alloc_ctx_tensors, mirroring how llama.cpp allocates its KV cache. CPU backend, no new ggml op. Built against ggml from the vendored llama.cpp checkout. Phase 1 of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 08:33:26 +00:00
Ettore Di Giacinto	edb1a11abc	feat(paged): vLLM-parity KV block manager (Phase 0, CPU-first prototype) Host-side paged-attention block manager ported faithfully from vLLM V1 (block_pool.py, kv_cache_utils.py, single_type_kv_cache_manager.py): - KVCacheBlock + intrusive LRU FreeBlockQueue (O(1) middle removal) - BlockPool: get_new_blocks / touch / free_blocks eviction ordering / cache_full_blocks / lazy eviction on reuse - PagedKVManager: on-demand allocate, block_table, slot arithmetic (slot = block_id*block_size + offset), free - Prefix caching: chained block hashing + find_longest_cache_hit (first-miss stop), enabling automatic cross-tenant prefix sharing Pure C++17, zero ggml/llama.cpp dependency, unit-tested to vLLM behavioral parity (4/4 suites green). Parity is on algorithm/behavior, not hash bytes. Phase 0 of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md. Phases 1-5 (ggml storage, gather-to-scratch read path, Gate 0 correctness, benchmark wins, prefix-share serving) follow. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 08:26:31 +00:00
LocalAI [bot]	4ad754eea3	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `b3dfb7858cfcb9166e92f366e5af87f19ebc94be` (#10395 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-19 00:03:37 +02:00
Richard Palethorpe	3fa7b2955c	feat(pii): NER tier engine — privacy-filter.cpp backend + NER-centric PII filter (#10360 ) Squashed feat/pii-ner-tier-engine rebased onto master (was 45 commits; see backup/pii-ner-tier-engine-prerebase). Net change: - privacy-filter.cpp: standalone GGML engine for the openai-privacy-filter PII/NER token classifier, wired as a LocalAI gRPC backend (CPU/CUDA/Vulkan). TokenClassify moves off the patched llama.cpp path onto this backend. - PII filter reworked to be NER-centric (encoder/NER detection tier scanning whole conversations as one document), with a recreated bounded restricted- regex secret-matching pattern detector tier alongside it (per-model pii_detection.builtins / .patterns + core/services/routing/piipattern). - Detection labelled by source (ner vs pattern); backend trace / confidence / debug observability; analyze/redact exposed as a synchronous API. - Instance-wide default detector policy + per-usecase default-on; request filtering extended to completions, embeddings, edits & Ollama. - React UI: NER-centric PII editor, detector-models table, pattern/builtins editor, middleware default-policy UI. - Gallery: privacy-filter-multilingual token-classify model + NER install filter; token_classify known_usecase; batch sized to context for NER models. privacy-filter backend registered in the backend gallery (cpu/vulkan/cuda-13 meta + image entries with a capabilities map) matching its CI matrix jobs, and an /import-model auto-detect importer (PrivacyFilterImporter, narrow privacy-filter GGUF detection) replacing the prior pref-only registration. Reconciled against master's independent evolution: - Dropped master's PIIPatternOverrides feature (global-pattern runtime overrides + /api/pii/patterns API + runtime_settings.json persistence). The per-model NER + pattern-detector design supersedes it; it was built on the global redactor pattern set this branch replaced. - Reverted the llama.cpp Score carry-patch (0006-server-task-type-score): removed the patch and restored master's grpc-server.cpp Score RPC (direct llama_decode, slot-loop bypass) and LLAMA_VERSION pin, plus master's model_config validation forbidding score + chat/completion/embeddings on llama-cpp. token_classify is unaffected (it runs on the privacy-filter backend, not llama-cpp). Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-06-18 11:45:22 +01:00
LocalAI [bot]	c133ca39dc	chore: ⬆️ Update ggml-org/llama.cpp to `f3e182816421c648188b5eab269853bf1531d950` (#10379 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-18 11:43:23 +02:00
LocalAI [bot]	5c2ae7857a	chore: ⬆️ Update antirez/ds4 to `80ebbc396aee40eedc1d829222f3362d10fa4c6c` (#10378 ) ⬆️ Update antirez/ds4 Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-18 00:32:13 +02:00
LocalAI [bot]	4af360300f	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `71af16a6b7f6fb7315b346b4a51aad530599c3f5` (#10381 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-18 00:12:25 +02:00
LocalAI [bot]	95e7149c87	chore: ⬆️ Update ggml-org/llama.cpp to `74ade52741203e5c8f81eaf06a96cb1cfe15f2a3` (#10368 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-17 13:25:29 +02:00
LocalAI [bot]	fd26c8c753	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `064d23a6f816d50491d8c9b35a0cafe546eaf4b5` (#10367 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-17 13:25:14 +02:00
LocalAI [bot]	e60c094a7d	feat(ds4): SSD streaming + quality engine options, 128GB DeepSeek gallery models (#10374 ) feat(ds4): wire SSD streaming + quality engine options, add 128GB DeepSeek gallery models The ds4 backend zero-initialized ds4_engine_options and exposed none of the engine's tunable knobs, so SSD streaming (run a model larger than RAM by streaming routed MoE experts from the GGUF on SSD) and the quality/perf knobs were unreachable from LocalAI model YAMLs. Map ModelOptions.Options onto ds4_engine_options through a declarative table (kEngineOptSpecs + apply_engine_option) instead of per-field branches: the struct is fixed C with no reflection, so the field set is enumerated once and a future knob is a one-line table row. Two fields use ds4's own typed parsers (GiB budgets, cache-experts count-or-NGB). Bare flags (e.g. "ssd_streaming") mean true; path-type options (mtp_path, expert_profile_path, directional_steering_file) resolve relative to the model directory so a gallery entry can reference a companion file by bare filename. mtp_draft/mtp_margin are now validated rather than parsed with throwing std::stoi/std::stof. Add gallery entries for the 128 GB class: - deepseek-v4-flash-q2-q4 (~91 GB, mixed q2/q4, fits RAM, higher quality) - deepseek-v4-flash-q4-ssd (~153 GB full 4-bit, runs on 128 GB via SSD streaming) - deepseek-v4-flash-q2-mtp (~81 GB + MTP speculative draft weights) - deepseek-v4-pro-q2-ssd (~433 GB Pro, experimental SSD streaming) SSD streaming is Metal (Darwin) only; the options are inert on CUDA/CPU. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-17 10:30:06 +02:00

1 2 3 4 5 ...

601 Commits