LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-23 08:08:52 -04:00

Author	SHA1	Message	Date
Ettore Di Giacinto	4bc2b4a9b2	feat(paged): add patch 0013 decoupled per-step prefill-token budget Mirror of the dev-tree paged scheduler patch into the llama.cpp backend's vendored patch series. Adds LLAMA_PREFILL_BUDGET, a per-step prefill-token budget for the inherited update_slots() scheduler, decoupled from n_batch (the analogue of vLLM's --max-num-batched-tokens). It caps how many prompt tokens a single update_slots() step ingests, splitting a long prefill across more steps so co-batched decode keeps advancing instead of freezing for the duration of one fat ~n_batch prefill chunk. Default (env unset or <= 0) = disabled, so stock behaviour is byte-identical; orthogonal to LLAMA_KV_PAGED. Measured on GB10 (dense Qwen3-32B-NVFP4, 8 steady decoders + one injected 6000-token prefill, same binary, only the env differs): worst decode freeze 3380 -> 482 ms (7.0x) and decode_stall 3285 -> 387 ms (8.5x) at budget=256, for a +20% TTFT on the long request; budget=512 gives 4.8x at ~no TTFT cost. This is a latency/fairness lever, not an aggregate-throughput lever (steady decode is NVFP4 weight-read-bound on GB10, which the scheduler cannot lift). Correctness: budget unset or >= n_batch is byte-identical to stock; budget=N is byte-identical to stock -bN while preserving n_batch for decode width; the only deviation on long prompts is intrinsic flash-attn chunk-size FP grouping that pure stock -b exhibits too. Verified applying on the pinned llama.cpp f3e1828 after patch 0008. Productisation follow-up: surface as a grpc-server.cpp options knob (max_prefill_tokens) per CHUNKED_PREFILL_PLAN Phase B. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-23 09:55:32 +00:00
Ettore Di Giacinto	ba6bd94976	feat(paged): assert mask-pad invariant for the paged tile route (patch 0012) Patch 0012 of the paged-attention series. Adds a defensive GGML_ASSERT in src/paged-attn.cpp so the now-default paged decode route (GQA-grouped fattn-tile kernel) cannot silently start leaking past-end KV rows. The route stays correct only because the compacted mask/block-table length n_view = GGML_PAD(n_gather, 256) is a whole number of flash-attn KV tiles (nbatch_fa = 64 for head_dim 128 divides 256), so the last tile sits entirely inside the -inf pad window. The assert (n_view % 64 == 0) pins that implicit invariant: a future pad < 256 or tile > 256 that broke it now aborts instead of leaking. Additive only, no behaviour change. Verified on the DGX dev tree: build-cpu compiles and the paged CPU byte gate (LLAMA_KV_PAGED off vs on, Qwen3-0.6B-Q8_0, greedy) stays byte-identical with the assert silent. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-23 09:13:08 +00:00
Ettore Di Giacinto	e983919516	feat(paged): route GQA-grouped tile kernel by default for paged decode (patch 0011) Increment 3 attention lever. In the paged in-kernel decode dispatch, route the common grouped-query F16 case to the tile kernel and keep the inc-1 vec kernel for everything else. Tile groups the q-heads that share a kv-head (ncols2) so each K/V row is loaded once per group instead of once per q-head, and runs at higher occupancy (108-128 regs vs vec 168 -> 25%). On GB10 (Qwen3-32B NVFP4, F16 cache, gqa 8, batch 32, 1024 ctx, same build, env-toggled) this cuts the decode step from 186.3 to 177.9 ms/step (-4.5%), within 1.8% of stock (174.8). The win grows with context (tile vs vec decode step, npl=8): 1024 -2.3%, 4096 -3.3%, 8192 -4.1%, 16384 -6.1%, as attention takes a larger share of the step. Routing guard: tile has no K/V type template (loads half2), so a non-F16 cache would be converted to a contiguous F16 copy by launch_fattn, breaking the in-kernel block-table read. So tile is correct only for an F16 cache, and the grouping only helps at gqa>=2. tile is used only for {F16 K and V, gqa_ratio>=2}; everything else falls back to the inc-1 vec path, exactly as before this change. LLAMA_KV_PAGED_VEC=1 forces vec for A/B. The inc-2 phys(j) tile read (patch 0010) was already plumbed; this only adds the default route. (Paged decode currently needs an F16 cache; quantized + paged is a pre-existing limitation unaffected by this change: stock+q8_0 works, paged+q8_0 aborts both before and after.) Split-K was ruled out: the vec decode grid is already block-saturated (~43 waves over 144 resident on 48 SM), so more parallel_blocks adds no SM fill; the under-saturation is intra-SM occupancy + 8x KV re-streaming, which GQA grouping attacks directly. Validated (greedy): CPU plumbing gate (0.6B, build-cpu, paged-on vs off) byte-identical; GPU 0.6B gqa=2 tile token-coherent with the inc-1 vec path (7/8 sequences identical, 8th in the same kernel-noise band where vec also drifts from stock); 32B gqa=8 tile tracks stock at least as well as vec. Stock (no block table) is byte-identical: the dispatch guard only diverts on src[5]. Full rationale and numbers in the patch header. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code]	2026-06-22 22:38:28 +00:00
Ettore Di Giacinto	2c5adda28c	feat(paged): tile in-kernel decode read + dispatch guard (patch 0010) Increment 2 (robustness): graft the patch-0009 phys(j) block-table read into the CUDA tile kernel (mirror of fattn-vec.cuh) and add a dispatch guard so a present block table (src[5]) routes ONLY to the vec or tile kernel, never to mma/wmma (which ignore the table and would silently read the wrong physical cells). Default route stays vec, the inc-1 byte-validated path. Gates: CPU byte-identical paged-on vs off (Qwen3-0.6B) PASS; GPU vec-paged == stock at -s 1 PASS; the real Qwen3-32B NVFP4 batch decode confirmed dispatching to vec (Q ne=[128,1,64,N]). The tile graft is plumbed for the increment-3 GQA head-group reuse but is EXPERIMENTAL/not byte-validated (LLAMA_KV_PAGED_TILE=1): the GQA-grouped ncols2>1 tile path reads a full nbatch_fa tile unbounded while the compacted paged mask is not padded to cover it. Bounding that path is increment-3 work; the default vec route is unaffected. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 20:37:12 +00:00
Ettore Di Giacinto	ee13a94a8c	paged: in-kernel decode read patch 0009 (kill the gather regression) Mirror patch 0009 for the paged llama.cpp engine. It removes the patch-0003 per-layer per-step gather (ggml_get_rows of K/V to a contiguous buffer) on the decode step and instead reads paged blocks in-kernel: build_attn passes the physical K/V views plus a position-ordered block table (src[5] of ggml_flash_attn_ext, padded to FATTN_KQ_STRIDE), and the CUDA fattn vec kernel plus the CPU reference map each logical KV index to its physical cell and read in place. KV_max / parallel_blocks / stream_k split-K are unchanged; a nullptr block table is the stock contiguous read (byte-identical, gated by LLAMA_KV_PAGED). Verified on GB10 (sm_121, Qwen3-32B NVFP4, batch 32 / 1024 ctx): the decode step drops from 1279 ms (paged-gather) to 696 ms in-kernel (-46%), reaching stock parity (647 ms). CPU paged vs stock is bit-for-bit identical; GPU stays within the documented batch-shape non-determinism band. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 18:04:09 +00:00
Ettore Di Giacinto	4dcbcfcf92	docs(paged): decode-step gap study vs vLLM on GB10 Profiling decomposition of the llama-server batch-32 / 1024-ctx decode step vs vLLM on a DGX Spark (GB10, sm_121). Findings: decode is GPU-bound (~95% busy, sampling/loop fully hidden); at 1024 ctx the step is ~84% KV/attention and ~16% weight GEMM; the paged KV engine is a ~1.85x decode regression vs stock (per-layer gather-to-contiguous); even stock is ~4-5x slower than vLLM, gated by the long-context decode-attention and thin-batch FP4 GEMM kernels, not by the serving loop. Ranked closable-vs-structural levers included. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 15:44:24 +00:00
Ettore Di Giacinto	80e0c1ac6b	feat(paged): wire cross-request prefix share into llama-server (patch 0008) Ship patch 0008 of the paged-attention series: wire the paged cross-request prefix recompute-skip (patch 0007's paged_prefix_api::share/commit engine seam) into the llama-server continuous-batching loop so CONCURRENT requests sharing a long prefix reuse one committed copy of the prefix blocks and prefill ONLY their divergent suffix. The server's native prompt cache only reuses a slot's own prior prompt; it does not share across distinct concurrent slots. 0008 adds that cross-slot share, fully gated behind LLAMA_KV_PAGED (stock byte-identical). The hook lives in tools/server/server-context.cpp update_slots (the only place with the slot prompt-processing loop; grpc-server.cpp includes it), ~50 gated lines: a fresh-slot share() that advances n_past past the committed prefix, and a commit() at the prefill->generation transition. The n_past<block gate guarantees every positive share is adopted so the engine reservation matches the suffix-only batch (no stale paged blocks). Verified in-server (32B NVFP4, CUDA, --kv-unified) with a live prefix holder: K=16/32 concurrent shared-prefix requests prefill only their ~27-token suffix instead of the ~1003-token prefix (36x fewer prefill tokens; K=16 23.9s->1.5s, K=32 57.9s->2.3s), engine logs 'shares ... prefix blocks - NOT recomputed' (ref_cnt>1), greedy output within the documented CUDA batch-shape non-determinism band. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 15:03:16 +00:00
Ettore Di Giacinto	52f0f7b8cf	docs(paged): apples-to-apples paged llama.cpp vs vLLM (batched+NVFP4+prefix cache) Matched comparison on DGX Spark (GB10, sm_121): batched llama-server with NVFP4 GGUF and the paged engine vs batched vLLM 0.23.0 NVFP4A16 with APC, both eager, both prefix-cache on. Two findings: (1) the paged cross-request prefix recompute-skip (patch 0007) does NOT engage in llama-server - it is only reachable via paged_prefix_api::share/commit, which the server never calls; the server engages only physical paged block placement plus its own native prompt cache. (2) With every confounder removed, vLLM is ~6x faster end-to-end (K=16: 8.6s vs 50.7s; K=32: 8.9s vs 58.3s), decode-bound not prefill-bound: llama ~828ms/decode-step at batch 32 vs vLLM ~185ms; CUDA graphs are not the differentiator (both eager). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 14:16:52 +00:00
Ettore Di Giacinto	f347f7ca1d	docs(paged): stock GPU batch-shape determinism + vLLM shared-prefix comparison Two closing measurements on DGX Spark (GB10, sm_121): 1. Stock GPU determinism (no paging): with LLAMA_KV_PAGED unset, stock llama.cpp produces a different greedy token stream when the same prompt is decoded in a full-prefill batch vs a split (prefix-then-suffix) batch. At G=24 the generated stream diverges 1/5 prompts on CPU and 2/5 on CUDA (and earlier on CUDA). This confirms the patch-0007 GPU byte-identity failure is stock floating-point batch-shape non-determinism, not a paged bug. CPU exhibits it too, just less often, which is why 0007's short CPU scenarios passed 16/16 while the CUDA run flipped. 2. vLLM vs llama.cpp+paged on a shared-prefix fan-out (K reqs share a 1024-tok prefix + unique 32-tok suffix, gen 64). llama.cpp+paged prefix cache gives 7.15x (K=16) / 10.3x (K=32) prefill reduction vs its no-share baseline - the same cross-request prefix-skip vLLM's APC provides (97% hit rate confirmed). Head-to-head on cached prefill vLLM is ~5x faster (Q4_K_M vs nvfp4a16 quant, vLLM on FP4 emulation + eager), and wider end-to-end due to continuous batched decode. Competitive in kind, behind in absolute terms on this hardware. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 13:48:01 +00:00
Ettore Di Giacinto	0dd45f0da5	docs(llama-cpp/paged): GPU 0007 re-run + shared-prefix benchmark results Record the belt-and-suspenders GPU run of the 0007 prefix-engine driver and a shared-prefix throughput benchmark. The committed CPU driver passes ALL PASS; the CUDA build fails only the strict greedy-token-equality assertions (the same binary fails them at ngl=0 too), which is CUDA float-kernel non-determinism, not a paged-logic defect - every structural KV-reuse invariant passes on GPU. The shared-prefix benchmark shows a real, K-scaling win: prefill wall time drops 7.2x (32B K=16) to 10.3x (32B K=32) when the shared prefix is computed once and reused via the paged cross-request prefix cache. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 12:59:09 +00:00
Ettore Di Giacinto	d1ba327843	docs(paged): record GPU correctness + CUDA backend-build verification GPU (DGX Spark, GB10/sm_121, CUDA 13.0) verification of the paged-KV series: core token-identical gate and 4-stream multiseq are byte-identical stock-vs-paged at -ngl 99, the device gather is confirmed firing, and a 32B paged run is coherent. Full backend: patches/paged apply clean to the pin and grpc-server compiles+links under CUDA sm_121. Notes also flag a double patch-application in the LLAMA_PAGED=on make flow (git apply + prepare.sh) and a token divergence in the unshipped prefix-recompute-skip dev driver (same on CPU and GPU). Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 11:50:01 +00:00
Ettore Di Giacinto	ecffd4b097	feat(llama-cpp/paged): engine-level prefix recompute-skip (patch 0007) Mirror patch 0007 of the paged-attention series into the vendored llama.cpp patch set. It wires the host-side cross-request prefix cache (0006) into the engine so a new sequence physically shares the cached prefix blocks (ref-counted) and decodes only the divergent suffix - the shared prefix KV is never recomputed. paged-alloc becomes one persistent caching PagedKVManager per (kv-cache, stream) keyed by the real seq_id (per-sequence ref-counted free); two gated llama_kv_cache methods (paged_prefix_share / paged_prefix_commit) mark the shared physical cells' seq-membership so the engine attention mask covers the already-computed prefix; find_slot anchors placement on each sequence's ubatch.pos. Existing-file core touch is llama-kv-cache.{cpp,h} (+71 -3); everything else is additive vendored units. Gated behind LLAMA_KV_PAGED, default off, stock byte-identical. Verified on Qwen3-0.6B-Q8_0 (CPU, unified cache): greedy byte-identity vs decode from scratch at a block boundary and mid-block, prefill computing only the suffix (32 prefix tokens skipped), and ref-counted free safety (2->1 on one sharer's removal, survivor intact and re-shareable, pool restored when all freed). The 0004 serving gate stays byte-identical stock vs paged in unified and non-unified mode. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 10:47:10 +00:00
Ettore Di Giacinto	67c6208b3a	feat(llama-cpp/paged): cross-request prefix caching patch 0006 Mirror patch 0006 of the paged-attention series into the vendored llama.cpp patch set. Extends the vendored PagedKVManager (src/paged-kv-manager) with host-side cross-request prefix sharing: place_with_prefix reuses cached physical blocks for a new sequence shared prefix (ref_cnt++) and allocates only the divergent suffix; cow_block copy-on-writes a still-shared (ref>1) block before a divergent write so co-owners stay byte-correct; ref-counted free releases a shared block only at ref 0. Core kv-cache files untouched; gated behind LLAMA_KV_PAGED, default off. Gate 0 verified on the dev tree (CPU, Qwen3-0.6B-Q8_0): shared-prefix greedy tokens byte-identical to the unshared baseline at both a block boundary and mid-block, measured 2-block reuse (ref_cnt==2, only the suffix allocated), and copy-on-write + seq_rm ref-count safety with no use-after-free. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 10:14:27 +00:00
Ettore Di Giacinto	04e3d04ab8	build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on) Move the paged-attention patch series (0001-0004 + docs) into patches/paged/, applied behind a new LLAMA_PAGED build flag (default on). The base patches/ dir is now clean, so a dep-bump that breaks a paged hook can be unblocked with LLAMA_PAGED=off (clean-against-upstream build) and the paged carry fixed independently - decoupling the paged-KV maintenance from routine bumps without a separate backend. Both apply paths wired (Makefile git-apply + prepare.sh re-apply, flag passed through). Runtime stays gated by LLAMA_KV_PAGED env, so an on build is byte-identical to stock until that env is set. Glob/flag logic verified in bash. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 09:22:36 +00:00
Ettore Di Giacinto	4968cd8a94	paged-attn 0004: on-demand KV block allocation Wire the paged placement in find_slot through the vendored PagedKVManager (0001) instead of a fixed full-pool permutation. Blocks are popped from a free pool on demand as a sequence crosses block boundaries, and returned on sequence end (full seq_rm / clear). One manager per (kv-cache, stream); all state lives in a new src/paged-alloc unit keyed by a static registry, so the core kv-cache struct is untouched (find_slot/clear/seq_rm gain only a gated call). Default off; stock path byte-identical. Gate 0 (CPU, Qwen3-0.6B-Q8_0), LLAMA_KV_PAGED=1 token-identical vs stock: - single-stream llama-simple, 48 tok: identical - multi-stream driver, 3 seqs x 40 tok: identical Demand-driven confirmed via debug log: blocks grow 0->1->2->3->4 at logical positions 16/32/48 (peak 4 blocks vs 16-block budget), per stream independently. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 08:50:57 +00:00
Ettore Di Giacinto	37e0e1ef55	paged-attn 0003: lift gather-read to multi-stream The 0003 gather-read was single-stream only (GGML_ASSERT k->ne[3]==1). Lift it to N streams: one index column per stream over the unified batch, gathered with a single ggml_get_rows along the stream axis. Each column is position-sorted (preserving the flash-attn online-softmax reduction order that makes the read byte-identical) and padded to the max non-empty count across streams with a masked (empty) cell, which contributes exp(-inf)=0. Core touch stays additive: the one-line build_attn hook is unchanged; only the two kv-cache gather helpers (now per-stream) and src/paged-attn.cpp grow. Gate 0 (CPU, Qwen3-0.6B-Q8_0): a multi-sequence greedy driver (non-unified KV, k->ne[3]>1) is token-identical between stock (env unset) and LLAMA_KV_PAGED=1: 3 seqs x 40 tok, 2 seqs x 32 tok, 5 seqs x 32 tok all identical; single-stream llama-simple unchanged. Debug log confirms n_stream=3 engaged the multi path. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 08:46:12 +00:00
Ettore Di Giacinto	d9d846e04b	feat(paged): patch 0003 gather-read - Gate 0 green, token-identical, additive Implements the paged-attention gather-read (the real engine compute): attention reads ONLY a sequence's used cells by gathering K, V and the kq_mask by the non-empty-cell index list before build_attn_mha. Verified token-identical to stock greedy generation, 9/9 across 3 prompts x {32,96,128} tokens on Qwen3-0.6B, with n_gather=71 < n_kv=256 confirming real compaction (not an identity no-op). Built in the additive "hook, don't edit" form: all logic in new src/paged-attn.{h,cpp} (an llm_graph_input_i gather-index subclass + the K/V/mask gather), hooked by one line in build_attn + two thin accessors on llama_kv_cache_context + one CMake line. No edit to llm_graph_input_attn_kv or llama-graph.h. 216 insertions; default-off behind LLAMA_KV_PAGED so stock path stays byte-identical. Key correctness finding: get_gather_idxs emits cells sorted by token position. CPU flash-attn's online softmax reduces cells in physical-array order and is FP-order- sensitive, so 0002's scattered placement alone (full-window read) diverges from stock past the first block; the position-sorted gather reproduces stock's exact reduction order -> bit-identical. So 0003 is what makes paged placement token-identical under flash-attn. Verified on a dev tree at the pin (0001+0002+0003 on branch paged); not pushed. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 08:26:46 +00:00
Ettore Di Giacinto	84d59e659b	docs(paged): additive "hook, don't edit" layout for the patch series Maintainers rejected PR #22569 (the upstream paged draft) as "slop" - it rewrites core attention and is unvendorable. Our own series must be additive so it survives llama.cpp pin bumps. This documents the rule and the per-patch core-touch budget: every change is either new code in a new vendored src/ file, or a single env-gated hook at one call site that delegates to it - no logic in core files, no core struct edits. Grounds it in the pinned source: llm_graph_input_i is pure-virtual and res->add_input() lets a new file register a graph input, so paged behavior plugs in without editing core graph types. Redesigns 0003 (gather-read) from the old 4-file surgery to one build_attn hook + a new paged-attn.{h,cpp} (a gather-input subclass) + two thin cache accessors (~8 core lines vs a core-struct rewrite). 0005 lands entirely in LocalAI's grpc-server.cpp (no core patch). Dev tree at the pin with 0001+0002 applied is set up; 0003 implementation is the next focused token-identical Gate-0 block. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 07:28:44 +00:00
Ettore Di Giacinto	d2651c86d9	bench(dense): root-cause the W4A4 NVFP4 hang; W4A16 vs Q4 is the headline Researched: W4A4 hangs on GB10 because FlashInfer ships no FP4 cubins for sm_120/121 (all datacenter Sm100a); dense mm_fp4 is gated-off/returns-zeros on consumer Blackwell, and the FlashInfer FP4 autotuner spins on the first forward pass. Not a misconfig - dense W4A4 inference isn't validated on sm_121. W4A16 (4-bit weight / 16-bit act, Marlin) vs llama Q4_K_M is the correct apples-to- apples (same quant class) AND the fast path. Removed the misleading 'W4A4 would be faster / lower bound' framing. Sources: vllm #30163/#26381, flashinfer #2577/#3294, cutlass #3096. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 06:59:50 +00:00
Ettore Di Giacinto	19742aee64	bench(dense): FORCE_CUBLAS no-op for dense too (720.8 vs 721.8) - every flag lever exhausted Confirms parity (dense+MoE, both phases) is strictly the FP4 tensor-core kernel; no config/flag shortcut remains. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 03:59:27 +00:00
Ettore Di Giacinto	37cbc089b0	bench(dense): Qwen3-32B dense parity - dense has the kernel gap too (PP 7.6-32x) vLLM W4A16 vs llama Q4_K_M dense: prefill 7.6-32x behind (llama plateaus ~765, vLLM scales to 24.4k); decode ~parity at B=1 (weight-bandwidth-bound), 2.2x at B=64. Full NVFP4 (W4A4) hangs on this vLLM/GB10 stack - W4A16 used. Decision: the Lever-3 kernel track must ALSO deliver a non-grouped FP4 dense GEMM, not just the MoE grouped GEMM (dense GEMM is the simpler first kernel to land). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-20 03:55:58 +00:00
Ettore Di Giacinto	b7b2e8291c	kernel(fp4-grouped-moe): scaffold the FP4 grouped-GEMM MoE dispatch (Lever 3) The only work that closes the vLLM gap on Blackwell: mul_mat_q<MXFP4> is 37% prefill + 54.6% decode-B64 GPU time; paged attention can't touch it (proven). Scaffold (builds clean on GB10, default byte-identical): fp4-grouped-moe.{cuh,cu} entry + gated hook in ggml_cuda_mul_mat_id (env GGML_CUDA_FP4_GROUPED), always falls back to MMQ for now. Design doc has the CUTLASS/tcgen05 implementation phases + parity harness + the dense-path follow-up (#28). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 23:44:31 +00:00
Ettore Di Giacinto	cb28deda6b	bench(paged): decode profile overturns 'engine-addressable' - decode is 54.6% MoE GEMM too Decode-dominated B=64 nsys: mul_mat_q<MXFP4> 54.6%, attention only 19.8%. Both phases are FP4-MoE-kernel-bound (Lever 3). The paged series cannot close the vLLM gap in either phase; its real value is capacity + prefix-sharing, not tok/s parity. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 23:27:35 +00:00
Ettore Di Giacinto	2a500c371f	bench(paged): fresh GB10 head-to-head vs vLLM - two distinct gaps Prefill 6-48x behind and does NOT scale with B (kernel-bound, paging can't fix). Decode: we win at B=1; 2.5-3.7x behind at B>=8 - THAT concurrency gap is the engine's domain (0004 pool + 0005 continuous batching target it). Baseline for the series to improve on. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 23:20:22 +00:00
Ettore Di Giacinto	48fbb9384f	docs(paged): refine 0003 plan - used-cell gather, per-ubatch rebuild, single-stream first Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 23:14:25 +00:00
Ettore Di Giacinto	145e45b6f2	docs(paged): exact executable plan for 0003 gather-read Every edit mapped (gather-index graph input mirroring k_idxs; gather K/V/mask by one aligned index; n_kv compaction; gated so stock stays byte-identical) with the token-identical gate and the known risks (mask transpose layout, v_trans). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 23:12:18 +00:00
Ettore Di Giacinto	c4b4f3a3e4	docs(paged): series status 0001/0002 done+verified; honest parity note Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 23:05:14 +00:00
Ettore Di Giacinto	61ff738177	patch(paged) 0002: LLAMA_KV_PAGED block placement, Gate 0 token-identical find_slot places a sequence's tokens at permuted non-contiguous blocks; greedy generation is token-identical to stock (verified on Qwen3-0.6B at the pin), branch confirmed firing. Default off. The placement substrate for the gather-read. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 23:04:28 +00:00
Ettore Di Giacinto	ce48cc0751	patch(paged) 0001: vendor PagedKVManager into llama.cpp src First patch of the stacking series. Adds src/paged-kv-manager.{h,cpp} (the CPU-verified vLLM-parity block manager) + CMake entry. No behavior change. Generated against the pinned LLAMA_VERSION; applies clean. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 22:55:22 +00:00
Ettore Di Giacinto	ba3fa5a633	build(paged): stacking patch-series scaffolding for llama.cpp paged attention Numbered patches under backend/cpp/llama-cpp/patches/ applied in order against the pinned LLAMA_VERSION (build hook in the llama.cpp: target). Each phase is one small, independently-buildable patch so the work rebases cleanly across llama.cpp bumps (anti-drift). README defines the series (0001 vendor manager -> 0006 prefix caching) + the regen workflow. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-19 22:53:20 +00:00
Ettore Di Giacinto	e3bcba5c45	chore: ⬆️ Update ggml-org/llama.cpp to `7f8ef50cce40e3e7e4526a3696cb45658190e69a` (#7402 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2025-12-01 07:50:40 +01:00
Ettore Di Giacinto	294f7022f3	feat: do not bundle llama-cpp anymore (#5790 ) * Build llama.cpp separately Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Start to try to attach some tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add git and small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: correctly autoload external backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run AIO tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Slightly update the Makefile helps Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Adapt auto-bumper Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run linux test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add llama-cpp into build pipelines Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add default capability (for cpu) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop llama-cpp specific logic from the backend loader Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop grpc install in ci for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Pass by backends path for tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build protogen at start Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests): set backends path consistently Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly configure the backends path Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to build for darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Compile for metal on arm64/darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to run build off from cross-arch Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to the backend index nvidia-l4t and cpu's llama-cpp backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Build also darwin-x86 for llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable arm64 builds temporary Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Test backend build on PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup build backend reusable workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * pass by skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Use crane Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Skip drivers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * x86 darwin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add packaging step for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix leftover from bark-cpp extraction Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Try to fix hipblas build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2025-07-18 13:24:12 +02:00

32 Commits