LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-23 08:08:52 -04:00

Author	SHA1	Message	Date
Ettore Di Giacinto	4bc2b4a9b2	feat(paged): add patch 0013 decoupled per-step prefill-token budget Mirror of the dev-tree paged scheduler patch into the llama.cpp backend's vendored patch series. Adds LLAMA_PREFILL_BUDGET, a per-step prefill-token budget for the inherited update_slots() scheduler, decoupled from n_batch (the analogue of vLLM's --max-num-batched-tokens). It caps how many prompt tokens a single update_slots() step ingests, splitting a long prefill across more steps so co-batched decode keeps advancing instead of freezing for the duration of one fat ~n_batch prefill chunk. Default (env unset or <= 0) = disabled, so stock behaviour is byte-identical; orthogonal to LLAMA_KV_PAGED. Measured on GB10 (dense Qwen3-32B-NVFP4, 8 steady decoders + one injected 6000-token prefill, same binary, only the env differs): worst decode freeze 3380 -> 482 ms (7.0x) and decode_stall 3285 -> 387 ms (8.5x) at budget=256, for a +20% TTFT on the long request; budget=512 gives 4.8x at ~no TTFT cost. This is a latency/fairness lever, not an aggregate-throughput lever (steady decode is NVFP4 weight-read-bound on GB10, which the scheduler cannot lift). Correctness: budget unset or >= n_batch is byte-identical to stock; budget=N is byte-identical to stock -bN while preserving n_batch for decode width; the only deviation on long prompts is intrinsic flash-attn chunk-size FP grouping that pure stock -b exhibits too. Verified applying on the pinned llama.cpp f3e1828 after patch 0008. Productisation follow-up: surface as a grpc-server.cpp options knob (max_prefill_tokens) per CHUNKED_PREFILL_PLAN Phase B. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-23 09:55:32 +00:00
Ettore Di Giacinto	ba6bd94976	feat(paged): assert mask-pad invariant for the paged tile route (patch 0012) Patch 0012 of the paged-attention series. Adds a defensive GGML_ASSERT in src/paged-attn.cpp so the now-default paged decode route (GQA-grouped fattn-tile kernel) cannot silently start leaking past-end KV rows. The route stays correct only because the compacted mask/block-table length n_view = GGML_PAD(n_gather, 256) is a whole number of flash-attn KV tiles (nbatch_fa = 64 for head_dim 128 divides 256), so the last tile sits entirely inside the -inf pad window. The assert (n_view % 64 == 0) pins that implicit invariant: a future pad < 256 or tile > 256 that broke it now aborts instead of leaking. Additive only, no behaviour change. Verified on the DGX dev tree: build-cpu compiles and the paged CPU byte gate (LLAMA_KV_PAGED off vs on, Qwen3-0.6B-Q8_0, greedy) stays byte-identical with the assert silent. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-23 09:13:08 +00:00
Ettore Di Giacinto	e983919516	feat(paged): route GQA-grouped tile kernel by default for paged decode (patch 0011) Increment 3 attention lever. In the paged in-kernel decode dispatch, route the common grouped-query F16 case to the tile kernel and keep the inc-1 vec kernel for everything else. Tile groups the q-heads that share a kv-head (ncols2) so each K/V row is loaded once per group instead of once per q-head, and runs at higher occupancy (108-128 regs vs vec 168 -> 25%). On GB10 (Qwen3-32B NVFP4, F16 cache, gqa 8, batch 32, 1024 ctx, same build, env-toggled) this cuts the decode step from 186.3 to 177.9 ms/step (-4.5%), within 1.8% of stock (174.8). The win grows with context (tile vs vec decode step, npl=8): 1024 -2.3%, 4096 -3.3%, 8192 -4.1%, 16384 -6.1%, as attention takes a larger share of the step. Routing guard: tile has no K/V type template (loads half2), so a non-F16 cache would be converted to a contiguous F16 copy by launch_fattn, breaking the in-kernel block-table read. So tile is correct only for an F16 cache, and the grouping only helps at gqa>=2. tile is used only for {F16 K and V, gqa_ratio>=2}; everything else falls back to the inc-1 vec path, exactly as before this change. LLAMA_KV_PAGED_VEC=1 forces vec for A/B. The inc-2 phys(j) tile read (patch 0010) was already plumbed; this only adds the default route. (Paged decode currently needs an F16 cache; quantized + paged is a pre-existing limitation unaffected by this change: stock+q8_0 works, paged+q8_0 aborts both before and after.) Split-K was ruled out: the vec decode grid is already block-saturated (~43 waves over 144 resident on 48 SM), so more parallel_blocks adds no SM fill; the under-saturation is intra-SM occupancy + 8x KV re-streaming, which GQA grouping attacks directly. Validated (greedy): CPU plumbing gate (0.6B, build-cpu, paged-on vs off) byte-identical; GPU 0.6B gqa=2 tile token-coherent with the inc-1 vec path (7/8 sequences identical, 8th in the same kernel-noise band where vec also drifts from stock); 32B gqa=8 tile tracks stock at least as well as vec. Stock (no block table) is byte-identical: the dispatch guard only diverts on src[5]. Full rationale and numbers in the patch header. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code]	2026-06-22 22:38:28 +00:00
Ettore Di Giacinto	2c5adda28c	feat(paged): tile in-kernel decode read + dispatch guard (patch 0010) Increment 2 (robustness): graft the patch-0009 phys(j) block-table read into the CUDA tile kernel (mirror of fattn-vec.cuh) and add a dispatch guard so a present block table (src[5]) routes ONLY to the vec or tile kernel, never to mma/wmma (which ignore the table and would silently read the wrong physical cells). Default route stays vec, the inc-1 byte-validated path. Gates: CPU byte-identical paged-on vs off (Qwen3-0.6B) PASS; GPU vec-paged == stock at -s 1 PASS; the real Qwen3-32B NVFP4 batch decode confirmed dispatching to vec (Q ne=[128,1,64,N]). The tile graft is plumbed for the increment-3 GQA head-group reuse but is EXPERIMENTAL/not byte-validated (LLAMA_KV_PAGED_TILE=1): the GQA-grouped ncols2>1 tile path reads a full nbatch_fa tile unbounded while the compacted paged mask is not padded to cover it. Bounding that path is increment-3 work; the default vec route is unaffected. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 20:37:12 +00:00
Ettore Di Giacinto	ee13a94a8c	paged: in-kernel decode read patch 0009 (kill the gather regression) Mirror patch 0009 for the paged llama.cpp engine. It removes the patch-0003 per-layer per-step gather (ggml_get_rows of K/V to a contiguous buffer) on the decode step and instead reads paged blocks in-kernel: build_attn passes the physical K/V views plus a position-ordered block table (src[5] of ggml_flash_attn_ext, padded to FATTN_KQ_STRIDE), and the CUDA fattn vec kernel plus the CPU reference map each logical KV index to its physical cell and read in place. KV_max / parallel_blocks / stream_k split-K are unchanged; a nullptr block table is the stock contiguous read (byte-identical, gated by LLAMA_KV_PAGED). Verified on GB10 (sm_121, Qwen3-32B NVFP4, batch 32 / 1024 ctx): the decode step drops from 1279 ms (paged-gather) to 696 ms in-kernel (-46%), reaching stock parity (647 ms). CPU paged vs stock is bit-for-bit identical; GPU stays within the documented batch-shape non-determinism band. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 18:04:09 +00:00
Ettore Di Giacinto	4dcbcfcf92	docs(paged): decode-step gap study vs vLLM on GB10 Profiling decomposition of the llama-server batch-32 / 1024-ctx decode step vs vLLM on a DGX Spark (GB10, sm_121). Findings: decode is GPU-bound (~95% busy, sampling/loop fully hidden); at 1024 ctx the step is ~84% KV/attention and ~16% weight GEMM; the paged KV engine is a ~1.85x decode regression vs stock (per-layer gather-to-contiguous); even stock is ~4-5x slower than vLLM, gated by the long-context decode-attention and thin-batch FP4 GEMM kernels, not by the serving loop. Ranked closable-vs-structural levers included. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 15:44:24 +00:00
Ettore Di Giacinto	80e0c1ac6b	feat(paged): wire cross-request prefix share into llama-server (patch 0008) Ship patch 0008 of the paged-attention series: wire the paged cross-request prefix recompute-skip (patch 0007's paged_prefix_api::share/commit engine seam) into the llama-server continuous-batching loop so CONCURRENT requests sharing a long prefix reuse one committed copy of the prefix blocks and prefill ONLY their divergent suffix. The server's native prompt cache only reuses a slot's own prior prompt; it does not share across distinct concurrent slots. 0008 adds that cross-slot share, fully gated behind LLAMA_KV_PAGED (stock byte-identical). The hook lives in tools/server/server-context.cpp update_slots (the only place with the slot prompt-processing loop; grpc-server.cpp includes it), ~50 gated lines: a fresh-slot share() that advances n_past past the committed prefix, and a commit() at the prefill->generation transition. The n_past<block gate guarantees every positive share is adopted so the engine reservation matches the suffix-only batch (no stale paged blocks). Verified in-server (32B NVFP4, CUDA, --kv-unified) with a live prefix holder: K=16/32 concurrent shared-prefix requests prefill only their ~27-token suffix instead of the ~1003-token prefix (36x fewer prefill tokens; K=16 23.9s->1.5s, K=32 57.9s->2.3s), engine logs 'shares ... prefix blocks - NOT recomputed' (ref_cnt>1), greedy output within the documented CUDA batch-shape non-determinism band. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 15:03:16 +00:00
Ettore Di Giacinto	52f0f7b8cf	docs(paged): apples-to-apples paged llama.cpp vs vLLM (batched+NVFP4+prefix cache) Matched comparison on DGX Spark (GB10, sm_121): batched llama-server with NVFP4 GGUF and the paged engine vs batched vLLM 0.23.0 NVFP4A16 with APC, both eager, both prefix-cache on. Two findings: (1) the paged cross-request prefix recompute-skip (patch 0007) does NOT engage in llama-server - it is only reachable via paged_prefix_api::share/commit, which the server never calls; the server engages only physical paged block placement plus its own native prompt cache. (2) With every confounder removed, vLLM is ~6x faster end-to-end (K=16: 8.6s vs 50.7s; K=32: 8.9s vs 58.3s), decode-bound not prefill-bound: llama ~828ms/decode-step at batch 32 vs vLLM ~185ms; CUDA graphs are not the differentiator (both eager). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 14:16:52 +00:00
Ettore Di Giacinto	f347f7ca1d	docs(paged): stock GPU batch-shape determinism + vLLM shared-prefix comparison Two closing measurements on DGX Spark (GB10, sm_121): 1. Stock GPU determinism (no paging): with LLAMA_KV_PAGED unset, stock llama.cpp produces a different greedy token stream when the same prompt is decoded in a full-prefill batch vs a split (prefix-then-suffix) batch. At G=24 the generated stream diverges 1/5 prompts on CPU and 2/5 on CUDA (and earlier on CUDA). This confirms the patch-0007 GPU byte-identity failure is stock floating-point batch-shape non-determinism, not a paged bug. CPU exhibits it too, just less often, which is why 0007's short CPU scenarios passed 16/16 while the CUDA run flipped. 2. vLLM vs llama.cpp+paged on a shared-prefix fan-out (K reqs share a 1024-tok prefix + unique 32-tok suffix, gen 64). llama.cpp+paged prefix cache gives 7.15x (K=16) / 10.3x (K=32) prefill reduction vs its no-share baseline - the same cross-request prefix-skip vLLM's APC provides (97% hit rate confirmed). Head-to-head on cached prefill vLLM is ~5x faster (Q4_K_M vs nvfp4a16 quant, vLLM on FP4 emulation + eager), and wider end-to-end due to continuous batched decode. Competitive in kind, behind in absolute terms on this hardware. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 13:48:01 +00:00
Ettore Di Giacinto	0dd45f0da5	docs(llama-cpp/paged): GPU 0007 re-run + shared-prefix benchmark results Record the belt-and-suspenders GPU run of the 0007 prefix-engine driver and a shared-prefix throughput benchmark. The committed CPU driver passes ALL PASS; the CUDA build fails only the strict greedy-token-equality assertions (the same binary fails them at ngl=0 too), which is CUDA float-kernel non-determinism, not a paged-logic defect - every structural KV-reuse invariant passes on GPU. The shared-prefix benchmark shows a real, K-scaling win: prefill wall time drops 7.2x (32B K=16) to 10.3x (32B K=32) when the shared prefix is computed once and reused via the paged cross-request prefix cache. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 12:59:09 +00:00
Ettore Di Giacinto	d1ba327843	docs(paged): record GPU correctness + CUDA backend-build verification GPU (DGX Spark, GB10/sm_121, CUDA 13.0) verification of the paged-KV series: core token-identical gate and 4-stream multiseq are byte-identical stock-vs-paged at -ngl 99, the device gather is confirmed firing, and a 32B paged run is coherent. Full backend: patches/paged apply clean to the pin and grpc-server compiles+links under CUDA sm_121. Notes also flag a double patch-application in the LLAMA_PAGED=on make flow (git apply + prepare.sh) and a token divergence in the unshipped prefix-recompute-skip dev driver (same on CPU and GPU). Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 11:50:01 +00:00
Ettore Di Giacinto	ecffd4b097	feat(llama-cpp/paged): engine-level prefix recompute-skip (patch 0007) Mirror patch 0007 of the paged-attention series into the vendored llama.cpp patch set. It wires the host-side cross-request prefix cache (0006) into the engine so a new sequence physically shares the cached prefix blocks (ref-counted) and decodes only the divergent suffix - the shared prefix KV is never recomputed. paged-alloc becomes one persistent caching PagedKVManager per (kv-cache, stream) keyed by the real seq_id (per-sequence ref-counted free); two gated llama_kv_cache methods (paged_prefix_share / paged_prefix_commit) mark the shared physical cells' seq-membership so the engine attention mask covers the already-computed prefix; find_slot anchors placement on each sequence's ubatch.pos. Existing-file core touch is llama-kv-cache.{cpp,h} (+71 -3); everything else is additive vendored units. Gated behind LLAMA_KV_PAGED, default off, stock byte-identical. Verified on Qwen3-0.6B-Q8_0 (CPU, unified cache): greedy byte-identity vs decode from scratch at a block boundary and mid-block, prefill computing only the suffix (32 prefix tokens skipped), and ref-counted free safety (2->1 on one sharer's removal, survivor intact and re-shareable, pool restored when all freed). The 0004 serving gate stays byte-identical stock vs paged in unified and non-unified mode. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 10:47:10 +00:00
Ettore Di Giacinto	67c6208b3a	feat(llama-cpp/paged): cross-request prefix caching patch 0006 Mirror patch 0006 of the paged-attention series into the vendored llama.cpp patch set. Extends the vendored PagedKVManager (src/paged-kv-manager) with host-side cross-request prefix sharing: place_with_prefix reuses cached physical blocks for a new sequence shared prefix (ref_cnt++) and allocates only the divergent suffix; cow_block copy-on-writes a still-shared (ref>1) block before a divergent write so co-owners stay byte-correct; ref-counted free releases a shared block only at ref 0. Core kv-cache files untouched; gated behind LLAMA_KV_PAGED, default off. Gate 0 verified on the dev tree (CPU, Qwen3-0.6B-Q8_0): shared-prefix greedy tokens byte-identical to the unshared baseline at both a block boundary and mid-block, measured 2-block reuse (ref_cnt==2, only the suffix allocated), and copy-on-write + seq_rm ref-count safety with no use-after-free. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 10:14:27 +00:00
Ettore Di Giacinto	04e3d04ab8	build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on) Move the paged-attention patch series (0001-0004 + docs) into patches/paged/, applied behind a new LLAMA_PAGED build flag (default on). The base patches/ dir is now clean, so a dep-bump that breaks a paged hook can be unblocked with LLAMA_PAGED=off (clean-against-upstream build) and the paged carry fixed independently - decoupling the paged-KV maintenance from routine bumps without a separate backend. Both apply paths wired (Makefile git-apply + prepare.sh re-apply, flag passed through). Runtime stays gated by LLAMA_KV_PAGED env, so an on build is byte-identical to stock until that env is set. Glob/flag logic verified in bash. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 09:22:36 +00:00

14 Commits