LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-23 16:19:07 -04:00

Author	SHA1	Message	Date
Ettore Di Giacinto	4dcbcfcf92	docs(paged): decode-step gap study vs vLLM on GB10 Profiling decomposition of the llama-server batch-32 / 1024-ctx decode step vs vLLM on a DGX Spark (GB10, sm_121). Findings: decode is GPU-bound (~95% busy, sampling/loop fully hidden); at 1024 ctx the step is ~84% KV/attention and ~16% weight GEMM; the paged KV engine is a ~1.85x decode regression vs stock (per-layer gather-to-contiguous); even stock is ~4-5x slower than vLLM, gated by the long-context decode-attention and thin-batch FP4 GEMM kernels, not by the serving loop. Ranked closable-vs-structural levers included. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 15:44:24 +00:00
Ettore Di Giacinto	80e0c1ac6b	feat(paged): wire cross-request prefix share into llama-server (patch 0008) Ship patch 0008 of the paged-attention series: wire the paged cross-request prefix recompute-skip (patch 0007's paged_prefix_api::share/commit engine seam) into the llama-server continuous-batching loop so CONCURRENT requests sharing a long prefix reuse one committed copy of the prefix blocks and prefill ONLY their divergent suffix. The server's native prompt cache only reuses a slot's own prior prompt; it does not share across distinct concurrent slots. 0008 adds that cross-slot share, fully gated behind LLAMA_KV_PAGED (stock byte-identical). The hook lives in tools/server/server-context.cpp update_slots (the only place with the slot prompt-processing loop; grpc-server.cpp includes it), ~50 gated lines: a fresh-slot share() that advances n_past past the committed prefix, and a commit() at the prefill->generation transition. The n_past<block gate guarantees every positive share is adopted so the engine reservation matches the suffix-only batch (no stale paged blocks). Verified in-server (32B NVFP4, CUDA, --kv-unified) with a live prefix holder: K=16/32 concurrent shared-prefix requests prefill only their ~27-token suffix instead of the ~1003-token prefix (36x fewer prefill tokens; K=16 23.9s->1.5s, K=32 57.9s->2.3s), engine logs 'shares ... prefix blocks - NOT recomputed' (ref_cnt>1), greedy output within the documented CUDA batch-shape non-determinism band. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 15:03:16 +00:00
Ettore Di Giacinto	52f0f7b8cf	docs(paged): apples-to-apples paged llama.cpp vs vLLM (batched+NVFP4+prefix cache) Matched comparison on DGX Spark (GB10, sm_121): batched llama-server with NVFP4 GGUF and the paged engine vs batched vLLM 0.23.0 NVFP4A16 with APC, both eager, both prefix-cache on. Two findings: (1) the paged cross-request prefix recompute-skip (patch 0007) does NOT engage in llama-server - it is only reachable via paged_prefix_api::share/commit, which the server never calls; the server engages only physical paged block placement plus its own native prompt cache. (2) With every confounder removed, vLLM is ~6x faster end-to-end (K=16: 8.6s vs 50.7s; K=32: 8.9s vs 58.3s), decode-bound not prefill-bound: llama ~828ms/decode-step at batch 32 vs vLLM ~185ms; CUDA graphs are not the differentiator (both eager). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 14:16:52 +00:00
Ettore Di Giacinto	f347f7ca1d	docs(paged): stock GPU batch-shape determinism + vLLM shared-prefix comparison Two closing measurements on DGX Spark (GB10, sm_121): 1. Stock GPU determinism (no paging): with LLAMA_KV_PAGED unset, stock llama.cpp produces a different greedy token stream when the same prompt is decoded in a full-prefill batch vs a split (prefix-then-suffix) batch. At G=24 the generated stream diverges 1/5 prompts on CPU and 2/5 on CUDA (and earlier on CUDA). This confirms the patch-0007 GPU byte-identity failure is stock floating-point batch-shape non-determinism, not a paged bug. CPU exhibits it too, just less often, which is why 0007's short CPU scenarios passed 16/16 while the CUDA run flipped. 2. vLLM vs llama.cpp+paged on a shared-prefix fan-out (K reqs share a 1024-tok prefix + unique 32-tok suffix, gen 64). llama.cpp+paged prefix cache gives 7.15x (K=16) / 10.3x (K=32) prefill reduction vs its no-share baseline - the same cross-request prefix-skip vLLM's APC provides (97% hit rate confirmed). Head-to-head on cached prefill vLLM is ~5x faster (Q4_K_M vs nvfp4a16 quant, vLLM on FP4 emulation + eager), and wider end-to-end due to continuous batched decode. Competitive in kind, behind in absolute terms on this hardware. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 13:48:01 +00:00
Ettore Di Giacinto	0dd45f0da5	docs(llama-cpp/paged): GPU 0007 re-run + shared-prefix benchmark results Record the belt-and-suspenders GPU run of the 0007 prefix-engine driver and a shared-prefix throughput benchmark. The committed CPU driver passes ALL PASS; the CUDA build fails only the strict greedy-token-equality assertions (the same binary fails them at ngl=0 too), which is CUDA float-kernel non-determinism, not a paged-logic defect - every structural KV-reuse invariant passes on GPU. The shared-prefix benchmark shows a real, K-scaling win: prefill wall time drops 7.2x (32B K=16) to 10.3x (32B K=32) when the shared prefix is computed once and reused via the paged cross-request prefix cache. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 12:59:09 +00:00
Ettore Di Giacinto	d1ba327843	docs(paged): record GPU correctness + CUDA backend-build verification GPU (DGX Spark, GB10/sm_121, CUDA 13.0) verification of the paged-KV series: core token-identical gate and 4-stream multiseq are byte-identical stock-vs-paged at -ngl 99, the device gather is confirmed firing, and a 32B paged run is coherent. Full backend: patches/paged apply clean to the pin and grpc-server compiles+links under CUDA sm_121. Notes also flag a double patch-application in the LLAMA_PAGED=on make flow (git apply + prepare.sh) and a token divergence in the unshipped prefix-recompute-skip dev driver (same on CPU and GPU). Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 11:50:01 +00:00
Ettore Di Giacinto	ecffd4b097	feat(llama-cpp/paged): engine-level prefix recompute-skip (patch 0007) Mirror patch 0007 of the paged-attention series into the vendored llama.cpp patch set. It wires the host-side cross-request prefix cache (0006) into the engine so a new sequence physically shares the cached prefix blocks (ref-counted) and decodes only the divergent suffix - the shared prefix KV is never recomputed. paged-alloc becomes one persistent caching PagedKVManager per (kv-cache, stream) keyed by the real seq_id (per-sequence ref-counted free); two gated llama_kv_cache methods (paged_prefix_share / paged_prefix_commit) mark the shared physical cells' seq-membership so the engine attention mask covers the already-computed prefix; find_slot anchors placement on each sequence's ubatch.pos. Existing-file core touch is llama-kv-cache.{cpp,h} (+71 -3); everything else is additive vendored units. Gated behind LLAMA_KV_PAGED, default off, stock byte-identical. Verified on Qwen3-0.6B-Q8_0 (CPU, unified cache): greedy byte-identity vs decode from scratch at a block boundary and mid-block, prefill computing only the suffix (32 prefix tokens skipped), and ref-counted free safety (2->1 on one sharer's removal, survivor intact and re-shareable, pool restored when all freed). The 0004 serving gate stays byte-identical stock vs paged in unified and non-unified mode. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 10:47:10 +00:00
Ettore Di Giacinto	67c6208b3a	feat(llama-cpp/paged): cross-request prefix caching patch 0006 Mirror patch 0006 of the paged-attention series into the vendored llama.cpp patch set. Extends the vendored PagedKVManager (src/paged-kv-manager) with host-side cross-request prefix sharing: place_with_prefix reuses cached physical blocks for a new sequence shared prefix (ref_cnt++) and allocates only the divergent suffix; cow_block copy-on-writes a still-shared (ref>1) block before a divergent write so co-owners stay byte-correct; ref-counted free releases a shared block only at ref 0. Core kv-cache files untouched; gated behind LLAMA_KV_PAGED, default off. Gate 0 verified on the dev tree (CPU, Qwen3-0.6B-Q8_0): shared-prefix greedy tokens byte-identical to the unshared baseline at both a block boundary and mid-block, measured 2-block reuse (ref_cnt==2, only the suffix allocated), and copy-on-write + seq_rm ref-count safety with no use-after-free. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 10:14:27 +00:00
Ettore Di Giacinto	04e3d04ab8	build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on) Move the paged-attention patch series (0001-0004 + docs) into patches/paged/, applied behind a new LLAMA_PAGED build flag (default on). The base patches/ dir is now clean, so a dep-bump that breaks a paged hook can be unblocked with LLAMA_PAGED=off (clean-against-upstream build) and the paged carry fixed independently - decoupling the paged-KV maintenance from routine bumps without a separate backend. Both apply paths wired (Makefile git-apply + prepare.sh re-apply, flag passed through). Runtime stays gated by LLAMA_KV_PAGED env, so an on build is byte-identical to stock until that env is set. Glob/flag logic verified in bash. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 09:22:36 +00:00

9 Commits