LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-24 00:28:55 -04:00

Author	SHA1	Message	Date
Ettore Di Giacinto	d1ba327843	docs(paged): record GPU correctness + CUDA backend-build verification GPU (DGX Spark, GB10/sm_121, CUDA 13.0) verification of the paged-KV series: core token-identical gate and 4-stream multiseq are byte-identical stock-vs-paged at -ngl 99, the device gather is confirmed firing, and a 32B paged run is coherent. Full backend: patches/paged apply clean to the pin and grpc-server compiles+links under CUDA sm_121. Notes also flag a double patch-application in the LLAMA_PAGED=on make flow (git apply + prepare.sh) and a token divergence in the unshipped prefix-recompute-skip dev driver (same on CPU and GPU). Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 11:50:01 +00:00
Ettore Di Giacinto	ecffd4b097	feat(llama-cpp/paged): engine-level prefix recompute-skip (patch 0007) Mirror patch 0007 of the paged-attention series into the vendored llama.cpp patch set. It wires the host-side cross-request prefix cache (0006) into the engine so a new sequence physically shares the cached prefix blocks (ref-counted) and decodes only the divergent suffix - the shared prefix KV is never recomputed. paged-alloc becomes one persistent caching PagedKVManager per (kv-cache, stream) keyed by the real seq_id (per-sequence ref-counted free); two gated llama_kv_cache methods (paged_prefix_share / paged_prefix_commit) mark the shared physical cells' seq-membership so the engine attention mask covers the already-computed prefix; find_slot anchors placement on each sequence's ubatch.pos. Existing-file core touch is llama-kv-cache.{cpp,h} (+71 -3); everything else is additive vendored units. Gated behind LLAMA_KV_PAGED, default off, stock byte-identical. Verified on Qwen3-0.6B-Q8_0 (CPU, unified cache): greedy byte-identity vs decode from scratch at a block boundary and mid-block, prefill computing only the suffix (32 prefix tokens skipped), and ref-counted free safety (2->1 on one sharer's removal, survivor intact and re-shareable, pool restored when all freed). The 0004 serving gate stays byte-identical stock vs paged in unified and non-unified mode. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 10:47:10 +00:00
Ettore Di Giacinto	67c6208b3a	feat(llama-cpp/paged): cross-request prefix caching patch 0006 Mirror patch 0006 of the paged-attention series into the vendored llama.cpp patch set. Extends the vendored PagedKVManager (src/paged-kv-manager) with host-side cross-request prefix sharing: place_with_prefix reuses cached physical blocks for a new sequence shared prefix (ref_cnt++) and allocates only the divergent suffix; cow_block copy-on-writes a still-shared (ref>1) block before a divergent write so co-owners stay byte-correct; ref-counted free releases a shared block only at ref 0. Core kv-cache files untouched; gated behind LLAMA_KV_PAGED, default off. Gate 0 verified on the dev tree (CPU, Qwen3-0.6B-Q8_0): shared-prefix greedy tokens byte-identical to the unshared baseline at both a block boundary and mid-block, measured 2-block reuse (ref_cnt==2, only the suffix allocated), and copy-on-write + seq_rm ref-count safety with no use-after-free. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 10:14:27 +00:00
Ettore Di Giacinto	04e3d04ab8	build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on) Move the paged-attention patch series (0001-0004 + docs) into patches/paged/, applied behind a new LLAMA_PAGED build flag (default on). The base patches/ dir is now clean, so a dep-bump that breaks a paged hook can be unblocked with LLAMA_PAGED=off (clean-against-upstream build) and the paged carry fixed independently - decoupling the paged-KV maintenance from routine bumps without a separate backend. Both apply paths wired (Makefile git-apply + prepare.sh re-apply, flag passed through). Runtime stays gated by LLAMA_KV_PAGED env, so an on build is byte-identical to stock until that env is set. Glob/flag logic verified in bash. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-22 09:22:36 +00:00

4 Commits