Files
LocalAI/backend/cpp/llama-cpp/patches/paged
Ettore Di Giacinto 125d10a782 feat(paged): paged-pool burst-reclaim (truncate + defrag + slot release) (patch 0024)
Fixes the paged-pool burst-degradation bug (OTHER_PATHS_INVESTIGATION.md section C
Part 2): on a long-lived llama-server with LLAMA_KV_PAGED=1, a high-fan-out prefill
burst strands KV blocks in the host-side paged pool, so a later lower-npl prefill
draws from a depleted/fragmented pool and its throughput collapses (the benchmark's
"restart per npl" crutch). Decode is unaffected. The fix changes only host-side
block accounting and placement, never KV values or compute, and is gated behind
LLAMA_KV_PAGED (LLAMA_PAGED_NO_RECLAIM=1 restores the pre-fix behavior).

Fix-1 reclaim trailing blocks: PagedKVManager::truncate(seq, n_keep) frees every
block beyond ceil(n_keep/bs) (ref-counted); called from llama_kv_cache::seq_rm for
the p1==MAX && p0>0 partial-tail case so the manager tracks the kv-cache exactly.
Fix-2 defrag on empty: when the pool is fully idle, defrag_free_pool() relinks the
free queue into ascending block-id order (FreeBlockQueue::rebuild), preserving
content-cache hashes.
Fix-3 release on slot completion: server_slot::release() issues prompt_clear()
under the paged engine so a finished-idle slot returns its blocks promptly.

Validation (DGX GB10, q36-27b-nvfp4 = qwen35 hybrid; HEAD f7409c2 = patch 0023):
- Bit-exact: greedy md5 identical across paged off / paged on / paged on+NO_RECLAIM
  (5951a5b4d624ce891e22ab5fca9bc439), == the 0023 baseline. test-backend-ops
  unaffected (no ggml op touched).
- Host unit test: truncate reclaims exactly 16 trailing blocks; defrag restores
  ascending popleft order. UNIT PASS.
- Model A/B (one binary, NO_RECLAIM): fragmentation prefill ratio 0.944 -> 0.998;
  64 idle slots strand 2048 blocks, reclaim returns the pool to fresh (2527).
- Server A/B (FRESH-npl8 -> BURST-npl64 -> POST-npl8): POST-npl8 prefill collapses
  488 -> 44 t/s with NO_RECLAIM (the bug; investigation saw 507 -> 65), restored to
  532 t/s (fresh 525, within 1%) with the fix. Paged release-log count 17 -> 96
  (Fix-3 fires per slot completion). Canary tokens identical fresh-vs-post in both
  arms (bit-exact serving).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-26 10:44:33 +00:00
..