Files
LocalAI/backend/cpp/llama-cpp/patches/paged
Ettore Di Giacinto ee13a94a8c paged: in-kernel decode read patch 0009 (kill the gather regression)
Mirror patch 0009 for the paged llama.cpp engine. It removes the patch-0003
per-layer per-step gather (ggml_get_rows of K/V to a contiguous buffer) on the
decode step and instead reads paged blocks in-kernel: build_attn passes the
physical K/V views plus a position-ordered block table (src[5] of
ggml_flash_attn_ext, padded to FATTN_KQ_STRIDE), and the CUDA fattn vec kernel
plus the CPU reference map each logical KV index to its physical cell and read
in place. KV_max / parallel_blocks / stream_k split-K are unchanged; a nullptr
block table is the stock contiguous read (byte-identical, gated by
LLAMA_KV_PAGED).

Verified on GB10 (sm_121, Qwen3-32B NVFP4, batch 32 / 1024 ctx): the decode
step drops from 1279 ms (paged-gather) to 696 ms in-kernel (-46%), reaching
stock parity (647 ms). CPU paged vs stock is bit-for-bit identical; GPU stays
within the documented batch-shape non-determinism band.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-22 18:04:09 +00:00
..