mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-23 16:19:07 -04:00
Patch 0012 of the paged-attention series. Adds a defensive GGML_ASSERT in src/paged-attn.cpp so the now-default paged decode route (GQA-grouped fattn-tile kernel) cannot silently start leaking past-end KV rows. The route stays correct only because the compacted mask/block-table length n_view = GGML_PAD(n_gather, 256) is a whole number of flash-attn KV tiles (nbatch_fa = 64 for head_dim 128 divides 256), so the last tile sits entirely inside the -inf pad window. The assert (n_view % 64 == 0) pins that implicit invariant: a future pad < 256 or tile > 256 that broke it now aborts instead of leaking. Additive only, no behaviour change. Verified on the DGX dev tree: build-cpu compiles and the paged CPU byte gate (LLAMA_KV_PAGED off vs on, Qwen3-0.6B-Q8_0, greedy) stays byte-identical with the assert silent. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>