Files
LocalAI/backend/cpp
Ettore Di Giacinto 2c5adda28c feat(paged): tile in-kernel decode read + dispatch guard (patch 0010)
Increment 2 (robustness): graft the patch-0009 phys(j) block-table read into
the CUDA tile kernel (mirror of fattn-vec.cuh) and add a dispatch guard so a
present block table (src[5]) routes ONLY to the vec or tile kernel, never to
mma/wmma (which ignore the table and would silently read the wrong physical
cells). Default route stays vec, the inc-1 byte-validated path.

Gates: CPU byte-identical paged-on vs off (Qwen3-0.6B) PASS; GPU vec-paged ==
stock at -s 1 PASS; the real Qwen3-32B NVFP4 batch decode confirmed dispatching
to vec (Q ne=[128,1,64,N]). The tile graft is plumbed for the increment-3 GQA
head-group reuse but is EXPERIMENTAL/not byte-validated (LLAMA_KV_PAGED_TILE=1):
the GQA-grouped ncols2>1 tile path reads a full nbatch_fa tile unbounded while
the compacted paged mask is not padded to cover it. Bounding that path is
increment-3 work; the default vec route is unaffected.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-22 20:37:12 +00:00
..