mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-23 16:19:07 -04:00
Increment 2 (robustness): graft the patch-0009 phys(j) block-table read into the CUDA tile kernel (mirror of fattn-vec.cuh) and add a dispatch guard so a present block table (src[5]) routes ONLY to the vec or tile kernel, never to mma/wmma (which ignore the table and would silently read the wrong physical cells). Default route stays vec, the inc-1 byte-validated path. Gates: CPU byte-identical paged-on vs off (Qwen3-0.6B) PASS; GPU vec-paged == stock at -s 1 PASS; the real Qwen3-32B NVFP4 batch decode confirmed dispatching to vec (Q ne=[128,1,64,N]). The tile graft is plumbed for the increment-3 GQA head-group reuse but is EXPERIMENTAL/not byte-validated (LLAMA_KV_PAGED_TILE=1): the GQA-grouped ncols2>1 tile path reads a full nbatch_fa tile unbounded while the compacted paged mask is not padded to cover it. Bounding that path is increment-3 work; the default vec route is unaffected. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>