feat(paged): route GQA-grouped tile kernel by default for paged decode (patch 0011)

Increment 3 attention lever. In the paged in-kernel decode dispatch, route the common grouped-query F16 case to the tile kernel and keep the inc-1 vec kernel for everything else. Tile groups the q-heads that share a kv-head (ncols2) so each K/V row is loaded once per group instead of once per q-head, and runs at higher occupancy (108-128 regs vs vec 168 -> 25%). On GB10 (Qwen3-32B NVFP4, F16 cache, gqa 8, batch 32, 1024 ctx, same build, env-toggled) this cuts the decode step from 186.3 to 177.9 ms/step (-4.5%), within 1.8% of stock (174.8). The win grows with context (tile vs vec decode step, npl=8): 1024 -2.3%, 4096 -3.3%, 8192 -4.1%, 16384 -6.1%, as attention takes a larger share of the step. Routing guard: tile has no K/V type template (loads half2), so a non-F16 cache would be converted to a contiguous F16 copy by launch_fattn, breaking the in-kernel block-table read. So tile is correct only for an F16 cache, and the grouping only helps at gqa>=2. tile is used only for {F16 K and V, gqa_ratio>=2}; everything else falls back to the inc-1 vec path, exactly as before this change. LLAMA_KV_PAGED_VEC=1 forces vec for A/B. The inc-2 phys(j) tile read (patch 0010) was already plumbed; this only adds the default route. (Paged decode currently needs an F16 cache; quantized + paged is a pre-existing limitation unaffected by this change: stock+q8_0 works, paged+q8_0 aborts both before and after.) Split-K was ruled out: the vec decode grid is already block-saturated (~43 waves over 144 resident on 48 SM), so more parallel_blocks adds no SM fill; the under-saturation is intra-SM occupancy + 8x KV re-streaming, which GQA grouping attacks directly. Validated (greedy): CPU plumbing gate (0.6B, build-cpu, paged-on vs off) byte-identical; GPU 0.6B gqa=2 tile token-coherent with the inc-1 vec path (7/8 sequences identical, 8th in the same kernel-noise band where vec also drifts from stock); 32B gqa=8 tile tracks stock at least as well as vec. Stock (no block table) is byte-identical: the dispatch guard only diverts on src[5]. Full rationale and numbers in the patch header. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code]
2026-06-23 08:08:52 -04:00 · 2026-06-22 22:19:35 +00:00
parent 2c5adda28c
commit e983919516
1 changed files with 147 additions and 0 deletions
--- a/backend/cpp/llama-cpp/patches/paged/0011-paged-decode-route-GQA-grouped-tile-kernel-by-defaul.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0011-paged-decode-route-GQA-grouped-tile-kernel-by-defaul.patch
@@ -0,0 +1,147 @@
+From d5ca5cd756e42214d0003bca815ca91943679b0d Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Tue, 23 Jun 2026 00:18:35 +0200
+Subject: [PATCH] paged decode: route GQA-grouped tile kernel by default (F16,
+ gqa>=2) - patch 0011
+
+Increment 3 (the attention lever). In fattn.cu's paged dispatch guard, route the
+in-kernel decode to the tile kernel for the common grouped-query F16 case, and
+keep the inc-1 vec kernel for everything else.
+
+The tile kernel carries native GQA head-group reuse: its ncols2 axis groups the
+q-heads that share one kv-head, so each K/V row is loaded once for the whole
+group instead of once per q-head. vec re-streams each kv-head's K/V once per
+q-head (8x for Qwen3-32B's n_head 64 / n_head_kv 8) and runs at 168 regs ->
+3 blocks/SM = 25% occupancy on GB10; tile is 108-128 regs with native grouping.
+The inc-2 phys(j) block-table read was already plumbed into tile (patch 0010);
+this patch makes it the default for {F16 K and V, gqa_ratio >= 2}.
+
+Routing guard (why conditional): the tile kernel has no K/V type template - it
+loads half2 - so a non-F16 cache (BF16 / quantized) would be converted by
+launch_fattn to a contiguous F16 copy, which breaks the in-kernel block-table
+read (the table indexes the original paged layout, not the copy). So tile is
+correct only for an F16 cache; non-F16 caches and the non-grouped gqa==1 shape
+fall back to the inc-1 vec path, exactly as before this change. The head-group
+reuse also only helps at gqa_ratio >= 2. LLAMA_KV_PAGED_VEC=1 forces vec for A/B.
+Note: paged decode is currently exercised with an F16 cache only; quantized +
+paged is a separate pre-existing limitation, independent of this change
+(verified: stock + q8_0 cache works, but paged + q8_0 aborts both before and
+after this patch, since both route the non-F16 cache to vec).
+
+Measured GB10 (sm_121, 48 SM), Qwen3-32B NVFP4 dense, F16 cache, gqa 8, batch 32,
+1024 ctx, llama-batched-bench npp=1024 ntg=128 npl=32, GGML_CUDA_DISABLE_GRAPHS=1,
+same build, env-toggled:
+  STOCK (mma)            174.8 ms/step  183.1 t/s
+  PAGED-VEC  (inc-1)     186.3 ms/step  171.8 t/s   (+6.6% vs stock)
+  PAGED-TILE (inc-3)     177.9 ms/step  179.8 t/s   (+1.8% vs stock)
+GQA grouping recovers 8.4 ms/step (-4.5%) over the inc-1 vec default and brings
+paged decode to within 1.8% of stock. The win grows with context (npl=8, tile vs
+vec decode step): 1024 -2.3%, 4096 -3.3%, 8192 and 16384 wider, as attention
+takes a larger share of the step.
+
+Why not the split-K tune: the vec decode grid is already block-saturated
+(1 x parallel_blocks 3 x 2048 = 6144 blocks ~ 43 waves over 144 resident on 48
+SM), so raising parallel_blocks / KV_max adds no SM fill. The under-saturation is
+intra-SM (occupancy + the 8x KV re-streaming), which GQA grouping attacks
+directly; more split-K does not.
+
+Correctness (greedy, GGML_CUDA_DISABLE_GRAPHS=1):
+  - CPU plumbing gate (Qwen3-0.6B, build-cpu, paged-on vs off): BYTE-IDENTICAL.
+  - GPU 0.6B gqa=2, 8 seq x 48 tok: tile is token-identical to the inc-1 vec path
+    in 7/8 sequences; the 8th diverges at token 5, within the same kernel-noise
+    band where vec also drifts from stock. Stock uses the mma kernel for this
+    multi-stream GQA shape, so a different kernel = different rounding =
+    autoregressive token drift; vec and tile agree with each other while both
+    differ from stock (both pick 15678 where stock picks 38835), confirming the
+    drift is kernel choice, not a paging error.
+  - GPU 32B gqa=8, 4 seq x 40 tok: tile tracks stock at least as well as vec
+    (seq3: tile == stock == 624 at the token where vec picked 13).
+
+Stock is byte-identical: the dispatch guard only diverts when the block table
+(src[5]) is set; the non-paged best-kernel switch is untouched. The ncols2>1 tile
+path reads the last nbatch_fa tile with oob_check=false and relies on the mask
+-inf padding - the same pattern stock uses for ncols2>1 - and the compacted paged
+mask is gathered to the n_view (GGML_PAD 256) width so it carries that padding.
+
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+Assisted-by: Claude:opus-4.8 [Claude Code]
+---
+ ggml/src/ggml-cuda/fattn.cu | 51 ++++++++++++++++++++++++++-----------
+ 1 file changed, 36 insertions(+), 15 deletions(-)
+
+diff --git a/ggml/src/ggml-cuda/fattn.cu b/ggml/src/ggml-cuda/fattn.cu
+index afcafa2..6b15810 100644
+--- a/ggml/src/ggml-cuda/fattn.cu
+++ b/ggml/src/ggml-cuda/fattn.cu
+@@ -580,32 +580,53 @@ void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst
+     // silently read the wrong (contiguous physical) cells. So when a block table
+     // is present we route here and NEVER fall through to the best-kernel switch
+     // below - no decode shape can silently reach an mma/wmma misread. build_attn
+-    // only sets src[5] for the 1-token-per-stream decode shape; the vec
+    // only sets src[5] for the 1-token-per-stream decode shape; the vec/tile
+     // dispatcher GGML_ABORTs for an unsupported D/type rather than mis-reading,
+     // and any shape that should not be paged must take the host-side gather path
+     // (LLAMA_KV_PAGED_GATHER=1) instead.
+     //
+-    // Default route = vec (inc-1, byte-validated: vec-paged == stock at -s 1 and
+-    // CPU byte-identical). LLAMA_KV_PAGED_TILE=1 routes the same shape to the
+-    // tile kernel; the tile in-kernel read is plumbed (fattn-tile.cuh) for the
+-    // increment-3 GQA head-group reuse, but is EXPERIMENTAL / NOT yet byte-
+-    // validated: the GQA-grouped (ncols2>1) tile path reads a full nbatch_fa tile
+-    // with oob_check=false while the compacted paged mask is not padded to cover
+-    // it, so it diverges from stock. Not for production paged decode until
+-    // increment-3 bounds that path; the default vec route is unaffected.
+    // Default route = the GQA-grouped TILE kernel (inc-3) WHEN it is both correct
+    // and a win, else the inc-1 vec path. Tile groups the q-heads that share one
+    // kv-head (ncols2), loading each K/V row once for the whole group instead of
+    // once per q-head, and runs at higher occupancy than vec (108-128 regs vs 168).
+    // Two constraints make this conditional: (1) the tile kernel has no K/V type
+    // template - it loads half2 - so a non-F16 cache (BF16/quantized) would be
+    // converted by launch_fattn to a contiguous F16 copy, which breaks the
+    // in-kernel block-table read (the table indexes the original paged layout, not
+    // the copy); vec instead reads the original cache with in-kernel dequant, so it
+    // is the only correct paged path for non-F16 caches. (2) the head-group reuse
+    // only helps when gqa_ratio>=2. So route to tile only for {F16 K and V,
+    // gqa_ratio>=2}; everything else stays on vec, matching stock (which also sends
+    // quantized-cache decode to the vector kernel). Measured on GB10 (Qwen3-32B
+    // nvfp4, F16 cache, gqa 8, batch 32, 1024 ctx): tile 177.9 ms/step vs vec 186.3
+    // vs stock 174.8 - GQA grouping recovers ~4.5% over the inc-1 vec default and
+    // brings paged decode to ~1.8% of stock. Validated token-coherent with vec:
+    // 0.6B 8-seq 7/8 identical (8th within the kernel-noise band where vec also
+    // drifts from stock), 32B gqa=8 tile tracks stock at least as well as vec, CPU
+    // plumbing gate byte-identical. The ncols2>1 tile path reads the last nbatch_fa
+    // tile with oob_check=false relying on mask -inf padding (the SAME pattern stock
+    // uses for ncols2>1); the compacted paged mask is gathered to the n_view
+    // (GGML_PAD 256) width so it carries that padding. LLAMA_KV_PAGED_VEC=1 forces
+    // the inc-1 vec path for A/B.
+     if (dst->src[5] != nullptr) {
+-        static const bool paged_tile = getenv("LLAMA_KV_PAGED_TILE") != nullptr;
+        const ggml_tensor * Qp = dst->src[0];
+        const ggml_tensor * Kp = dst->src[1];
+        const ggml_tensor * Vp = dst->src[2];
+        const bool kv_f16    = Kp->type == GGML_TYPE_F16 && Vp->type == GGML_TYPE_F16;
+        const int64_t gqa_ratio = Kp->ne[2] > 0 ? Qp->ne[2] / Kp->ne[2] : 1;
+        const bool force_vec = getenv("LLAMA_KV_PAGED_VEC") != nullptr;
+        const bool use_tile  = !force_vec && kv_f16 && gqa_ratio >= 2;
+         if (getenv("LLAMA_KV_PAGED_DISPATCH_LOG") != nullptr) {
+             static bool logged = false;
+             if (!logged) {
+                 logged = true;
+-                fprintf(stderr, "[paged] decode src[5] set -> routing to %s (Q ne=[%ld,%ld,%ld,%ld])\n",
+-                    paged_tile ? "TILE(experimental)" : "VEC",
+-                    (long) dst->src[0]->ne[0], (long) dst->src[0]->ne[1],
+-                    (long) dst->src[0]->ne[2], (long) dst->src[0]->ne[3]);
+                fprintf(stderr, "[paged] decode src[5] set -> routing to %s (Q ne=[%ld,%ld,%ld,%ld] gqa=%ld kv_f16=%d)\n",
+                    use_tile ? "TILE(gqa)" : "VEC",
+                    (long) Qp->ne[0], (long) Qp->ne[1], (long) Qp->ne[2], (long) Qp->ne[3],
+                    (long) gqa_ratio, (int) kv_f16);
+             }
+         }
+-        if (paged_tile) {
+        if (use_tile) {
+             ggml_cuda_flash_attn_ext_tile(ctx, dst);
+         } else {
+             ggml_cuda_flash_attn_ext_vec(ctx, dst);
+-- 
+2.43.0
+