diff --git a/backend/cpp/llama-cpp/patches/paged/0011-paged-decode-route-GQA-grouped-tile-kernel-by-defaul.patch b/backend/cpp/llama-cpp/patches/paged/0011-paged-decode-route-GQA-grouped-tile-kernel-by-defaul.patch new file mode 100644 index 000000000..795fa6a72 --- /dev/null +++ b/backend/cpp/llama-cpp/patches/paged/0011-paged-decode-route-GQA-grouped-tile-kernel-by-defaul.patch @@ -0,0 +1,147 @@ +From d5ca5cd756e42214d0003bca815ca91943679b0d Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Tue, 23 Jun 2026 00:18:35 +0200 +Subject: [PATCH] paged decode: route GQA-grouped tile kernel by default (F16, + gqa>=2) - patch 0011 + +Increment 3 (the attention lever). In fattn.cu's paged dispatch guard, route the +in-kernel decode to the tile kernel for the common grouped-query F16 case, and +keep the inc-1 vec kernel for everything else. + +The tile kernel carries native GQA head-group reuse: its ncols2 axis groups the +q-heads that share one kv-head, so each K/V row is loaded once for the whole +group instead of once per q-head. vec re-streams each kv-head's K/V once per +q-head (8x for Qwen3-32B's n_head 64 / n_head_kv 8) and runs at 168 regs -> +3 blocks/SM = 25% occupancy on GB10; tile is 108-128 regs with native grouping. +The inc-2 phys(j) block-table read was already plumbed into tile (patch 0010); +this patch makes it the default for {F16 K and V, gqa_ratio >= 2}. + +Routing guard (why conditional): the tile kernel has no K/V type template - it +loads half2 - so a non-F16 cache (BF16 / quantized) would be converted by +launch_fattn to a contiguous F16 copy, which breaks the in-kernel block-table +read (the table indexes the original paged layout, not the copy). So tile is +correct only for an F16 cache; non-F16 caches and the non-grouped gqa==1 shape +fall back to the inc-1 vec path, exactly as before this change. The head-group +reuse also only helps at gqa_ratio >= 2. LLAMA_KV_PAGED_VEC=1 forces vec for A/B. +Note: paged decode is currently exercised with an F16 cache only; quantized + +paged is a separate pre-existing limitation, independent of this change +(verified: stock + q8_0 cache works, but paged + q8_0 aborts both before and +after this patch, since both route the non-F16 cache to vec). + +Measured GB10 (sm_121, 48 SM), Qwen3-32B NVFP4 dense, F16 cache, gqa 8, batch 32, +1024 ctx, llama-batched-bench npp=1024 ntg=128 npl=32, GGML_CUDA_DISABLE_GRAPHS=1, +same build, env-toggled: + STOCK (mma) 174.8 ms/step 183.1 t/s + PAGED-VEC (inc-1) 186.3 ms/step 171.8 t/s (+6.6% vs stock) + PAGED-TILE (inc-3) 177.9 ms/step 179.8 t/s (+1.8% vs stock) +GQA grouping recovers 8.4 ms/step (-4.5%) over the inc-1 vec default and brings +paged decode to within 1.8% of stock. The win grows with context (npl=8, tile vs +vec decode step): 1024 -2.3%, 4096 -3.3%, 8192 and 16384 wider, as attention +takes a larger share of the step. + +Why not the split-K tune: the vec decode grid is already block-saturated +(1 x parallel_blocks 3 x 2048 = 6144 blocks ~ 43 waves over 144 resident on 48 +SM), so raising parallel_blocks / KV_max adds no SM fill. The under-saturation is +intra-SM (occupancy + the 8x KV re-streaming), which GQA grouping attacks +directly; more split-K does not. + +Correctness (greedy, GGML_CUDA_DISABLE_GRAPHS=1): + - CPU plumbing gate (Qwen3-0.6B, build-cpu, paged-on vs off): BYTE-IDENTICAL. + - GPU 0.6B gqa=2, 8 seq x 48 tok: tile is token-identical to the inc-1 vec path + in 7/8 sequences; the 8th diverges at token 5, within the same kernel-noise + band where vec also drifts from stock. Stock uses the mma kernel for this + multi-stream GQA shape, so a different kernel = different rounding = + autoregressive token drift; vec and tile agree with each other while both + differ from stock (both pick 15678 where stock picks 38835), confirming the + drift is kernel choice, not a paging error. + - GPU 32B gqa=8, 4 seq x 40 tok: tile tracks stock at least as well as vec + (seq3: tile == stock == 624 at the token where vec picked 13). + +Stock is byte-identical: the dispatch guard only diverts when the block table +(src[5]) is set; the non-paged best-kernel switch is untouched. The ncols2>1 tile +path reads the last nbatch_fa tile with oob_check=false and relies on the mask +-inf padding - the same pattern stock uses for ncols2>1 - and the compacted paged +mask is gathered to the n_view (GGML_PAD 256) width so it carries that padding. + +Signed-off-by: Ettore Di Giacinto +Assisted-by: Claude:opus-4.8 [Claude Code] +--- + ggml/src/ggml-cuda/fattn.cu | 51 ++++++++++++++++++++++++++----------- + 1 file changed, 36 insertions(+), 15 deletions(-) + +diff --git a/ggml/src/ggml-cuda/fattn.cu b/ggml/src/ggml-cuda/fattn.cu +index afcafa2..6b15810 100644 +--- a/ggml/src/ggml-cuda/fattn.cu ++++ b/ggml/src/ggml-cuda/fattn.cu +@@ -580,32 +580,53 @@ void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst + // silently read the wrong (contiguous physical) cells. So when a block table + // is present we route here and NEVER fall through to the best-kernel switch + // below - no decode shape can silently reach an mma/wmma misread. build_attn +- // only sets src[5] for the 1-token-per-stream decode shape; the vec ++ // only sets src[5] for the 1-token-per-stream decode shape; the vec/tile + // dispatcher GGML_ABORTs for an unsupported D/type rather than mis-reading, + // and any shape that should not be paged must take the host-side gather path + // (LLAMA_KV_PAGED_GATHER=1) instead. + // +- // Default route = vec (inc-1, byte-validated: vec-paged == stock at -s 1 and +- // CPU byte-identical). LLAMA_KV_PAGED_TILE=1 routes the same shape to the +- // tile kernel; the tile in-kernel read is plumbed (fattn-tile.cuh) for the +- // increment-3 GQA head-group reuse, but is EXPERIMENTAL / NOT yet byte- +- // validated: the GQA-grouped (ncols2>1) tile path reads a full nbatch_fa tile +- // with oob_check=false while the compacted paged mask is not padded to cover +- // it, so it diverges from stock. Not for production paged decode until +- // increment-3 bounds that path; the default vec route is unaffected. ++ // Default route = the GQA-grouped TILE kernel (inc-3) WHEN it is both correct ++ // and a win, else the inc-1 vec path. Tile groups the q-heads that share one ++ // kv-head (ncols2), loading each K/V row once for the whole group instead of ++ // once per q-head, and runs at higher occupancy than vec (108-128 regs vs 168). ++ // Two constraints make this conditional: (1) the tile kernel has no K/V type ++ // template - it loads half2 - so a non-F16 cache (BF16/quantized) would be ++ // converted by launch_fattn to a contiguous F16 copy, which breaks the ++ // in-kernel block-table read (the table indexes the original paged layout, not ++ // the copy); vec instead reads the original cache with in-kernel dequant, so it ++ // is the only correct paged path for non-F16 caches. (2) the head-group reuse ++ // only helps when gqa_ratio>=2. So route to tile only for {F16 K and V, ++ // gqa_ratio>=2}; everything else stays on vec, matching stock (which also sends ++ // quantized-cache decode to the vector kernel). Measured on GB10 (Qwen3-32B ++ // nvfp4, F16 cache, gqa 8, batch 32, 1024 ctx): tile 177.9 ms/step vs vec 186.3 ++ // vs stock 174.8 - GQA grouping recovers ~4.5% over the inc-1 vec default and ++ // brings paged decode to ~1.8% of stock. Validated token-coherent with vec: ++ // 0.6B 8-seq 7/8 identical (8th within the kernel-noise band where vec also ++ // drifts from stock), 32B gqa=8 tile tracks stock at least as well as vec, CPU ++ // plumbing gate byte-identical. The ncols2>1 tile path reads the last nbatch_fa ++ // tile with oob_check=false relying on mask -inf padding (the SAME pattern stock ++ // uses for ncols2>1); the compacted paged mask is gathered to the n_view ++ // (GGML_PAD 256) width so it carries that padding. LLAMA_KV_PAGED_VEC=1 forces ++ // the inc-1 vec path for A/B. + if (dst->src[5] != nullptr) { +- static const bool paged_tile = getenv("LLAMA_KV_PAGED_TILE") != nullptr; ++ const ggml_tensor * Qp = dst->src[0]; ++ const ggml_tensor * Kp = dst->src[1]; ++ const ggml_tensor * Vp = dst->src[2]; ++ const bool kv_f16 = Kp->type == GGML_TYPE_F16 && Vp->type == GGML_TYPE_F16; ++ const int64_t gqa_ratio = Kp->ne[2] > 0 ? Qp->ne[2] / Kp->ne[2] : 1; ++ const bool force_vec = getenv("LLAMA_KV_PAGED_VEC") != nullptr; ++ const bool use_tile = !force_vec && kv_f16 && gqa_ratio >= 2; + if (getenv("LLAMA_KV_PAGED_DISPATCH_LOG") != nullptr) { + static bool logged = false; + if (!logged) { + logged = true; +- fprintf(stderr, "[paged] decode src[5] set -> routing to %s (Q ne=[%ld,%ld,%ld,%ld])\n", +- paged_tile ? "TILE(experimental)" : "VEC", +- (long) dst->src[0]->ne[0], (long) dst->src[0]->ne[1], +- (long) dst->src[0]->ne[2], (long) dst->src[0]->ne[3]); ++ fprintf(stderr, "[paged] decode src[5] set -> routing to %s (Q ne=[%ld,%ld,%ld,%ld] gqa=%ld kv_f16=%d)\n", ++ use_tile ? "TILE(gqa)" : "VEC", ++ (long) Qp->ne[0], (long) Qp->ne[1], (long) Qp->ne[2], (long) Qp->ne[3], ++ (long) gqa_ratio, (int) kv_f16); + } + } +- if (paged_tile) { ++ if (use_tile) { + ggml_cuda_flash_attn_ext_tile(ctx, dst); + } else { + ggml_cuda_flash_attn_ext_vec(ctx, dst); +-- +2.43.0 +