mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-23 08:08:52 -04:00
feat(paged): route GQA-grouped tile kernel by default for paged decode (patch 0011)
Increment 3 attention lever. In the paged in-kernel decode dispatch, route the
common grouped-query F16 case to the tile kernel and keep the inc-1 vec kernel
for everything else. Tile groups the q-heads that share a kv-head (ncols2) so
each K/V row is loaded once per group instead of once per q-head, and runs at
higher occupancy (108-128 regs vs vec 168 -> 25%). On GB10 (Qwen3-32B NVFP4,
F16 cache, gqa 8, batch 32, 1024 ctx, same build, env-toggled) this cuts the
decode step from 186.3 to 177.9 ms/step (-4.5%), within 1.8% of stock (174.8).
The win grows with context (tile vs vec decode step, npl=8): 1024 -2.3%, 4096
-3.3%, 8192 -4.1%, 16384 -6.1%, as attention takes a larger share of the step.
Routing guard: tile has no K/V type template (loads half2), so a non-F16 cache
would be converted to a contiguous F16 copy by launch_fattn, breaking the
in-kernel block-table read. So tile is correct only for an F16 cache, and the
grouping only helps at gqa>=2. tile is used only for {F16 K and V, gqa_ratio>=2};
everything else falls back to the inc-1 vec path, exactly as before this change.
LLAMA_KV_PAGED_VEC=1 forces vec for A/B. The inc-2 phys(j) tile read (patch 0010)
was already plumbed; this only adds the default route. (Paged decode currently
needs an F16 cache; quantized + paged is a pre-existing limitation unaffected by
this change: stock+q8_0 works, paged+q8_0 aborts both before and after.)
Split-K was ruled out: the vec decode grid is already block-saturated (~43 waves
over 144 resident on 48 SM), so more parallel_blocks adds no SM fill; the
under-saturation is intra-SM occupancy + 8x KV re-streaming, which GQA grouping
attacks directly.
Validated (greedy): CPU plumbing gate (0.6B, build-cpu, paged-on vs off)
byte-identical; GPU 0.6B gqa=2 tile token-coherent with the inc-1 vec path
(7/8 sequences identical, 8th in the same kernel-noise band where vec also
drifts from stock); 32B gqa=8 tile tracks stock at least as well as vec. Stock
(no block table) is byte-identical: the dispatch guard only diverts on src[5].
Full rationale and numbers in the patch header.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
This commit is contained in:
@@ -0,0 +1,147 @@
|
||||
From d5ca5cd756e42214d0003bca815ca91943679b0d Mon Sep 17 00:00:00 2001
|
||||
From: Ettore Di Giacinto <mudler@localai.io>
|
||||
Date: Tue, 23 Jun 2026 00:18:35 +0200
|
||||
Subject: [PATCH] paged decode: route GQA-grouped tile kernel by default (F16,
|
||||
gqa>=2) - patch 0011
|
||||
|
||||
Increment 3 (the attention lever). In fattn.cu's paged dispatch guard, route the
|
||||
in-kernel decode to the tile kernel for the common grouped-query F16 case, and
|
||||
keep the inc-1 vec kernel for everything else.
|
||||
|
||||
The tile kernel carries native GQA head-group reuse: its ncols2 axis groups the
|
||||
q-heads that share one kv-head, so each K/V row is loaded once for the whole
|
||||
group instead of once per q-head. vec re-streams each kv-head's K/V once per
|
||||
q-head (8x for Qwen3-32B's n_head 64 / n_head_kv 8) and runs at 168 regs ->
|
||||
3 blocks/SM = 25% occupancy on GB10; tile is 108-128 regs with native grouping.
|
||||
The inc-2 phys(j) block-table read was already plumbed into tile (patch 0010);
|
||||
this patch makes it the default for {F16 K and V, gqa_ratio >= 2}.
|
||||
|
||||
Routing guard (why conditional): the tile kernel has no K/V type template - it
|
||||
loads half2 - so a non-F16 cache (BF16 / quantized) would be converted by
|
||||
launch_fattn to a contiguous F16 copy, which breaks the in-kernel block-table
|
||||
read (the table indexes the original paged layout, not the copy). So tile is
|
||||
correct only for an F16 cache; non-F16 caches and the non-grouped gqa==1 shape
|
||||
fall back to the inc-1 vec path, exactly as before this change. The head-group
|
||||
reuse also only helps at gqa_ratio >= 2. LLAMA_KV_PAGED_VEC=1 forces vec for A/B.
|
||||
Note: paged decode is currently exercised with an F16 cache only; quantized +
|
||||
paged is a separate pre-existing limitation, independent of this change
|
||||
(verified: stock + q8_0 cache works, but paged + q8_0 aborts both before and
|
||||
after this patch, since both route the non-F16 cache to vec).
|
||||
|
||||
Measured GB10 (sm_121, 48 SM), Qwen3-32B NVFP4 dense, F16 cache, gqa 8, batch 32,
|
||||
1024 ctx, llama-batched-bench npp=1024 ntg=128 npl=32, GGML_CUDA_DISABLE_GRAPHS=1,
|
||||
same build, env-toggled:
|
||||
STOCK (mma) 174.8 ms/step 183.1 t/s
|
||||
PAGED-VEC (inc-1) 186.3 ms/step 171.8 t/s (+6.6% vs stock)
|
||||
PAGED-TILE (inc-3) 177.9 ms/step 179.8 t/s (+1.8% vs stock)
|
||||
GQA grouping recovers 8.4 ms/step (-4.5%) over the inc-1 vec default and brings
|
||||
paged decode to within 1.8% of stock. The win grows with context (npl=8, tile vs
|
||||
vec decode step): 1024 -2.3%, 4096 -3.3%, 8192 and 16384 wider, as attention
|
||||
takes a larger share of the step.
|
||||
|
||||
Why not the split-K tune: the vec decode grid is already block-saturated
|
||||
(1 x parallel_blocks 3 x 2048 = 6144 blocks ~ 43 waves over 144 resident on 48
|
||||
SM), so raising parallel_blocks / KV_max adds no SM fill. The under-saturation is
|
||||
intra-SM (occupancy + the 8x KV re-streaming), which GQA grouping attacks
|
||||
directly; more split-K does not.
|
||||
|
||||
Correctness (greedy, GGML_CUDA_DISABLE_GRAPHS=1):
|
||||
- CPU plumbing gate (Qwen3-0.6B, build-cpu, paged-on vs off): BYTE-IDENTICAL.
|
||||
- GPU 0.6B gqa=2, 8 seq x 48 tok: tile is token-identical to the inc-1 vec path
|
||||
in 7/8 sequences; the 8th diverges at token 5, within the same kernel-noise
|
||||
band where vec also drifts from stock. Stock uses the mma kernel for this
|
||||
multi-stream GQA shape, so a different kernel = different rounding =
|
||||
autoregressive token drift; vec and tile agree with each other while both
|
||||
differ from stock (both pick 15678 where stock picks 38835), confirming the
|
||||
drift is kernel choice, not a paging error.
|
||||
- GPU 32B gqa=8, 4 seq x 40 tok: tile tracks stock at least as well as vec
|
||||
(seq3: tile == stock == 624 at the token where vec picked 13).
|
||||
|
||||
Stock is byte-identical: the dispatch guard only diverts when the block table
|
||||
(src[5]) is set; the non-paged best-kernel switch is untouched. The ncols2>1 tile
|
||||
path reads the last nbatch_fa tile with oob_check=false and relies on the mask
|
||||
-inf padding - the same pattern stock uses for ncols2>1 - and the compacted paged
|
||||
mask is gathered to the n_view (GGML_PAD 256) width so it carries that padding.
|
||||
|
||||
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
---
|
||||
ggml/src/ggml-cuda/fattn.cu | 51 ++++++++++++++++++++++++++-----------
|
||||
1 file changed, 36 insertions(+), 15 deletions(-)
|
||||
|
||||
diff --git a/ggml/src/ggml-cuda/fattn.cu b/ggml/src/ggml-cuda/fattn.cu
|
||||
index afcafa2..6b15810 100644
|
||||
--- a/ggml/src/ggml-cuda/fattn.cu
|
||||
+++ b/ggml/src/ggml-cuda/fattn.cu
|
||||
@@ -580,32 +580,53 @@ void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst
|
||||
// silently read the wrong (contiguous physical) cells. So when a block table
|
||||
// is present we route here and NEVER fall through to the best-kernel switch
|
||||
// below - no decode shape can silently reach an mma/wmma misread. build_attn
|
||||
- // only sets src[5] for the 1-token-per-stream decode shape; the vec
|
||||
+ // only sets src[5] for the 1-token-per-stream decode shape; the vec/tile
|
||||
// dispatcher GGML_ABORTs for an unsupported D/type rather than mis-reading,
|
||||
// and any shape that should not be paged must take the host-side gather path
|
||||
// (LLAMA_KV_PAGED_GATHER=1) instead.
|
||||
//
|
||||
- // Default route = vec (inc-1, byte-validated: vec-paged == stock at -s 1 and
|
||||
- // CPU byte-identical). LLAMA_KV_PAGED_TILE=1 routes the same shape to the
|
||||
- // tile kernel; the tile in-kernel read is plumbed (fattn-tile.cuh) for the
|
||||
- // increment-3 GQA head-group reuse, but is EXPERIMENTAL / NOT yet byte-
|
||||
- // validated: the GQA-grouped (ncols2>1) tile path reads a full nbatch_fa tile
|
||||
- // with oob_check=false while the compacted paged mask is not padded to cover
|
||||
- // it, so it diverges from stock. Not for production paged decode until
|
||||
- // increment-3 bounds that path; the default vec route is unaffected.
|
||||
+ // Default route = the GQA-grouped TILE kernel (inc-3) WHEN it is both correct
|
||||
+ // and a win, else the inc-1 vec path. Tile groups the q-heads that share one
|
||||
+ // kv-head (ncols2), loading each K/V row once for the whole group instead of
|
||||
+ // once per q-head, and runs at higher occupancy than vec (108-128 regs vs 168).
|
||||
+ // Two constraints make this conditional: (1) the tile kernel has no K/V type
|
||||
+ // template - it loads half2 - so a non-F16 cache (BF16/quantized) would be
|
||||
+ // converted by launch_fattn to a contiguous F16 copy, which breaks the
|
||||
+ // in-kernel block-table read (the table indexes the original paged layout, not
|
||||
+ // the copy); vec instead reads the original cache with in-kernel dequant, so it
|
||||
+ // is the only correct paged path for non-F16 caches. (2) the head-group reuse
|
||||
+ // only helps when gqa_ratio>=2. So route to tile only for {F16 K and V,
|
||||
+ // gqa_ratio>=2}; everything else stays on vec, matching stock (which also sends
|
||||
+ // quantized-cache decode to the vector kernel). Measured on GB10 (Qwen3-32B
|
||||
+ // nvfp4, F16 cache, gqa 8, batch 32, 1024 ctx): tile 177.9 ms/step vs vec 186.3
|
||||
+ // vs stock 174.8 - GQA grouping recovers ~4.5% over the inc-1 vec default and
|
||||
+ // brings paged decode to ~1.8% of stock. Validated token-coherent with vec:
|
||||
+ // 0.6B 8-seq 7/8 identical (8th within the kernel-noise band where vec also
|
||||
+ // drifts from stock), 32B gqa=8 tile tracks stock at least as well as vec, CPU
|
||||
+ // plumbing gate byte-identical. The ncols2>1 tile path reads the last nbatch_fa
|
||||
+ // tile with oob_check=false relying on mask -inf padding (the SAME pattern stock
|
||||
+ // uses for ncols2>1); the compacted paged mask is gathered to the n_view
|
||||
+ // (GGML_PAD 256) width so it carries that padding. LLAMA_KV_PAGED_VEC=1 forces
|
||||
+ // the inc-1 vec path for A/B.
|
||||
if (dst->src[5] != nullptr) {
|
||||
- static const bool paged_tile = getenv("LLAMA_KV_PAGED_TILE") != nullptr;
|
||||
+ const ggml_tensor * Qp = dst->src[0];
|
||||
+ const ggml_tensor * Kp = dst->src[1];
|
||||
+ const ggml_tensor * Vp = dst->src[2];
|
||||
+ const bool kv_f16 = Kp->type == GGML_TYPE_F16 && Vp->type == GGML_TYPE_F16;
|
||||
+ const int64_t gqa_ratio = Kp->ne[2] > 0 ? Qp->ne[2] / Kp->ne[2] : 1;
|
||||
+ const bool force_vec = getenv("LLAMA_KV_PAGED_VEC") != nullptr;
|
||||
+ const bool use_tile = !force_vec && kv_f16 && gqa_ratio >= 2;
|
||||
if (getenv("LLAMA_KV_PAGED_DISPATCH_LOG") != nullptr) {
|
||||
static bool logged = false;
|
||||
if (!logged) {
|
||||
logged = true;
|
||||
- fprintf(stderr, "[paged] decode src[5] set -> routing to %s (Q ne=[%ld,%ld,%ld,%ld])\n",
|
||||
- paged_tile ? "TILE(experimental)" : "VEC",
|
||||
- (long) dst->src[0]->ne[0], (long) dst->src[0]->ne[1],
|
||||
- (long) dst->src[0]->ne[2], (long) dst->src[0]->ne[3]);
|
||||
+ fprintf(stderr, "[paged] decode src[5] set -> routing to %s (Q ne=[%ld,%ld,%ld,%ld] gqa=%ld kv_f16=%d)\n",
|
||||
+ use_tile ? "TILE(gqa)" : "VEC",
|
||||
+ (long) Qp->ne[0], (long) Qp->ne[1], (long) Qp->ne[2], (long) Qp->ne[3],
|
||||
+ (long) gqa_ratio, (int) kv_f16);
|
||||
}
|
||||
}
|
||||
- if (paged_tile) {
|
||||
+ if (use_tile) {
|
||||
ggml_cuda_flash_attn_ext_tile(ctx, dst);
|
||||
} else {
|
||||
ggml_cuda_flash_attn_ext_vec(ctx, dst);
|
||||
--
|
||||
2.43.0
|
||||
|
||||
Reference in New Issue
Block a user