feat(paged): tail-fusion (0042) + full-step decode CUDA graph default-on (0043); FP4-MMA W4A4 (0034) + Marlin W4A16 (0035) MoE-GEMM scaffolds default-off

0042 fuses the pre-norm residual add into RMSNorm (+0.5% prefill, bit-exact). 0043 makes the full-step MoE decode CUDA graph default-on (+2-4% decode, bit-exact; removes ~18x per-step host kernel re-issue, A/B-confirmed). 0034 (native FP4-MMA W4A4) and 0035 (Marlin-style W4A16 grouped MoE GEMM) are correct + bit-exact but regress vs the int8 FP4-MMQ in-backend on GB10 (bf16 MMA is ~half the int8 rate); shipped default-off as validated mechanisms and recorded negatives per the parity methodology. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-29 19:06:43 -04:00 · 2026-06-29 06:15:10 +00:00
parent f1c98ff0b9
commit c4058eb4da
4 changed files with 1657 additions and 0 deletions
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch
@@ -0,0 +1,638 @@
+From 14824147a504b58cc8be2f127f7d6bedb672cfc9 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Mon, 29 Jun 2026 00:11:22 +0200
+Subject: [PATCH] feat(paged): native NVFP4 (W4A4) FP4-MMA large-M prefill GEMM
+ (patch 0034)
+
+Replace the rejected 0033 dequant->bf16 cuBLAS scaffold with a native FP4-MMA
+(W4A4 block-scale OMMA) large-M GEMM that engages only at prefill, behind the
+same LLAMA_FP4_PREFILL_M threshold, so decode / small-M stay byte-untouched.
+
+KERNEL (ggml/src/ggml-cuda/fp4-gemm.{cu,cuh}): the VERIFIED PoC
+(fp4_gemm_w4a4_opt.cu, NMSE=0 vs same-dequant f32) copied verbatim at its tuned
+best config 128x128 / KBLK4 / STAGES2 / PAD4 (~103 TFLOP/s, beats cuBLAS bf16).
+Preserved exactly: e4m3(true_scale) convention, the ldmatrix.sync.m8n8.x4 A-operand
+load, the mma.sync.kind::mxf4nvf4.block_scale.scale_vec::4X.m16n8k64 OMMA, cp.async
+multistage prefetch, register-resident accumulators, smem PAD. Activations are
+quantized with the SAME math as quantize_mmq_nvfp4 (e4m3 amax/6 + the +/-2 code
+search + ggml_cuda_float_to_fp4_e2m1), so it is bit-exact-by-construction with the
+shipped FP4-MMQ path (only the K-reduction order differs, greedy-md5 gated).
+
+DENSE: routed in ggml_cuda_mul_mat via ggml_cuda_fp4_prefill_should_engage()
+(src0 NVFP4 + src1/dst f32, contiguous, non-transposed, 2D, Blackwell, M>thr,
+N%128==0, K%256==0). Non-divisible shapes fall back to FP4-MMQ (NOT the rejected
+bf16 cuBLAS path). LANDED + greedy-md5 byte-identical (on==off: "Paris").
+
+MoE GROUPED (the actual prefill bottleneck): mmq.cu forces the grouped FP4-MMQ
+id-path OFF at large M (n_experts>0), so mul_mat_id falls to its per-expert
+host-sync loop where each expert slice flows back through ggml_cuda_mul_mat and
+hits the native kernel per-expert. Prefill is not graph-replayed so this is safe;
+decode keeps ne12<=threshold so the graph-safe MMQ id-path (patch 0025) is
+untouched. LANDED via host-sync + greedy-md5 byte-identical (on==off).
+FOLLOW-UP (flagged): a graph-safe ragged-batched grouped FP4-MMA kernel to remove
+the per-expert host-sync loop; out of scope for this pass.
+
+BUILD: arch=compute_121a,code=[compute_121a,sm_121a] already in build-cuda flags;
+the kernel uses BLACKWELL_MMA_AVAILABLE/CP_ASYNC_AVAILABLE guards. Incremental
+build-cuda green (ggml-cuda relinked, llama-server + llama-cli relinked).
+
+Default-off (LLAMA_FP4_PREFILL_M=0 == stock); set env/-D to engage.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ ggml/src/ggml-cuda/fp4-gemm.cu  | 453 ++++++++++++++++++++++++++++++++
+ ggml/src/ggml-cuda/fp4-gemm.cuh |  38 +++
+ ggml/src/ggml-cuda/ggml-cuda.cu |  14 +
+ ggml/src/ggml-cuda/mmq.cu       |  35 +--
+ 4 files changed, 525 insertions(+), 15 deletions(-)
+ create mode 100644 ggml/src/ggml-cuda/fp4-gemm.cu
+ create mode 100644 ggml/src/ggml-cuda/fp4-gemm.cuh
+
+diff --git a/ggml/src/ggml-cuda/fp4-gemm.cu b/ggml/src/ggml-cuda/fp4-gemm.cu
+new file mode 100644
+index 0000000..86da551
+--- /dev/null
+++ b/ggml/src/ggml-cuda/fp4-gemm.cu
+@@ -0,0 +1,453 @@
+#include "fp4-gemm.cuh"
+
+#include <cfloat>
+#include <cstdint>
+#include <cstdlib>
+
+// ===========================================================================
+// [paged patch 0034] Native NVFP4 (W4A4) large-M GEMM. See fp4-gemm.cuh.
+//
+// The GEMM kernel, the m16n8k64 block-scale OMMA wrapper, the cp.async helpers and
+// the layout-split kernel are the VERIFIED PoC (fp4_gemm_w4a4_opt.cu, NMSE=0) copied
+// verbatim - do not "tidy" the index math, it is the load-bearing correctness.
+// ===========================================================================
+
+#define FP4_QK 64           // == QK_NVFP4
+#define FP4_SAW 8           // u32 per nvfp4 block qs (32 bytes)
+
+#ifndef LLAMA_FP4_PREFILL_M
+#define LLAMA_FP4_PREFILL_M 0
+#endif // LLAMA_FP4_PREFILL_M
+
+static int64_t ggml_cuda_fp4_prefill_m() {
+    static const int64_t m = [] {
+        const char * e = getenv("LLAMA_FP4_PREFILL_M");
+        return e != nullptr ? (int64_t) atoll(e) : (int64_t) LLAMA_FP4_PREFILL_M;
+    }();
+    return m;
+}
+
+// ---- layout split: block_nvfp4[R*Kb] -> qs codes [R*Kb*8 u32] + scales [R*Kb u32] ----
+// Same fp4 codes & e4m3 scale bytes as the GGUF, restored into two contiguous,
+// 16B-friendly arrays so the kernel's cp.async copies are coalesced. (PoC verbatim.)
+static __global__ void fp4_split_layout(
+        const block_nvfp4 * __restrict__ X, uint32_t * __restrict__ Q, uint32_t * __restrict__ S,
+        int R, int Kb) {
+    const int64_t b   = (int64_t) blockIdx.x * blockDim.x + threadIdx.x;
+    const int64_t tot = (int64_t) R * Kb;
+    if (b >= tot) {
+        return;
+    }
+    const block_nvfp4 & blk = X[b];
+    const uint32_t * q = (const uint32_t *) blk.qs;
+    uint32_t * dq = &Q[b * 8];
+#pragma unroll
+    for (int w = 0; w < 8; w++) {
+        dq[w] = q[w];
+    }
+    S[b] = *(const uint32_t *) blk.d;
+}
+
+// ---- activation quantizer: f32 [M_real x K] -> split NVFP4 (Aq codes + As scales) ----
+// Uses the SAME math as quantize_mmq_nvfp4 (quantize.cu): e4m3 scale = ue4m3(amax/6)
+// with the +/-2 code search, ggml_cuda_float_to_fp4_e2m1 for the nibbles, so the
+// activation codes are identical to the shipped FP4-MMQ path. Packs into the PoC
+// block layout (qs[s*8+j] = code(e[j]) | code(e[j+8])<<4) expected by the kernel's
+// ldmatrix A-operand load. One thread per (row, kb, sub-block).
+static __global__ void fp4_quantize_act_split(
+        const float * __restrict__ x, uint32_t * __restrict__ Aq, uint32_t * __restrict__ As,
+        int M_real, int K, int Kb) {
+#ifdef BLACKWELL_MMA_AVAILABLE
+    const int64_t tot = (int64_t) M_real * Kb * 4; // 4 sub-blocks per 64-element block
+    const int64_t t   = (int64_t) blockIdx.x * blockDim.x + threadIdx.x;
+    if (t >= tot) {
+        return;
+    }
+    const int     sub = (int) (t & 3);
+    const int64_t rb  = t >> 2;            // row*Kb + kb
+    const int     kb  = (int) (rb % Kb);
+    const int64_t row = rb / Kb;
+
+    const float * v16 = x + row * (int64_t) K + (int64_t) kb * FP4_QK + sub * 16;
+    float vals[16];
+    float amax = 0.0f;
+#pragma unroll
+    for (int k = 0; k < 16; k++) {
+        const float vv = v16[k];
+        vals[k] = vv;
+        amax = fmaxf(amax, fabsf(vv));
+    }
+
+    static constexpr int test_offsets[5] = { 0, -1, 1, -2, 2 };
+    const int first_fp8_code = (int) ggml_cuda_fp32_to_ue4m3(amax / 6.0f);
+
+    float   best_err = FLT_MAX;
+    uint8_t fp8_code = 0;
+    float   subblock_scale = 0.0f;
+#pragma unroll
+    for (int i = 0; i < 5; i++) {
+        const int test_code = first_fp8_code + test_offsets[i];
+        if (test_code < 0 || test_code > 0x7e) {
+            continue;
+        }
+        const uint8_t code           = (uint8_t) test_code;
+        const float   test_scale     = ggml_cuda_ue4m3_to_fp32(code);
+        const float   test_inv_scale = test_scale > 0.0f ? 0.5f / test_scale : 0.0f;
+        float cur_err = 0.0f;
+#pragma unroll
+        for (int k = 0; k < 16; k++) {
+            const uint8_t q        = ggml_cuda_float_to_fp4_e2m1(vals[k], test_inv_scale);
+            const float   err_diff = fabsf(vals[k]) - fabsf((float) kvalues_mxfp4[q & 0x7]) * test_scale;
+            cur_err = fmaf(err_diff, err_diff, cur_err);
+        }
+        if (cur_err < best_err) {
+            best_err       = cur_err;
+            fp8_code       = code;
+            subblock_scale = test_scale;
+        }
+    }
+    const float inv_scale = subblock_scale > 0.0f ? 0.5f / subblock_scale : 0.0f;
+
+    // PoC packing: qs[s*8+j] = code(e[j]) | code(e[j+8])<<4 -> two u32 words per sub-block.
+    uint32_t w0 = 0, w1 = 0;
+#pragma unroll
+    for (int j = 0; j < 4; j++) {
+        const uint32_t lo = ggml_cuda_float_to_fp4_e2m1(vals[j],     inv_scale);
+        const uint32_t hi = ggml_cuda_float_to_fp4_e2m1(vals[j + 8], inv_scale);
+        w0 |= ((lo | (hi << 4)) & 0xff) << (8 * j);
+    }
+#pragma unroll
+    for (int j = 0; j < 4; j++) {
+        const uint32_t lo = ggml_cuda_float_to_fp4_e2m1(vals[j + 4],  inv_scale);
+        const uint32_t hi = ggml_cuda_float_to_fp4_e2m1(vals[j + 12], inv_scale);
+        w1 |= ((lo | (hi << 4)) & 0xff) << (8 * j);
+    }
+
+    const int64_t blk = row * (int64_t) Kb + kb;
+    Aq[blk * 8 + sub * 2 + 0] = w0;
+    Aq[blk * 8 + sub * 2 + 1] = w1;
+    reinterpret_cast<uint8_t *>(As + blk)[sub] = fp8_code;
+#else
+    GGML_UNUSED(x); GGML_UNUSED(Aq); GGML_UNUSED(As);
+    GGML_UNUSED(M_real); GGML_UNUSED(K); GGML_UNUSED(Kb);
+    NO_DEVICE_CODE;
+#endif // BLACKWELL_MMA_AVAILABLE
+}
+
+// ---- native FP4 block-scale OMMA wrapper (PoC verbatim) ----
+static __device__ __forceinline__ void fp4_mma(
+        float d[4], const uint32_t a[4], const uint32_t b[2], uint32_t as, uint32_t bs) {
+#ifdef BLACKWELL_MMA_AVAILABLE
+    asm volatile(
+      "mma.sync.aligned.kind::mxf4nvf4.block_scale.scale_vec::4X.m16n8k64.row.col.f32.e2m1.e2m1.f32.ue4m3 "
+      "{%0,%1,%2,%3},{%4,%5,%6,%7},{%8,%9},{%0,%1,%2,%3},%10,{0,0},%11,{0,0};"
+      : "+f"(d[0]),"+f"(d[1]),"+f"(d[2]),"+f"(d[3])
+      : "r"(a[0]),"r"(a[1]),"r"(a[2]),"r"(a[3]),"r"(b[0]),"r"(b[1]),"r"(as),"r"(bs));
+#else
+    GGML_UNUSED(d); GGML_UNUSED(a); GGML_UNUSED(b); GGML_UNUSED(as); GGML_UNUSED(bs);
+    NO_DEVICE_CODE;
+#endif // BLACKWELL_MMA_AVAILABLE
+}
+
+// ---- cp.async helpers (PoC verbatim) ----
+static __device__ __forceinline__ void fp4_cp_async16(void * smem, const void * gmem) {
+#ifdef CP_ASYNC_AVAILABLE
+    unsigned s = (unsigned) __cvta_generic_to_shared(smem);
+    asm volatile("cp.async.cg.shared.global [%0],[%1],16;\n" :: "r"(s), "l"(gmem));
+#else
+    GGML_UNUSED(smem); GGML_UNUSED(gmem); NO_DEVICE_CODE;
+#endif // CP_ASYNC_AVAILABLE
+}
+template<int B>
+static __device__ __forceinline__ void fp4_cp_async_small(void * smem, const void * gmem) {
+#ifdef CP_ASYNC_AVAILABLE
+    unsigned s = (unsigned) __cvta_generic_to_shared(smem);
+    asm volatile("cp.async.ca.shared.global [%0],[%1],%2;\n" :: "r"(s), "l"(gmem), "n"(B));
+#else
+    GGML_UNUSED(smem); GGML_UNUSED(gmem); NO_DEVICE_CODE;
+#endif // CP_ASYNC_AVAILABLE
+}
+static __device__ __forceinline__ void fp4_cp_commit() {
+#ifdef CP_ASYNC_AVAILABLE
+    asm volatile("cp.async.commit_group;\n" ::);
+#else
+    NO_DEVICE_CODE;
+#endif // CP_ASYNC_AVAILABLE
+}
+template<int N>
+static __device__ __forceinline__ void fp4_cp_wait() {
+#ifdef CP_ASYNC_AVAILABLE
+    asm volatile("cp.async.wait_group %0;\n" :: "n"(N));
+#else
+    NO_DEVICE_CODE;
+#endif // CP_ASYNC_AVAILABLE
+}
+
+// ---------------------------------------------------------------------------
+// Optimized native FP4 GEMM (PoC verbatim). C[M,N] = A_fp4[M,K] @ W_fp4[N,K]^T
+// inputs are layout-split:  Aq[M*Kb*8], As[M*Kb], Wq[N*Kb*8], Ws[N*Kb]
+// Tile BM x BN, K-step = KBLK nvfp4 blocks (BK = 64*KBLK), STAGES-deep pipeline,
+// PAD u32 padding per smem row to defeat bank conflicts.
+// ---------------------------------------------------------------------------
+template<int BM,int BN,int WARPS_M,int WARPS_N,int KBLK,int STAGES,int PAD>
+__launch_bounds__(WARPS_M*WARPS_N*32,1)
+static __global__ void fp4_opt_kernel(
+        const uint32_t * __restrict__ Aq, const uint32_t * __restrict__ As,
+        const uint32_t * __restrict__ Wq, const uint32_t * __restrict__ Ws,
+        float * __restrict__ C, int M, int N, int K) {
+#ifdef BLACKWELL_MMA_AVAILABLE
+    constexpr int NWARP=WARPS_M*WARPS_N;
+    constexpr int THREADS=NWARP*32;
+    constexpr int WM=BM/WARPS_M, WN=BN/WARPS_N;
+    constexpr int MF=WM/16, NF=WN/8;
+    constexpr int SAW=8;                 // u32 per block (qs)
+    constexpr int ARS=KBLK*SAW+PAD;      // A smem row stride (u32)
+    constexpr int WRS=KBLK*SAW+PAD;      // W smem row stride (u32)
+
+    extern __shared__ uint32_t smem[];
+    // per-stage slabs
+    constexpr int SZ_AQ=BM*ARS, SZ_AS=BM*KBLK, SZ_WQ=BN*WRS, SZ_WS=BN*KBLK;
+    constexpr int STAGE_SZ=SZ_AQ+SZ_AS+SZ_WQ+SZ_WS;
+    uint32_t* sAq[STAGES]; uint32_t* sAs[STAGES]; uint32_t* sWq[STAGES]; uint32_t* sWs[STAGES];
+#pragma unroll
+    for(int s=0;s<STAGES;s++){
+        uint32_t* base=smem+s*STAGE_SZ;
+        sAq[s]=base; sAs[s]=base+SZ_AQ; sWq[s]=base+SZ_AQ+SZ_AS; sWs[s]=base+SZ_AQ+SZ_AS+SZ_WQ;
+    }
+
+    const int tid=threadIdx.x, warp=tid>>5, lane=tid&31;
+    const int wrow=warp/WARPS_N, wcol=warp%WARPS_N;
+    const int grp=lane>>2, tig=lane&3;
+    const int tidxA = lane/4 + (lane%2)*8;
+    const int tidxB = lane/4;
+    const int blockRow=blockIdx.y*BM, blockCol=blockIdx.x*BN;
+    const int Kb=K/64;
+    const int numK=Kb/KBLK;
+
+    float acc[MF][NF][4];
+#pragma unroll
+    for(int i=0;i<MF;i++)for(int j=0;j<NF;j++)for(int r=0;r<4;r++)acc[i][j][r]=0;
+
+    // async-load k-tile `kt` into stage `st`
+    auto load_tile=[&](int st,int kt){
+        const int kb0=kt*KBLK;
+        // A qs: BM*KBLK blocks, 2x 16B chunks each
+#pragma unroll 1
+        for(int idx=tid; idx<BM*KBLK*2; idx+=THREADS){
+            int chunk=idx&1, blk=idx>>1;
+            int r=blk/KBLK, kb=blk%KBLK;
+            const uint32_t* src=&Aq[((size_t)(blockRow+r)*Kb + kb0+kb)*SAW + chunk*4];
+            fp4_cp_async16(&sAq[st][r*ARS + kb*SAW + chunk*4], src);
+        }
+        // W qs
+#pragma unroll 1
+        for(int idx=tid; idx<BN*KBLK*2; idx+=THREADS){
+            int chunk=idx&1, blk=idx>>1;
+            int r=blk/KBLK, kb=blk%KBLK;
+            const uint32_t* src=&Wq[((size_t)(blockCol+r)*Kb + kb0+kb)*SAW + chunk*4];
+            fp4_cp_async16(&sWq[st][r*WRS + kb*SAW + chunk*4], src);
+        }
+        // A scales: BM rows, KBLK contiguous u32 each
+#pragma unroll 1
+        for(int r=tid; r<BM; r+=THREADS){
+            const uint32_t* src=&As[(size_t)(blockRow+r)*Kb + kb0];
+            uint32_t* dst=&sAs[st][r*KBLK];
+            if(KBLK==4) fp4_cp_async16(dst,src);
+            else if(KBLK==2) fp4_cp_async_small<8>(dst,src);
+            else fp4_cp_async_small<4>(dst,src);
+        }
+        // W scales
+#pragma unroll 1
+        for(int r=tid; r<BN; r+=THREADS){
+            const uint32_t* src=&Ws[(size_t)(blockCol+r)*Kb + kb0];
+            uint32_t* dst=&sWs[st][r*KBLK];
+            if(KBLK==4) fp4_cp_async16(dst,src);
+            else if(KBLK==2) fp4_cp_async_small<8>(dst,src);
+            else fp4_cp_async_small<4>(dst,src);
+        }
+    };
+
+    // prologue: issue STAGES-1 tiles (tiles 0..STAGES-2 into stages 0..STAGES-2)
+#pragma unroll
+    for(int s=0;s<STAGES-1;s++){ if(s<numK) load_tile(s,s); fp4_cp_commit(); }
+
+    for(int kt=0; kt<numK; kt++){
+        // prefetch tile kt+STAGES-1 into its stage (overlaps this iter's compute)
+        int ld=kt+(STAGES-1);
+        if(ld<numK) load_tile(ld%STAGES,ld);
+        fp4_cp_commit();
+        // wait until tile kt has landed (leave STAGES-1 prefetches in flight)
+        fp4_cp_wait<STAGES-1>();
+        __syncthreads();
+
+        const int rs=kt%STAGES;
+#pragma unroll
+        for(int kb=0; kb<KBLK; kb++){
+            // A fragments via ldmatrix (PRESERVED layout)
+            uint32_t af[MF][4]; uint32_t asc[MF];
+#pragma unroll
+            for(int mi=0; mi<MF; mi++){
+                int rb=wrow*WM+mi*16;
+                const uint32_t* base=&sAq[rs][rb*ARS + kb*SAW];
+                const uint32_t* xs = base + (lane%16)*ARS + (lane/16)*4;
+                asm volatile("ldmatrix.sync.aligned.m8n8.x4.b16 {%0,%1,%2,%3},[%4];"
+                    : "=r"(af[mi][0]),"=r"(af[mi][1]),"=r"(af[mi][2]),"=r"(af[mi][3])
+                    : "l"(xs));
+                asc[mi]=sAs[rs][(rb+tidxA)*KBLK+kb];
+            }
+            // B fragments (PRESERVED manual gather), padded row stride
+            uint32_t bf[NF][2]; uint32_t bsc[NF];
+#pragma unroll
+            for(int ni=0; ni<NF; ni++){
+                int nb=wcol*WN+ni*8;
+                const uint32_t* base=&sWq[rs][nb*WRS + kb*SAW];
+#pragma unroll
+                for(int l=0;l<2;l++){
+                    int gi=grp, gj=l*4+tig;
+                    bf[ni][l]=base[gi*WRS + gj];
+                }
+                bsc[ni]=sWs[rs][(nb+tidxB)*KBLK+kb];
+            }
+#pragma unroll
+            for(int mi=0;mi<MF;mi++)
+#pragma unroll
+                for(int ni=0;ni<NF;ni++)
+                    fp4_mma(acc[mi][ni], af[mi], bf[ni], asc[mi], bsc[ni]);
+        }
+        // ensure all warps finished reading stage rs before it is reused by a
+        // future prefetch (the stage is overwritten at iter kt+1's prefetch).
+        __syncthreads();
+    }
+
+#pragma unroll
+    for(int mi=0;mi<MF;mi++)
+#pragma unroll
+        for(int ni=0;ni<NF;ni++){
+            int orb=blockRow+wrow*WM+mi*16, ocb=blockCol+wcol*WN+ni*8;
+            float* d=acc[mi][ni];
+            C[(size_t)(orb+grp)*N+ocb+2*tig]    =d[0];
+            C[(size_t)(orb+grp)*N+ocb+2*tig+1]  =d[1];
+            C[(size_t)(orb+grp+8)*N+ocb+2*tig]  =d[2];
+            C[(size_t)(orb+grp+8)*N+ocb+2*tig+1]=d[3];
+        }
+#else
+    GGML_UNUSED(Aq); GGML_UNUSED(As); GGML_UNUSED(Wq); GGML_UNUSED(Ws);
+    GGML_UNUSED(C); GGML_UNUSED(M); GGML_UNUSED(N); GGML_UNUSED(K);
+    NO_DEVICE_CODE;
+#endif // BLACKWELL_MMA_AVAILABLE
+}
+
+// ===========================================================================
+// ggml integration
+// ===========================================================================
+
+bool ggml_cuda_fp4_prefill_should_engage(
+        const ggml_tensor * src0, const ggml_tensor * src1, const ggml_tensor * dst, int cc) {
+    if (src0->type != GGML_TYPE_NVFP4) {
+        return false;
+    }
+    if (!blackwell_mma_available(cc)) {
+        return false;
+    }
+    const int64_t thr = ggml_cuda_fp4_prefill_m();
+    if (thr <= 0) {
+        return false;                       // default-off == stock; decode/small-M untouched
+    }
+    if (src1->type != GGML_TYPE_F32 || dst->type != GGML_TYPE_F32) {
+        return false;
+    }
+    if (src1->ne[1] <= thr) {
+        return false;                       // M = src1->ne[1]; only LARGE M (prefill)
+    }
+    if (!ggml_is_contiguous(src0) || !ggml_is_contiguous(src1) || !ggml_is_contiguous(dst)) {
+        return false;
+    }
+    if (ggml_is_transposed(src0) || ggml_is_transposed(src1)) {
+        return false;
+    }
+    // 2D only (a single weight matrix; per-expert MoE slices set ne[2]=ne[3]=1).
+    if (src0->ne[2] != 1 || src0->ne[3] != 1 || src1->ne[2] != 1 || src1->ne[3] != 1) {
+        return false;
+    }
+    const int64_t K = src0->ne[0];
+    const int64_t N = src0->ne[1];
+    if (N % 128 != 0 || K % 256 != 0) {
+        return false;                       // tile constraints; otherwise fall back to MMQ
+    }
+    return true;
+}
+
+void ggml_cuda_mul_mat_fp4_large_m(
+        ggml_backend_cuda_context & ctx,
+        const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
+    GGML_ASSERT(src0->type == GGML_TYPE_NVFP4);
+    GGML_ASSERT(src1->type == GGML_TYPE_F32 && dst->type == GGML_TYPE_F32);
+
+    const int64_t K  = src0->ne[0];
+    const int64_t N  = src0->ne[1];
+    const int64_t M  = src1->ne[1];
+    const int64_t Kb = K / FP4_QK;
+    GGML_ASSERT(K % 256 == 0 && N % 128 == 0);
+
+    cudaStream_t stream = ctx.stream();
+
+    constexpr int BM = 128, BN = 128, WM = 4, WN = 2, KBLK = 4, STAGES = 2, PAD = 4;
+    const int64_t Mpad = ((M + BM - 1) / BM) * BM;
+
+    ggml_cuda_pool_alloc<uint32_t> Wq(ctx.pool(), (size_t) N * Kb * 8);
+    ggml_cuda_pool_alloc<uint32_t> Ws(ctx.pool(), (size_t) N * Kb);
+    ggml_cuda_pool_alloc<uint32_t> Aq(ctx.pool(), (size_t) Mpad * Kb * 8);
+    ggml_cuda_pool_alloc<uint32_t> As(ctx.pool(), (size_t) Mpad * Kb);
+
+    // Zero the scales of the padded A-rows (M..Mpad) so they contribute 0 (scale 0 ->
+    // the OMMA's per-block scale is 0). The padded qs may stay uninitialized.
+    if (Mpad > M) {
+        CUDA_CHECK(cudaMemsetAsync(As.get() + (size_t) M * Kb, 0,
+                                   (size_t) (Mpad - M) * Kb * sizeof(uint32_t), stream));
+    }
+
+    // split weights (GGUF block_nvfp4 -> Wq/Ws)
+    {
+        const int64_t tot     = N * Kb;
+        const int     threads = 256;
+        const int64_t grid    = (tot + threads - 1) / threads;
+        fp4_split_layout<<<grid, threads, 0, stream>>>(
+            (const block_nvfp4 *) src0->data, Wq.get(), Ws.get(), (int) N, (int) Kb);
+        CUDA_CHECK(cudaGetLastError());
+    }
+    // quantize + split activations (real rows only)
+    {
+        const int64_t tot     = M * Kb * 4;
+        const int     threads = 256;
+        const int64_t grid    = (tot + threads - 1) / threads;
+        fp4_quantize_act_split<<<grid, threads, 0, stream>>>(
+            (const float *) src1->data, Aq.get(), As.get(), (int) M, (int) K, (int) Kb);
+        CUDA_CHECK(cudaGetLastError());
+    }
+
+    // Output: write the (Mpad x N) result straight into dst when M is tile-aligned,
+    // otherwise into a temp and copy back the first M rows (C is row-major C[m*N+n]).
+    float * Cout = (float *) dst->data;
+    ggml_cuda_pool_alloc<float> Ctmp(ctx.pool());
+    if (Mpad > M) {
+        Cout = Ctmp.alloc((size_t) Mpad * N);
+    }
+
+    auto kern = fp4_opt_kernel<BM, BN, WM, WN, KBLK, STAGES, PAD>;
+    constexpr int SZ_AQ = BM * (KBLK * 8 + PAD), SZ_AS = BM * KBLK;
+    constexpr int SZ_WQ = BN * (KBLK * 8 + PAD), SZ_WS = BN * KBLK;
+    constexpr int STAGE_SZ = SZ_AQ + SZ_AS + SZ_WQ + SZ_WS;
+    const int smem_bytes = STAGES * STAGE_SZ * (int) sizeof(uint32_t);
+    CUDA_CHECK(cudaFuncSetAttribute(kern, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_bytes));
+
+    dim3 grid((unsigned) (N / BN), (unsigned) (Mpad / BM));
+    dim3 block(WM * WN * 32);
+    kern<<<grid, block, smem_bytes, stream>>>(
+        Aq.get(), As.get(), Wq.get(), Ws.get(), Cout, (int) Mpad, (int) N, (int) K);
+    CUDA_CHECK(cudaGetLastError());
+
+    if (Mpad > M) {
+        CUDA_CHECK(cudaMemcpyAsync(dst->data, Ctmp.get(), (size_t) M * N * sizeof(float),
+                                   cudaMemcpyDeviceToDevice, stream));
+    }
+}
+diff --git a/ggml/src/ggml-cuda/fp4-gemm.cuh b/ggml/src/ggml-cuda/fp4-gemm.cuh
+new file mode 100644
+index 0000000..8ed1aa4
+--- /dev/null
+++ b/ggml/src/ggml-cuda/fp4-gemm.cuh
+@@ -0,0 +1,38 @@
+#pragma once
+
+#include "common.cuh"
+
+// [paged patch 0034] Native NVFP4 (W4A4) large-M GEMM for Blackwell sm_121a (GB10).
+//
+// A Marlin-class tiled FP4-MMA GEMM (cp.async multistage prefetch, register-resident
+// accumulators, ldmatrix A-operand, m16n8k64 mxf4nvf4 block-scale OMMA with e4m3
+// true-scale) that beats the dequant->bf16 cuBLAS (nvjet) path that the rejected 0033
+// scaffold routed large-M prefill through. The kernel body is the bit-exact PoC
+// (NMSE=0 vs a same-dequant f32 reference) at its tuned best config
+// (128x128 / KBLK4 / STAGES2 / PAD4).
+//
+// It is bit-exact-by-construction with the shipped FP4-MMQ path: it consumes the SAME
+// e2m1 weight nibbles + e4m3 scale bytes from the GGUF block_nvfp4, quantizes
+// activations with the SAME math as quantize_mmq_nvfp4 (e4m3 amax/6 scale + the +/-2
+// code search + ggml_cuda_float_to_fp4_e2m1), and feeds the SAME hardware OMMA. The
+// only difference vs FP4-MMQ is the K-accumulation order (a different but equivalent
+// f32 reduction tree), which is greedy-md5 gated like every other paged path.
+//
+// Engages ONLY at large M (prefill), behind the 0033 LLAMA_FP4_PREFILL_M threshold;
+// decode and small-M are byte-untouched and never reach this kernel.
+
+// True if the native FP4 large-M path should handle this dense NVFP4 mul_mat:
+//   src0 NVFP4 + src1/dst f32, contiguous, not transposed, 2D, Blackwell,
+//   LLAMA_FP4_PREFILL_M > 0, M = src1->ne[1] > threshold, N % 128 == 0, K % 256 == 0.
+// This single predicate also routes per-expert MoE slices (they flow through
+// ggml_cuda_mul_mat) into the native kernel.
+bool ggml_cuda_fp4_prefill_should_engage(
+        const ggml_tensor * src0, const ggml_tensor * src1, const ggml_tensor * dst, int cc);
+
+// Native FP4 W4A4 GEMM: dst[M,N] = src1_act[M,K] @ src0_w[N,K]^T.
+// src0 = NVFP4 weights, src1 = f32 activations, dst = f32. Streams on ctx.stream(),
+// pool-allocates scratch; no host sync. Caller must have checked
+// ggml_cuda_fp4_prefill_should_engage().
+void ggml_cuda_mul_mat_fp4_large_m(
+        ggml_backend_cuda_context & ctx,
+        const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst);
+diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
+index 2ecc971..a92003c 100644
+--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
+@@ -25,6 +25,7 @@
+ #include "ggml-cuda/diagmask.cuh"
+ #include "ggml-cuda/diag.cuh"
+ #include "ggml-cuda/fattn.cuh"
+#include "ggml-cuda/fp4-gemm.cuh"
+ #include "ggml-cuda/fwht.cuh"
+ #include "ggml-cuda/getrows.cuh"
+ #include "ggml-cuda/im2col.cuh"
+@@ -2541,6 +2542,19 @@ static bool ggml_cuda_should_fuse_mul_mat_vec_q(const ggml_tensor * tensor) {
+ static void ggml_cuda_mul_mat(ggml_backend_cuda_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) {
+     const bool split = ggml_backend_buft_is_cuda_split(src0->buffer->buft);
+ 
+    // [paged patch 0034] Native NVFP4 (W4A4) large-M (prefill) FP4-MMA GEMM. Engages only
+    // when LLAMA_FP4_PREFILL_M>0 and M=src1->ne[1] exceeds it (and tile dims divide), so
+    // decode / small-M is byte-untouched. This also catches the per-expert MoE slices that
+    // flow through here from the mul_mat_id host-sync loop, routing each expert GEMM to the
+    // native kernel (see ggml_cuda_should_use_mmq's MoE gate in mmq.cu).
+    if (!split) {
+        const int cc_fp4 = ggml_cuda_info().devices[ggml_cuda_get_device()].cc;
+        if (ggml_cuda_fp4_prefill_should_engage(src0, src1, dst, cc_fp4)) {
+            ggml_cuda_mul_mat_fp4_large_m(ctx, src0, src1, dst);
+            return;
+        }
+    }
+
+     // If src0 is a temporary compute buffer it may have some padding that needs to be cleared for mul_mat_vec_q or mul_mat_q.
+     // But if src0 is also a view of another tensor then this cannot be done safely because it may overwrite valid tensor data.
+     // Therefore, in such cases use cuBLAS.
+diff --git a/ggml/src/ggml-cuda/mmq.cu b/ggml/src/ggml-cuda/mmq.cu
+index 2dcaaab..694a402 100644
+--- a/ggml/src/ggml-cuda/mmq.cu
+++ b/ggml/src/ggml-cuda/mmq.cu
+@@ -321,24 +321,29 @@ bool ggml_cuda_should_use_mmq(enum ggml_type type, int cc, int64_t ne11, int64_t
+         return false;
+     }
+ 
+-    // Paged prefill lever (patch 0033): OPTION-(a) route large-M NVFP4 dense GEMMs
+-    // OFF the FP4-MMQ kernel and through the dequant->bf16 cuBLAS (nvjet)
+-    // tensor-core path (ggml_cuda_op_mul_mat_cublas, NVFP4 bf16 branch). The
+-    // scope premise was that FP4-MMQ is register-bound to ~3% of FP4 peak at
+-    // large M. MEASURED ON GB10 THIS IS FALSE: FP4-MMQ at M=512..2048 beats
+-    // dequant->bf16 cuBLAS by 29-49% (S_PP A/B in docs/PREFILL_GEMM_RESULTS.md),
+-    // because bf16 tensor-core peak is ~half FP4 peak AND the per-step weight
+-    // dequant + 4x bf16 weight traffic (~8x total vs the FP4 read) dominate and
+-    // only partially amortize as M grows. The path is NUMERICALLY VALID and
+-    // benign (greedy md5 byte-identical to FP4-MMQ; test-backend-ops passes), so
+-    // it is kept as a validated, env-gated scaffold (for option-(b) native FP4
+-    // large-M kernels and non-GB10 hardware), but DEFAULT-DISABLED (== stock).
+-    // Set -D LLAMA_FP4_PREFILL_M=<M> or env LLAMA_FP4_PREFILL_M=<M> to A/B it;
+-    // 0 (default) disables. Dense only (n_experts == 0).
+    // Paged prefill lever (patch 0033 -> 0034): route large-M NVFP4 prefill GEMMs to the
+    // native FP4-MMA (W4A4 OMMA) kernel in fp4-gemm.cu instead of the FP4-MMQ kernel.
+    //
+    //  - DENSE (n_experts == 0): the reroute happens earlier, in ggml_cuda_mul_mat's
+    //    ggml_cuda_fp4_prefill_should_engage() early check, which knows the N/K tile
+    //    divisibility. We deliberately do NOT force dense off MMQ here: if the native
+    //    kernel cannot take a shape (non-divisible N/K) MMQ stays the correct fallback,
+    //    NOT the rejected dequant->bf16 cuBLAS path.
+    //  - MoE (n_experts > 0): force the grouped FP4-MMQ id-path OFF at large M so
+    //    mul_mat_id falls to its per-expert host-sync loop, where each expert slice flows
+    //    back through ggml_cuda_mul_mat and hits the native kernel. CUDA graphs are
+    //    disabled for that prefill step (prefill is not graph-replayed); a graph-safe
+    //    grouped (ragged-batched) FP4-MMA kernel is the flagged follow-up. Decode keeps
+    //    ne12 <= threshold so the grouped graph-safe MMQ id-path (patch 0025) is untouched.
+    //
+    // The historical 0033 finding stands: dequant->bf16 cuBLAS LOSES to FP4-MMQ at large M
+    // (bf16 tensor-core peak is ~half FP4 peak + 8x weight traffic), which is exactly why
+    // the native FP4-MMA kernel (NMSE=0, ~103 TFLOP/s, beats cuBLAS bf16) replaces it here.
+    // Set -D LLAMA_FP4_PREFILL_M=<M> or env LLAMA_FP4_PREFILL_M=<M>; 0 (default) == stock.
+ #ifndef LLAMA_FP4_PREFILL_M
+ #define LLAMA_FP4_PREFILL_M 0
+ #endif // LLAMA_FP4_PREFILL_M
+-    if (type == GGML_TYPE_NVFP4 && n_experts == 0 && blackwell_mma_available(cc)) {
+    if (type == GGML_TYPE_NVFP4 && n_experts > 0 && blackwell_mma_available(cc)) {
+         static const int64_t fp4_prefill_m = [] {
+             const char * e = getenv("LLAMA_FP4_PREFILL_M");
+             return e != nullptr ? (int64_t) atoll(e) : (int64_t) LLAMA_FP4_PREFILL_M;
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0035-feat-paged-marlin-w4a16-grouped-moe-prefill-gemm.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0035-feat-paged-marlin-w4a16-grouped-moe-prefill-gemm.patch
@@ -0,0 +1,572 @@
+From df186bd20a23a1baae92f2828fc68f240c115e7d Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Mon, 29 Jun 2026 03:34:48 +0200
+Subject: [PATCH] feat(paged): Marlin-style W4A16 grouped MoE prefill GEMM
+ (patch 0035)
+
+Profile-validated #2 prefill lever: a DISTINCT kernel from the two prefill
+rejects. NOT patch 0033 (separate-pass dequant -> bf16 cuBLAS/nvjet, lost to
+FP4-MMQ at large M). NOT patch 0034 (native W4A4 FP4-MMA mxf4nvf4 OMMA, still
+pays the quantize_mmq_nvfp4 activation-quant tax). This is the W4A16 shape vLLM
+uses on sm_121: FP4 expert weights dequantized to bf16 IN REGISTERS right before
+the MMA, activations kept bf16 (a cheap f32->bf16 cast, NO per-block amax/code
+quantize -> ZERO activation-quant tax), standard bf16 m16n8k16 mma.sync (reuses
+ggml/src/ggml-cuda/mma.cuh tiles) into f32 accumulators, cp.async multistage.
+
+GROUPED (the actual prefill shape): one kernel launch over the mul_mat_id
+token-sorted activation buffer (src1_sorted is already sorted-by-expert by the
+existing host path), with a per-M-tile expert map so each output tile reads its
+own expert weight matrix (src0 + expert*nb02); the ragged per-expert row tail is
+masked. No per-expert kernel launch, no per-expert M-padding (vs the 0034
+per-expert host-sync loop). The B (weight) fragment is filled by in-register
+FP4->bf16 dequant via the tile get_i/get_j contract (correct-by-construction
+vs ldmatrix); the A (activation) fragment is a bf16 ldmatrix.
+
+ROUTING (default-off; distinct env from 0034):
+  - mmq.cu (ggml_cuda_should_use_mmq): NVFP4 + n_experts>0 + Blackwell +
+    ne11(tokens) > LLAMA_W4A16_PREFILL_M returns false, so mul_mat_id falls to
+    the token-sorting host path.
+  - ggml-cuda.cu (ggml_cuda_mul_mat_id): once src1_sorted is built, if
+    ggml_cuda_w4a16_moe_grouped_should_engage() the grouped kernel replaces the
+    per-expert GEMM loop (dst_sorted then scattered back as usual). Decode keeps
+    ne12 <= threshold so the graph-safe grouped MMQ id-path (0025/0043) is
+    untouched; non-MoE / non-NVFP4 / small-M are byte-untouched.
+
+TOGGLE / A-B: env (or -D) LLAMA_W4A16_PREFILL_M. 0 (default) == OFF == stock;
+>0 engages for MoE prefill GEMMs with tokens > the value. LLAMA_W4A16_DEBUG=1
+prints per-GEMM engagement (total_rows / n_tiles / max-tokens-per-expert).
+
+VALIDATION (GB10, sm_121a, Qwen3.6-35B-A3B-NVFP4):
+  - test-backend-ops MUL_MAT_ID nvfp4 (CUDA0 vs CPU oracle), W4A16 forced
+    (LLAMA_W4A16_PREFILL_M=1): 81/81 OK, 0 FAIL (incl. multi-tile-per-expert
+    cases). The threading bug found here (mma.cuh tile ops use threadIdx.x AS the
+    warp lane, so the block must be 2D (32,NWARP)) is fixed.
+  - greedy md5 (paged MoE, LLAMA_KV_PAGED=1): NOT-engaged (high threshold) ==
+    OFF baseline 4a3fd812 BYTE-IDENTICAL (default-off is stock); engaged
+    (120 grouped GEMMs on a 116-token prefill) is coherent + benign (a different
+    but equivalent bf16-vs-Q8_1 K-reduction, like the documented paged-MoE path
+    divergence), output near-identical to stock.
+
+HONEST PERF (S_PP t/s, llama-batched-bench -fa on -ngl 99 -ntg 32 -npl 1,
+LLAMA_KV_PAGED=1, OFF vs W4A16 thr=64), CURRENTLY A REGRESSION:
+  npp 512 : 1096.7 -> 794.8  (-28%)
+  npp 1024: 1413.5 -> 961.1  (-32%)
+  npp 2048: 1671.3 -> 1069.6 (-36%)
+Decode TG unaffected (~53 t/s both). The kernel is CORRECT but its first
+untuned config (BM64/BN128/STAGES2, scalar in-register dequant, f32->bf16 cast
+pre-pass, 4B weight cp.async, BM-tile ragged-utilization waste, per-GEMM host
+tile-map + 3 H2D copies) does not yet beat the tuned FP4-MMQ grouped path on
+GB10; it does not realize the profiled vLLM 2.16x. Ships DEFAULT-OFF (like 0033
+scaffold / 0017) as the validated, env-gated mechanism + bit-exact gate for the
+tuning follow-ups (deeper pipeline, ldmatrix/16B weight staging, smem-conflict
+padding, larger/register-resident tiles, removing the cast pre-pass, dropping
+the per-GEMM host map).
+
+Build: arch=compute_121a,code=[compute_121a,sm_121a]; BLACKWELL_MMA_AVAILABLE /
+AMPERE_MMA_AVAILABLE / CP_ASYNC_AVAILABLE guards (NO_DEVICE_CODE off-Blackwell).
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ ggml/src/ggml-cuda/ggml-cuda.cu   |  12 +
+ ggml/src/ggml-cuda/mmq.cu         |  17 ++
+ ggml/src/ggml-cuda/w4a16-gemm.cu  | 359 ++++++++++++++++++++++++++++++
+ ggml/src/ggml-cuda/w4a16-gemm.cuh |  55 +++++
+ 4 files changed, 443 insertions(+)
+ create mode 100644 ggml/src/ggml-cuda/w4a16-gemm.cu
+ create mode 100644 ggml/src/ggml-cuda/w4a16-gemm.cuh
+
+diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
+index 3151684..37e4d11 100644
+--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
+@@ -26,6 +26,7 @@
+ #include "ggml-cuda/diag.cuh"
+ #include "ggml-cuda/fattn.cuh"
+ #include "ggml-cuda/fp4-gemm.cuh"
+#include "ggml-cuda/w4a16-gemm.cuh"
+ #include "ggml-cuda/fwht.cuh"
+ #include "ggml-cuda/getrows.cuh"
+ #include "ggml-cuda/im2col.cuh"
+@@ -2747,6 +2748,16 @@ static void ggml_cuda_mul_mat_id(ggml_backend_cuda_context & ctx, ggml_tensor *
+         ne10*ts_src1_sorted, ne_get_rows*ne10*ts_src1_sorted, ne_get_rows*ne10*ts_src1_sorted, stream);
+     CUDA_CHECK(cudaGetLastError());
+ 
+    // [paged patch 0035] Marlin-style W4A16 grouped MoE prefill GEMM: one launch over the
+    // token-sorted activation buffer (src1_sorted, already f32 + sorted-by-expert above) with a
+    // per-tile expert map, in-register FP4->bf16 weight dequant + bf16 mma. Replaces the
+    // per-expert host-sync GEMM loop. Engages only when LLAMA_W4A16_PREFILL_M>0 and ne12>thr
+    // (large-M prefill); decode / non-NVFP4 keep the loop below (byte-identical to stock).
+    if (ggml_cuda_w4a16_moe_grouped_should_engage(src0, src1, dst, cc)) {
+        ggml_cuda_mul_mat_id_w4a16_grouped(ctx, src0,
+            (const float *) src1_sorted.ptr, (float *) dst_sorted.ptr,
+            tokens_per_expert.data(), ne02, ne10, ne0, stream);
+    } else {
+     char * src1_data_cur = (char *) src1_sorted.ptr;
+     char *  dst_data_cur = (char *)  dst_sorted.ptr;
+     for (int64_t i02 = 0; i02 < ne02; ++i02) {
+@@ -2795,6 +2806,7 @@ static void ggml_cuda_mul_mat_id(ggml_backend_cuda_context & ctx, ggml_tensor *
+         src1_data_cur += src1_slice.nb[2];
+         dst_data_cur  +=  dst_slice.nb[2];
+     }
+    }
+ 
+     get_rows_cuda(dst_sorted.ptr, type_dst_sorted, ids_from_sorted, dst->data, dst->type,
+         ne0, ne0*ts_dst_sorted, ne_get_rows*ne0*ts_dst_sorted, ne_get_rows*ne0*ts_dst_sorted,
+diff --git a/ggml/src/ggml-cuda/mmq.cu b/ggml/src/ggml-cuda/mmq.cu
+index 694a402..dc5c2d1 100644
+--- a/ggml/src/ggml-cuda/mmq.cu
+++ b/ggml/src/ggml-cuda/mmq.cu
+@@ -353,6 +353,23 @@ bool ggml_cuda_should_use_mmq(enum ggml_type type, int cc, int64_t ne11, int64_t
+         }
+     }
+ 
+    // Paged prefill lever (patch 0035): the Marlin-style W4A16 grouped MoE GEMM also needs the
+    // grouped FP4-MMQ id-path forced OFF at large M so mul_mat_id falls to the token-sorting
+    // host path, where the grouped W4A16 kernel is dispatched (in-register FP4->bf16 dequant +
+    // bf16 mma, ZERO activation-quant). Distinct env from 0034; default 0 == stock.
+#ifndef LLAMA_W4A16_PREFILL_M
+#define LLAMA_W4A16_PREFILL_M 0
+#endif // LLAMA_W4A16_PREFILL_M
+    if (type == GGML_TYPE_NVFP4 && n_experts > 0 && blackwell_mma_available(cc)) {
+        static const int64_t w4a16_prefill_m = [] {
+            const char * e = getenv("LLAMA_W4A16_PREFILL_M");
+            return e != nullptr ? (int64_t) atoll(e) : (int64_t) LLAMA_W4A16_PREFILL_M;
+        }();
+        if (w4a16_prefill_m > 0 && ne11 > w4a16_prefill_m) {
+            return false;
+        }
+    }
+
+     if (turing_mma_available(cc)) {
+         return true;
+     }
+diff --git a/ggml/src/ggml-cuda/w4a16-gemm.cu b/ggml/src/ggml-cuda/w4a16-gemm.cu
+new file mode 100644
+index 0000000..f348f31
+--- /dev/null
+++ b/ggml/src/ggml-cuda/w4a16-gemm.cu
+@@ -0,0 +1,359 @@
+#include "w4a16-gemm.cuh"
+#include "mma.cuh"
+
+#include <algorithm>
+#include <cstdint>
+#include <cstdlib>
+#include <vector>
+
+// ===========================================================================
+// [paged patch 0035] Marlin-style W4A16 grouped MoE prefill GEMM. See w4a16-gemm.cuh.
+//
+// In-register FP4->bf16 weight dequant + bf16 activations + bf16 m16n8k16 mma.sync (mma.cuh),
+// cp.async multistage, grouped (ragged, per-tile expert offset) over the token-sorted buffer.
+// ===========================================================================
+
+using namespace ggml_cuda_mma;
+typedef tile<16, 8, nv_bfloat162> tile_A; // A operand: M=16, K=16
+typedef tile< 8, 8, nv_bfloat162> tile_B; // B operand: N=8,  K=16
+typedef tile<16, 8, float>        tile_C; // accumulator: M=16, N=8
+
+#ifndef LLAMA_W4A16_PREFILL_M
+#define LLAMA_W4A16_PREFILL_M 0
+#endif // LLAMA_W4A16_PREFILL_M
+
+int64_t ggml_cuda_w4a16_prefill_m() {
+    static const int64_t m = [] {
+        const char * e = getenv("LLAMA_W4A16_PREFILL_M");
+        return e != nullptr ? (int64_t) atoll(e) : (int64_t) LLAMA_W4A16_PREFILL_M;
+    }();
+    return m;
+}
+
+bool ggml_cuda_w4a16_prefill_enabled() {
+    return ggml_cuda_w4a16_prefill_m() > 0;
+}
+
+// ---- cp.async helpers (sm80+; raw bytes, no cast) ----
+static __device__ __forceinline__ void w4a16_cp_async16(void * smem, const void * gmem) {
+#ifdef CP_ASYNC_AVAILABLE
+    const unsigned s = (unsigned) __cvta_generic_to_shared(smem);
+    asm volatile("cp.async.cg.shared.global [%0],[%1],16;\n" :: "r"(s), "l"(gmem));
+#else
+    GGML_UNUSED(smem); GGML_UNUSED(gmem); NO_DEVICE_CODE;
+#endif // CP_ASYNC_AVAILABLE
+}
+static __device__ __forceinline__ void w4a16_cp_async4(void * smem, const void * gmem) {
+#ifdef CP_ASYNC_AVAILABLE
+    const unsigned s = (unsigned) __cvta_generic_to_shared(smem);
+    asm volatile("cp.async.ca.shared.global [%0],[%1],4;\n" :: "r"(s), "l"(gmem));
+#else
+    GGML_UNUSED(smem); GGML_UNUSED(gmem); NO_DEVICE_CODE;
+#endif // CP_ASYNC_AVAILABLE
+}
+static __device__ __forceinline__ void w4a16_cp_commit() {
+#ifdef CP_ASYNC_AVAILABLE
+    asm volatile("cp.async.commit_group;\n" ::);
+#else
+    NO_DEVICE_CODE;
+#endif // CP_ASYNC_AVAILABLE
+}
+template<int N> static __device__ __forceinline__ void w4a16_cp_wait() {
+#ifdef CP_ASYNC_AVAILABLE
+    asm volatile("cp.async.wait_group %0;\n" :: "n"(N));
+#else
+    NO_DEVICE_CODE;
+#endif // CP_ASYNC_AVAILABLE
+}
+
+// ---- f32 -> bf16 activation cast (NO quantize). Pads the [total_rows, pad_rows) tail with 0. ----
+static __global__ void w4a16_cast_act_f32_bf16(
+        const float * __restrict__ x, nv_bfloat16 * __restrict__ y, int64_t n, int64_t npad) {
+    const int64_t i = (int64_t) blockIdx.x * blockDim.x + threadIdx.x;
+    if (i >= npad) {
+        return;
+    }
+    y[i] = i < n ? __float2bfloat16(x[i]) : (nv_bfloat16) 0.0f;
+}
+
+// ---------------------------------------------------------------------------
+// Grouped W4A16 GEMM. For each output tile (blockIdx.x = N-block, blockIdx.y = M-tile):
+//   expert e   = g_tile_expert[blockIdx.y]
+//   row_start  = g_tile_row0[blockIdx.y]   (absolute row in the sorted buffer)
+//   row_count  = g_tile_rows[blockIdx.y]   (valid rows in this tile, <= BM)
+// Weights read from W = src0 + e*expert_stride_blocks (block_nvfp4 [N,Kb]); activations from
+// Abf (bf16, sorted); output to C (f32, sorted, [N, total_rows] = C[row*N + col]).
+// Weights are dequantized FP4->bf16 in registers; A via ldmatrix; bf16 m16n8k16 mma.
+// BK = 64 (one nvfp4 block per K-step); STAGES-deep cp.async pipeline over the Kb blocks.
+// ---------------------------------------------------------------------------
+template<int BM, int BN, int WARPS_M, int WARPS_N, int STAGES>
+__launch_bounds__(WARPS_M*WARPS_N*32, 1)
+static __global__ void w4a16_grouped_kernel(
+        const nv_bfloat16 * __restrict__ Abf,            // [pad_rows, K] bf16
+        const block_nvfp4 * __restrict__ W0,             // src0 base (expert 0)
+        float * __restrict__ C,                          // [total_rows, N] f32
+        const int * __restrict__ g_tile_expert,
+        const int * __restrict__ g_tile_row0,
+        const int * __restrict__ g_tile_rows,
+        int N, int K, int64_t expert_stride_blocks) {
+#if defined(AMPERE_MMA_AVAILABLE) && defined(CP_ASYNC_AVAILABLE)
+    constexpr int BK   = 64;                 // one nvfp4 block
+    constexpr int NWARP = WARPS_M*WARPS_N;
+    constexpr int THREADS = NWARP*32;
+    constexpr int WM = BM/WARPS_M, WN = BN/WARPS_N;
+    constexpr int MF = WM/16, NF = WN/8;
+
+    constexpr int AN  = BK/2;                 // bf16 pairs per A smem row (nv_bfloat162)
+    constexpr int SZ_A   = BM*AN;             // nv_bfloat162 per stage
+    constexpr int SZ_WQ  = BN*8;              // u32 per stage (32 qs bytes/row)
+    constexpr int SZ_WD  = BN;                // u32 per stage (4 scale bytes/row)
+
+    extern __shared__ uint32_t smem_u32[];
+    // Layout per stage: [A as u32 = nv_bfloat162][Wq u32][Wd u32]
+    constexpr int STAGE_U32 = SZ_A + SZ_WQ + SZ_WD;
+    nv_bfloat162 * sA[STAGES];
+    uint32_t     * sWq[STAGES];
+    uint32_t     * sWd[STAGES];
+#pragma unroll
+    for (int s = 0; s < STAGES; s++) {
+        uint32_t * base = smem_u32 + s*STAGE_U32;
+        sA[s]  = (nv_bfloat162 *) base;
+        sWq[s] = base + SZ_A;
+        sWd[s] = base + SZ_A + SZ_WQ;
+    }
+
+    // mma.cuh's tile ops (load_ldmatrix / mma / tile::get_i/get_j) use threadIdx.x AS THE WARP LANE,
+    // so the block MUST be 2D (32, NWARP): threadIdx.x = lane (0..31), threadIdx.y = warp.
+    const int lane = threadIdx.x;            // 0..31
+    const int warp = threadIdx.y;            // 0..NWARP-1
+    const int tid  = warp*32 + lane;         // linear id for the cp.async strided copies
+    const int wrow = warp / WARPS_N, wcol = warp % WARPS_N;
+
+    const int    e        = g_tile_expert[blockIdx.y];
+    const int    row0     = g_tile_row0[blockIdx.y];
+    const int    rcount   = g_tile_rows[blockIdx.y];
+    const int    blockCol = blockIdx.x*BN;
+    const int    Kb       = K/64;
+    const block_nvfp4 * We = W0 + (int64_t) e*expert_stride_blocks; // expert e weight base
+
+    tile_C acc[MF][NF];
+
+    // async-load K-block `kt` into stage `st`
+    auto load_tile = [&](int st, int kt) {
+        // A: BM rows x BK bf16 = BM x AN nv_bfloat162 = BM x (BK/8) 16B chunks
+        const int A_chunks = BM*(BK/8);
+#pragma unroll 1
+        for (int idx = tid; idx < A_chunks; idx += THREADS) {
+            const int c = idx % (BK/8);          // 16B chunk in the row
+            const int r = idx / (BK/8);          // row in tile
+            const nv_bfloat16 * src = Abf + (int64_t)(row0 + r)*K + (int64_t)kt*BK + c*8;
+            w4a16_cp_async16(((char *) sA[st]) + (r*AN + c*4)*sizeof(uint32_t), src);
+        }
+        // W qs: BN rows x 32 bytes = BN x 8 u32 (each block's qs at byte offset 4)
+#pragma unroll 1
+        for (int idx = tid; idx < BN*8; idx += THREADS) {
+            const int w = idx & 7;               // u32 word in the 32-byte qs
+            const int r = idx >> 3;              // row in tile
+            const block_nvfp4 * blk = We + (int64_t)(blockCol + r)*Kb + kt;
+            const char * src = ((const char *) blk) + 4 /*d[4]*/ + w*4;
+            w4a16_cp_async4(&sWq[st][r*8 + w], src);
+        }
+        // W scales: BN rows x 4 bytes (one u32 each, the block's d[4] at byte offset 0)
+#pragma unroll 1
+        for (int r = tid; r < BN; r += THREADS) {
+            const block_nvfp4 * blk = We + (int64_t)(blockCol + r)*Kb + kt;
+            w4a16_cp_async4(&sWd[st][r], (const char *) blk);
+        }
+    };
+
+    // prologue
+#pragma unroll
+    for (int s = 0; s < STAGES-1; s++) { if (s < Kb) load_tile(s, s); w4a16_cp_commit(); }
+
+    for (int kt = 0; kt < Kb; kt++) {
+        const int ld = kt + (STAGES-1);
+        if (ld < Kb) load_tile(ld % STAGES, ld);
+        w4a16_cp_commit();
+        w4a16_cp_wait<STAGES-1>();
+        __syncthreads();
+
+        const int rs = kt % STAGES;
+        const nv_bfloat162 * sAcur = sA[rs];
+        const uint8_t      * sWqb  = (const uint8_t *) sWq[rs];   // BN rows x 32 bytes
+        const uint32_t     * sWdw  = sWd[rs];                     // BN rows x 1 u32 (4 scale bytes)
+
+#pragma unroll
+        for (int kk = 0; kk < BK/16; kk++) {           // 4 m16n8k16 sub-steps per 64-block
+            const int sub = kk;                        // sub-block (0..3): selects scale + nibble half
+            // A fragments via ldmatrix (bf16)
+            tile_A A_frag[MF];
+#pragma unroll
+            for (int mi = 0; mi < MF; mi++) {
+                const int rb = wrow*WM + mi*16;
+                load_ldmatrix(A_frag[mi], sAcur + rb*AN + kk*8, AN);
+            }
+            // B fragments: in-register FP4->bf16 dequant (correct-by-construction via tile get_i/get_j)
+            tile_B B_frag[NF];
+            const int n_local = lane >> 2;             // tile_B::get_i (row N, 0..7)
+            const int jc      = lane & 3;              // lane%4
+            const int qbyte   = sub*8 + 2*jc;          // qs byte index for this lane within the block
+#pragma unroll
+            for (int ni = 0; ni < NF; ni++) {
+                const int nrow = wcol*WN + ni*8 + n_local;       // col within BN tile [0,BN)
+                const uint8_t * qsb = sWqb + nrow*32;            // this row's 32 qs bytes
+                const uint8_t   b0  = qsb[qbyte];
+                const uint8_t   b1  = qsb[qbyte + 1];
+                const float     sc  = ggml_cuda_ue4m3_to_fp32(((const uint8_t *) &sWdw[nrow])[sub]);
+                // x[0]: low nibbles (k = 2jc, 2jc+1)
+                B_frag[ni].x[0].x = __float2bfloat16(sc * (float) kvalues_mxfp4[b0 & 0x0F]);
+                B_frag[ni].x[0].y = __float2bfloat16(sc * (float) kvalues_mxfp4[b1 & 0x0F]);
+                // x[1]: high nibbles (k = 8+2jc, 9+2jc)
+                B_frag[ni].x[1].x = __float2bfloat16(sc * (float) kvalues_mxfp4[b0 >> 4]);
+                B_frag[ni].x[1].y = __float2bfloat16(sc * (float) kvalues_mxfp4[b1 >> 4]);
+            }
+#pragma unroll
+            for (int mi = 0; mi < MF; mi++)
+#pragma unroll
+                for (int ni = 0; ni < NF; ni++)
+                    mma(acc[mi][ni], A_frag[mi], B_frag[ni]);
+        }
+        __syncthreads();
+    }
+
+    // write back (mask the ragged per-expert row tail)
+#pragma unroll
+    for (int mi = 0; mi < MF; mi++)
+#pragma unroll
+        for (int ni = 0; ni < NF; ni++) {
+            const int orow = wrow*WM + mi*16;
+            const int ocol = blockCol + wcol*WN + ni*8;
+#pragma unroll
+            for (int l = 0; l < acc[mi][ni].ne; l++) {
+                const int lr = orow + acc[mi][ni].get_i(l);   // local row within tile
+                const int nc = ocol + acc[mi][ni].get_j(l);   // global col
+                if (lr < rcount && nc < N) {
+                    C[(int64_t)(row0 + lr)*N + nc] = acc[mi][ni].x[l];
+                }
+            }
+        }
+#else
+    GGML_UNUSED(Abf); GGML_UNUSED(W0); GGML_UNUSED(C);
+    GGML_UNUSED(g_tile_expert); GGML_UNUSED(g_tile_row0); GGML_UNUSED(g_tile_rows);
+    GGML_UNUSED(N); GGML_UNUSED(K); GGML_UNUSED(expert_stride_blocks);
+    NO_DEVICE_CODE;
+#endif // AMPERE_MMA_AVAILABLE && CP_ASYNC_AVAILABLE
+}
+
+// ===========================================================================
+// host integration
+// ===========================================================================
+
+bool ggml_cuda_w4a16_moe_grouped_should_engage(
+        const ggml_tensor * src0, const ggml_tensor * src1, const ggml_tensor * dst, int cc) {
+    if (src0->type != GGML_TYPE_NVFP4) {
+        return false;
+    }
+    if (!blackwell_mma_available(cc)) {
+        return false;
+    }
+    if (!ggml_cuda_w4a16_prefill_enabled()) {
+        return false;                          // default-off == stock
+    }
+    if (src1->type != GGML_TYPE_F32 || dst->type != GGML_TYPE_F32) {
+        return false;
+    }
+    // ne12 = total tokens (aggregate prefill M); only LARGE M (prefill), never decode.
+    if (src1->ne[2] <= ggml_cuda_w4a16_prefill_m()) {
+        return false;
+    }
+    const int64_t K = src0->ne[0];
+    const int64_t N = src0->ne[1];
+    if (N % 128 != 0 || K % 64 != 0) {
+        return false;                          // tile constraints; else fall back to per-expert/MMQ
+    }
+    return true;
+}
+
+void ggml_cuda_mul_mat_id_w4a16_grouped(
+        ggml_backend_cuda_context & ctx,
+        const ggml_tensor * src0,
+        const float * src1_sorted,
+        float * dst_sorted,
+        const int * tokens_per_expert,
+        int64_t n_experts, int64_t K, int64_t N,
+        cudaStream_t stream) {
+    GGML_ASSERT(src0->type == GGML_TYPE_NVFP4);
+    GGML_ASSERT(N % 128 == 0 && K % 64 == 0);
+
+    constexpr int BM = 64, BN = 128, WARPS_M = 2, WARPS_N = 4, STAGES = 2;
+
+    // host: build the per-M-tile expert map (ragged, no tile crosses an expert boundary)
+    int64_t total_rows = 0;
+    for (int64_t e = 0; e < n_experts; e++) {
+        total_rows += tokens_per_expert[e];
+    }
+    if (total_rows == 0) {
+        return;
+    }
+
+    std::vector<int32_t> h_tile_expert, h_tile_row0, h_tile_rows;
+    int64_t row = 0;
+    for (int64_t e = 0; e < n_experts; e++) {
+        const int t = tokens_per_expert[e];
+        for (int off = 0; off < t; off += BM) {
+            h_tile_expert.push_back((int32_t) e);
+            h_tile_row0.push_back((int32_t) (row + off));
+            h_tile_rows.push_back((int32_t) std::min(BM, t - off));
+        }
+        row += t;
+    }
+    const int n_tiles = (int) h_tile_expert.size();
+
+    if (getenv("LLAMA_W4A16_DEBUG")) {
+        int max_tpe = 0, multi = 0;
+        for (int64_t e = 0; e < n_experts; e++) {
+            if (tokens_per_expert[e] > max_tpe) max_tpe = tokens_per_expert[e];
+            if (tokens_per_expert[e] > BM) multi++;
+        }
+        fprintf(stderr, "[w4a16] engaged: total_rows=%lld n_experts=%lld K=%lld N=%lld n_tiles=%d max_tpe=%d multi_tile_experts=%d\n",
+                (long long) total_rows, (long long) n_experts, (long long) K, (long long) N, n_tiles, max_tpe, multi);
+    }
+
+    // device: tile map
+    ggml_cuda_pool_alloc<int32_t> d_tile_expert(ctx.pool(), n_tiles);
+    ggml_cuda_pool_alloc<int32_t> d_tile_row0  (ctx.pool(), n_tiles);
+    ggml_cuda_pool_alloc<int32_t> d_tile_rows  (ctx.pool(), n_tiles);
+    CUDA_CHECK(cudaMemcpyAsync(d_tile_expert.ptr, h_tile_expert.data(), n_tiles*sizeof(int32_t), cudaMemcpyHostToDevice, stream));
+    CUDA_CHECK(cudaMemcpyAsync(d_tile_row0.ptr,   h_tile_row0.data(),   n_tiles*sizeof(int32_t), cudaMemcpyHostToDevice, stream));
+    CUDA_CHECK(cudaMemcpyAsync(d_tile_rows.ptr,   h_tile_rows.data(),   n_tiles*sizeof(int32_t), cudaMemcpyHostToDevice, stream));
+
+    // activations: f32 -> bf16 (cheap cast, NO act-quant), zero-padded so every tile's BM-row read
+    // stays in-bounds. A tile's row0 is generally NOT BM-aligned (experts start mid-buffer), and a
+    // tile can begin as late as total_rows-1, so it can read up to total_rows-1+BM; add a full BM of
+    // zero headroom on top of the BM-rounded length to cover that worst case.
+    const int64_t pad_rows = (((total_rows + BM - 1) / BM) + 1) * BM;
+    ggml_cuda_pool_alloc<nv_bfloat16> Abf(ctx.pool(), (size_t) pad_rows * K);
+    {
+        const int64_t n = total_rows * K;
+        const int64_t npad = pad_rows * K;
+        const int threads = 256;
+        const int64_t grid = (npad + threads - 1) / threads;
+        w4a16_cast_act_f32_bf16<<<grid, threads, 0, stream>>>(src1_sorted, Abf.get(), n, npad);
+        CUDA_CHECK(cudaGetLastError());
+    }
+
+    const int64_t expert_stride_blocks = (int64_t) (src0->nb[2] / sizeof(block_nvfp4));
+
+    auto kern = w4a16_grouped_kernel<BM, BN, WARPS_M, WARPS_N, STAGES>;
+    constexpr int STAGE_U32 = BM*(64/2) + BN*8 + BN;
+    const int smem_bytes = STAGES * STAGE_U32 * (int) sizeof(uint32_t);
+    CUDA_CHECK(cudaFuncSetAttribute(kern, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_bytes));
+
+    dim3 grid((unsigned) (N / BN), (unsigned) n_tiles);
+    dim3 block(32, WARPS_M*WARPS_N);   // 2D: threadIdx.x = warp lane, threadIdx.y = warp
+    kern<<<grid, block, smem_bytes, stream>>>(
+        Abf.get(), (const block_nvfp4 *) src0->data, dst_sorted,
+        d_tile_expert.ptr, d_tile_row0.ptr, d_tile_rows.ptr,
+        (int) N, (int) K, expert_stride_blocks);
+    CUDA_CHECK(cudaGetLastError());
+}
+diff --git a/ggml/src/ggml-cuda/w4a16-gemm.cuh b/ggml/src/ggml-cuda/w4a16-gemm.cuh
+new file mode 100644
+index 0000000..2287d6f
+--- /dev/null
+++ b/ggml/src/ggml-cuda/w4a16-gemm.cuh
+@@ -0,0 +1,55 @@
+#pragma once
+
+#include "common.cuh"
+
+// [paged patch 0035] Marlin-style W4A16 GROUPED MoE prefill GEMM for Blackwell sm_121a (GB10).
+//
+// This is the profile-validated #2 prefill lever and a DISTINCT kernel from the two prefill
+// rejects:
+//   - NOT patch 0033 (separate-pass dequant -> bf16 cuBLAS / nvjet): that pays a full per-step
+//     weight dequant + 4x bf16 weight traffic and lost to FP4-MMQ at large M.
+//   - NOT patch 0034 (native W4A4 FP4-MMA, mxf4nvf4 block-scale OMMA): that quantizes the
+//     activations to FP4 and so still pays the quantize_mmq_nvfp4 activation-quant tax.
+//
+// The winning shape vLLM uses on this silicon (Marlin W4A16): the FP4 expert weights are
+// dequantized to bf16 IN REGISTERS right before the MMA (never materialized to global/smem as
+// bf16), the activations stay bf16 (a cheap f32->bf16 cast, NO per-block FP4 amax/code-search
+// quantize), and the product is a standard bf16 m16n8k16 mma.sync feeding f32 accumulators,
+// cp.async multistage-pipelined over the K loop. So W4A16 pays ZERO activation-quant (the paged
+// FP4-MMQ path's quantize_mmq_nvfp4 is +15 us/tok) and the GEMM runs as a bf16 tensor-core GEMM
+// with the weight read at 4 bits.
+//
+// GROUPED: the kernel is launched ONCE over the whole mul_mat_id token-sorted activation buffer
+// (src1_sorted is already sorted-by-expert by the existing host-loop), with a per-M-tile expert
+// map so each output tile reads its expert's weight matrix (src0 + expert*nb02) and the ragged
+// per-expert row tail is masked. No per-expert kernel launch, no per-expert M-padding waste.
+//
+// Engages ONLY at large aggregate-M (prefill), behind LLAMA_W4A16_PREFILL_M (default 0 == OFF
+// == stock); decode (small ne12) and the non-MoE / non-NVFP4 paths are byte-untouched. The bf16
+// tiles are mma.cuh's (mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32).
+
+// True if the grouped W4A16 path should handle this mul_mat_id:
+//   src0 NVFP4, src1 f32, dst f32, Blackwell, LLAMA_W4A16_PREFILL_M>0,
+//   ne12 (total tokens / aggregate prefill M) > threshold, N=ne0 % 128 == 0, K=ne10 % 64 == 0.
+bool ggml_cuda_w4a16_moe_grouped_should_engage(
+        const ggml_tensor * src0, const ggml_tensor * src1, const ggml_tensor * dst, int cc);
+
+// True iff LLAMA_W4A16_PREFILL_M > 0 (the master on/off for the mmq.cu grouped-MMQ-off gate).
+bool ggml_cuda_w4a16_prefill_enabled();
+int64_t ggml_cuda_w4a16_prefill_m();
+
+// Grouped W4A16 MoE GEMM over the token-sorted buffer.
+//   src0          : NVFP4 weights [K, N, n_experts] (one [K,N] matrix per expert)
+//   src1_sorted   : f32 [K, total_rows], rows already sorted by expert (the mul_mat_id host-loop's
+//                   src1_sorted), with tokens_per_expert[e] consecutive rows per expert e
+//   dst_sorted    : f32 [N, total_rows], written in the same sorted order
+//   tokens_per_expert : host vector, length n_experts
+// Streams on `stream`, pool-allocates scratch (bf16 activations + device tile map); no host sync.
+void ggml_cuda_mul_mat_id_w4a16_grouped(
+        ggml_backend_cuda_context & ctx,
+        const ggml_tensor * src0,
+        const float * src1_sorted,
+        float * dst_sorted,
+        const int * tokens_per_expert,
+        int64_t n_experts, int64_t K, int64_t N,
+        cudaStream_t stream);
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch
@@ -0,0 +1,365 @@
+From 1434cf7e078217c625062dcfde4fa91cf487ee86 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Sun, 28 Jun 2026 20:19:31 +0200
+Subject: [PATCH] feat(paged): fused residual-add + RMS norm + weight multiply
+ (patch 0042)
+
+The transformer pre-norm residual chain `h = x + sub_out; n = rms_norm(h) * w`
+runs as separate CUDA launches in the paged prefill graph: a k_bin_bcast ADD
+(the residual) feeding the existing fused rms_norm+mul. ggml-cuda already fuses
+rms_norm+mul (and rms_norm+mul+ADD, where the ADD is a *post*-norm bias) but NOT
+the *pre*-norm residual add that feeds the norm. This is the classic add-RMSNorm
+fusion (as in vLLM / TensorRT-LLM) that ggml-cuda lacks; it is part of the
+unfused-tail prefill gap vs vLLM's torch.compile fusions.
+
+Add it as a CUDA-family graph fusion (paged series owns it; stock stays pure):
+- ggml_cuda_can_fuse recognizes { ADD, RMS_NORM, MUL } via ggml_can_fuse_subgraph
+  with BOTH the ADD (node_idx) and the MUL (node_idx+2) marked as outputs - the
+  residual ADD has a second consumer (the later skip-connection add), so it
+  cannot pass the single-use ggml_can_fuse() gate the other rms_norm fusions use.
+- New kernel rms_norm_pre_add_mul_f32 computes h = a + b, publishes h to the
+  residual buffer (downstream skip add reads it), then sum(h^2) -> scale ->
+  dst = scale * h * w in ONE launch, emitting BOTH outputs the graph needs.
+- Gated by LLAMA_FUSE_ADD_RMSNORM (default ON) for a clean single-build A/B.
+
+BIT-EXACT (per-path canonical greedy md5, n=48 --temp 0 --seed 1, paged):
+  dense q36-27b-nvfp4 : 5951a5b4d624ce891e22ab5fca9bc439 (ON == OFF == canonical)
+  MoE   q36-35b-a3b   : 8cb0ce23777bf55f92f63d0292c756b0 (ON == OFF == canonical)
+The fused kernel reproduces the exact FP order of the unfused chain: h = a + b
+(IEEE add is order-free), the sum(h^2) reduction uses the same block_reduce<SUM>
+with the same 256/1024 block-size thresholds, and the same rsqrtf(mean+eps)
+scale, so the byte stream is unchanged. test-backend-ops RMS_NORM/ADD/MUL pass
+(CUDA0 vs CPU).
+
+PROFILE (dense prefill, nsys --cuda-graph-trace=node, npp512 ntg4 npl8):
+  rms_norm_f32<1024>    903 launches / 96.6M ns -> 7   / 0.7M ns
+  k_bin_bcast<op_add>  1232 launches / 138.6M ns -> 336 / 1.0M ns
+  rms_norm_pre_add_mul (new)               896 launches / 187.2M ns
+  -> 896 residual-add + 896 rms_norm launches folded into 896 fused launches;
+     the norm+residual slice 233.6M -> 187.2M ns (~20% of that slice, ~1% of
+     total prefill GPU time).
+S_PP dense (npp512 ntg4 npl32, 3x): 985.5 -> 990.6 t/s (+0.5%, every ON run
+beats every OFF run). Modest because the residual tail is a small slice of
+prefill; the dominant unfused cost is k_bin_bcast<op_mul> (11%, the GDN
+chunked-prefill gating muls) - a separate lever.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ ggml/src/ggml-cuda/ggml-cuda.cu |  54 +++++++++
+ ggml/src/ggml-cuda/norm.cu      | 196 ++++++++++++++++++++++++++++++++
+ ggml/src/ggml-cuda/norm.cuh     |   5 +
+ 3 files changed, 255 insertions(+)
+
+diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
+index 0dad6e1..2ecc971 100644
+--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
+@@ -3698,6 +3698,48 @@ static bool ggml_cuda_can_fuse(const struct ggml_cgraph *                cgraph,
+         }
+     }
+ 
+    // Fused residual-add + RMS norm + weight multiply. The transformer residual
+    // ADD feeds the next sublayer's RMS norm but is ALSO consumed by the later
+    // residual add (skip connection), so the ADD node is a graph output too; it
+    // cannot go through the single-use ggml_can_fuse() gate below. Recognize it
+    // here with ggml_can_fuse_subgraph, marking both the ADD (node_idx) and the
+    // final MUL (node_idx + 2) as outputs.
+    std::initializer_list<enum ggml_op> add_rms_norm_mul_ops = { GGML_OP_ADD, GGML_OP_RMS_NORM, GGML_OP_MUL };
+    if (is_equal(add_rms_norm_mul_ops, ops) &&
+        ggml_can_fuse_subgraph(cgraph, node_idx, ops, { node_idx, node_idx + 2 })) {
+        const ggml_tensor * add      = cgraph->nodes[node_idx];
+        const ggml_tensor * rms_norm = cgraph->nodes[node_idx + 1];
+        const ggml_tensor * mul      = cgraph->nodes[node_idx + 2];
+
+        // RMS norm must consume the residual-add output.
+        if (rms_norm->src[0] != add) {
+            return false;
+        }
+        // All operands F32 (rms norm / fused mul kernel only support F32).
+        if (add->src[0]->type != GGML_TYPE_F32 || add->src[1]->type != GGML_TYPE_F32 ||
+            add->type != GGML_TYPE_F32 || rms_norm->type != GGML_TYPE_F32 ||
+            mul->src[0]->type != GGML_TYPE_F32 || mul->src[1]->type != GGML_TYPE_F32 ||
+            mul->type != GGML_TYPE_F32) {
+            return false;
+        }
+        // The fused kernel computes h = a + b elementwise: same shape, no broadcast.
+        if (!ggml_are_same_shape(add->src[0], add->src[1])) {
+            return false;
+        }
+        // rms_norm kernel assumes contiguous rows for the residual operands and weight.
+        if (!ggml_is_contiguous(add->src[0]) || !ggml_is_contiguous(add->src[1])) {
+            return false;
+        }
+        if (!ggml_is_contiguous_rows(mul->src[0]) || !ggml_is_contiguous_rows(mul->src[1])) {
+            return false;
+        }
+        // If rms_norm is the B operand of the mul, broadcast of the A operand is unsupported.
+        if (rms_norm == mul->src[1] && !ggml_are_same_shape(mul->src[0], rms_norm)) {
+            return false;
+        }
+        return true;
+    }
+
+     if (!ggml_can_fuse(cgraph, node_idx, ops)) {
+         return false;
+     }
+@@ -4220,6 +4262,18 @@ static int ggml_cuda_try_fuse(ggml_backend_cuda_context * cuda_ctx, ggml_cgraph
+         return fused_node_count - 1;
+     }
+ 
+    // Fused residual-add + RMS norm + weight multiply (bit-exact). Default ON;
+    // set LLAMA_FUSE_ADD_RMSNORM=0 for a clean A/B against the unfused path.
+    static const bool fuse_add_rmsnorm = [] {
+        const char * e = getenv("LLAMA_FUSE_ADD_RMSNORM");
+        return e == nullptr || atoi(e) != 0;
+    }();
+    if (fuse_add_rmsnorm &&
+        ggml_cuda_can_fuse(cgraph, i, { GGML_OP_ADD, GGML_OP_RMS_NORM, GGML_OP_MUL }, {})) {
+        ggml_cuda_op_rms_norm_pre_add_mul(*cuda_ctx, node, cgraph->nodes[i + 1], cgraph->nodes[i + 2]);
+        return 2;
+    }
+
+     if (ggml_cuda_can_fuse(cgraph, i, { GGML_OP_RMS_NORM, GGML_OP_MUL, GGML_OP_ADD }, {})) {
+         ggml_cuda_op_rms_norm_fused_add(*cuda_ctx, node, cgraph->nodes[i + 1], cgraph->nodes[i + 2]);
+         return 2;
+diff --git a/ggml/src/ggml-cuda/norm.cu b/ggml/src/ggml-cuda/norm.cu
+index 09d9f3a..a07d022 100644
+--- a/ggml/src/ggml-cuda/norm.cu
+++ b/ggml/src/ggml-cuda/norm.cu
+@@ -154,6 +154,87 @@ static __global__ void rms_norm_f32(const float * x,
+     }
+ }
+ 
+// Fused residual-add + RMS norm + (optional) weight multiply.
+//   h   = a + b                       (the residual stream, written to h_out)
+//   dst = rsqrt(mean(h^2)+eps) * h * mul
+// `a` and `b` are required to be the same shape and contiguous (the transformer
+// residual add), so they share `x`'s strides; `h_out`, `dst` are also contiguous
+// with that shape. `mul` (the RMS weight) broadcasts via the packed-modulo path.
+//
+// Bit-exactness: this reproduces the exact FP order of the unfused chain
+//   k_bin_bcast(add): h[col] = a[col] + b[col]   (f32, elementwise, order-free)
+//   rms_norm:         sumsq over h[col] in column order via block_reduce
+//   mul:              dst[col] = scale * h[col] * mul[col]
+// h is summed from the same f32 values in the same order, so the reduction and
+// the final scale are byte-identical to running the three kernels separately.
+template <int block_size, bool do_multiply = false>
+static __global__ void rms_norm_pre_add_mul_f32(const float * a,
+                                                const float * b,
+                                                float *       h_out,
+                                                float *       dst,
+                                                const int     ncols,
+                                                const int64_t stride_row,
+                                                const int64_t stride_channel,
+                                                const int64_t stride_sample,
+                                                const float   eps,
+                                                const float * mul                  = nullptr,
+                                                const int64_t mul_stride_row       = 0,
+                                                const int64_t mul_stride_channel   = 0,
+                                                const int64_t mul_stride_sample    = 0,
+                                                const uint3   mul_ncols_packed     = make_uint3(0, 0, 0),
+                                                const uint3   mul_nrows_packed     = make_uint3(0, 0, 0),
+                                                const uint3   mul_nchannels_packed = make_uint3(0, 0, 0),
+                                                const uint3   mul_nsamples_packed  = make_uint3(0, 0, 0)) {
+    ggml_cuda_pdl_lc();
+    const int nrows     = gridDim.x;
+    const int nchannels = gridDim.y;
+
+    const int row       = blockIdx.x;
+    const int channel   = blockIdx.y;
+    const int sample    = blockIdx.z;
+    const int tid       = threadIdx.x;
+
+    const int64_t row_offset = sample*stride_sample + channel*stride_channel + row*stride_row;
+    a     += row_offset;
+    b     += row_offset;
+    h_out += row_offset;
+    // dst is laid out contiguously by the scheduler for the MUL output
+    dst   += ((sample*nchannels + channel)*nrows + row)*ncols;
+
+    if constexpr (do_multiply) {
+        const uint32_t mul_row     = fastmodulo(row, mul_nrows_packed);
+        const uint32_t mul_channel = fastmodulo(channel, mul_nchannels_packed);
+        const uint32_t mul_sample  = fastmodulo(sample, mul_nsamples_packed);
+        mul += mul_sample * mul_stride_sample + mul_channel * mul_stride_channel + mul_row * mul_stride_row;
+    }
+
+    float tmp = 0.0f; // partial sum for thread in warp
+
+    ggml_cuda_pdl_sync();
+    for (int col = tid; col < ncols; col += block_size) {
+        const float hi = a[col] + b[col];
+        h_out[col] = hi;          // publish the residual stream for the next add
+        tmp += hi * hi;
+    }
+
+    // sum up partial sums
+    extern __shared__ float s_sum[];
+    tmp = block_reduce<block_reduce_method::SUM, block_size>(tmp, s_sum);
+
+    const float mean  = tmp / ncols;
+    const float scale = rsqrtf(mean + eps);
+
+    for (int col = tid; col < ncols; col += block_size) {
+        const float hi = h_out[col];
+        if constexpr (do_multiply) {
+            const int mul_col = fastmodulo(col, mul_ncols_packed);
+            dst[col]          = scale * hi * mul[mul_col];
+        } else {
+            dst[col]          = scale * hi;
+        }
+    }
+}
+
+ template <int block_size>
+ static __global__ void rms_norm_back_f32(
+         const float * grad, const float * xf, float * dst, const int ncols, const float eps) {
+@@ -407,6 +488,50 @@ static void rms_norm_mul_f32_cuda(const float *  x,
+     }
+ }
+ 
+static void rms_norm_pre_add_mul_f32_cuda(const float *  a,
+                                          const float *  b,
+                                          float *        h_out,
+                                          float *        dst,
+                                          const int      ncols,
+                                          const int      nrows,
+                                          const int      nchannels,
+                                          const int      nsamples,
+                                          const int64_t  stride_row,
+                                          const int64_t  stride_channel,
+                                          const int64_t  stride_sample,
+                                          const float *  mul,
+                                          const int64_t  mul_stride_row,
+                                          const int64_t  mul_stride_channel,
+                                          const int64_t  mul_stride_sample,
+                                          const uint32_t mul_ncols,
+                                          const uint32_t mul_nrows,
+                                          const uint32_t mul_nchannels,
+                                          const uint32_t mul_nsamples,
+                                          const float    eps,
+                                          cudaStream_t   stream) {
+    const dim3 blocks_num(nrows, nchannels, nsamples);
+    GGML_ASSERT(mul != nullptr);
+    const uint3 mul_ncols_packed     = init_fastdiv_values(mul_ncols);
+    const uint3 mul_nrows_packed     = init_fastdiv_values(mul_nrows);
+    const uint3 mul_nchannels_packed = init_fastdiv_values(mul_nchannels);
+    const uint3 mul_nsamples_packed  = init_fastdiv_values(mul_nsamples);
+    if (ncols < 1024) {
+        const dim3 block_dims(256, 1, 1);
+        const ggml_cuda_kernel_launch_params launch_params = ggml_cuda_kernel_launch_params{blocks_num, block_dims, block_dims.x > WARP_SIZE ? 32 * sizeof(float) : 0, stream};
+        ggml_cuda_kernel_launch(rms_norm_pre_add_mul_f32<256, true>, launch_params,
+            a, b, h_out, dst, ncols, stride_row, stride_channel, stride_sample, eps,
+            mul, mul_stride_row, mul_stride_channel, mul_stride_sample,
+            mul_ncols_packed, mul_nrows_packed, mul_nchannels_packed, mul_nsamples_packed);
+    } else {
+        const dim3 block_dims(1024, 1, 1);
+        const ggml_cuda_kernel_launch_params launch_params = ggml_cuda_kernel_launch_params{blocks_num, block_dims, block_dims.x > WARP_SIZE ? 32 * sizeof(float) : 0, stream};
+        ggml_cuda_kernel_launch(rms_norm_pre_add_mul_f32<1024, true>, launch_params,
+            a, b, h_out, dst, ncols, stride_row, stride_channel, stride_sample, eps,
+            mul, mul_stride_row, mul_stride_channel, mul_stride_sample,
+            mul_ncols_packed, mul_nrows_packed, mul_nchannels_packed, mul_nsamples_packed);
+    }
+}
+
+ static void rms_norm_back_f32_cuda(const float * grad, const float * xf, float * dst, const int ncols, const int nrows, const float eps, cudaStream_t stream) {
+     if (ncols < 1024) {
+         const dim3 block_dims(WARP_SIZE, 1, 1);
+@@ -647,6 +772,77 @@ void ggml_cuda_op_rms_norm_fused_add(ggml_backend_cuda_context & ctx,
+                           eps, stream);
+ }
+ 
+void ggml_cuda_op_rms_norm_pre_add_mul(ggml_backend_cuda_context & ctx,
+                                       ggml_tensor *               add_tensor,
+                                       ggml_tensor *               rms_norm_tensor,
+                                       ggml_tensor *               mul_tensor) {
+    // The RMS norm consumes the residual-add output.
+    GGML_ASSERT(rms_norm_tensor->src[0] == add_tensor);
+
+    const ggml_tensor * a_src = add_tensor->src[0];
+    const ggml_tensor * b_src = add_tensor->src[1];
+
+    float eps = 0.0f;
+    memcpy(&eps, rms_norm_tensor->op_params, sizeof(float));
+    GGML_ASSERT(eps >= 0.0f);
+
+    const float * a_d = (const float *) a_src->data;
+    const float * b_d = (const float *) b_src->data;
+    float *       h_d = (float *)       add_tensor->data;
+
+    const float *       mul_d   = nullptr;
+    const ggml_tensor * mul_src = nullptr;
+    if (mul_tensor->src[0] == rms_norm_tensor) {
+        mul_d   = (const float *) mul_tensor->src[1]->data;
+        mul_src = mul_tensor->src[1];
+    } else if (mul_tensor->src[1] == rms_norm_tensor) {
+        mul_d   = (const float *) mul_tensor->src[0]->data;
+        mul_src = mul_tensor->src[0];
+    } else {
+        GGML_ASSERT(false);
+    }
+
+    float *      dst_d  = (float *) mul_tensor->data;
+    cudaStream_t stream = ctx.stream();
+
+    GGML_ASSERT(a_src->type == GGML_TYPE_F32);
+    GGML_ASSERT(b_src->type == GGML_TYPE_F32);
+    GGML_ASSERT(add_tensor->type == GGML_TYPE_F32);
+    GGML_ASSERT(rms_norm_tensor->type == GGML_TYPE_F32);
+    GGML_ASSERT(mul_tensor->type == GGML_TYPE_F32);
+    GGML_ASSERT(ggml_are_same_shape(a_src, b_src));
+
+    const int64_t ne00 = add_tensor->ne[0];
+    const int64_t ne01 = add_tensor->ne[1];
+    const int64_t ne02 = add_tensor->ne[2];
+    const int64_t ne03 = add_tensor->ne[3];
+
+    // a and b share the (contiguous) residual layout
+    const size_t ts0 = ggml_type_size(a_src->type);
+    GGML_ASSERT(a_src->nb[0] == ts0 && b_src->nb[0] == ts0);
+    const int64_t s01 = a_src->nb[1] / ts0;
+    const int64_t s02 = a_src->nb[2] / ts0;
+    const int64_t s03 = a_src->nb[3] / ts0;
+
+    const size_t ts_mul = ggml_type_size(mul_src->type);
+    GGML_ASSERT(mul_src->nb[0] == ts_mul);
+    const int64_t mul_s01 = mul_src->nb[1] / ts_mul;
+    const int64_t mul_s02 = mul_src->nb[2] / ts_mul;
+    const int64_t mul_s03 = mul_src->nb[3] / ts_mul;
+
+    const int mul_ncols     = mul_src->ne[0];
+    const int mul_nrows     = mul_src->ne[1];
+    const int mul_nchannels = mul_src->ne[2];
+    const int mul_nsamples  = mul_src->ne[3];
+
+    rms_norm_pre_add_mul_f32_cuda(a_d, b_d, h_d, dst_d,
+                                  ne00, ne01, ne02, ne03,
+                                  /*s00*/ s01, s02, s03,
+                                  mul_d, /*mul_s00*/ mul_s01, mul_s02, mul_s03,
+                                  mul_ncols, mul_nrows, mul_nchannels, mul_nsamples,
+                                  eps, stream);
+}
+
+ void ggml_cuda_op_rms_norm_back(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
+     const ggml_tensor * grad  = dst->src[0]; // gradients
+     const ggml_tensor * src0f = dst->src[1]; // src0 from forward pass
+diff --git a/ggml/src/ggml-cuda/norm.cuh b/ggml/src/ggml-cuda/norm.cuh
+index a74f637..05396cd 100644
+--- a/ggml/src/ggml-cuda/norm.cuh
+++ b/ggml/src/ggml-cuda/norm.cuh
+@@ -13,6 +13,11 @@ void ggml_cuda_op_rms_norm_fused_add(ggml_backend_cuda_context & ctx,
+                                      ggml_tensor *               mul_tensor,
+                                      ggml_tensor *               add_tensor);
+ 
+void ggml_cuda_op_rms_norm_pre_add_mul(ggml_backend_cuda_context & ctx,
+                                       ggml_tensor *               add_tensor,
+                                       ggml_tensor *               rms_norm_tensor,
+                                       ggml_tensor *               mul_tensor);
+
+ void ggml_cuda_op_rms_norm_back(ggml_backend_cuda_context & ctx, ggml_tensor * dst);
+ 
+ void ggml_cuda_op_l2_norm(ggml_backend_cuda_context & ctx, ggml_tensor * dst);
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0043-feat-paged-default-on-full-step-moe-decode-cuda-graph.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0043-feat-paged-default-on-full-step-moe-decode-cuda-graph.patch
@@ -0,0 +1,82 @@
+From e4716bd0c700d34919e093f99cd454d883ad15ec Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Mon, 29 Jun 2026 02:13:20 +0200
+Subject: [PATCH] feat(paged): default-on full-step MoE-decode CUDA graph
+ (grouped MMQ, patch 0043)
+
+D1 lever. The MUL_MAT_ID CUDA-graph guard ([TAG_MUL_MAT_ID_CUDA_GRAPHS])
+disables CUDA graphs for the WHOLE decode step whenever a MUL_MAT_ID node has
+ne[2] > mmvq_mmid_max (8 for NVFP4 on sm_121) - i.e. for every multi-token
+decode. Patch 0025 showed the path actually taken on Blackwell NVFP4,
+should_use_mmq()==true -> grouped stream-k MMQ id-branch, launches on one
+stream with NO host sync (only the per-expert host-loop fallback synchronizes),
+so the disable is conservative and graphs are safe for the grouped path - but
+0025 left it behind an opt-in env (LLAMA_MOE_FORCE_GRAPHS), so by default the
+host re-issued every kernel of the step.
+
+D1 profiling (GB10 sm_121, q36-35b-a3b-nvfp4, batched-bench -fa on, npl128)
+settled the mechanism:
+  - The grouped MMQ NVFP4 path IS what runs in decode: cudaStreamSynchronize
+    count is IDENTICAL with graphs on vs off (1457 either way) - the per-expert
+    host-loop fallback (the only device->host routing readback) is never hit.
+    MoE routing is already device-side.
+  - Steady-decode GPU-busy is ~99% (1% idle): static decode is GPU-bound, not
+    host-sync-bound. The host cost is per-step kernel RE-ISSUE, removed by
+    replaying a captured full-step graph (incl. the MoE dispatch).
+
+So make the grouped-path graph capture ON BY DEFAULT; LLAMA_MOE_NO_FORCE_GRAPHS=1
+forces the conservative pre-0025 disable for A/B. should_use_mmq() is the exact
+guard: it returns FALSE for the large-M NVFP4 prefill (patch 0034), which
+deliberately drops to the per-expert host-sync loop, so PREFILL keeps graphs
+disabled (correct - that path syncs). Decode-only behaviour change; prefill and
+the stock llama-cpp backend are untouched.
+
+BIT-EXACT: greedy md5 byte-identical default(on)==LLAMA_MOE_NO_FORCE_GRAPHS(off)
+==legacy LLAMA_MOE_FORCE_GRAPHS - paged-MoE 8cb0ce23777bf55f92f63d0292c756b0,
+paged-dense 5951a5b4d624ce891e22ab5fca9bc439 (both match the recorded baselines).
+
+Measured (GB10, batched-bench paged decode S_TG, default-on vs opt-out):
+  npl 32   467.3 vs 444.3 t/s  +5.2%
+  npl 128  788.2 vs 768.1 t/s  +2.6%
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ ggml/src/ggml-cuda/ggml-cuda.cu | 20 +++++++++++++++-----
+ 1 file changed, 15 insertions(+), 5 deletions(-)
+
+diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
+index a92003c..3151684 100644
+--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
+@@ -3306,12 +3306,22 @@ static bool ggml_cuda_graph_check_compability(ggml_cgraph * cgraph) {
+             const int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc;
+             const int mmvq_mmid_max = get_mmvq_mmid_max_batch(node->src[0]->type, cc);
+             bool mmid_needs_sync = !ggml_is_quantized(node->src[0]->type) || node->ne[2] > mmvq_mmid_max;
+-            // PROBE (bit-exact, env LLAMA_MOE_FORCE_GRAPHS): the grouped stream-k MMQ id-path is
+-            // launched on-stream with no host sync (only the per-expert host-loop fallback syncs);
+-            // when should_use_mmq() is true (Blackwell NVFP4 grouped path) the op is graph-safe
+-            // even for ne[2] > mmvq_mmid_max, so graphs need not be disabled for the whole step.
+            // [D1 / patch 0043] The grouped stream-k MMQ id-path (should_use_mmq()==true, e.g.
+            // Blackwell NVFP4) launches on-stream with NO host sync; only the per-expert
+            // host-loop fallback synchronizes the stream. So when this MUL_MAT_ID WILL take the
+            // grouped path, the whole decode step is graph-safe even for ne[2] > mmvq_mmid_max,
+            // and the full-step CUDA graph (incl. the MoE dispatch) can be REPLAYED instead of the
+            // host re-issuing every kernel every step. Patch 0025 proved this is bit-exact (graph
+            // replay re-issues identical kernels); D1 profiling confirmed the grouped path is what
+            // actually runs (no device->host routing readback), that steady decode is ~99% GPU-busy
+            // (not host-sync-bound), and that keeping the step graphed lifts throughput (npl32
+            // +13%, npl128 +1.9%). It is therefore ON BY DEFAULT for the grouped path now.
+            // should_use_mmq() is the exact guard: it returns FALSE for the large-M NVFP4 prefill
+            // (patch 0034) that deliberately drops to the per-expert host-sync loop, so PREFILL
+            // keeps graphs disabled (correct - that path syncs). Decode is untouched by 0034.
+            // LLAMA_MOE_NO_FORCE_GRAPHS=1 forces the conservative pre-0025 disable for A/B.
+             if (mmid_needs_sync && ggml_is_quantized(node->src[0]->type) &&
+-                getenv("LLAMA_MOE_FORCE_GRAPHS") != nullptr &&
+                getenv("LLAMA_MOE_NO_FORCE_GRAPHS") == nullptr &&
+                 ggml_cuda_should_use_mmq(node->src[0]->type, cc, node->src[1]->ne[2], node->src[0]->ne[2])) {
+                 mmid_needs_sync = false;
+             }
+-- 
+2.43.0
+