chore(paged): keep patches/ patch-only; README to backend root, docs to docs/

The llama-cpp-localai-paged patches/ dir had accumulated docs, plots, a csv, dev .cpp harnesses, and a dead FP4-MoE kernel scaffold after an earlier git-mv. Restore the invariant that patches/ holds only the .patch series. Moves: - patches/paged/README.md -> README.md (canonical doc at the backend root) - patches/paged/{PIN_SYNC_c299a92c,PAGED_BITEXACT_NOTE,LOCALAI_LLAMACPP_BACKEND_PLAN,UPSTREAM_LAYER2_SCOPE}.md, final_benchmark.csv, qwen36_*.png, paged-burst-bench.cpp, paged-reclaim-unit.cpp -> docs/ - patches/README.md -> docs/PATCH_MAINTENANCE.md (unique patch-regen recipe not in the canonical README) Deletes: - patches/BENCHMARKS.md (superseded by README section 4 + the dev-notes section) - patches/kernel/ (dead FP4-MoE scaffold, never in the 0001-0030 apply glob, zero refs repo-wide) Repoint every reference to the moved files: README internal links (docs/ + the .github links drop from 5x ../ to 3x ../), .agents/llama-cpp-localai-paged-backend.md, .github/scripts/paged-canary-apply.sh, .github/workflows/llama-cpp-paged-canary.yml, the wrapper Makefile, backend/cpp/llama-cpp/grpc-server.cpp, backend/index.yaml, docs/content/features/backends.md, gallery/index.yaml. The build apply glob PAGED_PATCHES_DIR/0*.patch (PAGED_PATCHES_DIR := .../patches/paged) is unchanged and still resolves to the 28 patches. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 18:06:58 -04:00 · 2026-06-27 13:20:05 +00:00
parent db14006fcd
commit 08b754f910
21 changed files with 41 additions and 235 deletions
--- a/backend/cpp/llama-cpp-localai-paged/Makefile
+++ b/backend/cpp/llama-cpp-localai-paged/Makefile
@@ -9,7 +9,7 @@
 # Pin handling (mirrors the turboquant wrapper, the precedent this is modelled
 # on): the paged patch series is hand-verified bit-exact against ONE specific
 # llama.cpp tip and re-exported by the manual PIN_SYNC process
-# (patches/paged/PIN_SYNC_*.md). A naive pin bump would move the tip out from
+# (docs/PIN_SYNC_*.md). A naive pin bump would move the tip out from
 # under the patches and break `git apply` at build time, so this backend OWNS
 # its pin (LLAMA_VERSION below) instead of inheriting the auto-bumped stock pin
 # from backend/cpp/llama-cpp/Makefile. The override is forced into every copied
@@ -30,7 +30,7 @@
 # the nightly llama.cpp bump cannot silently break the vendored paged patches.
 # Advance ONLY via the PIN_SYNC process (rebase patches + bit-exact gate +
 # re-export), then update this value. See:
-#   backend/cpp/llama-cpp-localai-paged/patches/paged/PIN_SYNC_*.md
+#   backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_*.md
 #
 # This pin = the manual, verified sync. The signal telling you WHEN to do the
 # next sync is the early-warning canary
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/README.md
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/README.md
@@ -1,15 +1,15 @@
 # LocalAI paged-attention llama.cpp patch series

-This directory holds the vendored patch series that turns stock llama.cpp into
-LocalAI's paged-attention variant (`llama-cpp-localai-paged`). The patches are
-applied on top of a pinned upstream llama.cpp at build time; nothing here is a
-fork - it is a source-only `*.patch` stack plus this single canonical doc.
+This backend vendors the patch series (in `patches/paged/`) that turns stock
+llama.cpp into LocalAI's paged-attention variant (`llama-cpp-localai-paged`). The
+patches are applied on top of a pinned upstream llama.cpp at build time; nothing
+here is a fork - it is a source-only `*.patch` stack plus this canonical doc.

 > One-file rule: this README is the canonical reference for the patch series. The
-> only other docs kept in this directory are operational and linked below:
-> - [`PIN_SYNC_c299a92c.md`](PIN_SYNC_c299a92c.md) - the current pin-sync record (referenced by the canary workflow + scripts).
-> - [`PAGED_BITEXACT_NOTE.md`](PAGED_BITEXACT_NOTE.md) - the per-path bit-exactness gate (the canonical paged-MoE md5 reference).
-> - [`LOCALAI_LLAMACPP_BACKEND_PLAN.md`](LOCALAI_LLAMACPP_BACKEND_PLAN.md) - the design-of-record for shipping this as its own backend + the NVFP4 gallery items.
+> only other docs are operational, kept in `docs/`, and linked below:
+> - [`PIN_SYNC_c299a92c.md`](docs/PIN_SYNC_c299a92c.md) - the current pin-sync record (referenced by the canary workflow + scripts).
+> - [`PAGED_BITEXACT_NOTE.md`](docs/PAGED_BITEXACT_NOTE.md) - the per-path bit-exactness gate (the canonical paged-MoE md5 reference).
+> - [`LOCALAI_LLAMACPP_BACKEND_PLAN.md`](docs/LOCALAI_LLAMACPP_BACKEND_PLAN.md) - the design-of-record for shipping this as its own backend + the NVFP4 gallery items.

 ---

@@ -32,7 +32,7 @@ vendored patch series over upstream llama.cpp that adds

 It is **pinned to llama.cpp `c299a92c`** ("binaries : Improve rpc-server and
 export-graph-ops names", #25045) and advanced only by a manual, bit-exact-gated
-[pin-sync process](PIN_SYNC_c299a92c.md), decoupled from the nightly auto-bumper
+[pin-sync process](docs/PIN_SYNC_c299a92c.md), decoupled from the nightly auto-bumper
 (see section 7).

 The build gate is `LLAMA_PAGED` (default on in this tree); the paged engine is
@@ -158,9 +158,9 @@ These are the dominant decode levers on the Qwen3.6 hybrid models. All bit-exact
 Hardware: **GB10 / DGX Spark** (CUDA 13, sm_121). Models: dense
 **Qwen3.6-27B-NVFP4** and MoE **Qwen3.6-35B-A3B-NVFP4**. Metric: `decode_agg`
 S_TG (t/s) from `llama-batched-bench`, `-fa on`, `npp 128 / ntg 128`, swept over
-serving width `npl`. Plots: [`qwen36_dense_decode_vs_npl.png`](qwen36_dense_decode_vs_npl.png),
-[`qwen36_moe_decode_vs_npl.png`](qwen36_moe_decode_vs_npl.png); raw data
-[`final_benchmark.csv`](final_benchmark.csv).
+serving width `npl`. Plots: [`qwen36_dense_decode_vs_npl.png`](docs/qwen36_dense_decode_vs_npl.png),
+[`qwen36_moe_decode_vs_npl.png`](docs/qwen36_moe_decode_vs_npl.png); raw data
+[`final_benchmark.csv`](docs/final_benchmark.csv).

 ### (a) + (b) Patched vs stock vs vLLM

@@ -231,7 +231,7 @@ all three via the non-fused path. The patchset's fusions are gated off there
 (0030), so the outcome is the same neutral-to-slightly-negative as Metal - not
 "won't run". This backend therefore ships **CUDA-only** (where the fusions are
 live + verified); non-CUDA users should use the stock `llama-cpp` backend. See
-[`UPSTREAM_LAYER2_SCOPE.md`](UPSTREAM_LAYER2_SCOPE.md) for what native non-CUDA
+[`UPSTREAM_LAYER2_SCOPE.md`](docs/UPSTREAM_LAYER2_SCOPE.md) for what native non-CUDA
 fused kernels would take.

 ---
@@ -245,7 +245,7 @@ is" -n 48 --temp 0 --seed 1 | md5sum`, paged paths prefixed with
 chat-template path; and (2) `test-backend-ops` (CUDA0 vs CPU oracle) for every
 touched op (`SSM_CONV*`, `GATED_DELTA_NET`, `MUL_MAT`, `MUL_MAT_ID`).

-**The gate is per-path** (see [`PAGED_BITEXACT_NOTE.md`](PAGED_BITEXACT_NOTE.md)).
+**The gate is per-path** (see [`PAGED_BITEXACT_NOTE.md`](docs/PAGED_BITEXACT_NOTE.md)).
 Dense is bit-exact across paged/non-paged (`5951a5b4`). The **paged MoE** md5
 (`8cb0ce23`) does **not** byte-match the **non-paged MoE** md5 (`07db32c2`); this
 is a benign FP-accumulation-order difference of the paged attention reduction,
@@ -325,7 +325,7 @@ in a recommended/gallery config.
 ## 7. Pin + maintenance policy

 - **Pinned to llama.cpp `c299a92c`.** The pin is advanced **only** by the manual
-  [`PIN_SYNC`](PIN_SYNC_c299a92c.md) process: rebase the source-only patch series
+  [`PIN_SYNC`](docs/PIN_SYNC_c299a92c.md) process: rebase the source-only patch series
  onto the new tip, rebuild on GPU, and pass the bit-exact gate on every path
  (dense + MoE, paged + non-paged) plus `test-backend-ops`. The `9d5d882d ->
  c299a92c` jump (23 upstream commits) needed zero patch changes and did not
@@ -333,12 +333,12 @@ in a recommended/gallery config.
 - **Decoupled from the nightly auto-bumper.** There is deliberately **no**
  `bump_deps.yaml` entry for this backend - a naive `LLAMA_VERSION` bump could
  silently shift the tree out from under the patches.
- **Weekly canary.** [`.github/workflows/llama-cpp-paged-canary.yml`](../../../../../.github/workflows/llama-cpp-paged-canary.yml)
-  (via [`.github/scripts/paged-canary-apply.sh`](../../../../../.github/scripts/paged-canary-apply.sh))
+- **Weekly canary.** [`.github/workflows/llama-cpp-paged-canary.yml`](../../../.github/workflows/llama-cpp-paged-canary.yml)
+  (via [`.github/scripts/paged-canary-apply.sh`](../../../.github/scripts/paged-canary-apply.sh))
  tries the patch series against the latest upstream tip with the build's own
  strict `git apply`. **Red = upstream drifted past the series -> run a
  PIN_SYNC** (do not bump the pin blindly). The canary references
-  [`PIN_SYNC_c299a92c.md`](PIN_SYNC_c299a92c.md).
+  [`PIN_SYNC_c299a92c.md`](docs/PIN_SYNC_c299a92c.md).

 ---

@@ -363,4 +363,4 @@ Both gallery entries set `backend: llama-cpp-localai-paged` and the paged servin
 (`paged_kv:true`, `max_batch_tokens`, `kv_unified:false`, `parallel`,
 `flash_attention:on`, `context_size`). They intentionally stay bit-exact (no
 `ssm_bf16_tau`). The full backend-split + gallery plan is in
-[`LOCALAI_LLAMACPP_BACKEND_PLAN.md`](LOCALAI_LLAMACPP_BACKEND_PLAN.md).
+[`LOCALAI_LLAMACPP_BACKEND_PLAN.md`](docs/LOCALAI_LLAMACPP_BACKEND_PLAN.md).
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/LOCALAI_LLAMACPP_BACKEND_PLAN.md
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/LOCALAI_LLAMACPP_BACKEND_PLAN.md
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/PAGED_BITEXACT_NOTE.md
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/PAGED_BITEXACT_NOTE.md
--- a/backend/cpp/llama-cpp-localai-paged/docs/PATCH_MAINTENANCE.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PATCH_MAINTENANCE.md
@@ -61,7 +61,7 @@ everywhere without ever touching the stock `llama-cpp` source tree.
 - **0002 block placement — DONE + VERIFIED.** Built `llama-simple` at the pin; greedy generation is
  **token-identical** stock vs `LLAMA_KV_PAGED=1` (Qwen3-0.6B), paged branch confirmed firing.
 - **0003 gather-read — DONE + VERIFIED (Gate 0 green).** Implemented in the **additive** form
-  (see `paged/README.md`): all logic in new `src/paged-attn.{h,cpp}` (a `llm_graph_input_i` gather-index
+  (see `../README.md`): all logic in new `src/paged-attn.{h,cpp}` (a `llm_graph_input_i` gather-index
  subclass + the K/V/mask gather), hooked by **one** line in `build_attn` + **two** thin accessors on
  `llama_kv_cache_context` + 1 CMake line (216 insertions; no edit to `llm_graph_input_attn_kv` or
  `llama-graph.h`). Greedy generation is **token-identical** stock vs `LLAMA_KV_PAGED=1` (Qwen3-0.6B,
@@ -82,5 +82,5 @@ by itself reach vLLM throughput parity, because the measured prefill bottleneck
 (Lever 3: `mul_mat_q<MXFP4>` ~22 TFLOP/s, ~27× behind vLLM) — a *per-token compute* gap that paging does not
 touch. Paged attention closes the **concurrency/memory** gap (more sequences, prefix reuse); the prefill/throughput
 gap additionally needs the tcgen05/CUTLASS grouped-GEMM (deferred, upstream-grade, no shortcut — see
-`paged/README.md`). So full vLLM parity = this series **AND** the
+`../README.md`). So full vLLM parity = this series **AND** the
 kernel; neither alone suffices.
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/PIN_SYNC_c299a92c.md
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/PIN_SYNC_c299a92c.md
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/UPSTREAM_LAYER2_SCOPE.md
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/UPSTREAM_LAYER2_SCOPE.md
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/final_benchmark.csv
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/final_benchmark.csv
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/paged-burst-bench.cpp
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/paged-burst-bench.cpp
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/paged-reclaim-unit.cpp
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/paged-reclaim-unit.cpp
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/qwen36_dense_decode_vs_npl.png
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/qwen36_dense_decode_vs_npl.png
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/qwen36_moe_decode_vs_npl.png
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/qwen36_moe_decode_vs_npl.png
--- a/backend/cpp/llama-cpp-localai-paged/patches/BENCHMARKS.md
+++ b/backend/cpp/llama-cpp-localai-paged/patches/BENCHMARKS.md
@@ -1,106 +0,0 @@
-# Paged-attention / parity benchmarks (GB10 / DGX Spark)
-
-Goal of the series: vLLM parity. This records the measured gap so the parity claim is data-backed, not asserted.
-
-**Setup:** GB10 (sm_121, 119 GiB unified). Model Qwen3-Coder-30B-A3B. llama.cpp = pinned base + this series
-(MXFP4_MOE, `-fa 1 -b 2048 -ub 2048`, `llama-batched-bench`, PP=512 TG=128). vLLM = 0.23.0 FP8 (recorded
-prior run, same box/model). S_PP / S_TG are aggregate prefill / decode tok/s across B streams.
-
-## Fresh llama.cpp (this series, MXFP4) vs vLLM (FP8)
-
-| B | llama S_PP | vLLM S_PP | PP gap | llama S_TG | vLLM S_TG | TG gap |
-|---|-----------|-----------|--------|-----------|-----------|--------|
-| 1 | 1565 | 9644 | 6.2× | **83** | 48 | **llama wins** |
-| 8 | 3648 | 33373 | 9.1× | 126 | 312 | 2.5× |
-| 32 | 2074 | 99398 | 48× | 319 | 1171 | 3.7× |
-| 64 | 3643 | 151990 | 42× | 771 | 2064 | 2.7× |
-
-## Verdict — two distinct gaps, only one is the engine's
-
-1. **Prefill (S_PP): 6–48× behind, and it does NOT scale with B** (plateaus ~3.6k). This is the **FP4 MoE
-   GEMM kernel** (`mul_mat_q<MXFP4>` ~22 TFLOP/s), confirmed earlier. **Paged attention cannot close this** —
-   it's per-token compute. Needs the tcgen05/CUTLASS grouped-GEMM (Lever 3, multi-week, no upstream base).
-2. **Decode at concurrency (S_TG): 2.5–3.7× behind for B≥8** (we *win* at B=1). This gap IS partly the
-   engine's domain — vLLM's block-paged KV + continuous batching pack more concurrent decode work per step.
-   **This is what patches 0003–0006 target.** The win here is realistic; the prefill win is not (kernel).
-
-## CORRECTION — decode-phase profile (B=64, decode-dominated nsys)
-
-The "decode gap is engine-addressable" read above was **wrong**. Profiling a decode-dominated B=64 run:
-
-| kernel | % GPU time |
-|---|---|
-| `mul_mat_q<MXFP4>` (MoE GEMM) | **54.6** |
-| `flash_attn_ext` (attention) | 19.8 |
-| `mul_mat_q<Q8>` (dense) | 10.9 |
-| KV writes / quant / norms / rest | ~15 |
-
-**Decode at concurrency is ALSO dominated by the FP4 MoE GEMM (54.6%)** — the same Lever-3 kernel as prefill.
-Attention (the only thing paging optimizes) is ~20%, and the gather-read reclaims only the *masked-cell*
-fraction of that. So **the paged series (0003–0006) cannot close the vLLM gap in either phase** — both are
-MoE-kernel-bound. vLLM's concurrency advantage is its MoE/attention *kernels*, not (mainly) its KV management.
-
-### What the paged series IS still good for (just not throughput parity)
-
- **Capacity**: block-granular + on-demand allocation → fit more/longer concurrent sequences in fixed VRAM.
- **Prefix sharing**: cross-request block dedup → lower TTFT + memory on shared system prompts / RAG.
-
-These are real wins on *memory-pressured* and *shared-prefix* workloads — but they are not tok/s parity, and
-batched-bench (fresh, non-fragmented, no shared prefix) won't show them.
-
-## DENSE model parity (Qwen3-32B) — does the kernel gap exist for dense too? YES.
-
-The MoE work above is about the grouped MoE GEMM. Dense models use a different (non-grouped) matmul path,
-so we benchmarked a dense 32B head-to-head.
-
-**Headline comparison — vLLM NVFP4 W4A16 vs llama.cpp Q4_K_M.** This is the *correct apples-to-apples on
-DGX Spark*: both are **4-bit weights / 16-bit activations** (same quant class). vLLM = `Qwen3-32B-NVFP4A16`
-(FlashInfer Marlin W4A16 kernel); llama.cpp = `Qwen3-32B-Q4_K_M` (int8-MMQ compute). The only difference is
-the compute kernel — which is exactly what we're measuring. (Full **W4A4** NVFP4 does not run on GB10 today;
-root cause below — and it would *not* be a fair comparison even if it did, since Q4_K_M is also weight-only-4-bit.)
-
-| B | llama Q4_K_M PP | vLLM W4A16 PP | PP gap | llama decode | vLLM decode | TG gap |
-|---|---|---|---|---|---|---|
-| 1 | 708 | 5367 | 7.6× | 10.2 | 11.7 | ~parity |
-| 8 | 761 | 14941 | 20× | 58 | 92 | 1.6× |
-| 32 | 763 | 21952 | 29× | 205 | 330 | 1.6× |
-| 64 | 765 | 24444 | 32× | 253 | 569 | 2.2× |
-
-**Findings:**
-1. **Dense prefill has the SAME (larger) kernel gap.** llama dense prefill plateaus at ~765 t/s regardless of
-   B; vLLM scales to 24.4k (32×). Both read 4-bit weights — the gap is the compute kernel: vLLM's FP4 Marlin
-   tensor-core GEMM vs llama's int8-MMQ. (Note: on consumer Blackwell, W4A16 Marlin is also reported *faster*
-   than the experimental W4A4 path, so W4A16 isn't a handicapped stand-in — it's the fast path.)
-2. **Decode is ~parity at B=1** (10.2 vs 11.7 — both weight-bandwidth-bound reading 4-bit weights), and the
-   gap grows with batch (compute starts to matter → the kernel gap reappears: 2.2× at B=64).
-3. **Scope decision (the reason for this benchmark): the Lever-3 kernel track must also deliver a NON-grouped
-   block-scaled FP4 GEMM for dense**, not only the MoE grouped GEMM. The dense GEMM is the simpler of the two
-   (a plain CUTLASS dense GEMM), so it's a good first kernel to land — and it benefits every dense model.
-   - **No cheap lever:** `GGML_CUDA_FORCE_CUBLAS` is a **no-op for dense too** (Q4_K pp512: 720.8 vs 721.8) —
-     dequant→cuBLAS-BF16 doesn't engage / isn't faster than int8-MMQ on GB10. With ubatch (saturates) and
-     nwarps (static_assert) already ruled out for MoE, **every config/flag lever is now exhausted** for both
-     model classes. Parity is strictly the FP4 tensor-core kernel.
-4. **Why full W4A4 NVFP4 hangs on GB10 (root cause, researched).** This is a *known consumer-Blackwell
-   limitation, not a misconfiguration*. **FlashInfer ships no FP4 cubins for sm_120/sm_121** — its precompiled
-   kernels are all datacenter `Sm100a/Sm103a` (B200/B300). So on GB10 the dense `mm_fp4` W4A4 GEMM has no
-   working kernel: the optimized path is gated off for sm_121 (heuristic checks `minor==0`; 12.1 fails), the
-   CUTLASS dense FP4 fallback is documented to silently return **all-zeros**, and TRT-LLM errors at capability
-   120. Our exact symptom — loads weights, then stalls at the first profiling forward pass with
-   `enable_flashinfer_autotune=True` at 0–3% GPU — is the **FlashInfer FP4 autotuner/JIT spinning on an arch
-   with no FP4 cubins** (matches vllm #30163/#26381, flashinfer #2577/#3294). The "NVFP4 on DGX Spark" story
-   everyone cites is about *quantization + memory footprint + W4A16/MoE*, **not dense W4A4 inference**, which
-   isn't validated on sm_121 yet (where people patched it working, it was slower than W4A16 anyway).
-   **Therefore W4A16 vs Q4_K_M above is the right, reproducible apples-to-apples** for DGX Spark today.
-   Optional W4A4 retry (verify output isn't zeros first): `VLLM_SKIP_FLASHINFER_AUTOTUNE=1` +
-   `VLLM_NVFP4_GEMM_BACKEND=cutlass` + `--enforce-eager`, or NVIDIA's `vllm/vllm-openai:cu130-nightly` container.
-
-## So, honestly, where parity stands
-
- **Decode single-stream: already at/above parity** (B=1: 83 vs 48).
- **Decode concurrency: a real, engine-addressable gap** the paged series can narrow (0004 on-demand pool +
-  0005 continuous batching). Target: close the 2.5–3.7× at B≥8.
- **Prefill: kernel-bound, not engine-bound.** No amount of paging reaches vLLM here; that's a separate track.
-
-**Series status when measured:** 0001 (vendor) + 0002 (placement, token-identical) done; 0003 (gather-read)
-turn-key-planned, not yet implemented. These numbers are the *baseline* the engine patches must improve on at
-B≥8 decode — re-run this table after 0004/0005 to show the concurrency gap closing.
--- a/backend/cpp/llama-cpp-localai-paged/patches/kernel/0001-fp4-grouped-moe-scaffold.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/kernel/0001-fp4-grouped-moe-scaffold.patch
@@ -1,91 +0,0 @@
-diff --git a/ggml/src/ggml-cuda/fp4-grouped-moe.cu b/ggml/src/ggml-cuda/fp4-grouped-moe.cu
-new file mode 100644
-index 0000000..5f5a782
--- /dev/null
-+++ b/ggml/src/ggml-cuda/fp4-grouped-moe.cu
-@@ -0,0 +1,46 @@
-+#include "fp4-grouped-moe.cuh"
-+
-+#include <cstdlib>
-+#include <cstdio>
-+
-+// SCAFFOLD for the FP4 grouped-GEMM MoE kernel (Lever 3).
-+//
-+// Why: on GB10 (sm_121) the MoE matmul runs mul_mat_q<MXFP4> - a warp-level mma.sync grouped MMQ -
-+// at ~22 effective TFLOP/s, ~27x behind vLLM prefill, and it also dominates decode at concurrency
-+// (54.6% of GPU time at B=64). It is the single bottleneck to vLLM parity in BOTH phases; paged
-+// attention cannot touch it (proven by profiling). The fix is a CUTLASS-3.x collective-mainloop
-+// grouped GEMM over all experts, block-scaled e2m1 operands via tcgen05 tensor-memory MMA.
-+//
-+// This file is the integration seam. It is currently a no-op that always falls back to MMQ, so the
-+// default build is byte-identical. The kernel is filled in over the phases in the design doc.
-+
-+static bool fp4_grouped_enabled() {
-+    static const bool en = (std::getenv("GGML_CUDA_FP4_GROUPED") != nullptr);
-+    return en;
-+}
-+
-+bool ggml_cuda_fp4_grouped_moe(
-+        ggml_backend_cuda_context & ctx,
-+        const ggml_tensor * src0,
-+        const ggml_tensor * src1,
-+        const ggml_tensor * ids,
-+        ggml_tensor       * dst) {
-+    GGML_UNUSED(ctx); GGML_UNUSED(src1); GGML_UNUSED(ids); GGML_UNUSED(dst);
-+
-+    if (!fp4_grouped_enabled()) {
-+        return false; // default: existing MMQ path
-+    }
-+    if (src0->type != GGML_TYPE_MXFP4 && src0->type != GGML_TYPE_NVFP4) {
-+        return false;
-+    }
-+
-+    // TODO(kernel - see kernel design doc): CUTLASS 3.x GemmGrouped, sm_120a, block-scaled e2m1,
-+    // tcgen05 MMA; per-expert problem offsets from `ids`; fused activation quant; numerical parity
-+    // vs mul_mat_q<MXFP4> before enabling by default.
-+    static bool warned = false;
-+    if (!warned) {
-+        warned = true;
-+        fprintf(stderr, "[fp4-grouped] GGML_CUDA_FP4_GROUPED set, kernel not yet implemented - using MMQ\n");
-+    }
-+    return false; // scaffold: fall back until the kernel lands
-+}
-diff --git a/ggml/src/ggml-cuda/fp4-grouped-moe.cuh b/ggml/src/ggml-cuda/fp4-grouped-moe.cuh
-new file mode 100644
-index 0000000..29e1b5a
--- /dev/null
-+++ b/ggml/src/ggml-cuda/fp4-grouped-moe.cuh
-@@ -0,0 +1,13 @@
-+#pragma once
-+
-+#include "common.cuh"
-+
-+// Entry point for the tcgen05/CUTLASS block-scaled FP4 (MXFP4/NVFP4) grouped-GEMM MoE kernel for
-+// Blackwell consumer GPUs (sm_120/121). Returns true if it handled the op; false to fall back to
-+// the existing warp-mma MMQ path. Gated behind GGML_CUDA_FP4_GROUPED until correct + faster.
-+bool ggml_cuda_fp4_grouped_moe(
-+        ggml_backend_cuda_context & ctx,
-+        const ggml_tensor * src0,   // expert weights, MXFP4/NVFP4 [n_embd, n_ff, n_expert]
-+        const ggml_tensor * src1,   // activations, F32 [n_embd, n_tokens, ...]
-+        const ggml_tensor * ids,    // expert routing, I32
-+        ggml_tensor       * dst);   // F32 output
-diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
-index 8ea462a..104d131 100644
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
-+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
-@@ -30,6 +30,7 @@
- #include "ggml-cuda/im2col.cuh"
- #include "ggml-cuda/mmf.cuh"
- #include "ggml-cuda/mmq.cuh"
-+#include "ggml-cuda/fp4-grouped-moe.cuh"
- #include "ggml-cuda/mmvf.cuh"
- #include "ggml-cuda/mmvq.cuh"
- #include "ggml-cuda/norm.cuh"
-@@ -2701,6 +2702,7 @@ static void ggml_cuda_mul_mat_id(ggml_backend_cuda_context & ctx, ggml_tensor *
-         }
- 
-         if (ggml_cuda_should_use_mmq(src0->type, cc, ne12, /*n_experts=*/ne02)) {
-+            if (ggml_cuda_fp4_grouped_moe(ctx, src0, src1, ids, dst)) { return; }
-             ggml_cuda_mul_mat_q(ctx, src0, src1, ids, dst);
-             return;
-         }