docs(paged): record mmq occupancy rejection

Assisted-by: Codex:gpt-5
2026-07-02 20:37:03 -04:00 · 2026-07-01 04:18:12 +00:00
parent 3c2cb9f4ab
commit 3b9ec3e1f1
5 changed files with 180 additions and 0 deletions
--- a/backend/cpp/llama-cpp-localai-paged/README.md
+++ b/backend/cpp/llama-cpp-localai-paged/README.md
@@ -629,3 +629,13 @@ so an artifact proves inferencing gates without reading full logs.
 Do not use the stale DGX
 `~/bench/combined_definitive.sh` without first porting it to the current mirror
 and lock discipline.
+
+Phase 28 challenged the remaining low-conflict NVFP4 grouped-MMQ occupancy
+knobs on the same DGX mirror
+(`/home/mudler/bench/phase28_mmq_occupancy/20260701_040450`). The only buildable
+variant, `GGML_CUDA_FP4_MINBLOCKS=2`, was inference-safe before and after
+serving (MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID 806/806`) but regressed
+n128 decode serving (`705.1 -> 689.9` decode_agg_tps, `0.9784x`). The row-tile
+knob `GGML_CUDA_FP4_MMQ_Y=64` failed the NVFP4 writeback compile-time
+invariant. Do not promote these knobs; grouped-MMQ parity work now requires a
+structural kernel change, not launch-bounds or row-tile tweaks.
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -1715,3 +1715,58 @@ Decision:
  path still should not be reopened.
 - The serving profile does not change the Phase 26 parity verdict: n128 paged
  decode remains about `675 tok/s`, far below vLLM's same-session `1025 tok/s`.
+
+## Phase 28 NVFP4 MMQ Occupancy Build-Knob A/B
+
+Phase 28 tested the remaining small, additive grouped-MMQ occupancy knobs
+already present in the llama.cpp fork. This was a build-vs-build A/B only; no
+source change was promoted.
+
+Artifact:
+
+- `/home/mudler/bench/phase28_mmq_occupancy/20260701_040450`
+
+Source and hardware:
+
+- `/home/mudler/llama-phase6-source`
+- `f2521ab12 feat(server): trace speculative batch shapes`
+- `GPU 0: NVIDIA GB10`, driver `580.159.03`, compute capability `12.1`
+
+Build/gate results:
+
+| variant | build result | MoE md5 | dense md5 | `MUL_MAT_ID` |
+|---------|--------------|---------|-----------|--------------|
+| baseline | existing `build-cuda` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `806/806` |
+| `GGML_CUDA_FP4_MINBLOCKS=2` | built | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `806/806` |
+| `GGML_CUDA_FP4_MMQ_Y=64` | compile-time reject | n/a | n/a | n/a |
+
+`GGML_CUDA_FP4_MMQ_Y=64` fails the NVFP4 writeback invariant:
+`static_assert(nwarps*tile_C::I == mmq_y)`. That also rejects combined
+`MMQ_Y=64+MINBLOCKS=2` as a source of evidence. `MMQ_Y=96` is not a valid
+low-conflict shortcut for the same row-tile specialization reason, so it was
+not promoted to a serving A/B.
+
+Same-session n128 serving A/B (`PTOK=128`, `GEN=64`, two reps per arm):
+
+| arm | reps | agg_tps | decode_agg_tps | decode_perseq_tps | prefill_tps | TTFT mean ms |
+|-----|------|---------|----------------|--------------------|-------------|--------------|
+| baseline | 2 | 328.8 | 705.1 | 3.970 | 1607.4 | 7868.8 |
+| `MINBLOCKS=2` | 2 | 326.4 | 689.9 | 3.905 | 1644.9 | 7778.1 |
+| ratio | 2 | 0.9927 | 0.9784 | 0.9836 | 1.0233 | 0.9885 |
+
+Post-serving variant gate remained green:
+
+| phase | check | status | actual |
+|-------|-------|--------|--------|
+| post serving | MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
+| post serving | dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
+| post serving | `MUL_MAT_ID` | ok | `806/806` |
+
+Decision:
+
+- `GGML_CUDA_FP4_MINBLOCKS=2` is inference-safe but does not clear the serving
+  A/B gate; it regressed n128 decode aggregate by about `2.2%`.
+- `GGML_CUDA_FP4_MMQ_Y` is not a valid additive shortcut without deeper NVFP4
+  writeback retile work.
+- Do not promote either knob or add a LocalAI patch. The grouped-MMQ bucket
+  still needs a structural kernel change, not a launch-bounds/row-tile tweak.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -376,6 +376,16 @@ helper dispatch remains too small (`mm_ids` `0.61%`, `gather_mmq` `0.37%`,
 `argsort_topk` `0.40%`). Do not reopen metadata/helper-only MoE dispatch work on
 GB10.

+Phase 28 tested the remaining low-conflict NVFP4 grouped-MMQ occupancy knobs.
+Artifact: `/home/mudler/bench/phase28_mmq_occupancy/20260701_040450`.
+`GGML_CUDA_FP4_MINBLOCKS=2` passed md5/op gates before and after serving
+(MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
+`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT_ID` `806/806`) but regressed
+n128 same-session decode serving (`705.1 -> 689.9` decode_agg_tps, `0.9784x`).
+`GGML_CUDA_FP4_MMQ_Y=64` failed to compile because the NVFP4 writeback
+specialization asserts `nwarps*tile_C::I == mmq_y`. Do not promote either knob;
+future grouped-MMQ work must be structural kernel work.
+
 ---

 ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
@@ -443,6 +453,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
 - `~/bench/phase25_gate_summary_dryrun/20260701_053353` - dry run after adding `gate_summary.tsv` support; normal dry-run still writes `hardware.txt` and does not emit a gate summary before gates exist.
 - `~/bench/phase26_audited_snapshot/20260701_053650` - current audit-grade full paged-vs-vLLM MoE serving snapshot with `hardware.txt`, pre/post gates, `summary.tsv`, and `gate_summary.tsv`.
 - `~/bench/phase27_graph_node_serving/20260701_055519` - current clean llama.cpp n128 serving profile captured with `--cuda-graph-trace=node`, pre/post retry gates green.
+- `~/bench/phase28_mmq_occupancy/20260701_040450` - NVFP4 MMQ occupancy build-knob A/B; `MINBLOCKS=2` gate-safe but serving-regressed, `MMQ_Y=64` compile-rejected.
 - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
 - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
 - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -746,6 +746,25 @@ and `argsort_topk` is `0.40%`. Do not reopen metadata/helper-only MoE dispatch
 work on GB10. Any credible source patch must directly reduce GDN, grouped-MMQ,
 or projection work and still pass the md5/op gates.

+### Phase 28 NVFP4 MMQ occupancy A/B
+
+Phase 28 challenged the small grouped-MMQ build knobs before funding structural
+kernel work. Artifact:
+`/home/mudler/bench/phase28_mmq_occupancy/20260701_040450`.
+
+`GGML_CUDA_FP4_MINBLOCKS=2` built and passed the canonical safety gates before
+and after serving: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
+`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. Same-session
+n128 serving A/B rejected it on throughput: baseline `705.1` decode_agg_tps vs
+`689.9` with `MINBLOCKS=2` (`0.9784x`). `GGML_CUDA_FP4_MMQ_Y=64` does not
+compile against the current NVFP4 writeback invariant
+`nwarps*tile_C::I == mmq_y`, so the row-tile knob is not a valid low-conflict
+shortcut.
+
+Decision: do not promote the occupancy knobs and do not add a LocalAI patch.
+The grouped-MMQ bucket still requires structural kernel work; launch-bounds and
+row-tile build tweaks are closed on GB10.
+
 Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.

 ### Phase 10 GDN C32 slab update
--- a/docs/superpowers/plans/2026-07-01-mmq-occupancy-phase28.md
+++ b/docs/superpowers/plans/2026-07-01-mmq-occupancy-phase28.md
@@ -0,0 +1,85 @@
+# MMQ Occupancy Phase 28 Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use
+> superpowers:subagent-driven-development (recommended) or
+> superpowers:executing-plans to implement this plan task-by-task. Steps use
+> checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Test the remaining low-conflict NVFP4 grouped-MMQ occupancy knobs
+against the current GB10 serving gap, with md5/op gates before accepting any
+performance result.
+
+**Architecture:** Build-vs-build A/B only. The knobs are existing default-off
+compile-time macros in the llama.cpp fork, so this phase does not edit source
+unless a variant clears the serving gate.
+
+**Tech Stack:** DGX GB10, llama.cpp CUDA backend, `paged-inference-gates.sh`,
+h2h n128 serving client, LocalAI parity docs.
+
+---
+
+## Checklist
+
+- [x] **Step 1: Confirm candidate scope**
+  - Projection/FP8 follow-up was rejected by source/docs review: it is already
+    documented as too small or KL-failing.
+  - The remaining small candidate was NVFP4 MMQ occupancy:
+    `GGML_CUDA_FP4_MINBLOCKS=2` and `GGML_CUDA_FP4_MMQ_Y`.
+
+- [x] **Step 2: Check DGX preflight**
+  - `docker=0`
+  - `local_ai_worker=0`
+  - `compute=0`
+  - GPU owner file was `FREE`.
+
+- [x] **Step 3: Run baseline inference gates**
+  - Artifact:
+    `/home/mudler/bench/phase28_mmq_occupancy/20260701_040450/gate_baseline`
+  - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`
+  - Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`
+  - `MUL_MAT_ID`: `806/806`
+
+- [x] **Step 4: Build and gate `GGML_CUDA_FP4_MINBLOCKS=2`**
+  - Build dir: `/home/mudler/llama-phase6-source/build-phase28-minblocks2`
+  - Artifact:
+    `/home/mudler/bench/phase28_mmq_occupancy/20260701_040450/gate_minblocks2`
+  - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`
+  - Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`
+  - `MUL_MAT_ID`: `806/806`
+
+- [x] **Step 5: Try `GGML_CUDA_FP4_MMQ_Y=64`**
+  - Build dir: `/home/mudler/llama-phase6-source/build-phase28-mmqy64`
+  - Result: compile-time reject.
+  - Failure invariant: `static_assert(nwarps*tile_C::I == mmq_y)`.
+  - Decision: do not run combined `MMQ_Y=64+MINBLOCKS=2`; the row-tile
+    specialization is invalid before serving can be measured.
+
+- [x] **Step 6: Run same-session n128 serving A/B for the viable variant**
+  - Artifact:
+    `/home/mudler/bench/phase28_mmq_occupancy/20260701_040450/serving_ab`
+  - Baseline mean, two reps: `decode_agg_tps=705.1`,
+    `decode_perseq_tps=3.970`, `agg_tps=328.8`.
+  - `MINBLOCKS=2` mean, two reps: `decode_agg_tps=689.9`,
+    `decode_perseq_tps=3.905`, `agg_tps=326.4`.
+  - Ratio: `decode_agg_tps=0.9784`, `decode_perseq_tps=0.9836`,
+    `agg_tps=0.9927`.
+
+- [x] **Step 7: Run post-serving inference gates**
+  - Artifact:
+    `/home/mudler/bench/phase28_mmq_occupancy/20260701_040450/gate_minblocks2_post_serving`
+  - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`
+  - Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`
+  - `MUL_MAT_ID`: `806/806`
+
+- [x] **Step 8: Record decision**
+  - `MINBLOCKS=2` is inference-safe but rejected on throughput.
+  - `MMQ_Y` is rejected as a low-conflict shortcut because the current NVFP4
+    writeback specialization only accepts the stock row tile.
+  - No llama.cpp source patch or LocalAI patch mirror is justified.
+
+## Result
+
+Phase 28 closes the small NVFP4 MMQ occupancy branch. The only buildable knob
+kept md5/op gates green but regressed n128 decode serving, and the row-tile knob
+does not compile against the current specialization. Future grouped-MMQ work
+must be structural kernel work, not a launch-bounds or row-tile build tweak.