From 3b9ec3e1f17e333124d21ab4566dfc2ea21368ad Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Wed, 1 Jul 2026 04:18:12 +0000 Subject: [PATCH] docs(paged): record mmq occupancy rejection Assisted-by: Codex:gpt-5 --- backend/cpp/llama-cpp-localai-paged/README.md | 10 +++ .../docs/GB10_PARITY_PHASE0_RESULTS.md | 55 ++++++++++++ .../docs/PARITY_HANDOFF.md | 11 +++ .../docs/VLLM_PARITY_LEVER_MAP.md | 19 +++++ .../plans/2026-07-01-mmq-occupancy-phase28.md | 85 +++++++++++++++++++ 5 files changed, 180 insertions(+) create mode 100644 docs/superpowers/plans/2026-07-01-mmq-occupancy-phase28.md diff --git a/backend/cpp/llama-cpp-localai-paged/README.md b/backend/cpp/llama-cpp-localai-paged/README.md index 6b8eee31e..493b3bfe2 100644 --- a/backend/cpp/llama-cpp-localai-paged/README.md +++ b/backend/cpp/llama-cpp-localai-paged/README.md @@ -629,3 +629,13 @@ so an artifact proves inferencing gates without reading full logs. Do not use the stale DGX `~/bench/combined_definitive.sh` without first porting it to the current mirror and lock discipline. + +Phase 28 challenged the remaining low-conflict NVFP4 grouped-MMQ occupancy +knobs on the same DGX mirror +(`/home/mudler/bench/phase28_mmq_occupancy/20260701_040450`). The only buildable +variant, `GGML_CUDA_FP4_MINBLOCKS=2`, was inference-safe before and after +serving (MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID 806/806`) but regressed +n128 decode serving (`705.1 -> 689.9` decode_agg_tps, `0.9784x`). The row-tile +knob `GGML_CUDA_FP4_MMQ_Y=64` failed the NVFP4 writeback compile-time +invariant. Do not promote these knobs; grouped-MMQ parity work now requires a +structural kernel change, not launch-bounds or row-tile tweaks. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index f4531d7d8..f665fd913 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -1715,3 +1715,58 @@ Decision: path still should not be reopened. - The serving profile does not change the Phase 26 parity verdict: n128 paged decode remains about `675 tok/s`, far below vLLM's same-session `1025 tok/s`. + +## Phase 28 NVFP4 MMQ Occupancy Build-Knob A/B + +Phase 28 tested the remaining small, additive grouped-MMQ occupancy knobs +already present in the llama.cpp fork. This was a build-vs-build A/B only; no +source change was promoted. + +Artifact: + +- `/home/mudler/bench/phase28_mmq_occupancy/20260701_040450` + +Source and hardware: + +- `/home/mudler/llama-phase6-source` +- `f2521ab12 feat(server): trace speculative batch shapes` +- `GPU 0: NVIDIA GB10`, driver `580.159.03`, compute capability `12.1` + +Build/gate results: + +| variant | build result | MoE md5 | dense md5 | `MUL_MAT_ID` | +|---------|--------------|---------|-----------|--------------| +| baseline | existing `build-cuda` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `806/806` | +| `GGML_CUDA_FP4_MINBLOCKS=2` | built | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `806/806` | +| `GGML_CUDA_FP4_MMQ_Y=64` | compile-time reject | n/a | n/a | n/a | + +`GGML_CUDA_FP4_MMQ_Y=64` fails the NVFP4 writeback invariant: +`static_assert(nwarps*tile_C::I == mmq_y)`. That also rejects combined +`MMQ_Y=64+MINBLOCKS=2` as a source of evidence. `MMQ_Y=96` is not a valid +low-conflict shortcut for the same row-tile specialization reason, so it was +not promoted to a serving A/B. + +Same-session n128 serving A/B (`PTOK=128`, `GEN=64`, two reps per arm): + +| arm | reps | agg_tps | decode_agg_tps | decode_perseq_tps | prefill_tps | TTFT mean ms | +|-----|------|---------|----------------|--------------------|-------------|--------------| +| baseline | 2 | 328.8 | 705.1 | 3.970 | 1607.4 | 7868.8 | +| `MINBLOCKS=2` | 2 | 326.4 | 689.9 | 3.905 | 1644.9 | 7778.1 | +| ratio | 2 | 0.9927 | 0.9784 | 0.9836 | 1.0233 | 0.9885 | + +Post-serving variant gate remained green: + +| phase | check | status | actual | +|-------|-------|--------|--------| +| post serving | MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| post serving | dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| post serving | `MUL_MAT_ID` | ok | `806/806` | + +Decision: + +- `GGML_CUDA_FP4_MINBLOCKS=2` is inference-safe but does not clear the serving + A/B gate; it regressed n128 decode aggregate by about `2.2%`. +- `GGML_CUDA_FP4_MMQ_Y` is not a valid additive shortcut without deeper NVFP4 + writeback retile work. +- Do not promote either knob or add a LocalAI patch. The grouped-MMQ bucket + still needs a structural kernel change, not a launch-bounds/row-tile tweak. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index 87bec651b..23de0f789 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -376,6 +376,16 @@ helper dispatch remains too small (`mm_ids` `0.61%`, `gather_mmq` `0.37%`, `argsort_topk` `0.40%`). Do not reopen metadata/helper-only MoE dispatch work on GB10. +Phase 28 tested the remaining low-conflict NVFP4 grouped-MMQ occupancy knobs. +Artifact: `/home/mudler/bench/phase28_mmq_occupancy/20260701_040450`. +`GGML_CUDA_FP4_MINBLOCKS=2` passed md5/op gates before and after serving +(MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT_ID` `806/806`) but regressed +n128 same-session decode serving (`705.1 -> 689.9` decode_agg_tps, `0.9784x`). +`GGML_CUDA_FP4_MMQ_Y=64` failed to compile because the NVFP4 writeback +specialization asserts `nwarps*tile_C::I == mmq_y`. Do not promote either knob; +future grouped-MMQ work must be structural kernel work. + --- ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes) @@ -443,6 +453,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual - `~/bench/phase25_gate_summary_dryrun/20260701_053353` - dry run after adding `gate_summary.tsv` support; normal dry-run still writes `hardware.txt` and does not emit a gate summary before gates exist. - `~/bench/phase26_audited_snapshot/20260701_053650` - current audit-grade full paged-vs-vLLM MoE serving snapshot with `hardware.txt`, pre/post gates, `summary.tsv`, and `gate_summary.tsv`. - `~/bench/phase27_graph_node_serving/20260701_055519` - current clean llama.cpp n128 serving profile captured with `--cuda-graph-trace=node`, pre/post retry gates green. +- `~/bench/phase28_mmq_occupancy/20260701_040450` - NVFP4 MMQ occupancy build-knob A/B; `MINBLOCKS=2` gate-safe but serving-regressed, `MMQ_Y=64` compile-rejected. - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`. - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30. - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index f2eb4abca..0614fe287 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -746,6 +746,25 @@ and `argsort_topk` is `0.40%`. Do not reopen metadata/helper-only MoE dispatch work on GB10. Any credible source patch must directly reduce GDN, grouped-MMQ, or projection work and still pass the md5/op gates. +### Phase 28 NVFP4 MMQ occupancy A/B + +Phase 28 challenged the small grouped-MMQ build knobs before funding structural +kernel work. Artifact: +`/home/mudler/bench/phase28_mmq_occupancy/20260701_040450`. + +`GGML_CUDA_FP4_MINBLOCKS=2` built and passed the canonical safety gates before +and after serving: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 +`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. Same-session +n128 serving A/B rejected it on throughput: baseline `705.1` decode_agg_tps vs +`689.9` with `MINBLOCKS=2` (`0.9784x`). `GGML_CUDA_FP4_MMQ_Y=64` does not +compile against the current NVFP4 writeback invariant +`nwarps*tile_C::I == mmq_y`, so the row-tile knob is not a valid low-conflict +shortcut. + +Decision: do not promote the occupancy knobs and do not add a LocalAI patch. +The grouped-MMQ bucket still requires structural kernel work; launch-bounds and +row-tile build tweaks are closed on GB10. + Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`. ### Phase 10 GDN C32 slab update diff --git a/docs/superpowers/plans/2026-07-01-mmq-occupancy-phase28.md b/docs/superpowers/plans/2026-07-01-mmq-occupancy-phase28.md new file mode 100644 index 000000000..3b2606ca9 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-mmq-occupancy-phase28.md @@ -0,0 +1,85 @@ +# MMQ Occupancy Phase 28 Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use +> superpowers:subagent-driven-development (recommended) or +> superpowers:executing-plans to implement this plan task-by-task. Steps use +> checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Test the remaining low-conflict NVFP4 grouped-MMQ occupancy knobs +against the current GB10 serving gap, with md5/op gates before accepting any +performance result. + +**Architecture:** Build-vs-build A/B only. The knobs are existing default-off +compile-time macros in the llama.cpp fork, so this phase does not edit source +unless a variant clears the serving gate. + +**Tech Stack:** DGX GB10, llama.cpp CUDA backend, `paged-inference-gates.sh`, +h2h n128 serving client, LocalAI parity docs. + +--- + +## Checklist + +- [x] **Step 1: Confirm candidate scope** + - Projection/FP8 follow-up was rejected by source/docs review: it is already + documented as too small or KL-failing. + - The remaining small candidate was NVFP4 MMQ occupancy: + `GGML_CUDA_FP4_MINBLOCKS=2` and `GGML_CUDA_FP4_MMQ_Y`. + +- [x] **Step 2: Check DGX preflight** + - `docker=0` + - `local_ai_worker=0` + - `compute=0` + - GPU owner file was `FREE`. + +- [x] **Step 3: Run baseline inference gates** + - Artifact: + `/home/mudler/bench/phase28_mmq_occupancy/20260701_040450/gate_baseline` + - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0` + - Dense md5: `5951a5b4d624ce891e22ab5fca9bc439` + - `MUL_MAT_ID`: `806/806` + +- [x] **Step 4: Build and gate `GGML_CUDA_FP4_MINBLOCKS=2`** + - Build dir: `/home/mudler/llama-phase6-source/build-phase28-minblocks2` + - Artifact: + `/home/mudler/bench/phase28_mmq_occupancy/20260701_040450/gate_minblocks2` + - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0` + - Dense md5: `5951a5b4d624ce891e22ab5fca9bc439` + - `MUL_MAT_ID`: `806/806` + +- [x] **Step 5: Try `GGML_CUDA_FP4_MMQ_Y=64`** + - Build dir: `/home/mudler/llama-phase6-source/build-phase28-mmqy64` + - Result: compile-time reject. + - Failure invariant: `static_assert(nwarps*tile_C::I == mmq_y)`. + - Decision: do not run combined `MMQ_Y=64+MINBLOCKS=2`; the row-tile + specialization is invalid before serving can be measured. + +- [x] **Step 6: Run same-session n128 serving A/B for the viable variant** + - Artifact: + `/home/mudler/bench/phase28_mmq_occupancy/20260701_040450/serving_ab` + - Baseline mean, two reps: `decode_agg_tps=705.1`, + `decode_perseq_tps=3.970`, `agg_tps=328.8`. + - `MINBLOCKS=2` mean, two reps: `decode_agg_tps=689.9`, + `decode_perseq_tps=3.905`, `agg_tps=326.4`. + - Ratio: `decode_agg_tps=0.9784`, `decode_perseq_tps=0.9836`, + `agg_tps=0.9927`. + +- [x] **Step 7: Run post-serving inference gates** + - Artifact: + `/home/mudler/bench/phase28_mmq_occupancy/20260701_040450/gate_minblocks2_post_serving` + - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0` + - Dense md5: `5951a5b4d624ce891e22ab5fca9bc439` + - `MUL_MAT_ID`: `806/806` + +- [x] **Step 8: Record decision** + - `MINBLOCKS=2` is inference-safe but rejected on throughput. + - `MMQ_Y` is rejected as a low-conflict shortcut because the current NVFP4 + writeback specialization only accepts the stock row tile. + - No llama.cpp source patch or LocalAI patch mirror is justified. + +## Result + +Phase 28 closes the small NVFP4 MMQ occupancy branch. The only buildable knob +kept md5/op gates green but regressed n128 decode serving, and the row-tile knob +does not compile against the current specialization. Future grouped-MMQ work +must be structural kernel work, not a launch-bounds or row-tile build tweak.