docs(paged): record mmq occupancy rejection

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 04:18:12 +00:00
parent 3c2cb9f4ab
commit 3b9ec3e1f1
5 changed files with 180 additions and 0 deletions

View File

@@ -629,3 +629,13 @@ so an artifact proves inferencing gates without reading full logs.
Do not use the stale DGX
`~/bench/combined_definitive.sh` without first porting it to the current mirror
and lock discipline.
Phase 28 challenged the remaining low-conflict NVFP4 grouped-MMQ occupancy
knobs on the same DGX mirror
(`/home/mudler/bench/phase28_mmq_occupancy/20260701_040450`). The only buildable
variant, `GGML_CUDA_FP4_MINBLOCKS=2`, was inference-safe before and after
serving (MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID 806/806`) but regressed
n128 decode serving (`705.1 -> 689.9` decode_agg_tps, `0.9784x`). The row-tile
knob `GGML_CUDA_FP4_MMQ_Y=64` failed the NVFP4 writeback compile-time
invariant. Do not promote these knobs; grouped-MMQ parity work now requires a
structural kernel change, not launch-bounds or row-tile tweaks.

View File

@@ -1715,3 +1715,58 @@ Decision:
path still should not be reopened.
- The serving profile does not change the Phase 26 parity verdict: n128 paged
decode remains about `675 tok/s`, far below vLLM's same-session `1025 tok/s`.
## Phase 28 NVFP4 MMQ Occupancy Build-Knob A/B
Phase 28 tested the remaining small, additive grouped-MMQ occupancy knobs
already present in the llama.cpp fork. This was a build-vs-build A/B only; no
source change was promoted.
Artifact:
- `/home/mudler/bench/phase28_mmq_occupancy/20260701_040450`
Source and hardware:
- `/home/mudler/llama-phase6-source`
- `f2521ab12 feat(server): trace speculative batch shapes`
- `GPU 0: NVIDIA GB10`, driver `580.159.03`, compute capability `12.1`
Build/gate results:
| variant | build result | MoE md5 | dense md5 | `MUL_MAT_ID` |
|---------|--------------|---------|-----------|--------------|
| baseline | existing `build-cuda` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `806/806` |
| `GGML_CUDA_FP4_MINBLOCKS=2` | built | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `806/806` |
| `GGML_CUDA_FP4_MMQ_Y=64` | compile-time reject | n/a | n/a | n/a |
`GGML_CUDA_FP4_MMQ_Y=64` fails the NVFP4 writeback invariant:
`static_assert(nwarps*tile_C::I == mmq_y)`. That also rejects combined
`MMQ_Y=64+MINBLOCKS=2` as a source of evidence. `MMQ_Y=96` is not a valid
low-conflict shortcut for the same row-tile specialization reason, so it was
not promoted to a serving A/B.
Same-session n128 serving A/B (`PTOK=128`, `GEN=64`, two reps per arm):
| arm | reps | agg_tps | decode_agg_tps | decode_perseq_tps | prefill_tps | TTFT mean ms |
|-----|------|---------|----------------|--------------------|-------------|--------------|
| baseline | 2 | 328.8 | 705.1 | 3.970 | 1607.4 | 7868.8 |
| `MINBLOCKS=2` | 2 | 326.4 | 689.9 | 3.905 | 1644.9 | 7778.1 |
| ratio | 2 | 0.9927 | 0.9784 | 0.9836 | 1.0233 | 0.9885 |
Post-serving variant gate remained green:
| phase | check | status | actual |
|-------|-------|--------|--------|
| post serving | MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
| post serving | dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
| post serving | `MUL_MAT_ID` | ok | `806/806` |
Decision:
- `GGML_CUDA_FP4_MINBLOCKS=2` is inference-safe but does not clear the serving
A/B gate; it regressed n128 decode aggregate by about `2.2%`.
- `GGML_CUDA_FP4_MMQ_Y` is not a valid additive shortcut without deeper NVFP4
writeback retile work.
- Do not promote either knob or add a LocalAI patch. The grouped-MMQ bucket
still needs a structural kernel change, not a launch-bounds/row-tile tweak.

View File

@@ -376,6 +376,16 @@ helper dispatch remains too small (`mm_ids` `0.61%`, `gather_mmq` `0.37%`,
`argsort_topk` `0.40%`). Do not reopen metadata/helper-only MoE dispatch work on
GB10.
Phase 28 tested the remaining low-conflict NVFP4 grouped-MMQ occupancy knobs.
Artifact: `/home/mudler/bench/phase28_mmq_occupancy/20260701_040450`.
`GGML_CUDA_FP4_MINBLOCKS=2` passed md5/op gates before and after serving
(MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT_ID` `806/806`) but regressed
n128 same-session decode serving (`705.1 -> 689.9` decode_agg_tps, `0.9784x`).
`GGML_CUDA_FP4_MMQ_Y=64` failed to compile because the NVFP4 writeback
specialization asserts `nwarps*tile_C::I == mmq_y`. Do not promote either knob;
future grouped-MMQ work must be structural kernel work.
---
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
@@ -443,6 +453,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
- `~/bench/phase25_gate_summary_dryrun/20260701_053353` - dry run after adding `gate_summary.tsv` support; normal dry-run still writes `hardware.txt` and does not emit a gate summary before gates exist.
- `~/bench/phase26_audited_snapshot/20260701_053650` - current audit-grade full paged-vs-vLLM MoE serving snapshot with `hardware.txt`, pre/post gates, `summary.tsv`, and `gate_summary.tsv`.
- `~/bench/phase27_graph_node_serving/20260701_055519` - current clean llama.cpp n128 serving profile captured with `--cuda-graph-trace=node`, pre/post retry gates green.
- `~/bench/phase28_mmq_occupancy/20260701_040450` - NVFP4 MMQ occupancy build-knob A/B; `MINBLOCKS=2` gate-safe but serving-regressed, `MMQ_Y=64` compile-rejected.
- Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
- Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
- A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.

View File

@@ -746,6 +746,25 @@ and `argsort_topk` is `0.40%`. Do not reopen metadata/helper-only MoE dispatch
work on GB10. Any credible source patch must directly reduce GDN, grouped-MMQ,
or projection work and still pass the md5/op gates.
### Phase 28 NVFP4 MMQ occupancy A/B
Phase 28 challenged the small grouped-MMQ build knobs before funding structural
kernel work. Artifact:
`/home/mudler/bench/phase28_mmq_occupancy/20260701_040450`.
`GGML_CUDA_FP4_MINBLOCKS=2` built and passed the canonical safety gates before
and after serving: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. Same-session
n128 serving A/B rejected it on throughput: baseline `705.1` decode_agg_tps vs
`689.9` with `MINBLOCKS=2` (`0.9784x`). `GGML_CUDA_FP4_MMQ_Y=64` does not
compile against the current NVFP4 writeback invariant
`nwarps*tile_C::I == mmq_y`, so the row-tile knob is not a valid low-conflict
shortcut.
Decision: do not promote the occupancy knobs and do not add a LocalAI patch.
The grouped-MMQ bucket still requires structural kernel work; launch-bounds and
row-tile build tweaks are closed on GB10.
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
### Phase 10 GDN C32 slab update

View File

@@ -0,0 +1,85 @@
# MMQ Occupancy Phase 28 Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use
> superpowers:subagent-driven-development (recommended) or
> superpowers:executing-plans to implement this plan task-by-task. Steps use
> checkbox (`- [ ]`) syntax for tracking.
**Goal:** Test the remaining low-conflict NVFP4 grouped-MMQ occupancy knobs
against the current GB10 serving gap, with md5/op gates before accepting any
performance result.
**Architecture:** Build-vs-build A/B only. The knobs are existing default-off
compile-time macros in the llama.cpp fork, so this phase does not edit source
unless a variant clears the serving gate.
**Tech Stack:** DGX GB10, llama.cpp CUDA backend, `paged-inference-gates.sh`,
h2h n128 serving client, LocalAI parity docs.
---
## Checklist
- [x] **Step 1: Confirm candidate scope**
- Projection/FP8 follow-up was rejected by source/docs review: it is already
documented as too small or KL-failing.
- The remaining small candidate was NVFP4 MMQ occupancy:
`GGML_CUDA_FP4_MINBLOCKS=2` and `GGML_CUDA_FP4_MMQ_Y`.
- [x] **Step 2: Check DGX preflight**
- `docker=0`
- `local_ai_worker=0`
- `compute=0`
- GPU owner file was `FREE`.
- [x] **Step 3: Run baseline inference gates**
- Artifact:
`/home/mudler/bench/phase28_mmq_occupancy/20260701_040450/gate_baseline`
- MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`
- Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`
- `MUL_MAT_ID`: `806/806`
- [x] **Step 4: Build and gate `GGML_CUDA_FP4_MINBLOCKS=2`**
- Build dir: `/home/mudler/llama-phase6-source/build-phase28-minblocks2`
- Artifact:
`/home/mudler/bench/phase28_mmq_occupancy/20260701_040450/gate_minblocks2`
- MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`
- Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`
- `MUL_MAT_ID`: `806/806`
- [x] **Step 5: Try `GGML_CUDA_FP4_MMQ_Y=64`**
- Build dir: `/home/mudler/llama-phase6-source/build-phase28-mmqy64`
- Result: compile-time reject.
- Failure invariant: `static_assert(nwarps*tile_C::I == mmq_y)`.
- Decision: do not run combined `MMQ_Y=64+MINBLOCKS=2`; the row-tile
specialization is invalid before serving can be measured.
- [x] **Step 6: Run same-session n128 serving A/B for the viable variant**
- Artifact:
`/home/mudler/bench/phase28_mmq_occupancy/20260701_040450/serving_ab`
- Baseline mean, two reps: `decode_agg_tps=705.1`,
`decode_perseq_tps=3.970`, `agg_tps=328.8`.
- `MINBLOCKS=2` mean, two reps: `decode_agg_tps=689.9`,
`decode_perseq_tps=3.905`, `agg_tps=326.4`.
- Ratio: `decode_agg_tps=0.9784`, `decode_perseq_tps=0.9836`,
`agg_tps=0.9927`.
- [x] **Step 7: Run post-serving inference gates**
- Artifact:
`/home/mudler/bench/phase28_mmq_occupancy/20260701_040450/gate_minblocks2_post_serving`
- MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`
- Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`
- `MUL_MAT_ID`: `806/806`
- [x] **Step 8: Record decision**
- `MINBLOCKS=2` is inference-safe but rejected on throughput.
- `MMQ_Y` is rejected as a low-conflict shortcut because the current NVFP4
writeback specialization only accepts the stock row tile.
- No llama.cpp source patch or LocalAI patch mirror is justified.
## Result
Phase 28 closes the small NVFP4 MMQ occupancy branch. The only buildable knob
kept md5/op gates green but regressed n128 decode serving, and the row-tile knob
does not compile against the current specialization. Future grouped-MMQ work
must be structural kernel work, not a launch-bounds or row-tile build tweak.