mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 12:57:02 -04:00
docs(paged): record mmq occupancy rejection
Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -629,3 +629,13 @@ so an artifact proves inferencing gates without reading full logs.
|
||||
Do not use the stale DGX
|
||||
`~/bench/combined_definitive.sh` without first porting it to the current mirror
|
||||
and lock discipline.
|
||||
|
||||
Phase 28 challenged the remaining low-conflict NVFP4 grouped-MMQ occupancy
|
||||
knobs on the same DGX mirror
|
||||
(`/home/mudler/bench/phase28_mmq_occupancy/20260701_040450`). The only buildable
|
||||
variant, `GGML_CUDA_FP4_MINBLOCKS=2`, was inference-safe before and after
|
||||
serving (MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID 806/806`) but regressed
|
||||
n128 decode serving (`705.1 -> 689.9` decode_agg_tps, `0.9784x`). The row-tile
|
||||
knob `GGML_CUDA_FP4_MMQ_Y=64` failed the NVFP4 writeback compile-time
|
||||
invariant. Do not promote these knobs; grouped-MMQ parity work now requires a
|
||||
structural kernel change, not launch-bounds or row-tile tweaks.
|
||||
|
||||
@@ -1715,3 +1715,58 @@ Decision:
|
||||
path still should not be reopened.
|
||||
- The serving profile does not change the Phase 26 parity verdict: n128 paged
|
||||
decode remains about `675 tok/s`, far below vLLM's same-session `1025 tok/s`.
|
||||
|
||||
## Phase 28 NVFP4 MMQ Occupancy Build-Knob A/B
|
||||
|
||||
Phase 28 tested the remaining small, additive grouped-MMQ occupancy knobs
|
||||
already present in the llama.cpp fork. This was a build-vs-build A/B only; no
|
||||
source change was promoted.
|
||||
|
||||
Artifact:
|
||||
|
||||
- `/home/mudler/bench/phase28_mmq_occupancy/20260701_040450`
|
||||
|
||||
Source and hardware:
|
||||
|
||||
- `/home/mudler/llama-phase6-source`
|
||||
- `f2521ab12 feat(server): trace speculative batch shapes`
|
||||
- `GPU 0: NVIDIA GB10`, driver `580.159.03`, compute capability `12.1`
|
||||
|
||||
Build/gate results:
|
||||
|
||||
| variant | build result | MoE md5 | dense md5 | `MUL_MAT_ID` |
|
||||
|---------|--------------|---------|-----------|--------------|
|
||||
| baseline | existing `build-cuda` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `806/806` |
|
||||
| `GGML_CUDA_FP4_MINBLOCKS=2` | built | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `806/806` |
|
||||
| `GGML_CUDA_FP4_MMQ_Y=64` | compile-time reject | n/a | n/a | n/a |
|
||||
|
||||
`GGML_CUDA_FP4_MMQ_Y=64` fails the NVFP4 writeback invariant:
|
||||
`static_assert(nwarps*tile_C::I == mmq_y)`. That also rejects combined
|
||||
`MMQ_Y=64+MINBLOCKS=2` as a source of evidence. `MMQ_Y=96` is not a valid
|
||||
low-conflict shortcut for the same row-tile specialization reason, so it was
|
||||
not promoted to a serving A/B.
|
||||
|
||||
Same-session n128 serving A/B (`PTOK=128`, `GEN=64`, two reps per arm):
|
||||
|
||||
| arm | reps | agg_tps | decode_agg_tps | decode_perseq_tps | prefill_tps | TTFT mean ms |
|
||||
|-----|------|---------|----------------|--------------------|-------------|--------------|
|
||||
| baseline | 2 | 328.8 | 705.1 | 3.970 | 1607.4 | 7868.8 |
|
||||
| `MINBLOCKS=2` | 2 | 326.4 | 689.9 | 3.905 | 1644.9 | 7778.1 |
|
||||
| ratio | 2 | 0.9927 | 0.9784 | 0.9836 | 1.0233 | 0.9885 |
|
||||
|
||||
Post-serving variant gate remained green:
|
||||
|
||||
| phase | check | status | actual |
|
||||
|-------|-------|--------|--------|
|
||||
| post serving | MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
|
||||
| post serving | dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
|
||||
| post serving | `MUL_MAT_ID` | ok | `806/806` |
|
||||
|
||||
Decision:
|
||||
|
||||
- `GGML_CUDA_FP4_MINBLOCKS=2` is inference-safe but does not clear the serving
|
||||
A/B gate; it regressed n128 decode aggregate by about `2.2%`.
|
||||
- `GGML_CUDA_FP4_MMQ_Y` is not a valid additive shortcut without deeper NVFP4
|
||||
writeback retile work.
|
||||
- Do not promote either knob or add a LocalAI patch. The grouped-MMQ bucket
|
||||
still needs a structural kernel change, not a launch-bounds/row-tile tweak.
|
||||
|
||||
@@ -376,6 +376,16 @@ helper dispatch remains too small (`mm_ids` `0.61%`, `gather_mmq` `0.37%`,
|
||||
`argsort_topk` `0.40%`). Do not reopen metadata/helper-only MoE dispatch work on
|
||||
GB10.
|
||||
|
||||
Phase 28 tested the remaining low-conflict NVFP4 grouped-MMQ occupancy knobs.
|
||||
Artifact: `/home/mudler/bench/phase28_mmq_occupancy/20260701_040450`.
|
||||
`GGML_CUDA_FP4_MINBLOCKS=2` passed md5/op gates before and after serving
|
||||
(MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT_ID` `806/806`) but regressed
|
||||
n128 same-session decode serving (`705.1 -> 689.9` decode_agg_tps, `0.9784x`).
|
||||
`GGML_CUDA_FP4_MMQ_Y=64` failed to compile because the NVFP4 writeback
|
||||
specialization asserts `nwarps*tile_C::I == mmq_y`. Do not promote either knob;
|
||||
future grouped-MMQ work must be structural kernel work.
|
||||
|
||||
---
|
||||
|
||||
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
|
||||
@@ -443,6 +453,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
|
||||
- `~/bench/phase25_gate_summary_dryrun/20260701_053353` - dry run after adding `gate_summary.tsv` support; normal dry-run still writes `hardware.txt` and does not emit a gate summary before gates exist.
|
||||
- `~/bench/phase26_audited_snapshot/20260701_053650` - current audit-grade full paged-vs-vLLM MoE serving snapshot with `hardware.txt`, pre/post gates, `summary.tsv`, and `gate_summary.tsv`.
|
||||
- `~/bench/phase27_graph_node_serving/20260701_055519` - current clean llama.cpp n128 serving profile captured with `--cuda-graph-trace=node`, pre/post retry gates green.
|
||||
- `~/bench/phase28_mmq_occupancy/20260701_040450` - NVFP4 MMQ occupancy build-knob A/B; `MINBLOCKS=2` gate-safe but serving-regressed, `MMQ_Y=64` compile-rejected.
|
||||
- Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
|
||||
- Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
|
||||
- A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.
|
||||
|
||||
@@ -746,6 +746,25 @@ and `argsort_topk` is `0.40%`. Do not reopen metadata/helper-only MoE dispatch
|
||||
work on GB10. Any credible source patch must directly reduce GDN, grouped-MMQ,
|
||||
or projection work and still pass the md5/op gates.
|
||||
|
||||
### Phase 28 NVFP4 MMQ occupancy A/B
|
||||
|
||||
Phase 28 challenged the small grouped-MMQ build knobs before funding structural
|
||||
kernel work. Artifact:
|
||||
`/home/mudler/bench/phase28_mmq_occupancy/20260701_040450`.
|
||||
|
||||
`GGML_CUDA_FP4_MINBLOCKS=2` built and passed the canonical safety gates before
|
||||
and after serving: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. Same-session
|
||||
n128 serving A/B rejected it on throughput: baseline `705.1` decode_agg_tps vs
|
||||
`689.9` with `MINBLOCKS=2` (`0.9784x`). `GGML_CUDA_FP4_MMQ_Y=64` does not
|
||||
compile against the current NVFP4 writeback invariant
|
||||
`nwarps*tile_C::I == mmq_y`, so the row-tile knob is not a valid low-conflict
|
||||
shortcut.
|
||||
|
||||
Decision: do not promote the occupancy knobs and do not add a LocalAI patch.
|
||||
The grouped-MMQ bucket still requires structural kernel work; launch-bounds and
|
||||
row-tile build tweaks are closed on GB10.
|
||||
|
||||
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
|
||||
|
||||
### Phase 10 GDN C32 slab update
|
||||
|
||||
Reference in New Issue
Block a user