feat(paged): add moe mmq launch trace patch

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 04:54:33 +00:00
parent b28b448c68
commit e189e5a4ca
7 changed files with 403 additions and 14 deletions

View File

@@ -1861,3 +1861,49 @@ Decision:
separate A/B.
- Every traced call used stream-k, so a replacement kernel must account for the
current stream-k/fixup behavior rather than only conventional tiling.
## Phase 31 Live MoE MMQ Launch Shape Distribution
Phase 31 added patch `0057`, a default-off launch trace paired with the Phase 29
selector trace. It records the actual launch policy after `ntiles_dst`,
`tiles_efficiency_percent`, `stream_k_blocks`, and `fixup_needed` are known.
Artifact:
- `/home/mudler/bench/phase31_mmq_launch_trace/20260701_064424`
Run:
- Fork commit: `/home/mudler/_git/llama.cpp` `c78e537b5`
- DGX mirror commit: `dgx:~/llama-phase6-source` `8b75905e9`
- Env: `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 LLAMA_MOE_MMQ_SHAPE_TRACE=4096`
- Workload: h2h `n=128`, `PTOK=128`, `GEN=64`
- Throughput while tracing: `decode_agg_tps=691.0`, `agg_tps=337.0`,
`prefill_tps=1500.4`, `TTFT mean=7671.0 ms`
Launch summary:
| bucket | launch lines | `fixup=1` | `stream_k_blocks == ntiles_dst` | tile efficiency | `ncols_max` range |
|--------|--------------|-----------|----------------------------------|-----------------|-------------------|
| decode-like (`ncols_max <= 128`) | 4800 | 0 | 4800 | 96-99 | 12-128 |
| prefill-like (`ncols_max > 128`) | 4920 | 0 | 4920 | 99-100 | 129-510 |
Gates:
| check | status | actual |
|-------|--------|--------|
| default-off MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
| default-off dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
| trace-enabled MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
| trace-enabled dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
| post-serving MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
| post-serving dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
| `MUL_MAT_ID` | ok | `806/806` in all three gate runs |
Decision:
- Do not pursue a no-fixup/no-stream-k shortcut for n128 serving: the measured
launch path already uses `stream_k_blocks == ntiles_dst` and never runs fixup.
- The remaining grouped-MMQ work is structural small-M kernel work, not launch
overhead. A follow-up should target the decode-like `mmq_x <= 64`, low-density
kernel shape directly and keep the prefill `mmq_x=128` path separate.

View File

@@ -64,7 +64,7 @@ A lever compiled into the binary is **NOT** isolated by a runtime flag alone. It
- **Always update the fork FIRST, in this exact order:** (1) commit the change on the `localai-paged` branch and **push it**, then (2) regenerate the LocalAI series (`backend/cpp/llama-cpp-localai-paged/patches/paged/`) from the fork via `git format-patch` (one patch per fork commit, source-only, never touching a `*.md`/dev-doc), so the series stays a **1:1, drift-free mirror** of the branch. No hand-export.
- **NEVER edit the LocalAI `patches/paged/*.patch` files directly**, and **NEVER add a patch to the series with no corresponding fork-branch commit.** They are generated output, not source.
- The fork branch is also **where the build and the per-path bit-exact md5 gate actually run**, so it is the **only** place a change is truly validated. A patch that lives only in the LocalAI series has never been built or gated.
- **Mirror invariant (verify by tree hash):** applying the full on-disk series on the pin must reproduce the fork branch tree byte-for-byte. The series has **intentional gaps** (missing 0005, 0026, 0027, 0032, 0036-0039, 0045), so the patch count is not the max number; what must hold is the tree-hash equality, not the count. Current verified state: fork HEAD `20a99518a` is mirrored by worktree patch `0056-feat-cuda-trace-moe-mmq-batch-shapes.patch`; applying all `47` patch files on `0ed235ea2c17a19fc8238668653946721ed136fd` produces tree `8a7779726a81689a14f10a64523f2cc380d4801f`, exactly matching the fork.
- **Mirror invariant (verify by tree hash):** applying the full on-disk series on the pin must reproduce the fork branch tree byte-for-byte. The series has **intentional gaps** (missing 0005, 0026, 0027, 0032, 0036-0039, 0045), so the patch count is not the max number; what must hold is the tree-hash equality, not the count. Current verified state: fork HEAD `c78e537b5` is mirrored by worktree patch `0057-feat-cuda-trace-moe-mmq-launch-shapes.patch`; applying all `48` patch files on `0ed235ea2c17a19fc8238668653946721ed136fd` produces tree `4eae628e4ba6f2defa14a19d19f7e4abef9a2647`, exactly matching the fork.
### 2.6 Bench hygiene gates
- **NEVER set `LLAMA_MAX_BATCH_TOKENS` in benches** (the harness explicitly logs "NO LLAMA_MAX_BATCH_TOKENS").
@@ -322,11 +322,11 @@ Use this harness for future current-stack GB10 snapshots. Do not reuse
`~/bench/combined_definitive.sh` unless it is first ported away from stale
`~/llama-paged-dev` paths and old lock assumptions.
Phase 22 re-verified the patch-series mirror invariant after patch `0055`:
Phase 31 re-verified the patch-series mirror invariant after patch `0057`:
applying every LocalAI `patches/paged/0*.patch` with strict `git apply` on top of
Makefile pin `0ed235ea2c17a19fc8238668653946721ed136fd` produced tree
`5bdbf8ea3d750fe6fa1f85175fd6357d36222edb`, exactly matching fork branch
`localai-paged` HEAD `20a99518a feat(cuda): trace moe mmq batch shapes`.
`4eae628e4ba6f2defa14a19d19f7e4abef9a2647`, exactly matching fork branch
`localai-paged` HEAD `c78e537b5 feat(cuda): trace moe mmq launch shapes`.
Phase 24 extended `paged-current-serving-snapshot.sh` to write the snapshot
hardware report. DGX dry run passed at
@@ -405,6 +405,20 @@ grouped-MMQ calls split into 1200 decode-like calls (`ncols_max <= 128`) and
gates stayed green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT_ID` `806/806`.
Phase 31 added patch `0057` for default-off grouped-MMQ launch tracing.
Artifact: `/home/mudler/bench/phase31_mmq_launch_trace/20260701_064424`.
Fork commit: `c78e537b5 feat(cuda): trace moe mmq launch shapes`; DGX mirror
commit: `8b75905e9`. The trace adds `[LLAMA_MOE_MMQ_LAUNCH]` lines under
`LLAMA_MOE_MMQ_SHAPE_TRACE=<n>`, recording `ntiles_dst`, `stream_k_blocks`,
tile efficiency, `fixup`, `ntx/nty/ntzw`, and compiled `mmq_x/mmq_y`. Default
off, trace-enabled, and post-serving gates stayed green: MoE
`8cb0ce23777bf55f92f63d0292c756b0`, dense
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT_ID` `806/806`. The n128 serving
trace showed decode-like `4800/4800` and prefill-like `4920/4920` launch lines
with `fixup=0` and `stream_k_blocks == ntiles_dst`. Do not pursue a
no-fixup/no-stream-k shortcut for this workload; the remaining grouped-MMQ work
is structural small-M kernel work.
---
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
@@ -454,15 +468,15 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
## 7. KEY FILE / ARTIFACT INDEX
### Fork (canonical source of truth)
- Local canonical fork: `/home/mudler/_git/llama.cpp`, branch **`localai-paged`**, HEAD `20a99518a39acbb4474fa9c97121fc7b9f07c1ef` ("trace moe mmq batch shapes", patch `0056`).
- DGX current clean mirror/build tree: `dgx:~/llama-phase6-source`, HEAD `826c97a05` with the Phase 29 shape-trace patch applied and committed; Phase 20/26/27 artifacts still record their historical source hashes.
- Local canonical fork: `/home/mudler/_git/llama.cpp`, branch **`localai-paged`**, HEAD `c78e537b56e3446f8aa645c6700aacf263639bd8` ("trace moe mmq launch shapes", patch `0057`).
- DGX current clean mirror/build tree: `dgx:~/llama-phase6-source`, HEAD `8b75905e9` with the Phase 31 launch-trace patch applied and committed; Phase 20/26/27 artifacts still record their historical source hashes.
- Historical DGX dev tree: `dgx:~/llama-paged-dev`, branch **`paged`**, HEAD `a7d439e8ce6990eb09721223c975da4e49d8d136` ("GDN CONFIG C (M8) - bf16 Kc/Qc"). It is an old experimental tree and must not be treated as canonical.
### LocalAI worktree
- Path: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention`, branch `worktree-feat+paged-attention` (currently 246 ahead, 31 behind `origin/master`; recompute before reporting).
- Backend dir: `backend/cpp/llama-cpp-localai-paged/` (`Makefile` thin wrapper, `package.sh`, `run.sh`, `README.md` ~44 KB canonical, `docs/`, `patches/paged/`).
- `docs/`: `VLLM_PARITY_FINAL.md` (authoritative record), `VLLM_PARITY_LEVER_MAP.md` (working brainstorm, profile-validated section), `DECODE_SERVING_SCOPE.md`, `PREFILL_GEMM_SCOPE.md`, `PREFILL_GEMM_RESULTS.md`, `TENSORCORE_GDN_SCOPE.md`, `TENSORCORE_GDN_BUILD_PLAN.md`, `ACCELERATOR_PORTING_SCOPE.md`, `UPSTREAM_LAYER2_SCOPE.md`, `LOCALAI_LLAMACPP_BACKEND_PLAN.md`, `PAGED_BITEXACT_NOTE.md`, `PATCH_MAINTENANCE.md`, `final_benchmark.csv`, `paged-burst-bench.cpp`, `paged-reclaim-unit.cpp`, 3 PNGs, and this `PARITY_HANDOFF.md`.
- `patches/paged/`: **46** `.patch` files spanning 0001-0055 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV); W4A16 packed metadata/shape/padding is 0048-0050; MoE safety tests are 0051-0053; MTP backend-sampling safety is 0054; speculative shape trace is 0055.
- `patches/paged/`: **48** `.patch` files spanning 0001-0057 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV); W4A16 packed metadata/shape/padding is 0048-0050; MoE safety tests are 0051-0053; MTP backend-sampling safety is 0054; speculative shape trace is 0055; MoE MMQ selector/launch traces are 0056-0057.
### Bench artifacts (DGX)
- `~/bench/COMBINED_DEFINITIVE.txt` (+ `.log`, `.done`, `combined_definitive.sh`, `combined_definitive.out`) - historical same-session both-engine run.
@@ -475,6 +489,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
- `~/bench/phase28_mmq_occupancy/20260701_040450` - NVFP4 MMQ occupancy build-knob A/B; `MINBLOCKS=2` gate-safe but serving-regressed, `MMQ_Y=64` compile-rejected.
- `~/bench/phase29_mmq_shape_trace/20260701_042428` - default-off MoE MMQ shape trace patch `0056`; CUDA build plus default/trace md5 gates green.
- `~/bench/phase30_mmq_shape_serving/20260701_043300` - live n128 serving MMQ shape distribution from patch `0056`; post-run md5/op gates green.
- `~/bench/phase31_mmq_launch_trace/20260701_064424` - default-off MoE MMQ launch trace patch `0057`; default/trace/post-serving md5 gates green; n128 launch trace rejects stream-k/fixup shortcut (`fixup=0`, `stream_k_blocks == ntiles_dst`).
- Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
- Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
- A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.
@@ -487,8 +502,8 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
### Discrepancies to flag / resolve (carried verbatim from the gather, including UNVERIFIED labels)
1. **Pin prose reconciled in this worktree.** Makefile line 52 `LLAMA_VERSION?=0ed235ea2c17a19fc8238668653946721ed136fd` is authoritative and matches the local fork merge-base. Hard rule: the paged pin must equal the stock `llama-cpp` pin (shared `grpc-server.cpp`); a bump to `c299a92c` once broke the grpc-server link despite being bit-exact and was reverted. Trust the Makefile when building.
2. **Current fork/mirror are clean and verified.** Local fork HEAD is `20a99518a`, DGX clean mirror HEAD is `826c97a05`, and Phase 29 re-proved the LocalAI patch series tree equals the fork tree (`8a7779726a81689a14f10a64523f2cc380d4801f`). The old `llama-paged-dev` tree is historical only.
3. **Worktree patch series is tracked through 0056.** The only current untracked path in this worktree is `.claude/`.
2. **Current fork/mirror are clean and verified.** Local fork HEAD is `c78e537b5`, DGX clean mirror HEAD is `8b75905e9`, and Phase 31 should be treated as the current patch-series tip. The old `llama-paged-dev` tree is historical only.
3. **Worktree patch series is tracked through 0057.** The only expected unrelated untracked path in this worktree is `.claude/`.
4. **`sm_121a` is not in the worktree build files** - it lives only in the DGX experimental build scripts (`gdn_cc.sh`, `gdn_bv_build.sh`, `paged-build.sh`); mainline uses arch `121`. **UNVERIFIED** whether the shipped CI Dockerfile build path injects `121a` for the FP4-MMA kernels (`Dockerfile.llama-cpp-localai-paged` does not hardcode a CUDA arch).
5. **The `0921716...` paged-MoE md5 open item.** `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=0921716cd0582b5d15af8c362b811d00` for MoE, but a full doc/patch/`git log -S` grep of the worktree found **no** occurrence of `0921716...` in any committed source; the committed canonical paged-MoE gate is `8cb0ce23`. Treat this as **unreconciled**: the documented, KL-validated paged-MoE gate remains `8cb0ce23`, and any paged-MoE divergence (including `0921716`) must be KL-validated against the f16 reference before being accepted as benign, never on assertion alone. The `0921716` value is **UNVERIFIED** as a sanctioned gate; do not adopt it as canonical without re-running the KL gate. The **dense** run is symmetric: `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=ecfe924dee6c5622c149f419ff2a6481` for dense, which likewise differs from the canonical dense gate `5951a5b4`. Both CDEF `PAGED_GATE_MD5` values come from the `combined_definitive.sh` harness's own gate command, NOT the canonical bit-exact gate command in section 3.3, which is why they diverge from the committed `8cb0ce23` / `5951a5b4`; neither is a sanctioned gate and both must be KL-validated before being treated as benign.

View File

@@ -57,18 +57,18 @@ everywhere without ever touching the stock `llama-cpp` source tree.
## Latest mirror check
Phase 29 re-verified the mirror invariant after adding patch `0056`:
Phase 31 re-verified the mirror invariant after adding patch `0057`:
```text
base=0ed235ea2c17a19fc8238668653946721ed136fd
applied_tree=8a7779726a81689a14f10a64523f2cc380d4801f
fork_tree=8a7779726a81689a14f10a64523f2cc380d4801f
applied_tree=4eae628e4ba6f2defa14a19d19f7e4abef9a2647
fork_tree=4eae628e4ba6f2defa14a19d19f7e4abef9a2647
```
The check used a fresh worktree at `LLAMA_VERSION`, applied every
`patches/paged/0*.patch` with strict `git apply`, staged the result, and compared
`git write-tree` to canonical fork branch `localai-paged` at
`20a99518a feat(cuda): trace moe mmq batch shapes`.
`c78e537b5 feat(cuda): trace moe mmq launch shapes`.
## Status

View File

@@ -802,6 +802,29 @@ small-M decode tiles (`ncols_max` 26-111, density 1-4) separately from prefill.
The current stream-k/fixup path is part of the measured shape and cannot be
ignored by a replacement kernel.
### Phase 31 live serving MMQ launch distribution
Phase 31 added patch `0057`, extending `LLAMA_MOE_MMQ_SHAPE_TRACE=<n>` with
`[LLAMA_MOE_MMQ_LAUNCH]` lines emitted from `launch_mul_mat_q` after the actual
stream-k launch policy is known. Artifact:
`/home/mudler/bench/phase31_mmq_launch_trace/20260701_064424`.
The default-off, trace-enabled, and post-serving gates all stayed bit-exact:
MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
Live n128 serving with `LLAMA_MOE_MMQ_SHAPE_TRACE=4096` produced:
| bucket | launch lines | `fixup=1` | `stream_k_blocks == ntiles_dst` | tile efficiency |
|--------|--------------|-----------|----------------------------------|-----------------|
| decode-like (`ncols_max <= 128`) | 4800 | 0 | 4800 | 96-99 |
| prefill-like (`ncols_max > 128`) | 4920 | 0 | 4920 | 99-100 |
Lever implication: a no-fixup/no-stream-k shortcut is rejected for the measured
n128 serving workload. The launch code is already choosing conventional
stream-k tiling with no fixup; the remaining gap is the small-M grouped-MMQ
kernel shape itself, not launch/fixup overhead.
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
### Phase 10 GDN C32 slab update