From e189e5a4caa0bb36ffb2944faee6f7478ac8ade2 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Wed, 1 Jul 2026 04:54:33 +0000 Subject: [PATCH] feat(paged): add moe mmq launch trace patch Assisted-by: Codex:gpt-5 --- backend/cpp/llama-cpp-localai-paged/README.md | 12 +- .../docs/GB10_PARITY_PHASE0_RESULTS.md | 46 ++++ .../docs/PARITY_HANDOFF.md | 33 ++- .../docs/PATCH_MAINTENANCE.md | 8 +- .../docs/VLLM_PARITY_LEVER_MAP.md | 23 ++ ...eat-cuda-trace-moe-mmq-launch-shapes.patch | 223 ++++++++++++++++++ .../2026-07-01-mmq-launch-trace-phase31.md | 72 ++++++ 7 files changed, 403 insertions(+), 14 deletions(-) create mode 100644 backend/cpp/llama-cpp-localai-paged/patches/paged/0057-feat-cuda-trace-moe-mmq-launch-shapes.patch create mode 100644 docs/superpowers/plans/2026-07-01-mmq-launch-trace-phase31.md diff --git a/backend/cpp/llama-cpp-localai-paged/README.md b/backend/cpp/llama-cpp-localai-paged/README.md index a189ce360..dfb132ec0 100644 --- a/backend/cpp/llama-cpp-localai-paged/README.md +++ b/backend/cpp/llama-cpp-localai-paged/README.md @@ -87,7 +87,7 @@ orthogonal to the paged allocator. --- -## 3. Patch series (0001-0055) +## 3. Patch series (0001-0057) Source-only patches, with intentional numbering gaps (e.g. 0005, 0027). The decode-serving graph-reuse levers are 0040-0041. "Bit-exact" = greedy md5 / @@ -214,6 +214,7 @@ These are the dominant decode levers on the Qwen3.6 hybrid models. All bit-exact | 0054 | **Disable backend sampling for MTP drafts** - forces server MTP draft generation through the target-side sampler acceptance path instead of letting the draft backend sample independently. This was required for the Phase 14 rollback/prefix safety gate. | yes for canonical non-MTP gates; Phase 14 MTP normalized greedy-prefix gate passed | | 0055 | **Trace speculative batch shapes** - adds default-off `LLAMA_SPEC_SHAPE_TRACE=1` server logs around `server_slot::handle_last_sampled_token()`, reporting normal decode rows and MTP verification `K + 1` rows (`draft`, `outputs`, `spec_i_first`, `spec_i_last`). This is instrumentation only for Phase 18 shape-entropy measurement before any scheduler experiment. | yes (env unset is silent; DGX gates after patch: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID` `806/806`) | | 0056 | **Trace MoE MMQ batch shapes** - adds default-off `LLAMA_MOE_MMQ_SHAPE_TRACE=` logs from the grouped-MMQ host selector, reporting routed assignment count, estimated active experts, density, selected `mmq_x`, `mmq_y`, and stream-k. This is evidence-only instrumentation for sizing structural grouped-MMQ work after Phase 28 rejected launch-bounds/row-tile knobs. | yes (env unset and trace-enabled gates both green: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID` `806/806`; trace cap verified with 4 lines) | +| 0057 | **Trace MoE MMQ launch shapes** - extends `LLAMA_MOE_MMQ_SHAPE_TRACE=` with bounded `[LLAMA_MOE_MMQ_LAUNCH]` lines from `launch_mul_mat_q`, recording actual `ntiles_dst`, `stream_k_blocks`, tile efficiency, `fixup`, `ntx/nty/ntzw`, and compiled `mmq_x/mmq_y`. This is evidence-only instrumentation to distinguish real stream-k/fixup overhead from small-M kernel-shape cost. | yes (default-off, trace-enabled, and post-serving gates green: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID` `806/806`; Phase 31 n128 trace showed decode and prefill `fixup=0`, `stream_k_blocks == ntiles_dst`) | > **Dropped: patch 0026 (hybrid per-head bf16 SSM state, `ssm_bf16_tau`).** Once > the decode fusions (0028 recurrent-state gather-fusion + 0029 block-table cache) @@ -647,3 +648,12 @@ added test-first (`test-cuda-mmq-shape-trace`), compiled under CUDA on DGX, and kept inference stable with the trace disabled and enabled: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID 806/806`. Example trace line: `[LLAMA_MOE_MMQ_SHAPE] type=40 moe=1 ncols_dst=104 nchannels_x=256 ncols_max=13 n_active_est=104 density=1 mmq_x_max=128 mmq_x_lim=64 mmq_x_best=16 mmq_y=128 stream_k=1`. + +Phase 31 extended that trace as patch `0057` +(`/home/mudler/bench/phase31_mmq_launch_trace/20260701_064424`) with +`[LLAMA_MOE_MMQ_LAUNCH]` lines from `launch_mul_mat_q`. Default-off, +trace-enabled, and post-serving gates stayed stable: MoE `8cb0ce23`, dense +`5951a5b4`, `MUL_MAT_ID 806/806`. The n128 serving trace showed decode-like +`4800/4800` and prefill-like `4920/4920` launch lines with `fixup=0` and +`stream_k_blocks == ntiles_dst`, rejecting a no-fixup/no-stream-k shortcut for +this workload. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index 44c1a25ac..4463d80ce 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -1861,3 +1861,49 @@ Decision: separate A/B. - Every traced call used stream-k, so a replacement kernel must account for the current stream-k/fixup behavior rather than only conventional tiling. + +## Phase 31 Live MoE MMQ Launch Shape Distribution + +Phase 31 added patch `0057`, a default-off launch trace paired with the Phase 29 +selector trace. It records the actual launch policy after `ntiles_dst`, +`tiles_efficiency_percent`, `stream_k_blocks`, and `fixup_needed` are known. + +Artifact: + +- `/home/mudler/bench/phase31_mmq_launch_trace/20260701_064424` + +Run: + +- Fork commit: `/home/mudler/_git/llama.cpp` `c78e537b5` +- DGX mirror commit: `dgx:~/llama-phase6-source` `8b75905e9` +- Env: `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 LLAMA_MOE_MMQ_SHAPE_TRACE=4096` +- Workload: h2h `n=128`, `PTOK=128`, `GEN=64` +- Throughput while tracing: `decode_agg_tps=691.0`, `agg_tps=337.0`, + `prefill_tps=1500.4`, `TTFT mean=7671.0 ms` + +Launch summary: + +| bucket | launch lines | `fixup=1` | `stream_k_blocks == ntiles_dst` | tile efficiency | `ncols_max` range | +|--------|--------------|-----------|----------------------------------|-----------------|-------------------| +| decode-like (`ncols_max <= 128`) | 4800 | 0 | 4800 | 96-99 | 12-128 | +| prefill-like (`ncols_max > 128`) | 4920 | 0 | 4920 | 99-100 | 129-510 | + +Gates: + +| check | status | actual | +|-------|--------|--------| +| default-off MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| default-off dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| trace-enabled MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| trace-enabled dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| post-serving MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| post-serving dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| `MUL_MAT_ID` | ok | `806/806` in all three gate runs | + +Decision: + +- Do not pursue a no-fixup/no-stream-k shortcut for n128 serving: the measured + launch path already uses `stream_k_blocks == ntiles_dst` and never runs fixup. +- The remaining grouped-MMQ work is structural small-M kernel work, not launch + overhead. A follow-up should target the decode-like `mmq_x <= 64`, low-density + kernel shape directly and keep the prefill `mmq_x=128` path separate. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index 9ba6f4a2a..4880099d0 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -64,7 +64,7 @@ A lever compiled into the binary is **NOT** isolated by a runtime flag alone. It - **Always update the fork FIRST, in this exact order:** (1) commit the change on the `localai-paged` branch and **push it**, then (2) regenerate the LocalAI series (`backend/cpp/llama-cpp-localai-paged/patches/paged/`) from the fork via `git format-patch` (one patch per fork commit, source-only, never touching a `*.md`/dev-doc), so the series stays a **1:1, drift-free mirror** of the branch. No hand-export. - **NEVER edit the LocalAI `patches/paged/*.patch` files directly**, and **NEVER add a patch to the series with no corresponding fork-branch commit.** They are generated output, not source. - The fork branch is also **where the build and the per-path bit-exact md5 gate actually run**, so it is the **only** place a change is truly validated. A patch that lives only in the LocalAI series has never been built or gated. -- **Mirror invariant (verify by tree hash):** applying the full on-disk series on the pin must reproduce the fork branch tree byte-for-byte. The series has **intentional gaps** (missing 0005, 0026, 0027, 0032, 0036-0039, 0045), so the patch count is not the max number; what must hold is the tree-hash equality, not the count. Current verified state: fork HEAD `20a99518a` is mirrored by worktree patch `0056-feat-cuda-trace-moe-mmq-batch-shapes.patch`; applying all `47` patch files on `0ed235ea2c17a19fc8238668653946721ed136fd` produces tree `8a7779726a81689a14f10a64523f2cc380d4801f`, exactly matching the fork. +- **Mirror invariant (verify by tree hash):** applying the full on-disk series on the pin must reproduce the fork branch tree byte-for-byte. The series has **intentional gaps** (missing 0005, 0026, 0027, 0032, 0036-0039, 0045), so the patch count is not the max number; what must hold is the tree-hash equality, not the count. Current verified state: fork HEAD `c78e537b5` is mirrored by worktree patch `0057-feat-cuda-trace-moe-mmq-launch-shapes.patch`; applying all `48` patch files on `0ed235ea2c17a19fc8238668653946721ed136fd` produces tree `4eae628e4ba6f2defa14a19d19f7e4abef9a2647`, exactly matching the fork. ### 2.6 Bench hygiene gates - **NEVER set `LLAMA_MAX_BATCH_TOKENS` in benches** (the harness explicitly logs "NO LLAMA_MAX_BATCH_TOKENS"). @@ -322,11 +322,11 @@ Use this harness for future current-stack GB10 snapshots. Do not reuse `~/bench/combined_definitive.sh` unless it is first ported away from stale `~/llama-paged-dev` paths and old lock assumptions. -Phase 22 re-verified the patch-series mirror invariant after patch `0055`: +Phase 31 re-verified the patch-series mirror invariant after patch `0057`: applying every LocalAI `patches/paged/0*.patch` with strict `git apply` on top of Makefile pin `0ed235ea2c17a19fc8238668653946721ed136fd` produced tree -`5bdbf8ea3d750fe6fa1f85175fd6357d36222edb`, exactly matching fork branch -`localai-paged` HEAD `20a99518a feat(cuda): trace moe mmq batch shapes`. +`4eae628e4ba6f2defa14a19d19f7e4abef9a2647`, exactly matching fork branch +`localai-paged` HEAD `c78e537b5 feat(cuda): trace moe mmq launch shapes`. Phase 24 extended `paged-current-serving-snapshot.sh` to write the snapshot hardware report. DGX dry run passed at @@ -405,6 +405,20 @@ grouped-MMQ calls split into 1200 decode-like calls (`ncols_max <= 128`) and gates stayed green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT_ID` `806/806`. +Phase 31 added patch `0057` for default-off grouped-MMQ launch tracing. +Artifact: `/home/mudler/bench/phase31_mmq_launch_trace/20260701_064424`. +Fork commit: `c78e537b5 feat(cuda): trace moe mmq launch shapes`; DGX mirror +commit: `8b75905e9`. The trace adds `[LLAMA_MOE_MMQ_LAUNCH]` lines under +`LLAMA_MOE_MMQ_SHAPE_TRACE=`, recording `ntiles_dst`, `stream_k_blocks`, +tile efficiency, `fixup`, `ntx/nty/ntzw`, and compiled `mmq_x/mmq_y`. Default +off, trace-enabled, and post-serving gates stayed green: MoE +`8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT_ID` `806/806`. The n128 serving +trace showed decode-like `4800/4800` and prefill-like `4920/4920` launch lines +with `fixup=0` and `stream_k_blocks == ntiles_dst`. Do not pursue a +no-fixup/no-stream-k shortcut for this workload; the remaining grouped-MMQ work +is structural small-M kernel work. + --- ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes) @@ -454,15 +468,15 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual ## 7. KEY FILE / ARTIFACT INDEX ### Fork (canonical source of truth) -- Local canonical fork: `/home/mudler/_git/llama.cpp`, branch **`localai-paged`**, HEAD `20a99518a39acbb4474fa9c97121fc7b9f07c1ef` ("trace moe mmq batch shapes", patch `0056`). -- DGX current clean mirror/build tree: `dgx:~/llama-phase6-source`, HEAD `826c97a05` with the Phase 29 shape-trace patch applied and committed; Phase 20/26/27 artifacts still record their historical source hashes. +- Local canonical fork: `/home/mudler/_git/llama.cpp`, branch **`localai-paged`**, HEAD `c78e537b56e3446f8aa645c6700aacf263639bd8` ("trace moe mmq launch shapes", patch `0057`). +- DGX current clean mirror/build tree: `dgx:~/llama-phase6-source`, HEAD `8b75905e9` with the Phase 31 launch-trace patch applied and committed; Phase 20/26/27 artifacts still record their historical source hashes. - Historical DGX dev tree: `dgx:~/llama-paged-dev`, branch **`paged`**, HEAD `a7d439e8ce6990eb09721223c975da4e49d8d136` ("GDN CONFIG C (M8) - bf16 Kc/Qc"). It is an old experimental tree and must not be treated as canonical. ### LocalAI worktree - Path: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention`, branch `worktree-feat+paged-attention` (currently 246 ahead, 31 behind `origin/master`; recompute before reporting). - Backend dir: `backend/cpp/llama-cpp-localai-paged/` (`Makefile` thin wrapper, `package.sh`, `run.sh`, `README.md` ~44 KB canonical, `docs/`, `patches/paged/`). - `docs/`: `VLLM_PARITY_FINAL.md` (authoritative record), `VLLM_PARITY_LEVER_MAP.md` (working brainstorm, profile-validated section), `DECODE_SERVING_SCOPE.md`, `PREFILL_GEMM_SCOPE.md`, `PREFILL_GEMM_RESULTS.md`, `TENSORCORE_GDN_SCOPE.md`, `TENSORCORE_GDN_BUILD_PLAN.md`, `ACCELERATOR_PORTING_SCOPE.md`, `UPSTREAM_LAYER2_SCOPE.md`, `LOCALAI_LLAMACPP_BACKEND_PLAN.md`, `PAGED_BITEXACT_NOTE.md`, `PATCH_MAINTENANCE.md`, `final_benchmark.csv`, `paged-burst-bench.cpp`, `paged-reclaim-unit.cpp`, 3 PNGs, and this `PARITY_HANDOFF.md`. -- `patches/paged/`: **46** `.patch` files spanning 0001-0055 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV); W4A16 packed metadata/shape/padding is 0048-0050; MoE safety tests are 0051-0053; MTP backend-sampling safety is 0054; speculative shape trace is 0055. +- `patches/paged/`: **48** `.patch` files spanning 0001-0057 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV); W4A16 packed metadata/shape/padding is 0048-0050; MoE safety tests are 0051-0053; MTP backend-sampling safety is 0054; speculative shape trace is 0055; MoE MMQ selector/launch traces are 0056-0057. ### Bench artifacts (DGX) - `~/bench/COMBINED_DEFINITIVE.txt` (+ `.log`, `.done`, `combined_definitive.sh`, `combined_definitive.out`) - historical same-session both-engine run. @@ -475,6 +489,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual - `~/bench/phase28_mmq_occupancy/20260701_040450` - NVFP4 MMQ occupancy build-knob A/B; `MINBLOCKS=2` gate-safe but serving-regressed, `MMQ_Y=64` compile-rejected. - `~/bench/phase29_mmq_shape_trace/20260701_042428` - default-off MoE MMQ shape trace patch `0056`; CUDA build plus default/trace md5 gates green. - `~/bench/phase30_mmq_shape_serving/20260701_043300` - live n128 serving MMQ shape distribution from patch `0056`; post-run md5/op gates green. +- `~/bench/phase31_mmq_launch_trace/20260701_064424` - default-off MoE MMQ launch trace patch `0057`; default/trace/post-serving md5 gates green; n128 launch trace rejects stream-k/fixup shortcut (`fixup=0`, `stream_k_blocks == ntiles_dst`). - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`. - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30. - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`. @@ -487,8 +502,8 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual ### Discrepancies to flag / resolve (carried verbatim from the gather, including UNVERIFIED labels) 1. **Pin prose reconciled in this worktree.** Makefile line 52 `LLAMA_VERSION?=0ed235ea2c17a19fc8238668653946721ed136fd` is authoritative and matches the local fork merge-base. Hard rule: the paged pin must equal the stock `llama-cpp` pin (shared `grpc-server.cpp`); a bump to `c299a92c` once broke the grpc-server link despite being bit-exact and was reverted. Trust the Makefile when building. -2. **Current fork/mirror are clean and verified.** Local fork HEAD is `20a99518a`, DGX clean mirror HEAD is `826c97a05`, and Phase 29 re-proved the LocalAI patch series tree equals the fork tree (`8a7779726a81689a14f10a64523f2cc380d4801f`). The old `llama-paged-dev` tree is historical only. -3. **Worktree patch series is tracked through 0056.** The only current untracked path in this worktree is `.claude/`. +2. **Current fork/mirror are clean and verified.** Local fork HEAD is `c78e537b5`, DGX clean mirror HEAD is `8b75905e9`, and Phase 31 should be treated as the current patch-series tip. The old `llama-paged-dev` tree is historical only. +3. **Worktree patch series is tracked through 0057.** The only expected unrelated untracked path in this worktree is `.claude/`. 4. **`sm_121a` is not in the worktree build files** - it lives only in the DGX experimental build scripts (`gdn_cc.sh`, `gdn_bv_build.sh`, `paged-build.sh`); mainline uses arch `121`. **UNVERIFIED** whether the shipped CI Dockerfile build path injects `121a` for the FP4-MMA kernels (`Dockerfile.llama-cpp-localai-paged` does not hardcode a CUDA arch). 5. **The `0921716...` paged-MoE md5 open item.** `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=0921716cd0582b5d15af8c362b811d00` for MoE, but a full doc/patch/`git log -S` grep of the worktree found **no** occurrence of `0921716...` in any committed source; the committed canonical paged-MoE gate is `8cb0ce23`. Treat this as **unreconciled**: the documented, KL-validated paged-MoE gate remains `8cb0ce23`, and any paged-MoE divergence (including `0921716`) must be KL-validated against the f16 reference before being accepted as benign, never on assertion alone. The `0921716` value is **UNVERIFIED** as a sanctioned gate; do not adopt it as canonical without re-running the KL gate. The **dense** run is symmetric: `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=ecfe924dee6c5622c149f419ff2a6481` for dense, which likewise differs from the canonical dense gate `5951a5b4`. Both CDEF `PAGED_GATE_MD5` values come from the `combined_definitive.sh` harness's own gate command, NOT the canonical bit-exact gate command in section 3.3, which is why they diverge from the committed `8cb0ce23` / `5951a5b4`; neither is a sanctioned gate and both must be KL-validated before being treated as benign. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PATCH_MAINTENANCE.md b/backend/cpp/llama-cpp-localai-paged/docs/PATCH_MAINTENANCE.md index a9886a9c3..59a97cde1 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PATCH_MAINTENANCE.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PATCH_MAINTENANCE.md @@ -57,18 +57,18 @@ everywhere without ever touching the stock `llama-cpp` source tree. ## Latest mirror check -Phase 29 re-verified the mirror invariant after adding patch `0056`: +Phase 31 re-verified the mirror invariant after adding patch `0057`: ```text base=0ed235ea2c17a19fc8238668653946721ed136fd -applied_tree=8a7779726a81689a14f10a64523f2cc380d4801f -fork_tree=8a7779726a81689a14f10a64523f2cc380d4801f +applied_tree=4eae628e4ba6f2defa14a19d19f7e4abef9a2647 +fork_tree=4eae628e4ba6f2defa14a19d19f7e4abef9a2647 ``` The check used a fresh worktree at `LLAMA_VERSION`, applied every `patches/paged/0*.patch` with strict `git apply`, staged the result, and compared `git write-tree` to canonical fork branch `localai-paged` at -`20a99518a feat(cuda): trace moe mmq batch shapes`. +`c78e537b5 feat(cuda): trace moe mmq launch shapes`. ## Status diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index 76bb61928..d83daf195 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -802,6 +802,29 @@ small-M decode tiles (`ncols_max` 26-111, density 1-4) separately from prefill. The current stream-k/fixup path is part of the measured shape and cannot be ignored by a replacement kernel. +### Phase 31 live serving MMQ launch distribution + +Phase 31 added patch `0057`, extending `LLAMA_MOE_MMQ_SHAPE_TRACE=` with +`[LLAMA_MOE_MMQ_LAUNCH]` lines emitted from `launch_mul_mat_q` after the actual +stream-k launch policy is known. Artifact: +`/home/mudler/bench/phase31_mmq_launch_trace/20260701_064424`. + +The default-off, trace-enabled, and post-serving gates all stayed bit-exact: +MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. + +Live n128 serving with `LLAMA_MOE_MMQ_SHAPE_TRACE=4096` produced: + +| bucket | launch lines | `fixup=1` | `stream_k_blocks == ntiles_dst` | tile efficiency | +|--------|--------------|-----------|----------------------------------|-----------------| +| decode-like (`ncols_max <= 128`) | 4800 | 0 | 4800 | 96-99 | +| prefill-like (`ncols_max > 128`) | 4920 | 0 | 4920 | 99-100 | + +Lever implication: a no-fixup/no-stream-k shortcut is rejected for the measured +n128 serving workload. The launch code is already choosing conventional +stream-k tiling with no fixup; the remaining gap is the small-M grouped-MMQ +kernel shape itself, not launch/fixup overhead. + Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`. ### Phase 10 GDN C32 slab update diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0057-feat-cuda-trace-moe-mmq-launch-shapes.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0057-feat-cuda-trace-moe-mmq-launch-shapes.patch new file mode 100644 index 000000000..a44ba7be6 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0057-feat-cuda-trace-moe-mmq-launch-shapes.patch @@ -0,0 +1,223 @@ +From c78e537b56e3446f8aa645c6700aacf263639bd8 Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Wed, 1 Jul 2026 04:49:55 +0000 +Subject: [PATCH] feat(cuda): trace moe mmq launch shapes + +Assisted-by: Codex:gpt-5 +--- + ggml/src/ggml-cuda/mmq-shape-trace.h | 65 ++++++++++++++++++++++++++++ + ggml/src/ggml-cuda/mmq.cuh | 47 +++++++++++++------- + tests/test-cuda-mmq-shape-trace.cpp | 35 +++++++++++++++ + 3 files changed, 132 insertions(+), 15 deletions(-) + +diff --git a/ggml/src/ggml-cuda/mmq-shape-trace.h b/ggml/src/ggml-cuda/mmq-shape-trace.h +index 9d41b7c80..98bc21f7f 100644 +--- a/ggml/src/ggml-cuda/mmq-shape-trace.h ++++ b/ggml/src/ggml-cuda/mmq-shape-trace.h +@@ -19,6 +19,24 @@ struct ggml_cuda_mmq_shape { + bool use_stream_k; + }; + ++struct ggml_cuda_mmq_launch_shape { ++ int type; ++ bool is_moe; ++ int64_t ncols_dst; ++ int64_t ncols_max; ++ int mmq_x; ++ int mmq_y; ++ int ntx; ++ int nty; ++ int ntzw; ++ int ntiles_dst; ++ int nsm; ++ int tiles_nwaves; ++ int tiles_efficiency_percent; ++ int stream_k_blocks; ++ bool fixup_needed; ++}; ++ + static inline ggml_cuda_mmq_shape ggml_cuda_mmq_shape_make( + const int type, const bool is_moe, const int64_t ncols_dst, const int64_t nchannels_x, + const int64_t ncols_max, const int mmq_x_max, const int mmq_x_lim, const int mmq_x_best, +@@ -46,6 +64,30 @@ static inline ggml_cuda_mmq_shape ggml_cuda_mmq_shape_make( + }; + } + ++static inline ggml_cuda_mmq_launch_shape ggml_cuda_mmq_launch_shape_make( ++ const int type, const bool is_moe, const int64_t ncols_dst, const int64_t ncols_max, ++ const int mmq_x, const int mmq_y, const int ntx, const int nty, const int ntzw, ++ const int ntiles_dst, const int nsm, const int tiles_nwaves, const int tiles_efficiency_percent, ++ const int stream_k_blocks, const bool fixup_needed) { ++ return { ++ type, ++ is_moe, ++ ncols_dst, ++ ncols_max, ++ mmq_x, ++ mmq_y, ++ ntx, ++ nty, ++ ntzw, ++ ntiles_dst, ++ nsm, ++ tiles_nwaves, ++ tiles_efficiency_percent, ++ stream_k_blocks, ++ fixup_needed, ++ }; ++} ++ + static inline int ggml_cuda_mmq_shape_format(char * buf, const size_t size, const ggml_cuda_mmq_shape & shape) { + return std::snprintf(buf, size, + "type=%d moe=%d ncols_dst=%lld nchannels_x=%lld ncols_max=%lld " +@@ -64,3 +106,26 @@ static inline int ggml_cuda_mmq_shape_format(char * buf, const size_t size, cons + shape.mmq_y, + shape.use_stream_k ? 1 : 0); + } ++ ++static inline int ggml_cuda_mmq_launch_shape_format( ++ char * buf, const size_t size, const ggml_cuda_mmq_launch_shape & shape) { ++ return std::snprintf(buf, size, ++ "type=%d moe=%d ncols_dst=%lld ncols_max=%lld mmq_x=%d mmq_y=%d " ++ "ntx=%d nty=%d ntzw=%d ntiles_dst=%d nsm=%d tiles_nwaves=%d " ++ "tiles_efficiency=%d stream_k_blocks=%d fixup=%d", ++ shape.type, ++ shape.is_moe ? 1 : 0, ++ (long long) shape.ncols_dst, ++ (long long) shape.ncols_max, ++ shape.mmq_x, ++ shape.mmq_y, ++ shape.ntx, ++ shape.nty, ++ shape.ntzw, ++ shape.ntiles_dst, ++ shape.nsm, ++ shape.tiles_nwaves, ++ shape.tiles_efficiency_percent, ++ shape.stream_k_blocks, ++ shape.fixup_needed ? 1 : 0); ++} +diff --git a/ggml/src/ggml-cuda/mmq.cuh b/ggml/src/ggml-cuda/mmq.cuh +index 6bc943738..34002edf7 100644 +--- a/ggml/src/ggml-cuda/mmq.cuh ++++ b/ggml/src/ggml-cuda/mmq.cuh +@@ -3989,6 +3989,24 @@ static size_t mmq_get_nbytes_shared(const int mmq_x, const int mmq_y, const int + return nbs_ids + nbs_x + GGML_PAD(nbs_y, nwarps*warp_size*sizeof(int)); + } + ++static inline int ggml_cuda_moe_mmq_shape_trace_limit() { ++ static const int limit = []() -> int { ++ const char * s = getenv("LLAMA_MOE_MMQ_SHAPE_TRACE"); ++ if (s == nullptr || strcmp(s, "0") == 0) { ++ return 0; ++ } ++ const int parsed = atoi(s); ++ return parsed > 0 ? parsed : 256; ++ }(); ++ return limit; ++} ++ ++static inline bool ggml_cuda_moe_mmq_trace_take(std::atomic & counter) { ++ const int trace_limit = ggml_cuda_moe_mmq_shape_trace_limit(); ++ const int trace_index = trace_limit > 0 ? counter.fetch_add(1, std::memory_order_relaxed) : trace_limit; ++ return trace_index >= 0 && trace_index < trace_limit; ++} ++ + template + static void launch_mul_mat_q(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) { + const int id = ggml_cuda_get_device(); +@@ -4054,6 +4072,19 @@ static void launch_mul_mat_q(ggml_backend_cuda_context & ctx, const mmq_args & a + + const bool fixup_needed = ntiles_dst % block_nums_stream_k.x != 0; + ++ if (args.expert_bounds != nullptr) { ++ static std::atomic trace_count{0}; ++ if (ggml_cuda_moe_mmq_trace_take(trace_count)) { ++ char buf[256]; ++ const ggml_cuda_mmq_launch_shape shape = ggml_cuda_mmq_launch_shape_make( ++ (int) type, true, args.ncols_dst, args.ncols_max, mmq_x, mmq_y, ++ ntx, nty, ntzw, ntiles_dst, nsm, tiles_nwaves, tiles_efficiency_percent, ++ block_nums_stream_k.x, fixup_needed); ++ ggml_cuda_mmq_launch_shape_format(buf, sizeof(buf), shape); ++ fprintf(stderr, "[LLAMA_MOE_MMQ_LAUNCH] %s\n", buf); ++ } ++ } ++ + ggml_cuda_pool & pool = ctx.pool(id); + ggml_cuda_pool_alloc tmp_fixup(pool); + if (fixup_needed) { +@@ -4167,18 +4198,6 @@ static inline int ggml_cuda_fp4_dense_mmq_x_cap() { + return c; + } + +-static inline int ggml_cuda_moe_mmq_shape_trace_limit() { +- static const int limit = []() -> int { +- const char * s = getenv("LLAMA_MOE_MMQ_SHAPE_TRACE"); +- if (s == nullptr || strcmp(s, "0") == 0) { +- return 0; +- } +- const int parsed = atoi(s); +- return parsed > 0 ? parsed : 256; +- }(); +- return limit; +-} +- + template + void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) { + const int id = ggml_cuda_get_device(); +@@ -4267,9 +4286,7 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda + + if (args.expert_bounds != nullptr) { + static std::atomic trace_count{0}; +- const int trace_limit = ggml_cuda_moe_mmq_shape_trace_limit(); +- const int trace_index = trace_limit > 0 ? trace_count.fetch_add(1, std::memory_order_relaxed) : trace_limit; +- if (trace_index >= 0 && trace_index < trace_limit) { ++ if (ggml_cuda_moe_mmq_trace_take(trace_count)) { + char buf[256]; + const ggml_cuda_mmq_shape shape = ggml_cuda_mmq_shape_make( + (int) type, true, args.ncols_dst, args.nchannels_x, args.ncols_max, +diff --git a/tests/test-cuda-mmq-shape-trace.cpp b/tests/test-cuda-mmq-shape-trace.cpp +index 8620169c0..86ee15e02 100644 +--- a/tests/test-cuda-mmq-shape-trace.cpp ++++ b/tests/test-cuda-mmq-shape-trace.cpp +@@ -38,5 +38,40 @@ int main() { + require(std::strstr(buf, "mmq_x_best=64") != nullptr, "trace includes selected tile"); + require(std::strstr(buf, "stream_k=1") != nullptr, "trace includes stream-k flag"); + ++ const ggml_cuda_mmq_launch_shape launch = ggml_cuda_mmq_launch_shape_make( ++ /* type */ 39, ++ /* is_moe */ true, ++ /* ncols_dst */ 1024, ++ /* ncols_max */ 128, ++ /* mmq_x */ 64, ++ /* mmq_y */ 128, ++ /* ntx */ 2, ++ /* nty */ 4, ++ /* ntzw */ 3, ++ /* ntiles_dst */ 24, ++ /* nsm */ 16, ++ /* tiles_nwaves */ 2, ++ /* tiles_efficiency_percent */ 75, ++ /* stream_k_blocks */ 16, ++ /* fixup_needed */ true); ++ ++ const int launch_n = ggml_cuda_mmq_launch_shape_format(buf, sizeof(buf), launch); ++ ++ require(launch_n > 0, "launch format returns byte count"); ++ require(std::strstr(buf, "moe=1") != nullptr, "launch trace includes moe flag"); ++ require(std::strstr(buf, "ncols_dst=1024") != nullptr, "launch trace includes routed assignment count"); ++ require(std::strstr(buf, "ncols_max=128") != nullptr, "launch trace includes max column count"); ++ require(std::strstr(buf, "mmq_x=64") != nullptr, "launch trace includes compiled x tile"); ++ require(std::strstr(buf, "mmq_y=128") != nullptr, "launch trace includes compiled y tile"); ++ require(std::strstr(buf, "ntx=2") != nullptr, "launch trace includes x tile count"); ++ require(std::strstr(buf, "nty=4") != nullptr, "launch trace includes y tile count"); ++ require(std::strstr(buf, "ntzw=3") != nullptr, "launch trace includes batch tile count"); ++ require(std::strstr(buf, "ntiles_dst=24") != nullptr, "launch trace includes total tile count"); ++ require(std::strstr(buf, "nsm=16") != nullptr, "launch trace includes SM count"); ++ require(std::strstr(buf, "tiles_nwaves=2") != nullptr, "launch trace includes wave count"); ++ require(std::strstr(buf, "tiles_efficiency=75") != nullptr, "launch trace includes stream-k efficiency"); ++ require(std::strstr(buf, "stream_k_blocks=16") != nullptr, "launch trace includes actual stream-k block count"); ++ require(std::strstr(buf, "fixup=1") != nullptr, "launch trace includes fixup flag"); ++ + return 0; + } diff --git a/docs/superpowers/plans/2026-07-01-mmq-launch-trace-phase31.md b/docs/superpowers/plans/2026-07-01-mmq-launch-trace-phase31.md new file mode 100644 index 000000000..511e1e915 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-mmq-launch-trace-phase31.md @@ -0,0 +1,72 @@ +# MMQ Launch Trace Phase 31 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Extend the default-off MoE MMQ trace so decode serving records launch-shape, stream-k block, efficiency, and fixup facts without changing inference behavior. + +**Architecture:** Keep patch `0056` selector tracing intact and add a second bounded trace line inside `launch_mul_mat_q`, where the actual stream-k block policy and `fixup_needed` are known. The new helper is host-only and tested without CUDA execution; DGX gates validate that default-off and trace-enabled inference md5/op outputs are unchanged. + +**Tech Stack:** llama.cpp CUDA backend, host-only C++ unit test, LocalAI paged patch series, DGX GB10 gate scripts. + +--- + +## Files + +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/mmq-shape-trace.h` + - Add `ggml_cuda_mmq_launch_shape`, make/format helpers for launch metrics. +- Modify: `/home/mudler/_git/llama.cpp/tests/test-cuda-mmq-shape-trace.cpp` + - Add host-only assertions for launch trace formatting. +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/mmq.cuh` + - Emit `[LLAMA_MOE_MMQ_LAUNCH]` when `LLAMA_MOE_MMQ_SHAPE_TRACE` is enabled and grouped-MMQ uses stream-k. +- Create: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0057-feat-cuda-trace-moe-mmq-launch-shapes.patch` + - Mirror the fork commit as the next incremental patch. +- Modify docs in `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/` + - README patch table, GB10 results, lever map, handoff, patch maintenance. + +## Checklist + +- [x] **Step 1: Write RED host test** + - Add assertions in `tests/test-cuda-mmq-shape-trace.cpp` that call `ggml_cuda_mmq_launch_shape_make` and expect formatted fields: `ntiles_dst`, `stream_k_blocks`, `tiles_efficiency`, `fixup`, `nsm`, `ntx`, `nty`, `ntzw`. + - Run: `cmake --build build --target test-cuda-mmq-shape-trace -j 4` + - Expected: compile failure because the launch helper does not exist yet. + +- [x] **Step 2: Implement host launch trace helper** + - Add `ggml_cuda_mmq_launch_shape` plus make/format helpers in `mmq-shape-trace.h`. + - Re-run: `cmake --build build --target test-cuda-mmq-shape-trace -j 4 && ./build/bin/test-cuda-mmq-shape-trace` + - Expected: test passes. + +- [x] **Step 3: Wire bounded launch trace** + - In `launch_mul_mat_q`, after `fixup_needed` is known, emit `[LLAMA_MOE_MMQ_LAUNCH]` only when `args.expert_bounds != nullptr`, `args.use_stream_k`, and `LLAMA_MOE_MMQ_SHAPE_TRACE` limit allows it. + - Use a separate static counter from selector trace so the user can see up to N selector and N launch lines. + +- [x] **Step 4: Build and gate on DGX** + - Preflight: verify `docker=0`, `local_ai_worker=0`, `compute=0`, and take the owner lock. + - Build: `cmake --build build-cuda --target llama-completion test-backend-ops test-cuda-mmq-shape-trace -j $(nproc)` + - Default-off gate expected: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT_ID 806/806`. + - Trace gate expected: same md5/op values with bounded shape and launch trace lines. + +- [x] **Step 5: Run n128 serving launch trace** + - Run h2h n128 with `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 LLAMA_MOE_MMQ_SHAPE_TRACE=4096`. + - Parse `[LLAMA_MOE_MMQ_LAUNCH]` lines into decode-like and prefill-like buckets. + - Decide whether a no-fixup/no-stream-k shortcut is justified from measured `stream_k_blocks`, `tiles_efficiency`, and `fixup`. + +- [x] **Step 6: Mirror patch and update docs** + - Commit llama.cpp fork. + - Generate LocalAI patch `0057` from the fork commit. + - Verify strict patch-series application reaches the fork tree. + - Mark this plan complete with artifact path and gate results. + - Commit LocalAI docs and patch with `Assisted-by: Codex:gpt-5`. + +## Result + +- Fork commit: `/home/mudler/_git/llama.cpp` `c78e537b5 feat(cuda): trace moe mmq launch shapes`. +- DGX mirror commit: `dgx:~/llama-phase6-source` `8b75905e9 feat(cuda): trace moe mmq launch shapes`. +- Artifact: `/home/mudler/bench/phase31_mmq_launch_trace/20260701_064424`. +- RED verified: `cmake --build build --target test-cuda-mmq-shape-trace -j 4` failed on missing `ggml_cuda_mmq_launch_shape`. +- GREEN verified locally: `cmake --build build --target test-cuda-mmq-shape-trace -j 4 && ./build/bin/test-cuda-mmq-shape-trace`. +- DGX CUDA build verified: `llama-server`, `llama-completion`, `test-backend-ops`, and `test-cuda-mmq-shape-trace`. +- Default-off, trace-enabled, and post-serving gates all matched MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 `5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. +- n128 traced serving: `decode_agg_tps=691.0`, `agg_tps=337.0`, `prefill_tps=1500.4`, `TTFT mean=7671.0 ms`. +- Launch result: decode-like `4800/4800` and prefill-like `4920/4920` launch lines had `fixup=0` and `stream_k_blocks == ntiles_dst`. + +Decision: do not pursue a no-fixup/no-stream-k shortcut for the current n128 workload. The actual launch path is already taking conventional stream-k tiling with no fixup; the remaining grouped-MMQ gap is the small-M tile/kernel shape itself, not stream-k fixup overhead.