From b28b448c681e5f9766d790fe1b1bb3b41b4a7f41 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Wed, 1 Jul 2026 04:36:04 +0000 Subject: [PATCH] docs(paged): record mmq shape serving profile Assisted-by: Codex:gpt-5 --- .../docs/GB10_PARITY_PHASE0_RESULTS.md | 50 ++++++++++++++++++ .../docs/PARITY_HANDOFF.md | 10 ++++ .../docs/VLLM_PARITY_LEVER_MAP.md | 18 +++++++ .../2026-07-01-mmq-shape-serving-phase30.md | 51 +++++++++++++++++++ 4 files changed, 129 insertions(+) create mode 100644 docs/superpowers/plans/2026-07-01-mmq-shape-serving-phase30.md diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index a1eed5517..44c1a25ac 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -1811,3 +1811,53 @@ Decision: itself. - It gives a bounded, md5-safe way to collect live serving grouped-MMQ shape evidence before designing the next structural kernel. + +## Phase 30 Live MoE MMQ Shape Distribution + +Phase 30 used patch `0056` under the n128 h2h serving workload to collect the +first 4096 grouped-MMQ selector shapes. This is a measurement-only phase. + +Artifact: + +- `/home/mudler/bench/phase30_mmq_shape_serving/20260701_043300` + +Run: + +- Source: `dgx:~/llama-phase6-source`, commit `826c97a05` +- Env: `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 LLAMA_MOE_MMQ_SHAPE_TRACE=4096` +- Workload: h2h `n=128`, `PTOK=128`, `GEN=64` +- Throughput while tracing: `decode_agg_tps=645.8`, `agg_tps=313.3`, + `prefill_tps=1597.9`, `TTFT mean=8192.3 ms` + +Trace summary: + +| bucket | total traced calls | dominant `mmq_x_best` | density range | `ncols_max` range | +|--------|--------------------|-----------------------|---------------|-------------------| +| decode-like (`ncols_max <= 128`) | 1200 | `64` (480), `32` (360), `40` (240), `48` (120) | 1-4 | 26-111 | +| prefill-like (`ncols_max > 128`) | 2896 | `128` (1816), `64` (720), `112` (240), `48` (120) | 5-16 | 132-512 | + +Overall first-4096 distribution: + +| metric | notable values | +|--------|----------------| +| `mmq_x_best` | `128`: 1816, `64`: 1200, `32`: 360, `40`: 240, `48`: 240, `112`: 240 | +| `density` | `16`: 1680, `2`: 480, `1`: 360, `6`: 360, `4`: 240, `5`: 240 | +| `stream_k` | `1`: 4096 | + +Post-run gates: + +| check | status | actual | +|-------|--------|--------| +| MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| `MUL_MAT_ID` | ok | `806/806` | + +Decision: + +- Decode serving really is feeding grouped-MMQ small-M tiles: in this trace, + decode-like calls stay at density `1-4` and `mmq_x_best <= 64`. +- Prefill-like calls mostly select `mmq_x_best=128` and density `16`, so a + decode-only structural kernel should not be generalized to prefill without a + separate A/B. +- Every traced call used stream-k, so a replacement kernel must account for the + current stream-k/fixup behavior rather than only conventional tiling. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index 04e40a97e..9ba6f4a2a 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -396,6 +396,15 @@ Default-off and `LLAMA_MOE_MMQ_SHAPE_TRACE=4` gates both passed: MoE gate emitted exactly four `[LLAMA_MOE_MMQ_SHAPE]` lines. This is evidence-only instrumentation; it does not close the speed gap. +Phase 30 used patch `0056` for a live n128 serving shape trace. Artifact: +`/home/mudler/bench/phase30_mmq_shape_serving/20260701_043300`. The first 4096 +grouped-MMQ calls split into 1200 decode-like calls (`ncols_max <= 128`) and +2896 prefill-like calls. Decode-like calls had density `1-4` and selected +`mmq_x_best` only in `{32,40,48,64}`; prefill-like calls were mostly density +`16` and selected `mmq_x_best=128`. All traced calls had `stream_k=1`. Post-run +gates stayed green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT_ID` `806/806`. + --- ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes) @@ -465,6 +474,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual - `~/bench/phase27_graph_node_serving/20260701_055519` - current clean llama.cpp n128 serving profile captured with `--cuda-graph-trace=node`, pre/post retry gates green. - `~/bench/phase28_mmq_occupancy/20260701_040450` - NVFP4 MMQ occupancy build-knob A/B; `MINBLOCKS=2` gate-safe but serving-regressed, `MMQ_Y=64` compile-rejected. - `~/bench/phase29_mmq_shape_trace/20260701_042428` - default-off MoE MMQ shape trace patch `0056`; CUDA build plus default/trace md5 gates green. +- `~/bench/phase30_mmq_shape_serving/20260701_043300` - live n128 serving MMQ shape distribution from patch `0056`; post-run md5/op gates green. - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`. - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30. - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index 5892da554..76bb61928 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -784,6 +784,24 @@ Use this only to size the next grouped-MMQ structural kernel. It intentionally does not perform device readback of `expert_bounds`, so it records selector inputs and estimated density rather than exact per-expert histograms. +### Phase 30 live serving MMQ shape distribution + +Phase 30 ran n128 serving with `LLAMA_MOE_MMQ_SHAPE_TRACE=4096` on the patched +DGX mirror (`826c97a05`). Artifact: +`/home/mudler/bench/phase30_mmq_shape_serving/20260701_043300`. + +The first 4096 grouped-MMQ calls split into 1200 decode-like calls +(`ncols_max <= 128`) and 2896 prefill-like calls. Decode-like calls used +densities `1-4` and selected only `mmq_x_best` `32/40/48/64` +(`64`: 480, `32`: 360, `40`: 240, `48`: 120). Prefill-like calls were mostly +density `16` and selected `mmq_x_best=128` for 1816 calls. Every traced call had +`stream_k=1`. + +Kernel implication: the next grouped-MMQ structural experiment should target +small-M decode tiles (`ncols_max` 26-111, density 1-4) separately from prefill. +The current stream-k/fixup path is part of the measured shape and cannot be +ignored by a replacement kernel. + Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`. ### Phase 10 GDN C32 slab update diff --git a/docs/superpowers/plans/2026-07-01-mmq-shape-serving-phase30.md b/docs/superpowers/plans/2026-07-01-mmq-shape-serving-phase30.md new file mode 100644 index 000000000..6c59119a8 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-mmq-shape-serving-phase30.md @@ -0,0 +1,51 @@ +# MMQ Shape Serving Phase 30 Plan + +> **For agentic workers:** Use verification-before-completion before claiming +> trace or gate results. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Use patch `0056` to collect live grouped-MMQ selector shapes under the +n128 serving workload and derive the next structural-kernel target shape. + +**Architecture:** Measurement-only. Start `llama-server` with +`LLAMA_MOE_MMQ_SHAPE_TRACE=4096`, run h2h n128, parse the server log, then run +post-serving md5/op gates. + +**Tech Stack:** DGX GB10, llama.cpp CUDA backend, h2h client, +`paged-inference-gates.sh`. + +--- + +## Checklist + +- [x] **Step 1: Check DGX preflight and lock** + - `docker=0` + - `local_ai_worker=0` + - `compute=0` + - owner file set to `codex-phase30-mmq-shape-serving` + +- [x] **Step 2: Run traced n128 serving workload** + - Artifact: `/home/mudler/bench/phase30_mmq_shape_serving/20260701_043300` + - Source: `dgx:~/llama-phase6-source`, commit `826c97a05` + - Env: `LLAMA_MOE_MMQ_SHAPE_TRACE=4096` + - h2h result: `decode_agg_tps=645.8`, `agg_tps=313.3`, + `prefill_tps=1597.9`, `TTFT mean=8192.3 ms` + +- [x] **Step 3: Parse trace distribution** + - Total traced calls: `4096` + - Decode-like (`ncols_max <= 128`): `1200` + - Prefill-like (`ncols_max > 128`): `2896` + - Decode-like selected `mmq_x_best` only in `{32,40,48,64}` with density + `1-4`. + - Prefill-like was mostly density `16` with `mmq_x_best=128`. + - `stream_k=1` for all traced calls. + +- [x] **Step 4: Run post-serving inference gates** + - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0` + - Dense md5: `5951a5b4d624ce891e22ab5fca9bc439` + - `MUL_MAT_ID`: `806/806` + +## Result + +The next grouped-MMQ structural experiment should target decode small-M shapes +separately from prefill: `ncols_max` 26-111, density 1-4, selected tile <= 64, +with stream-k/fixup behavior accounted for.