mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-02 20:37:03 -04:00
feat(paged): add moe mmq shape trace patch
Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -213,6 +213,7 @@ These are the dominant decode levers on the Qwen3.6 hybrid models. All bit-exact
|
||||
|---|---|---|
|
||||
| 0054 | **Disable backend sampling for MTP drafts** - forces server MTP draft generation through the target-side sampler acceptance path instead of letting the draft backend sample independently. This was required for the Phase 14 rollback/prefix safety gate. | yes for canonical non-MTP gates; Phase 14 MTP normalized greedy-prefix gate passed |
|
||||
| 0055 | **Trace speculative batch shapes** - adds default-off `LLAMA_SPEC_SHAPE_TRACE=1` server logs around `server_slot::handle_last_sampled_token()`, reporting normal decode rows and MTP verification `K + 1` rows (`draft`, `outputs`, `spec_i_first`, `spec_i_last`). This is instrumentation only for Phase 18 shape-entropy measurement before any scheduler experiment. | yes (env unset is silent; DGX gates after patch: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID` `806/806`) |
|
||||
| 0056 | **Trace MoE MMQ batch shapes** - adds default-off `LLAMA_MOE_MMQ_SHAPE_TRACE=<n>` logs from the grouped-MMQ host selector, reporting routed assignment count, estimated active experts, density, selected `mmq_x`, `mmq_y`, and stream-k. This is evidence-only instrumentation for sizing structural grouped-MMQ work after Phase 28 rejected launch-bounds/row-tile knobs. | yes (env unset and trace-enabled gates both green: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID` `806/806`; trace cap verified with 4 lines) |
|
||||
|
||||
> **Dropped: patch 0026 (hybrid per-head bf16 SSM state, `ssm_bf16_tau`).** Once
|
||||
> the decode fusions (0028 recurrent-state gather-fusion + 0029 block-table cache)
|
||||
@@ -639,3 +640,10 @@ n128 decode serving (`705.1 -> 689.9` decode_agg_tps, `0.9784x`). The row-tile
|
||||
knob `GGML_CUDA_FP4_MMQ_Y=64` failed the NVFP4 writeback compile-time
|
||||
invariant. Do not promote these knobs; grouped-MMQ parity work now requires a
|
||||
structural kernel change, not launch-bounds or row-tile tweaks.
|
||||
|
||||
Phase 29 added the default-off grouped-MMQ shape trace as patch `0056`
|
||||
(`/home/mudler/bench/phase29_mmq_shape_trace/20260701_042428`). The helper was
|
||||
added test-first (`test-cuda-mmq-shape-trace`), compiled under CUDA on DGX, and
|
||||
kept inference stable with the trace disabled and enabled:
|
||||
MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID 806/806`. Example trace line:
|
||||
`[LLAMA_MOE_MMQ_SHAPE] type=40 moe=1 ncols_dst=104 nchannels_x=256 ncols_max=13 n_active_est=104 density=1 mmq_x_max=128 mmq_x_lim=64 mmq_x_best=16 mmq_y=128 stream_k=1`.
|
||||
|
||||
@@ -1770,3 +1770,44 @@ Decision:
|
||||
writeback retile work.
|
||||
- Do not promote either knob or add a LocalAI patch. The grouped-MMQ bucket
|
||||
still needs a structural kernel change, not a launch-bounds/row-tile tweak.
|
||||
|
||||
## Phase 29 Default-Off MoE MMQ Shape Trace
|
||||
|
||||
Phase 29 added evidence-only instrumentation for the structural grouped-MMQ
|
||||
path that remains after Phase 28. The trace is default-off and lives at the
|
||||
host-side grouped-MMQ selector so it does not read `expert_bounds` back from the
|
||||
device or add a synchronization.
|
||||
|
||||
Patch and artifact:
|
||||
|
||||
- Fork commit: `20a99518a feat(cuda): trace moe mmq batch shapes`
|
||||
- LocalAI patch: `0056-feat-cuda-trace-moe-mmq-batch-shapes.patch`
|
||||
- Artifact: `/home/mudler/bench/phase29_mmq_shape_trace/20260701_042428`
|
||||
|
||||
TDD/build checks:
|
||||
|
||||
| check | result |
|
||||
|-------|--------|
|
||||
| RED | `test-cuda-mmq-shape-trace` first failed on missing `ggml-cuda/mmq-shape-trace.h` |
|
||||
| local GREEN | `cmake --build build --target test-cuda-mmq-shape-trace -j 4 && ./build/bin/test-cuda-mmq-shape-trace` |
|
||||
| DGX CUDA build | `cmake --build build-cuda --target llama-completion test-backend-ops test-cuda-mmq-shape-trace` |
|
||||
|
||||
Safety gates:
|
||||
|
||||
| gate | MoE md5 | dense md5 | `MUL_MAT_ID` | trace lines |
|
||||
|------|---------|-----------|--------------|-------------|
|
||||
| default-off | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `806/806` | `0` |
|
||||
| `LLAMA_MOE_MMQ_SHAPE_TRACE=4` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `806/806` | `4` |
|
||||
|
||||
Example trace line:
|
||||
|
||||
```text
|
||||
[LLAMA_MOE_MMQ_SHAPE] type=40 moe=1 ncols_dst=104 nchannels_x=256 ncols_max=13 n_active_est=104 density=1 mmq_x_max=128 mmq_x_lim=64 mmq_x_best=16 mmq_y=128 stream_k=1
|
||||
```
|
||||
|
||||
Decision:
|
||||
|
||||
- This is not a speed patch and should not be counted as parity progress by
|
||||
itself.
|
||||
- It gives a bounded, md5-safe way to collect live serving grouped-MMQ shape
|
||||
evidence before designing the next structural kernel.
|
||||
|
||||
@@ -64,7 +64,7 @@ A lever compiled into the binary is **NOT** isolated by a runtime flag alone. It
|
||||
- **Always update the fork FIRST, in this exact order:** (1) commit the change on the `localai-paged` branch and **push it**, then (2) regenerate the LocalAI series (`backend/cpp/llama-cpp-localai-paged/patches/paged/`) from the fork via `git format-patch` (one patch per fork commit, source-only, never touching a `*.md`/dev-doc), so the series stays a **1:1, drift-free mirror** of the branch. No hand-export.
|
||||
- **NEVER edit the LocalAI `patches/paged/*.patch` files directly**, and **NEVER add a patch to the series with no corresponding fork-branch commit.** They are generated output, not source.
|
||||
- The fork branch is also **where the build and the per-path bit-exact md5 gate actually run**, so it is the **only** place a change is truly validated. A patch that lives only in the LocalAI series has never been built or gated.
|
||||
- **Mirror invariant (verify by tree hash):** applying the full on-disk series on the pin must reproduce the fork branch tree byte-for-byte. The series has **intentional gaps** (missing 0005, 0026, 0027, 0032, 0036-0039, 0045), so the patch count is not the max number; what must hold is the tree-hash equality, not the count. Current verified state: fork HEAD `fb9402661` is mirrored by worktree patch `0055-feat-server-trace-speculative-batch-shapes.patch`; applying all `46` patch files on `0ed235ea2c17a19fc8238668653946721ed136fd` produces tree `5bdbf8ea3d750fe6fa1f85175fd6357d36222edb`, exactly matching the fork.
|
||||
- **Mirror invariant (verify by tree hash):** applying the full on-disk series on the pin must reproduce the fork branch tree byte-for-byte. The series has **intentional gaps** (missing 0005, 0026, 0027, 0032, 0036-0039, 0045), so the patch count is not the max number; what must hold is the tree-hash equality, not the count. Current verified state: fork HEAD `20a99518a` is mirrored by worktree patch `0056-feat-cuda-trace-moe-mmq-batch-shapes.patch`; applying all `47` patch files on `0ed235ea2c17a19fc8238668653946721ed136fd` produces tree `8a7779726a81689a14f10a64523f2cc380d4801f`, exactly matching the fork.
|
||||
|
||||
### 2.6 Bench hygiene gates
|
||||
- **NEVER set `LLAMA_MAX_BATCH_TOKENS` in benches** (the harness explicitly logs "NO LLAMA_MAX_BATCH_TOKENS").
|
||||
@@ -326,7 +326,7 @@ Phase 22 re-verified the patch-series mirror invariant after patch `0055`:
|
||||
applying every LocalAI `patches/paged/0*.patch` with strict `git apply` on top of
|
||||
Makefile pin `0ed235ea2c17a19fc8238668653946721ed136fd` produced tree
|
||||
`5bdbf8ea3d750fe6fa1f85175fd6357d36222edb`, exactly matching fork branch
|
||||
`localai-paged` HEAD `fb9402661 feat(server): trace speculative batch shapes`.
|
||||
`localai-paged` HEAD `20a99518a feat(cuda): trace moe mmq batch shapes`.
|
||||
|
||||
Phase 24 extended `paged-current-serving-snapshot.sh` to write the snapshot
|
||||
hardware report. DGX dry run passed at
|
||||
@@ -386,6 +386,16 @@ n128 same-session decode serving (`705.1 -> 689.9` decode_agg_tps, `0.9784x`).
|
||||
specialization asserts `nwarps*tile_C::I == mmq_y`. Do not promote either knob;
|
||||
future grouped-MMQ work must be structural kernel work.
|
||||
|
||||
Phase 29 added the default-off grouped-MMQ shape trace as patch `0056`.
|
||||
Artifact: `/home/mudler/bench/phase29_mmq_shape_trace/20260701_042428`.
|
||||
Fork commit: `20a99518a feat(cuda): trace moe mmq batch shapes`. The helper was
|
||||
added test-first (`test-cuda-mmq-shape-trace`) and built under CUDA on DGX.
|
||||
Default-off and `LLAMA_MOE_MMQ_SHAPE_TRACE=4` gates both passed: MoE
|
||||
`8cb0ce23777bf55f92f63d0292c756b0`, dense
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT_ID` `806/806`. The trace-enabled
|
||||
gate emitted exactly four `[LLAMA_MOE_MMQ_SHAPE]` lines. This is evidence-only
|
||||
instrumentation; it does not close the speed gap.
|
||||
|
||||
---
|
||||
|
||||
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
|
||||
@@ -435,8 +445,8 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
|
||||
## 7. KEY FILE / ARTIFACT INDEX
|
||||
|
||||
### Fork (canonical source of truth)
|
||||
- Local canonical fork: `/home/mudler/_git/llama.cpp`, branch **`localai-paged`**, HEAD `fb9402661291e0488a3e2bf2f3948ebcd18e18c9` ("trace speculative batch shapes", patch `0055`).
|
||||
- DGX current clean mirror/build tree: `dgx:~/llama-phase6-source`, HEAD `f2521ab12` with the same tree as the local fork; this is what Phase 20 and the current snapshot harness use.
|
||||
- Local canonical fork: `/home/mudler/_git/llama.cpp`, branch **`localai-paged`**, HEAD `20a99518a39acbb4474fa9c97121fc7b9f07c1ef` ("trace moe mmq batch shapes", patch `0056`).
|
||||
- DGX current clean mirror/build tree: `dgx:~/llama-phase6-source`, HEAD `826c97a05` with the Phase 29 shape-trace patch applied and committed; Phase 20/26/27 artifacts still record their historical source hashes.
|
||||
- Historical DGX dev tree: `dgx:~/llama-paged-dev`, branch **`paged`**, HEAD `a7d439e8ce6990eb09721223c975da4e49d8d136` ("GDN CONFIG C (M8) - bf16 Kc/Qc"). It is an old experimental tree and must not be treated as canonical.
|
||||
|
||||
### LocalAI worktree
|
||||
@@ -454,6 +464,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
|
||||
- `~/bench/phase26_audited_snapshot/20260701_053650` - current audit-grade full paged-vs-vLLM MoE serving snapshot with `hardware.txt`, pre/post gates, `summary.tsv`, and `gate_summary.tsv`.
|
||||
- `~/bench/phase27_graph_node_serving/20260701_055519` - current clean llama.cpp n128 serving profile captured with `--cuda-graph-trace=node`, pre/post retry gates green.
|
||||
- `~/bench/phase28_mmq_occupancy/20260701_040450` - NVFP4 MMQ occupancy build-knob A/B; `MINBLOCKS=2` gate-safe but serving-regressed, `MMQ_Y=64` compile-rejected.
|
||||
- `~/bench/phase29_mmq_shape_trace/20260701_042428` - default-off MoE MMQ shape trace patch `0056`; CUDA build plus default/trace md5 gates green.
|
||||
- Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
|
||||
- Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
|
||||
- A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.
|
||||
@@ -466,8 +477,8 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
|
||||
|
||||
### Discrepancies to flag / resolve (carried verbatim from the gather, including UNVERIFIED labels)
|
||||
1. **Pin prose reconciled in this worktree.** Makefile line 52 `LLAMA_VERSION?=0ed235ea2c17a19fc8238668653946721ed136fd` is authoritative and matches the local fork merge-base. Hard rule: the paged pin must equal the stock `llama-cpp` pin (shared `grpc-server.cpp`); a bump to `c299a92c` once broke the grpc-server link despite being bit-exact and was reverted. Trust the Makefile when building.
|
||||
2. **Current fork/mirror are clean and verified.** Local fork HEAD is `fb9402661`, DGX clean mirror HEAD is `f2521ab12`, and Phase 22 proved the LocalAI patch series tree equals the fork tree. The old `llama-paged-dev` tree is historical only.
|
||||
3. **Worktree patch series is tracked through 0055.** The only current untracked path in this worktree is `.claude/`.
|
||||
2. **Current fork/mirror are clean and verified.** Local fork HEAD is `20a99518a`, DGX clean mirror HEAD is `826c97a05`, and Phase 29 re-proved the LocalAI patch series tree equals the fork tree (`8a7779726a81689a14f10a64523f2cc380d4801f`). The old `llama-paged-dev` tree is historical only.
|
||||
3. **Worktree patch series is tracked through 0056.** The only current untracked path in this worktree is `.claude/`.
|
||||
4. **`sm_121a` is not in the worktree build files** - it lives only in the DGX experimental build scripts (`gdn_cc.sh`, `gdn_bv_build.sh`, `paged-build.sh`); mainline uses arch `121`. **UNVERIFIED** whether the shipped CI Dockerfile build path injects `121a` for the FP4-MMA kernels (`Dockerfile.llama-cpp-localai-paged` does not hardcode a CUDA arch).
|
||||
5. **The `0921716...` paged-MoE md5 open item.** `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=0921716cd0582b5d15af8c362b811d00` for MoE, but a full doc/patch/`git log -S` grep of the worktree found **no** occurrence of `0921716...` in any committed source; the committed canonical paged-MoE gate is `8cb0ce23`. Treat this as **unreconciled**: the documented, KL-validated paged-MoE gate remains `8cb0ce23`, and any paged-MoE divergence (including `0921716`) must be KL-validated against the f16 reference before being accepted as benign, never on assertion alone. The `0921716` value is **UNVERIFIED** as a sanctioned gate; do not adopt it as canonical without re-running the KL gate. The **dense** run is symmetric: `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=ecfe924dee6c5622c149f419ff2a6481` for dense, which likewise differs from the canonical dense gate `5951a5b4`. Both CDEF `PAGED_GATE_MD5` values come from the `combined_definitive.sh` harness's own gate command, NOT the canonical bit-exact gate command in section 3.3, which is why they diverge from the committed `8cb0ce23` / `5951a5b4`; neither is a sanctioned gate and both must be KL-validated before being treated as benign.
|
||||
|
||||
|
||||
@@ -57,18 +57,18 @@ everywhere without ever touching the stock `llama-cpp` source tree.
|
||||
|
||||
## Latest mirror check
|
||||
|
||||
Phase 22 re-verified the mirror invariant after adding patch `0055`:
|
||||
Phase 29 re-verified the mirror invariant after adding patch `0056`:
|
||||
|
||||
```text
|
||||
base=0ed235ea2c17a19fc8238668653946721ed136fd
|
||||
applied_tree=5bdbf8ea3d750fe6fa1f85175fd6357d36222edb
|
||||
fork_tree=5bdbf8ea3d750fe6fa1f85175fd6357d36222edb
|
||||
applied_tree=8a7779726a81689a14f10a64523f2cc380d4801f
|
||||
fork_tree=8a7779726a81689a14f10a64523f2cc380d4801f
|
||||
```
|
||||
|
||||
The check used a fresh worktree at `LLAMA_VERSION`, applied every
|
||||
`patches/paged/0*.patch` with strict `git apply`, staged the result, and compared
|
||||
`git write-tree` to canonical fork branch `localai-paged` at
|
||||
`fb9402661 feat(server): trace speculative batch shapes`.
|
||||
`20a99518a feat(cuda): trace moe mmq batch shapes`.
|
||||
|
||||
## Status
|
||||
|
||||
|
||||
@@ -765,6 +765,25 @@ Decision: do not promote the occupancy knobs and do not add a LocalAI patch.
|
||||
The grouped-MMQ bucket still requires structural kernel work; launch-bounds and
|
||||
row-tile build tweaks are closed on GB10.
|
||||
|
||||
### Phase 29 default-off MoE MMQ shape trace
|
||||
|
||||
Patch `0056` adds `LLAMA_MOE_MMQ_SHAPE_TRACE=<n>` as bounded, default-off
|
||||
instrumentation at the grouped-MMQ host selector. Artifact:
|
||||
`/home/mudler/bench/phase29_mmq_shape_trace/20260701_042428`. Fork commit:
|
||||
`20a99518a feat(cuda): trace moe mmq batch shapes`.
|
||||
|
||||
The helper was added test-first (`test-cuda-mmq-shape-trace` failed on the
|
||||
missing header before implementation, then passed locally and under the DGX CUDA
|
||||
build). Default-off and trace-enabled gates both passed: MoE md5
|
||||
`8cb0ce23777bf55f92f63d0292c756b0`, dense md5
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. The
|
||||
trace-enabled gate with `LLAMA_MOE_MMQ_SHAPE_TRACE=4` emitted exactly four
|
||||
shape lines.
|
||||
|
||||
Use this only to size the next grouped-MMQ structural kernel. It intentionally
|
||||
does not perform device readback of `expert_bounds`, so it records selector
|
||||
inputs and estimated density rather than exact per-expert histograms.
|
||||
|
||||
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
|
||||
|
||||
### Phase 10 GDN C32 slab update
|
||||
|
||||
@@ -0,0 +1,212 @@
|
||||
From 20a99518a39acbb4474fa9c97121fc7b9f07c1ef Mon Sep 17 00:00:00 2001
|
||||
From: Ettore Di Giacinto <mudler@localai.io>
|
||||
Date: Wed, 1 Jul 2026 04:27:19 +0000
|
||||
Subject: [PATCH] feat(cuda): trace moe mmq batch shapes
|
||||
|
||||
Assisted-by: Codex:gpt-5
|
||||
---
|
||||
ggml/src/ggml-cuda/mmq-shape-trace.h | 66 ++++++++++++++++++++++++++++
|
||||
ggml/src/ggml-cuda/mmq.cuh | 31 ++++++++++++-
|
||||
tests/CMakeLists.txt | 2 +
|
||||
tests/test-cuda-mmq-shape-trace.cpp | 42 ++++++++++++++++++
|
||||
4 files changed, 140 insertions(+), 1 deletion(-)
|
||||
create mode 100644 ggml/src/ggml-cuda/mmq-shape-trace.h
|
||||
create mode 100644 tests/test-cuda-mmq-shape-trace.cpp
|
||||
|
||||
diff --git a/ggml/src/ggml-cuda/mmq-shape-trace.h b/ggml/src/ggml-cuda/mmq-shape-trace.h
|
||||
new file mode 100644
|
||||
index 000000000..9d41b7c80
|
||||
--- /dev/null
|
||||
+++ b/ggml/src/ggml-cuda/mmq-shape-trace.h
|
||||
@@ -0,0 +1,66 @@
|
||||
+#pragma once
|
||||
+
|
||||
+#include <cstddef>
|
||||
+#include <cstdint>
|
||||
+#include <cstdio>
|
||||
+
|
||||
+struct ggml_cuda_mmq_shape {
|
||||
+ int type;
|
||||
+ bool is_moe;
|
||||
+ int64_t ncols_dst;
|
||||
+ int64_t nchannels_x;
|
||||
+ int64_t ncols_max;
|
||||
+ int64_t n_active_est;
|
||||
+ int64_t density;
|
||||
+ int mmq_x_max;
|
||||
+ int mmq_x_lim;
|
||||
+ int mmq_x_best;
|
||||
+ int mmq_y;
|
||||
+ bool use_stream_k;
|
||||
+};
|
||||
+
|
||||
+static inline ggml_cuda_mmq_shape ggml_cuda_mmq_shape_make(
|
||||
+ const int type, const bool is_moe, const int64_t ncols_dst, const int64_t nchannels_x,
|
||||
+ const int64_t ncols_max, const int mmq_x_max, const int mmq_x_lim, const int mmq_x_best,
|
||||
+ const int mmq_y, const bool use_stream_k) {
|
||||
+ int64_t n_active_est = 0;
|
||||
+ int64_t density = 0;
|
||||
+ if (is_moe && ncols_dst > 0 && nchannels_x > 0) {
|
||||
+ n_active_est = ncols_dst < nchannels_x ? ncols_dst : nchannels_x;
|
||||
+ density = (ncols_dst + n_active_est - 1) / n_active_est;
|
||||
+ }
|
||||
+
|
||||
+ return {
|
||||
+ type,
|
||||
+ is_moe,
|
||||
+ ncols_dst,
|
||||
+ nchannels_x,
|
||||
+ ncols_max,
|
||||
+ n_active_est,
|
||||
+ density,
|
||||
+ mmq_x_max,
|
||||
+ mmq_x_lim,
|
||||
+ mmq_x_best,
|
||||
+ mmq_y,
|
||||
+ use_stream_k,
|
||||
+ };
|
||||
+}
|
||||
+
|
||||
+static inline int ggml_cuda_mmq_shape_format(char * buf, const size_t size, const ggml_cuda_mmq_shape & shape) {
|
||||
+ return std::snprintf(buf, size,
|
||||
+ "type=%d moe=%d ncols_dst=%lld nchannels_x=%lld ncols_max=%lld "
|
||||
+ "n_active_est=%lld density=%lld mmq_x_max=%d mmq_x_lim=%d "
|
||||
+ "mmq_x_best=%d mmq_y=%d stream_k=%d",
|
||||
+ shape.type,
|
||||
+ shape.is_moe ? 1 : 0,
|
||||
+ (long long) shape.ncols_dst,
|
||||
+ (long long) shape.nchannels_x,
|
||||
+ (long long) shape.ncols_max,
|
||||
+ (long long) shape.n_active_est,
|
||||
+ (long long) shape.density,
|
||||
+ shape.mmq_x_max,
|
||||
+ shape.mmq_x_lim,
|
||||
+ shape.mmq_x_best,
|
||||
+ shape.mmq_y,
|
||||
+ shape.use_stream_k ? 1 : 0);
|
||||
+}
|
||||
diff --git a/ggml/src/ggml-cuda/mmq.cuh b/ggml/src/ggml-cuda/mmq.cuh
|
||||
index b53e38a8b..6bc943738 100644
|
||||
--- a/ggml/src/ggml-cuda/mmq.cuh
|
||||
+++ b/ggml/src/ggml-cuda/mmq.cuh
|
||||
@@ -3,10 +3,14 @@
|
||||
#include "common.cuh"
|
||||
#include "vecdotq.cuh"
|
||||
#include "mma.cuh"
|
||||
+#include "mmq-shape-trace.h"
|
||||
|
||||
+#include <atomic>
|
||||
#include <climits>
|
||||
#include <cstdint>
|
||||
+#include <cstdio>
|
||||
#include <cstdlib>
|
||||
+#include <cstring>
|
||||
|
||||
using namespace ggml_cuda_mma;
|
||||
|
||||
@@ -4163,6 +4167,18 @@ static inline int ggml_cuda_fp4_dense_mmq_x_cap() {
|
||||
return c;
|
||||
}
|
||||
|
||||
+static inline int ggml_cuda_moe_mmq_shape_trace_limit() {
|
||||
+ static const int limit = []() -> int {
|
||||
+ const char * s = getenv("LLAMA_MOE_MMQ_SHAPE_TRACE");
|
||||
+ if (s == nullptr || strcmp(s, "0") == 0) {
|
||||
+ return 0;
|
||||
+ }
|
||||
+ const int parsed = atoi(s);
|
||||
+ return parsed > 0 ? parsed : 256;
|
||||
+ }();
|
||||
+ return limit;
|
||||
+}
|
||||
+
|
||||
template <ggml_type type>
|
||||
void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) {
|
||||
const int id = ggml_cuda_get_device();
|
||||
@@ -4249,6 +4265,20 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda
|
||||
}
|
||||
}
|
||||
|
||||
+ if (args.expert_bounds != nullptr) {
|
||||
+ static std::atomic<int> trace_count{0};
|
||||
+ const int trace_limit = ggml_cuda_moe_mmq_shape_trace_limit();
|
||||
+ const int trace_index = trace_limit > 0 ? trace_count.fetch_add(1, std::memory_order_relaxed) : trace_limit;
|
||||
+ if (trace_index >= 0 && trace_index < trace_limit) {
|
||||
+ char buf[256];
|
||||
+ const ggml_cuda_mmq_shape shape = ggml_cuda_mmq_shape_make(
|
||||
+ (int) type, true, args.ncols_dst, args.nchannels_x, args.ncols_max,
|
||||
+ mmq_x_max, mmq_x_lim, mmq_x_best, mmq_y, args.use_stream_k);
|
||||
+ ggml_cuda_mmq_shape_format(buf, sizeof(buf), shape);
|
||||
+ fprintf(stderr, "[LLAMA_MOE_MMQ_SHAPE] %s\n", buf);
|
||||
+ }
|
||||
+ }
|
||||
+
|
||||
switch (mmq_x_best) {
|
||||
case 8:
|
||||
launch_mul_mat_q<type, 8>(ctx, args, stream);
|
||||
@@ -4341,4 +4371,3 @@ void ggml_cuda_op_mul_mat_q(
|
||||
const int64_t src1_padded_row_size, cudaStream_t stream);
|
||||
|
||||
bool ggml_cuda_should_use_mmq(enum ggml_type type, int cc, int64_t ne11, int64_t n_experts);
|
||||
-
|
||||
diff --git a/tests/CMakeLists.txt b/tests/CMakeLists.txt
|
||||
index 24592a279..0a5194c87 100644
|
||||
--- a/tests/CMakeLists.txt
|
||||
+++ b/tests/CMakeLists.txt
|
||||
@@ -234,6 +234,8 @@ llama_build_and_test(test-thread-safety.cpp ARGS -m "${MODEL_DEST}" -ngl 99 -p "
|
||||
set_tests_properties(test-thread-safety PROPERTIES FIXTURES_REQUIRED test-download-model)
|
||||
|
||||
llama_build_and_test(test-arg-parser.cpp)
|
||||
+llama_build_and_test(test-cuda-mmq-shape-trace.cpp)
|
||||
+target_include_directories(test-cuda-mmq-shape-trace PRIVATE ${PROJECT_SOURCE_DIR}/ggml/src)
|
||||
|
||||
if (NOT LLAMA_SANITIZE_ADDRESS AND NOT GGML_SCHED_NO_REALLOC)
|
||||
# TODO: repair known memory leaks
|
||||
diff --git a/tests/test-cuda-mmq-shape-trace.cpp b/tests/test-cuda-mmq-shape-trace.cpp
|
||||
new file mode 100644
|
||||
index 000000000..8620169c0
|
||||
--- /dev/null
|
||||
+++ b/tests/test-cuda-mmq-shape-trace.cpp
|
||||
@@ -0,0 +1,42 @@
|
||||
+#include "ggml-cuda/mmq-shape-trace.h"
|
||||
+
|
||||
+#include <cstdio>
|
||||
+#include <cstdlib>
|
||||
+#include <cstring>
|
||||
+
|
||||
+static void require(bool ok, const char * what) {
|
||||
+ if (!ok) {
|
||||
+ std::fprintf(stderr, "require failed: %s\n", what);
|
||||
+ std::exit(1);
|
||||
+ }
|
||||
+}
|
||||
+
|
||||
+int main() {
|
||||
+ const ggml_cuda_mmq_shape shape = ggml_cuda_mmq_shape_make(
|
||||
+ /* type */ 39,
|
||||
+ /* is_moe */ true,
|
||||
+ /* ncols_dst */ 1024,
|
||||
+ /* nchannels_x */ 256,
|
||||
+ /* ncols_max */ 128,
|
||||
+ /* mmq_x_max */ 128,
|
||||
+ /* mmq_x_lim */ 64,
|
||||
+ /* mmq_x_best */ 64,
|
||||
+ /* mmq_y */ 128,
|
||||
+ /* use_stream_k */ true);
|
||||
+
|
||||
+ require(shape.n_active_est == 256, "active expert estimate is capped by expert count");
|
||||
+ require(shape.density == 4, "density is ceil(assignments / active experts)");
|
||||
+
|
||||
+ char buf[256];
|
||||
+ const int n = ggml_cuda_mmq_shape_format(buf, sizeof(buf), shape);
|
||||
+
|
||||
+ require(n > 0, "format returns byte count");
|
||||
+ require(std::strstr(buf, "moe=1") != nullptr, "trace includes moe flag");
|
||||
+ require(std::strstr(buf, "ncols_dst=1024") != nullptr, "trace includes routed assignment count");
|
||||
+ require(std::strstr(buf, "n_active_est=256") != nullptr, "trace includes active estimate");
|
||||
+ require(std::strstr(buf, "density=4") != nullptr, "trace includes density");
|
||||
+ require(std::strstr(buf, "mmq_x_best=64") != nullptr, "trace includes selected tile");
|
||||
+ require(std::strstr(buf, "stream_k=1") != nullptr, "trace includes stream-k flag");
|
||||
+
|
||||
+ return 0;
|
||||
+}
|
||||
64
docs/superpowers/plans/2026-07-01-mmq-shape-trace-phase29.md
Normal file
64
docs/superpowers/plans/2026-07-01-mmq-shape-trace-phase29.md
Normal file
@@ -0,0 +1,64 @@
|
||||
# MMQ Shape Trace Phase 29 Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use
|
||||
> superpowers:test-driven-development for source changes and
|
||||
> superpowers:verification-before-completion before claiming the phase is green.
|
||||
> Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Add a default-off, md5-safe MoE grouped-MMQ shape trace so the next
|
||||
structural grouped-MMQ kernel can be sized from live serving evidence.
|
||||
|
||||
**Architecture:** Host-side instrumentation only. The trace records selector
|
||||
inputs and estimated density at `mul_mat_q_case`, without reading device
|
||||
`expert_bounds` or adding synchronization.
|
||||
|
||||
**Tech Stack:** llama.cpp CUDA backend, local host-only unit test, DGX CUDA
|
||||
build, `paged-inference-gates.sh`.
|
||||
|
||||
---
|
||||
|
||||
## Checklist
|
||||
|
||||
- [x] **Step 1: Write the RED test**
|
||||
- Added `tests/test-cuda-mmq-shape-trace.cpp`.
|
||||
- First build failed on missing `ggml-cuda/mmq-shape-trace.h`, proving the
|
||||
test covered the new API before implementation.
|
||||
|
||||
- [x] **Step 2: Implement the minimal helper**
|
||||
- Added `ggml/src/ggml-cuda/mmq-shape-trace.h`.
|
||||
- Helper computes `n_active_est`, `density`, and formats a stable trace line.
|
||||
|
||||
- [x] **Step 3: Wire default-off instrumentation**
|
||||
- Added `LLAMA_MOE_MMQ_SHAPE_TRACE=<n>` in `mmq.cuh`.
|
||||
- Trace is capped by the env value; nonnumeric truthy values default to 256.
|
||||
- Env unset or `0` stays silent.
|
||||
|
||||
- [x] **Step 4: Verify local GREEN**
|
||||
- `cmake --build build --target test-cuda-mmq-shape-trace -j 4`
|
||||
- `./build/bin/test-cuda-mmq-shape-trace`
|
||||
|
||||
- [x] **Step 5: Verify DGX CUDA build**
|
||||
- Artifact: `/home/mudler/bench/phase29_mmq_shape_trace/20260701_042428`
|
||||
- `cmake --build build-cuda --target llama-completion test-backend-ops test-cuda-mmq-shape-trace`
|
||||
|
||||
- [x] **Step 6: Run default-off inference gates**
|
||||
- MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`
|
||||
- Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`
|
||||
- `MUL_MAT_ID`: `806/806`
|
||||
|
||||
- [x] **Step 7: Run trace-enabled inference gates**
|
||||
- `EXTRA_ENV=LLAMA_MOE_MMQ_SHAPE_TRACE=4`
|
||||
- MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`
|
||||
- Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`
|
||||
- `MUL_MAT_ID`: `806/806`
|
||||
- Trace lines: `4`
|
||||
|
||||
- [x] **Step 8: Mirror into LocalAI**
|
||||
- Fork commit: `20a99518a feat(cuda): trace moe mmq batch shapes`
|
||||
- LocalAI patch: `0056-feat-cuda-trace-moe-mmq-batch-shapes.patch`
|
||||
|
||||
## Result
|
||||
|
||||
Phase 29 is instrumentation-only. It does not claim a speed win, but it gives a
|
||||
bounded and gate-safe way to collect grouped-MMQ selector shape evidence for the
|
||||
next structural kernel phase.
|
||||
Reference in New Issue
Block a user