mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-02 20:37:03 -04:00
feat(paged): add moe mmid route trace patch
Add LocalAI patch 0060 from the llama.cpp fork and record Phase 34 gates, serving route counts, and the updated patch mirror invariant. Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -87,7 +87,7 @@ orthogonal to the paged allocator.
|
||||
|
||||
---
|
||||
|
||||
## 3. Patch series (0001-0059)
|
||||
## 3. Patch series (0001-0060)
|
||||
|
||||
Source-only patches, with intentional numbering gaps (e.g. 0005, 0027). The
|
||||
decode-serving graph-reuse levers are 0040-0041. "Bit-exact" = greedy md5 /
|
||||
@@ -217,6 +217,7 @@ These are the dominant decode levers on the Qwen3.6 hybrid models. All bit-exact
|
||||
| 0057 | **Trace MoE MMQ launch shapes** - extends `LLAMA_MOE_MMQ_SHAPE_TRACE=<n>` with bounded `[LLAMA_MOE_MMQ_LAUNCH]` lines from `launch_mul_mat_q`, recording actual `ntiles_dst`, `stream_k_blocks`, tile efficiency, `fixup`, `ntx/nty/ntzw`, and compiled `mmq_x/mmq_y`. This is evidence-only instrumentation to distinguish real stream-k/fixup overhead from small-M kernel-shape cost. | yes (default-off, trace-enabled, and post-serving gates green: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID` `806/806`; Phase 31 n128 trace showed decode and prefill `fixup=0`, `stream_k_blocks == ntiles_dst`) |
|
||||
| 0058 | **Trace MoE small-M MMQ candidates** - adds `LLAMA_MOE_MMQ_SMALL_M_TRACE=<n>` and a host-only classifier for decode-like low-density grouped-MMQ shapes (`ncols_max <= 128`, density `<=4`, `mmq_x_best <=64`). It only counts candidate calls for the next structural tile-policy A/B; no numeric branch is added. | yes (default-off, trace-enabled, and post-serving gates green: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID` `806/806`; Phase 32 n128 trace found 4096 candidates, mostly `mmq_x_best=64/48`) |
|
||||
| 0059 | **Gate MoE small-M MMQ tile policy** - adds default-off `LLAMA_MOE_SMALL_M_TILE=<n>` to cap only classified small-M MoE grouped-MMQ calls. This was used to A/B vLLM-like smaller M blocks without changing default inference. | yes (default-off, tile16, tile8, and post-serving gates green: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID` `806/806`; Phase 33 rejected tile16 and tile8 as slower) |
|
||||
| 0060 | **Trace MoE MMID dispatch routes** - adds default-off `LLAMA_MOE_MMID_ROUTE_TRACE=<n>` around `MUL_MAT_ID` dispatch, classifying each call as `mmvq`, `mmvf`, grouped `mmq`, `mmf`, or host-sync `fallback`. This is evidence-only instrumentation to resolve whether serving hits the per-expert host-sync fallback. | yes (default-off, trace-enabled, and post-serving gates green: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID` `806/806`; Phase 34 n128 trace found `mmq=2776`, `mmvq=1320`, `host_sync=0/4096`) |
|
||||
|
||||
> **Dropped: patch 0026 (hybrid per-head bf16 SSM state, `ssm_bf16_tau`).** Once
|
||||
> the decode fusions (0028 recurrent-state gather-fusion + 0029 block-table cache)
|
||||
@@ -673,3 +674,12 @@ Phase 33 added default-off `LLAMA_MOE_SMALL_M_TILE=<n>` as patch `0059`
|
||||
md5/op safe, but both tested values were slower in same-session n128 serving:
|
||||
baseline `672.1` decode_agg_tps, tile16 `640.3` (`0.953x`), tile8 `583.2`
|
||||
(`0.868x`). Do not promote simple smaller `mmq_x` caps for this workload.
|
||||
|
||||
Phase 34 added default-off `LLAMA_MOE_MMID_ROUTE_TRACE=<n>` as patch `0060`
|
||||
(`/home/mudler/bench/phase34_mmid_route_trace/20260701_072737`). Default-off,
|
||||
trace-enabled, and post-serving gates stayed stable: MoE `8cb0ce23`, dense
|
||||
`5951a5b4`, `MUL_MAT_ID 806/806`. Live n128 serving with trace cap 4096 produced
|
||||
`mmq=2776`, `mmvq=1320`, and `host_sync=0/4096`; the top shapes were
|
||||
`mmq ne2=12` (1096), `mmq ne2=18` (480), and `mmvq ne2=8` (360). This refutes
|
||||
host-sync fallback as the current n128 `MUL_MAT_ID` problem; follow-up work should
|
||||
target grouped-MMQ small-M kernel partitioning or another measured bucket.
|
||||
|
||||
@@ -1990,3 +1990,51 @@ Decision:
|
||||
inference-safe but slower.
|
||||
- A future grouped-MMQ kernel must change the work shape more deeply than the
|
||||
host-side tile cap, or pivot to a different bucket.
|
||||
|
||||
## Phase 34 MoE MMID Dispatch Route Trace
|
||||
|
||||
Phase 34 added patch `0060`, a default-off `LLAMA_MOE_MMID_ROUTE_TRACE=<n>`
|
||||
diagnostic around `MUL_MAT_ID` dispatch. It does not alter routing; it logs the
|
||||
existing route decision as `mmvq`, `mmvf`, grouped `mmq`, `mmf`, or host-sync
|
||||
`fallback`.
|
||||
|
||||
Artifact:
|
||||
|
||||
- `/home/mudler/bench/phase34_mmid_route_trace/20260701_072737`
|
||||
|
||||
Run:
|
||||
|
||||
- Fork commit: `/home/mudler/_git/llama.cpp` `6c332094c`
|
||||
- DGX mirror commit: `dgx:~/llama-phase6-source` `34a256d14`
|
||||
- Env: `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 LLAMA_MOE_MMID_ROUTE_TRACE=4096`
|
||||
- Workload: staggered n128 `llama-server`, `GEN=64`
|
||||
|
||||
Route summary:
|
||||
|
||||
| metric | value |
|
||||
|--------|-------|
|
||||
| traced `MUL_MAT_ID` calls | 4096 |
|
||||
| grouped MMQ | 2776 |
|
||||
| MMVQ | 1320 |
|
||||
| host-sync fallback | 0 |
|
||||
| top shapes | `mmq ne2=12`: 1096, `mmq ne2=18`: 480, `mmvq ne2=8`: 360 |
|
||||
|
||||
Gates:
|
||||
|
||||
| check | status | actual |
|
||||
|-------|--------|--------|
|
||||
| default-off MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
|
||||
| default-off dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
|
||||
| trace-enabled MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
|
||||
| trace-enabled dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
|
||||
| post-serving MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
|
||||
| post-serving dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
|
||||
| `MUL_MAT_ID` | ok | `806/806` in all three gate runs |
|
||||
|
||||
Decision:
|
||||
|
||||
- The current n128 serving path is not hitting the host-sync fallback in traced
|
||||
`MUL_MAT_ID` calls. The route is graph-safe MMVQ for very small widths and
|
||||
grouped MMQ above that.
|
||||
- Do not scope the next parity phase around avoiding fallback dispatch. Scope it
|
||||
around grouped-MMQ small-M kernel partitioning or another measured bucket.
|
||||
|
||||
@@ -442,6 +442,18 @@ MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
|
||||
serving rejected both caps: baseline `672.1` decode_agg_tps, tile16 `640.3`
|
||||
(`0.953x`), tile8 `583.2` (`0.868x`). Do not promote smaller `mmq_x` caps.
|
||||
|
||||
Phase 34 added patch `0060`, default-off `LLAMA_MOE_MMID_ROUTE_TRACE=<n>`, to
|
||||
classify the live `MUL_MAT_ID` dispatch route without changing route behavior.
|
||||
Artifact: `/home/mudler/bench/phase34_mmid_route_trace/20260701_072737`. Fork
|
||||
commit: `6c332094c feat(cuda): trace moe mmid routes`; DGX mirror commit:
|
||||
`34a256d14`. Default-off, trace-enabled, and post-serving gates stayed green:
|
||||
MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT_ID` `806/806`. Live n128 serving
|
||||
with trace cap 4096 found `mmq=2776`, `mmvq=1320`, and `host_sync=0/4096`.
|
||||
Treat the old current-stack host-sync-fallback concern as refuted for this
|
||||
workload; the remaining MoE work is grouped-MMQ small-M efficiency or another
|
||||
measured bucket.
|
||||
|
||||
---
|
||||
|
||||
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
|
||||
@@ -491,15 +503,15 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
|
||||
## 7. KEY FILE / ARTIFACT INDEX
|
||||
|
||||
### Fork (canonical source of truth)
|
||||
- Local canonical fork: `/home/mudler/_git/llama.cpp`, branch **`localai-paged`**, HEAD `fbed2abaa9f5af8e500f95c8dda86b305450ceff` ("gate moe small-m mmq tile policy", patch `0059`).
|
||||
- DGX current clean mirror/build tree: `dgx:~/llama-phase6-source`, HEAD `dfd1eaea8` with the Phase 33 small-M tile-policy patch applied and committed; Phase 20/26/27 artifacts still record their historical source hashes.
|
||||
- Local canonical fork: `/home/mudler/_git/llama.cpp`, branch **`localai-paged`**, HEAD `6c332094ca2fbb1e3211427c5f919adcaa89c588` ("trace moe mmid routes", patch `0060`).
|
||||
- DGX current clean mirror/build tree: `dgx:~/llama-phase6-source`, HEAD `34a256d14` with the Phase 34 MMID route-trace patch applied and committed; Phase 20/26/27 artifacts still record their historical source hashes.
|
||||
- Historical DGX dev tree: `dgx:~/llama-paged-dev`, branch **`paged`**, HEAD `a7d439e8ce6990eb09721223c975da4e49d8d136` ("GDN CONFIG C (M8) - bf16 Kc/Qc"). It is an old experimental tree and must not be treated as canonical.
|
||||
|
||||
### LocalAI worktree
|
||||
- Path: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention`, branch `worktree-feat+paged-attention` (currently 246 ahead, 31 behind `origin/master`; recompute before reporting).
|
||||
- Backend dir: `backend/cpp/llama-cpp-localai-paged/` (`Makefile` thin wrapper, `package.sh`, `run.sh`, `README.md` ~44 KB canonical, `docs/`, `patches/paged/`).
|
||||
- `docs/`: `VLLM_PARITY_FINAL.md` (authoritative record), `VLLM_PARITY_LEVER_MAP.md` (working brainstorm, profile-validated section), `DECODE_SERVING_SCOPE.md`, `PREFILL_GEMM_SCOPE.md`, `PREFILL_GEMM_RESULTS.md`, `TENSORCORE_GDN_SCOPE.md`, `TENSORCORE_GDN_BUILD_PLAN.md`, `ACCELERATOR_PORTING_SCOPE.md`, `UPSTREAM_LAYER2_SCOPE.md`, `LOCALAI_LLAMACPP_BACKEND_PLAN.md`, `PAGED_BITEXACT_NOTE.md`, `PATCH_MAINTENANCE.md`, `final_benchmark.csv`, `paged-burst-bench.cpp`, `paged-reclaim-unit.cpp`, 3 PNGs, and this `PARITY_HANDOFF.md`.
|
||||
- `patches/paged/`: **50** `.patch` files spanning 0001-0059 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV); W4A16 packed metadata/shape/padding is 0048-0050; MoE safety tests are 0051-0053; MTP backend-sampling safety is 0054; speculative shape trace is 0055; MoE MMQ selector/launch/candidate/tile-policy instrumentation is 0056-0059.
|
||||
- `patches/paged/`: **51** `.patch` files spanning 0001-0060 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV); W4A16 packed metadata/shape/padding is 0048-0050; MoE safety tests are 0051-0053; MTP backend-sampling safety is 0054; speculative shape trace is 0055; MoE MMQ selector/launch/candidate/tile-policy/route instrumentation is 0056-0060.
|
||||
|
||||
### Bench artifacts (DGX)
|
||||
- `~/bench/COMBINED_DEFINITIVE.txt` (+ `.log`, `.done`, `combined_definitive.sh`, `combined_definitive.out`) - historical same-session both-engine run.
|
||||
@@ -515,6 +527,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
|
||||
- `~/bench/phase31_mmq_launch_trace/20260701_064424` - default-off MoE MMQ launch trace patch `0057`; default/trace/post-serving md5 gates green; n128 launch trace rejects stream-k/fixup shortcut (`fixup=0`, `stream_k_blocks == ntiles_dst`).
|
||||
- `~/bench/phase32_small_m_classifier/20260701_070127` - default-off MoE MMQ small-M classifier patch `0058`; default/trace/post-serving md5 gates green; n128 trace found 4096 candidate calls.
|
||||
- `~/bench/phase33_small_m_tile_policy/20260701_071136` - default-off MoE MMQ small-M tile policy patch `0059`; tile16/tile8 md5/op safe but both slower in n128 serving.
|
||||
- `~/bench/phase34_mmid_route_trace/20260701_072737` - default-off MoE MMID route trace patch `0060`; default/trace/post-serving md5 gates green; n128 route trace found `mmq=2776`, `mmvq=1320`, `host_sync=0/4096`.
|
||||
- Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
|
||||
- Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
|
||||
- A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.
|
||||
@@ -527,8 +540,8 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
|
||||
|
||||
### Discrepancies to flag / resolve (carried verbatim from the gather, including UNVERIFIED labels)
|
||||
1. **Pin prose reconciled in this worktree.** Makefile line 52 `LLAMA_VERSION?=0ed235ea2c17a19fc8238668653946721ed136fd` is authoritative and matches the local fork merge-base. Hard rule: the paged pin must equal the stock `llama-cpp` pin (shared `grpc-server.cpp`); a bump to `c299a92c` once broke the grpc-server link despite being bit-exact and was reverted. Trust the Makefile when building.
|
||||
2. **Current fork/mirror are clean and verified.** Local fork HEAD is `fbed2abaa`, DGX clean mirror HEAD is `dfd1eaea8`, and Phase 33 should be treated as the current patch-series tip. The old `llama-paged-dev` tree is historical only.
|
||||
3. **Worktree patch series is tracked through 0059.** The only expected unrelated untracked path in this worktree is `.claude/`.
|
||||
2. **Current fork/mirror are clean and verified.** Local fork HEAD is `6c332094c`, DGX clean mirror HEAD is `34a256d14`, and Phase 34 should be treated as the current patch-series tip. The old `llama-paged-dev` tree is historical only.
|
||||
3. **Worktree patch series is tracked through 0060.** The only expected unrelated untracked path in this worktree is `.claude/`.
|
||||
4. **`sm_121a` is not in the worktree build files** - it lives only in the DGX experimental build scripts (`gdn_cc.sh`, `gdn_bv_build.sh`, `paged-build.sh`); mainline uses arch `121`. **UNVERIFIED** whether the shipped CI Dockerfile build path injects `121a` for the FP4-MMA kernels (`Dockerfile.llama-cpp-localai-paged` does not hardcode a CUDA arch).
|
||||
5. **The `0921716...` paged-MoE md5 open item.** `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=0921716cd0582b5d15af8c362b811d00` for MoE, but a full doc/patch/`git log -S` grep of the worktree found **no** occurrence of `0921716...` in any committed source; the committed canonical paged-MoE gate is `8cb0ce23`. Treat this as **unreconciled**: the documented, KL-validated paged-MoE gate remains `8cb0ce23`, and any paged-MoE divergence (including `0921716`) must be KL-validated against the f16 reference before being accepted as benign, never on assertion alone. The `0921716` value is **UNVERIFIED** as a sanctioned gate; do not adopt it as canonical without re-running the KL gate. The **dense** run is symmetric: `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=ecfe924dee6c5622c149f419ff2a6481` for dense, which likewise differs from the canonical dense gate `5951a5b4`. Both CDEF `PAGED_GATE_MD5` values come from the `combined_definitive.sh` harness's own gate command, NOT the canonical bit-exact gate command in section 3.3, which is why they diverge from the committed `8cb0ce23` / `5951a5b4`; neither is a sanctioned gate and both must be KL-validated before being treated as benign.
|
||||
|
||||
|
||||
@@ -57,18 +57,18 @@ everywhere without ever touching the stock `llama-cpp` source tree.
|
||||
|
||||
## Latest mirror check
|
||||
|
||||
Phase 33 re-verified the mirror invariant after adding patch `0059`:
|
||||
Phase 34 re-verified the mirror invariant after adding patch `0060`:
|
||||
|
||||
```text
|
||||
base=0ed235ea2c17a19fc8238668653946721ed136fd
|
||||
applied_tree=4dc5498ac86b100eddf777c4e7f4c4d11f59415d
|
||||
fork_tree=4dc5498ac86b100eddf777c4e7f4c4d11f59415d
|
||||
applied_tree=433720590dfafbde8cc5b23a80e13f88349ff90f
|
||||
fork_tree=433720590dfafbde8cc5b23a80e13f88349ff90f
|
||||
```
|
||||
|
||||
The check used a fresh worktree at `LLAMA_VERSION`, applied every
|
||||
`patches/paged/0*.patch` with strict `git apply`, staged the result, and compared
|
||||
`git write-tree` to canonical fork branch `localai-paged` at
|
||||
`fbed2abaa feat(cuda): gate moe small-m mmq tile policy`.
|
||||
`6c332094c feat(cuda): trace moe mmid routes`.
|
||||
|
||||
## Status
|
||||
|
||||
|
||||
@@ -873,6 +873,31 @@ remaining grouped-MMQ gap is not solved by emulating Marlin's small `block_size_
|
||||
with the current MMQ kernel; a future attempt must alter the kernel's internal
|
||||
work partitioning or move to a different bottleneck.
|
||||
|
||||
### Phase 34 MMID route trace
|
||||
|
||||
Phase 34 added patch `0060`, default-off `LLAMA_MOE_MMID_ROUTE_TRACE=<n>`, to
|
||||
classify the live `MUL_MAT_ID` dispatch route without changing the route. Artifact:
|
||||
`/home/mudler/bench/phase34_mmid_route_trace/20260701_072737`.
|
||||
|
||||
Default-off, trace-enabled, and post-serving gates were all bit-exact: MoE
|
||||
`8cb0ce23777bf55f92f63d0292c756b0`, dense
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
|
||||
|
||||
Live n128 serving with `LLAMA_MOE_MMID_ROUTE_TRACE=4096` produced:
|
||||
|
||||
| route | count | host sync |
|
||||
|-------|-------|-----------|
|
||||
| grouped `mmq` | 2776 | 0 |
|
||||
| `mmvq` | 1320 | 0 |
|
||||
| `mmf` | 0 | 0 |
|
||||
| fallback | 0 | 0 |
|
||||
|
||||
Top route shapes were `mmq ne2=12` (1096), `mmq ne2=18` (480), and
|
||||
`mmvq ne2=8` (360). Lever implication: the old D1 concern that current n128
|
||||
serving might fall into the per-expert host-sync fallback is refuted for this
|
||||
stack. The remaining MoE route issue is grouped-MMQ small-M efficiency, not
|
||||
fallback dispatch avoidance.
|
||||
|
||||
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
|
||||
|
||||
### Phase 10 GDN C32 slab update
|
||||
|
||||
@@ -0,0 +1,292 @@
|
||||
From 6c332094ca2fbb1e3211427c5f919adcaa89c588 Mon Sep 17 00:00:00 2001
|
||||
From: Ettore Di Giacinto <mudler@localai.io>
|
||||
Date: Wed, 1 Jul 2026 05:32:27 +0000
|
||||
Subject: [PATCH] feat(cuda): trace moe mmid routes
|
||||
|
||||
Add a default-off LLAMA_MOE_MMID_ROUTE_TRACE diagnostic for MUL_MAT_ID dispatch routes.
|
||||
|
||||
The trace reports whether a call uses MMVQ, MMVF, grouped MMQ, MMF, or host-sync fallback while preserving the existing dispatch predicates.
|
||||
|
||||
Assisted-by: Codex:gpt-5
|
||||
---
|
||||
ggml/src/ggml-cuda/ggml-cuda.cu | 34 +++++++++--
|
||||
ggml/src/ggml-cuda/mmq-shape-trace.h | 88 ++++++++++++++++++++++++++++
|
||||
tests/test-cuda-mmq-shape-trace.cpp | 82 ++++++++++++++++++++++++++
|
||||
3 files changed, 200 insertions(+), 4 deletions(-)
|
||||
|
||||
diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
|
||||
index 374949f25..a1754df39 100644
|
||||
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
|
||||
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
|
||||
@@ -2685,6 +2685,15 @@ static void ggml_cuda_mul_mat(ggml_backend_cuda_context & ctx, const ggml_tensor
|
||||
}
|
||||
}
|
||||
|
||||
+static inline int ggml_cuda_moe_mmid_route_trace_limit() {
|
||||
+ static const int value = []() {
|
||||
+ const char * s = getenv("LLAMA_MOE_MMID_ROUTE_TRACE");
|
||||
+ return s ? atoi(s) : 0;
|
||||
+ }();
|
||||
+
|
||||
+ return value;
|
||||
+}
|
||||
+
|
||||
static void ggml_cuda_mul_mat_id(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
|
||||
const ggml_tensor * src0 = dst->src[0];
|
||||
const ggml_tensor * src1 = dst->src[1];
|
||||
@@ -2697,13 +2706,30 @@ static void ggml_cuda_mul_mat_id(ggml_backend_cuda_context & ctx, ggml_tensor *
|
||||
GGML_TENSOR_BINARY_OP_LOCALS
|
||||
|
||||
const int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc;
|
||||
+ const bool src0_quantized = ggml_is_quantized(src0->type);
|
||||
+ const int mmvq_mmid_max = src0_quantized ? get_mmvq_mmid_max_batch(src0->type, cc) : MMVQ_MAX_BATCH_SIZE;
|
||||
+ const bool use_mmq = ggml_cuda_should_use_mmq(src0->type, cc, ne12, /*n_experts=*/ne02);
|
||||
+ const bool use_mmf = ggml_cuda_should_use_mmf(src0->type, cc, WARP_SIZE, src0->ne, src0->nb, src1->ne[2], /*mul_mat_id=*/true);
|
||||
+
|
||||
+ const int mmid_trace_limit = ggml_cuda_moe_mmid_route_trace_limit();
|
||||
+ if (mmid_trace_limit > 0) {
|
||||
+ static std::atomic<int> trace_count{0};
|
||||
+ const int trace_idx = trace_count.fetch_add(1, std::memory_order_relaxed);
|
||||
+ if (trace_idx < mmid_trace_limit) {
|
||||
+ const ggml_cuda_mmid_route_shape route_shape = ggml_cuda_mmid_route_shape_make(
|
||||
+ src0->type, ne2, ne12, /*n_experts=*/ne02, mmvq_mmid_max, use_mmq, use_mmf,
|
||||
+ GGML_CUDA_CC_IS_AMD(cc), src0_quantized);
|
||||
+ char buf[256];
|
||||
+ ggml_cuda_mmid_route_shape_format(buf, sizeof(buf), route_shape);
|
||||
+ fprintf(stderr, "[LLAMA_MOE_MMID_ROUTE] %s\n", buf);
|
||||
+ }
|
||||
+ }
|
||||
|
||||
// [TAG_MUL_MAT_ID_CUDA_GRAPHS]
|
||||
if (src1->type == GGML_TYPE_F32 && dst->type == GGML_TYPE_F32) {
|
||||
static_assert(MMVQ_MAX_BATCH_SIZE == MMVF_MAX_BATCH_SIZE);
|
||||
if (ne2 <= MMVQ_MAX_BATCH_SIZE) {
|
||||
- if (ggml_is_quantized(src0->type)) {
|
||||
- const int mmvq_mmid_max = get_mmvq_mmid_max_batch(src0->type, cc);
|
||||
+ if (src0_quantized) {
|
||||
if (ne2 <= mmvq_mmid_max) {
|
||||
ggml_cuda_mul_mat_vec_q(ctx, src0, src1, ids, dst);
|
||||
return;
|
||||
@@ -2716,12 +2742,12 @@ static void ggml_cuda_mul_mat_id(ggml_backend_cuda_context & ctx, ggml_tensor *
|
||||
}
|
||||
}
|
||||
|
||||
- if (ggml_cuda_should_use_mmq(src0->type, cc, ne12, /*n_experts=*/ne02)) {
|
||||
+ if (use_mmq) {
|
||||
ggml_cuda_mul_mat_q(ctx, src0, src1, ids, dst);
|
||||
return;
|
||||
}
|
||||
|
||||
- if (ggml_cuda_should_use_mmf(src0->type, cc, WARP_SIZE, src0->ne, src0->nb, src1->ne[2], /*mul_mat_id=*/true)) {
|
||||
+ if (use_mmf) {
|
||||
ggml_cuda_mul_mat_f(ctx, src0, src1, ids, dst);
|
||||
return;
|
||||
}
|
||||
diff --git a/ggml/src/ggml-cuda/mmq-shape-trace.h b/ggml/src/ggml-cuda/mmq-shape-trace.h
|
||||
index dfb4e898a..da234a302 100644
|
||||
--- a/ggml/src/ggml-cuda/mmq-shape-trace.h
|
||||
+++ b/ggml/src/ggml-cuda/mmq-shape-trace.h
|
||||
@@ -49,6 +49,40 @@ struct ggml_cuda_mmq_small_m_shape {
|
||||
bool is_candidate;
|
||||
};
|
||||
|
||||
+enum ggml_cuda_mmid_route {
|
||||
+ GGML_CUDA_MMID_ROUTE_MMVQ,
|
||||
+ GGML_CUDA_MMID_ROUTE_MMVF,
|
||||
+ GGML_CUDA_MMID_ROUTE_MMQ,
|
||||
+ GGML_CUDA_MMID_ROUTE_MMF,
|
||||
+ GGML_CUDA_MMID_ROUTE_FALLBACK,
|
||||
+};
|
||||
+
|
||||
+struct ggml_cuda_mmid_route_shape {
|
||||
+ ggml_cuda_mmid_route route;
|
||||
+ int type;
|
||||
+ int64_t ne2;
|
||||
+ int64_t ne12;
|
||||
+ int64_t n_experts;
|
||||
+ int mmvq_max;
|
||||
+ bool use_mmq;
|
||||
+ bool use_mmf;
|
||||
+ bool is_amd;
|
||||
+ bool is_quantized;
|
||||
+ bool host_sync;
|
||||
+};
|
||||
+
|
||||
+static inline const char * ggml_cuda_mmid_route_name(const ggml_cuda_mmid_route route) {
|
||||
+ switch (route) {
|
||||
+ case GGML_CUDA_MMID_ROUTE_MMVQ: return "mmvq";
|
||||
+ case GGML_CUDA_MMID_ROUTE_MMVF: return "mmvf";
|
||||
+ case GGML_CUDA_MMID_ROUTE_MMQ: return "mmq";
|
||||
+ case GGML_CUDA_MMID_ROUTE_MMF: return "mmf";
|
||||
+ case GGML_CUDA_MMID_ROUTE_FALLBACK: return "fallback";
|
||||
+ }
|
||||
+
|
||||
+ return "unknown";
|
||||
+}
|
||||
+
|
||||
static inline ggml_cuda_mmq_shape ggml_cuda_mmq_shape_make(
|
||||
const int type, const bool is_moe, const int64_t ncols_dst, const int64_t nchannels_x,
|
||||
const int64_t ncols_max, const int mmq_x_max, const int mmq_x_lim, const int mmq_x_best,
|
||||
@@ -76,6 +110,42 @@ static inline ggml_cuda_mmq_shape ggml_cuda_mmq_shape_make(
|
||||
};
|
||||
}
|
||||
|
||||
+static inline ggml_cuda_mmid_route_shape ggml_cuda_mmid_route_shape_make(
|
||||
+ const int type, const int64_t ne2, const int64_t ne12, const int64_t n_experts,
|
||||
+ const int mmvq_max, const bool use_mmq, const bool use_mmf, const bool is_amd,
|
||||
+ const bool is_quantized) {
|
||||
+ ggml_cuda_mmid_route route = GGML_CUDA_MMID_ROUTE_FALLBACK;
|
||||
+ if (ne2 <= mmvq_max) {
|
||||
+ if (is_quantized) {
|
||||
+ route = GGML_CUDA_MMID_ROUTE_MMVQ;
|
||||
+ } else if (is_amd) {
|
||||
+ route = GGML_CUDA_MMID_ROUTE_MMVF;
|
||||
+ }
|
||||
+ }
|
||||
+
|
||||
+ if (route == GGML_CUDA_MMID_ROUTE_FALLBACK) {
|
||||
+ if (use_mmq) {
|
||||
+ route = GGML_CUDA_MMID_ROUTE_MMQ;
|
||||
+ } else if (use_mmf) {
|
||||
+ route = GGML_CUDA_MMID_ROUTE_MMF;
|
||||
+ }
|
||||
+ }
|
||||
+
|
||||
+ return {
|
||||
+ route,
|
||||
+ type,
|
||||
+ ne2,
|
||||
+ ne12,
|
||||
+ n_experts,
|
||||
+ mmvq_max,
|
||||
+ use_mmq,
|
||||
+ use_mmf,
|
||||
+ is_amd,
|
||||
+ is_quantized,
|
||||
+ route == GGML_CUDA_MMID_ROUTE_FALLBACK,
|
||||
+ };
|
||||
+}
|
||||
+
|
||||
static inline ggml_cuda_mmq_small_m_shape ggml_cuda_mmq_small_m_shape_make(
|
||||
const bool is_moe, const int64_t ncols_dst, const int64_t nchannels_x,
|
||||
const int64_t ncols_max, const int mmq_x_best, const bool use_stream_k) {
|
||||
@@ -172,6 +242,24 @@ static inline int ggml_cuda_mmq_launch_shape_format(
|
||||
shape.fixup_needed ? 1 : 0);
|
||||
}
|
||||
|
||||
+static inline int ggml_cuda_mmid_route_shape_format(
|
||||
+ char * buf, const size_t size, const ggml_cuda_mmid_route_shape & shape) {
|
||||
+ return std::snprintf(buf, size,
|
||||
+ "route=%s type=%d host_sync=%d ne2=%lld ne12=%lld n_experts=%lld "
|
||||
+ "mmvq_max=%d use_mmq=%d use_mmf=%d is_amd=%d is_quantized=%d",
|
||||
+ ggml_cuda_mmid_route_name(shape.route),
|
||||
+ shape.type,
|
||||
+ shape.host_sync ? 1 : 0,
|
||||
+ (long long) shape.ne2,
|
||||
+ (long long) shape.ne12,
|
||||
+ (long long) shape.n_experts,
|
||||
+ shape.mmvq_max,
|
||||
+ shape.use_mmq ? 1 : 0,
|
||||
+ shape.use_mmf ? 1 : 0,
|
||||
+ shape.is_amd ? 1 : 0,
|
||||
+ shape.is_quantized ? 1 : 0);
|
||||
+}
|
||||
+
|
||||
static inline int ggml_cuda_mmq_small_m_shape_format(
|
||||
char * buf, const size_t size, const ggml_cuda_mmq_small_m_shape & shape) {
|
||||
return std::snprintf(buf, size,
|
||||
diff --git a/tests/test-cuda-mmq-shape-trace.cpp b/tests/test-cuda-mmq-shape-trace.cpp
|
||||
index f7863f03a..e190cf1ac 100644
|
||||
--- a/tests/test-cuda-mmq-shape-trace.cpp
|
||||
+++ b/tests/test-cuda-mmq-shape-trace.cpp
|
||||
@@ -118,5 +118,87 @@ int main() {
|
||||
ggml_cuda_mmq_small_m_shape_make(/* is_moe */ true, 4096, 256, 512, 128, true), 128, 16) == 128,
|
||||
"small-M tile override excludes prefill-like shapes");
|
||||
|
||||
+ const ggml_cuda_mmid_route_shape mmvq = ggml_cuda_mmid_route_shape_make(
|
||||
+ /* type */ 39,
|
||||
+ /* ne2 */ 1,
|
||||
+ /* ne12 */ 1,
|
||||
+ /* ne02 */ 256,
|
||||
+ /* mmvq_max */ 4,
|
||||
+ /* use_mmq */ true,
|
||||
+ /* use_mmf */ true,
|
||||
+ /* is_amd */ false,
|
||||
+ /* is_quantized */ true);
|
||||
+
|
||||
+ require(mmvq.route == GGML_CUDA_MMID_ROUTE_MMVQ, "MMVQ wins when within its batch cap");
|
||||
+ require(!mmvq.host_sync, "MMVQ route is graph-safe");
|
||||
+
|
||||
+ const ggml_cuda_mmid_route_shape mmq = ggml_cuda_mmid_route_shape_make(
|
||||
+ /* type */ 39,
|
||||
+ /* ne2 */ 128,
|
||||
+ /* ne12 */ 128,
|
||||
+ /* ne02 */ 256,
|
||||
+ /* mmvq_max */ 4,
|
||||
+ /* use_mmq */ true,
|
||||
+ /* use_mmf */ true,
|
||||
+ /* is_amd */ false,
|
||||
+ /* is_quantized */ true);
|
||||
+
|
||||
+ require(mmq.route == GGML_CUDA_MMID_ROUTE_MMQ, "grouped MMQ wins after MMVQ cap");
|
||||
+ require(!mmq.host_sync, "grouped MMQ route is graph-safe");
|
||||
+
|
||||
+ const ggml_cuda_mmid_route_shape mmf = ggml_cuda_mmid_route_shape_make(
|
||||
+ /* type */ 1,
|
||||
+ /* ne2 */ 128,
|
||||
+ /* ne12 */ 128,
|
||||
+ /* ne02 */ 256,
|
||||
+ /* mmvq_max */ 4,
|
||||
+ /* use_mmq */ false,
|
||||
+ /* use_mmf */ true,
|
||||
+ /* is_amd */ false,
|
||||
+ /* is_quantized */ false);
|
||||
+
|
||||
+ require(mmf.route == GGML_CUDA_MMID_ROUTE_MMF, "MMF wins when grouped MMQ is unavailable");
|
||||
+ require(!mmf.host_sync, "MMF route is graph-safe");
|
||||
+
|
||||
+ const ggml_cuda_mmid_route_shape fallback = ggml_cuda_mmid_route_shape_make(
|
||||
+ /* type */ 0,
|
||||
+ /* ne2 */ 128,
|
||||
+ /* ne12 */ 128,
|
||||
+ /* ne02 */ 256,
|
||||
+ /* mmvq_max */ 4,
|
||||
+ /* use_mmq */ false,
|
||||
+ /* use_mmf */ false,
|
||||
+ /* is_amd */ false,
|
||||
+ /* is_quantized */ false);
|
||||
+
|
||||
+ require(fallback.route == GGML_CUDA_MMID_ROUTE_FALLBACK, "fallback is used when device routes do not match");
|
||||
+ require(fallback.host_sync, "fallback route requires host synchronization");
|
||||
+
|
||||
+ const ggml_cuda_mmid_route_shape amd_mmvf = ggml_cuda_mmid_route_shape_make(
|
||||
+ /* type */ 1,
|
||||
+ /* ne2 */ 1,
|
||||
+ /* ne12 */ 1,
|
||||
+ /* ne02 */ 256,
|
||||
+ /* mmvq_max */ 4,
|
||||
+ /* use_mmq */ false,
|
||||
+ /* use_mmf */ false,
|
||||
+ /* is_amd */ true,
|
||||
+ /* is_quantized */ false);
|
||||
+
|
||||
+ require(amd_mmvf.route == GGML_CUDA_MMID_ROUTE_MMVF, "AMD float vector route wins within MMVQ cap");
|
||||
+ require(!amd_mmvf.host_sync, "AMD float vector route is graph-safe");
|
||||
+
|
||||
+ const int route_n = ggml_cuda_mmid_route_shape_format(buf, sizeof(buf), mmq);
|
||||
+
|
||||
+ require(route_n > 0, "MMID route format returns byte count");
|
||||
+ require(std::strstr(buf, "route=mmq") != nullptr, "MMID trace includes route name");
|
||||
+ require(std::strstr(buf, "host_sync=0") != nullptr, "MMID trace includes host sync flag");
|
||||
+ require(std::strstr(buf, "ne2=128") != nullptr, "MMID trace includes destination batch");
|
||||
+ require(std::strstr(buf, "ne12=128") != nullptr, "MMID trace includes routed token count");
|
||||
+ require(std::strstr(buf, "n_experts=256") != nullptr, "MMID trace includes expert count");
|
||||
+ require(std::strstr(buf, "mmvq_max=4") != nullptr, "MMID trace includes MMVQ cap");
|
||||
+ require(std::strstr(buf, "use_mmq=1") != nullptr, "MMID trace includes MMQ predicate");
|
||||
+ require(std::strstr(buf, "use_mmf=1") != nullptr, "MMID trace includes MMF predicate");
|
||||
+
|
||||
return 0;
|
||||
}
|
||||
--
|
||||
2.43.0
|
||||
|
||||
@@ -0,0 +1,49 @@
|
||||
# Phase 34: MMID Route Trace
|
||||
|
||||
**Goal:** Add a default-off `MUL_MAT_ID` route classifier so serving traces can prove whether current n128 MoE inference uses graph-safe MMVQ/grouped-MMQ paths or the host-sync fallback.
|
||||
|
||||
**Scope:** llama.cpp fork first, then LocalAI patch `0060`. Instrumentation only; no route or numeric behavior change.
|
||||
|
||||
## Plan
|
||||
|
||||
- [x] Inspect the current `ggml_cuda_mul_mat_id` dispatch order.
|
||||
- [x] Add a failing host-only test for route classification and trace formatting.
|
||||
- [x] Implement `ggml_cuda_mmid_route_shape_make()` and formatter in the existing CUDA trace helper.
|
||||
- [x] Wire `LLAMA_MOE_MMID_ROUTE_TRACE=<n>` in `ggml_cuda_mul_mat_id` using the same predicates as dispatch.
|
||||
- [x] Build and run `test-cuda-mmq-shape-trace` locally.
|
||||
- [x] Build `llama-server`, `llama-completion`, `test-backend-ops`, and `test-cuda-mmq-shape-trace` on DGX.
|
||||
- [x] Run default-off and trace-enabled md5/op gates.
|
||||
- [x] Run n128 serving trace and parse route counts.
|
||||
- [x] Run post-serving md5/op gates.
|
||||
- [x] Commit fork and DGX mirror, export LocalAI patch `0060`.
|
||||
- [x] Update README, parity docs, handoff, and patch maintenance.
|
||||
- [x] Re-run strict patch-series mirror invariant.
|
||||
|
||||
## Results
|
||||
|
||||
Artifact: `/home/mudler/bench/phase34_mmid_route_trace/20260701_072737`.
|
||||
|
||||
Commits:
|
||||
|
||||
- Fork: `6c332094c feat(cuda): trace moe mmid routes`
|
||||
- DGX mirror: `34a256d14 feat(cuda): trace moe mmid routes`
|
||||
- LocalAI patch: `backend/cpp/llama-cpp-localai-paged/patches/paged/0060-feat-cuda-trace-moe-mmid-routes.patch`
|
||||
|
||||
Gates:
|
||||
|
||||
- Default-off MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`
|
||||
- Default-off dense md5: `5951a5b4d624ce891e22ab5fca9bc439`
|
||||
- Trace-enabled MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`
|
||||
- Trace-enabled dense md5: `5951a5b4d624ce891e22ab5fca9bc439`
|
||||
- Post-serving MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`
|
||||
- Post-serving dense md5: `5951a5b4d624ce891e22ab5fca9bc439`
|
||||
- `MUL_MAT_ID`: `806/806` in default, trace-enabled, and post-serving gates
|
||||
|
||||
n128 route trace:
|
||||
|
||||
- `mmq`: 2776
|
||||
- `mmvq`: 1320
|
||||
- `host_sync=0`: 4096
|
||||
- Top shapes: `mmq ne2=12` 1096, `mmq ne2=18` 480, `mmvq ne2=8` 360
|
||||
|
||||
Decision: host-sync fallback is not firing in the current n128 serving path. The next phase should not chase fallback avoidance; it should either target grouped-MMQ small-M internal partitioning or pivot to the next measured bottleneck.
|
||||
Reference in New Issue
Block a user