mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-02 20:37:03 -04:00
feat(paged): add moe small-m mmq tile policy gate
Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -87,7 +87,7 @@ orthogonal to the paged allocator.
|
||||
|
||||
---
|
||||
|
||||
## 3. Patch series (0001-0058)
|
||||
## 3. Patch series (0001-0059)
|
||||
|
||||
Source-only patches, with intentional numbering gaps (e.g. 0005, 0027). The
|
||||
decode-serving graph-reuse levers are 0040-0041. "Bit-exact" = greedy md5 /
|
||||
@@ -216,6 +216,7 @@ These are the dominant decode levers on the Qwen3.6 hybrid models. All bit-exact
|
||||
| 0056 | **Trace MoE MMQ batch shapes** - adds default-off `LLAMA_MOE_MMQ_SHAPE_TRACE=<n>` logs from the grouped-MMQ host selector, reporting routed assignment count, estimated active experts, density, selected `mmq_x`, `mmq_y`, and stream-k. This is evidence-only instrumentation for sizing structural grouped-MMQ work after Phase 28 rejected launch-bounds/row-tile knobs. | yes (env unset and trace-enabled gates both green: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID` `806/806`; trace cap verified with 4 lines) |
|
||||
| 0057 | **Trace MoE MMQ launch shapes** - extends `LLAMA_MOE_MMQ_SHAPE_TRACE=<n>` with bounded `[LLAMA_MOE_MMQ_LAUNCH]` lines from `launch_mul_mat_q`, recording actual `ntiles_dst`, `stream_k_blocks`, tile efficiency, `fixup`, `ntx/nty/ntzw`, and compiled `mmq_x/mmq_y`. This is evidence-only instrumentation to distinguish real stream-k/fixup overhead from small-M kernel-shape cost. | yes (default-off, trace-enabled, and post-serving gates green: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID` `806/806`; Phase 31 n128 trace showed decode and prefill `fixup=0`, `stream_k_blocks == ntiles_dst`) |
|
||||
| 0058 | **Trace MoE small-M MMQ candidates** - adds `LLAMA_MOE_MMQ_SMALL_M_TRACE=<n>` and a host-only classifier for decode-like low-density grouped-MMQ shapes (`ncols_max <= 128`, density `<=4`, `mmq_x_best <=64`). It only counts candidate calls for the next structural tile-policy A/B; no numeric branch is added. | yes (default-off, trace-enabled, and post-serving gates green: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID` `806/806`; Phase 32 n128 trace found 4096 candidates, mostly `mmq_x_best=64/48`) |
|
||||
| 0059 | **Gate MoE small-M MMQ tile policy** - adds default-off `LLAMA_MOE_SMALL_M_TILE=<n>` to cap only classified small-M MoE grouped-MMQ calls. This was used to A/B vLLM-like smaller M blocks without changing default inference. | yes (default-off, tile16, tile8, and post-serving gates green: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID` `806/806`; Phase 33 rejected tile16 and tile8 as slower) |
|
||||
|
||||
> **Dropped: patch 0026 (hybrid per-head bf16 SSM state, `ssm_bf16_tau`).** Once
|
||||
> the decode fusions (0028 recurrent-state gather-fusion + 0029 block-table cache)
|
||||
@@ -666,3 +667,9 @@ trace-enabled, and post-serving gates stayed stable: MoE `8cb0ce23`, dense
|
||||
candidate calls: `mmq_x_best=64` 1800, `48` 1096, `40` 360, `32` 360, `16`
|
||||
360, `24` 120. This justifies Phase 33 as a default-off tile-policy A/B
|
||||
(`mmq_x=16`, possibly `8`) rather than a broad kernel rewrite.
|
||||
|
||||
Phase 33 added default-off `LLAMA_MOE_SMALL_M_TILE=<n>` as patch `0059`
|
||||
(`/home/mudler/bench/phase33_small_m_tile_policy/20260701_071136`). The knob is
|
||||
md5/op safe, but both tested values were slower in same-session n128 serving:
|
||||
baseline `672.1` decode_agg_tps, tile16 `640.3` (`0.953x`), tile8 `583.2`
|
||||
(`0.868x`). Do not promote simple smaller `mmq_x` caps for this workload.
|
||||
|
||||
@@ -1955,3 +1955,38 @@ Decision:
|
||||
A/B in Phase 33.
|
||||
- Start with a small-M MoE-only `mmq_x=16` cap, and consider `8` only if it
|
||||
compiles and preserves the existing NVFP4 tile invariants.
|
||||
|
||||
## Phase 33 Small-M MoE MMQ Tile Policy A/B
|
||||
|
||||
Phase 33 added patch `0059`, default-off `LLAMA_MOE_SMALL_M_TILE=<n>`, to cap
|
||||
only the Phase 32 classified small-M MoE grouped-MMQ calls. This tested whether
|
||||
a vLLM-like smaller M block could improve n128 decode without rewriting the
|
||||
kernel.
|
||||
|
||||
Artifact:
|
||||
|
||||
- `/home/mudler/bench/phase33_small_m_tile_policy/20260701_071136`
|
||||
|
||||
Gates:
|
||||
|
||||
| mode | MoE md5 | dense md5 | `MUL_MAT_ID` |
|
||||
|------|---------|-----------|--------------|
|
||||
| default-off | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `806/806` |
|
||||
| `LLAMA_MOE_SMALL_M_TILE=16` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `806/806` |
|
||||
| `LLAMA_MOE_SMALL_M_TILE=8` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `806/806` |
|
||||
| post-serving | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `806/806` |
|
||||
|
||||
Same-session n128 serving:
|
||||
|
||||
| mode | decode_agg_tps | agg_tps | prefill_tps | ratio vs baseline |
|
||||
|------|----------------|---------|-------------|-------------------|
|
||||
| baseline | 672.1 | 339.5 | 1511.4 | 1.000x |
|
||||
| `LLAMA_MOE_SMALL_M_TILE=16` | 640.3 | 328.9 | 1522.2 | 0.953x |
|
||||
| `LLAMA_MOE_SMALL_M_TILE=8` | 583.2 | 307.4 | 1442.6 | 0.868x |
|
||||
|
||||
Decision:
|
||||
|
||||
- Reject simple smaller `mmq_x` caps for classified n128 small-M calls. They are
|
||||
inference-safe but slower.
|
||||
- A future grouped-MMQ kernel must change the work shape more deeply than the
|
||||
host-side tile cap, or pivot to a different bucket.
|
||||
|
||||
@@ -432,6 +432,16 @@ trace found 4096 candidate calls, mostly `mmq_x_best=64` (1800) and `48`
|
||||
(1096). Phase 33 should A/B a default-off small-M tile policy starting at
|
||||
`mmq_x=16`.
|
||||
|
||||
Phase 33 added patch `0059`, default-off `LLAMA_MOE_SMALL_M_TILE=<n>`, and
|
||||
rejected the simple smaller-tile policy. Artifact:
|
||||
`/home/mudler/bench/phase33_small_m_tile_policy/20260701_071136`. Fork commit:
|
||||
`fbed2abaa feat(cuda): gate moe small-m mmq tile policy`; DGX mirror commit:
|
||||
`dfd1eaea8`. Default-off, tile16, tile8, and post-serving gates stayed green:
|
||||
MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT_ID` `806/806`. Same-session n128
|
||||
serving rejected both caps: baseline `672.1` decode_agg_tps, tile16 `640.3`
|
||||
(`0.953x`), tile8 `583.2` (`0.868x`). Do not promote smaller `mmq_x` caps.
|
||||
|
||||
---
|
||||
|
||||
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
|
||||
@@ -481,15 +491,15 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
|
||||
## 7. KEY FILE / ARTIFACT INDEX
|
||||
|
||||
### Fork (canonical source of truth)
|
||||
- Local canonical fork: `/home/mudler/_git/llama.cpp`, branch **`localai-paged`**, HEAD `2a9964d290a543d14db972d8d2927ee9d2974f7e` ("trace moe small-m mmq candidates", patch `0058`).
|
||||
- DGX current clean mirror/build tree: `dgx:~/llama-phase6-source`, HEAD `024f494d0` with the Phase 32 small-M classifier patch applied and committed; Phase 20/26/27 artifacts still record their historical source hashes.
|
||||
- Local canonical fork: `/home/mudler/_git/llama.cpp`, branch **`localai-paged`**, HEAD `fbed2abaa9f5af8e500f95c8dda86b305450ceff` ("gate moe small-m mmq tile policy", patch `0059`).
|
||||
- DGX current clean mirror/build tree: `dgx:~/llama-phase6-source`, HEAD `dfd1eaea8` with the Phase 33 small-M tile-policy patch applied and committed; Phase 20/26/27 artifacts still record their historical source hashes.
|
||||
- Historical DGX dev tree: `dgx:~/llama-paged-dev`, branch **`paged`**, HEAD `a7d439e8ce6990eb09721223c975da4e49d8d136` ("GDN CONFIG C (M8) - bf16 Kc/Qc"). It is an old experimental tree and must not be treated as canonical.
|
||||
|
||||
### LocalAI worktree
|
||||
- Path: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention`, branch `worktree-feat+paged-attention` (currently 246 ahead, 31 behind `origin/master`; recompute before reporting).
|
||||
- Backend dir: `backend/cpp/llama-cpp-localai-paged/` (`Makefile` thin wrapper, `package.sh`, `run.sh`, `README.md` ~44 KB canonical, `docs/`, `patches/paged/`).
|
||||
- `docs/`: `VLLM_PARITY_FINAL.md` (authoritative record), `VLLM_PARITY_LEVER_MAP.md` (working brainstorm, profile-validated section), `DECODE_SERVING_SCOPE.md`, `PREFILL_GEMM_SCOPE.md`, `PREFILL_GEMM_RESULTS.md`, `TENSORCORE_GDN_SCOPE.md`, `TENSORCORE_GDN_BUILD_PLAN.md`, `ACCELERATOR_PORTING_SCOPE.md`, `UPSTREAM_LAYER2_SCOPE.md`, `LOCALAI_LLAMACPP_BACKEND_PLAN.md`, `PAGED_BITEXACT_NOTE.md`, `PATCH_MAINTENANCE.md`, `final_benchmark.csv`, `paged-burst-bench.cpp`, `paged-reclaim-unit.cpp`, 3 PNGs, and this `PARITY_HANDOFF.md`.
|
||||
- `patches/paged/`: **49** `.patch` files spanning 0001-0058 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV); W4A16 packed metadata/shape/padding is 0048-0050; MoE safety tests are 0051-0053; MTP backend-sampling safety is 0054; speculative shape trace is 0055; MoE MMQ selector/launch/candidate traces are 0056-0058.
|
||||
- `patches/paged/`: **50** `.patch` files spanning 0001-0059 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV); W4A16 packed metadata/shape/padding is 0048-0050; MoE safety tests are 0051-0053; MTP backend-sampling safety is 0054; speculative shape trace is 0055; MoE MMQ selector/launch/candidate/tile-policy instrumentation is 0056-0059.
|
||||
|
||||
### Bench artifacts (DGX)
|
||||
- `~/bench/COMBINED_DEFINITIVE.txt` (+ `.log`, `.done`, `combined_definitive.sh`, `combined_definitive.out`) - historical same-session both-engine run.
|
||||
@@ -504,6 +514,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
|
||||
- `~/bench/phase30_mmq_shape_serving/20260701_043300` - live n128 serving MMQ shape distribution from patch `0056`; post-run md5/op gates green.
|
||||
- `~/bench/phase31_mmq_launch_trace/20260701_064424` - default-off MoE MMQ launch trace patch `0057`; default/trace/post-serving md5 gates green; n128 launch trace rejects stream-k/fixup shortcut (`fixup=0`, `stream_k_blocks == ntiles_dst`).
|
||||
- `~/bench/phase32_small_m_classifier/20260701_070127` - default-off MoE MMQ small-M classifier patch `0058`; default/trace/post-serving md5 gates green; n128 trace found 4096 candidate calls.
|
||||
- `~/bench/phase33_small_m_tile_policy/20260701_071136` - default-off MoE MMQ small-M tile policy patch `0059`; tile16/tile8 md5/op safe but both slower in n128 serving.
|
||||
- Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
|
||||
- Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
|
||||
- A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.
|
||||
@@ -516,8 +527,8 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
|
||||
|
||||
### Discrepancies to flag / resolve (carried verbatim from the gather, including UNVERIFIED labels)
|
||||
1. **Pin prose reconciled in this worktree.** Makefile line 52 `LLAMA_VERSION?=0ed235ea2c17a19fc8238668653946721ed136fd` is authoritative and matches the local fork merge-base. Hard rule: the paged pin must equal the stock `llama-cpp` pin (shared `grpc-server.cpp`); a bump to `c299a92c` once broke the grpc-server link despite being bit-exact and was reverted. Trust the Makefile when building.
|
||||
2. **Current fork/mirror are clean and verified.** Local fork HEAD is `2a9964d29`, DGX clean mirror HEAD is `024f494d0`, and Phase 32 should be treated as the current patch-series tip. The old `llama-paged-dev` tree is historical only.
|
||||
3. **Worktree patch series is tracked through 0058.** The only expected unrelated untracked path in this worktree is `.claude/`.
|
||||
2. **Current fork/mirror are clean and verified.** Local fork HEAD is `fbed2abaa`, DGX clean mirror HEAD is `dfd1eaea8`, and Phase 33 should be treated as the current patch-series tip. The old `llama-paged-dev` tree is historical only.
|
||||
3. **Worktree patch series is tracked through 0059.** The only expected unrelated untracked path in this worktree is `.claude/`.
|
||||
4. **`sm_121a` is not in the worktree build files** - it lives only in the DGX experimental build scripts (`gdn_cc.sh`, `gdn_bv_build.sh`, `paged-build.sh`); mainline uses arch `121`. **UNVERIFIED** whether the shipped CI Dockerfile build path injects `121a` for the FP4-MMA kernels (`Dockerfile.llama-cpp-localai-paged` does not hardcode a CUDA arch).
|
||||
5. **The `0921716...` paged-MoE md5 open item.** `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=0921716cd0582b5d15af8c362b811d00` for MoE, but a full doc/patch/`git log -S` grep of the worktree found **no** occurrence of `0921716...` in any committed source; the committed canonical paged-MoE gate is `8cb0ce23`. Treat this as **unreconciled**: the documented, KL-validated paged-MoE gate remains `8cb0ce23`, and any paged-MoE divergence (including `0921716`) must be KL-validated against the f16 reference before being accepted as benign, never on assertion alone. The `0921716` value is **UNVERIFIED** as a sanctioned gate; do not adopt it as canonical without re-running the KL gate. The **dense** run is symmetric: `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=ecfe924dee6c5622c149f419ff2a6481` for dense, which likewise differs from the canonical dense gate `5951a5b4`. Both CDEF `PAGED_GATE_MD5` values come from the `combined_definitive.sh` harness's own gate command, NOT the canonical bit-exact gate command in section 3.3, which is why they diverge from the committed `8cb0ce23` / `5951a5b4`; neither is a sanctioned gate and both must be KL-validated before being treated as benign.
|
||||
|
||||
|
||||
@@ -57,18 +57,18 @@ everywhere without ever touching the stock `llama-cpp` source tree.
|
||||
|
||||
## Latest mirror check
|
||||
|
||||
Phase 32 re-verified the mirror invariant after adding patch `0058`:
|
||||
Phase 33 re-verified the mirror invariant after adding patch `0059`:
|
||||
|
||||
```text
|
||||
base=0ed235ea2c17a19fc8238668653946721ed136fd
|
||||
applied_tree=de1bdd1892ab87aee947ec19c5efed8f53b93d40
|
||||
fork_tree=de1bdd1892ab87aee947ec19c5efed8f53b93d40
|
||||
applied_tree=4dc5498ac86b100eddf777c4e7f4c4d11f59415d
|
||||
fork_tree=4dc5498ac86b100eddf777c4e7f4c4d11f59415d
|
||||
```
|
||||
|
||||
The check used a fresh worktree at `LLAMA_VERSION`, applied every
|
||||
`patches/paged/0*.patch` with strict `git apply`, staged the result, and compared
|
||||
`git write-tree` to canonical fork branch `localai-paged` at
|
||||
`2a9964d29 feat(cuda): trace moe small-m mmq candidates`.
|
||||
`fbed2abaa feat(cuda): gate moe small-m mmq tile policy`.
|
||||
|
||||
## Status
|
||||
|
||||
|
||||
@@ -850,6 +850,29 @@ and keeps the NVFP4 tile invariants. This matches the vLLM/Marlin lesson that
|
||||
low-density routed expert rows want smaller M blocks, without porting Marlin,
|
||||
Triton, TMA, tcgen05, or layout repack machinery.
|
||||
|
||||
### Phase 33 small-M tile policy rejection
|
||||
|
||||
Phase 33 added patch `0059`, default-off `LLAMA_MOE_SMALL_M_TILE=<n>`, and
|
||||
tested the obvious vLLM-like shortcut on the Phase 32 candidate population.
|
||||
Artifact: `/home/mudler/bench/phase33_small_m_tile_policy/20260701_071136`.
|
||||
|
||||
Default-off, tile16, tile8, and post-serving gates were all bit-exact: MoE
|
||||
`8cb0ce23777bf55f92f63d0292c756b0`, dense
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
|
||||
|
||||
Same-session n128 serving:
|
||||
|
||||
| mode | decode_agg_tps | ratio |
|
||||
|------|----------------|-------|
|
||||
| baseline | 672.1 | 1.000x |
|
||||
| `LLAMA_MOE_SMALL_M_TILE=16` | 640.3 | 0.953x |
|
||||
| `LLAMA_MOE_SMALL_M_TILE=8` | 583.2 | 0.868x |
|
||||
|
||||
Lever implication: smaller `mmq_x` alone is rejected for n128 serving. The
|
||||
remaining grouped-MMQ gap is not solved by emulating Marlin's small `block_size_m`
|
||||
with the current MMQ kernel; a future attempt must alter the kernel's internal
|
||||
work partitioning or move to a different bottleneck.
|
||||
|
||||
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
|
||||
|
||||
### Phase 10 GDN C32 slab update
|
||||
|
||||
@@ -0,0 +1,80 @@
|
||||
From fbed2abaa9f5af8e500f95c8dda86b305450ceff Mon Sep 17 00:00:00 2001
|
||||
From: Ettore Di Giacinto <mudler@localai.io>
|
||||
Date: Wed, 1 Jul 2026 05:17:39 +0000
|
||||
Subject: [PATCH] feat(cuda): gate moe small-m mmq tile policy
|
||||
|
||||
Assisted-by: Codex:gpt-5
|
||||
---
|
||||
ggml/src/ggml-cuda/mmq-shape-trace.h | 9 +++++++++
|
||||
ggml/src/ggml-cuda/mmq.cuh | 13 +++++++++++++
|
||||
tests/test-cuda-mmq-shape-trace.cpp | 10 ++++++++++
|
||||
3 files changed, 32 insertions(+)
|
||||
|
||||
diff --git a/ggml/src/ggml-cuda/mmq-shape-trace.h b/ggml/src/ggml-cuda/mmq-shape-trace.h
|
||||
index 47453d91f..dfb4e898a 100644
|
||||
--- a/ggml/src/ggml-cuda/mmq-shape-trace.h
|
||||
+++ b/ggml/src/ggml-cuda/mmq-shape-trace.h
|
||||
@@ -187,3 +187,12 @@ static inline int ggml_cuda_mmq_small_m_shape_format(
|
||||
shape.mmq_x_best,
|
||||
shape.use_stream_k ? 1 : 0);
|
||||
}
|
||||
+
|
||||
+static inline int ggml_cuda_mmq_small_m_tile_limit(
|
||||
+ const ggml_cuda_mmq_small_m_shape & shape, const int current_limit, const int requested_tile) {
|
||||
+ if (!shape.is_candidate || requested_tile < 8 || requested_tile >= current_limit) {
|
||||
+ return current_limit;
|
||||
+ }
|
||||
+
|
||||
+ return requested_tile;
|
||||
+}
|
||||
diff --git a/ggml/src/ggml-cuda/mmq.cuh b/ggml/src/ggml-cuda/mmq.cuh
|
||||
index 25ead9e7b..16b3fcca4 100644
|
||||
--- a/ggml/src/ggml-cuda/mmq.cuh
|
||||
+++ b/ggml/src/ggml-cuda/mmq.cuh
|
||||
@@ -4201,6 +4201,15 @@ static inline int ggml_cuda_moe_density_max() {
|
||||
return d;
|
||||
}
|
||||
|
||||
+static inline int ggml_cuda_moe_small_m_tile() {
|
||||
+ static const int t = []() -> int {
|
||||
+ const char * s = getenv("LLAMA_MOE_SMALL_M_TILE");
|
||||
+ const int v = s ? atoi(s) : 0;
|
||||
+ return v >= 8 ? v : 0;
|
||||
+ }();
|
||||
+ return t;
|
||||
+}
|
||||
+
|
||||
// [paged patch 0017 / track B] DENSE NVFP4 decode mmq_x re-read occupancy DIAGNOSTIC (env, default off).
|
||||
// GGML_CUDA_FP4_DENSE_MMQ_X=<n> caps the dense (non-MoE) NVFP4 col-tile to <n>, splitting the M=128
|
||||
// decode ubatch into ceil(128/n) col-tiles. Each col-tile re-reads the full weight set (fatal cost
|
||||
@@ -4282,6 +4291,10 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda
|
||||
}
|
||||
}
|
||||
}
|
||||
+
|
||||
+ const ggml_cuda_mmq_small_m_shape small_m = ggml_cuda_mmq_small_m_shape_make(
|
||||
+ true, args.ncols_dst, args.nchannels_x, args.ncols_max, mmq_x_lim, args.use_stream_k);
|
||||
+ mmq_x_lim = ggml_cuda_mmq_small_m_tile_limit(small_m, mmq_x_lim, ggml_cuda_moe_small_m_tile());
|
||||
}
|
||||
|
||||
int mmq_x_best = 0;
|
||||
diff --git a/tests/test-cuda-mmq-shape-trace.cpp b/tests/test-cuda-mmq-shape-trace.cpp
|
||||
index 9f36ce1a1..f7863f03a 100644
|
||||
--- a/tests/test-cuda-mmq-shape-trace.cpp
|
||||
+++ b/tests/test-cuda-mmq-shape-trace.cpp
|
||||
@@ -108,5 +108,15 @@ int main() {
|
||||
require(std::strstr(buf, "density=4") != nullptr, "small-M trace includes density");
|
||||
require(std::strstr(buf, "mmq_x_best=64") != nullptr, "small-M trace includes selected tile");
|
||||
|
||||
+ require(ggml_cuda_mmq_small_m_tile_limit(small_m, 64, 0) == 64,
|
||||
+ "small-M tile override is default-off");
|
||||
+ require(ggml_cuda_mmq_small_m_tile_limit(small_m, 64, 16) == 16,
|
||||
+ "small-M tile override caps candidate tile limit");
|
||||
+ require(ggml_cuda_mmq_small_m_tile_limit(small_m, 64, 128) == 64,
|
||||
+ "small-M tile override ignores non-smaller tiles");
|
||||
+ require(ggml_cuda_mmq_small_m_tile_limit(
|
||||
+ ggml_cuda_mmq_small_m_shape_make(/* is_moe */ true, 4096, 256, 512, 128, true), 128, 16) == 128,
|
||||
+ "small-M tile override excludes prefill-like shapes");
|
||||
+
|
||||
return 0;
|
||||
}
|
||||
@@ -0,0 +1,56 @@
|
||||
# Small-M Tile Policy Phase 33 Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** A/B a default-off MoE-only small-M tile policy using Phase 32 candidate criteria, starting with `LLAMA_MOE_SMALL_M_TILE=16`.
|
||||
|
||||
**Architecture:** Add a narrow host-side override in `mul_mat_q_case`: after the normal MoE density auto-tile logic, if `LLAMA_MOE_SMALL_M_TILE=<n>` is set and the call is decode-like (`ncols_max <= 128`, density `<=4`, stream-k), cap `mmq_x_lim` to that tile. The existing MMQ kernels and launch path remain unchanged; unsupported/default cases fall through unchanged.
|
||||
|
||||
**Tech Stack:** llama.cpp CUDA backend, host-only selector tests, DGX GB10 md5/op gates and n128 h2h serving A/B.
|
||||
|
||||
---
|
||||
|
||||
## Checklist
|
||||
|
||||
- [x] **Step 1: RED selector test**
|
||||
- Add host helper assertions for `ggml_cuda_mmq_small_m_tile_limit`.
|
||||
- Expected: compile failure before helper exists.
|
||||
|
||||
- [x] **Step 2: GREEN helper**
|
||||
- Implement helper in `mmq-shape-trace.h`.
|
||||
- Local test passes.
|
||||
|
||||
- [x] **Step 3: Wire env policy**
|
||||
- Add `LLAMA_MOE_SMALL_M_TILE`.
|
||||
- Apply only to MoE grouped-MMQ small-M candidates.
|
||||
- Default path unchanged.
|
||||
|
||||
- [x] **Step 4: DGX gates**
|
||||
- Build CUDA targets.
|
||||
- Run default-off gates.
|
||||
- Run `EXTRA_ENV=LLAMA_MOE_SMALL_M_TILE=16` gates.
|
||||
|
||||
- [x] **Step 5: n128 A/B**
|
||||
- Same-session baseline vs `LLAMA_MOE_SMALL_M_TILE=16`, h2h n128.
|
||||
- Post-serving gates.
|
||||
|
||||
- [x] **Step 6: Mirror/docs**
|
||||
- Generate patch `0059`.
|
||||
- Strict patch-series tree check.
|
||||
- Update docs and commit LocalAI.
|
||||
|
||||
## Result
|
||||
|
||||
- Fork commit: `/home/mudler/_git/llama.cpp` `fbed2abaa feat(cuda): gate moe small-m mmq tile policy`.
|
||||
- DGX mirror commit: `dgx:~/llama-phase6-source` `dfd1eaea8 feat(cuda): gate moe small-m mmq tile policy`.
|
||||
- Artifact: `/home/mudler/bench/phase33_small_m_tile_policy/20260701_071136`.
|
||||
- RED verified: `cmake --build build --target test-cuda-mmq-shape-trace -j 4` failed on missing `ggml_cuda_mmq_small_m_tile_limit`.
|
||||
- GREEN verified locally: `cmake --build build --target test-cuda-mmq-shape-trace -j 4 && ./build/bin/test-cuda-mmq-shape-trace`.
|
||||
- DGX CUDA build verified: `llama-server`, `llama-completion`, `test-backend-ops`, and `test-cuda-mmq-shape-trace`.
|
||||
- Default-off, tile16, tile8, and post-serving gates all matched MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 `5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`.
|
||||
- Same-session n128 serving:
|
||||
- baseline: `decode_agg_tps=672.1`, `agg_tps=339.5`, `prefill_tps=1511.4`.
|
||||
- `LLAMA_MOE_SMALL_M_TILE=16`: `decode_agg_tps=640.3`, `agg_tps=328.9`, `prefill_tps=1522.2`, ratio `0.953x`.
|
||||
- `LLAMA_MOE_SMALL_M_TILE=8`: `decode_agg_tps=583.2`, `agg_tps=307.4`, `prefill_tps=1442.6`, ratio `0.868x`.
|
||||
|
||||
Decision: reject smaller `mmq_x` caps for the classified n128 small-M calls. They are md5/op safe but slower. The next structural direction must not be a simple smaller tile cap; it needs a different kernel shape or a different target bucket.
|
||||
Reference in New Issue
Block a user