mirror of https://github.com/mudler/LocalAI.git synced 2026-07-03 04:46:54 -04:00

Files

Ettore Di Giacinto 1b5ae227eb docs(paged): reject GDN M5 QS-early phase

Record the Phase 11 default-off QS-early GDN experiment, its canonical md5 gates, the same-session GB10 A/B regression, and the rejected diff artifact.

Assisted-by: Codex:gpt-5

2026-07-01 01:29:44 +00:00

39 KiB

Raw Blame History

GB10 Parity Phase 0 Results

Status: in progress.

Preflight

DGX host: promaxgb10-4ad8
Docker containers: none
GPU compute apps: none
GPU lock owner: FREE released-by-claude-fp4norm-profile 1782828229
LocalAI worktree SHA: d288a0300f36f7c126d62d997809bb03f297a3ac
Local llama.cpp fork SHA: 51168c5eee2e35348d9006f0b2fab3dc6e7c01cc
DGX artifact directory: ~/bench/reopen_phase0

Baseline Runs

Clean prefill baseline artifacts:

MoE: ~/bench/reopen_phase0/paged_moe_prefill.txt
Dense: ~/bench/reopen_phase0/paged_dense_prefill.txt

MoE paged prefill:

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
512	4	32	16512	7.181	2281.66	0.355	360.57	7.536	2191.16
2048	4	32	65664	27.131	2415.53	0.328	390.84	27.459	2391.38

Dense paged prefill:

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
512	4	32	16512	16.749	978.18	0.842	152.03	17.591	938.64
2048	4	32	65664	63.791	1027.35	0.687	186.29	64.479	1018.38

Decode Difference-Method Reproduction

Paged llama.cpp artifacts:

~/bench/reopen_phase0/paged_decode_nsys/paged_moe_n256_ntg16.nsys-rep
~/bench/reopen_phase0/paged_decode_nsys/paged_moe_n256_ntg16.bench.log
~/bench/reopen_phase0/paged_decode_nsys/paged_moe_n256_ntg64.nsys-rep
~/bench/reopen_phase0/paged_decode_nsys/paged_moe_n256_ntg64.bench.log

Paged llama.cpp rows:

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	16	256	36864	14.933	2194.39	4.502	909.80	19.435	1896.81
128	64	256	49152	14.949	2191.96	17.924	914.09	32.873	1495.21

Paged difference-method decode:

Token delta: 256 * (64 - 16) = 12288
Wall delta: 17.924 - 4.502 = 13.422 s
Decode throughput: 915.51 t/s

vLLM artifacts:

~/bench/reopen_phase0/vllm_decode_nsys/vllm_version.txt
~/bench/reopen_phase0/vllm_decode_nsys/dec_npl256_ntg16.nsys-rep
~/bench/reopen_phase0/vllm_decode_nsys/dec_npl256_ntg16.run.log
~/bench/reopen_phase0/vllm_decode_nsys/dec_npl256_ntg16.kern.csv
~/bench/reopen_phase0/vllm_decode_nsys/dec_npl256_ntg16.gpu_trace.csv
~/bench/reopen_phase0/vllm_decode_nsys/dec_npl256_ntg64.nsys-rep
~/bench/reopen_phase0/vllm_decode_nsys/dec_npl256_ntg64.run.log
~/bench/reopen_phase0/vllm_decode_nsys/dec_npl256_ntg64.kern.csv
~/bench/reopen_phase0/vllm_decode_nsys/dec_npl256_ntg64.gpu_trace.csv

vLLM version: 0.23.0

vLLM profiled rows:

NSEQ	GEN	Generated tokens	Wall s	Logged tok/s
256	16	4096	6.195	661.2
256	64	16384	17.607	930.5

vLLM difference-method decode:

Token delta: 16384 - 4096 = 12288
Wall delta: 17.607 - 6.195 = 11.412 s
Decode throughput: 1076.76 t/s

Clean reproduced paged/vLLM decode ratio: 85.0%.

W4A16 Kill-Gate Baseline

Artifacts:

Default FP4-MMQ: ~/bench/reopen_phase0/w4a16_off.txt
Forced W4A16 with debug: ~/bench/reopen_phase0/w4a16_on_thr64.txt
Forced W4A16 without debug: ~/bench/reopen_phase0/w4a16_on_thr64_nodebug.txt

Default FP4-MMQ:

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
512	4	32	16512	7.105	2306.06	0.321	399.00	7.426	2223.68
2048	4	32	65664	27.047	2423.00	0.329	388.89	27.377	2398.55

Forced W4A16, LLAMA_W4A16_PREFILL_M=64, debug off:

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
512	4	32	16512	12.517	1308.92	0.321	398.82	12.838	1286.17
2048	4	32	65664	49.165	1332.98	0.330	387.57	49.495	1326.67

Delta:

npp=512: -43.2% S_PP versus default FP4-MMQ.
npp=2048: -45.0% S_PP versus default FP4-MMQ.

Debug evidence:

Forced W4A16 debug run emitted 19200 engagement lines.
Observed n_tiles range: 139..282.
Observed multi_tile_experts range: 7..21.

First implementation target:

Option B: device-side or cached tile metadata.
Rationale: w4a16-gemm.cu currently builds h_tile_expert, h_tile_row0, and h_tile_rows on the host, pool-allocates three device tile-map buffers, and issues three H2D cudaMemcpyAsync calls per grouped W4A16 launch. The debug run shows this path is repeatedly exercised across many small ragged tile maps. The first fork-first experiment should remove or amortize that host-built tile-map path before retuning MMA tile shapes.

W4A16 Metadata Phase 1

Fork commit: 4b0cc1163cc42dc1c17892fd41ce5ab384ba3e17 (feat(paged): pack W4A16 grouped tile metadata).

LocalAI patch mirror: 0048-feat-paged-pack-W4A16-grouped-tile-metadata.patch.

Mirror invariant: applying the full LocalAI patches/paged/*.patch series to base pin 0ed235ea2c17a19fc8238668653946721ed136fd tree-matches fork HEAD 4b0cc1163cc42dc1c17892fd41ce5ab384ba3e17.

Artifacts:

Diff: ~/bench/w4a16_phase1/packed_desc.diff
Build mtimes: ~/bench/w4a16_phase1/build_binary_mtimes.txt
MoE gate: ~/bench/w4a16_phase1/gate_moe.md5
Dense gate: ~/bench/w4a16_phase1/gate_dense.md5
Default FP4-MMQ: ~/bench/w4a16_phase1/w4a16_off.txt
Packed W4A16: ~/bench/w4a16_phase1/w4a16_on_thr64.txt

Canonical gates:

MoE greedy md5: 8cb0ce23777bf55f92f63d0292c756b0 (matched expected)
Dense greedy md5: 5951a5b4d624ce891e22ab5fca9bc439 (matched expected)

Packed descriptor A/B:

Path	PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
FP4-MMQ	512	4	32	16512	7.114	2303.07	0.323	396.55	7.437	2220.32
FP4-MMQ	2048	4	32	65664	27.045	2423.23	0.331	387.14	27.376	2398.64
W4A16 packed	512	4	32	16512	12.468	1314.08	0.322	397.97	12.790	1291.04
W4A16 packed	2048	4	32	65664	48.930	1339.39	0.330	387.44	49.260	1333.00

Result:

Packed descriptors improved forced W4A16 by +0.39% at npp=512 and +0.48% at npp=2048 versus the Phase 0 no-debug W4A16 baseline.
W4A16 remains -42.9% at npp=512 and -44.7% at npp=2048 versus same-run default FP4-MMQ.
Decision: keep patch 0048 as a small simplification, but pivot the next W4A16 iteration to the activation cast or MMA/dequant tile body.

W4A16 Kernel Shape Phase 2

Profile-guided target:

Phase 1 forced W4A16 profile at npp=512: w4a16_grouped_kernel dominated at 5231.667 ms (47.8%) while w4a16_cast_act_f32_bf16 was 517.195 ms (4.7%).
Phase 2 therefore targeted grouped-kernel tile shape/body before activation cast fusion.

Shape sweep artifacts:

Build: ~/llama-w4a16-phase2
Benchmarks: ~/bench/w4a16_phase2/shape_*.txt
Winning profile: ~/bench/w4a16_phase2/profile/w4a16_bm32_npp512.*

Shape A/B:

Shape	512 S_PP t/s	2048 S_PP t/s	Decision
`base` / `64x128`	1308.02	1339.46	old baseline
`bn256`	1286.99	1311.56	rejected
`bm32` / `32x128`	1442.99	1475.65	selected
`bn64`	1334.80	1362.55	diagnostic only
`stages3`	1271.01	1295.96	rejected
`bn256x16`	1084.66	1100.95	rejected

Only bm32 and the old base selector are shipped in patch 0049. The other candidate shapes were benchmarked in the Phase 2 build and then deliberately left out to keep the upstream conflict surface small.

Default-verification after selecting bm32:

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
512	4	32	16512	11.360	1442.28	0.321	397.00	11.682	1413.43
2048	4	32	65664	44.529	1471.77	0.331	386.06	44.860	1463.75

Result:

bm32 improves forced W4A16 by about +10.4% at npp=512 and +10.2% at npp=2048 versus the old 64x128 shape in the same sweep.
The profiled bm32 grouped kernel dropped to 4107.355 ms (41.7%) at npp=512, from Phase 1's 5231.667 ms (47.8%).
Canonical post-change gates matched: MoE 8cb0ce23777bf55f92f63d0292c756b0, dense 5951a5b4d624ce891e22ab5fca9bc439.
Forced W4A16 shape gates matched each other: LLAMA_W4A16_PREFILL_M=1 default bm32 and LLAMA_W4A16_SHAPE=base both produced 07db32c2bcb78d17a43ed18bc22705cd on the canonical gate prompt.
Forced W4A16 MUL_MAT_ID op checks passed for both shapes: test-backend-ops test -b CUDA0 -o MUL_MAT_ID -j 1 reported 806/806 for default bm32 and 806/806 for base.
Decision: make bm32 the W4A16 default shape while keeping LLAMA_W4A16_SHAPE=base for old-shape A/B and leaving other candidates as diagnostics.

Mirror invariant after patch 0049:

Applying all 40 LocalAI patches/paged/*.patch files to base pin 0ed235ea2c17a19fc8238668653946721ed136fd tree-matches fork HEAD 7dfa0e17548c5f04f83d2cc2a057b0a9941b599a.
Tree hash after patch application: dabe225efbf20ec047b8309d1e1f19b34fc7c5c9.

W4A16 Scale Broadcast Phase 3

Goal: reduce duplicate FP4 scale conversion inside w4a16_grouped_kernel by having one lane per 4-lane group convert the ue4m3 scale and broadcast it with __shfl_sync.

Artifacts:

Build: ~/llama-w4a16-phase3
Logs: ~/bench/w4a16_phase3

Gates:

Canonical paged MoE md5: 8cb0ce23777bf55f92f63d0292c756b0.
Canonical dense md5: 5951a5b4d624ce891e22ab5fca9bc439.
Forced W4A16 bm32 and old base shape md5s matched each other: 07db32c2bcb78d17a43ed18bc22705cd.
Forced W4A16 MUL_MAT_ID: 806/806 on CUDA0.

Performance:

Shape	512 S_PP t/s	2048 S_PP t/s	Decision
Phase 2 `bm32`	1442.28	1471.77	baseline
Phase 3 scale-broadcast `bm32`	1392.46	1422.74	rejected
Phase 2 `base`	1310.13	1336.02	baseline
Phase 3 scale-broadcast `base`	1201.69	1221.25	rejected

Result:

Rejected. No fork commit and no LocalAI patch 0050.
The local fork experiment was reverted.
Do not retry this exact scale-broadcast approach; on GB10 the shuffle and/or scheduling cost exceeds the saved duplicate scale conversion.

W4A16 Shared-Memory Padding Phase 4

Goal: reduce bank pressure in w4a16_grouped_kernel by padding the A operand shared-memory row stride while preserving math order and launch shape.

Fork commit: d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3 (feat(paged): pad W4A16 A shared tile stride).

LocalAI patch mirror: 0050-feat-paged-pad-W4A16-A-shared-tile-stride.patch.

Artifacts:

Build: ~/llama-w4a16-phase4
Logs: ~/bench/w4a16_phase4

Gates:

Canonical paged MoE md5: 8cb0ce23777bf55f92f63d0292c756b0.
Canonical dense md5: 5951a5b4d624ce891e22ab5fca9bc439.
Forced W4A16 bm32 and old base shape md5s matched each other: 07db32c2bcb78d17a43ed18bc22705cd.
Forced W4A16 MUL_MAT_ID: 806/806 on CUDA0.

Performance:

Shape	512 S_PP t/s	2048 S_PP t/s	Decision
Phase 2 `bm32`	1442.28	1471.77	baseline
Phase 4 A-pad `bm32`	1466.62	1495.93	selected
Phase 2 `base`	1310.13	1336.02	baseline
Phase 4 A-pad `base`	1337.88	1364.98	positive diagnostic

Result:

Kept. Default W4A16 bm32 improves another +1.7% at npp=512 and +1.6% at npp=2048 versus Phase 2.
Applying all 41 LocalAI patches/paged/*.patch files to base pin 0ed235ea2c17a19fc8238668653946721ed136fd tree-matches fork HEAD d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3.
Tree hash after patch application: 8fcb151e0620fd0fc82b80c04318e5c34320b087.

W4A16 Wq Padding Phase 5

Goal: test whether padding the quantized-weight shared-memory row stride gives another low-conflict W4A16 grouped-kernel body win after 0050.

Artifacts:

Build: ~/llama-w4a16-phase5
Logs: ~/bench/w4a16_phase5

Gates:

Canonical paged MoE md5: 8cb0ce23777bf55f92f63d0292c756b0.
Canonical dense md5: 5951a5b4d624ce891e22ab5fca9bc439.
Forced W4A16 bm32 and old base shape md5s matched each other: 07db32c2bcb78d17a43ed18bc22705cd.
Forced W4A16 MUL_MAT_ID: 806/806 on CUDA0.

Performance:

Shape	512 S_PP t/s	2048 S_PP t/s	Decision
Phase 4 A-pad `bm32`	1466.62	1495.93	baseline
Phase 5 Wq-pad `bm32`	1472.36	1504.82	rejected: below 1% gate
Phase 4 A-pad `base`	1337.88	1364.98	baseline
Phase 5 Wq-pad `base`	1337.70	1368.48	diagnostic

Result:

Rejected. No fork commit and no LocalAI patch was created for that experiment.
The local fork experiment was reverted.
Do not ship Wq padding alone; the measured +0.4% / +0.6% default-shape gain is below the maintenance threshold.

Clean Build

First clean build attempt:

PID: 625392
Source checkout: ~/llama-paged-reopen-clean
Result: failed during CMake configure.
Root cause: nvcc was not discoverable on PATH. CUDA headers were found under /usr/local/cuda/targets/sbsa-linux/include, and the compiler exists at /usr/local/cuda-13.0/bin/nvcc.
Retry plan: rebuild the clean checkout with CUDACXX=/usr/local/cuda-13.0/bin/nvcc.

Second clean build attempt:

PID: 631100
Source checkout: ~/llama-paged-reopen-clean
Source status: ## HEAD (no branch)
Build HEAD: 51168c5eee2e35348d9006f0b2fab3dc6e7c01cc
CUDA compiler: /usr/local/cuda-13.0/bin/nvcc
Result: succeeded.
Binary mtimes:
- build-cuda/bin/llama-server 2026-06-30 22:14:34.091312112 +0200
- build-cuda/bin/llama-batched-bench 2026-06-30 22:14:35.156287566 +0200
- build-cuda/bin/llama-completion 2026-06-30 22:14:37.095750242 +0200
- build-cuda/bin/test-backend-ops 2026-06-30 22:14:47.360078186 +0200

Canonical Gates

MoE greedy md5: 8cb0ce23777bf55f92f63d0292c756b0 (matched expected)
Dense greedy md5: 5951a5b4d624ce891e22ab5fca9bc439 (matched expected)
Artifacts:
- ~/bench/reopen_phase0/gate_moe.txt
- ~/bench/reopen_phase0/gate_moe.md5
- ~/bench/reopen_phase0/gate_dense.txt
- ~/bench/reopen_phase0/gate_dense.md5

Source Provenance

Local llama.cpp fork: /home/mudler/_git/llama.cpp
Branch: localai-paged
Working tree: clean after fork commit d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3
Phase 0 HEAD: 51168c5eee2e35348d9006f0b2fab3dc6e7c01cc
Current HEAD: cd56cf037379b084d6bb0ed47db8b785c828be86
Base pin: 0ed235ea2c17a19fc8238668653946721ed136fd
Merge-base with base pin: 0ed235ea2c17a19fc8238668653946721ed136fd
LocalAI patch count: 38 at Phase 0; current mirror count is 42 after patch 0051.
LocalAI patch mirror: applies cleanly to the base pin and tree-matches fork HEAD.
Tree hash after patch application: 623b7cb008a929455ca3d9deae35494c02622fef

Existing Artifact Gap Review

Read-only DGX artifact inspection was performed after confirming the machine was idle: docker ps returned no running containers, nvidia-smi --query-compute-apps returned no compute-app rows, and ~/gpu_bench_lock/owner read FREE released-by-claude-fp4norm-profile 1782828229.

Existing paged llama.cpp decode and prefill numbers are supported by /home/mudler/bench/COMBINED_DEFINITIVE.txt: MoE paged prefill lines 13-18, MoE paged serving decode lines 23-26, dense paged prefill lines 43-48, and dense paged serving decode lines 53-56. Supporting comparison artifacts are /home/mudler/bench/STOCK3WAY.txt, /home/mudler/bench/PREFILL_KNOB.txt, /home/mudler/bench/DEFINITIVE_S3ab.txt, and the adjacent raw logs.

No self-contained vLLM 1078 t/s GPU-steady ntg16/ntg64 difference-method artifact was found. The available vLLM evidence is serving-run output in /home/mudler/bench/COMBINED_DEFINITIVE.txt plus nsys/run artifacts under /home/mudler/bench/profgap/ and /home/mudler/bench/postssm_decomp/; these do not form a packaged ntg16/ntg64 difference-method report.

W4A16/Marlin evidence exists in /home/mudler/bench/vllm_prefix.log, /home/mudler/bench/profgap/vllm_moe_decode.run.log, and /home/mudler/bench/marlin_gate/kl_marlin.log. /home/mudler/llama-paged-dev/LEVER3_ACTQUANT_FUSION_RESULTS.md records the parity conclusion: W4A16/Marlin is a precision-change lever, not a bit-exact llama.cpp parity lever.

GDN M5/M8 evidence exists in /home/mudler/bench/COMBINED_DEFINITIVE.txt (GDN CONFIG C (M8) and production defaults noting GDN M5), /home/mudler/llama-paged-dev/LEVER1_GATHER_RESULTS.md, and /home/mudler/llama-paged-dev/CONV_STATE_FUSION_RESULTS.md.

S3 evidence exists in /home/mudler/bench/DEFINITIVE_S3ab.txt; that A/B shows S3-on was worse unless paired with LLAMA_PAGED_PREFILL_PERIOD=1, matching /home/mudler/bench/COMBINED_DEFINITIVE.txt where S3 is recorded as off by default. No separate self-contained adaptive-scheduling proof artifact was found beyond the S3 and prefill-knob artifacts.

Open Items

Phase 6 Serving nsys Classifier

Exact fork head d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3 was mirrored to /home/mudler/llama-phase6-source on DGX and rebuilt with CUDA Release, CMAKE_CUDA_COMPILER=/usr/local/cuda-13.0/bin/nvcc, and CMAKE_CUDA_ARCHITECTURES=121.

Pre-profile gates passed:

MoE greedy md5: 8cb0ce23777bf55f92f63d0292c756b0.
Dense greedy md5: 5951a5b4d624ce891e22ab5fca9bc439.

Serving nsys artifacts:

llama.cpp: /home/mudler/bench/phase6_serving_nsys/llama_server_n128/.
vLLM: /home/mudler/bench/phase6_serving_nsys/vllm_server_n128/.

Same h2h shape (n=128, ptok=128, gen=128) under nsys:

Engine	decode tok/s/seq	decode agg tok/s	prefill tok/s
llama.cpp	4.05	591.0	1567.4
vLLM	6.95	961.1	5073.6

llama.cpp bucket highlights:

gated_delta_net_cuda: 33.7% GPU kernel time, 10.21s.
NVFP4 mul_mat_q: 24.3% + 5.5% for the largest grouped variants, 9.04s combined.
quantize_mmq_nvfp4: 2.7%, 0.81s.
flash_attn_tile: 1.3%, 0.38s.
CUDA API: cudaStreamSynchronize 76.5% API time, 23.66s over 106585 calls; 8028 synchronizes followed cudaMemcpyAsync and summed 21.41s.

vLLM bucket highlights:

fused_recurrent_gated_delta_rule_packed_decode_kernel: 16.6%, 8.95s.
marlin_moe_wna16::Marlin: 11.9% plus smaller Marlin-MoE variants.
flash_fwd_splitkv_kernel: visible split-K FA decode rows at 0.6% + 0.1%.
The vLLM delayed profile still contains startup/module-load API noise; prefer h2h and GPU kernel buckets over API percentages for vLLM.

Rejected Phase 6 sampler experiment:

Patch idea: in backend distribution sampling, skip the random uniform upload when prior backend filters already collapsed candidates to one token (temperature=0 path).
Gates passed:
- MoE md5 8cb0ce23777bf55f92f63d0292c756b0.
- Dense md5 5951a5b4d624ce891e22ab5fca9bc439.
- MUL_MAT_ID: 806/806 on CUDA0.
Serving A/B did not clear the performance gate: no-nsys reps were 4.19 and 3.55 tok/s/seq. The fork patch was reverted; no commit and no LocalAI patch were created.

Next measured target:

H3 is elevated above another W4A16/kernel-shape pass: llama.cpp spends 33.7% of GPU time in GDN decode versus vLLM's 16.6%, and vLLM remains 1.63x faster on aggregate decode for the same serving shape. Use existing GDN_NW and GDN_CPW controls to grid-search live-width-adaptive GDN launch parameters before changing source.

Phase 6 GDN Narrow-Serving Env Grid

Artifact: /home/mudler/bench/phase6_serving_nsys/gdn_grid/.

Clean binaries were rebuilt after reverting the rejected sampler experiment. Grid shape was n=128, ptok=128, gen=64 to keep each isolated server run bounded.

Setting	decode tok/s/seq	decode agg tok/s	Decision
default	3.91	647.9	baseline
`GDN_NW=4 GDN_CPW=1`	3.80	628.9	reject
`GDN_NW=8 GDN_CPW=2`	3.94	624.5	reject
`GDN_NW=8 GDN_CPW=4`	3.91	647.6	reject
`GDN_NW=8 GDN_CPW=8`	4.00	636.9	no material win
`GDN_NW=16 GDN_CPW=4`	3.85	637.5	reject
`GDN_NW=16 GDN_CPW=8`	3.96	652.0	no material win

Result:

Rejected as an env-only lever. Existing GDN geometry variants are too close in this serving gate to justify a source change.
Next focus moves back to the largest differentiating kernel bucket: llama.cpp's NVFP4 grouped mul_mat_q bucket (~30% GPU time) versus vLLM's Marlin-MoE bucket.

Phase 6 MoE MMQ Tile Env Grid

Artifact: /home/mudler/bench/phase6_serving_nsys/mmq_grid/.

Shape: n=128, ptok=128, gen=64.

Setting	decode tok/s/seq	decode agg tok/s	Decision
default	3.90	645.3	baseline
`LLAMA_MOE_AUTO_TILE=0`	3.90	655.3	tied/no material win
`LLAMA_MOE_DECODE_TILE=32`	3.82	635.9	reject
`LLAMA_MOE_DECODE_TILE=48`	3.81	637.3	reject
`LLAMA_MOE_DECODE_TILE=96`	3.84	642.8	reject
`LLAMA_MOE_DECODE_TILE=128`	3.84	640.6	reject
`LLAMA_MOE_MMQ_X=32`	3.76	642.0	reject; prefill worsened

Result:

Rejected as an env-only lever. Existing grouped-MMQ tile and auto-selector knobs do not materially close the serving gap.
A source patch that only retunes the current tile selector is not justified. The next useful MoE lever would need a structural change closer to vLLM's Marlin-MoE/fused-MoE shape, or the work should move to the synchronous serving input/sampler path with a measurable non-greedy workload.

Open Items

No current env-only lever clears the serving performance gate. Scope the next source candidate against either structural MoE decode fusion or async serving input/sampler uploads, with a workload that proves the target bucket matters.
Phase 7 must keep the canonical MoE and dense md5 gates as the first inference-safety check before any performance result is accepted.

Phase 7 Source-Candidate Test Gate

Fork commit cd56cf037379b084d6bb0ed47db8b785c828be86 added patch 0051-test-paged-cover-MoE-swiglu-down-chain.patch. This is a test-only patch; it does not change the production inference path.

Fresh DGX gates from /home/mudler/bench/phase7_source_scope/:

MoE greedy md5: 8cb0ce23777bf55f92f63d0292c756b0.
Dense greedy md5: 5951a5b4d624ce891e22ab5fca9bc439.
Baseline MUL_MAT_ID: 806/806.
New MOE_SWIGLU_DOWN: 7/7.

The new gate covers the merged MoE gate_up -> SWIGLU -> down-projection graph shape needed before attempting a batched NVFP4 down-input quantization fusion.

Phase 7 SWIGLU-Down Fusion Candidate Rejected

Attempted candidate: fuse GGML_OP_GLU(SWIGLU) into the NVFP4 activation quantization feeding the MoE down-projection MUL_MAT_ID, while keeping the existing grouped-MMQ kernel. The patch was kept behind GGML_CUDA_FUSE_SWIGLU_DOWN_MMQ=1 during validation.

DGX artifacts:

/home/mudler/bench/phase7_source_scope/test_backend_ops_moe_swiglu_down_optin.txt
/home/mudler/bench/phase7_source_scope/test_backend_ops_mul_mat_id_after_optin.txt
/home/mudler/bench/phase7_source_scope/default_gates_after_optin/
/home/mudler/bench/phase7_source_scope/optin_gates/
/home/mudler/bench/phase7_source_scope/serving_ab/

Correctness and inference gates:

Forced fusion MOE_SWIGLU_DOWN: 7/7.
Broad default MUL_MAT_ID: 806/806.
Default md5 after opt-in gating stayed canonical:
- MoE 8cb0ce23777bf55f92f63d0292c756b0.
- Dense 5951a5b4d624ce891e22ab5fca9bc439.
Opt-in fusion md5:
- MoE 07db32c2bcb78d17a43ed18bc22705cd.
- Dense 5951a5b4d624ce891e22ab5fca9bc439.

Serving A/B (n=128, ptok=128, gen=64, /v1/completions, --no-cache):

path	decode tok/s/seq	decode agg tok/s	prefill tok/s	verdict
default	3.92	657.1	1456.0	baseline
`GGML_CUDA_FUSE_SWIGLU_DOWN_MMQ=1`	3.88	667.4	1462.9	reject; md5 drift and flat A/B

Result:

Rejected as a production patch. The opt-in path changes the paged-MoE md5 into the non-paged namespace and does not materially improve serving.
Root-cause note for future attempts: the first fused-op gate failed because the fused quantizer used compact GLU-output strides to read split gate/up views. Split views stride over the merged gate/up tensor; using source-view strides fixed the op gate but not the end-to-end md5 drift.

Phase 7 Weighted-Combine Test Gate

Fork commit 3ef7eb9e4d added patch 0052-test-paged-cover-MoE-weighted-combine-chain.patch. This is a test-only patch; it does not change the production inference path.

The new MOE_WEIGHTED_COMBINE whole-graph gate covers:

down MUL_MAT_ID -> router-weight ggml_mul -> rank-ordered expert views/adds.

DGX artifact:

/home/mudler/bench/phase7_source_scope/test_backend_ops_moe_weighted_combine_green.txt

DGX result:

test-backend-ops test -b CUDA0 -o MOE_WEIGHTED_COMBINE -j 1: 7/7.

This gate is the correctness target for the next candidate: a deterministic post-down MoE weighted-combine fusion that preserves current f32 product and rank-order add semantics while avoiding the rejected SWIGLU/FP4-quantization shortcut.

Phase 7 Weighted-Combine Fusion Candidate Rejected

Attempted candidate: fuse the post-down MoE router-weight multiply and rank-ordered add fan-in:

ffn_moe_down -> ggml_mul(experts, weights) -> VIEW ranks -> ADD fan-in.

The candidate was fork-first, default-on during validation, and had a rollback env switch: LLAMA_MOE_NO_WEIGHTED_COMBINE_FUSION=1.

DGX artifacts:

/home/mudler/bench/phase7_source_scope/test_backend_ops_moe_weighted_combine_orderfix.txt
/home/mudler/bench/phase7_source_scope/test_backend_ops_mul_mat_id_weighted_combine_orderfix.txt
/home/mudler/bench/phase7_source_scope/weighted_combine_orderfix_gates_chat/
/home/mudler/bench/phase7_source_scope/weighted_combine_orderfix_nsys_completion/
/home/mudler/bench/phase7_source_scope/weighted_combine_orderfix_serving_ab/
Rejected diff: /home/mudler/bench/phase7_source_scope/rejected-phase7-moe-weighted-combine-fusion.diff

Correctness and inference gates:

MOE_WEIGHTED_COMBINE: 7/7.
Broad MUL_MAT_ID: 806/806.
Canonical transcript md5:
- MoE 8cb0ce23777bf55f92f63d0292c756b0.
- Dense 5951a5b4d624ce891e22ab5fca9bc439.

Nsight proof:

Disabled run: no k_moe_weighted_combine kernels.
Fused run: 110 k_moe_weighted_combine launches.

Serving A/B (n=128, ptok=128, gen=64, /v1/completions):

path	decode tok/s/seq	decode agg tok/s	prefill tok/s	verdict
`LLAMA_MOE_NO_WEIGHTED_COMBINE_FUSION=1`	2.63	417.5	1345.2	baseline
fused default	2.63	417.0	1346.9	reject; kernel fires but A/B is flat

Result:

Rejected as a production patch. The patch is md5-safe and the kernel fires, but it does not improve the bounded serving workload. Keep patch 0052 as a useful regression gate; do not retry this exact fan-in-only fusion unless a fresh profile shows the weighted/add fan-in as a material bucket.

Phase 8 Ragged MoE Dispatch Scope

Plan: docs/superpowers/plans/2026-07-01-serving-ragged-moe-phase8.md.

The next candidate is profile-gated before source work:

Target a fused routed-expert MUL_MAT_ID dispatch path for ragged serving decode, not another post-down fan-in fusion.
First decompose live llama.cpp and vLLM MoE serving at n=128, ptok=128, gen=64 with Nsight and /home/mudler/bench/bucket.py.
Promote only if mm_ids_helper, activation quant/gather, grouped MMQ, or related MoE dispatch rows are material and not hidden by GDN or FA.
Keep the backend-sampling/logit-bias upload cache as a non-default follow-up; it requires --backend-sampling and request backend_sampling: true with non-empty logit_bias or ignore_eos.

Required promotion gates remain:

MoE md5 8cb0ce23777bf55f92f63d0292c756b0.
Dense md5 5951a5b4d624ce891e22ab5fca9bc439.
MUL_MAT_ID: 806/806 on CUDA0.
Any fused dispatch prototype must start default-off behind LLAMA_MOE_FUSED_DISPATCH=1.

Profile-gate result:

Clean llama.cpp artifact: /home/mudler/bench/phase8_ragged_moe_dispatch/llama_n128_clean/.
vLLM artifact: /home/mudler/bench/phase8_ragged_moe_dispatch/vllm_n128/.
A stale first llama profile under llama_n128/ is intentionally ignored because the binary still contained the rejected weighted-combine kernel before the clean-source rebuild.

Throughput:

Engine	decode tok/s/seq	decode agg tok/s	prefill tok/s
llama.cpp	2.70	412.1	1368.3
vLLM	7.02	1036.6	5277.7

llama.cpp bucket highlights from the clean profile:

GDN: 4680.27 ms, 38.12%.
mmq_nvfp4: 2745.11 ms, 22.36%.
act_quant: 441.42 ms, 3.60%.
MoE dispatch: 183.67 ms, 1.50%.
ew_add fan-in: 280.15 ms, 2.28%.

Decision:

Promote to a test-only ragged MUL_MAT_ID gate before production source.
Do not implement fused dispatch yet. Standalone mm_ids/gather_mmq helper time is small; a source patch must reduce the larger grouped-MMQ/activation movement bucket and still beat the +5% serving A/B gate.

Phase 8 Ragged MoE Dispatch Test Gate

Fork commit e21732fc4 added patch 0053-test-paged-cover-ragged-MoE-dispatch.patch. This is a test-only patch; it does not change the production inference path.

The new MUL_MAT_ID_RAGGED_MOE gate covers:

one small F32 wiring case,
NVFP4 with n_mats=256, n_used=8, m=768, k=2048, n in {1, 8, 33, 128, 257},
deterministic unique top-k ids skewed toward hot experts, including expert 255, leaving many experts empty.

DGX artifact:

/home/mudler/bench/phase8_ragged_moe_dispatch/test_backend_ops_mul_mat_id_ragged_moe_fixed.txt

DGX result:

test-backend-ops test -b CUDA0 -o MUL_MAT_ID_RAGGED_MOE -j 1: 6/6.

Debug note:

The first version of the gate failed because the deterministic IDs produced duplicate expert IDs within token 0. That is not a valid top-k routing shape and caused a CPU/CUDA mismatch followed by a CUDA fault. The committed gate preserves unique expert IDs per token while keeping cross-token load skew.

Production-source decision:

Do not start a Phase 8 production CUDA patch yet.
Code inspection found that the existing native-FP4 MoE path already de-dups broadcast activation quantization when ne11 == 1, then gathers FP4 blocks before grouped MMQ.
The measured helper rows are small (mm_ids=0.66%, gather_mmq=0.42%). A metadata-only fused-dispatch hook would not plausibly clear the +5% serving A/B gate.
A future source candidate must reduce mmq_nvfp4 (22.36%) or act_quant (3.60%) directly, without D2H id readback, new stream synchronizations, or md5 drift.

Phase 9 MTP Draft Smoke Gate

Phase 9 challenged the older "MTP absent" assumption. The current fork has Qwen3.5/3.6 draft-mtp support and the DGX MoE GGUF contains MTP metadata and tensors:

qwen35moe.nextn_predict_layers
blk.40.nextn.eh_proj.weight
blk.40.nextn.shared_head_norm.weight
blk.40.nextn.enorm.weight
blk.40.nextn.hnorm.weight

Smoke artifacts:

Failing default pre-patch: /home/mudler/bench/phase9_mtp_smoke/mtp_smoke.err.
Passing explicit CPU-sampled draft: /home/mudler/bench/phase9_mtp_smoke/mtp_smoke_no_backend_sampling.err.
Passing default after patch: /home/mudler/bench/phase9_mtp_smoke/mtp_smoke_default_after_patch.err.

Finding:

draft-mtp runs with the current model when backend draft sampling is off.
The default path previously emitted: backend sampling requires at most one output token per sequence (seq_id 0 had 2).
Patch 0054-fix-speculative-disable-backend-sampling-for-MTP-drafts.patch disables backend draft sampling inside the MTP implementation until the backend sampler supports multi-output verification batches.

DGX smoke after patch:

rc=0.
Warning emitted: backend draft sampling is disabled for MTP.
n_drafted=5, n_accept=4, acceptance 80.000%.
Output tail: The capital of France is Paris, a city renowned for its rich history.

Normal inference gates after patch:

MoE md5: 8cb0ce23777bf55f92f63d0292c756b0.
Dense md5: 5951a5b4d624ce891e22ab5fca9bc439.

Decision:

Keep Phase 9 as an opt-in speculative smoke/fix only.
Do not enable MTP by default in LocalAI or llama-server.
Do not benchmark MTP as a parity win until a serving/API phase adds rollback gates for hybrid SSM/KV state and measures target verification throughput.

Phase 10 GDN C32 Slab Baseline and Source Check

Phase 10 starts a separate GDN prefill path; it does not reopen the rejected decode GDN_NW/GDN_CPW grid.

Current M5 baseline artifacts:

/home/mudler/bench/phase10_gdn_c32_slab/m5_baseline/paged_moe_prefill.txt
/home/mudler/bench/phase10_gdn_c32_slab/m5_baseline/paged_dense_prefill.txt
/home/mudler/bench/phase10_gdn_c32_slab/m5_baseline/summary_rows.txt
/home/mudler/bench/phase10_gdn_c32_slab/m5_baseline/provenance.txt

Current M5 baseline:

Model	PP	TG	B	S_PP t/s	S_TG t/s	S t/s
MoE	512	4	32	2314.18	359.16	2220.48
MoE	2048	4	32	2439.95	389.43	2415.16
Dense	512	4	32	978.97	143.56	936.71
Dense	2048	4	32	1023.61	184.09	1014.59

Source check:

A C32 M5 candidate cannot be implemented as a launcher-only shortcut.
The current M5 form-T apply path stores one 16-row tile of U=T*RHS in registers, syncs, then overwrites Ud. That is safe for C=16.
For C=32, a naive two-row-tile loop would overwrite RHS rows before all output rows are computed, and the current apply call only covers rowbase 0.
A correct C32 slab candidate must add a separate staging strategy for all C*DV_TILE U values, then run focused GATED_DELTA_NET op gates before any S_PP comparison.

Decision:

A default-off C32 slab candidate was implemented and rejected by the performance gate.
The candidate was correctness-clean only after fixing a tail-chunk staging bug: rows t >= Cc in the staged U=T*RHS copy-back must be zeroed before state/output math. Before that fix, the dense gate produced a degenerate transcript even though the focused op gate passed.
After the tail fix, both default and forced-C32 modes matched the canonical md5 gates exactly:
- MoE: 8cb0ce23777bf55f92f63d0292c756b0.
- Dense: 5951a5b4d624ce891e22ab5fca9bc439.
KL was not needed because md5 stayed stable after the tail fix.

Correctness artifacts:

/home/mudler/bench/phase10_gdn_c32_slab/gates/gated_delta_net_default_after_tailfix.txt
/home/mudler/bench/phase10_gdn_c32_slab/gates/gated_delta_net_c32_slab_after_tailfix.txt
/home/mudler/bench/phase10_gdn_c32_slab/gates/gate_moe_default_after_tailfix.md5
/home/mudler/bench/phase10_gdn_c32_slab/gates/gate_dense_default_after_tailfix.md5
/home/mudler/bench/phase10_gdn_c32_slab/gates/gate_moe_c32_after_tailfix.md5
/home/mudler/bench/phase10_gdn_c32_slab/gates/gate_dense_c32_after_tailfix.md5

Performance A/B artifacts:

/home/mudler/bench/phase10_gdn_c32_slab/ab/moe_base.txt
/home/mudler/bench/phase10_gdn_c32_slab/ab/moe_c32.txt
/home/mudler/bench/phase10_gdn_c32_slab/ab/dense_base.txt
/home/mudler/bench/phase10_gdn_c32_slab/ab/dense_c32.txt

Performance A/B:

Model	Mode	PP	TG	B	S_PP t/s	S_TG t/s	S t/s
MoE	M5 base	512	4	32	2323.48	397.57	2239.39
MoE	C32 slab	512	4	32	2069.12	357.43	1995.06
MoE	M5 base	2048	4	32	2430.32	388.29	2405.66
MoE	C32 slab	2048	4	32	2054.86	388.01	2037.79
Dense	M5 base	512	4	32	975.10	140.53	932.19
Dense	C32 slab	512	4	32	866.29	144.03	833.87
Dense	M5 base	2048	4	32	1019.25	183.25	1010.26
Dense	C32 slab	2048	4	32	903.73	183.47	896.86

Rejected diff:

/home/mudler/bench/phase10_gdn_c32_slab/rejected/c32_slab_tailfix_rejected.diff

Conclusion:

Do not ship Phase 10 C32 slab as implemented.
C32 slab is not a maintainable shortcut toward parity because duplicated A/T recomputation per value slab outweighs the intended state-traffic reduction.
A future GDN prefill attempt should either share the A/T work across value slabs or switch to a different FLA-style chunk design; it should not repeat this env-gated two-slab M5 variant.

Phase 11 GDN M5 QS-Early Rejection

Phase 11 tested a smaller C=16 M5 scheduling shortcut instead of reopening C32: move the QS = Qc * S0 state-boundary tensor-core pass earlier and keep it default-off behind GDN_M5_QS_EARLY=1.

Correctness artifacts:

/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/gated_delta_net_default.txt
/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/gated_delta_net_qs_early.txt
/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/gate_moe_default.md5
/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/gate_dense_default.md5
/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/gate_moe_qs_early.md5
/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/gate_dense_qs_early.md5

Correctness result:

Default and QS-early paths matched canonical md5 exactly:
- MoE 8cb0ce23777bf55f92f63d0292c756b0.
- Dense 5951a5b4d624ce891e22ab5fca9bc439.
KL was not needed.

Performance artifacts:

/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/moe_base.txt
/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/moe_qs_early.txt
/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/dense_base.txt
/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/dense_qs_early.txt

Performance A/B:

Model	Mode	PP	TG	B	S_PP t/s	S_TG t/s	S t/s
MoE	M5 base	512	4	32	2325.67	355.60	2229.90
MoE	QS-early	512	4	32	2315.77	353.27	2220.16
MoE	M5 base	2048	4	32	2441.54	390.53	2416.80
MoE	QS-early	2048	4	32	2420.26	389.89	2395.94
Dense	M5 base	512	4	32	975.15	142.71	932.97
Dense	QS-early	512	4	32	968.23	144.24	927.17
Dense	M5 base	2048	4	32	1021.06	183.34	1012.04
Dense	QS-early	2048	4	32	1015.77	183.73	1006.88

Rejected diff:

/home/mudler/bench/phase11_gdn_m5_state_boundary/rejected/qs_early_rejected.diff

Conclusion:

Do not ship Phase 11 QS-early as implemented.
Merely moving the QS state-boundary product earlier is not enough; it remains an extra MMA pass and does not reduce the M5 critical path.
The next GDN attempt should skip local scheduling-only changes and scope a true shared-A/Ai blocked-solve or global-scratch design, with an explicit scratch/synchronization cost model before coding.

39 KiB Raw Blame History

GB10 Parity Phase 0 Results

Preflight

Baseline Runs

Decode Difference-Method Reproduction

W4A16 Kill-Gate Baseline

W4A16 Metadata Phase 1

W4A16 Kernel Shape Phase 2

W4A16 Scale Broadcast Phase 3

W4A16 Shared-Memory Padding Phase 4

W4A16 Wq Padding Phase 5

Clean Build

Canonical Gates

Source Provenance

Existing Artifact Gap Review

Open Items

Phase 6 Serving nsys Classifier

Phase 6 GDN Narrow-Serving Env Grid

Phase 6 MoE MMQ Tile Env Grid

Open Items

Phase 7 Source-Candidate Test Gate

Phase 7 SWIGLU-Down Fusion Candidate Rejected

Phase 7 Weighted-Combine Test Gate

Phase 7 Weighted-Combine Fusion Candidate Rejected

Phase 8 Ragged MoE Dispatch Scope

Phase 8 Ragged MoE Dispatch Test Gate

Phase 9 MTP Draft Smoke Gate

Phase 10 GDN C32 Slab Baseline and Source Check

Phase 11 GDN M5 QS-Early Rejection

39 KiB

Raw Blame History