Record the Phase 27 current-stack llama.cpp n128 serving profile captured with CUDA graph node tracing and gated before and after the run. Assisted-by: Codex:gpt-5
65 KiB
GB10 Parity Phase 0 Results
Status: in progress.
Preflight
- DGX host:
promaxgb10-4ad8 - Docker containers:
none - GPU compute apps:
none - GPU lock owner:
FREE released-by-claude-fp4norm-profile 1782828229 - LocalAI worktree SHA:
d288a0300f36f7c126d62d997809bb03f297a3ac - Local llama.cpp fork SHA:
51168c5eee2e35348d9006f0b2fab3dc6e7c01cc - DGX artifact directory:
~/bench/reopen_phase0
Baseline Runs
Clean prefill baseline artifacts:
- MoE:
~/bench/reopen_phase0/paged_moe_prefill.txt - Dense:
~/bench/reopen_phase0/paged_dense_prefill.txt
MoE paged prefill:
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|---|---|---|---|---|---|---|---|---|---|
| 512 | 4 | 32 | 16512 | 7.181 | 2281.66 | 0.355 | 360.57 | 7.536 | 2191.16 |
| 2048 | 4 | 32 | 65664 | 27.131 | 2415.53 | 0.328 | 390.84 | 27.459 | 2391.38 |
Dense paged prefill:
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|---|---|---|---|---|---|---|---|---|---|
| 512 | 4 | 32 | 16512 | 16.749 | 978.18 | 0.842 | 152.03 | 17.591 | 938.64 |
| 2048 | 4 | 32 | 65664 | 63.791 | 1027.35 | 0.687 | 186.29 | 64.479 | 1018.38 |
Decode Difference-Method Reproduction
Paged llama.cpp artifacts:
~/bench/reopen_phase0/paged_decode_nsys/paged_moe_n256_ntg16.nsys-rep~/bench/reopen_phase0/paged_decode_nsys/paged_moe_n256_ntg16.bench.log~/bench/reopen_phase0/paged_decode_nsys/paged_moe_n256_ntg64.nsys-rep~/bench/reopen_phase0/paged_decode_nsys/paged_moe_n256_ntg64.bench.log
Paged llama.cpp rows:
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|---|---|---|---|---|---|---|---|---|---|
| 128 | 16 | 256 | 36864 | 14.933 | 2194.39 | 4.502 | 909.80 | 19.435 | 1896.81 |
| 128 | 64 | 256 | 49152 | 14.949 | 2191.96 | 17.924 | 914.09 | 32.873 | 1495.21 |
Paged difference-method decode:
- Token delta:
256 * (64 - 16) = 12288 - Wall delta:
17.924 - 4.502 = 13.422 s - Decode throughput:
915.51 t/s
vLLM artifacts:
~/bench/reopen_phase0/vllm_decode_nsys/vllm_version.txt~/bench/reopen_phase0/vllm_decode_nsys/dec_npl256_ntg16.nsys-rep~/bench/reopen_phase0/vllm_decode_nsys/dec_npl256_ntg16.run.log~/bench/reopen_phase0/vllm_decode_nsys/dec_npl256_ntg16.kern.csv~/bench/reopen_phase0/vllm_decode_nsys/dec_npl256_ntg16.gpu_trace.csv~/bench/reopen_phase0/vllm_decode_nsys/dec_npl256_ntg64.nsys-rep~/bench/reopen_phase0/vllm_decode_nsys/dec_npl256_ntg64.run.log~/bench/reopen_phase0/vllm_decode_nsys/dec_npl256_ntg64.kern.csv~/bench/reopen_phase0/vllm_decode_nsys/dec_npl256_ntg64.gpu_trace.csv
vLLM version: 0.23.0
vLLM profiled rows:
| NSEQ | GEN | Generated tokens | Wall s | Logged tok/s |
|---|---|---|---|---|
| 256 | 16 | 4096 | 6.195 | 661.2 |
| 256 | 64 | 16384 | 17.607 | 930.5 |
vLLM difference-method decode:
- Token delta:
16384 - 4096 = 12288 - Wall delta:
17.607 - 6.195 = 11.412 s - Decode throughput:
1076.76 t/s
Clean reproduced paged/vLLM decode ratio: 85.0%.
W4A16 Kill-Gate Baseline
Artifacts:
- Default FP4-MMQ:
~/bench/reopen_phase0/w4a16_off.txt - Forced W4A16 with debug:
~/bench/reopen_phase0/w4a16_on_thr64.txt - Forced W4A16 without debug:
~/bench/reopen_phase0/w4a16_on_thr64_nodebug.txt
Default FP4-MMQ:
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|---|---|---|---|---|---|---|---|---|---|
| 512 | 4 | 32 | 16512 | 7.105 | 2306.06 | 0.321 | 399.00 | 7.426 | 2223.68 |
| 2048 | 4 | 32 | 65664 | 27.047 | 2423.00 | 0.329 | 388.89 | 27.377 | 2398.55 |
Forced W4A16, LLAMA_W4A16_PREFILL_M=64, debug off:
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|---|---|---|---|---|---|---|---|---|---|
| 512 | 4 | 32 | 16512 | 12.517 | 1308.92 | 0.321 | 398.82 | 12.838 | 1286.17 |
| 2048 | 4 | 32 | 65664 | 49.165 | 1332.98 | 0.330 | 387.57 | 49.495 | 1326.67 |
Delta:
npp=512:-43.2%S_PP versus default FP4-MMQ.npp=2048:-45.0%S_PP versus default FP4-MMQ.
Debug evidence:
- Forced W4A16 debug run emitted
19200engagement lines. - Observed
n_tilesrange:139..282. - Observed
multi_tile_expertsrange:7..21.
First implementation target:
- Option B: device-side or cached tile metadata.
- Rationale:
w4a16-gemm.cucurrently buildsh_tile_expert,h_tile_row0, andh_tile_rowson the host, pool-allocates three device tile-map buffers, and issues three H2DcudaMemcpyAsynccalls per grouped W4A16 launch. The debug run shows this path is repeatedly exercised across many small ragged tile maps. The first fork-first experiment should remove or amortize that host-built tile-map path before retuning MMA tile shapes.
W4A16 Metadata Phase 1
Fork commit: 4b0cc1163cc42dc1c17892fd41ce5ab384ba3e17
(feat(paged): pack W4A16 grouped tile metadata).
LocalAI patch mirror: 0048-feat-paged-pack-W4A16-grouped-tile-metadata.patch.
Mirror invariant: applying the full LocalAI patches/paged/*.patch series to
base pin 0ed235ea2c17a19fc8238668653946721ed136fd tree-matches fork HEAD
4b0cc1163cc42dc1c17892fd41ce5ab384ba3e17.
Artifacts:
- Diff:
~/bench/w4a16_phase1/packed_desc.diff - Build mtimes:
~/bench/w4a16_phase1/build_binary_mtimes.txt - MoE gate:
~/bench/w4a16_phase1/gate_moe.md5 - Dense gate:
~/bench/w4a16_phase1/gate_dense.md5 - Default FP4-MMQ:
~/bench/w4a16_phase1/w4a16_off.txt - Packed W4A16:
~/bench/w4a16_phase1/w4a16_on_thr64.txt
Canonical gates:
- MoE greedy md5:
8cb0ce23777bf55f92f63d0292c756b0(matched expected) - Dense greedy md5:
5951a5b4d624ce891e22ab5fca9bc439(matched expected)
Packed descriptor A/B:
| Path | PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|---|---|---|---|---|---|---|---|---|---|---|
| FP4-MMQ | 512 | 4 | 32 | 16512 | 7.114 | 2303.07 | 0.323 | 396.55 | 7.437 | 2220.32 |
| FP4-MMQ | 2048 | 4 | 32 | 65664 | 27.045 | 2423.23 | 0.331 | 387.14 | 27.376 | 2398.64 |
| W4A16 packed | 512 | 4 | 32 | 16512 | 12.468 | 1314.08 | 0.322 | 397.97 | 12.790 | 1291.04 |
| W4A16 packed | 2048 | 4 | 32 | 65664 | 48.930 | 1339.39 | 0.330 | 387.44 | 49.260 | 1333.00 |
Result:
- Packed descriptors improved forced W4A16 by
+0.39%atnpp=512and+0.48%atnpp=2048versus the Phase 0 no-debug W4A16 baseline. - W4A16 remains
-42.9%atnpp=512and-44.7%atnpp=2048versus same-run default FP4-MMQ. - Decision: keep patch
0048as a small simplification, but pivot the next W4A16 iteration to the activation cast or MMA/dequant tile body.
W4A16 Kernel Shape Phase 2
Profile-guided target:
- Phase 1 forced W4A16 profile at
npp=512:w4a16_grouped_kerneldominated at5231.667 ms(47.8%) whilew4a16_cast_act_f32_bf16was517.195 ms(4.7%). - Phase 2 therefore targeted grouped-kernel tile shape/body before activation cast fusion.
Shape sweep artifacts:
- Build:
~/llama-w4a16-phase2 - Benchmarks:
~/bench/w4a16_phase2/shape_*.txt - Winning profile:
~/bench/w4a16_phase2/profile/w4a16_bm32_npp512.*
Shape A/B:
| Shape | 512 S_PP t/s | 2048 S_PP t/s | Decision |
|---|---|---|---|
base / 64x128 |
1308.02 | 1339.46 | old baseline |
bn256 |
1286.99 | 1311.56 | rejected |
bm32 / 32x128 |
1442.99 | 1475.65 | selected |
bn64 |
1334.80 | 1362.55 | diagnostic only |
stages3 |
1271.01 | 1295.96 | rejected |
bn256x16 |
1084.66 | 1100.95 | rejected |
Only bm32 and the old base selector are shipped in patch 0049. The other
candidate shapes were benchmarked in the Phase 2 build and then deliberately
left out to keep the upstream conflict surface small.
Default-verification after selecting bm32:
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|---|---|---|---|---|---|---|---|---|---|
| 512 | 4 | 32 | 16512 | 11.360 | 1442.28 | 0.321 | 397.00 | 11.682 | 1413.43 |
| 2048 | 4 | 32 | 65664 | 44.529 | 1471.77 | 0.331 | 386.06 | 44.860 | 1463.75 |
Result:
bm32improves forced W4A16 by about+10.4%atnpp=512and+10.2%atnpp=2048versus the old64x128shape in the same sweep.- The profiled
bm32grouped kernel dropped to4107.355 ms(41.7%) atnpp=512, from Phase 1's5231.667 ms(47.8%). - Canonical post-change gates matched: MoE
8cb0ce23777bf55f92f63d0292c756b0, dense5951a5b4d624ce891e22ab5fca9bc439. - Forced W4A16 shape gates matched each other:
LLAMA_W4A16_PREFILL_M=1defaultbm32andLLAMA_W4A16_SHAPE=baseboth produced07db32c2bcb78d17a43ed18bc22705cdon the canonical gate prompt. - Forced W4A16
MUL_MAT_IDop checks passed for both shapes:test-backend-ops test -b CUDA0 -o MUL_MAT_ID -j 1reported806/806for defaultbm32and806/806forbase. - Decision: make
bm32the W4A16 default shape while keepingLLAMA_W4A16_SHAPE=basefor old-shape A/B and leaving other candidates as diagnostics.
Mirror invariant after patch 0049:
- Applying all 40 LocalAI
patches/paged/*.patchfiles to base pin0ed235ea2c17a19fc8238668653946721ed136fdtree-matches fork HEAD7dfa0e17548c5f04f83d2cc2a057b0a9941b599a. - Tree hash after patch application:
dabe225efbf20ec047b8309d1e1f19b34fc7c5c9.
W4A16 Scale Broadcast Phase 3
Goal: reduce duplicate FP4 scale conversion inside w4a16_grouped_kernel by
having one lane per 4-lane group convert the ue4m3 scale and broadcast it with
__shfl_sync.
Artifacts:
- Build:
~/llama-w4a16-phase3 - Logs:
~/bench/w4a16_phase3
Gates:
- Canonical paged MoE md5:
8cb0ce23777bf55f92f63d0292c756b0. - Canonical dense md5:
5951a5b4d624ce891e22ab5fca9bc439. - Forced W4A16
bm32and oldbaseshape md5s matched each other:07db32c2bcb78d17a43ed18bc22705cd. - Forced W4A16
MUL_MAT_ID:806/806on CUDA0.
Performance:
| Shape | 512 S_PP t/s | 2048 S_PP t/s | Decision |
|---|---|---|---|
Phase 2 bm32 |
1442.28 | 1471.77 | baseline |
Phase 3 scale-broadcast bm32 |
1392.46 | 1422.74 | rejected |
Phase 2 base |
1310.13 | 1336.02 | baseline |
Phase 3 scale-broadcast base |
1201.69 | 1221.25 | rejected |
Result:
- Rejected. No fork commit and no LocalAI patch
0050. - The local fork experiment was reverted.
- Do not retry this exact scale-broadcast approach; on GB10 the shuffle and/or scheduling cost exceeds the saved duplicate scale conversion.
W4A16 Shared-Memory Padding Phase 4
Goal: reduce bank pressure in w4a16_grouped_kernel by padding the A operand
shared-memory row stride while preserving math order and launch shape.
Fork commit: d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3
(feat(paged): pad W4A16 A shared tile stride).
LocalAI patch mirror: 0050-feat-paged-pad-W4A16-A-shared-tile-stride.patch.
Artifacts:
- Build:
~/llama-w4a16-phase4 - Logs:
~/bench/w4a16_phase4
Gates:
- Canonical paged MoE md5:
8cb0ce23777bf55f92f63d0292c756b0. - Canonical dense md5:
5951a5b4d624ce891e22ab5fca9bc439. - Forced W4A16
bm32and oldbaseshape md5s matched each other:07db32c2bcb78d17a43ed18bc22705cd. - Forced W4A16
MUL_MAT_ID:806/806on CUDA0.
Performance:
| Shape | 512 S_PP t/s | 2048 S_PP t/s | Decision |
|---|---|---|---|
Phase 2 bm32 |
1442.28 | 1471.77 | baseline |
Phase 4 A-pad bm32 |
1466.62 | 1495.93 | selected |
Phase 2 base |
1310.13 | 1336.02 | baseline |
Phase 4 A-pad base |
1337.88 | 1364.98 | positive diagnostic |
Result:
- Kept. Default W4A16
bm32improves another+1.7%atnpp=512and+1.6%atnpp=2048versus Phase 2. - Applying all 41 LocalAI
patches/paged/*.patchfiles to base pin0ed235ea2c17a19fc8238668653946721ed136fdtree-matches fork HEADd9b9be0bee3d7239132bfca05d5b057ff4ee4cc3. - Tree hash after patch application:
8fcb151e0620fd0fc82b80c04318e5c34320b087.
W4A16 Wq Padding Phase 5
Goal: test whether padding the quantized-weight shared-memory row stride gives
another low-conflict W4A16 grouped-kernel body win after 0050.
Artifacts:
- Build:
~/llama-w4a16-phase5 - Logs:
~/bench/w4a16_phase5
Gates:
- Canonical paged MoE md5:
8cb0ce23777bf55f92f63d0292c756b0. - Canonical dense md5:
5951a5b4d624ce891e22ab5fca9bc439. - Forced W4A16
bm32and oldbaseshape md5s matched each other:07db32c2bcb78d17a43ed18bc22705cd. - Forced W4A16
MUL_MAT_ID:806/806on CUDA0.
Performance:
| Shape | 512 S_PP t/s | 2048 S_PP t/s | Decision |
|---|---|---|---|
Phase 4 A-pad bm32 |
1466.62 | 1495.93 | baseline |
Phase 5 Wq-pad bm32 |
1472.36 | 1504.82 | rejected: below 1% gate |
Phase 4 A-pad base |
1337.88 | 1364.98 | baseline |
Phase 5 Wq-pad base |
1337.70 | 1368.48 | diagnostic |
Result:
- Rejected. No fork commit and no LocalAI patch was created for that experiment.
- The local fork experiment was reverted.
- Do not ship Wq padding alone; the measured
+0.4%/+0.6%default-shape gain is below the maintenance threshold.
Clean Build
First clean build attempt:
- PID:
625392 - Source checkout:
~/llama-paged-reopen-clean - Result: failed during CMake configure.
- Root cause:
nvccwas not discoverable on PATH. CUDA headers were found under/usr/local/cuda/targets/sbsa-linux/include, and the compiler exists at/usr/local/cuda-13.0/bin/nvcc. - Retry plan: rebuild the clean checkout with
CUDACXX=/usr/local/cuda-13.0/bin/nvcc.
Second clean build attempt:
- PID:
631100 - Source checkout:
~/llama-paged-reopen-clean - Source status:
## HEAD (no branch) - Build HEAD:
51168c5eee2e35348d9006f0b2fab3dc6e7c01cc - CUDA compiler:
/usr/local/cuda-13.0/bin/nvcc - Result: succeeded.
- Binary mtimes:
build-cuda/bin/llama-server 2026-06-30 22:14:34.091312112 +0200build-cuda/bin/llama-batched-bench 2026-06-30 22:14:35.156287566 +0200build-cuda/bin/llama-completion 2026-06-30 22:14:37.095750242 +0200build-cuda/bin/test-backend-ops 2026-06-30 22:14:47.360078186 +0200
Canonical Gates
- MoE greedy md5:
8cb0ce23777bf55f92f63d0292c756b0(matched expected) - Dense greedy md5:
5951a5b4d624ce891e22ab5fca9bc439(matched expected) - Artifacts:
~/bench/reopen_phase0/gate_moe.txt~/bench/reopen_phase0/gate_moe.md5~/bench/reopen_phase0/gate_dense.txt~/bench/reopen_phase0/gate_dense.md5
Source Provenance
- Local llama.cpp fork:
/home/mudler/_git/llama.cpp - Branch:
localai-paged - Working tree: clean after fork commit
d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3 - Phase 0 HEAD:
51168c5eee2e35348d9006f0b2fab3dc6e7c01cc - Current HEAD:
cd56cf037379b084d6bb0ed47db8b785c828be86 - Base pin:
0ed235ea2c17a19fc8238668653946721ed136fd - Merge-base with base pin:
0ed235ea2c17a19fc8238668653946721ed136fd - LocalAI patch count:
38at Phase 0; current mirror count is42after patch0051. - LocalAI patch mirror: applies cleanly to the base pin and tree-matches fork HEAD.
- Tree hash after patch application:
623b7cb008a929455ca3d9deae35494c02622fef
Existing Artifact Gap Review
Read-only DGX artifact inspection was performed after confirming the machine was
idle: docker ps returned no running containers,
nvidia-smi --query-compute-apps returned no compute-app rows, and
~/gpu_bench_lock/owner read
FREE released-by-claude-fp4norm-profile 1782828229.
Existing paged llama.cpp decode and prefill numbers are supported by
/home/mudler/bench/COMBINED_DEFINITIVE.txt: MoE paged prefill lines 13-18,
MoE paged serving decode lines 23-26, dense paged prefill lines 43-48, and
dense paged serving decode lines 53-56. Supporting comparison artifacts are
/home/mudler/bench/STOCK3WAY.txt, /home/mudler/bench/PREFILL_KNOB.txt,
/home/mudler/bench/DEFINITIVE_S3ab.txt, and the adjacent raw logs.
No self-contained vLLM 1078 t/s GPU-steady ntg16/ntg64
difference-method artifact was found. The available vLLM evidence is
serving-run output in /home/mudler/bench/COMBINED_DEFINITIVE.txt plus
nsys/run artifacts under /home/mudler/bench/profgap/ and
/home/mudler/bench/postssm_decomp/; these do not form a packaged
ntg16/ntg64 difference-method report.
W4A16/Marlin evidence exists in /home/mudler/bench/vllm_prefix.log,
/home/mudler/bench/profgap/vllm_moe_decode.run.log, and
/home/mudler/bench/marlin_gate/kl_marlin.log.
/home/mudler/llama-paged-dev/LEVER3_ACTQUANT_FUSION_RESULTS.md records the
parity conclusion: W4A16/Marlin is a precision-change lever, not a bit-exact
llama.cpp parity lever.
GDN M5/M8 evidence exists in /home/mudler/bench/COMBINED_DEFINITIVE.txt
(GDN CONFIG C (M8) and production defaults noting GDN M5),
/home/mudler/llama-paged-dev/LEVER1_GATHER_RESULTS.md, and
/home/mudler/llama-paged-dev/CONV_STATE_FUSION_RESULTS.md.
S3 evidence exists in /home/mudler/bench/DEFINITIVE_S3ab.txt; that A/B shows
S3-on was worse unless paired with LLAMA_PAGED_PREFILL_PERIOD=1, matching
/home/mudler/bench/COMBINED_DEFINITIVE.txt where S3 is recorded as off by
default. No separate self-contained adaptive-scheduling proof artifact was
found beyond the S3 and prefill-knob artifacts.
Open Items
Phase 6 Serving nsys Classifier
Exact fork head d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3 was mirrored to
/home/mudler/llama-phase6-source on DGX and rebuilt with CUDA Release,
CMAKE_CUDA_COMPILER=/usr/local/cuda-13.0/bin/nvcc, and
CMAKE_CUDA_ARCHITECTURES=121.
Pre-profile gates passed:
- MoE greedy md5:
8cb0ce23777bf55f92f63d0292c756b0. - Dense greedy md5:
5951a5b4d624ce891e22ab5fca9bc439.
Serving nsys artifacts:
- llama.cpp:
/home/mudler/bench/phase6_serving_nsys/llama_server_n128/. - vLLM:
/home/mudler/bench/phase6_serving_nsys/vllm_server_n128/.
Same h2h shape (n=128, ptok=128, gen=128) under nsys:
| Engine | decode tok/s/seq | decode agg tok/s | prefill tok/s |
|---|---|---|---|
| llama.cpp | 4.05 | 591.0 | 1567.4 |
| vLLM | 6.95 | 961.1 | 5073.6 |
llama.cpp bucket highlights:
gated_delta_net_cuda: 33.7% GPU kernel time, 10.21s.- NVFP4
mul_mat_q: 24.3% + 5.5% for the largest grouped variants, 9.04s combined. quantize_mmq_nvfp4: 2.7%, 0.81s.flash_attn_tile: 1.3%, 0.38s.- CUDA API:
cudaStreamSynchronize76.5% API time, 23.66s over 106585 calls; 8028 synchronizes followedcudaMemcpyAsyncand summed 21.41s.
vLLM bucket highlights:
fused_recurrent_gated_delta_rule_packed_decode_kernel: 16.6%, 8.95s.marlin_moe_wna16::Marlin: 11.9% plus smaller Marlin-MoE variants.flash_fwd_splitkv_kernel: visible split-K FA decode rows at 0.6% + 0.1%.- The vLLM delayed profile still contains startup/module-load API noise; prefer h2h and GPU kernel buckets over API percentages for vLLM.
Rejected Phase 6 sampler experiment:
- Patch idea: in backend distribution sampling, skip the random uniform upload
when prior backend filters already collapsed candidates to one token
(
temperature=0path). - Gates passed:
- MoE md5
8cb0ce23777bf55f92f63d0292c756b0. - Dense md5
5951a5b4d624ce891e22ab5fca9bc439. MUL_MAT_ID:806/806on CUDA0.
- MoE md5
- Serving A/B did not clear the performance gate: no-nsys reps were
4.19and3.55tok/s/seq. The fork patch was reverted; no commit and no LocalAI patch were created.
Next measured target:
- H3 is elevated above another W4A16/kernel-shape pass: llama.cpp spends 33.7%
of GPU time in GDN decode versus vLLM's 16.6%, and vLLM remains 1.63x faster
on aggregate decode for the same serving shape. Use existing
GDN_NWandGDN_CPWcontrols to grid-search live-width-adaptive GDN launch parameters before changing source.
Phase 6 GDN Narrow-Serving Env Grid
Artifact: /home/mudler/bench/phase6_serving_nsys/gdn_grid/.
Clean binaries were rebuilt after reverting the rejected sampler experiment.
Grid shape was n=128, ptok=128, gen=64 to keep each isolated server run
bounded.
| Setting | decode tok/s/seq | decode agg tok/s | Decision |
|---|---|---|---|
| default | 3.91 | 647.9 | baseline |
GDN_NW=4 GDN_CPW=1 |
3.80 | 628.9 | reject |
GDN_NW=8 GDN_CPW=2 |
3.94 | 624.5 | reject |
GDN_NW=8 GDN_CPW=4 |
3.91 | 647.6 | reject |
GDN_NW=8 GDN_CPW=8 |
4.00 | 636.9 | no material win |
GDN_NW=16 GDN_CPW=4 |
3.85 | 637.5 | reject |
GDN_NW=16 GDN_CPW=8 |
3.96 | 652.0 | no material win |
Result:
- Rejected as an env-only lever. Existing GDN geometry variants are too close in this serving gate to justify a source change.
- Next focus moves back to the largest differentiating kernel bucket:
llama.cpp's NVFP4 grouped
mul_mat_qbucket (~30% GPU time) versus vLLM's Marlin-MoE bucket.
Phase 6 MoE MMQ Tile Env Grid
Artifact: /home/mudler/bench/phase6_serving_nsys/mmq_grid/.
Shape: n=128, ptok=128, gen=64.
| Setting | decode tok/s/seq | decode agg tok/s | Decision |
|---|---|---|---|
| default | 3.90 | 645.3 | baseline |
LLAMA_MOE_AUTO_TILE=0 |
3.90 | 655.3 | tied/no material win |
LLAMA_MOE_DECODE_TILE=32 |
3.82 | 635.9 | reject |
LLAMA_MOE_DECODE_TILE=48 |
3.81 | 637.3 | reject |
LLAMA_MOE_DECODE_TILE=96 |
3.84 | 642.8 | reject |
LLAMA_MOE_DECODE_TILE=128 |
3.84 | 640.6 | reject |
LLAMA_MOE_MMQ_X=32 |
3.76 | 642.0 | reject; prefill worsened |
Result:
- Rejected as an env-only lever. Existing grouped-MMQ tile and auto-selector knobs do not materially close the serving gap.
- A source patch that only retunes the current tile selector is not justified. The next useful MoE lever would need a structural change closer to vLLM's Marlin-MoE/fused-MoE shape, or the work should move to the synchronous serving input/sampler path with a measurable non-greedy workload.
Open Items
- No current env-only lever clears the serving performance gate. Scope the next source candidate against either structural MoE decode fusion or async serving input/sampler uploads, with a workload that proves the target bucket matters.
- Phase 7 must keep the canonical MoE and dense md5 gates as the first inference-safety check before any performance result is accepted.
Phase 7 Source-Candidate Test Gate
Fork commit cd56cf037379b084d6bb0ed47db8b785c828be86 added patch
0051-test-paged-cover-MoE-swiglu-down-chain.patch. This is a test-only patch;
it does not change the production inference path.
Fresh DGX gates from /home/mudler/bench/phase7_source_scope/:
- MoE greedy md5:
8cb0ce23777bf55f92f63d0292c756b0. - Dense greedy md5:
5951a5b4d624ce891e22ab5fca9bc439. - Baseline
MUL_MAT_ID:806/806. - New
MOE_SWIGLU_DOWN:7/7.
The new gate covers the merged MoE gate_up -> SWIGLU -> down-projection graph shape needed before attempting a batched NVFP4 down-input quantization fusion.
Phase 7 SWIGLU-Down Fusion Candidate Rejected
Attempted candidate: fuse GGML_OP_GLU(SWIGLU) into the NVFP4 activation
quantization feeding the MoE down-projection MUL_MAT_ID, while keeping the
existing grouped-MMQ kernel. The patch was kept behind
GGML_CUDA_FUSE_SWIGLU_DOWN_MMQ=1 during validation.
DGX artifacts:
/home/mudler/bench/phase7_source_scope/test_backend_ops_moe_swiglu_down_optin.txt/home/mudler/bench/phase7_source_scope/test_backend_ops_mul_mat_id_after_optin.txt/home/mudler/bench/phase7_source_scope/default_gates_after_optin//home/mudler/bench/phase7_source_scope/optin_gates//home/mudler/bench/phase7_source_scope/serving_ab/
Correctness and inference gates:
- Forced fusion
MOE_SWIGLU_DOWN:7/7. - Broad default
MUL_MAT_ID:806/806. - Default md5 after opt-in gating stayed canonical:
- MoE
8cb0ce23777bf55f92f63d0292c756b0. - Dense
5951a5b4d624ce891e22ab5fca9bc439.
- MoE
- Opt-in fusion md5:
- MoE
07db32c2bcb78d17a43ed18bc22705cd. - Dense
5951a5b4d624ce891e22ab5fca9bc439.
- MoE
Serving A/B (n=128, ptok=128, gen=64, /v1/completions, --no-cache):
| path | decode tok/s/seq | decode agg tok/s | prefill tok/s | verdict |
|---|---|---|---|---|
| default | 3.92 | 657.1 | 1456.0 | baseline |
GGML_CUDA_FUSE_SWIGLU_DOWN_MMQ=1 |
3.88 | 667.4 | 1462.9 | reject; md5 drift and flat A/B |
Result:
- Rejected as a production patch. The opt-in path changes the paged-MoE md5 into the non-paged namespace and does not materially improve serving.
- Root-cause note for future attempts: the first fused-op gate failed because
the fused quantizer used compact GLU-output strides to read split
gate/upviews. Split views stride over the merged gate/up tensor; using source-view strides fixed the op gate but not the end-to-end md5 drift.
Phase 7 Weighted-Combine Test Gate
Fork commit 3ef7eb9e4d added patch
0052-test-paged-cover-MoE-weighted-combine-chain.patch. This is a test-only
patch; it does not change the production inference path.
The new MOE_WEIGHTED_COMBINE whole-graph gate covers:
down MUL_MAT_ID -> router-weight ggml_mul -> rank-ordered expert views/adds.
DGX artifact:
/home/mudler/bench/phase7_source_scope/test_backend_ops_moe_weighted_combine_green.txt
DGX result:
test-backend-ops test -b CUDA0 -o MOE_WEIGHTED_COMBINE -j 1:7/7.
This gate is the correctness target for the next candidate: a deterministic post-down MoE weighted-combine fusion that preserves current f32 product and rank-order add semantics while avoiding the rejected SWIGLU/FP4-quantization shortcut.
Phase 7 Weighted-Combine Fusion Candidate Rejected
Attempted candidate: fuse the post-down MoE router-weight multiply and rank-ordered add fan-in:
ffn_moe_down -> ggml_mul(experts, weights) -> VIEW ranks -> ADD fan-in.
The candidate was fork-first, default-on during validation, and had a rollback
env switch: LLAMA_MOE_NO_WEIGHTED_COMBINE_FUSION=1.
DGX artifacts:
/home/mudler/bench/phase7_source_scope/test_backend_ops_moe_weighted_combine_orderfix.txt/home/mudler/bench/phase7_source_scope/test_backend_ops_mul_mat_id_weighted_combine_orderfix.txt/home/mudler/bench/phase7_source_scope/weighted_combine_orderfix_gates_chat//home/mudler/bench/phase7_source_scope/weighted_combine_orderfix_nsys_completion//home/mudler/bench/phase7_source_scope/weighted_combine_orderfix_serving_ab/- Rejected diff:
/home/mudler/bench/phase7_source_scope/rejected-phase7-moe-weighted-combine-fusion.diff
Correctness and inference gates:
MOE_WEIGHTED_COMBINE:7/7.- Broad
MUL_MAT_ID:806/806. - Canonical transcript md5:
- MoE
8cb0ce23777bf55f92f63d0292c756b0. - Dense
5951a5b4d624ce891e22ab5fca9bc439.
- MoE
Nsight proof:
- Disabled run: no
k_moe_weighted_combinekernels. - Fused run:
110k_moe_weighted_combinelaunches.
Serving A/B (n=128, ptok=128, gen=64, /v1/completions):
| path | decode tok/s/seq | decode agg tok/s | prefill tok/s | verdict |
|---|---|---|---|---|
LLAMA_MOE_NO_WEIGHTED_COMBINE_FUSION=1 |
2.63 | 417.5 | 1345.2 | baseline |
| fused default | 2.63 | 417.0 | 1346.9 | reject; kernel fires but A/B is flat |
Result:
- Rejected as a production patch. The patch is md5-safe and the kernel fires,
but it does not improve the bounded serving workload. Keep patch
0052as a useful regression gate; do not retry this exact fan-in-only fusion unless a fresh profile shows the weighted/add fan-in as a material bucket.
Phase 8 Ragged MoE Dispatch Scope
Plan: docs/superpowers/plans/2026-07-01-serving-ragged-moe-phase8.md.
The next candidate is profile-gated before source work:
- Target a fused routed-expert
MUL_MAT_IDdispatch path for ragged serving decode, not another post-down fan-in fusion. - First decompose live llama.cpp and vLLM MoE serving at
n=128,ptok=128,gen=64with Nsight and/home/mudler/bench/bucket.py. - Promote only if
mm_ids_helper, activation quant/gather, grouped MMQ, or related MoE dispatch rows are material and not hidden by GDN or FA. - Keep the backend-sampling/logit-bias upload cache as a non-default follow-up;
it requires
--backend-samplingand requestbackend_sampling: truewith non-emptylogit_biasorignore_eos.
Required promotion gates remain:
- MoE md5
8cb0ce23777bf55f92f63d0292c756b0. - Dense md5
5951a5b4d624ce891e22ab5fca9bc439. MUL_MAT_ID:806/806on CUDA0.- Any fused dispatch prototype must start default-off behind
LLAMA_MOE_FUSED_DISPATCH=1.
Profile-gate result:
- Clean llama.cpp artifact:
/home/mudler/bench/phase8_ragged_moe_dispatch/llama_n128_clean/. - vLLM artifact:
/home/mudler/bench/phase8_ragged_moe_dispatch/vllm_n128/. - A stale first llama profile under
llama_n128/is intentionally ignored because the binary still contained the rejected weighted-combine kernel before the clean-source rebuild.
Throughput:
| Engine | decode tok/s/seq | decode agg tok/s | prefill tok/s |
|---|---|---|---|
| llama.cpp | 2.70 | 412.1 | 1368.3 |
| vLLM | 7.02 | 1036.6 | 5277.7 |
llama.cpp bucket highlights from the clean profile:
- GDN:
4680.27 ms,38.12%. mmq_nvfp4:2745.11 ms,22.36%.act_quant:441.42 ms,3.60%.- MoE dispatch:
183.67 ms,1.50%. ew_addfan-in:280.15 ms,2.28%.
Decision:
- Promote to a test-only ragged
MUL_MAT_IDgate before production source. - Do not implement fused dispatch yet. Standalone
mm_ids/gather_mmqhelper time is small; a source patch must reduce the larger grouped-MMQ/activation movement bucket and still beat the+5%serving A/B gate.
Phase 8 Ragged MoE Dispatch Test Gate
Fork commit e21732fc4 added patch
0053-test-paged-cover-ragged-MoE-dispatch.patch. This is a test-only patch;
it does not change the production inference path.
The new MUL_MAT_ID_RAGGED_MOE gate covers:
- one small F32 wiring case,
- NVFP4 with
n_mats=256,n_used=8,m=768,k=2048,n in {1, 8, 33, 128, 257}, - deterministic unique top-k ids skewed toward hot experts, including expert
255, leaving many experts empty.
DGX artifact:
/home/mudler/bench/phase8_ragged_moe_dispatch/test_backend_ops_mul_mat_id_ragged_moe_fixed.txt
DGX result:
test-backend-ops test -b CUDA0 -o MUL_MAT_ID_RAGGED_MOE -j 1:6/6.
Debug note:
- The first version of the gate failed because the deterministic IDs produced duplicate expert IDs within token 0. That is not a valid top-k routing shape and caused a CPU/CUDA mismatch followed by a CUDA fault. The committed gate preserves unique expert IDs per token while keeping cross-token load skew.
Production-source decision:
- Do not start a Phase 8 production CUDA patch yet.
- Code inspection found that the existing native-FP4 MoE path already de-dups
broadcast activation quantization when
ne11 == 1, then gathers FP4 blocks before grouped MMQ. - The measured helper rows are small (
mm_ids=0.66%,gather_mmq=0.42%). A metadata-only fused-dispatch hook would not plausibly clear the+5%serving A/B gate. - A future source candidate must reduce
mmq_nvfp4(22.36%) oract_quant(3.60%) directly, without D2H id readback, new stream synchronizations, or md5 drift.
Phase 9 MTP Draft Smoke Gate
Phase 9 challenged the older "MTP absent" assumption. The current fork has
Qwen3.5/3.6 draft-mtp support and the DGX MoE GGUF contains MTP metadata and
tensors:
qwen35moe.nextn_predict_layersblk.40.nextn.eh_proj.weightblk.40.nextn.shared_head_norm.weightblk.40.nextn.enorm.weightblk.40.nextn.hnorm.weight
Smoke artifacts:
- Failing default pre-patch:
/home/mudler/bench/phase9_mtp_smoke/mtp_smoke.err. - Passing explicit CPU-sampled draft:
/home/mudler/bench/phase9_mtp_smoke/mtp_smoke_no_backend_sampling.err. - Passing default after patch:
/home/mudler/bench/phase9_mtp_smoke/mtp_smoke_default_after_patch.err.
Finding:
draft-mtpruns with the current model when backend draft sampling is off.- The default path previously emitted:
backend sampling requires at most one output token per sequence (seq_id 0 had 2). - Patch
0054-fix-speculative-disable-backend-sampling-for-MTP-drafts.patchdisables backend draft sampling inside the MTP implementation until the backend sampler supports multi-output verification batches.
DGX smoke after patch:
rc=0.- Warning emitted:
backend draft sampling is disabled for MTP. n_drafted=5,n_accept=4, acceptance80.000%.- Output tail:
The capital of France is Paris, a city renowned for its rich history.
Normal inference gates after patch:
- MoE md5:
8cb0ce23777bf55f92f63d0292c756b0. - Dense md5:
5951a5b4d624ce891e22ab5fca9bc439.
Decision:
- Keep Phase 9 as an opt-in speculative smoke/fix only.
- Do not enable MTP by default in LocalAI or llama-server.
- Do not benchmark MTP as a parity win until a serving/API phase adds rollback gates for hybrid SSM/KV state and measures target verification throughput.
Phase 14 MTP Rollback and Inference-Safety Gate
Phase 14 tested the missing safety question from Phase 9: whether MTP speculative rejection can run against the actual Qwen3.6 MoE GGUF without corrupting paged KV or recurrent GDN state.
Artifacts:
/home/mudler/bench/phase14_mtp_rollback/recurrent_rollback.err/home/mudler/bench/phase14_mtp_rollback/mtp_greedy_equiv.err/home/mudler/bench/phase14_mtp_rollback/completion_nocnv_n{8,16,24,32,48}.out/home/mudler/bench/phase14_mtp_rollback/mtp_n{8,16,24,48}.out/home/mudler/bench/paged_inference_gates/20260701_041117
Safety evidence:
test-recurrent-state-rollbackon/home/mudler/bench/q36-35b-a3b-nvfp4.ggufexited0and loggedrecurrent rollback checkpoint restored successfully.- MTP stderr logged bounded recurrent rollback support:
the context supports bounded partial sequence removal. - MTP partial rejection occurred at
temp=0:n_drafted=39,n_accept=20,accept=51.282%. - The backend sampler multi-output error stayed absent; the expected
backend draft sampling is disabled for MTPwarning was present. - Raw greedy text was prefix-equivalent after normalization for
n=8,16,24,32,48; no first differing token was found. Exact transcript md5 is not used for this cross-frontend gate becausellama-speculative-simpleemits accepted token groups and can overrunllama-completion -no-cnvfor the same-n.
Normal inference gates after Phase 14:
- MoE md5:
8cb0ce23777bf55f92f63d0292c756b0. - Dense md5:
5951a5b4d624ce891e22ab5fca9bc439. MUL_MAT_ID:806/806,Backend CUDA0: OK.
Decision:
- MTP rollback safety is green enough to scope a Phase 15 serving/API throughput gate.
- Do not enable MTP by default.
- Do not count MTP as a GB10 speed-parity win until serving results show useful target-verification throughput under the canonical inference gates.
Phase 15 MTP Serving Throughput Gate
Phase 15 measured the direct llama-server serving path after Phase 14 proved
rollback safety. The test compared two same-shape arms:
- baseline: no speculative decoding,
- MTP:
--spec-type draft-mtp --spec-draft-n-max 3 --no-spec-draft-backend-sampling.
Artifact:
/home/mudler/bench/phase15_mtp_serving/20260701_042005
Harness:
backend/cpp/llama-cpp-localai-paged/paged-mtp-serving-bench.shNPL="8 32 128" PTOK=128 GEN=128 CTX=131072 PARALLEL=128- client:
/home/mudler/bench/h2h_cli3.pyagainst/v1/completions
Result:
| arm | n | agg t/s | decode agg t/s | decode per-seq t/s | TTFT mean ms | wall s |
|---|---|---|---|---|---|---|
| baseline | 8 | 192.5 | 247.8 | 30.70 | 1181.1 | 5.318 |
| MTP | 8 | 92.9 | 109.8 | 14.26 | 1691.5 | 11.017 |
| baseline | 32 | 305.4 | 406.0 | 12.02 | 2762.2 | 13.412 |
| MTP | 32 | 95.8 | 111.7 | 3.61 | 4545.6 | 42.727 |
| baseline | 128 | 429.5 | 662.4 | 4.31 | 7747.2 | 38.144 |
| MTP | 128 | 100.3 | 138.5 | 0.97 | 20385.7 | 163.289 |
MTP did actually run:
- server initialized
draft-mtpwith bounded partial sequence removal, - response/server timings included draft counters,
- server log tail included
#gen tokens = 17293,#acc tokens = 15493.
Normal inference gates before and after the A/B:
- MoE md5:
8cb0ce23777bf55f92f63d0292c756b0. - Dense md5:
5951a5b4d624ce891e22ab5fca9bc439. MUL_MAT_ID:806/806,Backend CUDA0: OK.
Decision:
- Reject current
llama-serverMTP as a GB10 serving parity lever. - Do not enable MTP by default in LocalAI or llama-server.
- Do not tune
spec-draft-n-maxblindly. The regression is large enough that the next MTP phase, if any, must start with graph/batch-shape profiling.
Likely root cause:
- Baseline serving preserved heavy graph reuse (
graphs reused = 361in then=128tail). - MTP serving showed
graphs reused = 1and high per-slot eval time at high concurrency. - The working hypothesis is that MTP verification/draft batch shape churn defeats the paged decode graph-reuse wins, so extra verification dominates despite high draft acceptance.
Phase 16 MTP Graph-Reuse Profile
Phase 16 profiled the Phase 15 hypothesis with
nsys --cuda-graph-trace=node on a smaller direct serving shape:
- server:
-c 32768 -b 2048 -ub 512 --parallel 32, - client:
h2h_cli3.py -n 8 --ptok 64 --gen 64, - arms: baseline vs
--spec-type draft-mtp --spec-draft-n-max 3.
Artifact:
/home/mudler/bench/phase16_mtp_graph_profile/20260701_043016
Result:
| arm | decode agg t/s | decode per-seq t/s | wall s | graph reuse |
|---|---|---|---|---|
| baseline | 230.5 | 28.07 | 3.523 | graphs reused = 62 |
| MTP | 97.7 | 12.83 | 7.049 | graphs reused = 1 |
MTP drafted and accepted tokens:
draft acceptance = 0.81481 (44 accepted / 54 generated),#gen tokens = 460,#acc tokens = 346.
Nsight kernel summaries also show materially more GPU work in the MTP run:
roughly 5.89 s top-level GPU kernel time versus 2.59 s for the baseline
small profile.
Decision:
- Phase 16 supports the Phase 15 root-cause hypothesis: current MTP serving defeats the paged decode graph-reuse advantage and increases GPU work.
- A future source phase must start at speculative verification batch shapes and graph-reuse keys, not at MTP draft-length tuning.
Phase 10 GDN C32 Slab Baseline and Source Check
Phase 10 starts a separate GDN prefill path; it does not reopen the rejected
decode GDN_NW/GDN_CPW grid.
Current M5 baseline artifacts:
/home/mudler/bench/phase10_gdn_c32_slab/m5_baseline/paged_moe_prefill.txt/home/mudler/bench/phase10_gdn_c32_slab/m5_baseline/paged_dense_prefill.txt/home/mudler/bench/phase10_gdn_c32_slab/m5_baseline/summary_rows.txt/home/mudler/bench/phase10_gdn_c32_slab/m5_baseline/provenance.txt
Current M5 baseline:
| Model | PP | TG | B | S_PP t/s | S_TG t/s | S t/s |
|---|---|---|---|---|---|---|
| MoE | 512 | 4 | 32 | 2314.18 | 359.16 | 2220.48 |
| MoE | 2048 | 4 | 32 | 2439.95 | 389.43 | 2415.16 |
| Dense | 512 | 4 | 32 | 978.97 | 143.56 | 936.71 |
| Dense | 2048 | 4 | 32 | 1023.61 | 184.09 | 1014.59 |
Source check:
- A C32 M5 candidate cannot be implemented as a launcher-only shortcut.
- The current M5 form-T apply path stores one 16-row tile of
U=T*RHSin registers, syncs, then overwritesUd. That is safe forC=16. - For
C=32, a naive two-row-tile loop would overwrite RHS rows before all output rows are computed, and the current apply call only covers rowbase0. - A correct C32 slab candidate must add a separate staging strategy for all
C*DV_TILEU values, then run focusedGATED_DELTA_NETop gates before any S_PP comparison.
Decision:
- A default-off C32 slab candidate was implemented and rejected by the performance gate.
- The candidate was correctness-clean only after fixing a tail-chunk staging
bug: rows
t >= Ccin the stagedU=T*RHScopy-back must be zeroed before state/output math. Before that fix, the dense gate produced a degenerate transcript even though the focused op gate passed. - After the tail fix, both default and forced-C32 modes matched the canonical
md5 gates exactly:
- MoE:
8cb0ce23777bf55f92f63d0292c756b0. - Dense:
5951a5b4d624ce891e22ab5fca9bc439.
- MoE:
- KL was not needed because md5 stayed stable after the tail fix.
Correctness artifacts:
/home/mudler/bench/phase10_gdn_c32_slab/gates/gated_delta_net_default_after_tailfix.txt/home/mudler/bench/phase10_gdn_c32_slab/gates/gated_delta_net_c32_slab_after_tailfix.txt/home/mudler/bench/phase10_gdn_c32_slab/gates/gate_moe_default_after_tailfix.md5/home/mudler/bench/phase10_gdn_c32_slab/gates/gate_dense_default_after_tailfix.md5/home/mudler/bench/phase10_gdn_c32_slab/gates/gate_moe_c32_after_tailfix.md5/home/mudler/bench/phase10_gdn_c32_slab/gates/gate_dense_c32_after_tailfix.md5
Performance A/B artifacts:
/home/mudler/bench/phase10_gdn_c32_slab/ab/moe_base.txt/home/mudler/bench/phase10_gdn_c32_slab/ab/moe_c32.txt/home/mudler/bench/phase10_gdn_c32_slab/ab/dense_base.txt/home/mudler/bench/phase10_gdn_c32_slab/ab/dense_c32.txt
Performance A/B:
| Model | Mode | PP | TG | B | S_PP t/s | S_TG t/s | S t/s |
|---|---|---|---|---|---|---|---|
| MoE | M5 base | 512 | 4 | 32 | 2323.48 | 397.57 | 2239.39 |
| MoE | C32 slab | 512 | 4 | 32 | 2069.12 | 357.43 | 1995.06 |
| MoE | M5 base | 2048 | 4 | 32 | 2430.32 | 388.29 | 2405.66 |
| MoE | C32 slab | 2048 | 4 | 32 | 2054.86 | 388.01 | 2037.79 |
| Dense | M5 base | 512 | 4 | 32 | 975.10 | 140.53 | 932.19 |
| Dense | C32 slab | 512 | 4 | 32 | 866.29 | 144.03 | 833.87 |
| Dense | M5 base | 2048 | 4 | 32 | 1019.25 | 183.25 | 1010.26 |
| Dense | C32 slab | 2048 | 4 | 32 | 903.73 | 183.47 | 896.86 |
Rejected diff:
/home/mudler/bench/phase10_gdn_c32_slab/rejected/c32_slab_tailfix_rejected.diff
Conclusion:
- Do not ship Phase 10 C32 slab as implemented.
- C32 slab is not a maintainable shortcut toward parity because duplicated A/T recomputation per value slab outweighs the intended state-traffic reduction.
- A future GDN prefill attempt should either share the
A/Twork across value slabs or switch to a different FLA-style chunk design; it should not repeat this env-gated two-slab M5 variant.
Phase 11 GDN M5 QS-Early Rejection
Phase 11 tested a smaller C=16 M5 scheduling shortcut instead of reopening C32:
move the QS = Qc * S0 state-boundary tensor-core pass earlier and keep it
default-off behind GDN_M5_QS_EARLY=1.
Correctness artifacts:
/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/gated_delta_net_default.txt/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/gated_delta_net_qs_early.txt/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/gate_moe_default.md5/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/gate_dense_default.md5/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/gate_moe_qs_early.md5/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/gate_dense_qs_early.md5
Correctness result:
- Default and QS-early paths matched canonical md5 exactly:
- MoE
8cb0ce23777bf55f92f63d0292c756b0. - Dense
5951a5b4d624ce891e22ab5fca9bc439.
- MoE
- KL was not needed.
Performance artifacts:
/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/moe_base.txt/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/moe_qs_early.txt/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/dense_base.txt/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/dense_qs_early.txt
Performance A/B:
| Model | Mode | PP | TG | B | S_PP t/s | S_TG t/s | S t/s |
|---|---|---|---|---|---|---|---|
| MoE | M5 base | 512 | 4 | 32 | 2325.67 | 355.60 | 2229.90 |
| MoE | QS-early | 512 | 4 | 32 | 2315.77 | 353.27 | 2220.16 |
| MoE | M5 base | 2048 | 4 | 32 | 2441.54 | 390.53 | 2416.80 |
| MoE | QS-early | 2048 | 4 | 32 | 2420.26 | 389.89 | 2395.94 |
| Dense | M5 base | 512 | 4 | 32 | 975.15 | 142.71 | 932.97 |
| Dense | QS-early | 512 | 4 | 32 | 968.23 | 144.24 | 927.17 |
| Dense | M5 base | 2048 | 4 | 32 | 1021.06 | 183.34 | 1012.04 |
| Dense | QS-early | 2048 | 4 | 32 | 1015.77 | 183.73 | 1006.88 |
Rejected diff:
/home/mudler/bench/phase11_gdn_m5_state_boundary/rejected/qs_early_rejected.diff
Conclusion:
- Do not ship Phase 11 QS-early as implemented.
- Merely moving the QS state-boundary product earlier is not enough; it remains an extra MMA pass and does not reduce the M5 critical path.
- The next GDN attempt should skip local scheduling-only changes and scope a true shared-A/Ai blocked-solve or global-scratch design, with an explicit scratch/synchronization cost model before coding.
Phase 12 GDN Shared-A/Ai Cost Model
Phase 12 evaluated whether a real shared-A/Ai design is credible enough to prototype after the C32 slab and QS-early shortcut rejections.
Cost-model doc:
backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md
Metadata artifact:
/home/mudler/bench/phase12_gdn_shared_ai_cost_model/model_metadata.txt
Model dimensions:
| Model | GDN layers | H | S_v | Metadata basis |
|---|---|---|---|---|
| MoE | 30 inferred | 32 inferred | 128 | ssm.inner_size=4096, ssm.state_size=128 |
| Dense | 48 inferred | 48 inferred | 128 | ssm.inner_size=6144, ssm.state_size=128 |
Dynamic-smem result for S_v=128:
| Shape | Bytes | KiB | Fits GB10 dynamic smem? |
|---|---|---|---|
| C16 full-width | 93,376 | 91.19 | yes |
| C32 full-width | 127,360 | 124.38 | no |
| C32 slab64 + U staging | 94,592 | 92.38 | yes |
Ai scratch result at npp=2048,npl=32,BT=32,f32:
| Model | Ai scratch MiB | 3x Ai traffic MiB |
|---|---|---|
| MoE | 256.0 | 768.0 |
| Dense | 384.0 | 1152.0 |
Decision:
- GO for a default-off Phase 13 global-Ai32 prototype.
- Constraints:
BT=32, f32 Ai, twodv_tile=64slabs,GDN_GLOBAL_AI32=1. - The prototype must be rejected if it is flat or slower; do not iterate into f16/BF16 Ai unless f32 proves the schedule can win.
Phase 13 GDN Global-Ai32 Prototype Rejection
Phase 13 implemented the Phase 12 design in the llama.cpp fork as a default-off
prototype behind GDN_GLOBAL_AI32=1.
Implementation summary:
- Added a f32 Ai precompute kernel.
- Added C32,
dv_tile=64slab consumption through the chunked GDN path. - Allocated Ai scratch from the ggml CUDA pool only for supported calls.
- Kept the default C16 M5 path unchanged.
Correctness artifacts:
/home/mudler/bench/phase13_gdn_global_ai32/gates/gated_delta_net_default.txt/home/mudler/bench/phase13_gdn_global_ai32/gates/gated_delta_net_global_ai32.txt/home/mudler/bench/phase13_gdn_global_ai32/gates/gate_moe_default.md5/home/mudler/bench/phase13_gdn_global_ai32/gates/gate_dense_default.md5/home/mudler/bench/phase13_gdn_global_ai32/gates/gate_moe_global_ai32.md5/home/mudler/bench/phase13_gdn_global_ai32/gates/gate_dense_global_ai32.md5
Correctness result:
- Default and Global-Ai32 paths matched canonical md5 exactly:
- MoE
8cb0ce23777bf55f92f63d0292c756b0. - Dense
5951a5b4d624ce891e22ab5fca9bc439.
- MoE
- KL was not needed.
Performance artifacts:
/home/mudler/bench/phase13_gdn_global_ai32/ab/moe_base.txt/home/mudler/bench/phase13_gdn_global_ai32/ab/moe_global_ai32.txt/home/mudler/bench/phase13_gdn_global_ai32/ab/dense_base.txt/home/mudler/bench/phase13_gdn_global_ai32/ab/dense_global_ai32.txt
Performance A/B:
| Model | Mode | PP | TG | B | S_PP t/s | S_TG t/s | S t/s |
|---|---|---|---|---|---|---|---|
| MoE | M5 base | 512 | 4 | 32 | 2325.86 | 396.05 | 2241.21 |
| MoE | Global Ai32 | 512 | 4 | 32 | 2106.50 | 398.55 | 2038.78 |
| MoE | M5 base | 2048 | 4 | 32 | 2425.10 | 389.63 | 2400.66 |
| MoE | Global Ai32 | 2048 | 4 | 32 | 2097.76 | 388.40 | 2079.92 |
| Dense | M5 base | 512 | 4 | 32 | 970.62 | 149.89 | 931.10 |
| Dense | Global Ai32 | 512 | 4 | 32 | 876.51 | 149.29 | 844.62 |
| Dense | M5 base | 2048 | 4 | 32 | 1016.14 | 182.16 | 1007.15 |
| Dense | Global Ai32 | 2048 | 4 | 32 | 918.19 | 183.00 | 911.05 |
Rejected diff:
/home/mudler/bench/phase13_gdn_global_ai32/rejected/global_ai32_rejected.diff
Conclusion:
- Do not ship Phase 13 Global-Ai32 as implemented.
- The global scratch split is correctness-safe but slower than shipped C16 M5.
- Per the Phase 12/13 decision rule, stop GDN kernel work on GB10. The remaining vLLM GDN advantage requires a fuller FLA-style blocked solve or hardware assumptions that do not fit this GB10 patch stack without a regression.
Phase 8 Ragged MoE Dispatch Safety Rerun
Phase 8 had already closed the live ragged MoE helper path by profile:
mm_ids=0.66%, gather_mmq=0.42%, while mmq_nvfp4=22.36% and
act_quant=3.60%. The only source patch kept from the phase is the test gate
(0053-test-paged-cover-ragged-MoE-dispatch.patch); the metadata-only
LLAMA_MOE_FUSED_DISPATCH shortcut is rejected.
Rerun artifacts:
/home/mudler/bench/phase8_ragged_moe_dispatch/ragged_gate_rerun_20260701_035529.txt/home/mudler/bench/phase8_ragged_moe_dispatch/safety_rerun_20260701_035549/
Safety result:
MUL_MAT_ID_RAGGED_MOE:6/6on CUDA0.- Full
MUL_MAT_ID:806/806on CUDA0. - MoE transcript md5:
8cb0ce23777bf55f92f63d0292c756b0. - Dense transcript md5:
5951a5b4d624ce891e22ab5fca9bc439.
Conclusion:
- The inferencing gates remain canonical on the unchanged production path.
- Do not add a metadata/helper-only fused-dispatch hook. A future Phase 8
production candidate must reduce
mmq_nvfp4or activation movement directly, stay free of D2H id readback and new stream synchronizations, and then pass the same md5/op gates before any serving A/B is considered.
Phase 18 MTP Shape Trace
Phase 18 implemented the Phase 17 instrumentation-only recommendation as
patch 0055-feat-server-trace-speculative-batch-shapes.patch.
Implementation summary:
- Added default-off
LLAMA_SPEC_SHAPE_TRACE=1logging inserver_slot::handle_last_sampled_token(). - Normal decode logs one row/output per slot.
- MTP verification logs
K + 1rows/outputs per speculative slot, including draft length andslot.spec_i_batchrange. - No scheduler, graph-key, KV, logits, acceptance, or rollback behavior changed.
Red/green trace artifacts:
- Red check before patch:
/home/mudler/bench/phase18_mtp_shape_trace_red - Green check after patch:
/home/mudler/bench/phase18_mtp_shape_trace_green
Green trace sample:
spec shape: kind=verify batch_before=0 rows=4 outputs=4 draft=3 spec_i_first=0 spec_i_last=3 pos0=5 slot_tokens=5
spec shape: kind=verify batch_before=0 rows=4 outputs=4 draft=3 spec_i_first=0 spec_i_last=3 pos0=6 slot_tokens=6
spec shape: kind=verify batch_before=0 rows=3 outputs=3 draft=2 spec_i_first=0 spec_i_last=2 pos0=9 slot_tokens=9
Disabled-env check:
LLAMA_SPEC_SHAPE_TRACEunset emitted nospec shape:lines.
Inference gate artifact:
/home/mudler/bench/phase18_mtp_shape_trace_green/gate_after
Safety result:
- MoE transcript md5:
8cb0ce23777bf55f92f63d0292c756b0. - Dense transcript md5:
5951a5b4d624ce891e22ab5fca9bc439. - Full
MUL_MAT_ID:806/806on CUDA0.
Conclusion:
- Patch 0055 is safe instrumentation and does not break inferencing on the canonical gated paths.
- The trace confirms per-step MTP verification shape variation even in a tiny
request (
rows=4androws=3). - A follow-up scheduler experiment is not yet justified. First use this trace under real serving load to measure draft-length bucket entropy.
Phase 19 MTP Serving Shape Entropy
Phase 19 ran Phase 18's shape trace under the direct serving harness with
LLAMA_SPEC_SHAPE_TRACE=1, NPL="8 32 128", GEN=64, and PTOK=128.
Artifact:
/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534
Pre/post gate result:
- Pre-gate and post-gate both passed.
- MoE transcript md5:
8cb0ce23777bf55f92f63d0292c756b0. - Dense transcript md5:
5951a5b4d624ce891e22ab5fca9bc439. - Full
MUL_MAT_ID:806/806on CUDA0.
Serving A/B:
| n | baseline decode_agg | MTP decode_agg | MTP / baseline | baseline TTFT ms | MTP TTFT ms |
|---|---|---|---|---|---|
| 8 | 245.0 | 95.7 | 39.1% | 1147.2 | 1633.4 |
| 32 | 409.2 | 110.0 | 26.9% | 2710.0 | 4471.5 |
| 128 | 697.2 | 154.0 | 22.1% | 7601.5 | 20310.4 |
Shape entropy summaries:
shape_entropy_summary.tsvstep_shape_summary.tsv
Per-slot draft distribution:
| window | verify slots | draft counts | top draft share | unique batch_before |
|---|---|---|---|---|
| n8 | 162 | {1: 4, 2: 2, 3: 156} |
96.3% | 15 |
| n32 | 610 | {1: 8, 2: 11, 3: 591} |
96.9% | 96 |
| n128 | 2353 | {1: 40, 2: 49, 3: 2264} |
96.2% | 479 |
Per-step aggregate shape:
| window | steps | unique total rows | top full-shape rows |
|---|---|---|---|
| n8 | 26 | 12 | 32 rows for 14 steps |
| n32 | 32 | 20 | 128 rows for 13 steps |
| n128 | 37 | 34 | 512 rows for 4 steps |
Decision:
- Do not implement the Phase 20 group/defer-by-draft scheduler shortcut on this evidence.
- Draft length is already stable (
draft=3is >96% of verify slots), yet MTP still regresses decode throughput hard and worsens TTFT. - The residual shape churn is dominated by active-slot/tail churn and the MTP
K + 1verification-row expansion, not mixed draft lengths. - Any future MTP parity work needs a deeper target-verify graph/state design, not a small server scheduling shortcut.
Phase 20 Current-Stack Serving Snapshot
Phase 20 refreshed the MoE paged-vs-vLLM serving baseline on the current clean DGX mirror after the MTP investigation.
Artifact:
/home/mudler/bench/phase20_current_snapshot/20260701_050621
Current source:
/home/mudler/llama-phase6-sourcef2521ab12 feat(server): trace speculative batch shapes
Pre/post gate result:
- Pre-gate and post-gate both passed.
- MoE transcript md5:
8cb0ce23777bf55f92f63d0292c756b0. - Dense transcript md5:
5951a5b4d624ce891e22ab5fca9bc439. - Full
MUL_MAT_ID:806/806on CUDA0.
Serving snapshot:
| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg |
|---|---|---|---|---|---|---|
| 8 | 220.8 | 290.5 | 76.0% | 164.8 | 245.5 | 67.1% |
| 32 | 411.1 | 594.7 | 69.1% | 252.1 | 456.0 | 55.3% |
| 128 | 670.0 | 1022.7 | 65.5% | 322.4 | 662.4 | 48.7% |
Latency/prefill snapshot:
| n | paged TTFT ms | vLLM TTFT ms | paged/vLLM TTFT | paged prefill_tps | vLLM prefill_tps |
|---|---|---|---|---|---|
| 8 | 783.6 | 271.8 | 2.88x | 1669.9 | 4371.5 |
| 32 | 2630.6 | 783.8 | 3.36x | 1712.8 | 5358.3 |
| 128 | 7678.7 | 2465.7 | 3.11x | 1660.4 | 5242.9 |
Decision:
- The latest clean stack is still not at vLLM serving parity on GB10.
- The user-visible gap is dominated by prefill/TTFT and e2e serving throughput, not by a now-open MTP or scheduler shortcut.
- Keep MTP scheduler work closed. The next credible parity path is either a datacenter-Blackwell rerun or a larger fused-kernel project outside the low-conflict GB10 patch stack.
Phase 21 Current-Stack Serving Harness
Phase 21 made the Phase 20 current-stack serving snapshot repeatable from the LocalAI backend tree.
New script:
backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh
Purpose:
- targets the clean
~/llama-phase6-sourcemirror by default; - rejects busy docker,
local-ai-worker, GPU compute, or owned GPU-lock state; - builds the current llama.cpp targets;
- runs pre/post
paged-inference-gates.sh; - runs paged and vLLM serving arms with the same h2h client;
- writes paged/vLLM ratio summaries.
Verification:
- local
bash -npassed; - local
--helppassed; - DGX
DRY_RUN=1validated required paths and preflight without launching servers.
Dry-run artifact:
/home/mudler/bench/phase21_harness_dryrun/20260701_051757
Decision:
- Use
paged-current-serving-snapshot.shfor future current-stack GB10 serving snapshots. - Do not use stale DGX
~/bench/combined_definitive.shwithout porting it to~/llama-phase6-sourceand the owner-file lock discipline.
Phase 22 Patch-Series Mirror Invariant
Phase 22 verified that the LocalAI on-disk paged patch series still reconstructs
the canonical llama.cpp fork tree after patch 0055.
Method:
- Create a fresh worktree at Makefile pin
0ed235ea2c17a19fc8238668653946721ed136fd. - Apply every
backend/cpp/llama-cpp-localai-paged/patches/paged/0*.patchwith strictgit apply, matching the LocalAI build path. - Stage the result and compare
git write-treewith the fork branch HEAD tree.
Result:
base=0ed235ea2c17a19fc8238668653946721ed136fd
applied_tree=5bdbf8ea3d750fe6fa1f85175fd6357d36222edb
fork_tree=5bdbf8ea3d750fe6fa1f85175fd6357d36222edb
Decision:
- The patch series is drift-free against fork branch
localai-pagedatfb9402661 feat(server): trace speculative batch shapes.
Phase 24 Snapshot Hardware Report
Phase 24 made the current-stack serving harness record hardware identity before any server starts. This keeps GB10/workstation Blackwell evidence separate from future datacenter-Blackwell reruns.
Script change:
backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.shnow writeshardware.txtafter preflight and before theDRY_RUN=1exit.
Recorded fields:
nvidia-smi -L;nvidia-smi --query-gpu=name,driver_version,memory.total,compute_cap, with fallback to name/driver/memory ifcompute_capis unavailable;gpu_name;hardware_class;- parity note for that hardware class.
Verification:
- local
bash -npassed; - local
--helppassed; - DGX
DRY_RUN=1validated preflight and wrotehardware.txtwithout launching servers.
Dry-run artifact:
/home/mudler/bench/phase24_hardware_report_dryrun/20260701_052741
DGX hardware result:
GPU 0: NVIDIA GB10
driver=580.159.03
compute_cap=12.1
hardware_class=gb10_or_workstation_blackwell
Decision:
- Future snapshot artifacts are self-describing enough to prevent accidental GB10-to-datacenter generalization.
- The Phase 20 GB10 closure still applies to
gb10_or_workstation_blackwell; datacenter Blackwell needs a fresh run of the same methodology.
Phase 25 Snapshot Gate Summary
Phase 25 made current-stack serving artifacts self-auditing for the inference gates that protect the paged path.
Script change:
backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.shnow writesgate_summary.tsvafter the post gate in a full run.- The script also supports
--summarize-gates ARTto generate the same summary from existinggate_pre/andgate_post/artifacts without launching servers.
Recorded rows:
- pre/post MoE transcript md5 versus
8cb0ce23777bf55f92f63d0292c756b0; - pre/post dense transcript md5 versus
5951a5b4d624ce891e22ab5fca9bc439; - pre/post backend op rows, currently
MUL_MAT_ID, with the parsed passed/total count.
Verification:
- Red check: Phase 20 initially had gate artifacts but no
gate_summary.tsv. - local
bash -npassed; - local
--helppassed; - DGX
--summarize-gatesagainst Phase 20 wrote six green rows; - DGX
DRY_RUN=1validated the normal path still preflights and writeshardware.txtwithout launching servers or writing a gate summary before gates exist.
Artifacts:
- Backfilled summary:
/home/mudler/bench/phase20_current_snapshot/20260701_050621/gate_summary.tsv - Dry run:
/home/mudler/bench/phase25_gate_summary_dryrun/20260701_053353
Backfilled Phase 20 gate summary:
pre moe_md5 ok 8cb0ce23777bf55f92f63d0292c756b0
pre dense_md5 ok 5951a5b4d624ce891e22ab5fca9bc439
pre op_MUL_MAT_ID ok 806/806
post moe_md5 ok 8cb0ce23777bf55f92f63d0292c756b0
post dense_md5 ok 5951a5b4d624ce891e22ab5fca9bc439
post op_MUL_MAT_ID ok 806/806
Decision:
- Future full serving snapshots carry compact proof that inference md5/op gates stayed green before and after the paged-vs-vLLM run.
- Treat
gate_summary.tsvplushardware.txtas the quick audit surface before accepting a parity snapshot.
Phase 26 Audited Current-Stack Serving Snapshot
Phase 26 ran a full current-stack paged-vs-vLLM MoE serving snapshot with the Phase 24/25 audit files enabled.
Artifact:
/home/mudler/bench/phase26_audited_snapshot/20260701_053650
Current source:
/home/mudler/llama-phase6-sourcef2521ab12 feat(server): trace speculative batch shapes
Hardware report:
hardware_class=gb10_or_workstation_blackwellGPU 0: NVIDIA GB10- driver
580.159.03 - compute capability
12.1
Pre/post gate summary:
| phase | check | status | actual |
|---|---|---|---|
| pre | MoE md5 | ok | 8cb0ce23777bf55f92f63d0292c756b0 |
| pre | dense md5 | ok | 5951a5b4d624ce891e22ab5fca9bc439 |
| pre | MUL_MAT_ID |
ok | 806/806 |
| post | MoE md5 | ok | 8cb0ce23777bf55f92f63d0292c756b0 |
| post | dense md5 | ok | 5951a5b4d624ce891e22ab5fca9bc439 |
| post | MUL_MAT_ID |
ok | 806/806 |
Serving snapshot:
| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg |
|---|---|---|---|---|---|---|
| 8 | 230.8 | 283.2 | 81.5% | 170.6 | 241.6 | 70.6% |
| 32 | 420.0 | 609.0 | 69.0% | 254.6 | 466.7 | 54.6% |
| 128 | 673.4 | 1025.0 | 65.7% | 324.0 | 656.5 | 49.4% |
Latency/prefill snapshot:
| n | paged TTFT ms | vLLM TTFT ms | paged/vLLM TTFT | paged prefill_tps | vLLM prefill_tps |
|---|---|---|---|---|---|
| 8 | 778.6 | 271.1 | 2.87x | 1679.9 | 4485.6 |
| 32 | 2607.4 | 749.4 | 3.48x | 1698.8 | 5427.8 |
| 128 | 7569.6 | 2534.3 | 2.99x | 1668.7 | 5122.0 |
vLLM startup notes:
- vLLM selected the expected GB10 backend mix: FlashInfer FP8 projection kernels, Triton/FLA GDN prefill, FlashAttention, and MARLIN NVFP4 MoE.
- Startup was long because the server loaded three checkpoint shards, loaded cached torch-compile graphs, ran FlashInfer fp8 GEMM autotuning, and captured CUDA graphs before the API became ready.
Decision:
- The audited current stack still is not at vLLM serving parity on GB10.
- The Phase 20 conclusion is reproduced with stronger audit artifacts:
hardware.txt,gate_summary.tsv, pre/post full gates, and same-session paged/vLLM ratios. - Current paged/vLLM decode ratios remain about
81.5%at n8,69.0%at n32, and65.7%at n128; e2e aggregate ratios remain about70.6%,54.6%, and49.4%.
Phase 27 Graph-Node-Traced Current-Stack Serving Profile
Phase 27 re-profiled the current clean llama.cpp serving path with CUDA graph
node tracing enabled. This checks the Phase 8 bucket picture against the decode
profiling rule: serving/decode profiles must use --cuda-graph-trace=node.
Artifact:
/home/mudler/bench/phase27_graph_node_serving/20260701_055519
Source and hardware:
/home/mudler/llama-phase6-sourcef2521ab12 feat(server): trace speculative batch shapesGPU 0: NVIDIA GB10, driver580.159.03, compute capability12.1- Nsight Systems
2025.3.2.474-253236389321v0
Safety gates:
| phase | check | status | actual |
|---|---|---|---|
| pre | MoE md5 | ok | 8cb0ce23777bf55f92f63d0292c756b0 |
| pre | dense md5 | ok | 5951a5b4d624ce891e22ab5fca9bc439 |
| pre | MUL_MAT_ID |
ok | 806/806 |
| post retry | MoE md5 | ok | 8cb0ce23777bf55f92f63d0292c756b0 |
| post retry | dense md5 | ok | 5951a5b4d624ce891e22ab5fca9bc439 |
| post retry | MUL_MAT_ID |
ok | 806/806 |
The first immediate post-gate attempt raced with Nsight teardown and rejected
the run because it detected one compute process even though nvidia-smi already
printed no running processes. The post-gate retry started from docker=0,
local_ai_worker=0, compute=0, and a FREE owner file.
Serving sample (n=128, PTOK=128, GEN=64):
| agg_tps | decode_agg_tps | decode_perseq_tps | prefill_tps | TTFT mean ms |
|---|---|---|---|---|
| 319.9 | 675.5 | 3.9 | 1671.1 | 8363.4 |
This matches Phase 26's n128 paged decode rate (673.4 decode_agg_tps) closely
enough to treat the profile as representative for bucket direction.
Graph-node-traced kernel buckets:
| macro bucket | time ms | share |
|---|---|---|
| GDN | 6706.33 | 33.47% |
| MoE/FFN-GEMM | 5871.92 | 29.31% |
| bf16-proj | 2725.07 | 13.60% |
| layout-copy | 1309.99 | 6.54% |
| ew-mul(weight/norm/GDN) | 724.29 | 3.61% |
| act-quant | 697.75 | 3.48% |
| norms/residual | 405.29 | 2.02% |
| ew-add(resid/MoE-fanin) | 361.81 | 1.81% |
| MoE-dispatch | 275.99 | 1.38% |
| FA | 271.03 | 1.35% |
Fine buckets:
gdn_core:5929.85 ms(29.59%)mmq_nvfp4:5697.79 ms(28.44%)cublas_bf16_gemm:1892.81 ms(9.45%)act_quant:697.75 ms(3.48%)mm_ids:121.99 ms(0.61%)gather_mmq:73.88 ms(0.37%)argsort_topk:80.11 ms(0.40%)
Decision:
- The graph-node-traced current-stack profile confirms the Phase 8 source
shortcut decision. Metadata/helper work is still too small:
mm_ids,gather_mmq, andargsort_topktogether are about1.38%. - A credible GB10 source patch would have to reduce
gdn_coreormmq_nvfp4/bf16 projection work directly. The low-conflict helper-dispatch path still should not be reopened. - The serving profile does not change the Phase 26 parity verdict: n128 paged
decode remains about
675 tok/s, far below vLLM's same-session1025 tok/s.