docs(paged): reject GDN M5 QS-early phase

Record the Phase 11 default-off QS-early GDN experiment, its canonical md5 gates, the same-session GB10 A/B regression, and the rejected diff artifact.

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 01:29:44 +00:00
parent 24e778de47
commit 1b5ae227eb
5 changed files with 157 additions and 16 deletions

View File

@@ -928,3 +928,58 @@ Conclusion:
- A future GDN prefill attempt should either share the `A/T` work across value
slabs or switch to a different FLA-style chunk design; it should not repeat
this env-gated two-slab M5 variant.
## Phase 11 GDN M5 QS-Early Rejection
Phase 11 tested a smaller C=16 M5 scheduling shortcut instead of reopening C32:
move the `QS = Qc * S0` state-boundary tensor-core pass earlier and keep it
default-off behind `GDN_M5_QS_EARLY=1`.
Correctness artifacts:
- `/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/gated_delta_net_default.txt`
- `/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/gated_delta_net_qs_early.txt`
- `/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/gate_moe_default.md5`
- `/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/gate_dense_default.md5`
- `/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/gate_moe_qs_early.md5`
- `/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/gate_dense_qs_early.md5`
Correctness result:
- Default and QS-early paths matched canonical md5 exactly:
- MoE `8cb0ce23777bf55f92f63d0292c756b0`.
- Dense `5951a5b4d624ce891e22ab5fca9bc439`.
- KL was not needed.
Performance artifacts:
- `/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/moe_base.txt`
- `/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/moe_qs_early.txt`
- `/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/dense_base.txt`
- `/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/dense_qs_early.txt`
Performance A/B:
| Model | Mode | PP | TG | B | S_PP t/s | S_TG t/s | S t/s |
|-------|------|----|----|---|----------|----------|-------|
| MoE | M5 base | 512 | 4 | 32 | 2325.67 | 355.60 | 2229.90 |
| MoE | QS-early | 512 | 4 | 32 | 2315.77 | 353.27 | 2220.16 |
| MoE | M5 base | 2048 | 4 | 32 | 2441.54 | 390.53 | 2416.80 |
| MoE | QS-early | 2048 | 4 | 32 | 2420.26 | 389.89 | 2395.94 |
| Dense | M5 base | 512 | 4 | 32 | 975.15 | 142.71 | 932.97 |
| Dense | QS-early | 512 | 4 | 32 | 968.23 | 144.24 | 927.17 |
| Dense | M5 base | 2048 | 4 | 32 | 1021.06 | 183.34 | 1012.04 |
| Dense | QS-early | 2048 | 4 | 32 | 1015.77 | 183.73 | 1006.88 |
Rejected diff:
- `/home/mudler/bench/phase11_gdn_m5_state_boundary/rejected/qs_early_rejected.diff`
Conclusion:
- Do not ship Phase 11 QS-early as implemented.
- Merely moving the QS state-boundary product earlier is not enough; it remains
an extra MMA pass and does not reduce the M5 critical path.
- The next GDN attempt should skip local scheduling-only changes and scope a
true shared-A/Ai blocked-solve or global-scratch design, with an explicit
scratch/synchronization cost model before coding.

View File

@@ -174,6 +174,7 @@ GDN is the #1 prefill-gap contributor (+59.2 us/tok, ~30%). vLLM's FLA `chunk_ga
| BV block-occupancy A/B (tf32) | raise blocks/SM | REJECTED (occupancy NOT the bound) | 1844 vs 1814 S_PP (-1.04%, within noise) |
| bf16-C64 | bf16 Gram at C=64 | REJECTED | -18.75%; O(C^2) intra-chunk + serial recurrence dominates |
| Phase 10 C32 slab M5 | C=32, two `dv_tile=64` slabs, default-off `GDN_C32_SLAB=1` | REJECTED | md5-clean after tail-row zeroing, but slower: MoE 2048 2430.32 -> 2054.86; dense 2048 1019.25 -> 903.73 |
| Phase 11 QS-early M5 | move `QS = Qc * S0` earlier, default-off `GDN_M5_QS_EARLY=1` | REJECTED | md5-clean, but slightly slower: MoE 2048 2441.54 -> 2420.26; dense 2048 1021.06 -> 1015.77 |
Why not occupancy/dtype: the cost is the **O(C^2) intra-chunk triangular A-inverse solve + the strictly-serial inter-chunk recurrence**, with C forced to **16** by GB10's 99 KB dynamic-smem cap (the 128x128 f32 state alone is 64 KB). M5 captures the tractable TC part; it does not fully close 2.62x because vLLM's FLA blocked-solve is a more complete TC implementation.

View File

@@ -173,6 +173,7 @@ products through tensor cores. The series chased that headroom.
| BV block-occupancy A/B (tf32) | raise blocks/SM to test if occupancy is the bound | **REJECTED** (occupancy is NOT the bound; latency is wave-hidden) | two arms statistically equal: **1844 vs 1814 S_PP (-1.04%, within noise)** | GDNAB armA/armB |
| bf16-C64 | bf16 Gram at the larger C=64 chunk | **REJECTED** | **-18.75%** - the O(C^2) intra-chunk triangular-solve + serial recurrence dominates, so growing C hurts | recorded verdict / GDN build-plan |
| Phase 10 C32 slab M5 | C=32 with two `dv_tile=64` slabs, default-off `GDN_C32_SLAB=1` | **REJECTED** | md5-clean after tail-row zeroing, but S_PP regressed: MoE 2048 **2430.32 -> 2054.86**, dense 2048 **1019.25 -> 903.73** | phase10 gates/ab |
| Phase 11 QS-early M5 | move `QS = Qc * S0` earlier, default-off `GDN_M5_QS_EARLY=1` | **REJECTED** | md5-clean, but S_PP regressed slightly: MoE 2048 **2441.54 -> 2420.26**, dense 2048 **1021.06 -> 1015.77** | phase11 gates/ab |
**Why the bottleneck is not occupancy/dtype:** the cost is the **O(C^2)
intra-chunk triangular solve + the serial inter-chunk recurrence dependency**, not

View File

@@ -493,6 +493,34 @@ Artifacts:
- `/home/mudler/bench/phase10_gdn_c32_slab/ab/`
- `/home/mudler/bench/phase10_gdn_c32_slab/rejected/c32_slab_tailfix_rejected.diff`
### Phase 11 GDN M5 QS-early update
Phase 11 tested the smallest possible C=16 follow-up after the C32 slab
rejection: move the `QS = Qc * S0` state-boundary product earlier in the M5
chunk loop behind `GDN_M5_QS_EARLY=1`.
Result:
- The candidate built on DGX and stayed md5-exact:
MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
`5951a5b4d624ce891e22ab5fca9bc439`.
- It regressed S_PP slightly in both families:
MoE 2048 `2441.54 -> 2420.26`, dense 2048 `1021.06 -> 1015.77`.
Decision:
- **REJECT** QS-early.
- Do not add it to the LocalAI patch stack.
- A scheduling-only move that still performs the same QS MMA does not close the
GDN gap. The next GDN scope should be a real shared-A/Ai blocked-solve or
global-scratch design, not another local reorder.
Artifacts:
- `/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/`
- `/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/`
- `/home/mudler/bench/phase11_gdn_m5_state_boundary/rejected/qs_early_rejected.diff`
---
# PROFILE-VALIDATED PATH (both-engine nsys, adversarially verified Sun Jun 28 11:55:12 PM UTC 2026)