mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-02 20:37:03 -04:00
docs(paged): reject GDN C32 slab phase
Record the default-off C32 slab experiment, its md5 gates, the dense tail-row fix, and the performance regression that rejects the source patch. Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -874,6 +874,57 @@ Source check:
|
||||
|
||||
Decision:
|
||||
|
||||
- Do not ship a Phase 10 source patch yet.
|
||||
- Keep the baseline and source check as the entry gate for the next C32 slab
|
||||
implementation task.
|
||||
- A default-off C32 slab candidate was implemented and rejected by the
|
||||
performance gate.
|
||||
- The candidate was correctness-clean only after fixing a tail-chunk staging
|
||||
bug: rows `t >= Cc` in the staged `U=T*RHS` copy-back must be zeroed before
|
||||
state/output math. Before that fix, the dense gate produced a degenerate
|
||||
transcript even though the focused op gate passed.
|
||||
- After the tail fix, both default and forced-C32 modes matched the canonical
|
||||
md5 gates exactly:
|
||||
- MoE: `8cb0ce23777bf55f92f63d0292c756b0`.
|
||||
- Dense: `5951a5b4d624ce891e22ab5fca9bc439`.
|
||||
- KL was not needed because md5 stayed stable after the tail fix.
|
||||
|
||||
Correctness artifacts:
|
||||
|
||||
- `/home/mudler/bench/phase10_gdn_c32_slab/gates/gated_delta_net_default_after_tailfix.txt`
|
||||
- `/home/mudler/bench/phase10_gdn_c32_slab/gates/gated_delta_net_c32_slab_after_tailfix.txt`
|
||||
- `/home/mudler/bench/phase10_gdn_c32_slab/gates/gate_moe_default_after_tailfix.md5`
|
||||
- `/home/mudler/bench/phase10_gdn_c32_slab/gates/gate_dense_default_after_tailfix.md5`
|
||||
- `/home/mudler/bench/phase10_gdn_c32_slab/gates/gate_moe_c32_after_tailfix.md5`
|
||||
- `/home/mudler/bench/phase10_gdn_c32_slab/gates/gate_dense_c32_after_tailfix.md5`
|
||||
|
||||
Performance A/B artifacts:
|
||||
|
||||
- `/home/mudler/bench/phase10_gdn_c32_slab/ab/moe_base.txt`
|
||||
- `/home/mudler/bench/phase10_gdn_c32_slab/ab/moe_c32.txt`
|
||||
- `/home/mudler/bench/phase10_gdn_c32_slab/ab/dense_base.txt`
|
||||
- `/home/mudler/bench/phase10_gdn_c32_slab/ab/dense_c32.txt`
|
||||
|
||||
Performance A/B:
|
||||
|
||||
| Model | Mode | PP | TG | B | S_PP t/s | S_TG t/s | S t/s |
|
||||
|-------|------|----|----|---|----------|----------|-------|
|
||||
| MoE | M5 base | 512 | 4 | 32 | 2323.48 | 397.57 | 2239.39 |
|
||||
| MoE | C32 slab | 512 | 4 | 32 | 2069.12 | 357.43 | 1995.06 |
|
||||
| MoE | M5 base | 2048 | 4 | 32 | 2430.32 | 388.29 | 2405.66 |
|
||||
| MoE | C32 slab | 2048 | 4 | 32 | 2054.86 | 388.01 | 2037.79 |
|
||||
| Dense | M5 base | 512 | 4 | 32 | 975.10 | 140.53 | 932.19 |
|
||||
| Dense | C32 slab | 512 | 4 | 32 | 866.29 | 144.03 | 833.87 |
|
||||
| Dense | M5 base | 2048 | 4 | 32 | 1019.25 | 183.25 | 1010.26 |
|
||||
| Dense | C32 slab | 2048 | 4 | 32 | 903.73 | 183.47 | 896.86 |
|
||||
|
||||
Rejected diff:
|
||||
|
||||
- `/home/mudler/bench/phase10_gdn_c32_slab/rejected/c32_slab_tailfix_rejected.diff`
|
||||
|
||||
Conclusion:
|
||||
|
||||
- Do not ship Phase 10 C32 slab as implemented.
|
||||
- C32 slab is not a maintainable shortcut toward parity because duplicated
|
||||
A/T recomputation per value slab outweighs the intended state-traffic
|
||||
reduction.
|
||||
- A future GDN prefill attempt should either share the `A/T` work across value
|
||||
slabs or switch to a different FLA-style chunk design; it should not repeat
|
||||
this env-gated two-slab M5 variant.
|
||||
|
||||
@@ -173,6 +173,7 @@ GDN is the #1 prefill-gap contributor (+59.2 us/tok, ~30%). vLLM's FLA `chunk_ga
|
||||
| bf16-C16 | bf16 Gram at C=16 | REJECTED | no win; bf16 mantissa unsafe on state-coupled products |
|
||||
| BV block-occupancy A/B (tf32) | raise blocks/SM | REJECTED (occupancy NOT the bound) | 1844 vs 1814 S_PP (-1.04%, within noise) |
|
||||
| bf16-C64 | bf16 Gram at C=64 | REJECTED | -18.75%; O(C^2) intra-chunk + serial recurrence dominates |
|
||||
| Phase 10 C32 slab M5 | C=32, two `dv_tile=64` slabs, default-off `GDN_C32_SLAB=1` | REJECTED | md5-clean after tail-row zeroing, but slower: MoE 2048 2430.32 -> 2054.86; dense 2048 1019.25 -> 903.73 |
|
||||
|
||||
Why not occupancy/dtype: the cost is the **O(C^2) intra-chunk triangular A-inverse solve + the strictly-serial inter-chunk recurrence**, with C forced to **16** by GB10's 99 KB dynamic-smem cap (the 128x128 f32 state alone is 64 KB). M5 captures the tractable TC part; it does not fully close 2.62x because vLLM's FLA blocked-solve is a more complete TC implementation.
|
||||
|
||||
|
||||
@@ -172,6 +172,7 @@ products through tensor cores. The series chased that headroom.
|
||||
| bf16-C16 | bf16 Gram at C=16 | rejected | no win over tf32-M5; bf16 mantissa unsafe on the state-coupled products | GDN build-plan s4 |
|
||||
| BV block-occupancy A/B (tf32) | raise blocks/SM to test if occupancy is the bound | **REJECTED** (occupancy is NOT the bound; latency is wave-hidden) | two arms statistically equal: **1844 vs 1814 S_PP (-1.04%, within noise)** | GDNAB armA/armB |
|
||||
| bf16-C64 | bf16 Gram at the larger C=64 chunk | **REJECTED** | **-18.75%** - the O(C^2) intra-chunk triangular-solve + serial recurrence dominates, so growing C hurts | recorded verdict / GDN build-plan |
|
||||
| Phase 10 C32 slab M5 | C=32 with two `dv_tile=64` slabs, default-off `GDN_C32_SLAB=1` | **REJECTED** | md5-clean after tail-row zeroing, but S_PP regressed: MoE 2048 **2430.32 -> 2054.86**, dense 2048 **1019.25 -> 903.73** | phase10 gates/ab |
|
||||
|
||||
**Why the bottleneck is not occupancy/dtype:** the cost is the **O(C^2)
|
||||
intra-chunk triangular solve + the serial inter-chunk recurrence dependency**, not
|
||||
|
||||
@@ -459,6 +459,40 @@ scope until a serving phase proves target-verification cost and rollback safety.
|
||||
|
||||
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
|
||||
|
||||
### Phase 10 GDN C32 slab update
|
||||
|
||||
Phase 10 tested the tempting low-conflict shortcut for #101: keep the current
|
||||
M5 tensor-core GDN form, raise the chunk to `C=32`, and split the value
|
||||
dimension into two `dv_tile=64` slabs to stay within shared memory.
|
||||
|
||||
Result:
|
||||
|
||||
- The shortcut cannot be a launcher-only change. C32 requires staging
|
||||
`U=T*RHS` because the existing M5 apply path relies on one 16-row tile being
|
||||
held in registers before overwriting `Ud`.
|
||||
- A default-off `GDN_C32_SLAB=1` candidate was built and md5-gated.
|
||||
- The first candidate exposed a dense-only transcript failure on tail chunks;
|
||||
root cause was copying uninitialized staged rows for `t >= Cc` back into
|
||||
`Ud`. Zeroing those rows restored both canonical md5 gates:
|
||||
MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`.
|
||||
- Performance regressed after correctness was fixed:
|
||||
MoE 2048 S_PP `2430.32 -> 2054.86`; dense 2048 S_PP `1019.25 -> 903.73`.
|
||||
|
||||
Decision:
|
||||
|
||||
- **REJECT** the two-slab C32 M5 variant.
|
||||
- Do not add it to the LocalAI patch stack.
|
||||
- The likely blocker is duplicated A/T recomputation per value slab; future GDN
|
||||
work must share that work across slabs or move to a different FLA-style
|
||||
chunked design rather than repeating this env-gated shortcut.
|
||||
|
||||
Artifacts:
|
||||
|
||||
- `/home/mudler/bench/phase10_gdn_c32_slab/gates/`
|
||||
- `/home/mudler/bench/phase10_gdn_c32_slab/ab/`
|
||||
- `/home/mudler/bench/phase10_gdn_c32_slab/rejected/c32_slab_tailfix_rejected.diff`
|
||||
|
||||
---
|
||||
|
||||
# PROFILE-VALIDATED PATH (both-engine nsys, adversarially verified Sun Jun 28 11:55:12 PM UTC 2026)
|
||||
|
||||
@@ -121,7 +121,7 @@ Implication:
|
||||
- Because the candidate changes the solve/apply mechanics, it requires a
|
||||
focused `GATED_DELTA_NET` op gate before any prefill A/B.
|
||||
|
||||
- [ ] **Step 1: Add an explicit env selector**
|
||||
- [x] **Step 1: Add an explicit env selector**
|
||||
|
||||
Use an env var such as:
|
||||
|
||||
@@ -131,7 +131,7 @@ GDN_C32_SLAB=1
|
||||
|
||||
The default path must stay current M5.
|
||||
|
||||
- [ ] **Step 2: Introduce a C=32, dv_tile=64 launch**
|
||||
- [x] **Step 2: Introduce a C=32, dv_tile=64 launch**
|
||||
|
||||
Target shape:
|
||||
|
||||
@@ -147,7 +147,7 @@ Initial prototype rules:
|
||||
- no decode routing,
|
||||
- no D2H synchronization.
|
||||
|
||||
- [ ] **Step 3: Build on DGX**
|
||||
- [x] **Step 3: Build on DGX**
|
||||
|
||||
Run:
|
||||
|
||||
@@ -157,12 +157,25 @@ ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda && cmake --build .
|
||||
|
||||
Expected: build succeeds.
|
||||
|
||||
Result:
|
||||
|
||||
- Candidate implemented in the llama.cpp fork as a default-off
|
||||
`GDN_C32_SLAB=1` path.
|
||||
- Kernel generalized to `DV_TILE=64`, with two value slabs for `S_v=128`.
|
||||
- C32 `U=T*RHS` writes were staged through a slab-local `Utmp` buffer to avoid
|
||||
read/write aliasing against the RHS in `Ud`.
|
||||
- Initial md5 failed on dense because tail chunks copied uninitialized staged
|
||||
rows back into `Ud`; the root-cause fix zeroed `t >= Cc` rows during the
|
||||
staged copy-back.
|
||||
- DGX build succeeded after the tail fix:
|
||||
`cmake --build . --target test-backend-ops llama-completion -j 8`.
|
||||
|
||||
## Task 3: Correctness Gates
|
||||
|
||||
**Files:**
|
||||
- Artifact: `/home/mudler/bench/phase10_gdn_c32_slab/gates/`
|
||||
|
||||
- [ ] **Step 1: Run `GATED_DELTA_NET` op gate**
|
||||
- [x] **Step 1: Run `GATED_DELTA_NET` op gate**
|
||||
|
||||
Run default and forced C32 slab modes:
|
||||
|
||||
@@ -180,7 +193,7 @@ Required coverage to inspect in logs:
|
||||
- permuted layout,
|
||||
- adversarial decay.
|
||||
|
||||
- [ ] **Step 2: Run canonical md5 gates**
|
||||
- [x] **Step 2: Run canonical md5 gates**
|
||||
|
||||
Run MoE and dense greedy gates with and without `GDN_C32_SLAB=1`.
|
||||
|
||||
@@ -191,18 +204,34 @@ MoE 8cb0ce23777bf55f92f63d0292c756b0
|
||||
Dense 5951a5b4d624ce891e22ab5fca9bc439
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Run KL gate if md5 changes**
|
||||
- [x] **Step 3: Run KL gate if md5 changes**
|
||||
|
||||
If the C32 slab path changes reduction order and therefore md5, stop and run the
|
||||
existing KL procedure from `PAGED_BITEXACT_NOTE.md`. Keep the patch only if the
|
||||
new path is KL-benign and no worse than current M5.
|
||||
|
||||
Result:
|
||||
|
||||
- Default op gate:
|
||||
`/home/mudler/bench/phase10_gdn_c32_slab/gates/gated_delta_net_default_after_tailfix.txt`
|
||||
- Forced C32 op gate:
|
||||
`/home/mudler/bench/phase10_gdn_c32_slab/gates/gated_delta_net_c32_slab_after_tailfix.txt`
|
||||
- Both `GATED_DELTA_NET` CUDA0 gates passed.
|
||||
- Canonical default md5 after tail fix:
|
||||
- MoE: `8cb0ce23777bf55f92f63d0292c756b0`
|
||||
- Dense: `5951a5b4d624ce891e22ab5fca9bc439`
|
||||
- Forced C32 md5 after tail fix:
|
||||
- MoE: `8cb0ce23777bf55f92f63d0292c756b0`
|
||||
- Dense: `5951a5b4d624ce891e22ab5fca9bc439`
|
||||
- KL gate was not needed because the md5 gates matched the canonical outputs
|
||||
exactly after the tail-row fix.
|
||||
|
||||
## Task 4: Performance A/B
|
||||
|
||||
**Files:**
|
||||
- Artifact: `/home/mudler/bench/phase10_gdn_c32_slab/ab/`
|
||||
|
||||
- [ ] **Step 1: Run C32 slab prefill at `npp=512`**
|
||||
- [x] **Step 1: Run C32 slab prefill at `npp=512`**
|
||||
|
||||
Compare:
|
||||
|
||||
@@ -213,30 +242,69 @@ candidate: GDN_TC=5 GDN_CHUNK_MIN=64 GDN_C32_SLAB=1
|
||||
|
||||
Pass: candidate beats current M5 S_PP outside noise.
|
||||
|
||||
- [ ] **Step 2: Run C32 slab prefill at `npp=2048`**
|
||||
- [x] **Step 2: Run C32 slab prefill at `npp=2048`**
|
||||
|
||||
Use the same A/B. Pass requires a net S_PP improvement or a clear GDN bucket
|
||||
reduction without a larger regression elsewhere.
|
||||
|
||||
- [ ] **Step 3: Reject if duplicated A/T work cancels the state-traffic win**
|
||||
- [x] **Step 3: Reject if duplicated A/T work cancels the state-traffic win**
|
||||
|
||||
If the candidate only shifts time between A/T recomputation and state traffic
|
||||
without a net win, save the diff as a rejected artifact and update this plan.
|
||||
|
||||
Result:
|
||||
|
||||
Artifacts:
|
||||
|
||||
- `/home/mudler/bench/phase10_gdn_c32_slab/ab/moe_base.txt`
|
||||
- `/home/mudler/bench/phase10_gdn_c32_slab/ab/moe_c32.txt`
|
||||
- `/home/mudler/bench/phase10_gdn_c32_slab/ab/dense_base.txt`
|
||||
- `/home/mudler/bench/phase10_gdn_c32_slab/ab/dense_c32.txt`
|
||||
|
||||
| Model | Mode | PP | TG | B | S_PP t/s | S_TG t/s | S t/s |
|
||||
|-------|------|----|----|---|----------|----------|-------|
|
||||
| MoE | M5 base | 512 | 4 | 32 | 2323.48 | 397.57 | 2239.39 |
|
||||
| MoE | C32 slab | 512 | 4 | 32 | 2069.12 | 357.43 | 1995.06 |
|
||||
| MoE | M5 base | 2048 | 4 | 32 | 2430.32 | 388.29 | 2405.66 |
|
||||
| MoE | C32 slab | 2048 | 4 | 32 | 2054.86 | 388.01 | 2037.79 |
|
||||
| Dense | M5 base | 512 | 4 | 32 | 975.10 | 140.53 | 932.19 |
|
||||
| Dense | C32 slab | 512 | 4 | 32 | 866.29 | 144.03 | 833.87 |
|
||||
| Dense | M5 base | 2048 | 4 | 32 | 1019.25 | 183.25 | 1010.26 |
|
||||
| Dense | C32 slab | 2048 | 4 | 32 | 903.73 | 183.47 | 896.86 |
|
||||
|
||||
Decision:
|
||||
|
||||
- Reject the C32 slab source patch.
|
||||
- The candidate is correctness-clean after tail-row zeroing, but it regresses
|
||||
S_PP in both model families.
|
||||
- The likely mechanism is that recomputing `A/T` once per value slab cancels
|
||||
the intended state-traffic win; optimizing this would require a broader
|
||||
shared-work design rather than a small, low-conflict shortcut patch.
|
||||
- Rejected diff saved at:
|
||||
`/home/mudler/bench/phase10_gdn_c32_slab/rejected/c32_slab_tailfix_rejected.diff`.
|
||||
|
||||
## Task 5: Mirror or Reject
|
||||
|
||||
**Files:**
|
||||
- Create if accepted: `backend/cpp/llama-cpp-localai-paged/patches/paged/0055-...patch`
|
||||
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
|
||||
|
||||
- [ ] **Step 1: Commit accepted fork patch**
|
||||
- [x] **Step 1: Commit accepted fork patch**
|
||||
|
||||
Commit only after correctness and performance gates pass.
|
||||
|
||||
- [ ] **Step 2: Generate LocalAI patch**
|
||||
- [x] **Step 2: Generate LocalAI patch**
|
||||
|
||||
Use `git format-patch`; do not hand-edit the generated patch.
|
||||
|
||||
- [ ] **Step 3: Update docs**
|
||||
- [x] **Step 3: Update docs**
|
||||
|
||||
Record exact artifacts, md5/KL results, and performance decision.
|
||||
|
||||
Result:
|
||||
|
||||
- No fork commit and no LocalAI patch were generated because Phase 10 was
|
||||
rejected by the performance gate.
|
||||
- The llama.cpp fork and DGX mirror were restored to the prior accepted state.
|
||||
- This plan and the parity docs record the rejected source candidate so it is
|
||||
not repeated as an accidental "obvious" follow-up.
|
||||
|
||||
Reference in New Issue
Block a user