From 04ed7fe52fca20b96216cc6d8b55e7a749c03120 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Wed, 1 Jul 2026 15:13:45 +0000 Subject: [PATCH] docs(paged): record GDN launch sweep phase Assisted-by: Codex:gpt-5 --- .../llama-cpp-localai-paged/docs/BENCHMARK.md | 48 ++++++++++++++++--- .../docs/PARITY_HANDOFF.md | 16 +++++++ 2 files changed, 58 insertions(+), 6 deletions(-) diff --git a/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md b/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md index c0b77c980..5d92f0876 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md @@ -12,12 +12,13 @@ with artifact path, gates, benchmark rows, and decision. - Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`. - Current tested source: DGX mirror `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`. -- Latest attempt: Phase77. -- Latest decision: decode-only GB10 graph-node profile confirms GDN recurrence - is a real current decode bucket. In an isolated n=128 decode window, GDN was - `41.20%` of GPU kernel time and `gdn_core` alone was `38.95%`, slightly above - `mmq_nvfp4` (`38.26%`). This funds a default-off GDN decode A/B/PoC, with - md5/op gates and bucket reduction required before any merge/default change. +- Latest attempt: Phase78. +- Latest decision: GDN decode launch-shape env sweep did not beat the current + default. `GDN_NW=8 GDN_CPW=8` was correctness-clean but slower + (`gdn_core 1443.55 ms` vs Phase77 default `1408.33 ms`), and + `GDN_NW=16 GDN_CPW=4` failed the `MUL_MAT_ID` op gate. Keep the current + default launch shape; the next source path must be a real decode-kernel + structural A/B, not another `GDN_NW`/`GDN_CPW` retune. ## Current Serving Record @@ -57,6 +58,41 @@ Decision: ## Attempt Log +### Phase78: GDN Decode Launch-Shape Sweep + +- Date: 2026-07-01. +- Baseline artifact: + `/home/mudler/bench/phase77_moe_decode_only_profile/20260701_150134`. +- Sweep artifacts: + - `/home/mudler/bench/phase78_gdn_launch_sweep/nw8_cpw8_20260701_150654` + - `/home/mudler/bench/phase78_gdn_launch_sweep/nw16_cpw4_20260701_150954` +- Source baseline: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`. +- Result type: env-gated launch-shape sweep only; no source change. +- Shape: same as Phase77 decode-only graph-node profile. + +Result: + +| arm | env | gate status | GDN ms | GDN share | `gdn_core` ms | `gdn_core` share | `mmq_nvfp4` ms | +|-----|-----|-------------|-------:|----------:|--------------:|-----------------:|---------------:| +| Phase77 default | none | pre/post green | `1489.71` | `41.20%` | `1408.33` | `38.95%` | `1383.50` | +| sweep `8x8` | `GDN_NW=8 GDN_CPW=8` | pre/post green | `1525.86` | `41.94%` | `1443.55` | `39.68%` | `1366.33` | +| sweep `16x4` | `GDN_NW=16 GDN_CPW=4` | rejected | not run | not run | not run | not run | not run | + +Gate detail: + +- `8x8`: pre/post MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 + `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT 1146/1146`, + `MUL_MAT_ID 806/806`. +- `16x4`: completion md5 and `MUL_MAT 1146/1146` passed, but + `MUL_MAT_ID` failed `805/806`; rejected before profiling. + +Decision: + +- Keep the current default `GDN_NW=16 GDN_CPW=8`. +- Do not spend more GB10 time on launch-shape retunes without a new hypothesis. +- The funded source path remains a structural default-off GDN decode A/B/PoC + that reduces the Phase77 `gdn_core` bucket, not another existing-env sweep. + ### Phase77: MoE Decode-Only Graph-Node Profile - Date: 2026-07-01. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index 85a5be39d..6a21c05e0 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -1326,3 +1326,19 @@ based on vLLM's direct recurrent/packed decode structure. The next patch must prove a material reduction in the Phase77 `gdn_core` bucket, keep canonical md5 and op gates green, and avoid serving/decode throughput regression under the same decode-only capture shape before it can be considered for merge or default. + +Phase78 launch-shape sweep: + +- Baseline: Phase77 default launch shape (`GDN_NW=16 GDN_CPW=8`) had + `gdn_core 1408.33 ms` (`38.95%`) in the decode-only window. +- `GDN_NW=8 GDN_CPW=8` artifact: + `/home/mudler/bench/phase78_gdn_launch_sweep/nw8_cpw8_20260701_150654`. + Gates were green, but `gdn_core` worsened to `1443.55 ms` (`39.68%`). +- `GDN_NW=16 GDN_CPW=4` artifact: + `/home/mudler/bench/phase78_gdn_launch_sweep/nw16_cpw4_20260701_150954`. + Rejected before profiling: `MUL_MAT_ID` failed `805/806`. + +Decision: keep default `GDN_NW=16 GDN_CPW=8`. Do not retry existing +`GDN_NW`/`GDN_CPW` launch-shape retunes unless a new profile gives a specific +reason. The next GB10 source-funded work must be structural, default-off, and +measured against the Phase77 decode-only `gdn_core` bucket.