docs(paged): record BF16 F32 output dense serving phase

Assisted-by: Codex:gpt-5
2026-07-03 04:46:54 -04:00 · 2026-07-01 13:06:49 +00:00
parent e67b329eb1
commit 2b2b1f0b25
4 changed files with 225 additions and 0 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -3718,3 +3718,52 @@ Decision:
  removes the profiled BF16-to-F32 conversion row for this shape.
 - Do not make it default-on yet. The gain is modest and needs dense plus serving
  A/B before a default policy change.
+
+## BF16 F32 Output Dense Serving Phase68 Result
+
+Phase68 is recorded in
+`docs/superpowers/plans/2026-07-01-bf16-f32-output-dense-serving-phase68.md`.
+It reused the Phase67 source commit and did not change llama.cpp source.
+
+- Fork commit under test: `ea0875d14 feat(cuda): gate BF16 cuBLAS F32 output`
+- DGX mirror commit under test: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`
+- Env gate: `LLAMA_BF16_CUBLAS_F32_OUT=1`
+- DGX artifact: `/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710`
+- Serving A/B artifact: `/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710/serving_ab_20260701_150249`
+
+Correctness basis for this exact source commit remains the Phase67 default and
+opt-in gates:
+
+| mode | MoE md5 | dense md5 | `MUL_MAT` |
+|------|---------|-----------|-----------|
+| default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` |
+| opt-in | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` |
+
+Dense same-window prefill A/B:
+
+| npp | default S_PP | opt-in S_PP | change |
+|-----|-------------:|------------:|-------:|
+| `512` | `973.13` | `975.52` | `+0.25%` |
+| `2048` | `1019.88` | `1021.39` | `+0.15%` |
+
+MoE serving A/B, `N=128`, prompt `128`, generation `128`, `--parallel 128`:
+
+| metric | default | opt-in | change |
+|--------|--------:|-------:|-------:|
+| `agg_tps` | `409.8` | `415.0` | `+1.27%` |
+| `decode_agg_tps` | `615.3` | `627.2` | `+1.93%` |
+| `decode_perseq_tps` | `4.15` | `4.16` | `+0.24%` |
+| `prefill_tps` | `1630.2` | `1648.0` | `+1.09%` |
+| `ttft_mean_ms` | `8574.7` | `8085.9` | `-5.70%` |
+| `wall_s` | `39.978` | `39.480` | `-1.25%` |
+
+Decision:
+
+- Keep `LLAMA_BF16_CUBLAS_F32_OUT=1` default-off. The dense prefill gain is
+  positive but too small to justify a default policy change.
+- The opt-in is now worth carrying forward: MoE prefill, dense prefill, and the
+  small MoE serving window all moved in the right direction without changing the
+  Phase67 md5/op correctness gates.
+- Next default-on consideration requires regenerating the LocalAI patch series
+  from the fork and rerunning the broader current serving snapshot gates. Do not
+  default it from Phase68 alone.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -995,3 +995,37 @@ The opt-in `npp=512` profile removed the BF16-to-F32 conversion row:
 `convert_unary<__nv_bfloat16, float>` became `0 ns`, `0` instances. Keep this
 as default-off for now. It is correctness-clean and measurable, but the win is
 small and needs dense plus serving A/B before any default-on decision.
+
+## 13. PHASE68 RESULT: BF16 F32 OUTPUT DENSE + SERVING A/B
+
+Phase68 reused Phase67 source unchanged. Plan:
+`docs/superpowers/plans/2026-07-01-bf16-f32-output-dense-serving-phase68.md`.
+DGX artifact: `/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710`;
+serving A/B artifact:
+`/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710/serving_ab_20260701_150249`.
+
+Correctness basis for the exact source commit remains Phase67: default and
+`LLAMA_BF16_CUBLAS_F32_OUT=1` both produced MoE md5 `8cb0ce23`, dense md5
+`5951a5b4`, and `MUL_MAT 1146/1146`.
+
+Dense prefill stayed positive but tiny:
+
+| npp | default S_PP | opt-in S_PP | change |
+|-----|-------------:|------------:|-------:|
+| `512` | `973.13` | `975.52` | `+0.25%` |
+| `2048` | `1019.88` | `1021.39` | `+0.15%` |
+
+MoE serving A/B at `N=128`, prompt `128`, generation `128`, `--parallel 128`:
+
+| metric | default | opt-in | change |
+|--------|--------:|-------:|-------:|
+| `agg_tps` | `409.8` | `415.0` | `+1.27%` |
+| `decode_agg_tps` | `615.3` | `627.2` | `+1.93%` |
+| `prefill_tps` | `1630.2` | `1648.0` | `+1.09%` |
+| `ttft_mean_ms` | `8574.7` | `8085.9` | `-5.70%` |
+| `wall_s` | `39.978` | `39.480` | `-1.25%` |
+
+Decision: carry the shortcut as a default-off opt-in candidate. It is no longer
+just a prefill-only win, but Phase68 is not enough to default it on. Any future
+default-on proposal must mirror the fork commit into the LocalAI patch series
+and rerun a broader current serving snapshot with pre/post md5 and op gates.
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -123,6 +123,16 @@ It passed MoE/dense md5 and `MUL_MAT 1146/1146`; MoE prefill improved
 Keep it default-off until dense and serving A/B decide whether it is worth a
 default policy change.

+Phase68 ran that dense and serving A/B without changing source. Dense prefill
+was positive but tiny (`973.13 -> 975.52` at `npp=512`, `1019.88 -> 1021.39` at
+`npp=2048`). A small MoE serving window at `N=128`, prompt `128`, generation
+`128` also moved in the right direction: aggregate `409.8 -> 415.0`,
+decode aggregate `615.3 -> 627.2`, mean TTFT `8574.7 -> 8085.9 ms`, wall
+`39.978 -> 39.480 s`. Decision: keep `LLAMA_BF16_CUBLAS_F32_OUT=1` default-off
+but worth carrying as an opt-in shortcut candidate. Do not default it on until
+the fork commit is mirrored into the LocalAI patch series and a broader serving
+snapshot passes pre/post md5 and op gates.
+
 Relevant files: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch`, and the graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/models/{qwen35moe.cpp,delta-net-base.cpp}` + `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/llama-graph.cpp` (build_moe_ffn ~1500-1834, build_attn ~2136-2189).

 ## 2. Decode-serving compute hypotheses (ranked)