docs(paged): record BF16 F32 output dense serving phase

Assisted-by: Codex:gpt-5
2026-07-03 04:46:54 -04:00 · 2026-07-01 13:06:49 +00:00
parent e67b329eb1
commit 2b2b1f0b25
4 changed files with 225 additions and 0 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -3718,3 +3718,52 @@ Decision:
  removes the profiled BF16-to-F32 conversion row for this shape.
 - Do not make it default-on yet. The gain is modest and needs dense plus serving
  A/B before a default policy change.
+
+## BF16 F32 Output Dense Serving Phase68 Result
+
+Phase68 is recorded in
+`docs/superpowers/plans/2026-07-01-bf16-f32-output-dense-serving-phase68.md`.
+It reused the Phase67 source commit and did not change llama.cpp source.
+
+- Fork commit under test: `ea0875d14 feat(cuda): gate BF16 cuBLAS F32 output`
+- DGX mirror commit under test: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`
+- Env gate: `LLAMA_BF16_CUBLAS_F32_OUT=1`
+- DGX artifact: `/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710`
+- Serving A/B artifact: `/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710/serving_ab_20260701_150249`
+
+Correctness basis for this exact source commit remains the Phase67 default and
+opt-in gates:
+
+| mode | MoE md5 | dense md5 | `MUL_MAT` |
+|------|---------|-----------|-----------|
+| default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` |
+| opt-in | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` |
+
+Dense same-window prefill A/B:
+
+| npp | default S_PP | opt-in S_PP | change |
+|-----|-------------:|------------:|-------:|
+| `512` | `973.13` | `975.52` | `+0.25%` |
+| `2048` | `1019.88` | `1021.39` | `+0.15%` |
+
+MoE serving A/B, `N=128`, prompt `128`, generation `128`, `--parallel 128`:
+
+| metric | default | opt-in | change |
+|--------|--------:|-------:|-------:|
+| `agg_tps` | `409.8` | `415.0` | `+1.27%` |
+| `decode_agg_tps` | `615.3` | `627.2` | `+1.93%` |
+| `decode_perseq_tps` | `4.15` | `4.16` | `+0.24%` |
+| `prefill_tps` | `1630.2` | `1648.0` | `+1.09%` |
+| `ttft_mean_ms` | `8574.7` | `8085.9` | `-5.70%` |
+| `wall_s` | `39.978` | `39.480` | `-1.25%` |
+
+Decision:
+
+- Keep `LLAMA_BF16_CUBLAS_F32_OUT=1` default-off. The dense prefill gain is
+  positive but too small to justify a default policy change.
+- The opt-in is now worth carrying forward: MoE prefill, dense prefill, and the
+  small MoE serving window all moved in the right direction without changing the
+  Phase67 md5/op correctness gates.
+- Next default-on consideration requires regenerating the LocalAI patch series
+  from the fork and rerunning the broader current serving snapshot gates. Do not
+  default it from Phase68 alone.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -995,3 +995,37 @@ The opt-in `npp=512` profile removed the BF16-to-F32 conversion row:
 `convert_unary<__nv_bfloat16, float>` became `0 ns`, `0` instances. Keep this
 as default-off for now. It is correctness-clean and measurable, but the win is
 small and needs dense plus serving A/B before any default-on decision.
+
+## 13. PHASE68 RESULT: BF16 F32 OUTPUT DENSE + SERVING A/B
+
+Phase68 reused Phase67 source unchanged. Plan:
+`docs/superpowers/plans/2026-07-01-bf16-f32-output-dense-serving-phase68.md`.
+DGX artifact: `/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710`;
+serving A/B artifact:
+`/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710/serving_ab_20260701_150249`.
+
+Correctness basis for the exact source commit remains Phase67: default and
+`LLAMA_BF16_CUBLAS_F32_OUT=1` both produced MoE md5 `8cb0ce23`, dense md5
+`5951a5b4`, and `MUL_MAT 1146/1146`.
+
+Dense prefill stayed positive but tiny:
+
+| npp | default S_PP | opt-in S_PP | change |
+|-----|-------------:|------------:|-------:|
+| `512` | `973.13` | `975.52` | `+0.25%` |
+| `2048` | `1019.88` | `1021.39` | `+0.15%` |
+
+MoE serving A/B at `N=128`, prompt `128`, generation `128`, `--parallel 128`:
+
+| metric | default | opt-in | change |
+|--------|--------:|-------:|-------:|
+| `agg_tps` | `409.8` | `415.0` | `+1.27%` |
+| `decode_agg_tps` | `615.3` | `627.2` | `+1.93%` |
+| `prefill_tps` | `1630.2` | `1648.0` | `+1.09%` |
+| `ttft_mean_ms` | `8574.7` | `8085.9` | `-5.70%` |
+| `wall_s` | `39.978` | `39.480` | `-1.25%` |
+
+Decision: carry the shortcut as a default-off opt-in candidate. It is no longer
+just a prefill-only win, but Phase68 is not enough to default it on. Any future
+default-on proposal must mirror the fork commit into the LocalAI patch series
+and rerun a broader current serving snapshot with pre/post md5 and op gates.
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -123,6 +123,16 @@ It passed MoE/dense md5 and `MUL_MAT 1146/1146`; MoE prefill improved
 Keep it default-off until dense and serving A/B decide whether it is worth a
 default policy change.

+Phase68 ran that dense and serving A/B without changing source. Dense prefill
+was positive but tiny (`973.13 -> 975.52` at `npp=512`, `1019.88 -> 1021.39` at
+`npp=2048`). A small MoE serving window at `N=128`, prompt `128`, generation
+`128` also moved in the right direction: aggregate `409.8 -> 415.0`,
+decode aggregate `615.3 -> 627.2`, mean TTFT `8574.7 -> 8085.9 ms`, wall
+`39.978 -> 39.480 s`. Decision: keep `LLAMA_BF16_CUBLAS_F32_OUT=1` default-off
+but worth carrying as an opt-in shortcut candidate. Do not default it on until
+the fork commit is mirrored into the LocalAI patch series and a broader serving
+snapshot passes pre/post md5 and op gates.
+
 Relevant files: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch`, and the graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/models/{qwen35moe.cpp,delta-net-base.cpp}` + `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/llama-graph.cpp` (build_moe_ffn ~1500-1834, build_attn ~2136-2189).

 ## 2. Decode-serving compute hypotheses (ranked)
--- a/docs/superpowers/plans/2026-07-01-bf16-f32-output-dense-serving-phase68.md
+++ b/docs/superpowers/plans/2026-07-01-bf16-f32-output-dense-serving-phase68.md
@@ -0,0 +1,132 @@
+# BF16 F32 Output Dense Serving Phase68 Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Decide whether `LLAMA_BF16_CUBLAS_F32_OUT=1` has enough dense and serving value to consider a default policy change.
+
+**Architecture:** Reuse the Phase67 source patch and DGX build. Run dense prefill A/B first because it is fast and directly targets BF16 projections. Run serving A/B only if dense or MoE evidence supports a broader default-on question.
+
+**Tech Stack:** llama.cpp CUDA backend, DGX GB10, `llama-batched-bench`, optional LocalAI serving snapshot harness, LocalAI parity docs.
+
+---
+
+## Guardrails
+
+- Do not change source in Phase68.
+- Do not make `LLAMA_BF16_CUBLAS_F32_OUT=1` default-on from MoE prefill alone.
+- Keep DGX lock discipline: lock free, Docker `0`, `local-ai-worker` `0`, compute apps `0`.
+- Keep existing md5/op gate evidence from Phase67 as the correctness basis for this exact source commit.
+- Record no-go results as explicitly as wins.
+
+## Files
+
+- Create: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/docs/superpowers/plans/2026-07-01-bf16-f32-output-dense-serving-phase68.md`
+- Modify after DGX run: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
+- Modify after DGX run: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md`
+- Modify after DGX run: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md`
+
+---
+
+### Task 1: Dense Prefill A/B
+
+- [x] **Step 1: Confirm DGX idle and acquire lock**
+
+Run:
+
+```bash
+ssh dgx.casa 'cat /tmp/localai-gb10.lock 2>/dev/null || true; docker ps --format "{{.Names}}" | wc -l; (pgrep -af "[l]ocal-ai-worker" || true) | wc -l; nvidia-smi --query-compute-apps=pid,process_name,used_gpu_memory --format=csv,noheader | wc -l'
+ssh dgx.casa 'printf "codex-phase68-bf16-dense-serving %s\n" "$(date +%s)" > /tmp/localai-gb10.lock'
+```
+
+- [x] **Step 2: Run dense prefill default and opt-in**
+
+Run:
+
+```bash
+./llama-batched-bench -m /home/mudler/bench/q36-27b-nvfp4.gguf \
+  -c 131072 -b 2048 -ub 512 -ngl 99 -fa on -npp 512,2048 -ntg 4 -npl 32
+```
+
+with and without `LLAMA_BF16_CUBLAS_F32_OUT=1`.
+
+- [x] **Step 3: Dense decision**
+
+Dense improved slightly in the same window and did not regress:
+
+| npp | default S_PP | opt-in S_PP | change |
+|-----|-------------:|------------:|-------:|
+| `512` | `973.13` | `975.52` | `+0.25%` |
+| `2048` | `1019.88` | `1021.39` | `+0.15%` |
+
+Decision: run a small MoE serving A/B because Phase67 MoE prefill was positive
+and dense did not regress. The dense win is too small to justify default-on by
+itself.
+
+---
+
+### Task 2: Serving A/B If Funded
+
+- [x] **Step 1: Run a small same-window serving A/B**
+
+Use the current clean source tree and the existing h2h client or snapshot harness.
+Compare default versus:
+
+```bash
+LLAMA_BF16_CUBLAS_F32_OUT=1
+```
+
+At minimum capture MoE `N=128`, prompt `128`, generation `128` aggregate,
+decode aggregate, mean TTFT, wall time, and md5 gate summary.
+
+- [x] **Step 2: Serving decision**
+
+Keep default-off unless serving improves or is flat without dense regression.
+Do not default-on from prefill-only evidence.
+
+Serving artifact:
+
+- `/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710/serving_ab_20260701_150249`
+
+MoE serving A/B, `N=128`, prompt `128`, generation `128`, `--parallel 128`:
+
+| metric | default | opt-in | change |
+|--------|--------:|-------:|-------:|
+| `agg_tps` | `409.8` | `415.0` | `+1.27%` |
+| `decode_agg_tps` | `615.3` | `627.2` | `+1.93%` |
+| `decode_perseq_tps` | `4.15` | `4.16` | `+0.24%` |
+| `prefill_tps` | `1630.2` | `1648.0` | `+1.09%` |
+| `ttft_mean_ms` | `8574.7` | `8085.9` | `-5.70%` |
+| `wall_s` | `39.978` | `39.480` | `-1.25%` |
+
+Decision: keep `LLAMA_BF16_CUBLAS_F32_OUT=1` default-off but promoted as a
+safe opt-in shortcut candidate. It now has Phase67 MoE md5/op gates, Phase67
+dense md5/op gates, a tiny positive dense prefill result, and a positive small
+MoE serving A/B. Do not make it default-on until it is patch-series mirrored and
+retested in a broader serving snapshot.
+
+---
+
+### Task 3: Record and Commit
+
+- [x] **Step 1: Release DGX lock**
+
+Run:
+
+```bash
+ssh dgx.casa 'printf "FREE released-by-codex-phase68-bf16-dense-serving %s\n" "$(date +%s)" > /tmp/localai-gb10.lock'
+```
+
+- [x] **Step 2: Record docs**
+
+Record artifact path, dense A/B, serving A/B if run, and decision.
+
+- [x] **Step 3: Commit LocalAI docs**
+
+```bash
+git add -f docs/superpowers/plans/2026-07-01-bf16-f32-output-dense-serving-phase68.md
+git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \
+        backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md \
+        backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+git commit -m "docs(paged): record BF16 F32 output dense serving phase" \
+  -m "Assisted-by: Codex:gpt-5"
+```