diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index d26f063f5..90d13d15a 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -3718,3 +3718,52 @@ Decision: removes the profiled BF16-to-F32 conversion row for this shape. - Do not make it default-on yet. The gain is modest and needs dense plus serving A/B before a default policy change. + +## BF16 F32 Output Dense Serving Phase68 Result + +Phase68 is recorded in +`docs/superpowers/plans/2026-07-01-bf16-f32-output-dense-serving-phase68.md`. +It reused the Phase67 source commit and did not change llama.cpp source. + +- Fork commit under test: `ea0875d14 feat(cuda): gate BF16 cuBLAS F32 output` +- DGX mirror commit under test: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output` +- Env gate: `LLAMA_BF16_CUBLAS_F32_OUT=1` +- DGX artifact: `/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710` +- Serving A/B artifact: `/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710/serving_ab_20260701_150249` + +Correctness basis for this exact source commit remains the Phase67 default and +opt-in gates: + +| mode | MoE md5 | dense md5 | `MUL_MAT` | +|------|---------|-----------|-----------| +| default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | +| opt-in | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | + +Dense same-window prefill A/B: + +| npp | default S_PP | opt-in S_PP | change | +|-----|-------------:|------------:|-------:| +| `512` | `973.13` | `975.52` | `+0.25%` | +| `2048` | `1019.88` | `1021.39` | `+0.15%` | + +MoE serving A/B, `N=128`, prompt `128`, generation `128`, `--parallel 128`: + +| metric | default | opt-in | change | +|--------|--------:|-------:|-------:| +| `agg_tps` | `409.8` | `415.0` | `+1.27%` | +| `decode_agg_tps` | `615.3` | `627.2` | `+1.93%` | +| `decode_perseq_tps` | `4.15` | `4.16` | `+0.24%` | +| `prefill_tps` | `1630.2` | `1648.0` | `+1.09%` | +| `ttft_mean_ms` | `8574.7` | `8085.9` | `-5.70%` | +| `wall_s` | `39.978` | `39.480` | `-1.25%` | + +Decision: + +- Keep `LLAMA_BF16_CUBLAS_F32_OUT=1` default-off. The dense prefill gain is + positive but too small to justify a default policy change. +- The opt-in is now worth carrying forward: MoE prefill, dense prefill, and the + small MoE serving window all moved in the right direction without changing the + Phase67 md5/op correctness gates. +- Next default-on consideration requires regenerating the LocalAI patch series + from the fork and rerunning the broader current serving snapshot gates. Do not + default it from Phase68 alone. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index 5381afb91..0fc115a9c 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -995,3 +995,37 @@ The opt-in `npp=512` profile removed the BF16-to-F32 conversion row: `convert_unary<__nv_bfloat16, float>` became `0 ns`, `0` instances. Keep this as default-off for now. It is correctness-clean and measurable, but the win is small and needs dense plus serving A/B before any default-on decision. + +## 13. PHASE68 RESULT: BF16 F32 OUTPUT DENSE + SERVING A/B + +Phase68 reused Phase67 source unchanged. Plan: +`docs/superpowers/plans/2026-07-01-bf16-f32-output-dense-serving-phase68.md`. +DGX artifact: `/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710`; +serving A/B artifact: +`/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710/serving_ab_20260701_150249`. + +Correctness basis for the exact source commit remains Phase67: default and +`LLAMA_BF16_CUBLAS_F32_OUT=1` both produced MoE md5 `8cb0ce23`, dense md5 +`5951a5b4`, and `MUL_MAT 1146/1146`. + +Dense prefill stayed positive but tiny: + +| npp | default S_PP | opt-in S_PP | change | +|-----|-------------:|------------:|-------:| +| `512` | `973.13` | `975.52` | `+0.25%` | +| `2048` | `1019.88` | `1021.39` | `+0.15%` | + +MoE serving A/B at `N=128`, prompt `128`, generation `128`, `--parallel 128`: + +| metric | default | opt-in | change | +|--------|--------:|-------:|-------:| +| `agg_tps` | `409.8` | `415.0` | `+1.27%` | +| `decode_agg_tps` | `615.3` | `627.2` | `+1.93%` | +| `prefill_tps` | `1630.2` | `1648.0` | `+1.09%` | +| `ttft_mean_ms` | `8574.7` | `8085.9` | `-5.70%` | +| `wall_s` | `39.978` | `39.480` | `-1.25%` | + +Decision: carry the shortcut as a default-off opt-in candidate. It is no longer +just a prefill-only win, but Phase68 is not enough to default it on. Any future +default-on proposal must mirror the fork commit into the LocalAI patch series +and rerun a broader current serving snapshot with pre/post md5 and op gates. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index 2f11d0923..c3e857777 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -123,6 +123,16 @@ It passed MoE/dense md5 and `MUL_MAT 1146/1146`; MoE prefill improved Keep it default-off until dense and serving A/B decide whether it is worth a default policy change. +Phase68 ran that dense and serving A/B without changing source. Dense prefill +was positive but tiny (`973.13 -> 975.52` at `npp=512`, `1019.88 -> 1021.39` at +`npp=2048`). A small MoE serving window at `N=128`, prompt `128`, generation +`128` also moved in the right direction: aggregate `409.8 -> 415.0`, +decode aggregate `615.3 -> 627.2`, mean TTFT `8574.7 -> 8085.9 ms`, wall +`39.978 -> 39.480 s`. Decision: keep `LLAMA_BF16_CUBLAS_F32_OUT=1` default-off +but worth carrying as an opt-in shortcut candidate. Do not default it on until +the fork commit is mirrored into the LocalAI patch series and a broader serving +snapshot passes pre/post md5 and op gates. + Relevant files: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch`, and the graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/models/{qwen35moe.cpp,delta-net-base.cpp}` + `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/llama-graph.cpp` (build_moe_ffn ~1500-1834, build_attn ~2136-2189). ## 2. Decode-serving compute hypotheses (ranked) diff --git a/docs/superpowers/plans/2026-07-01-bf16-f32-output-dense-serving-phase68.md b/docs/superpowers/plans/2026-07-01-bf16-f32-output-dense-serving-phase68.md new file mode 100644 index 000000000..207263757 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-bf16-f32-output-dense-serving-phase68.md @@ -0,0 +1,132 @@ +# BF16 F32 Output Dense Serving Phase68 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Decide whether `LLAMA_BF16_CUBLAS_F32_OUT=1` has enough dense and serving value to consider a default policy change. + +**Architecture:** Reuse the Phase67 source patch and DGX build. Run dense prefill A/B first because it is fast and directly targets BF16 projections. Run serving A/B only if dense or MoE evidence supports a broader default-on question. + +**Tech Stack:** llama.cpp CUDA backend, DGX GB10, `llama-batched-bench`, optional LocalAI serving snapshot harness, LocalAI parity docs. + +--- + +## Guardrails + +- Do not change source in Phase68. +- Do not make `LLAMA_BF16_CUBLAS_F32_OUT=1` default-on from MoE prefill alone. +- Keep DGX lock discipline: lock free, Docker `0`, `local-ai-worker` `0`, compute apps `0`. +- Keep existing md5/op gate evidence from Phase67 as the correctness basis for this exact source commit. +- Record no-go results as explicitly as wins. + +## Files + +- Create: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/docs/superpowers/plans/2026-07-01-bf16-f32-output-dense-serving-phase68.md` +- Modify after DGX run: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify after DGX run: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` +- Modify after DGX run: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` + +--- + +### Task 1: Dense Prefill A/B + +- [x] **Step 1: Confirm DGX idle and acquire lock** + +Run: + +```bash +ssh dgx.casa 'cat /tmp/localai-gb10.lock 2>/dev/null || true; docker ps --format "{{.Names}}" | wc -l; (pgrep -af "[l]ocal-ai-worker" || true) | wc -l; nvidia-smi --query-compute-apps=pid,process_name,used_gpu_memory --format=csv,noheader | wc -l' +ssh dgx.casa 'printf "codex-phase68-bf16-dense-serving %s\n" "$(date +%s)" > /tmp/localai-gb10.lock' +``` + +- [x] **Step 2: Run dense prefill default and opt-in** + +Run: + +```bash +./llama-batched-bench -m /home/mudler/bench/q36-27b-nvfp4.gguf \ + -c 131072 -b 2048 -ub 512 -ngl 99 -fa on -npp 512,2048 -ntg 4 -npl 32 +``` + +with and without `LLAMA_BF16_CUBLAS_F32_OUT=1`. + +- [x] **Step 3: Dense decision** + +Dense improved slightly in the same window and did not regress: + +| npp | default S_PP | opt-in S_PP | change | +|-----|-------------:|------------:|-------:| +| `512` | `973.13` | `975.52` | `+0.25%` | +| `2048` | `1019.88` | `1021.39` | `+0.15%` | + +Decision: run a small MoE serving A/B because Phase67 MoE prefill was positive +and dense did not regress. The dense win is too small to justify default-on by +itself. + +--- + +### Task 2: Serving A/B If Funded + +- [x] **Step 1: Run a small same-window serving A/B** + +Use the current clean source tree and the existing h2h client or snapshot harness. +Compare default versus: + +```bash +LLAMA_BF16_CUBLAS_F32_OUT=1 +``` + +At minimum capture MoE `N=128`, prompt `128`, generation `128` aggregate, +decode aggregate, mean TTFT, wall time, and md5 gate summary. + +- [x] **Step 2: Serving decision** + +Keep default-off unless serving improves or is flat without dense regression. +Do not default-on from prefill-only evidence. + +Serving artifact: + +- `/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710/serving_ab_20260701_150249` + +MoE serving A/B, `N=128`, prompt `128`, generation `128`, `--parallel 128`: + +| metric | default | opt-in | change | +|--------|--------:|-------:|-------:| +| `agg_tps` | `409.8` | `415.0` | `+1.27%` | +| `decode_agg_tps` | `615.3` | `627.2` | `+1.93%` | +| `decode_perseq_tps` | `4.15` | `4.16` | `+0.24%` | +| `prefill_tps` | `1630.2` | `1648.0` | `+1.09%` | +| `ttft_mean_ms` | `8574.7` | `8085.9` | `-5.70%` | +| `wall_s` | `39.978` | `39.480` | `-1.25%` | + +Decision: keep `LLAMA_BF16_CUBLAS_F32_OUT=1` default-off but promoted as a +safe opt-in shortcut candidate. It now has Phase67 MoE md5/op gates, Phase67 +dense md5/op gates, a tiny positive dense prefill result, and a positive small +MoE serving A/B. Do not make it default-on until it is patch-series mirrored and +retested in a broader serving snapshot. + +--- + +### Task 3: Record and Commit + +- [x] **Step 1: Release DGX lock** + +Run: + +```bash +ssh dgx.casa 'printf "FREE released-by-codex-phase68-bf16-dense-serving %s\n" "$(date +%s)" > /tmp/localai-gb10.lock' +``` + +- [x] **Step 2: Record docs** + +Record artifact path, dense A/B, serving A/B if run, and decision. + +- [x] **Step 3: Commit LocalAI docs** + +```bash +git add -f docs/superpowers/plans/2026-07-01-bf16-f32-output-dense-serving-phase68.md +git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +git commit -m "docs(paged): record BF16 F32 output dense serving phase" \ + -m "Assisted-by: Codex:gpt-5" +```