diff --git a/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md b/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md new file mode 100644 index 000000000..8a8baf3cd --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md @@ -0,0 +1,143 @@ +# llama.cpp vLLM Parity Benchmark Ledger + +This file tracks each parity attempt from Phase70 onward, plus the immediate +context needed to interpret the current record. Append every new attempt here +with artifact path, gates, benchmark rows, and decision. + +## Current Status + +- Goal: reach vLLM speed parity in llama.cpp on GB10. +- Current decision model: MoE `q36-35b-a3b-nvfp4`. +- Canonical paged MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`. +- Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`. +- Current tested source: DGX mirror `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`. +- Latest attempt: Phase70. +- Latest decision: keep `LLAMA_BF16_CUBLAS_F32_OUT=1` default-off. It is + correctness-clean but not serving-safe enough to default on. + +## Current Serving Record + +Phase70 broader serving snapshot, MoE `PTOK=128`, `GEN=64`, `PARALLEL=128`. + +Artifact: + +- `/home/mudler/bench/phase70_bf16_broader_serving/20260701_151500` + +| arm | n | agg_tps | decode_agg_tps | decode_perseq_tps | prefill_tps | ttft_mean_ms | wall_s | +|-----|--:|--------:|---------------:|------------------:|------------:|-------------:|-------:| +| llama default | `8` | `178.5` | `242.6` | `29.82` | `1767.2` | `754.8` | `2.868` | +| llama opt-in | `8` | `158.8` | `218.3` | `26.60` | `1541.1` | `848.9` | `3.225` | +| vLLM | `8` | `260.9` | `299.5` | `36.67` | `5415.6` | `239.0` | `1.917` | +| llama default | `32` | `250.1` | `418.7` | `11.75` | `1661.2` | `2717.0` | `8.187` | +| llama opt-in | `32` | `247.9` | `417.6` | `11.79` | `1650.3` | `2803.9` | `8.261` | +| vLLM | `32` | `465.3` | `608.4` | `17.74` | `5394.4` | `782.7` | `4.314` | +| llama default | `128` | `322.5` | `706.2` | `3.87` | `1613.9` | `7836.5` | `25.401` | +| llama opt-in | `128` | `324.8` | `697.9` | `3.88` | `1671.1` | `7720.9` | `25.220` | +| vLLM | `128` | `659.9` | `1020.4` | `6.75` | `5228.0` | `2543.1` | `12.060` | + +Ratios: + +| n | opt/default agg | opt/default decode | opt/default TTFT | default decode/vLLM | opt decode/vLLM | default agg/vLLM | opt agg/vLLM | +|--:|----------------:|-------------------:|-----------------:|--------------------:|----------------:|-----------------:|-------------:| +| `8` | `0.8896` | `0.8998` | `1.1247` | `0.8100` | `0.7289` | `0.6842` | `0.6087` | +| `32` | `0.9912` | `0.9974` | `1.0320` | `0.6882` | `0.6864` | `0.5375` | `0.5328` | +| `128` | `1.0071` | `0.9882` | `0.9852` | `0.6921` | `0.6839` | `0.4887` | `0.4922` | + +Decision: + +- Reject default-on for `LLAMA_BF16_CUBLAS_F32_OUT=1`. +- Keep as default-off opt-in only. +- The opt-in regressed `n=8` throughput and TTFT materially, and slightly + widened the vLLM decode gap at `n=32` and `n=128`. + +## Attempt Log + +### Phase70: BF16 F32 Output Broader Serving + +- Date: 2026-07-01. +- Plan: `docs/superpowers/plans/2026-07-01-bf16-f32-output-broader-serving-phase70.md`. +- Artifact: `/home/mudler/bench/phase70_bf16_broader_serving/20260701_151500`. +- Source: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`. +- Shape: MoE serving, `NPL=8 32 128`, prompt `128`, generation `64`, + `PARALLEL=128`, `CTX=131072`. + +Gates: + +| gate | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|------|---------|-----------|-----------|--------------| +| pre default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| pre opt-in | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | not run | +| post default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post opt-in | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | not run | + +Result: + +- Default-on rejected. +- Opt-in remains correctness-clean, but broad serving is mixed-to-negative. + +### Phase69: Patch Series Mirror Readiness + +- Date: 2026-07-01. +- Plan: `docs/superpowers/plans/2026-07-01-patch-series-mirror-readiness-phase69.md`. +- Artifact: local dry-run only. +- Result: current `0001..0063` series matched Phase37 tree + `dedb1182910eafe9f6875588dc8285bfb544cce5`; projected `0064..0073` + matched fork HEAD tree `fcf5720b659c5e1e2b487ccf3c8f7289bb12b9c4`. +- Decision: patch regeneration is technically ready but blocked on explicit + push approval by policy. + +### Phase68: BF16 F32 Output Dense Serving + +- Date: 2026-07-01. +- Plan: `docs/superpowers/plans/2026-07-01-bf16-f32-output-dense-serving-phase68.md`. +- Artifact: `/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710`. +- Serving artifact: + `/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710/serving_ab_20260701_150249`. + +Dense prefill: + +| npp | default S_PP | opt-in S_PP | change | +|-----|-------------:|------------:|-------:| +| `512` | `973.13` | `975.52` | `+0.25%` | +| `2048` | `1019.88` | `1021.39` | `+0.15%` | + +MoE serving `N=128`, prompt `128`, generation `128`: + +| metric | default | opt-in | change | +|--------|--------:|-------:|-------:| +| `agg_tps` | `409.8` | `415.0` | `+1.27%` | +| `decode_agg_tps` | `615.3` | `627.2` | `+1.93%` | +| `prefill_tps` | `1630.2` | `1648.0` | `+1.09%` | +| `ttft_mean_ms` | `8574.7` | `8085.9` | `-5.70%` | +| `wall_s` | `39.978` | `39.480` | `-1.25%` | + +Decision: + +- Carry as default-off opt-in candidate pending broader serving evidence. + +### Phase67: BF16 cuBLAS F32 Output + +- Date: 2026-07-01. +- Plan: `docs/superpowers/plans/2026-07-01-bf16-cublas-f32-output-phase67.md`. +- Artifact: `/home/mudler/bench/phase67_bf16_f32_out/20260701_144909`. +- Fork commit: `ea0875d14 feat(cuda): gate BF16 cuBLAS F32 output`. +- DGX mirror commit: `14fd69f1e`. +- Env gate: `LLAMA_BF16_CUBLAS_F32_OUT=1`. + +Gates: + +| mode | MoE md5 | dense md5 | `MUL_MAT` | +|------|---------|-----------|-----------| +| default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | +| opt-in | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | + +MoE prefill: + +| npp | default S_PP | opt-in S_PP | change | +|-----|-------------:|------------:|-------:| +| `512` | `2347.41` | `2402.34` | `+2.34%` | +| `2048` | `2440.18` | `2456.54` | `+0.67%` | + +Decision: + +- Keep default-off pending dense and serving A/B. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index 473673ed5..928c07a94 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -3825,3 +3825,50 @@ Decision: regenerating the LocalAI patch series. Push still requires explicit approval. - After push approval, regenerate `0064..0073`, repeat the tree hash check, and only then run broader serving gates for any default-on BF16 policy decision. + +## BF16 F32 Output Broader Serving Phase70 Result + +Phase70 is recorded in +`docs/superpowers/plans/2026-07-01-bf16-f32-output-broader-serving-phase70.md`. +It did not change llama.cpp source and did not edit generated LocalAI patches. +It also creates the running benchmark ledger at +`backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md`. + +- DGX artifact: `/home/mudler/bench/phase70_bf16_broader_serving/20260701_151500` +- Source under test: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output` +- Shape: MoE serving, `NPL=8 32 128`, prompt `128`, generation `64`, + `PARALLEL=128`, `CTX=131072` + +Pre/post gates passed: + +| gate | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|------|---------|-----------|-----------|--------------| +| pre default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| pre opt-in | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | not run | +| post default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post opt-in | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | not run | + +Serving A/B and vLLM comparison: + +| n | default agg | opt-in agg | vLLM agg | default decode | opt-in decode | vLLM decode | +|---:|------------:|-----------:|---------:|---------------:|--------------:|------------:| +| `8` | `178.5` | `158.8` | `260.9` | `242.6` | `218.3` | `299.5` | +| `32` | `250.1` | `247.9` | `465.3` | `418.7` | `417.6` | `608.4` | +| `128` | `322.5` | `324.8` | `659.9` | `706.2` | `697.9` | `1020.4` | + +Ratios: + +| n | opt/default agg | opt/default decode | opt/default TTFT | default decode/vLLM | opt decode/vLLM | +|---:|----------------:|-------------------:|-----------------:|--------------------:|----------------:| +| `8` | `0.8896` | `0.8998` | `1.1247` | `0.8100` | `0.7289` | +| `32` | `0.9912` | `0.9974` | `1.0320` | `0.6882` | `0.6864` | +| `128` | `1.0071` | `0.9882` | `0.9852` | `0.6921` | `0.6839` | + +Decision: + +- Reject default-on for `LLAMA_BF16_CUBLAS_F32_OUT=1`. +- Keep the shortcut as default-off only. It is correctness-clean, but the + broader serving window regressed `n=8` materially and slightly widened the + vLLM decode gap at `n=32` and `n=128`. +- The next parity phase should not spend more time on this default policy. Use + the benchmark ledger for every following attempt. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index 538ea00a9..9854db763 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -1079,3 +1079,31 @@ requires pushing before regenerating the LocalAI series. Do not push without explicit approval. After approval, push the fork, regenerate `0064..0073`, rerun the same tree-hash check, and then run the broader serving gates before any default-on BF16 policy change. + +## 15. PHASE70 RESULT: BF16 F32 OUTPUT BROADER SERVING + +Phase70 broadened the Phase68 serving evidence without source changes. Plan: +`docs/superpowers/plans/2026-07-01-bf16-f32-output-broader-serving-phase70.md`. +Benchmark ledger: +`backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md`. +DGX artifact: +`/home/mudler/bench/phase70_bf16_broader_serving/20260701_151500`. + +Gates stayed green. Default pre/post gates matched MoE md5 `8cb0ce23`, dense +md5 `5951a5b4`, `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Opt-in pre/post +gates matched MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, and `MUL_MAT +1146/1146`. + +Serving shape: MoE `NPL=8 32 128`, prompt `128`, generation `64`, +`PARALLEL=128`. + +| n | opt/default agg | opt/default decode | opt/default TTFT | default decode/vLLM | opt decode/vLLM | +|---:|----------------:|-------------------:|-----------------:|--------------------:|----------------:| +| `8` | `0.8896` | `0.8998` | `1.1247` | `0.8100` | `0.7289` | +| `32` | `0.9912` | `0.9974` | `1.0320` | `0.6882` | `0.6864` | +| `128` | `1.0071` | `0.9882` | `0.9852` | `0.6921` | `0.6839` | + +Decision: reject default-on for `LLAMA_BF16_CUBLAS_F32_OUT=1`. The shortcut is +correctness-clean, but it materially regressed low-concurrency serving and +slightly widened the vLLM decode gap at `n=32` and `n=128`. Keep it +default-off only and move the next parity effort to a different lever. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index c3e857777..aeff3ab02 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -133,6 +133,13 @@ but worth carrying as an opt-in shortcut candidate. Do not default it on until the fork commit is mirrored into the LocalAI patch series and a broader serving snapshot passes pre/post md5 and op gates. +Phase70 ran that broader serving snapshot. Gates stayed green, but the broader +window rejected default-on: at `N=8`, opt-in aggregate and decode fell to +`0.8896x` and `0.8998x` of default, and mean TTFT worsened to `1.1247x`. +At `N=32` and `N=128`, opt-in slightly widened the vLLM decode gap +(`0.6864x` vs `0.6882x`, and `0.6839x` vs `0.6921x`). Keep +`LLAMA_BF16_CUBLAS_F32_OUT=1` default-off only and move to another lever. + Relevant files: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch`, and the graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/models/{qwen35moe.cpp,delta-net-base.cpp}` + `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/llama-graph.cpp` (build_moe_ffn ~1500-1834, build_attn ~2136-2189). ## 2. Decode-serving compute hypotheses (ranked) diff --git a/docs/superpowers/plans/2026-07-01-bf16-f32-output-broader-serving-phase70.md b/docs/superpowers/plans/2026-07-01-bf16-f32-output-broader-serving-phase70.md new file mode 100644 index 000000000..592f9da54 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-bf16-f32-output-broader-serving-phase70.md @@ -0,0 +1,156 @@ +# BF16 F32 Output Broader Serving Phase70 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** decide whether `LLAMA_BF16_CUBLAS_F32_OUT=1` has enough broader serving evidence to move beyond default-off opt-in status. + +**Architecture:** Do not change source. Reuse the Phase67 DGX mirror and binary, bracket the benchmark with canonical inference gates, then run same-window llama.cpp default, llama.cpp opt-in, and vLLM serving arms across multiple concurrencies. + +**Tech Stack:** llama.cpp CUDA backend, DGX GB10, `llama-server`, vLLM 0.23.0, `h2h_cli3.py`, LocalAI parity docs. + +--- + +## Guardrails + +- Do not change llama.cpp source in Phase70. +- Do not regenerate LocalAI generated patches. +- Do not push any repository. +- Confirm Docker `0`, `local-ai-worker` `0`, and GPU compute apps `0` before taking the DGX lock. +- Bracket serving with md5/op gates so inferencing safety is explicit. +- Keep `LLAMA_BF16_CUBLAS_F32_OUT=1` default-off unless broad serving is consistently flat-to-positive with gates green. + +## Files + +- Create: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/docs/superpowers/plans/2026-07-01-bf16-f32-output-broader-serving-phase70.md` +- Modify after DGX run: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify after DGX run: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` +- Modify after DGX run: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` + +--- + +### Task 1: DGX Preflight And Gates + +- [x] **Step 1: Confirm DGX idle** + +Run: + +```bash +ssh dgx.casa 'set -e; cat /tmp/localai-gb10.lock 2>/dev/null || true; docker ps -q | wc -l; (pgrep -af "[l]ocal-ai-worker" || true) | wc -l; nvidia-smi --query-compute-apps=pid --format=csv,noheader | sed "/^$/d" | wc -l' +``` + +Expected: + +```text +FREE... +0 +0 +0 +``` + +- [x] **Step 2: Run pre gates** + +Run canonical gates with default env and opt-in completion env: + +```bash +ssh dgx.casa 'ART=$HOME/bench/phase70_bf16_broader_serving//gate_pre_default OPS=MUL_MAT,MUL_MAT_ID ~/paged-inference-gates.sh' +ssh dgx.casa 'ART=$HOME/bench/phase70_bf16_broader_serving//gate_pre_optin OPS=MUL_MAT EXTRA_ENV="LLAMA_BF16_CUBLAS_F32_OUT=1" ~/paged-inference-gates.sh' +``` + +Expected: + +- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0` +- dense md5 `5951a5b4d624ce891e22ab5fca9bc439` +- op gates green. + +Result: + +- Artifact: `/home/mudler/bench/phase70_bf16_broader_serving/20260701_151500` +- Default pre gates: MoE/dense md5 matched, `MUL_MAT 1146/1146`, + `MUL_MAT_ID 806/806`. +- Opt-in pre gates: MoE/dense md5 matched, `MUL_MAT 1146/1146`. + +### Task 2: Same-Window Serving Snapshot + +- [x] **Step 1: Acquire lock** + +Use both active lock conventions: + +```bash +ssh dgx.casa 'mkdir -p ~/gpu_bench_lock; echo "codex-phase70-bf16-broader-serving $(date +%s)" > ~/gpu_bench_lock/owner; printf "codex-phase70-bf16-broader-serving %s\n" "$(date +%s)" > /tmp/localai-gb10.lock' +``` + +- [x] **Step 2: Run three serving arms** + +Run: + +- llama.cpp default +- llama.cpp with `LLAMA_BF16_CUBLAS_F32_OUT=1` +- vLLM + +Shape: + +```text +model=MoE q36-35b-a3b-nvfp4 +NPL=8 32 128 +PTOK=128 +GEN=64 +PARALLEL=128 +CTX=131072 +``` + +- [x] **Step 3: Release lock** + +Run: + +```bash +ssh dgx.casa 'echo "FREE released-by-codex-phase70-bf16-broader-serving $(date +%s)" > ~/gpu_bench_lock/owner; printf "FREE released-by-codex-phase70-bf16-broader-serving %s\n" "$(date +%s)" > /tmp/localai-gb10.lock' +``` + +### Task 3: Post Gates And Decision + +- [x] **Step 1: Run post gates** + +Repeat default and opt-in gates after serving. + +- [x] **Step 2: Summarize metrics** + +Capture for each `N`: + +- default vs opt-in aggregate throughput +- default vs opt-in decode aggregate throughput +- default vs opt-in TTFT +- opt-in vs vLLM decode and aggregate ratios + +- [x] **Step 3: Decision** + +Keep default-off if any concurrency materially regresses or if the result is mixed. Consider default-on only if all concurrencies are flat-to-positive, post gates are green, and the opt-in does not widen the vLLM parity gap. + +Result summary: + +| n | default agg | opt-in agg | opt/default agg | default decode | opt-in decode | opt/default decode | +|---:|------------:|-----------:|----------------:|---------------:|--------------:|-------------------:| +| `8` | `178.5` | `158.8` | `0.8896` | `242.6` | `218.3` | `0.8998` | +| `32` | `250.1` | `247.9` | `0.9912` | `418.7` | `417.6` | `0.9974` | +| `128` | `322.5` | `324.8` | `1.0071` | `706.2` | `697.9` | `0.9882` | + +Decision: reject default-on. The opt-in materially regressed low-concurrency +serving and slightly widened the vLLM decode gap at `n=32` and `n=128`, despite +green gates. + +### Task 4: Record And Commit + +- [x] **Step 1: Update docs** + +Record artifact path, gates, serving table, ratio table, and decision. + +- [x] **Step 2: Commit docs** + +```bash +git add -f docs/superpowers/plans/2026-07-01-bf16-f32-output-broader-serving-phase70.md +git add backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md \ + backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +git commit -m "docs(paged): record BF16 F32 output broader serving phase" \ + -m "Assisted-by: Codex:gpt-5" +```