docs(paged): record BF16 F32 output broader serving phase

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 13:26:50 +00:00
parent e573194799
commit 6cf8b782d1
5 changed files with 381 additions and 0 deletions

View File

@@ -0,0 +1,143 @@
# llama.cpp vLLM Parity Benchmark Ledger
This file tracks each parity attempt from Phase70 onward, plus the immediate
context needed to interpret the current record. Append every new attempt here
with artifact path, gates, benchmark rows, and decision.
## Current Status
- Goal: reach vLLM speed parity in llama.cpp on GB10.
- Current decision model: MoE `q36-35b-a3b-nvfp4`.
- Canonical paged MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`.
- Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
- Current tested source: DGX mirror `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
- Latest attempt: Phase70.
- Latest decision: keep `LLAMA_BF16_CUBLAS_F32_OUT=1` default-off. It is
correctness-clean but not serving-safe enough to default on.
## Current Serving Record
Phase70 broader serving snapshot, MoE `PTOK=128`, `GEN=64`, `PARALLEL=128`.
Artifact:
- `/home/mudler/bench/phase70_bf16_broader_serving/20260701_151500`
| arm | n | agg_tps | decode_agg_tps | decode_perseq_tps | prefill_tps | ttft_mean_ms | wall_s |
|-----|--:|--------:|---------------:|------------------:|------------:|-------------:|-------:|
| llama default | `8` | `178.5` | `242.6` | `29.82` | `1767.2` | `754.8` | `2.868` |
| llama opt-in | `8` | `158.8` | `218.3` | `26.60` | `1541.1` | `848.9` | `3.225` |
| vLLM | `8` | `260.9` | `299.5` | `36.67` | `5415.6` | `239.0` | `1.917` |
| llama default | `32` | `250.1` | `418.7` | `11.75` | `1661.2` | `2717.0` | `8.187` |
| llama opt-in | `32` | `247.9` | `417.6` | `11.79` | `1650.3` | `2803.9` | `8.261` |
| vLLM | `32` | `465.3` | `608.4` | `17.74` | `5394.4` | `782.7` | `4.314` |
| llama default | `128` | `322.5` | `706.2` | `3.87` | `1613.9` | `7836.5` | `25.401` |
| llama opt-in | `128` | `324.8` | `697.9` | `3.88` | `1671.1` | `7720.9` | `25.220` |
| vLLM | `128` | `659.9` | `1020.4` | `6.75` | `5228.0` | `2543.1` | `12.060` |
Ratios:
| n | opt/default agg | opt/default decode | opt/default TTFT | default decode/vLLM | opt decode/vLLM | default agg/vLLM | opt agg/vLLM |
|--:|----------------:|-------------------:|-----------------:|--------------------:|----------------:|-----------------:|-------------:|
| `8` | `0.8896` | `0.8998` | `1.1247` | `0.8100` | `0.7289` | `0.6842` | `0.6087` |
| `32` | `0.9912` | `0.9974` | `1.0320` | `0.6882` | `0.6864` | `0.5375` | `0.5328` |
| `128` | `1.0071` | `0.9882` | `0.9852` | `0.6921` | `0.6839` | `0.4887` | `0.4922` |
Decision:
- Reject default-on for `LLAMA_BF16_CUBLAS_F32_OUT=1`.
- Keep as default-off opt-in only.
- The opt-in regressed `n=8` throughput and TTFT materially, and slightly
widened the vLLM decode gap at `n=32` and `n=128`.
## Attempt Log
### Phase70: BF16 F32 Output Broader Serving
- Date: 2026-07-01.
- Plan: `docs/superpowers/plans/2026-07-01-bf16-f32-output-broader-serving-phase70.md`.
- Artifact: `/home/mudler/bench/phase70_bf16_broader_serving/20260701_151500`.
- Source: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
- Shape: MoE serving, `NPL=8 32 128`, prompt `128`, generation `64`,
`PARALLEL=128`, `CTX=131072`.
Gates:
| gate | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
|------|---------|-----------|-----------|--------------|
| pre default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
| pre opt-in | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | not run |
| post default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
| post opt-in | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | not run |
Result:
- Default-on rejected.
- Opt-in remains correctness-clean, but broad serving is mixed-to-negative.
### Phase69: Patch Series Mirror Readiness
- Date: 2026-07-01.
- Plan: `docs/superpowers/plans/2026-07-01-patch-series-mirror-readiness-phase69.md`.
- Artifact: local dry-run only.
- Result: current `0001..0063` series matched Phase37 tree
`dedb1182910eafe9f6875588dc8285bfb544cce5`; projected `0064..0073`
matched fork HEAD tree `fcf5720b659c5e1e2b487ccf3c8f7289bb12b9c4`.
- Decision: patch regeneration is technically ready but blocked on explicit
push approval by policy.
### Phase68: BF16 F32 Output Dense Serving
- Date: 2026-07-01.
- Plan: `docs/superpowers/plans/2026-07-01-bf16-f32-output-dense-serving-phase68.md`.
- Artifact: `/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710`.
- Serving artifact:
`/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710/serving_ab_20260701_150249`.
Dense prefill:
| npp | default S_PP | opt-in S_PP | change |
|-----|-------------:|------------:|-------:|
| `512` | `973.13` | `975.52` | `+0.25%` |
| `2048` | `1019.88` | `1021.39` | `+0.15%` |
MoE serving `N=128`, prompt `128`, generation `128`:
| metric | default | opt-in | change |
|--------|--------:|-------:|-------:|
| `agg_tps` | `409.8` | `415.0` | `+1.27%` |
| `decode_agg_tps` | `615.3` | `627.2` | `+1.93%` |
| `prefill_tps` | `1630.2` | `1648.0` | `+1.09%` |
| `ttft_mean_ms` | `8574.7` | `8085.9` | `-5.70%` |
| `wall_s` | `39.978` | `39.480` | `-1.25%` |
Decision:
- Carry as default-off opt-in candidate pending broader serving evidence.
### Phase67: BF16 cuBLAS F32 Output
- Date: 2026-07-01.
- Plan: `docs/superpowers/plans/2026-07-01-bf16-cublas-f32-output-phase67.md`.
- Artifact: `/home/mudler/bench/phase67_bf16_f32_out/20260701_144909`.
- Fork commit: `ea0875d14 feat(cuda): gate BF16 cuBLAS F32 output`.
- DGX mirror commit: `14fd69f1e`.
- Env gate: `LLAMA_BF16_CUBLAS_F32_OUT=1`.
Gates:
| mode | MoE md5 | dense md5 | `MUL_MAT` |
|------|---------|-----------|-----------|
| default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` |
| opt-in | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` |
MoE prefill:
| npp | default S_PP | opt-in S_PP | change |
|-----|-------------:|------------:|-------:|
| `512` | `2347.41` | `2402.34` | `+2.34%` |
| `2048` | `2440.18` | `2456.54` | `+0.67%` |
Decision:
- Keep default-off pending dense and serving A/B.

View File

@@ -3825,3 +3825,50 @@ Decision:
regenerating the LocalAI patch series. Push still requires explicit approval.
- After push approval, regenerate `0064..0073`, repeat the tree hash check, and
only then run broader serving gates for any default-on BF16 policy decision.
## BF16 F32 Output Broader Serving Phase70 Result
Phase70 is recorded in
`docs/superpowers/plans/2026-07-01-bf16-f32-output-broader-serving-phase70.md`.
It did not change llama.cpp source and did not edit generated LocalAI patches.
It also creates the running benchmark ledger at
`backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md`.
- DGX artifact: `/home/mudler/bench/phase70_bf16_broader_serving/20260701_151500`
- Source under test: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`
- Shape: MoE serving, `NPL=8 32 128`, prompt `128`, generation `64`,
`PARALLEL=128`, `CTX=131072`
Pre/post gates passed:
| gate | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
|------|---------|-----------|-----------|--------------|
| pre default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
| pre opt-in | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | not run |
| post default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
| post opt-in | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | not run |
Serving A/B and vLLM comparison:
| n | default agg | opt-in agg | vLLM agg | default decode | opt-in decode | vLLM decode |
|---:|------------:|-----------:|---------:|---------------:|--------------:|------------:|
| `8` | `178.5` | `158.8` | `260.9` | `242.6` | `218.3` | `299.5` |
| `32` | `250.1` | `247.9` | `465.3` | `418.7` | `417.6` | `608.4` |
| `128` | `322.5` | `324.8` | `659.9` | `706.2` | `697.9` | `1020.4` |
Ratios:
| n | opt/default agg | opt/default decode | opt/default TTFT | default decode/vLLM | opt decode/vLLM |
|---:|----------------:|-------------------:|-----------------:|--------------------:|----------------:|
| `8` | `0.8896` | `0.8998` | `1.1247` | `0.8100` | `0.7289` |
| `32` | `0.9912` | `0.9974` | `1.0320` | `0.6882` | `0.6864` |
| `128` | `1.0071` | `0.9882` | `0.9852` | `0.6921` | `0.6839` |
Decision:
- Reject default-on for `LLAMA_BF16_CUBLAS_F32_OUT=1`.
- Keep the shortcut as default-off only. It is correctness-clean, but the
broader serving window regressed `n=8` materially and slightly widened the
vLLM decode gap at `n=32` and `n=128`.
- The next parity phase should not spend more time on this default policy. Use
the benchmark ledger for every following attempt.

View File

@@ -1079,3 +1079,31 @@ requires pushing before regenerating the LocalAI series. Do not push without
explicit approval. After approval, push the fork, regenerate `0064..0073`, rerun
the same tree-hash check, and then run the broader serving gates before any
default-on BF16 policy change.
## 15. PHASE70 RESULT: BF16 F32 OUTPUT BROADER SERVING
Phase70 broadened the Phase68 serving evidence without source changes. Plan:
`docs/superpowers/plans/2026-07-01-bf16-f32-output-broader-serving-phase70.md`.
Benchmark ledger:
`backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md`.
DGX artifact:
`/home/mudler/bench/phase70_bf16_broader_serving/20260701_151500`.
Gates stayed green. Default pre/post gates matched MoE md5 `8cb0ce23`, dense
md5 `5951a5b4`, `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Opt-in pre/post
gates matched MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, and `MUL_MAT
1146/1146`.
Serving shape: MoE `NPL=8 32 128`, prompt `128`, generation `64`,
`PARALLEL=128`.
| n | opt/default agg | opt/default decode | opt/default TTFT | default decode/vLLM | opt decode/vLLM |
|---:|----------------:|-------------------:|-----------------:|--------------------:|----------------:|
| `8` | `0.8896` | `0.8998` | `1.1247` | `0.8100` | `0.7289` |
| `32` | `0.9912` | `0.9974` | `1.0320` | `0.6882` | `0.6864` |
| `128` | `1.0071` | `0.9882` | `0.9852` | `0.6921` | `0.6839` |
Decision: reject default-on for `LLAMA_BF16_CUBLAS_F32_OUT=1`. The shortcut is
correctness-clean, but it materially regressed low-concurrency serving and
slightly widened the vLLM decode gap at `n=32` and `n=128`. Keep it
default-off only and move the next parity effort to a different lever.

View File

@@ -133,6 +133,13 @@ but worth carrying as an opt-in shortcut candidate. Do not default it on until
the fork commit is mirrored into the LocalAI patch series and a broader serving
snapshot passes pre/post md5 and op gates.
Phase70 ran that broader serving snapshot. Gates stayed green, but the broader
window rejected default-on: at `N=8`, opt-in aggregate and decode fell to
`0.8896x` and `0.8998x` of default, and mean TTFT worsened to `1.1247x`.
At `N=32` and `N=128`, opt-in slightly widened the vLLM decode gap
(`0.6864x` vs `0.6882x`, and `0.6839x` vs `0.6921x`). Keep
`LLAMA_BF16_CUBLAS_F32_OUT=1` default-off only and move to another lever.
Relevant files: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch`, and the graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/models/{qwen35moe.cpp,delta-net-base.cpp}` + `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/llama-graph.cpp` (build_moe_ffn ~1500-1834, build_attn ~2136-2189).
## 2. Decode-serving compute hypotheses (ranked)

View File

@@ -0,0 +1,156 @@
# BF16 F32 Output Broader Serving Phase70 Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** decide whether `LLAMA_BF16_CUBLAS_F32_OUT=1` has enough broader serving evidence to move beyond default-off opt-in status.
**Architecture:** Do not change source. Reuse the Phase67 DGX mirror and binary, bracket the benchmark with canonical inference gates, then run same-window llama.cpp default, llama.cpp opt-in, and vLLM serving arms across multiple concurrencies.
**Tech Stack:** llama.cpp CUDA backend, DGX GB10, `llama-server`, vLLM 0.23.0, `h2h_cli3.py`, LocalAI parity docs.
---
## Guardrails
- Do not change llama.cpp source in Phase70.
- Do not regenerate LocalAI generated patches.
- Do not push any repository.
- Confirm Docker `0`, `local-ai-worker` `0`, and GPU compute apps `0` before taking the DGX lock.
- Bracket serving with md5/op gates so inferencing safety is explicit.
- Keep `LLAMA_BF16_CUBLAS_F32_OUT=1` default-off unless broad serving is consistently flat-to-positive with gates green.
## Files
- Create: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/docs/superpowers/plans/2026-07-01-bf16-f32-output-broader-serving-phase70.md`
- Modify after DGX run: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
- Modify after DGX run: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md`
- Modify after DGX run: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md`
---
### Task 1: DGX Preflight And Gates
- [x] **Step 1: Confirm DGX idle**
Run:
```bash
ssh dgx.casa 'set -e; cat /tmp/localai-gb10.lock 2>/dev/null || true; docker ps -q | wc -l; (pgrep -af "[l]ocal-ai-worker" || true) | wc -l; nvidia-smi --query-compute-apps=pid --format=csv,noheader | sed "/^$/d" | wc -l'
```
Expected:
```text
FREE...
0
0
0
```
- [x] **Step 2: Run pre gates**
Run canonical gates with default env and opt-in completion env:
```bash
ssh dgx.casa 'ART=$HOME/bench/phase70_bf16_broader_serving/<ts>/gate_pre_default OPS=MUL_MAT,MUL_MAT_ID ~/paged-inference-gates.sh'
ssh dgx.casa 'ART=$HOME/bench/phase70_bf16_broader_serving/<ts>/gate_pre_optin OPS=MUL_MAT EXTRA_ENV="LLAMA_BF16_CUBLAS_F32_OUT=1" ~/paged-inference-gates.sh'
```
Expected:
- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`
- dense md5 `5951a5b4d624ce891e22ab5fca9bc439`
- op gates green.
Result:
- Artifact: `/home/mudler/bench/phase70_bf16_broader_serving/20260701_151500`
- Default pre gates: MoE/dense md5 matched, `MUL_MAT 1146/1146`,
`MUL_MAT_ID 806/806`.
- Opt-in pre gates: MoE/dense md5 matched, `MUL_MAT 1146/1146`.
### Task 2: Same-Window Serving Snapshot
- [x] **Step 1: Acquire lock**
Use both active lock conventions:
```bash
ssh dgx.casa 'mkdir -p ~/gpu_bench_lock; echo "codex-phase70-bf16-broader-serving $(date +%s)" > ~/gpu_bench_lock/owner; printf "codex-phase70-bf16-broader-serving %s\n" "$(date +%s)" > /tmp/localai-gb10.lock'
```
- [x] **Step 2: Run three serving arms**
Run:
- llama.cpp default
- llama.cpp with `LLAMA_BF16_CUBLAS_F32_OUT=1`
- vLLM
Shape:
```text
model=MoE q36-35b-a3b-nvfp4
NPL=8 32 128
PTOK=128
GEN=64
PARALLEL=128
CTX=131072
```
- [x] **Step 3: Release lock**
Run:
```bash
ssh dgx.casa 'echo "FREE released-by-codex-phase70-bf16-broader-serving $(date +%s)" > ~/gpu_bench_lock/owner; printf "FREE released-by-codex-phase70-bf16-broader-serving %s\n" "$(date +%s)" > /tmp/localai-gb10.lock'
```
### Task 3: Post Gates And Decision
- [x] **Step 1: Run post gates**
Repeat default and opt-in gates after serving.
- [x] **Step 2: Summarize metrics**
Capture for each `N`:
- default vs opt-in aggregate throughput
- default vs opt-in decode aggregate throughput
- default vs opt-in TTFT
- opt-in vs vLLM decode and aggregate ratios
- [x] **Step 3: Decision**
Keep default-off if any concurrency materially regresses or if the result is mixed. Consider default-on only if all concurrencies are flat-to-positive, post gates are green, and the opt-in does not widen the vLLM parity gap.
Result summary:
| n | default agg | opt-in agg | opt/default agg | default decode | opt-in decode | opt/default decode |
|---:|------------:|-----------:|----------------:|---------------:|--------------:|-------------------:|
| `8` | `178.5` | `158.8` | `0.8896` | `242.6` | `218.3` | `0.8998` |
| `32` | `250.1` | `247.9` | `0.9912` | `418.7` | `417.6` | `0.9974` |
| `128` | `322.5` | `324.8` | `1.0071` | `706.2` | `697.9` | `0.9882` |
Decision: reject default-on. The opt-in materially regressed low-concurrency
serving and slightly widened the vLLM decode gap at `n=32` and `n=128`, despite
green gates.
### Task 4: Record And Commit
- [x] **Step 1: Update docs**
Record artifact path, gates, serving table, ratio table, and decision.
- [x] **Step 2: Commit docs**
```bash
git add -f docs/superpowers/plans/2026-07-01-bf16-f32-output-broader-serving-phase70.md
git add backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md \
backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \
backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md \
backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
git commit -m "docs(paged): record BF16 F32 output broader serving phase" \
-m "Assisted-by: Codex:gpt-5"
```