mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
docs(paged): record BF16 F32 output broader serving phase
Assisted-by: Codex:gpt-5
This commit is contained in:
143
backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md
Normal file
143
backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md
Normal file
@@ -0,0 +1,143 @@
|
||||
# llama.cpp vLLM Parity Benchmark Ledger
|
||||
|
||||
This file tracks each parity attempt from Phase70 onward, plus the immediate
|
||||
context needed to interpret the current record. Append every new attempt here
|
||||
with artifact path, gates, benchmark rows, and decision.
|
||||
|
||||
## Current Status
|
||||
|
||||
- Goal: reach vLLM speed parity in llama.cpp on GB10.
|
||||
- Current decision model: MoE `q36-35b-a3b-nvfp4`.
|
||||
- Canonical paged MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`.
|
||||
- Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
|
||||
- Current tested source: DGX mirror `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
|
||||
- Latest attempt: Phase70.
|
||||
- Latest decision: keep `LLAMA_BF16_CUBLAS_F32_OUT=1` default-off. It is
|
||||
correctness-clean but not serving-safe enough to default on.
|
||||
|
||||
## Current Serving Record
|
||||
|
||||
Phase70 broader serving snapshot, MoE `PTOK=128`, `GEN=64`, `PARALLEL=128`.
|
||||
|
||||
Artifact:
|
||||
|
||||
- `/home/mudler/bench/phase70_bf16_broader_serving/20260701_151500`
|
||||
|
||||
| arm | n | agg_tps | decode_agg_tps | decode_perseq_tps | prefill_tps | ttft_mean_ms | wall_s |
|
||||
|-----|--:|--------:|---------------:|------------------:|------------:|-------------:|-------:|
|
||||
| llama default | `8` | `178.5` | `242.6` | `29.82` | `1767.2` | `754.8` | `2.868` |
|
||||
| llama opt-in | `8` | `158.8` | `218.3` | `26.60` | `1541.1` | `848.9` | `3.225` |
|
||||
| vLLM | `8` | `260.9` | `299.5` | `36.67` | `5415.6` | `239.0` | `1.917` |
|
||||
| llama default | `32` | `250.1` | `418.7` | `11.75` | `1661.2` | `2717.0` | `8.187` |
|
||||
| llama opt-in | `32` | `247.9` | `417.6` | `11.79` | `1650.3` | `2803.9` | `8.261` |
|
||||
| vLLM | `32` | `465.3` | `608.4` | `17.74` | `5394.4` | `782.7` | `4.314` |
|
||||
| llama default | `128` | `322.5` | `706.2` | `3.87` | `1613.9` | `7836.5` | `25.401` |
|
||||
| llama opt-in | `128` | `324.8` | `697.9` | `3.88` | `1671.1` | `7720.9` | `25.220` |
|
||||
| vLLM | `128` | `659.9` | `1020.4` | `6.75` | `5228.0` | `2543.1` | `12.060` |
|
||||
|
||||
Ratios:
|
||||
|
||||
| n | opt/default agg | opt/default decode | opt/default TTFT | default decode/vLLM | opt decode/vLLM | default agg/vLLM | opt agg/vLLM |
|
||||
|--:|----------------:|-------------------:|-----------------:|--------------------:|----------------:|-----------------:|-------------:|
|
||||
| `8` | `0.8896` | `0.8998` | `1.1247` | `0.8100` | `0.7289` | `0.6842` | `0.6087` |
|
||||
| `32` | `0.9912` | `0.9974` | `1.0320` | `0.6882` | `0.6864` | `0.5375` | `0.5328` |
|
||||
| `128` | `1.0071` | `0.9882` | `0.9852` | `0.6921` | `0.6839` | `0.4887` | `0.4922` |
|
||||
|
||||
Decision:
|
||||
|
||||
- Reject default-on for `LLAMA_BF16_CUBLAS_F32_OUT=1`.
|
||||
- Keep as default-off opt-in only.
|
||||
- The opt-in regressed `n=8` throughput and TTFT materially, and slightly
|
||||
widened the vLLM decode gap at `n=32` and `n=128`.
|
||||
|
||||
## Attempt Log
|
||||
|
||||
### Phase70: BF16 F32 Output Broader Serving
|
||||
|
||||
- Date: 2026-07-01.
|
||||
- Plan: `docs/superpowers/plans/2026-07-01-bf16-f32-output-broader-serving-phase70.md`.
|
||||
- Artifact: `/home/mudler/bench/phase70_bf16_broader_serving/20260701_151500`.
|
||||
- Source: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
|
||||
- Shape: MoE serving, `NPL=8 32 128`, prompt `128`, generation `64`,
|
||||
`PARALLEL=128`, `CTX=131072`.
|
||||
|
||||
Gates:
|
||||
|
||||
| gate | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
|
||||
|------|---------|-----------|-----------|--------------|
|
||||
| pre default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
|
||||
| pre opt-in | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | not run |
|
||||
| post default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
|
||||
| post opt-in | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | not run |
|
||||
|
||||
Result:
|
||||
|
||||
- Default-on rejected.
|
||||
- Opt-in remains correctness-clean, but broad serving is mixed-to-negative.
|
||||
|
||||
### Phase69: Patch Series Mirror Readiness
|
||||
|
||||
- Date: 2026-07-01.
|
||||
- Plan: `docs/superpowers/plans/2026-07-01-patch-series-mirror-readiness-phase69.md`.
|
||||
- Artifact: local dry-run only.
|
||||
- Result: current `0001..0063` series matched Phase37 tree
|
||||
`dedb1182910eafe9f6875588dc8285bfb544cce5`; projected `0064..0073`
|
||||
matched fork HEAD tree `fcf5720b659c5e1e2b487ccf3c8f7289bb12b9c4`.
|
||||
- Decision: patch regeneration is technically ready but blocked on explicit
|
||||
push approval by policy.
|
||||
|
||||
### Phase68: BF16 F32 Output Dense Serving
|
||||
|
||||
- Date: 2026-07-01.
|
||||
- Plan: `docs/superpowers/plans/2026-07-01-bf16-f32-output-dense-serving-phase68.md`.
|
||||
- Artifact: `/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710`.
|
||||
- Serving artifact:
|
||||
`/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710/serving_ab_20260701_150249`.
|
||||
|
||||
Dense prefill:
|
||||
|
||||
| npp | default S_PP | opt-in S_PP | change |
|
||||
|-----|-------------:|------------:|-------:|
|
||||
| `512` | `973.13` | `975.52` | `+0.25%` |
|
||||
| `2048` | `1019.88` | `1021.39` | `+0.15%` |
|
||||
|
||||
MoE serving `N=128`, prompt `128`, generation `128`:
|
||||
|
||||
| metric | default | opt-in | change |
|
||||
|--------|--------:|-------:|-------:|
|
||||
| `agg_tps` | `409.8` | `415.0` | `+1.27%` |
|
||||
| `decode_agg_tps` | `615.3` | `627.2` | `+1.93%` |
|
||||
| `prefill_tps` | `1630.2` | `1648.0` | `+1.09%` |
|
||||
| `ttft_mean_ms` | `8574.7` | `8085.9` | `-5.70%` |
|
||||
| `wall_s` | `39.978` | `39.480` | `-1.25%` |
|
||||
|
||||
Decision:
|
||||
|
||||
- Carry as default-off opt-in candidate pending broader serving evidence.
|
||||
|
||||
### Phase67: BF16 cuBLAS F32 Output
|
||||
|
||||
- Date: 2026-07-01.
|
||||
- Plan: `docs/superpowers/plans/2026-07-01-bf16-cublas-f32-output-phase67.md`.
|
||||
- Artifact: `/home/mudler/bench/phase67_bf16_f32_out/20260701_144909`.
|
||||
- Fork commit: `ea0875d14 feat(cuda): gate BF16 cuBLAS F32 output`.
|
||||
- DGX mirror commit: `14fd69f1e`.
|
||||
- Env gate: `LLAMA_BF16_CUBLAS_F32_OUT=1`.
|
||||
|
||||
Gates:
|
||||
|
||||
| mode | MoE md5 | dense md5 | `MUL_MAT` |
|
||||
|------|---------|-----------|-----------|
|
||||
| default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` |
|
||||
| opt-in | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` |
|
||||
|
||||
MoE prefill:
|
||||
|
||||
| npp | default S_PP | opt-in S_PP | change |
|
||||
|-----|-------------:|------------:|-------:|
|
||||
| `512` | `2347.41` | `2402.34` | `+2.34%` |
|
||||
| `2048` | `2440.18` | `2456.54` | `+0.67%` |
|
||||
|
||||
Decision:
|
||||
|
||||
- Keep default-off pending dense and serving A/B.
|
||||
@@ -3825,3 +3825,50 @@ Decision:
|
||||
regenerating the LocalAI patch series. Push still requires explicit approval.
|
||||
- After push approval, regenerate `0064..0073`, repeat the tree hash check, and
|
||||
only then run broader serving gates for any default-on BF16 policy decision.
|
||||
|
||||
## BF16 F32 Output Broader Serving Phase70 Result
|
||||
|
||||
Phase70 is recorded in
|
||||
`docs/superpowers/plans/2026-07-01-bf16-f32-output-broader-serving-phase70.md`.
|
||||
It did not change llama.cpp source and did not edit generated LocalAI patches.
|
||||
It also creates the running benchmark ledger at
|
||||
`backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md`.
|
||||
|
||||
- DGX artifact: `/home/mudler/bench/phase70_bf16_broader_serving/20260701_151500`
|
||||
- Source under test: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`
|
||||
- Shape: MoE serving, `NPL=8 32 128`, prompt `128`, generation `64`,
|
||||
`PARALLEL=128`, `CTX=131072`
|
||||
|
||||
Pre/post gates passed:
|
||||
|
||||
| gate | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
|
||||
|------|---------|-----------|-----------|--------------|
|
||||
| pre default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
|
||||
| pre opt-in | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | not run |
|
||||
| post default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
|
||||
| post opt-in | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | not run |
|
||||
|
||||
Serving A/B and vLLM comparison:
|
||||
|
||||
| n | default agg | opt-in agg | vLLM agg | default decode | opt-in decode | vLLM decode |
|
||||
|---:|------------:|-----------:|---------:|---------------:|--------------:|------------:|
|
||||
| `8` | `178.5` | `158.8` | `260.9` | `242.6` | `218.3` | `299.5` |
|
||||
| `32` | `250.1` | `247.9` | `465.3` | `418.7` | `417.6` | `608.4` |
|
||||
| `128` | `322.5` | `324.8` | `659.9` | `706.2` | `697.9` | `1020.4` |
|
||||
|
||||
Ratios:
|
||||
|
||||
| n | opt/default agg | opt/default decode | opt/default TTFT | default decode/vLLM | opt decode/vLLM |
|
||||
|---:|----------------:|-------------------:|-----------------:|--------------------:|----------------:|
|
||||
| `8` | `0.8896` | `0.8998` | `1.1247` | `0.8100` | `0.7289` |
|
||||
| `32` | `0.9912` | `0.9974` | `1.0320` | `0.6882` | `0.6864` |
|
||||
| `128` | `1.0071` | `0.9882` | `0.9852` | `0.6921` | `0.6839` |
|
||||
|
||||
Decision:
|
||||
|
||||
- Reject default-on for `LLAMA_BF16_CUBLAS_F32_OUT=1`.
|
||||
- Keep the shortcut as default-off only. It is correctness-clean, but the
|
||||
broader serving window regressed `n=8` materially and slightly widened the
|
||||
vLLM decode gap at `n=32` and `n=128`.
|
||||
- The next parity phase should not spend more time on this default policy. Use
|
||||
the benchmark ledger for every following attempt.
|
||||
|
||||
@@ -1079,3 +1079,31 @@ requires pushing before regenerating the LocalAI series. Do not push without
|
||||
explicit approval. After approval, push the fork, regenerate `0064..0073`, rerun
|
||||
the same tree-hash check, and then run the broader serving gates before any
|
||||
default-on BF16 policy change.
|
||||
|
||||
## 15. PHASE70 RESULT: BF16 F32 OUTPUT BROADER SERVING
|
||||
|
||||
Phase70 broadened the Phase68 serving evidence without source changes. Plan:
|
||||
`docs/superpowers/plans/2026-07-01-bf16-f32-output-broader-serving-phase70.md`.
|
||||
Benchmark ledger:
|
||||
`backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md`.
|
||||
DGX artifact:
|
||||
`/home/mudler/bench/phase70_bf16_broader_serving/20260701_151500`.
|
||||
|
||||
Gates stayed green. Default pre/post gates matched MoE md5 `8cb0ce23`, dense
|
||||
md5 `5951a5b4`, `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Opt-in pre/post
|
||||
gates matched MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, and `MUL_MAT
|
||||
1146/1146`.
|
||||
|
||||
Serving shape: MoE `NPL=8 32 128`, prompt `128`, generation `64`,
|
||||
`PARALLEL=128`.
|
||||
|
||||
| n | opt/default agg | opt/default decode | opt/default TTFT | default decode/vLLM | opt decode/vLLM |
|
||||
|---:|----------------:|-------------------:|-----------------:|--------------------:|----------------:|
|
||||
| `8` | `0.8896` | `0.8998` | `1.1247` | `0.8100` | `0.7289` |
|
||||
| `32` | `0.9912` | `0.9974` | `1.0320` | `0.6882` | `0.6864` |
|
||||
| `128` | `1.0071` | `0.9882` | `0.9852` | `0.6921` | `0.6839` |
|
||||
|
||||
Decision: reject default-on for `LLAMA_BF16_CUBLAS_F32_OUT=1`. The shortcut is
|
||||
correctness-clean, but it materially regressed low-concurrency serving and
|
||||
slightly widened the vLLM decode gap at `n=32` and `n=128`. Keep it
|
||||
default-off only and move the next parity effort to a different lever.
|
||||
|
||||
@@ -133,6 +133,13 @@ but worth carrying as an opt-in shortcut candidate. Do not default it on until
|
||||
the fork commit is mirrored into the LocalAI patch series and a broader serving
|
||||
snapshot passes pre/post md5 and op gates.
|
||||
|
||||
Phase70 ran that broader serving snapshot. Gates stayed green, but the broader
|
||||
window rejected default-on: at `N=8`, opt-in aggregate and decode fell to
|
||||
`0.8896x` and `0.8998x` of default, and mean TTFT worsened to `1.1247x`.
|
||||
At `N=32` and `N=128`, opt-in slightly widened the vLLM decode gap
|
||||
(`0.6864x` vs `0.6882x`, and `0.6839x` vs `0.6921x`). Keep
|
||||
`LLAMA_BF16_CUBLAS_F32_OUT=1` default-off only and move to another lever.
|
||||
|
||||
Relevant files: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch`, and the graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/models/{qwen35moe.cpp,delta-net-base.cpp}` + `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/llama-graph.cpp` (build_moe_ffn ~1500-1834, build_attn ~2136-2189).
|
||||
|
||||
## 2. Decode-serving compute hypotheses (ranked)
|
||||
|
||||
Reference in New Issue
Block a user