mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
docs(paged): record BF16 F32 output dense serving phase
Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -3718,3 +3718,52 @@ Decision:
|
||||
removes the profiled BF16-to-F32 conversion row for this shape.
|
||||
- Do not make it default-on yet. The gain is modest and needs dense plus serving
|
||||
A/B before a default policy change.
|
||||
|
||||
## BF16 F32 Output Dense Serving Phase68 Result
|
||||
|
||||
Phase68 is recorded in
|
||||
`docs/superpowers/plans/2026-07-01-bf16-f32-output-dense-serving-phase68.md`.
|
||||
It reused the Phase67 source commit and did not change llama.cpp source.
|
||||
|
||||
- Fork commit under test: `ea0875d14 feat(cuda): gate BF16 cuBLAS F32 output`
|
||||
- DGX mirror commit under test: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`
|
||||
- Env gate: `LLAMA_BF16_CUBLAS_F32_OUT=1`
|
||||
- DGX artifact: `/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710`
|
||||
- Serving A/B artifact: `/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710/serving_ab_20260701_150249`
|
||||
|
||||
Correctness basis for this exact source commit remains the Phase67 default and
|
||||
opt-in gates:
|
||||
|
||||
| mode | MoE md5 | dense md5 | `MUL_MAT` |
|
||||
|------|---------|-----------|-----------|
|
||||
| default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` |
|
||||
| opt-in | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` |
|
||||
|
||||
Dense same-window prefill A/B:
|
||||
|
||||
| npp | default S_PP | opt-in S_PP | change |
|
||||
|-----|-------------:|------------:|-------:|
|
||||
| `512` | `973.13` | `975.52` | `+0.25%` |
|
||||
| `2048` | `1019.88` | `1021.39` | `+0.15%` |
|
||||
|
||||
MoE serving A/B, `N=128`, prompt `128`, generation `128`, `--parallel 128`:
|
||||
|
||||
| metric | default | opt-in | change |
|
||||
|--------|--------:|-------:|-------:|
|
||||
| `agg_tps` | `409.8` | `415.0` | `+1.27%` |
|
||||
| `decode_agg_tps` | `615.3` | `627.2` | `+1.93%` |
|
||||
| `decode_perseq_tps` | `4.15` | `4.16` | `+0.24%` |
|
||||
| `prefill_tps` | `1630.2` | `1648.0` | `+1.09%` |
|
||||
| `ttft_mean_ms` | `8574.7` | `8085.9` | `-5.70%` |
|
||||
| `wall_s` | `39.978` | `39.480` | `-1.25%` |
|
||||
|
||||
Decision:
|
||||
|
||||
- Keep `LLAMA_BF16_CUBLAS_F32_OUT=1` default-off. The dense prefill gain is
|
||||
positive but too small to justify a default policy change.
|
||||
- The opt-in is now worth carrying forward: MoE prefill, dense prefill, and the
|
||||
small MoE serving window all moved in the right direction without changing the
|
||||
Phase67 md5/op correctness gates.
|
||||
- Next default-on consideration requires regenerating the LocalAI patch series
|
||||
from the fork and rerunning the broader current serving snapshot gates. Do not
|
||||
default it from Phase68 alone.
|
||||
|
||||
@@ -995,3 +995,37 @@ The opt-in `npp=512` profile removed the BF16-to-F32 conversion row:
|
||||
`convert_unary<__nv_bfloat16, float>` became `0 ns`, `0` instances. Keep this
|
||||
as default-off for now. It is correctness-clean and measurable, but the win is
|
||||
small and needs dense plus serving A/B before any default-on decision.
|
||||
|
||||
## 13. PHASE68 RESULT: BF16 F32 OUTPUT DENSE + SERVING A/B
|
||||
|
||||
Phase68 reused Phase67 source unchanged. Plan:
|
||||
`docs/superpowers/plans/2026-07-01-bf16-f32-output-dense-serving-phase68.md`.
|
||||
DGX artifact: `/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710`;
|
||||
serving A/B artifact:
|
||||
`/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710/serving_ab_20260701_150249`.
|
||||
|
||||
Correctness basis for the exact source commit remains Phase67: default and
|
||||
`LLAMA_BF16_CUBLAS_F32_OUT=1` both produced MoE md5 `8cb0ce23`, dense md5
|
||||
`5951a5b4`, and `MUL_MAT 1146/1146`.
|
||||
|
||||
Dense prefill stayed positive but tiny:
|
||||
|
||||
| npp | default S_PP | opt-in S_PP | change |
|
||||
|-----|-------------:|------------:|-------:|
|
||||
| `512` | `973.13` | `975.52` | `+0.25%` |
|
||||
| `2048` | `1019.88` | `1021.39` | `+0.15%` |
|
||||
|
||||
MoE serving A/B at `N=128`, prompt `128`, generation `128`, `--parallel 128`:
|
||||
|
||||
| metric | default | opt-in | change |
|
||||
|--------|--------:|-------:|-------:|
|
||||
| `agg_tps` | `409.8` | `415.0` | `+1.27%` |
|
||||
| `decode_agg_tps` | `615.3` | `627.2` | `+1.93%` |
|
||||
| `prefill_tps` | `1630.2` | `1648.0` | `+1.09%` |
|
||||
| `ttft_mean_ms` | `8574.7` | `8085.9` | `-5.70%` |
|
||||
| `wall_s` | `39.978` | `39.480` | `-1.25%` |
|
||||
|
||||
Decision: carry the shortcut as a default-off opt-in candidate. It is no longer
|
||||
just a prefill-only win, but Phase68 is not enough to default it on. Any future
|
||||
default-on proposal must mirror the fork commit into the LocalAI patch series
|
||||
and rerun a broader current serving snapshot with pre/post md5 and op gates.
|
||||
|
||||
@@ -123,6 +123,16 @@ It passed MoE/dense md5 and `MUL_MAT 1146/1146`; MoE prefill improved
|
||||
Keep it default-off until dense and serving A/B decide whether it is worth a
|
||||
default policy change.
|
||||
|
||||
Phase68 ran that dense and serving A/B without changing source. Dense prefill
|
||||
was positive but tiny (`973.13 -> 975.52` at `npp=512`, `1019.88 -> 1021.39` at
|
||||
`npp=2048`). A small MoE serving window at `N=128`, prompt `128`, generation
|
||||
`128` also moved in the right direction: aggregate `409.8 -> 415.0`,
|
||||
decode aggregate `615.3 -> 627.2`, mean TTFT `8574.7 -> 8085.9 ms`, wall
|
||||
`39.978 -> 39.480 s`. Decision: keep `LLAMA_BF16_CUBLAS_F32_OUT=1` default-off
|
||||
but worth carrying as an opt-in shortcut candidate. Do not default it on until
|
||||
the fork commit is mirrored into the LocalAI patch series and a broader serving
|
||||
snapshot passes pre/post md5 and op gates.
|
||||
|
||||
Relevant files: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch`, and the graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/models/{qwen35moe.cpp,delta-net-base.cpp}` + `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/llama-graph.cpp` (build_moe_ffn ~1500-1834, build_attn ~2136-2189).
|
||||
|
||||
## 2. Decode-serving compute hypotheses (ranked)
|
||||
|
||||
Reference in New Issue
Block a user