mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
docs(paged): record BF16 cuBLAS F32 output phase
Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -3678,3 +3678,43 @@ Decision:
|
||||
- Do not reopen W4A16/no-activation-quant from this evidence. Earlier W4A16
|
||||
phases already rejected that rewrite; Phase66 only rules out a smaller
|
||||
gather/quant shortcut.
|
||||
|
||||
## BF16 cuBLAS F32 Output Phase67 Result
|
||||
|
||||
Phase67 is recorded in
|
||||
`docs/superpowers/plans/2026-07-01-bf16-cublas-f32-output-phase67.md`.
|
||||
It added a default-off BF16 projection shortcut:
|
||||
|
||||
- Fork commit: `ea0875d14 feat(cuda): gate BF16 cuBLAS F32 output`
|
||||
- Env gate: `LLAMA_BF16_CUBLAS_F32_OUT=1`
|
||||
- DGX mirror commit: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`
|
||||
- DGX artifact: `/home/mudler/bench/phase67_bf16_f32_out/20260701_144909`
|
||||
|
||||
Default and opt-in gates passed:
|
||||
|
||||
| mode | MoE md5 | dense md5 | `MUL_MAT` |
|
||||
|------|---------|-----------|-----------|
|
||||
| default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` |
|
||||
| opt-in | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` |
|
||||
|
||||
Same-window MoE prefill A/B:
|
||||
|
||||
| npp | default S_PP | opt-in S_PP | change |
|
||||
|-----|-------------:|------------:|-------:|
|
||||
| `512` | `2347.41` | `2402.34` | `+2.34%` |
|
||||
| `2048` | `2440.18` | `2456.54` | `+0.67%` |
|
||||
|
||||
Opt-in `npp=512` nsys profile:
|
||||
|
||||
| row | value |
|
||||
|-----|------:|
|
||||
| total GPU kernel time | `7020867757 ns` |
|
||||
| `convert_unary<__nv_bfloat16, float>` | `0 ns`, `0` instances |
|
||||
| `convert_unary<float, __nv_bfloat16>` | `159651026 ns`, `6840` instances, `2.27%` |
|
||||
|
||||
Decision:
|
||||
|
||||
- Keep the patch as a default-off opt-in shortcut. It is md5/op clean and
|
||||
removes the profiled BF16-to-F32 conversion row for this shape.
|
||||
- Do not make it default-on yet. The gain is modest and needs dense plus serving
|
||||
A/B before a default policy change.
|
||||
|
||||
@@ -973,3 +973,25 @@ Decision: reject a Phase66 gather/quant source patch. The gather is too small
|
||||
to target, and quantize plus gather is below the `8%` source-funding threshold.
|
||||
Do not reopen W4A16/no-activation-quant from this evidence; that larger rewrite
|
||||
was already rejected in earlier phases.
|
||||
|
||||
## 12. PHASE67 RESULT: BF16 CUBLAS F32 OUTPUT
|
||||
|
||||
Phase67 added a default-off BF16 projection shortcut in the llama.cpp fork:
|
||||
`ea0875d14 feat(cuda): gate BF16 cuBLAS F32 output`. The env gate is
|
||||
`LLAMA_BF16_CUBLAS_F32_OUT=1`. DGX mirror commit: `14fd69f1e`.
|
||||
|
||||
DGX artifact: `/home/mudler/bench/phase67_bf16_f32_out/20260701_144909`.
|
||||
Default and opt-in gates stayed green: MoE md5 `8cb0ce23`, dense md5
|
||||
`5951a5b4`, `MUL_MAT 1146/1146`.
|
||||
|
||||
Same-window MoE prefill A/B:
|
||||
|
||||
| npp | default S_PP | opt-in S_PP | change |
|
||||
|-----|-------------:|------------:|-------:|
|
||||
| `512` | `2347.41` | `2402.34` | `+2.34%` |
|
||||
| `2048` | `2440.18` | `2456.54` | `+0.67%` |
|
||||
|
||||
The opt-in `npp=512` profile removed the BF16-to-F32 conversion row:
|
||||
`convert_unary<__nv_bfloat16, float>` became `0 ns`, `0` instances. Keep this
|
||||
as default-off for now. It is correctness-clean and measurable, but the win is
|
||||
small and needs dense plus serving A/B before any default-on decision.
|
||||
|
||||
@@ -115,6 +115,14 @@ Phase66 ran that timing pass. At MoE `npp=512`, total GPU kernel time was
|
||||
gather/quant shortcut on GB10 for now: the gather is not material and the
|
||||
combined route is below the `8%` source-funding threshold.
|
||||
|
||||
Phase67 tested the `bf16-proj` conversion half directly. Fork commit
|
||||
`ea0875d14` adds default-off `LLAMA_BF16_CUBLAS_F32_OUT=1`, letting BF16 cuBLAS
|
||||
write F32 output instead of writing BF16 then launching a BF16-to-F32 conversion.
|
||||
It passed MoE/dense md5 and `MUL_MAT 1146/1146`; MoE prefill improved
|
||||
`2347.41 -> 2402.34` at `npp=512` and `2440.18 -> 2456.54` at `npp=2048`.
|
||||
Keep it default-off until dense and serving A/B decide whether it is worth a
|
||||
default policy change.
|
||||
|
||||
Relevant files: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch`, and the graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/models/{qwen35moe.cpp,delta-net-base.cpp}` + `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/llama-graph.cpp` (build_moe_ffn ~1500-1834, build_attn ~2136-2189).
|
||||
|
||||
## 2. Decode-serving compute hypotheses (ranked)
|
||||
|
||||
Reference in New Issue
Block a user