mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
docs(paged): record BF16 F32 output dense serving phase
Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -3718,3 +3718,52 @@ Decision:
|
||||
removes the profiled BF16-to-F32 conversion row for this shape.
|
||||
- Do not make it default-on yet. The gain is modest and needs dense plus serving
|
||||
A/B before a default policy change.
|
||||
|
||||
## BF16 F32 Output Dense Serving Phase68 Result
|
||||
|
||||
Phase68 is recorded in
|
||||
`docs/superpowers/plans/2026-07-01-bf16-f32-output-dense-serving-phase68.md`.
|
||||
It reused the Phase67 source commit and did not change llama.cpp source.
|
||||
|
||||
- Fork commit under test: `ea0875d14 feat(cuda): gate BF16 cuBLAS F32 output`
|
||||
- DGX mirror commit under test: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`
|
||||
- Env gate: `LLAMA_BF16_CUBLAS_F32_OUT=1`
|
||||
- DGX artifact: `/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710`
|
||||
- Serving A/B artifact: `/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710/serving_ab_20260701_150249`
|
||||
|
||||
Correctness basis for this exact source commit remains the Phase67 default and
|
||||
opt-in gates:
|
||||
|
||||
| mode | MoE md5 | dense md5 | `MUL_MAT` |
|
||||
|------|---------|-----------|-----------|
|
||||
| default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` |
|
||||
| opt-in | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` |
|
||||
|
||||
Dense same-window prefill A/B:
|
||||
|
||||
| npp | default S_PP | opt-in S_PP | change |
|
||||
|-----|-------------:|------------:|-------:|
|
||||
| `512` | `973.13` | `975.52` | `+0.25%` |
|
||||
| `2048` | `1019.88` | `1021.39` | `+0.15%` |
|
||||
|
||||
MoE serving A/B, `N=128`, prompt `128`, generation `128`, `--parallel 128`:
|
||||
|
||||
| metric | default | opt-in | change |
|
||||
|--------|--------:|-------:|-------:|
|
||||
| `agg_tps` | `409.8` | `415.0` | `+1.27%` |
|
||||
| `decode_agg_tps` | `615.3` | `627.2` | `+1.93%` |
|
||||
| `decode_perseq_tps` | `4.15` | `4.16` | `+0.24%` |
|
||||
| `prefill_tps` | `1630.2` | `1648.0` | `+1.09%` |
|
||||
| `ttft_mean_ms` | `8574.7` | `8085.9` | `-5.70%` |
|
||||
| `wall_s` | `39.978` | `39.480` | `-1.25%` |
|
||||
|
||||
Decision:
|
||||
|
||||
- Keep `LLAMA_BF16_CUBLAS_F32_OUT=1` default-off. The dense prefill gain is
|
||||
positive but too small to justify a default policy change.
|
||||
- The opt-in is now worth carrying forward: MoE prefill, dense prefill, and the
|
||||
small MoE serving window all moved in the right direction without changing the
|
||||
Phase67 md5/op correctness gates.
|
||||
- Next default-on consideration requires regenerating the LocalAI patch series
|
||||
from the fork and rerunning the broader current serving snapshot gates. Do not
|
||||
default it from Phase68 alone.
|
||||
|
||||
@@ -995,3 +995,37 @@ The opt-in `npp=512` profile removed the BF16-to-F32 conversion row:
|
||||
`convert_unary<__nv_bfloat16, float>` became `0 ns`, `0` instances. Keep this
|
||||
as default-off for now. It is correctness-clean and measurable, but the win is
|
||||
small and needs dense plus serving A/B before any default-on decision.
|
||||
|
||||
## 13. PHASE68 RESULT: BF16 F32 OUTPUT DENSE + SERVING A/B
|
||||
|
||||
Phase68 reused Phase67 source unchanged. Plan:
|
||||
`docs/superpowers/plans/2026-07-01-bf16-f32-output-dense-serving-phase68.md`.
|
||||
DGX artifact: `/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710`;
|
||||
serving A/B artifact:
|
||||
`/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710/serving_ab_20260701_150249`.
|
||||
|
||||
Correctness basis for the exact source commit remains Phase67: default and
|
||||
`LLAMA_BF16_CUBLAS_F32_OUT=1` both produced MoE md5 `8cb0ce23`, dense md5
|
||||
`5951a5b4`, and `MUL_MAT 1146/1146`.
|
||||
|
||||
Dense prefill stayed positive but tiny:
|
||||
|
||||
| npp | default S_PP | opt-in S_PP | change |
|
||||
|-----|-------------:|------------:|-------:|
|
||||
| `512` | `973.13` | `975.52` | `+0.25%` |
|
||||
| `2048` | `1019.88` | `1021.39` | `+0.15%` |
|
||||
|
||||
MoE serving A/B at `N=128`, prompt `128`, generation `128`, `--parallel 128`:
|
||||
|
||||
| metric | default | opt-in | change |
|
||||
|--------|--------:|-------:|-------:|
|
||||
| `agg_tps` | `409.8` | `415.0` | `+1.27%` |
|
||||
| `decode_agg_tps` | `615.3` | `627.2` | `+1.93%` |
|
||||
| `prefill_tps` | `1630.2` | `1648.0` | `+1.09%` |
|
||||
| `ttft_mean_ms` | `8574.7` | `8085.9` | `-5.70%` |
|
||||
| `wall_s` | `39.978` | `39.480` | `-1.25%` |
|
||||
|
||||
Decision: carry the shortcut as a default-off opt-in candidate. It is no longer
|
||||
just a prefill-only win, but Phase68 is not enough to default it on. Any future
|
||||
default-on proposal must mirror the fork commit into the LocalAI patch series
|
||||
and rerun a broader current serving snapshot with pre/post md5 and op gates.
|
||||
|
||||
@@ -123,6 +123,16 @@ It passed MoE/dense md5 and `MUL_MAT 1146/1146`; MoE prefill improved
|
||||
Keep it default-off until dense and serving A/B decide whether it is worth a
|
||||
default policy change.
|
||||
|
||||
Phase68 ran that dense and serving A/B without changing source. Dense prefill
|
||||
was positive but tiny (`973.13 -> 975.52` at `npp=512`, `1019.88 -> 1021.39` at
|
||||
`npp=2048`). A small MoE serving window at `N=128`, prompt `128`, generation
|
||||
`128` also moved in the right direction: aggregate `409.8 -> 415.0`,
|
||||
decode aggregate `615.3 -> 627.2`, mean TTFT `8574.7 -> 8085.9 ms`, wall
|
||||
`39.978 -> 39.480 s`. Decision: keep `LLAMA_BF16_CUBLAS_F32_OUT=1` default-off
|
||||
but worth carrying as an opt-in shortcut candidate. Do not default it on until
|
||||
the fork commit is mirrored into the LocalAI patch series and a broader serving
|
||||
snapshot passes pre/post md5 and op gates.
|
||||
|
||||
Relevant files: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch`, and the graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/models/{qwen35moe.cpp,delta-net-base.cpp}` + `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/llama-graph.cpp` (build_moe_ffn ~1500-1834, build_attn ~2136-2189).
|
||||
|
||||
## 2. Decode-serving compute hypotheses (ranked)
|
||||
|
||||
@@ -0,0 +1,132 @@
|
||||
# BF16 F32 Output Dense Serving Phase68 Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Decide whether `LLAMA_BF16_CUBLAS_F32_OUT=1` has enough dense and serving value to consider a default policy change.
|
||||
|
||||
**Architecture:** Reuse the Phase67 source patch and DGX build. Run dense prefill A/B first because it is fast and directly targets BF16 projections. Run serving A/B only if dense or MoE evidence supports a broader default-on question.
|
||||
|
||||
**Tech Stack:** llama.cpp CUDA backend, DGX GB10, `llama-batched-bench`, optional LocalAI serving snapshot harness, LocalAI parity docs.
|
||||
|
||||
---
|
||||
|
||||
## Guardrails
|
||||
|
||||
- Do not change source in Phase68.
|
||||
- Do not make `LLAMA_BF16_CUBLAS_F32_OUT=1` default-on from MoE prefill alone.
|
||||
- Keep DGX lock discipline: lock free, Docker `0`, `local-ai-worker` `0`, compute apps `0`.
|
||||
- Keep existing md5/op gate evidence from Phase67 as the correctness basis for this exact source commit.
|
||||
- Record no-go results as explicitly as wins.
|
||||
|
||||
## Files
|
||||
|
||||
- Create: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/docs/superpowers/plans/2026-07-01-bf16-f32-output-dense-serving-phase68.md`
|
||||
- Modify after DGX run: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
|
||||
- Modify after DGX run: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md`
|
||||
- Modify after DGX run: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md`
|
||||
|
||||
---
|
||||
|
||||
### Task 1: Dense Prefill A/B
|
||||
|
||||
- [x] **Step 1: Confirm DGX idle and acquire lock**
|
||||
|
||||
Run:
|
||||
|
||||
```bash
|
||||
ssh dgx.casa 'cat /tmp/localai-gb10.lock 2>/dev/null || true; docker ps --format "{{.Names}}" | wc -l; (pgrep -af "[l]ocal-ai-worker" || true) | wc -l; nvidia-smi --query-compute-apps=pid,process_name,used_gpu_memory --format=csv,noheader | wc -l'
|
||||
ssh dgx.casa 'printf "codex-phase68-bf16-dense-serving %s\n" "$(date +%s)" > /tmp/localai-gb10.lock'
|
||||
```
|
||||
|
||||
- [x] **Step 2: Run dense prefill default and opt-in**
|
||||
|
||||
Run:
|
||||
|
||||
```bash
|
||||
./llama-batched-bench -m /home/mudler/bench/q36-27b-nvfp4.gguf \
|
||||
-c 131072 -b 2048 -ub 512 -ngl 99 -fa on -npp 512,2048 -ntg 4 -npl 32
|
||||
```
|
||||
|
||||
with and without `LLAMA_BF16_CUBLAS_F32_OUT=1`.
|
||||
|
||||
- [x] **Step 3: Dense decision**
|
||||
|
||||
Dense improved slightly in the same window and did not regress:
|
||||
|
||||
| npp | default S_PP | opt-in S_PP | change |
|
||||
|-----|-------------:|------------:|-------:|
|
||||
| `512` | `973.13` | `975.52` | `+0.25%` |
|
||||
| `2048` | `1019.88` | `1021.39` | `+0.15%` |
|
||||
|
||||
Decision: run a small MoE serving A/B because Phase67 MoE prefill was positive
|
||||
and dense did not regress. The dense win is too small to justify default-on by
|
||||
itself.
|
||||
|
||||
---
|
||||
|
||||
### Task 2: Serving A/B If Funded
|
||||
|
||||
- [x] **Step 1: Run a small same-window serving A/B**
|
||||
|
||||
Use the current clean source tree and the existing h2h client or snapshot harness.
|
||||
Compare default versus:
|
||||
|
||||
```bash
|
||||
LLAMA_BF16_CUBLAS_F32_OUT=1
|
||||
```
|
||||
|
||||
At minimum capture MoE `N=128`, prompt `128`, generation `128` aggregate,
|
||||
decode aggregate, mean TTFT, wall time, and md5 gate summary.
|
||||
|
||||
- [x] **Step 2: Serving decision**
|
||||
|
||||
Keep default-off unless serving improves or is flat without dense regression.
|
||||
Do not default-on from prefill-only evidence.
|
||||
|
||||
Serving artifact:
|
||||
|
||||
- `/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710/serving_ab_20260701_150249`
|
||||
|
||||
MoE serving A/B, `N=128`, prompt `128`, generation `128`, `--parallel 128`:
|
||||
|
||||
| metric | default | opt-in | change |
|
||||
|--------|--------:|-------:|-------:|
|
||||
| `agg_tps` | `409.8` | `415.0` | `+1.27%` |
|
||||
| `decode_agg_tps` | `615.3` | `627.2` | `+1.93%` |
|
||||
| `decode_perseq_tps` | `4.15` | `4.16` | `+0.24%` |
|
||||
| `prefill_tps` | `1630.2` | `1648.0` | `+1.09%` |
|
||||
| `ttft_mean_ms` | `8574.7` | `8085.9` | `-5.70%` |
|
||||
| `wall_s` | `39.978` | `39.480` | `-1.25%` |
|
||||
|
||||
Decision: keep `LLAMA_BF16_CUBLAS_F32_OUT=1` default-off but promoted as a
|
||||
safe opt-in shortcut candidate. It now has Phase67 MoE md5/op gates, Phase67
|
||||
dense md5/op gates, a tiny positive dense prefill result, and a positive small
|
||||
MoE serving A/B. Do not make it default-on until it is patch-series mirrored and
|
||||
retested in a broader serving snapshot.
|
||||
|
||||
---
|
||||
|
||||
### Task 3: Record and Commit
|
||||
|
||||
- [x] **Step 1: Release DGX lock**
|
||||
|
||||
Run:
|
||||
|
||||
```bash
|
||||
ssh dgx.casa 'printf "FREE released-by-codex-phase68-bf16-dense-serving %s\n" "$(date +%s)" > /tmp/localai-gb10.lock'
|
||||
```
|
||||
|
||||
- [x] **Step 2: Record docs**
|
||||
|
||||
Record artifact path, dense A/B, serving A/B if run, and decision.
|
||||
|
||||
- [x] **Step 3: Commit LocalAI docs**
|
||||
|
||||
```bash
|
||||
git add -f docs/superpowers/plans/2026-07-01-bf16-f32-output-dense-serving-phase68.md
|
||||
git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \
|
||||
backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md \
|
||||
backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
|
||||
git commit -m "docs(paged): record BF16 F32 output dense serving phase" \
|
||||
-m "Assisted-by: Codex:gpt-5"
|
||||
```
|
||||
Reference in New Issue
Block a user