diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index 27880c453..d26f063f5 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -3678,3 +3678,43 @@ Decision: - Do not reopen W4A16/no-activation-quant from this evidence. Earlier W4A16 phases already rejected that rewrite; Phase66 only rules out a smaller gather/quant shortcut. + +## BF16 cuBLAS F32 Output Phase67 Result + +Phase67 is recorded in +`docs/superpowers/plans/2026-07-01-bf16-cublas-f32-output-phase67.md`. +It added a default-off BF16 projection shortcut: + +- Fork commit: `ea0875d14 feat(cuda): gate BF16 cuBLAS F32 output` +- Env gate: `LLAMA_BF16_CUBLAS_F32_OUT=1` +- DGX mirror commit: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output` +- DGX artifact: `/home/mudler/bench/phase67_bf16_f32_out/20260701_144909` + +Default and opt-in gates passed: + +| mode | MoE md5 | dense md5 | `MUL_MAT` | +|------|---------|-----------|-----------| +| default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | +| opt-in | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | + +Same-window MoE prefill A/B: + +| npp | default S_PP | opt-in S_PP | change | +|-----|-------------:|------------:|-------:| +| `512` | `2347.41` | `2402.34` | `+2.34%` | +| `2048` | `2440.18` | `2456.54` | `+0.67%` | + +Opt-in `npp=512` nsys profile: + +| row | value | +|-----|------:| +| total GPU kernel time | `7020867757 ns` | +| `convert_unary<__nv_bfloat16, float>` | `0 ns`, `0` instances | +| `convert_unary` | `159651026 ns`, `6840` instances, `2.27%` | + +Decision: + +- Keep the patch as a default-off opt-in shortcut. It is md5/op clean and + removes the profiled BF16-to-F32 conversion row for this shape. +- Do not make it default-on yet. The gain is modest and needs dense plus serving + A/B before a default policy change. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index cf53d32a8..5381afb91 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -973,3 +973,25 @@ Decision: reject a Phase66 gather/quant source patch. The gather is too small to target, and quantize plus gather is below the `8%` source-funding threshold. Do not reopen W4A16/no-activation-quant from this evidence; that larger rewrite was already rejected in earlier phases. + +## 12. PHASE67 RESULT: BF16 CUBLAS F32 OUTPUT + +Phase67 added a default-off BF16 projection shortcut in the llama.cpp fork: +`ea0875d14 feat(cuda): gate BF16 cuBLAS F32 output`. The env gate is +`LLAMA_BF16_CUBLAS_F32_OUT=1`. DGX mirror commit: `14fd69f1e`. + +DGX artifact: `/home/mudler/bench/phase67_bf16_f32_out/20260701_144909`. +Default and opt-in gates stayed green: MoE md5 `8cb0ce23`, dense md5 +`5951a5b4`, `MUL_MAT 1146/1146`. + +Same-window MoE prefill A/B: + +| npp | default S_PP | opt-in S_PP | change | +|-----|-------------:|------------:|-------:| +| `512` | `2347.41` | `2402.34` | `+2.34%` | +| `2048` | `2440.18` | `2456.54` | `+0.67%` | + +The opt-in `npp=512` profile removed the BF16-to-F32 conversion row: +`convert_unary<__nv_bfloat16, float>` became `0 ns`, `0` instances. Keep this +as default-off for now. It is correctness-clean and measurable, but the win is +small and needs dense plus serving A/B before any default-on decision. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index 8fff2aad9..2f11d0923 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -115,6 +115,14 @@ Phase66 ran that timing pass. At MoE `npp=512`, total GPU kernel time was gather/quant shortcut on GB10 for now: the gather is not material and the combined route is below the `8%` source-funding threshold. +Phase67 tested the `bf16-proj` conversion half directly. Fork commit +`ea0875d14` adds default-off `LLAMA_BF16_CUBLAS_F32_OUT=1`, letting BF16 cuBLAS +write F32 output instead of writing BF16 then launching a BF16-to-F32 conversion. +It passed MoE/dense md5 and `MUL_MAT 1146/1146`; MoE prefill improved +`2347.41 -> 2402.34` at `npp=512` and `2440.18 -> 2456.54` at `npp=2048`. +Keep it default-off until dense and serving A/B decide whether it is worth a +default policy change. + Relevant files: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch`, and the graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/models/{qwen35moe.cpp,delta-net-base.cpp}` + `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/llama-graph.cpp` (build_moe_ffn ~1500-1834, build_attn ~2136-2189). ## 2. Decode-serving compute hypotheses (ranked) diff --git a/docs/superpowers/plans/2026-07-01-bf16-cublas-f32-output-phase67.md b/docs/superpowers/plans/2026-07-01-bf16-cublas-f32-output-phase67.md new file mode 100644 index 000000000..24e3daa20 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-bf16-cublas-f32-output-phase67.md @@ -0,0 +1,244 @@ +# BF16 cuBLAS F32 Output Phase67 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Test whether BF16 projection GEMMs can write F32 output directly and remove the BF16-to-F32 conversion kernel without breaking inference. + +**Architecture:** Add a default-off `LLAMA_BF16_CUBLAS_F32_OUT=1` branch in the CUDA BF16 cuBLAS path. The default path remains byte-identical. The opt-in path is accepted only if canonical md5/op gates pass and the measured GPU kernel-time reduction is material. + +**Tech Stack:** llama.cpp CUDA backend, cuBLAS `cublasGemmEx`, DGX GB10, LocalAI parity docs, canonical md5 and backend-op gates. + +--- + +## Guardrails + +- Default behavior must remain unchanged when `LLAMA_BF16_CUBLAS_F32_OUT` is unset. +- The source patch must be small and local to the BF16 cuBLAS path. +- The opt-in path is rejected unless it passes: + - MoE paged md5 `8cb0ce23777bf55f92f63d0292c756b0` + - dense md5 `5951a5b4d624ce891e22ab5fca9bc439` + - `test-backend-ops` `MUL_MAT` +- If the opt-in path changes md5, do not benchmark it as a parity shortcut unless a KL plan is explicitly created later. +- Do not regenerate LocalAI patch files in this phase. +- Do not push without explicit approval. + +## Files + +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu` +- Create: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/docs/superpowers/plans/2026-07-01-bf16-cublas-f32-output-phase67.md` +- Modify after DGX run: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify after DGX run: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` +- Modify after DGX run: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` + +--- + +### Task 1: Add Default-Off BF16 F32 Output Branch + +- [x] **Step 1: Add env helper** + +Add near the cuBLAS route helpers in `ggml-cuda.cu`: + +```c++ +static inline bool ggml_cuda_bf16_cublas_f32_out_enabled() { + static const bool value = []() { + const char * s = getenv("LLAMA_BF16_CUBLAS_F32_OUT"); + return s != nullptr && atoi(s) != 0; + }(); + + return value; +} +``` + +- [x] **Step 2: Branch BF16 cuBLAS output** + +In the `src0->type == GGML_TYPE_BF16` cuBLAS branch, keep the current BF16 +temporary path as the default. When the env is enabled, call `cublasGemmEx` with +the existing BF16 inputs and `dst_dd_i` as `CUDA_R_32F`, then skip the +`to_fp32_cuda` conversion: + +```c++ +if (ggml_cuda_bf16_cublas_f32_out_enabled()) { + CUBLAS_CHECK(cublasGemmEx(ctx.cublas_handle(id), CUBLAS_OP_T, CUBLAS_OP_N, + row_diff, src1_ncols, ne10, + &alpha_f32, src0_ptr, CUDA_R_16BF, ne00, + src1_ptr, CUDA_R_16BF, ne10, + &beta_f32, dst_dd_i, CUDA_R_32F, ldc, + CUBLAS_COMPUTE_32F, + CUBLAS_GEMM_DEFAULT_TENSOR_OP)); +} else { + // existing BF16 temp plus BF16-to-F32 conversion +} +``` + +- [x] **Step 3: Local diff check** + +Run: + +```bash +git -C /home/mudler/_git/llama.cpp diff --check +``` + +Expected: exit `0`. + +--- + +### Task 2: DGX Build and Default Gates + +- [x] **Step 1: Confirm DGX is idle** + +Run: + +```bash +ssh dgx.casa 'cat /tmp/localai-gb10.lock 2>/dev/null || true; docker ps --format "{{.Names}}" | wc -l; (pgrep -af "[l]ocal-ai-worker" || true) | wc -l; nvidia-smi --query-compute-apps=pid,process_name,used_gpu_memory --format=csv,noheader | wc -l' +``` + +Expected: lock `FREE*`, Docker `0`, worker `0`, compute apps `0`. + +- [x] **Step 2: Acquire lock** + +Run: + +```bash +ssh dgx.casa 'printf "codex-phase67-bf16-f32-out %s\n" "$(date +%s)" > /tmp/localai-gb10.lock' +``` + +- [x] **Step 3: Apply patch and build** + +Apply the patch to `/home/mudler/llama-phase6-source`, then run: + +```bash +ssh dgx.casa 'cd /home/mudler/llama-phase6-source && cmake --build build-cuda --target llama-completion llama-batched-bench test-backend-ops -j $(nproc)' +``` + +Expected: exit `0`. + +- [x] **Step 4: Run default gates** + +Run canonical MoE and dense md5 plus: + +```bash +./test-backend-ops test -o MUL_MAT +``` + +Expected default path: + +```text +MoE md5 8cb0ce23777bf55f92f63d0292c756b0 +dense md5 5951a5b4d624ce891e22ab5fca9bc439 +MUL_MAT 1146/1146 +``` + +Observed artifact: `/home/mudler/bench/phase67_bf16_f32_out/20260701_144909`. + +```text +default MoE md5 8cb0ce23777bf55f92f63d0292c756b0 +default dense md5 5951a5b4d624ce891e22ab5fca9bc439 +default MUL_MAT 1146/1146 +``` + +--- + +### Task 3: Opt-In Correctness Gate + +- [x] **Step 1: Run opt-in md5 gates** + +Run the same MoE and dense commands with: + +```bash +LLAMA_BF16_CUBLAS_F32_OUT=1 +``` + +Expected: exact same md5s. If either md5 differs, reject the source shortcut. + +Observed: + +```text +opt-in MoE md5 8cb0ce23777bf55f92f63d0292c756b0 +opt-in dense md5 5951a5b4d624ce891e22ab5fca9bc439 +``` + +- [x] **Step 2: Run opt-in backend-op gate** + +Run: + +```bash +LLAMA_BF16_CUBLAS_F32_OUT=1 ./test-backend-ops test -o MUL_MAT +``` + +Expected: `1146/1146`. + +Observed: `1146/1146`. + +--- + +### Task 4: Benchmark if Correct + +- [x] **Step 1: Run same-shape prefill A/B** + +Only if Task 3 passes, run baseline and opt-in: + +```bash +./llama-batched-bench -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf \ + -c 131072 -b 2048 -ub 512 -ngl 99 -fa on -npp 512,2048 -ntg 4 -npl 32 +``` + +with and without `LLAMA_BF16_CUBLAS_F32_OUT=1`. + +Observed MoE A/B: + +| npp | default S_PP | opt-in S_PP | change | +|-----|-------------:|------------:|-------:| +| `512` | `2347.41` | `2402.34` | `+2.34%` | +| `2048` | `2440.18` | `2456.54` | `+0.67%` | + +- [x] **Step 2: Profile opt-in if A/B improves** + +Use nsys kernel summary to verify the BF16-to-F32 conversion rows shrink. + +Observed opt-in `npp=512` profile: + +| row | value | +|-----|------:| +| total GPU kernel time | `7020867757 ns` | +| `convert_unary<__nv_bfloat16, float>` | `0 ns`, `0` instances | +| `convert_unary` | `159651026 ns`, `6840` instances, `2.27%` | + +- [x] **Step 3: Source decision** + +Keep the patch only if opt-in gates pass and it produces a material speedup. +Otherwise revert the source patch locally and record the rejection. + +Decision: keep as a default-off opt-in path. It is correctness-clean and removes +the profiled BF16-to-F32 conversion row for this shape, but the speedup is small +and needs dense plus serving A/B before any default-on decision. + +--- + +### Task 5: Commit and Record + +- [x] **Step 1: Commit source only if accepted as default-off diagnostic or opt-in** + +```bash +git -C /home/mudler/_git/llama.cpp add ggml/src/ggml-cuda/ggml-cuda.cu +git -C /home/mudler/_git/llama.cpp commit -m "feat(cuda): gate BF16 cuBLAS F32 output" -m "Assisted-by: Codex:gpt-5" +``` + +Result: + +- Local fork: `ea0875d14 feat(cuda): gate BF16 cuBLAS F32 output` +- DGX mirror: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output` + +- [x] **Step 2: Record LocalAI docs** + +Record artifact path, gates, A/B result, and decision. + +- [x] **Step 3: Commit LocalAI docs** + +```bash +git add -f docs/superpowers/plans/2026-07-01-bf16-cublas-f32-output-phase67.md +git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +git commit -m "docs(paged): record BF16 cuBLAS F32 output phase" \ + -m "Assisted-by: Codex:gpt-5" +```