docs(paged): record prefill bucket attribution phase

Assisted-by: Codex:gpt-5
2026-07-03 12:57:02 -04:00 · 2026-07-01 12:20:29 +00:00
parent 6a2618b6dc
commit 2e19e5c90f
4 changed files with 485 additions and 1 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -3507,3 +3507,49 @@ Decision:
 - Do not tune `spec-draft-n-max` blindly. Phase15, Phase19, and Phase62 all
  showed high acceptance with poor serving throughput, so the remaining question
  is verify cost, not whether MTP can draft.
+
+## Prefill Bucket Attribution Phase63 Result
+
+Phase63 is recorded in
+`docs/superpowers/plans/2026-07-01-prefill-bucket-attribution-phase63.md`.
+It was a measurement and decision phase, not a source patch phase.
+
+Artifact:
+
+- `/home/mudler/bench/phase63_prefill_bucket/20260701_140127`
+
+Pre/post inference gates passed:
+
+| gate | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
+|------|---------|-----------|-----------|--------------|
+| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
+| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
+
+llama.cpp MoE prefill, `npl=32`, `ntg=4`:
+
+| npp | S_PP | MoE/FFN-GEMM | GDN | bf16-proj | layout-copy | act-quant | MoE-dispatch | gather | FA |
+|-----|------|--------------|-----|-----------|-------------|-----------|--------------|--------|----|
+| 512 | `2248.20` | `40.48%` | `18.00%` | `10.19%` | `7.82%` | `4.47%` | `1.94%` | `1.26%` | `0.71%` |
+| 2048 | `2385.22` | `41.06%` | `16.15%` | `9.97%` | `7.96%` | `4.61%` | `2.12%` | `1.36%` | `1.18%` |
+
+vLLM MoE prefill, `NSEQ=32`, `GEN=1`, `NREP=3`, eager profile path:
+
+| PT | S_PP | ew/glue | GDN | FA | bf16-proj | MoE-dispatch | top unclassified |
+|----|------|---------|-----|----|-----------|--------------|------------------|
+| 512 | `5315.6` | `32.97%` | `18.34%` | `0.73%` | `3.41%` | `1.37%` | Marlin MoE `1940.99ms`, FP8 projection `565.74ms` |
+| 2048 | `5384.4` | `33.48%` | `18.00%` | `1.75%` | `1.06%` | `0.49%` | Marlin MoE `7745.84ms`, FP8 projection `3047.75ms` |
+
+Decision:
+
+- Reject a Phase63 paged FlashAttention mask/block-table source patch. llama.cpp
+  FA is only `1.18%` of prefill GPU kernel time at `npp=2048`, below the `<5%`
+  reject rule and far below the `8%` source-funding threshold.
+- The `npp=2048` FA cost is about `4.9 us/tok` for llama.cpp and `3.1 us/tok`
+  for vLLM, so the cross-engine FA delta is only about `1.7 us/tok`, below the
+  `15 us/tok` funding threshold.
+- The dominant remaining llama.cpp buckets are still MoE/FFN GEMM, GDN,
+  bf16 projections, layout copies, and activation quantization. Phase63 did not
+  identify a new low-conflict source patch that can move GB10 parity without
+  reopening already-rejected W4A16/GDN/MTP/small-M work.
+- No llama.cpp source files were modified. Default inferencing stayed green with
+  the canonical md5/op gates.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -878,4 +878,24 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual

 ---

-*Status: investigation CLOSED. This handoff is procedure; `VLLM_PARITY_FINAL.md` is the record. The path to parity is datacenter Blackwell, not GB10 kernels.*
+## 8. PHASE63 RESULT: PREFILL BUCKET ATTRIBUTION
+
+Phase63 is complete as a measurement-only no-go. The plan is
+`docs/superpowers/plans/2026-07-01-prefill-bucket-attribution-phase63.md`; the
+DGX artifact is `/home/mudler/bench/phase63_prefill_bucket/20260701_140127`.
+
+Pre/post gates stayed green:
+
+- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`;
+- dense md5 `5951a5b4d624ce891e22ab5fca9bc439`;
+- `MUL_MAT` `1146/1146`;
+- `MUL_MAT_ID` `806/806`.
+
+The candidate paged FlashAttention mask/block-table cleanup is rejected for now:
+llama.cpp FA is only `0.71%` at `npp=512` and `1.18%` at `npp=2048`; the
+`npp=2048` cross-engine FA delta is about `1.7 us/tok`, not the `15 us/tok`
+needed to fund source work. No llama.cpp source files were modified.
+
+*Status: Phase63 closed. `VLLM_PARITY_FINAL.md` remains the GB10 shortcut record;
+the remaining measured buckets are still MoE/FFN GEMM, GDN, bf16 projections,
+layout copies, and activation quantization.*
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -85,6 +85,15 @@ The 10-16 full-attention layers' QK^T·softmax·PV is a separate kernel covered
 ## Bottom line
 Two prefill levers (GEMM, GDN) are correctly the top-2 and own ~the gap's majority, but they are **not** the whole gap. The op-walk surfaces **MoE router+combine/scatter** and the **W4A4 activation-quant pass** as genuine, currently-untracked prefill contributors on the MoE decision model (~8-14% combined), plus **FA prefill** as a context-dependent risk the npp=128 bench hides. Per the methodology, step 0 is an nsys prefill-only window that explicitly breaks out `argsort/add(combine)`, `quantize_mmq_nvfp4`, and `flash_attn` as separate rows to size these three before funding a kernel.

+Phase63 executed that step-0 discipline after the W4A16 direct-A and MTP
+rejections. It stayed profile-first and inference-gated: pre/post canonical md5
+and backend-op gates wrapped same-shape llama.cpp/vLLM prefill profiles at
+`npp/PT=512` and `2048`. Result: FA is not a source lever on GB10 right now.
+llama.cpp FA was `0.71%` at `npp=512` and `1.18%` at `npp=2048`; the
+`npp=2048` cross-engine FA delta was about `1.7 us/tok`. The paged
+FlashAttention mask/block-table cleanup remains a correctness/test gap worth
+keeping in mind, but Phase63 rejects it as a parity patch.
+
 Relevant files: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch`, and the graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/models/{qwen35moe.cpp,delta-net-base.cpp}` + `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/llama-graph.cpp` (build_moe_ffn ~1500-1834, build_attn ~2136-2189).

 ## 2. Decode-serving compute hypotheses (ranked)