docs(paged): record prefill bucket attribution phase

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 12:20:29 +00:00
parent 6a2618b6dc
commit 2e19e5c90f
4 changed files with 485 additions and 1 deletions

View File

@@ -3507,3 +3507,49 @@ Decision:
- Do not tune `spec-draft-n-max` blindly. Phase15, Phase19, and Phase62 all
showed high acceptance with poor serving throughput, so the remaining question
is verify cost, not whether MTP can draft.
## Prefill Bucket Attribution Phase63 Result
Phase63 is recorded in
`docs/superpowers/plans/2026-07-01-prefill-bucket-attribution-phase63.md`.
It was a measurement and decision phase, not a source patch phase.
Artifact:
- `/home/mudler/bench/phase63_prefill_bucket/20260701_140127`
Pre/post inference gates passed:
| gate | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
|------|---------|-----------|-----------|--------------|
| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
llama.cpp MoE prefill, `npl=32`, `ntg=4`:
| npp | S_PP | MoE/FFN-GEMM | GDN | bf16-proj | layout-copy | act-quant | MoE-dispatch | gather | FA |
|-----|------|--------------|-----|-----------|-------------|-----------|--------------|--------|----|
| 512 | `2248.20` | `40.48%` | `18.00%` | `10.19%` | `7.82%` | `4.47%` | `1.94%` | `1.26%` | `0.71%` |
| 2048 | `2385.22` | `41.06%` | `16.15%` | `9.97%` | `7.96%` | `4.61%` | `2.12%` | `1.36%` | `1.18%` |
vLLM MoE prefill, `NSEQ=32`, `GEN=1`, `NREP=3`, eager profile path:
| PT | S_PP | ew/glue | GDN | FA | bf16-proj | MoE-dispatch | top unclassified |
|----|------|---------|-----|----|-----------|--------------|------------------|
| 512 | `5315.6` | `32.97%` | `18.34%` | `0.73%` | `3.41%` | `1.37%` | Marlin MoE `1940.99ms`, FP8 projection `565.74ms` |
| 2048 | `5384.4` | `33.48%` | `18.00%` | `1.75%` | `1.06%` | `0.49%` | Marlin MoE `7745.84ms`, FP8 projection `3047.75ms` |
Decision:
- Reject a Phase63 paged FlashAttention mask/block-table source patch. llama.cpp
FA is only `1.18%` of prefill GPU kernel time at `npp=2048`, below the `<5%`
reject rule and far below the `8%` source-funding threshold.
- The `npp=2048` FA cost is about `4.9 us/tok` for llama.cpp and `3.1 us/tok`
for vLLM, so the cross-engine FA delta is only about `1.7 us/tok`, below the
`15 us/tok` funding threshold.
- The dominant remaining llama.cpp buckets are still MoE/FFN GEMM, GDN,
bf16 projections, layout copies, and activation quantization. Phase63 did not
identify a new low-conflict source patch that can move GB10 parity without
reopening already-rejected W4A16/GDN/MTP/small-M work.
- No llama.cpp source files were modified. Default inferencing stayed green with
the canonical md5/op gates.

View File

@@ -878,4 +878,24 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
---
*Status: investigation CLOSED. This handoff is procedure; `VLLM_PARITY_FINAL.md` is the record. The path to parity is datacenter Blackwell, not GB10 kernels.*
## 8. PHASE63 RESULT: PREFILL BUCKET ATTRIBUTION
Phase63 is complete as a measurement-only no-go. The plan is
`docs/superpowers/plans/2026-07-01-prefill-bucket-attribution-phase63.md`; the
DGX artifact is `/home/mudler/bench/phase63_prefill_bucket/20260701_140127`.
Pre/post gates stayed green:
- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`;
- dense md5 `5951a5b4d624ce891e22ab5fca9bc439`;
- `MUL_MAT` `1146/1146`;
- `MUL_MAT_ID` `806/806`.
The candidate paged FlashAttention mask/block-table cleanup is rejected for now:
llama.cpp FA is only `0.71%` at `npp=512` and `1.18%` at `npp=2048`; the
`npp=2048` cross-engine FA delta is about `1.7 us/tok`, not the `15 us/tok`
needed to fund source work. No llama.cpp source files were modified.
*Status: Phase63 closed. `VLLM_PARITY_FINAL.md` remains the GB10 shortcut record;
the remaining measured buckets are still MoE/FFN GEMM, GDN, bf16 projections,
layout copies, and activation quantization.*

View File

@@ -85,6 +85,15 @@ The 10-16 full-attention layers' QK^T·softmax·PV is a separate kernel covered
## Bottom line
Two prefill levers (GEMM, GDN) are correctly the top-2 and own ~the gap's majority, but they are **not** the whole gap. The op-walk surfaces **MoE router+combine/scatter** and the **W4A4 activation-quant pass** as genuine, currently-untracked prefill contributors on the MoE decision model (~8-14% combined), plus **FA prefill** as a context-dependent risk the npp=128 bench hides. Per the methodology, step 0 is an nsys prefill-only window that explicitly breaks out `argsort/add(combine)`, `quantize_mmq_nvfp4`, and `flash_attn` as separate rows to size these three before funding a kernel.
Phase63 executed that step-0 discipline after the W4A16 direct-A and MTP
rejections. It stayed profile-first and inference-gated: pre/post canonical md5
and backend-op gates wrapped same-shape llama.cpp/vLLM prefill profiles at
`npp/PT=512` and `2048`. Result: FA is not a source lever on GB10 right now.
llama.cpp FA was `0.71%` at `npp=512` and `1.18%` at `npp=2048`; the
`npp=2048` cross-engine FA delta was about `1.7 us/tok`. The paged
FlashAttention mask/block-table cleanup remains a correctness/test gap worth
keeping in mind, but Phase63 rejects it as a parity patch.
Relevant files: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch`, and the graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/models/{qwen35moe.cpp,delta-net-base.cpp}` + `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/llama-graph.cpp` (build_moe_ffn ~1500-1834, build_attn ~2136-2189).
## 2. Decode-serving compute hypotheses (ranked)