mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 12:57:02 -04:00
docs(paged): record prefill bucket attribution phase
Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -3507,3 +3507,49 @@ Decision:
|
||||
- Do not tune `spec-draft-n-max` blindly. Phase15, Phase19, and Phase62 all
|
||||
showed high acceptance with poor serving throughput, so the remaining question
|
||||
is verify cost, not whether MTP can draft.
|
||||
|
||||
## Prefill Bucket Attribution Phase63 Result
|
||||
|
||||
Phase63 is recorded in
|
||||
`docs/superpowers/plans/2026-07-01-prefill-bucket-attribution-phase63.md`.
|
||||
It was a measurement and decision phase, not a source patch phase.
|
||||
|
||||
Artifact:
|
||||
|
||||
- `/home/mudler/bench/phase63_prefill_bucket/20260701_140127`
|
||||
|
||||
Pre/post inference gates passed:
|
||||
|
||||
| gate | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
|
||||
|------|---------|-----------|-----------|--------------|
|
||||
| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
|
||||
| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
|
||||
|
||||
llama.cpp MoE prefill, `npl=32`, `ntg=4`:
|
||||
|
||||
| npp | S_PP | MoE/FFN-GEMM | GDN | bf16-proj | layout-copy | act-quant | MoE-dispatch | gather | FA |
|
||||
|-----|------|--------------|-----|-----------|-------------|-----------|--------------|--------|----|
|
||||
| 512 | `2248.20` | `40.48%` | `18.00%` | `10.19%` | `7.82%` | `4.47%` | `1.94%` | `1.26%` | `0.71%` |
|
||||
| 2048 | `2385.22` | `41.06%` | `16.15%` | `9.97%` | `7.96%` | `4.61%` | `2.12%` | `1.36%` | `1.18%` |
|
||||
|
||||
vLLM MoE prefill, `NSEQ=32`, `GEN=1`, `NREP=3`, eager profile path:
|
||||
|
||||
| PT | S_PP | ew/glue | GDN | FA | bf16-proj | MoE-dispatch | top unclassified |
|
||||
|----|------|---------|-----|----|-----------|--------------|------------------|
|
||||
| 512 | `5315.6` | `32.97%` | `18.34%` | `0.73%` | `3.41%` | `1.37%` | Marlin MoE `1940.99ms`, FP8 projection `565.74ms` |
|
||||
| 2048 | `5384.4` | `33.48%` | `18.00%` | `1.75%` | `1.06%` | `0.49%` | Marlin MoE `7745.84ms`, FP8 projection `3047.75ms` |
|
||||
|
||||
Decision:
|
||||
|
||||
- Reject a Phase63 paged FlashAttention mask/block-table source patch. llama.cpp
|
||||
FA is only `1.18%` of prefill GPU kernel time at `npp=2048`, below the `<5%`
|
||||
reject rule and far below the `8%` source-funding threshold.
|
||||
- The `npp=2048` FA cost is about `4.9 us/tok` for llama.cpp and `3.1 us/tok`
|
||||
for vLLM, so the cross-engine FA delta is only about `1.7 us/tok`, below the
|
||||
`15 us/tok` funding threshold.
|
||||
- The dominant remaining llama.cpp buckets are still MoE/FFN GEMM, GDN,
|
||||
bf16 projections, layout copies, and activation quantization. Phase63 did not
|
||||
identify a new low-conflict source patch that can move GB10 parity without
|
||||
reopening already-rejected W4A16/GDN/MTP/small-M work.
|
||||
- No llama.cpp source files were modified. Default inferencing stayed green with
|
||||
the canonical md5/op gates.
|
||||
|
||||
@@ -878,4 +878,24 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
|
||||
|
||||
---
|
||||
|
||||
*Status: investigation CLOSED. This handoff is procedure; `VLLM_PARITY_FINAL.md` is the record. The path to parity is datacenter Blackwell, not GB10 kernels.*
|
||||
## 8. PHASE63 RESULT: PREFILL BUCKET ATTRIBUTION
|
||||
|
||||
Phase63 is complete as a measurement-only no-go. The plan is
|
||||
`docs/superpowers/plans/2026-07-01-prefill-bucket-attribution-phase63.md`; the
|
||||
DGX artifact is `/home/mudler/bench/phase63_prefill_bucket/20260701_140127`.
|
||||
|
||||
Pre/post gates stayed green:
|
||||
|
||||
- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`;
|
||||
- dense md5 `5951a5b4d624ce891e22ab5fca9bc439`;
|
||||
- `MUL_MAT` `1146/1146`;
|
||||
- `MUL_MAT_ID` `806/806`.
|
||||
|
||||
The candidate paged FlashAttention mask/block-table cleanup is rejected for now:
|
||||
llama.cpp FA is only `0.71%` at `npp=512` and `1.18%` at `npp=2048`; the
|
||||
`npp=2048` cross-engine FA delta is about `1.7 us/tok`, not the `15 us/tok`
|
||||
needed to fund source work. No llama.cpp source files were modified.
|
||||
|
||||
*Status: Phase63 closed. `VLLM_PARITY_FINAL.md` remains the GB10 shortcut record;
|
||||
the remaining measured buckets are still MoE/FFN GEMM, GDN, bf16 projections,
|
||||
layout copies, and activation quantization.*
|
||||
|
||||
@@ -85,6 +85,15 @@ The 10-16 full-attention layers' QK^T·softmax·PV is a separate kernel covered
|
||||
## Bottom line
|
||||
Two prefill levers (GEMM, GDN) are correctly the top-2 and own ~the gap's majority, but they are **not** the whole gap. The op-walk surfaces **MoE router+combine/scatter** and the **W4A4 activation-quant pass** as genuine, currently-untracked prefill contributors on the MoE decision model (~8-14% combined), plus **FA prefill** as a context-dependent risk the npp=128 bench hides. Per the methodology, step 0 is an nsys prefill-only window that explicitly breaks out `argsort/add(combine)`, `quantize_mmq_nvfp4`, and `flash_attn` as separate rows to size these three before funding a kernel.
|
||||
|
||||
Phase63 executed that step-0 discipline after the W4A16 direct-A and MTP
|
||||
rejections. It stayed profile-first and inference-gated: pre/post canonical md5
|
||||
and backend-op gates wrapped same-shape llama.cpp/vLLM prefill profiles at
|
||||
`npp/PT=512` and `2048`. Result: FA is not a source lever on GB10 right now.
|
||||
llama.cpp FA was `0.71%` at `npp=512` and `1.18%` at `npp=2048`; the
|
||||
`npp=2048` cross-engine FA delta was about `1.7 us/tok`. The paged
|
||||
FlashAttention mask/block-table cleanup remains a correctness/test gap worth
|
||||
keeping in mind, but Phase63 rejects it as a parity patch.
|
||||
|
||||
Relevant files: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch`, and the graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/models/{qwen35moe.cpp,delta-net-base.cpp}` + `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/llama-graph.cpp` (build_moe_ffn ~1500-1834, build_attn ~2136-2189).
|
||||
|
||||
## 2. Decode-serving compute hypotheses (ranked)
|
||||
|
||||
Reference in New Issue
Block a user