mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
docs(paged): record layout trace phase
Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -3553,3 +3553,50 @@ Decision:
|
||||
reopening already-rejected W4A16/GDN/MTP/small-M work.
|
||||
- No llama.cpp source files were modified. Default inferencing stayed green with
|
||||
the canonical md5/op gates.
|
||||
|
||||
## Layout Trace Phase64 Result
|
||||
|
||||
Phase64 is recorded in
|
||||
`docs/superpowers/plans/2026-07-01-layout-trace-phase64.md`.
|
||||
It added default-off CUDA layout attribution to the llama.cpp fork:
|
||||
|
||||
- Fork commit: `fa944bb5f feat(cuda): trace layout tensor names`
|
||||
- Env gate: `LLAMA_LAYOUT_TRACE=<n>`
|
||||
- Traced runtime routes: `GET_ROWS`, `CPY`, `CONT`, `DUP`, `CONCAT`
|
||||
- DGX artifact: `/home/mudler/bench/phase64_layout_trace/20260701_142519`
|
||||
|
||||
Patched build gates passed:
|
||||
|
||||
| check | value |
|
||||
|-------|-------|
|
||||
| MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` |
|
||||
| dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` |
|
||||
| `MUL_MAT` | `1146/1146` |
|
||||
| `MUL_MAT_ID` | `806/806` |
|
||||
|
||||
Bounded `npp=512` trace distribution:
|
||||
|
||||
| route | lines |
|
||||
|-------|------:|
|
||||
| `get_rows` | `7268` |
|
||||
| `cpy` | `2008` |
|
||||
| `cont` | `1734` |
|
||||
| `concat` | `990` |
|
||||
|
||||
Top traced layout sources:
|
||||
|
||||
- `concat conv_states_reshaped-N + qkv_mixed_transposed-N -> conv_input-N`
|
||||
- `cpy conv_state_last-N -> conv_state_update-N`
|
||||
- `get_rows cache_r_lN -> conv_states-N`
|
||||
- `get_rows ffn_moe_probs-N -> ffn_moe_weights-N`
|
||||
- `get_rows node_* with ffn_moe_topk-N` for expert fan-in weights
|
||||
- attention mask/KV reshapes and f32-to-f16 copies for paged full-attention layers
|
||||
|
||||
Decision:
|
||||
|
||||
- Keep the instrumentation in the fork as a default-off diagnostic patch.
|
||||
- Do not fund a Phase64 layout optimization yet. The trace points at GDN
|
||||
conv-state materialization, MoE top-k fan-in gathers, and paged-attention
|
||||
mask/KV reshapes, not a single clean projection/layout shortcut.
|
||||
- Any Phase65 source work must either remove a named repeated layout chain with
|
||||
md5/op gates, or close as another measured no-go.
|
||||
|
||||
@@ -899,3 +899,29 @@ needed to fund source work. No llama.cpp source files were modified.
|
||||
*Status: Phase63 closed. `VLLM_PARITY_FINAL.md` remains the GB10 shortcut record;
|
||||
the remaining measured buckets are still MoE/FFN GEMM, GDN, bf16 projections,
|
||||
layout copies, and activation quantization.*
|
||||
|
||||
## 9. PHASE64 RESULT: LAYOUT TRACE
|
||||
|
||||
Phase64 added default-off layout attribution in the llama.cpp fork:
|
||||
`fa944bb5f feat(cuda): trace layout tensor names`. The env gate is
|
||||
`LLAMA_LAYOUT_TRACE=<n>`. It traces CUDA `GET_ROWS`, `CPY`, `CONT`, `DUP`, and
|
||||
`CONCAT` runtime dispatch with tensor names, types, shapes, and contiguity flags.
|
||||
|
||||
DGX artifact: `/home/mudler/bench/phase64_layout_trace/20260701_142519`.
|
||||
Patched build gates stayed green: MoE md5 `8cb0ce23`, dense md5 `5951a5b4`,
|
||||
`MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`.
|
||||
|
||||
Trace result at MoE `npp=512`, `ntg=4`, `npl=32`:
|
||||
|
||||
- `get_rows`: `7268`
|
||||
- `cpy`: `2008`
|
||||
- `cont`: `1734`
|
||||
- `concat`: `990`
|
||||
|
||||
The named layout sources are GDN conv-state gather/concat/update
|
||||
(`cache_r_lN`, `conv_states_reshaped-N`, `qkv_mixed_transposed-N`,
|
||||
`conv_input-N`, `conv_state_update-N`), MoE top-k fan-in gathers
|
||||
(`ffn_moe_probs-N`, `ffn_moe_topk-N`, `ffn_moe_weights-N`), and paged-attention
|
||||
mask/KV reshape/copy paths. This does not fund a clean layout optimization yet;
|
||||
it gives Phase65 the exact names needed to either remove one repeated chain or
|
||||
reject it with evidence.
|
||||
|
||||
@@ -94,6 +94,13 @@ llama.cpp FA was `0.71%` at `npp=512` and `1.18%` at `npp=2048`; the
|
||||
FlashAttention mask/block-table cleanup remains a correctness/test gap worth
|
||||
keeping in mind, but Phase63 rejects it as a parity patch.
|
||||
|
||||
Phase64 then attributed the remaining `layout-copy` bucket with default-off
|
||||
`LLAMA_LAYOUT_TRACE=<n>` in fork commit `fa944bb5f`. The trace showed the
|
||||
layout bucket is a mix of GDN conv-state materialization, MoE top-k fan-in
|
||||
gathers, and paged-attention mask/KV reshape/copy paths. It did not expose a
|
||||
single low-conflict projection/layout shortcut; use the Phase64 names before
|
||||
funding any Phase65 source work.
|
||||
|
||||
Relevant files: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch`, and the graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/models/{qwen35moe.cpp,delta-net-base.cpp}` + `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/llama-graph.cpp` (build_moe_ffn ~1500-1834, build_attn ~2136-2189).
|
||||
|
||||
## 2. Decode-serving compute hypotheses (ranked)
|
||||
|
||||
Reference in New Issue
Block a user