diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index ca2c73143..a9c77bae1 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -3553,3 +3553,50 @@ Decision: reopening already-rejected W4A16/GDN/MTP/small-M work. - No llama.cpp source files were modified. Default inferencing stayed green with the canonical md5/op gates. + +## Layout Trace Phase64 Result + +Phase64 is recorded in +`docs/superpowers/plans/2026-07-01-layout-trace-phase64.md`. +It added default-off CUDA layout attribution to the llama.cpp fork: + +- Fork commit: `fa944bb5f feat(cuda): trace layout tensor names` +- Env gate: `LLAMA_LAYOUT_TRACE=` +- Traced runtime routes: `GET_ROWS`, `CPY`, `CONT`, `DUP`, `CONCAT` +- DGX artifact: `/home/mudler/bench/phase64_layout_trace/20260701_142519` + +Patched build gates passed: + +| check | value | +|-------|-------| +| MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| `MUL_MAT` | `1146/1146` | +| `MUL_MAT_ID` | `806/806` | + +Bounded `npp=512` trace distribution: + +| route | lines | +|-------|------:| +| `get_rows` | `7268` | +| `cpy` | `2008` | +| `cont` | `1734` | +| `concat` | `990` | + +Top traced layout sources: + +- `concat conv_states_reshaped-N + qkv_mixed_transposed-N -> conv_input-N` +- `cpy conv_state_last-N -> conv_state_update-N` +- `get_rows cache_r_lN -> conv_states-N` +- `get_rows ffn_moe_probs-N -> ffn_moe_weights-N` +- `get_rows node_* with ffn_moe_topk-N` for expert fan-in weights +- attention mask/KV reshapes and f32-to-f16 copies for paged full-attention layers + +Decision: + +- Keep the instrumentation in the fork as a default-off diagnostic patch. +- Do not fund a Phase64 layout optimization yet. The trace points at GDN + conv-state materialization, MoE top-k fan-in gathers, and paged-attention + mask/KV reshapes, not a single clean projection/layout shortcut. +- Any Phase65 source work must either remove a named repeated layout chain with + md5/op gates, or close as another measured no-go. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index c3cd56ed5..50883ae3c 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -899,3 +899,29 @@ needed to fund source work. No llama.cpp source files were modified. *Status: Phase63 closed. `VLLM_PARITY_FINAL.md` remains the GB10 shortcut record; the remaining measured buckets are still MoE/FFN GEMM, GDN, bf16 projections, layout copies, and activation quantization.* + +## 9. PHASE64 RESULT: LAYOUT TRACE + +Phase64 added default-off layout attribution in the llama.cpp fork: +`fa944bb5f feat(cuda): trace layout tensor names`. The env gate is +`LLAMA_LAYOUT_TRACE=`. It traces CUDA `GET_ROWS`, `CPY`, `CONT`, `DUP`, and +`CONCAT` runtime dispatch with tensor names, types, shapes, and contiguity flags. + +DGX artifact: `/home/mudler/bench/phase64_layout_trace/20260701_142519`. +Patched build gates stayed green: MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, +`MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`. + +Trace result at MoE `npp=512`, `ntg=4`, `npl=32`: + +- `get_rows`: `7268` +- `cpy`: `2008` +- `cont`: `1734` +- `concat`: `990` + +The named layout sources are GDN conv-state gather/concat/update +(`cache_r_lN`, `conv_states_reshaped-N`, `qkv_mixed_transposed-N`, +`conv_input-N`, `conv_state_update-N`), MoE top-k fan-in gathers +(`ffn_moe_probs-N`, `ffn_moe_topk-N`, `ffn_moe_weights-N`), and paged-attention +mask/KV reshape/copy paths. This does not fund a clean layout optimization yet; +it gives Phase65 the exact names needed to either remove one repeated chain or +reject it with evidence. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index 8987bc44d..9ee9560d4 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -94,6 +94,13 @@ llama.cpp FA was `0.71%` at `npp=512` and `1.18%` at `npp=2048`; the FlashAttention mask/block-table cleanup remains a correctness/test gap worth keeping in mind, but Phase63 rejects it as a parity patch. +Phase64 then attributed the remaining `layout-copy` bucket with default-off +`LLAMA_LAYOUT_TRACE=` in fork commit `fa944bb5f`. The trace showed the +layout bucket is a mix of GDN conv-state materialization, MoE top-k fan-in +gathers, and paged-attention mask/KV reshape/copy paths. It did not expose a +single low-conflict projection/layout shortcut; use the Phase64 names before +funding any Phase65 source work. + Relevant files: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch`, and the graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/models/{qwen35moe.cpp,delta-net-base.cpp}` + `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/llama-graph.cpp` (build_moe_ffn ~1500-1834, build_attn ~2136-2189). ## 2. Decode-serving compute hypotheses (ranked) diff --git a/docs/superpowers/plans/2026-07-01-layout-trace-phase64.md b/docs/superpowers/plans/2026-07-01-layout-trace-phase64.md new file mode 100644 index 000000000..61bc0b82e --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-layout-trace-phase64.md @@ -0,0 +1,201 @@ +# Layout Trace Phase64 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Attribute the remaining llama.cpp `layout-copy` prefill bucket to concrete graph tensors without changing inference behavior. + +**Architecture:** Add default-off CUDA layout tracing for `GET_ROWS`, `CPY`, `CONT`, `DUP`, and `CONCAT`, gated by `LLAMA_LAYOUT_TRACE=`. Use the same md5/op gates before accepting the instrumentation, then run a bounded MoE prefill trace to decide whether the layout bucket exposes a low-conflict Phase65 source patch. + +**Tech Stack:** llama.cpp CUDA backend, LocalAI paged parity docs, DGX `dgx.casa`, `llama-batched-bench`, canonical md5/op gates. + +--- + +## Guardrails + +- Trace must be silent when `LLAMA_LAYOUT_TRACE` is unset. +- Trace must not alter tensor data or route decisions. +- Do not regenerate LocalAI patch series in this phase. +- Canonical gates: + - MoE md5 `8cb0ce23777bf55f92f63d0292c756b0` + - dense md5 `5951a5b4d624ce891e22ab5fca9bc439` + - `MUL_MAT` `1146/1146` + - `MUL_MAT_ID` `806/806` + +## Files + +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu` +- Create: `docs/superpowers/plans/2026-07-01-layout-trace-phase64.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` + +--- + +### Task 1: Add Default-Off Layout Trace + +- [x] **Step 1: Inspect measured layout rows** + +Phase63 kernel names at `npp=2048`: + +```text +convert_unary: 721.23ms +convert_unary: 634.91ms +concat_non_cont: 566.04ms +k_get_rows_float: 307.52ms +cpy_scalar: 107.05ms +``` + +- [x] **Step 2: Add trace helper in `ggml-cuda.cu`** + +Implemented `LLAMA_LAYOUT_TRACE=` with route, op, dst/src names, types, +shapes, and contiguity flags. + +- [x] **Step 3: Wire trace calls to runtime dispatch** + +Runtime cases traced: + +- `GGML_OP_GET_ROWS` +- `GGML_OP_DUP` +- `GGML_OP_CPY` +- `GGML_OP_CONT` +- `GGML_OP_CONCAT` + +- [x] **Step 4: Verify local diff** + +Run: + +```bash +git -C /home/mudler/_git/llama.cpp diff --check +``` + +Expected: no output. + +Result: no output. + +--- + +### Task 2: Build and Gate on DGX + +- [x] **Step 1: Acquire DGX lock** + +Result: + +```text +docker=0 local_ai_worker=0 compute=0 lock=FREE released-by-codex-phase63-prefill-bucket 1782908317 +codex-phase64-layout-trace 1782908645 +``` + +- [x] **Step 2: Apply the patch to DGX clean build tree** + +Result: applied to `/home/mudler/llama-phase6-source`; remote diff was +`ggml/src/ggml-cuda/ggml-cuda.cu | 51 +++++++++++++++++++++++++++++++++++++++++`. + +- [x] **Step 3: Build CUDA targets** + +Run: + +```bash +ssh dgx.casa 'cd /home/mudler/llama-phase6-source && cmake --build build-cuda --target llama-completion llama-batched-bench test-backend-ops -j $(nproc)' +``` + +Result: build passed. + +- [x] **Step 4: Run patched md5/op gates** + +Artifact: `/home/mudler/bench/phase64_layout_trace/20260701_142519` + +```text +patched moe_md5 8cb0ce23777bf55f92f63d0292c756b0 8cb0ce23777bf55f92f63d0292c756b0 ok +patched dense_md5 5951a5b4d624ce891e22ab5fca9bc439 5951a5b4d624ce891e22ab5fca9bc439 ok +patched MUL_MAT 1146/1146 1146/1146 ok +patched MUL_MAT_ID 806/806 806/806 ok +``` + +--- + +### Task 3: Run Bounded Layout Trace + +- [x] **Step 1: Run MoE prefill trace** + +Run: + +```bash +LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GGML_NO_BACKTRACE=1 GGML_CUDA_DISABLE_GRAPHS=1 LLAMA_LAYOUT_TRACE=12000 \ + ./llama-batched-bench -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf \ + -c 131072 -b 2048 -ub 512 -ngl 99 -fa on -npp 512 -ntg 4 -npl 32 +``` + +Result files: + +- `/home/mudler/bench/phase64_layout_trace/20260701_142519/layout_trace_npp512.trace` +- `/home/mudler/bench/phase64_layout_trace/20260701_142519/layout_trace_summary2.txt` + +- [x] **Step 2: Reduce trace** + +Route distribution: + +| route | lines | +|-------|------:| +| `get_rows` | `7268` | +| `cpy` | `2008` | +| `cont` | `1734` | +| `concat` | `990` | + +Top type pairs: + +| route/type | count | +|------------|------:| +| `get_rows f32 -> f32` | `6250` | +| `get_rows f16 -> f32` | `1018` | +| `concat f32 -> f32` | `990` | +| `cpy f32 -> f32 noncontig -> contig` | `990` | +| `cont f16 -> f16 noncontig -> contig` | `970` | +| `cont f32 -> f32 noncontig -> contig` | `688` | +| `cpy f32 -> f16 noncontig -> contig` | `660` | +| `cpy f32 -> f16 contig -> contig` | `358` | + +Named sources: + +- `concat conv_states_reshaped-N + qkv_mixed_transposed-N -> conv_input-N` +- `cpy conv_state_last-N -> conv_state_update-N` +- `get_rows cache_r_lN -> conv_states-N` +- `get_rows ffn_moe_probs-N -> ffn_moe_weights-N` +- `get_rows node_* with ffn_moe_topk-N` for expert fan-in weights +- attention mask/KV reshapes and f32-to-f16 copies for paged full-attention layers + +--- + +### Task 4: Commit and Record + +- [x] **Step 1: Commit fork instrumentation** + +Result: `/home/mudler/_git/llama.cpp` commit +`fa944bb5f feat(cuda): trace layout tensor names`. + +- [x] **Step 2: Record LocalAI docs** + +Result: this plan and parity docs updated. + +- [x] **Step 3: Commit LocalAI docs** + +Result: this commit records the Phase64 LocalAI docs. + +Command: + +```bash +git add -f docs/superpowers/plans/2026-07-01-layout-trace-phase64.md +git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +git commit -m "docs(paged): record layout trace phase" \ + -m "Assisted-by: Codex:gpt-5" +``` + +--- + +## Decision + +Phase64 keeps the instrumentation patch because it is default-off, low-conflict, +and md5/op gated. It does not yet fund a layout optimization: the trace points at +GDN conv-state materialization, MoE top-k fan-in gathers, and paged-attention +mask/KV reshapes, not a single clean projection/layout shortcut.