docs(paged): record quant trace phase

Assisted-by: Codex:gpt-5
2026-07-03 04:46:54 -04:00 · 2026-07-01 12:42:13 +00:00
parent 55df9100dc
commit 3fbdfc21c9
4 changed files with 361 additions and 0 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -3600,3 +3600,52 @@ Decision:
  mask/KV reshapes, not a single clean projection/layout shortcut.
 - Any Phase65 source work must either remove a named repeated layout chain with
  md5/op gates, or close as another measured no-go.
+
+## Quant Trace Phase65 Result
+
+Phase65 is recorded in
+`docs/superpowers/plans/2026-07-01-quant-trace-phase65.md`.
+It added default-off activation-quant route attribution to the llama.cpp fork:
+
+- Fork commit: `afc2c7030 feat(cuda): trace activation quant routes`
+- Env gate: `LLAMA_QUANT_TRACE=<n>`
+- DGX mirror commit: `7863194bd feat(cuda): trace activation quant routes`
+- DGX artifact: `/home/mudler/bench/phase65_quant_trace/20260701_143729`
+
+Patched build gates passed:
+
+| check | value |
+|-------|-------|
+| MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` |
+| dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` |
+| `MUL_MAT` | `1146/1146` |
+| `MUL_MAT_ID` | `806/806` |
+
+Bounded MoE `npp=512`, `ntg=4`, `npl=32` quant trace:
+
+| route | lines |
+|-------|------:|
+| `mmq_dense` | `4444` |
+| `mmq_moe_dedup_unique` | `2960` |
+| `mmq_moe_gather` | `2960` |
+| `mmq_moe_flat` | `1480` |
+
+Dominant default-path shapes:
+
+| count | route | source family | K | rows | ne12 |
+|------:|-------|---------------|---:|-----:|-----:|
+| `2560` | `mmq_moe_dedup_unique` | gate/up experts | `2048` | `512` | `512` |
+| `2560` | `mmq_moe_gather` | gate/up experts | `2048` | `4096` | `512` |
+| `2560` | `mmq_dense` | shared expert gate/up | `2048` | `512` | `1` |
+| `1280` | `mmq_moe_flat` | down experts | `512` | `4096` | `512` |
+| `1280` | `mmq_dense` | shared expert down | `512` | `512` | `1` |
+
+Decision:
+
+- Keep the instrumentation in the fork as a default-off diagnostic patch.
+- Do not fund a quantization optimization from route counts alone. The trace
+  confirms the activation-quant bucket is concentrated in MoE gate/up dedup plus
+  gather, MoE down flat quantization, and shared-expert dense quantization, but
+  it does not prove which sub-kernel is material.
+- Phase66 should time `quantize_mmq_nvfp4` versus `gather_mmq_fp4` with
+  nsys/NVTX before changing source behavior.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -925,3 +925,30 @@ The named layout sources are GDN conv-state gather/concat/update
 mask/KV reshape/copy paths. This does not fund a clean layout optimization yet;
 it gives Phase65 the exact names needed to either remove one repeated chain or
 reject it with evidence.
+
+## 10. PHASE65 RESULT: QUANT TRACE
+
+Phase65 added default-off activation-quant route attribution in the llama.cpp
+fork: `afc2c7030 feat(cuda): trace activation quant routes`. The env gate is
+`LLAMA_QUANT_TRACE=<n>`. DGX mirror commit: `7863194bd`.
+
+DGX artifact: `/home/mudler/bench/phase65_quant_trace/20260701_143729`.
+Patched build gates stayed green: MoE md5 `8cb0ce23`, dense md5 `5951a5b4`,
+`MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`.
+
+Trace result at MoE `npp=512`, `ntg=4`, `npl=32`:
+
+- `mmq_dense`: `4444`
+- `mmq_moe_dedup_unique`: `2960`
+- `mmq_moe_gather`: `2960`
+- `mmq_moe_flat`: `1480`
+
+The dominant default-path shapes are MoE gate/up expert activation quant
+deduplication (`K=2048`, `rows=512`) followed by gather to expert-token rows
+(`rows=4096`), shared-expert dense gate/up quantization (`K=2048`, `rows=512`),
+MoE down expert flat quantization (`K=512`, `rows=4096`), and shared-expert down
+quantization (`K=512`, `rows=512`). This confirms the activation-quant bucket is
+concentrated in named MoE/shared-expert FFN paths, but it does not prove whether
+`gather_mmq_fp4` is material or just a cheap cost of the existing dedup win.
+Phase66 should time `quantize_mmq_nvfp4` versus `gather_mmq_fp4` with nsys/NVTX
+before funding any behavior-changing source patch.
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -101,6 +101,14 @@ gathers, and paged-attention mask/KV reshape/copy paths. It did not expose a
 single low-conflict projection/layout shortcut; use the Phase64 names before
 funding any Phase65 source work.

+Phase65 attributed the activation-quant bucket with default-off
+`LLAMA_QUANT_TRACE=<n>` in fork commit `afc2c7030`. The default MoE prefill path
+emitted `mmq_dense 4444`, `mmq_moe_dedup_unique 2960`, `mmq_moe_gather 2960`,
+and `mmq_moe_flat 1480` trace lines at `npp=512`. The named paths are MoE
+gate/up expert quant dedup plus gather, MoE down expert flat quantization, and
+shared-expert dense quantization. Do not optimize from counts alone; Phase66
+should time `quantize_mmq_nvfp4` versus `gather_mmq_fp4` with nsys/NVTX first.
+
 Relevant files: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch`, and the graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/models/{qwen35moe.cpp,delta-net-base.cpp}` + `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/llama-graph.cpp` (build_moe_ffn ~1500-1834, build_attn ~2136-2189).

 ## 2. Decode-serving compute hypotheses (ranked)