From 60954d484a2c9f15ef469b25a7f77dcf80d89718 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Wed, 1 Jul 2026 12:45:19 +0000 Subject: [PATCH] docs(paged): record quant kernel timing phase Assisted-by: Codex:gpt-5 --- .../docs/GB10_PARITY_PHASE0_RESULTS.md | 29 +++++ .../docs/PARITY_HANDOFF.md | 21 ++++ .../docs/VLLM_PARITY_LEVER_MAP.md | 6 + .../2026-07-01-quant-kernel-timing-phase66.md | 107 ++++++++++++++++++ 4 files changed, 163 insertions(+) create mode 100644 docs/superpowers/plans/2026-07-01-quant-kernel-timing-phase66.md diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index ad2805ae4..27880c453 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -3649,3 +3649,32 @@ Decision: it does not prove which sub-kernel is material. - Phase66 should time `quantize_mmq_nvfp4` versus `gather_mmq_fp4` with nsys/NVTX before changing source behavior. + +## Quant Kernel Timing Phase66 Result + +Phase66 is recorded in +`docs/superpowers/plans/2026-07-01-quant-kernel-timing-phase66.md`. +It used the Phase65-gated binary and Nsight Systems to time the activation-quant +candidate kernels directly. + +- DGX artifact: `/home/mudler/bench/phase66_quant_kernel_timing/20260701_144256` +- Profile: `quant_npp512.nsys-rep` +- Kernel summary: `quant_npp512_kern_sum_cuda_gpu_kern_sum.csv` +- Shape: MoE `npp=512`, `ntg=4`, `npl=32` + +Observed total GPU kernel time: `7108388986 ns`. + +| kernel | time | instances | share | +|--------|-----:|----------:|------:| +| `quantize_mmq_nvfp4` | `317205504 ns` | `8884` | `4.46%` | +| `gather_mmq_fp4` | `45374880 ns` | `2960` | `0.64%` | +| combined | `362580384 ns` | - | `5.10%` | + +Decision: + +- Reject a Phase66 gather/quant source optimization. `gather_mmq_fp4` is not a + material standalone target, and `quantize_mmq_nvfp4 + gather_mmq_fp4` is below + the `8%` source-funding threshold for this shape. +- Do not reopen W4A16/no-activation-quant from this evidence. Earlier W4A16 + phases already rejected that rewrite; Phase66 only rules out a smaller + gather/quant shortcut. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index 29b287570..cf53d32a8 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -952,3 +952,24 @@ concentrated in named MoE/shared-expert FFN paths, but it does not prove whether `gather_mmq_fp4` is material or just a cheap cost of the existing dedup win. Phase66 should time `quantize_mmq_nvfp4` versus `gather_mmq_fp4` with nsys/NVTX before funding any behavior-changing source patch. + +## 11. PHASE66 RESULT: QUANT KERNEL TIMING + +Phase66 timed the Phase65 candidate kernels directly with Nsight Systems. +Artifact: `/home/mudler/bench/phase66_quant_kernel_timing/20260701_144256`. +Profile: `quant_npp512.nsys-rep`; summary: +`quant_npp512_kern_sum_cuda_gpu_kern_sum.csv`. + +Shape: MoE `npp=512`, `ntg=4`, `npl=32`. Total GPU kernel time: +`7108388986 ns`. + +| kernel | time | instances | share | +|--------|-----:|----------:|------:| +| `quantize_mmq_nvfp4` | `317205504 ns` | `8884` | `4.46%` | +| `gather_mmq_fp4` | `45374880 ns` | `2960` | `0.64%` | +| combined | `362580384 ns` | - | `5.10%` | + +Decision: reject a Phase66 gather/quant source patch. The gather is too small +to target, and quantize plus gather is below the `8%` source-funding threshold. +Do not reopen W4A16/no-activation-quant from this evidence; that larger rewrite +was already rejected in earlier phases. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index 4b69e7e20..8fff2aad9 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -109,6 +109,12 @@ gate/up expert quant dedup plus gather, MoE down expert flat quantization, and shared-expert dense quantization. Do not optimize from counts alone; Phase66 should time `quantize_mmq_nvfp4` versus `gather_mmq_fp4` with nsys/NVTX first. +Phase66 ran that timing pass. At MoE `npp=512`, total GPU kernel time was +`7108388986 ns`; `quantize_mmq_nvfp4` was `317205504 ns` (`4.46%`), +`gather_mmq_fp4` was `45374880 ns` (`0.64%`), combined `5.10%`. Reject a +gather/quant shortcut on GB10 for now: the gather is not material and the +combined route is below the `8%` source-funding threshold. + Relevant files: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch`, and the graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/models/{qwen35moe.cpp,delta-net-base.cpp}` + `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/llama-graph.cpp` (build_moe_ffn ~1500-1834, build_attn ~2136-2189). ## 2. Decode-serving compute hypotheses (ranked) diff --git a/docs/superpowers/plans/2026-07-01-quant-kernel-timing-phase66.md b/docs/superpowers/plans/2026-07-01-quant-kernel-timing-phase66.md new file mode 100644 index 000000000..61a8a1e03 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-quant-kernel-timing-phase66.md @@ -0,0 +1,107 @@ +# Quant Kernel Timing Phase66 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Time the Phase65 activation-quant candidate kernels directly and decide whether a source optimization is funded. + +**Architecture:** Use the already-gated Phase65 llama.cpp binary on DGX and collect an Nsight Systems CUDA kernel summary for the same MoE `npp=512`, `ntg=4`, `npl=32` prefill shape. Compare `quantize_mmq_nvfp4` and `gather_mmq_fp4` against total GPU kernel time. + +**Tech Stack:** llama.cpp CUDA backend, Nsight Systems 2025.3.2, DGX GB10 benchmark host, LocalAI parity docs. + +--- + +## Files + +- Create: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/docs/superpowers/plans/2026-07-01-quant-kernel-timing-phase66.md` +- Modify: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` +- Modify: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` + +--- + +### Task 1: DGX Profile + +- [x] **Step 1: Confirm DGX is idle** + +Observed before profiling: lock `FREE`, Docker `0`, `local-ai-worker` `0`, +compute apps `0`. + +- [x] **Step 2: Acquire lock** + +Observed lock owner: `codex-phase66-quant-kernel-timing 1782909776`. + +- [x] **Step 3: Run Nsight Systems profile** + +Artifact: `/home/mudler/bench/phase66_quant_kernel_timing/20260701_144256`. + +Command shape: + +```bash +LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GGML_NO_BACKTRACE=1 GGML_CUDA_DISABLE_GRAPHS=1 \ + nsys profile --trace=cuda,nvtx --cuda-graph-trace=node --force-overwrite=true \ + --sample=none --cpuctxsw=none \ + -o "$ART/quant_npp512" \ + ./llama-batched-bench -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf \ + -c 131072 -b 2048 -ub 512 -ngl 99 -fa on -npp 512 -ntg 4 -npl 32 +``` + +- [x] **Step 4: Generate CUDA kernel summary** + +Generated: + +```text +/home/mudler/bench/phase66_quant_kernel_timing/20260701_144256/quant_npp512_kern_sum_cuda_gpu_kern_sum.csv +``` + +--- + +### Task 2: Decide + +- [x] **Step 1: Extract candidate kernel timing** + +Observed total GPU kernel time: `7108388986 ns`. + +| kernel | time | instances | share | +|--------|-----:|----------:|------:| +| `quantize_mmq_nvfp4` | `317205504 ns` | `8884` | `4.46%` | +| `gather_mmq_fp4` | `45374880 ns` | `2960` | `0.64%` | +| combined | `362580384 ns` | - | `5.10%` | + +- [x] **Step 2: Source decision** + +Reject a Phase66 gather/quant source optimization. `gather_mmq_fp4` is not +material, and `quantize_mmq_nvfp4 + gather_mmq_fp4` is below the `8%` source +funding threshold for this profiled shape. A W4A16/no-activation-quant rewrite +has already been rejected in earlier phases, so do not reopen it from this data. + +- [x] **Step 3: Release lock** + +Observed release state: + +```text +FREE released-by-codex-phase66-quant-kernel-timing 1782909826 +docker=0 +local_ai_worker=0 +compute_apps=0 +``` + +--- + +### Task 3: Commit and Record + +- [x] **Step 1: Record LocalAI docs** + +This plan and parity docs record the Phase66 no-go decision. + +- [x] **Step 2: Commit LocalAI docs** + +Expected commit: + +```bash +git add -f docs/superpowers/plans/2026-07-01-quant-kernel-timing-phase66.md +git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +git commit -m "docs(paged): record quant kernel timing phase" \ + -m "Assisted-by: Codex:gpt-5" +```