mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
docs(paged): record quant kernel timing phase
Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -3649,3 +3649,32 @@ Decision:
|
||||
it does not prove which sub-kernel is material.
|
||||
- Phase66 should time `quantize_mmq_nvfp4` versus `gather_mmq_fp4` with
|
||||
nsys/NVTX before changing source behavior.
|
||||
|
||||
## Quant Kernel Timing Phase66 Result
|
||||
|
||||
Phase66 is recorded in
|
||||
`docs/superpowers/plans/2026-07-01-quant-kernel-timing-phase66.md`.
|
||||
It used the Phase65-gated binary and Nsight Systems to time the activation-quant
|
||||
candidate kernels directly.
|
||||
|
||||
- DGX artifact: `/home/mudler/bench/phase66_quant_kernel_timing/20260701_144256`
|
||||
- Profile: `quant_npp512.nsys-rep`
|
||||
- Kernel summary: `quant_npp512_kern_sum_cuda_gpu_kern_sum.csv`
|
||||
- Shape: MoE `npp=512`, `ntg=4`, `npl=32`
|
||||
|
||||
Observed total GPU kernel time: `7108388986 ns`.
|
||||
|
||||
| kernel | time | instances | share |
|
||||
|--------|-----:|----------:|------:|
|
||||
| `quantize_mmq_nvfp4` | `317205504 ns` | `8884` | `4.46%` |
|
||||
| `gather_mmq_fp4` | `45374880 ns` | `2960` | `0.64%` |
|
||||
| combined | `362580384 ns` | - | `5.10%` |
|
||||
|
||||
Decision:
|
||||
|
||||
- Reject a Phase66 gather/quant source optimization. `gather_mmq_fp4` is not a
|
||||
material standalone target, and `quantize_mmq_nvfp4 + gather_mmq_fp4` is below
|
||||
the `8%` source-funding threshold for this shape.
|
||||
- Do not reopen W4A16/no-activation-quant from this evidence. Earlier W4A16
|
||||
phases already rejected that rewrite; Phase66 only rules out a smaller
|
||||
gather/quant shortcut.
|
||||
|
||||
@@ -952,3 +952,24 @@ concentrated in named MoE/shared-expert FFN paths, but it does not prove whether
|
||||
`gather_mmq_fp4` is material or just a cheap cost of the existing dedup win.
|
||||
Phase66 should time `quantize_mmq_nvfp4` versus `gather_mmq_fp4` with nsys/NVTX
|
||||
before funding any behavior-changing source patch.
|
||||
|
||||
## 11. PHASE66 RESULT: QUANT KERNEL TIMING
|
||||
|
||||
Phase66 timed the Phase65 candidate kernels directly with Nsight Systems.
|
||||
Artifact: `/home/mudler/bench/phase66_quant_kernel_timing/20260701_144256`.
|
||||
Profile: `quant_npp512.nsys-rep`; summary:
|
||||
`quant_npp512_kern_sum_cuda_gpu_kern_sum.csv`.
|
||||
|
||||
Shape: MoE `npp=512`, `ntg=4`, `npl=32`. Total GPU kernel time:
|
||||
`7108388986 ns`.
|
||||
|
||||
| kernel | time | instances | share |
|
||||
|--------|-----:|----------:|------:|
|
||||
| `quantize_mmq_nvfp4` | `317205504 ns` | `8884` | `4.46%` |
|
||||
| `gather_mmq_fp4` | `45374880 ns` | `2960` | `0.64%` |
|
||||
| combined | `362580384 ns` | - | `5.10%` |
|
||||
|
||||
Decision: reject a Phase66 gather/quant source patch. The gather is too small
|
||||
to target, and quantize plus gather is below the `8%` source-funding threshold.
|
||||
Do not reopen W4A16/no-activation-quant from this evidence; that larger rewrite
|
||||
was already rejected in earlier phases.
|
||||
|
||||
@@ -109,6 +109,12 @@ gate/up expert quant dedup plus gather, MoE down expert flat quantization, and
|
||||
shared-expert dense quantization. Do not optimize from counts alone; Phase66
|
||||
should time `quantize_mmq_nvfp4` versus `gather_mmq_fp4` with nsys/NVTX first.
|
||||
|
||||
Phase66 ran that timing pass. At MoE `npp=512`, total GPU kernel time was
|
||||
`7108388986 ns`; `quantize_mmq_nvfp4` was `317205504 ns` (`4.46%`),
|
||||
`gather_mmq_fp4` was `45374880 ns` (`0.64%`), combined `5.10%`. Reject a
|
||||
gather/quant shortcut on GB10 for now: the gather is not material and the
|
||||
combined route is below the `8%` source-funding threshold.
|
||||
|
||||
Relevant files: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch`, and the graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/models/{qwen35moe.cpp,delta-net-base.cpp}` + `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/llama-graph.cpp` (build_moe_ffn ~1500-1834, build_attn ~2136-2189).
|
||||
|
||||
## 2. Decode-serving compute hypotheses (ranked)
|
||||
|
||||
Reference in New Issue
Block a user