docs(paged): profile current W4A16 prefill

Assisted-by: Codex:gpt-5
2026-07-03 04:46:54 -04:00 · 2026-07-01 10:56:48 +00:00
parent ef7dbfa5f7
commit fc5d5e4ff3
4 changed files with 179 additions and 0 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -3352,3 +3352,55 @@ Decision:
  `LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` opt-in. Do not default it yet.
 - Treat this as latency tuning, not the next parity track. The larger gap is
  still prefill / MoE compute.
+
+## Phase 60 Current W4A16 Prefill Profile
+
+Phase 60 re-profiles the current clean W4A16 grouped MoE prefill path after the
+Phase1-5 W4A16 work, to decide whether another low-conflict W4A16 patch is
+justified.
+
+Artifact:
+
+- `/home/mudler/bench/phase60_w4a16_current_profile/20260701_104915`
+
+Pre/post gates:
+
+| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
+|-------|---------|-----------|-----------|--------------|
+| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
+| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
+
+MoE `llama-batched-bench`, `npl=32`, `ntg=4`:
+
+| path | PP | S_PP t/s | T_PP s | S_TG t/s | total S t/s |
+|------|----|----------|--------|----------|-------------|
+| default FP4-MMQ | `512` | `2327.69` | `7.039` | `399.87` | `2243.83` |
+| default FP4-MMQ | `2048` | `2423.20` | `27.045` | `391.58` | `2398.94` |
+| forced W4A16 | `512` | `1451.00` | `11.291` | `319.32` | `1412.21` |
+| forced W4A16 | `2048` | `1482.76` | `44.199` | `303.40` | `1471.61` |
+
+Forced W4A16 remains `0.623x` default FP4-MMQ at `npp=512` and `0.612x` at
+`npp=2048`.
+
+`npp=512` profile:
+
+| path | top bucket | time % | total time |
+|------|------------|--------|------------|
+| default FP4-MMQ | `mul_mat_q<nvfp4,128>` | `39.2%` | `2.712s` |
+| default FP4-MMQ | `quantize_mmq_nvfp4` | `4.5%` | `0.314s` |
+| forced W4A16 | `w4a16_grouped_kernel<32,128,1,4,2>` | `42.5%` | `4.142s` |
+| forced W4A16 | `k_get_rows_float<float,float>` | `11.2%` | `1.094s` |
+| forced W4A16 | `w4a16_cast_act_f32_bf16` | `5.3%` | `0.517s` |
+| forced W4A16 | residual `quantize_mmq_nvfp4` | `1.4%` | `0.132s` |
+
+Decision:
+
+- Reject another small W4A16 body/metadata/cast tweak as the next parity phase.
+  The current forced path avoids most activation quantization, but the grouped
+  W4A16 kernel itself is `1.53x` slower than default MMQ's main `mul_mat_q`
+  bucket at `npp=512`, and sorted activation gathers add another `1.094s`.
+- Eliminating the cast kernel entirely would recover only `5.3%` of the forced
+  W4A16 profile, not the `37-39%` end-to-end S_PP loss.
+- Any future W4A16 parity work must be a larger redesign that improves the
+  grouped kernel body and removes or fuses the sorted activation gather. Do not
+  reopen the low-conflict micro-patch track.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -717,6 +717,22 @@ knob. It does not prove vLLM parity progress by itself. Do not default it until
 more workload coverage exists, and do not regenerate LocalAI patches until the
 fork commits are pushed with explicit approval.

+Phase 60 re-profiled the current W4A16 grouped MoE prefill path to check whether
+there was still a low-conflict W4A16 shortcut after Phase1-5. Artifact:
+`/home/mudler/bench/phase60_w4a16_current_profile/20260701_104915`. Pre/post
+gates stayed green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
+`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID`
+`806/806`. Default FP4-MMQ S_PP was `2327.69` at `npp=512` and `2423.20` at
+`npp=2048`; forced W4A16 was `1451.00` and `1482.76`, only `0.623x` and
+`0.612x` of default. The `npp=512` profile showed W4A16 still dominated by
+`w4a16_grouped_kernel` (`4.142s`, `42.5%`) plus sorted activation gathers
+(`1.094s`, `11.2%`), while the cast kernel was only `0.517s` (`5.3%`).
+
+Decision: do not add another small W4A16 metadata/body/cast patch. Future W4A16
+work needs a larger redesign that improves the grouped kernel body and removes
+or fuses sorted activation movement. Near-term GB10 parity work should return to
+broader prefill/GDN/MoE design or hardware-pivot benchmarking.
+
 ---

 ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -1488,6 +1488,36 @@ lever. Llama min32 is still `0.560x` vLLM aggregate, `0.430x` vLLM prefill,
 `0.673x` vLLM decode aggregate, and `2.415x` slower on mean TTFT. Keep the
 scheduler knob opt-in and return parity work to the prefill / MoE compute gap.

+### Phase 60 current W4A16 prefill profile
+
+Phase60 re-profiled the current W4A16 grouped MoE prefill path after the
+Phase1-5 W4A16 work. Artifact:
+`/home/mudler/bench/phase60_w4a16_current_profile/20260701_104915`.
+
+Pre/post md5 and op gates stayed green: MoE
+`8cb0ce23777bf55f92f63d0292c756b0`, dense
+`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID`
+`806/806`.
+
+MoE prefill A/B (`npl=32`, `ntg=4`) still rejects W4A16 as an incremental
+parity path:
+
+| path | npp512 S_PP | npp2048 S_PP |
+|------|-------------|--------------|
+| default FP4-MMQ | `2327.69` | `2423.20` |
+| forced W4A16 | `1451.00` | `1482.76` |
+
+At `npp=512`, default MMQ spends `2.712s` (`39.2%`) in its main
+`mul_mat_q<nvfp4,128>` bucket. Forced W4A16 spends `4.142s` (`42.5%`) in
+`w4a16_grouped_kernel<32,128,1,4,2>`, plus `1.094s` (`11.2%`) in
+`k_get_rows_float<float,float>` sorted activation gathers and `0.517s` (`5.3%`)
+in `w4a16_cast_act_f32_bf16`.
+
+Decision: do not add another W4A16 micro-patch. Cast elimination alone cannot
+close a `37-39%` S_PP loss, and the dominant loss is the grouped kernel body
+plus sorted activation movement. Future W4A16 parity work must be a larger
+design that changes those structures, not another metadata/body shortcut.
+
 Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.

 ### Phase 10 GDN C32 slab update
--- a/docs/superpowers/plans/2026-07-01-w4a16-current-profile-phase60.md
+++ b/docs/superpowers/plans/2026-07-01-w4a16-current-profile-phase60.md
@@ -0,0 +1,81 @@
+# Phase 60: Current W4A16 Prefill Profile
+
+## Goal
+
+Re-profile the current clean W4A16 grouped MoE prefill path after the Phase1-5
+W4A16 work, then decide whether another low-conflict W4A16 patch is justified.
+
+## Artifact
+
+- `/home/mudler/bench/phase60_w4a16_current_profile/20260701_104915`
+
+## Source State
+
+- DGX mirror: `~/llama-phase6-source`
+- Branch: `localai-paged`
+- Commit: `2cbb61969443cf52aa1aa58eb9f5a8d7c20a7780`
+
+## Gates
+
+| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
+|-------|---------|-----------|-----------|--------------|
+| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
+| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
+
+DGX cleanup:
+
+- Docker containers: `0`
+- GPU compute apps: `0`
+- Lock released: `FREE phase60-cleanup 20260701T105438Z`
+
+## End-to-End A/B
+
+MoE `llama-batched-bench`, `npl=32`, `ntg=4`, `npp=512,2048`:
+
+| path | PP | S_PP t/s | T_PP s | S_TG t/s | total S t/s |
+|------|----|----------|--------|----------|-------------|
+| default FP4-MMQ | `512` | `2327.69` | `7.039` | `399.87` | `2243.83` |
+| default FP4-MMQ | `2048` | `2423.20` | `27.045` | `391.58` | `2398.94` |
+| forced W4A16 | `512` | `1451.00` | `11.291` | `319.32` | `1412.21` |
+| forced W4A16 | `2048` | `1482.76` | `44.199` | `303.40` | `1471.61` |
+
+Forced W4A16 remains:
+
+- `0.623x` default FP4-MMQ at `npp=512` (`-37.7%` S_PP).
+- `0.612x` default FP4-MMQ at `npp=2048` (`-38.8%` S_PP).
+
+## `npp=512` Kernel Summary
+
+Default FP4-MMQ top rows:
+
+| bucket | time % | total time |
+|--------|--------|------------|
+| `mul_mat_q<nvfp4,128>` | `39.2%` | `2.712s` |
+| `gated_delta_net_chunked_cuda` | `12.2%` | `0.843s` |
+| `quantize_mmq_nvfp4` | `4.5%` | `0.314s` |
+
+Forced W4A16 top rows:
+
+| bucket | time % | total time |
+|--------|--------|------------|
+| `w4a16_grouped_kernel<32,128,1,4,2>` | `42.5%` | `4.142s` |
+| `k_get_rows_float<float,float>` | `11.2%` | `1.094s` |
+| `gated_delta_net_chunked_cuda` | `8.6%` | `0.838s` |
+| `w4a16_cast_act_f32_bf16` | `5.3%` | `0.517s` |
+| residual `quantize_mmq_nvfp4` | `1.4%` | `0.132s` |
+
+## Decision
+
+Reject another small W4A16 body/metadata/cast tweak as the next parity phase.
+
+The current W4A16 path avoids most activation quantization, but the grouped
+kernel is still `1.53x` slower than default MMQ's main `mul_mat_q` bucket at
+`npp=512` (`4.142s` versus `2.712s`) and sorted activation gathers add another
+`1.094s`. Eliminating the cast kernel entirely would recover only `5.3%` of the
+forced-W4A16 profile and would not close the `37-39%` end-to-end S_PP loss.
+
+Next W4A16 work would need a larger redesign that both improves the grouped
+kernel body and removes or fuses the sorted activation gather. That is outside
+the low-conflict incremental patch track. For near-term parity work, return to
+the broader prefill/GDN/MoE design track or a hardware-pivot benchmark rather
+than another W4A16 micro-patch.