docs(paged): profile current W4A16 prefill

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 10:56:48 +00:00
parent ef7dbfa5f7
commit fc5d5e4ff3
4 changed files with 179 additions and 0 deletions

View File

@@ -3352,3 +3352,55 @@ Decision:
`LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` opt-in. Do not default it yet.
- Treat this as latency tuning, not the next parity track. The larger gap is
still prefill / MoE compute.
## Phase 60 Current W4A16 Prefill Profile
Phase 60 re-profiles the current clean W4A16 grouped MoE prefill path after the
Phase1-5 W4A16 work, to decide whether another low-conflict W4A16 patch is
justified.
Artifact:
- `/home/mudler/bench/phase60_w4a16_current_profile/20260701_104915`
Pre/post gates:
| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
|-------|---------|-----------|-----------|--------------|
| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
MoE `llama-batched-bench`, `npl=32`, `ntg=4`:
| path | PP | S_PP t/s | T_PP s | S_TG t/s | total S t/s |
|------|----|----------|--------|----------|-------------|
| default FP4-MMQ | `512` | `2327.69` | `7.039` | `399.87` | `2243.83` |
| default FP4-MMQ | `2048` | `2423.20` | `27.045` | `391.58` | `2398.94` |
| forced W4A16 | `512` | `1451.00` | `11.291` | `319.32` | `1412.21` |
| forced W4A16 | `2048` | `1482.76` | `44.199` | `303.40` | `1471.61` |
Forced W4A16 remains `0.623x` default FP4-MMQ at `npp=512` and `0.612x` at
`npp=2048`.
`npp=512` profile:
| path | top bucket | time % | total time |
|------|------------|--------|------------|
| default FP4-MMQ | `mul_mat_q<nvfp4,128>` | `39.2%` | `2.712s` |
| default FP4-MMQ | `quantize_mmq_nvfp4` | `4.5%` | `0.314s` |
| forced W4A16 | `w4a16_grouped_kernel<32,128,1,4,2>` | `42.5%` | `4.142s` |
| forced W4A16 | `k_get_rows_float<float,float>` | `11.2%` | `1.094s` |
| forced W4A16 | `w4a16_cast_act_f32_bf16` | `5.3%` | `0.517s` |
| forced W4A16 | residual `quantize_mmq_nvfp4` | `1.4%` | `0.132s` |
Decision:
- Reject another small W4A16 body/metadata/cast tweak as the next parity phase.
The current forced path avoids most activation quantization, but the grouped
W4A16 kernel itself is `1.53x` slower than default MMQ's main `mul_mat_q`
bucket at `npp=512`, and sorted activation gathers add another `1.094s`.
- Eliminating the cast kernel entirely would recover only `5.3%` of the forced
W4A16 profile, not the `37-39%` end-to-end S_PP loss.
- Any future W4A16 parity work must be a larger redesign that improves the
grouped kernel body and removes or fuses the sorted activation gather. Do not
reopen the low-conflict micro-patch track.

View File

@@ -717,6 +717,22 @@ knob. It does not prove vLLM parity progress by itself. Do not default it until
more workload coverage exists, and do not regenerate LocalAI patches until the
fork commits are pushed with explicit approval.
Phase 60 re-profiled the current W4A16 grouped MoE prefill path to check whether
there was still a low-conflict W4A16 shortcut after Phase1-5. Artifact:
`/home/mudler/bench/phase60_w4a16_current_profile/20260701_104915`. Pre/post
gates stayed green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID`
`806/806`. Default FP4-MMQ S_PP was `2327.69` at `npp=512` and `2423.20` at
`npp=2048`; forced W4A16 was `1451.00` and `1482.76`, only `0.623x` and
`0.612x` of default. The `npp=512` profile showed W4A16 still dominated by
`w4a16_grouped_kernel` (`4.142s`, `42.5%`) plus sorted activation gathers
(`1.094s`, `11.2%`), while the cast kernel was only `0.517s` (`5.3%`).
Decision: do not add another small W4A16 metadata/body/cast patch. Future W4A16
work needs a larger redesign that improves the grouped kernel body and removes
or fuses sorted activation movement. Near-term GB10 parity work should return to
broader prefill/GDN/MoE design or hardware-pivot benchmarking.
---
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)

View File

@@ -1488,6 +1488,36 @@ lever. Llama min32 is still `0.560x` vLLM aggregate, `0.430x` vLLM prefill,
`0.673x` vLLM decode aggregate, and `2.415x` slower on mean TTFT. Keep the
scheduler knob opt-in and return parity work to the prefill / MoE compute gap.
### Phase 60 current W4A16 prefill profile
Phase60 re-profiled the current W4A16 grouped MoE prefill path after the
Phase1-5 W4A16 work. Artifact:
`/home/mudler/bench/phase60_w4a16_current_profile/20260701_104915`.
Pre/post md5 and op gates stayed green: MoE
`8cb0ce23777bf55f92f63d0292c756b0`, dense
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID`
`806/806`.
MoE prefill A/B (`npl=32`, `ntg=4`) still rejects W4A16 as an incremental
parity path:
| path | npp512 S_PP | npp2048 S_PP |
|------|-------------|--------------|
| default FP4-MMQ | `2327.69` | `2423.20` |
| forced W4A16 | `1451.00` | `1482.76` |
At `npp=512`, default MMQ spends `2.712s` (`39.2%`) in its main
`mul_mat_q<nvfp4,128>` bucket. Forced W4A16 spends `4.142s` (`42.5%`) in
`w4a16_grouped_kernel<32,128,1,4,2>`, plus `1.094s` (`11.2%`) in
`k_get_rows_float<float,float>` sorted activation gathers and `0.517s` (`5.3%`)
in `w4a16_cast_act_f32_bf16`.
Decision: do not add another W4A16 micro-patch. Cast elimination alone cannot
close a `37-39%` S_PP loss, and the dominant loss is the grouped kernel body
plus sorted activation movement. Future W4A16 parity work must be a larger
design that changes those structures, not another metadata/body shortcut.
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
### Phase 10 GDN C32 slab update

View File

@@ -0,0 +1,81 @@
# Phase 60: Current W4A16 Prefill Profile
## Goal
Re-profile the current clean W4A16 grouped MoE prefill path after the Phase1-5
W4A16 work, then decide whether another low-conflict W4A16 patch is justified.
## Artifact
- `/home/mudler/bench/phase60_w4a16_current_profile/20260701_104915`
## Source State
- DGX mirror: `~/llama-phase6-source`
- Branch: `localai-paged`
- Commit: `2cbb61969443cf52aa1aa58eb9f5a8d7c20a7780`
## Gates
| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
|-------|---------|-----------|-----------|--------------|
| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
DGX cleanup:
- Docker containers: `0`
- GPU compute apps: `0`
- Lock released: `FREE phase60-cleanup 20260701T105438Z`
## End-to-End A/B
MoE `llama-batched-bench`, `npl=32`, `ntg=4`, `npp=512,2048`:
| path | PP | S_PP t/s | T_PP s | S_TG t/s | total S t/s |
|------|----|----------|--------|----------|-------------|
| default FP4-MMQ | `512` | `2327.69` | `7.039` | `399.87` | `2243.83` |
| default FP4-MMQ | `2048` | `2423.20` | `27.045` | `391.58` | `2398.94` |
| forced W4A16 | `512` | `1451.00` | `11.291` | `319.32` | `1412.21` |
| forced W4A16 | `2048` | `1482.76` | `44.199` | `303.40` | `1471.61` |
Forced W4A16 remains:
- `0.623x` default FP4-MMQ at `npp=512` (`-37.7%` S_PP).
- `0.612x` default FP4-MMQ at `npp=2048` (`-38.8%` S_PP).
## `npp=512` Kernel Summary
Default FP4-MMQ top rows:
| bucket | time % | total time |
|--------|--------|------------|
| `mul_mat_q<nvfp4,128>` | `39.2%` | `2.712s` |
| `gated_delta_net_chunked_cuda` | `12.2%` | `0.843s` |
| `quantize_mmq_nvfp4` | `4.5%` | `0.314s` |
Forced W4A16 top rows:
| bucket | time % | total time |
|--------|--------|------------|
| `w4a16_grouped_kernel<32,128,1,4,2>` | `42.5%` | `4.142s` |
| `k_get_rows_float<float,float>` | `11.2%` | `1.094s` |
| `gated_delta_net_chunked_cuda` | `8.6%` | `0.838s` |
| `w4a16_cast_act_f32_bf16` | `5.3%` | `0.517s` |
| residual `quantize_mmq_nvfp4` | `1.4%` | `0.132s` |
## Decision
Reject another small W4A16 body/metadata/cast tweak as the next parity phase.
The current W4A16 path avoids most activation quantization, but the grouped
kernel is still `1.53x` slower than default MMQ's main `mul_mat_q` bucket at
`npp=512` (`4.142s` versus `2.712s`) and sorted activation gathers add another
`1.094s`. Eliminating the cast kernel entirely would recover only `5.3%` of the
forced-W4A16 profile and would not close the `37-39%` end-to-end S_PP loss.
Next W4A16 work would need a larger redesign that both improves the grouped
kernel body and removes or fuses the sorted activation gather. That is outside
the low-conflict incremental patch track. For near-term parity work, return to
the broader prefill/GDN/MoE design track or a hardware-pivot benchmark rather
than another W4A16 micro-patch.