mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
docs(paged): profile current W4A16 prefill
Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -3352,3 +3352,55 @@ Decision:
|
||||
`LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` opt-in. Do not default it yet.
|
||||
- Treat this as latency tuning, not the next parity track. The larger gap is
|
||||
still prefill / MoE compute.
|
||||
|
||||
## Phase 60 Current W4A16 Prefill Profile
|
||||
|
||||
Phase 60 re-profiles the current clean W4A16 grouped MoE prefill path after the
|
||||
Phase1-5 W4A16 work, to decide whether another low-conflict W4A16 patch is
|
||||
justified.
|
||||
|
||||
Artifact:
|
||||
|
||||
- `/home/mudler/bench/phase60_w4a16_current_profile/20260701_104915`
|
||||
|
||||
Pre/post gates:
|
||||
|
||||
| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
|
||||
|-------|---------|-----------|-----------|--------------|
|
||||
| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
|
||||
| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
|
||||
|
||||
MoE `llama-batched-bench`, `npl=32`, `ntg=4`:
|
||||
|
||||
| path | PP | S_PP t/s | T_PP s | S_TG t/s | total S t/s |
|
||||
|------|----|----------|--------|----------|-------------|
|
||||
| default FP4-MMQ | `512` | `2327.69` | `7.039` | `399.87` | `2243.83` |
|
||||
| default FP4-MMQ | `2048` | `2423.20` | `27.045` | `391.58` | `2398.94` |
|
||||
| forced W4A16 | `512` | `1451.00` | `11.291` | `319.32` | `1412.21` |
|
||||
| forced W4A16 | `2048` | `1482.76` | `44.199` | `303.40` | `1471.61` |
|
||||
|
||||
Forced W4A16 remains `0.623x` default FP4-MMQ at `npp=512` and `0.612x` at
|
||||
`npp=2048`.
|
||||
|
||||
`npp=512` profile:
|
||||
|
||||
| path | top bucket | time % | total time |
|
||||
|------|------------|--------|------------|
|
||||
| default FP4-MMQ | `mul_mat_q<nvfp4,128>` | `39.2%` | `2.712s` |
|
||||
| default FP4-MMQ | `quantize_mmq_nvfp4` | `4.5%` | `0.314s` |
|
||||
| forced W4A16 | `w4a16_grouped_kernel<32,128,1,4,2>` | `42.5%` | `4.142s` |
|
||||
| forced W4A16 | `k_get_rows_float<float,float>` | `11.2%` | `1.094s` |
|
||||
| forced W4A16 | `w4a16_cast_act_f32_bf16` | `5.3%` | `0.517s` |
|
||||
| forced W4A16 | residual `quantize_mmq_nvfp4` | `1.4%` | `0.132s` |
|
||||
|
||||
Decision:
|
||||
|
||||
- Reject another small W4A16 body/metadata/cast tweak as the next parity phase.
|
||||
The current forced path avoids most activation quantization, but the grouped
|
||||
W4A16 kernel itself is `1.53x` slower than default MMQ's main `mul_mat_q`
|
||||
bucket at `npp=512`, and sorted activation gathers add another `1.094s`.
|
||||
- Eliminating the cast kernel entirely would recover only `5.3%` of the forced
|
||||
W4A16 profile, not the `37-39%` end-to-end S_PP loss.
|
||||
- Any future W4A16 parity work must be a larger redesign that improves the
|
||||
grouped kernel body and removes or fuses the sorted activation gather. Do not
|
||||
reopen the low-conflict micro-patch track.
|
||||
|
||||
@@ -717,6 +717,22 @@ knob. It does not prove vLLM parity progress by itself. Do not default it until
|
||||
more workload coverage exists, and do not regenerate LocalAI patches until the
|
||||
fork commits are pushed with explicit approval.
|
||||
|
||||
Phase 60 re-profiled the current W4A16 grouped MoE prefill path to check whether
|
||||
there was still a low-conflict W4A16 shortcut after Phase1-5. Artifact:
|
||||
`/home/mudler/bench/phase60_w4a16_current_profile/20260701_104915`. Pre/post
|
||||
gates stayed green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID`
|
||||
`806/806`. Default FP4-MMQ S_PP was `2327.69` at `npp=512` and `2423.20` at
|
||||
`npp=2048`; forced W4A16 was `1451.00` and `1482.76`, only `0.623x` and
|
||||
`0.612x` of default. The `npp=512` profile showed W4A16 still dominated by
|
||||
`w4a16_grouped_kernel` (`4.142s`, `42.5%`) plus sorted activation gathers
|
||||
(`1.094s`, `11.2%`), while the cast kernel was only `0.517s` (`5.3%`).
|
||||
|
||||
Decision: do not add another small W4A16 metadata/body/cast patch. Future W4A16
|
||||
work needs a larger redesign that improves the grouped kernel body and removes
|
||||
or fuses sorted activation movement. Near-term GB10 parity work should return to
|
||||
broader prefill/GDN/MoE design or hardware-pivot benchmarking.
|
||||
|
||||
---
|
||||
|
||||
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
|
||||
|
||||
@@ -1488,6 +1488,36 @@ lever. Llama min32 is still `0.560x` vLLM aggregate, `0.430x` vLLM prefill,
|
||||
`0.673x` vLLM decode aggregate, and `2.415x` slower on mean TTFT. Keep the
|
||||
scheduler knob opt-in and return parity work to the prefill / MoE compute gap.
|
||||
|
||||
### Phase 60 current W4A16 prefill profile
|
||||
|
||||
Phase60 re-profiled the current W4A16 grouped MoE prefill path after the
|
||||
Phase1-5 W4A16 work. Artifact:
|
||||
`/home/mudler/bench/phase60_w4a16_current_profile/20260701_104915`.
|
||||
|
||||
Pre/post md5 and op gates stayed green: MoE
|
||||
`8cb0ce23777bf55f92f63d0292c756b0`, dense
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID`
|
||||
`806/806`.
|
||||
|
||||
MoE prefill A/B (`npl=32`, `ntg=4`) still rejects W4A16 as an incremental
|
||||
parity path:
|
||||
|
||||
| path | npp512 S_PP | npp2048 S_PP |
|
||||
|------|-------------|--------------|
|
||||
| default FP4-MMQ | `2327.69` | `2423.20` |
|
||||
| forced W4A16 | `1451.00` | `1482.76` |
|
||||
|
||||
At `npp=512`, default MMQ spends `2.712s` (`39.2%`) in its main
|
||||
`mul_mat_q<nvfp4,128>` bucket. Forced W4A16 spends `4.142s` (`42.5%`) in
|
||||
`w4a16_grouped_kernel<32,128,1,4,2>`, plus `1.094s` (`11.2%`) in
|
||||
`k_get_rows_float<float,float>` sorted activation gathers and `0.517s` (`5.3%`)
|
||||
in `w4a16_cast_act_f32_bf16`.
|
||||
|
||||
Decision: do not add another W4A16 micro-patch. Cast elimination alone cannot
|
||||
close a `37-39%` S_PP loss, and the dominant loss is the grouped kernel body
|
||||
plus sorted activation movement. Future W4A16 parity work must be a larger
|
||||
design that changes those structures, not another metadata/body shortcut.
|
||||
|
||||
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
|
||||
|
||||
### Phase 10 GDN C32 slab update
|
||||
|
||||
@@ -0,0 +1,81 @@
|
||||
# Phase 60: Current W4A16 Prefill Profile
|
||||
|
||||
## Goal
|
||||
|
||||
Re-profile the current clean W4A16 grouped MoE prefill path after the Phase1-5
|
||||
W4A16 work, then decide whether another low-conflict W4A16 patch is justified.
|
||||
|
||||
## Artifact
|
||||
|
||||
- `/home/mudler/bench/phase60_w4a16_current_profile/20260701_104915`
|
||||
|
||||
## Source State
|
||||
|
||||
- DGX mirror: `~/llama-phase6-source`
|
||||
- Branch: `localai-paged`
|
||||
- Commit: `2cbb61969443cf52aa1aa58eb9f5a8d7c20a7780`
|
||||
|
||||
## Gates
|
||||
|
||||
| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
|
||||
|-------|---------|-----------|-----------|--------------|
|
||||
| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
|
||||
| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
|
||||
|
||||
DGX cleanup:
|
||||
|
||||
- Docker containers: `0`
|
||||
- GPU compute apps: `0`
|
||||
- Lock released: `FREE phase60-cleanup 20260701T105438Z`
|
||||
|
||||
## End-to-End A/B
|
||||
|
||||
MoE `llama-batched-bench`, `npl=32`, `ntg=4`, `npp=512,2048`:
|
||||
|
||||
| path | PP | S_PP t/s | T_PP s | S_TG t/s | total S t/s |
|
||||
|------|----|----------|--------|----------|-------------|
|
||||
| default FP4-MMQ | `512` | `2327.69` | `7.039` | `399.87` | `2243.83` |
|
||||
| default FP4-MMQ | `2048` | `2423.20` | `27.045` | `391.58` | `2398.94` |
|
||||
| forced W4A16 | `512` | `1451.00` | `11.291` | `319.32` | `1412.21` |
|
||||
| forced W4A16 | `2048` | `1482.76` | `44.199` | `303.40` | `1471.61` |
|
||||
|
||||
Forced W4A16 remains:
|
||||
|
||||
- `0.623x` default FP4-MMQ at `npp=512` (`-37.7%` S_PP).
|
||||
- `0.612x` default FP4-MMQ at `npp=2048` (`-38.8%` S_PP).
|
||||
|
||||
## `npp=512` Kernel Summary
|
||||
|
||||
Default FP4-MMQ top rows:
|
||||
|
||||
| bucket | time % | total time |
|
||||
|--------|--------|------------|
|
||||
| `mul_mat_q<nvfp4,128>` | `39.2%` | `2.712s` |
|
||||
| `gated_delta_net_chunked_cuda` | `12.2%` | `0.843s` |
|
||||
| `quantize_mmq_nvfp4` | `4.5%` | `0.314s` |
|
||||
|
||||
Forced W4A16 top rows:
|
||||
|
||||
| bucket | time % | total time |
|
||||
|--------|--------|------------|
|
||||
| `w4a16_grouped_kernel<32,128,1,4,2>` | `42.5%` | `4.142s` |
|
||||
| `k_get_rows_float<float,float>` | `11.2%` | `1.094s` |
|
||||
| `gated_delta_net_chunked_cuda` | `8.6%` | `0.838s` |
|
||||
| `w4a16_cast_act_f32_bf16` | `5.3%` | `0.517s` |
|
||||
| residual `quantize_mmq_nvfp4` | `1.4%` | `0.132s` |
|
||||
|
||||
## Decision
|
||||
|
||||
Reject another small W4A16 body/metadata/cast tweak as the next parity phase.
|
||||
|
||||
The current W4A16 path avoids most activation quantization, but the grouped
|
||||
kernel is still `1.53x` slower than default MMQ's main `mul_mat_q` bucket at
|
||||
`npp=512` (`4.142s` versus `2.712s`) and sorted activation gathers add another
|
||||
`1.094s`. Eliminating the cast kernel entirely would recover only `5.3%` of the
|
||||
forced-W4A16 profile and would not close the `37-39%` end-to-end S_PP loss.
|
||||
|
||||
Next W4A16 work would need a larger redesign that both improves the grouped
|
||||
kernel body and removes or fuses the sorted activation gather. That is outside
|
||||
the low-conflict incremental patch track. For near-term parity work, return to
|
||||
the broader prefill/GDN/MoE design track or a hardware-pivot benchmark rather
|
||||
than another W4A16 micro-patch.
|
||||
Reference in New Issue
Block a user