mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-02 20:37:03 -04:00
docs(paged): record W4A16 direct activation rejection
Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -3404,3 +3404,35 @@ Decision:
|
||||
- Any future W4A16 parity work must be a larger redesign that improves the
|
||||
grouped kernel body and removes or fuses the sorted activation gather. Do not
|
||||
reopen the low-conflict micro-patch track.
|
||||
|
||||
## W4A16 Direct-Activation Phase61 Result
|
||||
|
||||
Phase61 tested the larger W4A16 direct-activation redesign. It passed default
|
||||
inference gates and opt-in direct-A correctness:
|
||||
|
||||
- Default gates artifact:
|
||||
`/home/mudler/bench/phase61_direct_default_gates/20260701_132057`
|
||||
- A/B artifact: `/home/mudler/bench/phase61_direct_ab/20260701_132237`
|
||||
- Default MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`
|
||||
- Default dense md5: `5951a5b4d624ce891e22ab5fca9bc439`
|
||||
- `MUL_MAT`: `1146/1146`
|
||||
- `MUL_MAT_ID`: `806/806`
|
||||
- Forced W4A16 and direct-A MoE md5:
|
||||
`07db32c2bcb78d17a43ed18bc22705cd`
|
||||
|
||||
The direct path had to mirror `get_rows_cuda` flat-row source addressing. A
|
||||
token/slot decode of `ids_to_sorted` failed `b=1` NVFP4 op cases; flat
|
||||
`src_row*nb11` addressing fixed the gate.
|
||||
|
||||
MoE prefill A/B (`npl=32`, `ntg=4`):
|
||||
|
||||
| path | npp512 S_PP | npp2048 S_PP |
|
||||
|------|-------------|--------------|
|
||||
| default FP4-MMQ | `2325.45` | `2423.18` |
|
||||
| forced W4A16 | `1471.05` | `1502.46` |
|
||||
| forced W4A16 direct-A | `1566.30` | `1605.82` |
|
||||
|
||||
Decision: reject. Direct-A improved forced W4A16 by only `+6.5%` and `+6.9%`,
|
||||
and still reached only `0.67x` / `0.66x` of default FP4-MMQ. The rejected direct
|
||||
kernel diff was saved to `/tmp/phase61-w4a16-direct-a-rejected.diff` and not
|
||||
committed. Do not continue W4A16 body tuning on GB10 as the next parity lever.
|
||||
|
||||
@@ -742,6 +742,16 @@ grouped-kernel body rewrite. Keep it only if it improves forced W4A16 S_PP by at
|
||||
least `+12%` and reaches at least `0.75x` default FP4-MMQ; otherwise reject and
|
||||
do not continue W4A16 body tuning.
|
||||
|
||||
Phase61 result: rejected. The direct-A kernel passed correctness after matching
|
||||
`get_rows_cuda` flat-row addressing (`MUL_MAT_ID` `806/806`; forced/direct-A
|
||||
MoE transcript md5 both `07db32c2bcb78d17a43ed18bc22705cd`) and default gates
|
||||
remained green (`8cb0ce23`, `5951a5b4`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID`
|
||||
`806/806`). But direct-A only improved forced W4A16 S_PP `1471.05 -> 1566.30`
|
||||
at `npp=512` and `1502.46 -> 1605.82` at `npp=2048` (`+6.5%` / `+6.9%`), still
|
||||
just `0.67x` / `0.66x` of default FP4-MMQ. The direct kernel diff was not
|
||||
committed; only the safe policy/routing stub remains in the fork. Do not pursue
|
||||
more W4A16 body tuning on GB10 as the next parity lever.
|
||||
|
||||
---
|
||||
|
||||
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
|
||||
|
||||
@@ -1518,6 +1518,43 @@ close a `37-39%` S_PP loss, and the dominant loss is the grouped kernel body
|
||||
plus sorted activation movement. Future W4A16 parity work must be a larger
|
||||
design that changes those structures, not another metadata/body shortcut.
|
||||
|
||||
### Phase 61 W4A16 direct activation kill-gate
|
||||
|
||||
Phase61 implemented the larger direct-activation experiment behind
|
||||
`LLAMA_W4A16_DIRECT_A=1`, consuming original `src1` and `ids_to_sorted` directly
|
||||
instead of materializing `src1_sorted` and then casting it to bf16. The correct
|
||||
source addressing matched `get_rows_cuda`: `ids_to_sorted` is a flat source-row
|
||||
index addressed with `nb11`. The initial token/slot decode failed `b=1` op
|
||||
tests; the flat-row fix passed forced direct-A `MUL_MAT_ID` `806/806`.
|
||||
|
||||
Artifacts:
|
||||
|
||||
- default gates: `/home/mudler/bench/phase61_direct_default_gates/20260701_132057`
|
||||
- A/B: `/home/mudler/bench/phase61_direct_ab/20260701_132237`
|
||||
|
||||
Gates:
|
||||
|
||||
- default MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`
|
||||
- default dense md5 `5951a5b4d624ce891e22ab5fca9bc439`
|
||||
- `MUL_MAT` `1146/1146`
|
||||
- `MUL_MAT_ID` `806/806`
|
||||
- forced W4A16 and direct-A MoE transcripts both
|
||||
`07db32c2bcb78d17a43ed18bc22705cd`
|
||||
|
||||
MoE prefill A/B (`npl=32`, `ntg=4`):
|
||||
|
||||
| path | npp512 S_PP | npp2048 S_PP |
|
||||
|------|-------------|--------------|
|
||||
| default FP4-MMQ | `2325.45` | `2423.18` |
|
||||
| forced W4A16 | `1471.05` | `1502.46` |
|
||||
| forced W4A16 direct-A | `1566.30` | `1605.82` |
|
||||
|
||||
Decision: reject. Direct-A improved forced W4A16 by only `+6.5%` / `+6.9%` and
|
||||
remained `0.67x` / `0.66x` of default FP4-MMQ, below the `+12%` and `0.75x`
|
||||
keep gates. The direct kernel diff was saved to
|
||||
`/tmp/phase61-w4a16-direct-a-rejected.diff` and not committed. W4A16 body
|
||||
tuning is no longer the next GB10 parity lever.
|
||||
|
||||
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
|
||||
|
||||
### Phase 10 GDN C32 slab update
|
||||
|
||||
@@ -0,0 +1,61 @@
|
||||
# W4A16 Direct-Activation Phase61 Result
|
||||
|
||||
Verdict: rejected.
|
||||
|
||||
The default-off direct-A kernel was implemented and gated, but it failed the
|
||||
performance keep gate. The rejected local diff was saved at:
|
||||
|
||||
- `/tmp/phase61-w4a16-direct-a-rejected.diff`
|
||||
|
||||
The llama.cpp fork keeps only the safe routing stub:
|
||||
|
||||
- `41be3da5b test(cuda): cover W4A16 direct activation policy`
|
||||
- `7967ad47f feat(cuda): route W4A16 direct activation stub`
|
||||
|
||||
## Correctness
|
||||
|
||||
Default inference gates:
|
||||
|
||||
- Artifact: `/home/mudler/bench/phase61_direct_default_gates/20260701_132057`
|
||||
- MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`
|
||||
- dense md5: `5951a5b4d624ce891e22ab5fca9bc439`
|
||||
- `MUL_MAT`: `1146/1146`
|
||||
- `MUL_MAT_ID`: `806/806`
|
||||
|
||||
Forced direct-A op gate:
|
||||
|
||||
- Initial direct kernel: `794/806`, failed only `b=1` NVFP4 cases.
|
||||
- Root cause: `ids_to_sorted` is a flat source-row index for `get_rows_cuda`,
|
||||
not a `(token, expert-slot)` pair.
|
||||
- Fixed direct load: `src_base = src1 + src_row*nb11`.
|
||||
- Final direct gate: `806/806`.
|
||||
|
||||
Opt-in transcript check:
|
||||
|
||||
- Artifact: `/home/mudler/bench/phase61_direct_ab/20260701_132237`
|
||||
- forced W4A16 MoE md5: `07db32c2bcb78d17a43ed18bc22705cd`
|
||||
- direct-A MoE md5: `07db32c2bcb78d17a43ed18bc22705cd`
|
||||
- forced and direct-A transcripts were byte-identical.
|
||||
|
||||
## Performance
|
||||
|
||||
MoE prefill, `npl=32`, `ntg=4`:
|
||||
|
||||
| path | npp512 S_PP | npp2048 S_PP |
|
||||
|------|-------------|--------------|
|
||||
| default FP4-MMQ | `2325.45` | `2423.18` |
|
||||
| forced W4A16 | `1471.05` | `1502.46` |
|
||||
| forced W4A16 direct-A | `1566.30` | `1605.82` |
|
||||
|
||||
Direct-A improved forced W4A16 by `+6.5%` at `npp=512` and `+6.9%` at
|
||||
`npp=2048`. It reached only `0.67x` and `0.66x` of default FP4-MMQ.
|
||||
|
||||
The keep gate required at least `+12%` over forced W4A16 and at least `0.75x`
|
||||
of default FP4-MMQ. Phase61 failed both thresholds.
|
||||
|
||||
## Decision
|
||||
|
||||
Do not commit the direct-A kernel. Do not continue W4A16 body tuning as the next
|
||||
GB10 parity lever. The sorted activation gather and cast were real overhead, but
|
||||
removing them is not enough: the W4A16 grouped kernel body remains too slow
|
||||
relative to default FP4-MMQ on GB10.
|
||||
@@ -434,7 +434,7 @@ Fork commit:
|
||||
|
||||
- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/w4a16-gemm.cu`
|
||||
|
||||
- [ ] **Step 1: Add the direct kernel variant**
|
||||
- [x] **Step 1: Add the direct kernel variant**
|
||||
|
||||
Copy `w4a16_grouped_kernel` into a new template named `w4a16_grouped_direct_a_kernel`. Change only the A-load section:
|
||||
|
||||
@@ -460,7 +460,7 @@ uint4 packed = *reinterpret_cast<const uint4 *>(tmp);
|
||||
|
||||
Keep W `cp.async` unchanged.
|
||||
|
||||
- [ ] **Step 2: Wire the direct launcher to the new kernel**
|
||||
- [x] **Step 2: Wire the direct launcher to the new kernel**
|
||||
|
||||
Replace the stub body with a launcher that mirrors `ggml_cuda_mul_mat_id_w4a16_grouped_impl`, but:
|
||||
|
||||
@@ -468,7 +468,7 @@ Replace the stub body with a launcher that mirrors `ggml_cuda_mul_mat_id_w4a16_g
|
||||
- does not call `w4a16_cast_act_f32_bf16`;
|
||||
- passes `src1`, `ids_to_sorted`, `n_expert_used`, `src1_nb1`, and `src1_nb2` into the direct kernel.
|
||||
|
||||
- [ ] **Step 3: Build**
|
||||
- [x] **Step 3: Build**
|
||||
|
||||
Run:
|
||||
|
||||
@@ -480,11 +480,29 @@ cmake --build build --target test-cuda-w4a16-policy test-backend-ops llama-batch
|
||||
|
||||
Expected: build succeeds and policy test passes.
|
||||
|
||||
Actual implementation note: the first direct kernel decoded `ids_to_sorted` as
|
||||
`token = src_row / n_expert_used` and `slot = src_row % n_expert_used`. That was
|
||||
wrong for `b=1` backend-op shapes. The existing `get_rows_cuda` call treats
|
||||
`ids_to_sorted` as a flat row index and addresses `src1 + src_row*nb11`, so the
|
||||
working direct kernel used the same flat-row addressing. The first forced
|
||||
direct-A gate failed `794/806`; the flat-row fix passed `806/806`.
|
||||
|
||||
Actual local build:
|
||||
|
||||
```bash
|
||||
cd /home/mudler/_git/llama.cpp
|
||||
git diff --check
|
||||
cmake --build build --target test-cuda-w4a16-policy -j2
|
||||
./build/bin/test-cuda-w4a16-policy
|
||||
```
|
||||
|
||||
Result: `test-cuda-w4a16-policy: OK`.
|
||||
|
||||
## Task 6: Local CUDA Correctness Gate
|
||||
|
||||
**Files:** none.
|
||||
|
||||
- [ ] **Step 1: Run forced W4A16 direct-A op gate**
|
||||
- [x] **Step 1: Run forced W4A16 direct-A op gate**
|
||||
|
||||
Run on a CUDA host:
|
||||
|
||||
@@ -495,7 +513,17 @@ LLAMA_W4A16_PREFILL_M=1 LLAMA_W4A16_DIRECT_A=1 ./build/bin/test-backend-ops test
|
||||
|
||||
Expected: `806/806 tests passed`.
|
||||
|
||||
- [ ] **Step 2: Run default op gate**
|
||||
Actual RED before implementation: abort at
|
||||
`LLAMA_W4A16_DIRECT_A selected before direct-A kernel implementation`, as
|
||||
expected.
|
||||
|
||||
Actual GREEN on DGX after flat-row fix:
|
||||
|
||||
- `LLAMA_W4A16_PREFILL_M=1 LLAMA_W4A16_DIRECT_A=1 test-backend-ops ... MUL_MAT_ID`
|
||||
- Result: `806/806 tests passed`, `Backend CUDA0: OK`.
|
||||
- Cleanup lock: `FREE phase61-direct-kernel-gate2 20260701T112013Z`.
|
||||
|
||||
- [x] **Step 2: Run default op gate**
|
||||
|
||||
Run:
|
||||
|
||||
@@ -506,11 +534,14 @@ cd /home/mudler/_git/llama.cpp
|
||||
|
||||
Expected: `806/806 tests passed`.
|
||||
|
||||
Actual default-path gate was run as part of the full default inference gate in
|
||||
Task 7: `MUL_MAT_ID` `806/806`.
|
||||
|
||||
## Task 7: DGX Inference and Performance Gate
|
||||
|
||||
**Files:** none.
|
||||
|
||||
- [ ] **Step 1: Preflight DGX**
|
||||
- [x] **Step 1: Preflight DGX**
|
||||
|
||||
Run:
|
||||
|
||||
@@ -520,11 +551,17 @@ ssh dgx.casa 'echo docker=$(docker ps -q | wc -l); echo compute=$(nvidia-smi --q
|
||||
|
||||
Expected: Docker `0`, compute `0`, lock `FREE*`, and no worker/server process.
|
||||
|
||||
- [ ] **Step 2: Apply patch to clean DGX mirror and build**
|
||||
Actual: DGX checks were clean before the phase, and each run acquired/released
|
||||
`/tmp/localai-gpu.lock`.
|
||||
|
||||
- [x] **Step 2: Apply patch to clean DGX mirror and build**
|
||||
|
||||
Use the fork diff for this one patch only, apply it to `~/llama-phase6-source`, and build `build-cuda`. Do not leave the DGX mirror dirty after the phase.
|
||||
|
||||
- [ ] **Step 3: Run pre gates**
|
||||
Actual: the cumulative fork diff was applied to `~/llama-phase6-source` for each
|
||||
DGX gate and reverted by cleanup traps. The final mirror status was clean.
|
||||
|
||||
- [x] **Step 3: Run pre gates**
|
||||
|
||||
Run the canonical MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` gates:
|
||||
|
||||
@@ -542,7 +579,15 @@ Expected:
|
||||
- `MUL_MAT` `1146/1146`
|
||||
- `MUL_MAT_ID` `806/806`
|
||||
|
||||
- [ ] **Step 4: Run W4A16 A/B**
|
||||
Actual default inference gate artifact:
|
||||
|
||||
- `/home/mudler/bench/phase61_direct_default_gates/20260701_132057`
|
||||
- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`
|
||||
- dense md5 `5951a5b4d624ce891e22ab5fca9bc439`
|
||||
- `MUL_MAT` `1146/1146`
|
||||
- `MUL_MAT_ID` `806/806`
|
||||
|
||||
- [x] **Step 4: Run W4A16 A/B**
|
||||
|
||||
Run:
|
||||
|
||||
@@ -559,10 +604,37 @@ Expected decision gate:
|
||||
- Continue deeper W4A16 body work only if direct-A reaches at least `0.75x` default FP4-MMQ S_PP.
|
||||
- Otherwise revert the code patch and record Phase61 as rejected.
|
||||
|
||||
- [ ] **Step 5: Run post gates and cleanup**
|
||||
Actual artifact:
|
||||
|
||||
- `/home/mudler/bench/phase61_direct_ab/20260701_132237`
|
||||
|
||||
Opt-in transcript md5:
|
||||
|
||||
- forced W4A16 MoE md5 `07db32c2bcb78d17a43ed18bc22705cd`
|
||||
- direct-A MoE md5 `07db32c2bcb78d17a43ed18bc22705cd`
|
||||
- the two transcripts were byte-identical.
|
||||
|
||||
MoE prefill A/B:
|
||||
|
||||
| path | npp512 S_PP | npp2048 S_PP |
|
||||
|------|-------------|--------------|
|
||||
| default FP4-MMQ | `2325.45` | `2423.18` |
|
||||
| forced W4A16 | `1471.05` | `1502.46` |
|
||||
| forced W4A16 direct-A | `1566.30` | `1605.82` |
|
||||
|
||||
Direct-A improved forced W4A16 by only `+6.5%` at `npp=512` and `+6.9%` at
|
||||
`npp=2048`, and reached only `0.67x` / `0.66x` of default FP4-MMQ. This fails
|
||||
both keep gates. Verdict: reject the direct-A kernel implementation and do not
|
||||
continue W4A16 body tuning on GB10 as the next parity lever.
|
||||
|
||||
- [x] **Step 5: Run post gates and cleanup**
|
||||
|
||||
Run the same md5/op gates as Step 3, revert the temporary DGX patch, confirm `git status --short` is clean, and release `/tmp/localai-gpu.lock` as `FREE phase61-cleanup ...`.
|
||||
|
||||
Actual: the final default gates above ran on the final patch before A/B, the A/B
|
||||
run reverted the temporary DGX patch, and `/tmp/localai-gpu.lock` was released as
|
||||
`FREE phase61-direct-ab 20260701T112517Z`.
|
||||
|
||||
## Task 8: Commit or Revert
|
||||
|
||||
**Files:**
|
||||
@@ -581,7 +653,7 @@ git add ggml/src/ggml-cuda/w4a16-gemm.cuh ggml/src/ggml-cuda/w4a16-gemm.cu ggml/
|
||||
git commit -m "feat(cuda): add W4A16 direct activation prefill path" -m "Assisted-by: Codex:gpt-5"
|
||||
```
|
||||
|
||||
- [ ] **Step 2: If performance gate fails, revert fork code**
|
||||
- [x] **Step 2: If performance gate fails, revert fork code**
|
||||
|
||||
Run:
|
||||
|
||||
@@ -593,7 +665,12 @@ rm -f ggml/src/ggml-cuda/w4a16-policy.h tests/test-cuda-w4a16-policy.cpp
|
||||
git status --short
|
||||
```
|
||||
|
||||
- [ ] **Step 3: Update LocalAI docs**
|
||||
Actual: saved rejected local diff to
|
||||
`/tmp/phase61-w4a16-direct-a-rejected.diff` and reverted it. The fork remains at
|
||||
committed routing-stub HEAD `7967ad47f`; the direct kernel implementation was
|
||||
not committed.
|
||||
|
||||
- [x] **Step 3: Update LocalAI docs**
|
||||
|
||||
Create `docs/superpowers/plans/2026-07-01-w4a16-direct-activation-phase61-result.md` with the artifact path, gate table, A/B table, and keep/reject decision. Update:
|
||||
|
||||
@@ -601,7 +678,11 @@ Create `docs/superpowers/plans/2026-07-01-w4a16-direct-activation-phase61-result
|
||||
- `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md`
|
||||
- `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md`
|
||||
|
||||
- [ ] **Step 4: Commit LocalAI docs**
|
||||
Actual: created the Phase61 result file and updated the three parity docs with
|
||||
the reject decision, artifacts, md5/op gates, A/B table, and direct-A
|
||||
flat-row-addressing correction.
|
||||
|
||||
- [x] **Step 4: Commit LocalAI docs**
|
||||
|
||||
Run:
|
||||
|
||||
|
||||
Reference in New Issue
Block a user