mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
docs(paged): close ragged MoE dispatch shortcut
Record the Phase 8 safety rerun, canonical transcript md5 gates, full and ragged MUL_MAT_ID op gates, and the no-production-patch decision for metadata-only fused dispatch work. Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -1085,3 +1085,31 @@ Conclusion:
|
||||
- Per the Phase 12/13 decision rule, stop GDN kernel work on GB10. The remaining
|
||||
vLLM GDN advantage requires a fuller FLA-style blocked solve or hardware
|
||||
assumptions that do not fit this GB10 patch stack without a regression.
|
||||
|
||||
## Phase 8 Ragged MoE Dispatch Safety Rerun
|
||||
|
||||
Phase 8 had already closed the live ragged MoE helper path by profile:
|
||||
`mm_ids=0.66%`, `gather_mmq=0.42%`, while `mmq_nvfp4=22.36%` and
|
||||
`act_quant=3.60%`. The only source patch kept from the phase is the test gate
|
||||
(`0053-test-paged-cover-ragged-MoE-dispatch.patch`); the metadata-only
|
||||
`LLAMA_MOE_FUSED_DISPATCH` shortcut is rejected.
|
||||
|
||||
Rerun artifacts:
|
||||
|
||||
- `/home/mudler/bench/phase8_ragged_moe_dispatch/ragged_gate_rerun_20260701_035529.txt`
|
||||
- `/home/mudler/bench/phase8_ragged_moe_dispatch/safety_rerun_20260701_035549/`
|
||||
|
||||
Safety result:
|
||||
|
||||
- `MUL_MAT_ID_RAGGED_MOE`: `6/6` on CUDA0.
|
||||
- Full `MUL_MAT_ID`: `806/806` on CUDA0.
|
||||
- MoE transcript md5: `8cb0ce23777bf55f92f63d0292c756b0`.
|
||||
- Dense transcript md5: `5951a5b4d624ce891e22ab5fca9bc439`.
|
||||
|
||||
Conclusion:
|
||||
|
||||
- The inferencing gates remain canonical on the unchanged production path.
|
||||
- Do not add a metadata/helper-only fused-dispatch hook. A future Phase 8
|
||||
production candidate must reduce `mmq_nvfp4` or activation movement directly,
|
||||
stay free of D2H id readback and new stream synchronizations, and then pass
|
||||
the same md5/op gates before any serving A/B is considered.
|
||||
|
||||
@@ -579,6 +579,29 @@ Artifacts:
|
||||
- `/home/mudler/bench/phase13_gdn_global_ai32/ab/`
|
||||
- `/home/mudler/bench/phase13_gdn_global_ai32/rejected/global_ai32_rejected.diff`
|
||||
|
||||
### Phase 8 ragged MoE dispatch closure
|
||||
|
||||
The remaining Phase 8 source shortcut was closed without production CUDA edits.
|
||||
The live ragged serving profile showed helper metadata buckets too small to clear
|
||||
the `+5%` serving A/B gate (`mm_ids=0.66%`, `gather_mmq=0.42%`). Patch `0023`
|
||||
already handles the broadcast-activation NVFP4 path by quantizing unique tokens
|
||||
once and gathering FP4 blocks, so a metadata-only `LLAMA_MOE_FUSED_DISPATCH`
|
||||
hook would add conflict surface without attacking the dominant buckets.
|
||||
|
||||
Safety rerun:
|
||||
|
||||
- `MUL_MAT_ID_RAGGED_MOE`: `6/6` on CUDA0.
|
||||
- Full `MUL_MAT_ID`: `806/806` on CUDA0.
|
||||
- MoE transcript md5: `8cb0ce23777bf55f92f63d0292c756b0`.
|
||||
- Dense transcript md5: `5951a5b4d624ce891e22ab5fca9bc439`.
|
||||
|
||||
Decision:
|
||||
|
||||
- Keep test patch `0053`.
|
||||
- Do not add a Phase 8 production patch unless it directly reduces
|
||||
`mmq_nvfp4` or activation movement without D2H id readback, new
|
||||
synchronizations, or md5 drift.
|
||||
|
||||
---
|
||||
|
||||
# PROFILE-VALIDATED PATH (both-engine nsys, adversarially verified Sun Jun 28 11:55:12 PM UTC 2026)
|
||||
|
||||
@@ -253,7 +253,7 @@ Selected Phase 8 candidate:
|
||||
- GDN remains the single largest bucket, so any Phase 8 source patch still
|
||||
must clear the `+5%` serving A/B gate before being kept.
|
||||
|
||||
- [ ] **Step 6: Commit the profile decision**
|
||||
- [x] **Step 6: Commit the profile decision**
|
||||
|
||||
If promoted:
|
||||
|
||||
@@ -275,6 +275,15 @@ Selected Phase 8 candidate:
|
||||
-m "Assisted-by: Codex:gpt-5"
|
||||
```
|
||||
|
||||
Result:
|
||||
|
||||
- Committed the profile decision as `89ef3a402`
|
||||
(`docs(paged): record ragged MoE profile gate`).
|
||||
- The follow-up test gate landed as fork commit `e21732fc4` and LocalAI
|
||||
mirror commit `b009de0ee`.
|
||||
- The source shortcut rejection landed as `b862e2c56`
|
||||
(`docs(paged): stop ragged dispatch source shortcut`).
|
||||
|
||||
## Task 2: Add Ragged `MUL_MAT_ID` Test Gate If Promoted
|
||||
|
||||
**Files:**
|
||||
@@ -349,8 +358,8 @@ Selected Phase 8 candidate:
|
||||
- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/mmq.cuh`
|
||||
- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu`
|
||||
|
||||
**Status:** Not started. The profile and code inspection do not justify a
|
||||
metadata/helper-only prototype.
|
||||
**Status:** Rejected before production CUDA edits. The profile and code
|
||||
inspection do not justify a metadata/helper-only prototype.
|
||||
|
||||
Inspection result:
|
||||
|
||||
@@ -372,7 +381,11 @@ Stop condition:
|
||||
reducing `mmq_nvfp4` or `act_quant` time without D2H id readback, new stream
|
||||
synchronizations, or md5 drift.
|
||||
|
||||
- [ ] **Step 1: Add env-gated entry point**
|
||||
- [x] **Step 1: Add env-gated entry point**
|
||||
|
||||
Decision: not implemented. Adding a default-off env hook without a concrete
|
||||
`mmq_nvfp4` or activation-movement reduction would add patch-stack conflict
|
||||
surface while preserving the same slow path.
|
||||
|
||||
Add a default-off env gate:
|
||||
|
||||
@@ -389,7 +402,12 @@ Stop condition:
|
||||
The default path must remain byte-identical and use the existing
|
||||
`ggml_cuda_mul_mat_id` implementation.
|
||||
|
||||
- [ ] **Step 2: Add the smallest measurable fused metadata path**
|
||||
- [x] **Step 2: Add the smallest measurable fused metadata path**
|
||||
|
||||
Decision: not implemented. The live profile puts the metadata helpers below
|
||||
the `+5%` serving A/B threshold (`mm_ids=0.66%`, `gather_mmq=0.42%`), and
|
||||
patch `0023` already avoids repeated activation quantization for the
|
||||
broadcast-activation NVFP4 MoE case.
|
||||
|
||||
Start by replacing repeated host/device metadata setup only when all are true:
|
||||
|
||||
@@ -401,7 +419,17 @@ Stop condition:
|
||||
|
||||
If this cannot be done without syncs, stop and reject the prototype.
|
||||
|
||||
- [ ] **Step 3: Run gates**
|
||||
- [x] **Step 3: Run gates**
|
||||
|
||||
Rerun result from the unchanged production path:
|
||||
|
||||
- Artifact:
|
||||
`/home/mudler/bench/phase8_ragged_moe_dispatch/safety_rerun_20260701_035549/`
|
||||
- MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`.
|
||||
- Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
|
||||
- Full `MUL_MAT_ID`: `806/806` on CUDA0.
|
||||
- Specific `MUL_MAT_ID_RAGGED_MOE`: `6/6` on CUDA0, rerun artifact
|
||||
`/home/mudler/bench/phase8_ragged_moe_dispatch/ragged_gate_rerun_20260701_035529.txt`.
|
||||
|
||||
Run on DGX:
|
||||
|
||||
@@ -421,7 +449,12 @@ Stop condition:
|
||||
|
||||
Expected MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`.
|
||||
|
||||
- [ ] **Step 4: Run serving A/B**
|
||||
- [x] **Step 4: Run serving A/B**
|
||||
|
||||
Decision: not run because no production CUDA candidate was added. The existing
|
||||
profile already rejects metadata-only work: the helper buckets are too small,
|
||||
and a valid source candidate must attack `mmq_nvfp4` or `act_quant` directly
|
||||
before it earns a serving A/B run.
|
||||
|
||||
Compare:
|
||||
|
||||
|
||||
Reference in New Issue
Block a user