docs(paged): compare MoE min32 against vLLM

Assisted-by: Codex:gpt-5
2026-07-03 04:46:54 -04:00 · 2026-07-01 10:46:32 +00:00
parent c41d1a5b4f
commit ef7dbfa5f7
4 changed files with 188 additions and 5 deletions
--- a/docs/superpowers/plans/2026-07-01-moe-min32-repeat-vllm-phase59.md
+++ b/docs/superpowers/plans/2026-07-01-moe-min32-repeat-vllm-phase59.md
@@ -0,0 +1,75 @@
+# Phase 59: MoE Min32 Repeat and vLLM H2H
+
+## Goal
+
+Repeat the Phase58 MoE `LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` result in a
+fresh DGX window, then compare against a matching vLLM `n=128`, `ptok=128`,
+`gen=64` serving run.
+
+## Patch Under Test
+
+The temporary DGX patch stack was generated from the local llama.cpp fork
+through:
+
+- `8759213e3 feat(server): gate TTFT defer by prompt backlog`
+
+The patch was applied to the clean DGX mirror for llama.cpp runs, then reverted
+before the vLLM run.
+
+## Verification
+
+Pre and post llama gates stayed green:
+
+| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
+|-------|---------|-----------|-----------|--------------|
+| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
+| post llama | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
+
+## Results
+
+Artifact:
+
+- `/home/mudler/bench/phase59_moe_min32_repeat_vllm/20260701_123147`
+
+MoE `n=128`, `ptok=128`, `gen=64`:
+
+| engine / variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred |
+|------------------|---------|-----------------|-------------|--------------|-------------|--------|----------|
+| llama default | `336.6` | `646.7` | `1525.1` | `7798.5` | `11666.8` | `24.334` | `0` |
+| llama min32 | `336.9` | `632.0` | `1567.1` | `7167.8` | `11353.4` | `24.316` | `279` |
+| vLLM | `601.3` | `938.8` | `3648.7` | `2968.1` | `4871.6` | `13.563` | n/a |
+
+Min32 repeat delta versus llama default:
+
+- Aggregate throughput: `+0.1%`
+- Mean TTFT: `-8.1%`
+- Max TTFT: `-2.7%`
+- Wall time: `-0.1%`
+- Prefill throughput: `+2.8%`
+- Decode aggregate throughput: `-2.3%`
+
+Llama min32 versus vLLM:
+
+- Aggregate throughput ratio: `0.560`
+- Mean TTFT: llama is `2.415x` slower
+- Wall time: llama is `1.793x` slower
+- Prefill throughput ratio: `0.430`
+- Decode aggregate throughput ratio: `0.673`
+
+## Decision
+
+`LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` repeated as a real, inference-gated
+llama.cpp scheduler QoS improvement for MoE `n=128`: it cuts mean TTFT without
+moving aggregate throughput or wall time materially.
+
+It is not a vLLM parity lever by itself. vLLM remains far ahead on the same
+serving shape, especially prefill and TTFT. Keep the scheduler path opt-in and
+treat it as user-visible latency tuning while parity work returns to the larger
+prefill / MoE compute gap.
+
+## Status
+
+- Phase59 docs recorded.
+- DGX lock released as `FREE phase59-cleanup`.
+- No push performed.
+- LocalAI `patches/paged/` not regenerated.