docs(paged): compare MoE min32 against vLLM

Assisted-by: Codex:gpt-5
2026-07-02 20:37:03 -04:00 · 2026-07-01 10:46:32 +00:00
parent c41d1a5b4f
commit ef7dbfa5f7
4 changed files with 188 additions and 5 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md
@@ -3297,3 +3297,58 @@ Decision:
  and wall regressed slightly. Do not default-on yet.
 - Next step should repeat the MoE min32 result and run the matching vLLM h2h
  comparison before treating this as real parity progress rather than run noise.
+
+## Phase 59 MoE Min32 Repeat and vLLM H2H
+
+Phase 59 repeats the Phase58 MoE min32 point and compares it to a matching vLLM
+serving run. The Phase51+Phase54+Phase55+Phase57+Phase58 stack was applied
+temporarily to the clean DGX mirror for the llama.cpp runs, then reverted before
+the vLLM run.
+
+Artifact:
+
+- `/home/mudler/bench/phase59_moe_min32_repeat_vllm/20260701_123147`
+
+Pre/post llama gates:
+
+| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
+|-------|---------|-----------|-----------|--------------|
+| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
+| post llama | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
+
+MoE `n=128`, `ptok=128`, `gen=64`:
+
+| engine / variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred |
+|------------------|---------|-----------------|-------------|--------------|-------------|--------|----------|
+| llama default | `336.6` | `646.7` | `1525.1` | `7798.5` | `11666.8` | `24.334` | `0` |
+| llama min32 | `336.9` | `632.0` | `1567.1` | `7167.8` | `11353.4` | `24.316` | `279` |
+| vLLM | `601.3` | `938.8` | `3648.7` | `2968.1` | `4871.6` | `13.563` | n/a |
+
+Llama min32 repeat versus llama default:
+
+- Aggregate throughput: `+0.1%`
+- Mean TTFT: `-8.1%`
+- Max TTFT: `-2.7%`
+- Wall time: `-0.1%`
+- Prefill throughput: `+2.8%`
+- Decode aggregate throughput: `-2.3%`
+
+Llama min32 versus vLLM:
+
+- Aggregate throughput ratio: `0.560`
+- Mean TTFT: llama is `2.415x` slower
+- Wall time: llama is `1.793x` slower
+- Prefill throughput ratio: `0.430`
+- Decode aggregate throughput ratio: `0.673`
+
+Decision:
+
+- The min32 repeat confirms a real, inference-gated llama.cpp scheduler QoS
+  improvement for MoE `n=128`: mean TTFT drops without material aggregate or
+  wall-time loss.
+- It does not close parity with vLLM. vLLM remains much faster on the same
+  request shape, especially prefill throughput and TTFT.
+- Keep `LLAMA_TTFT_PREFILL_FIRST=1` plus
+  `LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` opt-in. Do not default it yet.
+- Treat this as latency tuning, not the next parity track. The larger gap is
+  still prefill / MoE compute.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -20,7 +20,8 @@ Read order for a cold start:

 ## 1. TL;DR STATE

-> 2026-07-01 active update: Phase50-55 reopened the dense serving question.
+> 2026-07-01 active update: Phase50-59 reopened the dense and MoE serving
+> scheduler question.
 > True dense decode is much closer to vLLM (`383.66` vs `435.00` t/s, `88.2%`)
 > than the Phase47 h2h aggregate suggested, while traced serving still shows
 > no pure decode-only steps and high TTFT. Phase53 rejected static lower
@@ -37,10 +38,13 @@ Read order for a cold start:
 > dense caps lost aggregate. Phase58 added a prompt-backlog threshold; min32
 > improved MoE `n=128` aggregate `339.0 -> 341.9`, mean TTFT
 > `7743.1 -> 7420.1 ms`, and wall `24.167 -> 23.950 s` in the same window, while
-> dense `n=128` was mixed. Next step should repeat min32 and run matching vLLM
-> h2h before any default-on discussion. The trace and scheduler commits are
-> local and DGX-gated but not pushed, so the LocalAI patch series has not been
-> regenerated.
+> dense `n=128` was mixed. Phase59 repeated MoE min32: aggregate stayed flat
+> (`336.6 -> 336.9`), mean TTFT improved (`7798.5 -> 7167.8 ms`), and wall stayed
+> flat (`24.334 -> 24.316 s`) with md5/op gates green. Matching vLLM was still
+> far ahead (`601.3` aggregate, `2968.1 ms` mean TTFT), so min32 is an opt-in
+> llama.cpp QoS knob, not a parity-closing lever. The trace and scheduler commits
+> are local and DGX-gated but not pushed, so the LocalAI patch series has not
+> been regenerated.

 - Historical verdict: the older investigation marked GB10 parity **CLOSED** and
  unreachable. Treat that as superseded where Phase50-54 provide newer dense
@@ -689,6 +693,30 @@ lowering aggregate/prefill throughput, and it does not materially solve TTFT.
 Next scheduler work should be per-step histograms or a targeted first-token
 admission policy.

+Phase 54 through Phase 59 tested that targeted scheduler path. The fork commits
+are still local-only and default-off:
+
+- `c6cb8460e feat(server): trace serving admission batches`
+- `bd7b2e952 feat(server): add admission trace histograms`
+- `8a97629a4 feat(server): add TTFT prefill-first scheduler mode`
+- `3b6ab5fa8 feat(server): cap TTFT prefill-first decode deferral`
+- `8759213e3 feat(server): gate TTFT defer by prompt backlog`
+
+Phase59 is the current verdict. Artifact:
+`/home/mudler/bench/phase59_moe_min32_repeat_vllm/20260701_123147`. Pre/post
+llama gates stayed green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
+`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID`
+`806/806`. MoE `n=128`, `ptok=128`, `gen=64` repeated the Phase58 min32 signal:
+llama default `agg=336.6`, `TTFT=7798.5ms`, wall `24.334s`; llama min32
+`agg=336.9`, `TTFT=7167.8ms`, wall `24.316s`. Matching vLLM was still
+`agg=601.3`, `TTFT=2968.1ms`, wall `13.563s`.
+
+Decision: keep `LLAMA_TTFT_PREFILL_FIRST=1` and
+`LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` as an opt-in llama.cpp latency/QoS
+knob. It does not prove vLLM parity progress by itself. Do not default it until
+more workload coverage exists, and do not regenerate LocalAI patches until the
+fork commits are pushed with explicit approval.
+
 ---

 ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
--- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md
@@ -1463,6 +1463,31 @@ scheduler A/B so far: MoE `n=128` improved aggregate, TTFT, and wall in the same
 window, while dense `n=128` was roughly neutral but slightly worse on aggregate
 and wall. Keep it opt-in until repeated and compared against matching vLLM h2h.

+### Phase 59 MoE min32 repeat and vLLM H2H
+
+Phase59 repeated the Phase58 MoE min32 point, then ran matching vLLM serving.
+Artifact:
+`/home/mudler/bench/phase59_moe_min32_repeat_vllm/20260701_123147`.
+
+Pre/post llama md5 and op gates stayed green: MoE
+`8cb0ce23777bf55f92f63d0292c756b0`, dense
+`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID`
+`806/806`.
+
+MoE `n=128`, `ptok=128`, `gen=64`:
+
+| engine / variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s |
+|------------------|---------|-----------------|-------------|--------------|-------------|--------|
+| llama default | `336.6` | `646.7` | `1525.1` | `7798.5` | `11666.8` | `24.334` |
+| llama min32 | `336.9` | `632.0` | `1567.1` | `7167.8` | `11353.4` | `24.316` |
+| vLLM | `601.3` | `938.8` | `3648.7` | `2968.1` | `4871.6` | `13.563` |
+
+Decision: min32 repeated as a real llama.cpp scheduler QoS improvement
+(`-8.1%` mean TTFT with flat aggregate and wall), but it is not a vLLM parity
+lever. Llama min32 is still `0.560x` vLLM aggregate, `0.430x` vLLM prefill,
+`0.673x` vLLM decode aggregate, and `2.415x` slower on mean TTFT. Keep the
+scheduler knob opt-in and return parity work to the prefill / MoE compute gap.
+
 Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.

 ### Phase 10 GDN C32 slab update
--- a/docs/superpowers/plans/2026-07-01-moe-min32-repeat-vllm-phase59.md
+++ b/docs/superpowers/plans/2026-07-01-moe-min32-repeat-vllm-phase59.md
@@ -0,0 +1,75 @@
+# Phase 59: MoE Min32 Repeat and vLLM H2H
+
+## Goal
+
+Repeat the Phase58 MoE `LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` result in a
+fresh DGX window, then compare against a matching vLLM `n=128`, `ptok=128`,
+`gen=64` serving run.
+
+## Patch Under Test
+
+The temporary DGX patch stack was generated from the local llama.cpp fork
+through:
+
+- `8759213e3 feat(server): gate TTFT defer by prompt backlog`
+
+The patch was applied to the clean DGX mirror for llama.cpp runs, then reverted
+before the vLLM run.
+
+## Verification
+
+Pre and post llama gates stayed green:
+
+| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
+|-------|---------|-----------|-----------|--------------|
+| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
+| post llama | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
+
+## Results
+
+Artifact:
+
+- `/home/mudler/bench/phase59_moe_min32_repeat_vllm/20260701_123147`
+
+MoE `n=128`, `ptok=128`, `gen=64`:
+
+| engine / variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred |
+|------------------|---------|-----------------|-------------|--------------|-------------|--------|----------|
+| llama default | `336.6` | `646.7` | `1525.1` | `7798.5` | `11666.8` | `24.334` | `0` |
+| llama min32 | `336.9` | `632.0` | `1567.1` | `7167.8` | `11353.4` | `24.316` | `279` |
+| vLLM | `601.3` | `938.8` | `3648.7` | `2968.1` | `4871.6` | `13.563` | n/a |
+
+Min32 repeat delta versus llama default:
+
+- Aggregate throughput: `+0.1%`
+- Mean TTFT: `-8.1%`
+- Max TTFT: `-2.7%`
+- Wall time: `-0.1%`
+- Prefill throughput: `+2.8%`
+- Decode aggregate throughput: `-2.3%`
+
+Llama min32 versus vLLM:
+
+- Aggregate throughput ratio: `0.560`
+- Mean TTFT: llama is `2.415x` slower
+- Wall time: llama is `1.793x` slower
+- Prefill throughput ratio: `0.430`
+- Decode aggregate throughput ratio: `0.673`
+
+## Decision
+
+`LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` repeated as a real, inference-gated
+llama.cpp scheduler QoS improvement for MoE `n=128`: it cuts mean TTFT without
+moving aggregate throughput or wall time materially.
+
+It is not a vLLM parity lever by itself. vLLM remains far ahead on the same
+serving shape, especially prefill and TTFT. Keep the scheduler path opt-in and
+treat it as user-visible latency tuning while parity work returns to the larger
+prefill / MoE compute gap.
+
+## Status
+
+- Phase59 docs recorded.
+- DGX lock released as `FREE phase59-cleanup`.
+- No push performed.
+- LocalAI `patches/paged/` not regenerated.