diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index d1c5263f0..a0159846e 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -3297,3 +3297,58 @@ Decision: and wall regressed slightly. Do not default-on yet. - Next step should repeat the MoE min32 result and run the matching vLLM h2h comparison before treating this as real parity progress rather than run noise. + +## Phase 59 MoE Min32 Repeat and vLLM H2H + +Phase 59 repeats the Phase58 MoE min32 point and compares it to a matching vLLM +serving run. The Phase51+Phase54+Phase55+Phase57+Phase58 stack was applied +temporarily to the clean DGX mirror for the llama.cpp runs, then reverted before +the vLLM run. + +Artifact: + +- `/home/mudler/bench/phase59_moe_min32_repeat_vllm/20260701_123147` + +Pre/post llama gates: + +| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post llama | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +MoE `n=128`, `ptok=128`, `gen=64`: + +| engine / variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred | +|------------------|---------|-----------------|-------------|--------------|-------------|--------|----------| +| llama default | `336.6` | `646.7` | `1525.1` | `7798.5` | `11666.8` | `24.334` | `0` | +| llama min32 | `336.9` | `632.0` | `1567.1` | `7167.8` | `11353.4` | `24.316` | `279` | +| vLLM | `601.3` | `938.8` | `3648.7` | `2968.1` | `4871.6` | `13.563` | n/a | + +Llama min32 repeat versus llama default: + +- Aggregate throughput: `+0.1%` +- Mean TTFT: `-8.1%` +- Max TTFT: `-2.7%` +- Wall time: `-0.1%` +- Prefill throughput: `+2.8%` +- Decode aggregate throughput: `-2.3%` + +Llama min32 versus vLLM: + +- Aggregate throughput ratio: `0.560` +- Mean TTFT: llama is `2.415x` slower +- Wall time: llama is `1.793x` slower +- Prefill throughput ratio: `0.430` +- Decode aggregate throughput ratio: `0.673` + +Decision: + +- The min32 repeat confirms a real, inference-gated llama.cpp scheduler QoS + improvement for MoE `n=128`: mean TTFT drops without material aggregate or + wall-time loss. +- It does not close parity with vLLM. vLLM remains much faster on the same + request shape, especially prefill throughput and TTFT. +- Keep `LLAMA_TTFT_PREFILL_FIRST=1` plus + `LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` opt-in. Do not default it yet. +- Treat this as latency tuning, not the next parity track. The larger gap is + still prefill / MoE compute. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index 7b638f9aa..54f7cf1c9 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -20,7 +20,8 @@ Read order for a cold start: ## 1. TL;DR STATE -> 2026-07-01 active update: Phase50-55 reopened the dense serving question. +> 2026-07-01 active update: Phase50-59 reopened the dense and MoE serving +> scheduler question. > True dense decode is much closer to vLLM (`383.66` vs `435.00` t/s, `88.2%`) > than the Phase47 h2h aggregate suggested, while traced serving still shows > no pure decode-only steps and high TTFT. Phase53 rejected static lower @@ -37,10 +38,13 @@ Read order for a cold start: > dense caps lost aggregate. Phase58 added a prompt-backlog threshold; min32 > improved MoE `n=128` aggregate `339.0 -> 341.9`, mean TTFT > `7743.1 -> 7420.1 ms`, and wall `24.167 -> 23.950 s` in the same window, while -> dense `n=128` was mixed. Next step should repeat min32 and run matching vLLM -> h2h before any default-on discussion. The trace and scheduler commits are -> local and DGX-gated but not pushed, so the LocalAI patch series has not been -> regenerated. +> dense `n=128` was mixed. Phase59 repeated MoE min32: aggregate stayed flat +> (`336.6 -> 336.9`), mean TTFT improved (`7798.5 -> 7167.8 ms`), and wall stayed +> flat (`24.334 -> 24.316 s`) with md5/op gates green. Matching vLLM was still +> far ahead (`601.3` aggregate, `2968.1 ms` mean TTFT), so min32 is an opt-in +> llama.cpp QoS knob, not a parity-closing lever. The trace and scheduler commits +> are local and DGX-gated but not pushed, so the LocalAI patch series has not +> been regenerated. - Historical verdict: the older investigation marked GB10 parity **CLOSED** and unreachable. Treat that as superseded where Phase50-54 provide newer dense @@ -689,6 +693,30 @@ lowering aggregate/prefill throughput, and it does not materially solve TTFT. Next scheduler work should be per-step histograms or a targeted first-token admission policy. +Phase 54 through Phase 59 tested that targeted scheduler path. The fork commits +are still local-only and default-off: + +- `c6cb8460e feat(server): trace serving admission batches` +- `bd7b2e952 feat(server): add admission trace histograms` +- `8a97629a4 feat(server): add TTFT prefill-first scheduler mode` +- `3b6ab5fa8 feat(server): cap TTFT prefill-first decode deferral` +- `8759213e3 feat(server): gate TTFT defer by prompt backlog` + +Phase59 is the current verdict. Artifact: +`/home/mudler/bench/phase59_moe_min32_repeat_vllm/20260701_123147`. Pre/post +llama gates stayed green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` +`806/806`. MoE `n=128`, `ptok=128`, `gen=64` repeated the Phase58 min32 signal: +llama default `agg=336.6`, `TTFT=7798.5ms`, wall `24.334s`; llama min32 +`agg=336.9`, `TTFT=7167.8ms`, wall `24.316s`. Matching vLLM was still +`agg=601.3`, `TTFT=2968.1ms`, wall `13.563s`. + +Decision: keep `LLAMA_TTFT_PREFILL_FIRST=1` and +`LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` as an opt-in llama.cpp latency/QoS +knob. It does not prove vLLM parity progress by itself. Do not default it until +more workload coverage exists, and do not regenerate LocalAI patches until the +fork commits are pushed with explicit approval. + --- ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes) diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index 929dea1b5..41b738b40 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -1463,6 +1463,31 @@ scheduler A/B so far: MoE `n=128` improved aggregate, TTFT, and wall in the same window, while dense `n=128` was roughly neutral but slightly worse on aggregate and wall. Keep it opt-in until repeated and compared against matching vLLM h2h. +### Phase 59 MoE min32 repeat and vLLM H2H + +Phase59 repeated the Phase58 MoE min32 point, then ran matching vLLM serving. +Artifact: +`/home/mudler/bench/phase59_moe_min32_repeat_vllm/20260701_123147`. + +Pre/post llama md5 and op gates stayed green: MoE +`8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID` +`806/806`. + +MoE `n=128`, `ptok=128`, `gen=64`: + +| engine / variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | +|------------------|---------|-----------------|-------------|--------------|-------------|--------| +| llama default | `336.6` | `646.7` | `1525.1` | `7798.5` | `11666.8` | `24.334` | +| llama min32 | `336.9` | `632.0` | `1567.1` | `7167.8` | `11353.4` | `24.316` | +| vLLM | `601.3` | `938.8` | `3648.7` | `2968.1` | `4871.6` | `13.563` | + +Decision: min32 repeated as a real llama.cpp scheduler QoS improvement +(`-8.1%` mean TTFT with flat aggregate and wall), but it is not a vLLM parity +lever. Llama min32 is still `0.560x` vLLM aggregate, `0.430x` vLLM prefill, +`0.673x` vLLM decode aggregate, and `2.415x` slower on mean TTFT. Keep the +scheduler knob opt-in and return parity work to the prefill / MoE compute gap. + Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`. ### Phase 10 GDN C32 slab update diff --git a/docs/superpowers/plans/2026-07-01-moe-min32-repeat-vllm-phase59.md b/docs/superpowers/plans/2026-07-01-moe-min32-repeat-vllm-phase59.md new file mode 100644 index 000000000..b71bf8404 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-moe-min32-repeat-vllm-phase59.md @@ -0,0 +1,75 @@ +# Phase 59: MoE Min32 Repeat and vLLM H2H + +## Goal + +Repeat the Phase58 MoE `LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` result in a +fresh DGX window, then compare against a matching vLLM `n=128`, `ptok=128`, +`gen=64` serving run. + +## Patch Under Test + +The temporary DGX patch stack was generated from the local llama.cpp fork +through: + +- `8759213e3 feat(server): gate TTFT defer by prompt backlog` + +The patch was applied to the clean DGX mirror for llama.cpp runs, then reverted +before the vLLM run. + +## Verification + +Pre and post llama gates stayed green: + +| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post llama | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +## Results + +Artifact: + +- `/home/mudler/bench/phase59_moe_min32_repeat_vllm/20260701_123147` + +MoE `n=128`, `ptok=128`, `gen=64`: + +| engine / variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred | +|------------------|---------|-----------------|-------------|--------------|-------------|--------|----------| +| llama default | `336.6` | `646.7` | `1525.1` | `7798.5` | `11666.8` | `24.334` | `0` | +| llama min32 | `336.9` | `632.0` | `1567.1` | `7167.8` | `11353.4` | `24.316` | `279` | +| vLLM | `601.3` | `938.8` | `3648.7` | `2968.1` | `4871.6` | `13.563` | n/a | + +Min32 repeat delta versus llama default: + +- Aggregate throughput: `+0.1%` +- Mean TTFT: `-8.1%` +- Max TTFT: `-2.7%` +- Wall time: `-0.1%` +- Prefill throughput: `+2.8%` +- Decode aggregate throughput: `-2.3%` + +Llama min32 versus vLLM: + +- Aggregate throughput ratio: `0.560` +- Mean TTFT: llama is `2.415x` slower +- Wall time: llama is `1.793x` slower +- Prefill throughput ratio: `0.430` +- Decode aggregate throughput ratio: `0.673` + +## Decision + +`LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` repeated as a real, inference-gated +llama.cpp scheduler QoS improvement for MoE `n=128`: it cuts mean TTFT without +moving aggregate throughput or wall time materially. + +It is not a vLLM parity lever by itself. vLLM remains far ahead on the same +serving shape, especially prefill and TTFT. Keep the scheduler path opt-in and +treat it as user-visible latency tuning while parity work returns to the larger +prefill / MoE compute gap. + +## Status + +- Phase59 docs recorded. +- DGX lock released as `FREE phase59-cleanup`. +- No push performed. +- LocalAI `patches/paged/` not regenerated.