docs(paged): compare MoE min32 against vLLM

Assisted-by: Codex:gpt-5
This commit is contained in:
Ettore Di Giacinto
2026-07-01 10:46:32 +00:00
parent c41d1a5b4f
commit ef7dbfa5f7
4 changed files with 188 additions and 5 deletions

View File

@@ -3297,3 +3297,58 @@ Decision:
and wall regressed slightly. Do not default-on yet.
- Next step should repeat the MoE min32 result and run the matching vLLM h2h
comparison before treating this as real parity progress rather than run noise.
## Phase 59 MoE Min32 Repeat and vLLM H2H
Phase 59 repeats the Phase58 MoE min32 point and compares it to a matching vLLM
serving run. The Phase51+Phase54+Phase55+Phase57+Phase58 stack was applied
temporarily to the clean DGX mirror for the llama.cpp runs, then reverted before
the vLLM run.
Artifact:
- `/home/mudler/bench/phase59_moe_min32_repeat_vllm/20260701_123147`
Pre/post llama gates:
| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
|-------|---------|-----------|-----------|--------------|
| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
| post llama | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
MoE `n=128`, `ptok=128`, `gen=64`:
| engine / variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred |
|------------------|---------|-----------------|-------------|--------------|-------------|--------|----------|
| llama default | `336.6` | `646.7` | `1525.1` | `7798.5` | `11666.8` | `24.334` | `0` |
| llama min32 | `336.9` | `632.0` | `1567.1` | `7167.8` | `11353.4` | `24.316` | `279` |
| vLLM | `601.3` | `938.8` | `3648.7` | `2968.1` | `4871.6` | `13.563` | n/a |
Llama min32 repeat versus llama default:
- Aggregate throughput: `+0.1%`
- Mean TTFT: `-8.1%`
- Max TTFT: `-2.7%`
- Wall time: `-0.1%`
- Prefill throughput: `+2.8%`
- Decode aggregate throughput: `-2.3%`
Llama min32 versus vLLM:
- Aggregate throughput ratio: `0.560`
- Mean TTFT: llama is `2.415x` slower
- Wall time: llama is `1.793x` slower
- Prefill throughput ratio: `0.430`
- Decode aggregate throughput ratio: `0.673`
Decision:
- The min32 repeat confirms a real, inference-gated llama.cpp scheduler QoS
improvement for MoE `n=128`: mean TTFT drops without material aggregate or
wall-time loss.
- It does not close parity with vLLM. vLLM remains much faster on the same
request shape, especially prefill throughput and TTFT.
- Keep `LLAMA_TTFT_PREFILL_FIRST=1` plus
`LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` opt-in. Do not default it yet.
- Treat this as latency tuning, not the next parity track. The larger gap is
still prefill / MoE compute.

View File

@@ -20,7 +20,8 @@ Read order for a cold start:
## 1. TL;DR STATE
> 2026-07-01 active update: Phase50-55 reopened the dense serving question.
> 2026-07-01 active update: Phase50-59 reopened the dense and MoE serving
> scheduler question.
> True dense decode is much closer to vLLM (`383.66` vs `435.00` t/s, `88.2%`)
> than the Phase47 h2h aggregate suggested, while traced serving still shows
> no pure decode-only steps and high TTFT. Phase53 rejected static lower
@@ -37,10 +38,13 @@ Read order for a cold start:
> dense caps lost aggregate. Phase58 added a prompt-backlog threshold; min32
> improved MoE `n=128` aggregate `339.0 -> 341.9`, mean TTFT
> `7743.1 -> 7420.1 ms`, and wall `24.167 -> 23.950 s` in the same window, while
> dense `n=128` was mixed. Next step should repeat min32 and run matching vLLM
> h2h before any default-on discussion. The trace and scheduler commits are
> local and DGX-gated but not pushed, so the LocalAI patch series has not been
> regenerated.
> dense `n=128` was mixed. Phase59 repeated MoE min32: aggregate stayed flat
> (`336.6 -> 336.9`), mean TTFT improved (`7798.5 -> 7167.8 ms`), and wall stayed
> flat (`24.334 -> 24.316 s`) with md5/op gates green. Matching vLLM was still
> far ahead (`601.3` aggregate, `2968.1 ms` mean TTFT), so min32 is an opt-in
> llama.cpp QoS knob, not a parity-closing lever. The trace and scheduler commits
> are local and DGX-gated but not pushed, so the LocalAI patch series has not
> been regenerated.
- Historical verdict: the older investigation marked GB10 parity **CLOSED** and
unreachable. Treat that as superseded where Phase50-54 provide newer dense
@@ -689,6 +693,30 @@ lowering aggregate/prefill throughput, and it does not materially solve TTFT.
Next scheduler work should be per-step histograms or a targeted first-token
admission policy.
Phase 54 through Phase 59 tested that targeted scheduler path. The fork commits
are still local-only and default-off:
- `c6cb8460e feat(server): trace serving admission batches`
- `bd7b2e952 feat(server): add admission trace histograms`
- `8a97629a4 feat(server): add TTFT prefill-first scheduler mode`
- `3b6ab5fa8 feat(server): cap TTFT prefill-first decode deferral`
- `8759213e3 feat(server): gate TTFT defer by prompt backlog`
Phase59 is the current verdict. Artifact:
`/home/mudler/bench/phase59_moe_min32_repeat_vllm/20260701_123147`. Pre/post
llama gates stayed green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID`
`806/806`. MoE `n=128`, `ptok=128`, `gen=64` repeated the Phase58 min32 signal:
llama default `agg=336.6`, `TTFT=7798.5ms`, wall `24.334s`; llama min32
`agg=336.9`, `TTFT=7167.8ms`, wall `24.316s`. Matching vLLM was still
`agg=601.3`, `TTFT=2968.1ms`, wall `13.563s`.
Decision: keep `LLAMA_TTFT_PREFILL_FIRST=1` and
`LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` as an opt-in llama.cpp latency/QoS
knob. It does not prove vLLM parity progress by itself. Do not default it until
more workload coverage exists, and do not regenerate LocalAI patches until the
fork commits are pushed with explicit approval.
---
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)

View File

@@ -1463,6 +1463,31 @@ scheduler A/B so far: MoE `n=128` improved aggregate, TTFT, and wall in the same
window, while dense `n=128` was roughly neutral but slightly worse on aggregate
and wall. Keep it opt-in until repeated and compared against matching vLLM h2h.
### Phase 59 MoE min32 repeat and vLLM H2H
Phase59 repeated the Phase58 MoE min32 point, then ran matching vLLM serving.
Artifact:
`/home/mudler/bench/phase59_moe_min32_repeat_vllm/20260701_123147`.
Pre/post llama md5 and op gates stayed green: MoE
`8cb0ce23777bf55f92f63d0292c756b0`, dense
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID`
`806/806`.
MoE `n=128`, `ptok=128`, `gen=64`:
| engine / variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s |
|------------------|---------|-----------------|-------------|--------------|-------------|--------|
| llama default | `336.6` | `646.7` | `1525.1` | `7798.5` | `11666.8` | `24.334` |
| llama min32 | `336.9` | `632.0` | `1567.1` | `7167.8` | `11353.4` | `24.316` |
| vLLM | `601.3` | `938.8` | `3648.7` | `2968.1` | `4871.6` | `13.563` |
Decision: min32 repeated as a real llama.cpp scheduler QoS improvement
(`-8.1%` mean TTFT with flat aggregate and wall), but it is not a vLLM parity
lever. Llama min32 is still `0.560x` vLLM aggregate, `0.430x` vLLM prefill,
`0.673x` vLLM decode aggregate, and `2.415x` slower on mean TTFT. Keep the
scheduler knob opt-in and return parity work to the prefill / MoE compute gap.
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
### Phase 10 GDN C32 slab update

View File

@@ -0,0 +1,75 @@
# Phase 59: MoE Min32 Repeat and vLLM H2H
## Goal
Repeat the Phase58 MoE `LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` result in a
fresh DGX window, then compare against a matching vLLM `n=128`, `ptok=128`,
`gen=64` serving run.
## Patch Under Test
The temporary DGX patch stack was generated from the local llama.cpp fork
through:
- `8759213e3 feat(server): gate TTFT defer by prompt backlog`
The patch was applied to the clean DGX mirror for llama.cpp runs, then reverted
before the vLLM run.
## Verification
Pre and post llama gates stayed green:
| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
|-------|---------|-----------|-----------|--------------|
| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
| post llama | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
## Results
Artifact:
- `/home/mudler/bench/phase59_moe_min32_repeat_vllm/20260701_123147`
MoE `n=128`, `ptok=128`, `gen=64`:
| engine / variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred |
|------------------|---------|-----------------|-------------|--------------|-------------|--------|----------|
| llama default | `336.6` | `646.7` | `1525.1` | `7798.5` | `11666.8` | `24.334` | `0` |
| llama min32 | `336.9` | `632.0` | `1567.1` | `7167.8` | `11353.4` | `24.316` | `279` |
| vLLM | `601.3` | `938.8` | `3648.7` | `2968.1` | `4871.6` | `13.563` | n/a |
Min32 repeat delta versus llama default:
- Aggregate throughput: `+0.1%`
- Mean TTFT: `-8.1%`
- Max TTFT: `-2.7%`
- Wall time: `-0.1%`
- Prefill throughput: `+2.8%`
- Decode aggregate throughput: `-2.3%`
Llama min32 versus vLLM:
- Aggregate throughput ratio: `0.560`
- Mean TTFT: llama is `2.415x` slower
- Wall time: llama is `1.793x` slower
- Prefill throughput ratio: `0.430`
- Decode aggregate throughput ratio: `0.673`
## Decision
`LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` repeated as a real, inference-gated
llama.cpp scheduler QoS improvement for MoE `n=128`: it cuts mean TTFT without
moving aggregate throughput or wall time materially.
It is not a vLLM parity lever by itself. vLLM remains far ahead on the same
serving shape, especially prefill and TTFT. Keep the scheduler path opt-in and
treat it as user-visible latency tuning while parity work returns to the larger
prefill / MoE compute gap.
## Status
- Phase59 docs recorded.
- DGX lock released as `FREE phase59-cleanup`.
- No push performed.
- LocalAI `patches/paged/` not regenerated.