mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-02 20:37:03 -04:00
docs(paged): compare MoE min32 against vLLM
Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -3297,3 +3297,58 @@ Decision:
|
||||
and wall regressed slightly. Do not default-on yet.
|
||||
- Next step should repeat the MoE min32 result and run the matching vLLM h2h
|
||||
comparison before treating this as real parity progress rather than run noise.
|
||||
|
||||
## Phase 59 MoE Min32 Repeat and vLLM H2H
|
||||
|
||||
Phase 59 repeats the Phase58 MoE min32 point and compares it to a matching vLLM
|
||||
serving run. The Phase51+Phase54+Phase55+Phase57+Phase58 stack was applied
|
||||
temporarily to the clean DGX mirror for the llama.cpp runs, then reverted before
|
||||
the vLLM run.
|
||||
|
||||
Artifact:
|
||||
|
||||
- `/home/mudler/bench/phase59_moe_min32_repeat_vllm/20260701_123147`
|
||||
|
||||
Pre/post llama gates:
|
||||
|
||||
| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
|
||||
|-------|---------|-----------|-----------|--------------|
|
||||
| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
|
||||
| post llama | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
|
||||
|
||||
MoE `n=128`, `ptok=128`, `gen=64`:
|
||||
|
||||
| engine / variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred |
|
||||
|------------------|---------|-----------------|-------------|--------------|-------------|--------|----------|
|
||||
| llama default | `336.6` | `646.7` | `1525.1` | `7798.5` | `11666.8` | `24.334` | `0` |
|
||||
| llama min32 | `336.9` | `632.0` | `1567.1` | `7167.8` | `11353.4` | `24.316` | `279` |
|
||||
| vLLM | `601.3` | `938.8` | `3648.7` | `2968.1` | `4871.6` | `13.563` | n/a |
|
||||
|
||||
Llama min32 repeat versus llama default:
|
||||
|
||||
- Aggregate throughput: `+0.1%`
|
||||
- Mean TTFT: `-8.1%`
|
||||
- Max TTFT: `-2.7%`
|
||||
- Wall time: `-0.1%`
|
||||
- Prefill throughput: `+2.8%`
|
||||
- Decode aggregate throughput: `-2.3%`
|
||||
|
||||
Llama min32 versus vLLM:
|
||||
|
||||
- Aggregate throughput ratio: `0.560`
|
||||
- Mean TTFT: llama is `2.415x` slower
|
||||
- Wall time: llama is `1.793x` slower
|
||||
- Prefill throughput ratio: `0.430`
|
||||
- Decode aggregate throughput ratio: `0.673`
|
||||
|
||||
Decision:
|
||||
|
||||
- The min32 repeat confirms a real, inference-gated llama.cpp scheduler QoS
|
||||
improvement for MoE `n=128`: mean TTFT drops without material aggregate or
|
||||
wall-time loss.
|
||||
- It does not close parity with vLLM. vLLM remains much faster on the same
|
||||
request shape, especially prefill throughput and TTFT.
|
||||
- Keep `LLAMA_TTFT_PREFILL_FIRST=1` plus
|
||||
`LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` opt-in. Do not default it yet.
|
||||
- Treat this as latency tuning, not the next parity track. The larger gap is
|
||||
still prefill / MoE compute.
|
||||
|
||||
@@ -20,7 +20,8 @@ Read order for a cold start:
|
||||
|
||||
## 1. TL;DR STATE
|
||||
|
||||
> 2026-07-01 active update: Phase50-55 reopened the dense serving question.
|
||||
> 2026-07-01 active update: Phase50-59 reopened the dense and MoE serving
|
||||
> scheduler question.
|
||||
> True dense decode is much closer to vLLM (`383.66` vs `435.00` t/s, `88.2%`)
|
||||
> than the Phase47 h2h aggregate suggested, while traced serving still shows
|
||||
> no pure decode-only steps and high TTFT. Phase53 rejected static lower
|
||||
@@ -37,10 +38,13 @@ Read order for a cold start:
|
||||
> dense caps lost aggregate. Phase58 added a prompt-backlog threshold; min32
|
||||
> improved MoE `n=128` aggregate `339.0 -> 341.9`, mean TTFT
|
||||
> `7743.1 -> 7420.1 ms`, and wall `24.167 -> 23.950 s` in the same window, while
|
||||
> dense `n=128` was mixed. Next step should repeat min32 and run matching vLLM
|
||||
> h2h before any default-on discussion. The trace and scheduler commits are
|
||||
> local and DGX-gated but not pushed, so the LocalAI patch series has not been
|
||||
> regenerated.
|
||||
> dense `n=128` was mixed. Phase59 repeated MoE min32: aggregate stayed flat
|
||||
> (`336.6 -> 336.9`), mean TTFT improved (`7798.5 -> 7167.8 ms`), and wall stayed
|
||||
> flat (`24.334 -> 24.316 s`) with md5/op gates green. Matching vLLM was still
|
||||
> far ahead (`601.3` aggregate, `2968.1 ms` mean TTFT), so min32 is an opt-in
|
||||
> llama.cpp QoS knob, not a parity-closing lever. The trace and scheduler commits
|
||||
> are local and DGX-gated but not pushed, so the LocalAI patch series has not
|
||||
> been regenerated.
|
||||
|
||||
- Historical verdict: the older investigation marked GB10 parity **CLOSED** and
|
||||
unreachable. Treat that as superseded where Phase50-54 provide newer dense
|
||||
@@ -689,6 +693,30 @@ lowering aggregate/prefill throughput, and it does not materially solve TTFT.
|
||||
Next scheduler work should be per-step histograms or a targeted first-token
|
||||
admission policy.
|
||||
|
||||
Phase 54 through Phase 59 tested that targeted scheduler path. The fork commits
|
||||
are still local-only and default-off:
|
||||
|
||||
- `c6cb8460e feat(server): trace serving admission batches`
|
||||
- `bd7b2e952 feat(server): add admission trace histograms`
|
||||
- `8a97629a4 feat(server): add TTFT prefill-first scheduler mode`
|
||||
- `3b6ab5fa8 feat(server): cap TTFT prefill-first decode deferral`
|
||||
- `8759213e3 feat(server): gate TTFT defer by prompt backlog`
|
||||
|
||||
Phase59 is the current verdict. Artifact:
|
||||
`/home/mudler/bench/phase59_moe_min32_repeat_vllm/20260701_123147`. Pre/post
|
||||
llama gates stayed green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID`
|
||||
`806/806`. MoE `n=128`, `ptok=128`, `gen=64` repeated the Phase58 min32 signal:
|
||||
llama default `agg=336.6`, `TTFT=7798.5ms`, wall `24.334s`; llama min32
|
||||
`agg=336.9`, `TTFT=7167.8ms`, wall `24.316s`. Matching vLLM was still
|
||||
`agg=601.3`, `TTFT=2968.1ms`, wall `13.563s`.
|
||||
|
||||
Decision: keep `LLAMA_TTFT_PREFILL_FIRST=1` and
|
||||
`LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` as an opt-in llama.cpp latency/QoS
|
||||
knob. It does not prove vLLM parity progress by itself. Do not default it until
|
||||
more workload coverage exists, and do not regenerate LocalAI patches until the
|
||||
fork commits are pushed with explicit approval.
|
||||
|
||||
---
|
||||
|
||||
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
|
||||
|
||||
@@ -1463,6 +1463,31 @@ scheduler A/B so far: MoE `n=128` improved aggregate, TTFT, and wall in the same
|
||||
window, while dense `n=128` was roughly neutral but slightly worse on aggregate
|
||||
and wall. Keep it opt-in until repeated and compared against matching vLLM h2h.
|
||||
|
||||
### Phase 59 MoE min32 repeat and vLLM H2H
|
||||
|
||||
Phase59 repeated the Phase58 MoE min32 point, then ran matching vLLM serving.
|
||||
Artifact:
|
||||
`/home/mudler/bench/phase59_moe_min32_repeat_vllm/20260701_123147`.
|
||||
|
||||
Pre/post llama md5 and op gates stayed green: MoE
|
||||
`8cb0ce23777bf55f92f63d0292c756b0`, dense
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID`
|
||||
`806/806`.
|
||||
|
||||
MoE `n=128`, `ptok=128`, `gen=64`:
|
||||
|
||||
| engine / variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s |
|
||||
|------------------|---------|-----------------|-------------|--------------|-------------|--------|
|
||||
| llama default | `336.6` | `646.7` | `1525.1` | `7798.5` | `11666.8` | `24.334` |
|
||||
| llama min32 | `336.9` | `632.0` | `1567.1` | `7167.8` | `11353.4` | `24.316` |
|
||||
| vLLM | `601.3` | `938.8` | `3648.7` | `2968.1` | `4871.6` | `13.563` |
|
||||
|
||||
Decision: min32 repeated as a real llama.cpp scheduler QoS improvement
|
||||
(`-8.1%` mean TTFT with flat aggregate and wall), but it is not a vLLM parity
|
||||
lever. Llama min32 is still `0.560x` vLLM aggregate, `0.430x` vLLM prefill,
|
||||
`0.673x` vLLM decode aggregate, and `2.415x` slower on mean TTFT. Keep the
|
||||
scheduler knob opt-in and return parity work to the prefill / MoE compute gap.
|
||||
|
||||
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
|
||||
|
||||
### Phase 10 GDN C32 slab update
|
||||
|
||||
@@ -0,0 +1,75 @@
|
||||
# Phase 59: MoE Min32 Repeat and vLLM H2H
|
||||
|
||||
## Goal
|
||||
|
||||
Repeat the Phase58 MoE `LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` result in a
|
||||
fresh DGX window, then compare against a matching vLLM `n=128`, `ptok=128`,
|
||||
`gen=64` serving run.
|
||||
|
||||
## Patch Under Test
|
||||
|
||||
The temporary DGX patch stack was generated from the local llama.cpp fork
|
||||
through:
|
||||
|
||||
- `8759213e3 feat(server): gate TTFT defer by prompt backlog`
|
||||
|
||||
The patch was applied to the clean DGX mirror for llama.cpp runs, then reverted
|
||||
before the vLLM run.
|
||||
|
||||
## Verification
|
||||
|
||||
Pre and post llama gates stayed green:
|
||||
|
||||
| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
|
||||
|-------|---------|-----------|-----------|--------------|
|
||||
| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
|
||||
| post llama | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
|
||||
|
||||
## Results
|
||||
|
||||
Artifact:
|
||||
|
||||
- `/home/mudler/bench/phase59_moe_min32_repeat_vllm/20260701_123147`
|
||||
|
||||
MoE `n=128`, `ptok=128`, `gen=64`:
|
||||
|
||||
| engine / variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred |
|
||||
|------------------|---------|-----------------|-------------|--------------|-------------|--------|----------|
|
||||
| llama default | `336.6` | `646.7` | `1525.1` | `7798.5` | `11666.8` | `24.334` | `0` |
|
||||
| llama min32 | `336.9` | `632.0` | `1567.1` | `7167.8` | `11353.4` | `24.316` | `279` |
|
||||
| vLLM | `601.3` | `938.8` | `3648.7` | `2968.1` | `4871.6` | `13.563` | n/a |
|
||||
|
||||
Min32 repeat delta versus llama default:
|
||||
|
||||
- Aggregate throughput: `+0.1%`
|
||||
- Mean TTFT: `-8.1%`
|
||||
- Max TTFT: `-2.7%`
|
||||
- Wall time: `-0.1%`
|
||||
- Prefill throughput: `+2.8%`
|
||||
- Decode aggregate throughput: `-2.3%`
|
||||
|
||||
Llama min32 versus vLLM:
|
||||
|
||||
- Aggregate throughput ratio: `0.560`
|
||||
- Mean TTFT: llama is `2.415x` slower
|
||||
- Wall time: llama is `1.793x` slower
|
||||
- Prefill throughput ratio: `0.430`
|
||||
- Decode aggregate throughput ratio: `0.673`
|
||||
|
||||
## Decision
|
||||
|
||||
`LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` repeated as a real, inference-gated
|
||||
llama.cpp scheduler QoS improvement for MoE `n=128`: it cuts mean TTFT without
|
||||
moving aggregate throughput or wall time materially.
|
||||
|
||||
It is not a vLLM parity lever by itself. vLLM remains far ahead on the same
|
||||
serving shape, especially prefill and TTFT. Keep the scheduler path opt-in and
|
||||
treat it as user-visible latency tuning while parity work returns to the larger
|
||||
prefill / MoE compute gap.
|
||||
|
||||
## Status
|
||||
|
||||
- Phase59 docs recorded.
|
||||
- DGX lock released as `FREE phase59-cleanup`.
|
||||
- No push performed.
|
||||
- LocalAI `patches/paged/` not regenerated.
|
||||
Reference in New Issue
Block a user