mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
docs(paged): record admission histogram trace
Record Phase54 trace-only histogram work, DGX md5/op gates, dense serving histogram evidence, and the next scheduler decision. Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -2976,3 +2976,82 @@ Decision:
|
||||
- Do not promote simple budget shrinkage as a parity lever. The next useful
|
||||
scheduler work is a richer per-step histogram trace or a targeted first-token
|
||||
admission policy, not a static lower `LLAMA_MAX_BATCH_TOKENS`.
|
||||
|
||||
## Phase 54 Admission Histogram Trace
|
||||
|
||||
Phase 54 extends the Phase51 trace with compact per-step histograms for prompt
|
||||
tokens, decode tokens, and waiting prompt slots. This is still trace-only and
|
||||
default-off behind `LLAMA_SERVING_TRACE=1`; it does not change scheduling or
|
||||
inference.
|
||||
|
||||
Fork commits:
|
||||
|
||||
- `c6cb8460e feat(server): trace serving admission batches`
|
||||
- `bd7b2e952 feat(server): add admission trace histograms`
|
||||
|
||||
Artifact:
|
||||
|
||||
- `/home/mudler/bench/phase54_admission_hist_trace/20260701_113201`
|
||||
|
||||
Pre/post gates:
|
||||
|
||||
| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
|
||||
|-------|---------|-----------|-----------|--------------|
|
||||
| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
|
||||
| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
|
||||
|
||||
Focused test/build:
|
||||
|
||||
- Red test first: histogram assertions failed before implementation.
|
||||
- Local fork: `test-server-admission-trace` passed, CTest passed, and
|
||||
`llama-server` built.
|
||||
- DGX `build-cuda`: `test-server-admission-trace` passed under CTest after the
|
||||
temporary Phase51+Phase54 patch stack was applied.
|
||||
|
||||
Phase52-aligned dense trace:
|
||||
|
||||
- Dense GGUF: `~/bench/q36-27b-nvfp4.gguf`
|
||||
- `LLAMA_SERVING_TRACE=1`
|
||||
- `N=128`, `PTOK=168`, `GEN=64`
|
||||
- `CTX=131072`, `PARALLEL=128`, `BATCH=2048`, `UBATCH=512`
|
||||
|
||||
H2H result:
|
||||
|
||||
| n | prompt tokens | agg t/s | decode agg t/s | decode per-seq t/s | prefill t/s | TTFT mean ms | wall s |
|
||||
|---|---------------|---------|-----------------|---------------------|-------------|--------------|--------|
|
||||
| 128 | `22913` | `138.1` | `360.2` | `1.92` | `626.7` | `23393.2` | `59.303` |
|
||||
|
||||
Trace:
|
||||
|
||||
```text
|
||||
serving admission trace: steps=76 decode_only_steps=0 decode_tokens=8064 prompt_tokens=22913 waiting_prompt_slots=267 max_waiting_prompt_slots=34 started_prompt_slots=128 continued_prompt_slots=139 last_n_batch=2048 last_n_ubatch=512 last_prefill_budget_step=0 last_prefill_cap_per_slot=0 prompt_hist=0:63,1-64:1,513+:12 decode_hist=0:3,1-63:10,64-127:10,128-255:53 waiting_hist=0:63,1-7:1,8-15:2,16-31:9,32-63:1
|
||||
```
|
||||
|
||||
Interpretation:
|
||||
|
||||
- The Phase54 run matches the Phase52 serving envelope: same `76` steps, same
|
||||
`8064` trace decode tokens, same `267` waiting prompt slots, and throughput
|
||||
within noise.
|
||||
- `63/76` steps have `prompt_tokens=0` and `waiting_prompt_slots=0`.
|
||||
- Prompt admission is concentrated in a small number of very large chunks:
|
||||
`prompt_hist=513+:12`.
|
||||
- Decode is mostly full-width during active decode:
|
||||
`decode_hist=128-255:53`.
|
||||
- The scheduler still emits no pure decode-only steps for this shape.
|
||||
|
||||
Decision:
|
||||
|
||||
- The histogram strengthens the Phase53 rejection of static lower batch
|
||||
budgets. The issue is not a uniformly oversized prompt budget every step;
|
||||
prompt work arrives in a few large chunks and first-token latency remains high.
|
||||
- The next scheduler A/B should be a targeted first-token admission or prompt
|
||||
front-loading policy that changes when first prompt chunks are admitted, while
|
||||
keeping md5/op gates unchanged. Do not reduce `LLAMA_MAX_BATCH_TOKENS` globally
|
||||
as the next parity lever.
|
||||
|
||||
Mirror status:
|
||||
|
||||
- Both trace commits are local and DGX-gated.
|
||||
- The LocalAI `patches/paged/` series is not regenerated yet because the
|
||||
handoff requires pushing the fork branch first, and pushes require explicit
|
||||
approval.
|
||||
|
||||
@@ -20,7 +20,20 @@ Read order for a cold start:
|
||||
|
||||
## 1. TL;DR STATE
|
||||
|
||||
- The investigation is **CLOSED**. Parity is **not reachable on GB10** silicon; the residual is a hardware ceiling, not engineering debt.
|
||||
> 2026-07-01 active update: Phase50-54 reopened the dense serving question.
|
||||
> True dense decode is much closer to vLLM (`383.66` vs `435.00` t/s, `88.2%`)
|
||||
> than the Phase47 h2h aggregate suggested, while traced serving still shows
|
||||
> no pure decode-only steps and high TTFT. Phase53 rejected static lower
|
||||
> admission budgets; Phase54 histograms show prompt admission concentrated in a
|
||||
> few large chunks (`prompt_hist=513+:12`) with mostly full-width decode
|
||||
> (`decode_hist=128-255:53`). Next scheduler work should be a targeted
|
||||
> first-token admission or prompt-front-loading A/B, not another global
|
||||
> `LLAMA_MAX_BATCH_TOKENS` reduction. The trace commits are local and DGX-gated
|
||||
> but not pushed, so the LocalAI patch series has not been regenerated.
|
||||
|
||||
- Historical verdict: the older investigation marked GB10 parity **CLOSED** and
|
||||
unreachable. Treat that as superseded where Phase50-54 provide newer dense
|
||||
serving evidence.
|
||||
- **Prefill** is a genuine floor at **~36% (MoE) / ~43% (dense)** of vLLM. Prefill is **not** CUDA-graph-replayed, so these numbers are real, not measurement artifacts.
|
||||
- **Decode** is **near-parity: ~86% of vLLM's TRUE GPU-steady decode** (924 vs 1078 t/s). The long-standing **~56% headline was a CUDA-graph measurement artifact** (nsys without `--cuda-graph-trace=node` collapses each graph replay into one opaque launch). Decode is also **ahead of vLLM at low concurrency** (dense 116.7% at N=8) and uses **1.5-3x less memory**, bit-exact per-path.
|
||||
- The lever search was **exhaustive**: every attempt (prefill GEMM, GDN chunked scan, decode fusions, serving/scheduler) is recorded with its verdict and number so it is **not re-run**.
|
||||
|
||||
@@ -1291,6 +1291,38 @@ throughput and prefill throughput fall, and TTFT does not materially improve.
|
||||
Next scheduler work should collect per-step histograms or test a targeted
|
||||
first-token admission policy.
|
||||
|
||||
### Phase 54 admission histogram trace
|
||||
|
||||
Phase54 extended the Phase51 default-off trace with prompt-token,
|
||||
decode-token, and waiting-slot histograms. Fork stack:
|
||||
`c6cb8460e feat(server): trace serving admission batches` and
|
||||
`bd7b2e952 feat(server): add admission trace histograms`.
|
||||
|
||||
Artifact:
|
||||
`/home/mudler/bench/phase54_admission_hist_trace/20260701_113201`.
|
||||
|
||||
Pre/post md5 and op gates stayed green on the temporary DGX patch stack:
|
||||
MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and
|
||||
`MUL_MAT_ID` `806/806`.
|
||||
|
||||
The Phase52-aligned dense run used `n=128`, `ptok=168`, `gen=64`, producing
|
||||
`prompt_tok_total=22913`, `agg_tps=138.1`, `decode_agg_tps=360.2`,
|
||||
`prefill_tps=626.7`, `ttft_mean_ms=23393.2`, and `wall_s=59.303`.
|
||||
|
||||
Trace:
|
||||
|
||||
```text
|
||||
steps=76 decode_only_steps=0 decode_tokens=8064 prompt_tokens=22913 waiting_prompt_slots=267 max_waiting_prompt_slots=34 prompt_hist=0:63,1-64:1,513+:12 decode_hist=0:3,1-63:10,64-127:10,128-255:53 waiting_hist=0:63,1-7:1,8-15:2,16-31:9,32-63:1
|
||||
```
|
||||
|
||||
Decision: the scheduler does not spend every step over-admitting prompt work.
|
||||
Most steps have no waiting prompts and no prompt tokens, while prompt admission
|
||||
is concentrated into a small number of large chunks. This rejects global
|
||||
budget-shrinkage as the next path and points to a targeted first-token
|
||||
admission or prompt-front-loading A/B, gated by the same md5 and backend-op
|
||||
checks.
|
||||
|
||||
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
|
||||
|
||||
### Phase 10 GDN C32 slab update
|
||||
|
||||
Reference in New Issue
Block a user