mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
docs(paged): record dense admission trace
Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -2879,3 +2879,62 @@ Mirror status:
|
||||
- The LocalAI `patches/paged/` series is not regenerated yet because the
|
||||
handoff requires pushing the fork branch first, and pushes require explicit
|
||||
approval.
|
||||
|
||||
## Phase 52 Dense Admission Trace
|
||||
|
||||
Phase 52 uses the Phase51 trace to capture the actual dense `n=128` serving
|
||||
admission shape. The Phase51 patch was applied temporarily to the clean DGX
|
||||
mirror, built, gated, used for the trace, and then reverted from the mirror.
|
||||
|
||||
Artifact:
|
||||
|
||||
- `/home/mudler/bench/phase52_dense_admission_trace/20260701_111017`
|
||||
|
||||
Pre/post gates:
|
||||
|
||||
| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
|
||||
|-------|---------|-----------|-----------|--------------|
|
||||
| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
|
||||
| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
|
||||
|
||||
Clean run shape:
|
||||
|
||||
- Dense GGUF: `~/bench/q36-27b-nvfp4.gguf`
|
||||
- `LLAMA_SERVING_TRACE=1`
|
||||
- `N=128`, `PTOK=128`, `GEN=64`
|
||||
- `CTX=131072`, `PARALLEL=128`, `BATCH=2048`, `UBATCH=512`
|
||||
|
||||
H2H result:
|
||||
|
||||
| n | agg t/s | decode agg t/s | decode per-seq t/s | prefill t/s | TTFT mean ms | wall s |
|
||||
|---|---------|-----------------|---------------------|-------------|--------------|--------|
|
||||
| 128 | `139.0` | `360.5` | `1.93` | `629.5` | `23171.5` | `58.921` |
|
||||
|
||||
Admission trace:
|
||||
|
||||
| steps | decode-only steps | decode tokens | prompt tokens | waiting prompt slots | max waiting prompt slots | started prompt slots | continued prompt slots |
|
||||
|-------|-------------------|---------------|---------------|----------------------|--------------------------|----------------------|------------------------|
|
||||
| `76` | `0` | `8064` | `22785` | `267` | `35` | `128` | `139` |
|
||||
|
||||
Derived values:
|
||||
|
||||
- `prompt_tokens` matched h2h `prompt_tok_total` exactly: `22785`.
|
||||
- `decode_tokens` were `128` fewer than h2h `gen_total`, which is expected for
|
||||
one first-token transition per request.
|
||||
- Average prompt tokens per scheduler step: `299.8`.
|
||||
- Average decode tokens per scheduler step: `106.11`.
|
||||
- Average waiting prompt slots per scheduler step: `3.51`.
|
||||
- `prefill_budget_step=0` and `prefill_cap_per_slot=0`, confirming the default
|
||||
stock n-batch-only prompt admission path.
|
||||
|
||||
Decision:
|
||||
|
||||
- The default dense `n=128` scheduler emits no pure decode steps
|
||||
(`decode_only_steps=0`) and admits prompt work across mixed steps. That
|
||||
explains why Phase47 h2h serving decode can lag the Phase50 true-decode ratio:
|
||||
serving is shaped by mixed prompt/decode admission and TTFT, not just dense
|
||||
decode kernels.
|
||||
- The next code phase should be a small, default-off scheduler A/B or a richer
|
||||
per-step histogram trace to test whether prefill chunking/admission can reduce
|
||||
TTFT without regressing aggregate throughput. Do not move to another GDN/GEMM
|
||||
rewrite until this scheduler hypothesis is tested.
|
||||
|
||||
@@ -640,6 +640,18 @@ patched `build-cuda` CTest passed and inference gates stayed green: MoE
|
||||
`806/806`. Push and LocalAI patch-series regeneration are still pending because
|
||||
push requires explicit approval.
|
||||
|
||||
Phase 52 uses the Phase51 trace on DGX for dense `n=128`, `ptok=128`, `gen=64`.
|
||||
Artifact: `/home/mudler/bench/phase52_dense_admission_trace/20260701_111017`.
|
||||
Pre/post md5 and op gates stayed green. The clean traced h2h row was
|
||||
`decode_agg_tps=360.5`, `prefill_tps=629.5`, `ttft_mean_ms=23171.5`, wall
|
||||
`58.921s`. The admission trace reported `steps=76`, `decode_only_steps=0`,
|
||||
`decode_tokens=8064`, `prompt_tokens=22785`, `max_waiting_prompt_slots=35`,
|
||||
`started_prompt_slots=128`, `continued_prompt_slots=139`,
|
||||
`prefill_budget_step=0`, and `prefill_cap_per_slot=0`. The prompt token count
|
||||
matches h2h exactly, so this is the target request. The next GB10 lever should
|
||||
be a default-off scheduler/admission A/B or a per-step histogram trace, not an
|
||||
immediate GDN/GEMM rewrite.
|
||||
|
||||
---
|
||||
|
||||
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
|
||||
@@ -734,6 +746,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
|
||||
- `~/bench/phase49_vllm_env_hygiene_dryrun/20260701_102138` - harness dry-run after scrubbing harness-owned `VLLM_*` variables from the `vllm serve` child environment.
|
||||
- `~/bench/phase50_dense_true_decode/20260701_103120` - dense graph-node difference-method profile at `npl=128`, `npp=128`; `build-cuda` pre/post md5 and op gates green; true decode paged `383.66 t/s`, vLLM `435.00 t/s`, ratio `0.8820`, pointing next at serving admission/scheduler tracing.
|
||||
- `~/bench/phase51_serving_admission_trace/20260701_110130` - default-off `LLAMA_SERVING_TRACE=1` fork commit `c6cb8460e`; DGX patched `build-cuda` CTest and md5/op gates green; push and LocalAI patch-series mirror pending approval.
|
||||
- `~/bench/phase52_dense_admission_trace/20260701_111017` - clean dense `n=128` admission trace; pre/post gates green; `decode_only_steps=0`, `prompt_tokens=22785`, `max_waiting_prompt_slots=35`; next lever is scheduler/admission A/B or per-step histogram trace.
|
||||
- Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
|
||||
- Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
|
||||
- A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.
|
||||
|
||||
@@ -1243,6 +1243,30 @@ Verification:
|
||||
Mirror status: pending explicit approval to push the fork branch, then
|
||||
regenerate the LocalAI patch series from the pushed fork commit.
|
||||
|
||||
### Phase 52 dense admission trace
|
||||
|
||||
Phase52 used the Phase51 trace on DGX to measure dense `n=128`, `ptok=128`,
|
||||
`gen=64` llama-server admission. Artifact:
|
||||
`/home/mudler/bench/phase52_dense_admission_trace/20260701_111017`.
|
||||
|
||||
The traced build was bracketed by canonical gates, all green before and after:
|
||||
MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and
|
||||
`MUL_MAT_ID` `806/806`.
|
||||
|
||||
Clean trace:
|
||||
|
||||
| h2h wall s | decode agg t/s | TTFT mean ms | steps | decode-only steps | decode tokens | prompt tokens | max waiting prompt slots |
|
||||
|------------|-----------------|--------------|-------|-------------------|---------------|---------------|--------------------------|
|
||||
| `58.921` | `360.5` | `23171.5` | `76` | `0` | `8064` | `22785` | `35` |
|
||||
|
||||
Decision: the default scheduler never emitted pure decode steps for this
|
||||
high-N dense run. Prompt tokens matched h2h exactly, and prompt admission used
|
||||
the stock path (`prefill_budget_step=0`, `prefill_cap_per_slot=0`). This
|
||||
supports the Phase50 conclusion that the remaining high-N serving gap is
|
||||
scheduler/admission and TTFT shaped. Next lever should be a default-off
|
||||
admission-policy A/B or per-step histogram trace, not immediate kernel work.
|
||||
|
||||
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
|
||||
|
||||
### Phase 10 GDN C32 slab update
|
||||
|
||||
Reference in New Issue
Block a user