mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 12:57:02 -04:00
docs(paged): record max-concurrency parity check
Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -2295,3 +2295,78 @@ Decision:
|
||||
default-off, keep gate weights in F32, avoid graph-time weight concat, and
|
||||
pass MoE/dense md5 plus `MUL_MAT`/`MUL_MAT_ID` gates before any serving
|
||||
benchmark. If md5 changes, run KL first and reject on KL regression.
|
||||
|
||||
## Phase 40 Max-Concurrency C1 Check
|
||||
|
||||
Phase 40 tested the remaining C1 hypothesis from the lever map: use paged KV's
|
||||
lower memory footprint to run a higher-concurrency serving point where vLLM
|
||||
falls behind or fails to fit.
|
||||
|
||||
Artifacts:
|
||||
|
||||
- `/home/mudler/bench/phase40_max_concurrency_dryrun/20260701_090002`
|
||||
- `/home/mudler/bench/phase40_max_concurrency/20260701_090012`
|
||||
|
||||
Preflight:
|
||||
|
||||
| check | actual |
|
||||
|-------|--------|
|
||||
| GPU | `NVIDIA GB10, 580.159.03` |
|
||||
| docker containers | `0` |
|
||||
| `local-ai-worker` containers | `0` |
|
||||
| GPU compute apps | `0` |
|
||||
| GPU lock owner | `FREE phase39-gate-sgemm-profile-done 1782888737` |
|
||||
|
||||
Harness change:
|
||||
|
||||
- `paged-current-serving-snapshot.sh` now accepts `BUILD_DIR` and defaults
|
||||
`BIN` from that same directory. This keeps the benchmark build step and runtime
|
||||
binaries pointed at the same CMake tree.
|
||||
- Phase 40 used `BUILD_DIR=$HOME/llama-phase6-source/build-phase36`,
|
||||
`BIN=$HOME/llama-phase6-source/build-phase36/bin`,
|
||||
`OPS=MUL_MAT,MUL_MAT_ID`, `PARALLEL=256`, `CTX=262144`, `PTOK=128`,
|
||||
`GEN=64`, `NPL="128 192 256"`.
|
||||
|
||||
Pre/post inference gates:
|
||||
|
||||
| phase | check | status | actual |
|
||||
|-------|-------|--------|--------|
|
||||
| pre | MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
|
||||
| pre | dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
|
||||
| pre | `MUL_MAT` | ok | `1146/1146` |
|
||||
| pre | `MUL_MAT_ID` | ok | `806/806` |
|
||||
| post | MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` |
|
||||
| post | dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` |
|
||||
| post | `MUL_MAT` | ok | `1146/1146` |
|
||||
| post | `MUL_MAT_ID` | ok | `806/806` |
|
||||
|
||||
Serving result:
|
||||
|
||||
| arm | n | agg t/s | decode agg t/s | decode per-seq t/s | prefill t/s | TTFT mean ms |
|
||||
|-----|---|---------|----------------|--------------------|-------------|--------------|
|
||||
| paged | 128 | `326.3` | `671.8` | `3.97` | `1695.2` | `8182.3` |
|
||||
| paged | 192 | `318.3` | `679.9` | `2.50` | `1605.2` | `11151.6` |
|
||||
| paged | 256 | `337.1` | `829.9` | `2.09` | `1525.7` | `15065.7` |
|
||||
| vLLM | 128 | `654.4` | `1013.3` | `6.72` | `5206.0` | `2582.6` |
|
||||
| vLLM | 192 | `697.7` | `1185.2` | `4.88` | `4787.1` | `3690.6` |
|
||||
| vLLM | 256 | `714.1` | `1306.1` | `3.90` | `4471.0` | `5124.2` |
|
||||
|
||||
Ratios:
|
||||
|
||||
| n | paged decode / vLLM | paged per-seq / vLLM | paged agg / vLLM | paged TTFT / vLLM |
|
||||
|---|---------------------|----------------------|------------------|-------------------|
|
||||
| 128 | `0.6630` | `0.5908` | `0.4986` | `3.1682` |
|
||||
| 192 | `0.5737` | `0.5123` | `0.4562` | `3.0216` |
|
||||
| 256 | `0.6354` | `0.5359` | `0.4721` | `2.9401` |
|
||||
|
||||
Decision:
|
||||
|
||||
- C1 does not close GB10 parity for this workload. Paged safely serves `n=256`
|
||||
with canonical md5/op gates green before and after the run, but vLLM also
|
||||
fits and remains materially faster.
|
||||
- Do not claim a GB10 parity win from higher max concurrency at
|
||||
`PTOK=128`, `GEN=64`, `n<=256`.
|
||||
- The next GB10 work should stay on the profile-validated root causes:
|
||||
prefill GDN, prefill MoE GEMM, and low-concurrency/full-step graph capture.
|
||||
Any future C1 rerun must push beyond this tested point and keep the same
|
||||
md5 plus `MUL_MAT`/`MUL_MAT_ID` gates.
|
||||
|
||||
@@ -513,6 +513,18 @@ The only future fused-gate design worth scoping is a persistent/load-time F32
|
||||
combined gate weight with output views, default-off until MoE/dense md5,
|
||||
`MUL_MAT`, `MUL_MAT_ID`, and KL-if-md5-changes gates pass.
|
||||
|
||||
Phase 40 closes the tested GB10 max-concurrency C1 shortcut. Artifact:
|
||||
`/home/mudler/bench/phase40_max_concurrency/20260701_090012`. The snapshot ran
|
||||
with `PARALLEL=256`, `CTX=262144`, `PTOK=128`, `GEN=64`, `NPL="128 192 256"`,
|
||||
and `OPS=MUL_MAT,MUL_MAT_ID`. Pre/post gates stayed green: MoE
|
||||
`8cb0ce23777bf55f92f63d0292c756b0`, dense
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID`
|
||||
`806/806`. Paged safely served `n=256`, but vLLM also fit and remained faster:
|
||||
`paged_decode_over_vllm=0.6354`, `paged_agg_over_vllm=0.4721`,
|
||||
`paged_ttft_over_vllm=2.9401`. Do not claim GB10 parity from higher max
|
||||
concurrency at this prompt/gen length and `n<=256`; a future C1 retry must push
|
||||
beyond this tested point and keep the same md5/op gates.
|
||||
|
||||
---
|
||||
|
||||
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
|
||||
@@ -595,6 +607,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
|
||||
- `~/bench/phase38_gate_baseline/20260701_084410` - current Phase37 build baseline before gate-projection policy work; docker/local-ai-worker/GPU idle preflight green; MoE/dense md5 green; `MUL_MAT` `1146/1146`; `MUL_MAT_ID` `806/806`.
|
||||
- `~/bench/phase39_gate_sgemm_profile/20260701_085211` - short completion profile, diagnostic only because `-n 32` is not a canonical md5 gate; useful for confirming graph-time concat is a real kernel path.
|
||||
- `~/bench/phase39_gate_sgemm_profile/phase27_reanalysis` - Phase27 serving profile re-analysis used to reject graph-time fused gate weight concat; `concat_layout=459.84ms` (`2.29%`) in the serving kernel window.
|
||||
- `~/bench/phase40_max_concurrency/20260701_090012` - max-concurrency C1 check at `NPL=128/192/256`, `PTOK=128`, `GEN=64`, `PARALLEL=256`, `CTX=262144`; pre/post MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` gates green, but vLLM also fit at `n=256` and stayed ahead (`paged_decode_over_vllm=0.6354`, `paged_agg_over_vllm=0.4721`).
|
||||
- Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
|
||||
- Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
|
||||
- A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.
|
||||
|
||||
@@ -1012,6 +1012,32 @@ into `ffn_moe_logits` and `shared_expert_gate`. That is a model-loader/weight
|
||||
layout feature, not a graph shortcut. It must stay default-off until MoE/dense
|
||||
md5, `MUL_MAT`, `MUL_MAT_ID`, and KL-if-md5-changes gates pass.
|
||||
|
||||
### Phase 40 max-concurrency C1 check
|
||||
|
||||
Phase 40 tested whether paged KV's memory advantage creates a higher-concurrency
|
||||
GB10 serving point that closes the vLLM gap. Artifact:
|
||||
`/home/mudler/bench/phase40_max_concurrency/20260701_090012`. The run used
|
||||
`PARALLEL=256`, `CTX=262144`, `PTOK=128`, `GEN=64`, `NPL="128 192 256"`, and
|
||||
`OPS=MUL_MAT,MUL_MAT_ID`.
|
||||
|
||||
Pre/post gates stayed green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID`
|
||||
`806/806`.
|
||||
|
||||
Result:
|
||||
|
||||
| n | paged decode / vLLM | paged per-seq / vLLM | paged agg / vLLM | paged TTFT / vLLM |
|
||||
|---|---------------------|----------------------|------------------|-------------------|
|
||||
| 128 | `0.6630` | `0.5908` | `0.4986` | `3.1682` |
|
||||
| 192 | `0.5737` | `0.5123` | `0.4562` | `3.0216` |
|
||||
| 256 | `0.6354` | `0.5359` | `0.4721` | `2.9401` |
|
||||
|
||||
Decision: C1 does not close GB10 parity at `PTOK=128`, `GEN=64`, and `n<=256`.
|
||||
Paged safely serves `n=256`, but vLLM also fits and remains faster. Do not use
|
||||
the memory-footprint advantage as a parity claim at this tested point; any
|
||||
future C1 retry must push beyond it and keep md5 plus `MUL_MAT`/`MUL_MAT_ID`
|
||||
gates.
|
||||
|
||||
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
|
||||
|
||||
### Phase 10 GDN C32 slab update
|
||||
|
||||
@@ -13,6 +13,7 @@ comparison with the h2h client.
|
||||
|
||||
Environment overrides:
|
||||
SRC llama.cpp source dir (default: ~/llama-phase6-source)
|
||||
BUILD_DIR llama.cpp CMake build dir (default: $SRC/build-cuda)
|
||||
BIN llama.cpp build bin dir (default: $SRC/build-cuda/bin)
|
||||
MODEL paged GGUF path (default: ~/bench/q36-35b-a3b-nvfp4.gguf)
|
||||
VLLM_MODEL vLLM model dir (default: ~/bench/q36-35b-a3b-nvfp4-vllm)
|
||||
@@ -58,7 +59,8 @@ case "${1:-}" in
|
||||
esac
|
||||
|
||||
SRC=${SRC:-"$HOME/llama-phase6-source"}
|
||||
BIN=${BIN:-"$SRC/build-cuda/bin"}
|
||||
BUILD_DIR=${BUILD_DIR:-"$SRC/build-cuda"}
|
||||
BIN=${BIN:-"$BUILD_DIR/bin"}
|
||||
MODEL=${MODEL:-"$HOME/bench/q36-35b-a3b-nvfp4.gguf"}
|
||||
VLLM_MODEL=${VLLM_MODEL:-"$HOME/bench/q36-35b-a3b-nvfp4-vllm"}
|
||||
H2H=${H2H:-"$HOME/bench/h2h_cli3.py"}
|
||||
@@ -376,14 +378,14 @@ log "source=$(git -C "$SRC" log --oneline -1)"
|
||||
|
||||
if [[ "$DRY_RUN" == "1" ]]; then
|
||||
log "dry run only; commands validated"
|
||||
log "would build: cmake --build $SRC/build-cuda --target llama-server llama-completion test-backend-ops -j8"
|
||||
log "would build: cmake --build $BUILD_DIR --target llama-server llama-completion test-backend-ops -j8"
|
||||
log "would run paged NPL=[$NPL] PTOK=$PTOK GEN=$GEN"
|
||||
log "would run vLLM NPL=[$NPL] PTOK=$PTOK GEN=$GEN"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
log "building llama-server, llama-completion, and test-backend-ops"
|
||||
cmake --build "$SRC/build-cuda" --target llama-server llama-completion test-backend-ops -j 8 \
|
||||
cmake --build "$BUILD_DIR" --target llama-server llama-completion test-backend-ops -j 8 \
|
||||
> "$ART/build.log" 2>&1
|
||||
|
||||
run_gate pre
|
||||
|
||||
Reference in New Issue
Block a user