mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 12:57:02 -04:00
docs(paged): record inference gate guard
Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -2539,3 +2539,45 @@ Decision:
|
||||
hardware-pivot benchmark still needs the normal preflight, `hardware.txt`,
|
||||
pre/post MoE/dense md5 gates, `MUL_MAT`/`MUL_MAT_ID` checks, and
|
||||
KL-if-md5-changes before interpreting throughput.
|
||||
|
||||
## Phase 45 Inference Gate Guard
|
||||
|
||||
Phase 45 answers the inference-safety question after the harness-only Phase44
|
||||
change by running the canonical paged inference gates on DGX. This is a
|
||||
gate-only phase: it does not benchmark serving throughput and does not change
|
||||
inference code.
|
||||
|
||||
Artifact:
|
||||
|
||||
- `/home/mudler/bench/phase45_inference_gate_guard/20260701_094320`
|
||||
|
||||
Preflight:
|
||||
|
||||
- Docker containers: `0`
|
||||
- `local-ai-worker` containers: `0`
|
||||
- GPU compute apps: `0`
|
||||
- GPU lock owner: `FREE released-by-codex-current-serving-snapshot 1782890417`
|
||||
|
||||
Gate command:
|
||||
|
||||
```bash
|
||||
BIN=$HOME/llama-phase6-source/build-phase36/bin \
|
||||
ART=$HOME/bench/phase45_inference_gate_guard/20260701_094320 \
|
||||
OPS=MUL_MAT,MUL_MAT_ID \
|
||||
~/paged-inference-gates.sh
|
||||
```
|
||||
|
||||
Results:
|
||||
|
||||
| check | result |
|
||||
|-------|--------|
|
||||
| MoE paged md5 | `8cb0ce23777bf55f92f63d0292c756b0` |
|
||||
| Dense paged md5 | `5951a5b4d624ce891e22ab5fca9bc439` |
|
||||
| `MUL_MAT` backend op | `1146/1146`, `Backend CUDA0: OK` |
|
||||
| `MUL_MAT_ID` backend op | `806/806`, `Backend CUDA0: OK` |
|
||||
|
||||
Decision:
|
||||
|
||||
- Current DGX phase36 build still passes the canonical inference md5/op gates.
|
||||
- Phase44 did not touch inference code; Phase45 provides the post-change guard
|
||||
artifact for future handoff and comparison.
|
||||
|
||||
@@ -567,6 +567,13 @@ gate behavior. Use it when the next parity run targets datacenter Blackwell or
|
||||
another non-GB10 vLLM serving shape, while keeping `hardware.txt`, pre/post
|
||||
MoE/dense md5, `MUL_MAT`/`MUL_MAT_ID`, and KL-if-md5-changes as mandatory gates.
|
||||
|
||||
Phase 45 records the immediate inference-safety guard after Phase44. Artifact:
|
||||
`/home/mudler/bench/phase45_inference_gate_guard/20260701_094320`. The DGX
|
||||
phase36 build passed MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID`
|
||||
`806/806`. Docker, `local-ai-worker`, and GPU compute preflight were all zero
|
||||
before and after the run.
|
||||
|
||||
---
|
||||
|
||||
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
|
||||
@@ -652,6 +659,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
|
||||
- `~/bench/phase40_max_concurrency/20260701_090012` - max-concurrency C1 check at `NPL=128/192/256`, `PTOK=128`, `GEN=64`, `PARALLEL=256`, `CTX=262144`; pre/post MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` gates green, but vLLM also fit at `n=256` and stayed ahead (`paged_decode_over_vllm=0.6354`, `paged_agg_over_vllm=0.4721`).
|
||||
- `~/bench/phase41_low_concurrency/20260701_091437` - low-concurrency serving check at `NPL=1/8/32`, `PTOK=128`, `GEN=64`, `PARALLEL=32`, `CTX=32768`; pre/post MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` gates green; paged is `0.7493`, `0.7518`, and `0.6649` of vLLM decode at `n=1/8/32`, with TTFT still much worse by `n=8/32`; does not reopen D1.
|
||||
- `~/bench/phase44_hardware_pivot_harness_dryrun/20260701_094038` - harness-only dry-run artifact proving the vLLM serving config overrides are printed and preflighted before any server starts.
|
||||
- `~/bench/phase45_inference_gate_guard/20260701_094320` - post-Phase44 inference guard; MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` backend-op gates green.
|
||||
- Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
|
||||
- Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
|
||||
- A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.
|
||||
|
||||
@@ -1111,6 +1111,18 @@ future non-GB10 snapshots can carry the same `hardware.txt`, pre/post md5,
|
||||
`MUL_MAT`/`MUL_MAT_ID`, and KL-if-md5-changes gates while using hardware-specific
|
||||
vLLM serving limits.
|
||||
|
||||
### Phase 45 inference gate guard
|
||||
|
||||
Phase 45 ran the canonical paged inference safety gate after the Phase44 harness
|
||||
change. Artifact:
|
||||
`/home/mudler/bench/phase45_inference_gate_guard/20260701_094320`.
|
||||
|
||||
Results stayed green on the DGX phase36 build: MoE md5
|
||||
`8cb0ce23777bf55f92f63d0292c756b0`, dense md5
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and
|
||||
`MUL_MAT_ID` `806/806`. This confirms the current build still satisfies the
|
||||
inference-safety gates before any later hardware-pivot or larger kernel work.
|
||||
|
||||
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
|
||||
|
||||
### Phase 10 GDN C32 slab update
|
||||
|
||||
Reference in New Issue
Block a user