From 2a0fc0f4b9de2e779de57fd2d46d47ce29bfa60e Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Wed, 1 Jul 2026 07:45:52 +0000 Subject: [PATCH] docs(paged): record inference gate guard Assisted-by: Codex:gpt-5 --- .../docs/GB10_PARITY_PHASE0_RESULTS.md | 42 +++++++++ .../docs/PARITY_HANDOFF.md | 8 ++ .../docs/VLLM_PARITY_LEVER_MAP.md | 12 +++ ...2026-07-01-inference-gate-guard-phase45.md | 87 +++++++++++++++++++ 4 files changed, 149 insertions(+) create mode 100644 docs/superpowers/plans/2026-07-01-inference-gate-guard-phase45.md diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md index 7cba7fe61..be8dfb65e 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -2539,3 +2539,45 @@ Decision: hardware-pivot benchmark still needs the normal preflight, `hardware.txt`, pre/post MoE/dense md5 gates, `MUL_MAT`/`MUL_MAT_ID` checks, and KL-if-md5-changes before interpreting throughput. + +## Phase 45 Inference Gate Guard + +Phase 45 answers the inference-safety question after the harness-only Phase44 +change by running the canonical paged inference gates on DGX. This is a +gate-only phase: it does not benchmark serving throughput and does not change +inference code. + +Artifact: + +- `/home/mudler/bench/phase45_inference_gate_guard/20260701_094320` + +Preflight: + +- Docker containers: `0` +- `local-ai-worker` containers: `0` +- GPU compute apps: `0` +- GPU lock owner: `FREE released-by-codex-current-serving-snapshot 1782890417` + +Gate command: + +```bash +BIN=$HOME/llama-phase6-source/build-phase36/bin \ +ART=$HOME/bench/phase45_inference_gate_guard/20260701_094320 \ +OPS=MUL_MAT,MUL_MAT_ID \ +~/paged-inference-gates.sh +``` + +Results: + +| check | result | +|-------|--------| +| MoE paged md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| Dense paged md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| `MUL_MAT` backend op | `1146/1146`, `Backend CUDA0: OK` | +| `MUL_MAT_ID` backend op | `806/806`, `Backend CUDA0: OK` | + +Decision: + +- Current DGX phase36 build still passes the canonical inference md5/op gates. +- Phase44 did not touch inference code; Phase45 provides the post-change guard + artifact for future handoff and comparison. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index 71dff2cde..a769d79ca 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -567,6 +567,13 @@ gate behavior. Use it when the next parity run targets datacenter Blackwell or another non-GB10 vLLM serving shape, while keeping `hardware.txt`, pre/post MoE/dense md5, `MUL_MAT`/`MUL_MAT_ID`, and KL-if-md5-changes as mandatory gates. +Phase 45 records the immediate inference-safety guard after Phase44. Artifact: +`/home/mudler/bench/phase45_inference_gate_guard/20260701_094320`. The DGX +phase36 build passed MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID` +`806/806`. Docker, `local-ai-worker`, and GPU compute preflight were all zero +before and after the run. + --- ## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes) @@ -652,6 +659,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual - `~/bench/phase40_max_concurrency/20260701_090012` - max-concurrency C1 check at `NPL=128/192/256`, `PTOK=128`, `GEN=64`, `PARALLEL=256`, `CTX=262144`; pre/post MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` gates green, but vLLM also fit at `n=256` and stayed ahead (`paged_decode_over_vllm=0.6354`, `paged_agg_over_vllm=0.4721`). - `~/bench/phase41_low_concurrency/20260701_091437` - low-concurrency serving check at `NPL=1/8/32`, `PTOK=128`, `GEN=64`, `PARALLEL=32`, `CTX=32768`; pre/post MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` gates green; paged is `0.7493`, `0.7518`, and `0.6649` of vLLM decode at `n=1/8/32`, with TTFT still much worse by `n=8/32`; does not reopen D1. - `~/bench/phase44_hardware_pivot_harness_dryrun/20260701_094038` - harness-only dry-run artifact proving the vLLM serving config overrides are printed and preflighted before any server starts. +- `~/bench/phase45_inference_gate_guard/20260701_094320` - post-Phase44 inference guard; MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` backend-op gates green. - Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`. - Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30. - A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md index aa0276a58..2b9cc2959 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -1111,6 +1111,18 @@ future non-GB10 snapshots can carry the same `hardware.txt`, pre/post md5, `MUL_MAT`/`MUL_MAT_ID`, and KL-if-md5-changes gates while using hardware-specific vLLM serving limits. +### Phase 45 inference gate guard + +Phase 45 ran the canonical paged inference safety gate after the Phase44 harness +change. Artifact: +`/home/mudler/bench/phase45_inference_gate_guard/20260701_094320`. + +Results stayed green on the DGX phase36 build: MoE md5 +`8cb0ce23777bf55f92f63d0292c756b0`, dense md5 +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and +`MUL_MAT_ID` `806/806`. This confirms the current build still satisfies the +inference-safety gates before any later hardware-pivot or larger kernel work. + Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`. ### Phase 10 GDN C32 slab update diff --git a/docs/superpowers/plans/2026-07-01-inference-gate-guard-phase45.md b/docs/superpowers/plans/2026-07-01-inference-gate-guard-phase45.md new file mode 100644 index 000000000..3c6acb988 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-inference-gate-guard-phase45.md @@ -0,0 +1,87 @@ +# Phase45 Inference Gate Guard Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Prove the current DGX build still passes the canonical paged inference md5 and backend-op gates after the harness-only Phase44 change. + +**Architecture:** Run the existing DGX `~/paged-inference-gates.sh` script against `~/llama-phase6-source/build-phase36/bin` with both `MUL_MAT` and `MUL_MAT_ID` op filters. Record the artifact in the parity docs; do not change llama.cpp inference source. + +**Tech Stack:** DGX ssh, Bash gate harness, LocalAI parity documentation. + +--- + +### Task 1: Confirm DGX gate preflight + +**Files:** +- Test only: DGX runtime state. + +- [x] **Step 1: Check docker, LocalAI worker, GPU compute, and lock owner** + +```bash +ssh dgx.casa 'set -euo pipefail; docker_count=$(docker ps -q | wc -l); local_ai=$(docker ps --format "{{.Names}}" | grep -c local-ai-worker || true); compute=$(nvidia-smi --query-compute-apps=pid --format=csv,noheader | sed "/^$/d" | wc -l); owner=FREE-no-lock-file; if [ -f "$HOME/gpu_bench_lock/owner" ]; then owner=$(cat "$HOME/gpu_bench_lock/owner"); fi; printf "docker=%s\nlocal_ai_worker=%s\ncompute=%s\nowner=%s\n" "$docker_count" "$local_ai" "$compute" "$owner"' +``` + +Expected: `docker=0`, `local_ai_worker=0`, `compute=0`, and owner starts with `FREE`. + +### Task 2: Run canonical inference gates + +**Files:** +- Test only: `~/paged-inference-gates.sh` on DGX. + +- [x] **Step 1: Run md5 and backend-op gates** + +```bash +ssh dgx.casa 'set -euo pipefail; ART=$HOME/bench/phase45_inference_gate_guard/$(date +%Y%m%d_%H%M%S); BIN=$HOME/llama-phase6-source/build-phase36/bin ART=$ART OPS=MUL_MAT,MUL_MAT_ID ~/paged-inference-gates.sh' +``` + +Expected: + +```text +moe md5 OK: 8cb0ce23777bf55f92f63d0292c756b0 +dense md5 OK: 5951a5b4d624ce891e22ab5fca9bc439 +1146/1146 tests passed +Backend CUDA0: OK +806/806 tests passed +Backend CUDA0: OK +paged inference gates OK +``` + +### Task 3: Record Phase45 + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` +- Modify: `docs/superpowers/plans/2026-07-01-inference-gate-guard-phase45.md` + +- [x] **Step 1: Append gate artifact and verdict** + +Record the exact artifact directory and the md5/op results. + +- [x] **Step 2: Mark this plan complete** + +Only mark the remaining steps complete after the gate and docs update are done. + +### Task 4: Commit + +**Files:** +- Commit the Phase45 docs and plan. + +- [x] **Step 1: Run final checks** + +```bash +git diff --check +git status --short +``` + +Expected: no whitespace errors; only intended docs/plan changes plus the pre-existing untracked `.claude/`. + +- [x] **Step 2: Commit** + +```bash +git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +git add -f docs/superpowers/plans/2026-07-01-inference-gate-guard-phase45.md +git commit -m "docs(paged): record inference gate guard" -m "Assisted-by: Codex:gpt-5" +```