Add a reusable current-stack paged-vs-vLLM serving snapshot harness that targets the clean DGX mirror, enforces idle/lock preflight, runs pre/post inference gates, and records ratio summaries.
Assisted-by: Codex:gpt-5
Record the Phase 20 same-session MoE paged-vs-vLLM serving snapshot on the current clean DGX mirror, including pre/post inference gates and the resulting GB10 parity decision.
Assisted-by: Codex:gpt-5
Record the Phase 19 trace-only serving entropy run and close the group/defer-by-draft follow-up based on measured shape distribution, throughput regression, and green inference gates.
Assisted-by: Codex:gpt-5
Add the next incremental llama.cpp patch for default-off speculative batch-shape tracing and record the Phase 18 red/green and inference-gate results.
Assisted-by: Codex:gpt-5
Record Phase 17 source inspection: MTP verification rows change hard graph dimensions, padding is not a safe shortcut, and any future work should start with shape instrumentation before an opt-in scheduler experiment.
Assisted-by: Codex:gpt-5
Record the default-off Global-Ai32 implementation, exact md5 gates, GB10 A/B regression, rejected diff artifact, and the resulting stop decision for GDN kernel work on GB10.
Assisted-by: Codex:gpt-5
Record the shared-A/Ai GB10 cost model, the GO decision for one default-off f32 Ai prototype, and the Phase 13 implementation plan.
Assisted-by: Codex:gpt-5
Record the Phase 11 default-off QS-early GDN experiment, its canonical md5 gates, the same-session GB10 A/B regression, and the rejected diff artifact.
Assisted-by: Codex:gpt-5
Record the default-off C32 slab experiment, its md5 gates, the dense tail-row fix, and the performance regression that rejects the source patch.
Assisted-by: Codex:gpt-5
Mark the Phase 6 serving classifier complete, preserve the old parity final as historical, and scope Phase 7 source candidates with explicit md5 and op gates.
Assisted-by: Codex:gpt-5
Mirror fork commit d9b9be0be as patch 0050 and record the Phase 4 W4A16 shared-memory padding gates, benchmarks, and mirror verification.
Assisted-by: Codex:gpt-5
Mirror fork commit 7dfa0e175 as patch 0049 and record the Phase 2 GB10 W4A16 shape sweep, md5 gates, MUL_MAT_ID checks, and mirror verification.
Assisted-by: Codex:gpt-5
Mirror the fork-first W4A16 packed tile metadata commit into the LocalAI paged patch series, record the Phase 1 benchmark result, and keep the implementation plan checked off.
Assisted-by: Codex:gpt-5
Reconcile the paged backend pin prose with the current Makefile pin, mark the 0044 patch tracking note as resolved, and add DGX Docker worker idleness to the benchmark preflight.
Assisted-by: Codex:gpt-5
The fork mudler/llama.cpp branch localai-paged is the canonical source of
truth for all paged-backend kernel/patch work. Always update it FIRST: commit
the change on the fork branch and push it, then regenerate the LocalAI patch
series (backend/cpp/llama-cpp-localai-paged/patches/paged/) from the fork via
git format-patch so the series is a 1:1 drift-free mirror of the branch. Never
edit the LocalAI patch files directly, and never add a patch with no
corresponding fork-branch commit. The series is a derivative; the fork is the
source. The fork branch is also where the build and the per-path bit-exact md5
gate actually run, so it is the only place a change is truly validated.
Codified in two places:
- .agents/llama-cpp-localai-paged-backend.md: new "Fork-first workflow
(MANDATORY)" section at the top of the patch/pin-sync material, plus the
"Encapsulating your work" bullet now points at it.
- backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md: strengthened the
hard-gate (section 2.5) into "Fork-first is MANDATORY", and corrected a stale
numbering example (fork 51168c5ee "patch 0044" maps to worktree 0044, not the
f32-only M5 which is worktree 0047).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Ground-check follow-up to 2431090ff. Two factual corrections:
- Section 7 worktree line had the ahead/behind counts swapped ("25 ahead,
197 behind"); the branch is actually ~199 ahead / 25 behind origin/master.
- Discrepancy item 5 flagged only the MoE CDEF PAGED_GATE_MD5 (0921716...);
the dense run is symmetric (COMBINED_DEFINITIVE.txt records ecfe924d... for
dense, which likewise differs from the canonical dense gate 5951a5b4). Both
CDEF values come from combined_definitive.sh's own gate command, not the
canonical bit-exact gate in section 3.3, so neither is sanctioned and both
must be KL-validated.
Everything else in the handoff verified accurate: fork branch localai-paged
HEAD 51168c5ee (patch 0044) on dgx:~/llama-paged-fork, dev-tree HEAD a7d439e,
all md5/KL numbers, the 86%/1078/924 decode record, bench env, and all
referenced file/artifact paths.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Adds docs/PARITY_HANDOFF.md: the operational how-to for an agent with zero
context picking up the GB10 vLLM-parity work. Complements VLLM_PARITY_FINAL.md
(the why/record) with TL;DR state, the hard gates (per-path bit-exact md5,
KL-gate, no LLAMA_MAX_BATCH_TOKENS, fork-is-canonical), a copy-pasteable
operational quickstart (ssh/lock/build/bench + the --cuda-graph-trace=node
decode-profiling rule that caused 4 wrong analyses), the complete tested-and-
rejected lever map, methodology lessons, the three forward directions, and a
key file/artifact index with the open discrepancies to reconcile.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>