Record the Phase 27 current-stack llama.cpp n128 serving profile captured with CUDA graph node tracing and gated before and after the run.
Assisted-by: Codex:gpt-5
Record the Phase 26 current-stack paged-vs-vLLM serving snapshot with hardware classification and compact pre/post inference gates.
Assisted-by: Codex:gpt-5
Emit a compact gate_summary.tsv from current serving snapshots so each artifact records the pre/post MoE md5, dense md5, and backend op checks. Add a summary-only mode for auditing existing artifacts and document the Phase 25 backfill on the Phase 20 snapshot.
Assisted-by: Codex:gpt-5
Add a hardware report to the current serving snapshot harness so every paged-vs-vLLM artifact records the GPU identity and conservative hardware class before any server starts. Document the Phase 24 dry run and the GB10 classification for future parity comparisons.
Assisted-by: Codex:gpt-5
Update the paged parity handoff to the current fork head, patch count, mirror invariant, current serving harness, and LocalAI AI-attribution policy after Phases 20-22.
Assisted-by: Codex:gpt-5
Record the Phase 22 strict git-apply mirror check proving the LocalAI paged patch series reconstructs the canonical llama.cpp fork tree after patch 0055.
Assisted-by: Codex:gpt-5
Add a reusable current-stack paged-vs-vLLM serving snapshot harness that targets the clean DGX mirror, enforces idle/lock preflight, runs pre/post inference gates, and records ratio summaries.
Assisted-by: Codex:gpt-5
Record the Phase 20 same-session MoE paged-vs-vLLM serving snapshot on the current clean DGX mirror, including pre/post inference gates and the resulting GB10 parity decision.
Assisted-by: Codex:gpt-5
Record the Phase 19 trace-only serving entropy run and close the group/defer-by-draft follow-up based on measured shape distribution, throughput regression, and green inference gates.
Assisted-by: Codex:gpt-5
Add the next incremental llama.cpp patch for default-off speculative batch-shape tracing and record the Phase 18 red/green and inference-gate results.
Assisted-by: Codex:gpt-5
Record Phase 17 source inspection: MTP verification rows change hard graph dimensions, padding is not a safe shortcut, and any future work should start with shape instrumentation before an opt-in scheduler experiment.
Assisted-by: Codex:gpt-5
Record the Phase 8 safety rerun, canonical transcript md5 gates, full and ragged MUL_MAT_ID op gates, and the no-production-patch decision for metadata-only fused dispatch work.
Assisted-by: Codex:gpt-5
Record the default-off Global-Ai32 implementation, exact md5 gates, GB10 A/B regression, rejected diff artifact, and the resulting stop decision for GDN kernel work on GB10.
Assisted-by: Codex:gpt-5
Record the shared-A/Ai GB10 cost model, the GO decision for one default-off f32 Ai prototype, and the Phase 13 implementation plan.
Assisted-by: Codex:gpt-5
Record the Phase 11 default-off QS-early GDN experiment, its canonical md5 gates, the same-session GB10 A/B regression, and the rejected diff artifact.
Assisted-by: Codex:gpt-5
Record the default-off C32 slab experiment, its md5 gates, the dense tail-row fix, and the performance regression that rejects the source patch.
Assisted-by: Codex:gpt-5
Record the Phase 10 current-M5 prefill baseline and the source inspection finding that C32 M5 needs a real U-staging implementation rather than a launcher-only shortcut.
Assisted-by: Codex:gpt-5
Record the Phase 9 MTP smoke gate, mirror the fork patch that disables backend sampling for MTP drafts, and scope the follow-up C32 slab GDN prefill phase.
Assisted-by: Codex:gpt-5
Mark the Phase 6 serving classifier complete, preserve the old parity final as historical, and scope Phase 7 source candidates with explicit md5 and op gates.
Assisted-by: Codex:gpt-5
Record both-engine serving nsys buckets, rejected sampler short-circuit, and rejected GDN/MMQ env grids for the GB10 parity work.
Assisted-by: Codex:gpt-5
Record the Phase 5 Wq shared-memory padding experiment, its gates, sub-threshold benchmark gain, and the decision to ship no 0051 patch.
Assisted-by: Codex:gpt-5
Mirror fork commit d9b9be0be as patch 0050 and record the Phase 4 W4A16 shared-memory padding gates, benchmarks, and mirror verification.
Assisted-by: Codex:gpt-5
Record the Phase 3 scale-broadcast experiment, its md5 and MUL_MAT_ID gates, the prefill regression, and the decision to ship no 0050 patch.
Assisted-by: Codex:gpt-5
Mirror fork commit 7dfa0e175 as patch 0049 and record the Phase 2 GB10 W4A16 shape sweep, md5 gates, MUL_MAT_ID checks, and mirror verification.
Assisted-by: Codex:gpt-5
Mirror the fork-first W4A16 packed tile metadata commit into the LocalAI paged patch series, record the Phase 1 benchmark result, and keep the implementation plan checked off.
Assisted-by: Codex:gpt-5
Record the clean forced W4A16 baseline, default comparison, selected metadata target, and completed plan checkpoint for the GB10 parity reopen.
Assisted-by: Codex:gpt-5
Record the clean DGX build retry, binary provenance, canonical greedy md5 gates, and completed plan steps for the GB10 parity reopen.
Assisted-by: Codex:gpt-5
Add the read-only DGX artifact review for the Phase 0 parity reopen, including supported paged measurements and missing vLLM difference-method evidence.
Assisted-by: Codex:gpt-5
Add the clean llama.cpp fork state, base merge point, patch count, and tree-match result for the Phase 0 parity reopen workflow.
Assisted-by: Codex:gpt-5
Add a phased follow-up spec for challenging the GB10 vLLM-parity closure, including provenance gates, W4A16/GDN/MoE workstreams, and subagent ownership boundaries.
Assisted-by: Codex:gpt-5
Reconcile the paged backend pin prose with the current Makefile pin, mark the 0044 patch tracking note as resolved, and add DGX Docker worker idleness to the benchmark preflight.
Assisted-by: Codex:gpt-5
The fork mudler/llama.cpp branch localai-paged is the canonical source of
truth for all paged-backend kernel/patch work. Always update it FIRST: commit
the change on the fork branch and push it, then regenerate the LocalAI patch
series (backend/cpp/llama-cpp-localai-paged/patches/paged/) from the fork via
git format-patch so the series is a 1:1 drift-free mirror of the branch. Never
edit the LocalAI patch files directly, and never add a patch with no
corresponding fork-branch commit. The series is a derivative; the fork is the
source. The fork branch is also where the build and the per-path bit-exact md5
gate actually run, so it is the only place a change is truly validated.
Codified in two places:
- .agents/llama-cpp-localai-paged-backend.md: new "Fork-first workflow
(MANDATORY)" section at the top of the patch/pin-sync material, plus the
"Encapsulating your work" bullet now points at it.
- backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md: strengthened the
hard-gate (section 2.5) into "Fork-first is MANDATORY", and corrected a stale
numbering example (fork 51168c5ee "patch 0044" maps to worktree 0044, not the
f32-only M5 which is worktree 0047).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The fork mudler/llama.cpp branch localai-paged is the canonical source of
truth for the paged-backend patch series. This file is the git format-patch
of fork commit 51168c5ee ("feat(paged): fused gated RMSNorm + SiLU gate-mul
CUDA op (patch 0044)"), verified byte-identical to that commit's format-patch
output. The full on-disk series applies clean in numeric order on the pinned
base and the resulting tree is byte-identical to the fork commit tree (tree
hash a73d759350277532a14e853e1fe78f08bbb74ce8), so the LocalAI series is a
drift-free 1:1 mirror of the fork branch.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Ground-check follow-up to 2431090ff. Two factual corrections:
- Section 7 worktree line had the ahead/behind counts swapped ("25 ahead,
197 behind"); the branch is actually ~199 ahead / 25 behind origin/master.
- Discrepancy item 5 flagged only the MoE CDEF PAGED_GATE_MD5 (0921716...);
the dense run is symmetric (COMBINED_DEFINITIVE.txt records ecfe924d... for
dense, which likewise differs from the canonical dense gate 5951a5b4). Both
CDEF values come from combined_definitive.sh's own gate command, not the
canonical bit-exact gate in section 3.3, so neither is sanctioned and both
must be KL-validated.
Everything else in the handoff verified accurate: fork branch localai-paged
HEAD 51168c5ee (patch 0044) on dgx:~/llama-paged-fork, dev-tree HEAD a7d439e,
all md5/KL numbers, the 86%/1078/924 decode record, bench env, and all
referenced file/artifact paths.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Adds docs/PARITY_HANDOFF.md: the operational how-to for an agent with zero
context picking up the GB10 vLLM-parity work. Complements VLLM_PARITY_FINAL.md
(the why/record) with TL;DR state, the hard gates (per-path bit-exact md5,
KL-gate, no LLAMA_MAX_BATCH_TOKENS, fork-is-canonical), a copy-pasteable
operational quickstart (ssh/lock/build/bench + the --cuda-graph-trace=node
decode-profiling rule that caused 4 wrong analyses), the complete tested-and-
rejected lever map, methodology lessons, the three forward directions, and a
key file/artifact index with the open discrepancies to reconcile.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>