Add a hardware report to the current serving snapshot harness so every paged-vs-vLLM artifact records the GPU identity and conservative hardware class before any server starts. Document the Phase 24 dry run and the GB10 classification for future parity comparisons.
Assisted-by: Codex:gpt-5
Update the paged parity handoff to the current fork head, patch count, mirror invariant, current serving harness, and LocalAI AI-attribution policy after Phases 20-22.
Assisted-by: Codex:gpt-5
Record the Phase 22 strict git-apply mirror check proving the LocalAI paged patch series reconstructs the canonical llama.cpp fork tree after patch 0055.
Assisted-by: Codex:gpt-5
Add a reusable current-stack paged-vs-vLLM serving snapshot harness that targets the clean DGX mirror, enforces idle/lock preflight, runs pre/post inference gates, and records ratio summaries.
Assisted-by: Codex:gpt-5
Record the Phase 20 same-session MoE paged-vs-vLLM serving snapshot on the current clean DGX mirror, including pre/post inference gates and the resulting GB10 parity decision.
Assisted-by: Codex:gpt-5
Record the Phase 19 trace-only serving entropy run and close the group/defer-by-draft follow-up based on measured shape distribution, throughput regression, and green inference gates.
Assisted-by: Codex:gpt-5
Add the next incremental llama.cpp patch for default-off speculative batch-shape tracing and record the Phase 18 red/green and inference-gate results.
Assisted-by: Codex:gpt-5
Record Phase 17 source inspection: MTP verification rows change hard graph dimensions, padding is not a safe shortcut, and any future work should start with shape instrumentation before an opt-in scheduler experiment.
Assisted-by: Codex:gpt-5
Record the Phase 8 safety rerun, canonical transcript md5 gates, full and ragged MUL_MAT_ID op gates, and the no-production-patch decision for metadata-only fused dispatch work.
Assisted-by: Codex:gpt-5
Record the default-off Global-Ai32 implementation, exact md5 gates, GB10 A/B regression, rejected diff artifact, and the resulting stop decision for GDN kernel work on GB10.
Assisted-by: Codex:gpt-5
Record the shared-A/Ai GB10 cost model, the GO decision for one default-off f32 Ai prototype, and the Phase 13 implementation plan.
Assisted-by: Codex:gpt-5
Record the Phase 11 default-off QS-early GDN experiment, its canonical md5 gates, the same-session GB10 A/B regression, and the rejected diff artifact.
Assisted-by: Codex:gpt-5
Record the default-off C32 slab experiment, its md5 gates, the dense tail-row fix, and the performance regression that rejects the source patch.
Assisted-by: Codex:gpt-5
Record the Phase 10 current-M5 prefill baseline and the source inspection finding that C32 M5 needs a real U-staging implementation rather than a launcher-only shortcut.
Assisted-by: Codex:gpt-5
Record the Phase 9 MTP smoke gate, mirror the fork patch that disables backend sampling for MTP drafts, and scope the follow-up C32 slab GDN prefill phase.
Assisted-by: Codex:gpt-5
Mark the Phase 6 serving classifier complete, preserve the old parity final as historical, and scope Phase 7 source candidates with explicit md5 and op gates.
Assisted-by: Codex:gpt-5
Record both-engine serving nsys buckets, rejected sampler short-circuit, and rejected GDN/MMQ env grids for the GB10 parity work.
Assisted-by: Codex:gpt-5
Record the Phase 5 Wq shared-memory padding experiment, its gates, sub-threshold benchmark gain, and the decision to ship no 0051 patch.
Assisted-by: Codex:gpt-5
Mirror fork commit d9b9be0be as patch 0050 and record the Phase 4 W4A16 shared-memory padding gates, benchmarks, and mirror verification.
Assisted-by: Codex:gpt-5
Record the Phase 3 scale-broadcast experiment, its md5 and MUL_MAT_ID gates, the prefill regression, and the decision to ship no 0050 patch.
Assisted-by: Codex:gpt-5
Mirror fork commit 7dfa0e175 as patch 0049 and record the Phase 2 GB10 W4A16 shape sweep, md5 gates, MUL_MAT_ID checks, and mirror verification.
Assisted-by: Codex:gpt-5
Mirror the fork-first W4A16 packed tile metadata commit into the LocalAI paged patch series, record the Phase 1 benchmark result, and keep the implementation plan checked off.
Assisted-by: Codex:gpt-5
Record the clean forced W4A16 baseline, default comparison, selected metadata target, and completed plan checkpoint for the GB10 parity reopen.
Assisted-by: Codex:gpt-5
Record the clean DGX build retry, binary provenance, canonical greedy md5 gates, and completed plan steps for the GB10 parity reopen.
Assisted-by: Codex:gpt-5
Add the read-only DGX artifact review for the Phase 0 parity reopen, including supported paged measurements and missing vLLM difference-method evidence.
Assisted-by: Codex:gpt-5
Add the clean llama.cpp fork state, base merge point, patch count, and tree-match result for the Phase 0 parity reopen workflow.
Assisted-by: Codex:gpt-5
Add a phased follow-up spec for challenging the GB10 vLLM-parity closure, including provenance gates, W4A16/GDN/MoE workstreams, and subagent ownership boundaries.
Assisted-by: Codex:gpt-5
Reconcile the paged backend pin prose with the current Makefile pin, mark the 0044 patch tracking note as resolved, and add DGX Docker worker idleness to the benchmark preflight.
Assisted-by: Codex:gpt-5
The fork mudler/llama.cpp branch localai-paged is the canonical source of
truth for all paged-backend kernel/patch work. Always update it FIRST: commit
the change on the fork branch and push it, then regenerate the LocalAI patch
series (backend/cpp/llama-cpp-localai-paged/patches/paged/) from the fork via
git format-patch so the series is a 1:1 drift-free mirror of the branch. Never
edit the LocalAI patch files directly, and never add a patch with no
corresponding fork-branch commit. The series is a derivative; the fork is the
source. The fork branch is also where the build and the per-path bit-exact md5
gate actually run, so it is the only place a change is truly validated.
Codified in two places:
- .agents/llama-cpp-localai-paged-backend.md: new "Fork-first workflow
(MANDATORY)" section at the top of the patch/pin-sync material, plus the
"Encapsulating your work" bullet now points at it.
- backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md: strengthened the
hard-gate (section 2.5) into "Fork-first is MANDATORY", and corrected a stale
numbering example (fork 51168c5ee "patch 0044" maps to worktree 0044, not the
f32-only M5 which is worktree 0047).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Ground-check follow-up to 2431090ff. Two factual corrections:
- Section 7 worktree line had the ahead/behind counts swapped ("25 ahead,
197 behind"); the branch is actually ~199 ahead / 25 behind origin/master.
- Discrepancy item 5 flagged only the MoE CDEF PAGED_GATE_MD5 (0921716...);
the dense run is symmetric (COMBINED_DEFINITIVE.txt records ecfe924d... for
dense, which likewise differs from the canonical dense gate 5951a5b4). Both
CDEF values come from combined_definitive.sh's own gate command, not the
canonical bit-exact gate in section 3.3, so neither is sanctioned and both
must be KL-validated.
Everything else in the handoff verified accurate: fork branch localai-paged
HEAD 51168c5ee (patch 0044) on dgx:~/llama-paged-fork, dev-tree HEAD a7d439e,
all md5/KL numbers, the 86%/1078/924 decode record, bench env, and all
referenced file/artifact paths.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Adds docs/PARITY_HANDOFF.md: the operational how-to for an agent with zero
context picking up the GB10 vLLM-parity work. Complements VLLM_PARITY_FINAL.md
(the why/record) with TL;DR state, the hard gates (per-path bit-exact md5,
KL-gate, no LLAMA_MAX_BATCH_TOKENS, fork-is-canonical), a copy-pasteable
operational quickstart (ssh/lock/build/bench + the --cuda-graph-trace=node
decode-profiling rule that caused 4 wrong analyses), the complete tested-and-
rejected lever map, methodology lessons, the three forward directions, and a
key file/artifact index with the open discrepancies to reconcile.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The decode-serving section characterized the high-N gap as "BW-floored, vLLM
pays equally / 56-68%". A clean uncontended graph-node-traced profile
(dgx ~/highN_prof2 + ~/highN_vllm, 2026-06-30) shows that was a profiling
artifact: decode runs as a replayed CUDA graph, and nsys without
--cuda-graph-trace=node collapses each replay into one opaque launch, so every
prior decode decomposition (159 us/tok, "host-bound", "5.4x more efficient")
was wrong. Corrected via --cuda-graph-trace=node + the ntg=64-minus-ntg=16
difference method.
Real picture (paged npl=256): 99% GPU-busy (idle 1.4%), NOT host-bound. GDN
recurrent scan 553 us/tok (51%, linear in batch, dominant), NVFP4 expert GEMM
254 (23%), bf16 proj 73 (7%), elementwise 57, SSM conv 31. Gap reconciled:
vLLM-server 1177 -> vLLM true GPU-steady 1078 (chunked-prefill overlap inflates
its window ~8pt) -> llama GPU-steady 924 (= 86% of 1078) -> llama-server 718
(61%, the ~17pt S3-recoverable serving graph-reuse overhead). So vs vLLM's true
GPU-steady decode we are ~86%, not 56%. GDN is a shared BW floor where paged
leads (83% vs 79% of 273 GB/s peak; both 1.17-1.18x for 2x batch).
The residual ~14pt is vLLM's mature fused kernels (Marlin MoE +11ms, Triton
elementwise +10ms); both ggml fusions rejected: act-quant-into-MMQ -79.4%
(ggml MMQ re-quantizes y per row-tile x stream-k split, no single-pass tiling),
norm+quant+silu infeasible via ggml_cuda_can_fuse. Added rejected levers:
Q8_0/FP8 projection (regime error, closes <=6%; vLLM FP8-proj confirmed from
hf_quant_config.json MIXED_PRECISION), the two decode fusions; refined BV-block
GDN occupancy to -1.04% (wave-hidden).
Revised verdict: PREFILL genuinely capped (36-43%, not graph-replayed so real);
DECODE-SERVING near-parity ~86% of vLLM true GPU-steady (headline 56% was a
measurement/operating-point artifact). GB10-vs-datacenter framing kept.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Add docs/VLLM_PARITY_FINAL.md: the standing, never-re-litigate record of the
exhaustive GB10 (sm_121) vLLM-parity investigation for the Qwen3.6 NVFP4 hybrid
models. Captures the definitive same-session both-engine benchmark (prefill
S_PP, decode/serving per-seq + aggregate, TTFT, PEAK_GB, paged-as-%-of-vLLM for
both the MoE 35B-A3B and dense 27B models), the complete lever map (every
prefill-GEMM, prefill-GDN, decode and serving/engine attempt with its verdict
and key number), the structural floors (LPDDR5x bandwidth, FP4-MMQ optimality,
GDN O(C^2) intra-chunk + serial recurrence, vLLM's HBM-tuned FLA/Marlin), the
shipped bit-exact wins, and the parity verdict: parity is a hardware ceiling on
GB10, not missing optimizations; the path to parity is datacenter Blackwell.
Every number cites its artifact (dgx:~/bench/COMBINED_DEFINITIVE.txt, the
marlin_gate / gdn_p1_ab A/Bs, PREFILL_GEMM_RESULTS, VLLM_PARITY_LEVER_MAP,
DECODE_SERVING_SCOPE, the patch headers); figures not pinned to an artifact are
marked estimated. Add a section-9 summary + link in the backend README.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Lever map records the full prefill/decode gap decomposition vs vLLM, the ranked levers, and the rejected dead ends. GDN build plan is the per-product mma mapping + A-inverse + occupancy design.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The S1 section-(a) padded/fixed-slot decode shape (the scoped follow-up to push
serving graph reuse from ~72% toward ~100%) was implemented in an isolated
worktree off the committed S1/S3/tail base, built CUDA-only, and benched on GB10.
Verdict: REJECTED. It is bit-exact and provably inert, but it regresses serving
throughput at every concurrency and does not close the vLLM gap.
Implementation (default-off, LLAMA_PAGED_PAD_DECODE): on a pure-decode step
(n_prompt_budgeted == 0) emit a masked-inert dummy decode for every idle slot so
n_tokens / n_seqs / n_seqs_unq / n_outputs and the seq-id set stay constant; a
release()-side guard keeps a finished slot warm under padding. Each dummy is its
own sequence (private recurrent state, per-stream paged attention, logits
discarded), so it cannot perturb a real stream.
Gates: single-seq greedy md5 bit-exact (dense 5951a5b4, paged-MoE 8cb0ce23). The
literal per-stream ON-vs-OFF identity gate is unachievable - concurrent cuBLAS/FA
decode is not bit-reproducible run-to-run even with padding off (OFF-vs-OFF
diverging streams: dense 3/16, MoE 8/16). The achievable inertness gate passed:
ON-vs-OFF per-stream prefix-agreement equals the OFF-vs-OFF noise floor exactly
(MoE 0.940/0.940, dense 0.812/0.812), so the dummy slots leak nothing.
Bench (MoE Qwen3.6-35B-A3B-NVFP4, GB10), burst decode tok/s/seq: n=8 S1+S3 28.16
/ PAD 6.05 / vLLM 44.8; n=128 S1+S3 4.53 / PAD 4.32 / vLLM 6.87. Staggered
aggregate tok/s: baseline (reuse 0%) 757.6, S1+S3 (reuse 72%) 763.3, PAD
(reuse 38%) 558.0.
Why it fails: (1) serving decode here is GPU-compute-bound, not host-rebuild-bound
- baseline reuse 0% ~= S1+S3 reuse 72% on aggregate tok/s, so closing reuse buys
~nothing (the earlier 542->762 host-bound delta did not reproduce); (2) padding
adds dummy-row compute proportional to pad_width - real_load, catastrophic at low
load; (3) in continuous serving padding cannot hold a constant width (perpetual
prefill churn) so reuse drops 72% -> 38%; (4) the completion-driven batch shrink
padding prevents is itself a throughput win in a compute-bound regime. The
residual burst gap is GPU-compute, which a host-side reuse lever cannot close.
Patch series unchanged: this rejected lever is NOT added to patches/paged/.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Add the two decode-serving graph-reuse levers (validated on GB10) that close the
host-bound serving gap (paged dropped to ~3.7 vs vLLM ~5.9 tok/s/seq in real
continuous serving while tying it in static batched-bench).
- 0040 S1 paged decode-graph reuse: the paged decode inputs never overrode
llm_graph_input_i::can_reuse (defaults false), so the host rebuilt the ggml
graph on EVERY decode step (layer-A reuse 0%). Add a 256-bucketed-shape
can_reuse + a live-mctx refresh from the owning attn input. Bit-exact (md5
byte-identical reuse on/off). Static batched-bench: paged reuse 0% -> 95.5%.
- 0041 S3 decode-shape-stable scheduling: keep co-batched prefill out of decode
steps so the scheduler emits the reuse-stable pure-decode shape S1 can reuse.
Default-off policy on top of 0016; bit-exact (per-stream independent).
S1+S3 together (128-client staggered serving, MoE Qwen3.6-35B-A3B-NVFP4): graph
reuse 0% -> 72.2%, hostproc 15.98 -> 6.31 ms/step, decode 4.05 -> 5.52 tok/s/seq
median (4.24 -> 5.96 mean, at vLLM's ~5.9). S1 alone is insufficient (13.8%);
S3 is the multiplier. S2 (double-buffer set_inputs) dropped: Phase-0 put
set_inputs at ~0.05 ms/step, so it has nothing to recover. README patch table +
DECODE_SERVING_SCOPE.md updated with results and the padded/fixed-slot follow-up.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>