LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-07-03 04:46:54 -04:00

Author	SHA1	Message	Date
Ettore Di Giacinto	70394364a3	docs(paged): gate MTP rollback safety Record Phase 14 MTP rollback evidence, normalized greedy-prefix checks, and canonical inference gates. Assisted-by: Codex:gpt-5	2026-07-01 02:15:11 +00:00
Ettore Di Giacinto	abc70c209e	docs(paged): close ragged MoE dispatch shortcut Record the Phase 8 safety rerun, canonical transcript md5 gates, full and ragged MUL_MAT_ID op gates, and the no-production-patch decision for metadata-only fused dispatch work. Assisted-by: Codex:gpt-5	2026-07-01 01:57:45 +00:00
Ettore Di Giacinto	2074b4fb5b	docs(paged): reject GDN global Ai32 prototype Record the default-off Global-Ai32 implementation, exact md5 gates, GB10 A/B regression, rejected diff artifact, and the resulting stop decision for GDN kernel work on GB10. Assisted-by: Codex:gpt-5	2026-07-01 01:51:53 +00:00
Ettore Di Giacinto	adabd11919	docs(paged): scope GDN global Ai32 prototype Record the shared-A/Ai GB10 cost model, the GO decision for one default-off f32 Ai prototype, and the Phase 13 implementation plan. Assisted-by: Codex:gpt-5	2026-07-01 01:38:51 +00:00
Ettore Di Giacinto	1b5ae227eb	docs(paged): reject GDN M5 QS-early phase Record the Phase 11 default-off QS-early GDN experiment, its canonical md5 gates, the same-session GB10 A/B regression, and the rejected diff artifact. Assisted-by: Codex:gpt-5	2026-07-01 01:29:44 +00:00
Ettore Di Giacinto	3da3b169fb	docs(paged): reject GDN C32 slab phase Record the default-off C32 slab experiment, its md5 gates, the dense tail-row fix, and the performance regression that rejects the source patch. Assisted-by: Codex:gpt-5	2026-07-01 01:15:00 +00:00
Ettore Di Giacinto	ff3ad84191	docs(paged): record GDN C32 slab baseline Record the Phase 10 current-M5 prefill baseline and the source inspection finding that C32 M5 needs a real U-staging implementation rather than a launcher-only shortcut. Assisted-by: Codex:gpt-5	2026-07-01 00:58:54 +00:00
Ettore Di Giacinto	9bbe02c161	fix(paged): gate MTP backend sampling Record the Phase 9 MTP smoke gate, mirror the fork patch that disables backend sampling for MTP drafts, and scope the follow-up C32 slab GDN prefill phase. Assisted-by: Codex:gpt-5	2026-07-01 00:54:25 +00:00
Ettore Di Giacinto	b862e2c568	docs(paged): stop ragged dispatch source shortcut Assisted-by: Codex:gpt-5	2026-07-01 00:42:36 +00:00
Ettore Di Giacinto	b009de0ee0	test(paged): mirror ragged MoE dispatch gate Assisted-by: Codex:gpt-5	2026-07-01 00:41:21 +00:00
Ettore Di Giacinto	89ef3a4020	docs(paged): record ragged MoE profile gate Assisted-by: Codex:gpt-5	2026-07-01 00:35:21 +00:00
Ettore Di Giacinto	ef14748f06	docs(paged): scope ragged MoE dispatch phase Assisted-by: Codex:gpt-5	2026-07-01 00:26:01 +00:00
Ettore Di Giacinto	b6885aa446	docs(paged): reject weighted combine fusion candidate Assisted-by: Codex:gpt-5	2026-07-01 00:20:53 +00:00
Ettore Di Giacinto	4b6fc0fa1c	test(paged): mirror MoE weighted combine gate Assisted-by: Codex:gpt-5	2026-06-30 23:51:52 +00:00
Ettore Di Giacinto	3cf7fa1715	docs(paged): reject swiglu down fusion candidate Assisted-by: Codex:gpt-5	2026-06-30 23:41:38 +00:00
Ettore Di Giacinto	d0fa463eac	test(paged): mirror MoE swiglu down gate Mirror the llama.cpp Phase 7 test gate for the merged MoE gate_up/SWIGLU/down chain and record the DGX md5/op gate evidence. Assisted-by: Codex:gpt-5	2026-06-30 23:20:52 +00:00
Ettore Di Giacinto	34c4b5ce8d	docs(paged): scope phase7 serving candidates Mark the Phase 6 serving classifier complete, preserve the old parity final as historical, and scope Phase 7 source candidates with explicit md5 and op gates. Assisted-by: Codex:gpt-5	2026-06-30 23:12:09 +00:00
Ettore Di Giacinto	b647460dee	docs(paged): record phase6 serving classifier Record both-engine serving nsys buckets, rejected sampler short-circuit, and rejected GDN/MMQ env grids for the GB10 parity work. Assisted-by: Codex:gpt-5	2026-06-30 23:04:15 +00:00
Ettore Di Giacinto	f9e015d8e2	docs(paged): record W4A16 Wq padding rejection Record the Phase 5 Wq shared-memory padding experiment, its gates, sub-threshold benchmark gain, and the decision to ship no 0051 patch. Assisted-by: Codex:gpt-5	2026-06-30 22:23:14 +00:00
Ettore Di Giacinto	85c88320ef	patches(paged): pad W4A16 A shared tile stride Mirror fork commit d9b9be0be as patch 0050 and record the Phase 4 W4A16 shared-memory padding gates, benchmarks, and mirror verification. Assisted-by: Codex:gpt-5	2026-06-30 22:15:21 +00:00
Ettore Di Giacinto	8b413d1cbd	docs(paged): record W4A16 scale broadcast rejection Record the Phase 3 scale-broadcast experiment, its md5 and MUL_MAT_ID gates, the prefill regression, and the decision to ship no 0050 patch. Assisted-by: Codex:gpt-5	2026-06-30 22:06:17 +00:00
Ettore Di Giacinto	c5f2545cdd	patches(paged): tune W4A16 grouped tile shape Mirror fork commit 7dfa0e175 as patch 0049 and record the Phase 2 GB10 W4A16 shape sweep, md5 gates, MUL_MAT_ID checks, and mirror verification. Assisted-by: Codex:gpt-5	2026-06-30 21:57:42 +00:00
Ettore Di Giacinto	d8edc615e7	patches(paged): mirror W4A16 packed metadata Mirror the fork-first W4A16 packed tile metadata commit into the LocalAI paged patch series, record the Phase 1 benchmark result, and keep the implementation plan checked off. Assisted-by: Codex:gpt-5	2026-06-30 21:21:53 +00:00
Ettore Di Giacinto	1c0709b700	docs(paged): record W4A16 phase1 kill gate Record the clean forced W4A16 baseline, default comparison, selected metadata target, and completed plan checkpoint for the GB10 parity reopen. Assisted-by: Codex:gpt-5	2026-06-30 20:40:40 +00:00
Ettore Di Giacinto	337ebb8a37	docs(paged): record phase0 decode repro Record comparable graph-node-traced paged and vLLM decode difference-method artifacts for the GB10 parity reopen. Assisted-by: Codex:gpt-5	2026-06-30 20:35:43 +00:00
Ettore Di Giacinto	ef5d4af203	docs(paged): record phase0 prefill baseline Record clean-source MoE and dense prefill baselines for the GB10 parity reopen and mark the plan checkpoint complete. Assisted-by: Codex:gpt-5	2026-06-30 20:22:18 +00:00
Ettore Di Giacinto	a9a2efb296	docs(paged): record phase0 clean build gates Record the clean DGX build retry, binary provenance, canonical greedy md5 gates, and completed plan steps for the GB10 parity reopen. Assisted-by: Codex:gpt-5	2026-06-30 20:19:14 +00:00
Ettore Di Giacinto	b1a1b721bd	docs(paged): record GB10 parity artifact gaps Add the read-only DGX artifact review for the Phase 0 parity reopen, including supported paged measurements and missing vLLM difference-method evidence. Assisted-by: Codex:gpt-5	2026-06-30 15:55:16 +00:00
Ettore Di Giacinto	b3cfdfac4a	docs(paged): record GB10 parity source provenance Add the clean llama.cpp fork state, base merge point, patch count, and tree-match result for the Phase 0 parity reopen workflow. Assisted-by: Codex:gpt-5	2026-06-30 15:54:23 +00:00
Ettore Di Giacinto	6ac06734e9	docs(paged): start GB10 parity phase0 record Create the Phase 0 results record for the parity reopen workflow, including preflight, provenance, and baseline sections. Assisted-by: Codex:gpt-5	2026-06-30 15:51:57 +00:00
Ettore Di Giacinto	f8d7b026cf	docs(paged): scope GB10 parity reopen plan Add a phased follow-up spec for challenging the GB10 vLLM-parity closure, including provenance gates, W4A16/GDN/MoE workstreams, and subagent ownership boundaries. Assisted-by: Codex:gpt-5	2026-06-30 15:44:11 +00:00
Ettore Di Giacinto	de34cd5954	docs(paged): refresh parity handoff state Reconcile the paged backend pin prose with the current Makefile pin, mark the 0044 patch tracking note as resolved, and add DGX Docker worker idleness to the benchmark preflight. Assisted-by: Codex:gpt-5	2026-06-30 15:27:44 +00:00
Ettore Di Giacinto	1b9176c2c8	docs(paged): codify fork-first patch workflow as mandatory policy The fork mudler/llama.cpp branch localai-paged is the canonical source of truth for all paged-backend kernel/patch work. Always update it FIRST: commit the change on the fork branch and push it, then regenerate the LocalAI patch series (backend/cpp/llama-cpp-localai-paged/patches/paged/) from the fork via git format-patch so the series is a 1:1 drift-free mirror of the branch. Never edit the LocalAI patch files directly, and never add a patch with no corresponding fork-branch commit. The series is a derivative; the fork is the source. The fork branch is also where the build and the per-path bit-exact md5 gate actually run, so it is the only place a change is truly validated. Codified in two places: - .agents/llama-cpp-localai-paged-backend.md: new "Fork-first workflow (MANDATORY)" section at the top of the patch/pin-sync material, plus the "Encapsulating your work" bullet now points at it. - backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md: strengthened the hard-gate (section 2.5) into "Fork-first is MANDATORY", and corrected a stale numbering example (fork 51168c5ee "patch 0044" maps to worktree 0044, not the f32-only M5 which is worktree 0047). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-30 15:12:36 +00:00
Ettore Di Giacinto	8bb47e5a8a	docs(paged): correct PARITY_HANDOFF ahead/behind + note dense CDEF gate md5 Ground-check follow-up to `2431090ff`. Two factual corrections: - Section 7 worktree line had the ahead/behind counts swapped ("25 ahead, 197 behind"); the branch is actually ~199 ahead / 25 behind origin/master. - Discrepancy item 5 flagged only the MoE CDEF PAGED_GATE_MD5 (0921716...); the dense run is symmetric (COMBINED_DEFINITIVE.txt records ecfe924d... for dense, which likewise differs from the canonical dense gate 5951a5b4). Both CDEF values come from combined_definitive.sh's own gate command, not the canonical bit-exact gate in section 3.3, so neither is sanctioned and both must be KL-validated. Everything else in the handoff verified accurate: fork branch localai-paged HEAD 51168c5ee (patch 0044) on dgx:~/llama-paged-fork, dev-tree HEAD a7d439e, all md5/KL numbers, the 86%/1078/924 decode record, bench env, and all referenced file/artifact paths. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-30 14:49:06 +00:00
Ettore Di Giacinto	2431090ff3	docs(paged): future-agent vLLM-parity HANDOFF guide (GB10, how-to companion to FINAL) Adds docs/PARITY_HANDOFF.md: the operational how-to for an agent with zero context picking up the GB10 vLLM-parity work. Complements VLLM_PARITY_FINAL.md (the why/record) with TL;DR state, the hard gates (per-path bit-exact md5, KL-gate, no LLAMA_MAX_BATCH_TOKENS, fork-is-canonical), a copy-pasteable operational quickstart (ssh/lock/build/bench + the --cuda-graph-trace=node decode-profiling rule that caused 4 wrong analyses), the complete tested-and- rejected lever map, methodology lessons, the three forward directions, and a key file/artifact index with the open discrepancies to reconcile. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-30 14:42:44 +00:00
Ettore Di Giacinto	baf1025245	docs(paged): correct decode-serving record to ~86% GPU-steady parity (graph-node-traced) The decode-serving section characterized the high-N gap as "BW-floored, vLLM pays equally / 56-68%". A clean uncontended graph-node-traced profile (dgx ~/highN_prof2 + ~/highN_vllm, 2026-06-30) shows that was a profiling artifact: decode runs as a replayed CUDA graph, and nsys without --cuda-graph-trace=node collapses each replay into one opaque launch, so every prior decode decomposition (159 us/tok, "host-bound", "5.4x more efficient") was wrong. Corrected via --cuda-graph-trace=node + the ntg=64-minus-ntg=16 difference method. Real picture (paged npl=256): 99% GPU-busy (idle 1.4%), NOT host-bound. GDN recurrent scan 553 us/tok (51%, linear in batch, dominant), NVFP4 expert GEMM 254 (23%), bf16 proj 73 (7%), elementwise 57, SSM conv 31. Gap reconciled: vLLM-server 1177 -> vLLM true GPU-steady 1078 (chunked-prefill overlap inflates its window ~8pt) -> llama GPU-steady 924 (= 86% of 1078) -> llama-server 718 (61%, the ~17pt S3-recoverable serving graph-reuse overhead). So vs vLLM's true GPU-steady decode we are ~86%, not 56%. GDN is a shared BW floor where paged leads (83% vs 79% of 273 GB/s peak; both 1.17-1.18x for 2x batch). The residual ~14pt is vLLM's mature fused kernels (Marlin MoE +11ms, Triton elementwise +10ms); both ggml fusions rejected: act-quant-into-MMQ -79.4% (ggml MMQ re-quantizes y per row-tile x stream-k split, no single-pass tiling), norm+quant+silu infeasible via ggml_cuda_can_fuse. Added rejected levers: Q8_0/FP8 projection (regime error, closes <=6%; vLLM FP8-proj confirmed from hf_quant_config.json MIXED_PRECISION), the two decode fusions; refined BV-block GDN occupancy to -1.04% (wave-hidden). Revised verdict: PREFILL genuinely capped (36-43%, not graph-replayed so real); DECODE-SERVING near-parity ~86% of vLLM true GPU-steady (headline 56% was a measurement/operating-point artifact). GB10-vs-datacenter framing kept. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-30 14:16:06 +00:00
Ettore Di Giacinto	6edbb56b06	docs(paged): definitive vLLM-parity final-state record (GB10, CLOSED) Add docs/VLLM_PARITY_FINAL.md: the standing, never-re-litigate record of the exhaustive GB10 (sm_121) vLLM-parity investigation for the Qwen3.6 NVFP4 hybrid models. Captures the definitive same-session both-engine benchmark (prefill S_PP, decode/serving per-seq + aggregate, TTFT, PEAK_GB, paged-as-%-of-vLLM for both the MoE 35B-A3B and dense 27B models), the complete lever map (every prefill-GEMM, prefill-GDN, decode and serving/engine attempt with its verdict and key number), the structural floors (LPDDR5x bandwidth, FP4-MMQ optimality, GDN O(C^2) intra-chunk + serial recurrence, vLLM's HBM-tuned FLA/Marlin), the shipped bit-exact wins, and the parity verdict: parity is a hardware ceiling on GB10, not missing optimizations; the path to parity is datacenter Blackwell. Every number cites its artifact (dgx:~/bench/COMBINED_DEFINITIVE.txt, the marlin_gate / gdn_p1_ab A/Bs, PREFILL_GEMM_RESULTS, VLLM_PARITY_LEVER_MAP, DECODE_SERVING_SCOPE, the patch headers); figures not pinned to an artifact are marked estimated. Add a section-9 summary + link in the backend README. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-30 11:57:36 +00:00
Ettore Di Giacinto	042deab40e	docs(paged): vLLM-parity lever map + tensor-core GDN build plan (both-engine profile-validated) Lever map records the full prefill/decode gap decomposition vs vLLM, the ranked levers, and the rejected dead ends. GDN build plan is the per-product mma mapping + A-inverse + occupancy design. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-29 06:15:10 +00:00
Ettore Di Giacinto	b028c81eda	docs(paged): record padded/fixed-slot decode shape as tested-and-rejected The S1 section-(a) padded/fixed-slot decode shape (the scoped follow-up to push serving graph reuse from ~72% toward ~100%) was implemented in an isolated worktree off the committed S1/S3/tail base, built CUDA-only, and benched on GB10. Verdict: REJECTED. It is bit-exact and provably inert, but it regresses serving throughput at every concurrency and does not close the vLLM gap. Implementation (default-off, LLAMA_PAGED_PAD_DECODE): on a pure-decode step (n_prompt_budgeted == 0) emit a masked-inert dummy decode for every idle slot so n_tokens / n_seqs / n_seqs_unq / n_outputs and the seq-id set stay constant; a release()-side guard keeps a finished slot warm under padding. Each dummy is its own sequence (private recurrent state, per-stream paged attention, logits discarded), so it cannot perturb a real stream. Gates: single-seq greedy md5 bit-exact (dense 5951a5b4, paged-MoE 8cb0ce23). The literal per-stream ON-vs-OFF identity gate is unachievable - concurrent cuBLAS/FA decode is not bit-reproducible run-to-run even with padding off (OFF-vs-OFF diverging streams: dense 3/16, MoE 8/16). The achievable inertness gate passed: ON-vs-OFF per-stream prefix-agreement equals the OFF-vs-OFF noise floor exactly (MoE 0.940/0.940, dense 0.812/0.812), so the dummy slots leak nothing. Bench (MoE Qwen3.6-35B-A3B-NVFP4, GB10), burst decode tok/s/seq: n=8 S1+S3 28.16 / PAD 6.05 / vLLM 44.8; n=128 S1+S3 4.53 / PAD 4.32 / vLLM 6.87. Staggered aggregate tok/s: baseline (reuse 0%) 757.6, S1+S3 (reuse 72%) 763.3, PAD (reuse 38%) 558.0. Why it fails: (1) serving decode here is GPU-compute-bound, not host-rebuild-bound - baseline reuse 0% ~= S1+S3 reuse 72% on aggregate tok/s, so closing reuse buys ~nothing (the earlier 542->762 host-bound delta did not reproduce); (2) padding adds dummy-row compute proportional to pad_width - real_load, catastrophic at low load; (3) in continuous serving padding cannot hold a constant width (perpetual prefill churn) so reuse drops 72% -> 38%; (4) the completion-driven batch shrink padding prevents is itself a throughput win in a compute-bound regime. The residual burst gap is GPU-compute, which a host-side reuse lever cannot close. Patch series unchanged: this rejected lever is NOT added to patches/paged/. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 20:47:43 +00:00
Ettore Di Giacinto	d706980c2b	feat(paged): close the continuous-serving decode gap (S1+S3, patches 0040/0041) Add the two decode-serving graph-reuse levers (validated on GB10) that close the host-bound serving gap (paged dropped to ~3.7 vs vLLM ~5.9 tok/s/seq in real continuous serving while tying it in static batched-bench). - 0040 S1 paged decode-graph reuse: the paged decode inputs never overrode llm_graph_input_i::can_reuse (defaults false), so the host rebuilt the ggml graph on EVERY decode step (layer-A reuse 0%). Add a 256-bucketed-shape can_reuse + a live-mctx refresh from the owning attn input. Bit-exact (md5 byte-identical reuse on/off). Static batched-bench: paged reuse 0% -> 95.5%. - 0041 S3 decode-shape-stable scheduling: keep co-batched prefill out of decode steps so the scheduler emits the reuse-stable pure-decode shape S1 can reuse. Default-off policy on top of 0016; bit-exact (per-stream independent). S1+S3 together (128-client staggered serving, MoE Qwen3.6-35B-A3B-NVFP4): graph reuse 0% -> 72.2%, hostproc 15.98 -> 6.31 ms/step, decode 4.05 -> 5.52 tok/s/seq median (4.24 -> 5.96 mean, at vLLM's ~5.9). S1 alone is insufficient (13.8%); S3 is the multiplier. S2 (double-buffer set_inputs) dropped: Phase-0 put set_inputs at ~0.05 ms/step, so it has nothing to recover. README patch table + DECODE_SERVING_SCOPE.md updated with results and the padded/fixed-slot follow-up. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 18:04:28 +00:00
Ettore Di Giacinto	000705321f	feat(paged): FP4 prefill large-M dequant->bf16 cuBLAS scaffold (patch 0033, default-off) Option (a) of PREFILL_GEMM_SCOPE.md: route large-M (prefill) NVFP4 dense weight GEMMs off the decode-tuned FP4-MMQ kernel onto the dequant->bf16 cuBLAS (nvjet) tensor-core path, wired via an M-threshold in ggml_cuda_should_use_mmq. Lands the validated, bit-exact-gated mechanism and records the honest GB10 result: it is a regression, so it ships default-off (== stock), mirroring the patch-0017 default-off discipline. Three-edit scaffold (no new kernel): should_use_mmq routes NVFP4+Blackwell+dense M>LLAMA_FP4_PREFILL_M to cuBLAS; op_mul_mat_cublas gains an NVFP4 branch that dequants the FP4 weights to a transient bf16 pool buffer (not cached - stays FP4-resident) and runs cublasGemmEx CUDA_R_16BF/COMPUTE_32F; ggml_get_to_bf16_cuda gains the NVFP4 case. Bit-exact gate PASS (benign): test-backend-ops MUL_MAT 1146/1146 + MUL_MAT_ID 806/806; the forced path (LLAMA_FP4_PREFILL_M=64) is green CUDA-vs-CPU at NVFP4 large-M shapes; greedy md5 on q36-27b is byte-identical to FP4-MMQ both for short prefill (5951a5b4, decode untouched) and for a >threshold prefill that exercises the bf16 path (5f3967df - no greedy argmax flips). Performance REGRESSES on GB10 (S_PP, q36-27b dense, A/B via env): M=512 958.99 -> 486.65 (-49%), M=1024 1013.65 -> 587.27 (-42%), M=2048 918.46 -> 649.42 (-29%). The scope premise (FP4-MMQ ~3% of FP4 peak at large M) is false here: FP4-MMQ beats bf16-cuBLAS because bf16 peak is ~half FP4 peak and the per-step weight dequant + 4x bf16 weight traffic (~8x total vs the FP4 read) dominate, only partially amortizing as M grows. Default-off keeps stock S_PP (966.98). Phase 2 (MoE grouped large-M) not implemented: it inherits the same bf16-peak<FP4-peak ceiling plus a per-expert dequant, so grouped bf16-cuBLAS would regress for the same reason; a real prefill GEMM win needs option (b), a native FP4-MMA large-M kernel. Full A/B in docs/PREFILL_GEMM_RESULTS.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 17:42:15 +00:00
Ettore Di Giacinto	4bdd26a7f0	docs(paged): scope tensor-core (mma) chunked GDN prefill kernel Scopes the follow-up recorded by patch 0031 + README section 5: replace the serial per-thread reductions of the chunked gated-DeltaNet prefill scan with mma.sync tensor-core matmuls and lift the 1-block/SM occupancy ceiling, the path that would beat the tuned sequential scan and close the GDN prefill bucket toward vLLM's ~2.5x-cheaper chunked scan. Confirmed (not assumed) the GB10/sm_121a tensor-core reality: consumer Blackwell (SM12x) has NO wgmma (Hopper-only) and NO tcgen05/TMEM (sm_100a data-center only); the usable path is the extended mma.sync family. So the kernel is a warp-synchronous mma.sync + cp.async design (reusing ggml's mma.cuh tiles), not a wgmma/TMA/tcgen05 design - patch 0031's 'mma/wgmma' shorthand reads as mma only on this part. Design: register-resident state frees the 64KB that forced C=16, admitting C=64 under the 99KB shared opt-in; tf32 inputs / f32 accumulate with a 3xtf32 precision ladder; decays/gamma/beta stay f32 outside the mma to preserve the bounded de-gating; A-inverse via blocked forward substitution (FLA UT transform) with mma off-diagonal coupling. Mechanism: chunking cuts state-BW ~Cx, mma absorbs the O(C^2) intra-chunk flops the serial 0031 could not. Honest: multi-week, high risk, no vendor kernel to route to on sm_121; gains beat the sequential scan and close most of the bucket but not full sm_100-class parity. KL-gate binding (NMSE likely fails at reduced precision). Phased: re-profile -> two-product PoC -> full intra-chunk + C=64 + reg-state -> occupancy/cp.async; opt-in default-OFF until A/B-proven. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 17:23:51 +00:00
Ettore Di Giacinto	9a28f23134	docs(paged): scope the continuous-serving decode gap (host-bound, design-only) Add DECODE_SERVING_SCOPE.md: the decode KERNEL is at parity in static batched-bench (~6.1 tok/s/seq ~ vLLM ~5.9 at npl128) but continuous serving through llama-server update_slots() drops to ~3.7 (-39%) while vLLM sustains ~5.9. Scope shows the gap is the scheduler/host loop, not the kernel. Root-cause hypothesis from source: continuous batching's batch-shape + seq-set churn breaks BOTH graph-reuse layers every step - llama-context can_reuse/ allow_reuse (n_tokens + seq-set must match) and the CUDA ggml_cuda_graph update_required memcmp (ne/nb/data ptrs) - so the GPU idles while the host rebuilds + re-captures the graph and runs un-graphed set_inputs. vLLM avoids this with padded/bucketed decode shapes + piecewise CUDA graphs. Documents that the shipped scheduler patches (0008/0013/0016/0024/0025/0029) target prefill freezing + burst collapse, NOT decode-step graph reuse, which is why the serving gap survives them; notes the README s.5 'lever 2 graph coverage FLAT' verdict was static-regime and is reopened here for serving only. Ranks host-side, bit-exact-safe levers: S1 bucketed/padded decode-step shape for graph reuse, S2 double-buffer/overlap per-step host work, S3 graph-shape-stable scheduling (extend 0016). Specifies a Phase-0 profile to confirm host-bound before any build, reusing the in-tree [L5INSTR] hostproc/set_inputs/ get_block_table timers, the 'graphs reused' perf counter, LLAMA_GRAPH_REUSE_DISABLE and nsys GPU-busy%, with vLLM ground-truthed at the same concurrency. No kernel code; no GPU run in this pass. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 17:14:51 +00:00
Ettore Di Giacinto	11128cb080	docs(paged): scope the large-M NVFP4 prefill GEMM lever (design only) Design + plan for the #1 prefill lever: NVFP4 weight GEMM at large M, where MMQ (decode/M<=128-tuned, 1 CTA/SM, 128-col tile cap) is ~3.4x slower than vLLM's marlin/cutlass large-M path (~51% of the prefill gap). Recommends (a) dequant->bf16 cuBLAS routed by an M-threshold (dense first, MoE grouped-cuBLAS second); rejects (b) a from-scratch Marlin/FP4 kernel as a multi-week project. Key enabling finding: NVFP4->bf16 dequant kernels already exist, and NVFP4 is currently force-excluded from the tensor-core cuBLAS path (falls to f32 Sgemm) - relaxing that one guard is the pivot. Honest: bf16-cuBLAS banks ~60-75% of the GEMM gap, not full 68us/tok parity (bf16 TC peak ~half FP4). Design only - no kernel, no GPU run. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]	2026-06-28 16:42:23 +00:00
Ettore Di Giacinto	4cd90bfae9	paged: drop bf16-tau (patch 0026), subsumed by decode fusions (tau=100000 flat, zero speed benefit) The opt-in hybrid per-head bf16 SSM-state lever (ssm_bf16_tau, patch 0026) is removed from the llama-cpp-localai-paged patch series. Clean re-measurement after the decode fusions (0028 recurrent-state gather-fusion + 0029 block-table cache) landed shows it buys nothing: forcing ALL gated-DeltaNet heads to bf16 (tau=100000, the most aggressive setting) gives flat decode throughput, 780.6 vs 780.0 t/s. The mode engages but adds zero speed because it is subsumed by the fusions. The earlier "+12%" was measured before the fusions completed. bf16-tau was a precision trade (not bit-exact, ~91% same-top-p) plus extra bug surface and extra CUDA template-instantiation compile cost with no offsetting benefit. Dependency check: no later patch (0028/0029/0030) depends on 0026. 0030's only mention is a description comment; its code keys off fused_gdn_ar/ch/auto_fgdn, which originate in 0018/0019/0021 (before 0026). The remaining series (0001-0025, 0028-0030) applies clean with git apply --check against the pin 0ed235ea2c17a19fc8238668653946721ed136fd. The Makefile applies the series by glob (patches/paged/0*.patch); the resulting gap at 0026 is tolerated (0005/0027 are already absent). Removed: - patches/paged/0026-qwen35-hybrid-perhead-ssm-state.patch - the dead ssm_bf16_tau / ssm_hybrid_tau option handler in the shared grpc-server.cpp (it only set LLAMA_SSM_BF16_TAU, now a no-op the library no longer reads) - the patched+bf16-tau benchmark columns and llama-patched-bf16tau rows (README + final_benchmark.csv), the ssm_bf16_tau option text in backend index.yaml, the gallery NOTE block, and the docs/features/backends.md mention. The rejected-lever lesson is kept (why it was dropped: subsumed, tau=100000 flat) in the backend README section 5, the paged-backend agent guide, and the vLLM-parity methodology, so it is not re-tried. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 16:06:06 +00:00
Ettore Di Giacinto	c51ff4cec9	docs(paged): scope porting the portable benefits to Metal/SYCL/Vulkan (+ROCm) Add ACCELERATOR_PORTING_SCOPE.md, the umbrella scope for taking the paged backend's accelerator-portable wins off the CUDA family. It builds on (does not duplicate) UPSTREAM_LAYER2_SCOPE.md, which stays the GDN/SSM-fusion detail (benefit #1), and adds: - Benefit #2 (paged KV in-kernel block-table flash-attn read, 0009-0011): new per-backend feasibility from source analysis of the Metal/SYCL/Vulkan flash-attn kernels. SYCL EASY (near line-for-line CUDA mirror), Metal EASY-MEDIUM (decode already routes to the vec kernel), Vulkan MEDIUM (the fast coopmat2 NVIDIA decode path cannot do the indexed read; push-constants are full). Universal constraint: only the vec/scalar decode kernel admits the per-cell indexed read, so route block-table ops onto vec (as CUDA's 0009-0010 dispatch guard already does) and leave the fast MM/coopmat2 path contiguous-only. This is the lever that flips paged KV from neutral-to-slightly-negative to non-negative off CUDA. - Benefit #3 (decode-first scheduler, 0013/0016): confirmed a free portable win - host-side update_slots() policy, zero kernel work, runs on any accelerator as-is. - Benefit #4 (NVFP4 FP4-MMA, 0017/0023/0025): out of scope (Blackwell only); flags the backend-agnostic analogues of the act-quant dedup and the graph-coverage lever without over-claiming a port. - A ROCm note: ROCm rides the CUDA/HIP path (validate, don't re-port); FP4-MMA stays Blackwell-only. Benefits #1 and #2 share the port shape and rank Metal->SYCL->Vulkan, so they bundle into one per-backend PR behind a shared ops-first PR. Cross-link added from UPSTREAM_LAYER2_SCOPE.md. All gates are test-backend-ops on-target (no Metal/SYCL/Vulkan/ROCm hardware here). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 08:34:32 +00:00
Ettore Di Giacinto	0b84fda496	docs(paged): add the bf16-tau opt-in line to the decode plots Per request, the plots now show all four series: llama.cpp (standard), vLLM, LocalAI's llama.cpp patches (bit-exact hero), and LocalAI's patches + bf16-tau (opt-in ceiling, +3% to +17% over the patches, ahead of vLLM at every dense width and MoE npl>=32). Subtitle flags bf16-tau as opt-in / not bit-exact. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 22:25:02 +00:00
Ettore Di Giacinto	1431f72b92	docs(paged): regenerate decode plots (3-way) from re-measured data + overview Rebuild the two committed decode plots from the re-measured CSV and add a combined overview. Three series per the comparison that matters: llama.cpp (standard) vs vLLM vs LocalAI's llama.cpp patches; x-over-standard called out at npl128. bf16-tau stays out of the plot (it remains in the CSV + the README table as the opt-in row). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 22:20:12 +00:00
Ettore Di Giacinto	3466094c68	docs(paged): re-measure DGX benchmarks on one harness (stock/patched/bf16-tau) Re-run the GB10/DGX-Spark llama-batched-bench matrix (dense q36-27b + MoE q36-35b-a3b, npl 8/32/64/128, -fa on -ngl 99 -npp 128 -ntg 128) so the CSV and README section 4 carry a single consistent set of llama numbers with all three configs: - stock: separately-built unpatched llama.cpp at this backend's exact pin 9d5d882d (toggling LLAMA_KV_PAGED on the patched binary does NOT reproduce stock - the SSM decode fusions are compiled in, not env-gated). - patched: paged binary, LLAMA_KV_PAGED=1 (+LLAMA_MOE_FORCE_GRAPHS=1 for MoE). - patched+bf16-tau: patched plus --ssm-bf16-tau 64 (opt-in, NOT bit-exact, ~91% same-top-p). final_benchmark.csv now has stock + patched + bf16-tau + vllm rows for both models at all four widths (the prior CSV had no stock and no bf16-tau rows). peak_gb is dropped: the GB10's unified LPDDR5x reports [N/A] to nvidia-smi and the bench does not print it, so per-run peak could not be captured this session. Patch series gives up to 2.46x (dense) / 2.26x (MoE) over true-stock; opt-in bf16-tau adds a further +3% to +17% on top of patched (growing with width). vLLM column is kept from the prior session (not re-run) and labeled as such. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 22:05:59 +00:00
Ettore Di Giacinto	ed5eb705c7	docs(paged): drop moot PIN_SYNC_c299a92c record, repoint to README sec 7 The paged backend's llama.cpp pin was reverted from c299a92c back to 9d5d882d (== stock), so docs/PIN_SYNC_c299a92c.md (a blow-by-blow of the reverted sync) is dead weight. The pin-sync PROCESS stays documented in the three live places: the Makefile comment, README section 7 (Pin + maintenance policy), and .agents/llama-cpp-localai-paged-backend.md. Delete the doc and repoint every reference to it (Makefile, README, .agents, canary script + workflow) at README section 7. No functional paths change: the canary's patches-dir glob (patches/paged/0*.patch) is untouched. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 21:34:10 +00:00

1 2

52 Commits