LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-07-03 12:57:02 -04:00

Author	SHA1	Message	Date
Ettore Di Giacinto	85c88320ef	patches(paged): pad W4A16 A shared tile stride Mirror fork commit d9b9be0be as patch 0050 and record the Phase 4 W4A16 shared-memory padding gates, benchmarks, and mirror verification. Assisted-by: Codex:gpt-5	2026-06-30 22:15:21 +00:00
Ettore Di Giacinto	8b413d1cbd	docs(paged): record W4A16 scale broadcast rejection Record the Phase 3 scale-broadcast experiment, its md5 and MUL_MAT_ID gates, the prefill regression, and the decision to ship no 0050 patch. Assisted-by: Codex:gpt-5	2026-06-30 22:06:17 +00:00
Ettore Di Giacinto	c5f2545cdd	patches(paged): tune W4A16 grouped tile shape Mirror fork commit 7dfa0e175 as patch 0049 and record the Phase 2 GB10 W4A16 shape sweep, md5 gates, MUL_MAT_ID checks, and mirror verification. Assisted-by: Codex:gpt-5	2026-06-30 21:57:42 +00:00
Ettore Di Giacinto	d8edc615e7	patches(paged): mirror W4A16 packed metadata Mirror the fork-first W4A16 packed tile metadata commit into the LocalAI paged patch series, record the Phase 1 benchmark result, and keep the implementation plan checked off. Assisted-by: Codex:gpt-5	2026-06-30 21:21:53 +00:00
Ettore Di Giacinto	1c0709b700	docs(paged): record W4A16 phase1 kill gate Record the clean forced W4A16 baseline, default comparison, selected metadata target, and completed plan checkpoint for the GB10 parity reopen. Assisted-by: Codex:gpt-5	2026-06-30 20:40:40 +00:00
Ettore Di Giacinto	337ebb8a37	docs(paged): record phase0 decode repro Record comparable graph-node-traced paged and vLLM decode difference-method artifacts for the GB10 parity reopen. Assisted-by: Codex:gpt-5	2026-06-30 20:35:43 +00:00
Ettore Di Giacinto	ef5d4af203	docs(paged): record phase0 prefill baseline Record clean-source MoE and dense prefill baselines for the GB10 parity reopen and mark the plan checkpoint complete. Assisted-by: Codex:gpt-5	2026-06-30 20:22:18 +00:00
Ettore Di Giacinto	a9a2efb296	docs(paged): record phase0 clean build gates Record the clean DGX build retry, binary provenance, canonical greedy md5 gates, and completed plan steps for the GB10 parity reopen. Assisted-by: Codex:gpt-5	2026-06-30 20:19:14 +00:00
Ettore Di Giacinto	b1a1b721bd	docs(paged): record GB10 parity artifact gaps Add the read-only DGX artifact review for the Phase 0 parity reopen, including supported paged measurements and missing vLLM difference-method evidence. Assisted-by: Codex:gpt-5	2026-06-30 15:55:16 +00:00
Ettore Di Giacinto	b3cfdfac4a	docs(paged): record GB10 parity source provenance Add the clean llama.cpp fork state, base merge point, patch count, and tree-match result for the Phase 0 parity reopen workflow. Assisted-by: Codex:gpt-5	2026-06-30 15:54:23 +00:00
Ettore Di Giacinto	6ac06734e9	docs(paged): start GB10 parity phase0 record Create the Phase 0 results record for the parity reopen workflow, including preflight, provenance, and baseline sections. Assisted-by: Codex:gpt-5	2026-06-30 15:51:57 +00:00
Ettore Di Giacinto	f8d7b026cf	docs(paged): scope GB10 parity reopen plan Add a phased follow-up spec for challenging the GB10 vLLM-parity closure, including provenance gates, W4A16/GDN/MoE workstreams, and subagent ownership boundaries. Assisted-by: Codex:gpt-5	2026-06-30 15:44:11 +00:00
Ettore Di Giacinto	de34cd5954	docs(paged): refresh parity handoff state Reconcile the paged backend pin prose with the current Makefile pin, mark the 0044 patch tracking note as resolved, and add DGX Docker worker idleness to the benchmark preflight. Assisted-by: Codex:gpt-5	2026-06-30 15:27:44 +00:00
Ettore Di Giacinto	1b9176c2c8	docs(paged): codify fork-first patch workflow as mandatory policy The fork mudler/llama.cpp branch localai-paged is the canonical source of truth for all paged-backend kernel/patch work. Always update it FIRST: commit the change on the fork branch and push it, then regenerate the LocalAI patch series (backend/cpp/llama-cpp-localai-paged/patches/paged/) from the fork via git format-patch so the series is a 1:1 drift-free mirror of the branch. Never edit the LocalAI patch files directly, and never add a patch with no corresponding fork-branch commit. The series is a derivative; the fork is the source. The fork branch is also where the build and the per-path bit-exact md5 gate actually run, so it is the only place a change is truly validated. Codified in two places: - .agents/llama-cpp-localai-paged-backend.md: new "Fork-first workflow (MANDATORY)" section at the top of the patch/pin-sync material, plus the "Encapsulating your work" bullet now points at it. - backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md: strengthened the hard-gate (section 2.5) into "Fork-first is MANDATORY", and corrected a stale numbering example (fork 51168c5ee "patch 0044" maps to worktree 0044, not the f32-only M5 which is worktree 0047). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-30 15:12:36 +00:00
Ettore Di Giacinto	2033086f60	patches(paged): track 0044 GatedRMSNorm patch, sync LocalAI series to fork 51168c5 The fork mudler/llama.cpp branch localai-paged is the canonical source of truth for the paged-backend patch series. This file is the git format-patch of fork commit 51168c5ee ("feat(paged): fused gated RMSNorm + SiLU gate-mul CUDA op (patch 0044)"), verified byte-identical to that commit's format-patch output. The full on-disk series applies clean in numeric order on the pinned base and the resulting tree is byte-identical to the fork commit tree (tree hash a73d759350277532a14e853e1fe78f08bbb74ce8), so the LocalAI series is a drift-free 1:1 mirror of the fork branch. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-30 15:10:13 +00:00
Ettore Di Giacinto	8bb47e5a8a	docs(paged): correct PARITY_HANDOFF ahead/behind + note dense CDEF gate md5 Ground-check follow-up to `2431090ff`. Two factual corrections: - Section 7 worktree line had the ahead/behind counts swapped ("25 ahead, 197 behind"); the branch is actually ~199 ahead / 25 behind origin/master. - Discrepancy item 5 flagged only the MoE CDEF PAGED_GATE_MD5 (0921716...); the dense run is symmetric (COMBINED_DEFINITIVE.txt records ecfe924d... for dense, which likewise differs from the canonical dense gate 5951a5b4). Both CDEF values come from combined_definitive.sh's own gate command, not the canonical bit-exact gate in section 3.3, so neither is sanctioned and both must be KL-validated. Everything else in the handoff verified accurate: fork branch localai-paged HEAD 51168c5ee (patch 0044) on dgx:~/llama-paged-fork, dev-tree HEAD a7d439e, all md5/KL numbers, the 86%/1078/924 decode record, bench env, and all referenced file/artifact paths. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-30 14:49:06 +00:00
Ettore Di Giacinto	2431090ff3	docs(paged): future-agent vLLM-parity HANDOFF guide (GB10, how-to companion to FINAL) Adds docs/PARITY_HANDOFF.md: the operational how-to for an agent with zero context picking up the GB10 vLLM-parity work. Complements VLLM_PARITY_FINAL.md (the why/record) with TL;DR state, the hard gates (per-path bit-exact md5, KL-gate, no LLAMA_MAX_BATCH_TOKENS, fork-is-canonical), a copy-pasteable operational quickstart (ssh/lock/build/bench + the --cuda-graph-trace=node decode-profiling rule that caused 4 wrong analyses), the complete tested-and- rejected lever map, methodology lessons, the three forward directions, and a key file/artifact index with the open discrepancies to reconcile. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-30 14:42:44 +00:00
Ettore Di Giacinto	baf1025245	docs(paged): correct decode-serving record to ~86% GPU-steady parity (graph-node-traced) The decode-serving section characterized the high-N gap as "BW-floored, vLLM pays equally / 56-68%". A clean uncontended graph-node-traced profile (dgx ~/highN_prof2 + ~/highN_vllm, 2026-06-30) shows that was a profiling artifact: decode runs as a replayed CUDA graph, and nsys without --cuda-graph-trace=node collapses each replay into one opaque launch, so every prior decode decomposition (159 us/tok, "host-bound", "5.4x more efficient") was wrong. Corrected via --cuda-graph-trace=node + the ntg=64-minus-ntg=16 difference method. Real picture (paged npl=256): 99% GPU-busy (idle 1.4%), NOT host-bound. GDN recurrent scan 553 us/tok (51%, linear in batch, dominant), NVFP4 expert GEMM 254 (23%), bf16 proj 73 (7%), elementwise 57, SSM conv 31. Gap reconciled: vLLM-server 1177 -> vLLM true GPU-steady 1078 (chunked-prefill overlap inflates its window ~8pt) -> llama GPU-steady 924 (= 86% of 1078) -> llama-server 718 (61%, the ~17pt S3-recoverable serving graph-reuse overhead). So vs vLLM's true GPU-steady decode we are ~86%, not 56%. GDN is a shared BW floor where paged leads (83% vs 79% of 273 GB/s peak; both 1.17-1.18x for 2x batch). The residual ~14pt is vLLM's mature fused kernels (Marlin MoE +11ms, Triton elementwise +10ms); both ggml fusions rejected: act-quant-into-MMQ -79.4% (ggml MMQ re-quantizes y per row-tile x stream-k split, no single-pass tiling), norm+quant+silu infeasible via ggml_cuda_can_fuse. Added rejected levers: Q8_0/FP8 projection (regime error, closes <=6%; vLLM FP8-proj confirmed from hf_quant_config.json MIXED_PRECISION), the two decode fusions; refined BV-block GDN occupancy to -1.04% (wave-hidden). Revised verdict: PREFILL genuinely capped (36-43%, not graph-replayed so real); DECODE-SERVING near-parity ~86% of vLLM true GPU-steady (headline 56% was a measurement/operating-point artifact). GB10-vs-datacenter framing kept. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-30 14:16:06 +00:00
Ettore Di Giacinto	6edbb56b06	docs(paged): definitive vLLM-parity final-state record (GB10, CLOSED) Add docs/VLLM_PARITY_FINAL.md: the standing, never-re-litigate record of the exhaustive GB10 (sm_121) vLLM-parity investigation for the Qwen3.6 NVFP4 hybrid models. Captures the definitive same-session both-engine benchmark (prefill S_PP, decode/serving per-seq + aggregate, TTFT, PEAK_GB, paged-as-%-of-vLLM for both the MoE 35B-A3B and dense 27B models), the complete lever map (every prefill-GEMM, prefill-GDN, decode and serving/engine attempt with its verdict and key number), the structural floors (LPDDR5x bandwidth, FP4-MMQ optimality, GDN O(C^2) intra-chunk + serial recurrence, vLLM's HBM-tuned FLA/Marlin), the shipped bit-exact wins, and the parity verdict: parity is a hardware ceiling on GB10, not missing optimizations; the path to parity is datacenter Blackwell. Every number cites its artifact (dgx:~/bench/COMBINED_DEFINITIVE.txt, the marlin_gate / gdn_p1_ab A/Bs, PREFILL_GEMM_RESULTS, VLLM_PARITY_LEVER_MAP, DECODE_SERVING_SCOPE, the patch headers); figures not pinned to an artifact are marked estimated. Add a section-9 summary + link in the backend README. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-30 11:57:36 +00:00
Ettore Di Giacinto	bd100dd20a	fix(paged): repair the patch series, sync to the fork branch (drop dev-tree 0044/0045, add f32-only M5 as 0047) The 0044/0045 patches were exported from the old bf16/hybrid dev tree and no longer apply on the f32-only series (0026 ssm_bf16_tau is dropped), so the build broke at `git apply`. Re-sync the vendored series to the now feature-complete fork branch mudler/llama.cpp:localai-paged, which is the canonical source (pin 0ed235ea + the paged patch commits in order). - git rm the dev-tree-based 0044 (GDN M5, bf16-machinery base) and 0045 (Marlin W4A16 offline-repack, not part of the fork branch). - Add the fork branch's newest commit (2c32ab8b7, "GDN M5 tensor-core chunked-scan prefill, f32-only re-port") as 0047, generated with a single git format-patch off that branch. It sequences after 0046 (its parent on the branch) and recovers the prefill win 0044 encoded (+3.5% S_PP @npp512, +17.7% @npp2048), bit-exact per-path (test-backend-ops GATED_DELTA_NET 46/46 default and force-M5; greedy md5 default-on == M5-forced == canonical). - Track patch 0046 (dense-prefill geometry gate), which was on disk but never committed, so the series is complete in git. - README: patch-table header 0001-0046 -> 0001-0047, replace the 0044 row with the f32-only 0047 row, fix the dangling 0044 prose references, note the bf16 M6/M7/M8 variants are not part of this f32-only series, and add a maintenance bullet that the series is now generated from the fork branch so there is no more patch-export drift. Verified: on a pristine llama.cpp at pin 0ed235ea the full series 0001-0043, 0046, 0047 applies clean in sorted order with the Makefile's exact `git apply --verbose` method (37/37 OK), and the resulting tree is byte-identical to the fork branch tip 2c32ab8b7. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-30 07:54:46 +00:00
Ettore Di Giacinto	be65438eac	docs(paged): record MoE-prefill engine-gap decomposition + GEMM-port negatives (default-off) nsys cross-engine decomposition: the MoE prefill 64% gap vs vLLM is engine plumbing, not the kernel (GPU 97% busy, 443 vs 197 us/tok). Three buckets: per-expert W4A4 M-fragmentation (58%), GDN scan (24%), f32<->bf16 casts (15%). Offline-repack (0045) and verbatim vLLM-marlin port both trail FP4-MMQ via wrapper overhead, kept default-off as recorded negatives. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-29 17:20:07 +00:00
Ettore Di Giacinto	7b38c6b2a3	feat(paged): GDN M5 tensor-core chunked-scan prefill, default-on under paged KV (patch 0044) Land the tensor-core forms of the chunked gated-DeltaNet prefill scan (0031) as a single GDN_TC-selected build and ship the M5 variant (full TC form-T solve + state-update mma) default-ON when LLAMA_KV_PAGED is set. The dispatch defaults GDN_TC=5 and GDN_CHUNK_MIN=64 under paged KV (both env-overridable; OFF/INT_MAX when not paged, so stock/non-paged stays regression-free). GDN_CHUNK_MIN is the per-call engage threshold and stays > 1 so decode (1 tok/call) keeps the sequential recurrence; 64 was tuned from a {1,32,64,128,256} sweep (32/64/128 all win on prefill, 256 barely fires because the MoE-prefill per-call count is < 256, 1 collapses decode S_TG ~25%). Measured GB10, q36-35b-a3b-nvfp4, LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1, llama-batched-bench -ngl 99 -fa on -ntg 4 -npl 32: -npp 512 S_PP 2208.96 -> 2286.5 t/s (+3.5%, mean of 3 interleaved A/B) -npp 2048 S_PP 2021.5 -> 2379.8 t/s (+17.7%) Decode S_TG unchanged (~399 vs ~397 t/s, within noise). Bit-exactness (per-path greedy md5, n=48 --temp 0 --seed 1, paged): default-on == M5-forced == canonical on the gate prompt - MoE 8cb0ce23, dense 5951a5b4. test-backend-ops GATED_DELTA_NET 94/94 vs CPU with M5 forced (incl. multi-chunk up to n_tokens=256). On a long MoE prompt the default (M5 fires at >=64 tokens) and the sequential path agree word-for-word until one benign greedy token-flip; dense is byte-identical. The chunked scan is a NEW per-path result (different FP reduction order), NMSE-validated benign. CUDA-only, gencode arch=compute_121a,code=sm_121a (GB10 / sm_121a). README sections 3 (0044 row, 0031 superseded note) and 5 (dev-notes verdict) updated. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-29 06:42:11 +00:00
Ettore Di Giacinto	042deab40e	docs(paged): vLLM-parity lever map + tensor-core GDN build plan (both-engine profile-validated) Lever map records the full prefill/decode gap decomposition vs vLLM, the ranked levers, and the rejected dead ends. GDN build plan is the per-product mma mapping + A-inverse + occupancy design. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-29 06:15:10 +00:00
Ettore Di Giacinto	c4058eb4da	feat(paged): tail-fusion (0042) + full-step decode CUDA graph default-on (0043); FP4-MMA W4A4 (0034) + Marlin W4A16 (0035) MoE-GEMM scaffolds default-off 0042 fuses the pre-norm residual add into RMSNorm (+0.5% prefill, bit-exact). 0043 makes the full-step MoE decode CUDA graph default-on (+2-4% decode, bit-exact; removes ~18x per-step host kernel re-issue, A/B-confirmed). 0034 (native FP4-MMA W4A4) and 0035 (Marlin-style W4A16 grouped MoE GEMM) are correct + bit-exact but regress vs the int8 FP4-MMQ in-backend on GB10 (bf16 MMA is ~half the int8 rate); shipped default-off as validated mechanisms and recorded negatives per the parity methodology. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-29 06:15:10 +00:00
Ettore Di Giacinto	f1c98ff0b9	fix(paged): revert S3 decode-stable scheduler to default-OFF (A/B regression) Patch 0041 (LLAMA_PAGED_DECODE_STABLE) was made default-on-when-paged, but a measured end-to-end A/B proved that is a serving mistake. S3 defers prefill admission on the period-8 cadence, which delays prompt admission: 2.5x worse TTFT (60s vs 24s at N=256) and 20-29% lower end-to-end throughput, with no end-to-end win at any concurrency. Its apparent decode_agg gain was a metric artifact (faster per-step decode bought by starving prefill). Flip the s3_enabled default so an unset LLAMA_PAGED_DECODE_STABLE means OFF; the mechanism stays available as an explicit opt-in (LLAMA_PAGED_DECODE_STABLE=1) for decode-dominated, low-arrival traffic where TTFT is not a concern. The default now prefers prompt prefill admission for good TTFT. S1 (patch 0040) keeps shipping default-on; only S3's default changes. Re-exports patch 0041 (change folded into its source commit) and updates the README 0041 row plus the decode-serving narrative to record the A/B finding. Greedy md5 gate unchanged (single-sequence llama-completion path, not update_slots): paged MoE 8cb0ce23, dense 5951a5b4. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-29 05:00:11 +00:00
Ettore Di Giacinto	b028c81eda	docs(paged): record padded/fixed-slot decode shape as tested-and-rejected The S1 section-(a) padded/fixed-slot decode shape (the scoped follow-up to push serving graph reuse from ~72% toward ~100%) was implemented in an isolated worktree off the committed S1/S3/tail base, built CUDA-only, and benched on GB10. Verdict: REJECTED. It is bit-exact and provably inert, but it regresses serving throughput at every concurrency and does not close the vLLM gap. Implementation (default-off, LLAMA_PAGED_PAD_DECODE): on a pure-decode step (n_prompt_budgeted == 0) emit a masked-inert dummy decode for every idle slot so n_tokens / n_seqs / n_seqs_unq / n_outputs and the seq-id set stay constant; a release()-side guard keeps a finished slot warm under padding. Each dummy is its own sequence (private recurrent state, per-stream paged attention, logits discarded), so it cannot perturb a real stream. Gates: single-seq greedy md5 bit-exact (dense 5951a5b4, paged-MoE 8cb0ce23). The literal per-stream ON-vs-OFF identity gate is unachievable - concurrent cuBLAS/FA decode is not bit-reproducible run-to-run even with padding off (OFF-vs-OFF diverging streams: dense 3/16, MoE 8/16). The achievable inertness gate passed: ON-vs-OFF per-stream prefix-agreement equals the OFF-vs-OFF noise floor exactly (MoE 0.940/0.940, dense 0.812/0.812), so the dummy slots leak nothing. Bench (MoE Qwen3.6-35B-A3B-NVFP4, GB10), burst decode tok/s/seq: n=8 S1+S3 28.16 / PAD 6.05 / vLLM 44.8; n=128 S1+S3 4.53 / PAD 4.32 / vLLM 6.87. Staggered aggregate tok/s: baseline (reuse 0%) 757.6, S1+S3 (reuse 72%) 763.3, PAD (reuse 38%) 558.0. Why it fails: (1) serving decode here is GPU-compute-bound, not host-rebuild-bound - baseline reuse 0% ~= S1+S3 reuse 72% on aggregate tok/s, so closing reuse buys ~nothing (the earlier 542->762 host-bound delta did not reproduce); (2) padding adds dummy-row compute proportional to pad_width - real_load, catastrophic at low load; (3) in continuous serving padding cannot hold a constant width (perpetual prefill churn) so reuse drops 72% -> 38%; (4) the completion-driven batch shrink padding prevents is itself a throughput win in a compute-bound regime. The residual burst gap is GPU-compute, which a host-side reuse lever cannot close. Patch series unchanged: this rejected lever is NOT added to patches/paged/. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 20:47:43 +00:00
Ettore Di Giacinto	2fa8ef8fc5	fix(paged): make patch 0031 apply on the 0001-0030 base; default S3 on under paged KV FIX A (patch 0031 compose break): the chunked GDN prefill patch carried '#include <cuda_bf16.h>' and '#include <type_traits>' as CONTEXT lines, but those were introduced by the dropped bf16-tau patch 0026, so on the bf16-tau-free 0001-0030 base only '#include <cstdlib>' is present and 'git apply' failed. The same 0026 drop also shifted 0031's later hunks off their context (the ', hyb' kernel-launch arg, the 'STATE_BF16, HYBRID' template params, and the GDN_LAUNCH_ARGS list). Regenerated 0031 against a fresh pin(0ed235ea) + 0001-0030 tree: the chunked kernel now SELF-PROVIDES the cuda_bf16.h / type_traits includes (adds them, plus the climits it needs for INT_MAX) and the dispatch guard is the 2-param 'if constexpr (!KDA && !keep_rs_t)' form. Behaviour is unchanged: 0031 stays opt-in, default OFF (GDN_CHUNK_MIN), a recorded negative. The full 0001-0042 series now applies clean on 0ed235ea ('git apply --check' green for every patch). FIX B (patch 0041 S3 default): the decode-shape-stable scheduler defaulted OFF. Make it default ON whenever paged KV is active (LLAMA_KV_PAGED set), still overridable to off via LLAMA_PAGED_DECODE_STABLE=0. Minimal host-side change in update_slots(); re-exported from the dev tree, README 0041 row updated to match. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 19:37:05 +00:00
Ettore Di Giacinto	d706980c2b	feat(paged): close the continuous-serving decode gap (S1+S3, patches 0040/0041) Add the two decode-serving graph-reuse levers (validated on GB10) that close the host-bound serving gap (paged dropped to ~3.7 vs vLLM ~5.9 tok/s/seq in real continuous serving while tying it in static batched-bench). - 0040 S1 paged decode-graph reuse: the paged decode inputs never overrode llm_graph_input_i::can_reuse (defaults false), so the host rebuilt the ggml graph on EVERY decode step (layer-A reuse 0%). Add a 256-bucketed-shape can_reuse + a live-mctx refresh from the owning attn input. Bit-exact (md5 byte-identical reuse on/off). Static batched-bench: paged reuse 0% -> 95.5%. - 0041 S3 decode-shape-stable scheduling: keep co-batched prefill out of decode steps so the scheduler emits the reuse-stable pure-decode shape S1 can reuse. Default-off policy on top of 0016; bit-exact (per-stream independent). S1+S3 together (128-client staggered serving, MoE Qwen3.6-35B-A3B-NVFP4): graph reuse 0% -> 72.2%, hostproc 15.98 -> 6.31 ms/step, decode 4.05 -> 5.52 tok/s/seq median (4.24 -> 5.96 mean, at vLLM's ~5.9). S1 alone is insufficient (13.8%); S3 is the multiplier. S2 (double-buffer set_inputs) dropped: Phase-0 put set_inputs at ~0.05 ms/step, so it has nothing to recover. README patch table + DECODE_SERVING_SCOPE.md updated with results and the padded/fixed-slot follow-up. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 18:04:28 +00:00
Ettore Di Giacinto	000705321f	feat(paged): FP4 prefill large-M dequant->bf16 cuBLAS scaffold (patch 0033, default-off) Option (a) of PREFILL_GEMM_SCOPE.md: route large-M (prefill) NVFP4 dense weight GEMMs off the decode-tuned FP4-MMQ kernel onto the dequant->bf16 cuBLAS (nvjet) tensor-core path, wired via an M-threshold in ggml_cuda_should_use_mmq. Lands the validated, bit-exact-gated mechanism and records the honest GB10 result: it is a regression, so it ships default-off (== stock), mirroring the patch-0017 default-off discipline. Three-edit scaffold (no new kernel): should_use_mmq routes NVFP4+Blackwell+dense M>LLAMA_FP4_PREFILL_M to cuBLAS; op_mul_mat_cublas gains an NVFP4 branch that dequants the FP4 weights to a transient bf16 pool buffer (not cached - stays FP4-resident) and runs cublasGemmEx CUDA_R_16BF/COMPUTE_32F; ggml_get_to_bf16_cuda gains the NVFP4 case. Bit-exact gate PASS (benign): test-backend-ops MUL_MAT 1146/1146 + MUL_MAT_ID 806/806; the forced path (LLAMA_FP4_PREFILL_M=64) is green CUDA-vs-CPU at NVFP4 large-M shapes; greedy md5 on q36-27b is byte-identical to FP4-MMQ both for short prefill (5951a5b4, decode untouched) and for a >threshold prefill that exercises the bf16 path (5f3967df - no greedy argmax flips). Performance REGRESSES on GB10 (S_PP, q36-27b dense, A/B via env): M=512 958.99 -> 486.65 (-49%), M=1024 1013.65 -> 587.27 (-42%), M=2048 918.46 -> 649.42 (-29%). The scope premise (FP4-MMQ ~3% of FP4 peak at large M) is false here: FP4-MMQ beats bf16-cuBLAS because bf16 peak is ~half FP4 peak and the per-step weight dequant + 4x bf16 weight traffic (~8x total vs the FP4 read) dominate, only partially amortizing as M grows. Default-off keeps stock S_PP (966.98). Phase 2 (MoE grouped large-M) not implemented: it inherits the same bf16-peak<FP4-peak ceiling plus a per-expert dequant, so grouped bf16-cuBLAS would regress for the same reason; a real prefill GEMM win needs option (b), a native FP4-MMA large-M kernel. Full A/B in docs/PREFILL_GEMM_RESULTS.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 17:42:15 +00:00
Ettore Di Giacinto	4bdd26a7f0	docs(paged): scope tensor-core (mma) chunked GDN prefill kernel Scopes the follow-up recorded by patch 0031 + README section 5: replace the serial per-thread reductions of the chunked gated-DeltaNet prefill scan with mma.sync tensor-core matmuls and lift the 1-block/SM occupancy ceiling, the path that would beat the tuned sequential scan and close the GDN prefill bucket toward vLLM's ~2.5x-cheaper chunked scan. Confirmed (not assumed) the GB10/sm_121a tensor-core reality: consumer Blackwell (SM12x) has NO wgmma (Hopper-only) and NO tcgen05/TMEM (sm_100a data-center only); the usable path is the extended mma.sync family. So the kernel is a warp-synchronous mma.sync + cp.async design (reusing ggml's mma.cuh tiles), not a wgmma/TMA/tcgen05 design - patch 0031's 'mma/wgmma' shorthand reads as mma only on this part. Design: register-resident state frees the 64KB that forced C=16, admitting C=64 under the 99KB shared opt-in; tf32 inputs / f32 accumulate with a 3xtf32 precision ladder; decays/gamma/beta stay f32 outside the mma to preserve the bounded de-gating; A-inverse via blocked forward substitution (FLA UT transform) with mma off-diagonal coupling. Mechanism: chunking cuts state-BW ~Cx, mma absorbs the O(C^2) intra-chunk flops the serial 0031 could not. Honest: multi-week, high risk, no vendor kernel to route to on sm_121; gains beat the sequential scan and close most of the bucket but not full sm_100-class parity. KL-gate binding (NMSE likely fails at reduced precision). Phased: re-profile -> two-product PoC -> full intra-chunk + C=64 + reg-state -> occupancy/cp.async; opt-in default-OFF until A/B-proven. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 17:23:51 +00:00
Ettore Di Giacinto	9a28f23134	docs(paged): scope the continuous-serving decode gap (host-bound, design-only) Add DECODE_SERVING_SCOPE.md: the decode KERNEL is at parity in static batched-bench (~6.1 tok/s/seq ~ vLLM ~5.9 at npl128) but continuous serving through llama-server update_slots() drops to ~3.7 (-39%) while vLLM sustains ~5.9. Scope shows the gap is the scheduler/host loop, not the kernel. Root-cause hypothesis from source: continuous batching's batch-shape + seq-set churn breaks BOTH graph-reuse layers every step - llama-context can_reuse/ allow_reuse (n_tokens + seq-set must match) and the CUDA ggml_cuda_graph update_required memcmp (ne/nb/data ptrs) - so the GPU idles while the host rebuilds + re-captures the graph and runs un-graphed set_inputs. vLLM avoids this with padded/bucketed decode shapes + piecewise CUDA graphs. Documents that the shipped scheduler patches (0008/0013/0016/0024/0025/0029) target prefill freezing + burst collapse, NOT decode-step graph reuse, which is why the serving gap survives them; notes the README s.5 'lever 2 graph coverage FLAT' verdict was static-regime and is reopened here for serving only. Ranks host-side, bit-exact-safe levers: S1 bucketed/padded decode-step shape for graph reuse, S2 double-buffer/overlap per-step host work, S3 graph-shape-stable scheduling (extend 0016). Specifies a Phase-0 profile to confirm host-bound before any build, reusing the in-tree [L5INSTR] hostproc/set_inputs/ get_block_table timers, the 'graphs reused' perf counter, LLAMA_GRAPH_REUSE_DISABLE and nsys GPU-busy%, with vLLM ground-truthed at the same concurrency. No kernel code; no GPU run in this pass. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 17:14:51 +00:00
Ettore Di Giacinto	e610347367	feat(paged): chunked parallel-scan GDN prefill kernel (patch 0031) Adds patch 0031 to the paged llama.cpp series: an FLA-style chunked parallel-scan prefill kernel for gated DeltaNet (the upstream gated_delta_net.cu "Add chunked kernel for even faster pre-fill" TODO). Scope: non-KDA scalar gate, f32 state, final-state-only, homogeneous. Bit-exact-benign (NEW per-path): test-backend-ops GATED_DELTA_NET 91/91 within the 1e-7 NMSE gate vs the CPU reference (patch adds 8 S_v=128 prefill cases: exact-multiple / tail / multi-seq / GQA / permuted); numpy prototype confirms f32 chunked-vs-sequential NMSE ~1e-13. OPT-IN, default OFF: GB10's 99KB dynamic-smem opt-in forces C=16 (the 128x128 f32 state is 64KB of the all-shared layout), pinning the kernel to 1 block/SM with serial dk-reductions. Measured ~761 t/s chunked vs ~971 t/s sequential (~22%% slower) on q36-27b-nvfp4 prefill, so it defaults OFF (enable with GDN_CHUNK_MIN=<n>); the backend default is regression-free. Beating the 84.7%-of-peak sequential scan needs tensor-core matmuls / register-resident state with larger chunks (recorded in README section 5). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 17:09:38 +00:00
Ettore Di Giacinto	11128cb080	docs(paged): scope the large-M NVFP4 prefill GEMM lever (design only) Design + plan for the #1 prefill lever: NVFP4 weight GEMM at large M, where MMQ (decode/M<=128-tuned, 1 CTA/SM, 128-col tile cap) is ~3.4x slower than vLLM's marlin/cutlass large-M path (~51% of the prefill gap). Recommends (a) dequant->bf16 cuBLAS routed by an M-threshold (dense first, MoE grouped-cuBLAS second); rejects (b) a from-scratch Marlin/FP4 kernel as a multi-week project. Key enabling finding: NVFP4->bf16 dequant kernels already exist, and NVFP4 is currently force-excluded from the tensor-core cuBLAS path (falls to f32 Sgemm) - relaxing that one guard is the pivot. Honest: bf16-cuBLAS banks ~60-75% of the GEMM gap, not full 68us/tok parity (bf16 TC peak ~half FP4). Design only - no kernel, no GPU run. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]	2026-06-28 16:42:23 +00:00
Ettore Di Giacinto	4cd90bfae9	paged: drop bf16-tau (patch 0026), subsumed by decode fusions (tau=100000 flat, zero speed benefit) The opt-in hybrid per-head bf16 SSM-state lever (ssm_bf16_tau, patch 0026) is removed from the llama-cpp-localai-paged patch series. Clean re-measurement after the decode fusions (0028 recurrent-state gather-fusion + 0029 block-table cache) landed shows it buys nothing: forcing ALL gated-DeltaNet heads to bf16 (tau=100000, the most aggressive setting) gives flat decode throughput, 780.6 vs 780.0 t/s. The mode engages but adds zero speed because it is subsumed by the fusions. The earlier "+12%" was measured before the fusions completed. bf16-tau was a precision trade (not bit-exact, ~91% same-top-p) plus extra bug surface and extra CUDA template-instantiation compile cost with no offsetting benefit. Dependency check: no later patch (0028/0029/0030) depends on 0026. 0030's only mention is a description comment; its code keys off fused_gdn_ar/ch/auto_fgdn, which originate in 0018/0019/0021 (before 0026). The remaining series (0001-0025, 0028-0030) applies clean with git apply --check against the pin 0ed235ea2c17a19fc8238668653946721ed136fd. The Makefile applies the series by glob (patches/paged/0*.patch); the resulting gap at 0026 is tolerated (0005/0027 are already absent). Removed: - patches/paged/0026-qwen35-hybrid-perhead-ssm-state.patch - the dead ssm_bf16_tau / ssm_hybrid_tau option handler in the shared grpc-server.cpp (it only set LLAMA_SSM_BF16_TAU, now a no-op the library no longer reads) - the patched+bf16-tau benchmark columns and llama-patched-bf16tau rows (README + final_benchmark.csv), the ssm_bf16_tau option text in backend index.yaml, the gallery NOTE block, and the docs/features/backends.md mention. The rejected-lever lesson is kept (why it was dropped: subsumed, tau=100000 flat) in the backend README section 5, the paged-backend agent guide, and the vLLM-parity methodology, so it is not re-tried. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 16:06:06 +00:00
Ettore Di Giacinto	2c59805267	fix(paged): rpc cmake target renamed rpc-server -> ggml-rpc-server at pin 0ed235ea llama.cpp renamed the RPC tool target (tools/rpc/CMakeLists.txt: set(TARGET ggml-rpc-server)) at the 0ed235ea pin. master already updated the stock llama-cpp Makefile to match (--target ggml-rpc-server, cp bin/ggml-rpc-server); the paged backend's separate Makefile copy was left stale and its -grpc (RPC) variant failed with 'No rule to make target rpc-server' (grpc-server itself built to 100%). Mirror the stock rename in the paged Makefile. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 11:10:07 +00:00
Ettore Di Giacinto	c51ff4cec9	docs(paged): scope porting the portable benefits to Metal/SYCL/Vulkan (+ROCm) Add ACCELERATOR_PORTING_SCOPE.md, the umbrella scope for taking the paged backend's accelerator-portable wins off the CUDA family. It builds on (does not duplicate) UPSTREAM_LAYER2_SCOPE.md, which stays the GDN/SSM-fusion detail (benefit #1), and adds: - Benefit #2 (paged KV in-kernel block-table flash-attn read, 0009-0011): new per-backend feasibility from source analysis of the Metal/SYCL/Vulkan flash-attn kernels. SYCL EASY (near line-for-line CUDA mirror), Metal EASY-MEDIUM (decode already routes to the vec kernel), Vulkan MEDIUM (the fast coopmat2 NVIDIA decode path cannot do the indexed read; push-constants are full). Universal constraint: only the vec/scalar decode kernel admits the per-cell indexed read, so route block-table ops onto vec (as CUDA's 0009-0010 dispatch guard already does) and leave the fast MM/coopmat2 path contiguous-only. This is the lever that flips paged KV from neutral-to-slightly-negative to non-negative off CUDA. - Benefit #3 (decode-first scheduler, 0013/0016): confirmed a free portable win - host-side update_slots() policy, zero kernel work, runs on any accelerator as-is. - Benefit #4 (NVFP4 FP4-MMA, 0017/0023/0025): out of scope (Blackwell only); flags the backend-agnostic analogues of the act-quant dedup and the graph-coverage lever without over-claiming a port. - A ROCm note: ROCm rides the CUDA/HIP path (validate, don't re-port); FP4-MMA stays Blackwell-only. Benefits #1 and #2 share the port shape and rank Metal->SYCL->Vulkan, so they bundle into one per-backend PR behind a shared ops-first PR. Cross-link added from UPSTREAM_LAYER2_SCOPE.md. All gates are test-backend-ops on-target (no Metal/SYCL/Vulkan/ROCm hardware here). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 08:34:32 +00:00
Ettore Di Giacinto	ea72a56e2c	Merge origin/master + pin-sync paged backend to 0ed235ea master auto-bumped the stock llama-cpp pin 9d5d882d -> 0ed235ea and updated the shared grpc-server.cpp. The paged backend's pin must track the stock pin (the grpc-server.cpp is shared), so bump its LLAMA_VERSION to match. All 28 paged patches apply clean on 0ed235ea (verified against a fresh upstream clone). The bf16-tau state-serialization fix (patch 0026) is included. Bit-exact gate + full grpc-server build verify on GPU/CI to follow. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 07:56:47 +00:00
Ettore Di Giacinto	1f3e5ba301	fix(paged): serialize both SSM partitions in hybrid bf16-tau state save/restore (patch 0026) The opt-in ssm_bf16_tau hybrid mode splits a gated-DeltaNet layer's recurrent SSM state into an f32 partition (s_l) and a bf16 partition (s_l_bf16). The recurrent state serialization paths (state_write_data / state_read_data) were never updated for the split: they read/wrote s_l using the FULL hparams.n_embd_s() (S_vS_vH) row width, but a split layer's s_l only holds S_vS_vn_f32, so the access overruns the smaller tensor (a ggml_backend tensor read out of bounds), and the bf16 fast-head partition was never persisted at all. This is what broke high-concurrency serving with --ssm-bf16-tau: the server's context-checkpoint feature serializes per-sequence state via state_seq_get_data. With a checkpoint enabled, even a single request triggered the out-of-bounds read; at higher concurrency the cell range starts at a higher base slot so the overrun reaches further (hard abort in a debug build, silent state corruption then 1-token-then-EOS on restore in a release build). The static batched-bench never exercises save/restore so it did not catch it; the GDN decode kernel and per-head partition offsets were already correct (decode with checkpoints disabled is fine at N=8/16/32). Fix: serialize the f32 partition and, when the layer is split, the bf16 partition right after it, each with its OWN row width (tensor ne[0]). head_slot is rebuilt deterministically at load (same model + tau), so it is not serialized. Non-split layers have ne[0] == n_embd_s() and no bf16 partition, so their on-disk format and behavior are byte-identical (the default f32 path and the bit-exact gate are unaffected). Verified on GB10/DGX with Qwen3.6-35B-A3B-NVFP4 + --ssm-bf16-tau 64 via a continuous-batching llama-server: with context checkpoints enabled, N=8, N=16 and N=32 (slot reuse + restore) all now produce full coherent 128-token output and the server stays up; pre-fix the same config aborted on the first checkpoint. Assisted-by: Claude:claude-opus-4-8[1m] [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 07:47:17 +00:00
LocalAI [bot]	d3a26f961d	fix(ik-llama): port multimodal path to mtmd API and bump to f96eaddb (#10534 ) (#10568 ) * fix(ik-llama): port multimodal path to mtmd API and bump to f96eaddb (#10534) The IK_LLAMA_VERSION bump to f96eaddba8bed6a9a5e628bbf6a566775c70b49c pulls in upstream commit "Prune examples/llava", which deletes examples/llava (clip.* / llava.). The ik-llama backend's grpc-server.cpp built a local `myclip` library from those files and called the removed clip/llava C API, so the bump no longer builds. ik_llama keeps its multimodal stack in the surviving `mtmd` library (examples/mtmd/, public headers mtmd.h + mtmd-helper.h). This ports the backend's multimodal path onto the high-level mtmd_ / mtmd_helper_* API in place, leaving the text path (which still uses ik_llama's retained old common API) untouched: - Makefile: bump IK_LLAMA_VERSION to f96eaddb. - prepare.sh: drop the clip/llava source copy + sed block; mtmd is a library target, no source copy needed. - CMakeLists.txt: remove the `myclip` target; link `mtmd` and add its include dir; build grpc-server as C++17 (mtmd headers require it). - patches: drop 0002 (targeted the deleted examples/llava/clip.cpp; the mtmd clip.cpp never calls ggml_quantize_chunk, so the fix is unneeded). Keep 0001 (verified still applies). - grpc-server.cpp / utils.hpp: replace clip_model_load + clip_image_load_from_bytes + llava_image_embed_make_with_clip_img + the manual [img-N] prefix splitting and per-image llava_embd_batch decode loop with mtmd_init_from_file (moved after the model load, which it requires), mtmd_helper_bitmap_init_from_buf, mtmd_tokenize and mtmd_helper_eval_chunks. Legacy [img-N] tags are translated, in order, into mtmd media markers (mtmd_default_marker()); the post-image suffix text stays on the normal token path so the sampling loop is unchanged. Supersedes #10534. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(ik-llama): align json alias to ordered_json to resolve mtmd.h conflict (#10534) mtmd.h declares `using json = nlohmann::ordered_json` at global scope (and its mtmd.cpp depends on it), while ik_llama's whole server/common stack also uses ordered_json. Our grpc-server.cpp/utils.hpp kept a plain `nlohmann::json` alias, which now collides with mtmd.h once it is included for the multimodal port: "conflicting declaration 'using json = ...'". Switch our two aliases to ordered_json to match; it is API-compatible (utils.hpp already used ordered_json for its log helper) and our json never crosses into an unordered-json API. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 08:57:11 +02:00
LocalAI [bot]	13b1ae53bc	chore: ⬆️ Update ggml-org/llama.cpp to `0ed235ea2c17a19fc8238668653946721ed136fd` (#10536 ) * ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * fix(llama-cpp): link server-stream.cpp TU into grpc-server for upstream 0ed235ea (#10536) Upstream llama.cpp 0ed235ea added an SSE stream-resumption layer in a new translation unit tools/server/server-stream.cpp, which defines stream_session, stream_pipe_producer and the g_stream_sessions manager. server-context.cpp (already #included into grpc-server.cpp) now calls into it via spipe->cleanup(), stream_aware_should_stop() and stream_session_attach_pipe(), so without the new TU the grpc-server link fails on every arch with: undefined reference to `stream_pipe_producer::cleanup()' prepare.sh already copies every tools/server/* file into tools/grpc-server/, so the source is present; the only missing piece was including its definitions. Add an __has_include-guarded #include "server-stream.cpp" before server-context.cpp, mirroring the existing server-chat.cpp and server-schema.cpp guards, keeping the source compatible with older pins/forks that predate the split. The file is self-contained (its only external symbols come from server-common, already in the TU) so it adds no new undefined references; the http route-handler factories it also defines are unused in the grpc path but harmless. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(llama-cpp): build renamed ggml-rpc-server target for upstream 0ed235ea (#10536) Upstream renamed the RPC server CMake target and binary from `rpc-server` to `ggml-rpc-server` (tools/rpc/CMakeLists.txt: `set(TARGET ggml-rpc-server)`), so the RPC-enabled grpc build failed with "No rule to make target 'rpc-server'". The grpc-server itself links fine after the server-stream.cpp fix; this only updates the RPC target name and the binary path copied to llama-cpp-rpc-server. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 08:56:40 +02:00
Ettore Di Giacinto	4da769c1ca	paged headers: self-include <cstddef>/<cstdint> for size_t/uintN_t (fix amd64/non-arm64 build; compile-only) Vendored paged headers used size_t / uintN_t without including <cstddef> / <cstdint>. The arm64 DGX toolchain provides them transitively so the build passed there, but amd64/older toolchains do not, failing the CI amd64 build one header at a time ('size_t' does not name a type -> cascade). paged-kv-manager.h was already fixed. This adds the missing includes to the remaining vendored headers at the point each is created/rewritten in the patch series so every src/paged.h self-includes both: paged-attn.h (0003): add <cstddef> (had <cstdint>) * paged-alloc.h (0007): add <cstddef> (had <cstdint>) * paged-prefix-api.h (0007): add <cstddef> + <cstdint> (had only llama.h) The .cpp units include their own paged header, so they inherit the includes transitively. Whole series still applies clean on the pinned llama.cpp. Compile-only change: no runtime behavior change, bit-exactness unaffected. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 06:18:56 +00:00
Ettore Di Giacinto	23b11a5239	paged-kv-manager.h: add missing <cstddef> for size_t Fixes cuda-13 amd64 / non-arm64 build where size_t was used without the header (arm64 cuda-13 pulled it in transitively; amd64/cuda-12 toolchains do not). Compile-only change, bit-exactness unaffected. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 04:09:16 +00:00
Ettore Di Giacinto	0b84fda496	docs(paged): add the bf16-tau opt-in line to the decode plots Per request, the plots now show all four series: llama.cpp (standard), vLLM, LocalAI's llama.cpp patches (bit-exact hero), and LocalAI's patches + bf16-tau (opt-in ceiling, +3% to +17% over the patches, ahead of vLLM at every dense width and MoE npl>=32). Subtitle flags bf16-tau as opt-in / not bit-exact. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 22:25:02 +00:00
Ettore Di Giacinto	1431f72b92	docs(paged): regenerate decode plots (3-way) from re-measured data + overview Rebuild the two committed decode plots from the re-measured CSV and add a combined overview. Three series per the comparison that matters: llama.cpp (standard) vs vLLM vs LocalAI's llama.cpp patches; x-over-standard called out at npl128. bf16-tau stays out of the plot (it remains in the CSV + the README table as the opt-in row). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 22:20:12 +00:00
Ettore Di Giacinto	3466094c68	docs(paged): re-measure DGX benchmarks on one harness (stock/patched/bf16-tau) Re-run the GB10/DGX-Spark llama-batched-bench matrix (dense q36-27b + MoE q36-35b-a3b, npl 8/32/64/128, -fa on -ngl 99 -npp 128 -ntg 128) so the CSV and README section 4 carry a single consistent set of llama numbers with all three configs: - stock: separately-built unpatched llama.cpp at this backend's exact pin 9d5d882d (toggling LLAMA_KV_PAGED on the patched binary does NOT reproduce stock - the SSM decode fusions are compiled in, not env-gated). - patched: paged binary, LLAMA_KV_PAGED=1 (+LLAMA_MOE_FORCE_GRAPHS=1 for MoE). - patched+bf16-tau: patched plus --ssm-bf16-tau 64 (opt-in, NOT bit-exact, ~91% same-top-p). final_benchmark.csv now has stock + patched + bf16-tau + vllm rows for both models at all four widths (the prior CSV had no stock and no bf16-tau rows). peak_gb is dropped: the GB10's unified LPDDR5x reports [N/A] to nvidia-smi and the bench does not print it, so per-run peak could not be captured this session. Patch series gives up to 2.46x (dense) / 2.26x (MoE) over true-stock; opt-in bf16-tau adds a further +3% to +17% on top of patched (growing with width). vLLM column is kept from the prior session (not re-run) and labeled as such. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 22:05:59 +00:00
Ettore Di Giacinto	ed5eb705c7	docs(paged): drop moot PIN_SYNC_c299a92c record, repoint to README sec 7 The paged backend's llama.cpp pin was reverted from c299a92c back to 9d5d882d (== stock), so docs/PIN_SYNC_c299a92c.md (a blow-by-blow of the reverted sync) is dead weight. The pin-sync PROCESS stays documented in the three live places: the Makefile comment, README section 7 (Pin + maintenance policy), and .agents/llama-cpp-localai-paged-backend.md. Delete the doc and repoint every reference to it (Makefile, README, .agents, canary script + workflow) at README section 7. No functional paths change: the canary's patches-dir glob (patches/paged/0*.patch) is untouched. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 21:34:10 +00:00
Ettore Di Giacinto	53f66a6f03	fix(paged): revert pin to 9d5d882d (== stock); c299a92c broke grpc-server link The c299a92c bump diverged 23 commits ahead of the stock llama-cpp pin. grpc-server.cpp is SHARED with the stock backend and tracks the stock pin; c299a92c's upstream server-API refactor pulled stream_* helpers into the headers grpc-server.cpp includes, whose definitions the stock-aligned build does not compile -> every paged variant failed to LINK (undefined reference to stream_aware_should_stop / stream_pipe_producer::cleanup / stream_session_attach_pipe). The bump was greedy-md5 bit-exact, but the bit-exact gate never exercises the full grpc-server build, so it slipped through. Revert LLAMA_VERSION to 9d5d882d (== stock pin, where the patches are bit-exact AND grpc-server links - the original DGX-proven baseline). Document the hard constraint in the Makefile, README, PIN_SYNC record, and the .agents guide: the paged pin must track the stock pin, and a pin-sync must pass the full CI grpc-server build, not only the bit-exact gate. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 20:28:28 +00:00
Ettore Di Giacinto	08b754f910	chore(paged): keep patches/ patch-only; README to backend root, docs to docs/ The llama-cpp-localai-paged patches/ dir had accumulated docs, plots, a csv, dev .cpp harnesses, and a dead FP4-MoE kernel scaffold after an earlier git-mv. Restore the invariant that patches/ holds only the .patch series. Moves: - patches/paged/README.md -> README.md (canonical doc at the backend root) - patches/paged/{PIN_SYNC_c299a92c,PAGED_BITEXACT_NOTE,LOCALAI_LLAMACPP_BACKEND_PLAN,UPSTREAM_LAYER2_SCOPE}.md, final_benchmark.csv, qwen36_.png, paged-burst-bench.cpp, paged-reclaim-unit.cpp -> docs/ - patches/README.md -> docs/PATCH_MAINTENANCE.md (unique patch-regen recipe not in the canonical README) Deletes: - patches/BENCHMARKS.md (superseded by README section 4 + the dev-notes section) - patches/kernel/ (dead FP4-MoE scaffold, never in the 0001-0030 apply glob, zero refs repo-wide) Repoint every reference to the moved files: README internal links (docs/ + the .github links drop from 5x ../ to 3x ../), .agents/llama-cpp-localai-paged-backend.md, .github/scripts/paged-canary-apply.sh, .github/workflows/llama-cpp-paged-canary.yml, the wrapper Makefile, backend/cpp/llama-cpp/grpc-server.cpp, backend/index.yaml, docs/content/features/backends.md, gallery/index.yaml. The build apply glob PAGED_PATCHES_DIR/0.patch (PAGED_PATCHES_DIR := .../patches/paged) is unchanged and still resolves to the 28 patches. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 13:20:05 +00:00
Ettore Di Giacinto	a4e730979d	feat(paged): restrict llama-cpp-localai-paged to CUDA-only build targets The paged backend previously built for cublas/cuda, cpu, vulkan, sycl, hipblas and darwin/metal. On non-CUDA the patchset's wins are inert: the GDN fusions are gated off (patch 0030) and NVFP4 falls back to dequant, so the backend is neutral-to-negative there (README section 4c). The darwin grpc-server link also fails on undefined upstream server symbols, turning CI red. Both broken and pointless off-CUDA, so ship CUDA-only. - backend-matrix.yml: drop the hipblas, sycl f32/f16, cpu amd64/arm64, vulkan amd64/arm64 and metal-darwin rows for this backend; keep the four cublas rows (cuda-12, cuda-13, nvidia-l4t cuda-12 and cuda-13). - index.yaml: meta-backend (and -development) capabilities are now CUDA-only with default pointing at cuda12 (mirrors faster-qwen3-tts); removed the orphaned cpu/rocm/sycl/vulkan/metal variant entries. - Removed the now-unused darwin build script and its Makefile target / .NOTPARALLEL entry / backend_build_darwin.yml step. - Documented the CUDA-only build coverage in the patch README and plan. Non-CUDA users should use the stock llama-cpp backend. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 12:29:15 +00:00
Ettore Di Giacinto	9115c2c52c	docs(paged): correct Vulkan/SYCL note (GDN op IS upstream) + CUDA-only rationale The gated-DeltaNet + SSM_CONV ops have upstream Metal/Vulkan/SYCL kernels, so the Qwen3.6 hybrids run there (non-fused) - the earlier 'no Vulkan kernel' note was wrong. The patchset's fusions are gated off off-CUDA, so the backend ships CUDA-only; non-CUDA users use stock llama-cpp. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 12:18:11 +00:00

1 2 3 4 5 ...

785 Commits