Commit Graph

7028 Commits

Author SHA1 Message Date
Ettore Di Giacinto
23b11a5239 paged-kv-manager.h: add missing <cstddef> for size_t
Fixes cuda-13 amd64 / non-arm64 build where size_t was used without the
header (arm64 cuda-13 pulled it in transitively; amd64/cuda-12 toolchains
do not). Compile-only change, bit-exactness unaffected.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-28 04:09:16 +00:00
Ettore Di Giacinto
9bb8994c4e chore(paged): drop CUDA-12 variants of llama-cpp-localai-paged, keep CUDA-13 only
The paged backend targets Blackwell sm_121a, which CUDA 12.0 cannot target
at all, so the CUDA-12 variants were pointless. They were also broken: the
cublas-12 / nvidia-l4t / arm64 build failed to compile paged-kv-manager.cpp
("no declaration matches ...", a ~10-function mismatch the older
cuda-12-base gcc rejects). CUDA-13 compiles it fine (confirmed on GB10).

Removed (config-only, scoped to the paged backend):
- backend-matrix.yml: the two CUDA-12 paged rows
  (-gpu-nvidia-cuda-12-llama-cpp-localai-paged,
   -nvidia-l4t-arm64-llama-cpp-localai-paged)
- backend/index.yaml: CUDA-12 capability keys (nvidia-cuda-12,
  nvidia-l4t-cuda-12, nvidia-l4t) on both meta-backends, repointed
  default/nvidia to the cuda13 amd64 variant, and dropped the orphaned
  cuda12-* / nvidia-l4t-arm64-* variant definitions (latest + -development).

Kept CUDA-13 only: cuda13-llama-cpp-localai-paged (amd64) and
cuda13-nvidia-l4t-arm64-llama-cpp-localai-paged (l4t arm64). Matrix
tag-suffixes <-> index variant URIs form a clean 2:2 bijection.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-28 01:37:54 +00:00
Ettore Di Giacinto
0b84fda496 docs(paged): add the bf16-tau opt-in line to the decode plots
Per request, the plots now show all four series: llama.cpp (standard), vLLM,
LocalAI's llama.cpp patches (bit-exact hero), and LocalAI's patches + bf16-tau
(opt-in ceiling, +3% to +17% over the patches, ahead of vLLM at every dense width
and MoE npl>=32). Subtitle flags bf16-tau as opt-in / not bit-exact.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 22:25:02 +00:00
Ettore Di Giacinto
1431f72b92 docs(paged): regenerate decode plots (3-way) from re-measured data + overview
Rebuild the two committed decode plots from the re-measured CSV and add a combined
overview. Three series per the comparison that matters: llama.cpp (standard) vs
vLLM vs LocalAI's llama.cpp patches; x-over-standard called out at npl128. bf16-tau
stays out of the plot (it remains in the CSV + the README table as the opt-in row).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 22:20:12 +00:00
Ettore Di Giacinto
266fcc79ad docs(agents): fix A/B-bench gotcha - env-toggle != stock for compiled-in wins
The DGX re-run showed toggling LLAMA_KV_PAGED on/off on the patched binary does
NOT reproduce stock: the dominant SSM decode fusions are compiled in, not
runtime-gated, so the toggle measures only the (here ~neutral) paged-KV part.
True stock needs a separately-built unpatched binary at the same pin. Correct the
methodology skill's per-lever discipline + apples-to-apples rule accordingly.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 22:09:05 +00:00
Ettore Di Giacinto
3466094c68 docs(paged): re-measure DGX benchmarks on one harness (stock/patched/bf16-tau)
Re-run the GB10/DGX-Spark llama-batched-bench matrix (dense q36-27b + MoE
q36-35b-a3b, npl 8/32/64/128, -fa on -ngl 99 -npp 128 -ntg 128) so the CSV and
README section 4 carry a single consistent set of llama numbers with all three
configs:

- stock: separately-built unpatched llama.cpp at this backend's exact pin
  9d5d882d (toggling LLAMA_KV_PAGED on the patched binary does NOT reproduce
  stock - the SSM decode fusions are compiled in, not env-gated).
- patched: paged binary, LLAMA_KV_PAGED=1 (+LLAMA_MOE_FORCE_GRAPHS=1 for MoE).
- patched+bf16-tau: patched plus --ssm-bf16-tau 64 (opt-in, NOT bit-exact,
  ~91% same-top-p).

final_benchmark.csv now has stock + patched + bf16-tau + vllm rows for both
models at all four widths (the prior CSV had no stock and no bf16-tau rows).
peak_gb is dropped: the GB10's unified LPDDR5x reports [N/A] to nvidia-smi and
the bench does not print it, so per-run peak could not be captured this session.

Patch series gives up to 2.46x (dense) / 2.26x (MoE) over true-stock; opt-in
bf16-tau adds a further +3% to +17% on top of patched (growing with width).
vLLM column is kept from the prior session (not re-run) and labeled as such.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 22:05:59 +00:00
Ettore Di Giacinto
ed5eb705c7 docs(paged): drop moot PIN_SYNC_c299a92c record, repoint to README sec 7
The paged backend's llama.cpp pin was reverted from c299a92c back to
9d5d882d (== stock), so docs/PIN_SYNC_c299a92c.md (a blow-by-blow of the
reverted sync) is dead weight. The pin-sync PROCESS stays documented in
the three live places: the Makefile comment, README section 7 (Pin +
maintenance policy), and .agents/llama-cpp-localai-paged-backend.md.

Delete the doc and repoint every reference to it (Makefile, README,
.agents, canary script + workflow) at README section 7. No functional
paths change: the canary's patches-dir glob (patches/paged/0*.patch)
is untouched.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 21:34:10 +00:00
Ettore Di Giacinto
53f66a6f03 fix(paged): revert pin to 9d5d882d (== stock); c299a92c broke grpc-server link
The c299a92c bump diverged 23 commits ahead of the stock llama-cpp pin.
grpc-server.cpp is SHARED with the stock backend and tracks the stock pin;
c299a92c's upstream server-API refactor pulled stream_* helpers into the headers
grpc-server.cpp includes, whose definitions the stock-aligned build does not
compile -> every paged variant failed to LINK (undefined reference to
stream_aware_should_stop / stream_pipe_producer::cleanup /
stream_session_attach_pipe). The bump was greedy-md5 bit-exact, but the bit-exact
gate never exercises the full grpc-server build, so it slipped through.

Revert LLAMA_VERSION to 9d5d882d (== stock pin, where the patches are bit-exact
AND grpc-server links - the original DGX-proven baseline). Document the hard
constraint in the Makefile, README, PIN_SYNC record, and the .agents guide: the
paged pin must track the stock pin, and a pin-sync must pass the full CI
grpc-server build, not only the bit-exact gate.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 20:28:28 +00:00
Ettore Di Giacinto
08b754f910 chore(paged): keep patches/ patch-only; README to backend root, docs to docs/
The llama-cpp-localai-paged patches/ dir had accumulated docs, plots, a csv,
dev .cpp harnesses, and a dead FP4-MoE kernel scaffold after an earlier git-mv.
Restore the invariant that patches/ holds only the .patch series.

Moves:
- patches/paged/README.md -> README.md (canonical doc at the backend root)
- patches/paged/{PIN_SYNC_c299a92c,PAGED_BITEXACT_NOTE,LOCALAI_LLAMACPP_BACKEND_PLAN,UPSTREAM_LAYER2_SCOPE}.md,
  final_benchmark.csv, qwen36_*.png, paged-burst-bench.cpp, paged-reclaim-unit.cpp -> docs/
- patches/README.md -> docs/PATCH_MAINTENANCE.md (unique patch-regen recipe not in the canonical README)

Deletes:
- patches/BENCHMARKS.md (superseded by README section 4 + the dev-notes section)
- patches/kernel/ (dead FP4-MoE scaffold, never in the 0001-0030 apply glob, zero refs repo-wide)

Repoint every reference to the moved files: README internal links (docs/ + the
.github links drop from 5x ../ to 3x ../), .agents/llama-cpp-localai-paged-backend.md,
.github/scripts/paged-canary-apply.sh, .github/workflows/llama-cpp-paged-canary.yml,
the wrapper Makefile, backend/cpp/llama-cpp/grpc-server.cpp, backend/index.yaml,
docs/content/features/backends.md, gallery/index.yaml.

The build apply glob PAGED_PATCHES_DIR/0*.patch (PAGED_PATCHES_DIR := .../patches/paged)
is unchanged and still resolves to the 28 patches.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 13:20:05 +00:00
Ettore Di Giacinto
db14006fcd docs(agents): add paged-backend maintenance + vLLM-parity methodology skills
Two .agents guides (indexed in AGENTS.md):
- llama-cpp-localai-paged-backend.md: what the CUDA-only paged backend is, the
  patchset scope, the bit-exact gate, the manual pin-sync + weekly canary, the
  CUDA-only / stock-stays-pure invariants, and the Metal/SYCL/Vulkan follow-up scope.
- vllm-parity-methodology.md: the decode-parity playbook (bit-exact gating,
  profile-don't-assume, both-engine ground-truth, per-lever A/B, recording rejected
  levers, multi-agent GPU orchestration).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 12:58:01 +00:00
Ettore Di Giacinto
a4e730979d feat(paged): restrict llama-cpp-localai-paged to CUDA-only build targets
The paged backend previously built for cublas/cuda, cpu, vulkan, sycl,
hipblas and darwin/metal. On non-CUDA the patchset's wins are inert: the
GDN fusions are gated off (patch 0030) and NVFP4 falls back to dequant,
so the backend is neutral-to-negative there (README section 4c). The
darwin grpc-server link also fails on undefined upstream server symbols,
turning CI red. Both broken and pointless off-CUDA, so ship CUDA-only.

- backend-matrix.yml: drop the hipblas, sycl f32/f16, cpu amd64/arm64,
  vulkan amd64/arm64 and metal-darwin rows for this backend; keep the
  four cublas rows (cuda-12, cuda-13, nvidia-l4t cuda-12 and cuda-13).
- index.yaml: meta-backend (and -development) capabilities are now
  CUDA-only with default pointing at cuda12 (mirrors faster-qwen3-tts);
  removed the orphaned cpu/rocm/sycl/vulkan/metal variant entries.
- Removed the now-unused darwin build script and its Makefile target /
  .NOTPARALLEL entry / backend_build_darwin.yml step.
- Documented the CUDA-only build coverage in the patch README and plan.

Non-CUDA users should use the stock llama-cpp backend.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 12:29:15 +00:00
Ettore Di Giacinto
9115c2c52c docs(paged): correct Vulkan/SYCL note (GDN op IS upstream) + CUDA-only rationale
The gated-DeltaNet + SSM_CONV ops have upstream Metal/Vulkan/SYCL kernels, so the
Qwen3.6 hybrids run there (non-fused) - the earlier 'no Vulkan kernel' note was
wrong. The patchset's fusions are gated off off-CUDA, so the backend ships
CUDA-only; non-CUDA users use stock llama-cpp.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 12:18:11 +00:00
Ettore Di Giacinto
984c8fcbea docs(paged): Layer-2 upstream scope for native fused-GDN kernels (Metal/Vulkan/SYCL)
Source-only analysis of what it would take to give the gated-DeltaNet decode
fusions (0018 in-place state write-back, 0019 fused recurrent-state gather,
0021 ssm_conv_update_inplace, 0028 conv-tap gather fusion) native kernels on
the non-CUDA compute backends, so the patch-series decode win extends past
CUDA-family hardware.

Key findings:
- The base GGML_OP_GATED_DELTA_NET and GGML_OP_SSM_CONV kernels ALREADY exist
  upstream on Metal, Vulkan AND SYCL (the README's no-Vulkan-kernel line is
  stale). The Qwen3.6 hybrids run on all three today via the non-fused path;
  Layer-2 is the decode SPEEDUP, not enabling the model to run.
- Per backend the new work is only the FUSION plumbing: redirect the GDN state
  write (in-place), add the ids read, write one new conv-update kernel + its
  ids variant, two tiny gather kernels, plus supports_op + op-handler + (Vulkan)
  pipeline/push-constant/descriptor wiring. Builders, CPU refs, model graph and
  test-backend-ops cases are shared and already done.
- Bit-exactness is feasible per backend by construction (the fusions redirect
  addresses, not the f32 reduction order); test-backend-ops (backendX-vs-CPU)
  is the gate.
- The 0030 name allow-list should become capability-driven (make supports_op
  authoritative for the discriminated src slots).
- Ranked: ops-first PR, then Metal (highest value/effort, fixed simdgroup =
  simplest bit-exactness), then SYCL (near-verbatim CUDA mirror, cheapest to
  author), then Vulkan (widest hardware reach but the shader-gen + variant
  matrix + subgroup variance make it the capstone).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 12:11:24 +00:00
Ettore Di Giacinto
4a9a1dd247 docs(paged): Mac stock-vs-patched bench + Vulkan note + cross-backend learnings
Section 4(c): real Apple M4/Metal numbers (Qwen3-8B Q4_K_M, stock vs patched) -
patchset is neutral-to-slightly-negative on Metal (the in-kernel block-table read
is CUDA-only; NVFP4/GDN-fusions inert), so prefer stock llama-cpp on Apple Silicon.
Vulkan: same picture, worse (no upstream GDN op). Section 6: cross-backend learnings
+ upstream candidates (the GDN decode-plumbing fusions are the portable, bit-exact,
CPU-mirrored win worth upstreaming).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 11:05:37 +00:00
Ettore Di Giacinto
78fac9a28f refactor(paged): stock llama-cpp is patch-free; paged backend owns its patch series
Move ALL paged-attention content out of the stock backend/cpp/llama-cpp
backend and into backend/cpp/llama-cpp-localai-paged, so the stock backend is
pure upstream llama.cpp and the paged backend owns and applies its own vendored
patch series.

- Delete the dead early-exploration scaffold backend/cpp/llama-cpp/paged/
  (kernel/w4a16 Marlin scaffold, standalone paged_kv_manager, bench/loadgen,
  its own 0001-0002 patches, dense-era design docs, tests). Zero references
  repo-wide.
- Move backend/cpp/llama-cpp/patches/ (the 28-patch paged series + paged/README
  + 3 operational docs, plus the kernel/ scaffold patch and the top-level paged
  README/BENCHMARKS) to backend/cpp/llama-cpp-localai-paged/patches/. The stock
  backend keeps no patches/ dir; it had no non-paged base patches.
- Purify the stock backend: remove the LLAMA_PAGED make variable, the
  patches/paged apply loop, and the LLAMA_PAGED passthrough to prepare.sh;
  remove the paged-series handling from prepare.sh. The stock llama.cpp target
  now only clones the pin and applies its own (currently empty) base patches/
  series. The runtime paged option hooks in the shared grpc-server.cpp are
  untouched (inert without the patches).
- The paged backend's Makefile now applies its OWN patches/paged/0*.patch onto
  each freshly cloned tree via strict git apply (apply-paged-patches), after the
  copied stock infra clones the pin and applies base patches.
- Repoint every reference to the old patches/paged path: the upstream canary
  workflow + apply script, bump_deps.yaml, gallery/index.yaml, the docs,
  backend/index.yaml, backend-matrix.yml, the top-level Makefile comments, and
  the moved PIN_SYNC / README docs. Drop the now-removed LLAMA_PAGED=on
  build-toggle from comments.

Verified: the full 28-patch series applies strict-clean (git apply, exit 0) to
a clean ggml-org/llama.cpp checkout at the pinned c299a92c, and the repointed
canary apply script resolves and applies the series end to end.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 11:01:22 +00:00
Ettore Di Giacinto
fb2dc33d52 docs(paged): consolidate the dev-trail docs into one canonical README
The paged-attention patch directory had accumulated ~55 scattered dev docs
(results, progress, scope, lever, and gap-analysis notes). Consolidate the
durable content of all of them into one canonical
backend/cpp/llama-cpp/patches/paged/README.md covering: what the patchset is,
the architecture (paged KV + block-table flash-attn, the gated-DeltaNet SSM
decode path, NVFP4 FP4-MMA, the decode-first scheduler), the full 0001-0030
patch series table with bit-exact status, the GB10 benchmarks
(patched-vs-stock-vs-vLLM + the Apple M4 architectural note), the dev notes
(bit-exact methodology, the per-path gate, the MoE-parity conclusion, the
rejected/flat levers, the opt-in bf16-SSM mode), arch+quant generality, the
pin + canary maintenance policy, and the published NVFP4 gallery models.

Delete the consolidated-away dev trail. Keep the three operational docs the
README links to: PIN_SYNC_c299a92c.md (canary reference), PAGED_BITEXACT_NOTE.md
(per-path gate reference) and LOCALAI_LLAMACPP_BACKEND_PLAN.md (the
ship-as-own-backend design-of-record), plus the benchmark plots + csv. The
.patch files and the unit/bench .cpp are untouched.

Repoint every external reference to a deleted doc at the new README:
grpc-server.cpp, docs/content/features/backends.md, gallery/index.yaml, the
canary apply script (PIN_BUMP_APPLY_CHECK.md -> README), and the base
patches/README.md (ADDITIVE_DESIGN.md -> README). The canary's PIN_SYNC
reference still resolves; its inert SSM_DECODE_FIX_RESULTS.md glob (a
patch-internal path matcher, not a repo-doc link) is left intact.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 09:25:47 +00:00
Ettore Di Giacinto
a5a5b2ad80 feat(paged): bump llama.cpp pin 9d5d882d -> c299a92c (bit-exact verified)
Advance the paged-attention backend's owned llama.cpp pin by 23 upstream
commits. The shipped source-only patch series (0001-0030, 28 patches) applies
strict-clean (git apply, exit 0) on a fresh c299a92c checkout with no re-export
needed, and the bit-exact gate is GREEN on every path on GB10 (CUDA sm_121):

- md5 greedy decode (-ngl 99 -fa on -n 48 --temp 0 --seed 1): dense
  non-paged/paged 5951a5b4, MoE non-paged 07db32c2, MoE paged 8cb0ce23; all
  match the established baselines.
- test-backend-ops CUDA0: SSM_CONV 45/45, SSM_CONV_UPDATE 16/16,
  SSM_CONV_UPDATE_IDS 16/16, GATED_DELTA_NET 84/84, MUL_MAT 1146/1146,
  MUL_MAT_ID 806/806; all OK.

The 23-commit upstream jump did not change our decode output. The .patch files
are kept byte-identical (they already apply strict-clean at the new pin); only
the pin, the PIN_SYNC evidence doc, and the canary/gallery doc references change.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 08:57:33 +00:00
Ettore Di Giacinto
7e1832b868 fix(paged): strip stray dev-doc hunks so patch series applies on a clean checkout
The shipped from-patches build applies the paged series with strict `git apply`
(backend/cpp/llama-cpp/Makefile `llama.cpp` target:
`git apply --verbose "$p" || { ...; exit 1; }`), which is atomic: a hunk against
a file missing from the tree rejects the whole patch and fails the build. Four
patches carried hunks against dev-only docs that live in the DGX dev tree but are
absent from a clean ggml-org/llama.cpp checkout, so the build only succeeded on
the DGX and FAILED on CI / any clean checkout:

  0019 -> SSM_DECODE_FIX_RESULTS.md   (modify hunk = the root reject)
  0020 -> LEVER1_OPROJ_MMQ_RESULTS.md (create)
  0021 -> CONV_STATE_FUSION_RESULTS.md (create)
  0028 -> LEVER1_GATHER_PROGRESS.md, LEVER1_GATHER_RESULTS.md (create)

0019's reject cascaded to 0021/0022/0026/0028 (which build on 0019's code). Strip
each `diff --git a/<devdoc>` section plus its diffstat line, `create mode`
trailer, and correct the summary count. Every llama.cpp SOURCE hunk is left
byte-identical (verified by sha256 of each patch's source-diff tail).

Verified on a fresh clone of ggml-org/llama.cpp at the pin 9d5d882d: BEFORE,
strict `git apply` failed at 0019 (cascade 0019/0021/0022/0026/0028); AFTER, the
full series 0001-0030 applies with exit 0 (sentinel created, zero stray docs).
The tolerant `patch -p1` fallback in prepare.sh also applies with zero rejects.

PIN_SYNC_9d5d882d.md documents the durable fix: re-exports/pin-syncs must keep
patches source-only (export with a source pathspec / `:!*.md`, gate with a strict
`git apply` on a clean checkout). The upcoming c299a92c pin-bump re-export must
produce source-only patches too.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 08:39:27 +00:00
Ettore Di Giacinto
2bee7a5ab1 ci(paged): add early-warning canary for vendored llama.cpp paged patches
The paged backend (backend/cpp/llama-cpp-localai-paged) pins its own verified
llama.cpp tip and is excluded from the nightly auto-bumper so a naive bump can
never silently break the shipped build. That exclusion also removed the early
warning of upstream drift. This restores the signal without touching the pin.

Add .github/workflows/llama-cpp-paged-canary.yml (weekly + workflow_dispatch):

- apply-check job (ubuntu-latest, toolchain-free): resolve the latest
  ggml-org/llama.cpp master tip, shallow-checkout it, and apply the full paged
  series 0001-0030 in order with the build's own git-apply method via the new
  shared helper .github/scripts/paged-canary-apply.sh. Red on any apply break.
- compile job (needs apply-check): on the exact tip it validated, build the
  paged backend (cublas) inside the same base-grpc-cuda-12 toolchain and the
  same `make grpc-server` target the shipped build uses, so a red means upstream
  drift, not toolchain noise. nvcc compiles the kernels with no GPU present.

Red here = run a PIN_SYNC (rebase + bit-exact gate + re-export), then bump the
paged Makefile pin. The canary is signal-only: it opens no PR and never moves
the pin, so the shipped build and the dep-bump PRs stay green regardless. It is
fully separate from bump_deps.

The lone pre-existing quirk in the series (patch 0019 carries a stray modify
hunk against the dev-only doc SSM_DECODE_FIX_RESULTS.md, absent from any clean
upstream checkout; git apply is atomic so it rejects the whole patch and
cascades to 0021/0022/0026/0028) is handled path-scoped: the helper excludes
only that dev-doc and still applies 0019's real code hunks atomically, mirroring
prepare.sh's tolerance, so the quirk never false-positives the canary but a
genuine code break in 0019 still turns it red.

Point the existing pin comments in backend/cpp/llama-cpp-localai-paged/Makefile
and .github/workflows/bump_deps.yaml at this canary as the drift signal, and
document it in the PIN_SYNC doc: canary red -> do a pin-sync.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 08:29:09 +00:00
Ettore Di Giacinto
e160041f05 chore(paged): decouple paged llama.cpp pin from the nightly auto-bumper
The llama-cpp-localai-paged backend reused backend/cpp/llama-cpp's LLAMA_VERSION,
which .github/workflows/bump_deps.yaml auto-bumps nightly to the latest
ggml-org/llama.cpp master tip. The stock backend is patch-free so that bump is
safe, but the paged backend applies a vendored patch series
(backend/cpp/llama-cpp/patches/paged/) hand-verified bit-exact against ONE
specific tip. A naive bump moves the tip out from under the patches and breaks
'git apply' at build time - a dep-bump PR would go red (or, worse, the break
surfaces later in a release build).

Mirror the turboquant precedent: give the paged wrapper its OWN LLAMA_VERSION
pin (the verified 9d5d882d) and force it into every copied build via
LLAMA_VERSION=$(LLAMA_VERSION), so the nightly stock bump no longer drags the
paged build to an unverified tip. Unlike turboquant (whose fork branch carries
the patches and is safe to auto-bump), the paged series is vendored, so it gets
NO bump_deps.yaml entry: it is advanced only by the manual PIN_SYNC process.
Add cross-referencing comments in both Makefiles and bump_deps.yaml.

Also add PIN_BUMP_APPLY_CHECK.md: an apply-feasibility report for the latest tip
(c299a92c, 23 commits ahead). The full series applies CLEAN under 'git apply'
with only benign line offsets and zero conflicts; the lone failure (0019) is a
pre-existing stray dev-doc hunk, identical on the current pin, not a bump
regression.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 08:02:37 +00:00
Ettore Di Giacinto
400930db19 Merge remote-tracking branch 'origin/master' into worktree-feat+paged-attention
# Conflicts:
#	gallery/index.yaml
2026-06-27 07:48:49 +00:00
LocalAI [bot]
e95018ef70 chore(model gallery): 🤖 add 1 new models via gallery agent (#10544)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-27 09:42:46 +02:00
LocalAI [bot]
0258f8af55 fix(backends): repair release CI build/test breaks (kokoros, fish-speech, llama-cpp-quantization, sglang) (#10547)
* fix(kokoros): implement new Backend RPCs to fix the build

The backend.proto grew six RPCs (SoundDetection, Depth, TokenClassify,
Score and the bidi-streaming Forward) that the kokoros gRPC service never
implemented, so the trait impl no longer satisfies `Backend`:

    error[E0046]: not all trait items implemented, missing:
      `sound_detection`, `depth`, `token_classify`, `score`,
      `ForwardStream`, `forward`

kokoros is a TTS backend with no use for these, so add `unimplemented`
stubs (plus the `ForwardStream` associated type) matching the existing
pattern for every other unsupported RPC in this file.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(fish-speech): add setuptools-rust for the editable source install

install.sh installs the fish-speech source tree editable with
`--no-build-isolation`, which means the build backends of its transitive
dependencies must already be present in the venv. One of them builds a
Rust extension and its metadata step fails with:

    ModuleNotFoundError: No module named 'setuptools_rust'

Add setuptools-rust to requirements.txt so installRequirements provisions
it before the editable install runs.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(llama-cpp-quantization): vendor convert_hf_to_gguf.py with conversion/

Upstream llama.cpp split the model-specific logic out of the single
convert_hf_to_gguf.py file into a sibling `conversion/` package, so the
script now starts with `from conversion import ...`. Downloading just the
one file therefore fails at runtime with:

    ModuleNotFoundError: No module named 'conversion'

Clone the repo (reusing the clone already needed to build llama-quantize)
and copy both the script and the `conversion/` package into the backend
dir. Python puts the script's own directory on sys.path[0], so the package
resolves when it sits beside the script.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(sglang): pin the CPU source build to sglang v0.5.11

The CPU profile builds sgl-kernel from a `git clone` of sglang with no
ref, so it always tracks master. Recent master added CPU kernels (e.g.
mamba/fla.cpp) that fail to compile in our builder:

    constexpr variable 'scale' must be initialized by a constant
    static library kineto_LIBRARY-NOTFOUND not found

Pin the clone to v0.5.11, the same release the GPU path already floors on
(requirements-cublas12-after.txt). Overridable via SGLANG_VERSION so the
pin can be bumped deliberately.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 09:42:22 +02:00
Ettore Di Giacinto
202a29f980 feat(paged): Metal/darwin build availability for llama-cpp-localai-paged
Close the single build-targeting gap the cross-arch audit (ARCH_GENERALITY_AUDIT.md
section 6, item 2) flagged: the paged backend had no Metal/darwin variant and no
metal: capability key, so a Mac user selecting llama-cpp-localai-paged fell back to
default=cpu (a Linux image) that does not run, with no fallthrough to stock llama-cpp.

Mirror exactly how stock llama-cpp does darwin:

- .github/backend-matrix.yml: add the includeDarwin row
  (-metal-darwin-arm64-llama-cpp-localai-paged, arch arm64, lang go) next to the
  stock llama-cpp darwin row.
- backend/index.yaml: add the metal: capability key to the
  llama-cpp-localai-paged meta-backend plus the metal-llama-cpp-localai-paged and
  -development variant entries (URIs match the matrix tag-suffix); add Metal to tags.
- scripts/build/llama-cpp-localai-paged-darwin.sh: new bespoke darwin build,
  a line-for-line mirror of llama-cpp-darwin.sh swapping the paged wrapper dir,
  binary names, ggml-shared-libs dir and output tar. Same CPU_ALL_VARIANTS + Metal
  path (GGML_METAL=ON via the reused llama-cpp Makefile when OS=Darwin; --target ggml
  pulls in ggml-metal via add_dependencies) with LLAMA_PAGED=on.
- Makefile: add backends/llama-cpp-localai-paged-darwin target (+ .NOTPARALLEL).
- .github/workflows/backend_build_darwin.yml: give the paged backend the same
  bespoke darwin build step as stock llama-cpp, share the llama ccache restore (save
  stays stock-only to avoid a same-run key collision), and exclude it from the
  generic build-darwin-go-backend step.
- scripts/changed-backends.js: comment-only - the paged darwin path mapping was
  already present (forward-looking); update the stale "if a metal row is ever added"
  note now that the row exists.

Metal delivers paged-KV only (NVFP4 FP4-MMA is CUDA/Blackwell-only); the GDN/conv
fused ops have no Metal kernel, so a gated-DeltaNet (qwen35) model falls back to the
CPU reference op at runtime - made SAFE by the fused-op backend gate (patch 0030).
This is config; the Metal build runs in CI on the next push and is runtime-tested on
the M4 Mac.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 07:42:08 +00:00
Ettore Di Giacinto
621a20d2b5 feat(paged): backend-gate fused GDN/discriminated SSM_CONV emission (patch 0030)
Closes audit RISKY-1 (the one latent silent-miscompute hazard). The fused/in-place
Gated Delta Net op (0018/0019/0026) and the discriminated SSM_CONV decode op
(0021/0028, which REUSE GGML_OP_SSM_CONV / GGML_OP_GATED_DELTA_NET via a non-null
src[3]/src[4] discriminator) are CUDA+CPU-only but were emitted DEFAULT-ON
(cparams.fused_gdn_ar/ch=true, auto_fgdn=true) with no backend guard. A backend
that supports plain SSM_CONV but ignores the discriminator (Vulkan/SYCL/Metal)
would run the wrong plain conv => silent corruption.

Fix: in llama_context::sched_reserve(), before the auto_fgdn resolution, force
fused_gdn_ar = fused_gdn_ch = auto_fgdn = false when any non-CPU compute backend
is not CUDA-family (reg name not "CUDA"/"ROCm"/"MUSA"). Every emission site keys
off these flags, so the graph falls back to the upstream non-fused plain
ggml_ssm_conv + ggml_silu path that every backend handles. On CUDA the reg name is
"CUDA", the flags are left untouched, and the decode graph is byte-identical.

Mirror of DGX paged patch 0030; adds FUSED_OP_BACKEND_GATE_RESULTS.md.

Verified GPU-free: reconstructed pin 9d5d882d + paged 0001-0029 + 0030, CPU-only
build (GGML_CUDA=OFF) of libllama + test-backend-ops links with 0 errors; 0030
applies cleanly via git apply and patch -p1. test-backend-ops correctness for
SSM_CONV/SSM_CONV_UPDATE(_IDS)/GATED_DELTA_NET is CUDA0-vs-CPU (pending DGX,
tunnel offline this session); registered test cases will exercise it.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 07:32:49 +00:00
Ettore Di Giacinto
2332587fdc fix(gallery): scope NVFP4-paged entries to Blackwell + consistent tags
The six LocalAI-paged NVFP4 entries advertised GB10 throughput figures with
no machine-readable hardware signal, and the four qwopus/MTP entries lacked
the nvfp4 tag entirely (not discoverable as NVFP4). Per the cross-arch audit
(ARCH_GENERALITY_AUDIT.md section gallery-targeting), NVFP4 GGUFs run
everywhere via dequant (never fail), so the gap is performance-expectation,
not correctness; the only available lever is description + tags.

- Add the nvfp4 tag to the four qwopus/MTP entries that lacked it; the two
  base qwen3.6 entries already had it.
- Add a blackwell tag to all six (precedent: the nvidia hardware tag is
  already used on many gallery entries as a filter chip).
- Lead each of the six descriptions with a one-line Blackwell-recommended /
  runs-slower-off-Blackwell caveat.
- Scope the qwen3.6-27b 90-117% of vLLM claim explicitly to GB10 / DGX Spark
  (consumer Blackwell) so it is not read as a universal figure.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 07:19:42 +00:00
Ettore Di Giacinto
af6e133759 docs(paged): cross-arch synthesis - ship verdict + minimum non-Blackwell fixes
Synthesizes the four ARCH_GENERALITY_AUDIT sections (build-matrix,
gguf-gallery-targeting, optimization-generality, patch-arch-safety) into a
single cross-arch ship decision: build-safety table per target, every
patch bucketed (SAFE-EVERYWHERE / BLACKWELL-ONLY-clean-fallback / GB10-TUNED
/ RISKY), the NVFP4 gallery recommendation, a per-arch roadmap ranked by
value/effort, the empirical-verification matrix (GB10 + M4 cover all but
non-Blackwell NVIDIA), and the ship verdict.

Verdict: SAFE to ship as Blackwell/Linux today; the build is arch-general
(no GB10 pin; FP4 code is default-off + #if-guarded) and NVFP4 GGUFs run
everywhere via dequant. The one hard prerequisite before extending paged to
Metal/Vulkan/SYCL is closing the backend-ungated, default-on fused GDN/conv
op emission (discriminated GGML_OP_SSM_CONV via non-null src[3], CUDA+CPU
only, no supports_op guard) - latent on current Linux targets, silent
miscompute on a future non-CUDA paged build of a gated-DeltaNet model.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 07:07:20 +00:00
Ettore Di Giacinto
87cfd1fadb docs(paged): quant-generality audit - SSM/serving opts are quant-agnostic (0013-0029)
Source-verify each paged decode optimization as quant-agnostic (operates on the
f32 gated-DeltaNet/conv recurrent state, the paged serving host path, or the
generic MMQ/CUDA-graph routing) vs NVFP4-specific (only fires inside the
use_native_fp4 / GGML_TYPE_NVFP4 branch).

Findings: 14 of 16 landed patches are quant-agnostic (0013/0014/0015/0016/0018/
0019/0020/0021/0022/0024/0025/0026/0028/0029). Only 0023 (MoE FP4 act-quant
de-dup, inside use_native_fp4) is NVFP4-specific; 0017 is NVFP4-only but
default-off/inert (kill-gate, no win).

Corrects the hypothesis on 0025: the actual patch is the MUL_MAT_ID CUDA-graph
guard relaxation gated on ggml_is_quantized + ggml_cuda_should_use_mmq (the
generic quantized grouped-MMQ path), NOT NVFP4. The NVFP4-specific act-quant /
quantize_mmq_nvfp4 work is LEVER 3, which was a measurement STOP and never
landed (no patch); LEVER 4 (NVFP4 projections) KL-failed and never shipped.

Adds the relative-impact-by-quant estimate (fixed f32-recurrence/host ms is the
largest step fraction at NVFP4, shrinks at Q8/bf16 as the weight read grows) and
the A/B plan to prove generality on a Q4_K_M requant of the same Qwen3.6 (build
the control first, md5/KLD bit-exact gate per path, decode_agg npl 32/128, with
0023 as the NVFP4-only negative control).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 07:06:14 +00:00
Ettore Di Giacinto
2a2de1d6c1 docs(paged): patch-arch-safety classification for patches 0018-0029
Build-break / miscompile audit of the paged patch series. Classifies each
patch general/Blackwell-gated/risky, records the only conditional arch surface
(0017, fully #if-gated + default-off), and gives the per-target build-safety
verdict (sm_80-90 CUDA / sm_100 / Metal-not-a-target / CPU / ROCm-SYCL-Vulkan).
Flags the one latent silent-correctness hazard: fused GDN/conv ops reuse
GGML_OP_SSM_CONV via a src discriminator with CUDA+CPU-only kernels and
backend-ungated emission.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 07:04:05 +00:00
Ettore Di Giacinto
5667dfe461 docs(paged): arch-generality audit - optimization classification (0017-0029)
Classify the paged-attention optimizations as arch-GENERAL (ship everywhere),
GB10-TUNED (per-arch retune), or Blackwell-precision-specific; add the per-arch
expected story (sm_100/Hopper/Ada/Metal/CPU) and the SAFETY gap (fused GDN/conv
ops are CUDA+CPU-only with backend-ungated emission). Extends the prior
build/gallery-targeting audit in the same file.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 07:02:54 +00:00
Ettore Di Giacinto
34abf392fc docs(paged): ARCH audit - NVFP4 GGUF off-Blackwell portability + gallery targeting gap
NVFP4 (GGML_TYPE_NVFP4=40 / MOSTLY_NVFP4) GGUFs are portable: full CPU/CUDA-DP4A/
generic-MMA/Vulkan dequant coverage. FP4-MMA is a runtime Blackwell-only speed
tier (mmq.cu use_native_fp4 flag), not a load/run gate. Off-Blackwell = runs via
dequant, correct-but-slow, never fail/garbage. Gallery has no microarch-gating
primitive (tags are search-only, capabilities map is family-level nvidia/amd/
metal, model struct has no hardware field), so the 6 -paged entries can only
express Blackwell-targeting via description prose + tags.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 07:00:34 +00:00
Ettore Di Giacinto
683e22500f docs(paged): arch-generality audit - build-targeting (CUDA arch fan + variants + Metal gap)
The llama-cpp-localai-paged backend sets NO explicit CUDA arch list anywhere
(CUDA_DOCKER_ARCH empty in every matrix row; compile.sh only injects
-DCMAKE_CUDA_ARCHITECTURES when non-empty), so it compiles the full upstream
ggml default arch fan - bit-identical targeting to stock llama-cpp, NOT
Blackwell-only. NVFP4 FP4-MMA is gated inside the kernel by
BLACKWELL_MMA_AVAILABLE, not by the build matrix, so the binary is arch-portable.

Variants: CUDA 12/13 + l4t arm64, ROCm, SYCL f32/f16, Vulkan amd64/arm64, CPU
amd64/arm64 (CPU_ALL_VARIANTS) - same Linux set as stock llama-cpp, not CUDA-only.

Single gap vs stock: NO Metal/Darwin row in includeDarwin and NO metal:
capability key in the meta-backend. macOS hosts fall back to the default cpu
(Linux) image, which will not run, and do not auto-fall to stock llama-cpp.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 06:59:36 +00:00
Ettore Di Giacinto
db6ebc53b2 feat(paged): block-table within-step host cache (patch 0029)
Mirror of paged-dev commit e2acb3b (lever 5). get_block_table() is recomputed
once per full-attention layer per decode step, but the KV cell layout is fixed
for the whole step (it only changes in apply()). This caches the table the first
time it is built in a step and memcpy-reuses the identical bytes for the rest,
invalidating in apply(). Bit-exact; toggle off with LLAMA_PAGED_NO_BT_CACHE=1.

Host-side get_block_table time (llama-batched-bench, npp128 ntg128 npl128,
cache OFF -> ON): MoE 112.94 -> 14.82 ms (-87%), dense 193.78 -> 16.90 ms (-91%).
Dense decode is partly host-bound and gains (TG 364.8 -> 374.7 t/s, ~96% of the
vLLM 391 t/s @npl128 reference); MoE decode is compute-bound (FP4 GEMM) so the
saved host time is off the critical path and MoE TG is flat. Details in
LEVER5_HOSTPIPE_RESULTS.md.

Also records the per-path bit-exactness gate (PAGED_BITEXACT_NOTE.md): the
paged-MoE greedy md5 (8cb0ce23) differs from the non-paged md5 (07db32c2) by a
benign FP-accumulation-order difference of the paged attention reduction, not a
bug. KL-validated vs the f16 reference (16 chunks, c512): KLD(paged||f16) =
0.13600 <= KLD(nonpaged||f16) = 0.13660, PPL(paged) = 7.4009 ~ PPL(nonpaged) =
7.3896 (within +/- 0.29). Canonical references are now per path: non-paged MoE
07db32c2 and paged MoE 8cb0ce23; dense is bit-exact across paths (5951a5b4).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 01:47:08 +00:00
Ettore Di Giacinto
9b0e4e544c docs(paged): residual-assess FINAL - MoE at bit-exact ceiling, hunt DONE
Conclude the MoE-parity hunt. The two remaining sub-levers in the
20.3-vs-13.8 ms projection bucket are both bit-changing or at the BW floor:

- convert-glue (3.24 ms/step, measured: 1.73 input f32->bf16 + 1.52 output
  bf16->f32): NOT bit-exact eliminable. ggml-cuda.cu:1663-1690 rounds the f32
  GEMM accumulator to bf16 (CUDA_R_16BF dst) then widens to f32; that
  bf16-rounded value is load-bearing for the shipped md5. Removing the
  round-trip (f32-direct output, bf16 residual stream, or NVFP4 weights) all
  rebaseline md5. A precision boundary, like lever 4.
- bf16 projection GEMM (17.27 ms/step): BW-bound at the LPDDR5x floor
  (~4.7 GB/step at 273 GB/s; M=128 -> 128 FLOP/byte vs >900 ridge). nvjet
  already TMA-streams the weights; cutlass reads the same bytes. No kernel
  lever; only fewer bytes (quantize) helps - rejected on quality.

Corrects the body premise that vLLM runs these projections as NVFP4-Marlin:
vLLM runs the same nvidia-modelopt checkpoint that keeps them BF16, so the
projection bucket is a matched-precision gap, not a quant gap.

Realistic bit-exact MoE ceiling ~86-88% of vLLM; shipped lever 1 (86.3%) is
at it. No one-more-lever for MoE. Only clean win left is DENSE (+0.41% lever 5),
gated behind resolving the paged-MoE baseline md5 drift.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 01:05:06 +00:00
LocalAI [bot]
14b29ebf4e fix(backends): derive darwin RUN_BINARY from the exec line only (#10541)
golang-darwin.sh's packaging check derived the launch binary by grepping every
$CURDIR/... reference in run.sh and taking the last one. Backends that pick a
runtime CPU variant assign it via unquoted `LIBRARY=$CURDIR/libgo<x>-avx512.so`
lines, so the heuristic returned `libgo<x>-avx512.so` — a variant Darwin never
builds (arm64 builds only fallback) — and the check then failed with
"package/libgo<x>-avx512.so not found ... refusing to package (#10267)",
breaking the darwin builds for whisper, sam3-cpp, vibevoice-cpp and friends.

Scan only the `exec` line(s) (the actual launch contract) and tolerate a
quoted `exec "$CURDIR"/<binary>`. parakeet-cpp's parakeet-cpp-grpc and the
quoted-only backends (sherpa/piper/opus) resolve correctly; no Linux change.

Assisted-by: Claude:claude-opus-4-8

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
v4.5.4
2026-06-27 02:05:40 +02:00
LocalAI [bot]
f0d0bff232 fix(llama-cpp): stop reinterpreting plain-string message content as JSON (#10524) (#10538)
The llama-cpp gRPC backend reconstructs OpenAI messages from proto for the
tokenizer-template path and blindly json::parse'd each message's content
string. LocalAI's Go layer always flattens content to a plain string, so a
user prompt that merely looks like JSON (e.g. mealie's ingredient array
["1/4 cup brown sugar", ...]) was reinterpreted as structured content parts and
rejected by oaicompat_chat_params_parse with "unsupported content[].type".

Normalize content per role instead: user/system/developer content is opaque
text and is never JSON-sniffed; assistant/tool content still collapses a literal
JSON null/object (tool-call bookkeeping) to a string, but a plain string is
never turned into an array/scalar. The array defense is role-independent, so the
role gate only governs the benign null/object case.

While here, extract the duplicated per-message reconstruction and the
pre-template content sanitization into shared, unit-tested helpers
(message_content.h) so the streaming (PredictStream) and non-streaming (Predict)
paths cannot drift. This removes ~490 lines of copy-pasted defensive code, the
dead tool-role parse branches, and the redundant Predict-only tool_calls branch,
while preserving the prior #7324 (null content -> "") and #7528 (tool array
content -> string) fixes.

Tests:
- backend/cpp/llama-cpp/message_content_test.cpp: standalone C++ unit tests for
  all three helpers (#10524, #7324, #7528, multimodal), discovered and run by
  `make test-backend-cpp` and a new generic tests-backend-cpp CI job. Also wired
  as an opt-in CMake/ctest target (-DLLAMA_GRPC_BUILD_TESTS=ON).
- core/schema/message_test.go: Go regression pinning that ToProto flattens a
  JSON-array-looking text part to the verbatim string.
- prepare.sh now copies message_content.h into the build tree.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
v4.5.3
2026-06-27 01:42:05 +02:00
LocalAI [bot]
64150ca7ab fix(distributed): broadcast admin model-config changes across replicas (#10540)
In distributed mode the admin model endpoints (/models/edit, /models/import,
/models/toggle-state and the PATCH config-json endpoint) wrote the YAML to the
shared models dir but reloaded only the local replica's in-memory
ModelConfigLoader. With multiple frontend replicas behind one service, a save
landed on whichever replica handled the request; peers kept serving their stale
in-memory view, so a load-balanced request was a coin-flip between old and new
config (a created alias visible on one replica and missing on the other, an
edited alias target diverging, etc.).

The NATS cache-invalidation channel (SubjectCacheInvalidateModels +
OnModelsChanged) already existed for the gallery install/delete path; these
admin endpoints simply never published on it. Wire them up via a new
GalleryService.BroadcastModelsChanged helper (no-op in standalone mode).

Also fix delete propagation: LoadModelConfigsFromPath is additive and never
drops an entry whose file is gone, so the subscriber hook (which only reloaded
from disk) could not propagate a removal. ApplyRemoteChange now honors the
event op - pruning the element on "delete" and reloading otherwise - and shuts
down any running instance of the affected model so the new config takes effect.
This closes the same latent gap on the gallery delete path.


Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 01:36:57 +02:00
Ettore Di Giacinto
e3f8149f3b docs(paged): lever-4 KL-gate FAIL - NVFP4 MoE projections cost ~6% PPL, no-ship
Re-quantizing the MoE GGUF's bf16 GDN/attn projections to NVFP4 (the lever-4
scope hypothesis) fails the KL gate on every axis vs the shipping NVFP4 baseline:
PPL +6.51% (FULL) / +6.15% (CONS) against a <1% gate, mean KLD-to-f16 0.164/0.172
vs baseline 0.137, top-1 argmax agreement down ~2.2-2.6 points. Both projq
variants rejected; in_proj_ba being kept bf16 (CONS) recovered almost nothing, so
the damage is in the bulk attn/GDN projections.

Root cause: the bf16 projections are a deliberate modelopt precision choice, not a
provenance accident. vLLM runs the same modelopt checkpoint, so it keeps these
projections bf16 too - the baseline GGUF already matches vLLM. The ~20.3ms
projection-GEMM bucket is the price of high-precision projections that vLLM also
pays; it is not the llama-vs-vLLM lever it appeared to be. The speed win is only
purchasable with a 6% PPL regression. MoE stays at 86.3% of vLLM @ npl128.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-26 23:36:38 +00:00
LocalAI [bot]
f98b0f1c1e fix(gpu-libs): bundle transitive deps of GPU runtime libs (#10537) (#10539)
fix(gpu-libs): bundle transitive deps of GPU runtime libs

The per-vendor packagers in package-gpu-libs.sh copy an explicit allowlist
of top-level GPU runtime libraries (libamdhip64, libhipblas, librocblas, the
CUDA/Intel equivalents, ...) but never resolved their transitive
dependencies. Backends run through the bundled lib/ld.so with
LD_LIBRARY_PATH=lib, so any transitive dep not in the allowlist is a fatal
"cannot open shared object file" at load time.

On recent ROCm (base image rocm 7.2.1) the runtime libs link against
librocprofiler-register.so.0, which is not in the allowlist, so the rocm
llama-cpp backend (and every other GPU backend sharing this script) failed
to load with:

  librocprofiler-register.so.0: cannot open shared object file

The Vulkan path already solved this class of problem with copy_elf_deps
(ldd-based transitive resolution), but that sweep was only wired into the
Vulkan ICD path. This adds a generic sweep_transitive_deps that runs the
same ldd resolution over everything the allowlist already bundled, and wires
it into the ROCm, CUDA and Intel packagers. ldd returns the full recursive
closure, so one pass suffices; core libc-family deps are skipped via
is_core_lib so we never shadow the loader's own libc/libstdc++.

Adds a self-contained regression test (gcc + ldd) that fabricates a primary
lib linking a transitive lib and asserts the sweep bundles the dependency.

Fixes #10537

Assisted-by: Claude:opus-4.8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 01:36:33 +02:00
Ettore Di Giacinto
9a1be79f04 docs(paged): lever-4 scope - NVFP4 the still-bf16 MoE GDN/attn projections (+6.5ms bucket)
Scopes lever 4 (read-only, no GPU) on top of the flat levers 2+3. Root cause: the
MoE GGUF (nvidia modelopt, 241 NVFP4 tensors) quantized only the experts and left the
GDN/attn linear projections in BF16, while the dense GGUF (unsloth, 304 NVFP4 tensors)
already has them NVFP4 (proven: dense ssm_out runs FP4 MMQ; dense decode at 96.6% of
vLLM). Lever 4 = re-quantize the MoE GGUF's bf16 GDN/attn projections to NVFP4, the same
move vLLM makes on the identical weights - the +6.5ms projections bucket, the largest
single banked MoE gain available.

Path: offline re-quantize to a new GGUF variant (expanded --tensor-type); zero kernel
code - the loader sidecar-scale path + tuned mul_mat_q<NVFP4> are already in tree and
proven by the dense GGUF. Bit-changing => KL-gate, not md5. KL expected to pass (per-step
non-accumulating weight quant, unlike the failed bf16-state; experts already W4A4-clean);
lm_head is the one risky tensor (gate on argmax-agreement). Expected ~+4-6.5ms => MoE
86.3% -> ~88-91% of vLLM. Recommend a separate OPT-IN gallery variant (preserve the
bit-exact default; promote to default only if the KL gate is clean).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-26 22:57:05 +00:00
LocalAI [bot]
2c96c2d08e chore: ⬆️ Update mudler/parakeet.cpp to f469a57270a1cc4554acb15febf60e56619673b9 (#10530)
⬆️ Update mudler/parakeet.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-27 00:50:51 +02:00
LocalAI [bot]
f01a969f7b docs: ⬆️ update docs version mudler/LocalAI (#10531)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-27 00:29:29 +02:00
Ettore Di Giacinto
62c407ed55 docs(paged): lever1 gather-fusion bench landed - checkpoint + attribution (patch 0028)
Anchors the rigorous same-session A/B validation of patch 0028 (residual conv-state
tap k_get_rows fusion) on this worktree branch with sign-off attribution. The
regenerated 0028 patch + bench-updated LEVER1_GATHER_RESULTS.md first landed via a
concurrent origin/master merge (c1f1d1e8e) that swept the staged files; this records
the provenance and the bench summary in the checkpoint.

Gate (bit-exact, greedy --temp 0 --seed 1 -n 48): dense q36-27b-nvfp4
5951a5b4d624ce891e22ab5fca9bc439, MoE q36-35b-a3b-nvfp4 07db32c2bcb78d17a43ed18bc22705cd
(both == baseline; base == lever1). decode_agg npl128: dense 369.95 -> 377.83 t/s
(+2.13%, 96.6% of vLLM), MoE 763.47 -> 777.95 t/s (+1.90%, 86.3% of vLLM). nsys MoE
decode: k_get_rows_float 17334 -> 15414 inst (-1920), 358.37 -> 133.52 ms, step -3.13 ms.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-26 21:41:45 +00:00
Ettore Di Giacinto
c1f1d1e8ea Merge remote-tracking branch 'origin/master' into worktree-feat+paged-attention
# Conflicts:
#	gallery/index.yaml
2026-06-26 21:38:56 +00:00
Ettore Di Giacinto
6dd8a3d895 docs(gallery): NVFP4 GGUFs published to mudler/ - update header note
The dense + MoE base NVFP4 GGUFs are live (huggingface.co/mudler/Qwen3.6-27B-NVFP4-GGUF
and .../Qwen3.6-35B-A3B-NVFP4-GGUF), sha256 verified vs the Hub LFS hash, uris resolve.
Replaces the placeholder/not-yet-published TODO.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-26 21:31:16 +00:00
Ettore Di Giacinto
79edfd26a3 feat(gallery): -paged suffix rename + qwopus NVFP4-MTP paged variants
Rename the two base NVFP4 entries to a consistent -paged suffix
(qwen3.6-27b-nvfp4 -> qwen3.6-27b-nvfp4-paged, qwen3.6-35b-a3b-nvfp4 ->
qwen3.6-35b-a3b-nvfp4-paged) so all four base/MTP paged entries share the
naming convention. Update the two matching examples in the backend plan doc.

Add qwopus3.6-27b-v2-mtp-nvfp4-paged and qwopus3.6-27b-coder-mtp-nvfp4-paged:
verbatim copies of the stock qwopus NVFP4-MTP entries (same GGUF uri/sha256,
sampling, template, tags, function block) rewired onto the LocalAI
paged-attention stack (backend llama-cpp-localai-paged; f16, flash_attention,
131072 context, 99 gpu_layers, batch 512; paged_kv + max_batch_tokens:512 +
kv_unified:false + parallel:128). The stock entries are left untouched.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-26 21:26:14 +00:00
Ettore Di Giacinto
bf9b4fafa8 feat(gallery): NVFP4-MTP Qwen3.6 entries for the LocalAI paged backend
Add qwen3.6-27b-nvfp4-mtp-paged and qwen3.6-35b-a3b-nvfp4-mtp-paged: the
existing michaelw9999 NVFP4-MTP GGUFs (same uri/sha256/filename and the
recommended Qwen3.6 sampling defaults) wired to backend
llama-cpp-localai-paged with our optimized paged options (f16, flash
attention, 128k context, gpu_layers 99, batch 512, paged_kv, decode-first
max_batch_tokens, kv_unified:false, parallel:128).

These coexist with the stock llama-cpp *-nvfp4-mtp entries (distinct
-paged names) so the four LocalAI-paged NVFP4 entries sit together at the
top of the gallery.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-26 21:19:52 +00:00
LocalAI [bot]
56600eec3e fix(nodes): show a node's existing labels on the detail view (#10529)
fix(nodes): return labels in single-node GET so the detail view shows them

The node detail view (/app/nodes/:id) reads `node.labels` to render a
node's existing labels, but the single-node GET endpoint returned a bare
BackendNode whose Labels live in a separate table - so the list was always
empty and operators could only add labels, never see what was already set
(#10527). The same response also lacked in_flight_count and model_count.

Add NodeRegistry.GetWithExtras, mirroring the existing List vs ListWithExtras
split: bare Get stays cheap for the routing hot paths and existence checks,
while the detail endpoint uses the enriched variant to attach the labels map
and live counts. No frontend change is needed - the UI already renders
existing labels once the data is present.

Closes #10527


Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-26 23:06:42 +02:00
Ettore Di Giacinto
b1667b48ea feat(paged): qwen35 recurrent-state gather fusion (patch 0028)
Fuse the residual k_get_rows_float in the gated-DeltaNet decode path (the biggest
single kernel vLLM lacks per MOE_GAP_VS_VLLM.md, ~5.2 ms/step MoE). 0019 fused the
SSM-state gather, 0021 fused the conv compute but kept a build_rs gather for the
conv taps; nsys located that conv-state tap gather (n_embd_r=24576 floats x 128
seqs, ~720 x ~115 us per 24-step window) as the last k_get_rows in the GDN path.

New op ggml_ssm_conv_update_inplace_ids reads each sequence's prior conv taps from
cache[ids[s]] in-kernel (identity in place from the write slot, non-identity via a
disjoint scratch), mirroring the 0019 in-place + ids fusion. Bit-exact: read VALUES
unchanged, only the read path changes. Helps both dense and MoE (shared GDN conv).

GATE test-backend-ops (CUDA0 2/2): SSM_CONV_UPDATE_IDS, SSM_CONV_UPDATE, SSM_CONV,
GATED_DELTA_NET, GET_ROWS all PASS. GATE greedy md5 (-temp 0 -seed 1 -n 48)
BYTE-IDENTICAL both models: q36-27b-nvfp4 5951a5b4..., q36-35b-a3b-nvfp4 07db32c2...
nsys: k_get_rows<float,float> 10174 -> 9454 instances, 186.3 -> 102.8 ms (720 conv
gathers eliminated, replaced by a ~1.1 us no-op gather).

Built and gated on the DGX llama tree (branch paged, commit 944636c, f32 default).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-26 20:59:59 +00:00
LocalAI [bot]
c4fa256cdf chore(model gallery): 🤖 add 1 new models via gallery agent (#10526)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-26 22:31:22 +02:00