1257 Commits

Author SHA1 Message Date
LocalAI [bot]
957619af53 chore: ⬆️ Update ikawrakow/ik_llama.cpp to f9a93c37e2fc021760c3c1aa99cf74c73b7591a7 (#9795)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-13 00:40:48 +02:00
LocalAI [bot]
0b81e36504 chore: ⬆️ Update antirez/ds4 to f8b4ed635d559b3a5b44bf2df6a77e21b3e9178f (#9794)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-13 00:40:09 +02:00
LocalAI [bot]
602866a9d8 chore: ⬆️ Update ggml-org/whisper.cpp to 338cce1e58133261753243802a0e7a430118866d (#9793)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-13 00:39:57 +02:00
LocalAI [bot]
bc4cd3dd85 feat(llama-cpp): bump to 1ec7ba0c, adapt grpc-server, expose new spec-decoding options (#9765)
* chore(llama.cpp): bump to 1ec7ba0c14f33f17e980daeeda5f35b225d41994

Picks up the upstream `spec : parallel drafting support` change
(ggml-org/llama.cpp#22838) which reshapes the speculative-decoding API
and `server_context_impl`.

Adapt the grpc-server wrapper accordingly:

  * `common_params_speculative::type` (single enum) became `types`
    (`std::vector<common_speculative_type>`). Update both the
    "default to draft when a draft model is set" branch and the
    `spec_type`/`speculative_type` option parser. The parser now also
    tolerates comma-separated lists, mirroring the upstream
    `common_speculative_types_from_names` semantics.
  * `common_params_speculative_draft::n_ctx` is gone (draft now shares
    the target context size). Keep the `draft_ctx_size` option name for
    backward compatibility and ignore the value rather than failing.
  * `server_context_impl::model` was renamed to `model_tgt`; update the
    two reranker / model-metadata call sites.

Replaces #9763. Builds cleanly under the linux/amd64 cpu-llama-cpp
target locally.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(llama-cpp): expose new speculative-decoding option keys

Upstream `spec : parallel drafting support` (ggml-org/llama.cpp#22838)
adds the `ngram_mod`, `ngram_map_k`, and `ngram_map_k4v` speculative
families and beefs up the draft-model knobs. The previous bump only
adapted the API; this exposes the new fields through the grpc-server
options dictionary so model configs can drive them.

New `options:` keys (all under `backend: llama-cpp`):

ngram_mod (`ngram_mod` type):
  spec_ngram_mod_n_min / spec_ngram_mod_n_max / spec_ngram_mod_n_match

ngram_map_k (`ngram_map_k` type):
  spec_ngram_map_k_size_n / spec_ngram_map_k_size_m / spec_ngram_map_k_min_hits

ngram_map_k4v (`ngram_map_k4v` type):
  spec_ngram_map_k4v_size_n / spec_ngram_map_k4v_size_m /
  spec_ngram_map_k4v_min_hits

ngram lookup caches (`ngram_cache` type):
  spec_lookup_cache_static / lookup_cache_static
  spec_lookup_cache_dynamic / lookup_cache_dynamic

Draft-model tuning (active when `spec_type` is `draft`):
  draft_cache_type_k / spec_draft_cache_type_k
  draft_cache_type_v / spec_draft_cache_type_v
  draft_threads / spec_draft_threads
  draft_threads_batch / spec_draft_threads_batch
  draft_cpu_moe / spec_draft_cpu_moe          (bool flag)
  draft_n_cpu_moe / spec_draft_n_cpu_moe      (first N MoE layers on CPU)
  draft_override_tensor / spec_draft_override_tensor
    (comma-separated <tensor regex>=<buffer type>; re-implements upstream's
     static parse_tensor_buffer_overrides since it isn't exported)

`spec_type` already accepted comma-separated lists after the previous
commit, matching upstream's `common_speculative_types_from_names`.

Docs: refresh `docs/content/advanced/model-configuration.md` with
per-family tables and a note about multi-type chaining.

Builds locally with `make docker-build-llama-cpp` (linux/amd64
cpu-llama-cpp AVX variant).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(turboquant): bridge new llama.cpp spec API to the legacy fork layout

The previous commits in this series adapted backend/cpp/llama-cpp/grpc-server.cpp
to the post-#22838 (parallel drafting) llama.cpp API. The turboquant build
reuses the same grpc-server.cpp through backend/cpp/turboquant/Makefile,
which copies it into turboquant-<flavor>-build/ and runs patch-grpc-server.sh
on the copy. The fork branched before the API refactor, so it errors out on:

  * `ctx_server.impl->model_tgt` (fork still has `model`)
  * `params.speculative.{ngram_mod,ngram_map_k,ngram_map_k4v,ngram_cache}.*`
    (none of these sub-structs exist in the fork)
  * `params.speculative.draft.{cache_type_k/v, cpuparams[, _batch].n_threads,
    tensor_buft_overrides}` (fork uses the pre-#22397 flat layout)
  * `params.speculative.types` vector / `common_speculative_types_from_names`
    (fork has a scalar `type` and only the singular helper)

Approach:

1. backend/cpp/llama-cpp/grpc-server.cpp: introduce a single feature switch
   `LOCALAI_LEGACY_LLAMA_CPP_SPEC`. When defined, the two `speculative.type[s]`
   discriminations (the "default to draft when a draft model is set" branch
   and the `spec_type` / `speculative_type` option parser) fall back to the
   singular scalar form, and the entire new-option block (ngram_mod / map_k
   / map_k4v / ngram_cache / draft.{cache_type_*, cpuparams*,
   tensor_buft_overrides}) is preprocessed out. The macro is *not* defined
   in the source tree — stock llama-cpp builds get the full new API.

2. backend/cpp/turboquant/patch-grpc-server.sh: two new patch steps applied
   to the per-flavor build copy at turboquant-<flavor>-build/grpc-server.cpp:
   - substitute `ctx_server.impl->model_tgt` -> `ctx_server.impl->model`
   - inject `#define LOCALAI_LEGACY_LLAMA_CPP_SPEC 1` before the first
     `#include`, so the guarded blocks above drop out for the fork build.

   Both patches are idempotent and follow the existing sed/awk pattern in
   this script (KV cache types, `get_media_marker`, flat speculative
   renames). Stock llama-cpp's `grpc-server.cpp` is never touched.

Drop both legacy patches once the turboquant fork rebases past
ggml-org/llama.cpp#22397 / #22838.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(turboquant): close draft_ctx_size brace inside legacy guard

The previous turboquant fix wrapped the new option-handler blocks in
`#ifndef LOCALAI_LEGACY_LLAMA_CPP_SPEC ... #endif` but placed the guard
in the middle of an `else if` chain — the `} else if` openings of the
new blocks were responsible for closing the previous block's brace.
With the macro defined the new blocks vanish, draft_ctx_size's `{`
loses its closer, the for-loop's `}` is consumed instead, and the
file ends with a stray opening brace — clang reports it as
`function-definition is not allowed here before '{'` on the next
top-level `int main(...)` and `expected '}' at end of input`.

Move the chain split inside the draft_ctx_size branch:

    } else if (... "draft_ctx_size") {
        // ...
#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
    }                                  // legacy: chain ends here
#else
    } else if (... "spec_ngram_mod_n_min") {  // modern: chain continues
        ...
    } else if (... "draft_override_tensor") {
        ...
    }                                  // closes last branch
#endif
    }                                  // closes for-loop

Brace count is now balanced under both preprocessor branches (verified
with `tr -cd '{' | wc -c` against the patched and unpatched outputs).

Local `make docker-build-turboquant` builds the linux/amd64 cpu-llama-cpp
`turboquant-avx` variant cleanly.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(ci): forward AMDGPU_TARGETS into Dockerfile.turboquant builder-prebuilt

Dockerfile.turboquant's `builder-prebuilt` stage was missing the
`ARG AMDGPU_TARGETS` / `ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}` pair that
`builder-fromsource` already has (and that `Dockerfile.llama-cpp`
mirrors across both stages). When CI uses the prebuilt base image
(quay.io/go-skynet/ci-cache:base-grpc-*, the common path) the build-arg
passed by the workflow never reaches the env inside the compile stage.

backend/cpp/llama-cpp/Makefile:38 (introduced by #9626) errors out on
hipblas builds when AMDGPU_TARGETS is empty, and the turboquant
Makefile reuses backend/cpp/llama-cpp via a sibling build dir, so the
same check fires from turboquant-fallback under BUILD_TYPE=hipblas:

  Makefile:38: *** AMDGPU_TARGETS is empty — set it to a comma-separated
  list of gfx targets e.g. gfx1100,gfx1101.  Stop.
  make: *** [Makefile:66: turboquant-fallback] Error 2

The bug is latent on master because the docker layer cache stays warm
across builds — the compile step rarely re-runs from scratch. The
llama.cpp bump in this PR invalidates the cache, so the missing env var
becomes load-bearing and the hipblas turboquant CI job fails.

Mirror the existing pattern from Dockerfile.llama-cpp.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-12 17:22:37 +02:00
dependabot[bot]
61c9b187fa chore(deps): update charset-normalizer requirement from >=3.4.0 to >=3.4.7 in /backend/python/vllm (#9779)
chore(deps): update charset-normalizer requirement

Updates the requirements on [charset-normalizer](https://github.com/jawah/charset_normalizer) to permit the latest version.
- [Release notes](https://github.com/jawah/charset_normalizer/releases)
- [Changelog](https://github.com/jawah/charset_normalizer/blob/master/CHANGELOG.md)
- [Commits](https://github.com/jawah/charset_normalizer/compare/3.4.0...3.4.7)

---
updated-dependencies:
- dependency-name: charset-normalizer
  dependency-version: 3.4.7
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-12 09:22:23 +02:00
dependabot[bot]
abc2a51641 chore(deps): update transformers requirement from >=5.0.0 to >=5.8.0 in /backend/python/transformers (#9775)
chore(deps): update transformers requirement

Updates the requirements on [transformers](https://github.com/huggingface/transformers) to permit the latest version.
- [Release notes](https://github.com/huggingface/transformers/releases)
- [Commits](https://github.com/huggingface/transformers/compare/v5.0.0...v5.8.0)

---
updated-dependencies:
- dependency-name: transformers
  dependency-version: 5.8.0
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-12 09:21:05 +02:00
LocalAI [bot]
78722caedc chore: ⬆️ Update ikawrakow/ik_llama.cpp to eb570eb96689c235933b813693ca28ab9d3d26de (#9764)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-12 00:02:22 +02:00
LocalAI [bot]
621c612b2d ci(bump-deps): register ds4 + move version pin into the Makefile (#9761)
* ci(bump-deps): register ds4 + move version pin into the Makefile

The initial ds4 PR (#9758) put the upstream commit pin in
backend/cpp/ds4/prepare.sh as a shell variable. The auto-bump bot at
.github/bump_deps.sh greps for ^$VAR?= in a Makefile, so DS4_VERSION
was invisible to it - other backends (llama-cpp, ik-llama-cpp,
turboquant, voxtral, etc.) all pin in their Makefile.

This change:

- Moves DS4_VERSION?= and DS4_REPO?= to the top of
  backend/cpp/ds4/Makefile.
- Inlines the git init/fetch/checkout recipe into the 'ds4:' target
  (matches llama-cpp's 'llama.cpp:' target pattern). Directory acts
  as the target so make only re-clones when missing.
- Deletes the now-redundant prepare.sh.
- Adds antirez/ds4 + DS4_VERSION + main + backend/cpp/ds4/Makefile to
  the .github/workflows/bump_deps.yaml matrix so the daily bot opens
  PRs against this pin.
- Updates .agents/ds4-backend.md to point at the Makefile.

Verified:
  $ grep -m1 '^DS4_VERSION?=' backend/cpp/ds4/Makefile
  DS4_VERSION?=ae302c2fa18cc6d9aefc021d0f27ae03c9ad2fc0
  $ make -C backend/cpp/ds4 ds4   # clones into ds4/ at the pin
  $ make -C backend/cpp/ds4 ds4   # no-op on second invocation
  make: 'ds4' is up to date.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: route backend/cpp/ds4/ changes through changed-backends.js

scripts/changed-backends.js:inferBackendPath has an explicit branch per
cpp dockerfile suffix (ik-llama-cpp, turboquant, llama-cpp). Without a
matching branch the function returns null, the backend never lands in
the path map, and PR change-detection cannot map "backend/cpp/ds4/X
changed" -> "rebuild ds4 image".

This is why PR #9761 produced zero ds4 jobs even though it directly
edits backend/cpp/ds4/Makefile.

Adds the missing branch (Dockerfile.ds4 -> backend/cpp/ds4/), placed
before the llama-cpp branch (since both share the .cpp ancestry but
ds4 is more specific - same ordering rule documented in
.agents/adding-backends.md).

Verified with a local Node simulation of the script against this PR's
diff: the path map now contains 'ds4 -> backend/cpp/ds4/' and a
'backend/cpp/ds4/Makefile' change correctly triggers the ds4 backend
in the rebuild set.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(adding-backends): harden the two gotchas that bit ds4

Both omissions are silent at the time you ADD a backend - the failure
mode only appears later (the bump bot stays silent forever, or the path
filter shows up on the next PR that touches your backend with zero CI
jobs and looks broken for unrelated reasons). Expanding the
`scripts/changed-backends.js` paragraph from a one-liner to a fully
worked example, and adding a new sibling paragraph for the
`bump_deps.yaml` + Makefile-pin contract.

Both call out the specific mistakes from the ds4 timeline (#9758#9761) so future contributors can pattern-match on the cause.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-11 22:46:02 +02:00
LocalAI [bot]
d892e4af80 feat: add ds4 backend (DeepSeek V4 Flash) with tool calls, thinking, KV cache (#9758)
* test(e2e-backends): allow BACKEND_BINARY for native-built backends

Adds an escape hatch for hardware-gated backends (e.g. ds4) where the
model is too large for Docker build context. When BACKEND_BINARY points
at a run.sh produced by 'make -C backend/cpp/<name> package', the suite
skips docker image extraction and drives the binary directly.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(e2e-backends): validate BACKEND_BINARY basename + log actual source

Two follow-ups from the cbcf5148 code review:

- BACKEND_BINARY now requires a path whose basename is `run.sh`. Without
  this check, `filepath.Dir(binary)` silently discarded the filename, so
  pointing the env var at an arbitrary binary failed later with a
  confusing assertion that named a path the user never typed.
- The "Testing image=..." debug line printed an empty string when the
  binary path was used, hiding the actual source in CI logs. The line
  now reports whichever of BACKEND_IMAGE / BACKEND_BINARY is in effect
  as `src=...`.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): scaffold ds4 backend dir

Adds prepare.sh, run.sh, and a .gitignore. CMakeLists, Makefile, and the
implementation arrive in follow-up commits.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): add backend Makefile

Drives ds4's upstream Makefile to produce engine .o files (CUDA on Linux
when BUILD_TYPE=cublas, Metal on Darwin, otherwise CPU debug path), then
invokes CMake on our wrapper.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): add CMakeLists for grpc-server

Generates protoc stubs from backend.proto, links grpc-server.cpp +
dsml_parser.cpp + dsml_renderer.cpp + kv_cache.cpp against pre-built
ds4 engine .o files. DS4_GPU=cuda|metal|cpu selects the backend.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): grpc-server skeleton + module stubs

The minimum that links: Backend service with Health + Free; other RPCs
default to UNIMPLEMENTED. Stub headers/sources for dsml_parser,
dsml_renderer, and kv_cache are in place so CMake links cleanly even
before those modules ship.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): implement LoadModel

Opens engine + creates session sized to ContextSize (default 32768).
Backend is compile-time: CPU when DS4_NO_GPU, Metal on __APPLE__, else
CUDA. MTP/speculative options are accepted via ModelOptions.Options[]
(mtp_path, mtp_draft, mtp_margin). kv_cache_dir option is captured into
g_kv_cache_dir for the cache module (Task 19 wires it in).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): implement TokenizeString

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): implement Predict (plain text)

Tool calls + thinking-mode split arrive in Task 13 once dsml_parser is in.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): implement PredictStream (plain text)

ChatDelta + reasoning/tool_calls split arrives in Task 14.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): implement Status RPC

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): add DSML streaming parser

Classifies raw model-emitted token text into CONTENT / REASONING /
TOOL_START / TOOL_ARGS / TOOL_END events. Markers it watches for are the
literal DSML strings rendered by ds4_server.c's prompt template
(<|DSML|tool_calls>, <|DSML|invoke name=...>, <think>, etc.) - these are
plain text the model emits, not special tokens.

Partial markers split across token chunks are buffered until a full marker
or a definitively-not-a-marker '<' is observed. RandomToolId() generates
the API-side tool call id (call_xxx) that exact-replay would key on.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend/cpp/ds4): split hex escapes in DSML markers + add cstring/cstdio includes

C++ \x hex escapes have no length cap. '\x9cD' was read as a single escape
producing byte 0xCD, eating the 'D'. The markers were never actually matching
the DSML text the model emits. Split each escape with adjacent string literal
concatenation so the byte sequence is exactly EF BD 9C 44 (|D) at runtime.

Also adds <cstring> and <cstdio> includes (libstdc++ 13 does not transitively
expose std::strlen / std::snprintf via <string>).

The local plan file (uncommitted) was also updated with the same fixes so
Task 16's dsml_renderer.cpp does not re-introduce the bug.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): wire DsmlParser into Predict (ChatDelta)

Non-streaming Predict now emits one ChatDelta carrying content,
reasoning_content, and tool_calls[] parsed from the model's DSML output.
Reply.message still carries the raw model bytes for backends that prefer
the regex fallback path.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): wire DsmlParser into PredictStream

Per-token ChatDelta writes: content/reasoning_content go incrementally,
tool_calls emit TOOL_START as one delta (id + name) followed by
TOOL_ARGS deltas with incremental JSON. The Go-side aggregator
(pkg/functions/chat_deltas.go) reassembles them.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): chat template + reasoning_effort mapping

UseTokenizerTemplate=true + Messages -> ds4_chat_begin / append /
assistant_prefix. PredictOptions.Metadata['enable_thinking'] and
['reasoning_effort'] map to ds4_think_mode (DS4_THINK_HIGH default;
'max'/'xhigh' -> DS4_THINK_MAX; disabled -> DS4_THINK_NONE).

Tool-call rendering for assistant turns with tool_calls JSON arrives in
the next commit (dsml_renderer).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): render assistant tool_calls + tool results to DSML

Closes the round-trip: when an OpenAI client sends a multi-turn chat
where prior turns contain tool_calls or role=tool messages, build_prompt
serializes them back to the DSML shape the model was trained on. Mirrors
ds4_server.c's prompt renderer; uses nlohmann::json for parsing the
OpenAI tool_calls payload.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): disk KV cache module

Dir-based cache keyed by SHA1(rendered prompt prefix). File format:
'DS4G' magic + version + ctx_size + prefix_len + prefix + payload_bytes
+ ds4_session_save_payload output. NOT bit-compatible with ds4-server's
KVC files - that interop is a follow-up plan. LoadLongestPrefix walks
the dir picking the longest stored prefix that prefixes the incoming
prompt.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): wire KvCache into Predict/PredictStream

LoadModel reads 'kv_cache_dir' from ModelOptions.Options[], passes it to
g_kv_cache.SetDir. Each Predict/PredictStream computes a render text for
the request, tries LoadLongestPrefix to recover state, then Saves the
new state after generation. ds4_session_sync handles the live-cache
fast path internally, so the disk cache only matters for cold-starts
and cross-session reuse.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): add package.sh

Linux: bundles libc + ld + libstdc++ + libgomp + GPU runtime libs into
package/lib so the FROM scratch image boots without a host libc.
Darwin is handled by scripts/build/ds4-darwin.sh which uses otool -L.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend/cpp/ds4): rename namespace ds4_backend -> ds4cpp

ds4.h defines 'typedef enum {...} ds4_backend' which collides with our
C++ 'namespace ds4_backend' anywhere a TU includes both. kv_cache.h
includes ds4.h directly and surfaces the conflict immediately; other
TUs would hit it once gRPC dev headers are available.

Renames the C++ namespace to ds4cpp across all wrapper files and the
plan, leaving the upstream ds4 typedef untouched.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend): add Dockerfile.ds4

Single-stage builder (CUDA devel image for cublas, ubuntu:24.04 for cpu)
-> FROM scratch with packaged grpc-server + bundled runtime libs.
nlohmann-json3-dev is required for dsml_renderer's JSON handling.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(make): wire backend/cpp/ds4 + ds4-darwin into root Makefile

BACKEND_DS4 entry + generate-docker-build-target eval + docker-build-ds4
in docker-build-backends + .NOTPARALLEL guards. Also adds the
backends/ds4-darwin target which delegates to scripts/build/ds4-darwin.sh
(landed in Task 24).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: add backend-matrix entries for ds4 (cpu + cuda13, per-arch)

Two entries per build (amd64 + arm64) so backend-merge-jobs assembles a
multi-arch manifest. Skipping cuda12 - ds4 was validated against CUDA 13.
Darwin Metal is handled outside this matrix by backend_build_darwin.yml.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/index): add ds4 meta + image entries

cpu + cuda13 x latest + master. Darwin Metal builds publish under
ds4-darwin via the existing llama-cpp-darwin OCI pipeline.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(scripts/build): add ds4-darwin.sh

Native macOS/Metal build for the ds4 backend. Mirrors llama-cpp-darwin.sh:
make grpc-server -> otool -L for dylib bundling -> OCI tar that
'local-ai backends install' consumes via the backends/ds4-darwin
Makefile target.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(darwin): build ds4-darwin in backend_build_darwin

Adds a 'Build ds4 backend (Darwin Metal)' step that runs the
backends/ds4-darwin Makefile target on the macOS runner.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(import): auto-detect ds4 weights via DS4Importer

Adds core/gallery/importers/ds4.go which matches on the antirez/deepseek-v4-gguf
repo URI and the DeepSeek-V4-Flash-*.gguf filename pattern. Registered before
LlamaCPPImporter so ds4 weights route to backend: ds4 instead of falling
through to llama-cpp.

Also lists ds4 in /backends/known so the /import-model UI surfaces it as a
manual choice for users who want to force the backend on a non-canonical URI.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): add deepseek-v4-flash-q2 (ds4 backend)

One-click install of the q2 weights with backend: ds4.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(.agents): add ds4-backend.md

Documents the backend shape, DSML state machine, thinking-mode mapping,
disk KV cache, build matrix (cpu/cuda13/Darwin), and the BACKEND_BINARY
hardware-validation path.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend/cpp/ds4): pass UBUNTU_VERSION + arch env vars to install-base-deps

The .docker/install-base-deps.sh script needs UBUNTU_VERSION (defaults to
2404), TARGETARCH, SKIP_DRIVERS, and APT_MIRROR/APT_PORTS_MIRROR exported
into the environment so it can pick the right cuda-keyring / cudss / nvpl
debs and apt mirrors. Dockerfile.ds4 was declaring some of the ARGs but not
re-exporting them via ENV. Mirrors Dockerfile.llama-cpp's pattern.

Without this fix 'make docker-build-ds4 BUILD_TYPE=cublas CUDA_MAJOR_VERSION=13'
failed at:
  /usr/local/sbin/install-base-deps: line 120: UBUNTU_VERSION: unbound variable

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/index): add Metal image entries for ds4

Adds metal-ds4 + metal-ds4-development image entries pointing at
quay.io/go-skynet/local-ai-backends:{latest,master}-metal-darwin-arm64-ds4
(built by scripts/build/ds4-darwin.sh on macOS arm64 runners), plus the
'metal' and 'metal-darwin-arm64' capability mappings on the ds4 meta and
ds4-development variant.

Closes a gap from the initial Task 23 landing - the Darwin Metal build
script and CI workflow step were already wired (Tasks 24-25), but the
gallery had no image entry for users to install the Metal variant.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(ci): use ubuntu:24.04 base for ds4 cuda13 matrix entries

The initial Task 22 matrix landing used base-image: 'nvidia/cuda:13.0.0-devel-ubuntu24.04'
which clashes with install-base-deps.sh's cuda-keyring step:

  E: Conflicting values set for option Signed-By regarding source
     https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/sbsa/

The canonical pattern (llama-cpp, ik-llama-cpp, turboquant) uses plain
'ubuntu:24.04' + 'skip-drivers: false' so install-base-deps installs CUDA
from scratch via its own keyring setup. Adopting that here.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend/cpp/ds4): drop install-base-deps.sh dependency

The .docker/install-base-deps.sh pipeline is built around the llama-cpp
needs: NVIDIA keyring + cuda-toolkit apt + gRPC-from-source build at
/opt/grpc. For ds4 we don't need any of that:
- CUDA: nvidia/cuda:13.0.0-devel-ubuntu24.04 ships /usr/local/cuda
  ready to go; install-base-deps's keyring step then conflicts with
  the pre-installed Signed-By.
- gRPC: ds4's grpc-server.cpp only links against grpc++; system
  libgrpc++-dev (apt) is sufficient, no source build needed.

Replaced the install-base-deps invocation in Dockerfile.ds4 with a
direct 'apt-get install libgrpc++-dev libprotobuf-dev protobuf-compiler-grpc
nlohmann-json3-dev cmake build-essential pkg-config git'. Matrix entries
back to nvidia/cuda base + skip-drivers=true so install-base-deps would
no-op even if some downstream tooling calls it.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend/cpp/ds4): correct proto accessors + alias grpc::Status as GStatus

Two compile bugs caught by the docker build:

1. proto::Message uses snake_case accessors. The build_prompt loop called
   m.toolcalls() / m.toolcallid() - the protoc-generated names are
   m.tool_calls() / m.tool_call_id(). Plan-text bug propagated to the
   wrapper.

2. The Status RPC method shadowed the 'using grpc::Status' alias, so any
   later method declaration using Status as a return type failed to parse
   ('Status does not name a type' starting at LoadModel). Solution: alias
   grpc::Status as GStatus instead, with no 'using' clause that would
   conflict. All RPC method declarations and return-statement constructions
   now use GStatus.

Pre-existing code reviewer flagged the Status-shadow concern as 'minor'
in the original Task 10 commit; it turned out to be a real compile blocker
under libstdc++ 13 once the surrounding methods were filled in.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend/cpp/ds4): preserve TOOL_ARGS content in dsml_parser Flush

When the model emitted a parameter value that arrived in the same buffer
as the surrounding tool_call markers (e.g. the buffered tail after a
literal '</think>' opened the model output), the parser deferred all
buffered bytes to Flush() because looks_like_prefix() always returns
true while buf starts with '<'. Flush() then drained the buffer as
plain CONTENT/REASONING regardless of parser state, so the bytes
between the parameter open and close markers were classified as
CONTENT instead of TOOL_ARGS.

Symptom: the model emitted

  <|DSML|parameter name="location" string="true">Paris, France</|DSML|parameter>

and the assembled tool_call arguments came out as {"location":""} -
the opener and closer were emitted into the args stream but the
"Paris, France" content went to the assistant message instead.

Fix:

1. Flush() now uses the same state-aware emit logic as DrainPlain:
   PARAM_VALUE bytes become TOOL_ARGS (json-escaped when string),
   THINK bytes become REASONING, TEXT bytes become CONTENT, and
   INVOKE / TOOL_CALLS structural whitespace is discarded.

2. looks_like_prefix() restricts its leading-'<' fallback to buffers
   that have not yet seen a '>'. Without that change, char-by-char
   feeds would discard the '<' of '<|DSML|invoke name="..."' once
   the marker prefix length was reached but the closing quote/'>'
   were still in flight.

Verified with a standalone harness that runs the failing input three
ways (single Feed, split-after-'>', and char-by-char) and aggregates
TOOL_ARGS for tool index 0: all three now produce
{"location":"Paris, France"}.

Assisted-by: Claude:opus-4.7 [Read,Edit,Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend/cpp/ds4): use ds4_session_sync + manual generation loop for KV persistence

ds4_engine_generate_argmax() is a self-contained helper that doesn't take or
update a ds4_session - it manages its own internal state. Our Predict and
PredictStream methods created g_session via ds4_session_create() but then
called ds4_engine_generate_argmax(), so g_session's KV state never advanced.
ds4_session_payload_bytes(g_session) returned 0 and the disk KV cache save
correctly rejected with 'session has no valid checkpoint to save'.

Switch both RPCs to the proper session API:
  ds4_session_sync(g_session, &prompt, ...)
  loop:
    int token = ds4_session_argmax(g_session)
    if token == eos: break
    emit(token)
    ds4_session_eval(g_session, token, ...)

After the loop the session has a real checkpoint and ds4_session_save_payload
writes the KV state to disk. Verified end-to-end on a DGX Spark GB10: three
.kv files (15-30 MB each) are written when BACKEND_TEST_OPTIONS sets
kv_cache_dir, and the e2e tool-call assertion still passes.

Also added stderr diagnostics to KvCache (enabled/disabled at SetDir; per-save
path + payload_bytes + result) so future failures are visible instead of
silent. The 'wrote ok' lines are low-volume - one per Predict/PredictStream
when the cache is enabled - and skipped entirely when the option is unset.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): use ds4_session_eval_speculative_argmax when MTP loaded

Wires MTP (Multi-Token Prediction) speculative decoding into the manual
generation loop in both Predict and PredictStream. When the upstream MTP
weights are loaded via 'mtp_path:' option AND we're on CUDA / Metal,
ds4_engine_mtp_draft_tokens() returns >0 and we switch the inner loop to
ds4_session_eval_speculative_argmax(), which can accept N>1 tokens per
verifier step. When MTP is not loaded (no option, CPU backend, or weights
absent), we fall through to the simple ds4_session_argmax + ds4_session_eval
path with no behavior change.

Validated on a DGX Spark GB10 with the optional MTP GGUF
(DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf, ~3.6 GB). LoadModel logs
'ds4: MTP support model loaded ... (draft=2)' on stderr.

Caveat per upstream README: 'currently provides at most a slight speedup,
not a meaningful generation-speed win'. Wired now mainly to track the
upstream API; bigger speedups arrive when ds4 improves the speculative path.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): honor PredictOptions sampling with DSML-aware override

Mirrors ds4_server.c:7102-7115 sampling-policy semantics on the LocalAI
gRPC side. The generation loop now consults compute_sample_params() per
token to pick the effective (temperature, top_k, top_p, min_p), based on:

  1. Request defaults: PredictOptions.temperature / .topk / .topp / .minp
  2. Thinking-mode override: when enable_thinking != false, force T=1.0,
     top_k=0, top_p=1.0, min_p=0.0 (creativity for the reasoning pass and
     the trailing content)
  3. DSML structural override: when DsmlParser::IsInDsmlStructural()
     returns true (we are between tool-call markers but NOT in a param
     value payload), force T=0.0 so protocol bytes parse cleanly

When the effective temperature is 0, we keep using ds4_session_argmax +
MTP speculative path (matches ds4-server's gate that only enables MTP for
greedy positions). When > 0, we call ds4_session_sample(s, T, ...) with
a per-thread RNG seeded from system_clock and fall back to single-token
ds4_session_eval.

New public method on DsmlParser: IsInDsmlStructural() encodes which states
need protocol-byte determinism. PARAM_VALUE is excluded (payload uses user
sampling); TEXT and THINK are excluded (no tool-call context to protect).

Verified on the DGX Spark GB10: the e2e suite still passes with all 5
specs including tools, and the Predict output now varies between runs
(creative sampling active) while the tool-call args remain a clean
'{"location":"Paris, France"}' because the parser-state check forces
greedy on the structural bytes.

UX note: thinking mode is ON by default (matching ds4-server). Users who
want deterministic output should set Metadata.enable_thinking = false.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): add sha256 to deepseek-v4-flash-q2 entry

Per HF LFS metadata for antirez/deepseek-v4-gguf:
  size: 86720111200 bytes (~80.76 GiB)
  sha256: 31598c67c8b8744d3bcebcd19aa62253c6dc43cef3b8adf9f593656c9e86fd8c

LocalAI's downloader verifies sha256 when present, so users who install
deepseek-v4-flash-q2 from the gallery get integrity-checked weights and
the partial-download issue (an 81 GB file is easy to truncate) becomes
recoverable instead of silently producing a broken backend.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-11 22:15:47 +02:00
LocalAI [bot]
b9e81dbfd4 chore: ⬆️ Update ggml-org/llama.cpp to 389ff61d77b5c71cec0cf92fe4e5d01ace80b797 (#9752)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-05-11 08:14:07 +02:00
LocalAI [bot]
19d59102d5 feat(whisper-cpp): implement streaming transcription (#9751)
* test(whisper): wire e2e streaming transcription target

Adds test-extra-backend-whisper-transcription, mirroring the existing
llama-cpp / sherpa-onnx / vibevoice-cpp targets. The generic
AudioTranscriptionStream spec at tests/e2e-backends/backend_test.go:644
fails today because backend/go/whisper has no streaming impl - this
target is the failing TDD gate that the next phase makes pass.

Confirmed RED locally: 3 Passed (health, load, offline transcription),
1 Failed (streaming spec hits its 300s context deadline because the
base implementation returns 'unimplemented' but doesn't close the
result channel, leaving the gRPC stream open until the client times
out).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(whisper-cpp): expose new_segment_callback to the Go side

Adds set_new_segment_callback() and a C-side trampoline that whisper.cpp
invokes once per new text segment during whisper_full(). The trampoline
dispatches (idx_first, n_new, user_data) to a Go function pointer
registered via purego.NewCallback - text and timings are pulled by Go
through the existing get_segment_text/get_segment_t0/get_segment_t1
getters.

Wires the hook only when streaming is actually requested, to avoid a
per-segment function-pointer dispatch on the offline path.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(whisper-cpp): implement AudioTranscriptionStream

Wires whisper.cpp's new_segment_callback through purego back to Go so
the streaming transcription RPC produces real, time-correlated deltas
while whisper_full() is still decoding. Each segment becomes one
TranscriptStreamResponse{Delta}; whisper_full's return is the
TranscriptStreamResponse{FinalResult} carrying the full segment list,
language, and duration.

Per-call state is tracked in a sync.Map keyed by an atomic counter; the
Go callback registered via purego.NewCallback is a singleton, dispatched
through user_data. SingleThread today means only one entry is ever live,
but the map shape matches the sherpa-onnx TTS callback pattern.

The streaming path's final.Text is the literal concat of every emitted
delta (a strings.Builder accumulated by onNewSegment) so the e2e
invariant `final.Text == concat(deltas)` holds exactly. The first delta
has no leading space; subsequent deltas are space-prefixed. The offline
AudioTranscription path is unchanged.

Closes the gap with sherpa-onnx, vibevoice-cpp, llama-cpp, and tinygrad,
which already implement AudioTranscriptionStream.

Verified GREEN locally: make test-extra-backend-whisper-transcription
passes 4/4 specs (3 Passed initially under RED, +1 streaming spec now).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(whisper-cpp): assert progressive multi-segment streaming

Drives AudioTranscriptionStream against a real long-audio fixture and
asserts len(deltas) >= 2. The generic e2e spec at
tests/e2e-backends/backend_test.go:644 only checks len(deltas) >= 1
which is satisfied by both real and faked streaming - this spec is the
guardrail that a future "fake" impl can't sneak past.

Skipped by default (env-gated, like the cancellation spec); set
WHISPER_LIBRARY, WHISPER_MODEL_PATH, and WHISPER_AUDIO_PATH to a 30+
second clip to run.

Verified locally with a 55s 5x-JFK concat against ggml-base.en.bin:
1 Passed in 7.3s, deltas >= 2, finalSegmentCount >= 2,
concat(deltas) == final.Text.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(whisper-cpp): add transcription gRPC e2e job

Mirrors tests-sherpa-onnx-grpc-transcription /
tests-llama-cpp-grpc-transcription. Runs make
test-extra-backend-whisper-transcription whenever the whisper backend
or the run-all switch fires, so a pin-bump or refactor that breaks
streaming transcription gets caught before merge.

The whisper output on detect-changes is already emitted by
scripts/changed-backends.js (it iterates allBackendPaths); this PR
just exposes it as a workflow output and consumes it.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(whisper-cpp): silence errcheck on AudioTranscriptionStream defers

golangci-lint runs with new-from-merge-base=origin/master, so the
identical defer patterns in the existing offline AudioTranscription
path are grandfathered while the new ones in AudioTranscriptionStream
trip errcheck. Wrap both defers in `func() { _ = ... }()` to match what
errcheck wants without altering behavior. The errors from os.RemoveAll
and *os.File.Close are not actionable inside a defer here (we're
already returning), matching the offline path's contract.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-10 23:11:46 +02:00
LocalAI [bot]
4715a68660 chore: ⬆️ Update vllm-project/vllm cu130 wheel to 0.20.2 (#9750)
⬆️ Update vllm-project/vllm cu130 wheel

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-10 21:33:07 +02:00
LocalAI [bot]
28f33be48f chore: ⬆️ Update ggml-org/whisper.cpp to c33c5618b72bb345df029b730b36bc0e369845a3 (#9749)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-05-10 21:32:47 +02:00
LocalAI [bot]
a435f7cc69 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 23127139cb6fa314899c3b5f4935b88b3374c56c (#9748)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-10 21:32:28 +02:00
LocalAI [bot]
f6c9c20911 chore: ⬆️ Update ggml-org/llama.cpp to 2b2babd1243c67ca811c0a5852cedf92b1a20024 (#9747)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-05-10 21:17:38 +02:00
LocalAI [bot]
6cbf69dc29 chore: ⬆️ Update ggml-org/llama.cpp to 1e5ad35d560b90a8ac447d149c8f8447ae1fcaa0 (#9739)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-05-10 00:06:29 +02:00
LocalAI [bot]
593f3a8648 ci: refactor llama-cpp variant Dockerfiles to consume prebuilt base-grpc images (PR 2/2) (#9738)
* ci(backend_build): plumb builder-base-image and BUILDER_TARGET build-args

Adds an optional builder-base-image input. When set, BUILDER_BASE_IMAGE
is forwarded as a build-arg AND BUILDER_TARGET=builder-prebuilt is set
to select the variant Dockerfile's prebuilt-base stage. When empty,
BUILDER_TARGET=builder-fromsource (the default) keeps the existing
from-source build path.

This makes the prebuilt-base optimization opt-in per matrix entry
without breaking local `make backends/<name>` invocations or backends
whose Dockerfile doesn't have a prebuilt path.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(llama-cpp,ik-llama-cpp,turboquant): multi-target Dockerfiles for prebuilt + from-source

Restructure the three llama.cpp-derived Dockerfiles so each supports
two builder paths in a single file, selected via the BUILDER_TARGET
build-arg:

  BUILDER_TARGET=builder-fromsource (default)
    - Standalone build: gRPC stage + apt installs + (conditionally)
      CUDA/ROCm/Vulkan + compile.
    - Used by `make backends/llama-cpp` locally and any caller that
      doesn't supply a prebuilt base.

  BUILDER_TARGET=builder-prebuilt
    - FROM \${BUILDER_BASE_IMAGE} (one of quay.io/go-skynet/ci-cache:
      base-grpc-* shipped in PR #9737).
    - Skips ~25-35 min of gRPC compile + ~5-10 min of toolchain installs.
    - Used by CI when the matrix entry sets builder-base-image.

Final FROM scratch resolves BUILDER_TARGET via an aliasing FROM stage
(BuildKit doesn't support variable expansion directly in COPY --from),
then COPY --from=builder pulls package output from the chosen path.
BuildKit prunes the unreferenced builder, so each build only does the
work for the chosen path.

The compile RUN is identical between both builder stages, so it's
factored into .docker/<name>-compile.sh and bind-mounted into both.
ccache mount + cache-id stay per-arch / per-build-type.

Local DX preserved: `make backends/llama-cpp` (no extra args) defaults
to BUILDER_TARGET=builder-fromsource and works exactly as before.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(backend.yml,backend_pr.yml): forward builder-base-image from matrix

Plumbs the new optional builder-base-image input from matrix into
backend_build.yml. backend_build.yml derives BUILDER_TARGET from
whether builder-base-image is set, so matrix entries that map to a
prebuilt base get the prebuilt path; entries that don't (python/go/
rust backends) fall through to the default builder-fromsource (which
their own Dockerfiles don't reference, so it's a no-op for them).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(backend-matrix): wire builder-base-image to llama-cpp variants

For every entry whose Dockerfile is llama-cpp/ik-llama-cpp/turboquant,
add a builder-base-image field pointing at the appropriate prebuilt
quay.io/go-skynet/ci-cache:base-grpc-* tag.

backend_build.yml derives BUILDER_TARGET from this field's presence:
non-empty -> builder-prebuilt; empty -> builder-fromsource. So this
commit alone activates the prebuilt-base path for these 23 backends
in CI, while local `make backends/<name>` (no extra args) keeps the
from-source path.

Mapping by (build-type, arch):
- '' / amd64        -> base-grpc-amd64
- '' / arm64        -> base-grpc-arm64
- cublas-12 / amd64 -> base-grpc-cuda-12-amd64
- cublas-13 / amd64 -> base-grpc-cuda-13-amd64
- cublas-13 / arm64 -> base-grpc-cuda-13-arm64
- hipblas / amd64   -> base-grpc-rocm-amd64
- vulkan / amd64    -> base-grpc-vulkan-amd64
- vulkan / arm64    -> base-grpc-vulkan-arm64
- sycl_* / amd64    -> base-grpc-intel-amd64
- cublas-12 + JetPack r36.4.0 / arm64 -> base-grpc-l4t-cuda-12-arm64

Cold-build savings expected: ~25-35 min per variant (skips the gRPC
compile + toolchain install that's now in the base).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: add base-grpc-l4t-cuda-12-arm64 variant for legacy JetPack entries

Two matrix entries (-nvidia-l4t-arm64-llama-cpp, -nvidia-l4t-arm64-
turboquant) build against nvcr.io/nvidia/l4t-jetpack:r36.4.0 + CUDA
12 ARM64. They're distinct from -nvidia-l4t-cuda-13-arm64-* which use
Ubuntu 24.04 + CUDA 13 sbsa. Add the missing JetPack-based variant
to base-images.yml so those two entries' builder-base-image mapping
in the previous commit resolves.

Bootstrap order before merging this PR (re-run base-images.yml on
this branch — 9 existing variants hit BuildKit cache, only the new
l4t-cuda-12-arm64 builds cold):

  gh workflow run base-images.yml --ref ci/base-images-consumers

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: extract base-builder install logic into .docker/install-base-deps.sh

Pre-extraction, the apt + protoc + cmake + conditional CUDA/ROCm/Vulkan
+ gRPC install logic was duplicated across four files:
  - backend/Dockerfile.base-grpc-builder (CI prebuilt-base source of truth)
  - backend/Dockerfile.llama-cpp (builder-fromsource stage)
  - backend/Dockerfile.ik-llama-cpp (builder-fromsource stage)
  - backend/Dockerfile.turboquant (builder-fromsource stage)

A bump to e.g. CUDA toolkit packages had to be made in 4 places, and
drift between the prebuilt base and the variant-Dockerfile from-source
path was a real concern (ik-llama-cpp's hipblas branch was already
missing the rocBLAS Kernels echo that llama-cpp / turboquant /
base-grpc-builder all had).

Factor the install logic into a single .docker/install-base-deps.sh
that reads its inputs from env vars and runs conditionally on
BUILD_TYPE / CUDA_*_VERSION / TARGETARCH. Each Dockerfile now bind-
mounts the script alongside .docker/apt-mirror.sh and invokes it from
a single RUN step.

The variant Dockerfiles' grpc-source stage is removed entirely — the
script handles gRPC compile + install at /opt/grpc, and the
builder-fromsource stage mirrors builder-prebuilt by copying
/opt/grpc/. to /usr/local/.

Result:
  - install-base-deps.sh: 244 lines (one source of truth)
  - Dockerfile.base-grpc-builder: 268 -> 98 lines
  - Dockerfile.llama-cpp: 361 -> 157 lines
  - Dockerfile.ik-llama-cpp: 348 -> 151 lines
  - Dockerfile.turboquant: 355 -> 154 lines
  - Total Dockerfile bytes: 1332 -> 560 lines (58% reduction)

Bit-equivalence between prebuilt and from-source paths is now enforced
by construction: both invoke the same script with the same inputs.
A side-effect is that ik-llama-cpp now also gets the rocBLAS Kernels
echo + clblas block parity it was previously missing.

Includes the BUILD_TYPE=clblas branch (libclblast-dev) for parity even
though no current CI matrix entry uses it.

After this commit's force-push, base-images.yml needs to be redispatched
on this branch — the Dockerfile.base-grpc-builder content shifts so the
existing cache won't apply for the install layer (gRPC layer also
rebuilds since it's now in the same RUN step).

Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(base-images): skip-drivers on JetPack l4t variant

cuda-nvcc-12-0 isn't installable via apt on the JetPack r36.4.0 base
image — JetPack ships CUDA preinstalled at /usr/local/cuda and its
apt feed doesn't carry the cuda-nvcc-* packages from the public
repositories. The original matrix entry for -nvidia-l4t-arm64-llama-cpp
on master sets skip-drivers: 'true' for exactly this reason; the
new base-grpc-l4t-cuda-12-arm64 base needs to match.

Also forwards SKIP_DRIVERS as a build-arg from matrix into the build
(was missing entirely before this commit).

Caught by run 25612030775 — l4t-cuda-12-arm64 failed at:
  E: Package 'cuda-nvcc-12-0' has no installation candidate

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-10 00:03:52 +02:00
LocalAI [bot]
28e29625a2 ci: add pre-built base-grpc-builder image infrastructure (PR 1/2) (#9737)
Introduces a parameterized Dockerfile.base-grpc-builder that produces
a fully-prepped builder base image (apt deps + protoc + cmake + gRPC
at /opt/grpc + conditional CUDA/ROCm/Vulkan toolchains) and a
base-images.yml workflow that builds + pushes 9 variants to
quay.io/go-skynet/ci-cache:base-grpc-*:

  base-grpc-amd64                 (Ubuntu 24.04, CPU-only)
  base-grpc-arm64                 (Ubuntu 24.04, CPU-only)
  base-grpc-cuda-12-amd64         (Ubuntu 24.04 + CUDA 12.8)
  base-grpc-cuda-13-amd64         (Ubuntu 22.04 + CUDA 13.0)
  base-grpc-cuda-13-arm64         (Ubuntu 24.04 + CUDA 13.0 sbsa)
  base-grpc-rocm-amd64            (rocm/dev-ubuntu-24.04:7.2.1 + hipblas)
  base-grpc-vulkan-amd64          (Ubuntu 24.04 + Vulkan SDK 1.4.335)
  base-grpc-vulkan-arm64          (Ubuntu 24.04 + Vulkan SDK ARM 1.4.335)
  base-grpc-intel-amd64           (intel/oneapi-basekit:2025.3.2)

The variant Dockerfiles (Dockerfile.llama-cpp, ik-llama-cpp, turboquant)
are NOT touched in this PR. PR 2 will refactor them to FROM these
prebuilt bases. This PR is intentionally inert - landing it changes no
existing CI behavior. The base images don't exist on quay until
someone manually triggers the workflow.

Bootstrap after merge:
  gh workflow run base-images.yml --ref master
Wait ~30 min for all 9 variants to push, then merge PR 2 (the
consumer-side refactor that uses BUILDER_BASE_IMAGE build-arg to
FROM these tags).

Triggers afterwards:
  - Saturdays 05:00 UTC (cron) - picks up upstream security updates,
    runs ~24h before the backend.yml Sunday cron so bases are fresh.
  - workflow_dispatch - manual ad-hoc rebuild.
  - master push touching Dockerfile.base-grpc-builder or this workflow.

Why split into two PRs: the variant Dockerfiles in PR 2 will FROM the
prebuilt bases and have no from-source fallback. Their CI builds fail
if the bases don't exist on quay yet. Landing infrastructure first +
manual bootstrap + then consumer refactor avoids a broken-master window.

Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-09 18:44:42 +02:00
Ettore Di Giacinto
31aa0582a5 ci(ik-llama-cpp,turboquant): add BuildKit ccache mount to compile steps
Mirror the ccache mount added to Dockerfile.llama-cpp in 9228e5b4 for
the other two llama.cpp-derived backends. Same shape, distinct mount
ids so each backend's cache is independent:

  ik-llama-cpp-ccache-${TARGETARCH}-${BUILD_TYPE}
  turboquant-ccache-${TARGETARCH}-${BUILD_TYPE}

ik_llama.cpp is a different upstream fork; no source overlap with
llama-cpp, separate cache makes sense.

turboquant is a llama.cpp fork that reuses backend/cpp/llama-cpp
source via a thin wrapper Makefile — most TUs would in principle hit
llama-cpp's ccache too. Keeping them separate for now to avoid one
fork's regressions poisoning the other; revisit sharing after we have
hit-rate numbers.

Same registry-export behavior as llama-cpp: the cache mount rides on
backend_build.yml's existing cache-to: type=registry,mode=max.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-09 16:21:49 +00:00
Ettore Di Giacinto
9228e5b412 ci(llama-cpp): add BuildKit ccache mount to the compile step
The big RUN at line 268 of Dockerfile.llama-cpp re-runs from scratch on
every LLAMA_VERSION bump (or any LocalAI source change due to
COPY . /LocalAI just before). For CUDA-13 specifically that compile
recently hit the GHA 6h hard limit and failed:

  https://github.com/mudler/LocalAI/actions/runs/25598418931/job/75148244557

Add a BuildKit cache mount on /root/.ccache and thread ccache through
CMake (CMAKE_C/CXX/CUDA_COMPILER_LAUNCHER) so most translation units
hit cache when their preprocessed source is byte-identical to the
previous build.

The cache mount is exported to the registry as part of the existing
cache-to: type=registry,mode=max in backend_build.yml, so it persists
across runs. mount id is keyed on TARGETARCH + BUILD_TYPE so different
variants don't thrash the same cache slot; sharing=locked serializes
concurrent writes.

Cold-build effect (first run after enable, or on LLAMA_VERSION bump
that touches every TU): unchanged. Hot-build effect (subsequent runs
with the same source, or LLAMA_VERSION bumps that touch a handful of
files): ~5-15 min for the llama.cpp compile vs the previous 1-3h cold.
For CUDA-13 specifically this should bring rebuilds well under the 6h
GHA limit.

Does NOT help the *first* post-bump build — that's still cold. For
that, follow-up work would be: (a) trim CUDA_DOCKER_ARCH to modern
GPUs only, (b) audit which CMake variants the published images
actually need, (c) pre-built CUDA+gRPC base image.

ccache package is already installed in the builder stage (line 90).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-09 16:16:46 +00:00
LocalAI [bot]
a91e718473 chore: ⬆️ Update ggml-org/llama.cpp to 00d56b11c3477b99bc18562dc1d1834f0d961778 (#9733)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-05-09 12:05:11 +02:00
LocalAI [bot]
d1eef05852 chore: ⬆️ Update ikawrakow/ik_llama.cpp to ab0f22b819ac57b7e7484f69c00c10fc755d5c6c (#9734)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-09 11:18:59 +02:00
LocalAI [bot]
4542833cb4 chore: ⬆️ Update ggml-org/llama.cpp to 9f5f0e689c9e977e5f23a27e344aa36082f44738 (#9724)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-09 10:18:05 +02:00
Ettore Di Giacinto
fe7b27eb66 test(ci): trigger faster-whisper rebuild to observe per-arch+merge
The PR that introduced the per-arch + manifest-merge pilot (#9727)
only touched CI infrastructure files, so the path filter correctly
skipped backend builds on its merge commit. To observe the new
backend-merge-jobs flow assemble a real manifest list, this commit
touches faster-whisper's Makefile so its two new per-arch entries
schedule and the merge job runs.

The trailing comment is the smallest possible diff and is harmless
to the build.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-08 22:09:46 +00:00
LocalAI [bot]
14a3275329 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 98950267c67fd95937a54ebd6e3c66cf2679b710 (#9725)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-09 00:06:05 +02:00
LocalAI [bot]
2be07f61da feat(whisper): honor client cancellation via ggml abort_callback (#9710)
* refactor(transcription): propagate request ctx through ModelTranscription*

Replaces context.Background() with the HTTP request ctx so client
disconnects start cancelling the gRPC call. No backend-side abort wiring
yet — that comes in a later commit. Pure plumbing.

Assisted-by: Claude:claude-haiku-4-5
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(cli): pass ctx to backend.ModelTranscription

Follow-up to e65d3e1f which threaded ctx through ModelTranscription
but missed the CLI caller. CLI commands have no request-scoped ctx,
so context.Background() is correct here.

Assisted-by: Claude:claude-haiku-4-5
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(audio): propagate request ctx into TTS, sound-gen, audio-transform

Same ctx-plumbing pattern applied to the rest of the audio path. CLI
callers use context.Background() since there is no request scope; HTTP
callers use c.Request().Context().

Assisted-by: Claude:claude-haiku-4-5
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(backend): propagate request ctx into biometric, detection, rerank, diarization paths

Replaces remaining context.Background() sites in core/backend with the
caller's ctx. After this commit, every core/backend/*.go entry point
threads the request ctx end-to-end to the gRPC client.

Assisted-by: Claude:claude-haiku-4-5
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(grpc): plumb ctx through AIModel.AudioTranscription{,Stream}

Adds context.Context as first parameter to the AIModel interface methods
that wrap whisper-style transcription. Server-side gRPC handler now
forwards the per-RPC ctx (server-streaming uses stream.Context()).
Whisper, Voxtral, vibevoice-cpp, and sherpa-onnx accept the parameter;
none uses it yet — the actual cancellation primitive lands in the next
commit so this is pure plumbing.

Assisted-by: Claude:claude-sonnet-4-6
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(whisper): add abort_callback hook in the C++ bridge

Installs a std::atomic<int> flag, wires it into
whisper_full_params.abort_callback, and exposes a set_abort(int) C
symbol so Go can flip the flag from a goroutine watching the request
context. transcribe() now distinguishes abort (return 2) from real
whisper_full failure (return 1).

Assisted-by: Claude:claude-haiku-4-5
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(whisper): register set_abort symbol in the purego loader

Adds the Go-side binding for the new C export so the next commit can
call CppSetAbort(1) from a watcher goroutine on ctx.Done().

Assisted-by: Claude:claude-haiku-4-5
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(whisper): honor ctx cancellation and return codes.Canceled

A watcher goroutine watches ctx.Done() during AudioTranscription and
calls CppSetAbort(1) on cancel. whisper_full sees abort_callback return
true at the next compute graph step, returns non-zero, and the bridge
returns 2 -> AudioTranscription maps that to codes.Canceled.

Adds an opt-in test (gated on WHISPER_MODEL_PATH / WHISPER_AUDIO_PATH)
that asserts cancellation latency under 5s and proves the abort flag
resets cleanly so the next transcription succeeds.

Assisted-by: Claude:claude-sonnet-4-6
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(whisper): join the cancel watcher goroutine before returning

Follow-up to 85edf9d2. The previous commit used `defer close(done)` and
called the watcher "joined synchronously" — but close() only signals,
it does not block until the goroutine exits. That left a window where
a late CppSetAbort(1) from a cancelled call could land on the next
call, after its C-side g_abort reset but before whisper_full() began
polling the abort callback, corrupting the second transcription.

Switch to a sync.WaitGroup join so wg.Wait() blocks until the watcher
has actually returned from its select.

Assisted-by: Claude:claude-sonnet-4-6
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(whisper): short-circuit pre-cancelled ctx in AudioTranscription

If ctx is already Done() at entry, return codes.Canceled immediately
instead of running the full transcription. The C-side g_abort reset
happens at the start of transcribe() and would otherwise overwrite a
watcher-set abort flag from an already-cancelled ctx, producing a
spurious successful transcription on a request the client has already
abandoned.

Assisted-by: Claude:claude-haiku-4-5
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(tests/distributed): update testLLM mock for new AudioTranscription signature

Phase B (93c48e19) added context.Context to AIModel.AudioTranscription
but missed the testLLM mock in tests/e2e/distributed. CI golangci-lint
caught it: *testLLM did not implement grpc.AIModel because the method
signature lacked the ctx parameter, which broke the distributed test
suite compilation and cascaded through every backend-build job that
runs `go build ./...`.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(whisper): port cancellation test to Ginkgo/Gomega

Project policy (.agents/coding-style.md, enforced by golangci-lint
forbidigo) is that all Go tests must use Ginkgo v2 + Gomega — no
stdlib testing patterns (t.Skip, t.Fatalf, etc.). Convert the
cancellation test to a Describe/It block with Skip(...) for env
gating and Expect/HaveOccurred for assertions.

Same coverage: cancel mid-flight returns codes.Canceled within 5s and
a follow-up transcription succeeds, proving the C-side g_abort flag
resets cleanly.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-08 01:44:47 +02:00
LocalAI [bot]
806130bbc0 chore: ⬆️ Update ggml-org/whisper.cpp to c81b2dabbc45484dee2ca6658cfe39c841df5c70 (#9712)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-08 01:44:32 +02:00
LocalAI [bot]
3b84582567 chore: ⬆️ Update ggml-org/llama.cpp to 05ff59cb57860cc992fc6dcede32c696efea711c (#9714)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-08 01:44:17 +02:00
LocalAI [bot]
907929ce60 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 9a26522af234f8db079ae3735f35ab6c20fe2c66 (#9713)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-08 01:43:44 +02:00
Richard Palethorpe
c894d9c826 feat(sglang): wire engine_args, add cuda13 build, ship MTP gallery demos (#9686)
Bring the sglang Python backend up to feature parity with vllm by adding
the same engine_args:-map plumbing the vLLM backend already has. Any
ServerArgs field (~380 in sglang 0.5.11) becomes settable from a model
YAML, including the speculative-decoding flags needed for Multi-Token
Prediction. Validation matches the vllm backend's: keys are checked
against dataclasses.fields(ServerArgs), unknown keys raise ValueError
with a difflib close-match suggestion at LoadModel time, and the typed
ModelOptions fields keep their existing meaning with engine_args
overriding them.

Backend code:
* backend/python/sglang/backend.py: add _apply_engine_args, import
  dataclasses/difflib/ServerArgs, call from LoadModel; rename Seed ->
  sampling_seed (sglang 0.5.11 renamed the SamplingParams field).
* backend/python/sglang/test.py + test.sh + Makefile: six unit tests
  exercising the helper directly (no engine load required).

Build / CI / backend gallery (cuda13 + l4t13 paths are now first-class):
* backend/python/sglang/install.sh: add --prerelease=allow because
  sglang 0.5.11 hard-pins flash-attn-4 which only ships beta wheels;
  add --index-strategy=unsafe-best-match for cublas12 so the cu128
  torch index wins over default-PyPI's cu130; new pyproject.toml-driven
  l4t13 install path so [tool.uv.sources] can pin torch/torchvision/
  torchaudio/sglang to the jetson-ai-lab index without forcing every
  transitive PyPI dep through the L4T mirror's flaky proxy (mirrors the
  equivalent fix in backend/python/vllm/install.sh).
* backend/python/sglang/pyproject.toml (new): L4T project spec with
  explicit-source jetson-ai-lab index. Replaces requirements-l4t13.txt
  for the l4t13 BUILD_PROFILE; other profiles still go through the
  requirements-*.txt pipeline via libbackend.sh's installRequirements.
* backend/python/sglang/requirements-l4t13.txt: removed; superseded
  by pyproject.toml.
* backend/python/sglang/requirements-cublas{12,13}{,-after}.txt: pin
  sglang>=0.5.11 (Gemma 4 floor); add cu130 torch index for cublas13
  (new files) and cu128 torch index for cublas12 (default PyPI now
  ships cu130 torch wheels by default and breaks cu12 hosts).
* backend/index.yaml: add cuda13-sglang and cuda13-sglang-development
  capability mappings + image entries pointing at
  quay.io/.../-gpu-nvidia-cuda-13-sglang.
* .github/workflows/backend.yml: new cublas13 sglang matrix entry,
  mirroring vllm's cuda13 build.

Model gallery + docs:
* gallery/sglang.yaml: base sglang config template, mirrors vllm.yaml.
* gallery/sglang-gemma-4-{e2b,e4b}-mtp.yaml: Gemma 4 MTP demos
  transcribed verbatim from the SGLang Gemma 4 cookbook MTP commands.
* gallery/sglang-mimo-7b-mtp.yaml: MiMo-7B-RL with built-in MTP heads
  + online fp8 weight quantization, verified end-to-end on a 16 GB
  RTX 5070 Ti at ~88 tok/s. Uses mem_fraction_static: 0.7 because the
  MTP draft worker's vocab embedding is loaded unquantised and OOMs
  the static reservation at sglang's 0.85 default.
* gallery/index.yaml: three new entries (gemma-4-e2b-it:sglang-mtp,
  gemma-4-e4b-it:sglang-mtp, mimo-7b-mtp:sglang).
* docs/content/features/text-generation.md: new SGLang section with
  setup, engine_args reference, MTP demos, version requirements.
* .agents/sglang-backend.md (new): agent one-pager covering the flat
  ServerArgs structure, the typed-vs-engine_args precedence, the
  speculative-decoding cheatsheet, and the mem_fraction_static gotcha
  documented above.
* AGENTS.md: index entry for the new agent doc.

Known limitation: the two Gemma 4 MTP gallery entries ship a recipe
that doesn't yet run on stock libraries. The drafter checkpoints
(google/gemma-4-{E2B,E4B}-it-assistant) declare
model_type: gemma4_assistant / Gemma4AssistantForCausalLM, which
neither transformers (<=5.6.0, including the SGLang cookbook's pinned
commit 91b1ab1f... and main HEAD) nor sglang's own model registry
(<=0.5.11) registers as of 2026-05-06. They will start working when
HF or sglang upstream registers the architecture -- no LocalAI
changes needed. The MiMo MTP demo and the non-MTP Gemma 4 paths work
today on this build (verified on RTX 5070 Ti, 16 GB).

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash] [WebFetch] [WebSearch]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-07 17:27:29 +02:00
Ettore Di Giacinto
048daa0cdc fix(chatterbox): install chatterbox-tts with --no-deps and pin runtime deps
The previous omegaconf pin only addressed one symptom of a deeper problem:
chatterbox-tts upstream depends on `russian-text-stresser` (unpinned git URL),
which transitively pins `spacy==3.6.*` and other ancient packages. That cascade
forces pip to backtrack through Jinja2/MarkupSafe/omegaconf into Python-2-era
sdists that no longer build (e.g. ruamel.yaml<0.15, Jinja2 2.6 importing the
long-removed `setuptools.Feature`).

Install chatterbox-tts itself with --no-deps in install.sh and list its real
runtime deps explicitly in each requirements-*.txt, dropping the optional
russian-text-stresser. This unblocks the darwin (and other) builds without
playing whack-a-mole on each newly-discovered transitive pin.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
2026-05-07 09:03:40 +00:00
Ettore Di Giacinto
7c77d3506a fix(chatterbox): pin omegaconf in every profile requirements file
The previous pin in requirements.txt was ineffective: installRequirements
runs a separate `pip install --requirement` per file, so resolution does
not carry over to the per-profile file where chatterbox-tts is declared.
With chatterbox-tts's unpinned `omegaconf` dep, pip backtracked through
1.x sdists into ruamel.yaml<0.15, whose Python-2-era setup.py fails on
Python 3.10+.

Pin omegaconf==2.3.0 next to chatterbox-tts in every profile file
(matches what upstream chatterbox uses). Drop the dead pin from
requirements.txt.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
2026-05-07 07:44:37 +00:00
dependabot[bot]
c96ce99742 chore(deps): bump openssl from 0.10.76 to 0.10.79 in /backend/rust/kokoros in the cargo group across 1 directory (#9694)
chore(deps): bump openssl

Bumps the cargo group with 1 update in the /backend/rust/kokoros directory: [openssl](https://github.com/rust-openssl/rust-openssl).


Updates `openssl` from 0.10.76 to 0.10.79
- [Release notes](https://github.com/rust-openssl/rust-openssl/releases)
- [Commits](https://github.com/rust-openssl/rust-openssl/compare/openssl-v0.10.76...openssl-v0.10.79)

---
updated-dependencies:
- dependency-name: openssl
  dependency-version: 0.10.79
  dependency-type: indirect
  dependency-group: cargo
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-07 08:30:18 +02:00
LocalAI [bot]
0b9344ef3d chore: ⬆️ Update leejet/stable-diffusion.cpp to 90e87bc846f17059771efb8aaa31e9ef0cab6f78 (#9701)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-07 08:29:41 +02:00
LocalAI [bot]
151d6c9cf0 chore: ⬆️ Update ggml-org/llama.cpp to 2496f9c14965c39589f53eea31bdb6d762b1d360 (#9698)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-07 08:29:27 +02:00
LocalAI [bot]
659939db9b chore: ⬆️ Update ikawrakow/ik_llama.cpp to b93721902b4662f9b973b1c412006081c958d085 (#9697)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-07 08:29:12 +02:00
Ettore Di Giacinto
b2be9729ef fix(chatterbox): pin omegaconf>=2.0 to prevent resolver backtracking
Without an upper-floor pin, pip's resolver backtracks through omegaconf 1.x
sdists when installing chatterbox-tts. Old 1.x setups depend on
ruamel.yaml<0.15, whose setup.py uses Python-2-era names (Str, Bytes) and
fails to build on Python 3.10+, breaking the darwin python backend build.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
2026-05-06 18:07:32 +00:00
LocalAI [bot]
4e154b59e5 fix(ci): unbreak rerankers (torch bump) and vllm-omni on aarch64 (#9688)
Two unrelated CI breakages bundled together since both are one-liners:

- rerankers: bump torch 2.4.1 -> 2.7.1 on cpu/cublas12. The unpinned
  transformers resolves to 5.x, whose moe.py registers a custom_op with
  string-typed `'torch.Tensor'` annotations that torch 2.4.1's
  infer_schema rejects, blocking the gRPC server from starting and
  failing all 5 backend tests with "Connection refused" on :50051.
  Matches the version used by the transformers backend.

- vllm-omni: strip fa3-fwd from the upstream requirements/cuda.txt
  before resolving on aarch64. fa3-fwd 0.0.3 ships only an
  x86_64 wheel and has no sdist, making the cuda profile unsatisfiable
  on Jetson/SBSA. fa3-fwd is a soft runtime dep — vllm-omni's
  attention backends fall back to FA2 then SDPA when it's missing.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-06 17:07:24 +02:00
LocalAI [bot]
a8d7d37a3c fix: unbreak master CI (docs, kokoros, vibevoice-cpp ABI) (#9682)
* fix(docs): correct broken Hugo relrefs

The Hugo build has been failing on master since the relevant pages
landed:

- text-generation.md:720 referenced `/docs/features/distributed-mode`,
  but Hugo `relref` paths are relative to the content root, not the
  rendered URL. Drop the `/docs/` prefix so the lookup matches the
  existing `features/...` form used elsewhere in the file.
- audio-transform.md:144 referenced `tts.md`; the actual page is
  `text-to-audio.md`.

Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(kokoros): stub Diarize and AudioTransform Backend trait methods

The recent backend.proto additions (Diarize, AudioTransform,
AudioTransformStream) extended the gRPC Backend trait, breaking
kokoros-grpc compilation with E0046 because the Rust implementation
hadn't picked up the new methods. Add Unimplemented stubs matching the
existing pattern for non-applicable RPCs in this TTS-only backend.

Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(vibevoice-cpp): track upstream ABI + wire 1.5B voice cloning

Two recent commits in mudler/vibevoice.cpp reshaped the vv_capi_tts
signature without a corresponding bump on the LocalAI side:

  3bd759c "1.5b: unify into a single tts entry point" inserted a
          ref_audio_path parameter between voice_path and dst_wav_path.
  ad856bd "1.5b: multi-speaker dialog support" promoted that to a
          (const char* const* ref_audio_paths, int n_ref_audio_paths)
          pair for per-speaker conditioning.

Because purego resolves symbols by name and not by signature, the
build kept linking; at runtime the misaligned arguments turned the
TTS->ASR closed-loop test into a SIGSEGV inside cgo. Track HEAD
explicitly and bring the bridge in line with it:

  * Update the CppTTS purego binding to the 9-arg form. purego
    marshals []*byte as a **char by handing the C side the underlying
    array address; nil/empty maps to NULL, which matches the C
    contract for "no reference audio" on the realtime-0.5B path.
  * Add a `ref_audio` gallery option (comma-separated, repeatable)
    that the 1.5B path consumes for runtime voice cloning. Multiple
    entries are interpreted as one WAV per speaker (Speaker 0..n-1).
  * TTSRequest.Voice now routes by extension/shape: `.wav` or a
    comma-separated list goes to ref_audio_paths; anything else stays
    on voice_path (realtime-0.5B's pre-baked voice gguf).
  * Pin VIBEVOICE_CPP_VERSION to ad856bd and wire the Makefile into
    the existing bump_deps matrix so future upstream rolls land as
    reviewable PRs instead of a silent CI break.

Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(vibevoice-cpp): use ModelOptions.AudioPath for 1.5B ref audio

Use the existing audio_path field from ModelOptions (already plumbed
through config_file's `audio_path:` YAML and consumed by other audio
backends like kokoros) instead of inventing a custom `ref_audio:`
Options[] string. Multi-speaker setups stay on a single comma-
separated value.

No behavior change beyond the gallery key name; per-call routing via
TTSRequest.Voice is unchanged.

Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-06 10:36:59 +02:00
LocalAI [bot]
7fab5e3d21 chore: ⬆️ Update ggml-org/whisper.cpp to 4bf733672b2871d4153158af4f621a6dd9104f4a (#9636)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-06 00:34:16 +02:00
Andreas Egli
af83518532 feat: support word-level timestamps for faster-whisper (#9621)
Signed-off-by: Andreas Egli <github@kharan.ch>
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-05-06 00:32:52 +02:00
LocalAI [bot]
a315c321c1 chore: ⬆️ Update TheTom/llama-cpp-turboquant to 69d8e4be47243e83b3d0d71e932bc7aa61c644dc (#9638)
⬆️ Update TheTom/llama-cpp-turboquant

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-06 00:29:05 +02:00
Richard Palethorpe
16b2d4c807 fix(python-backend): make JIT subprocesses work on hosts of any size (#9679)
Two related runtime fixes for Python backends that JIT-compile CUDA
kernels at first model load (FlashInfer, PyTorch inductor, triton):

1. libbackend.sh: replace `source ${EDIR}/venv/bin/activate` with a
   minimal manual setup (_activateVenv: export VIRTUAL_ENV, prepend
   PATH, unset PYTHONHOME) computed from $EDIR at runtime. `uv venv`
   and `python -m venv` both bake the create-time absolute path into
   bin/activate (e.g. VIRTUAL_ENV='/vllm/venv' from the Docker build
   stage), so sourcing activate on a relocated venv — copied out of
   the build container and unpacked at an arbitrary backend dir —
   prepends a stale, non-existent path to $PATH. Pip-installed CLI
   tools (e.g. ninja, used by FlashInfer's NVFP4 GEMM JIT) are then
   never found and the load aborts with FileNotFoundError. Doing the
   env setup ourselves matches what `uv run` does internally and
   sidesteps the relocation problem entirely. Generic — every Python
   backend benefits.

2. vllm/run.sh: replace ninja's default -j$(nproc)+2 with an adaptive
   MAX_JOBS = min(nproc, (MemAvailable-4)/4). Each concurrent
   nvcc/cudafe++ peaks at multiple GiB; the default OOM-kills on
   memory-tight hosts (e.g. a 16 GiB desktop loading a 27B NVFP4
   model) but underutilises 100-core / 1 TB boxes. User-set MAX_JOBS
   still wins. Also pin NVCC_THREADS=2 unless overridden.

Refs: https://github.com/vllm-project/vllm/issues/20079

Assisted-by: Claude:claude-opus-4-7 [Edit] [Bash]
2026-05-06 00:28:01 +02:00
Richard Palethorpe
8e43842175 feat(vllm, distributed): tensor parallel distributed workers (#9612)
* feat(vllm): build vllm from source for Intel XPU

Upstream publishes no XPU wheels for vllm. The Intel profile was
silently picking up a non-XPU wheel that imported but errored at
engine init, and several runtime deps (pillow, charset-normalizer,
chardet) were missing on Intel -- backend.py crashed at import time
before the gRPC server came up.

Switch the Intel profile to upstream's documented from-source
procedure (docs/getting_started/installation/gpu.xpu.inc.md in
vllm-project/vllm):

  - Bump portable Python to 3.12 -- vllm-xpu-kernels ships only a
    cp312 wheel.
  - Source /opt/intel/oneapi/setvars.sh so vllm's CMake build sees
    the dpcpp/sycl compiler from the oneapi-basekit base image.
  - Hide requirements-intel-after.txt during installRequirements
    (it used to 'pip install vllm'); install vllm's deps from a
    fresh git clone of vllm via 'uv pip install -r
    requirements/xpu.txt', swap stock triton for
    triton-xpu==3.7.0, then 'VLLM_TARGET_DEVICE=xpu uv pip install
    --no-deps .'.
  - requirements-intel.txt trimmed to LocalAI's direct deps
    (accelerate / transformers / bitsandbytes); torch-xpu, vllm,
    vllm_xpu_kernels and the rest come from upstream's xpu.txt
    during the source build.
  - requirements.txt: add pillow + charset-normalizer + chardet --
    used by backend.py and missing on the Intel install profile.
  - run.sh: 'set -x' so backend startup is visible in container
    logs (the gRPC startup error path was previously opaque).

Also adds a one-line docs example for engine_args.attention_backend
under the vLLM section, since older XE-HPG GPUs (e.g. Arc A770)
need TRITON_ATTN to bypass the cutlass path in vllm_xpu_kernels.

Tested end-to-end on an Intel Arc A770 with Qwen2.5-0.5B-Instruct
via LocalAI's /v1/chat/completions.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(vllm): add multi-node data-parallel follower worker

vLLM v1's multi-node story is one process per node sharing a DP
coordinator over ZMQ -- the head runs the API server with
data_parallel_size > 1 and followers run `vllm serve --headless ...`
with matching topology. Today LocalAI can already configure DP on the
head via the engine_args YAML map, but there's no way to bring up the
follower nodes -- so the head sits waiting for ranks that never
handshake.

Add `local-ai p2p-worker vllm`, mirroring MLXDistributed's structural
precedent (operator-launched, static config, no NATS placement). The
worker:

  - Optionally self-registers with the frontend as an agent-type node
    tagged `node.role=vllm-follower` so it's visible in the admin UI
    and operators can scope ordinary models away via inverse
    selectors.
  - Resolves the platform-specific vllm backend via the gallery's
    "vllm" meta-entry (cuda*, intel-vllm, rocm-vllm, ...).
  - Runs vLLM as a child process so the heartbeat goroutine survives
    until vLLM exits; forwards SIGINT/SIGTERM so vLLM can clean up its
    ZMQ sockets before we tear down.
  - Validates --headless + --start-rank 0 is rejected (rank 0 is the
    head and must serve the API).

Backend run.sh dispatches `serve` as the first arg to vllm's own CLI
instead of LocalAI's backend.py gRPC server -- the follower speaks
ZMQ directly to the head, there is no LocalAI gRPC on the follower
side. Single-node usage is unchanged.

Generalises the gallery resolution helper into findBackendPath()
shared by MLX and vLLM workers; extracts ParseNodeLabels for the
comma-separated label parsing both use.

Ships with two compose recipes (`docker-compose.vllm-multinode.yaml`
for NVIDIA, `docker-compose.vllm-multinode.intel.yaml` for Intel
XPU/xccl) plus `tests/e2e/vllm-multinode/smoke.sh`. Both vendors are
supported (NCCL for CUDA/ROCm, xccl for XPU) but mixed-vendor DP is
not -- PyTorch's process group requires every rank to use the same
collective backend, and NCCL/xccl/gloo don't interoperate.

Out of scope (deferred): SmartRouter-driven placement of follower
ranks via NATS backend.install events, follower log streaming through
/api/backend-logs, tensor-parallel across nodes, disaggregated
prefill via KVTransferConfig.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* test(vllm): CPU-only end-to-end test for multi-node DP

Adds tests/e2e/vllm-multinode/, a Ginkgo + testcontainers-go suite
that brings up a head + headless follower from the locally-built
local-ai:tests image, bind-mounts the cpu-vllm backend extracted by
make extract-backend-vllm so it's seen as a system backend (no gallery
fetch, no registry server), and asserts a chat completion across both
DP ranks. New `make test-e2e-vllm-multinode` target wires the docker
build, backend extract, and ginkgo run together; BuildKit caches both
images so re-runs only rebuild what changed. Tagged Label("VLLMMultinode")
so the existing distributed suite isn't pulled along.

Two pre-existing bugs surfaced by the test:

1. extract-backend-% (Makefile) failed for every backend, because all
   backend images end with `FROM scratch` and `docker create` rejects
   an image with no CMD/ENTRYPOINT. Fixed by passing
   --entrypoint=/run.sh -- the container is never started, only
   docker-cp'd, so the path doesn't have to exist; we just need
   anything that satisfies the daemon's create-time validation.

2. backend/python/vllm/run.sh's `serve` shortcut for the multi-node DP
   follower exec'd ${EDIR}/venv/bin/vllm directly, but uv bakes an
   absolute build-time shebang (`#!/vllm/venv/bin/python3`) that no
   longer resolves once the backend is relocated to BackendsPath.
   _makeVenvPortable's shebang rewriter only matches paths that
   already point at ${EDIR}, so the original shebang slips through
   unchanged. Fixed by exec-ing ${EDIR}/venv/bin/python with the script
   as an argument -- Python ignores the script's shebang in that case.

The test fixture caps memory aggressively (max_model_len=512,
VLLM_CPU_KVCACHE_SPACE=1, TORCH_COMPILE_DISABLE=1) so two CPU engines
fit on a 32 GB box. TORCH_COMPILE_DISABLE is currently mandatory for
cpu-vllm: torch._inductor's CPU-ISA probe runs even with
enforce_eager=True and needs g++ on PATH, which the LocalAI runtime
image doesn't ship -- to be addressed in a follow-up that bundles a
toolchain in the cpu-vllm backend.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(vllm): bundle a g++ toolchain in the cpu-vllm backend image

torch._inductor's CPU-ISA probe (`cpu_model_runner.py:65 "Warming up
model for the compilation"`) shells out to `g++` at vllm engine
startup, regardless of `enforce_eager=True` -- the eager flag only
disables CUDA graphs, not inductor's first-batch warmup. The LocalAI
CPU runtime image (Dockerfile, unconditional apt list) does not ship
build-essential, and the cpu-vllm backend image is `FROM scratch`,
so any non-trivial inference on cpu-vllm crashes with:

  torch._inductor.exc.InductorError:
    InvalidCxxCompiler: No working C++ compiler found in
    torch._inductor.config.cpp.cxx: (None, 'g++')

Bundling the toolchain in the CPU runtime image would bloat every
non-vllm-CPU deployment and force a single GCC version on backends
that may want clang or a different version. So this lives in the
backend, gated to BUILD_TYPE=='' (the CPU profile).

`package.sh` snapshots g++ + binutils + cc1plus + libstdc++ + libc6
(runtime + dev) + the math libs cc1plus links (libisl/libmpc/libmpfr/
libjansson) into ${BACKEND}/toolchain/, mirroring /usr/... layout. The
unversioned binaries on Debian/Ubuntu are symlink chains pointing into
multiarch packages (`g++` -> `g++-13` -> `x86_64-linux-gnu-g++-13`,
the latter in `g++-13-x86-64-linux-gnu`), so the package list resolves
both the version and the arch-triplet variant. Symlinks /lib ->
usr/lib and /lib64 -> usr/lib64 are recreated under the toolchain
root because Ubuntu's UsrMerge keeps them at /, and ld scripts
(`libc.so`, `libm.so`) hardcode `/lib/...` paths that --sysroot
re-roots into the toolchain.

The unversioned `g++`/`gcc`/`cpp` symlinks are replaced with wrapper
shell scripts that resolve their own location at runtime and pass
`--sysroot=<toolchain>` and `-B <toolchain>/usr/lib/gcc/<triplet>/<ver>/`
to the underlying versioned binary. That's how torch's bare `g++ foo.cpp
-o foo` invocation finds cc1plus (-B), system headers (--sysroot), and
the bundled libstdc++ (--sysroot, --sysroot is recursive into linker).

`run.sh` adds the toolchain bin dir to PATH and the toolchain's
shared-lib dir to LD_LIBRARY_PATH -- everything else (header search,
linker search, executable search) is encapsulated in the wrappers.
No-op for non-CPU builds, the dir doesn't exist there.

The cpu-vllm image grows by ~217 MB. Tradeoff is acceptable -- cpu-vllm
is already a niche profile (few users compared to GPU vllm) and the
alternative is a backend that crashes at first inference unless the
operator manually sets TORCH_COMPILE_DISABLE=1, which silently disables
all torch.compile optimizations.

Drops `TORCH_COMPILE_DISABLE=1` from tests/e2e/vllm-multinode -- the
smoke now exercises the real compile path through the bundled toolchain.
Test runtime is +20s for the warmup compile, still <90s end to end.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(vllm): scope jetson-ai-lab index to L4T-specific wheels via pyproject.toml

The L4T arm64 build resolves dependencies through pypi.jetson-ai-lab.io,
which hosts the L4T-specific torch / vllm / flash-attn wheels but also
transparently proxies the rest of PyPI through `/+f/<sha>/<filename>`
URLs. With `--extra-index-url` + `--index-strategy=unsafe-best-match`
uv would pick those proxy URLs for ordinary PyPI packages —
anthropic/openai/propcache/annotated-types — and fail when the proxy
503s. Master is hitting the same bug on its own l4t-vllm matrix entry.

Switch the l4t13 install path to a pyproject.toml that marks the
jetson-ai-lab index `explicit = true` and pins only torch, torchvision,
torchaudio, flash-attn, and vllm to it via [tool.uv.sources]. uv won't
consult the L4T mirror for anything else, so transitive deps fall back
to PyPI as the default index — no exposure to the proxy 503s.

`uv pip install -r requirements.txt` ignores [tool.uv.sources], so the
l4t13 branch in install.sh now invokes `uv pip install --requirement
pyproject.toml` directly, replacing the old requirements-l4t13*.txt
files. Other BUILD_PROFILEs continue using libbackend.sh's
installRequirements and never read pyproject.toml.

Local resolution test (x86_64, dry-run) confirms uv hits the L4T
index for torch and falls through to PyPI for everything else.

Assisted-by: claude-code:claude-opus-4-7-1m [Read] [Edit] [Bash] [Write]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-06 00:22:50 +02:00
Arkadiusz Tymiński
503904d311 fix(faster-whisper): cast segment timestamps to int after multiplication (#9674)
`int(x) * 1e9` returns a float because `1e9` is a float literal, but
TranscriptSegment.start/end are integer protobuf fields. This caused
every transcription request to fail with:

  TypeError: 'float' object cannot be interpreted as an integer

Multiply first, then cast — `int(x * 1e9)` — to get an int as required.
2026-05-05 23:46:39 +02:00
LocalAI [bot]
d5ce823b83 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 8b56d813a9ed04fa7b7fe2588fddd845cf64eccb (#9677)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-05 23:46:09 +02:00
LocalAI [bot]
c9141098b6 chore: ⬆️ Update ggml-org/llama.cpp to bbeb89d76c41bc250f16e4a6fefcc9b530d6e3f3 (#9676)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-05 23:45:54 +02:00
Ettore Di Giacinto
e86ade54a6 feat(api): add /v1/audio/diarization endpoint with sherpa-onnx + vibevoice.cpp (#9654)
* feat(api): add /v1/audio/diarization endpoint with sherpa-onnx + vibevoice.cpp

Closes #1648.

OpenAI-style multipart endpoint that returns "who spoke when". Single
endpoint instead of the issue's three-endpoint sketch (refactor /vad,
/vad/embedding, /diarization) — the typical client wants one call, and
embeddings can land later as a sibling without breaking this surface.

Response shape borrows from Pyannote/Deepgram: segments carry a
normalised SPEAKER_NN id (zero-padded, stable across the response) plus
the raw backend label, optional per-segment text when the backend bundles
ASR, and a speakers summary in verbose_json. response_format also accepts
rttm so consumers can pipe straight into pyannote.metrics / dscore.

Backends:

* vibevoice-cpp — Diarize() reuses the existing vv_capi_asr pass.
  vibevoice's ASR prompt asks the model to emit
  [{Start,End,Speaker,Content}] natively, so diarization is a by-product
  of the same pass; include_text=true preserves the transcript per
  segment, otherwise we drop it.

* sherpa-onnx — wraps the upstream SherpaOnnxOfflineSpeakerDiarization
  C API (pyannote segmentation + speaker-embedding extractor + fast
  clustering). libsherpa-shim grew config builders, a SetClustering
  wrapper for per-call num_clusters/threshold overrides, and a
  segment_at accessor (purego can't read field arrays out of
  SherpaOnnxOfflineSpeakerDiarizationSegment[] directly).

Plumbing: new Diarize gRPC RPC + DiarizeRequest / DiarizeSegment /
DiarizeResponse messages, threaded through interface.go, base, server,
client, embed. Default Base impl returns unimplemented.

Capability surfaces all updated: FLAG_DIARIZATION usecase,
FeatureAudioDiarization permission (default-on), RouteFeatureRegistry
entries for /v1/audio/diarization and /audio/diarization, audio
instruction-def description widened, CAP_DIARIZATION JS symbol,
swagger regenerated, /api/instructions discovery map updated.

Tests:

* core/backend: speaker-label normalisation (first-seen → SPEAKER_NN,
  per-speaker totals, nil-safety, fallback to backend NumSpeakers when
  no segments).

* core/http/endpoints/openai: RTTM rendering (file-id basename, negative
  duration clamping, fallback id).

* tests/e2e: mock-backend grew a deterministic Diarize that emits
  raw labels "5","2","5" so the e2e suite verifies SPEAKER_NN
  remapping, verbose_json speakers summary + transcript pass-through
  (gated by include_text), RTTM bytes content-type, and rejection of
  unknown response_format. mock-diarize model config registered with
  known_usecases=[FLAG_DIARIZATION] to bypass the backend-name guard.

Docs: new features/audio-diarization.md (request/response, RTTM example,
sherpa-onnx + vibevoice setup), cross-link from audio-to-text.md, entry
in whats-new.md.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]

* fix(diarization): correct sherpa-onnx symbol name + lint cleanup

CI failures on #9654:

* sherpa-onnx-grpc-{tts,transcription} and sherpa-onnx-realtime panicked
  at backend startup with `undefined symbol: SherpaOnnxDestroyOfflineSpeakerDiarizationResult`.
  Upstream's actual symbol is SherpaOnnxOfflineSpeakerDiarizationDestroyResult
  (Destroy in the middle, not the prefix); the rest of the diarization
  surface follows the same naming pattern. The mismatched name made
  purego.RegisterLibFunc fail at dlopen time and crashed the gRPC server
  before the BeforeAll could probe Health, taking down every sherpa-onnx
  test job — not just the diarization-related ones.

* golangci-lint flagged 5 errcheck violations on new defer cleanups
  (os.RemoveAll / Close / conn.Close); wrap each in a `defer func() { _ = X() }()`
  closure (matches the pattern other LocalAI files use for new code, since
  pre-existing bare defers are grandfathered in via new-from-merge-base).

* golangci-lint also flagged forbidigo violations: the new
  diarization_test.go files used testing.T-style `t.Errorf` / `t.Fatalf`,
  which are forbidden by the project's coding-style policy
  (.agents/coding-style.md). Convert both files to Ginkgo/Gomega
  Describe/It with Expect(...) — they get picked up by the existing
  TestBackend / TestOpenAI suites, no new suite plumbing needed.

* modernize linter: tightened the diarization segment loop to
  `for i := range int(numSegments)` (Go 1.22+ idiom).

Verified locally: golangci-lint with new-from-merge-base=origin/master
reports 0 issues across all touched packages, and the four mocked
diarization e2e specs in tests/e2e/mock_backend_test.go still pass.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]

* fix(vibevoice-cpp): convert non-WAV input via ffmpeg + raise ASR token budget

Confirmed end-to-end against a real LocalAI instance with vibevoice-asr-q4_k
loaded and the multi-speaker MP3 sample at vibevoice.cpp/samples/2p_argument.mp3:
both /v1/audio/transcriptions and /v1/audio/diarization now succeed and
return correctly attributed speaker turns for the full clip.

Two latent issues surfaced once the diarization endpoint actually exercised
the backend with a non-trivial input:

1. vv_capi_asr only accepts WAV via load_wav_24k_mono. The previous code
   passed the uploaded path straight through, so anything that wasn't
   already a 24 kHz mono s16le WAV failed at the C side with rc=-8 and
   the very unhelpful "vv_capi_asr failed". prepareWavInput shells out
   to ffmpeg ("-ar 24000 -ac 1 -acodec pcm_s16le") in a per-call temp
   dir, matching the rate the model was trained on; both AudioTranscription
   and Diarize now route through it. This is the same shape sherpa-onnx
   uses (utils.AudioToWav), but vibevoice needs 24 kHz rather than 16 kHz
   so we don't reuse that helper.

2. The C ABI's max_new_tokens defaults to 256 when 0 is passed. That's
   fine for a five-second clip but not for anything past ~10 s — vibevoice
   stops mid-JSON, the parse fails, and the caller sees a hard error.
   Pass a much larger budget (16 384 ≈ ~9 minutes of speech at the
   model's ~30 tok/s rate); generation stops at EOS so this is a cap
   rather than a target.

3. As a defensive belt-and-braces, mirror AudioTranscription's existing
   "fall back to a single segment if the model emits non-JSON text"
   pattern in Diarize, so partial / unusual model output never produces
   a 500. This kept the endpoint usable while diagnosing (1) and (2),
   and is the right behaviour to keep.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]

* fix(vibevoice-cpp): pass valid WAVs through directly so ffmpeg is not required at runtime

Spotted by tests-e2e-backend (1.25.x): the previous fix forced every
incoming audio file through `ffmpeg -ar 24000 ...`, which meant the
backend container — which does not ship ffmpeg — failed even for the
existing happy path where the caller already uploads a WAV. The
container-side error was:

    rpc error: code = Unknown desc = vibevoice-cpp: ffmpeg convert to
    24k mono wav: exec: "ffmpeg": executable file not found in $PATH

Reading vibevoice.cpp's audio_io.cpp, `load_wav_24k_mono` uses drwav and
already accepts any PCM/IEEE-float WAV at any sample rate, downmixes
multi-channel input to mono, and resamples to 24 kHz internally. So the
only inputs that genuinely need an external converter are non-WAV
formats (MP3, OGG, FLAC, ...).

Detect WAVs by RIFF/WAVE magic at bytes 0..3 / 8..11 and pass them
straight through with a no-op cleanup; everything else still goes
through ffmpeg with the same 24 kHz mono s16le target. The result:

* Container builds without ffmpeg keep working for WAV uploads
  (the e2e-backends fixture is jfk.wav at 16 kHz mono s16le).
* MP3 and other non-WAV inputs still get the new ffmpeg conversion
  path so the diarization endpoint stays useful.
* If the caller uploads a non-WAV but ffmpeg isn't on PATH, the
  surfaced error is still descriptive enough to act on.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]

* fix(ci): make gcc-14 install in Dockerfile.golang best-effort for jammy bases

The LocalVQE PR (bb033b16) made `gcc-14 g++-14` an unconditional apt
install in backend/Dockerfile.golang and pointed update-alternatives at
them. That works on the default `BASE_IMAGE=ubuntu:24.04` (noble has
gcc-14 in main), but every Go backend that builds on
`nvcr.io/nvidia/l4t-jetpack:r36.4.0` — jammy under the hood — now fails
at the apt step:

    E: Unable to locate package gcc-14

This blocked unrelated jobs:
backend-jobs(*-nvidia-l4t-arm64-{stablediffusion-ggml, sam3-cpp, whisper,
acestep-cpp, qwen3-tts-cpp, vibevoice-cpp}). LocalVQE itself is only
matrix-built on ubuntu:24.04 (CPU + Vulkan), so it doesn't actually
need gcc-14 anywhere else.

Make the gcc-14 install conditional on the package being available in
the configured apt repos. On noble: identical behaviour to today (gcc-14
installed, update-alternatives points at it). On jammy: skip the
gcc-14 stanza entirely and let build-essential's default gcc take over,
which is what the other Go backends compile with anyway.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-05 15:10:13 +02:00
LocalAI [bot]
1634eece6b chore: ⬆️ Update ikawrakow/ik_llama.cpp to 45dfd80371785731bc2ed05a76252497a4e7a282 (#9644)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-05 15:09:40 +02:00
LocalAI [bot]
b88ddce0f3 chore: ⬆️ Update ggml-org/llama.cpp to eff06702b2a52e1020ea009ebd86cb9f5acabab5 (#9637)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-05 09:52:28 +02:00