Commit Graph

6319 Commits

Author SHA1 Message Date
dependabot[bot]
d75173dd2a chore(deps): bump actions/download-artifact from 4 to 8 (#9771)
Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 4 to 8.
- [Release notes](https://github.com/actions/download-artifact/releases)
- [Commits](https://github.com/actions/download-artifact/compare/v4...v8)

---
updated-dependencies:
- dependency-name: actions/download-artifact
  dependency-version: '8'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-12 09:20:14 +02:00
dependabot[bot]
9be5310394 chore(deps): bump actions/upload-artifact from 4 to 7 (#9770)
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4 to 7.
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](https://github.com/actions/upload-artifact/compare/v4...v7)

---
updated-dependencies:
- dependency-name: actions/upload-artifact
  dependency-version: '7'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-12 09:20:03 +02:00
dependabot[bot]
cdf50fd723 chore(deps): bump node from 25-slim to 26-slim (#9769)
Bumps node from 25-slim to 26-slim.

---
updated-dependencies:
- dependency-name: node
  dependency-version: 26-slim
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-12 09:19:51 +02:00
LocalAI [bot]
bc3fb16105 feat(ollama): report model capabilities + details on /api/tags and /api/show (#9766)
Ollama-compatible clients (Open WebUI, Enchanted, ollama-grid-search,
etc.) rely on the `capabilities` list and `details.{parameter_size,
quantization_level,families}` fields returned by /api/tags and
/api/show to decide which models are eligible for a given task --
for example to filter the "embedding model" picker. Upstream Ollama
returns these; LocalAI's compat layer was leaving them empty, so
embedding models were silently rejected by clients that only allow
chat models for chat and only allow embedding models for embeddings.

This wires up the existing config signals already present in
ModelConfig:

- modelCapabilities() derives the Ollama capability strings from the
  config: "embedding" (FLAG_EMBEDDINGS), "completion" (FLAG_CHAT /
  FLAG_COMPLETION), "vision" (explicit KnownUsecases bit or MMProj /
  multimodal template / backend media marker), "tools" (auto-detected
  ToolFormatMarkers, JSON/Response regex, XML format, grammar
  triggers), "thinking" (ReasoningConfig with reasoning not disabled)
  and "insert" (presence of a completion template).
- modelDetailsFromModelConfig() now fills families, parameter_size
  and quantization_level. The latter two are parsed from the GGUF
  filename via regex -- conservative tokens only (Q*/IQ*/F16/F32/BF16
  and \d+(\.\d+)?[BM] surrounded by separators) so we don't accidentally
  match "Qwen3" as "3B".
- modelInfoFromModelConfig() exposes general.architecture and
  general.context_length in the new ShowResponse.model_info map.

Note: HasUsecases(FLAG_VISION) cannot be used directly -- GuessUsecases
has no FLAG_VISION case and returns true at the end for any chat model.
hasVisionSupport() instead reads KnownUsecases explicitly plus MMProj /
template / media-marker signals.

Tests are written first (TDD) using Ginkgo/Gomega -- DescribeTable for
the capability mapping (embedding-only, chat, vision, thinking, tools
via markers, tools via JSON regex, no-capability rerank) plus
integration tests against ShowModelEndpoint that round-trip JSON
through a real ModelConfigLoader populated from a temp YAML file.

Fixes #9760.


Assisted-by: Claude Code:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
v4.2.1
2026-05-12 00:16:19 +02:00
LocalAI [bot]
78722caedc chore: ⬆️ Update ikawrakow/ik_llama.cpp to eb570eb96689c235933b813693ca28ab9d3d26de (#9764)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-12 00:02:22 +02:00
LocalAI [bot]
621c612b2d ci(bump-deps): register ds4 + move version pin into the Makefile (#9761)
* ci(bump-deps): register ds4 + move version pin into the Makefile

The initial ds4 PR (#9758) put the upstream commit pin in
backend/cpp/ds4/prepare.sh as a shell variable. The auto-bump bot at
.github/bump_deps.sh greps for ^$VAR?= in a Makefile, so DS4_VERSION
was invisible to it - other backends (llama-cpp, ik-llama-cpp,
turboquant, voxtral, etc.) all pin in their Makefile.

This change:

- Moves DS4_VERSION?= and DS4_REPO?= to the top of
  backend/cpp/ds4/Makefile.
- Inlines the git init/fetch/checkout recipe into the 'ds4:' target
  (matches llama-cpp's 'llama.cpp:' target pattern). Directory acts
  as the target so make only re-clones when missing.
- Deletes the now-redundant prepare.sh.
- Adds antirez/ds4 + DS4_VERSION + main + backend/cpp/ds4/Makefile to
  the .github/workflows/bump_deps.yaml matrix so the daily bot opens
  PRs against this pin.
- Updates .agents/ds4-backend.md to point at the Makefile.

Verified:
  $ grep -m1 '^DS4_VERSION?=' backend/cpp/ds4/Makefile
  DS4_VERSION?=ae302c2fa18cc6d9aefc021d0f27ae03c9ad2fc0
  $ make -C backend/cpp/ds4 ds4   # clones into ds4/ at the pin
  $ make -C backend/cpp/ds4 ds4   # no-op on second invocation
  make: 'ds4' is up to date.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: route backend/cpp/ds4/ changes through changed-backends.js

scripts/changed-backends.js:inferBackendPath has an explicit branch per
cpp dockerfile suffix (ik-llama-cpp, turboquant, llama-cpp). Without a
matching branch the function returns null, the backend never lands in
the path map, and PR change-detection cannot map "backend/cpp/ds4/X
changed" -> "rebuild ds4 image".

This is why PR #9761 produced zero ds4 jobs even though it directly
edits backend/cpp/ds4/Makefile.

Adds the missing branch (Dockerfile.ds4 -> backend/cpp/ds4/), placed
before the llama-cpp branch (since both share the .cpp ancestry but
ds4 is more specific - same ordering rule documented in
.agents/adding-backends.md).

Verified with a local Node simulation of the script against this PR's
diff: the path map now contains 'ds4 -> backend/cpp/ds4/' and a
'backend/cpp/ds4/Makefile' change correctly triggers the ds4 backend
in the rebuild set.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(adding-backends): harden the two gotchas that bit ds4

Both omissions are silent at the time you ADD a backend - the failure
mode only appears later (the bump bot stays silent forever, or the path
filter shows up on the next PR that touches your backend with zero CI
jobs and looks broken for unrelated reasons). Expanding the
`scripts/changed-backends.js` paragraph from a one-liner to a fully
worked example, and adding a new sibling paragraph for the
`bump_deps.yaml` + Makefile-pin contract.

Both call out the specific mistakes from the ds4 timeline (#9758#9761) so future contributors can pattern-match on the cause.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-11 22:46:02 +02:00
LocalAI [bot]
e3f9de1026 docs: ⬆️ update docs version mudler/LocalAI (#9762)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-11 22:37:06 +02:00
LocalAI [bot]
d892e4af80 feat: add ds4 backend (DeepSeek V4 Flash) with tool calls, thinking, KV cache (#9758)
* test(e2e-backends): allow BACKEND_BINARY for native-built backends

Adds an escape hatch for hardware-gated backends (e.g. ds4) where the
model is too large for Docker build context. When BACKEND_BINARY points
at a run.sh produced by 'make -C backend/cpp/<name> package', the suite
skips docker image extraction and drives the binary directly.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(e2e-backends): validate BACKEND_BINARY basename + log actual source

Two follow-ups from the cbcf5148 code review:

- BACKEND_BINARY now requires a path whose basename is `run.sh`. Without
  this check, `filepath.Dir(binary)` silently discarded the filename, so
  pointing the env var at an arbitrary binary failed later with a
  confusing assertion that named a path the user never typed.
- The "Testing image=..." debug line printed an empty string when the
  binary path was used, hiding the actual source in CI logs. The line
  now reports whichever of BACKEND_IMAGE / BACKEND_BINARY is in effect
  as `src=...`.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): scaffold ds4 backend dir

Adds prepare.sh, run.sh, and a .gitignore. CMakeLists, Makefile, and the
implementation arrive in follow-up commits.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): add backend Makefile

Drives ds4's upstream Makefile to produce engine .o files (CUDA on Linux
when BUILD_TYPE=cublas, Metal on Darwin, otherwise CPU debug path), then
invokes CMake on our wrapper.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): add CMakeLists for grpc-server

Generates protoc stubs from backend.proto, links grpc-server.cpp +
dsml_parser.cpp + dsml_renderer.cpp + kv_cache.cpp against pre-built
ds4 engine .o files. DS4_GPU=cuda|metal|cpu selects the backend.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): grpc-server skeleton + module stubs

The minimum that links: Backend service with Health + Free; other RPCs
default to UNIMPLEMENTED. Stub headers/sources for dsml_parser,
dsml_renderer, and kv_cache are in place so CMake links cleanly even
before those modules ship.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): implement LoadModel

Opens engine + creates session sized to ContextSize (default 32768).
Backend is compile-time: CPU when DS4_NO_GPU, Metal on __APPLE__, else
CUDA. MTP/speculative options are accepted via ModelOptions.Options[]
(mtp_path, mtp_draft, mtp_margin). kv_cache_dir option is captured into
g_kv_cache_dir for the cache module (Task 19 wires it in).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): implement TokenizeString

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): implement Predict (plain text)

Tool calls + thinking-mode split arrive in Task 13 once dsml_parser is in.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): implement PredictStream (plain text)

ChatDelta + reasoning/tool_calls split arrives in Task 14.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): implement Status RPC

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): add DSML streaming parser

Classifies raw model-emitted token text into CONTENT / REASONING /
TOOL_START / TOOL_ARGS / TOOL_END events. Markers it watches for are the
literal DSML strings rendered by ds4_server.c's prompt template
(<|DSML|tool_calls>, <|DSML|invoke name=...>, <think>, etc.) - these are
plain text the model emits, not special tokens.

Partial markers split across token chunks are buffered until a full marker
or a definitively-not-a-marker '<' is observed. RandomToolId() generates
the API-side tool call id (call_xxx) that exact-replay would key on.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend/cpp/ds4): split hex escapes in DSML markers + add cstring/cstdio includes

C++ \x hex escapes have no length cap. '\x9cD' was read as a single escape
producing byte 0xCD, eating the 'D'. The markers were never actually matching
the DSML text the model emits. Split each escape with adjacent string literal
concatenation so the byte sequence is exactly EF BD 9C 44 (|D) at runtime.

Also adds <cstring> and <cstdio> includes (libstdc++ 13 does not transitively
expose std::strlen / std::snprintf via <string>).

The local plan file (uncommitted) was also updated with the same fixes so
Task 16's dsml_renderer.cpp does not re-introduce the bug.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): wire DsmlParser into Predict (ChatDelta)

Non-streaming Predict now emits one ChatDelta carrying content,
reasoning_content, and tool_calls[] parsed from the model's DSML output.
Reply.message still carries the raw model bytes for backends that prefer
the regex fallback path.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): wire DsmlParser into PredictStream

Per-token ChatDelta writes: content/reasoning_content go incrementally,
tool_calls emit TOOL_START as one delta (id + name) followed by
TOOL_ARGS deltas with incremental JSON. The Go-side aggregator
(pkg/functions/chat_deltas.go) reassembles them.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): chat template + reasoning_effort mapping

UseTokenizerTemplate=true + Messages -> ds4_chat_begin / append /
assistant_prefix. PredictOptions.Metadata['enable_thinking'] and
['reasoning_effort'] map to ds4_think_mode (DS4_THINK_HIGH default;
'max'/'xhigh' -> DS4_THINK_MAX; disabled -> DS4_THINK_NONE).

Tool-call rendering for assistant turns with tool_calls JSON arrives in
the next commit (dsml_renderer).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): render assistant tool_calls + tool results to DSML

Closes the round-trip: when an OpenAI client sends a multi-turn chat
where prior turns contain tool_calls or role=tool messages, build_prompt
serializes them back to the DSML shape the model was trained on. Mirrors
ds4_server.c's prompt renderer; uses nlohmann::json for parsing the
OpenAI tool_calls payload.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): disk KV cache module

Dir-based cache keyed by SHA1(rendered prompt prefix). File format:
'DS4G' magic + version + ctx_size + prefix_len + prefix + payload_bytes
+ ds4_session_save_payload output. NOT bit-compatible with ds4-server's
KVC files - that interop is a follow-up plan. LoadLongestPrefix walks
the dir picking the longest stored prefix that prefixes the incoming
prompt.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): wire KvCache into Predict/PredictStream

LoadModel reads 'kv_cache_dir' from ModelOptions.Options[], passes it to
g_kv_cache.SetDir. Each Predict/PredictStream computes a render text for
the request, tries LoadLongestPrefix to recover state, then Saves the
new state after generation. ds4_session_sync handles the live-cache
fast path internally, so the disk cache only matters for cold-starts
and cross-session reuse.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): add package.sh

Linux: bundles libc + ld + libstdc++ + libgomp + GPU runtime libs into
package/lib so the FROM scratch image boots without a host libc.
Darwin is handled by scripts/build/ds4-darwin.sh which uses otool -L.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend/cpp/ds4): rename namespace ds4_backend -> ds4cpp

ds4.h defines 'typedef enum {...} ds4_backend' which collides with our
C++ 'namespace ds4_backend' anywhere a TU includes both. kv_cache.h
includes ds4.h directly and surfaces the conflict immediately; other
TUs would hit it once gRPC dev headers are available.

Renames the C++ namespace to ds4cpp across all wrapper files and the
plan, leaving the upstream ds4 typedef untouched.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend): add Dockerfile.ds4

Single-stage builder (CUDA devel image for cublas, ubuntu:24.04 for cpu)
-> FROM scratch with packaged grpc-server + bundled runtime libs.
nlohmann-json3-dev is required for dsml_renderer's JSON handling.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(make): wire backend/cpp/ds4 + ds4-darwin into root Makefile

BACKEND_DS4 entry + generate-docker-build-target eval + docker-build-ds4
in docker-build-backends + .NOTPARALLEL guards. Also adds the
backends/ds4-darwin target which delegates to scripts/build/ds4-darwin.sh
(landed in Task 24).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: add backend-matrix entries for ds4 (cpu + cuda13, per-arch)

Two entries per build (amd64 + arm64) so backend-merge-jobs assembles a
multi-arch manifest. Skipping cuda12 - ds4 was validated against CUDA 13.
Darwin Metal is handled outside this matrix by backend_build_darwin.yml.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/index): add ds4 meta + image entries

cpu + cuda13 x latest + master. Darwin Metal builds publish under
ds4-darwin via the existing llama-cpp-darwin OCI pipeline.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(scripts/build): add ds4-darwin.sh

Native macOS/Metal build for the ds4 backend. Mirrors llama-cpp-darwin.sh:
make grpc-server -> otool -L for dylib bundling -> OCI tar that
'local-ai backends install' consumes via the backends/ds4-darwin
Makefile target.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(darwin): build ds4-darwin in backend_build_darwin

Adds a 'Build ds4 backend (Darwin Metal)' step that runs the
backends/ds4-darwin Makefile target on the macOS runner.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(import): auto-detect ds4 weights via DS4Importer

Adds core/gallery/importers/ds4.go which matches on the antirez/deepseek-v4-gguf
repo URI and the DeepSeek-V4-Flash-*.gguf filename pattern. Registered before
LlamaCPPImporter so ds4 weights route to backend: ds4 instead of falling
through to llama-cpp.

Also lists ds4 in /backends/known so the /import-model UI surfaces it as a
manual choice for users who want to force the backend on a non-canonical URI.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): add deepseek-v4-flash-q2 (ds4 backend)

One-click install of the q2 weights with backend: ds4.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(.agents): add ds4-backend.md

Documents the backend shape, DSML state machine, thinking-mode mapping,
disk KV cache, build matrix (cpu/cuda13/Darwin), and the BACKEND_BINARY
hardware-validation path.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend/cpp/ds4): pass UBUNTU_VERSION + arch env vars to install-base-deps

The .docker/install-base-deps.sh script needs UBUNTU_VERSION (defaults to
2404), TARGETARCH, SKIP_DRIVERS, and APT_MIRROR/APT_PORTS_MIRROR exported
into the environment so it can pick the right cuda-keyring / cudss / nvpl
debs and apt mirrors. Dockerfile.ds4 was declaring some of the ARGs but not
re-exporting them via ENV. Mirrors Dockerfile.llama-cpp's pattern.

Without this fix 'make docker-build-ds4 BUILD_TYPE=cublas CUDA_MAJOR_VERSION=13'
failed at:
  /usr/local/sbin/install-base-deps: line 120: UBUNTU_VERSION: unbound variable

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/index): add Metal image entries for ds4

Adds metal-ds4 + metal-ds4-development image entries pointing at
quay.io/go-skynet/local-ai-backends:{latest,master}-metal-darwin-arm64-ds4
(built by scripts/build/ds4-darwin.sh on macOS arm64 runners), plus the
'metal' and 'metal-darwin-arm64' capability mappings on the ds4 meta and
ds4-development variant.

Closes a gap from the initial Task 23 landing - the Darwin Metal build
script and CI workflow step were already wired (Tasks 24-25), but the
gallery had no image entry for users to install the Metal variant.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(ci): use ubuntu:24.04 base for ds4 cuda13 matrix entries

The initial Task 22 matrix landing used base-image: 'nvidia/cuda:13.0.0-devel-ubuntu24.04'
which clashes with install-base-deps.sh's cuda-keyring step:

  E: Conflicting values set for option Signed-By regarding source
     https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/sbsa/

The canonical pattern (llama-cpp, ik-llama-cpp, turboquant) uses plain
'ubuntu:24.04' + 'skip-drivers: false' so install-base-deps installs CUDA
from scratch via its own keyring setup. Adopting that here.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend/cpp/ds4): drop install-base-deps.sh dependency

The .docker/install-base-deps.sh pipeline is built around the llama-cpp
needs: NVIDIA keyring + cuda-toolkit apt + gRPC-from-source build at
/opt/grpc. For ds4 we don't need any of that:
- CUDA: nvidia/cuda:13.0.0-devel-ubuntu24.04 ships /usr/local/cuda
  ready to go; install-base-deps's keyring step then conflicts with
  the pre-installed Signed-By.
- gRPC: ds4's grpc-server.cpp only links against grpc++; system
  libgrpc++-dev (apt) is sufficient, no source build needed.

Replaced the install-base-deps invocation in Dockerfile.ds4 with a
direct 'apt-get install libgrpc++-dev libprotobuf-dev protobuf-compiler-grpc
nlohmann-json3-dev cmake build-essential pkg-config git'. Matrix entries
back to nvidia/cuda base + skip-drivers=true so install-base-deps would
no-op even if some downstream tooling calls it.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend/cpp/ds4): correct proto accessors + alias grpc::Status as GStatus

Two compile bugs caught by the docker build:

1. proto::Message uses snake_case accessors. The build_prompt loop called
   m.toolcalls() / m.toolcallid() - the protoc-generated names are
   m.tool_calls() / m.tool_call_id(). Plan-text bug propagated to the
   wrapper.

2. The Status RPC method shadowed the 'using grpc::Status' alias, so any
   later method declaration using Status as a return type failed to parse
   ('Status does not name a type' starting at LoadModel). Solution: alias
   grpc::Status as GStatus instead, with no 'using' clause that would
   conflict. All RPC method declarations and return-statement constructions
   now use GStatus.

Pre-existing code reviewer flagged the Status-shadow concern as 'minor'
in the original Task 10 commit; it turned out to be a real compile blocker
under libstdc++ 13 once the surrounding methods were filled in.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend/cpp/ds4): preserve TOOL_ARGS content in dsml_parser Flush

When the model emitted a parameter value that arrived in the same buffer
as the surrounding tool_call markers (e.g. the buffered tail after a
literal '</think>' opened the model output), the parser deferred all
buffered bytes to Flush() because looks_like_prefix() always returns
true while buf starts with '<'. Flush() then drained the buffer as
plain CONTENT/REASONING regardless of parser state, so the bytes
between the parameter open and close markers were classified as
CONTENT instead of TOOL_ARGS.

Symptom: the model emitted

  <|DSML|parameter name="location" string="true">Paris, France</|DSML|parameter>

and the assembled tool_call arguments came out as {"location":""} -
the opener and closer were emitted into the args stream but the
"Paris, France" content went to the assistant message instead.

Fix:

1. Flush() now uses the same state-aware emit logic as DrainPlain:
   PARAM_VALUE bytes become TOOL_ARGS (json-escaped when string),
   THINK bytes become REASONING, TEXT bytes become CONTENT, and
   INVOKE / TOOL_CALLS structural whitespace is discarded.

2. looks_like_prefix() restricts its leading-'<' fallback to buffers
   that have not yet seen a '>'. Without that change, char-by-char
   feeds would discard the '<' of '<|DSML|invoke name="..."' once
   the marker prefix length was reached but the closing quote/'>'
   were still in flight.

Verified with a standalone harness that runs the failing input three
ways (single Feed, split-after-'>', and char-by-char) and aggregates
TOOL_ARGS for tool index 0: all three now produce
{"location":"Paris, France"}.

Assisted-by: Claude:opus-4.7 [Read,Edit,Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend/cpp/ds4): use ds4_session_sync + manual generation loop for KV persistence

ds4_engine_generate_argmax() is a self-contained helper that doesn't take or
update a ds4_session - it manages its own internal state. Our Predict and
PredictStream methods created g_session via ds4_session_create() but then
called ds4_engine_generate_argmax(), so g_session's KV state never advanced.
ds4_session_payload_bytes(g_session) returned 0 and the disk KV cache save
correctly rejected with 'session has no valid checkpoint to save'.

Switch both RPCs to the proper session API:
  ds4_session_sync(g_session, &prompt, ...)
  loop:
    int token = ds4_session_argmax(g_session)
    if token == eos: break
    emit(token)
    ds4_session_eval(g_session, token, ...)

After the loop the session has a real checkpoint and ds4_session_save_payload
writes the KV state to disk. Verified end-to-end on a DGX Spark GB10: three
.kv files (15-30 MB each) are written when BACKEND_TEST_OPTIONS sets
kv_cache_dir, and the e2e tool-call assertion still passes.

Also added stderr diagnostics to KvCache (enabled/disabled at SetDir; per-save
path + payload_bytes + result) so future failures are visible instead of
silent. The 'wrote ok' lines are low-volume - one per Predict/PredictStream
when the cache is enabled - and skipped entirely when the option is unset.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): use ds4_session_eval_speculative_argmax when MTP loaded

Wires MTP (Multi-Token Prediction) speculative decoding into the manual
generation loop in both Predict and PredictStream. When the upstream MTP
weights are loaded via 'mtp_path:' option AND we're on CUDA / Metal,
ds4_engine_mtp_draft_tokens() returns >0 and we switch the inner loop to
ds4_session_eval_speculative_argmax(), which can accept N>1 tokens per
verifier step. When MTP is not loaded (no option, CPU backend, or weights
absent), we fall through to the simple ds4_session_argmax + ds4_session_eval
path with no behavior change.

Validated on a DGX Spark GB10 with the optional MTP GGUF
(DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf, ~3.6 GB). LoadModel logs
'ds4: MTP support model loaded ... (draft=2)' on stderr.

Caveat per upstream README: 'currently provides at most a slight speedup,
not a meaningful generation-speed win'. Wired now mainly to track the
upstream API; bigger speedups arrive when ds4 improves the speculative path.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): honor PredictOptions sampling with DSML-aware override

Mirrors ds4_server.c:7102-7115 sampling-policy semantics on the LocalAI
gRPC side. The generation loop now consults compute_sample_params() per
token to pick the effective (temperature, top_k, top_p, min_p), based on:

  1. Request defaults: PredictOptions.temperature / .topk / .topp / .minp
  2. Thinking-mode override: when enable_thinking != false, force T=1.0,
     top_k=0, top_p=1.0, min_p=0.0 (creativity for the reasoning pass and
     the trailing content)
  3. DSML structural override: when DsmlParser::IsInDsmlStructural()
     returns true (we are between tool-call markers but NOT in a param
     value payload), force T=0.0 so protocol bytes parse cleanly

When the effective temperature is 0, we keep using ds4_session_argmax +
MTP speculative path (matches ds4-server's gate that only enables MTP for
greedy positions). When > 0, we call ds4_session_sample(s, T, ...) with
a per-thread RNG seeded from system_clock and fall back to single-token
ds4_session_eval.

New public method on DsmlParser: IsInDsmlStructural() encodes which states
need protocol-byte determinism. PARAM_VALUE is excluded (payload uses user
sampling); TEXT and THINK are excluded (no tool-call context to protect).

Verified on the DGX Spark GB10: the e2e suite still passes with all 5
specs including tools, and the Predict output now varies between runs
(creative sampling active) while the tool-call args remain a clean
'{"location":"Paris, France"}' because the parser-state check forces
greedy on the structural bytes.

UX note: thinking mode is ON by default (matching ds4-server). Users who
want deterministic output should set Metadata.enable_thinking = false.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): add sha256 to deepseek-v4-flash-q2 entry

Per HF LFS metadata for antirez/deepseek-v4-gguf:
  size: 86720111200 bytes (~80.76 GiB)
  sha256: 31598c67c8b8744d3bcebcd19aa62253c6dc43cef3b8adf9f593656c9e86fd8c

LocalAI's downloader verifies sha256 when present, so users who install
deepseek-v4-flash-q2 from the gallery get integrity-checked weights and
the partial-download issue (an 81 GB file is easy to truncate) becomes
recoverable instead of silently producing a broken backend.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-11 22:15:47 +02:00
dependabot[bot]
5d0f732b16 chore(deps): bump the go_modules group across 1 directory with 2 updates (#9759)
Bumps the go_modules group with 2 updates in the / directory: [github.com/gofiber/utils](https://github.com/gofiber/utils) and [github.com/go-git/go-git/v5](https://github.com/go-git/go-git).


Updates `github.com/gofiber/utils` from 1.1.0 to 1.2.0
- [Release notes](https://github.com/gofiber/utils/releases)
- [Commits](https://github.com/gofiber/utils/compare/v1.1.0...v1.2.0)

Updates `github.com/go-git/go-git/v5` from 5.18.0 to 5.19.0
- [Release notes](https://github.com/go-git/go-git/releases)
- [Changelog](https://github.com/go-git/go-git/blob/main/HISTORY.md)
- [Commits](https://github.com/go-git/go-git/compare/v5.18.0...v5.19.0)

---
updated-dependencies:
- dependency-name: github.com/gofiber/utils
  dependency-version: 1.2.0
  dependency-type: indirect
  dependency-group: go_modules
- dependency-name: github.com/go-git/go-git/v5
  dependency-version: 5.19.0
  dependency-type: indirect
  dependency-group: go_modules
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-11 18:37:00 +02:00
Ettore Di Giacinto
ea00199554 ci: tag every backend digest, including singletons
backend_build.yml pushes by canonical digest only (push-by-digest=true,
no tags applied at build time). User-facing tagging happens in
backend_merge.yml's `imagetools create` step. Before this commit,
scripts/changed-backends.js emitted a merge entry only for tag-suffixes
with 2+ legs, so every single-arch backend (CUDA/ROCm/Intel Python
images, vLLM, sglang, transformers, diffusers, ...) pushed its digest
untagged and stayed that way until quay's GC reaped it. Symptom: tag
releases shipped multi-arch backends tagged correctly, but no
v<X>-gpu-nvidia-cuda-12-vllm (or any singleton variant) ever appeared
in the registry.

Changes:

- scripts/changed-backends.js drops the `group.length < 2` skip and
  emits two merge matrices, one per arch class, so each downstream
  merge job can `needs:` only its corresponding build matrix.
- backend.yml splits backend-merge-jobs into multiarch and singlearch
  variants. The split preserves PR #9746's fix: slow singlearch CUDA
  builds (~6h) must not gate multiarch merges, or quay's GC reaps the
  multiarch per-arch digests before they're tagged.
- backend_pr.yml mirrors the split.
- backend_build.yml renames the digest artifact from
  `digests<suffix>-<platform-tag>` to
  `digests<suffix>--<platform-tag-or-"single">`. The `--` separator
  prevents the merge-side glob from over-matching sibling backends
  whose tag-suffix is a prefix of ours (e.g. -cpu-vllm vs
  -cpu-vllm-omni, -cpu-mlx vs -cpu-mlx-audio); the `single` placeholder
  keeps the name well-formed when platform-tag is empty.
- backend_merge.yml updates the download pattern to match.

Verified locally: a tag-push event now expands to 36 multiarch merge
entries (= 72 builds / 2 legs) and 199 singlearch merge entries (one
per singleton, including -gpu-nvidia-cuda-12-vllm at index 24).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-11 13:22:00 +00:00
LocalAI [bot]
b9e81dbfd4 chore: ⬆️ Update ggml-org/llama.cpp to 389ff61d77b5c71cec0cf92fe4e5d01ace80b797 (#9752)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
v4.2.0
2026-05-11 08:14:07 +02:00
Ettore Di Giacinto
059c493641 ci(darwin): brew reinstall ccache to handle transitive dep drift
Symptom (PR #9752, run 25638825961, job 75256261163):

  dyld[11144]: Library not loaded: /opt/homebrew/opt/fmt/lib/libfmt.12.dylib
    Referenced from: /opt/homebrew/Cellar/ccache/4.13.5/bin/ccache
  Abort trap: 6

Previous fix (commit 3f6e4934) added blake3, hiredis, xxhash, zstd as
explicit installs + cache paths because ccache's runtime dep closure
wasn't in the brew cache. But ccache 4.13 also depends on fmt — which
I missed. This is going to keep happening as upstream ccache adds or
shuffles deps over time.

Durable fix: `brew reinstall ccache` after the install step forces
brew to re-resolve and install ccache's full transitive dep closure
every run, immune to future formula changes. The brew downloads cache
makes the reinstall cheap (~5s on a cache hit).

Also adds fmt to the explicit install/link/Cellar-cache lists for the
fresh-runner path. The reinstall covers the cache-hit path; the
explicit install covers the brand-new-runner path where neither the
downloads cache nor the Cellar cache has been populated yet.

Caught by PR #9752's CI; would also have caught any future
LLAMA_VERSION bump triggering the Darwin matrix.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-10 21:17:30 +00:00
LocalAI [bot]
19d59102d5 feat(whisper-cpp): implement streaming transcription (#9751)
* test(whisper): wire e2e streaming transcription target

Adds test-extra-backend-whisper-transcription, mirroring the existing
llama-cpp / sherpa-onnx / vibevoice-cpp targets. The generic
AudioTranscriptionStream spec at tests/e2e-backends/backend_test.go:644
fails today because backend/go/whisper has no streaming impl - this
target is the failing TDD gate that the next phase makes pass.

Confirmed RED locally: 3 Passed (health, load, offline transcription),
1 Failed (streaming spec hits its 300s context deadline because the
base implementation returns 'unimplemented' but doesn't close the
result channel, leaving the gRPC stream open until the client times
out).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(whisper-cpp): expose new_segment_callback to the Go side

Adds set_new_segment_callback() and a C-side trampoline that whisper.cpp
invokes once per new text segment during whisper_full(). The trampoline
dispatches (idx_first, n_new, user_data) to a Go function pointer
registered via purego.NewCallback - text and timings are pulled by Go
through the existing get_segment_text/get_segment_t0/get_segment_t1
getters.

Wires the hook only when streaming is actually requested, to avoid a
per-segment function-pointer dispatch on the offline path.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(whisper-cpp): implement AudioTranscriptionStream

Wires whisper.cpp's new_segment_callback through purego back to Go so
the streaming transcription RPC produces real, time-correlated deltas
while whisper_full() is still decoding. Each segment becomes one
TranscriptStreamResponse{Delta}; whisper_full's return is the
TranscriptStreamResponse{FinalResult} carrying the full segment list,
language, and duration.

Per-call state is tracked in a sync.Map keyed by an atomic counter; the
Go callback registered via purego.NewCallback is a singleton, dispatched
through user_data. SingleThread today means only one entry is ever live,
but the map shape matches the sherpa-onnx TTS callback pattern.

The streaming path's final.Text is the literal concat of every emitted
delta (a strings.Builder accumulated by onNewSegment) so the e2e
invariant `final.Text == concat(deltas)` holds exactly. The first delta
has no leading space; subsequent deltas are space-prefixed. The offline
AudioTranscription path is unchanged.

Closes the gap with sherpa-onnx, vibevoice-cpp, llama-cpp, and tinygrad,
which already implement AudioTranscriptionStream.

Verified GREEN locally: make test-extra-backend-whisper-transcription
passes 4/4 specs (3 Passed initially under RED, +1 streaming spec now).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(whisper-cpp): assert progressive multi-segment streaming

Drives AudioTranscriptionStream against a real long-audio fixture and
asserts len(deltas) >= 2. The generic e2e spec at
tests/e2e-backends/backend_test.go:644 only checks len(deltas) >= 1
which is satisfied by both real and faked streaming - this spec is the
guardrail that a future "fake" impl can't sneak past.

Skipped by default (env-gated, like the cancellation spec); set
WHISPER_LIBRARY, WHISPER_MODEL_PATH, and WHISPER_AUDIO_PATH to a 30+
second clip to run.

Verified locally with a 55s 5x-JFK concat against ggml-base.en.bin:
1 Passed in 7.3s, deltas >= 2, finalSegmentCount >= 2,
concat(deltas) == final.Text.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(whisper-cpp): add transcription gRPC e2e job

Mirrors tests-sherpa-onnx-grpc-transcription /
tests-llama-cpp-grpc-transcription. Runs make
test-extra-backend-whisper-transcription whenever the whisper backend
or the run-all switch fires, so a pin-bump or refactor that breaks
streaming transcription gets caught before merge.

The whisper output on detect-changes is already emitted by
scripts/changed-backends.js (it iterates allBackendPaths); this PR
just exposes it as a workflow output and consumes it.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(whisper-cpp): silence errcheck on AudioTranscriptionStream defers

golangci-lint runs with new-from-merge-base=origin/master, so the
identical defer patterns in the existing offline AudioTranscription
path are grandfathered while the new ones in AudioTranscriptionStream
trip errcheck. Wrap both defers in `func() { _ = ... }()` to match what
errcheck wants without altering behavior. The errors from os.RemoveAll
and *os.File.Close are not actionable inside a defer here (we're
already returning), matching the offline path's contract.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-10 23:11:46 +02:00
LocalAI [bot]
4715a68660 chore: ⬆️ Update vllm-project/vllm cu130 wheel to 0.20.2 (#9750)
⬆️ Update vllm-project/vllm cu130 wheel

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-10 21:33:07 +02:00
LocalAI [bot]
28f33be48f chore: ⬆️ Update ggml-org/whisper.cpp to c33c5618b72bb345df029b730b36bc0e369845a3 (#9749)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-05-10 21:32:47 +02:00
LocalAI [bot]
a435f7cc69 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 23127139cb6fa314899c3b5f4935b88b3374c56c (#9748)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-10 21:32:28 +02:00
LocalAI [bot]
f6c9c20911 chore: ⬆️ Update ggml-org/llama.cpp to 2b2babd1243c67ca811c0a5852cedf92b1a20024 (#9747)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-05-10 21:17:38 +02:00
Ettore Di Giacinto
3f6e493439 ci(darwin): install ccache's runtime dylib deps (blake3, hiredis, xxhash, zstd)
Symptom (run 25634195866, job 75244019809): the Configure ccache step
on the Darwin llama-cpp build aborted with:

  dyld[5647]: Library not loaded: /opt/homebrew/opt/blake3/lib/libblake3.0.dylib
    Referenced from: /opt/homebrew/Cellar/ccache/4.13.5/bin/ccache
  Abort trap: 6

The previous Darwin fix (acc5588d) addressed missing /opt/homebrew/bin
symlinks after a brew cache restore by force-linking. This is a
different layer: ccache's Cellar dir IS restored from cache and IS
linked, but ccache 4.13 dynamically links against blake3 / hiredis /
xxhash / zstd at runtime, and those dependencies are NOT in the
restored Cellar paths. brew install ccache sees the ccache Cellar
present and skips the install — including skipping installation of
those transitive deps.

Two-part fix:
  - Add /opt/homebrew/Cellar/{blake3,hiredis,xxhash,zstd} to the brew
    cache restore/save paths so future cache-hit runs restore them.
  - Explicitly install + link them in the Dependencies step so even
    a fresh runner (cache miss on a new key) gets them, and brew has
    them on hand for ccache to dlopen.

Caught by run 25634195866. Pre-existing condition on Darwin runners;
surfaced because Darwin builds run more often after the llama-cpp-
darwin consolidation in #9731.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-10 17:09:01 +00:00
LocalAI [bot]
35f6db8c76 ci: split backend-jobs into single-arch and multi-arch matrices (#9746)
Symptom (run 25612992409): backend-merge-jobs failed with
"quay.io/go-skynet/local-ai-backends@sha256:fdbd93ca...: not found"
even though the per-arch build for -cpu-llama-cpp pushed that exact
digest 14h31m earlier.

Root cause: backend-merge-jobs was gated on the WHOLE backend-jobs
matrix (`needs: backend-jobs`). The multi-arch -cpu-llama-cpp legs
finished within 30 min, but a single-arch CUDA-12-llama-cpp slot in
the same matrix queued for ~8h (max-parallel: 8 throttle) and then
took ~6h to build cold. By the time it freed the merge to run, quay's
GC had reaped the per-arch digests pushed by the fast multi-arch legs
the day before.

Fix: split the linux backend matrix in two.

  backend-jobs-multiarch  - entries with `platform-tag` set (paired
    per-arch legs that feed backend-merge-jobs).
  backend-jobs-singlearch - entries without `platform-tag` (heavy
    standalone builds: CUDA, ROCm, Intel oneAPI, vLLM, sglang, etc.).

backend-merge-jobs now `needs:` only backend-jobs-multiarch. The
multi-arch matrix completes in ~2-3h, well inside quay's GC window.
Heavy single-arch entries keep running independently with no merge
dependency.

scripts/changed-backends.js gains a splitByArch() helper that
partitions filtered entries by whether `platform-tag` is set, and
emits matrix-singlearch + matrix-multiarch + has-backends-singlearch
+ has-backends-multiarch outputs (replacing the previous combined
matrix / has-backends pair). Applied in both the full-matrix and
filtered-matrix code paths. Smoke test: 199 single-arch + 72 multi-
arch + 35 darwin = 271 total entries; 36 merge-matrix entries
(one per multi-arch backend pair). Matches expectation.

Local `make backends/<name>` is unaffected — the script's outputs
only feed CI workflow matrices.

Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-10 18:15:53 +02:00
Ettore Di Giacinto
6113e5a4d0 docs(ci-caching): list all paths that retrigger base-images.yml
Now that base-images.yml's master-push trigger includes the install
script and apt-mirror script (commit 7fff8584), enumerate them in
the workflow-surfaces table instead of the vaguer "base Dockerfile/
workflow" phrasing.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-09 22:31:37 +00:00
Ettore Di Giacinto
7fff858408 ci(base-images): also trigger rebuild on .docker/install-base-deps.sh changes
base-images.yml's master-push trigger had a path filter listing only
backend/Dockerfile.base-grpc-builder and .github/workflows/base-images.yml.
That misses .docker/install-base-deps.sh — which is the actual source
of truth for what goes into each base image (apt deps, gRPC, conditional
CUDA/ROCm/Vulkan installs). The script is bind-mounted into the base
Dockerfile at build time; changes to it would change the produced
images, but without this path filter, the workflow wouldn't auto-rebuild
on those changes. Stale bases would persist until Saturday's cron or a
manual workflow_dispatch.

Same applies to .docker/apt-mirror.sh, also bind-mounted by the base
Dockerfile.

Add both to the trigger paths so consumer-affecting changes to either
file rebuild the bases automatically.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-09 22:30:46 +00:00
LocalAI [bot]
6fd21d5cf3 docs(agents): update CI caching docs after the GHA-free-tier migration (#9742)
The migration shipped over a sequence of PRs (#9726#9727#9730#9731#9737#9738 plus a handful of direct-to-master fixes) and
left the .agents/ docs significantly out of date.

Updated:

- .agents/ci-caching.md (significant rewrite)
  - Cache key shape: now includes per-arch suffix (cache<suffix>-<arch>).
  - New "Workflow surfaces" overview table.
  - New "Pre-built base images (base-grpc-*)" section covering the 10
    quay.io/go-skynet/ci-cache:base-grpc-* tags, the multi-target
    Dockerfile pattern (builder-fromsource / builder-prebuilt /
    aliasing FROM), the BUILDER_BASE_IMAGE → BUILDER_TARGET derivation,
    the bootstrap-on-branch order for new variants.
  - New "Per-arch native builds + manifest merge" section: split
    matrix entries, push-by-digest, backend_merge.yml, why
    provenance: false matters.
  - New "Path filter on master push" section: changed-backends.js
    handles push events via the Compare API; weekly Sunday cron is
    the safety net for unpinned Python deps.
  - New "ccache for C++ backend builds" section.
  - New "Composite actions" section: free-disk-space and
    setup-build-disk.
  - New "Concurrency" section documenting the per-PR-per-commit group
    fix.
  - Darwin section gains the brew link --overwrite note (after-
    cache-restore symlinks weren't restored) and the llama-cpp-darwin
    consolidation context.
  - "Self-hosted runners" section confirming the matrix is free of
    arc-runner-set / bigger-runner references except the residual
    test-extra.yml vibevoice case.
  - "Touching the cache pipeline" rule list extended (provenance,
    install-base-deps.sh single-source-of-truth, base-images bootstrap
    order).

- .agents/adding-backends.md
  - Section 2 title: backend.yml -> backend-matrix.yml (path moved).
  - New paragraph on per-arch entries (platform-tag + paired matrix
    rows + auto-firing merge job).
  - New paragraph on builder-base-image for llama-cpp / ik-llama-cpp /
    turboquant.
  - Final checklist line updated accordingly.

- .agents/building-and-testing.md
  - Reference: backend.yml -> backend-matrix.yml.
  - Note about builder-base-image and BUILDER_TARGET defaulting to
    builder-fromsource for local builds.

- AGENTS.md
  - One-line description update for ci-caching.md to mention the new
    infrastructure (per-arch keys, base-grpc-*, manifest-merge,
    setup-build-disk, path filter).

Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-10 00:28:57 +02:00
LocalAI [bot]
6cbf69dc29 chore: ⬆️ Update ggml-org/llama.cpp to 1e5ad35d560b90a8ac447d149c8f8447ae1fcaa0 (#9739)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-05-10 00:06:29 +02:00
LocalAI [bot]
593f3a8648 ci: refactor llama-cpp variant Dockerfiles to consume prebuilt base-grpc images (PR 2/2) (#9738)
* ci(backend_build): plumb builder-base-image and BUILDER_TARGET build-args

Adds an optional builder-base-image input. When set, BUILDER_BASE_IMAGE
is forwarded as a build-arg AND BUILDER_TARGET=builder-prebuilt is set
to select the variant Dockerfile's prebuilt-base stage. When empty,
BUILDER_TARGET=builder-fromsource (the default) keeps the existing
from-source build path.

This makes the prebuilt-base optimization opt-in per matrix entry
without breaking local `make backends/<name>` invocations or backends
whose Dockerfile doesn't have a prebuilt path.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(llama-cpp,ik-llama-cpp,turboquant): multi-target Dockerfiles for prebuilt + from-source

Restructure the three llama.cpp-derived Dockerfiles so each supports
two builder paths in a single file, selected via the BUILDER_TARGET
build-arg:

  BUILDER_TARGET=builder-fromsource (default)
    - Standalone build: gRPC stage + apt installs + (conditionally)
      CUDA/ROCm/Vulkan + compile.
    - Used by `make backends/llama-cpp` locally and any caller that
      doesn't supply a prebuilt base.

  BUILDER_TARGET=builder-prebuilt
    - FROM \${BUILDER_BASE_IMAGE} (one of quay.io/go-skynet/ci-cache:
      base-grpc-* shipped in PR #9737).
    - Skips ~25-35 min of gRPC compile + ~5-10 min of toolchain installs.
    - Used by CI when the matrix entry sets builder-base-image.

Final FROM scratch resolves BUILDER_TARGET via an aliasing FROM stage
(BuildKit doesn't support variable expansion directly in COPY --from),
then COPY --from=builder pulls package output from the chosen path.
BuildKit prunes the unreferenced builder, so each build only does the
work for the chosen path.

The compile RUN is identical between both builder stages, so it's
factored into .docker/<name>-compile.sh and bind-mounted into both.
ccache mount + cache-id stay per-arch / per-build-type.

Local DX preserved: `make backends/llama-cpp` (no extra args) defaults
to BUILDER_TARGET=builder-fromsource and works exactly as before.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(backend.yml,backend_pr.yml): forward builder-base-image from matrix

Plumbs the new optional builder-base-image input from matrix into
backend_build.yml. backend_build.yml derives BUILDER_TARGET from
whether builder-base-image is set, so matrix entries that map to a
prebuilt base get the prebuilt path; entries that don't (python/go/
rust backends) fall through to the default builder-fromsource (which
their own Dockerfiles don't reference, so it's a no-op for them).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(backend-matrix): wire builder-base-image to llama-cpp variants

For every entry whose Dockerfile is llama-cpp/ik-llama-cpp/turboquant,
add a builder-base-image field pointing at the appropriate prebuilt
quay.io/go-skynet/ci-cache:base-grpc-* tag.

backend_build.yml derives BUILDER_TARGET from this field's presence:
non-empty -> builder-prebuilt; empty -> builder-fromsource. So this
commit alone activates the prebuilt-base path for these 23 backends
in CI, while local `make backends/<name>` (no extra args) keeps the
from-source path.

Mapping by (build-type, arch):
- '' / amd64        -> base-grpc-amd64
- '' / arm64        -> base-grpc-arm64
- cublas-12 / amd64 -> base-grpc-cuda-12-amd64
- cublas-13 / amd64 -> base-grpc-cuda-13-amd64
- cublas-13 / arm64 -> base-grpc-cuda-13-arm64
- hipblas / amd64   -> base-grpc-rocm-amd64
- vulkan / amd64    -> base-grpc-vulkan-amd64
- vulkan / arm64    -> base-grpc-vulkan-arm64
- sycl_* / amd64    -> base-grpc-intel-amd64
- cublas-12 + JetPack r36.4.0 / arm64 -> base-grpc-l4t-cuda-12-arm64

Cold-build savings expected: ~25-35 min per variant (skips the gRPC
compile + toolchain install that's now in the base).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: add base-grpc-l4t-cuda-12-arm64 variant for legacy JetPack entries

Two matrix entries (-nvidia-l4t-arm64-llama-cpp, -nvidia-l4t-arm64-
turboquant) build against nvcr.io/nvidia/l4t-jetpack:r36.4.0 + CUDA
12 ARM64. They're distinct from -nvidia-l4t-cuda-13-arm64-* which use
Ubuntu 24.04 + CUDA 13 sbsa. Add the missing JetPack-based variant
to base-images.yml so those two entries' builder-base-image mapping
in the previous commit resolves.

Bootstrap order before merging this PR (re-run base-images.yml on
this branch — 9 existing variants hit BuildKit cache, only the new
l4t-cuda-12-arm64 builds cold):

  gh workflow run base-images.yml --ref ci/base-images-consumers

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: extract base-builder install logic into .docker/install-base-deps.sh

Pre-extraction, the apt + protoc + cmake + conditional CUDA/ROCm/Vulkan
+ gRPC install logic was duplicated across four files:
  - backend/Dockerfile.base-grpc-builder (CI prebuilt-base source of truth)
  - backend/Dockerfile.llama-cpp (builder-fromsource stage)
  - backend/Dockerfile.ik-llama-cpp (builder-fromsource stage)
  - backend/Dockerfile.turboquant (builder-fromsource stage)

A bump to e.g. CUDA toolkit packages had to be made in 4 places, and
drift between the prebuilt base and the variant-Dockerfile from-source
path was a real concern (ik-llama-cpp's hipblas branch was already
missing the rocBLAS Kernels echo that llama-cpp / turboquant /
base-grpc-builder all had).

Factor the install logic into a single .docker/install-base-deps.sh
that reads its inputs from env vars and runs conditionally on
BUILD_TYPE / CUDA_*_VERSION / TARGETARCH. Each Dockerfile now bind-
mounts the script alongside .docker/apt-mirror.sh and invokes it from
a single RUN step.

The variant Dockerfiles' grpc-source stage is removed entirely — the
script handles gRPC compile + install at /opt/grpc, and the
builder-fromsource stage mirrors builder-prebuilt by copying
/opt/grpc/. to /usr/local/.

Result:
  - install-base-deps.sh: 244 lines (one source of truth)
  - Dockerfile.base-grpc-builder: 268 -> 98 lines
  - Dockerfile.llama-cpp: 361 -> 157 lines
  - Dockerfile.ik-llama-cpp: 348 -> 151 lines
  - Dockerfile.turboquant: 355 -> 154 lines
  - Total Dockerfile bytes: 1332 -> 560 lines (58% reduction)

Bit-equivalence between prebuilt and from-source paths is now enforced
by construction: both invoke the same script with the same inputs.
A side-effect is that ik-llama-cpp now also gets the rocBLAS Kernels
echo + clblas block parity it was previously missing.

Includes the BUILD_TYPE=clblas branch (libclblast-dev) for parity even
though no current CI matrix entry uses it.

After this commit's force-push, base-images.yml needs to be redispatched
on this branch — the Dockerfile.base-grpc-builder content shifts so the
existing cache won't apply for the install layer (gRPC layer also
rebuilds since it's now in the same RUN step).

Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(base-images): skip-drivers on JetPack l4t variant

cuda-nvcc-12-0 isn't installable via apt on the JetPack r36.4.0 base
image — JetPack ships CUDA preinstalled at /usr/local/cuda and its
apt feed doesn't carry the cuda-nvcc-* packages from the public
repositories. The original matrix entry for -nvidia-l4t-arm64-llama-cpp
on master sets skip-drivers: 'true' for exactly this reason; the
new base-grpc-l4t-cuda-12-arm64 base needs to match.

Also forwards SKIP_DRIVERS as a build-arg from matrix into the build
(was missing entirely before this commit).

Caught by run 25612030775 — l4t-cuda-12-arm64 failed at:
  E: Package 'cuda-nvcc-12-0' has no installation candidate

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-10 00:03:52 +02:00
Ettore Di Giacinto
acc5588d2c ci(darwin): force-link brew formulas after cache restore
Symptom: `ccache: command not found` in the Configure ccache step on
runs that hit the brew cache.

Root cause: actions/cache restores /opt/homebrew/Cellar/<formula> but
NOT the bin symlinks at /opt/homebrew/bin/*. The subsequent
`brew install` sees the Cellar entries present and decides "already
installed" — without re-running the link step. So on cache-hit runs
none of the cached formulas are actually on PATH.

Fix: explicit `brew link --overwrite` for every formula we install,
right after `brew install`. --overwrite tolerates leftover symlinks
from a partial earlier install. The 2>/dev/null + || true keeps the
step from failing if a formula is already correctly linked.

Pre-existing flake; surfaces more often as Darwin matrix coverage
grows after the llama-cpp-darwin consolidation in #9731.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-09 20:41:03 +00:00
LocalAI [bot]
28e29625a2 ci: add pre-built base-grpc-builder image infrastructure (PR 1/2) (#9737)
Introduces a parameterized Dockerfile.base-grpc-builder that produces
a fully-prepped builder base image (apt deps + protoc + cmake + gRPC
at /opt/grpc + conditional CUDA/ROCm/Vulkan toolchains) and a
base-images.yml workflow that builds + pushes 9 variants to
quay.io/go-skynet/ci-cache:base-grpc-*:

  base-grpc-amd64                 (Ubuntu 24.04, CPU-only)
  base-grpc-arm64                 (Ubuntu 24.04, CPU-only)
  base-grpc-cuda-12-amd64         (Ubuntu 24.04 + CUDA 12.8)
  base-grpc-cuda-13-amd64         (Ubuntu 22.04 + CUDA 13.0)
  base-grpc-cuda-13-arm64         (Ubuntu 24.04 + CUDA 13.0 sbsa)
  base-grpc-rocm-amd64            (rocm/dev-ubuntu-24.04:7.2.1 + hipblas)
  base-grpc-vulkan-amd64          (Ubuntu 24.04 + Vulkan SDK 1.4.335)
  base-grpc-vulkan-arm64          (Ubuntu 24.04 + Vulkan SDK ARM 1.4.335)
  base-grpc-intel-amd64           (intel/oneapi-basekit:2025.3.2)

The variant Dockerfiles (Dockerfile.llama-cpp, ik-llama-cpp, turboquant)
are NOT touched in this PR. PR 2 will refactor them to FROM these
prebuilt bases. This PR is intentionally inert - landing it changes no
existing CI behavior. The base images don't exist on quay until
someone manually triggers the workflow.

Bootstrap after merge:
  gh workflow run base-images.yml --ref master
Wait ~30 min for all 9 variants to push, then merge PR 2 (the
consumer-side refactor that uses BUILDER_BASE_IMAGE build-arg to
FROM these tags).

Triggers afterwards:
  - Saturdays 05:00 UTC (cron) - picks up upstream security updates,
    runs ~24h before the backend.yml Sunday cron so bases are fresh.
  - workflow_dispatch - manual ad-hoc rebuild.
  - master push touching Dockerfile.base-grpc-builder or this workflow.

Why split into two PRs: the variant Dockerfiles in PR 2 will FROM the
prebuilt bases and have no from-source fallback. Their CI builds fail
if the bases don't exist on quay yet. Landing infrastructure first +
manual bootstrap + then consumer refactor avoids a broken-master window.

Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-09 18:44:42 +02:00
Ettore Di Giacinto
31aa0582a5 ci(ik-llama-cpp,turboquant): add BuildKit ccache mount to compile steps
Mirror the ccache mount added to Dockerfile.llama-cpp in 9228e5b4 for
the other two llama.cpp-derived backends. Same shape, distinct mount
ids so each backend's cache is independent:

  ik-llama-cpp-ccache-${TARGETARCH}-${BUILD_TYPE}
  turboquant-ccache-${TARGETARCH}-${BUILD_TYPE}

ik_llama.cpp is a different upstream fork; no source overlap with
llama-cpp, separate cache makes sense.

turboquant is a llama.cpp fork that reuses backend/cpp/llama-cpp
source via a thin wrapper Makefile — most TUs would in principle hit
llama-cpp's ccache too. Keeping them separate for now to avoid one
fork's regressions poisoning the other; revisit sharing after we have
hit-rate numbers.

Same registry-export behavior as llama-cpp: the cache mount rides on
backend_build.yml's existing cache-to: type=registry,mode=max.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-09 16:21:49 +00:00
Ettore Di Giacinto
3568b2819d fix(gallery): keep auto-upgrade off non-dev backends when -development is installed (#9736)
A `-development` backend variant (e.g. `cuda12-llama-cpp-development`)
shares its `alias` with the stable counterpart and is meant to be a
drop-in replacement via ListSystemBackends alias resolution. Two paths
in the auto-upgrade flow let the stable variant slip back in on top of
the user's explicit dev pick:

1. ListSystemBackends emits a synthetic alias row keyed by the alias
   name that re-uses the chosen concrete's metadata pointer. In
   distributed mode, the worker's handleBackendList serialised that
   row over NATS as `{Name: <alias>, URI: <dev URI>, Digest: <dev>}`
   — the frontend can't reconstruct the alias relationship, and the
   wire-rebuilt row then carried `Metadata.Name = <alias>` and
   resolved against an unrelated gallery entry on the next upgrade
   check.
2. CheckUpgradesAgainst happily iterated the synthetic row in
   single-node too. Today the duplicate gallery lookup is harmless
   because both rows share the same `Metadata.Name`, but any gallery
   change that gives a meta backend a version, or any concrete
   sharing its alias with a dev counterpart, would surface a phantom
   non-dev upgrade and auto-upgrade would install it — shadowing the
   dev one through alias-token preference.

Two layered fixes:

- `core/services/worker/lifecycle.go` (`handleBackendList`): drop
  rows where the map key differs from `b.Metadata.Name`. Concrete
  and meta entries always have `key == Metadata.Name`; only synthetic
  aliases violate it. Workers now report only what's actually on disk;
  the per-node UI listing and CheckUpgrades both stop seeing phantoms.
- `core/gallery/upgrade.go` (`CheckUpgradesAgainst`): iterate by key,
  skip rows where `key != Metadata.Name` (belt-and-suspenders for any
  caller-supplied installed set), and apply the dev-aware rule —
  build a set of installed `Metadata.Name`s and drop any non-dev
  candidate `X` whose `X-<devSuffix>` counterpart is installed. Uses
  the configured dev suffix from `getFallbackTagValues(systemState)`.

Manual `POST /api/backends/upgrade/<name>` is unaffected: it goes
straight through `bm.UpgradeBackend(name)` without consulting the
suppression list, so users who genuinely want the stable variant
upgraded can still trigger it explicitly.

Tests in core/gallery/upgrade_test.go cover three cases under
"CheckUpgradesAgainst (distributed)": dev-only installed → only the
dev surfaces; both variants installed → dev still wins; synthetic
alias row is ignored. Generic backend names are used to avoid the
capability filter dropping cuda-prefixed entries on a CPU-only host.

Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-09 18:20:00 +02:00
Ettore Di Giacinto
9228e5b412 ci(llama-cpp): add BuildKit ccache mount to the compile step
The big RUN at line 268 of Dockerfile.llama-cpp re-runs from scratch on
every LLAMA_VERSION bump (or any LocalAI source change due to
COPY . /LocalAI just before). For CUDA-13 specifically that compile
recently hit the GHA 6h hard limit and failed:

  https://github.com/mudler/LocalAI/actions/runs/25598418931/job/75148244557

Add a BuildKit cache mount on /root/.ccache and thread ccache through
CMake (CMAKE_C/CXX/CUDA_COMPILER_LAUNCHER) so most translation units
hit cache when their preprocessed source is byte-identical to the
previous build.

The cache mount is exported to the registry as part of the existing
cache-to: type=registry,mode=max in backend_build.yml, so it persists
across runs. mount id is keyed on TARGETARCH + BUILD_TYPE so different
variants don't thrash the same cache slot; sharing=locked serializes
concurrent writes.

Cold-build effect (first run after enable, or on LLAMA_VERSION bump
that touches every TU): unchanged. Hot-build effect (subsequent runs
with the same source, or LLAMA_VERSION bumps that touch a handful of
files): ~5-15 min for the llama.cpp compile vs the previous 1-3h cold.
For CUDA-13 specifically this should bring rebuilds well under the 6h
GHA limit.

Does NOT help the *first* post-bump build — that's still cold. For
that, follow-up work would be: (a) trim CUDA_DOCKER_ARCH to modern
GPUs only, (b) audit which CMake variants the published images
actually need, (c) pre-built CUDA+gRPC base image.

ccache package is already installed in the builder stage (line 90).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-09 16:16:46 +00:00
LocalAI [bot]
a91e718473 chore: ⬆️ Update ggml-org/llama.cpp to 00d56b11c3477b99bc18562dc1d1834f0d961778 (#9733)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-05-09 12:05:11 +02:00
Ettore Di Giacinto
6d2b7d893a ci: drop paths-ignore from test.yml and tests-e2e.yml
These workflows are configured as required status checks in branch
protection. With paths-ignore matching the PR diff, the workflow
doesn't trigger and no status is reported — branch protection then
blocks the PR with "Expected — Waiting for status to be reported"
indefinitely. Especially common for backend-only PRs since the ignore
list included backend/**.

Run the full test suite on every PR. Cost is ~5 min per PR for
tests-linux + ~similar for tests-apple + the e2e backend smoke; small
trade for unblocking PR merges.

Workflows affected:
- tests-linux (1.26.x), tests-apple (1.26.x) in test.yml
- tests-e2e-backend (1.25.x) in tests-e2e.yml

Other workflows that still have paths-ignore (none currently in the
required-checks list) are left as-is — adding them to required later
would re-introduce the same problem.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-09 09:23:51 +00:00
LocalAI [bot]
d1eef05852 chore: ⬆️ Update ikawrakow/ik_llama.cpp to ab0f22b819ac57b7e7484f69c00c10fc755d5c6c (#9734)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-09 11:18:59 +02:00
Ettore Di Giacinto
5a12392570 ci(concurrency): make cancel-in-progress event-aware, group by sha on push
Yesterday two PRs (#9724 llama.cpp bump, #9731 llama-cpp-darwin
consolidation) merged 11 seconds apart. Both shared the same
backend.yml concurrency group (ci-backends-refs/heads/master-...) due
to "${{ github.head_ref || github.ref }}" — empty head_ref on push
events falls through to the static refs/heads/master. With
cancel-in-progress: true that meant the second merge cancelled the
first's in-flight backend builds. The first PR's CI never finished;
the second PR only touched CI files so its run was a no-op.

Two changes per workflow:
- group: replace "${{ github.head_ref || github.ref }}" with
  "${{ github.event.pull_request.number || github.sha }}". On PRs
  this groups by PR number (same as before, just keyed on number not
  branch name); on push events it groups per-commit, so two master
  pushes never share a group.
- cancel-in-progress: gate on github.event_name == 'pull_request' so
  rapid pushes to a PR still cancel old runs (newer push wins) but
  master pushes never cancel each other.

Trade-off vs alternatives:
- Merge queue would also solve this and additionally test the merged
  commit before it lands. Heavier process change; out of scope here.
- Allowing per-commit master concurrency means two simultaneous master
  runs may overlap and race on tag pushes, but each commit's manifest
  digest is unique and the registry is last-writer-wins on tags —
  newer commit's tag overwrites older.

Applied to 11 workflows that share the same concurrency pattern:
backend.yml, backend_pr.yml, image.yml, image-pr.yml, lint.yml,
test.yml, test-extra.yml, tests-e2e.yml, tests-aio.yml,
tests-ui-e2e.yml, generate_intel_image.yaml.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-09 08:30:55 +00:00
Ettore Di Giacinto
05d6383393 Change vibevoice.cpp repository reference
Updated repository reference for vibevoice.cpp in bump_deps.yaml.

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-05-09 10:30:11 +02:00
Ettore Di Giacinto
733c254b32 ci: consolidate llama-cpp-darwin into the matrix-driven Darwin flow (#9731)
The bespoke llama-cpp-darwin + llama-cpp-darwin-publish top-level jobs
in backend.yml ran unconditionally on every backend.yml trigger
(push/cron), bypassing the path filter that all 34 other Darwin
backends already honor via backend-jobs-darwin -> backend_build_darwin.yml.

Move llama-cpp into the includeDarwin matrix:
- New entry in .github/backend-matrix.yml (lang=go, no build-type).
- backend_build_darwin.yml gains an `if: inputs.backend == 'llama-cpp'`
  build step that drives `make backends/llama-cpp-darwin`. The bespoke
  script (scripts/build/llama-cpp-darwin.sh) compiles three CMake
  variants from backend/cpp/llama-cpp and bundles dylibs via otool, so
  it doesn't fit the build-darwin-go-backend mold; the existing
  llama-cpp-aware ccache setup blocks already in this workflow are
  what motivated the consolidation in the first place.
- scripts/changed-backends.js's inferBackendPathDarwin gains a special
  case so llama-cpp on Darwin maps to backend/cpp/llama-cpp/ (the C++
  source tree) rather than the non-existent backend/go/llama-cpp/.
- Bumps Darwin go-version from 1.24.x -> 1.25.x in backend.yml and
  backend_pr.yml so llama-cpp keeps the Go toolchain it had under the
  bespoke job; the other 34 Darwin backends pick this up too with no
  known reason to pin 1.24.
- Removes ~80 lines of bespoke YAML from backend.yml.

The publish path is unchanged in shape - every Darwin backend now uses
the same crane-push leg from ubuntu-latest in
backend_build_darwin.yml; only the build target differs per backend.

After this commit, llama-cpp-darwin only rebuilds when
backend/cpp/llama-cpp/ is touched (verified locally) - same behavior
as every other Darwin backend.

Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-09 10:18:17 +02:00
LocalAI [bot]
4542833cb4 chore: ⬆️ Update ggml-org/llama.cpp to 9f5f0e689c9e977e5f23a27e344aa36082f44738 (#9724)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-09 10:18:05 +02:00
LocalAI [bot]
f0374aa0e8 ci: finish GHA free-tier migration (per-arch fan-out, image splits, retire self-hosted, fix provenance) (#9730)
* ci: add per-arch + manifest-merge support for LocalAI server image

Mirror the backend_build.yml + backend_merge.yml pattern shipped in
PR #9726 for the LocalAI server image:

- image_build.yml accepts optional platform-tag (default ''), scopes
  registry cache to cache-localai<suffix>-<platform-tag>, and pushes
  by canonical digest only on push events. Digests upload as artifacts
  named digests-localai<suffix>-<platform-tag>, with a "-core"
  placeholder when tag-suffix is empty so the merge job's download
  pattern doesn't over-match across multiple suffixes.
- image_merge.yml is a new reusable workflow that downloads matching
  digest artifacts and assembles the final tagged manifest list via
  docker buildx imagetools create.

Image names differ from backend_*.yml: the LocalAI server is published
under quay.io/go-skynet/local-ai and localai/localai (not -backends).

Not yet wired into image.yml / image-pr.yml — Commit C does that.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: fan out per-arch split to remaining 34 backends

Convert all remaining linux/amd64,linux/arm64 entries in
backend-matrix.yml to per-arch + manifest-merge form. Each was a
single matrix entry running both arches on x86 under QEMU emulation;
each becomes two entries — amd64 on ubuntu-latest, arm64 on
ubuntu-24.04-arm (native).

Four backends that were on bigger-runner (-cpu-llama-cpp,
-cpu-turboquant, -gpu-vulkan-llama-cpp, -gpu-vulkan-turboquant) have
both legs moved to free tier as part of the same change. They are
compile-only (no torch/CUDA install) and fit comfortably with the
setup-build-disk /mnt relocation. Phase 4 (next commit) retires the
remaining 5 single-arch bigger-runner entries.

After this commit:
- 271 total matrix entries (was 237)
- 0 multi-arch entries left
- 36 per-arch pairs (34 new + 2 pilots from PR #9727)
- 5 bigger-runner entries remaining (single-arch, Phase 4 target)

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: split LocalAI image multi-arch entries per arch + merge

Mirror the backend per-arch split for the main LocalAI image:

- image.yml's core-image-build matrix: split the core ('') and
  -gpu-vulkan entries into amd64 + arm64 legs each. amd64 on
  ubuntu-latest, arm64 on ubuntu-24.04-arm (native).
- New top-level core-image-merge and gpu-vulkan-image-merge jobs
  call image_merge.yml after core-image-build completes.
- image-pr.yml's image-build matrix: split the -vulkan-core entry.
  No merge job added on the PR side — image_build.yml's digest-push
  is push-only-event-gated, so a PR-side merge would have nothing
  to download.

After this commit, no workflow file references
linux/amd64,linux/arm64 in a single matrix slot.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: retire bigger-runner from backend matrix (Phase 4)

Migrate the remaining 5 single-arch bigger-runner entries to
ubuntu-latest. Combined with the Phase 3 setup-build-disk /mnt
relocation (PR #9726), free-tier ubuntu-latest now has ~100 GB of
working space — enough for ROCm dev image (~16 GB), CUDA toolkit
(~5 GB), and the per-backend compile/install steps these entries do.

Backends migrated:
- -gpu-nvidia-cuda-12-llama-cpp
- -gpu-nvidia-cuda-12-turboquant
- -gpu-rocm-hipblas-faster-whisper
- -gpu-rocm-hipblas-coqui
- -cpu-ik-llama-cpp

After this commit, .github/backend-matrix.yml has zero bigger-runner
references. The bigger-runner used in tests-vibevoice-cpp-grpc-
transcription (test-extra.yml) is a separate concern handled in a
follow-up.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: migrate 9 Intel oneAPI backends to free tier (Phase 5.1)

Intel oneAPI base image is ~6 GB; each backend's wheel install
stays well within the ~100 GB working space provided by Phase 3's
setup-build-disk /mnt relocation. Lowest-risk batch of the
arc-runner-set retirement.

Backends migrated:
  vllm, sglang, vibevoice, qwen-asr, nemo, qwen-tts,
  fish-speech, voxcpm, pocket-tts (all -gpu-intel-* variants).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: migrate 15 ROCm Python backends to free tier (Phase 5.2)

ROCm dev image (~16 GB) plus per-backend torch/wheels install fits
on ubuntu-latest with the /mnt-relocated Docker root. These entries
include the heavier vLLM/sglang/transformers/diffusers stack on
ROCm; if any specific backend OOMs or runs out of disk, individual
flips back to arc-runner-set are revertable per-entry.

Backends migrated: all 15 -gpu-rocm-hipblas-* entries previously on
arc-runner-set (vllm/vllm-omni/sglang/transformers/diffusers/
ace-step/kokoro/vibevoice/qwen-asr/nemo/qwen-tts/fish-speech/
voxcpm/pocket-tts/neutts).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: migrate 6 CUDA Python backends to free tier (Phase 5.3)

vLLM/sglang stacks on CUDA 12 and CUDA 13 are the heaviest
backends in the matrix — flash-attn intermediate layers can spike
disk usage during build. setup-build-disk's /mnt relocation gives
~100 GB working space which fits the documented peak.

Highest-risk batch of the arc-runner-set retirement; if any
backend fails to build on free tier, the per-entry runs-on flip
is the unit of revert.

Backends migrated: -gpu-nvidia-cuda-{12,13}-{vllm,vllm-omni,sglang}.

After this commit, .github/backend-matrix.yml has zero references
to arc-runner-set or bigger-runner. The migration is complete.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: disable provenance on multi-registry digest pushes

Root-caused on master via PR #9727's pilot: when docker/build-push-action@v7
pushes a single build to TWO registries simultaneously with
push-by-digest=true, buildx generates a per-registry provenance
attestation manifest (because mode=max — the default for push:true —
includes the runner ID). That makes the resulting manifest-list digest
diverge across registries:

  arm64 -cpu-faster-whisper build:
    image manifest:        sha256:d3bdd34b... (identical, content-only)
    quay manifest list:    sha256:66b4cfc8... (with quay attestation)
    dockerhub manifest list: sha256:e0733c3b... (with dockerhub attestation)

steps.build.outputs.digest returns only one of the list digests
(empirically the dockerhub one). The merge job then asks
"quay.io/...@sha256:e0733c3b..." which doesn't exist on quay — that
list has digest 66b4cfc8 there. Result: imagetools create fails with
"not found" and the merge job fails (run 25581983094, job 75110021491).

Setting provenance: false drops the per-registry attestation; the
manifest-list digest becomes pure content, identical across both
registries, and steps.build.outputs.digest works on either lookup.

Applied to backend_build.yml and image_build.yml — both refactored
to use the same multi-registry digest-push pattern in the prior PRs.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-09 09:37:00 +02:00
Ettore Di Giacinto
fe7b27eb66 test(ci): trigger faster-whisper rebuild to observe per-arch+merge
The PR that introduced the per-arch + manifest-merge pilot (#9727)
only touched CI infrastructure files, so the path filter correctly
skipped backend builds on its merge commit. To observe the new
backend-merge-jobs flow assemble a real manifest list, this commit
touches faster-whisper's Makefile so its two new per-arch entries
schedule and the merge job runs.

The trailing comment is the smallest possible diff and is harmless
to the build.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-08 22:09:46 +00:00
LocalAI [bot]
14a3275329 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 98950267c67fd95937a54ebd6e3c66cf2679b710 (#9725)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-09 00:06:05 +02:00
LocalAI [bot]
cb68cd1cf4 ci: pilot per-arch split + manifest merge for faster-whisper and llama-cpp-quantization (#9727)
ci: pilot per-arch split for faster-whisper and llama-cpp-quantization

Convert two backends from QEMU-emulated multi-arch (linux/amd64,linux/arm64
on a single ubuntu-latest) to native per-arch + manifest-list merge:
- amd64 leg on ubuntu-latest
- arm64 leg on ubuntu-24.04-arm (native, ~5-10x faster than emulated)
- merge job assembles both digests under the final tag via
  docker buildx imagetools create

Backends piloted:
- -cpu-faster-whisper (small Python, fast baseline)
- -cpu-llama-cpp-quantization (heavier compile path, stress test)

Infrastructure changes that the rest of Phase 2 (Tasks 2.5+) will reuse:
- .github/backend-matrix.yml entries gain a `platform-tag` field
  ('amd64'/'arm64') for matrix entries that participate in the split.
  Other entries omit it; backend_build.yml already defaults missing
  values to '' (empty cache key suffix preserved as cache<suffix>-).
- backend.yml + backend_pr.yml forward `platform-tag` from matrix to
  the reusable backend_build.yml.
- scripts/changed-backends.js groups filtered entries by tag-suffix
  and emits a `merge-matrix` (plus `has-merges`) for groups of size>=2.
  Singletons aren't merged.
- backend.yml + backend_pr.yml gain a `backend-merge-jobs` job that
  consumes merge-matrix and calls backend_merge.yml after backend-jobs.
  PR variant is also event-gated so the no-op-on-PR merge job doesn't
  even start.

The other 34 multi-arch entries are unchanged in this PR -- Task 2.5
fans out the same shape to them once the pilot is observed green.

Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-09 00:04:42 +02:00
LocalAI [bot]
624fa946f8 feat(swagger): update swagger (#9723)
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-08 23:44:55 +02:00
LocalAI [bot]
1f313cfdb0 ci: phase 1-3 of GHA free tier migration (path filter, multi-arch split prep, /mnt disk relief) (#9726)
* ci: extract free-disk-space composite action

Consolidate the apt-clean + dotnet/android/ghc/boost removal blocks from
backend_build.yml, image_build.yml, and test.yml into a single composite
action. The three callers had slightly different inline blocks; the
composite uses the more aggressive backend_build/image_build variant for
all three callers — test.yml jobs now also purge snapd, edge/firefox/
powershell/r-base-core, and sweep /opt/ghc + /usr/local/share/boost +
$AGENT_TOOLSDIRECTORY. Idempotent and skipped on self-hosted runners.

In test.yml, actions/checkout now runs before the composite action call
because the composite lives at ./.github/actions/free-disk-space and
requires a checked-out repo. The original ordering relied on
jlumbroso/free-disk-space@main being a remote action; this is the
minimum-invasive change to support a local composite.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: path-filter backend.yml master push

Run scripts/changed-backends.js on master pushes too (not just PRs) so
unrelated commits don't rebuild all ~210 backend container images. Tag
pushes still build the full matrix via FORCE_ALL.

Push events use the GitHub Compare API to diff event.before..event.after.
Edge cases (first push with zero base, API truncation beyond 300 files,
missing fields, network failure) fall back to "run everything" — better
safe than silently miss a backend.

The matrix literal moves from .github/workflows/backend.yml into a new
data-only file at .github/backend-matrix.yml (outside workflows/ so
actionlint doesn't try to parse it as a workflow). Both backend.yml and
backend_pr.yml now consume the dynamic matrix output uniformly via
fromJson(needs.generate-matrix.outputs.matrix); the script reads the
matrix from the new location.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: bound max-parallel on backend-jobs matrices

Cap to 8 concurrent jobs to avoid queue starvation on the shared GHA free
pool while migration is in flight. Lift after Phases 4-5 retire the
self-hosted runners. Also drops a leftover commented-out max-parallel
line that lived in backend.yml since the previous matrix shape.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: scope backend cache per arch, push by digest

Prepare backend_build.yml for the multi-arch split. The reusable
workflow now accepts a `platform-tag` input ("amd64" / "arm64") that
scopes the registry cache to cache<suffix>-<platform-tag> and (on push
events) pushes the resulting image by canonical digest only. Digests
are uploaded as artifacts named digests<suffix>-<platform-tag> for the
merge job (Task 2.2) to consume.

`platform-tag` is optional with empty default during the migration —
existing callers continue to work unchanged (their cache key just
becomes `cache<suffix>-`, an orphaned but valid key). Tasks 2.3+ will
update callers to pass an explicit "amd64" / "arm64" value. Phase 6
flips the input to required: true once every caller is wired.

PR builds keep their existing tag-based push to ci-tests but pick up
the per-arch cache key. Multi-arch PR builds remain emulated in this
commit; they migrate when the matrix entries split (Tasks 2.3+).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: add backend_merge.yml reusable workflow

Joins per-arch digest artifacts (uploaded by backend_build.yml when
called with platform-tag) into a single tagged multi-arch manifest list
via `docker buildx imagetools create`. Called once per backend by
backend.yml after both per-arch build jobs succeed.

The workflow generates final tags identically to the previous monolithic
build job (same docker/metadata-action invocation), so consumers of
quay.io/go-skynet/local-ai-backends and localai/localai-backends see no
tag-shape change. Two imagetools calls (one per registry) reference the
same per-arch digests under different image names.

Not yet wired into backend.yml — Tasks 2.3+ rewrite individual matrix
entries to expand into per-arch + merge jobs that call this workflow.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: relocate Docker data-root to /mnt on hosted runners

GHA hosted ubuntu-latest runners ship a ~75 GB /mnt drive that's unused
by default. Stopping Docker, rsync'ing /var/lib/docker to /mnt, and
restarting with data-root pointing there yields ~100 GB of working
space (combined with the apt-clean from Task 1.1) — enough for ROCm
dev image + vLLM torch install + flash-attn intermediate layers.

This is the structural change that lets Phases 4 and 5 of the migration
plan move the bigger-runner and arc-runner-set jobs onto ubuntu-latest.

The composite action is no-op on self-hosted runners (where /mnt isn't
expected) and on non-X64 runners (Task 3.2 verifies the arm64 hosted
pool's /mnt shape separately before enabling). Wired into both
backend_build.yml and image_build.yml between free-disk-space and the
first Docker operation.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(setup-build-disk): chmod 1777 /mnt/docker-tmp

buildx CLI runs as the unprivileged 'runner' user and creates config
dirs under TMPDIR before binding them into the buildkit container.
/mnt is root-owned by default, so the original mkdir produced a
permission-denied when buildx tried to write there:

  ERROR: mkdir /mnt/docker-tmp/buildkitd-config2740457204: permission denied

Mirror /tmp's permission mode (1777 — world-writable with sticky bit)
on /mnt/docker-tmp so non-root processes can stage their config.

Caught by the first PR run (image-build hipblas job) on PR #9726.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: weekly full-matrix rebuild via cron

Path-filtering backend.yml master push (the previous commit's main
optimization) skips backends whose source didn't change. That broke
the DEPS_REFRESH cache-buster's coverage: the build-arg keyed on
%Y-W%V busts the install layer's cache on a new ISO week, but only
when the build actually runs. Untouched Python backends (torch,
transformers, vllm with no version pin) would otherwise ship stale
wheels indefinitely.

Add a Sunday 06:00 UTC cron that fires the full matrix. Schedule
events have no event.ref / event.before, so the script's changedFiles
== null fallback (scripts/changed-backends.js) emits the full matrix
automatically — no script change needed.

C++/Go backends with pinned deps cache-hit and complete fast, so the
weekly cost is dominated by Python re-resolves which is exactly what
we want.

workflow_dispatch added so a maintainer can trigger an ad-hoc
full-matrix rebuild without faking a tag push.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-08 23:43:41 +02:00
LocalAI [bot]
0c1f1e6cbd chore(model gallery): 🤖 add 1 new models via gallery agent (#9720)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-08 16:26:03 +02:00
Richard Palethorpe
670259ce43 chore: Security hardening (#9719)
* fix(http): close 0.0.0.0/[::] SSRF bypass in /api/cors-proxy

The CORS proxy carried its own private-network blocklist (RFC 1918 + a
handful of IPv6 ranges) instead of using the same classification as
pkg/utils/urlfetch.go. The hand-rolled list missed 0.0.0.0/8 and ::/128,
both of which Linux routes to localhost — so any user with FeatureMCP
(default-on for new users) could reach LocalAI's own listener and any
other service bound to 0.0.0.0:port via:

  GET /api/cors-proxy?url=http://0.0.0.0:8080/...
  GET /api/cors-proxy?url=http://[::]:8080/...

Replace the custom check with utils.IsPublicIP (Go stdlib IsLoopback /
IsLinkLocalUnicast / IsPrivate / IsUnspecified, plus IPv4-mapped IPv6
unmasking) and add an upfront hostname rejection for localhost, *.local,
and the cloud metadata aliases so split-horizon DNS can't paper over the
IP check.

The IP-pinning DialContext is unchanged: the validated IP from the
single resolution is reused for the connection, so DNS rebinding still
cannot swap a public answer for a private one between validate and dial.

Regression tests cover 0.0.0.0, 0.0.0.0:PORT, [::], ::ffff:127.0.0.1,
::ffff:10.0.0.1, file://, gopher://, ftp://, localhost, 127.0.0.1,
10.0.0.1, 169.254.169.254, metadata.google.internal.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(downloader): verify SHA before promoting temp file to final path

DownloadFileWithContext renamed the .partial file to its final name
*before* checking the streamed SHA, so a hash mismatch returned an
error but left the tampered file at filePath. Subsequent code that
operated on filePath (a backend launcher, a YAML loader, a re-download
that finds the file already present and skips) would consume the
attacker-supplied bytes.

Reorder: verify the streamed hash first, remove the .partial on
mismatch, then rename. The streamed hash is computed during io.Copy
so no second read is needed.

While here, raise the empty-SHA case from a Debug log to a Warn so
"this download had no integrity check" is visible at the default log
level. Backend installs currently pass through with no digest; the
warning makes that footprint observable without changing behaviour.

Regression test asserts os.IsNotExist on the destination after a
deliberate SHA mismatch.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(auth): require email_verified for OIDC admin promotion

extractOIDCUserInfo read the ID token's "email" claim but never
inspected "email_verified". With LOCALAI_ADMIN_EMAIL set, an attacker
who could register on the configured OIDC IdP under that email (some
IdPs accept self-supplied unverified emails) inherited admin role:

  - first login:  AssignRole(tx, email, adminEmail) → RoleAdmin
  - re-login:     MaybePromote(db, user, adminEmail) → flip to RoleAdmin

Add EmailVerified to oauthUserInfo, parse email_verified from the OIDC
claims (default false on absence so an IdP that omits the claim cannot
short-circuit the gate), and substitute "" for the role-decision email
when verified=false via emailForRoleDecision. The user record still
stores the unverified email for display.

GitHub's path defaults EmailVerified=true: GitHub only returns a public
profile email after verification, and fetchGitHubPrimaryEmail explicitly
filters to Verified=true.

Regression tests cover both the helper contract and integration with
AssignRole, including the bootstrap "first user" branch that would
otherwise mask the gate.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(cli): refuse public bind when no auth backend is configured

When neither an auth DB nor a static API key is set, the auth
middleware passes every request through. That is fine for a developer
laptop, a home LAN, or a Tailnet — the network itself is the trust
boundary. It is not fine on a public IP, where every model install,
settings change, and admin endpoint becomes reachable from the
internet.

Refuse to start in that exact configuration. Loopback, RFC 1918,
RFC 4193 ULA, link-local, and RFC 6598 CGNAT (Tailscale's default
range) all count as trusted; wildcard binds (`:port`, `0.0.0.0`,
`[::]`) are accepted only when every host interface is in one of those
ranges. Hostnames are resolved and treated as trusted only when every
answer is.

A new --allow-insecure-public-bind / LOCALAI_ALLOW_INSECURE_PUBLIC_BIND
flag opts out for deployments that gate access externally (a reverse
proxy enforcing auth, a mesh ACL, etc.). The error message lists this
plus the three constructive alternatives (bind a private interface,
enable --auth, set --api-keys).

The interface enumeration goes through a package-level interfaceAddrsFn
var so tests can simulate cloud-VM, home-LAN, Tailscale-only, and
enumeration-failure topologies without poking at the real network
stack.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* test(http): regression-test the localai_assistant admin gate

ChatEndpoint already rejects metadata.localai_assistant=true from a
non-admin caller, but the gate was open-coded inline with no direct
test coverage. The chat route is FeatureChat-gated (default-on), and
the assistant's in-process MCP server can install/delete models and
edit configs — the wrong handler change would silently turn the LLM
into a confused deputy.

Extract the gate into requireAssistantAccess(c, authEnabled) and pin
its behaviour: auth disabled is a no-op, unauthenticated is 403,
RoleUser is 403, RoleAdmin and the synthetic legacy-key admin are
admitted.

No behaviour change in the production path.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* test(http): assert every API route is auth-classified

The auth middleware classifies path prefixes (/api/, /v1/, /models/,
etc.) as protected and treats anything else as a static-asset
passthrough. A new endpoint shipped under a brand-new prefix — or a
new path that simply isn't on the prefix allowlist — would be
reachable anonymously.

Walk every route registered by API() with auth enabled and a fresh
in-memory database (no users, no keys), and assert each API-prefixed
route returns 401 / 404 / 405 to an anonymous request. Public surfaces
(/api/auth/*, /api/branding, /api/node/* token-authenticated routes,
/healthz, branding asset server, generated-content server, static
assets) are explicit allowlist entries with comments justifying them.

Build-tagged 'auth' so it runs against the SQLite-backed auth DB
(matches the existing auth suite).

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* test(http): pin agent endpoint per-user isolation contract

agents.go's getUserID / effectiveUserID / canImpersonateUser /
wantsAllUsers helpers are the single trust boundary for cross-user
access on agent, agent-jobs, collections, and skills routes. A
regression there is the difference between "regular user reads their
own data" and "regular user reads anyone's data via ?user_id=victim".

Lock in the contract:
  - effectiveUserID ignores ?user_id= for unauthenticated and RoleUser
  - effectiveUserID honours it for RoleAdmin and ProviderAgentWorker
  - wantsAllUsers requires admin AND the literal "true" string
  - canImpersonateUser is admin OR agent-worker, never plain RoleUser

No production change — this commit only adds tests.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(downloader): drop redundant stat in removePartialFile

The stat-then-remove pattern is a TOCTOU window and a wasted syscall —
os.Remove already returns ErrNotExist for the missing-file case, so trust
that and treat it as a no-op.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(http): redact secrets from trace buffer and distribution-token logs

The /api/traces buffer captured Authorization, Cookie, Set-Cookie, and
API-key headers verbatim from every request when tracing was enabled. The
endpoint is admin-only but the buffer is reachable via any heap-style
introspection and the captured tokens otherwise outlive the request.
Strip those header values at capture time. Body redaction is left to a
follow-up — the prompts are usually the operator's own and JSON-walking
is invasive.

Distribution tokens were also logged in plaintext from
core/explorer/discovery.go; logs forward to syslog/journald and outlive
the token. Redact those to a short prefix/suffix instead.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(auth): rate-limit OAuth callbacks separately from password endpoints

The shared 5/min/IP limit on auth endpoints is right for password-style
flows but too tight for OAuth callbacks: corporate SSO funnels many real
users through one outbound IP and would trip the limit. Add a separate
60/min/IP limiter for /api/auth/{github,oidc}/callback so callbacks are
bounded against floods without breaking shared-IP deployments.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(gallery): verify backend tarball sha256 when set in gallery entry

GalleryBackend gained an optional sha256 field; the install path now
threads it through to the existing downloader hash-verify (which already
streams, verifies, and rolls back on mismatch). Galleries without sha256
keep working; the empty-SHA path still emits the existing
"downloading without integrity check" warning.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* test(http): pin CSRF coverage on multipart endpoints

The CSRF middleware in app.go is global (e.Use) so it covers every
multipart upload route — branding assets, fine-tune datasets, audio
transforms, agent collections. Pin that contract: cross-site multipart
POSTs are rejected; same-origin / same-site / API-key clients are not.
Also pins the SameSite=Lax fallback path the skipper relies on when
Sec-Fetch-Site is absent.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(http): XSS hardening — CSP headers, safe href, base-href escape, SVG sandbox

Several closely related XSS-prevention changes spanning the SPA shell, the
React UI, and the branding asset server:

- New SecurityHeaders middleware sets CSP, X-Content-Type-Options,
  X-Frame-Options, and Referrer-Policy on every response. The CSP keeps
  script-src permissive because the Vite bundle relies on inline + eval'd
  scripts; tightening that requires moving to a nonce-based policy.

- The <base href> injection in the SPA shell escaped attacker-controllable
  Host / X-Forwarded-Host headers — a single quote in the host header
  broke out of the attribute. Pass through SecureBaseHref (html.EscapeString).

- Three React sinks rendering untrusted content via dangerouslySetInnerHTML
  switch to text-node rendering with whiteSpace: pre-wrap: user message
  bodies in Chat.jsx and AgentChat.jsx, and the agent activity log in
  AgentChat.jsx. The hand-rolled escape on the agent user-message variant
  is replaced by the same plain-text path.

- New safeHref util collapses non-allowlisted URI schemes (most
  importantly javascript:) to '#'. Applied to gallery `<a href={url}>`
  links in Models / Backends / Manage and to canvas artifact links —
  these come from gallery JSON or assistant tool calls and must be treated
  as untrusted.

- The branding asset server attaches a sandbox CSP plus same-origin CORP
  to .svg responses. The React UI loads logos via <img>, but the same URL
  is also reachable via direct navigation; this prevents script
  execution if a hostile SVG slipped past upload validation.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(http): bound HTTP server with read-header and idle timeouts

A net/http server with no timeouts is trivially Slowloris-able and leaks
idle keep-alive connections. Set ReadHeaderTimeout (30s) to plug the
slow-headers attack and IdleTimeout (120s) to cap keep-alive sockets.

ReadTimeout and WriteTimeout stay at 0 because request bodies can be
multi-GB model uploads and SSE / chat completions stream for many
minutes; operators who need tighter per-request bounds should terminate
slow clients at a reverse proxy.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* test(auth): pin PUT /api/auth/profile field-tampering contract

The handler uses an explicit local body struct (only name and avatar_url)
plus a gorm Updates(map) with a column allowlist, so an attacker posting
{"role":"admin","email":"...","password_hash":"..."} can't mass-assign
those fields. Lock that down with a regression test so a future
"let's just c.Bind(&user)" refactor breaks loudly.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(services): strip directory components from multipart upload filenames

UploadDataset and UploadToCollectionForUser took the raw multipart
file.Filename and joined it into a destination path. The fine-tune
upload was incidentally safe because of a UUID prefix that fused any
leading '..' to a literal segment, but the protection is fragile.
UploadToCollectionForUser handed the filename to a vendored backend
without sanitising at all.

Strip to filepath.Base at both boundaries and reject the trivial
unsafe values ("", ".", "..", "/").

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(react-ui): validate persisted MCP server entries on load

localStorage is shared across same-origin pages; an XSS that lands once
can poison persisted MCP server config to attempt header injection or
to feed a non-http URL into the fetch path on subsequent loads.
Validate every entry: types must match, URL must parse with http(s)
scheme, header keys/values must be control-char-free. Drop anything
that doesn't fit.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(http): close X-Forwarded-Prefix open redirect

The reverse-proxy support concatenated X-Forwarded-Prefix into the
redirect target without validation, so a forged header value of
"//evil.com" turned the SPA-shell redirect helper at /, /browse, and
/browse/* into a 301 to //evil.com/app. The path-strip middleware had
the same shape on its prefix-trailing-slash redirect.

Add SafeForwardedPrefix at the middleware boundary: must start with
a single '/', no protocol-relative '//' opener, no scheme, no
backslash, no control characters. Apply at both consumers; misconfig
trips the validator and the header is dropped.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(http): refuse wildcard CORS when LOCALAI_CORS=true with empty allowlist

When LOCALAI_CORS=true but LOCALAI_CORS_ALLOW_ORIGINS was empty, Echo's
CORSWithConfig saw an empty allow-list and fell back to its default
AllowOrigins=["*"]. An operator who flipped the strict-CORS feature
flag without populating the list got the opposite of what they asked
for. Echo never sets Allow-Credentials: true so this isn't directly
exploitable (cookies aren't sent under wildcard CORS), but the
misconfiguration trap is worth closing. Skip the registration and warn.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(auth): zxcvbn password strength check with user-acknowledged override

The previous policy was len < 8, which let through "Password1" and the
rest of the credential-stuffing corpus. LocalAI has no second factor
yet, so the bar needs to sit higher.

Add ValidatePasswordStrength using github.com/timbutler/zxcvbn (an
actively-maintained fork of the trustelem port; v1.0.4, April 2024):
- min 12 chars, max 72 (bcrypt's truncation point)
- reject NUL bytes (some bcrypt callers truncate at the first NUL)
- require zxcvbn score >= 3 ("safely unguessable, ~10^8 guesses to
  break"); the hint list ["localai", "local-ai", "admin"] penalises
  passwords built from the app's own branding

zxcvbn produces false positives sometimes (a strong-looking password
that happens to match a dictionary word) and operators occasionally
need to set a known-weak password (kiosk demos, CI rigs). Add an
acknowledgement path: PasswordPolicy{AllowWeak: true} skips the
entropy check while still enforcing the hard rules. The structured
PasswordErrorResponse marks weak-password rejections as Overridable
so the UI can surface a "use this anyway" checkbox.

Wired through register, self-service password change, and admin
password reset on both the server and the React UI.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(react-ui): drop HTML5 minLength on new-password inputs

minLength={12} on the new-password input let the browser block the
form submit silently before any JS or network call ran. The browser
focused the field, showed a brief native tooltip, and that was that —
no toast, no fetch, no clue. Reproducible by typing fewer than 12
chars on the second password change of a session.

The JS-level length check in handleSubmit already shows a toast and
the server rejects with a structured error, so the HTML5 attribute
was redundant defence anyway. Drop it.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(react-ui): bundle Geist fonts locally instead of fetching from Google

The new CSP correctly refused to apply styles from
fonts.googleapis.com because style-src is locked to 'self' and
'unsafe-inline'. Loosening the CSP would defeat its purpose; the
right fix is to stop reaching out to a third-party CDN for fonts on
every page load.

Add @fontsource-variable/geist and @fontsource-variable/geist-mono as
npm deps and import them once at boot. Drop the <link rel="preconnect">
and external stylesheet from index.html.

Side benefit: no third-party tracking via Referer / IP on every UI
load, no failure mode when offline / behind a captive portal.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(react-ui): refresh i18n strings to reflect 12-char password minimum

The translations still said "at least 8 characters" everywhere — the
client-side toast on a too-short password change told the user the
wrong floor. Update tooShort and newPasswordPlaceholder /
newPasswordDescription across all five locales (en, es, it, de,
zh-CN) to match the real ValidatePasswordStrength rule.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(auth): make password length-floor overridable like the entropy check

The 12-char minimum was a policy choice, not a technical invariant —
only "non-empty", "<= 72 bytes", and "no NUL bytes" are real bcrypt
constraints. Treating length-12 as a hard rule was inconsistent with
the entropy check (already overridable) and friction for use cases
where the account is just a name on a session, not a security
boundary (single-user kiosk, CI rig, lab demo).

Restructure ValidatePasswordStrength:
- Hard rules (always enforced): non-empty, <= MaxPasswordLength, no NUL byte
- Policy rules (skipped when AllowWeak=true): length >= 12, zxcvbn score >= 3

PasswordError now marks password_too_short as Overridable too. The
React forms generalised from `error_code === 'password_too_weak'` to
`overridable === true`, and the JS-side preflight length checks were
removed (server is source of truth, returns the same checkbox flow).

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-08 16:25:45 +02:00
LocalAI [bot]
e5d7b84216 fix(distributed): split NATS backend.upgrade off install + dedup loads (#9717)
* feat(messaging): add backend.upgrade NATS subject + payload types

Splits the slow force-reinstall path off backend.install so it can run on
its own subscription goroutine, eliminating head-of-line blocking between
routine model loads and full gallery upgrades.

Wire-level Force flag on BackendInstallRequest is kept for one release as
the rolling-update fallback target; doc note marks it deprecated.

Assisted-by: Claude:claude-sonnet-4-6
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed/worker): add per-backend mutex helper to backendSupervisor

Different backend names lock independently; same backend serializes. This
is the synchronization primitive used by the upcoming concurrent install
handler — without it, wrapping the NATS callback in a goroutine would
race the gallery directory when two requests target the same backend.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(distributed/worker): run backend.install handler in a goroutine

NATS subscriptions deliver messages serially on a single per-subscription
goroutine. With a synchronous install handler, a multi-minute gallery
download would head-of-line-block every other install request to the
same worker — manifesting upstream as a 5-minute "nats: timeout" on
unrelated routine model loads.

The body now runs in its own goroutine, with a per-backend mutex
(lockBackend) protecting the gallery directory from concurrent operations
on the same backend. Different backend names install in parallel.

Backward-compat: req.Force=true is still honored here, so an older master
that hasn't been updated to send on backend.upgrade keeps working.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed/worker): subscribe to backend.upgrade as a separate path

Slow force-reinstall now lives on its own NATS subscription, so a
multi-minute gallery pull cannot head-of-line-block the routine
backend.install handler on the same worker. Same per-backend mutex
guards both — concurrent install + upgrade for the same backend
serialize at the gallery directory; different backends are independent.

upgradeBackend stops every live process for the backend, force-installs
from gallery, and re-registers. It does not start a new process — the
next backend.install will spawn one with the freshly-pulled binary.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): add UpgradeBackend on NodeCommandSender; drop Force from InstallBackend

Master now sends to backend.upgrade for force-reinstall, with a
nats.ErrNoResponders fallback to the legacy backend.install Force=true
path so a rolling update with a new master + an old worker still
converges. The Force parameter leaves the public Go API surface
entirely — only the internal fallback sets it on the wire.

InstallBackend timeout drops 5min -> 3min (most replies are sub-second
since the worker short-circuits on already-running or already-installed).
UpgradeBackend timeout is 15min, sized for real-world Jetson-on-WiFi
gallery pulls.

Updates the admin install HTTP endpoint
(core/http/endpoints/localai/nodes.go) to the new signature too.

router_test.go's fakeUnloader does not yet implement the new interface
shape; Task 3.2 will catch it up before the next package-level test run.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(distributed): update fakeUnloader for new NodeCommandSender shape

InstallBackend lost its force bool param (Force is not part of the public
Go API anymore — only the internal upgrade-fallback path sets it on the
wire). UpgradeBackend gained a method. Fake records both call slices and
provides an installHook concurrency seam for upcoming singleflight tests.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(distributed): cover UpgradeBackend's new subject + rolling-update fallback

Task 3.1 changed the master to publish UpgradeBackend on the new
backend.upgrade subject; the existing UpgradeBackend tests scripted the
old install subject and so all 3 began failing as expected. Updates them
to script SubjectNodeBackendUpgrade with BackendUpgradeReply.

Adds two new specs for the rolling-update fallback:
  - ErrNoResponders on backend.upgrade triggers a backend.install
    Force=true retry on the same node.
  - Non-NoResponders errors propagate to the caller unchanged.

scriptedMessagingClient gains scriptNoResponders (real nats sentinel) and
scriptReplyMatching (predicate-matched canned reply, used to assert that
the fallback path actually sets Force=true on the install retry).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(distributed): coalesce concurrent identical backend.install via singleflight

Six simultaneous chat completions for the same not-yet-loaded model were
observed firing six independent NATS install requests, each serializing
through the worker's per-subscription goroutine and amplifying queue
depth. SmartRouter now wraps the NATS round-trip in a singleflight.Group
keyed by (nodeID, backend, modelID, replica): N concurrent identical
loads share one round-trip and one reply.

Distinct (modelID, replica) keys still fire independent calls, so
multi-replica scaling and multi-model fan-out are unaffected.

fakeUnloader gains a sync.Mutex around its recording slices to keep
concurrent test goroutines race-clean.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(e2e/distributed): drop force arg from InstallBackend test calls

Two e2e test call sites still passed the trailing force bool that was
removed from RemoteUnloaderAdapter.InstallBackend in 9bde76d7. Caught
by golangci-lint typecheck on the upgrade-split branch (master CI was
already green because these tests don't run in the standard test path).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(distributed): extract worker business logic to core/services/worker

core/cli/worker.go grew to 1212 lines after the backend.upgrade split.
The CLI package was carrying backendSupervisor, NATS lifecycle handlers,
gallery install/upgrade orchestration, S3 file staging, and registration
helpers — all distributed-worker business logic that doesn't belong in
the cobra surface.

Move it to a new core/services/worker package, mirroring the existing
core/services/{nodes,messaging,galleryop} pattern. core/cli/worker.go
shrinks to ~19 lines: a kong-tagged shim that embeds worker.Config and
delegates Run.

No behavior change. All symbols stay unexported except Config and Run.
The three worker-specific tests (addr/replica/concurrency) move with
the code via git mv so history follows them.

Files split as:
  worker.go        - Run entry point
  config.go        - Config struct (kong tags retained, kong not imported)
  supervisor.go    - backendProcess, backendSupervisor, process lifecycle
  install.go       - installBackend, upgradeBackend, findBackend, lockBackend
  lifecycle.go     - subscribeLifecycleEvents (verbatim, decomposition is
                     a follow-up commit)
  file_staging.go  - subscribeFileStaging, isPathAllowed
  registration.go  - advertiseAddr, registrationBody, heartbeatBody, etc.
  reply.go         - replyJSON
  process_helpers.go - readLastLinesFromFile

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(distributed/worker): decompose subscribeLifecycleEvents into per-event handlers

The 226-line subscribeLifecycleEvents method packed eight NATS subscriptions
inline. Each grew context-shaped doc comments mixed with subscription
plumbing, making it hard to read any one handler without scrolling past the
others. Extract each handler into its own method on *backendSupervisor; the
subscriber becomes a thin 8-line dispatcher.

No behavior change: each method body is byte-equivalent to its corresponding
inline goroutine + handler. Doc comments that were attached to the inline
SubscribeReply calls migrate to the new method godocs.

Adding the next NATS subject is now a 2-line patch to the dispatcher plus
one new method, instead of grafting onto a monolith.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-08 16:24:54 +02:00
LocalAI [bot]
6070aafc69 chore(deps): bump LocalAGI for collection rehydrate-on-init-failure fix (#9721)
Picks up mudler/LocalAGI#? (commit 941ac52, merged into main):
in-process collections backend now registers a placeholder for every
on-disk collection at startup — even when the engine wrapper fails to
construct (typically because the embedding model is briefly
unreachable) — and rehydrates lazily on first access. Previously a
transient outage at LocalAI boot silently dropped every existing
collection from the agent pool's in-memory map; users saw "collection
not found" indefinitely until LocalAI was restarted, even after the
embedding service recovered. With this bump the next request to the
collection rehydrates it transparently.

Does not address the deeper LocalRecall issue where
NewPersistentPostgresCollection probes the embedding model at engine
construction even for read-only paths — that needs a separate fix in
mudler/localrecall.

Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-08 16:24:42 +02:00
LocalAI [bot]
2be07f61da feat(whisper): honor client cancellation via ggml abort_callback (#9710)
* refactor(transcription): propagate request ctx through ModelTranscription*

Replaces context.Background() with the HTTP request ctx so client
disconnects start cancelling the gRPC call. No backend-side abort wiring
yet — that comes in a later commit. Pure plumbing.

Assisted-by: Claude:claude-haiku-4-5
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(cli): pass ctx to backend.ModelTranscription

Follow-up to e65d3e1f which threaded ctx through ModelTranscription
but missed the CLI caller. CLI commands have no request-scoped ctx,
so context.Background() is correct here.

Assisted-by: Claude:claude-haiku-4-5
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(audio): propagate request ctx into TTS, sound-gen, audio-transform

Same ctx-plumbing pattern applied to the rest of the audio path. CLI
callers use context.Background() since there is no request scope; HTTP
callers use c.Request().Context().

Assisted-by: Claude:claude-haiku-4-5
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(backend): propagate request ctx into biometric, detection, rerank, diarization paths

Replaces remaining context.Background() sites in core/backend with the
caller's ctx. After this commit, every core/backend/*.go entry point
threads the request ctx end-to-end to the gRPC client.

Assisted-by: Claude:claude-haiku-4-5
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(grpc): plumb ctx through AIModel.AudioTranscription{,Stream}

Adds context.Context as first parameter to the AIModel interface methods
that wrap whisper-style transcription. Server-side gRPC handler now
forwards the per-RPC ctx (server-streaming uses stream.Context()).
Whisper, Voxtral, vibevoice-cpp, and sherpa-onnx accept the parameter;
none uses it yet — the actual cancellation primitive lands in the next
commit so this is pure plumbing.

Assisted-by: Claude:claude-sonnet-4-6
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(whisper): add abort_callback hook in the C++ bridge

Installs a std::atomic<int> flag, wires it into
whisper_full_params.abort_callback, and exposes a set_abort(int) C
symbol so Go can flip the flag from a goroutine watching the request
context. transcribe() now distinguishes abort (return 2) from real
whisper_full failure (return 1).

Assisted-by: Claude:claude-haiku-4-5
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(whisper): register set_abort symbol in the purego loader

Adds the Go-side binding for the new C export so the next commit can
call CppSetAbort(1) from a watcher goroutine on ctx.Done().

Assisted-by: Claude:claude-haiku-4-5
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(whisper): honor ctx cancellation and return codes.Canceled

A watcher goroutine watches ctx.Done() during AudioTranscription and
calls CppSetAbort(1) on cancel. whisper_full sees abort_callback return
true at the next compute graph step, returns non-zero, and the bridge
returns 2 -> AudioTranscription maps that to codes.Canceled.

Adds an opt-in test (gated on WHISPER_MODEL_PATH / WHISPER_AUDIO_PATH)
that asserts cancellation latency under 5s and proves the abort flag
resets cleanly so the next transcription succeeds.

Assisted-by: Claude:claude-sonnet-4-6
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(whisper): join the cancel watcher goroutine before returning

Follow-up to 85edf9d2. The previous commit used `defer close(done)` and
called the watcher "joined synchronously" — but close() only signals,
it does not block until the goroutine exits. That left a window where
a late CppSetAbort(1) from a cancelled call could land on the next
call, after its C-side g_abort reset but before whisper_full() began
polling the abort callback, corrupting the second transcription.

Switch to a sync.WaitGroup join so wg.Wait() blocks until the watcher
has actually returned from its select.

Assisted-by: Claude:claude-sonnet-4-6
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(whisper): short-circuit pre-cancelled ctx in AudioTranscription

If ctx is already Done() at entry, return codes.Canceled immediately
instead of running the full transcription. The C-side g_abort reset
happens at the start of transcribe() and would otherwise overwrite a
watcher-set abort flag from an already-cancelled ctx, producing a
spurious successful transcription on a request the client has already
abandoned.

Assisted-by: Claude:claude-haiku-4-5
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(tests/distributed): update testLLM mock for new AudioTranscription signature

Phase B (93c48e19) added context.Context to AIModel.AudioTranscription
but missed the testLLM mock in tests/e2e/distributed. CI golangci-lint
caught it: *testLLM did not implement grpc.AIModel because the method
signature lacked the ctx parameter, which broke the distributed test
suite compilation and cascaded through every backend-build job that
runs `go build ./...`.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(whisper): port cancellation test to Ginkgo/Gomega

Project policy (.agents/coding-style.md, enforced by golangci-lint
forbidigo) is that all Go tests must use Ginkgo v2 + Gomega — no
stdlib testing patterns (t.Skip, t.Fatalf, etc.). Convert the
cancellation test to a Describe/It block with Skip(...) for env
gating and Expect/HaveOccurred for assertions.

Same coverage: cancel mid-flight returns codes.Canceled within 5s and
a follow-up transcription succeeds, proving the C-side g_abort flag
resets cleanly.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-08 01:44:47 +02:00
LocalAI [bot]
806130bbc0 chore: ⬆️ Update ggml-org/whisper.cpp to c81b2dabbc45484dee2ca6658cfe39c841df5c70 (#9712)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-08 01:44:32 +02:00
LocalAI [bot]
3b84582567 chore: ⬆️ Update ggml-org/llama.cpp to 05ff59cb57860cc992fc6dcede32c696efea711c (#9714)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-08 01:44:17 +02:00
LocalAI [bot]
907929ce60 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 9a26522af234f8db079ae3735f35ab6c20fe2c66 (#9713)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-08 01:43:44 +02:00