LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-05-16 20:52:08 -04:00

Author	SHA1	Message	Date
dependabot[bot]	9be5310394	chore(deps): bump actions/upload-artifact from 4 to 7 (#9770 ) Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4 to 7. - [Release notes](https://github.com/actions/upload-artifact/releases) - [Commits](https://github.com/actions/upload-artifact/compare/v4...v7) --- updated-dependencies: - dependency-name: actions/upload-artifact dependency-version: '7' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-12 09:20:03 +02:00
LocalAI [bot]	621c612b2d	ci(bump-deps): register ds4 + move version pin into the Makefile (#9761 ) * ci(bump-deps): register ds4 + move version pin into the Makefile The initial ds4 PR (#9758) put the upstream commit pin in backend/cpp/ds4/prepare.sh as a shell variable. The auto-bump bot at .github/bump_deps.sh greps for ^$VAR?= in a Makefile, so DS4_VERSION was invisible to it - other backends (llama-cpp, ik-llama-cpp, turboquant, voxtral, etc.) all pin in their Makefile. This change: - Moves DS4_VERSION?= and DS4_REPO?= to the top of backend/cpp/ds4/Makefile. - Inlines the git init/fetch/checkout recipe into the 'ds4:' target (matches llama-cpp's 'llama.cpp:' target pattern). Directory acts as the target so make only re-clones when missing. - Deletes the now-redundant prepare.sh. - Adds antirez/ds4 + DS4_VERSION + main + backend/cpp/ds4/Makefile to the .github/workflows/bump_deps.yaml matrix so the daily bot opens PRs against this pin. - Updates .agents/ds4-backend.md to point at the Makefile. Verified: $ grep -m1 '^DS4_VERSION?=' backend/cpp/ds4/Makefile DS4_VERSION?=ae302c2fa18cc6d9aefc021d0f27ae03c9ad2fc0 $ make -C backend/cpp/ds4 ds4 # clones into ds4/ at the pin $ make -C backend/cpp/ds4 ds4 # no-op on second invocation make: 'ds4' is up to date. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: route backend/cpp/ds4/ changes through changed-backends.js scripts/changed-backends.js:inferBackendPath has an explicit branch per cpp dockerfile suffix (ik-llama-cpp, turboquant, llama-cpp). Without a matching branch the function returns null, the backend never lands in the path map, and PR change-detection cannot map "backend/cpp/ds4/X changed" -> "rebuild ds4 image". This is why PR #9761 produced zero ds4 jobs even though it directly edits backend/cpp/ds4/Makefile. Adds the missing branch (Dockerfile.ds4 -> backend/cpp/ds4/), placed before the llama-cpp branch (since both share the .cpp ancestry but ds4 is more specific - same ordering rule documented in .agents/adding-backends.md). Verified with a local Node simulation of the script against this PR's diff: the path map now contains 'ds4 -> backend/cpp/ds4/' and a 'backend/cpp/ds4/Makefile' change correctly triggers the ds4 backend in the rebuild set. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(adding-backends): harden the two gotchas that bit ds4 Both omissions are silent at the time you ADD a backend - the failure mode only appears later (the bump bot stays silent forever, or the path filter shows up on the next PR that touches your backend with zero CI jobs and looks broken for unrelated reasons). Expanding the `scripts/changed-backends.js` paragraph from a one-liner to a fully worked example, and adding a new sibling paragraph for the `bump_deps.yaml` + Makefile-pin contract. Both call out the specific mistakes from the ds4 timeline (#9758 → #9761) so future contributors can pattern-match on the cause. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-11 22:46:02 +02:00
LocalAI [bot]	d892e4af80	feat: add ds4 backend (DeepSeek V4 Flash) with tool calls, thinking, KV cache (#9758 ) * test(e2e-backends): allow BACKEND_BINARY for native-built backends Adds an escape hatch for hardware-gated backends (e.g. ds4) where the model is too large for Docker build context. When BACKEND_BINARY points at a run.sh produced by 'make -C backend/cpp/<name> package', the suite skips docker image extraction and drives the binary directly. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(e2e-backends): validate BACKEND_BINARY basename + log actual source Two follow-ups from the `cbcf5148` code review: - BACKEND_BINARY now requires a path whose basename is `run.sh`. Without this check, `filepath.Dir(binary)` silently discarded the filename, so pointing the env var at an arbitrary binary failed later with a confusing assertion that named a path the user never typed. - The "Testing image=..." debug line printed an empty string when the binary path was used, hiding the actual source in CI logs. The line now reports whichever of BACKEND_IMAGE / BACKEND_BINARY is in effect as `src=...`. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): scaffold ds4 backend dir Adds prepare.sh, run.sh, and a .gitignore. CMakeLists, Makefile, and the implementation arrive in follow-up commits. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): add backend Makefile Drives ds4's upstream Makefile to produce engine .o files (CUDA on Linux when BUILD_TYPE=cublas, Metal on Darwin, otherwise CPU debug path), then invokes CMake on our wrapper. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): add CMakeLists for grpc-server Generates protoc stubs from backend.proto, links grpc-server.cpp + dsml_parser.cpp + dsml_renderer.cpp + kv_cache.cpp against pre-built ds4 engine .o files. DS4_GPU=cuda\|metal\|cpu selects the backend. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): grpc-server skeleton + module stubs The minimum that links: Backend service with Health + Free; other RPCs default to UNIMPLEMENTED. Stub headers/sources for dsml_parser, dsml_renderer, and kv_cache are in place so CMake links cleanly even before those modules ship. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): implement LoadModel Opens engine + creates session sized to ContextSize (default 32768). Backend is compile-time: CPU when DS4_NO_GPU, Metal on __APPLE__, else CUDA. MTP/speculative options are accepted via ModelOptions.Options[] (mtp_path, mtp_draft, mtp_margin). kv_cache_dir option is captured into g_kv_cache_dir for the cache module (Task 19 wires it in). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): implement TokenizeString Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): implement Predict (plain text) Tool calls + thinking-mode split arrive in Task 13 once dsml_parser is in. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): implement PredictStream (plain text) ChatDelta + reasoning/tool_calls split arrives in Task 14. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): implement Status RPC Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): add DSML streaming parser Classifies raw model-emitted token text into CONTENT / REASONING / TOOL_START / TOOL_ARGS / TOOL_END events. Markers it watches for are the literal DSML strings rendered by ds4_server.c's prompt template (<｜DSML｜tool_calls>, <｜DSML｜invoke name=...>, <think>, etc.) - these are plain text the model emits, not special tokens. Partial markers split across token chunks are buffered until a full marker or a definitively-not-a-marker '<' is observed. RandomToolId() generates the API-side tool call id (call_xxx) that exact-replay would key on. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend/cpp/ds4): split hex escapes in DSML markers + add cstring/cstdio includes C++ \x hex escapes have no length cap. '\x9cD' was read as a single escape producing byte 0xCD, eating the 'D'. The markers were never actually matching the DSML text the model emits. Split each escape with adjacent string literal concatenation so the byte sequence is exactly EF BD 9C 44 (｜D) at runtime. Also adds <cstring> and <cstdio> includes (libstdc++ 13 does not transitively expose std::strlen / std::snprintf via <string>). The local plan file (uncommitted) was also updated with the same fixes so Task 16's dsml_renderer.cpp does not re-introduce the bug. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): wire DsmlParser into Predict (ChatDelta) Non-streaming Predict now emits one ChatDelta carrying content, reasoning_content, and tool_calls[] parsed from the model's DSML output. Reply.message still carries the raw model bytes for backends that prefer the regex fallback path. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): wire DsmlParser into PredictStream Per-token ChatDelta writes: content/reasoning_content go incrementally, tool_calls emit TOOL_START as one delta (id + name) followed by TOOL_ARGS deltas with incremental JSON. The Go-side aggregator (pkg/functions/chat_deltas.go) reassembles them. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): chat template + reasoning_effort mapping UseTokenizerTemplate=true + Messages -> ds4_chat_begin / append / assistant_prefix. PredictOptions.Metadata['enable_thinking'] and ['reasoning_effort'] map to ds4_think_mode (DS4_THINK_HIGH default; 'max'/'xhigh' -> DS4_THINK_MAX; disabled -> DS4_THINK_NONE). Tool-call rendering for assistant turns with tool_calls JSON arrives in the next commit (dsml_renderer). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): render assistant tool_calls + tool results to DSML Closes the round-trip: when an OpenAI client sends a multi-turn chat where prior turns contain tool_calls or role=tool messages, build_prompt serializes them back to the DSML shape the model was trained on. Mirrors ds4_server.c's prompt renderer; uses nlohmann::json for parsing the OpenAI tool_calls payload. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): disk KV cache module Dir-based cache keyed by SHA1(rendered prompt prefix). File format: 'DS4G' magic + version + ctx_size + prefix_len + prefix + payload_bytes + ds4_session_save_payload output. NOT bit-compatible with ds4-server's KVC files - that interop is a follow-up plan. LoadLongestPrefix walks the dir picking the longest stored prefix that prefixes the incoming prompt. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): wire KvCache into Predict/PredictStream LoadModel reads 'kv_cache_dir' from ModelOptions.Options[], passes it to g_kv_cache.SetDir. Each Predict/PredictStream computes a render text for the request, tries LoadLongestPrefix to recover state, then Saves the new state after generation. ds4_session_sync handles the live-cache fast path internally, so the disk cache only matters for cold-starts and cross-session reuse. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): add package.sh Linux: bundles libc + ld + libstdc++ + libgomp + GPU runtime libs into package/lib so the FROM scratch image boots without a host libc. Darwin is handled by scripts/build/ds4-darwin.sh which uses otool -L. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend/cpp/ds4): rename namespace ds4_backend -> ds4cpp ds4.h defines 'typedef enum {...} ds4_backend' which collides with our C++ 'namespace ds4_backend' anywhere a TU includes both. kv_cache.h includes ds4.h directly and surfaces the conflict immediately; other TUs would hit it once gRPC dev headers are available. Renames the C++ namespace to ds4cpp across all wrapper files and the plan, leaving the upstream ds4 typedef untouched. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend): add Dockerfile.ds4 Single-stage builder (CUDA devel image for cublas, ubuntu:24.04 for cpu) -> FROM scratch with packaged grpc-server + bundled runtime libs. nlohmann-json3-dev is required for dsml_renderer's JSON handling. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(make): wire backend/cpp/ds4 + ds4-darwin into root Makefile BACKEND_DS4 entry + generate-docker-build-target eval + docker-build-ds4 in docker-build-backends + .NOTPARALLEL guards. Also adds the backends/ds4-darwin target which delegates to scripts/build/ds4-darwin.sh (landed in Task 24). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: add backend-matrix entries for ds4 (cpu + cuda13, per-arch) Two entries per build (amd64 + arm64) so backend-merge-jobs assembles a multi-arch manifest. Skipping cuda12 - ds4 was validated against CUDA 13. Darwin Metal is handled outside this matrix by backend_build_darwin.yml. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/index): add ds4 meta + image entries cpu + cuda13 x latest + master. Darwin Metal builds publish under ds4-darwin via the existing llama-cpp-darwin OCI pipeline. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(scripts/build): add ds4-darwin.sh Native macOS/Metal build for the ds4 backend. Mirrors llama-cpp-darwin.sh: make grpc-server -> otool -L for dylib bundling -> OCI tar that 'local-ai backends install' consumes via the backends/ds4-darwin Makefile target. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(darwin): build ds4-darwin in backend_build_darwin Adds a 'Build ds4 backend (Darwin Metal)' step that runs the backends/ds4-darwin Makefile target on the macOS runner. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(import): auto-detect ds4 weights via DS4Importer Adds core/gallery/importers/ds4.go which matches on the antirez/deepseek-v4-gguf repo URI and the DeepSeek-V4-Flash-.gguf filename pattern. Registered before LlamaCPPImporter so ds4 weights route to backend: ds4 instead of falling through to llama-cpp. Also lists ds4 in /backends/known so the /import-model UI surfaces it as a manual choice for users who want to force the backend on a non-canonical URI. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> feat(gallery): add deepseek-v4-flash-q2 (ds4 backend) One-click install of the q2 weights with backend: ds4. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(.agents): add ds4-backend.md Documents the backend shape, DSML state machine, thinking-mode mapping, disk KV cache, build matrix (cpu/cuda13/Darwin), and the BACKEND_BINARY hardware-validation path. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend/cpp/ds4): pass UBUNTU_VERSION + arch env vars to install-base-deps The .docker/install-base-deps.sh script needs UBUNTU_VERSION (defaults to 2404), TARGETARCH, SKIP_DRIVERS, and APT_MIRROR/APT_PORTS_MIRROR exported into the environment so it can pick the right cuda-keyring / cudss / nvpl debs and apt mirrors. Dockerfile.ds4 was declaring some of the ARGs but not re-exporting them via ENV. Mirrors Dockerfile.llama-cpp's pattern. Without this fix 'make docker-build-ds4 BUILD_TYPE=cublas CUDA_MAJOR_VERSION=13' failed at: /usr/local/sbin/install-base-deps: line 120: UBUNTU_VERSION: unbound variable Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/index): add Metal image entries for ds4 Adds metal-ds4 + metal-ds4-development image entries pointing at quay.io/go-skynet/local-ai-backends:{latest,master}-metal-darwin-arm64-ds4 (built by scripts/build/ds4-darwin.sh on macOS arm64 runners), plus the 'metal' and 'metal-darwin-arm64' capability mappings on the ds4 meta and ds4-development variant. Closes a gap from the initial Task 23 landing - the Darwin Metal build script and CI workflow step were already wired (Tasks 24-25), but the gallery had no image entry for users to install the Metal variant. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(ci): use ubuntu:24.04 base for ds4 cuda13 matrix entries The initial Task 22 matrix landing used base-image: 'nvidia/cuda:13.0.0-devel-ubuntu24.04' which clashes with install-base-deps.sh's cuda-keyring step: E: Conflicting values set for option Signed-By regarding source https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/sbsa/ The canonical pattern (llama-cpp, ik-llama-cpp, turboquant) uses plain 'ubuntu:24.04' + 'skip-drivers: false' so install-base-deps installs CUDA from scratch via its own keyring setup. Adopting that here. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend/cpp/ds4): drop install-base-deps.sh dependency The .docker/install-base-deps.sh pipeline is built around the llama-cpp needs: NVIDIA keyring + cuda-toolkit apt + gRPC-from-source build at /opt/grpc. For ds4 we don't need any of that: - CUDA: nvidia/cuda:13.0.0-devel-ubuntu24.04 ships /usr/local/cuda ready to go; install-base-deps's keyring step then conflicts with the pre-installed Signed-By. - gRPC: ds4's grpc-server.cpp only links against grpc++; system libgrpc++-dev (apt) is sufficient, no source build needed. Replaced the install-base-deps invocation in Dockerfile.ds4 with a direct 'apt-get install libgrpc++-dev libprotobuf-dev protobuf-compiler-grpc nlohmann-json3-dev cmake build-essential pkg-config git'. Matrix entries back to nvidia/cuda base + skip-drivers=true so install-base-deps would no-op even if some downstream tooling calls it. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend/cpp/ds4): correct proto accessors + alias grpc::Status as GStatus Two compile bugs caught by the docker build: 1. proto::Message uses snake_case accessors. The build_prompt loop called m.toolcalls() / m.toolcallid() - the protoc-generated names are m.tool_calls() / m.tool_call_id(). Plan-text bug propagated to the wrapper. 2. The Status RPC method shadowed the 'using grpc::Status' alias, so any later method declaration using Status as a return type failed to parse ('Status does not name a type' starting at LoadModel). Solution: alias grpc::Status as GStatus instead, with no 'using' clause that would conflict. All RPC method declarations and return-statement constructions now use GStatus. Pre-existing code reviewer flagged the Status-shadow concern as 'minor' in the original Task 10 commit; it turned out to be a real compile blocker under libstdc++ 13 once the surrounding methods were filled in. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend/cpp/ds4): preserve TOOL_ARGS content in dsml_parser Flush When the model emitted a parameter value that arrived in the same buffer as the surrounding tool_call markers (e.g. the buffered tail after a literal '</think>' opened the model output), the parser deferred all buffered bytes to Flush() because looks_like_prefix() always returns true while buf starts with '<'. Flush() then drained the buffer as plain CONTENT/REASONING regardless of parser state, so the bytes between the parameter open and close markers were classified as CONTENT instead of TOOL_ARGS. Symptom: the model emitted <\|DSML\|parameter name="location" string="true">Paris, France</\|DSML\|parameter> and the assembled tool_call arguments came out as {"location":""} - the opener and closer were emitted into the args stream but the "Paris, France" content went to the assistant message instead. Fix: 1. Flush() now uses the same state-aware emit logic as DrainPlain: PARAM_VALUE bytes become TOOL_ARGS (json-escaped when string), THINK bytes become REASONING, TEXT bytes become CONTENT, and INVOKE / TOOL_CALLS structural whitespace is discarded. 2. looks_like_prefix() restricts its leading-'<' fallback to buffers that have not yet seen a '>'. Without that change, char-by-char feeds would discard the '<' of '<\|DSML\|invoke name="..."' once the marker prefix length was reached but the closing quote/'>' were still in flight. Verified with a standalone harness that runs the failing input three ways (single Feed, split-after-'>', and char-by-char) and aggregates TOOL_ARGS for tool index 0: all three now produce {"location":"Paris, France"}. Assisted-by: Claude:opus-4.7 [Read,Edit,Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend/cpp/ds4): use ds4_session_sync + manual generation loop for KV persistence ds4_engine_generate_argmax() is a self-contained helper that doesn't take or update a ds4_session - it manages its own internal state. Our Predict and PredictStream methods created g_session via ds4_session_create() but then called ds4_engine_generate_argmax(), so g_session's KV state never advanced. ds4_session_payload_bytes(g_session) returned 0 and the disk KV cache save correctly rejected with 'session has no valid checkpoint to save'. Switch both RPCs to the proper session API: ds4_session_sync(g_session, &prompt, ...) loop: int token = ds4_session_argmax(g_session) if token == eos: break emit(token) ds4_session_eval(g_session, token, ...) After the loop the session has a real checkpoint and ds4_session_save_payload writes the KV state to disk. Verified end-to-end on a DGX Spark GB10: three .kv files (15-30 MB each) are written when BACKEND_TEST_OPTIONS sets kv_cache_dir, and the e2e tool-call assertion still passes. Also added stderr diagnostics to KvCache (enabled/disabled at SetDir; per-save path + payload_bytes + result) so future failures are visible instead of silent. The 'wrote ok' lines are low-volume - one per Predict/PredictStream when the cache is enabled - and skipped entirely when the option is unset. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): use ds4_session_eval_speculative_argmax when MTP loaded Wires MTP (Multi-Token Prediction) speculative decoding into the manual generation loop in both Predict and PredictStream. When the upstream MTP weights are loaded via 'mtp_path:' option AND we're on CUDA / Metal, ds4_engine_mtp_draft_tokens() returns >0 and we switch the inner loop to ds4_session_eval_speculative_argmax(), which can accept N>1 tokens per verifier step. When MTP is not loaded (no option, CPU backend, or weights absent), we fall through to the simple ds4_session_argmax + ds4_session_eval path with no behavior change. Validated on a DGX Spark GB10 with the optional MTP GGUF (DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf, ~3.6 GB). LoadModel logs 'ds4: MTP support model loaded ... (draft=2)' on stderr. Caveat per upstream README: 'currently provides at most a slight speedup, not a meaningful generation-speed win'. Wired now mainly to track the upstream API; bigger speedups arrive when ds4 improves the speculative path. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): honor PredictOptions sampling with DSML-aware override Mirrors ds4_server.c:7102-7115 sampling-policy semantics on the LocalAI gRPC side. The generation loop now consults compute_sample_params() per token to pick the effective (temperature, top_k, top_p, min_p), based on: 1. Request defaults: PredictOptions.temperature / .topk / .topp / .minp 2. Thinking-mode override: when enable_thinking != false, force T=1.0, top_k=0, top_p=1.0, min_p=0.0 (creativity for the reasoning pass and the trailing content) 3. DSML structural override: when DsmlParser::IsInDsmlStructural() returns true (we are between tool-call markers but NOT in a param value payload), force T=0.0 so protocol bytes parse cleanly When the effective temperature is 0, we keep using ds4_session_argmax + MTP speculative path (matches ds4-server's gate that only enables MTP for greedy positions). When > 0, we call ds4_session_sample(s, T, ...) with a per-thread RNG seeded from system_clock and fall back to single-token ds4_session_eval. New public method on DsmlParser: IsInDsmlStructural() encodes which states need protocol-byte determinism. PARAM_VALUE is excluded (payload uses user sampling); TEXT and THINK are excluded (no tool-call context to protect). Verified on the DGX Spark GB10: the e2e suite still passes with all 5 specs including tools, and the Predict output now varies between runs (creative sampling active) while the tool-call args remain a clean '{"location":"Paris, France"}' because the parser-state check forces greedy on the structural bytes. UX note: thinking mode is ON by default (matching ds4-server). Users who want deterministic output should set Metadata.enable_thinking = false. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(gallery): add sha256 to deepseek-v4-flash-q2 entry Per HF LFS metadata for antirez/deepseek-v4-gguf: size: 86720111200 bytes (~80.76 GiB) sha256: 31598c67c8b8744d3bcebcd19aa62253c6dc43cef3b8adf9f593656c9e86fd8c LocalAI's downloader verifies sha256 when present, so users who install deepseek-v4-flash-q2 from the gallery get integrity-checked weights and the partial-download issue (an 81 GB file is easy to truncate) becomes recoverable instead of silently producing a broken backend. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-11 22:15:47 +02:00
Ettore Di Giacinto	ea00199554	ci: tag every backend digest, including singletons backend_build.yml pushes by canonical digest only (push-by-digest=true, no tags applied at build time). User-facing tagging happens in backend_merge.yml's `imagetools create` step. Before this commit, scripts/changed-backends.js emitted a merge entry only for tag-suffixes with 2+ legs, so every single-arch backend (CUDA/ROCm/Intel Python images, vLLM, sglang, transformers, diffusers, ...) pushed its digest untagged and stayed that way until quay's GC reaped it. Symptom: tag releases shipped multi-arch backends tagged correctly, but no v<X>-gpu-nvidia-cuda-12-vllm (or any singleton variant) ever appeared in the registry. Changes: - scripts/changed-backends.js drops the `group.length < 2` skip and emits two merge matrices, one per arch class, so each downstream merge job can `needs:` only its corresponding build matrix. - backend.yml splits backend-merge-jobs into multiarch and singlearch variants. The split preserves PR #9746's fix: slow singlearch CUDA builds (~6h) must not gate multiarch merges, or quay's GC reaps the multiarch per-arch digests before they're tagged. - backend_pr.yml mirrors the split. - backend_build.yml renames the digest artifact from `digests<suffix>-<platform-tag>` to `digests<suffix>--<platform-tag-or-"single">`. The `--` separator prevents the merge-side glob from over-matching sibling backends whose tag-suffix is a prefix of ours (e.g. -cpu-vllm vs -cpu-vllm-omni, -cpu-mlx vs -cpu-mlx-audio); the `single` placeholder keeps the name well-formed when platform-tag is empty. - backend_merge.yml updates the download pattern to match. Verified locally: a tag-push event now expands to 36 multiarch merge entries (= 72 builds / 2 legs) and 199 singlearch merge entries (one per singleton, including -gpu-nvidia-cuda-12-vllm at index 24). Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-11 13:22:00 +00:00
Ettore Di Giacinto	059c493641	ci(darwin): brew reinstall ccache to handle transitive dep drift Symptom (PR #9752, run 25638825961, job 75256261163): dyld[11144]: Library not loaded: /opt/homebrew/opt/fmt/lib/libfmt.12.dylib Referenced from: /opt/homebrew/Cellar/ccache/4.13.5/bin/ccache Abort trap: 6 Previous fix (commit `3f6e4934`) added blake3, hiredis, xxhash, zstd as explicit installs + cache paths because ccache's runtime dep closure wasn't in the brew cache. But ccache 4.13 also depends on fmt — which I missed. This is going to keep happening as upstream ccache adds or shuffles deps over time. Durable fix: `brew reinstall ccache` after the install step forces brew to re-resolve and install ccache's full transitive dep closure every run, immune to future formula changes. The brew downloads cache makes the reinstall cheap (~5s on a cache hit). Also adds fmt to the explicit install/link/Cellar-cache lists for the fresh-runner path. The reinstall covers the cache-hit path; the explicit install covers the brand-new-runner path where neither the downloads cache nor the Cellar cache has been populated yet. Caught by PR #9752's CI; would also have caught any future LLAMA_VERSION bump triggering the Darwin matrix. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-10 21:17:30 +00:00
LocalAI [bot]	19d59102d5	feat(whisper-cpp): implement streaming transcription (#9751 ) * test(whisper): wire e2e streaming transcription target Adds test-extra-backend-whisper-transcription, mirroring the existing llama-cpp / sherpa-onnx / vibevoice-cpp targets. The generic AudioTranscriptionStream spec at tests/e2e-backends/backend_test.go:644 fails today because backend/go/whisper has no streaming impl - this target is the failing TDD gate that the next phase makes pass. Confirmed RED locally: 3 Passed (health, load, offline transcription), 1 Failed (streaming spec hits its 300s context deadline because the base implementation returns 'unimplemented' but doesn't close the result channel, leaving the gRPC stream open until the client times out). Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(whisper-cpp): expose new_segment_callback to the Go side Adds set_new_segment_callback() and a C-side trampoline that whisper.cpp invokes once per new text segment during whisper_full(). The trampoline dispatches (idx_first, n_new, user_data) to a Go function pointer registered via purego.NewCallback - text and timings are pulled by Go through the existing get_segment_text/get_segment_t0/get_segment_t1 getters. Wires the hook only when streaming is actually requested, to avoid a per-segment function-pointer dispatch on the offline path. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(whisper-cpp): implement AudioTranscriptionStream Wires whisper.cpp's new_segment_callback through purego back to Go so the streaming transcription RPC produces real, time-correlated deltas while whisper_full() is still decoding. Each segment becomes one TranscriptStreamResponse{Delta}; whisper_full's return is the TranscriptStreamResponse{FinalResult} carrying the full segment list, language, and duration. Per-call state is tracked in a sync.Map keyed by an atomic counter; the Go callback registered via purego.NewCallback is a singleton, dispatched through user_data. SingleThread today means only one entry is ever live, but the map shape matches the sherpa-onnx TTS callback pattern. The streaming path's final.Text is the literal concat of every emitted delta (a strings.Builder accumulated by onNewSegment) so the e2e invariant `final.Text == concat(deltas)` holds exactly. The first delta has no leading space; subsequent deltas are space-prefixed. The offline AudioTranscription path is unchanged. Closes the gap with sherpa-onnx, vibevoice-cpp, llama-cpp, and tinygrad, which already implement AudioTranscriptionStream. Verified GREEN locally: make test-extra-backend-whisper-transcription passes 4/4 specs (3 Passed initially under RED, +1 streaming spec now). Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(whisper-cpp): assert progressive multi-segment streaming Drives AudioTranscriptionStream against a real long-audio fixture and asserts len(deltas) >= 2. The generic e2e spec at tests/e2e-backends/backend_test.go:644 only checks len(deltas) >= 1 which is satisfied by both real and faked streaming - this spec is the guardrail that a future "fake" impl can't sneak past. Skipped by default (env-gated, like the cancellation spec); set WHISPER_LIBRARY, WHISPER_MODEL_PATH, and WHISPER_AUDIO_PATH to a 30+ second clip to run. Verified locally with a 55s 5x-JFK concat against ggml-base.en.bin: 1 Passed in 7.3s, deltas >= 2, finalSegmentCount >= 2, concat(deltas) == final.Text. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(whisper-cpp): add transcription gRPC e2e job Mirrors tests-sherpa-onnx-grpc-transcription / tests-llama-cpp-grpc-transcription. Runs make test-extra-backend-whisper-transcription whenever the whisper backend or the run-all switch fires, so a pin-bump or refactor that breaks streaming transcription gets caught before merge. The whisper output on detect-changes is already emitted by scripts/changed-backends.js (it iterates allBackendPaths); this PR just exposes it as a workflow output and consumes it. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(whisper-cpp): silence errcheck on AudioTranscriptionStream defers golangci-lint runs with new-from-merge-base=origin/master, so the identical defer patterns in the existing offline AudioTranscription path are grandfathered while the new ones in AudioTranscriptionStream trip errcheck. Wrap both defers in `func() { _ = ... }()` to match what errcheck wants without altering behavior. The errors from os.RemoveAll and *os.File.Close are not actionable inside a defer here (we're already returning), matching the offline path's contract. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-10 23:11:46 +02:00
Ettore Di Giacinto	3f6e493439	ci(darwin): install ccache's runtime dylib deps (blake3, hiredis, xxhash, zstd) Symptom (run 25634195866, job 75244019809): the Configure ccache step on the Darwin llama-cpp build aborted with: dyld[5647]: Library not loaded: /opt/homebrew/opt/blake3/lib/libblake3.0.dylib Referenced from: /opt/homebrew/Cellar/ccache/4.13.5/bin/ccache Abort trap: 6 The previous Darwin fix (`acc5588d`) addressed missing /opt/homebrew/bin symlinks after a brew cache restore by force-linking. This is a different layer: ccache's Cellar dir IS restored from cache and IS linked, but ccache 4.13 dynamically links against blake3 / hiredis / xxhash / zstd at runtime, and those dependencies are NOT in the restored Cellar paths. brew install ccache sees the ccache Cellar present and skips the install — including skipping installation of those transitive deps. Two-part fix: - Add /opt/homebrew/Cellar/{blake3,hiredis,xxhash,zstd} to the brew cache restore/save paths so future cache-hit runs restore them. - Explicitly install + link them in the Dependencies step so even a fresh runner (cache miss on a new key) gets them, and brew has them on hand for ccache to dlopen. Caught by run 25634195866. Pre-existing condition on Darwin runners; surfaced because Darwin builds run more often after the llama-cpp- darwin consolidation in #9731. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-10 17:09:01 +00:00
LocalAI [bot]	35f6db8c76	ci: split backend-jobs into single-arch and multi-arch matrices (#9746 ) Symptom (run 25612992409): backend-merge-jobs failed with "quay.io/go-skynet/local-ai-backends@sha256:fdbd93ca...: not found" even though the per-arch build for -cpu-llama-cpp pushed that exact digest 14h31m earlier. Root cause: backend-merge-jobs was gated on the WHOLE backend-jobs matrix (`needs: backend-jobs`). The multi-arch -cpu-llama-cpp legs finished within 30 min, but a single-arch CUDA-12-llama-cpp slot in the same matrix queued for ~8h (max-parallel: 8 throttle) and then took ~6h to build cold. By the time it freed the merge to run, quay's GC had reaped the per-arch digests pushed by the fast multi-arch legs the day before. Fix: split the linux backend matrix in two. backend-jobs-multiarch - entries with `platform-tag` set (paired per-arch legs that feed backend-merge-jobs). backend-jobs-singlearch - entries without `platform-tag` (heavy standalone builds: CUDA, ROCm, Intel oneAPI, vLLM, sglang, etc.). backend-merge-jobs now `needs:` only backend-jobs-multiarch. The multi-arch matrix completes in ~2-3h, well inside quay's GC window. Heavy single-arch entries keep running independently with no merge dependency. scripts/changed-backends.js gains a splitByArch() helper that partitions filtered entries by whether `platform-tag` is set, and emits matrix-singlearch + matrix-multiarch + has-backends-singlearch + has-backends-multiarch outputs (replacing the previous combined matrix / has-backends pair). Applied in both the full-matrix and filtered-matrix code paths. Smoke test: 199 single-arch + 72 multi- arch + 35 darwin = 271 total entries; 36 merge-matrix entries (one per multi-arch backend pair). Matches expectation. Local `make backends/<name>` is unaffected — the script's outputs only feed CI workflow matrices. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-10 18:15:53 +02:00
Ettore Di Giacinto	7fff858408	ci(base-images): also trigger rebuild on .docker/install-base-deps.sh changes base-images.yml's master-push trigger had a path filter listing only backend/Dockerfile.base-grpc-builder and .github/workflows/base-images.yml. That misses .docker/install-base-deps.sh — which is the actual source of truth for what goes into each base image (apt deps, gRPC, conditional CUDA/ROCm/Vulkan installs). The script is bind-mounted into the base Dockerfile at build time; changes to it would change the produced images, but without this path filter, the workflow wouldn't auto-rebuild on those changes. Stale bases would persist until Saturday's cron or a manual workflow_dispatch. Same applies to .docker/apt-mirror.sh, also bind-mounted by the base Dockerfile. Add both to the trigger paths so consumer-affecting changes to either file rebuild the bases automatically. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-09 22:30:46 +00:00
LocalAI [bot]	593f3a8648	ci: refactor llama-cpp variant Dockerfiles to consume prebuilt base-grpc images (PR 2/2) (#9738 ) * ci(backend_build): plumb builder-base-image and BUILDER_TARGET build-args Adds an optional builder-base-image input. When set, BUILDER_BASE_IMAGE is forwarded as a build-arg AND BUILDER_TARGET=builder-prebuilt is set to select the variant Dockerfile's prebuilt-base stage. When empty, BUILDER_TARGET=builder-fromsource (the default) keeps the existing from-source build path. This makes the prebuilt-base optimization opt-in per matrix entry without breaking local `make backends/<name>` invocations or backends whose Dockerfile doesn't have a prebuilt path. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(llama-cpp,ik-llama-cpp,turboquant): multi-target Dockerfiles for prebuilt + from-source Restructure the three llama.cpp-derived Dockerfiles so each supports two builder paths in a single file, selected via the BUILDER_TARGET build-arg: BUILDER_TARGET=builder-fromsource (default) - Standalone build: gRPC stage + apt installs + (conditionally) CUDA/ROCm/Vulkan + compile. - Used by `make backends/llama-cpp` locally and any caller that doesn't supply a prebuilt base. BUILDER_TARGET=builder-prebuilt - FROM \${BUILDER_BASE_IMAGE} (one of quay.io/go-skynet/ci-cache: base-grpc-* shipped in PR #9737). - Skips ~25-35 min of gRPC compile + ~5-10 min of toolchain installs. - Used by CI when the matrix entry sets builder-base-image. Final FROM scratch resolves BUILDER_TARGET via an aliasing FROM stage (BuildKit doesn't support variable expansion directly in COPY --from), then COPY --from=builder pulls package output from the chosen path. BuildKit prunes the unreferenced builder, so each build only does the work for the chosen path. The compile RUN is identical between both builder stages, so it's factored into .docker/<name>-compile.sh and bind-mounted into both. ccache mount + cache-id stay per-arch / per-build-type. Local DX preserved: `make backends/llama-cpp` (no extra args) defaults to BUILDER_TARGET=builder-fromsource and works exactly as before. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(backend.yml,backend_pr.yml): forward builder-base-image from matrix Plumbs the new optional builder-base-image input from matrix into backend_build.yml. backend_build.yml derives BUILDER_TARGET from whether builder-base-image is set, so matrix entries that map to a prebuilt base get the prebuilt path; entries that don't (python/go/ rust backends) fall through to the default builder-fromsource (which their own Dockerfiles don't reference, so it's a no-op for them). Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(backend-matrix): wire builder-base-image to llama-cpp variants For every entry whose Dockerfile is llama-cpp/ik-llama-cpp/turboquant, add a builder-base-image field pointing at the appropriate prebuilt quay.io/go-skynet/ci-cache:base-grpc-* tag. backend_build.yml derives BUILDER_TARGET from this field's presence: non-empty -> builder-prebuilt; empty -> builder-fromsource. So this commit alone activates the prebuilt-base path for these 23 backends in CI, while local `make backends/<name>` (no extra args) keeps the from-source path. Mapping by (build-type, arch): - '' / amd64 -> base-grpc-amd64 - '' / arm64 -> base-grpc-arm64 - cublas-12 / amd64 -> base-grpc-cuda-12-amd64 - cublas-13 / amd64 -> base-grpc-cuda-13-amd64 - cublas-13 / arm64 -> base-grpc-cuda-13-arm64 - hipblas / amd64 -> base-grpc-rocm-amd64 - vulkan / amd64 -> base-grpc-vulkan-amd64 - vulkan / arm64 -> base-grpc-vulkan-arm64 - sycl_* / amd64 -> base-grpc-intel-amd64 - cublas-12 + JetPack r36.4.0 / arm64 -> base-grpc-l4t-cuda-12-arm64 Cold-build savings expected: ~25-35 min per variant (skips the gRPC compile + toolchain install that's now in the base). Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: add base-grpc-l4t-cuda-12-arm64 variant for legacy JetPack entries Two matrix entries (-nvidia-l4t-arm64-llama-cpp, -nvidia-l4t-arm64- turboquant) build against nvcr.io/nvidia/l4t-jetpack:r36.4.0 + CUDA 12 ARM64. They're distinct from -nvidia-l4t-cuda-13-arm64-* which use Ubuntu 24.04 + CUDA 13 sbsa. Add the missing JetPack-based variant to base-images.yml so those two entries' builder-base-image mapping in the previous commit resolves. Bootstrap order before merging this PR (re-run base-images.yml on this branch — 9 existing variants hit BuildKit cache, only the new l4t-cuda-12-arm64 builds cold): gh workflow run base-images.yml --ref ci/base-images-consumers Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: extract base-builder install logic into .docker/install-base-deps.sh Pre-extraction, the apt + protoc + cmake + conditional CUDA/ROCm/Vulkan + gRPC install logic was duplicated across four files: - backend/Dockerfile.base-grpc-builder (CI prebuilt-base source of truth) - backend/Dockerfile.llama-cpp (builder-fromsource stage) - backend/Dockerfile.ik-llama-cpp (builder-fromsource stage) - backend/Dockerfile.turboquant (builder-fromsource stage) A bump to e.g. CUDA toolkit packages had to be made in 4 places, and drift between the prebuilt base and the variant-Dockerfile from-source path was a real concern (ik-llama-cpp's hipblas branch was already missing the rocBLAS Kernels echo that llama-cpp / turboquant / base-grpc-builder all had). Factor the install logic into a single .docker/install-base-deps.sh that reads its inputs from env vars and runs conditionally on BUILD_TYPE / CUDA__VERSION / TARGETARCH. Each Dockerfile now bind- mounts the script alongside .docker/apt-mirror.sh and invokes it from a single RUN step. The variant Dockerfiles' grpc-source stage is removed entirely — the script handles gRPC compile + install at /opt/grpc, and the builder-fromsource stage mirrors builder-prebuilt by copying /opt/grpc/. to /usr/local/. Result: - install-base-deps.sh: 244 lines (one source of truth) - Dockerfile.base-grpc-builder: 268 -> 98 lines - Dockerfile.llama-cpp: 361 -> 157 lines - Dockerfile.ik-llama-cpp: 348 -> 151 lines - Dockerfile.turboquant: 355 -> 154 lines - Total Dockerfile bytes: 1332 -> 560 lines (58% reduction) Bit-equivalence between prebuilt and from-source paths is now enforced by construction: both invoke the same script with the same inputs. A side-effect is that ik-llama-cpp now also gets the rocBLAS Kernels echo + clblas block parity it was previously missing. Includes the BUILD_TYPE=clblas branch (libclblast-dev) for parity even though no current CI matrix entry uses it. After this commit's force-push, base-images.yml needs to be redispatched on this branch — the Dockerfile.base-grpc-builder content shifts so the existing cache won't apply for the install layer (gRPC layer also rebuilds since it's now in the same RUN step). Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> ci(base-images): skip-drivers on JetPack l4t variant cuda-nvcc-12-0 isn't installable via apt on the JetPack r36.4.0 base image — JetPack ships CUDA preinstalled at /usr/local/cuda and its apt feed doesn't carry the cuda-nvcc-* packages from the public repositories. The original matrix entry for -nvidia-l4t-arm64-llama-cpp on master sets skip-drivers: 'true' for exactly this reason; the new base-grpc-l4t-cuda-12-arm64 base needs to match. Also forwards SKIP_DRIVERS as a build-arg from matrix into the build (was missing entirely before this commit). Caught by run 25612030775 — l4t-cuda-12-arm64 failed at: E: Package 'cuda-nvcc-12-0' has no installation candidate Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-10 00:03:52 +02:00
Ettore Di Giacinto	acc5588d2c	ci(darwin): force-link brew formulas after cache restore Symptom: `ccache: command not found` in the Configure ccache step on runs that hit the brew cache. Root cause: actions/cache restores /opt/homebrew/Cellar/<formula> but NOT the bin symlinks at /opt/homebrew/bin/*. The subsequent `brew install` sees the Cellar entries present and decides "already installed" — without re-running the link step. So on cache-hit runs none of the cached formulas are actually on PATH. Fix: explicit `brew link --overwrite` for every formula we install, right after `brew install`. --overwrite tolerates leftover symlinks from a partial earlier install. The 2>/dev/null + \|\| true keeps the step from failing if a formula is already correctly linked. Pre-existing flake; surfaces more often as Darwin matrix coverage grows after the llama-cpp-darwin consolidation in #9731. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-09 20:41:03 +00:00
LocalAI [bot]	28e29625a2	ci: add pre-built base-grpc-builder image infrastructure (PR 1/2) (#9737 ) Introduces a parameterized Dockerfile.base-grpc-builder that produces a fully-prepped builder base image (apt deps + protoc + cmake + gRPC at /opt/grpc + conditional CUDA/ROCm/Vulkan toolchains) and a base-images.yml workflow that builds + pushes 9 variants to quay.io/go-skynet/ci-cache:base-grpc-*: base-grpc-amd64 (Ubuntu 24.04, CPU-only) base-grpc-arm64 (Ubuntu 24.04, CPU-only) base-grpc-cuda-12-amd64 (Ubuntu 24.04 + CUDA 12.8) base-grpc-cuda-13-amd64 (Ubuntu 22.04 + CUDA 13.0) base-grpc-cuda-13-arm64 (Ubuntu 24.04 + CUDA 13.0 sbsa) base-grpc-rocm-amd64 (rocm/dev-ubuntu-24.04:7.2.1 + hipblas) base-grpc-vulkan-amd64 (Ubuntu 24.04 + Vulkan SDK 1.4.335) base-grpc-vulkan-arm64 (Ubuntu 24.04 + Vulkan SDK ARM 1.4.335) base-grpc-intel-amd64 (intel/oneapi-basekit:2025.3.2) The variant Dockerfiles (Dockerfile.llama-cpp, ik-llama-cpp, turboquant) are NOT touched in this PR. PR 2 will refactor them to FROM these prebuilt bases. This PR is intentionally inert - landing it changes no existing CI behavior. The base images don't exist on quay until someone manually triggers the workflow. Bootstrap after merge: gh workflow run base-images.yml --ref master Wait ~30 min for all 9 variants to push, then merge PR 2 (the consumer-side refactor that uses BUILDER_BASE_IMAGE build-arg to FROM these tags). Triggers afterwards: - Saturdays 05:00 UTC (cron) - picks up upstream security updates, runs ~24h before the backend.yml Sunday cron so bases are fresh. - workflow_dispatch - manual ad-hoc rebuild. - master push touching Dockerfile.base-grpc-builder or this workflow. Why split into two PRs: the variant Dockerfiles in PR 2 will FROM the prebuilt bases and have no from-source fallback. Their CI builds fail if the bases don't exist on quay yet. Landing infrastructure first + manual bootstrap + then consumer refactor avoids a broken-master window. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-09 18:44:42 +02:00
Ettore Di Giacinto	6d2b7d893a	ci: drop paths-ignore from test.yml and tests-e2e.yml These workflows are configured as required status checks in branch protection. With paths-ignore matching the PR diff, the workflow doesn't trigger and no status is reported — branch protection then blocks the PR with "Expected — Waiting for status to be reported" indefinitely. Especially common for backend-only PRs since the ignore list included backend/**. Run the full test suite on every PR. Cost is ~5 min per PR for tests-linux + ~similar for tests-apple + the e2e backend smoke; small trade for unblocking PR merges. Workflows affected: - tests-linux (1.26.x), tests-apple (1.26.x) in test.yml - tests-e2e-backend (1.25.x) in tests-e2e.yml Other workflows that still have paths-ignore (none currently in the required-checks list) are left as-is — adding them to required later would re-introduce the same problem. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-09 09:23:51 +00:00
Ettore Di Giacinto	5a12392570	ci(concurrency): make cancel-in-progress event-aware, group by sha on push Yesterday two PRs (#9724 llama.cpp bump, #9731 llama-cpp-darwin consolidation) merged 11 seconds apart. Both shared the same backend.yml concurrency group (ci-backends-refs/heads/master-...) due to "${{ github.head_ref \|\| github.ref }}" — empty head_ref on push events falls through to the static refs/heads/master. With cancel-in-progress: true that meant the second merge cancelled the first's in-flight backend builds. The first PR's CI never finished; the second PR only touched CI files so its run was a no-op. Two changes per workflow: - group: replace "${{ github.head_ref \|\| github.ref }}" with "${{ github.event.pull_request.number \|\| github.sha }}". On PRs this groups by PR number (same as before, just keyed on number not branch name); on push events it groups per-commit, so two master pushes never share a group. - cancel-in-progress: gate on github.event_name == 'pull_request' so rapid pushes to a PR still cancel old runs (newer push wins) but master pushes never cancel each other. Trade-off vs alternatives: - Merge queue would also solve this and additionally test the merged commit before it lands. Heavier process change; out of scope here. - Allowing per-commit master concurrency means two simultaneous master runs may overlap and race on tag pushes, but each commit's manifest digest is unique and the registry is last-writer-wins on tags — newer commit's tag overwrites older. Applied to 11 workflows that share the same concurrency pattern: backend.yml, backend_pr.yml, image.yml, image-pr.yml, lint.yml, test.yml, test-extra.yml, tests-e2e.yml, tests-aio.yml, tests-ui-e2e.yml, generate_intel_image.yaml. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-09 08:30:55 +00:00
Ettore Di Giacinto	05d6383393	Change vibevoice.cpp repository reference Updated repository reference for vibevoice.cpp in bump_deps.yaml. Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2026-05-09 10:30:11 +02:00
Ettore Di Giacinto	733c254b32	ci: consolidate llama-cpp-darwin into the matrix-driven Darwin flow (#9731 ) The bespoke llama-cpp-darwin + llama-cpp-darwin-publish top-level jobs in backend.yml ran unconditionally on every backend.yml trigger (push/cron), bypassing the path filter that all 34 other Darwin backends already honor via backend-jobs-darwin -> backend_build_darwin.yml. Move llama-cpp into the includeDarwin matrix: - New entry in .github/backend-matrix.yml (lang=go, no build-type). - backend_build_darwin.yml gains an `if: inputs.backend == 'llama-cpp'` build step that drives `make backends/llama-cpp-darwin`. The bespoke script (scripts/build/llama-cpp-darwin.sh) compiles three CMake variants from backend/cpp/llama-cpp and bundles dylibs via otool, so it doesn't fit the build-darwin-go-backend mold; the existing llama-cpp-aware ccache setup blocks already in this workflow are what motivated the consolidation in the first place. - scripts/changed-backends.js's inferBackendPathDarwin gains a special case so llama-cpp on Darwin maps to backend/cpp/llama-cpp/ (the C++ source tree) rather than the non-existent backend/go/llama-cpp/. - Bumps Darwin go-version from 1.24.x -> 1.25.x in backend.yml and backend_pr.yml so llama-cpp keeps the Go toolchain it had under the bespoke job; the other 34 Darwin backends pick this up too with no known reason to pin 1.24. - Removes ~80 lines of bespoke YAML from backend.yml. The publish path is unchanged in shape - every Darwin backend now uses the same crane-push leg from ubuntu-latest in backend_build_darwin.yml; only the build target differs per backend. After this commit, llama-cpp-darwin only rebuilds when backend/cpp/llama-cpp/ is touched (verified locally) - same behavior as every other Darwin backend. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-09 10:18:17 +02:00
LocalAI [bot]	f0374aa0e8	ci: finish GHA free-tier migration (per-arch fan-out, image splits, retire self-hosted, fix provenance) (#9730 ) * ci: add per-arch + manifest-merge support for LocalAI server image Mirror the backend_build.yml + backend_merge.yml pattern shipped in PR #9726 for the LocalAI server image: - image_build.yml accepts optional platform-tag (default ''), scopes registry cache to cache-localai<suffix>-<platform-tag>, and pushes by canonical digest only on push events. Digests upload as artifacts named digests-localai<suffix>-<platform-tag>, with a "-core" placeholder when tag-suffix is empty so the merge job's download pattern doesn't over-match across multiple suffixes. - image_merge.yml is a new reusable workflow that downloads matching digest artifacts and assembles the final tagged manifest list via docker buildx imagetools create. Image names differ from backend_.yml: the LocalAI server is published under quay.io/go-skynet/local-ai and localai/localai (not -backends). Not yet wired into image.yml / image-pr.yml — Commit C does that. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> ci: fan out per-arch split to remaining 34 backends Convert all remaining linux/amd64,linux/arm64 entries in backend-matrix.yml to per-arch + manifest-merge form. Each was a single matrix entry running both arches on x86 under QEMU emulation; each becomes two entries — amd64 on ubuntu-latest, arm64 on ubuntu-24.04-arm (native). Four backends that were on bigger-runner (-cpu-llama-cpp, -cpu-turboquant, -gpu-vulkan-llama-cpp, -gpu-vulkan-turboquant) have both legs moved to free tier as part of the same change. They are compile-only (no torch/CUDA install) and fit comfortably with the setup-build-disk /mnt relocation. Phase 4 (next commit) retires the remaining 5 single-arch bigger-runner entries. After this commit: - 271 total matrix entries (was 237) - 0 multi-arch entries left - 36 per-arch pairs (34 new + 2 pilots from PR #9727) - 5 bigger-runner entries remaining (single-arch, Phase 4 target) Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: split LocalAI image multi-arch entries per arch + merge Mirror the backend per-arch split for the main LocalAI image: - image.yml's core-image-build matrix: split the core ('') and -gpu-vulkan entries into amd64 + arm64 legs each. amd64 on ubuntu-latest, arm64 on ubuntu-24.04-arm (native). - New top-level core-image-merge and gpu-vulkan-image-merge jobs call image_merge.yml after core-image-build completes. - image-pr.yml's image-build matrix: split the -vulkan-core entry. No merge job added on the PR side — image_build.yml's digest-push is push-only-event-gated, so a PR-side merge would have nothing to download. After this commit, no workflow file references linux/amd64,linux/arm64 in a single matrix slot. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: retire bigger-runner from backend matrix (Phase 4) Migrate the remaining 5 single-arch bigger-runner entries to ubuntu-latest. Combined with the Phase 3 setup-build-disk /mnt relocation (PR #9726), free-tier ubuntu-latest now has ~100 GB of working space — enough for ROCm dev image (~16 GB), CUDA toolkit (~5 GB), and the per-backend compile/install steps these entries do. Backends migrated: - -gpu-nvidia-cuda-12-llama-cpp - -gpu-nvidia-cuda-12-turboquant - -gpu-rocm-hipblas-faster-whisper - -gpu-rocm-hipblas-coqui - -cpu-ik-llama-cpp After this commit, .github/backend-matrix.yml has zero bigger-runner references. The bigger-runner used in tests-vibevoice-cpp-grpc- transcription (test-extra.yml) is a separate concern handled in a follow-up. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: migrate 9 Intel oneAPI backends to free tier (Phase 5.1) Intel oneAPI base image is ~6 GB; each backend's wheel install stays well within the ~100 GB working space provided by Phase 3's setup-build-disk /mnt relocation. Lowest-risk batch of the arc-runner-set retirement. Backends migrated: vllm, sglang, vibevoice, qwen-asr, nemo, qwen-tts, fish-speech, voxcpm, pocket-tts (all -gpu-intel-* variants). Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: migrate 15 ROCm Python backends to free tier (Phase 5.2) ROCm dev image (~16 GB) plus per-backend torch/wheels install fits on ubuntu-latest with the /mnt-relocated Docker root. These entries include the heavier vLLM/sglang/transformers/diffusers stack on ROCm; if any specific backend OOMs or runs out of disk, individual flips back to arc-runner-set are revertable per-entry. Backends migrated: all 15 -gpu-rocm-hipblas-* entries previously on arc-runner-set (vllm/vllm-omni/sglang/transformers/diffusers/ ace-step/kokoro/vibevoice/qwen-asr/nemo/qwen-tts/fish-speech/ voxcpm/pocket-tts/neutts). Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: migrate 6 CUDA Python backends to free tier (Phase 5.3) vLLM/sglang stacks on CUDA 12 and CUDA 13 are the heaviest backends in the matrix — flash-attn intermediate layers can spike disk usage during build. setup-build-disk's /mnt relocation gives ~100 GB working space which fits the documented peak. Highest-risk batch of the arc-runner-set retirement; if any backend fails to build on free tier, the per-entry runs-on flip is the unit of revert. Backends migrated: -gpu-nvidia-cuda-{12,13}-{vllm,vllm-omni,sglang}. After this commit, .github/backend-matrix.yml has zero references to arc-runner-set or bigger-runner. The migration is complete. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: disable provenance on multi-registry digest pushes Root-caused on master via PR #9727's pilot: when docker/build-push-action@v7 pushes a single build to TWO registries simultaneously with push-by-digest=true, buildx generates a per-registry provenance attestation manifest (because mode=max — the default for push:true — includes the runner ID). That makes the resulting manifest-list digest diverge across registries: arm64 -cpu-faster-whisper build: image manifest: sha256:d3bdd34b... (identical, content-only) quay manifest list: sha256:66b4cfc8... (with quay attestation) dockerhub manifest list: sha256:e0733c3b... (with dockerhub attestation) steps.build.outputs.digest returns only one of the list digests (empirically the dockerhub one). The merge job then asks "quay.io/...@sha256:e0733c3b..." which doesn't exist on quay — that list has digest 66b4cfc8 there. Result: imagetools create fails with "not found" and the merge job fails (run 25581983094, job 75110021491). Setting provenance: false drops the per-registry attestation; the manifest-list digest becomes pure content, identical across both registries, and steps.build.outputs.digest works on either lookup. Applied to backend_build.yml and image_build.yml — both refactored to use the same multi-registry digest-push pattern in the prior PRs. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-09 09:37:00 +02:00
LocalAI [bot]	cb68cd1cf4	ci: pilot per-arch split + manifest merge for faster-whisper and llama-cpp-quantization (#9727 ) ci: pilot per-arch split for faster-whisper and llama-cpp-quantization Convert two backends from QEMU-emulated multi-arch (linux/amd64,linux/arm64 on a single ubuntu-latest) to native per-arch + manifest-list merge: - amd64 leg on ubuntu-latest - arm64 leg on ubuntu-24.04-arm (native, ~5-10x faster than emulated) - merge job assembles both digests under the final tag via docker buildx imagetools create Backends piloted: - -cpu-faster-whisper (small Python, fast baseline) - -cpu-llama-cpp-quantization (heavier compile path, stress test) Infrastructure changes that the rest of Phase 2 (Tasks 2.5+) will reuse: - .github/backend-matrix.yml entries gain a `platform-tag` field ('amd64'/'arm64') for matrix entries that participate in the split. Other entries omit it; backend_build.yml already defaults missing values to '' (empty cache key suffix preserved as cache<suffix>-). - backend.yml + backend_pr.yml forward `platform-tag` from matrix to the reusable backend_build.yml. - scripts/changed-backends.js groups filtered entries by tag-suffix and emits a `merge-matrix` (plus `has-merges`) for groups of size>=2. Singletons aren't merged. - backend.yml + backend_pr.yml gain a `backend-merge-jobs` job that consumes merge-matrix and calls backend_merge.yml after backend-jobs. PR variant is also event-gated so the no-op-on-PR merge job doesn't even start. The other 34 multi-arch entries are unchanged in this PR -- Task 2.5 fans out the same shape to them once the pilot is observed green. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-09 00:04:42 +02:00
LocalAI [bot]	1f313cfdb0	ci: phase 1-3 of GHA free tier migration (path filter, multi-arch split prep, /mnt disk relief) (#9726 ) * ci: extract free-disk-space composite action Consolidate the apt-clean + dotnet/android/ghc/boost removal blocks from backend_build.yml, image_build.yml, and test.yml into a single composite action. The three callers had slightly different inline blocks; the composite uses the more aggressive backend_build/image_build variant for all three callers — test.yml jobs now also purge snapd, edge/firefox/ powershell/r-base-core, and sweep /opt/ghc + /usr/local/share/boost + $AGENT_TOOLSDIRECTORY. Idempotent and skipped on self-hosted runners. In test.yml, actions/checkout now runs before the composite action call because the composite lives at ./.github/actions/free-disk-space and requires a checked-out repo. The original ordering relied on jlumbroso/free-disk-space@main being a remote action; this is the minimum-invasive change to support a local composite. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: path-filter backend.yml master push Run scripts/changed-backends.js on master pushes too (not just PRs) so unrelated commits don't rebuild all ~210 backend container images. Tag pushes still build the full matrix via FORCE_ALL. Push events use the GitHub Compare API to diff event.before..event.after. Edge cases (first push with zero base, API truncation beyond 300 files, missing fields, network failure) fall back to "run everything" — better safe than silently miss a backend. The matrix literal moves from .github/workflows/backend.yml into a new data-only file at .github/backend-matrix.yml (outside workflows/ so actionlint doesn't try to parse it as a workflow). Both backend.yml and backend_pr.yml now consume the dynamic matrix output uniformly via fromJson(needs.generate-matrix.outputs.matrix); the script reads the matrix from the new location. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: bound max-parallel on backend-jobs matrices Cap to 8 concurrent jobs to avoid queue starvation on the shared GHA free pool while migration is in flight. Lift after Phases 4-5 retire the self-hosted runners. Also drops a leftover commented-out max-parallel line that lived in backend.yml since the previous matrix shape. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: scope backend cache per arch, push by digest Prepare backend_build.yml for the multi-arch split. The reusable workflow now accepts a `platform-tag` input ("amd64" / "arm64") that scopes the registry cache to cache<suffix>-<platform-tag> and (on push events) pushes the resulting image by canonical digest only. Digests are uploaded as artifacts named digests<suffix>-<platform-tag> for the merge job (Task 2.2) to consume. `platform-tag` is optional with empty default during the migration — existing callers continue to work unchanged (their cache key just becomes `cache<suffix>-`, an orphaned but valid key). Tasks 2.3+ will update callers to pass an explicit "amd64" / "arm64" value. Phase 6 flips the input to required: true once every caller is wired. PR builds keep their existing tag-based push to ci-tests but pick up the per-arch cache key. Multi-arch PR builds remain emulated in this commit; they migrate when the matrix entries split (Tasks 2.3+). Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: add backend_merge.yml reusable workflow Joins per-arch digest artifacts (uploaded by backend_build.yml when called with platform-tag) into a single tagged multi-arch manifest list via `docker buildx imagetools create`. Called once per backend by backend.yml after both per-arch build jobs succeed. The workflow generates final tags identically to the previous monolithic build job (same docker/metadata-action invocation), so consumers of quay.io/go-skynet/local-ai-backends and localai/localai-backends see no tag-shape change. Two imagetools calls (one per registry) reference the same per-arch digests under different image names. Not yet wired into backend.yml — Tasks 2.3+ rewrite individual matrix entries to expand into per-arch + merge jobs that call this workflow. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: relocate Docker data-root to /mnt on hosted runners GHA hosted ubuntu-latest runners ship a ~75 GB /mnt drive that's unused by default. Stopping Docker, rsync'ing /var/lib/docker to /mnt, and restarting with data-root pointing there yields ~100 GB of working space (combined with the apt-clean from Task 1.1) — enough for ROCm dev image + vLLM torch install + flash-attn intermediate layers. This is the structural change that lets Phases 4 and 5 of the migration plan move the bigger-runner and arc-runner-set jobs onto ubuntu-latest. The composite action is no-op on self-hosted runners (where /mnt isn't expected) and on non-X64 runners (Task 3.2 verifies the arm64 hosted pool's /mnt shape separately before enabling). Wired into both backend_build.yml and image_build.yml between free-disk-space and the first Docker operation. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(setup-build-disk): chmod 1777 /mnt/docker-tmp buildx CLI runs as the unprivileged 'runner' user and creates config dirs under TMPDIR before binding them into the buildkit container. /mnt is root-owned by default, so the original mkdir produced a permission-denied when buildx tried to write there: ERROR: mkdir /mnt/docker-tmp/buildkitd-config2740457204: permission denied Mirror /tmp's permission mode (1777 — world-writable with sticky bit) on /mnt/docker-tmp so non-root processes can stage their config. Caught by the first PR run (image-build hipblas job) on PR #9726. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: weekly full-matrix rebuild via cron Path-filtering backend.yml master push (the previous commit's main optimization) skips backends whose source didn't change. That broke the DEPS_REFRESH cache-buster's coverage: the build-arg keyed on %Y-W%V busts the install layer's cache on a new ISO week, but only when the build actually runs. Untouched Python backends (torch, transformers, vllm with no version pin) would otherwise ship stale wheels indefinitely. Add a Sunday 06:00 UTC cron that fires the full matrix. Schedule events have no event.ref / event.before, so the script's changedFiles == null fallback (scripts/changed-backends.js) emits the full matrix automatically — no script change needed. C++/Go backends with pinned deps cache-hit and complete fast, so the weekly cost is dominated by Python re-resolves which is exactly what we want. workflow_dispatch added so a maintainer can trigger an ad-hoc full-matrix rebuild without faking a tag push. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-08 23:43:41 +02:00
Richard Palethorpe	c894d9c826	feat(sglang): wire engine_args, add cuda13 build, ship MTP gallery demos (#9686 ) Bring the sglang Python backend up to feature parity with vllm by adding the same engine_args:-map plumbing the vLLM backend already has. Any ServerArgs field (~380 in sglang 0.5.11) becomes settable from a model YAML, including the speculative-decoding flags needed for Multi-Token Prediction. Validation matches the vllm backend's: keys are checked against dataclasses.fields(ServerArgs), unknown keys raise ValueError with a difflib close-match suggestion at LoadModel time, and the typed ModelOptions fields keep their existing meaning with engine_args overriding them. Backend code: * backend/python/sglang/backend.py: add _apply_engine_args, import dataclasses/difflib/ServerArgs, call from LoadModel; rename Seed -> sampling_seed (sglang 0.5.11 renamed the SamplingParams field). * backend/python/sglang/test.py + test.sh + Makefile: six unit tests exercising the helper directly (no engine load required). Build / CI / backend gallery (cuda13 + l4t13 paths are now first-class): * backend/python/sglang/install.sh: add --prerelease=allow because sglang 0.5.11 hard-pins flash-attn-4 which only ships beta wheels; add --index-strategy=unsafe-best-match for cublas12 so the cu128 torch index wins over default-PyPI's cu130; new pyproject.toml-driven l4t13 install path so [tool.uv.sources] can pin torch/torchvision/ torchaudio/sglang to the jetson-ai-lab index without forcing every transitive PyPI dep through the L4T mirror's flaky proxy (mirrors the equivalent fix in backend/python/vllm/install.sh). * backend/python/sglang/pyproject.toml (new): L4T project spec with explicit-source jetson-ai-lab index. Replaces requirements-l4t13.txt for the l4t13 BUILD_PROFILE; other profiles still go through the requirements-.txt pipeline via libbackend.sh's installRequirements. backend/python/sglang/requirements-l4t13.txt: removed; superseded by pyproject.toml. * backend/python/sglang/requirements-cublas{12,13}{,-after}.txt: pin sglang>=0.5.11 (Gemma 4 floor); add cu130 torch index for cublas13 (new files) and cu128 torch index for cublas12 (default PyPI now ships cu130 torch wheels by default and breaks cu12 hosts). * backend/index.yaml: add cuda13-sglang and cuda13-sglang-development capability mappings + image entries pointing at quay.io/.../-gpu-nvidia-cuda-13-sglang. * .github/workflows/backend.yml: new cublas13 sglang matrix entry, mirroring vllm's cuda13 build. Model gallery + docs: * gallery/sglang.yaml: base sglang config template, mirrors vllm.yaml. * gallery/sglang-gemma-4-{e2b,e4b}-mtp.yaml: Gemma 4 MTP demos transcribed verbatim from the SGLang Gemma 4 cookbook MTP commands. * gallery/sglang-mimo-7b-mtp.yaml: MiMo-7B-RL with built-in MTP heads + online fp8 weight quantization, verified end-to-end on a 16 GB RTX 5070 Ti at ~88 tok/s. Uses mem_fraction_static: 0.7 because the MTP draft worker's vocab embedding is loaded unquantised and OOMs the static reservation at sglang's 0.85 default. * gallery/index.yaml: three new entries (gemma-4-e2b-it:sglang-mtp, gemma-4-e4b-it:sglang-mtp, mimo-7b-mtp:sglang). * docs/content/features/text-generation.md: new SGLang section with setup, engine_args reference, MTP demos, version requirements. * .agents/sglang-backend.md (new): agent one-pager covering the flat ServerArgs structure, the typed-vs-engine_args precedence, the speculative-decoding cheatsheet, and the mem_fraction_static gotcha documented above. * AGENTS.md: index entry for the new agent doc. Known limitation: the two Gemma 4 MTP gallery entries ship a recipe that doesn't yet run on stock libraries. The drafter checkpoints (google/gemma-4-{E2B,E4B}-it-assistant) declare model_type: gemma4_assistant / Gemma4AssistantForCausalLM, which neither transformers (<=5.6.0, including the SGLang cookbook's pinned commit 91b1ab1f... and main HEAD) nor sglang's own model registry (<=0.5.11) registers as of 2026-05-06. They will start working when HF or sglang upstream registers the architecture -- no LocalAI changes needed. The MiMo MTP demo and the non-MTP Gemma 4 paths work today on this build (verified on RTX 5070 Ti, 16 GB). Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash] [WebFetch] [WebSearch] Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-05-07 17:27:29 +02:00
LocalAI [bot]	a8d7d37a3c	fix: unbreak master CI (docs, kokoros, vibevoice-cpp ABI) (#9682 ) * fix(docs): correct broken Hugo relrefs The Hugo build has been failing on master since the relevant pages landed: - text-generation.md:720 referenced `/docs/features/distributed-mode`, but Hugo `relref` paths are relative to the content root, not the rendered URL. Drop the `/docs/` prefix so the lookup matches the existing `features/...` form used elsewhere in the file. - audio-transform.md:144 referenced `tts.md`; the actual page is `text-to-audio.md`. Assisted-by: Claude:claude-opus-4-7[1m] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(kokoros): stub Diarize and AudioTransform Backend trait methods The recent backend.proto additions (Diarize, AudioTransform, AudioTransformStream) extended the gRPC Backend trait, breaking kokoros-grpc compilation with E0046 because the Rust implementation hadn't picked up the new methods. Add Unimplemented stubs matching the existing pattern for non-applicable RPCs in this TTS-only backend. Assisted-by: Claude:claude-opus-4-7[1m] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(vibevoice-cpp): track upstream ABI + wire 1.5B voice cloning Two recent commits in mudler/vibevoice.cpp reshaped the vv_capi_tts signature without a corresponding bump on the LocalAI side: 3bd759c "1.5b: unify into a single tts entry point" inserted a ref_audio_path parameter between voice_path and dst_wav_path. ad856bd "1.5b: multi-speaker dialog support" promoted that to a (const char* const* ref_audio_paths, int n_ref_audio_paths) pair for per-speaker conditioning. Because purego resolves symbols by name and not by signature, the build kept linking; at runtime the misaligned arguments turned the TTS->ASR closed-loop test into a SIGSEGV inside cgo. Track HEAD explicitly and bring the bridge in line with it: * Update the CppTTS purego binding to the 9-arg form. purego marshals []byte as a char by handing the C side the underlying array address; nil/empty maps to NULL, which matches the C contract for "no reference audio" on the realtime-0.5B path. Add a `ref_audio` gallery option (comma-separated, repeatable) that the 1.5B path consumes for runtime voice cloning. Multiple entries are interpreted as one WAV per speaker (Speaker 0..n-1). * TTSRequest.Voice now routes by extension/shape: `.wav` or a comma-separated list goes to ref_audio_paths; anything else stays on voice_path (realtime-0.5B's pre-baked voice gguf). * Pin VIBEVOICE_CPP_VERSION to ad856bd and wire the Makefile into the existing bump_deps matrix so future upstream rolls land as reviewable PRs instead of a silent CI break. Assisted-by: Claude:claude-opus-4-7[1m] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(vibevoice-cpp): use ModelOptions.AudioPath for 1.5B ref audio Use the existing audio_path field from ModelOptions (already plumbed through config_file's `audio_path:` YAML and consumed by other audio backends like kokoros) instead of inventing a custom `ref_audio:` Options[] string. Multi-speaker setups stay on a single comma- separated value. No behavior change beyond the gallery key name; per-call routing via TTSRequest.Voice is unchanged. Assisted-by: Claude:claude-opus-4-7[1m] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-06 10:36:59 +02:00
dependabot[bot]	1caab1de10	chore(deps): bump actions/checkout from 4 to 6 (#9663 ) Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 6. - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](https://github.com/actions/checkout/compare/v4...v6) --- updated-dependencies: - dependency-name: actions/checkout dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-05 15:37:05 +02:00
Richard Palethorpe	bb033b16a9	feat: add LocalVQE backend and audio transformations UI (#9640 ) feat(audio-transform): add LocalVQE backend, bidi gRPC RPC, Studio UI Introduce a generic "audio transform" capability for any audio-in / audio-out operation (echo cancellation, noise suppression, dereverberation, voice conversion, etc.) and ship LocalVQE as the first backend implementation. Backend protocol: - Two new gRPC RPCs in backend.proto: unary AudioTransform for batch and bidirectional AudioTransformStream for low-latency frame-by-frame use. This is the first bidi stream in the proto; per-frame unary at LocalVQE's 16 ms hop would be RTT-bound. Wire it through pkg/grpc/{client,server, embed,interface,base} with paired-channel ergonomics. LocalVQE backend (backend/go/localvqe/): - Go-Purego wrapper around upstream liblocalvqe.so. CMake builds the upstream shared lib + its libggml-cpu-.so runtime variants directly — no MODULE wrapper needed because LocalVQE handles CPU feature selection internally via GGML_BACKEND_DL. - Sets GGML_NTHREADS from opts.Threads (or runtime.NumCPU()-1) — without it LocalVQE runs single-threaded at ~1× realtime instead of the documented ~9.6×. - Reference-length policy: zero-pad short refs, truncate long ones (the trailing portion can't have leaked into a mic that wasn't recording). - Ginkgo test suite (9 always-on specs + 2 model-gated). HTTP layer: - POST /audio/transformations (alias /audio/transform): multipart batch endpoint, accepts audio + optional reference + params[]=v form fields. Persists inputs alongside the output in GeneratedContentDir/audio so the React UI history can replay past (audio, reference, output) triples. - GET /audio/transformations/stream: WebSocket bidi, 16 ms PCM frames (interleaved stereo mic+ref in, mono out). JSON session.update envelope for config; constants hoisted in core/schema/audio_transform.go. - ffmpeg-based input normalisation to 16 kHz mono s16 WAV via the existing utils.AudioToWav (with passthrough fast-path), so the user can upload any format / rate without seeing the model's strict 16 kHz constraint. - BackendTraceAudioTransform integration so /api/backend-traces and the Traces UI light up with audio_snippet base64 and timing. - Routes registered under routes/localai.go (LocalAI extension; OpenAI has no /audio/transformations endpoint), traced via TraceMiddleware. Auth + capability + importer: - FLAG_AUDIO_TRANSFORM (model_config.go), FeatureAudioTransform (default-on, in APIFeatures), three RouteFeatureRegistry rows. - localvqe added to knownPrefOnlyBackends with modality "audio-transform". - Gallery entry localvqe-v1-1.3m (sha256-pinned, hosted on huggingface.co/LocalAI-io/LocalVQE). React UI: - New /app/transform page surfaced via a dedicated "Enhance" sidebar section (sibling of Tools / Biometrics) — the page is enhancement, not generation, so it lives outside Studio. Two AudioInput components (Upload + Record tabs, drag-drop, mic capture). - Echo-test button: records mic while playing the loaded reference through the speakers — the mic naturally picks up speaker bleed, giving a real (mic, ref) pair for AEC testing without leaving the UI. - Reusable WaveformPlayer (canvas peaks + click-to-seek + audio controls) and useAudioPeaks hook (shared module-scoped AudioContext to avoid hitting browser context limits with three players on one page); migrated TTS, Sound, Traces audio blocks to use it. - Past runs saved in localStorage via useMediaHistory('audio-transform') — the history entry stores all three URLs so clicking re-renders the full triple, not just the output. Build + e2e: - 11 matrix entries removed from .github/workflows/backend.yml (CUDA, ROCm, SYCL, Metal, L4T): upstream supports only CPU + Vulkan, so we ship those two and let GPU-class hardware route through Vulkan in the gallery capabilities map. - tests-localvqe-grpc-transform job in test-extra.yml (gated on detect-changes.outputs.localvqe). - New audio_transform capability + 4 specs in tests/e2e-backends. - Playwright spec suite in core/http/react-ui/e2e/audio-transform.spec.js (8 specs covering tabs, file upload, multipart shape, history, errors). Docs: - New docs/content/features/audio-transform.md covering the (audio, reference) mental model, batch + WebSocket wire formats, LocalVQE param keys, and a YAML config example. Cross-links from text-to-audio and audio-to-text feature pages. Assisted-by: Claude:claude-opus-4-7 [Bash Read Edit Write Agent TaskCreate] Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-05-04 22:07:11 +02:00
Ettore Di Giacinto	28b4857bd6	fix(ci): leave ports.ubuntu.com upstream on self-hosted runners mirrors.edge.kernel.org carries /ubuntu/ (amd64 archive) but does NOT carry /ubuntu-ports/. With the previous default both archive and ports pointed at kernel.org, so multi-arch builds (linux/amd64,linux/arm64) on bigger-runner / arc-runner-set 404'd on the arm64 leg: Err:5 http://mirrors.edge.kernel.org/ubuntu-ports noble Release 404 Not Found [IP: 213.196.21.55 80] The original outage was on archive.ubuntu.com, not ports.ubuntu.com, so default the self-hosted-ports-mirror to '' (= keep ports.ubuntu.com upstream). apt-mirror.sh and the runner-side rewrite both already no-op when the env var is empty. Self-hosted amd64 still uses kernel.org for the main archive, which worked fine in this run before the arm64 leg failed. Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-04 07:28:43 +00:00
Ettore Di Giacinto	5503be1fb3	fix(ci): use http for the kernel.org mirror — bare ubuntu image has no CA bundle The Docker build runs on the minimal ubuntu:24.04 base image, which ships without ca-certificates. The very first apt-get update over HTTPS therefore fails the TLS handshake ("No system certificates available. Try installing ca-certificates."), and apt can't reach ca-certificates itself to fix the situation — chicken and egg. Apt validates package integrity via GPG-signed Release files, so plain HTTP is safe for the archive. archive.ubuntu.com / azure.archive are already accessed over HTTP for the same reason. Switch the kernel.org defaults from https://mirrors.edge.kernel.org to http://mirrors.edge.kernel.org so the in-Dockerfile rewrite works on self-hosted runners too. Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-03 23:29:53 +00:00
Ettore Di Giacinto	50580a84ae	fix(ci): switch apt mirror per runner — azure on github-hosted, kernel.org on self-hosted Self-hosted runners (arc-runner-set, bigger-runner) cannot reach azure.archive.ubuntu.com — they live in different networks (e.g. our arc-runner-set Kubernetes cluster) where Azure's mirror IP is not routable. Symptom: "Connection failed [IP: 51.11.236.225 80]" with each Ign:/Err: cycle taking 60s, hanging the build for ~16 minutes before exit 100. Pick the mirror based on `runner.environment`: * github-hosted (ubuntu-latest, ubuntu-24.04-arm) → Azure (http://azure.archive.ubuntu.com / http://azure.ports.ubuntu.com) — same VPC as the runner. * self-hosted (arc-runner-set, bigger-runner) → kernel.org (https://mirrors.edge.kernel.org for both archive and ports) — publicly reachable from any network. The choice now lives in one place: the .github/actions/configure-apt-mirror composite action exposes `effective-mirror` / `effective-ports-mirror` outputs so the reusable workflows can forward the same value as Docker build-args without duplicating the per-runner-environment branch. The now-redundant `apt-mirror` / `apt-ports-mirror` workflow inputs on image_build.yml and backend_build.yml are dropped — defaults live in the composite action and are visible there. Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-03 22:59:26 +00:00
Ettore Di Giacinto	8edac61e57	feat(ci): allow routing apt traffic through an alternate Ubuntu mirror (#9650 ) * feat(ci): allow routing apt traffic through an alternate Ubuntu mirror Adds opt-in APT_MIRROR / APT_PORTS_MIRROR knobs to all Dockerfiles, the Makefile, and CI workflows so we can fail over to a non-canonical Ubuntu mirror when archive.ubuntu.com / security.ubuntu.com / ports.ubuntu.com are degraded (recently observed: multi-day DDoS against the default pool). Defaults are empty everywhere — behavior is unchanged unless a mirror is configured. To enable in CI, set the repo-level GitHub Actions variables APT_MIRROR (and APT_PORTS_MIRROR for arm64 builds). Locally: make docker APT_MIRROR=http://azure.archive.ubuntu.com A small POSIX-sh helper in .docker/apt-mirror.sh rewrites both DEB822 (/etc/apt/sources.list.d/ubuntu.sources, Ubuntu 24.04+) and the legacy /etc/apt/sources.list before the first apt-get update. Dockerfile stages load it via RUN --mount=type=bind, so there is no extra layer and no cache invalidation when the script is unchanged. Reusable workflows also rewrite the runner's own /etc/apt sources before any sudo apt-get call. Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(apt-mirror): default to the Azure mirror, visible in the workflow source Bakes Azure (http://azure.archive.ubuntu.com / http://azure.ports.ubuntu.com) in as the default for both Docker builds and runner-side apt — rather than hiding the URL behind a GitHub Actions repo variable that's not visible from the source tree. A new composite action at .github/actions/configure-apt-mirror is the single source of truth for runner-side rewrites. Five standalone workflows (build-test, release, tests-e2e, tests-ui-e2e, update_swagger) just `uses: ./.github/actions/configure-apt-mirror`. Three workflows (image_build, backend_build, checksum_checker) keep an inline bash rewrite, because they install/upgrade git via apt before the checkout step (so the local composite action isn't loadable yet). The Azure URL is visible in those files too. The `apt-mirror` / `apt-ports-mirror` inputs of the reusable workflows keep their now-Azure defaults — they still feed the Docker build-args block in addition to the inline runner-side rewrite. Callers (image.yml, image-pr.yml, backend.yml, backend_pr.yml) drop the previous `vars.APT_MIRROR` plumbing and rely on those defaults. Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(apt-mirror): drop Force Install GIT, consolidate on the composite action The PPA git upgrade ran add-apt-repository ppa:git-core/ppa, which talks to api.launchpad.net — also part of Canonical's infrastructure and currently returning HTTP 504. The Azure mirror only covers archive.ubuntu.com / security.ubuntu.com / ports.ubuntu.com, not PPAs. The system git that ubuntu-latest already ships is sufficient for actions/checkout and the build pipeline, so just drop the upgrade. With that gone, the apt-before-checkout constraint disappears too — all three holdouts (image_build, backend_build, checksum_checker) can now switch to ./.github/actions/configure-apt-mirror like the other five. Net: 0 inline apt-mirror blocks, all 8 workflows route through the composite action. Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-03 23:50:13 +02:00
Russell Sim	18e039f305	fix(ci): fix AMDGPU_TARGETS empty-string bypass in hipblas builds (#9626 ) * fix(ci): fix AMDGPU_TARGETS empty-string bypass in hipblas builds `399c1dec` wired amdgpu-targets through the backend_build workflow_call interface, intending the input's default value to cover matrix entries that don't specify targets. However, GitHub Actions only applies a workflow_call input default when the caller omits the input entirely. When backend.yml passes `amdgpu-targets: ${{ matrix.amdgpu-targets }}` and the matrix entry has no amdgpu-targets key, the expression evaluates to an empty string, which is treated as an explicit value — bypassing the default. The result is Docker receiving AMDGPU_TARGETS="" which in turn causes Make's ?= default to be skipped (since the variable is already set in the environment, even to empty), and cmake gets -DAMDGPU_TARGETS= with no targets, so the HIP backend compiles for an indeterminate target rather than the intended GPU list. Fix this at two levels: 1. backend.yml: use a \|\| fallback in the expression so that an undefined matrix.amdgpu-targets never reaches the reusable workflow as an empty string. The target list is the canonical default and lives here. 2. backend_build.yml: remove the now-misleading default value from the input declaration. The default never fired due to the above bug, so keeping it implied a guarantee that didn't exist. 3. backend/cpp/llama-cpp/Makefile: add an explicit $(error ...) guard after the ?= assignment so that if AMDGPU_TARGETS is empty (whether from environment or any future CI wiring mistake) the build fails immediately with a clear message rather than silently producing a binary compiled for an unknown GPU target. Assisted-by: Claude Code:claude-sonnet-4-6 Signed-off-by: Russell Sim <rsl@simopolis.xyz> * fix(build): plumb AMDGPU_TARGETS through to Docker builds The docker-build-backend Makefile macro and Dockerfile.golang did not pass AMDGPU_TARGETS to the inner make invocation, so hipblas builds always used the backend Makefile's hardcoded default GPU targets regardless of what was specified via environment or CI inputs. Signed-off-by: Russell Sim <rsl@simopolis.xyz> --------- Signed-off-by: Russell Sim <rsl@simopolis.xyz>	2026-05-02 15:53:14 +02:00
Ettore Di Giacinto	fe6eb57082	feat(vibevoice-cpp): add purego TTS+ASR backend (#9610 ) * feat(vibevoice-cpp): add purego TTS+ASR backend Wire up Microsoft VibeVoice via the vibevoice.cpp C ABI as a new purego-based Go backend that serves both Backend.TTS and Backend.AudioTranscription from a single gRPC binary. Mirrors the qwen3-tts-cpp / sherpa-onnx pattern so the variant matrix (cpu/cuda12/cuda13/metal/rocm/sycl-f16/f32/vulkan/l4t) and the e2e-backends gRPC harness reuse existing infrastructure. - backend/go/vibevoice-cpp/ - Makefile, CMakeLists, purego shim, gRPC Backend with model-dir auto-detection, closed-loop TTS->ASR smoke test - backend/index.yaml - &vibevoicecpp meta + 18 image entries - Makefile - .NOTPARALLEL, BACKEND_VIBEVOICE_CPP, docker-build wiring, test-extra-backend-vibevoice-cpp-{tts,transcription} e2e wrappers - .github/workflows/backend.yml - matrix entries for all variants - .github/workflows/test-extra.yml - per-backend smoke + 2 gRPC e2e jobs * feat(vibevoice-cpp): drop hardcoded glob detection, add gallery entries Refactor backend Load() to follow the standard Options[] convention used by sherpa-onnx and the rest of the multi-role backends: ModelFile is the primary gguf, supplementary paths come through opts.Options[] as key=value (or key:value for Make-target compat), resolved against opts.ModelPath. type=asr/tts decides the role of ModelFile when neither tts_model nor asr_model is set explicitly. Add gallery/index.yaml entries: - vibevoice-cpp - realtime 0.5B Q8_0 TTS + tokenizer + Carter voice - vibevoice-cpp-asr - long-form ASR Q8_0 + tokenizer Both pull from huggingface://mudler/vibevoice.cpp-models with sha256 verification. parameters.model + Options[] paths are siblings under {models_dir} per the qwen3-tts-cpp convention. Update Makefile e2e wrappers to pass BACKEND_TEST_OPTIONS comma+colon style, and tighten the per-backend Go closed-loop test to use the explicit Options API. * fix(vibevoice-cpp): force whole-archive link so vv_capi_* exports survive libvibevoice is a STATIC archive linked into the MODULE library. Without --whole-archive (or -force_load on Apple, /WHOLEARCHIVE on MSVC), the linker garbage-collects symbols not referenced from this translation unit - which means dlopen+RegisterLibFunc panics with 'undefined symbol: vv_capi_load' at backend startup, since purego looks them up by name and our cpp/govibevoicecpp.cpp doesn't call them directly. * test(vibevoice-cpp): rewrite suite with Ginkgo v2 Match the convention used by backend/go/sherpa-onnx/backend_test.go. The suite now covers backend semantics that don't need purego (Locking, empty-ModelFile rejection, TTS/ASR-without-loaded-model errors) on top of the gRPC lifecycle specs (Health, Load, closed-loop TTS->ASR). Model-dependent specs Skip() when VIBEVOICE_MODEL_DIR is unset, so `go test ./backend/go/vibevoice-cpp/` is green on a clean checkout and runs the heavyweight closed-loop spec when test.sh has staged the bundle. * fix(vibevoice-cpp): implement TTSStream + AudioTranscriptionStream The gRPC server's stream handlers (pkg/grpc/server.go) spawn a goroutine that ranges over a chan; the only thing closing that chan is the backend's own Stream method. With the default Base stub returning 'unimplemented' and never touching the chan, the server goroutine hangs forever and the client hits DeadlineExceeded - which is exactly what the e2e harness saw in the test-extra-backend-vibevoice-cpp-tts matrix run. TTSStream synthesizes via vv_capi_tts to a tempfile, then emits a streaming WAV header (chunk sizes 0xFFFFFFFF so HTTP clients can start playback before the full PCM lands) followed by the PCM body in 64 KB slices. The header + >=2 PCM frames satisfy the harness's 'expected >=2 chunks' assertion and give a real progressive stream. AudioTranscriptionStream runs the offline transcription, emits each segment as a delta, and closes with a final_result whose Text equals the concatenated deltas (the harness asserts those match). Two new Ginkgo specs guard the close-channel-on-error path so the deadline-exceeded regression can't come back silently. fix(vibevoice-cpp): silence errcheck on cleanup paths Lint flagged six unchecked Close()/Remove()/RemoveAll() calls along purely-cleanup deferred paths. Wrap each in '_ = ...' (or a closure for defers that take args) - matches what the rest of the LocalAI backend/go/* tree already does for these callsites. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(vibevoice-cpp): closed-loop slot fill + modelRoot-relative path resolution Two bugs the test-extra-backend-vibevoice-cpp-* CI matrix surfaced: 1. Closed-loop Load with ModelFile=tts.gguf + Options[asr_model=...] left v.ttsModel empty, because the default-fill block only ran when BOTH slots were empty. vv_capi_load then got tts="" + a voice and the C side rejected it with rc=-3 'TTS model required to load a voice'. Fix: ModelFile fills the primary role-slot (decided by 'type=' in Options, defaulting to tts) independently of the secondary, so ModelFile + asr_model resolves to both. 2. resolvePath stat'd CWD before falling back to relTo. With LocalAI launched from a directory that happens to contain a same-named file, supplementary Options[] paths could leak away from the models dir. Drop the CWD probe entirely - relative paths now always join onto opts.ModelPath (the gallery convention). New Ginkgo coverage: * 'ModelFile slot resolution' (4 specs) - asr_model+ModelFile, type=asr, explicit tts_model override, key:value variant. * 'resolvePath (relative-to-modelRoot)' (5 specs) - join, abs passthrough, empty input, empty relTo, and the CWD-trap regression test. * 'Load resolves relative Options paths against opts.ModelPath' - end- to-end gallery layout round-trip. Verified locally: 19/19 specs pass (with model bundle, including the closed-loop TTS->ASR; without bundle, 17 pass + 2 model-dependent skip). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(vibevoice-cpp): use gallery convention in closed-loop spec The 'loads the realtime TTS model' / closed-loop specs were passing already-prefixed paths into Options[]: Options: ['tokenizer=' + filepath.Join(modelDir, 'tokenizer.gguf')] Combined with no ModelPath set on the request, the backend's modelRoot fell back to filepath.Dir(ModelFile) = modelDir, then resolvePath joined the prefixed Options path on top of it - producing 'vibevoice-models/vibevoice-models/tokenizer.gguf' when the CI's VIBEVOICE_MODEL_DIR is the relative './vibevoice-models'. The fix is to mirror the gallery contract LocalAI core actually sends in production: ModelPath is the models root (absolute), ModelFile is a name under it, every Options[] path is relative to ModelPath. Uses filepath.Base() to get bare filenames. Verified locally with both VIBEVOICE_MODEL_DIR=/tmp/vv-bundle (abs) and VIBEVOICE_MODEL_DIR=vibevoice-models (the relative shape that broke CI). Both: 19/19 specs pass, ~55-60s. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(vibevoice-cpp): switch ASR to Q4_K + bump transcription timeout The Q8_0 ASR gguf is ~14 GB - too big to fit alongside the runner image, the docker build cache, and the test artifacts on a free ubuntu-latest GHA runner; 'test-extra-backend-vibevoice-cpp-transcription' was getting SIGTERM'd at 90 min before the model could finish loading. Switch to Q4_K (~10 GB on disk, slightly faster CPU decode) for: * the e2e harness Make target * the gallery 'vibevoice-cpp-asr' entry (parameters + files block) * the per-backend test.sh auto-download list Bump tests-vibevoice-cpp-grpc-transcription's timeout-minutes from 90 to 150 - even with Q4_K, the 30 s JFK clip on a CPU runner needs runway above the previous 90 min cap. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(vibevoice-cpp): drop transcription gRPC e2e job - too heavy for free runners The vibevoice ASR is a 7B-parameter model. Even on Q4_K (~10 GB on disk) a single 30 s transcription saturates the per-test 30 min timeout in the e2e-backends harness on a 4-core ubuntu-latest, and the 10 GB download + Docker layer + working space leaves no headroom on the runner's free disk. Two attempts in CI got SIGTERM'd at the LoadModel boundary - the bottleneck isn't tunable from the workflow side without a paid-tier runner. The per-backend tests-vibevoice-cpp job already runs the same AudioTranscription path via a closed-loop TTS->ASR Ginkgo spec - same gRPC contract, same model, single process - so the standalone tests-vibevoice-cpp-grpc-transcription job was redundant on top of the disk/CPU pressure. The Makefile target test-extra-backend-vibevoice-cpp-transcription stays for local invocation on workstations that can afford it - useful when developing the streaming codepaths. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(vibevoice-cpp): restore transcription gRPC e2e on bigger-runner Switch tests-vibevoice-cpp-grpc-transcription from ubuntu-latest to the self-hosted 'bigger-runner' label that GPU image builds in backend.yml use, plus the documented Free-disk-space prep step (purge dotnet / ghc / android / CodeQL caches) the disabled vllm/sglang entries in this file describe. That gives the 7B-param Q4_K ASR model the disk + CPU runway it needs. Keep timeout-minutes: 150 - even on a beefier runner the 30 s JFK decode plus 10 GB download has to fit comfortably. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(vibevoice-cpp): apt-get install make on bigger-runner before transcription e2e bigger-runner is a self-hosted bare runner without the standard ubuntu image's preinstalled build tools, so the previous job died at the very first command with 'make: command not found' (exit 127). Add the Dependencies step that the disabled vllm/sglang entries in this file already document - apt-get installs make + build-essential + curl + unzip + ca-certificates + git + tar before the make target runs. Mirrors how every other 'runs-on: bigger-runner' entry in backend.yml prepares the runner. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-29 22:22:14 +02:00
Richard Palethorpe	4916f8c880	feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map (#9563 ) * feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map LocalAI's vLLM backend wraps a small typed subset of vLLM's AsyncEngineArgs (quantization, tensor_parallel_size, dtype, etc.). Anything outside that subset -- pipeline/data/expert parallelism, speculative_config, kv_transfer_config, all2all_backend, prefix caching, chunked prefill, etc. -- requires a new protobuf field, a Go struct field, an options.go line, and a backend.py mapping per feature. That cadence is the bottleneck on shipping vLLM's production feature set. Add a generic `engine_args:` map on the model YAML that is JSON-serialised into a new ModelOptions.EngineArgs proto field and applied verbatim to AsyncEngineArgs at LoadModel time. Validation is done by the Python backend via dataclasses.fields(); unknown keys fail with the closest valid name as a hint. dataclasses.replace() is used so vLLM's __post_init__ re-runs and auto-converts dict values into nested config dataclasses (CompilationConfig, AttentionConfig, ...). speculative_config and kv_transfer_config flow through as dicts; vLLM converts them at engine init. Operators can now write: engine_args: data_parallel_size: 8 enable_expert_parallel: true all2all_backend: deepep_low_latency speculative_config: method: deepseek_mtp num_speculative_tokens: 3 kv_cache_dtype: fp8 without further proto/Go/Python plumbing per field. Production defaults seeded by hooks_vllm.go: enable_prefix_caching and enable_chunked_prefill default to true unless explicitly set. Existing typed YAML fields (gpu_memory_utilization, tensor_parallel_size, etc.) remain for back-compat; engine_args overrides them when both are set. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(vllm): pin cublas13 to vLLM 0.20.0 cu130 wheel vLLM's PyPI wheel is built against CUDA 12 (libcudart.so.12) and won't load on a cu130 host. Switch the cublas13 build to vLLM's per-tag cu130 simple-index (https://wheels.vllm.ai/0.20.0/cu130/) and pin vllm==0.20.0. The cu130-flavoured wheel ships libcudart.so.13 and includes the DFlash speculative-decoding method that landed in 0.20.0. cublas13 install gets --index-strategy=unsafe-best-match so uv consults both the cu130 index and PyPI when resolving — PyPI also publishes vllm==0.20.0, but with cu12 binaries that error at import time. Verified: Qwen3.5-4B + z-lab/Qwen3.5-4B-DFlash loads and serves chat completions on RTX 5070 Ti (sm_120, cu130). Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * ci(vllm): bot job to bump cublas13 vLLM wheel pin vLLM's cu130 wheel index URL is itself version-locked (wheels.vllm.ai/<TAG>/cu130/, no /latest/ alias upstream), so a vLLM bump means rewriting two values atomically — the URL segment and the version constraint. bump_deps.sh handles git-sha-in-Makefile only; add a sibling bump_vllm_wheel.sh and a matching workflow job that mirrors the existing matrix's PR-creation pattern. The bumper queries /releases/latest (which excludes prereleases), strips the leading 'v', and seds both lines unconditionally. When the file is already on the latest tag the rewrite is a no-op and peter-evans/create-pull-request opens no PR. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * docs(vllm): document engine_args and speculative decoding The new engine_args: map plumbs arbitrary AsyncEngineArgs through to vLLM, but the public docs only covered the basic typed fields. Add a short subsection in the vLLM section explaining the typed/generic split and showing a worked DFlash speculative-decoding config, with pointers to vLLM's SpeculativeConfig reference and z-lab's drafter collection. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com> Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2026-04-29 00:49:28 +02:00
Richard Palethorpe	4443250756	chore: add golangci-lint with new-from-merge-base baseline (#9603 ) * chore: add golangci-lint with new-from-merge-base baseline Configure golangci-lint v2 with the standard linter set (errcheck, govet, ineffassign, unused) plus forbidigo, which enforces the Ginkgo/Gomega-only test convention from .agents/coding-style.md by rejecting stdlib testing calls (t.Errorf, t.Fatalf, t.Run, ...). staticcheck is disabled — the codebase has many pre-existing QF-style suggestions not worth gating on. issues.new-from-merge-base = master makes the lint job a gate for new issues only; the ~1300 pre-existing baseline stays visible via 'make lint-all' for incremental cleanup. CI runs 'make lint'. Backends needing C/C++ headers we don't install in the lint runner are excluded via a deny list in the Makefile (backend/go/{piper,silero-vad, llm}, cmd/launcher). Discovery still flows through 'go list ./...', so new packages are scanned automatically. To make backend/go/{sam3-cpp,stablediffusion-ggml,whisper} typecheckable, move their .cpp/.h sources into cpp/ subdirs (matching qwen3-tts-cpp / acestep-cpp). Without this 'go list' rejects the package because Go does not allow .cpp alongside .go without cgo. Fix two real bugs found by lint in tests/integration/ (run only via 'make test-stores', not default CI): a stale zerolog reference left over from the slog migration (`c37785b7`) and an unused 'os' import. Assisted-by: Claude Code:Opus 4.7 (1M) [Bash] [Read] [Edit] [Write] Signed-off-by: Richard Palethorpe <io@richiejp.com> * ci(lint): generate proto sources and fetch full history The lint job was failing for two reasons: - pkg/grpc/proto/.go is generated, not checked in. Several packages import it, so without 'make protogen-go' typecheck fails project-wide with "no required module provides package github.com/mudler/LocalAI/ pkg/grpc/proto". - golangci-lint's new-from-merge-base needs to git-merge-base the PR against master, but actions/checkout's default shallow clone doesn't fetch master. fetch-depth: 0 brings full history; the config now references origin/master (the remote-tracking branch that survives the shallow checkout) instead of bare master (which doesn't exist locally after checkout). Assisted-by: Claude Code:Opus 4.7 (1M) [Bash] [Read] [Edit] [Write] Signed-off-by: Richard Palethorpe <io@richiejp.com> ci(lint): stub react-ui/dist for go:embed glob core/http/app.go has //go:embed react-ui/dist/*. The glob must match at least one non-hidden entry or typecheck fails the whole core/http package. We don't need the real React bundle to lint Go code, so just touch an empty index.html to satisfy the embed. Assisted-by: Claude Code:Opus 4.7 (1M) [Bash] [Read] [Edit] [Write] Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-04-28 22:07:44 +02:00
Ettore Di Giacinto	a0317d9926	refactor(tests): split app_test.go, move real-backend coverage to e2e-backends core/http/app_test.go had grown to 1495 lines exercising three concerns at once: HTTP-layer integration, real-backend inference (llama-gguf, tts, stablediffusion, transformers embeddings, whisper), and service logic that already has unit-level coverage. Each PR paid for 6 backend builds plus real-model downloads to satisfy a single suite. Reorg per layer: - app_test.go (1495 -> 1003 lines) drives the mock-backend binary only. Kept: auth, routing, gallery API, file:// import, /system, agent-jobs HTTP plumbing, config-file model loading. Deleted real-inference specs (llama-gguf chat, ggml completions/streaming, logprobs, logit_bias, transcription, embeddings, External-gRPC, Stores duplicate, Model gallery Context). Lifted Agent Jobs out of the deleted Stores Context. - tests/e2e-backends/backend_test.go gains logprobs, logit_bias, and no-first-token-dup specs (the latter folded into PredictStream). Two new caps gate them so non-LLM backends opt out. - tests/e2e-aio/e2e_test.go gains a streaming smoke under Context("text") to catch container-level streaming regressions. - tests/models_fixtures/ removed; all fixtures referenced testmodel.ggml. app_test.go now writes per-Context inline mock-model YAMLs. CI: - test.yml + tests-e2e.yml gain paths-ignore (docs/, examples/, *.md, backend/) so docs and backend-only PRs skip them. test.yml drops the 6-backend Build step plus TRANSFORMER_BACKEND/GO_TAGS=tts; tests-apple drops the llama-cpp-darwin build. - New tests-aio.yml runs the AIO container nightly + on workflow_dispatch + master/tags. The tests-e2e-container job moved out of test.yml so PRs no longer pay AIO cost. - New tests-llama-cpp-smoke job in test-extra.yml runs on every PR with no detect-changes gate; pulls quay.io/go-skynet/local-ai-backends: master-cpu-llama-cpp (no build on PR) and exercises predict/stream/ logprobs/logit_bias against Qwen3-0.6B. This is the PR-acceptance real-backend gate after AIO moved to nightly. The path-gated heavy test-extra-backend-llama-cpp wrapper appends the same caps so it exercises the moved specs when the backend actually changes. Makefile: - Deleted test-models/testmodel.ggml (the wget chain), test-llama-gguf, test-tts, test-stablediffusion, test-realtime-models. test target drops --label-filter, HUGGINGFACE_GRPC, TRANSFORMER_BACKEND, TEST_DIR, FIXTURES, CONFIG_FILE, MODELS_PATH, BACKENDS_PATH; depends on build-mock-backend. test-stores keeps a focused entry point and depends on backends/local-store. clean-tests also clears the mock-backend binary. Net per typical Go-side PR: ~25min (6 backend builds + tests + AIO) + ~8min e2e drops to ~5min mock-backend test + ~8min e2e + ~5-10min llama-cpp-smoke (image pulled). Docs and backend-only PRs skip the always-on workflows entirely. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:claude-opus-4-7 [Edit] [Write] [Bash]	2026-04-27 23:09:20 +00:00
Ettore Di Giacinto	9a7f5e68bd	ci(darwin): add native caches to backend_build_darwin macOS runners can't use the registry-backed BuildKit cache (no Docker daemon), so every darwin matrix run was paying full cost for brew installs, Go module downloads, llama.cpp recompiles and Python wheel resolution. Wires actions/cache@v4 into the reusable workflow for four caches: - Go modules + build cache (setup-go cache: true), shared across matrix - Homebrew downloads + selected /opt/homebrew/Cellar entries, with HOMEBREW_NO_AUTO_UPDATE so restored Cellar paths stay stable - ccache for the llama-cpp CMake variants, keyed on the pinned LLAMA_VERSION; CMAKE_*_COMPILER_LAUNCHER is exported via GITHUB_ENV so backend/cpp/llama-cpp/Makefile picks it up without script changes - Python uv + pip wheel cache, keyed by backend + ISO week — same one-cold-rebuild-per-week cadence as the Linux DEPS_REFRESH Read/write semantics match the existing BuildKit policy: every run restores, only master/tag pushes save, so PRs can't pollute master's warm cache. Documents the new caches and the macOS-specific constraints in .agents/ci-caching.md. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code]	2026-04-27 20:17:36 +00:00
Ettore Di Giacinto	f4036fa83f	ci(python-backends): add weekly DEPS_REFRESH cache-buster The shared backend/Dockerfile.python ends in: RUN cd /${BACKEND} && PORTABLE_PYTHON=true make which `pip install`s each backend's requirements*.txt. A scan of all 34 Python backends shows every single one ships at least some unpinned deps (torch, transformers, vllm, diffusers, ...). With the registry cache now enabled, that `make` layer's BuildKit hash depends only on Dockerfile instructions + COPYed source — not on what pip resolves at runtime — so a warm cache would freeze upstream versions indefinitely. DEPS_REFRESH is an ARG declared right before that RUN. backend_build.yml computes `date -u +%Y-W%V` (ISO week, e.g. `2026-W17`) and passes it as a build-arg, so the install layer invalidates at most once per week and re-resolves PyPI / nightly indexes. Within a week, builds stay warm. Only Dockerfile.python is affected: Go (go.sum) and Rust (Cargo.lock) already lock their deps, and the C++ backends pull gRPC at a pinned tag and llama.cpp at a pinned commit. Add .agents/ci-caching.md documenting the cache layout (quay.io/go-skynet/ci-cache:cache<tag-suffix>), read/write semantics (master writes, PRs read-only), DEPS_REFRESH semantics, and how to manually evict tags. Index it from AGENTS.md (CLAUDE.md is a symlink). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:claude-opus-4-7-1m	2026-04-27 14:21:11 +00:00
Ettore Di Giacinto	bdfa5e934a	ci: switch image/backend build cache to a dedicated registry image - Switch cache-from/cache-to in backend_build.yml and image_build.yml from the unused gha cache to type=registry pointing at quay.io/go-skynet/ci-cache:cache<tag-suffix>, mode=max with ignore-error=true. Master/tag builds populate their own per-matrix-entry cache; PR builds read-only. - Drop the broken generate_grpc_cache.yaml cron. It targeted a `grpc` Dockerfile stage that was removed by `b1fc5acd` in July 2025, has been failing every night since, and never populated the gha cache. The new registry-cache scheme is self-warming, so no separate populator is needed. - Remove the dead GRPC_VERSION / GRPC_BASE_IMAGE / GRPC_MAKEFLAGS build-args from image_build.yml and the orphan ARG GRPC_BASE_IMAGE in the root Dockerfile (the root Dockerfile no longer compiles gRPC; the source build now lives in backend/Dockerfile.{llama-cpp, ik-llama-cpp, turboquant} only and uses its own ARG defaults). - Drop the unused grpc-base-image input from image_build.yml plus the matrix passthroughs in image.yml / image-pr.yml. - Drop the unused GRPC_VERSION env in test.yml. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:claude-opus-4-7-1m	2026-04-27 13:13:04 +00:00
Alex Brick	41ed8ced70	[intel GPU support] Use latest oneapi-basekit image for Intel images to support b70 (in more places this time) (#9578 ) Update additional intel base images	2026-04-27 09:18:57 +02:00
Ettore Di Giacinto	e16e758dff	ci(backends): build cpu-whisperx and cpu-faster-whisper for linux/arm64 (#9573 ) Extend the existing CPU build matrix entries to produce a multi-arch manifest (linux/amd64,linux/arm64) at the same image tags. arm64 Linux hosts without an NVIDIA GPU report the "default" capability, which already maps to cpu-whisperx / cpu-faster-whisper in backend/index.yaml -- so the manifest list lets Docker pull the right variant without any gallery changes. Both stacks install cleanly under aarch64: torch (2.4.1/2.8.0), faster-whisper, ctranslate2, whisperx, opencv-python and the remaining deps all ship manylinux2014_aarch64 wheels, so no source builds run under QEMU emulation. Follows the same pattern already used by cpu-llama-cpp-quantization. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-26 08:30:03 +02:00
Ettore Di Giacinto	703b4fcae8	Change cron schedule to run every 12 hours Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2026-04-25 18:38:28 +02:00
Ettore Di Giacinto	24505e57f5	feat(backends): add CUDA 13 + L4T arm64 CUDA 13 variants for vllm/vllm-omni/sglang (#9553 ) * feat(backends): add CUDA 13 + L4T arm64 CUDA 13 variants for vllm/vllm-omni/sglang Adds new build profiles mirroring the diffusers/ace-step pattern so vLLM serving (and SGLang on arm64) can be deployed on CUDA 13 hosts and JetPack 7 boards: - vllm: cublas13 (PyPI cu130 channel) + l4t13 (jetson-ai-lab SBSA cu130 prebuilt vllm + flash-attn). - vllm-omni: cublas13 + l4t13. Floats vllm version on cu13 since vllm 0.19+ ships cu130 wheels by default and vllm-omni tracks vllm master; cu12 path keeps the 0.14.0 pin to avoid disturbing existing images. - sglang: l4t13 arm64 only — uses the prebuilt sglang wheel from the jetson-ai-lab SBSA cu130 index, so no source build is needed. Cublas13 sglang on x86_64 is intentionally deferred. CI matrix gains five new images (-gpu-nvidia-cuda-13-vllm{,-omni}, -nvidia-l4t-cuda-13-arm64-{vllm,vllm-omni,sglang}); backend/index.yaml gains the matching capability keys (nvidia-cuda-13, nvidia-l4t-cuda-13) and latest/development merge entries. Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash] * fix(backends): use unsafe-best-match index strategy on l4t13 builds The jetson-ai-lab SBSA cu130 index lists transitive deps (decord, etc.) at limited versions / older Python ABIs. uv defaults to the first index that contains a package and refuses to fall through to PyPI, so sglang l4t13 build fails resolving decord. Mirror the existing cpu sglang profile by setting --index-strategy=unsafe-best-match on l4t13 across the three backends, and apply it to the explicit vllm install line in vllm-omni's install.sh (which doesn't honor EXTRA_PIP_INSTALL_FLAGS). Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash] * fix(sglang): drop [all] extras on l4t13, floor version at 0.5.0 The [all] extra brings in outlines→decord, and decord has no aarch64 cp312 wheel on PyPI nor the jetson-ai-lab index (only legacy cp35-cp37 tags). With unsafe-best-match enabled, uv backtracked through sglang versions trying to satisfy decord and silently landed on sglang==0.1.16, an ancient version with an entirely different dep tree (cloudpickle/outlines 0.0.44, etc.). Drop [all] so decord is no longer required, and floor sglang at 0.5.0 to prevent any future resolver misfire from degrading the version again. Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-25 12:26:29 +02:00
Alex Brick	e5337039b0	[intel GPU support] Use latest oneapi-basekit image for Intel images to support b70 (#9543 ) * Use latest oneapi-basekit image for Intel images The current `localai/localai:master-gpu-intel` images don't work with the intel arc pro b70. Updating the base_image to 2025.3.2 fixes it. Signed-off-by: Alex Brick <3220905+arbrick@users.noreply.github.com> * Update github workflow base image --------- Signed-off-by: Alex Brick <3220905+arbrick@users.noreply.github.com>	2026-04-24 18:29:10 +02:00
Richard Palethorpe	13734ae9fa	feat: Add Sherpa ONNX backend for ASR and TTS (#8523 ) feat(backend): Add Sherpa ONNX backend and Omnilingual ASR Adds a new Go backend wrapping sherpa-onnx via purego (no cgo). Same approach as opus/stablediffusion-ggml/whisper — a thin C shim (csrc/shim.c + shim.h → libsherpa-shim.so) wraps the bits purego can't reach directly: nested struct config writes, result-struct field reads, and the streaming TTS callback trampoline. The Go side uses opaque uintptr handles and purego.NewCallback for the TTS callback. Supports: - VAD via sherpa-onnx's Silero VAD - Offline ASR: Whisper, Paraformer, SenseVoice, Omnilingual CTC - Online/streaming ASR: zipformer transducer with endpoint detection (AudioTranscriptionStream emits delta events during decode) - Offline TTS: VITS (LJS, etc.) - Streaming TTS: sherpa-onnx's callback API → PCM chunks on a channel, prefixed by a streaming WAV header Gallery entries: omnilingual-0.3b-ctc-q8-sherpa (1600-language offline ASR), streaming-zipformer-en-sherpa (low-latency streaming ASR), silero-vad-sherpa, vits-ljs-sherpa. E2E coverage: tests/e2e-backends for offline + streaming ASR, tests/e2e for the full realtime pipeline (VAD + STT + TTS). Assisted-by: claude-opus-4-7-1M [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-04-24 14:40:06 +02:00
Ettore Di Giacinto	181ebb6df4	feat: voice recognition (#9500 ) * feat(voice-recognition): add /v1/voice/{verify,analyze,embed} + speaker-recognition backend Audio analog to face recognition. Adds three gRPC RPCs (VoiceVerify / VoiceAnalyze / VoiceEmbed), their Go service and HTTP layers, a new FLAG_SPEAKER_RECOGNITION capability flag, and a Python backend scaffold under backend/python/speaker-recognition/ wrapping SpeechBrain ECAPA-TDNN with a parallel OnnxDirectEngine for WeSpeaker / 3D-Speaker ONNX exports. The kokoros Rust backend gets matching unimplemented trait stubs — tonic's async_trait has no defaults, so adding an RPC without Rust stubs breaks the build (same regression fixed by `eb01c772` for face). Swagger, /api/instructions, and the auth RouteFeatureRegistry / APIFeatures list are updated so the endpoints surface everywhere a client or admin UI looks. Assisted-by: Claude:claude-opus-4-7 * feat(voice-recognition): add 1:N identify + register/forget endpoints Mirrors the face-recognition register/identify/forget surface. New package core/services/voicerecognition/ carries a Registry interface and a local-store-backed implementation (same in-memory vector-store plumbing facerecognition uses, separate instance so the embedding spaces stay isolated). Handlers under /v1/voice/{register,identify,forget} reuse backend.VoiceEmbed to compute the probe vector, then delegate the nearest-neighbour search to the registry. Default cosine-distance threshold is tuned for ECAPA-TDNN on VoxCeleb (0.25, EER ~1.9%). As with the face registry, the current backing is in-memory only — a pgvector implementation is a future constructor-level swap. Assisted-by: Claude:claude-opus-4-7 * feat(voice-recognition): gallery, docs, CI and e2e coverage - backend/index.yaml: speaker-recognition backend entry + CPU and CUDA-12 image variants (plus matching development variants). - gallery/index.yaml: speechbrain-ecapa-tdnn (default) and wespeaker-resnet34 model entries. The WeSpeaker SHA-256 is a deliberate placeholder — the HF URI must be curl'd and its hash filled in before the entry installs. - docs/content/features/voice-recognition.md: API reference + quickstart, mirrors the face-recognition docs. - React UI: CAP_SPEAKER_RECOGNITION flag export (consumers follow face's precedent — no dedicated tab yet). - tests/e2e-backends: voice_embed / voice_verify / voice_analyze specs. Helper resolveFaceFixture is reused as-is — the only thing face/voice share is "download a file into workDir", so no need for a new helper. - Makefile: docker-build-speaker-recognition + test-extra-backend- speaker-recognition-{ecapa,all} targets. Audio fixtures default to VCTK p225/p226 samples from HuggingFace. - CI: test-extra.yml grows a tests-speaker-recognition-grpc job mirroring insightface. backend.yml matrix gains CPU + CUDA-12 image build entries — scripts/changed-backends.js auto-picks these up. Assisted-by: Claude:claude-opus-4-7 * feat(voice-recognition): wire a working /v1/voice/analyze head Adds AnalysisHead: a lazy-loading age / gender / emotion inference wrapper that plugs into both SpeechBrainEngine and OnnxDirectEngine. Defaults to two open-licence HuggingFace checkpoints: - audeering/wav2vec2-large-robust-24-ft-age-gender (Apache 2.0) — age regression + 3-way gender (female / male / child). - superb/wav2vec2-base-superb-er (Apache 2.0) — 4-way emotion. Both are optional and degrade gracefully when transformers or the model can't be loaded — the engine raises NotImplementedError so the gRPC layer returns 501 instead of a generic 500. Emotion classes pass through from the model (neutral/happy/angry/sad on the default checkpoint); the e2e test now accepts any non-empty dominant gender so custom age_gender_model overrides don't fail it. Adds transformers to the backend's CPU and CUDA-12 requirements. Assisted-by: Claude:claude-opus-4-7 * fix(voice-recognition): pin real WeSpeaker ResNet34 ONNX SHA-256 Replaces the placeholder hash in gallery/index.yaml with the actual SHA-256 (7bb2f06e…) of the upstream Wespeaker/wespeaker-voxceleb-resnet34-LM ONNX at ~25MB. `local-ai models install wespeaker-resnet34` now succeeds. Assisted-by: Claude:claude-opus-4-7 * fix(voice-recognition): soundfile loader + honest analyze default Two issues surfaced on first end-to-end smoke with the actual backend image: 1. torchaudio.load in torchaudio 2.8+ requires the torchcodec package for audio decoding. Switch SpeechBrainEngine._load_waveform to the already-present soundfile (listed in requirements.txt) plus a numpy linear resample to 16kHz. Drops a heavy ffmpeg-linked dep and the codepath we never exercise (torchaudio's ffmpeg backend). 2. The AnalysisHead was defaulting to audeering/wav2vec2-large-robust- 24-ft-age-gender, but AutoModelForAudioClassification silently mangles that checkpoint — it reports the age head weights as UNEXPECTED and re-initialises the classifier head with random values, so the "gender" output is noise and there is no age output at all. Make age/gender opt-in instead (empty default; users wire a cleanly-loadable Wav2Vec2ForSequenceClassification checkpoint via age_gender_model: option). Emotion keeps its working Superb default. Also broaden _infer_age_gender's tensor-shape handling and catch runtime exceptions so a dodgy age/gender head never takes down the whole analyze call. Docs and README updated to match the new policy. Verified with the branch-scoped gallery on localhost: - voice/embed → 192-d ECAPA-TDNN vector - voice/verify → same-clip dist≈6e-08 verified=true; cross-speaker dist 0.76–0.99 verified=false (as expected) - voice/register/identify/forget → round-trip works, 404 on unknown id - voice/analyze → emotion populated, age/gender omitted (opt-in) Assisted-by: Claude:claude-opus-4-7 * fix(voice-recognition): real CI audio fixtures + fixture-agnostic verify spec Two issues surfaced after CI actually ran the speaker-recognition e2e target (I'd curl-tested against a running server but hadn't run the make target locally): 1. The default BACKEND_TEST_VOICE_AUDIO_* URLs pointed at huggingface.co/datasets/CSTR-Edinburgh/vctk paths that return 404 (the dataset is gated). Swap them for the speechbrain test samples served from github.com/speechbrain/speechbrain/raw/develop/ — public, no auth, correct 16kHz mono format. 2. The VoiceVerify spec required d(file1,file2) < 0.4, assuming file1/file2 were same-speaker. The speechbrain samples are three different speakers (example1/2/5), and there is no easy un-gated source of true same-speaker audio pairs (VoxCeleb/VCTK/LibriSpeech are all license- or size-gated for CI use). Replace the ceiling check with a relative-ordering assertion: d(pair) > d(same-clip) for both file2 and file3 — that's enough to prove the embeddings encode speaker info, and it works with any three non-identical clips. Actual speaker ordering d(1,2) vs d(1,3) is logged but not asserted. Local run: 4/4 voice specs pass (Health, LoadModel, VoiceEmbed, VoiceVerify) on the built backend image. 12 non-voice specs skipped as expected. Assisted-by: Claude:claude-opus-4-7 * fix(ci): checkout with submodules in the reusable backend_build workflow The kokoros Rust backend build fails with failed to read .../sources/Kokoros/kokoros/Cargo.toml: No such file because the reusable backend_build.yml workflow's actions/checkout step was missing `submodules: true`. Dockerfile.rust does `COPY . /LocalAI`, and without the submodule files the subsequent `cargo build` can't find the vendored Kokoros crate. The bug pre-dates this PR — scripts/changed-backends.js only triggers the kokoros image job when something under backend/rust/kokoros or the shared proto changes, so master had been coasting past it. The voice-recognition proto addition re-broke it. Other checkouts in backend.yml (llama-cpp-darwin) and test-extra.yml (insightface, kokoros, speaker-recognition) already pass `submodules: true`; this brings the shared backend image builder in line. Assisted-by: Claude:claude-opus-4-7	2026-04-23 12:07:14 +02:00
Ettore Di Giacinto	20baec77ab	feat(face-recognition): add insightface/onnx backend for 1:1 verify, 1:N identify, embedding, detection, analysis (#9480 ) * feat(face-recognition): add insightface backend for 1:1 verify, 1:N identify, embedding, detection, analysis Adds face recognition as a new first-class capability in LocalAI via the `insightface` Python backend, with a pluggable two-engine design so non-commercial (insightface model packs) and commercial-safe (OpenCV Zoo YuNet + SFace) models share the same gRPC/HTTP surface. New gRPC RPCs (backend/backend.proto): * FaceVerify(FaceVerifyRequest) returns FaceVerifyResponse * FaceAnalyze(FaceAnalyzeRequest) returns FaceAnalyzeResponse Existing Embedding and Detect RPCs are reused (face image in PredictOptions.Images / DetectOptions.src) for face embedding and face detection respectively. New HTTP endpoints under /v1/face/: * verify — 1:1 image pair same-person decision * analyze — per-face age + gender (emotion/race reserved) * register — 1:N enrollment; stores embedding in vector store * identify — 1:N recognition; detect → embed → StoresFind * forget — remove a registered face by opaque ID Service layer (core/services/facerecognition/) introduces a `Registry` interface with one in-memory `storeRegistry` impl backed by LocalAI's existing local-store gRPC vector backend. HTTP handlers depend on the interface, not on StoresSet/StoresFind directly, so a persistent PostgreSQL/pgvector implementation can be slotted in via a single constructor change in core/application (TODO marker in the package doc). New usecase flag FLAG_FACE_RECOGNITION; insightface is also wired into FLAG_DETECTION so /v1/detection works for face bounding boxes. Gallery (backend/index.yaml) ships three entries: * insightface-buffalo-l — SCRFD-10GF + ArcFace R50 + genderage (~326MB pre-baked; non-commercial research use only) * insightface-opencv — YuNet + SFace (~40MB pre-baked; Apache 2.0) * insightface-buffalo-s — SCRFD-500MF + MBF (runtime download; non-commercial) Python backend (backend/python/insightface/): * engines.py — FaceEngine protocol with InsightFaceEngine and OnnxDirectEngine; resolves model paths relative to the backend directory so the same gallery config works in docker-scratch and in the e2e-backends rootfs-extraction harness. * backend.py — gRPC servicer implementing Health, LoadModel, Status, Embedding, Detect, FaceVerify, FaceAnalyze. * install.sh — pre-bakes buffalo_l + OpenCV YuNet/SFace inside the backend directory so first-run is offline-clean (the final scratch image only preserves files under /<backend>/). * test.py — parametrized unit tests over both engines. Tests: * Registry unit tests (go test -race ./core/services/facerecognition/...) — in-memory fake grpc.Backend, table-driven, covers register/ identify/forget/error paths + concurrent access. * tests/e2e-backends/backend_test.go extended with face caps (face_detect, face_embed, face_verify, face_analyze); relative ordering + configurable verifyCeiling per engine. * Makefile targets: test-extra-backend-insightface-buffalo-l, -opencv, and the -all aggregate. * CI: .github/workflows/test-extra.yml gains tests-insightface-grpc, auto-triggered by changes under backend/python/insightface/. Docs: * docs/content/features/face-recognition.md — feature page with license table, quickstart (defaults to the commercial-safe model), models matrix, API reference, 1:N workflow, storage caveats. * Cross-refs in object-detection.md, stores.md, embeddings.md, and whats-new.md. * Contributor README at backend/python/insightface/README.md. Verified end-to-end: * buffalo_l: 6/6 specs (health, load, face_detect, face_embed, face_verify, face_analyze). * opencv: 5/5 specs (same minus face_analyze — SFace has no demographic head; correctly skipped via BACKEND_TEST_CAPS). Assisted-by: Claude:claude-opus-4-7 * fix(face-recognition): move engine selection to model gallery, collapse backend entries The previous commit put engine/model_pack options on backend gallery entries (`backend/index.yaml`). That was wrong — `GalleryBackend` (core/gallery/backend_types.go:32) has no `options` field, so the YAML decoder silently dropped those keys and all three "different insightface-" backend entries resolved to the same container image with no distinguishing configuration. Correct split: `backend/index.yaml` now has ONE `insightface` backend entry shipping the CPU + CUDA 12 container images. The Python backend bundles both the non-commercial insightface model packs (buffalo_l / buffalo_s) and the commercial-safe OpenCV Zoo weights (YuNet + SFace); the active engine is selected at LoadModel time via `options: ["engine:..."]`. * `gallery/index.yaml` gains three model entries — `insightface-buffalo-l`, `insightface-opencv`, `insightface-buffalo-s` — each setting the appropriate `overrides.backend` + `overrides.options` so installing one actually gives the user the intended engine. This matches how `rfdetr-base` lives in the model gallery against the `rfdetr` backend. The earlier e2e tests passed despite this bug because the Makefile targets pass `BACKEND_TEST_OPTIONS` directly to LoadModel via gRPC, bypassing any gallery resolution entirely. No code changes needed. Assisted-by: Claude:claude-opus-4-7 * feat(face-recognition): cover all supported models in the gallery + drop weight baking Follows up on the model-gallery split: adds entries for every model configuration either engine actually supports, and switches weight delivery from image-baked to LocalAI's standard gallery mechanism. Gallery now has seven `insightface-` model entries (gallery/index.yaml): insightface (family) — non-commercial research use • buffalo-l (326MB) — SCRFD-10GF + ResNet50 + genderage, default • buffalo-m (313MB) — SCRFD-2.5GF + ResNet50 + genderage • buffalo-s (159MB) — SCRFD-500MF + MBF + genderage • buffalo-sc (16MB) — SCRFD-500MF + MBF, recognition only (no landmarks, no demographics — analyze returns empty attributes) • antelopev2 (407MB) — SCRFD-10GF + ResNet100@Glint360K + genderage OpenCV Zoo family — Apache 2.0 commercial-safe • opencv — YuNet + SFace fp32 (~40MB) • opencv-int8 — YuNet + SFace int8 (~12MB, ~3x smaller, faster on CPU) Model weights are no longer baked into the backend image. The image now ships only the Python runtime + libraries (~275MB content size, ~1.18GB disk vs ~1.21GB when weights were baked). Weights flow through LocalAI's gallery mechanism: OpenCV variants list `files:` with ONNX URIs + SHA-256, so `local-ai models install insightface-opencv` pulls them into the models directory exactly like any other gallery-managed model. * insightface packs (upstream distributes .zip archives only, not individual ONNX files) auto-download on first LoadModel via FaceAnalysis' built-in machinery, rooted at the LocalAI models directory so they live alongside everything else — same pattern `rfdetr` uses with `inference.get_model()`. Backend changes (backend/python/insightface/): * backend.py — LoadModel propagates `ModelOptions.ModelPath` (the LocalAI models directory) to engines via a `_model_dir` hint. This replaces the earlier ModelFile-dirname approach; ModelPath is the canonical "models directory" variable set by the Go loader (pkg/model/initializers.go:144) and is always populated. * engines.py::_resolve_model_path — picks up `model_dir` and searches it (plus basename-in-model-dir) before falling back to the dev script-dir. This is how OnnxDirectEngine finds gallery-downloaded YuNet/SFace files by filename only. * engines.py::_flatten_insightface_pack — new helper that works around an upstream packaging inconsistency: buffalo_l/s/sc zips expand flat, but buffalo_m and antelopev2 zips wrap their ONNX files in a redundant `<name>/` directory. insightface's own loader looks one level too shallow and fails. We call `ensure_available()` explicitly, flatten if nested, then hand to FaceAnalysis. * engines.py::InsightFaceEngine.prepare — root-resolution order now includes the `_model_dir` hint so packs download into the LocalAI models directory by default. * install.sh — no longer pre-downloads any weights. Everything is gallery-managed now. * smoke.py (new) — parametrized smoke test that iterates over every gallery configuration, simulating the LocalAI install flow (creates a models dir, fetches OpenCV files with checksum verification, lets insightface auto-download its packs), then runs detect + embed + verify (+ analyze where supported) through the in-process BackendServicer. * test.py — OnnxDirectEngineTest no longer hardcodes `/models/opencv/` paths; downloads ONNX files to a temp dir at setUpClass time and passes ModelPath accordingly. Registry change (core/services/facerecognition/store_registry.go): * `dim=0` in NewStoreRegistry now means "accept whatever dimension arrives" — needed because the backend supports 512-d ArcFace/MBF and 128-d SFace via the same Registry. A non-zero dim still fails fast with ErrDimensionMismatch. * core/application plumbs `faceEmbeddingDim = 0`, explaining the rationale in the comment. Backend gallery description updated to reflect that the image carries no weights — it's just Python + engines. Smoke-tested all 7 configurations against the rebuilt image (with the flatten fix applied), exit 0: PASS: insightface-buffalo-l faces=6 dim=512 same-dist=0.000 PASS: insightface-buffalo-sc faces=6 dim=512 same-dist=0.000 PASS: insightface-buffalo-s faces=6 dim=512 same-dist=0.000 PASS: insightface-buffalo-m faces=6 dim=512 same-dist=0.000 PASS: insightface-antelopev2 faces=6 dim=512 same-dist=0.000 PASS: insightface-opencv faces=6 dim=128 same-dist=0.000 PASS: insightface-opencv-int8 faces=6 dim=128 same-dist=0.000 7/7 passed Assisted-by: Claude:claude-opus-4-7 * fix(face-recognition): pre-fetch OpenCV ONNX for e2e target; drop stale pre-baked claim CI regression from the previous commit: I moved OpenCV Zoo weight delivery to LocalAI's gallery `files:` mechanism, but the test-extra-backend-insightface-opencv target was still passing relative paths `detector_onnx:models/opencv/yunet.onnx` in BACKEND_TEST_OPTIONS. The e2e suite drives LoadModel directly over gRPC without going through the gallery, so those relative paths resolved to nothing and OpenCV's ONNXImporter failed: LoadModel failed: Failed to load face engine: OpenCV(4.13.0) ... Can't read ONNX file: models/opencv/yunet.onnx Fix: add an `insightface-opencv-models` prerequisite target that fetches the two ONNX files (YuNet + SFace) to a deterministic host cache at /tmp/localai-insightface-opencv-cache/, verifies SHA-256, and skips the download on re-runs. The opencv test target depends on it and passes absolute paths in BACKEND_TEST_OPTIONS, so the backend finds the files via its normal absolute-path resolution branch. Also refresh the buffalo_l comment: it no longer says "pre-baked" (nothing is — the pack auto-downloads from upstream's GitHub release on first LoadModel, same as in CI). Locally verified: `make test-extra-backend-insightface-opencv` passes 5/5 specs (health, load, face_detect, face_embed, face_verify). Assisted-by: Claude:claude-opus-4-7 * feat(face-recognition): add POST /v1/face/embed + correct /v1/embeddings docs The docs promised that /v1/embeddings returns face vectors when you send an image data-URI. That was never true: /v1/embeddings is OpenAI-compatible and text-only by contract — its handler goes through `core/backend/embeddings.go::ModelEmbedding`, which sets `predictOptions.Embeddings = s` (a string of TEXT to embed) and never populates `predictOptions.Images[]`. The Python backend's Embedding gRPC method does handle Images[] (that's how /v1/face/register reaches it internally via `backend.FaceEmbed`), but the HTTP embeddings endpoint wasn't wired to populate it. Rather than overload /v1/embeddings with image-vs-text detection — messy, and the endpoint is OpenAI-compatible by design — add a dedicated /v1/face/embed endpoint that wraps `backend.FaceEmbed` (already used internally by /v1/face/register and /v1/face/identify). Matches LocalAI's convention of a dedicated path per non-standard flow (/v1/rerank, /v1/detection, /v1/face/verify etc.). Response: { "embedding": [<dim> floats, L2-normed], "dim": int, // 512 for ArcFace R50 / MBF, 128 for SFace "model": "<name>" } Live-tested on the opencv engine: returns a 128-d L2-normalized vector (sum(x^2) = 1.0000). Sentinel in docs updated to note /v1/embeddings is text-only and point image users at /v1/face/embed instead. Assisted-by: Claude:claude-opus-4-7 * fix(http): map malformed image input + gRPC status codes to proper 4xx Image-input failures on LocalAI's single-image endpoints (/v1/detection, /v1/face/{verify,analyze,embed,register,identify}) have historically returned 500 — even when the client was the one who sent garbage. Classic example: you POST an "image" that isn't a URL, isn't a data-URI, and isn't a valid JPEG/PNG — the server shouldn't claim that's its fault. Two helpers land in core/http/endpoints/localai/images.go and every single-image handler is switched over: * decodeImageInput(s) Wraps utils.GetContentURIAsBase64 and turns any failure (invalid URL, not a data-URI, download error, etc.) into echo.NewHTTPError(400, "invalid image input: ..."). * mapBackendError(err) Inspects the gRPC status on a backend call error and maps: INVALID_ARGUMENT → 400 Bad Request NOT_FOUND → 404 Not Found FAILED_PRECONDITION → 412 Precondition Failed Unimplemented → 501 Not Implemented All other codes fall through unchanged (still 500). Before, my 1×1 PNG error-path test returned: HTTP 500 "rpc error: code = InvalidArgument desc = failed to decode one or both images" After: HTTP 400 "failed to decode one or both images" Scope-limited to the LocalAI single-image endpoints. The multi-modal paths (middleware/request.go, openresponses/responses.go, openai/realtime.go) intentionally log-and-skip individual media parts when decoding fails — different design intent (graceful degradation of a multi-part message), not a 400-worthy failure. Left untouched. Live-verified: every error case in /tmp/face_errors.py now returns 4xx with a meaningful message; the "image with no face (1x1 PNG)" case specifically went from 500 → 400. Assisted-by: Claude:claude-opus-4-7 * refactor(face-recognition): insightface packs go through gallery files:, drop FaceAnalysis Follows up on the discovery that LocalAI's gallery `files:` mechanism handles archives (zip, tar.gz, …) via mholt/archiver/v3 — the rhasspy piper voices use exactly this pattern. Insightface packs are zip archives, so we can now deliver them the same way every other gallery-managed model gets delivered: declaratively, checksum-verified, through LocalAI's standard download+extract pipeline. Two changes: 1. Gallery (gallery/index.yaml) — every insightface-* entry gains a `files:` list with the pack zip's URI + SHA-256. `local-ai models install insightface-buffalo-l` now fetches the zip, verifies the hash, and extracts it into the models directory. No more reliance on insightface's library-internal `ensure_available()` auto-download or its hardcoded `BASE_REPO_URL`. 2. InsightFaceEngine (backend/python/insightface/engines.py) — drops the FaceAnalysis wrapper and drives insightface's `model_zoo` directly. The ~50 lines FaceAnalysis provides — glob ONNX files, route each through `model_zoo.get_model()`, build a `{taskname: model}` dict, loop per-face at inference — are reimplemented in `InsightFaceEngine`. The actual inference classes (RetinaFace, ArcFaceONNX, Attribute, Landmark) are still insightface's — we only replicate the glue, so drift risk against upstream is minimal. Why drop FaceAnalysis: it hard-codes a `<root>/models/<name>/.onnx` layout that doesn't match what LocalAI's zip extraction produces. LocalAI unpacks archives flat into `<models_dir>`. Upstream packs are inconsistent — buffalo_l/s/sc ship ONNX at the zip root (lands at `<models_dir>/.onnx`), buffalo_m/antelopev2 wrap in a redundant `<name>/` dir (lands at `<models_dir>/<name>/.onnx`). The new `_locate_insightface_pack` helper searches both locations plus legacy paths and returns whichever has ONNX files. Replaces the earlier `_flatten_insightface_pack` helper (which tried to fight FaceAnalysis's layout expectations; now we just find the files wherever they are). Net effect for users: install once via LocalAI's managed flow, weights live alongside every other model, progress shows in the jobs endpoint, no first-load network call. Same API surface, cleaner plumbing. Assisted-by: Claude:claude-opus-4-7 fix(face-recognition): CI's insightface e2e path needs the pack pre-fetched The e2e suite drives LoadModel over gRPC without going through LocalAI's gallery flow, so the engine's `_model_dir` option (normally populated from ModelPath) is empty. Previously the insightface target relied on FaceAnalysis auto-download to paper over this, but we dropped FaceAnalysis in favor of direct model_zoo calls — so the buffalo_l target started failing at LoadModel with "no insightface pack found". Mirror the opencv target's pre-fetch pattern: download buffalo_sc.zip (same SHA as the gallery entry), extract it on the host, and pass `root:<dir>` so the engine locates the pack without needing ModelPath. Switched to buffalo_sc (smallest pack, ~16MB) to keep CI fast; it covers the same insightface engine code path as buffalo_l. Face analyze cap dropped since buffalo_sc has no age/gender head. Assisted-by: Claude:claude-opus-4-7[1m] * feat(face-recognition): surface face-recognition in advertised feature maps The six /v1/face/* endpoints were missing from every place LocalAI advertises its feature surface to clients: * api_instructions — the machine-readable capability index at GET /api/instructions. Added `face-recognition` as a dedicated instruction area with an intro that calls out the in-memory registry caveat and the /v1/face/embed vs /v1/embeddings split. * auth/permissions — added FeatureFaceRecognition constant, routed all six face endpoints through it so admins can gate them per-user like any other API feature. Default ON (matches the other API features). * React UI capabilities — CAP_FACE_RECOGNITION symbol mapped to FLAG_FACE_RECOGNITION. Declared only for now; the Face page is a follow-up (noted in the plan). Instruction count bumped 9 → 10; test updated. Assisted-by: Claude:claude-opus-4-7[1m] * docs(agents): capture advertising-surface steps in the endpoint guide Before this change, adding a new /v1/* endpoint reliably missed one or more of: the swagger @Tags annotation, the /api/instructions registry, the auth RouteFeatureRegistry, and the React UI CAP_* symbol. The endpoint would work but be invisible to API consumers, admins, and the UI — and nothing in the existing docs said to look in those places. Extend .agents/api-endpoints-and-auth.md with a new "Advertising surfaces" section covering all four surfaces (swagger tags, /api/ instructions, capabilities.js, docs/), and expand the closing checklist so it's impossible to ship a feature without visiting each one. Hoist a one-liner reminder into AGENTS.md's Quick Reference so agents skim it before diving in. Assisted-by: Claude:claude-opus-4-7[1m]	2026-04-22 21:55:41 +02:00
Richard Palethorpe	d16f19f1eb	fix(kokoros): Build and publish the backend images from CI/CD (#9487 ) * fix(kokoros): Build and publish the backend images from CI/CD Signed-off-by: Richard Palethorpe <io@richiejp.com> * Delete .claude/agents Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> * Delete .claude/commands Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> * Delete .claude/settings.json Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> * Delete .claude/skills Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com> Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2026-04-22 13:19:55 +02:00
Ettore Di Giacinto	39573ecd2a	chore(whisperx): drop ROCm/hipblas build target (#9474 ) whisperx has no upstream AMD GPU support and its core transcription path (faster-whisper -> ctranslate2) falls back to CPU on AMD since the PyPI ctranslate2 is CUDA-only. The torch rocm wheels would accelerate only the alignment/diarization stages, producing a misleadingly half-working image. Drop the hipblas variant rather than shipping a partially accelerated build users can't distinguish from the real thing. AMD hosts now fall through the capability map to cpu-whisperx / cpu-whisperx-development. Also removes the now-dangling rocm-whisperx assertion from pkg/system/capabilities_test.go and the ROCm mention from the whisperx row in docs/content/reference/compatibility-table.md. Assisted-by: Claude Code:claude-opus-4-7	2026-04-21 21:50:18 +02:00
Ettore Di Giacinto	a7dbb2a83d	fix(gallery-agent): process blacklist command on recently-closed PRs (#9473 ) The command-processing step only walked open PRs, so when a maintainer wrote `/gallery-agent blacklist` and immediately closed the PR, the next scheduled run missed the command, the `gallery-agent/blacklisted` label was never applied, and the skip-URL step (which only pulls URLs from closed PRs carrying that label) re-proposed the model on the next cron. Also scan closed gallery-agent PRs from the last 14 days that don't already carry the blacklist label, and apply the label retroactively when the command is present. Close/recreate actions still only run on open PRs. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-21 16:29:13 +02:00
Russell Sim	c66c41e8d7	fix(ci): wire AMDGPU_TARGETS through backend build workflow (#9445 ) Commit `8839a71c` exposed AMDGPU_TARGETS as an ARG/ENV in Dockerfile.llama-cpp so GPU targets could be overridden, but never wired the value through the CI workflow inputs. Without it, Docker receives AMDGPU_TARGETS="" which overrides the Makefile's ?= default, causing all hipblas builds to compile only for gfx906 regardless of the target list in the Makefile. Add amdgpu-targets as a workflow_call input with the same default list as the Makefile, and pass it as AMDGPU_TARGETS in the build-args of both the push and PR build steps. Assisted-by: Claude Code:claude-sonnet-4-6 Signed-off-by: Russell Sim <rsl@simopolis.xyz>	2026-04-20 23:41:19 +02:00
Ettore Di Giacinto	a90a8cf1d0	fix(ci): switch gallery-agent to sigs.k8s.io/yaml (#9397 ) The gallery-agent lives under .github/, which Go tooling treats as a hidden directory and excludes from './...' expansion. That means 'go mod tidy' (run on every dependabot dependency bump) repeatedly strips github.com/ghodss/yaml from go.mod/go.sum, breaking 'go run ./.github/gallery-agent' with a missing go.sum entry error. Switch to sigs.k8s.io/yaml — API-compatible with ghodss/yaml and already pulled in as a transitive dependency via non-hidden packages, so tidy can no longer remove it.	2026-04-17 10:10:42 +02:00
Ettore Di Giacinto	b4e30692a2	feat(backends): add sglang (#9359 ) * feat(backends): add sglang Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(sglang): force AVX-512 CXXFLAGS and disable CI e2e job sgl-kernel's shm.cpp uses __m512 AVX-512 intrinsics unconditionally; -march=native fails on CI runners without AVX-512 in /proc/cpuinfo. Force -march=sapphirerapids so the build always succeeds, matching sglang upstream's docker/xeon.Dockerfile recipe. The resulting binary still requires an AVX-512 capable CPU at runtime, so disable tests-sglang-grpc in test-extra.yml for the same reason tests-vllm-grpc is disabled. Local runs with make test-extra-backend-sglang still work on hosts with the right SIMD baseline. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(sglang): patch CMakeLists.txt instead of CXXFLAGS for AVX-512 CXXFLAGS with -march=sapphirerapids was being overridden by add_compile_options(-march=native) in sglang's CPU CMakeLists.txt, since CMake appends those flags after CXXFLAGS. Sed-patch the CMakeLists.txt directly after cloning to replace -march=native. --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-16 22:40:56 +02:00
Ettore Di Giacinto	6f0051301b	feat(backend): add tinygrad multimodal backend (experimental) (#9364 ) * feat(backend): add tinygrad multimodal backend Wire tinygrad as a new Python backend covering LLM text generation with native tool-call extraction, embeddings, Stable Diffusion 1.x image generation, and Whisper speech-to-text from a single self-contained container. Backend (`backend/python/tinygrad/`): - `backend.py` gRPC servicer with LLM Predict/PredictStream (auto-detects Llama / Qwen2 / Mistral architecture from `config.json`, supports safetensors and GGUF), Embedding via mean-pooled last hidden state, GenerateImage via the vendored SD1.x pipeline, AudioTranscription + AudioTranscriptionStream via the vendored Whisper inference loop, plus Tokenize / ModelMetadata / Status / Free. - Vendored upstream model code under `vendor/` (MIT, headers preserved): llama.py with an added `qkv_bias` flag for Qwen2-family bias support and an `embed()` method that returns the last hidden state, plus clip.py, unet.py, stable_diffusion.py (trimmed to drop the MLPerf training branch that pulls `mlperf.initializers`), audio_helpers.py and whisper.py (trimmed to drop the pyaudio listener). - Pluggable tool-call parsers under `tool_parsers/`: hermes (Qwen2.5 / Hermes), llama3_json (Llama 3.1+), qwen3_xml (Qwen 3), mistral (Mistral / Mixtral). Auto-selected from model architecture or `Options`. - `install.sh` pins Python 3.11.14 (tinygrad >=0.12 needs >=3.11; the default portable python is 3.10). - `package.sh` bundles libLLVM.so.1 + libedit/libtinfo/libgomp/libsndfile into the scratch image. `run.sh` sets `CPU_LLVM=1` and `LLVM_PATH` so tinygrad's CPU device uses the in-process libLLVM JIT instead of shelling out to the missing `clang` binary. - Local unit tests for Health and the four parsers in `test.py`. Build wiring: - Root `Makefile`: `.NOTPARALLEL`, `prepare-test-extra`, `test-extra`, `BACKEND_TINYGRAD = tinygrad\|python\|.\|false\|true`, docker-build-target eval, and `docker-build-backends` aggregator. - `.github/workflows/backend.yml`: cpu / cuda12 / cuda13 build matrix entries (mirrors the transformers backend placement). - `backend/index.yaml`: `&tinygrad` meta + cpu/cuda12/cuda13 image entries (latest + development). E2E test wiring: - `tests/e2e-backends/backend_test.go` gains an `image` capability that exercises GenerateImage and asserts a non-empty PNG is written to `dst`. New `BACKEND_TEST_IMAGE_PROMPT` / `BACKEND_TEST_IMAGE_STEPS` knobs. - Five new make targets next to `test-extra-backend-vllm`: - `test-extra-backend-tinygrad` — Qwen2.5-0.5B-Instruct + hermes, mirrors the vllm target 1:1 (5/9 specs in ~57s). - `test-extra-backend-tinygrad-embeddings` — same model, embeddings via LLM hidden state (3/9 in ~10s). - `test-extra-backend-tinygrad-sd` — stable-diffusion-v1-5 mirror, health/load/image (3/9 in ~10min, 4 diffusion steps on CPU). - `test-extra-backend-tinygrad-whisper` — openai/whisper-tiny.en against jfk.wav from whisper.cpp samples (4/9 in ~49s). - `test-extra-backend-tinygrad-all` aggregate. All four targets land green on the first MVP pass: 15 specs total, 0 failures across LLM+tools, embeddings, image generation, and speech transcription. * refactor(tinygrad): collapse to a single backend image tinygrad generates its own GPU kernels (PTX renderer for CUDA, the autogen ctypes wrappers for HIP / Metal / WebGPU) and never links against cuDNN, cuBLAS, or any toolkit-version-tied library. The only runtime dependency that varies across hosts is the driver's libcuda.so.1 / libamdhip64.so, which are injected into the container at run time by the nvidia-container / rocm runtimes. So unlike torch- or vLLM-based backends, there is no reason to ship per-CUDA-version images. - Drop the cuda12-tinygrad and cuda13-tinygrad build-matrix entries from .github/workflows/backend.yml. The sole remaining entry is renamed to -tinygrad (from -cpu-tinygrad) since it is no longer CPU-only. - Collapse backend/index.yaml to a single meta + development pair. The meta anchor carries the latest uri directly; the development entry points at the master tag. - run.sh picks the tinygrad device at launch time by probing /usr/lib/... for libcuda.so.1 / libamdhip64.so. When libcuda is visible we set CUDA=1 + CUDA_PTX=1 so tinygrad uses its own PTX renderer (avoids any nvrtc/toolkit dependency); otherwise we fall back to HIP or CLANG. CPU_LLVM=1 + LLVM_PATH keep the in-process libLLVM JIT for the CLANG path. - backend.py's _select_tinygrad_device() is trimmed to a CLANG-only fallback since production device selection happens in run.sh. Re-ran test-extra-backend-tinygrad after the change: Ran 5 of 9 Specs in 56.541 seconds — 5 Passed, 0 Failed	2026-04-15 19:48:23 +02:00

1 2 3 4 5 ...

648 Commits