LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-07-03 04:46:54 -04:00

Author	SHA1	Message	Date
Ettore Di Giacinto	733c254b32	ci: consolidate llama-cpp-darwin into the matrix-driven Darwin flow (#9731 ) The bespoke llama-cpp-darwin + llama-cpp-darwin-publish top-level jobs in backend.yml ran unconditionally on every backend.yml trigger (push/cron), bypassing the path filter that all 34 other Darwin backends already honor via backend-jobs-darwin -> backend_build_darwin.yml. Move llama-cpp into the includeDarwin matrix: - New entry in .github/backend-matrix.yml (lang=go, no build-type). - backend_build_darwin.yml gains an `if: inputs.backend == 'llama-cpp'` build step that drives `make backends/llama-cpp-darwin`. The bespoke script (scripts/build/llama-cpp-darwin.sh) compiles three CMake variants from backend/cpp/llama-cpp and bundles dylibs via otool, so it doesn't fit the build-darwin-go-backend mold; the existing llama-cpp-aware ccache setup blocks already in this workflow are what motivated the consolidation in the first place. - scripts/changed-backends.js's inferBackendPathDarwin gains a special case so llama-cpp on Darwin maps to backend/cpp/llama-cpp/ (the C++ source tree) rather than the non-existent backend/go/llama-cpp/. - Bumps Darwin go-version from 1.24.x -> 1.25.x in backend.yml and backend_pr.yml so llama-cpp keeps the Go toolchain it had under the bespoke job; the other 34 Darwin backends pick this up too with no known reason to pin 1.24. - Removes ~80 lines of bespoke YAML from backend.yml. The publish path is unchanged in shape - every Darwin backend now uses the same crane-push leg from ubuntu-latest in backend_build_darwin.yml; only the build target differs per backend. After this commit, llama-cpp-darwin only rebuilds when backend/cpp/llama-cpp/ is touched (verified locally) - same behavior as every other Darwin backend. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-09 10:18:17 +02:00
LocalAI [bot]	4542833cb4	chore: ⬆️ Update ggml-org/llama.cpp to `9f5f0e689c9e977e5f23a27e344aa36082f44738` (#9724 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-09 10:18:05 +02:00
LocalAI [bot]	f0374aa0e8	ci: finish GHA free-tier migration (per-arch fan-out, image splits, retire self-hosted, fix provenance) (#9730 ) * ci: add per-arch + manifest-merge support for LocalAI server image Mirror the backend_build.yml + backend_merge.yml pattern shipped in PR #9726 for the LocalAI server image: - image_build.yml accepts optional platform-tag (default ''), scopes registry cache to cache-localai<suffix>-<platform-tag>, and pushes by canonical digest only on push events. Digests upload as artifacts named digests-localai<suffix>-<platform-tag>, with a "-core" placeholder when tag-suffix is empty so the merge job's download pattern doesn't over-match across multiple suffixes. - image_merge.yml is a new reusable workflow that downloads matching digest artifacts and assembles the final tagged manifest list via docker buildx imagetools create. Image names differ from backend_.yml: the LocalAI server is published under quay.io/go-skynet/local-ai and localai/localai (not -backends). Not yet wired into image.yml / image-pr.yml — Commit C does that. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> ci: fan out per-arch split to remaining 34 backends Convert all remaining linux/amd64,linux/arm64 entries in backend-matrix.yml to per-arch + manifest-merge form. Each was a single matrix entry running both arches on x86 under QEMU emulation; each becomes two entries — amd64 on ubuntu-latest, arm64 on ubuntu-24.04-arm (native). Four backends that were on bigger-runner (-cpu-llama-cpp, -cpu-turboquant, -gpu-vulkan-llama-cpp, -gpu-vulkan-turboquant) have both legs moved to free tier as part of the same change. They are compile-only (no torch/CUDA install) and fit comfortably with the setup-build-disk /mnt relocation. Phase 4 (next commit) retires the remaining 5 single-arch bigger-runner entries. After this commit: - 271 total matrix entries (was 237) - 0 multi-arch entries left - 36 per-arch pairs (34 new + 2 pilots from PR #9727) - 5 bigger-runner entries remaining (single-arch, Phase 4 target) Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: split LocalAI image multi-arch entries per arch + merge Mirror the backend per-arch split for the main LocalAI image: - image.yml's core-image-build matrix: split the core ('') and -gpu-vulkan entries into amd64 + arm64 legs each. amd64 on ubuntu-latest, arm64 on ubuntu-24.04-arm (native). - New top-level core-image-merge and gpu-vulkan-image-merge jobs call image_merge.yml after core-image-build completes. - image-pr.yml's image-build matrix: split the -vulkan-core entry. No merge job added on the PR side — image_build.yml's digest-push is push-only-event-gated, so a PR-side merge would have nothing to download. After this commit, no workflow file references linux/amd64,linux/arm64 in a single matrix slot. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: retire bigger-runner from backend matrix (Phase 4) Migrate the remaining 5 single-arch bigger-runner entries to ubuntu-latest. Combined with the Phase 3 setup-build-disk /mnt relocation (PR #9726), free-tier ubuntu-latest now has ~100 GB of working space — enough for ROCm dev image (~16 GB), CUDA toolkit (~5 GB), and the per-backend compile/install steps these entries do. Backends migrated: - -gpu-nvidia-cuda-12-llama-cpp - -gpu-nvidia-cuda-12-turboquant - -gpu-rocm-hipblas-faster-whisper - -gpu-rocm-hipblas-coqui - -cpu-ik-llama-cpp After this commit, .github/backend-matrix.yml has zero bigger-runner references. The bigger-runner used in tests-vibevoice-cpp-grpc- transcription (test-extra.yml) is a separate concern handled in a follow-up. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: migrate 9 Intel oneAPI backends to free tier (Phase 5.1) Intel oneAPI base image is ~6 GB; each backend's wheel install stays well within the ~100 GB working space provided by Phase 3's setup-build-disk /mnt relocation. Lowest-risk batch of the arc-runner-set retirement. Backends migrated: vllm, sglang, vibevoice, qwen-asr, nemo, qwen-tts, fish-speech, voxcpm, pocket-tts (all -gpu-intel-* variants). Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: migrate 15 ROCm Python backends to free tier (Phase 5.2) ROCm dev image (~16 GB) plus per-backend torch/wheels install fits on ubuntu-latest with the /mnt-relocated Docker root. These entries include the heavier vLLM/sglang/transformers/diffusers stack on ROCm; if any specific backend OOMs or runs out of disk, individual flips back to arc-runner-set are revertable per-entry. Backends migrated: all 15 -gpu-rocm-hipblas-* entries previously on arc-runner-set (vllm/vllm-omni/sglang/transformers/diffusers/ ace-step/kokoro/vibevoice/qwen-asr/nemo/qwen-tts/fish-speech/ voxcpm/pocket-tts/neutts). Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: migrate 6 CUDA Python backends to free tier (Phase 5.3) vLLM/sglang stacks on CUDA 12 and CUDA 13 are the heaviest backends in the matrix — flash-attn intermediate layers can spike disk usage during build. setup-build-disk's /mnt relocation gives ~100 GB working space which fits the documented peak. Highest-risk batch of the arc-runner-set retirement; if any backend fails to build on free tier, the per-entry runs-on flip is the unit of revert. Backends migrated: -gpu-nvidia-cuda-{12,13}-{vllm,vllm-omni,sglang}. After this commit, .github/backend-matrix.yml has zero references to arc-runner-set or bigger-runner. The migration is complete. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: disable provenance on multi-registry digest pushes Root-caused on master via PR #9727's pilot: when docker/build-push-action@v7 pushes a single build to TWO registries simultaneously with push-by-digest=true, buildx generates a per-registry provenance attestation manifest (because mode=max — the default for push:true — includes the runner ID). That makes the resulting manifest-list digest diverge across registries: arm64 -cpu-faster-whisper build: image manifest: sha256:d3bdd34b... (identical, content-only) quay manifest list: sha256:66b4cfc8... (with quay attestation) dockerhub manifest list: sha256:e0733c3b... (with dockerhub attestation) steps.build.outputs.digest returns only one of the list digests (empirically the dockerhub one). The merge job then asks "quay.io/...@sha256:e0733c3b..." which doesn't exist on quay — that list has digest 66b4cfc8 there. Result: imagetools create fails with "not found" and the merge job fails (run 25581983094, job 75110021491). Setting provenance: false drops the per-registry attestation; the manifest-list digest becomes pure content, identical across both registries, and steps.build.outputs.digest works on either lookup. Applied to backend_build.yml and image_build.yml — both refactored to use the same multi-registry digest-push pattern in the prior PRs. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-09 09:37:00 +02:00
Ettore Di Giacinto	fe7b27eb66	test(ci): trigger faster-whisper rebuild to observe per-arch+merge The PR that introduced the per-arch + manifest-merge pilot (#9727) only touched CI infrastructure files, so the path filter correctly skipped backend builds on its merge commit. To observe the new backend-merge-jobs flow assemble a real manifest list, this commit touches faster-whisper's Makefile so its two new per-arch entries schedule and the merge job runs. The trailing comment is the smallest possible diff and is harmless to the build. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-08 22:09:46 +00:00
LocalAI [bot]	14a3275329	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `98950267c67fd95937a54ebd6e3c66cf2679b710` (#9725 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-09 00:06:05 +02:00
LocalAI [bot]	cb68cd1cf4	ci: pilot per-arch split + manifest merge for faster-whisper and llama-cpp-quantization (#9727 ) ci: pilot per-arch split for faster-whisper and llama-cpp-quantization Convert two backends from QEMU-emulated multi-arch (linux/amd64,linux/arm64 on a single ubuntu-latest) to native per-arch + manifest-list merge: - amd64 leg on ubuntu-latest - arm64 leg on ubuntu-24.04-arm (native, ~5-10x faster than emulated) - merge job assembles both digests under the final tag via docker buildx imagetools create Backends piloted: - -cpu-faster-whisper (small Python, fast baseline) - -cpu-llama-cpp-quantization (heavier compile path, stress test) Infrastructure changes that the rest of Phase 2 (Tasks 2.5+) will reuse: - .github/backend-matrix.yml entries gain a `platform-tag` field ('amd64'/'arm64') for matrix entries that participate in the split. Other entries omit it; backend_build.yml already defaults missing values to '' (empty cache key suffix preserved as cache<suffix>-). - backend.yml + backend_pr.yml forward `platform-tag` from matrix to the reusable backend_build.yml. - scripts/changed-backends.js groups filtered entries by tag-suffix and emits a `merge-matrix` (plus `has-merges`) for groups of size>=2. Singletons aren't merged. - backend.yml + backend_pr.yml gain a `backend-merge-jobs` job that consumes merge-matrix and calls backend_merge.yml after backend-jobs. PR variant is also event-gated so the no-op-on-PR merge job doesn't even start. The other 34 multi-arch entries are unchanged in this PR -- Task 2.5 fans out the same shape to them once the pilot is observed green. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-09 00:04:42 +02:00
LocalAI [bot]	624fa946f8	feat(swagger): update swagger (#9723 ) Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-08 23:44:55 +02:00
LocalAI [bot]	1f313cfdb0	ci: phase 1-3 of GHA free tier migration (path filter, multi-arch split prep, /mnt disk relief) (#9726 ) * ci: extract free-disk-space composite action Consolidate the apt-clean + dotnet/android/ghc/boost removal blocks from backend_build.yml, image_build.yml, and test.yml into a single composite action. The three callers had slightly different inline blocks; the composite uses the more aggressive backend_build/image_build variant for all three callers — test.yml jobs now also purge snapd, edge/firefox/ powershell/r-base-core, and sweep /opt/ghc + /usr/local/share/boost + $AGENT_TOOLSDIRECTORY. Idempotent and skipped on self-hosted runners. In test.yml, actions/checkout now runs before the composite action call because the composite lives at ./.github/actions/free-disk-space and requires a checked-out repo. The original ordering relied on jlumbroso/free-disk-space@main being a remote action; this is the minimum-invasive change to support a local composite. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: path-filter backend.yml master push Run scripts/changed-backends.js on master pushes too (not just PRs) so unrelated commits don't rebuild all ~210 backend container images. Tag pushes still build the full matrix via FORCE_ALL. Push events use the GitHub Compare API to diff event.before..event.after. Edge cases (first push with zero base, API truncation beyond 300 files, missing fields, network failure) fall back to "run everything" — better safe than silently miss a backend. The matrix literal moves from .github/workflows/backend.yml into a new data-only file at .github/backend-matrix.yml (outside workflows/ so actionlint doesn't try to parse it as a workflow). Both backend.yml and backend_pr.yml now consume the dynamic matrix output uniformly via fromJson(needs.generate-matrix.outputs.matrix); the script reads the matrix from the new location. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: bound max-parallel on backend-jobs matrices Cap to 8 concurrent jobs to avoid queue starvation on the shared GHA free pool while migration is in flight. Lift after Phases 4-5 retire the self-hosted runners. Also drops a leftover commented-out max-parallel line that lived in backend.yml since the previous matrix shape. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: scope backend cache per arch, push by digest Prepare backend_build.yml for the multi-arch split. The reusable workflow now accepts a `platform-tag` input ("amd64" / "arm64") that scopes the registry cache to cache<suffix>-<platform-tag> and (on push events) pushes the resulting image by canonical digest only. Digests are uploaded as artifacts named digests<suffix>-<platform-tag> for the merge job (Task 2.2) to consume. `platform-tag` is optional with empty default during the migration — existing callers continue to work unchanged (their cache key just becomes `cache<suffix>-`, an orphaned but valid key). Tasks 2.3+ will update callers to pass an explicit "amd64" / "arm64" value. Phase 6 flips the input to required: true once every caller is wired. PR builds keep their existing tag-based push to ci-tests but pick up the per-arch cache key. Multi-arch PR builds remain emulated in this commit; they migrate when the matrix entries split (Tasks 2.3+). Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: add backend_merge.yml reusable workflow Joins per-arch digest artifacts (uploaded by backend_build.yml when called with platform-tag) into a single tagged multi-arch manifest list via `docker buildx imagetools create`. Called once per backend by backend.yml after both per-arch build jobs succeed. The workflow generates final tags identically to the previous monolithic build job (same docker/metadata-action invocation), so consumers of quay.io/go-skynet/local-ai-backends and localai/localai-backends see no tag-shape change. Two imagetools calls (one per registry) reference the same per-arch digests under different image names. Not yet wired into backend.yml — Tasks 2.3+ rewrite individual matrix entries to expand into per-arch + merge jobs that call this workflow. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: relocate Docker data-root to /mnt on hosted runners GHA hosted ubuntu-latest runners ship a ~75 GB /mnt drive that's unused by default. Stopping Docker, rsync'ing /var/lib/docker to /mnt, and restarting with data-root pointing there yields ~100 GB of working space (combined with the apt-clean from Task 1.1) — enough for ROCm dev image + vLLM torch install + flash-attn intermediate layers. This is the structural change that lets Phases 4 and 5 of the migration plan move the bigger-runner and arc-runner-set jobs onto ubuntu-latest. The composite action is no-op on self-hosted runners (where /mnt isn't expected) and on non-X64 runners (Task 3.2 verifies the arm64 hosted pool's /mnt shape separately before enabling). Wired into both backend_build.yml and image_build.yml between free-disk-space and the first Docker operation. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(setup-build-disk): chmod 1777 /mnt/docker-tmp buildx CLI runs as the unprivileged 'runner' user and creates config dirs under TMPDIR before binding them into the buildkit container. /mnt is root-owned by default, so the original mkdir produced a permission-denied when buildx tried to write there: ERROR: mkdir /mnt/docker-tmp/buildkitd-config2740457204: permission denied Mirror /tmp's permission mode (1777 — world-writable with sticky bit) on /mnt/docker-tmp so non-root processes can stage their config. Caught by the first PR run (image-build hipblas job) on PR #9726. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: weekly full-matrix rebuild via cron Path-filtering backend.yml master push (the previous commit's main optimization) skips backends whose source didn't change. That broke the DEPS_REFRESH cache-buster's coverage: the build-arg keyed on %Y-W%V busts the install layer's cache on a new ISO week, but only when the build actually runs. Untouched Python backends (torch, transformers, vllm with no version pin) would otherwise ship stale wheels indefinitely. Add a Sunday 06:00 UTC cron that fires the full matrix. Schedule events have no event.ref / event.before, so the script's changedFiles == null fallback (scripts/changed-backends.js) emits the full matrix automatically — no script change needed. C++/Go backends with pinned deps cache-hit and complete fast, so the weekly cost is dominated by Python re-resolves which is exactly what we want. workflow_dispatch added so a maintainer can trigger an ad-hoc full-matrix rebuild without faking a tag push. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-08 23:43:41 +02:00
LocalAI [bot]	0c1f1e6cbd	chore(model gallery): 🤖 add 1 new models via gallery agent (#9720 ) chore(model gallery): 🤖 add new models via gallery agent Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-08 16:26:03 +02:00
Richard Palethorpe	670259ce43	chore: Security hardening (#9719 ) * fix(http): close 0.0.0.0/[::] SSRF bypass in /api/cors-proxy The CORS proxy carried its own private-network blocklist (RFC 1918 + a handful of IPv6 ranges) instead of using the same classification as pkg/utils/urlfetch.go. The hand-rolled list missed 0.0.0.0/8 and ::/128, both of which Linux routes to localhost — so any user with FeatureMCP (default-on for new users) could reach LocalAI's own listener and any other service bound to 0.0.0.0:port via: GET /api/cors-proxy?url=http://0.0.0.0:8080/... GET /api/cors-proxy?url=http://[::]:8080/... Replace the custom check with utils.IsPublicIP (Go stdlib IsLoopback / IsLinkLocalUnicast / IsPrivate / IsUnspecified, plus IPv4-mapped IPv6 unmasking) and add an upfront hostname rejection for localhost, .local, and the cloud metadata aliases so split-horizon DNS can't paper over the IP check. The IP-pinning DialContext is unchanged: the validated IP from the single resolution is reused for the connection, so DNS rebinding still cannot swap a public answer for a private one between validate and dial. Regression tests cover 0.0.0.0, 0.0.0.0:PORT, [::], ::ffff:127.0.0.1, ::ffff:10.0.0.1, file://, gopher://, ftp://, localhost, 127.0.0.1, 10.0.0.1, 169.254.169.254, metadata.google.internal. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> fix(downloader): verify SHA before promoting temp file to final path DownloadFileWithContext renamed the .partial file to its final name before checking the streamed SHA, so a hash mismatch returned an error but left the tampered file at filePath. Subsequent code that operated on filePath (a backend launcher, a YAML loader, a re-download that finds the file already present and skips) would consume the attacker-supplied bytes. Reorder: verify the streamed hash first, remove the .partial on mismatch, then rename. The streamed hash is computed during io.Copy so no second read is needed. While here, raise the empty-SHA case from a Debug log to a Warn so "this download had no integrity check" is visible at the default log level. Backend installs currently pass through with no digest; the warning makes that footprint observable without changing behaviour. Regression test asserts os.IsNotExist on the destination after a deliberate SHA mismatch. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(auth): require email_verified for OIDC admin promotion extractOIDCUserInfo read the ID token's "email" claim but never inspected "email_verified". With LOCALAI_ADMIN_EMAIL set, an attacker who could register on the configured OIDC IdP under that email (some IdPs accept self-supplied unverified emails) inherited admin role: - first login: AssignRole(tx, email, adminEmail) → RoleAdmin - re-login: MaybePromote(db, user, adminEmail) → flip to RoleAdmin Add EmailVerified to oauthUserInfo, parse email_verified from the OIDC claims (default false on absence so an IdP that omits the claim cannot short-circuit the gate), and substitute "" for the role-decision email when verified=false via emailForRoleDecision. The user record still stores the unverified email for display. GitHub's path defaults EmailVerified=true: GitHub only returns a public profile email after verification, and fetchGitHubPrimaryEmail explicitly filters to Verified=true. Regression tests cover both the helper contract and integration with AssignRole, including the bootstrap "first user" branch that would otherwise mask the gate. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(cli): refuse public bind when no auth backend is configured When neither an auth DB nor a static API key is set, the auth middleware passes every request through. That is fine for a developer laptop, a home LAN, or a Tailnet — the network itself is the trust boundary. It is not fine on a public IP, where every model install, settings change, and admin endpoint becomes reachable from the internet. Refuse to start in that exact configuration. Loopback, RFC 1918, RFC 4193 ULA, link-local, and RFC 6598 CGNAT (Tailscale's default range) all count as trusted; wildcard binds (`:port`, `0.0.0.0`, `[::]`) are accepted only when every host interface is in one of those ranges. Hostnames are resolved and treated as trusted only when every answer is. A new --allow-insecure-public-bind / LOCALAI_ALLOW_INSECURE_PUBLIC_BIND flag opts out for deployments that gate access externally (a reverse proxy enforcing auth, a mesh ACL, etc.). The error message lists this plus the three constructive alternatives (bind a private interface, enable --auth, set --api-keys). The interface enumeration goes through a package-level interfaceAddrsFn var so tests can simulate cloud-VM, home-LAN, Tailscale-only, and enumeration-failure topologies without poking at the real network stack. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * test(http): regression-test the localai_assistant admin gate ChatEndpoint already rejects metadata.localai_assistant=true from a non-admin caller, but the gate was open-coded inline with no direct test coverage. The chat route is FeatureChat-gated (default-on), and the assistant's in-process MCP server can install/delete models and edit configs — the wrong handler change would silently turn the LLM into a confused deputy. Extract the gate into requireAssistantAccess(c, authEnabled) and pin its behaviour: auth disabled is a no-op, unauthenticated is 403, RoleUser is 403, RoleAdmin and the synthetic legacy-key admin are admitted. No behaviour change in the production path. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * test(http): assert every API route is auth-classified The auth middleware classifies path prefixes (/api/, /v1/, /models/, etc.) as protected and treats anything else as a static-asset passthrough. A new endpoint shipped under a brand-new prefix — or a new path that simply isn't on the prefix allowlist — would be reachable anonymously. Walk every route registered by API() with auth enabled and a fresh in-memory database (no users, no keys), and assert each API-prefixed route returns 401 / 404 / 405 to an anonymous request. Public surfaces (/api/auth/, /api/branding, /api/node/ token-authenticated routes, /healthz, branding asset server, generated-content server, static assets) are explicit allowlist entries with comments justifying them. Build-tagged 'auth' so it runs against the SQLite-backed auth DB (matches the existing auth suite). Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * test(http): pin agent endpoint per-user isolation contract agents.go's getUserID / effectiveUserID / canImpersonateUser / wantsAllUsers helpers are the single trust boundary for cross-user access on agent, agent-jobs, collections, and skills routes. A regression there is the difference between "regular user reads their own data" and "regular user reads anyone's data via ?user_id=victim". Lock in the contract: - effectiveUserID ignores ?user_id= for unauthenticated and RoleUser - effectiveUserID honours it for RoleAdmin and ProviderAgentWorker - wantsAllUsers requires admin AND the literal "true" string - canImpersonateUser is admin OR agent-worker, never plain RoleUser No production change — this commit only adds tests. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(downloader): drop redundant stat in removePartialFile The stat-then-remove pattern is a TOCTOU window and a wasted syscall — os.Remove already returns ErrNotExist for the missing-file case, so trust that and treat it as a no-op. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(http): redact secrets from trace buffer and distribution-token logs The /api/traces buffer captured Authorization, Cookie, Set-Cookie, and API-key headers verbatim from every request when tracing was enabled. The endpoint is admin-only but the buffer is reachable via any heap-style introspection and the captured tokens otherwise outlive the request. Strip those header values at capture time. Body redaction is left to a follow-up — the prompts are usually the operator's own and JSON-walking is invasive. Distribution tokens were also logged in plaintext from core/explorer/discovery.go; logs forward to syslog/journald and outlive the token. Redact those to a short prefix/suffix instead. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(auth): rate-limit OAuth callbacks separately from password endpoints The shared 5/min/IP limit on auth endpoints is right for password-style flows but too tight for OAuth callbacks: corporate SSO funnels many real users through one outbound IP and would trip the limit. Add a separate 60/min/IP limiter for /api/auth/{github,oidc}/callback so callbacks are bounded against floods without breaking shared-IP deployments. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(gallery): verify backend tarball sha256 when set in gallery entry GalleryBackend gained an optional sha256 field; the install path now threads it through to the existing downloader hash-verify (which already streams, verifies, and rolls back on mismatch). Galleries without sha256 keep working; the empty-SHA path still emits the existing "downloading without integrity check" warning. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * test(http): pin CSRF coverage on multipart endpoints The CSRF middleware in app.go is global (e.Use) so it covers every multipart upload route — branding assets, fine-tune datasets, audio transforms, agent collections. Pin that contract: cross-site multipart POSTs are rejected; same-origin / same-site / API-key clients are not. Also pins the SameSite=Lax fallback path the skipper relies on when Sec-Fetch-Site is absent. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(http): XSS hardening — CSP headers, safe href, base-href escape, SVG sandbox Several closely related XSS-prevention changes spanning the SPA shell, the React UI, and the branding asset server: - New SecurityHeaders middleware sets CSP, X-Content-Type-Options, X-Frame-Options, and Referrer-Policy on every response. The CSP keeps script-src permissive because the Vite bundle relies on inline + eval'd scripts; tightening that requires moving to a nonce-based policy. - The <base href> injection in the SPA shell escaped attacker-controllable Host / X-Forwarded-Host headers — a single quote in the host header broke out of the attribute. Pass through SecureBaseHref (html.EscapeString). - Three React sinks rendering untrusted content via dangerouslySetInnerHTML switch to text-node rendering with whiteSpace: pre-wrap: user message bodies in Chat.jsx and AgentChat.jsx, and the agent activity log in AgentChat.jsx. The hand-rolled escape on the agent user-message variant is replaced by the same plain-text path. - New safeHref util collapses non-allowlisted URI schemes (most importantly javascript:) to '#'. Applied to gallery `<a href={url}>` links in Models / Backends / Manage and to canvas artifact links — these come from gallery JSON or assistant tool calls and must be treated as untrusted. - The branding asset server attaches a sandbox CSP plus same-origin CORP to .svg responses. The React UI loads logos via <img>, but the same URL is also reachable via direct navigation; this prevents script execution if a hostile SVG slipped past upload validation. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(http): bound HTTP server with read-header and idle timeouts A net/http server with no timeouts is trivially Slowloris-able and leaks idle keep-alive connections. Set ReadHeaderTimeout (30s) to plug the slow-headers attack and IdleTimeout (120s) to cap keep-alive sockets. ReadTimeout and WriteTimeout stay at 0 because request bodies can be multi-GB model uploads and SSE / chat completions stream for many minutes; operators who need tighter per-request bounds should terminate slow clients at a reverse proxy. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * test(auth): pin PUT /api/auth/profile field-tampering contract The handler uses an explicit local body struct (only name and avatar_url) plus a gorm Updates(map) with a column allowlist, so an attacker posting {"role":"admin","email":"...","password_hash":"..."} can't mass-assign those fields. Lock that down with a regression test so a future "let's just c.Bind(&user)" refactor breaks loudly. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(services): strip directory components from multipart upload filenames UploadDataset and UploadToCollectionForUser took the raw multipart file.Filename and joined it into a destination path. The fine-tune upload was incidentally safe because of a UUID prefix that fused any leading '..' to a literal segment, but the protection is fragile. UploadToCollectionForUser handed the filename to a vendored backend without sanitising at all. Strip to filepath.Base at both boundaries and reject the trivial unsafe values ("", ".", "..", "/"). Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(react-ui): validate persisted MCP server entries on load localStorage is shared across same-origin pages; an XSS that lands once can poison persisted MCP server config to attempt header injection or to feed a non-http URL into the fetch path on subsequent loads. Validate every entry: types must match, URL must parse with http(s) scheme, header keys/values must be control-char-free. Drop anything that doesn't fit. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(http): close X-Forwarded-Prefix open redirect The reverse-proxy support concatenated X-Forwarded-Prefix into the redirect target without validation, so a forged header value of "//evil.com" turned the SPA-shell redirect helper at /, /browse, and /browse/* into a 301 to //evil.com/app. The path-strip middleware had the same shape on its prefix-trailing-slash redirect. Add SafeForwardedPrefix at the middleware boundary: must start with a single '/', no protocol-relative '//' opener, no scheme, no backslash, no control characters. Apply at both consumers; misconfig trips the validator and the header is dropped. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(http): refuse wildcard CORS when LOCALAI_CORS=true with empty allowlist When LOCALAI_CORS=true but LOCALAI_CORS_ALLOW_ORIGINS was empty, Echo's CORSWithConfig saw an empty allow-list and fell back to its default AllowOrigins=[""]. An operator who flipped the strict-CORS feature flag without populating the list got the opposite of what they asked for. Echo never sets Allow-Credentials: true so this isn't directly exploitable (cookies aren't sent under wildcard CORS), but the misconfiguration trap is worth closing. Skip the registration and warn. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> feat(auth): zxcvbn password strength check with user-acknowledged override The previous policy was len < 8, which let through "Password1" and the rest of the credential-stuffing corpus. LocalAI has no second factor yet, so the bar needs to sit higher. Add ValidatePasswordStrength using github.com/timbutler/zxcvbn (an actively-maintained fork of the trustelem port; v1.0.4, April 2024): - min 12 chars, max 72 (bcrypt's truncation point) - reject NUL bytes (some bcrypt callers truncate at the first NUL) - require zxcvbn score >= 3 ("safely unguessable, ~10^8 guesses to break"); the hint list ["localai", "local-ai", "admin"] penalises passwords built from the app's own branding zxcvbn produces false positives sometimes (a strong-looking password that happens to match a dictionary word) and operators occasionally need to set a known-weak password (kiosk demos, CI rigs). Add an acknowledgement path: PasswordPolicy{AllowWeak: true} skips the entropy check while still enforcing the hard rules. The structured PasswordErrorResponse marks weak-password rejections as Overridable so the UI can surface a "use this anyway" checkbox. Wired through register, self-service password change, and admin password reset on both the server and the React UI. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(react-ui): drop HTML5 minLength on new-password inputs minLength={12} on the new-password input let the browser block the form submit silently before any JS or network call ran. The browser focused the field, showed a brief native tooltip, and that was that — no toast, no fetch, no clue. Reproducible by typing fewer than 12 chars on the second password change of a session. The JS-level length check in handleSubmit already shows a toast and the server rejects with a structured error, so the HTML5 attribute was redundant defence anyway. Drop it. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(react-ui): bundle Geist fonts locally instead of fetching from Google The new CSP correctly refused to apply styles from fonts.googleapis.com because style-src is locked to 'self' and 'unsafe-inline'. Loosening the CSP would defeat its purpose; the right fix is to stop reaching out to a third-party CDN for fonts on every page load. Add @fontsource-variable/geist and @fontsource-variable/geist-mono as npm deps and import them once at boot. Drop the <link rel="preconnect"> and external stylesheet from index.html. Side benefit: no third-party tracking via Referer / IP on every UI load, no failure mode when offline / behind a captive portal. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(react-ui): refresh i18n strings to reflect 12-char password minimum The translations still said "at least 8 characters" everywhere — the client-side toast on a too-short password change told the user the wrong floor. Update tooShort and newPasswordPlaceholder / newPasswordDescription across all five locales (en, es, it, de, zh-CN) to match the real ValidatePasswordStrength rule. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(auth): make password length-floor overridable like the entropy check The 12-char minimum was a policy choice, not a technical invariant — only "non-empty", "<= 72 bytes", and "no NUL bytes" are real bcrypt constraints. Treating length-12 as a hard rule was inconsistent with the entropy check (already overridable) and friction for use cases where the account is just a name on a session, not a security boundary (single-user kiosk, CI rig, lab demo). Restructure ValidatePasswordStrength: - Hard rules (always enforced): non-empty, <= MaxPasswordLength, no NUL byte - Policy rules (skipped when AllowWeak=true): length >= 12, zxcvbn score >= 3 PasswordError now marks password_too_short as Overridable too. The React forms generalised from `error_code === 'password_too_weak'` to `overridable === true`, and the JS-side preflight length checks were removed (server is source of truth, returns the same checkbox flow). Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-05-08 16:25:45 +02:00
LocalAI [bot]	e5d7b84216	fix(distributed): split NATS backend.upgrade off install + dedup loads (#9717 ) * feat(messaging): add backend.upgrade NATS subject + payload types Splits the slow force-reinstall path off backend.install so it can run on its own subscription goroutine, eliminating head-of-line blocking between routine model loads and full gallery upgrades. Wire-level Force flag on BackendInstallRequest is kept for one release as the rolling-update fallback target; doc note marks it deprecated. Assisted-by: Claude:claude-sonnet-4-6 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed/worker): add per-backend mutex helper to backendSupervisor Different backend names lock independently; same backend serializes. This is the synchronization primitive used by the upcoming concurrent install handler — without it, wrapping the NATS callback in a goroutine would race the gallery directory when two requests target the same backend. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(distributed/worker): run backend.install handler in a goroutine NATS subscriptions deliver messages serially on a single per-subscription goroutine. With a synchronous install handler, a multi-minute gallery download would head-of-line-block every other install request to the same worker — manifesting upstream as a 5-minute "nats: timeout" on unrelated routine model loads. The body now runs in its own goroutine, with a per-backend mutex (lockBackend) protecting the gallery directory from concurrent operations on the same backend. Different backend names install in parallel. Backward-compat: req.Force=true is still honored here, so an older master that hasn't been updated to send on backend.upgrade keeps working. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed/worker): subscribe to backend.upgrade as a separate path Slow force-reinstall now lives on its own NATS subscription, so a multi-minute gallery pull cannot head-of-line-block the routine backend.install handler on the same worker. Same per-backend mutex guards both — concurrent install + upgrade for the same backend serialize at the gallery directory; different backends are independent. upgradeBackend stops every live process for the backend, force-installs from gallery, and re-registers. It does not start a new process — the next backend.install will spawn one with the freshly-pulled binary. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): add UpgradeBackend on NodeCommandSender; drop Force from InstallBackend Master now sends to backend.upgrade for force-reinstall, with a nats.ErrNoResponders fallback to the legacy backend.install Force=true path so a rolling update with a new master + an old worker still converges. The Force parameter leaves the public Go API surface entirely — only the internal fallback sets it on the wire. InstallBackend timeout drops 5min -> 3min (most replies are sub-second since the worker short-circuits on already-running or already-installed). UpgradeBackend timeout is 15min, sized for real-world Jetson-on-WiFi gallery pulls. Updates the admin install HTTP endpoint (core/http/endpoints/localai/nodes.go) to the new signature too. router_test.go's fakeUnloader does not yet implement the new interface shape; Task 3.2 will catch it up before the next package-level test run. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(distributed): update fakeUnloader for new NodeCommandSender shape InstallBackend lost its force bool param (Force is not part of the public Go API anymore — only the internal upgrade-fallback path sets it on the wire). UpgradeBackend gained a method. Fake records both call slices and provides an installHook concurrency seam for upcoming singleflight tests. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(distributed): cover UpgradeBackend's new subject + rolling-update fallback Task 3.1 changed the master to publish UpgradeBackend on the new backend.upgrade subject; the existing UpgradeBackend tests scripted the old install subject and so all 3 began failing as expected. Updates them to script SubjectNodeBackendUpgrade with BackendUpgradeReply. Adds two new specs for the rolling-update fallback: - ErrNoResponders on backend.upgrade triggers a backend.install Force=true retry on the same node. - Non-NoResponders errors propagate to the caller unchanged. scriptedMessagingClient gains scriptNoResponders (real nats sentinel) and scriptReplyMatching (predicate-matched canned reply, used to assert that the fallback path actually sets Force=true on the install retry). Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(distributed): coalesce concurrent identical backend.install via singleflight Six simultaneous chat completions for the same not-yet-loaded model were observed firing six independent NATS install requests, each serializing through the worker's per-subscription goroutine and amplifying queue depth. SmartRouter now wraps the NATS round-trip in a singleflight.Group keyed by (nodeID, backend, modelID, replica): N concurrent identical loads share one round-trip and one reply. Distinct (modelID, replica) keys still fire independent calls, so multi-replica scaling and multi-model fan-out are unaffected. fakeUnloader gains a sync.Mutex around its recording slices to keep concurrent test goroutines race-clean. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(e2e/distributed): drop force arg from InstallBackend test calls Two e2e test call sites still passed the trailing force bool that was removed from RemoteUnloaderAdapter.InstallBackend in `9bde76d7`. Caught by golangci-lint typecheck on the upgrade-split branch (master CI was already green because these tests don't run in the standard test path). Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(distributed): extract worker business logic to core/services/worker core/cli/worker.go grew to 1212 lines after the backend.upgrade split. The CLI package was carrying backendSupervisor, NATS lifecycle handlers, gallery install/upgrade orchestration, S3 file staging, and registration helpers — all distributed-worker business logic that doesn't belong in the cobra surface. Move it to a new core/services/worker package, mirroring the existing core/services/{nodes,messaging,galleryop} pattern. core/cli/worker.go shrinks to ~19 lines: a kong-tagged shim that embeds worker.Config and delegates Run. No behavior change. All symbols stay unexported except Config and Run. The three worker-specific tests (addr/replica/concurrency) move with the code via git mv so history follows them. Files split as: worker.go - Run entry point config.go - Config struct (kong tags retained, kong not imported) supervisor.go - backendProcess, backendSupervisor, process lifecycle install.go - installBackend, upgradeBackend, findBackend, lockBackend lifecycle.go - subscribeLifecycleEvents (verbatim, decomposition is a follow-up commit) file_staging.go - subscribeFileStaging, isPathAllowed registration.go - advertiseAddr, registrationBody, heartbeatBody, etc. reply.go - replyJSON process_helpers.go - readLastLinesFromFile Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(distributed/worker): decompose subscribeLifecycleEvents into per-event handlers The 226-line subscribeLifecycleEvents method packed eight NATS subscriptions inline. Each grew context-shaped doc comments mixed with subscription plumbing, making it hard to read any one handler without scrolling past the others. Extract each handler into its own method on *backendSupervisor; the subscriber becomes a thin 8-line dispatcher. No behavior change: each method body is byte-equivalent to its corresponding inline goroutine + handler. Doc comments that were attached to the inline SubscribeReply calls migrate to the new method godocs. Adding the next NATS subject is now a 2-line patch to the dispatcher plus one new method, instead of grafting onto a monolith. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-08 16:24:54 +02:00
LocalAI [bot]	6070aafc69	chore(deps): bump LocalAGI for collection rehydrate-on-init-failure fix (#9721 ) Picks up mudler/LocalAGI#? (commit 941ac52, merged into main): in-process collections backend now registers a placeholder for every on-disk collection at startup — even when the engine wrapper fails to construct (typically because the embedding model is briefly unreachable) — and rehydrates lazily on first access. Previously a transient outage at LocalAI boot silently dropped every existing collection from the agent pool's in-memory map; users saw "collection not found" indefinitely until LocalAI was restarted, even after the embedding service recovered. With this bump the next request to the collection rehydrates it transparently. Does not address the deeper LocalRecall issue where NewPersistentPostgresCollection probes the embedding model at engine construction even for read-only paths — that needs a separate fix in mudler/localrecall. Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-08 16:24:42 +02:00
LocalAI [bot]	2be07f61da	feat(whisper): honor client cancellation via ggml abort_callback (#9710 ) * refactor(transcription): propagate request ctx through ModelTranscription* Replaces context.Background() with the HTTP request ctx so client disconnects start cancelling the gRPC call. No backend-side abort wiring yet — that comes in a later commit. Pure plumbing. Assisted-by: Claude:claude-haiku-4-5 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(cli): pass ctx to backend.ModelTranscription Follow-up to `e65d3e1f` which threaded ctx through ModelTranscription but missed the CLI caller. CLI commands have no request-scoped ctx, so context.Background() is correct here. Assisted-by: Claude:claude-haiku-4-5 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(audio): propagate request ctx into TTS, sound-gen, audio-transform Same ctx-plumbing pattern applied to the rest of the audio path. CLI callers use context.Background() since there is no request scope; HTTP callers use c.Request().Context(). Assisted-by: Claude:claude-haiku-4-5 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(backend): propagate request ctx into biometric, detection, rerank, diarization paths Replaces remaining context.Background() sites in core/backend with the caller's ctx. After this commit, every core/backend/.go entry point threads the request ctx end-to-end to the gRPC client. Assisted-by: Claude:claude-haiku-4-5 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> refactor(grpc): plumb ctx through AIModel.AudioTranscription{,Stream} Adds context.Context as first parameter to the AIModel interface methods that wrap whisper-style transcription. Server-side gRPC handler now forwards the per-RPC ctx (server-streaming uses stream.Context()). Whisper, Voxtral, vibevoice-cpp, and sherpa-onnx accept the parameter; none uses it yet — the actual cancellation primitive lands in the next commit so this is pure plumbing. Assisted-by: Claude:claude-sonnet-4-6 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(whisper): add abort_callback hook in the C++ bridge Installs a std::atomic<int> flag, wires it into whisper_full_params.abort_callback, and exposes a set_abort(int) C symbol so Go can flip the flag from a goroutine watching the request context. transcribe() now distinguishes abort (return 2) from real whisper_full failure (return 1). Assisted-by: Claude:claude-haiku-4-5 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(whisper): register set_abort symbol in the purego loader Adds the Go-side binding for the new C export so the next commit can call CppSetAbort(1) from a watcher goroutine on ctx.Done(). Assisted-by: Claude:claude-haiku-4-5 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(whisper): honor ctx cancellation and return codes.Canceled A watcher goroutine watches ctx.Done() during AudioTranscription and calls CppSetAbort(1) on cancel. whisper_full sees abort_callback return true at the next compute graph step, returns non-zero, and the bridge returns 2 -> AudioTranscription maps that to codes.Canceled. Adds an opt-in test (gated on WHISPER_MODEL_PATH / WHISPER_AUDIO_PATH) that asserts cancellation latency under 5s and proves the abort flag resets cleanly so the next transcription succeeds. Assisted-by: Claude:claude-sonnet-4-6 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(whisper): join the cancel watcher goroutine before returning Follow-up to `85edf9d2`. The previous commit used `defer close(done)` and called the watcher "joined synchronously" — but close() only signals, it does not block until the goroutine exits. That left a window where a late CppSetAbort(1) from a cancelled call could land on the next call, after its C-side g_abort reset but before whisper_full() began polling the abort callback, corrupting the second transcription. Switch to a sync.WaitGroup join so wg.Wait() blocks until the watcher has actually returned from its select. Assisted-by: Claude:claude-sonnet-4-6 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(whisper): short-circuit pre-cancelled ctx in AudioTranscription If ctx is already Done() at entry, return codes.Canceled immediately instead of running the full transcription. The C-side g_abort reset happens at the start of transcribe() and would otherwise overwrite a watcher-set abort flag from an already-cancelled ctx, producing a spurious successful transcription on a request the client has already abandoned. Assisted-by: Claude:claude-haiku-4-5 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests/distributed): update testLLM mock for new AudioTranscription signature Phase B (`93c48e19`) added context.Context to AIModel.AudioTranscription but missed the testLLM mock in tests/e2e/distributed. CI golangci-lint caught it: testLLM did not implement grpc.AIModel because the method signature lacked the ctx parameter, which broke the distributed test suite compilation and cascaded through every backend-build job that runs `go build ./...`. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> test(whisper): port cancellation test to Ginkgo/Gomega Project policy (.agents/coding-style.md, enforced by golangci-lint forbidigo) is that all Go tests must use Ginkgo v2 + Gomega — no stdlib testing patterns (t.Skip, t.Fatalf, etc.). Convert the cancellation test to a Describe/It block with Skip(...) for env gating and Expect/HaveOccurred for assertions. Same coverage: cancel mid-flight returns codes.Canceled within 5s and a follow-up transcription succeeds, proving the C-side g_abort flag resets cleanly. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-08 01:44:47 +02:00
LocalAI [bot]	806130bbc0	chore: ⬆️ Update ggml-org/whisper.cpp to `c81b2dabbc45484dee2ca6658cfe39c841df5c70` (#9712 ) ⬆️ Update ggml-org/whisper.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-08 01:44:32 +02:00
LocalAI [bot]	3b84582567	chore: ⬆️ Update ggml-org/llama.cpp to `05ff59cb57860cc992fc6dcede32c696efea711c` (#9714 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-08 01:44:17 +02:00
LocalAI [bot]	907929ce60	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `9a26522af234f8db079ae3735f35ab6c20fe2c66` (#9713 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-08 01:43:44 +02:00
Ettore Di Giacinto	e6916ae9b1	fix(gallery/flux.2): enable VAE encoder so image edits actually work stable-diffusion.cpp loads the VAE encoder weights only when the ctx is created with vae_decode_only=false. Our gosd wrapper defaults that flag to true, and the upstream flux2/flux2-klein code paths don't auto-flip it (sd_version_is_unet_edit / sd_version_is_control both return false for VERSION_FLUX2 and VERSION_FLUX2_KLEIN). The CLI compensates by flipping the flag whenever -r/--ref-image is passed, but on the server side we don't know that at load time. Result: requests to /v1/images/generations with `ref_images` against flux.2-dev / flux.2-klein-{4b,9b} would silently skip the encode step (first_stage_model.encoder + tae.encoder are dropped at load via ignore_tensors), and the output had no relation to the reference image — image-editing was effectively unsupported even though stable-diffusion.cpp itself supports it. Add `vae_decode_only:false` to the options of all three flux.2 gallery entries so the encoder is loaded on first launch. Verified on a live instance: a 64x64 striped reference image produces an output that preserves the 3-band layout, confirming the VAE encoder is now wired through. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code]	2026-05-07 23:18:08 +00:00
Ettore Di Giacinto	3bc5ae8da6	fix(tests/e2e-backends): bump ctx_size for llama-cpp transcription Qwen3-ASR-0.6B encodes the jfk.wav fixture into 777 audio tokens via its mmproj, but the test harness defaulted BACKEND_TEST_CTX_SIZE to 512, so llama.cpp server rejected every transcription request with "request (777 tokens) exceeds the available context size (512 tokens)". Set BACKEND_TEST_CTX_SIZE=2048 on the llama-cpp transcription target only — sherpa-onnx and vibevoice transcription targets don't go through llama.cpp's slot/n_ctx and weren't failing. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code]	2026-05-07 22:31:08 +00:00
dependabot[bot]	3234e6d6ba	chore(deps): bump the go_modules group across 1 directory with 8 updates (#9705 ) Bumps the go_modules group with 8 updates in the / directory: \| Package \| From \| To \| \| --- \| --- \| --- \| \| [github.com/buger/jsonparser](https://github.com/buger/jsonparser) \| `1.1.1` \| `1.1.2` \| \| [github.com/antchfx/xpath](https://github.com/antchfx/xpath) \| `1.3.4` \| `1.3.6` \| \| [github.com/cloudflare/circl](https://github.com/cloudflare/circl) \| `1.6.1` \| `1.6.3` \| \| [github.com/go-git/go-git/v5](https://github.com/go-git/go-git) \| `5.16.4` \| `5.18.0` \| \| [github.com/gofiber/fiber/v2](https://github.com/gofiber/fiber) \| `2.52.11` \| `2.52.13` \| \| [github.com/jackc/pgx/v5](https://github.com/jackc/pgx) \| `5.8.0` \| `5.9.2` \| \| [golang.org/x/image](https://github.com/golang/image) \| `0.25.0` \| `0.38.0` \| \| [github.com/ipld/go-ipld-prime](https://github.com/ipld/go-ipld-prime) \| `0.21.0` \| `0.23.0` \| Updates `github.com/buger/jsonparser` from 1.1.1 to 1.1.2 - [Release notes](https://github.com/buger/jsonparser/releases) - [Commits](https://github.com/buger/jsonparser/compare/v1.1.1...v1.1.2) Updates `github.com/antchfx/xpath` from 1.3.4 to 1.3.6 - [Release notes](https://github.com/antchfx/xpath/releases) - [Commits](https://github.com/antchfx/xpath/compare/v1.3.4...v1.3.6) Updates `github.com/cloudflare/circl` from 1.6.1 to 1.6.3 - [Release notes](https://github.com/cloudflare/circl/releases) - [Commits](https://github.com/cloudflare/circl/compare/v1.6.1...v1.6.3) Updates `github.com/go-git/go-git/v5` from 5.16.4 to 5.18.0 - [Release notes](https://github.com/go-git/go-git/releases) - [Changelog](https://github.com/go-git/go-git/blob/main/HISTORY.md) - [Commits](https://github.com/go-git/go-git/compare/v5.16.4...v5.18.0) Updates `github.com/gofiber/fiber/v2` from 2.52.11 to 2.52.13 - [Release notes](https://github.com/gofiber/fiber/releases) - [Commits](https://github.com/gofiber/fiber/compare/v2.52.11...v2.52.13) Updates `github.com/jackc/pgx/v5` from 5.8.0 to 5.9.2 - [Changelog](https://github.com/jackc/pgx/blob/master/CHANGELOG.md) - [Commits](https://github.com/jackc/pgx/compare/v5.8.0...v5.9.2) Updates `golang.org/x/image` from 0.25.0 to 0.38.0 - [Commits](https://github.com/golang/image/compare/v0.25.0...v0.38.0) Updates `github.com/ipld/go-ipld-prime` from 0.21.0 to 0.23.0 - [Release notes](https://github.com/ipld/go-ipld-prime/releases) - [Changelog](https://github.com/ipld/go-ipld-prime/blob/master/CHANGELOG.md) - [Commits](https://github.com/ipld/go-ipld-prime/compare/v0.21.0...v0.23.0) --- updated-dependencies: - dependency-name: github.com/antchfx/xpath dependency-version: 1.3.6 dependency-type: indirect - dependency-name: github.com/buger/jsonparser dependency-version: 1.1.2 dependency-type: indirect - dependency-name: github.com/cloudflare/circl dependency-version: 1.6.3 dependency-type: indirect - dependency-name: github.com/go-git/go-git/v5 dependency-version: 5.18.0 dependency-type: indirect - dependency-name: github.com/gofiber/fiber/v2 dependency-version: 2.52.13 dependency-type: indirect - dependency-name: github.com/ipld/go-ipld-prime dependency-version: 0.23.0 dependency-type: indirect - dependency-name: github.com/jackc/pgx/v5 dependency-version: 5.9.2 dependency-type: indirect - dependency-name: golang.org/x/image dependency-version: 0.38.0 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-07 17:29:04 +02:00
LocalAI [bot]	595b6fd22d	feat(api/transcription): include segments + duration + language on stream done event (#9709 ) streamTranscription previously emitted a done event with just `text`, matching the OpenAI streaming spec exactly. Streaming clients that need per-utterance timings or audio duration had to fall back to the non-streaming JSON path — and that path is exactly the one that trips on ResponseHeaderTimeout when whisper requests queue behind each other on a SingleThread backend. Extend the done event to additively carry `language`, `duration`, and a `segments` array (id, start, end, text — start/end as float seconds, matching TranscriptionSegmentSeconds). Empty / zero values are still omitted; spec-compliant clients ignore the new fields. This unblocks notary's streaming Transcribe (companion change in the notary repo) so it produces the same TranscriptionResult shape as the JSON path while sidestepping the queue-induced header timeouts. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-07 17:28:26 +02:00
LocalAI [bot]	447c186089	fix(distributed): make backend upgrade actually re-install on workers (#9708 ) * fix(distributed): make backend upgrade actually re-install on workers UpgradeBackend dispatched a vanilla backend.install NATS event to every node hosting the backend. The worker's installBackend short-circuits on "already running for this (model, replica) slot" and returns the existing address — so the gallery install path was skipped, no artifact was re-downloaded, no metadata was written. The frontend's drift detection then re-flagged the same backends every cycle (installedDigest stays empty → mismatch → "Backend upgrade available (new build)") while "Backend upgraded successfully" landed in the logs at the same time. The user-visible symptom: clicking "Upgrade All" silently does nothing and the same N backends sit on the upgrade list forever. Two coupled fixes, one PR: 1. Force flag on backend.install. Add `Force bool` to BackendInstallRequest and thread it through NodeCommandSender -> RemoteUnloaderAdapter. UpgradeBackend (and the reconciler's pending-op drain when retrying an upgrade) sets force=true; routine load events and admin install endpoints keep force=false. On the worker, force=true stops every live process that uses this backend (resolveProcessKeys for peer replicas, plus the exact request processKey), skips the findBackend short-circuit, and passes force=true into gallery.InstallBackendFromGallery so the on-disk artifact is overwritten. After the gallery install completes, startBackend brings up a fresh process at the same processKey on a new port. 2. Liveness check on the fast path. installBackend's "already running" branch read getAddr without verifying the process was alive, so a gRPC backend that died without the supervisor noticing left a stale (key, addr) entry. The reconciler then dialed that address, got ECONNREFUSED, marked the replica failed, retried install — and the supervisor said "already running addr=…" again. Loop forever, exactly what we observed on a node whose llama-cpp process had died but whose supervisor record persisted. Verify s.isRunning(processKey) before trusting getAddr; if the entry is stale, stopBackendExact cleans up and we fall through to a real install. Backwards-compatible: the new Force field is omitempty, older workers ignore it (their default behavior matches force=false). The signature change on NodeCommandSender.InstallBackend is internal-only. Verified: unit tests in core/services/nodes pass (108s suite). The pre-existing core/backend build break (proto regen pending for word-level timestamps) blocks core/cli and core/http/endpoints/localai package tests but is unrelated to this change. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] * test(e2e/distributed): pass force=false to adapter.InstallBackend NodeCommandSender.InstallBackend gained a final force bool in the upgrade-force commit; the e2e distributed lifecycle tests still called the old 8-arg signature and broke compilation. These tests exercise the routine install path (single replica, default behavior), so force=false preserves their existing semantics. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-07 17:28:14 +02:00
LocalAI [bot]	cec5c4fdfc	fix(http): make handler-error status visible in access log + transcription errors (#9707 ) * fix(http): log accurate status code when handler returns error The custom xlog access-log middleware in API() reads res.Status before Echo's central HTTPErrorHandler runs, so when a handler returns an error without writing a response (e.g. TranscriptEndpoint's `return err` on backend failure) the status field stays at its default 200. The logged line then claims status=200 while the client receives 500 — silently hiding every 500/503/etc. that bubbles up through Echo's error handler. Mirror echo.DefaultHTTPErrorHandler's status derivation when err != nil and the response hasn't been committed: default to 500, upgrade to echo.HTTPError.Code if applicable. The logged status now matches what the client actually sees, so failed transcription requests stop appearing as 200 in the access log. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] fix(transcription): log underlying error before returning 500 to client ModelTranscriptionWithOptions surfaces real failures — gRPC errors from a remote node, model load problems, ffmpeg conversion crashes — but TranscriptEndpoint just did `return err`, so Echo turned it into a 500 with a generic body and the original error was lost. Operators chasing transcription failures across distributed mode were left with "upstream returned 500" on the client and zero context anywhere in the frontend's logs. Add an xlog.Error before returning, recording model name, the staged audio path, and the underlying error. Combined with the access-log status fix, a failing transcription now leaves an audit trail (real status code in the access line, real cause in an Error line) instead of vanishing. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-07 17:27:45 +02:00
Richard Palethorpe	c894d9c826	feat(sglang): wire engine_args, add cuda13 build, ship MTP gallery demos (#9686 ) Bring the sglang Python backend up to feature parity with vllm by adding the same engine_args:-map plumbing the vLLM backend already has. Any ServerArgs field (~380 in sglang 0.5.11) becomes settable from a model YAML, including the speculative-decoding flags needed for Multi-Token Prediction. Validation matches the vllm backend's: keys are checked against dataclasses.fields(ServerArgs), unknown keys raise ValueError with a difflib close-match suggestion at LoadModel time, and the typed ModelOptions fields keep their existing meaning with engine_args overriding them. Backend code: * backend/python/sglang/backend.py: add _apply_engine_args, import dataclasses/difflib/ServerArgs, call from LoadModel; rename Seed -> sampling_seed (sglang 0.5.11 renamed the SamplingParams field). * backend/python/sglang/test.py + test.sh + Makefile: six unit tests exercising the helper directly (no engine load required). Build / CI / backend gallery (cuda13 + l4t13 paths are now first-class): * backend/python/sglang/install.sh: add --prerelease=allow because sglang 0.5.11 hard-pins flash-attn-4 which only ships beta wheels; add --index-strategy=unsafe-best-match for cublas12 so the cu128 torch index wins over default-PyPI's cu130; new pyproject.toml-driven l4t13 install path so [tool.uv.sources] can pin torch/torchvision/ torchaudio/sglang to the jetson-ai-lab index without forcing every transitive PyPI dep through the L4T mirror's flaky proxy (mirrors the equivalent fix in backend/python/vllm/install.sh). * backend/python/sglang/pyproject.toml (new): L4T project spec with explicit-source jetson-ai-lab index. Replaces requirements-l4t13.txt for the l4t13 BUILD_PROFILE; other profiles still go through the requirements-.txt pipeline via libbackend.sh's installRequirements. backend/python/sglang/requirements-l4t13.txt: removed; superseded by pyproject.toml. * backend/python/sglang/requirements-cublas{12,13}{,-after}.txt: pin sglang>=0.5.11 (Gemma 4 floor); add cu130 torch index for cublas13 (new files) and cu128 torch index for cublas12 (default PyPI now ships cu130 torch wheels by default and breaks cu12 hosts). * backend/index.yaml: add cuda13-sglang and cuda13-sglang-development capability mappings + image entries pointing at quay.io/.../-gpu-nvidia-cuda-13-sglang. * .github/workflows/backend.yml: new cublas13 sglang matrix entry, mirroring vllm's cuda13 build. Model gallery + docs: * gallery/sglang.yaml: base sglang config template, mirrors vllm.yaml. * gallery/sglang-gemma-4-{e2b,e4b}-mtp.yaml: Gemma 4 MTP demos transcribed verbatim from the SGLang Gemma 4 cookbook MTP commands. * gallery/sglang-mimo-7b-mtp.yaml: MiMo-7B-RL with built-in MTP heads + online fp8 weight quantization, verified end-to-end on a 16 GB RTX 5070 Ti at ~88 tok/s. Uses mem_fraction_static: 0.7 because the MTP draft worker's vocab embedding is loaded unquantised and OOMs the static reservation at sglang's 0.85 default. * gallery/index.yaml: three new entries (gemma-4-e2b-it:sglang-mtp, gemma-4-e4b-it:sglang-mtp, mimo-7b-mtp:sglang). * docs/content/features/text-generation.md: new SGLang section with setup, engine_args reference, MTP demos, version requirements. * .agents/sglang-backend.md (new): agent one-pager covering the flat ServerArgs structure, the typed-vs-engine_args precedence, the speculative-decoding cheatsheet, and the mem_fraction_static gotcha documented above. * AGENTS.md: index entry for the new agent doc. Known limitation: the two Gemma 4 MTP gallery entries ship a recipe that doesn't yet run on stock libraries. The drafter checkpoints (google/gemma-4-{E2B,E4B}-it-assistant) declare model_type: gemma4_assistant / Gemma4AssistantForCausalLM, which neither transformers (<=5.6.0, including the SGLang cookbook's pinned commit 91b1ab1f... and main HEAD) nor sglang's own model registry (<=0.5.11) registers as of 2026-05-06. They will start working when HF or sglang upstream registers the architecture -- no LocalAI changes needed. The MiMo MTP demo and the non-MTP Gemma 4 paths work today on this build (verified on RTX 5070 Ti, 16 GB). Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash] [WebFetch] [WebSearch] Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-05-07 17:27:29 +02:00
Ettore Di Giacinto	048daa0cdc	fix(chatterbox): install chatterbox-tts with --no-deps and pin runtime deps The previous omegaconf pin only addressed one symptom of a deeper problem: chatterbox-tts upstream depends on `russian-text-stresser` (unpinned git URL), which transitively pins `spacy==3.6.` and other ancient packages. That cascade forces pip to backtrack through Jinja2/MarkupSafe/omegaconf into Python-2-era sdists that no longer build (e.g. ruamel.yaml<0.15, Jinja2 2.6 importing the long-removed `setuptools.Feature`). Install chatterbox-tts itself with --no-deps in install.sh and list its real runtime deps explicitly in each requirements-.txt, dropping the optional russian-text-stresser. This unblocks the darwin (and other) builds without playing whack-a-mole on each newly-discovered transitive pin. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code]	2026-05-07 09:03:40 +00:00
Ettore Di Giacinto	88f5029a5b	chore(deps): bump LocalAGI/LocalRecall — go-pdfium WASM PDF extractor LocalRecall switched its PDF extractor from gen2brain/go-fitz (cgo + static libmupdf) to klippa-app/go-pdfium with the WebAssembly backend (no cgo, no static libs, no glibc symbol issues). This bump pulls in the new shape via LocalAGI@main → LocalRecall@a7724fe. Why this matters: the previous go-fitz approach broke aarch64 LocalAI builds with `__isoc23_strtol undefined reference` link errors — go-fitz's bundled libmupdf static library is compiled against glibc ≥ 2.38 (C2x strtol intrinsics) but the LocalAI builder uses older glibc. The WASM backend has no native dependencies; same Go binary works on every architecture. Indirect graph: drops gen2brain/go-fitz, ebitengine/purego, jupiterrider/ffi; adds klippa-app/go-pdfium + tetratelabs/wazero. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 07:48:14 +00:00
Ettore Di Giacinto	7c77d3506a	fix(chatterbox): pin omegaconf in every profile requirements file The previous pin in requirements.txt was ineffective: installRequirements runs a separate `pip install --requirement` per file, so resolution does not carry over to the per-profile file where chatterbox-tts is declared. With chatterbox-tts's unpinned `omegaconf` dep, pip backtracked through 1.x sdists into ruamel.yaml<0.15, whose Python-2-era setup.py fails on Python 3.10+. Pin omegaconf==2.3.0 next to chatterbox-tts in every profile file (matches what upstream chatterbox uses). Drop the dead pin from requirements.txt. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code]	2026-05-07 07:44:37 +00:00
dependabot[bot]	c96ce99742	chore(deps): bump openssl from 0.10.76 to 0.10.79 in /backend/rust/kokoros in the cargo group across 1 directory (#9694 ) chore(deps): bump openssl Bumps the cargo group with 1 update in the /backend/rust/kokoros directory: [openssl](https://github.com/rust-openssl/rust-openssl). Updates `openssl` from 0.10.76 to 0.10.79 - [Release notes](https://github.com/rust-openssl/rust-openssl/releases) - [Commits](https://github.com/rust-openssl/rust-openssl/compare/openssl-v0.10.76...openssl-v0.10.79) --- updated-dependencies: - dependency-name: openssl dependency-version: 0.10.79 dependency-type: indirect dependency-group: cargo ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-07 08:30:18 +02:00
LocalAI [bot]	840db2fde3	chore(model gallery): 🤖 add 1 new models via gallery agent (#9703 ) chore(model gallery): 🤖 add new models via gallery agent Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-07 08:29:56 +02:00
LocalAI [bot]	0b9344ef3d	chore: ⬆️ Update leejet/stable-diffusion.cpp to `90e87bc846f17059771efb8aaa31e9ef0cab6f78` (#9701 ) ⬆️ Update leejet/stable-diffusion.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-07 08:29:41 +02:00
LocalAI [bot]	151d6c9cf0	chore: ⬆️ Update ggml-org/llama.cpp to `2496f9c14965c39589f53eea31bdb6d762b1d360` (#9698 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-07 08:29:27 +02:00
LocalAI [bot]	659939db9b	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `b93721902b4662f9b973b1c412006081c958d085` (#9697 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-07 08:29:12 +02:00
LocalAI [bot]	392fc9ce3d	fix(auth): cascade user deletion across all owned data on PostgreSQL (#9702 ) * fix(auth): cascade user deletion across all owned data on PostgreSQL Deleting a user from the admin UI in distributed mode (PostgreSQL auth DB) returned "user not found" even when the user clearly existed. The old handler ignored result.Error and only checked RowsAffected, so a foreign-key constraint violation surfaced as a misleading 404. Two issues drove this: 1. invite_codes.created_by / used_by reference users(id) but the InviteCode model declared the FKs without ON DELETE CASCADE. On PostgreSQL the engine therefore rejected the user delete with NO ACTION whenever the user had ever issued or consumed an invite. On SQLite (default in single-node mode) FKs are not enforced, so the bug never appeared there. 2. Several owned tables were never cleaned up regardless of dialect: user_permissions and quota_rules relied on CASCADE that does not fire under SQLite, and usage_records have no FK at all and were left orphaned in every dialect. Introduce auth.DeleteUserCascade which runs the full cleanup in a single transaction: drop invites authored by the user, NULL used_by on invites they consumed (preserves the audit trail), and explicitly wipe sessions, API keys, permissions, quota rules, and usage metrics before deleting the user. The in-memory quota cache is invalidated after commit so a recreated user with the same id never sees stale entries. The HTTP handler now maps the helper's errors to proper status codes — real failures surface as 500 with the cause instead of being swallowed as "not found". Add Ginkgo regression coverage in core/http/auth/users_test.go and core/http/routes/auth_test.go covering invite cleanup, used_by null-out, full data wipe, and the FK-enforced original failure mode (via PRAGMA foreign_keys=ON to mirror PostgreSQL behavior on SQLite). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] * chore(deps): bump LocalAGI/LocalRecall — pull in go-fitz PDF extraction Pulls LocalAGI@main (facd888) and LocalRecall@v0.6.0. The latter swaps PDF text extraction from dslipak/pdf to gen2brain/go-fitz (libmupdf bindings) and wraps it in a 60s goroutine timeout — previously certain PDFs (broken xref tables, encrypted, image-only without OCR) would hang indefinitely inside r.GetPlainText() and poison the upload queue. Pure dep bump, no LocalAI source changes. Indirect graph picks up go-fitz + purego + ffi; drops dslipak/pdf. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 08:28:58 +02:00
LocalAI [bot]	31368092a8	chore(model-gallery): ⬆️ update checksum (#9700 ) ⬆️ Checksum updates in gallery/index.yaml Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-07 08:28:18 +02:00
LocalAI [bot]	890c070e55	feat(swagger): update swagger (#9699 ) Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-07 08:28:04 +02:00
Tai An	0497bb6595	fix(downloader): list supported URL schemes in DownloadFile error (#9689 ) * fix(downloader): list supported URL schemes when input is unrecognized The error message previously read "does not look like an HTTP URL", but the downloader actually supports file://, huggingface://, hf://, ollama://, oci://, and github:// in addition to http(s)://. Users who type a bare filename or a typo'd scheme (e.g. fle:// instead of file://) get the misleading impression that only HTTP is accepted. Reference the existing prefix constants directly via strings.Join so the scheme list cannot drift when new prefixes are added. Refs #9683. Signed-off-by: Tai An <antai12232931@outlook.com> * fix(downloader): normalize uri.go to LF line endings Resolves the noisy diff and golangci-lint errcheck warnings on lines I did not actually modify. * fix(downloader): preserve trailing newline at end of file --------- Signed-off-by: Tai An <antai12232931@outlook.com>	2026-05-06 21:59:09 +02:00
Ettore Di Giacinto	b2be9729ef	fix(chatterbox): pin omegaconf>=2.0 to prevent resolver backtracking Without an upper-floor pin, pip's resolver backtracks through omegaconf 1.x sdists when installing chatterbox-tts. Old 1.x setups depend on ruamel.yaml<0.15, whose setup.py uses Python-2-era names (Str, Bytes) and fails to build on Python 3.10+, breaking the darwin python backend build. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code]	2026-05-06 18:07:32 +00:00
LocalAI [bot]	22ff86d64f	fix(distributed): round-robin replicas of the same model (#9695 ) FindAndLockNodeWithModel previously ordered candidate replicas by in_flight ASC, available_vram DESC. The primary key is correct, but the tiebreaker meant that whenever in_flight tied — the common case at low to moderate concurrency where requests don't overlap — the node with the largest available VRAM won every pick. With autoscaling placing replicas of the same model on multiple nodes, the fattest GPU node ended up taking nearly all the load while the others sat idle. Insert last_used ASC between the two existing tiers. last_used is already refreshed inside the same transaction that increments in_flight (and by TouchNodeModel on cache hits in the router), so the "oldest-used" replica naturally rotates through the candidate set — strict round-robin without a schema change. available_vram DESC is demoted to a final tiebreaker for cold starts where last_used is identical across replicas. Placement queries (FindNodeWithVRAM, FindLeastLoadedNode, and the *FromSet variants) have the same fattest-GPU bias on tiebreakers but are higher-cost to fix consistently. Deferred to a follow-up so the routing fix can land first — for the user-observed symptom routing was the dominant cause anyway. Test: registry_test.go adds a focused spec that loads three replicas on three nodes with 24/16/8 GB VRAM and asserts each is picked at least twice across 9 in_flight-tied calls. Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash] [Grep] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-06 19:40:54 +02:00
LocalAI [bot]	4e154b59e5	fix(ci): unbreak rerankers (torch bump) and vllm-omni on aarch64 (#9688 ) Two unrelated CI breakages bundled together since both are one-liners: - rerankers: bump torch 2.4.1 -> 2.7.1 on cpu/cublas12. The unpinned transformers resolves to 5.x, whose moe.py registers a custom_op with string-typed `'torch.Tensor'` annotations that torch 2.4.1's infer_schema rejects, blocking the gRPC server from starting and failing all 5 backend tests with "Connection refused" on :50051. Matches the version used by the transformers backend. - vllm-omni: strip fa3-fwd from the upstream requirements/cuda.txt before resolving on aarch64. fa3-fwd 0.0.3 ships only an x86_64 wheel and has no sdist, making the cuda profile unsatisfiable on Jetson/SBSA. fa3-fwd is a soft runtime dep — vllm-omni's attention backends fall back to FA2 then SDPA when it's missing. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-06 17:07:24 +02:00
Richard Palethorpe	969005b2a1	feat(gallery): Speed up load times and clean gallery entries (#9211 ) * feat: Rework VRAM estimation and use known_usecases in gallery Signed-off-by: Richard Palethorpe <io@richiejp.com> Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code] * chore(gallery): regenerate gallery index and add known_usecases to model entries Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-05-06 14:51:38 +02:00
LocalAI [bot]	6d56bf98fe	feat(importers): add vibevoice-cpp importer for GGUF bundles (#9685 ) Routes mudler/vibevoice.cpp-models and similar repos to the vibevoice-cpp backend. Detects via repo name ("vibevoice.cpp"/"vibevoice-cpp"), file listing (vibevoice-*.gguf + tokenizer.gguf), or preferences.backend override. Defaults to the realtime TTS model; preferences.usecase=asr selects the ASR/diarization variant. Bundles the required tokenizer.gguf and (for TTS) a voice prompt, emitting the Options[] entries the backend expects. Registered ahead of VibeVoiceImporter so the C++ bundles aren't swallowed by the older Python-backend substring match. Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Write] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-06 13:33:10 +02:00
LocalAI [bot]	a8d7d37a3c	fix: unbreak master CI (docs, kokoros, vibevoice-cpp ABI) (#9682 ) * fix(docs): correct broken Hugo relrefs The Hugo build has been failing on master since the relevant pages landed: - text-generation.md:720 referenced `/docs/features/distributed-mode`, but Hugo `relref` paths are relative to the content root, not the rendered URL. Drop the `/docs/` prefix so the lookup matches the existing `features/...` form used elsewhere in the file. - audio-transform.md:144 referenced `tts.md`; the actual page is `text-to-audio.md`. Assisted-by: Claude:claude-opus-4-7[1m] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(kokoros): stub Diarize and AudioTransform Backend trait methods The recent backend.proto additions (Diarize, AudioTransform, AudioTransformStream) extended the gRPC Backend trait, breaking kokoros-grpc compilation with E0046 because the Rust implementation hadn't picked up the new methods. Add Unimplemented stubs matching the existing pattern for non-applicable RPCs in this TTS-only backend. Assisted-by: Claude:claude-opus-4-7[1m] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(vibevoice-cpp): track upstream ABI + wire 1.5B voice cloning Two recent commits in mudler/vibevoice.cpp reshaped the vv_capi_tts signature without a corresponding bump on the LocalAI side: 3bd759c "1.5b: unify into a single tts entry point" inserted a ref_audio_path parameter between voice_path and dst_wav_path. ad856bd "1.5b: multi-speaker dialog support" promoted that to a (const char* const* ref_audio_paths, int n_ref_audio_paths) pair for per-speaker conditioning. Because purego resolves symbols by name and not by signature, the build kept linking; at runtime the misaligned arguments turned the TTS->ASR closed-loop test into a SIGSEGV inside cgo. Track HEAD explicitly and bring the bridge in line with it: * Update the CppTTS purego binding to the 9-arg form. purego marshals []byte as a char by handing the C side the underlying array address; nil/empty maps to NULL, which matches the C contract for "no reference audio" on the realtime-0.5B path. Add a `ref_audio` gallery option (comma-separated, repeatable) that the 1.5B path consumes for runtime voice cloning. Multiple entries are interpreted as one WAV per speaker (Speaker 0..n-1). * TTSRequest.Voice now routes by extension/shape: `.wav` or a comma-separated list goes to ref_audio_paths; anything else stays on voice_path (realtime-0.5B's pre-baked voice gguf). * Pin VIBEVOICE_CPP_VERSION to ad856bd and wire the Makefile into the existing bump_deps matrix so future upstream rolls land as reviewable PRs instead of a silent CI break. Assisted-by: Claude:claude-opus-4-7[1m] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(vibevoice-cpp): use ModelOptions.AudioPath for 1.5B ref audio Use the existing audio_path field from ModelOptions (already plumbed through config_file's `audio_path:` YAML and consumed by other audio backends like kokoros) instead of inventing a custom `ref_audio:` Options[] string. Multi-speaker setups stay on a single comma- separated value. No behavior change beyond the gallery key name; per-call routing via TTSRequest.Voice is unchanged. Assisted-by: Claude:claude-opus-4-7[1m] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-06 10:36:59 +02:00
LocalAI [bot]	06a1524155	chore(model gallery): 🤖 add 1 new models via gallery agent (#9681 ) chore(model gallery): 🤖 add new models via gallery agent Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-06 08:47:40 +02:00
LocalAI [bot]	70cf8ac546	fix(backend): resolve relative draft_model paths against the models dir (#9680 ) * fix(backend): resolve relative draft_model paths against the models dir The main model file and mmproj are joined with the configured models directory before reaching the backend, but draft_model was sent verbatim. With a relative draft_model in the YAML config, llama.cpp opens the path from the backend process's CWD and fails with "No such file or directory", forcing users to hard-code an absolute path. Mirror the existing mmproj resolution: if draft_model is relative, join it with modelPath. Absolute paths are passed through unchanged. Adds an e2e regression test against the mock backend that asserts the main model file, mmproj, and draft_model all arrive at the backend resolved to absolute paths. Closes #9675 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7-1m [Read] [Edit] [Bash] [Write] * fix(backend): always join draft_model with models dir (drop IsAbs shortcut) The previous commit kept absolute draft_model paths intact via an IsAbs check. That left a path-traversal vector open: a user-supplied YAML config could set draft_model to /etc/passwd (or any other host file the backend process can read) and the path would be sent through unchanged. filepath.Join cleans the leading slash from absolute components, so joining unconditionally — the way mmproj already does — keeps the result rooted at the configured models directory regardless of input. Adds a second e2e spec that feeds an absolute draft_model into the mock backend and asserts the path is clamped under modelsPath. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7-1m [Read] [Edit] [Bash] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-06 00:58:38 +02:00
LocalAI [bot]	7fab5e3d21	chore: ⬆️ Update ggml-org/whisper.cpp to `4bf733672b2871d4153158af4f621a6dd9104f4a` (#9636 ) ⬆️ Update ggml-org/whisper.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-06 00:34:16 +02:00
Andreas Egli	af83518532	feat: support word-level timestamps for faster-whisper (#9621 ) Signed-off-by: Andreas Egli <github@kharan.ch> Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2026-05-06 00:32:52 +02:00
LocalAI [bot]	a315c321c1	chore: ⬆️ Update TheTom/llama-cpp-turboquant to `69d8e4be47243e83b3d0d71e932bc7aa61c644dc` (#9638 ) ⬆️ Update TheTom/llama-cpp-turboquant Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-06 00:29:05 +02:00
Ettore Di Giacinto	75fba9e03f	fix(distributed): scope Upgrade All to nodes that have the backend installed (#9678 ) In distributed mode the React UI's "Upgrade All" button fanned every detected outdated backend out to every healthy backend node, including nodes that never had that backend installed. On heterogeneous clusters this surfaced as platform errors (e.g. mac-mini-m4 asked to upgrade cpu-insightface-development, which has no darwin/arm64 variant) and left forever-retrying pending_backend_ops rows. DistributedBackendManager.UpgradeBackend now queries ListBackends() first, builds the target node-ID set from SystemBackend.Nodes, and only fans out to those nodes — every per-node primitive (adapter.InstallBackend, the pending-ops queue, BackendOpResult) is unchanged. enqueueAndDrainBackendOp gains an optional targetNodeIDs allowlist; Install/Delete keep their fan-to-everyone semantics by passing nil. If no node reports the backend installed, UpgradeBackend now returns a clear "not installed on any node" error instead of producing a stuck queue. Adds Ginkgo coverage for the smart fan-out: backend on a subset of nodes goes only to those nodes; backend on no node returns the new error and never sends a NATS install request. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-06 00:28:41 +02:00
Richard Palethorpe	16b2d4c807	fix(python-backend): make JIT subprocesses work on hosts of any size (#9679 ) Two related runtime fixes for Python backends that JIT-compile CUDA kernels at first model load (FlashInfer, PyTorch inductor, triton): 1. libbackend.sh: replace `source ${EDIR}/venv/bin/activate` with a minimal manual setup (_activateVenv: export VIRTUAL_ENV, prepend PATH, unset PYTHONHOME) computed from $EDIR at runtime. `uv venv` and `python -m venv` both bake the create-time absolute path into bin/activate (e.g. VIRTUAL_ENV='/vllm/venv' from the Docker build stage), so sourcing activate on a relocated venv — copied out of the build container and unpacked at an arbitrary backend dir — prepends a stale, non-existent path to $PATH. Pip-installed CLI tools (e.g. ninja, used by FlashInfer's NVFP4 GEMM JIT) are then never found and the load aborts with FileNotFoundError. Doing the env setup ourselves matches what `uv run` does internally and sidesteps the relocation problem entirely. Generic — every Python backend benefits. 2. vllm/run.sh: replace ninja's default -j$(nproc)+2 with an adaptive MAX_JOBS = min(nproc, (MemAvailable-4)/4). Each concurrent nvcc/cudafe++ peaks at multiple GiB; the default OOM-kills on memory-tight hosts (e.g. a 16 GiB desktop loading a 27B NVFP4 model) but underutilises 100-core / 1 TB boxes. User-set MAX_JOBS still wins. Also pin NVCC_THREADS=2 unless overridden. Refs: https://github.com/vllm-project/vllm/issues/20079 Assisted-by: Claude:claude-opus-4-7 [Edit] [Bash]	2026-05-06 00:28:01 +02:00
Richard Palethorpe	8e43842175	feat(vllm, distributed): tensor parallel distributed workers (#9612 ) * feat(vllm): build vllm from source for Intel XPU Upstream publishes no XPU wheels for vllm. The Intel profile was silently picking up a non-XPU wheel that imported but errored at engine init, and several runtime deps (pillow, charset-normalizer, chardet) were missing on Intel -- backend.py crashed at import time before the gRPC server came up. Switch the Intel profile to upstream's documented from-source procedure (docs/getting_started/installation/gpu.xpu.inc.md in vllm-project/vllm): - Bump portable Python to 3.12 -- vllm-xpu-kernels ships only a cp312 wheel. - Source /opt/intel/oneapi/setvars.sh so vllm's CMake build sees the dpcpp/sycl compiler from the oneapi-basekit base image. - Hide requirements-intel-after.txt during installRequirements (it used to 'pip install vllm'); install vllm's deps from a fresh git clone of vllm via 'uv pip install -r requirements/xpu.txt', swap stock triton for triton-xpu==3.7.0, then 'VLLM_TARGET_DEVICE=xpu uv pip install --no-deps .'. - requirements-intel.txt trimmed to LocalAI's direct deps (accelerate / transformers / bitsandbytes); torch-xpu, vllm, vllm_xpu_kernels and the rest come from upstream's xpu.txt during the source build. - requirements.txt: add pillow + charset-normalizer + chardet -- used by backend.py and missing on the Intel install profile. - run.sh: 'set -x' so backend startup is visible in container logs (the gRPC startup error path was previously opaque). Also adds a one-line docs example for engine_args.attention_backend under the vLLM section, since older XE-HPG GPUs (e.g. Arc A770) need TRITON_ATTN to bypass the cutlass path in vllm_xpu_kernels. Tested end-to-end on an Intel Arc A770 with Qwen2.5-0.5B-Instruct via LocalAI's /v1/chat/completions. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(vllm): add multi-node data-parallel follower worker vLLM v1's multi-node story is one process per node sharing a DP coordinator over ZMQ -- the head runs the API server with data_parallel_size > 1 and followers run `vllm serve --headless ...` with matching topology. Today LocalAI can already configure DP on the head via the engine_args YAML map, but there's no way to bring up the follower nodes -- so the head sits waiting for ranks that never handshake. Add `local-ai p2p-worker vllm`, mirroring MLXDistributed's structural precedent (operator-launched, static config, no NATS placement). The worker: - Optionally self-registers with the frontend as an agent-type node tagged `node.role=vllm-follower` so it's visible in the admin UI and operators can scope ordinary models away via inverse selectors. - Resolves the platform-specific vllm backend via the gallery's "vllm" meta-entry (cuda, intel-vllm, rocm-vllm, ...). - Runs vLLM as a child process so the heartbeat goroutine survives until vLLM exits; forwards SIGINT/SIGTERM so vLLM can clean up its ZMQ sockets before we tear down. - Validates --headless + --start-rank 0 is rejected (rank 0 is the head and must serve the API). Backend run.sh dispatches `serve` as the first arg to vllm's own CLI instead of LocalAI's backend.py gRPC server -- the follower speaks ZMQ directly to the head, there is no LocalAI gRPC on the follower side. Single-node usage is unchanged. Generalises the gallery resolution helper into findBackendPath() shared by MLX and vLLM workers; extracts ParseNodeLabels for the comma-separated label parsing both use. Ships with two compose recipes (`docker-compose.vllm-multinode.yaml` for NVIDIA, `docker-compose.vllm-multinode.intel.yaml` for Intel XPU/xccl) plus `tests/e2e/vllm-multinode/smoke.sh`. Both vendors are supported (NCCL for CUDA/ROCm, xccl for XPU) but mixed-vendor DP is not -- PyTorch's process group requires every rank to use the same collective backend, and NCCL/xccl/gloo don't interoperate. Out of scope (deferred): SmartRouter-driven placement of follower ranks via NATS backend.install events, follower log streaming through /api/backend-logs, tensor-parallel across nodes, disaggregated prefill via KVTransferConfig. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> test(vllm): CPU-only end-to-end test for multi-node DP Adds tests/e2e/vllm-multinode/, a Ginkgo + testcontainers-go suite that brings up a head + headless follower from the locally-built local-ai:tests image, bind-mounts the cpu-vllm backend extracted by make extract-backend-vllm so it's seen as a system backend (no gallery fetch, no registry server), and asserts a chat completion across both DP ranks. New `make test-e2e-vllm-multinode` target wires the docker build, backend extract, and ginkgo run together; BuildKit caches both images so re-runs only rebuild what changed. Tagged Label("VLLMMultinode") so the existing distributed suite isn't pulled along. Two pre-existing bugs surfaced by the test: 1. extract-backend-% (Makefile) failed for every backend, because all backend images end with `FROM scratch` and `docker create` rejects an image with no CMD/ENTRYPOINT. Fixed by passing --entrypoint=/run.sh -- the container is never started, only docker-cp'd, so the path doesn't have to exist; we just need anything that satisfies the daemon's create-time validation. 2. backend/python/vllm/run.sh's `serve` shortcut for the multi-node DP follower exec'd ${EDIR}/venv/bin/vllm directly, but uv bakes an absolute build-time shebang (`#!/vllm/venv/bin/python3`) that no longer resolves once the backend is relocated to BackendsPath. _makeVenvPortable's shebang rewriter only matches paths that already point at ${EDIR}, so the original shebang slips through unchanged. Fixed by exec-ing ${EDIR}/venv/bin/python with the script as an argument -- Python ignores the script's shebang in that case. The test fixture caps memory aggressively (max_model_len=512, VLLM_CPU_KVCACHE_SPACE=1, TORCH_COMPILE_DISABLE=1) so two CPU engines fit on a 32 GB box. TORCH_COMPILE_DISABLE is currently mandatory for cpu-vllm: torch._inductor's CPU-ISA probe runs even with enforce_eager=True and needs g++ on PATH, which the LocalAI runtime image doesn't ship -- to be addressed in a follow-up that bundles a toolchain in the cpu-vllm backend. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(vllm): bundle a g++ toolchain in the cpu-vllm backend image torch._inductor's CPU-ISA probe (`cpu_model_runner.py:65 "Warming up model for the compilation"`) shells out to `g++` at vllm engine startup, regardless of `enforce_eager=True` -- the eager flag only disables CUDA graphs, not inductor's first-batch warmup. The LocalAI CPU runtime image (Dockerfile, unconditional apt list) does not ship build-essential, and the cpu-vllm backend image is `FROM scratch`, so any non-trivial inference on cpu-vllm crashes with: torch._inductor.exc.InductorError: InvalidCxxCompiler: No working C++ compiler found in torch._inductor.config.cpp.cxx: (None, 'g++') Bundling the toolchain in the CPU runtime image would bloat every non-vllm-CPU deployment and force a single GCC version on backends that may want clang or a different version. So this lives in the backend, gated to BUILD_TYPE=='' (the CPU profile). `package.sh` snapshots g++ + binutils + cc1plus + libstdc++ + libc6 (runtime + dev) + the math libs cc1plus links (libisl/libmpc/libmpfr/ libjansson) into ${BACKEND}/toolchain/, mirroring /usr/... layout. The unversioned binaries on Debian/Ubuntu are symlink chains pointing into multiarch packages (`g++` -> `g++-13` -> `x86_64-linux-gnu-g++-13`, the latter in `g++-13-x86-64-linux-gnu`), so the package list resolves both the version and the arch-triplet variant. Symlinks /lib -> usr/lib and /lib64 -> usr/lib64 are recreated under the toolchain root because Ubuntu's UsrMerge keeps them at /, and ld scripts (`libc.so`, `libm.so`) hardcode `/lib/...` paths that --sysroot re-roots into the toolchain. The unversioned `g++`/`gcc`/`cpp` symlinks are replaced with wrapper shell scripts that resolve their own location at runtime and pass `--sysroot=<toolchain>` and `-B <toolchain>/usr/lib/gcc/<triplet>/<ver>/` to the underlying versioned binary. That's how torch's bare `g++ foo.cpp -o foo` invocation finds cc1plus (-B), system headers (--sysroot), and the bundled libstdc++ (--sysroot, --sysroot is recursive into linker). `run.sh` adds the toolchain bin dir to PATH and the toolchain's shared-lib dir to LD_LIBRARY_PATH -- everything else (header search, linker search, executable search) is encapsulated in the wrappers. No-op for non-CPU builds, the dir doesn't exist there. The cpu-vllm image grows by ~217 MB. Tradeoff is acceptable -- cpu-vllm is already a niche profile (few users compared to GPU vllm) and the alternative is a backend that crashes at first inference unless the operator manually sets TORCH_COMPILE_DISABLE=1, which silently disables all torch.compile optimizations. Drops `TORCH_COMPILE_DISABLE=1` from tests/e2e/vllm-multinode -- the smoke now exercises the real compile path through the bundled toolchain. Test runtime is +20s for the warmup compile, still <90s end to end. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(vllm): scope jetson-ai-lab index to L4T-specific wheels via pyproject.toml The L4T arm64 build resolves dependencies through pypi.jetson-ai-lab.io, which hosts the L4T-specific torch / vllm / flash-attn wheels but also transparently proxies the rest of PyPI through `/+f/<sha>/<filename>` URLs. With `--extra-index-url` + `--index-strategy=unsafe-best-match` uv would pick those proxy URLs for ordinary PyPI packages — anthropic/openai/propcache/annotated-types — and fail when the proxy 503s. Master is hitting the same bug on its own l4t-vllm matrix entry. Switch the l4t13 install path to a pyproject.toml that marks the jetson-ai-lab index `explicit = true` and pins only torch, torchvision, torchaudio, flash-attn, and vllm to it via [tool.uv.sources]. uv won't consult the L4T mirror for anything else, so transitive deps fall back to PyPI as the default index — no exposure to the proxy 503s. `uv pip install -r requirements.txt` ignores [tool.uv.sources], so the l4t13 branch in install.sh now invokes `uv pip install --requirement pyproject.toml` directly, replacing the old requirements-l4t13*.txt files. Other BUILD_PROFILEs continue using libbackend.sh's installRequirements and never read pyproject.toml. Local resolution test (x86_64, dry-run) confirms uv hits the L4T index for torch and falls through to PyPI for everything else. Assisted-by: claude-code:claude-opus-4-7-1m [Read] [Edit] [Bash] [Write] Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-05-06 00:22:50 +02:00
Arkadiusz Tymiński	503904d311	fix(faster-whisper): cast segment timestamps to int after multiplication (#9674 ) `int(x) * 1e9` returns a float because `1e9` is a float literal, but TranscriptSegment.start/end are integer protobuf fields. This caused every transcription request to fail with: TypeError: 'float' object cannot be interpreted as an integer Multiply first, then cast — `int(x * 1e9)` — to get an int as required.	2026-05-05 23:46:39 +02:00

1 2 3 4 5 ...

6285 Commits