mirror of
https://github.com/mudler/LocalAI.git
synced 2026-05-17 04:56:52 -04:00
v4.2.0
644 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
059c493641 |
ci(darwin): brew reinstall ccache to handle transitive dep drift
Symptom (PR #9752, run 25638825961, job 75256261163):
dyld[11144]: Library not loaded: /opt/homebrew/opt/fmt/lib/libfmt.12.dylib
Referenced from: /opt/homebrew/Cellar/ccache/4.13.5/bin/ccache
Abort trap: 6
Previous fix (commit
|
||
|
|
19d59102d5 |
feat(whisper-cpp): implement streaming transcription (#9751)
* test(whisper): wire e2e streaming transcription target Adds test-extra-backend-whisper-transcription, mirroring the existing llama-cpp / sherpa-onnx / vibevoice-cpp targets. The generic AudioTranscriptionStream spec at tests/e2e-backends/backend_test.go:644 fails today because backend/go/whisper has no streaming impl - this target is the failing TDD gate that the next phase makes pass. Confirmed RED locally: 3 Passed (health, load, offline transcription), 1 Failed (streaming spec hits its 300s context deadline because the base implementation returns 'unimplemented' but doesn't close the result channel, leaving the gRPC stream open until the client times out). Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(whisper-cpp): expose new_segment_callback to the Go side Adds set_new_segment_callback() and a C-side trampoline that whisper.cpp invokes once per new text segment during whisper_full(). The trampoline dispatches (idx_first, n_new, user_data) to a Go function pointer registered via purego.NewCallback - text and timings are pulled by Go through the existing get_segment_text/get_segment_t0/get_segment_t1 getters. Wires the hook only when streaming is actually requested, to avoid a per-segment function-pointer dispatch on the offline path. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(whisper-cpp): implement AudioTranscriptionStream Wires whisper.cpp's new_segment_callback through purego back to Go so the streaming transcription RPC produces real, time-correlated deltas while whisper_full() is still decoding. Each segment becomes one TranscriptStreamResponse{Delta}; whisper_full's return is the TranscriptStreamResponse{FinalResult} carrying the full segment list, language, and duration. Per-call state is tracked in a sync.Map keyed by an atomic counter; the Go callback registered via purego.NewCallback is a singleton, dispatched through user_data. SingleThread today means only one entry is ever live, but the map shape matches the sherpa-onnx TTS callback pattern. The streaming path's final.Text is the literal concat of every emitted delta (a strings.Builder accumulated by onNewSegment) so the e2e invariant `final.Text == concat(deltas)` holds exactly. The first delta has no leading space; subsequent deltas are space-prefixed. The offline AudioTranscription path is unchanged. Closes the gap with sherpa-onnx, vibevoice-cpp, llama-cpp, and tinygrad, which already implement AudioTranscriptionStream. Verified GREEN locally: make test-extra-backend-whisper-transcription passes 4/4 specs (3 Passed initially under RED, +1 streaming spec now). Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(whisper-cpp): assert progressive multi-segment streaming Drives AudioTranscriptionStream against a real long-audio fixture and asserts len(deltas) >= 2. The generic e2e spec at tests/e2e-backends/backend_test.go:644 only checks len(deltas) >= 1 which is satisfied by both real and faked streaming - this spec is the guardrail that a future "fake" impl can't sneak past. Skipped by default (env-gated, like the cancellation spec); set WHISPER_LIBRARY, WHISPER_MODEL_PATH, and WHISPER_AUDIO_PATH to a 30+ second clip to run. Verified locally with a 55s 5x-JFK concat against ggml-base.en.bin: 1 Passed in 7.3s, deltas >= 2, finalSegmentCount >= 2, concat(deltas) == final.Text. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(whisper-cpp): add transcription gRPC e2e job Mirrors tests-sherpa-onnx-grpc-transcription / tests-llama-cpp-grpc-transcription. Runs make test-extra-backend-whisper-transcription whenever the whisper backend or the run-all switch fires, so a pin-bump or refactor that breaks streaming transcription gets caught before merge. The whisper output on detect-changes is already emitted by scripts/changed-backends.js (it iterates allBackendPaths); this PR just exposes it as a workflow output and consumes it. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(whisper-cpp): silence errcheck on AudioTranscriptionStream defers golangci-lint runs with new-from-merge-base=origin/master, so the identical defer patterns in the existing offline AudioTranscription path are grandfathered while the new ones in AudioTranscriptionStream trip errcheck. Wrap both defers in `func() { _ = ... }()` to match what errcheck wants without altering behavior. The errors from os.RemoveAll and *os.File.Close are not actionable inside a defer here (we're already returning), matching the offline path's contract. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
3f6e493439 |
ci(darwin): install ccache's runtime dylib deps (blake3, hiredis, xxhash, zstd)
Symptom (run 25634195866, job 75244019809): the Configure ccache step
on the Darwin llama-cpp build aborted with:
dyld[5647]: Library not loaded: /opt/homebrew/opt/blake3/lib/libblake3.0.dylib
Referenced from: /opt/homebrew/Cellar/ccache/4.13.5/bin/ccache
Abort trap: 6
The previous Darwin fix (
|
||
|
|
35f6db8c76 |
ci: split backend-jobs into single-arch and multi-arch matrices (#9746)
Symptom (run 25612992409): backend-merge-jobs failed with
"quay.io/go-skynet/local-ai-backends@sha256:fdbd93ca...: not found"
even though the per-arch build for -cpu-llama-cpp pushed that exact
digest 14h31m earlier.
Root cause: backend-merge-jobs was gated on the WHOLE backend-jobs
matrix (`needs: backend-jobs`). The multi-arch -cpu-llama-cpp legs
finished within 30 min, but a single-arch CUDA-12-llama-cpp slot in
the same matrix queued for ~8h (max-parallel: 8 throttle) and then
took ~6h to build cold. By the time it freed the merge to run, quay's
GC had reaped the per-arch digests pushed by the fast multi-arch legs
the day before.
Fix: split the linux backend matrix in two.
backend-jobs-multiarch - entries with `platform-tag` set (paired
per-arch legs that feed backend-merge-jobs).
backend-jobs-singlearch - entries without `platform-tag` (heavy
standalone builds: CUDA, ROCm, Intel oneAPI, vLLM, sglang, etc.).
backend-merge-jobs now `needs:` only backend-jobs-multiarch. The
multi-arch matrix completes in ~2-3h, well inside quay's GC window.
Heavy single-arch entries keep running independently with no merge
dependency.
scripts/changed-backends.js gains a splitByArch() helper that
partitions filtered entries by whether `platform-tag` is set, and
emits matrix-singlearch + matrix-multiarch + has-backends-singlearch
+ has-backends-multiarch outputs (replacing the previous combined
matrix / has-backends pair). Applied in both the full-matrix and
filtered-matrix code paths. Smoke test: 199 single-arch + 72 multi-
arch + 35 darwin = 271 total entries; 36 merge-matrix entries
(one per multi-arch backend pair). Matches expectation.
Local `make backends/<name>` is unaffected — the script's outputs
only feed CI workflow matrices.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
|
||
|
|
7fff858408 |
ci(base-images): also trigger rebuild on .docker/install-base-deps.sh changes
base-images.yml's master-push trigger had a path filter listing only backend/Dockerfile.base-grpc-builder and .github/workflows/base-images.yml. That misses .docker/install-base-deps.sh — which is the actual source of truth for what goes into each base image (apt deps, gRPC, conditional CUDA/ROCm/Vulkan installs). The script is bind-mounted into the base Dockerfile at build time; changes to it would change the produced images, but without this path filter, the workflow wouldn't auto-rebuild on those changes. Stale bases would persist until Saturday's cron or a manual workflow_dispatch. Same applies to .docker/apt-mirror.sh, also bind-mounted by the base Dockerfile. Add both to the trigger paths so consumer-affecting changes to either file rebuild the bases automatically. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
593f3a8648 |
ci: refactor llama-cpp variant Dockerfiles to consume prebuilt base-grpc images (PR 2/2) (#9738)
* ci(backend_build): plumb builder-base-image and BUILDER_TARGET build-args Adds an optional builder-base-image input. When set, BUILDER_BASE_IMAGE is forwarded as a build-arg AND BUILDER_TARGET=builder-prebuilt is set to select the variant Dockerfile's prebuilt-base stage. When empty, BUILDER_TARGET=builder-fromsource (the default) keeps the existing from-source build path. This makes the prebuilt-base optimization opt-in per matrix entry without breaking local `make backends/<name>` invocations or backends whose Dockerfile doesn't have a prebuilt path. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(llama-cpp,ik-llama-cpp,turboquant): multi-target Dockerfiles for prebuilt + from-source Restructure the three llama.cpp-derived Dockerfiles so each supports two builder paths in a single file, selected via the BUILDER_TARGET build-arg: BUILDER_TARGET=builder-fromsource (default) - Standalone build: gRPC stage + apt installs + (conditionally) CUDA/ROCm/Vulkan + compile. - Used by `make backends/llama-cpp` locally and any caller that doesn't supply a prebuilt base. BUILDER_TARGET=builder-prebuilt - FROM \${BUILDER_BASE_IMAGE} (one of quay.io/go-skynet/ci-cache: base-grpc-* shipped in PR #9737). - Skips ~25-35 min of gRPC compile + ~5-10 min of toolchain installs. - Used by CI when the matrix entry sets builder-base-image. Final FROM scratch resolves BUILDER_TARGET via an aliasing FROM stage (BuildKit doesn't support variable expansion directly in COPY --from), then COPY --from=builder pulls package output from the chosen path. BuildKit prunes the unreferenced builder, so each build only does the work for the chosen path. The compile RUN is identical between both builder stages, so it's factored into .docker/<name>-compile.sh and bind-mounted into both. ccache mount + cache-id stay per-arch / per-build-type. Local DX preserved: `make backends/llama-cpp` (no extra args) defaults to BUILDER_TARGET=builder-fromsource and works exactly as before. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(backend.yml,backend_pr.yml): forward builder-base-image from matrix Plumbs the new optional builder-base-image input from matrix into backend_build.yml. backend_build.yml derives BUILDER_TARGET from whether builder-base-image is set, so matrix entries that map to a prebuilt base get the prebuilt path; entries that don't (python/go/ rust backends) fall through to the default builder-fromsource (which their own Dockerfiles don't reference, so it's a no-op for them). Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(backend-matrix): wire builder-base-image to llama-cpp variants For every entry whose Dockerfile is llama-cpp/ik-llama-cpp/turboquant, add a builder-base-image field pointing at the appropriate prebuilt quay.io/go-skynet/ci-cache:base-grpc-* tag. backend_build.yml derives BUILDER_TARGET from this field's presence: non-empty -> builder-prebuilt; empty -> builder-fromsource. So this commit alone activates the prebuilt-base path for these 23 backends in CI, while local `make backends/<name>` (no extra args) keeps the from-source path. Mapping by (build-type, arch): - '' / amd64 -> base-grpc-amd64 - '' / arm64 -> base-grpc-arm64 - cublas-12 / amd64 -> base-grpc-cuda-12-amd64 - cublas-13 / amd64 -> base-grpc-cuda-13-amd64 - cublas-13 / arm64 -> base-grpc-cuda-13-arm64 - hipblas / amd64 -> base-grpc-rocm-amd64 - vulkan / amd64 -> base-grpc-vulkan-amd64 - vulkan / arm64 -> base-grpc-vulkan-arm64 - sycl_* / amd64 -> base-grpc-intel-amd64 - cublas-12 + JetPack r36.4.0 / arm64 -> base-grpc-l4t-cuda-12-arm64 Cold-build savings expected: ~25-35 min per variant (skips the gRPC compile + toolchain install that's now in the base). Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: add base-grpc-l4t-cuda-12-arm64 variant for legacy JetPack entries Two matrix entries (-nvidia-l4t-arm64-llama-cpp, -nvidia-l4t-arm64- turboquant) build against nvcr.io/nvidia/l4t-jetpack:r36.4.0 + CUDA 12 ARM64. They're distinct from -nvidia-l4t-cuda-13-arm64-* which use Ubuntu 24.04 + CUDA 13 sbsa. Add the missing JetPack-based variant to base-images.yml so those two entries' builder-base-image mapping in the previous commit resolves. Bootstrap order before merging this PR (re-run base-images.yml on this branch — 9 existing variants hit BuildKit cache, only the new l4t-cuda-12-arm64 builds cold): gh workflow run base-images.yml --ref ci/base-images-consumers Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: extract base-builder install logic into .docker/install-base-deps.sh Pre-extraction, the apt + protoc + cmake + conditional CUDA/ROCm/Vulkan + gRPC install logic was duplicated across four files: - backend/Dockerfile.base-grpc-builder (CI prebuilt-base source of truth) - backend/Dockerfile.llama-cpp (builder-fromsource stage) - backend/Dockerfile.ik-llama-cpp (builder-fromsource stage) - backend/Dockerfile.turboquant (builder-fromsource stage) A bump to e.g. CUDA toolkit packages had to be made in 4 places, and drift between the prebuilt base and the variant-Dockerfile from-source path was a real concern (ik-llama-cpp's hipblas branch was already missing the rocBLAS Kernels echo that llama-cpp / turboquant / base-grpc-builder all had). Factor the install logic into a single .docker/install-base-deps.sh that reads its inputs from env vars and runs conditionally on BUILD_TYPE / CUDA_*_VERSION / TARGETARCH. Each Dockerfile now bind- mounts the script alongside .docker/apt-mirror.sh and invokes it from a single RUN step. The variant Dockerfiles' grpc-source stage is removed entirely — the script handles gRPC compile + install at /opt/grpc, and the builder-fromsource stage mirrors builder-prebuilt by copying /opt/grpc/. to /usr/local/. Result: - install-base-deps.sh: 244 lines (one source of truth) - Dockerfile.base-grpc-builder: 268 -> 98 lines - Dockerfile.llama-cpp: 361 -> 157 lines - Dockerfile.ik-llama-cpp: 348 -> 151 lines - Dockerfile.turboquant: 355 -> 154 lines - Total Dockerfile bytes: 1332 -> 560 lines (58% reduction) Bit-equivalence between prebuilt and from-source paths is now enforced by construction: both invoke the same script with the same inputs. A side-effect is that ik-llama-cpp now also gets the rocBLAS Kernels echo + clblas block parity it was previously missing. Includes the BUILD_TYPE=clblas branch (libclblast-dev) for parity even though no current CI matrix entry uses it. After this commit's force-push, base-images.yml needs to be redispatched on this branch — the Dockerfile.base-grpc-builder content shifts so the existing cache won't apply for the install layer (gRPC layer also rebuilds since it's now in the same RUN step). Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(base-images): skip-drivers on JetPack l4t variant cuda-nvcc-12-0 isn't installable via apt on the JetPack r36.4.0 base image — JetPack ships CUDA preinstalled at /usr/local/cuda and its apt feed doesn't carry the cuda-nvcc-* packages from the public repositories. The original matrix entry for -nvidia-l4t-arm64-llama-cpp on master sets skip-drivers: 'true' for exactly this reason; the new base-grpc-l4t-cuda-12-arm64 base needs to match. Also forwards SKIP_DRIVERS as a build-arg from matrix into the build (was missing entirely before this commit). Caught by run 25612030775 — l4t-cuda-12-arm64 failed at: E: Package 'cuda-nvcc-12-0' has no installation candidate Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
acc5588d2c |
ci(darwin): force-link brew formulas after cache restore
Symptom: `ccache: command not found` in the Configure ccache step on runs that hit the brew cache. Root cause: actions/cache restores /opt/homebrew/Cellar/<formula> but NOT the bin symlinks at /opt/homebrew/bin/*. The subsequent `brew install` sees the Cellar entries present and decides "already installed" — without re-running the link step. So on cache-hit runs none of the cached formulas are actually on PATH. Fix: explicit `brew link --overwrite` for every formula we install, right after `brew install`. --overwrite tolerates leftover symlinks from a partial earlier install. The 2>/dev/null + || true keeps the step from failing if a formula is already correctly linked. Pre-existing flake; surfaces more often as Darwin matrix coverage grows after the llama-cpp-darwin consolidation in #9731. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
28e29625a2 |
ci: add pre-built base-grpc-builder image infrastructure (PR 1/2) (#9737)
Introduces a parameterized Dockerfile.base-grpc-builder that produces
a fully-prepped builder base image (apt deps + protoc + cmake + gRPC
at /opt/grpc + conditional CUDA/ROCm/Vulkan toolchains) and a
base-images.yml workflow that builds + pushes 9 variants to
quay.io/go-skynet/ci-cache:base-grpc-*:
base-grpc-amd64 (Ubuntu 24.04, CPU-only)
base-grpc-arm64 (Ubuntu 24.04, CPU-only)
base-grpc-cuda-12-amd64 (Ubuntu 24.04 + CUDA 12.8)
base-grpc-cuda-13-amd64 (Ubuntu 22.04 + CUDA 13.0)
base-grpc-cuda-13-arm64 (Ubuntu 24.04 + CUDA 13.0 sbsa)
base-grpc-rocm-amd64 (rocm/dev-ubuntu-24.04:7.2.1 + hipblas)
base-grpc-vulkan-amd64 (Ubuntu 24.04 + Vulkan SDK 1.4.335)
base-grpc-vulkan-arm64 (Ubuntu 24.04 + Vulkan SDK ARM 1.4.335)
base-grpc-intel-amd64 (intel/oneapi-basekit:2025.3.2)
The variant Dockerfiles (Dockerfile.llama-cpp, ik-llama-cpp, turboquant)
are NOT touched in this PR. PR 2 will refactor them to FROM these
prebuilt bases. This PR is intentionally inert - landing it changes no
existing CI behavior. The base images don't exist on quay until
someone manually triggers the workflow.
Bootstrap after merge:
gh workflow run base-images.yml --ref master
Wait ~30 min for all 9 variants to push, then merge PR 2 (the
consumer-side refactor that uses BUILDER_BASE_IMAGE build-arg to
FROM these tags).
Triggers afterwards:
- Saturdays 05:00 UTC (cron) - picks up upstream security updates,
runs ~24h before the backend.yml Sunday cron so bases are fresh.
- workflow_dispatch - manual ad-hoc rebuild.
- master push touching Dockerfile.base-grpc-builder or this workflow.
Why split into two PRs: the variant Dockerfiles in PR 2 will FROM the
prebuilt bases and have no from-source fallback. Their CI builds fail
if the bases don't exist on quay yet. Landing infrastructure first +
manual bootstrap + then consumer refactor avoids a broken-master window.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
|
||
|
|
6d2b7d893a |
ci: drop paths-ignore from test.yml and tests-e2e.yml
These workflows are configured as required status checks in branch protection. With paths-ignore matching the PR diff, the workflow doesn't trigger and no status is reported — branch protection then blocks the PR with "Expected — Waiting for status to be reported" indefinitely. Especially common for backend-only PRs since the ignore list included backend/**. Run the full test suite on every PR. Cost is ~5 min per PR for tests-linux + ~similar for tests-apple + the e2e backend smoke; small trade for unblocking PR merges. Workflows affected: - tests-linux (1.26.x), tests-apple (1.26.x) in test.yml - tests-e2e-backend (1.25.x) in tests-e2e.yml Other workflows that still have paths-ignore (none currently in the required-checks list) are left as-is — adding them to required later would re-introduce the same problem. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
5a12392570 |
ci(concurrency): make cancel-in-progress event-aware, group by sha on push
Yesterday two PRs (#9724 llama.cpp bump, #9731 llama-cpp-darwin consolidation) merged 11 seconds apart. Both shared the same backend.yml concurrency group (ci-backends-refs/heads/master-...) due to "${{ github.head_ref || github.ref }}" — empty head_ref on push events falls through to the static refs/heads/master. With cancel-in-progress: true that meant the second merge cancelled the first's in-flight backend builds. The first PR's CI never finished; the second PR only touched CI files so its run was a no-op. Two changes per workflow: - group: replace "${{ github.head_ref || github.ref }}" with "${{ github.event.pull_request.number || github.sha }}". On PRs this groups by PR number (same as before, just keyed on number not branch name); on push events it groups per-commit, so two master pushes never share a group. - cancel-in-progress: gate on github.event_name == 'pull_request' so rapid pushes to a PR still cancel old runs (newer push wins) but master pushes never cancel each other. Trade-off vs alternatives: - Merge queue would also solve this and additionally test the merged commit before it lands. Heavier process change; out of scope here. - Allowing per-commit master concurrency means two simultaneous master runs may overlap and race on tag pushes, but each commit's manifest digest is unique and the registry is last-writer-wins on tags — newer commit's tag overwrites older. Applied to 11 workflows that share the same concurrency pattern: backend.yml, backend_pr.yml, image.yml, image-pr.yml, lint.yml, test.yml, test-extra.yml, tests-e2e.yml, tests-aio.yml, tests-ui-e2e.yml, generate_intel_image.yaml. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
05d6383393 |
Change vibevoice.cpp repository reference
Updated repository reference for vibevoice.cpp in bump_deps.yaml. Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> |
||
|
|
733c254b32 |
ci: consolidate llama-cpp-darwin into the matrix-driven Darwin flow (#9731)
The bespoke llama-cpp-darwin + llama-cpp-darwin-publish top-level jobs in backend.yml ran unconditionally on every backend.yml trigger (push/cron), bypassing the path filter that all 34 other Darwin backends already honor via backend-jobs-darwin -> backend_build_darwin.yml. Move llama-cpp into the includeDarwin matrix: - New entry in .github/backend-matrix.yml (lang=go, no build-type). - backend_build_darwin.yml gains an `if: inputs.backend == 'llama-cpp'` build step that drives `make backends/llama-cpp-darwin`. The bespoke script (scripts/build/llama-cpp-darwin.sh) compiles three CMake variants from backend/cpp/llama-cpp and bundles dylibs via otool, so it doesn't fit the build-darwin-go-backend mold; the existing llama-cpp-aware ccache setup blocks already in this workflow are what motivated the consolidation in the first place. - scripts/changed-backends.js's inferBackendPathDarwin gains a special case so llama-cpp on Darwin maps to backend/cpp/llama-cpp/ (the C++ source tree) rather than the non-existent backend/go/llama-cpp/. - Bumps Darwin go-version from 1.24.x -> 1.25.x in backend.yml and backend_pr.yml so llama-cpp keeps the Go toolchain it had under the bespoke job; the other 34 Darwin backends pick this up too with no known reason to pin 1.24. - Removes ~80 lines of bespoke YAML from backend.yml. The publish path is unchanged in shape - every Darwin backend now uses the same crane-push leg from ubuntu-latest in backend_build_darwin.yml; only the build target differs per backend. After this commit, llama-cpp-darwin only rebuilds when backend/cpp/llama-cpp/ is touched (verified locally) - same behavior as every other Darwin backend. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
f0374aa0e8 |
ci: finish GHA free-tier migration (per-arch fan-out, image splits, retire self-hosted, fix provenance) (#9730)
* ci: add per-arch + manifest-merge support for LocalAI server image Mirror the backend_build.yml + backend_merge.yml pattern shipped in PR #9726 for the LocalAI server image: - image_build.yml accepts optional platform-tag (default ''), scopes registry cache to cache-localai<suffix>-<platform-tag>, and pushes by canonical digest only on push events. Digests upload as artifacts named digests-localai<suffix>-<platform-tag>, with a "-core" placeholder when tag-suffix is empty so the merge job's download pattern doesn't over-match across multiple suffixes. - image_merge.yml is a new reusable workflow that downloads matching digest artifacts and assembles the final tagged manifest list via docker buildx imagetools create. Image names differ from backend_*.yml: the LocalAI server is published under quay.io/go-skynet/local-ai and localai/localai (not -backends). Not yet wired into image.yml / image-pr.yml — Commit C does that. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: fan out per-arch split to remaining 34 backends Convert all remaining linux/amd64,linux/arm64 entries in backend-matrix.yml to per-arch + manifest-merge form. Each was a single matrix entry running both arches on x86 under QEMU emulation; each becomes two entries — amd64 on ubuntu-latest, arm64 on ubuntu-24.04-arm (native). Four backends that were on bigger-runner (-cpu-llama-cpp, -cpu-turboquant, -gpu-vulkan-llama-cpp, -gpu-vulkan-turboquant) have both legs moved to free tier as part of the same change. They are compile-only (no torch/CUDA install) and fit comfortably with the setup-build-disk /mnt relocation. Phase 4 (next commit) retires the remaining 5 single-arch bigger-runner entries. After this commit: - 271 total matrix entries (was 237) - 0 multi-arch entries left - 36 per-arch pairs (34 new + 2 pilots from PR #9727) - 5 bigger-runner entries remaining (single-arch, Phase 4 target) Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: split LocalAI image multi-arch entries per arch + merge Mirror the backend per-arch split for the main LocalAI image: - image.yml's core-image-build matrix: split the core ('') and -gpu-vulkan entries into amd64 + arm64 legs each. amd64 on ubuntu-latest, arm64 on ubuntu-24.04-arm (native). - New top-level core-image-merge and gpu-vulkan-image-merge jobs call image_merge.yml after core-image-build completes. - image-pr.yml's image-build matrix: split the -vulkan-core entry. No merge job added on the PR side — image_build.yml's digest-push is push-only-event-gated, so a PR-side merge would have nothing to download. After this commit, no workflow file references linux/amd64,linux/arm64 in a single matrix slot. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: retire bigger-runner from backend matrix (Phase 4) Migrate the remaining 5 single-arch bigger-runner entries to ubuntu-latest. Combined with the Phase 3 setup-build-disk /mnt relocation (PR #9726), free-tier ubuntu-latest now has ~100 GB of working space — enough for ROCm dev image (~16 GB), CUDA toolkit (~5 GB), and the per-backend compile/install steps these entries do. Backends migrated: - -gpu-nvidia-cuda-12-llama-cpp - -gpu-nvidia-cuda-12-turboquant - -gpu-rocm-hipblas-faster-whisper - -gpu-rocm-hipblas-coqui - -cpu-ik-llama-cpp After this commit, .github/backend-matrix.yml has zero bigger-runner references. The bigger-runner used in tests-vibevoice-cpp-grpc- transcription (test-extra.yml) is a separate concern handled in a follow-up. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: migrate 9 Intel oneAPI backends to free tier (Phase 5.1) Intel oneAPI base image is ~6 GB; each backend's wheel install stays well within the ~100 GB working space provided by Phase 3's setup-build-disk /mnt relocation. Lowest-risk batch of the arc-runner-set retirement. Backends migrated: vllm, sglang, vibevoice, qwen-asr, nemo, qwen-tts, fish-speech, voxcpm, pocket-tts (all -gpu-intel-* variants). Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: migrate 15 ROCm Python backends to free tier (Phase 5.2) ROCm dev image (~16 GB) plus per-backend torch/wheels install fits on ubuntu-latest with the /mnt-relocated Docker root. These entries include the heavier vLLM/sglang/transformers/diffusers stack on ROCm; if any specific backend OOMs or runs out of disk, individual flips back to arc-runner-set are revertable per-entry. Backends migrated: all 15 -gpu-rocm-hipblas-* entries previously on arc-runner-set (vllm/vllm-omni/sglang/transformers/diffusers/ ace-step/kokoro/vibevoice/qwen-asr/nemo/qwen-tts/fish-speech/ voxcpm/pocket-tts/neutts). Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: migrate 6 CUDA Python backends to free tier (Phase 5.3) vLLM/sglang stacks on CUDA 12 and CUDA 13 are the heaviest backends in the matrix — flash-attn intermediate layers can spike disk usage during build. setup-build-disk's /mnt relocation gives ~100 GB working space which fits the documented peak. Highest-risk batch of the arc-runner-set retirement; if any backend fails to build on free tier, the per-entry runs-on flip is the unit of revert. Backends migrated: -gpu-nvidia-cuda-{12,13}-{vllm,vllm-omni,sglang}. After this commit, .github/backend-matrix.yml has zero references to arc-runner-set or bigger-runner. The migration is complete. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: disable provenance on multi-registry digest pushes Root-caused on master via PR #9727's pilot: when docker/build-push-action@v7 pushes a single build to TWO registries simultaneously with push-by-digest=true, buildx generates a per-registry provenance attestation manifest (because mode=max — the default for push:true — includes the runner ID). That makes the resulting manifest-list digest diverge across registries: arm64 -cpu-faster-whisper build: image manifest: sha256:d3bdd34b... (identical, content-only) quay manifest list: sha256:66b4cfc8... (with quay attestation) dockerhub manifest list: sha256:e0733c3b... (with dockerhub attestation) steps.build.outputs.digest returns only one of the list digests (empirically the dockerhub one). The merge job then asks "quay.io/...@sha256:e0733c3b..." which doesn't exist on quay — that list has digest 66b4cfc8 there. Result: imagetools create fails with "not found" and the merge job fails (run 25581983094, job 75110021491). Setting provenance: false drops the per-registry attestation; the manifest-list digest becomes pure content, identical across both registries, and steps.build.outputs.digest works on either lookup. Applied to backend_build.yml and image_build.yml — both refactored to use the same multi-registry digest-push pattern in the prior PRs. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
cb68cd1cf4 |
ci: pilot per-arch split + manifest merge for faster-whisper and llama-cpp-quantization (#9727)
ci: pilot per-arch split for faster-whisper and llama-cpp-quantization
Convert two backends from QEMU-emulated multi-arch (linux/amd64,linux/arm64
on a single ubuntu-latest) to native per-arch + manifest-list merge:
- amd64 leg on ubuntu-latest
- arm64 leg on ubuntu-24.04-arm (native, ~5-10x faster than emulated)
- merge job assembles both digests under the final tag via
docker buildx imagetools create
Backends piloted:
- -cpu-faster-whisper (small Python, fast baseline)
- -cpu-llama-cpp-quantization (heavier compile path, stress test)
Infrastructure changes that the rest of Phase 2 (Tasks 2.5+) will reuse:
- .github/backend-matrix.yml entries gain a `platform-tag` field
('amd64'/'arm64') for matrix entries that participate in the split.
Other entries omit it; backend_build.yml already defaults missing
values to '' (empty cache key suffix preserved as cache<suffix>-).
- backend.yml + backend_pr.yml forward `platform-tag` from matrix to
the reusable backend_build.yml.
- scripts/changed-backends.js groups filtered entries by tag-suffix
and emits a `merge-matrix` (plus `has-merges`) for groups of size>=2.
Singletons aren't merged.
- backend.yml + backend_pr.yml gain a `backend-merge-jobs` job that
consumes merge-matrix and calls backend_merge.yml after backend-jobs.
PR variant is also event-gated so the no-op-on-PR merge job doesn't
even start.
The other 34 multi-arch entries are unchanged in this PR -- Task 2.5
fans out the same shape to them once the pilot is observed green.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
|
||
|
|
1f313cfdb0 |
ci: phase 1-3 of GHA free tier migration (path filter, multi-arch split prep, /mnt disk relief) (#9726)
* ci: extract free-disk-space composite action Consolidate the apt-clean + dotnet/android/ghc/boost removal blocks from backend_build.yml, image_build.yml, and test.yml into a single composite action. The three callers had slightly different inline blocks; the composite uses the more aggressive backend_build/image_build variant for all three callers — test.yml jobs now also purge snapd, edge/firefox/ powershell/r-base-core, and sweep /opt/ghc + /usr/local/share/boost + $AGENT_TOOLSDIRECTORY. Idempotent and skipped on self-hosted runners. In test.yml, actions/checkout now runs before the composite action call because the composite lives at ./.github/actions/free-disk-space and requires a checked-out repo. The original ordering relied on jlumbroso/free-disk-space@main being a remote action; this is the minimum-invasive change to support a local composite. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: path-filter backend.yml master push Run scripts/changed-backends.js on master pushes too (not just PRs) so unrelated commits don't rebuild all ~210 backend container images. Tag pushes still build the full matrix via FORCE_ALL. Push events use the GitHub Compare API to diff event.before..event.after. Edge cases (first push with zero base, API truncation beyond 300 files, missing fields, network failure) fall back to "run everything" — better safe than silently miss a backend. The matrix literal moves from .github/workflows/backend.yml into a new data-only file at .github/backend-matrix.yml (outside workflows/ so actionlint doesn't try to parse it as a workflow). Both backend.yml and backend_pr.yml now consume the dynamic matrix output uniformly via fromJson(needs.generate-matrix.outputs.matrix); the script reads the matrix from the new location. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: bound max-parallel on backend-jobs matrices Cap to 8 concurrent jobs to avoid queue starvation on the shared GHA free pool while migration is in flight. Lift after Phases 4-5 retire the self-hosted runners. Also drops a leftover commented-out max-parallel line that lived in backend.yml since the previous matrix shape. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: scope backend cache per arch, push by digest Prepare backend_build.yml for the multi-arch split. The reusable workflow now accepts a `platform-tag` input ("amd64" / "arm64") that scopes the registry cache to cache<suffix>-<platform-tag> and (on push events) pushes the resulting image by canonical digest only. Digests are uploaded as artifacts named digests<suffix>-<platform-tag> for the merge job (Task 2.2) to consume. `platform-tag` is optional with empty default during the migration — existing callers continue to work unchanged (their cache key just becomes `cache<suffix>-`, an orphaned but valid key). Tasks 2.3+ will update callers to pass an explicit "amd64" / "arm64" value. Phase 6 flips the input to required: true once every caller is wired. PR builds keep their existing tag-based push to ci-tests but pick up the per-arch cache key. Multi-arch PR builds remain emulated in this commit; they migrate when the matrix entries split (Tasks 2.3+). Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: add backend_merge.yml reusable workflow Joins per-arch digest artifacts (uploaded by backend_build.yml when called with platform-tag) into a single tagged multi-arch manifest list via `docker buildx imagetools create`. Called once per backend by backend.yml after both per-arch build jobs succeed. The workflow generates final tags identically to the previous monolithic build job (same docker/metadata-action invocation), so consumers of quay.io/go-skynet/local-ai-backends and localai/localai-backends see no tag-shape change. Two imagetools calls (one per registry) reference the same per-arch digests under different image names. Not yet wired into backend.yml — Tasks 2.3+ rewrite individual matrix entries to expand into per-arch + merge jobs that call this workflow. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: relocate Docker data-root to /mnt on hosted runners GHA hosted ubuntu-latest runners ship a ~75 GB /mnt drive that's unused by default. Stopping Docker, rsync'ing /var/lib/docker to /mnt, and restarting with data-root pointing there yields ~100 GB of working space (combined with the apt-clean from Task 1.1) — enough for ROCm dev image + vLLM torch install + flash-attn intermediate layers. This is the structural change that lets Phases 4 and 5 of the migration plan move the bigger-runner and arc-runner-set jobs onto ubuntu-latest. The composite action is no-op on self-hosted runners (where /mnt isn't expected) and on non-X64 runners (Task 3.2 verifies the arm64 hosted pool's /mnt shape separately before enabling). Wired into both backend_build.yml and image_build.yml between free-disk-space and the first Docker operation. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(setup-build-disk): chmod 1777 /mnt/docker-tmp buildx CLI runs as the unprivileged 'runner' user and creates config dirs under TMPDIR before binding them into the buildkit container. /mnt is root-owned by default, so the original mkdir produced a permission-denied when buildx tried to write there: ERROR: mkdir /mnt/docker-tmp/buildkitd-config2740457204: permission denied Mirror /tmp's permission mode (1777 — world-writable with sticky bit) on /mnt/docker-tmp so non-root processes can stage their config. Caught by the first PR run (image-build hipblas job) on PR #9726. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: weekly full-matrix rebuild via cron Path-filtering backend.yml master push (the previous commit's main optimization) skips backends whose source didn't change. That broke the DEPS_REFRESH cache-buster's coverage: the build-arg keyed on %Y-W%V busts the install layer's cache on a new ISO week, but only when the build actually runs. Untouched Python backends (torch, transformers, vllm with no version pin) would otherwise ship stale wheels indefinitely. Add a Sunday 06:00 UTC cron that fires the full matrix. Schedule events have no event.ref / event.before, so the script's changedFiles == null fallback (scripts/changed-backends.js) emits the full matrix automatically — no script change needed. C++/Go backends with pinned deps cache-hit and complete fast, so the weekly cost is dominated by Python re-resolves which is exactly what we want. workflow_dispatch added so a maintainer can trigger an ad-hoc full-matrix rebuild without faking a tag push. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
c894d9c826 |
feat(sglang): wire engine_args, add cuda13 build, ship MTP gallery demos (#9686)
Bring the sglang Python backend up to feature parity with vllm by adding
the same engine_args:-map plumbing the vLLM backend already has. Any
ServerArgs field (~380 in sglang 0.5.11) becomes settable from a model
YAML, including the speculative-decoding flags needed for Multi-Token
Prediction. Validation matches the vllm backend's: keys are checked
against dataclasses.fields(ServerArgs), unknown keys raise ValueError
with a difflib close-match suggestion at LoadModel time, and the typed
ModelOptions fields keep their existing meaning with engine_args
overriding them.
Backend code:
* backend/python/sglang/backend.py: add _apply_engine_args, import
dataclasses/difflib/ServerArgs, call from LoadModel; rename Seed ->
sampling_seed (sglang 0.5.11 renamed the SamplingParams field).
* backend/python/sglang/test.py + test.sh + Makefile: six unit tests
exercising the helper directly (no engine load required).
Build / CI / backend gallery (cuda13 + l4t13 paths are now first-class):
* backend/python/sglang/install.sh: add --prerelease=allow because
sglang 0.5.11 hard-pins flash-attn-4 which only ships beta wheels;
add --index-strategy=unsafe-best-match for cublas12 so the cu128
torch index wins over default-PyPI's cu130; new pyproject.toml-driven
l4t13 install path so [tool.uv.sources] can pin torch/torchvision/
torchaudio/sglang to the jetson-ai-lab index without forcing every
transitive PyPI dep through the L4T mirror's flaky proxy (mirrors the
equivalent fix in backend/python/vllm/install.sh).
* backend/python/sglang/pyproject.toml (new): L4T project spec with
explicit-source jetson-ai-lab index. Replaces requirements-l4t13.txt
for the l4t13 BUILD_PROFILE; other profiles still go through the
requirements-*.txt pipeline via libbackend.sh's installRequirements.
* backend/python/sglang/requirements-l4t13.txt: removed; superseded
by pyproject.toml.
* backend/python/sglang/requirements-cublas{12,13}{,-after}.txt: pin
sglang>=0.5.11 (Gemma 4 floor); add cu130 torch index for cublas13
(new files) and cu128 torch index for cublas12 (default PyPI now
ships cu130 torch wheels by default and breaks cu12 hosts).
* backend/index.yaml: add cuda13-sglang and cuda13-sglang-development
capability mappings + image entries pointing at
quay.io/.../-gpu-nvidia-cuda-13-sglang.
* .github/workflows/backend.yml: new cublas13 sglang matrix entry,
mirroring vllm's cuda13 build.
Model gallery + docs:
* gallery/sglang.yaml: base sglang config template, mirrors vllm.yaml.
* gallery/sglang-gemma-4-{e2b,e4b}-mtp.yaml: Gemma 4 MTP demos
transcribed verbatim from the SGLang Gemma 4 cookbook MTP commands.
* gallery/sglang-mimo-7b-mtp.yaml: MiMo-7B-RL with built-in MTP heads
+ online fp8 weight quantization, verified end-to-end on a 16 GB
RTX 5070 Ti at ~88 tok/s. Uses mem_fraction_static: 0.7 because the
MTP draft worker's vocab embedding is loaded unquantised and OOMs
the static reservation at sglang's 0.85 default.
* gallery/index.yaml: three new entries (gemma-4-e2b-it:sglang-mtp,
gemma-4-e4b-it:sglang-mtp, mimo-7b-mtp:sglang).
* docs/content/features/text-generation.md: new SGLang section with
setup, engine_args reference, MTP demos, version requirements.
* .agents/sglang-backend.md (new): agent one-pager covering the flat
ServerArgs structure, the typed-vs-engine_args precedence, the
speculative-decoding cheatsheet, and the mem_fraction_static gotcha
documented above.
* AGENTS.md: index entry for the new agent doc.
Known limitation: the two Gemma 4 MTP gallery entries ship a recipe
that doesn't yet run on stock libraries. The drafter checkpoints
(google/gemma-4-{E2B,E4B}-it-assistant) declare
model_type: gemma4_assistant / Gemma4AssistantForCausalLM, which
neither transformers (<=5.6.0, including the SGLang cookbook's pinned
commit 91b1ab1f... and main HEAD) nor sglang's own model registry
(<=0.5.11) registers as of 2026-05-06. They will start working when
HF or sglang upstream registers the architecture -- no LocalAI
changes needed. The MiMo MTP demo and the non-MTP Gemma 4 paths work
today on this build (verified on RTX 5070 Ti, 16 GB).
Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash] [WebFetch] [WebSearch]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
|
||
|
|
a8d7d37a3c |
fix: unbreak master CI (docs, kokoros, vibevoice-cpp ABI) (#9682)
* fix(docs): correct broken Hugo relrefs The Hugo build has been failing on master since the relevant pages landed: - text-generation.md:720 referenced `/docs/features/distributed-mode`, but Hugo `relref` paths are relative to the content root, not the rendered URL. Drop the `/docs/` prefix so the lookup matches the existing `features/...` form used elsewhere in the file. - audio-transform.md:144 referenced `tts.md`; the actual page is `text-to-audio.md`. Assisted-by: Claude:claude-opus-4-7[1m] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(kokoros): stub Diarize and AudioTransform Backend trait methods The recent backend.proto additions (Diarize, AudioTransform, AudioTransformStream) extended the gRPC Backend trait, breaking kokoros-grpc compilation with E0046 because the Rust implementation hadn't picked up the new methods. Add Unimplemented stubs matching the existing pattern for non-applicable RPCs in this TTS-only backend. Assisted-by: Claude:claude-opus-4-7[1m] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(vibevoice-cpp): track upstream ABI + wire 1.5B voice cloning Two recent commits in mudler/vibevoice.cpp reshaped the vv_capi_tts signature without a corresponding bump on the LocalAI side: 3bd759c "1.5b: unify into a single tts entry point" inserted a ref_audio_path parameter between voice_path and dst_wav_path. ad856bd "1.5b: multi-speaker dialog support" promoted that to a (const char* const* ref_audio_paths, int n_ref_audio_paths) pair for per-speaker conditioning. Because purego resolves symbols by name and not by signature, the build kept linking; at runtime the misaligned arguments turned the TTS->ASR closed-loop test into a SIGSEGV inside cgo. Track HEAD explicitly and bring the bridge in line with it: * Update the CppTTS purego binding to the 9-arg form. purego marshals []*byte as a **char by handing the C side the underlying array address; nil/empty maps to NULL, which matches the C contract for "no reference audio" on the realtime-0.5B path. * Add a `ref_audio` gallery option (comma-separated, repeatable) that the 1.5B path consumes for runtime voice cloning. Multiple entries are interpreted as one WAV per speaker (Speaker 0..n-1). * TTSRequest.Voice now routes by extension/shape: `.wav` or a comma-separated list goes to ref_audio_paths; anything else stays on voice_path (realtime-0.5B's pre-baked voice gguf). * Pin VIBEVOICE_CPP_VERSION to ad856bd and wire the Makefile into the existing bump_deps matrix so future upstream rolls land as reviewable PRs instead of a silent CI break. Assisted-by: Claude:claude-opus-4-7[1m] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(vibevoice-cpp): use ModelOptions.AudioPath for 1.5B ref audio Use the existing audio_path field from ModelOptions (already plumbed through config_file's `audio_path:` YAML and consumed by other audio backends like kokoros) instead of inventing a custom `ref_audio:` Options[] string. Multi-speaker setups stay on a single comma- separated value. No behavior change beyond the gallery key name; per-call routing via TTSRequest.Voice is unchanged. Assisted-by: Claude:claude-opus-4-7[1m] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
1caab1de10 |
chore(deps): bump actions/checkout from 4 to 6 (#9663)
Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 6. - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](https://github.com/actions/checkout/compare/v4...v6) --- updated-dependencies: - dependency-name: actions/checkout dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
bb033b16a9 |
feat: add LocalVQE backend and audio transformations UI (#9640)
feat(audio-transform): add LocalVQE backend, bidi gRPC RPC, Studio UI
Introduce a generic "audio transform" capability for any audio-in / audio-out
operation (echo cancellation, noise suppression, dereverberation, voice
conversion, etc.) and ship LocalVQE as the first backend implementation.
Backend protocol:
- Two new gRPC RPCs in backend.proto: unary AudioTransform for batch and
bidirectional AudioTransformStream for low-latency frame-by-frame use.
This is the first bidi stream in the proto; per-frame unary at LocalVQE's
16 ms hop would be RTT-bound. Wire it through pkg/grpc/{client,server,
embed,interface,base} with paired-channel ergonomics.
LocalVQE backend (backend/go/localvqe/):
- Go-Purego wrapper around upstream liblocalvqe.so. CMake builds the upstream
shared lib + its libggml-cpu-*.so runtime variants directly — no MODULE
wrapper needed because LocalVQE handles CPU feature selection internally
via GGML_BACKEND_DL.
- Sets GGML_NTHREADS from opts.Threads (or runtime.NumCPU()-1) — without it
LocalVQE runs single-threaded at ~1× realtime instead of the documented
~9.6×.
- Reference-length policy: zero-pad short refs, truncate long ones (the
trailing portion can't have leaked into a mic that wasn't recording).
- Ginkgo test suite (9 always-on specs + 2 model-gated).
HTTP layer:
- POST /audio/transformations (alias /audio/transform): multipart batch
endpoint, accepts audio + optional reference + params[*]=v form fields.
Persists inputs alongside the output in GeneratedContentDir/audio so the
React UI history can replay past (audio, reference, output) triples.
- GET /audio/transformations/stream: WebSocket bidi, 16 ms PCM frames
(interleaved stereo mic+ref in, mono out). JSON session.update envelope
for config; constants hoisted in core/schema/audio_transform.go.
- ffmpeg-based input normalisation to 16 kHz mono s16 WAV via the existing
utils.AudioToWav (with passthrough fast-path), so the user can upload any
format / rate without seeing the model's strict 16 kHz constraint.
- BackendTraceAudioTransform integration so /api/backend-traces and the
Traces UI light up with audio_snippet base64 and timing.
- Routes registered under routes/localai.go (LocalAI extension; OpenAI has
no /audio/transformations endpoint), traced via TraceMiddleware.
Auth + capability + importer:
- FLAG_AUDIO_TRANSFORM (model_config.go), FeatureAudioTransform (default-on,
in APIFeatures), three RouteFeatureRegistry rows.
- localvqe added to knownPrefOnlyBackends with modality "audio-transform".
- Gallery entry localvqe-v1-1.3m (sha256-pinned, hosted on
huggingface.co/LocalAI-io/LocalVQE).
React UI:
- New /app/transform page surfaced via a dedicated "Enhance" sidebar
section (sibling of Tools / Biometrics) — the page is enhancement, not
generation, so it lives outside Studio. Two AudioInput components
(Upload + Record tabs, drag-drop, mic capture).
- Echo-test button: records mic while playing the loaded reference through
the speakers — the mic naturally picks up speaker bleed, giving a real
(mic, ref) pair for AEC testing without leaving the UI.
- Reusable WaveformPlayer (canvas peaks + click-to-seek + audio controls)
and useAudioPeaks hook (shared module-scoped AudioContext to avoid
hitting browser context limits with three players on one page); migrated
TTS, Sound, Traces audio blocks to use it.
- Past runs saved in localStorage via useMediaHistory('audio-transform') —
the history entry stores all three URLs so clicking re-renders the full
triple, not just the output.
Build + e2e:
- 11 matrix entries removed from .github/workflows/backend.yml (CUDA, ROCm,
SYCL, Metal, L4T): upstream supports only CPU + Vulkan, so we ship those
two and let GPU-class hardware route through Vulkan in the gallery
capabilities map.
- tests-localvqe-grpc-transform job in test-extra.yml (gated on
detect-changes.outputs.localvqe).
- New audio_transform capability + 4 specs in tests/e2e-backends.
- Playwright spec suite in core/http/react-ui/e2e/audio-transform.spec.js
(8 specs covering tabs, file upload, multipart shape, history, errors).
Docs:
- New docs/content/features/audio-transform.md covering the (audio,
reference) mental model, batch + WebSocket wire formats, LocalVQE param
keys, and a YAML config example. Cross-links from text-to-audio and
audio-to-text feature pages.
Assisted-by: Claude:claude-opus-4-7 [Bash Read Edit Write Agent TaskCreate]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
|
||
|
|
28b4857bd6 |
fix(ci): leave ports.ubuntu.com upstream on self-hosted runners
mirrors.edge.kernel.org carries /ubuntu/ (amd64 archive) but does NOT carry /ubuntu-ports/. With the previous default both archive and ports pointed at kernel.org, so multi-arch builds (linux/amd64,linux/arm64) on bigger-runner / arc-runner-set 404'd on the arm64 leg: Err:5 http://mirrors.edge.kernel.org/ubuntu-ports noble Release 404 Not Found [IP: 213.196.21.55 80] The original outage was on archive.ubuntu.com, not ports.ubuntu.com, so default the self-hosted-ports-mirror to '' (= keep ports.ubuntu.com upstream). apt-mirror.sh and the runner-side rewrite both already no-op when the env var is empty. Self-hosted amd64 still uses kernel.org for the main archive, which worked fine in this run before the arm64 leg failed. Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
5503be1fb3 |
fix(ci): use http for the kernel.org mirror — bare ubuntu image has no CA bundle
The Docker build runs on the minimal ubuntu:24.04 base image, which
ships *without* ca-certificates. The very first apt-get update over
HTTPS therefore fails the TLS handshake ("No system certificates
available. Try installing ca-certificates."), and apt can't reach
ca-certificates itself to fix the situation — chicken and egg.
Apt validates package integrity via GPG-signed Release files, so plain
HTTP is safe for the archive. archive.ubuntu.com / azure.archive are
already accessed over HTTP for the same reason. Switch the kernel.org
defaults from https://mirrors.edge.kernel.org to
http://mirrors.edge.kernel.org so the in-Dockerfile rewrite works on
self-hosted runners too.
Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
|
||
|
|
50580a84ae |
fix(ci): switch apt mirror per runner — azure on github-hosted, kernel.org on self-hosted
Self-hosted runners (arc-runner-set, bigger-runner) cannot reach
azure.archive.ubuntu.com — they live in different networks (e.g. our
arc-runner-set Kubernetes cluster) where Azure's mirror IP is not
routable. Symptom: "Connection failed [IP: 51.11.236.225 80]" with each
Ign:/Err: cycle taking 60s, hanging the build for ~16 minutes before
exit 100.
Pick the mirror based on `runner.environment`:
* github-hosted (ubuntu-latest, ubuntu-24.04-arm) → Azure
(http://azure.archive.ubuntu.com / http://azure.ports.ubuntu.com)
— same VPC as the runner.
* self-hosted (arc-runner-set, bigger-runner) → kernel.org
(https://mirrors.edge.kernel.org for both archive and ports)
— publicly reachable from any network.
The choice now lives in one place: the .github/actions/configure-apt-mirror
composite action exposes `effective-mirror` / `effective-ports-mirror`
outputs so the reusable workflows can forward the same value as Docker
build-args without duplicating the per-runner-environment branch.
The now-redundant `apt-mirror` / `apt-ports-mirror` workflow inputs on
image_build.yml and backend_build.yml are dropped — defaults live in the
composite action and are visible there.
Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
|
||
|
|
8edac61e57 |
feat(ci): allow routing apt traffic through an alternate Ubuntu mirror (#9650)
* feat(ci): allow routing apt traffic through an alternate Ubuntu mirror
Adds opt-in APT_MIRROR / APT_PORTS_MIRROR knobs to all Dockerfiles, the
Makefile, and CI workflows so we can fail over to a non-canonical Ubuntu
mirror when archive.ubuntu.com / security.ubuntu.com / ports.ubuntu.com
are degraded (recently observed: multi-day DDoS against the default pool).
Defaults are empty everywhere — behavior is unchanged unless a mirror is
configured. To enable in CI, set the repo-level GitHub Actions variables
APT_MIRROR (and APT_PORTS_MIRROR for arm64 builds). Locally:
make docker APT_MIRROR=http://azure.archive.ubuntu.com
A small POSIX-sh helper in .docker/apt-mirror.sh rewrites both DEB822
(/etc/apt/sources.list.d/ubuntu.sources, Ubuntu 24.04+) and the legacy
/etc/apt/sources.list before the first apt-get update. Dockerfile stages
load it via RUN --mount=type=bind, so there is no extra layer and no
cache invalidation when the script is unchanged. Reusable workflows also
rewrite the runner's own /etc/apt sources before any sudo apt-get call.
Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci(apt-mirror): default to the Azure mirror, visible in the workflow source
Bakes Azure (http://azure.archive.ubuntu.com / http://azure.ports.ubuntu.com)
in as the default for both Docker builds and runner-side apt — rather than
hiding the URL behind a GitHub Actions repo variable that's not visible
from the source tree.
A new composite action at .github/actions/configure-apt-mirror is the
single source of truth for runner-side rewrites. Five standalone
workflows (build-test, release, tests-e2e, tests-ui-e2e, update_swagger)
just `uses: ./.github/actions/configure-apt-mirror`.
Three workflows (image_build, backend_build, checksum_checker) keep an
inline bash rewrite, because they install/upgrade git via apt *before*
the checkout step (so the local composite action isn't loadable yet).
The Azure URL is visible in those files too.
The `apt-mirror` / `apt-ports-mirror` inputs of the reusable workflows
keep their now-Azure defaults — they still feed the Docker build-args
block in addition to the inline runner-side rewrite. Callers (image.yml,
image-pr.yml, backend.yml, backend_pr.yml) drop the previous
`vars.APT_MIRROR` plumbing and rely on those defaults.
Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci(apt-mirror): drop Force Install GIT, consolidate on the composite action
The PPA git upgrade ran add-apt-repository ppa:git-core/ppa, which talks
to api.launchpad.net — also part of Canonical's infrastructure and
currently returning HTTP 504. The Azure mirror only covers
archive.ubuntu.com / security.ubuntu.com / ports.ubuntu.com, not PPAs.
The system git that ubuntu-latest already ships is sufficient for
actions/checkout and the build pipeline, so just drop the upgrade. With
that gone, the apt-before-checkout constraint disappears too — all three
holdouts (image_build, backend_build, checksum_checker) can now switch
to ./.github/actions/configure-apt-mirror like the other five.
Net: 0 inline apt-mirror blocks, all 8 workflows route through the
composite action.
Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
|
||
|
|
18e039f305 |
fix(ci): fix AMDGPU_TARGETS empty-string bypass in hipblas builds (#9626)
* fix(ci): fix AMDGPU_TARGETS empty-string bypass in hipblas builds
|
||
|
|
fe6eb57082 |
feat(vibevoice-cpp): add purego TTS+ASR backend (#9610)
* feat(vibevoice-cpp): add purego TTS+ASR backend
Wire up Microsoft VibeVoice via the vibevoice.cpp C ABI as a new
purego-based Go backend that serves both Backend.TTS and
Backend.AudioTranscription from a single gRPC binary. Mirrors the
qwen3-tts-cpp / sherpa-onnx pattern so the variant matrix
(cpu/cuda12/cuda13/metal/rocm/sycl-f16/f32/vulkan/l4t) and the
e2e-backends gRPC harness reuse existing infrastructure.
- backend/go/vibevoice-cpp/ - Makefile, CMakeLists, purego shim, gRPC
Backend with model-dir auto-detection, closed-loop TTS->ASR smoke test
- backend/index.yaml - &vibevoicecpp meta + 18 image entries
- Makefile - .NOTPARALLEL, BACKEND_VIBEVOICE_CPP, docker-build wiring,
test-extra-backend-vibevoice-cpp-{tts,transcription} e2e wrappers
- .github/workflows/backend.yml - matrix entries for all variants
- .github/workflows/test-extra.yml - per-backend smoke + 2 gRPC e2e jobs
* feat(vibevoice-cpp): drop hardcoded glob detection, add gallery entries
Refactor backend Load() to follow the standard Options[] convention
used by sherpa-onnx and the rest of the multi-role backends:
ModelFile is the primary gguf, supplementary paths come through
opts.Options[] as key=value (or key:value for Make-target compat),
resolved against opts.ModelPath. type=asr/tts decides the role of
ModelFile when neither tts_model nor asr_model is set explicitly.
Add gallery/index.yaml entries:
- vibevoice-cpp - realtime 0.5B Q8_0 TTS + tokenizer + Carter voice
- vibevoice-cpp-asr - long-form ASR Q8_0 + tokenizer
Both pull from huggingface://mudler/vibevoice.cpp-models with sha256
verification. parameters.model + Options[] paths are siblings under
{models_dir} per the qwen3-tts-cpp convention.
Update Makefile e2e wrappers to pass BACKEND_TEST_OPTIONS comma+colon
style, and tighten the per-backend Go closed-loop test to use the
explicit Options API.
* fix(vibevoice-cpp): force whole-archive link so vv_capi_* exports survive
libvibevoice is a STATIC archive linked into the MODULE library.
Without --whole-archive (or -force_load on Apple, /WHOLEARCHIVE on
MSVC), the linker garbage-collects symbols not referenced from this
translation unit - which means dlopen+RegisterLibFunc panics with
'undefined symbol: vv_capi_load' at backend startup, since purego
looks them up by name and our cpp/govibevoicecpp.cpp doesn't call
them directly.
* test(vibevoice-cpp): rewrite suite with Ginkgo v2
Match the convention used by backend/go/sherpa-onnx/backend_test.go.
The suite now covers backend semantics that don't need purego (Locking,
empty-ModelFile rejection, TTS/ASR-without-loaded-model errors) on top
of the gRPC lifecycle specs (Health, Load, closed-loop TTS->ASR).
Model-dependent specs Skip() when VIBEVOICE_MODEL_DIR is unset, so
`go test ./backend/go/vibevoice-cpp/` is green on a clean checkout
and runs the heavyweight closed-loop spec when test.sh has staged
the bundle.
* fix(vibevoice-cpp): implement TTSStream + AudioTranscriptionStream
The gRPC server's stream handlers (pkg/grpc/server.go) spawn a
goroutine that ranges over a chan; the only thing closing that chan
is the backend's own *Stream method. With the default Base stub
returning 'unimplemented' and never touching the chan, the server
goroutine hangs forever and the client hits DeadlineExceeded - which
is exactly what the e2e harness saw in the test-extra-backend-vibevoice-cpp-tts
matrix run.
TTSStream synthesizes via vv_capi_tts to a tempfile, then emits a
streaming WAV header (chunk sizes 0xFFFFFFFF so HTTP clients can
start playback before the full PCM lands) followed by the PCM body
in 64 KB slices. The header + >=2 PCM frames satisfy the harness's
'expected >=2 chunks' assertion and give a real progressive stream.
AudioTranscriptionStream runs the offline transcription, emits each
segment as a delta, and closes with a final_result whose Text equals
the concatenated deltas (the harness asserts those match).
Two new Ginkgo specs guard the close-channel-on-error path so the
deadline-exceeded regression can't come back silently.
* fix(vibevoice-cpp): silence errcheck on cleanup paths
Lint flagged six unchecked Close()/Remove()/RemoveAll() calls along
purely-cleanup deferred paths. Wrap each in '_ = ...' (or a closure
for defers that take args) - matches what the rest of the LocalAI
backend/go/* tree already does for these callsites.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(vibevoice-cpp): closed-loop slot fill + modelRoot-relative path resolution
Two bugs the test-extra-backend-vibevoice-cpp-* CI matrix surfaced:
1. Closed-loop Load with ModelFile=tts.gguf + Options[asr_model=...] left
v.ttsModel empty, because the default-fill block only ran when BOTH
slots were empty. vv_capi_load then got tts="" + a voice and the
C side rejected it with rc=-3 'TTS model required to load a voice'.
Fix: ModelFile fills the *primary* role-slot (decided by 'type=' in
Options, defaulting to tts) independently of the secondary, so
ModelFile + asr_model resolves to both.
2. resolvePath stat'd CWD before falling back to relTo. With LocalAI
launched from a directory that happens to contain a same-named
file, supplementary Options[] paths could leak away from the
models dir. Drop the CWD probe entirely - relative paths now
*always* join onto opts.ModelPath (the gallery convention).
New Ginkgo coverage:
* 'ModelFile slot resolution' (4 specs) - asr_model+ModelFile, type=asr,
explicit tts_model override, key:value variant.
* 'resolvePath (relative-to-modelRoot)' (5 specs) - join, abs passthrough,
empty input, empty relTo, and the CWD-trap regression test.
* 'Load resolves relative Options paths against opts.ModelPath' - end-
to-end gallery layout round-trip.
Verified locally: 19/19 specs pass (with model bundle, including the
closed-loop TTS->ASR; without bundle, 17 pass + 2 model-dependent skip).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(vibevoice-cpp): use gallery convention in closed-loop spec
The 'loads the realtime TTS model' / closed-loop specs were passing
already-prefixed paths into Options[]:
Options: ['tokenizer=' + filepath.Join(modelDir, 'tokenizer.gguf')]
Combined with no ModelPath set on the request, the backend's
modelRoot fell back to filepath.Dir(ModelFile) = modelDir, then
resolvePath joined the prefixed Options path on top of it -
producing 'vibevoice-models/vibevoice-models/tokenizer.gguf' when
the CI's VIBEVOICE_MODEL_DIR is the relative './vibevoice-models'.
The fix is to mirror the gallery contract LocalAI core actually
sends in production: ModelPath is the models root (absolute),
ModelFile is a name *under* it, every Options[] path is relative
to ModelPath. Uses filepath.Base() to get bare filenames.
Verified locally with both VIBEVOICE_MODEL_DIR=/tmp/vv-bundle (abs)
and VIBEVOICE_MODEL_DIR=vibevoice-models (the relative shape that
broke CI). Both: 19/19 specs pass, ~55-60s.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci(vibevoice-cpp): switch ASR to Q4_K + bump transcription timeout
The Q8_0 ASR gguf is ~14 GB - too big to fit alongside the runner
image, the docker build cache, and the test artifacts on a free
ubuntu-latest GHA runner; 'test-extra-backend-vibevoice-cpp-transcription'
was getting SIGTERM'd at 90 min before the model could finish loading.
Switch to Q4_K (~10 GB on disk, slightly faster CPU decode) for:
* the e2e harness Make target
* the gallery 'vibevoice-cpp-asr' entry (parameters + files block)
* the per-backend test.sh auto-download list
Bump tests-vibevoice-cpp-grpc-transcription's timeout-minutes from
90 to 150 - even with Q4_K, the 30 s JFK clip on a CPU runner needs
runway above the previous 90 min cap.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci(vibevoice-cpp): drop transcription gRPC e2e job - too heavy for free runners
The vibevoice ASR is a 7B-parameter model. Even on Q4_K (~10 GB on
disk) a single 30 s transcription saturates the per-test 30 min
timeout in the e2e-backends harness on a 4-core ubuntu-latest, and
the 10 GB download + Docker layer + working space leaves no headroom
on the runner's free disk. Two attempts in CI got SIGTERM'd at the
LoadModel boundary - the bottleneck isn't tunable from the workflow
side without a paid-tier runner.
The per-backend tests-vibevoice-cpp job already runs the same
AudioTranscription path via a closed-loop TTS->ASR Ginkgo spec - same
gRPC contract, same model, single process - so the standalone
tests-vibevoice-cpp-grpc-transcription job was redundant on top of
the disk/CPU pressure.
The Makefile target test-extra-backend-vibevoice-cpp-transcription
stays for local invocation on workstations that can afford it -
useful when developing the streaming codepaths.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci(vibevoice-cpp): restore transcription gRPC e2e on bigger-runner
Switch tests-vibevoice-cpp-grpc-transcription from ubuntu-latest to
the self-hosted 'bigger-runner' label that GPU image builds in
backend.yml use, plus the documented Free-disk-space prep step (purge
dotnet / ghc / android / CodeQL caches) the disabled vllm/sglang
entries in this file describe. That gives the 7B-param Q4_K ASR
model the disk + CPU runway it needs.
Keep timeout-minutes: 150 - even on a beefier runner the 30 s JFK
decode plus 10 GB download has to fit comfortably.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci(vibevoice-cpp): apt-get install make on bigger-runner before transcription e2e
bigger-runner is a self-hosted bare runner without the standard
ubuntu image's preinstalled build tools, so the previous job died at
the very first command with 'make: command not found' (exit 127).
Add the Dependencies step that the disabled vllm/sglang entries in
this file already document - apt-get installs make + build-essential
+ curl + unzip + ca-certificates + git + tar before the make target
runs. Mirrors how every other 'runs-on: bigger-runner' entry in
backend.yml prepares the runner.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
|
||
|
|
4916f8c880 |
feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map (#9563)
* feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map
LocalAI's vLLM backend wraps a small typed subset of vLLM's
AsyncEngineArgs (quantization, tensor_parallel_size, dtype, etc.).
Anything outside that subset -- pipeline/data/expert parallelism,
speculative_config, kv_transfer_config, all2all_backend, prefix
caching, chunked prefill, etc. -- requires a new protobuf field, a
Go struct field, an options.go line, and a backend.py mapping per
feature. That cadence is the bottleneck on shipping vLLM's
production feature set.
Add a generic `engine_args:` map on the model YAML that is
JSON-serialised into a new ModelOptions.EngineArgs proto field and
applied verbatim to AsyncEngineArgs at LoadModel time. Validation
is done by the Python backend via dataclasses.fields(); unknown
keys fail with the closest valid name as a hint.
dataclasses.replace() is used so vLLM's __post_init__ re-runs and
auto-converts dict values into nested config dataclasses
(CompilationConfig, AttentionConfig, ...). speculative_config and
kv_transfer_config flow through as dicts; vLLM converts them at
engine init.
Operators can now write:
engine_args:
data_parallel_size: 8
enable_expert_parallel: true
all2all_backend: deepep_low_latency
speculative_config:
method: deepseek_mtp
num_speculative_tokens: 3
kv_cache_dtype: fp8
without further proto/Go/Python plumbing per field.
Production defaults seeded by hooks_vllm.go: enable_prefix_caching
and enable_chunked_prefill default to true unless explicitly set.
Existing typed YAML fields (gpu_memory_utilization,
tensor_parallel_size, etc.) remain for back-compat; engine_args
overrides them when both are set.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* chore(vllm): pin cublas13 to vLLM 0.20.0 cu130 wheel
vLLM's PyPI wheel is built against CUDA 12 (libcudart.so.12) and won't
load on a cu130 host. Switch the cublas13 build to vLLM's per-tag cu130
simple-index (https://wheels.vllm.ai/0.20.0/cu130/) and pin
vllm==0.20.0. The cu130-flavoured wheel ships libcudart.so.13 and
includes the DFlash speculative-decoding method that landed in 0.20.0.
cublas13 install gets --index-strategy=unsafe-best-match so uv consults
both the cu130 index and PyPI when resolving — PyPI also publishes
vllm==0.20.0, but with cu12 binaries that error at import time.
Verified: Qwen3.5-4B + z-lab/Qwen3.5-4B-DFlash loads and serves chat
completions on RTX 5070 Ti (sm_120, cu130).
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* ci(vllm): bot job to bump cublas13 vLLM wheel pin
vLLM's cu130 wheel index URL is itself version-locked
(wheels.vllm.ai/<TAG>/cu130/, no /latest/ alias upstream), so a vLLM
bump means rewriting two values atomically — the URL segment and the
version constraint. bump_deps.sh handles git-sha-in-Makefile only;
add a sibling bump_vllm_wheel.sh and a matching workflow job that
mirrors the existing matrix's PR-creation pattern.
The bumper queries /releases/latest (which excludes prereleases),
strips the leading 'v', and seds both lines unconditionally. When the
file is already on the latest tag the rewrite is a no-op and
peter-evans/create-pull-request opens no PR.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* docs(vllm): document engine_args and speculative decoding
The new engine_args: map plumbs arbitrary AsyncEngineArgs through to
vLLM, but the public docs only covered the basic typed fields. Add a
short subsection in the vLLM section explaining the typed/generic
split and showing a worked DFlash speculative-decoding config, with
pointers to vLLM's SpeculativeConfig reference and z-lab's drafter
collection.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
---------
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
|
||
|
|
4443250756 |
chore: add golangci-lint with new-from-merge-base baseline (#9603)
* chore: add golangci-lint with new-from-merge-base baseline
Configure golangci-lint v2 with the standard linter set (errcheck, govet,
ineffassign, unused) plus forbidigo, which enforces the Ginkgo/Gomega-only
test convention from .agents/coding-style.md by rejecting stdlib testing
calls (t.Errorf, t.Fatalf, t.Run, ...). staticcheck is disabled — the
codebase has many pre-existing QF-style suggestions not worth gating on.
issues.new-from-merge-base = master makes the lint job a gate for new
issues only; the ~1300 pre-existing baseline stays visible via
'make lint-all' for incremental cleanup. CI runs 'make lint'.
Backends needing C/C++ headers we don't install in the lint runner are
excluded via a deny list in the Makefile (backend/go/{piper,silero-vad,
llm}, cmd/launcher). Discovery still flows through 'go list ./...', so
new packages are scanned automatically.
To make backend/go/{sam3-cpp,stablediffusion-ggml,whisper} typecheckable,
move their .cpp/.h sources into cpp/ subdirs (matching qwen3-tts-cpp /
acestep-cpp). Without this 'go list' rejects the package because Go does
not allow .cpp alongside .go without cgo.
Fix two real bugs found by lint in tests/integration/ (run only via
'make test-stores', not default CI): a stale zerolog reference left over
from the slog migration (
|
||
|
|
a0317d9926 |
refactor(tests): split app_test.go, move real-backend coverage to e2e-backends
core/http/app_test.go had grown to 1495 lines exercising three concerns at
once: HTTP-layer integration, real-backend inference (llama-gguf, tts,
stablediffusion, transformers embeddings, whisper), and service logic that
already has unit-level coverage. Each PR paid for 6 backend builds plus
real-model downloads to satisfy a single suite.
Reorg per layer:
- app_test.go (1495 -> 1003 lines) drives the mock-backend binary only.
Kept: auth, routing, gallery API, file:// import, /system, agent-jobs
HTTP plumbing, config-file model loading. Deleted real-inference specs
(llama-gguf chat, ggml completions/streaming, logprobs, logit_bias,
transcription, embeddings, External-gRPC, Stores duplicate, Model gallery
Context). Lifted Agent Jobs out of the deleted Stores Context.
- tests/e2e-backends/backend_test.go gains logprobs, logit_bias, and
no-first-token-dup specs (the latter folded into PredictStream). Two
new caps gate them so non-LLM backends opt out.
- tests/e2e-aio/e2e_test.go gains a streaming smoke under Context("text")
to catch container-level streaming regressions.
- tests/models_fixtures/ removed; all fixtures referenced testmodel.ggml.
app_test.go now writes per-Context inline mock-model YAMLs.
CI:
- test.yml + tests-e2e.yml gain paths-ignore (docs/, examples/, *.md,
backend/) so docs and backend-only PRs skip them. test.yml drops the
6-backend Build step plus TRANSFORMER_BACKEND/GO_TAGS=tts; tests-apple
drops the llama-cpp-darwin build.
- New tests-aio.yml runs the AIO container nightly + on workflow_dispatch
+ master/tags. The tests-e2e-container job moved out of test.yml so PRs
no longer pay AIO cost.
- New tests-llama-cpp-smoke job in test-extra.yml runs on every PR with
no detect-changes gate; pulls quay.io/go-skynet/local-ai-backends:
master-cpu-llama-cpp (no build on PR) and exercises predict/stream/
logprobs/logit_bias against Qwen3-0.6B. This is the PR-acceptance
real-backend gate after AIO moved to nightly. The path-gated heavy
test-extra-backend-llama-cpp wrapper appends the same caps so it
exercises the moved specs when the backend actually changes.
Makefile:
- Deleted test-models/testmodel.ggml (the wget chain), test-llama-gguf,
test-tts, test-stablediffusion, test-realtime-models. test target
drops --label-filter, HUGGINGFACE_GRPC, TRANSFORMER_BACKEND, TEST_DIR,
FIXTURES, CONFIG_FILE, MODELS_PATH, BACKENDS_PATH; depends on
build-mock-backend. test-stores keeps a focused entry point and depends
on backends/local-store. clean-tests also clears the mock-backend
binary.
Net per typical Go-side PR: ~25min (6 backend builds + tests + AIO) +
~8min e2e drops to ~5min mock-backend test + ~8min e2e + ~5-10min
llama-cpp-smoke (image pulled). Docs and backend-only PRs skip the
always-on workflows entirely.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7 [Edit] [Write] [Bash]
|
||
|
|
9a7f5e68bd |
ci(darwin): add native caches to backend_build_darwin
macOS runners can't use the registry-backed BuildKit cache (no Docker daemon), so every darwin matrix run was paying full cost for brew installs, Go module downloads, llama.cpp recompiles and Python wheel resolution. Wires actions/cache@v4 into the reusable workflow for four caches: - Go modules + build cache (setup-go cache: true), shared across matrix - Homebrew downloads + selected /opt/homebrew/Cellar entries, with HOMEBREW_NO_AUTO_UPDATE so restored Cellar paths stay stable - ccache for the llama-cpp CMake variants, keyed on the pinned LLAMA_VERSION; CMAKE_*_COMPILER_LAUNCHER is exported via GITHUB_ENV so backend/cpp/llama-cpp/Makefile picks it up without script changes - Python uv + pip wheel cache, keyed by backend + ISO week — same one-cold-rebuild-per-week cadence as the Linux DEPS_REFRESH Read/write semantics match the existing BuildKit policy: every run restores, only master/tag pushes save, so PRs can't pollute master's warm cache. Documents the new caches and the macOS-specific constraints in .agents/ci-caching.md. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code] |
||
|
|
f4036fa83f |
ci(python-backends): add weekly DEPS_REFRESH cache-buster
The shared backend/Dockerfile.python ends in:
RUN cd /${BACKEND} && PORTABLE_PYTHON=true make
which `pip install`s each backend's requirements*.txt. A scan of all 34
Python backends shows every single one ships at least some unpinned deps
(torch, transformers, vllm, diffusers, ...). With the registry cache now
enabled, that `make` layer's BuildKit hash depends only on Dockerfile
instructions + COPYed source — not on what pip resolves at runtime — so
a warm cache would freeze upstream versions indefinitely.
DEPS_REFRESH is an ARG declared right before that RUN. backend_build.yml
computes `date -u +%Y-W%V` (ISO week, e.g. `2026-W17`) and passes it as
a build-arg, so the install layer invalidates at most once per week and
re-resolves PyPI / nightly indexes. Within a week, builds stay warm.
Only Dockerfile.python is affected: Go (go.sum) and Rust (Cargo.lock)
already lock their deps, and the C++ backends pull gRPC at a pinned tag
and llama.cpp at a pinned commit.
Add .agents/ci-caching.md documenting the cache layout
(quay.io/go-skynet/ci-cache:cache<tag-suffix>), read/write semantics
(master writes, PRs read-only), DEPS_REFRESH semantics, and how to
manually evict tags. Index it from AGENTS.md (CLAUDE.md is a symlink).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7-1m
|
||
|
|
bdfa5e934a |
ci: switch image/backend build cache to a dedicated registry image
- Switch cache-from/cache-to in backend_build.yml and image_build.yml
from the unused gha cache to type=registry pointing at
quay.io/go-skynet/ci-cache:cache<tag-suffix>, mode=max with
ignore-error=true. Master/tag builds populate their own
per-matrix-entry cache; PR builds read-only.
- Drop the broken generate_grpc_cache.yaml cron. It targeted a `grpc`
Dockerfile stage that was removed by
|
||
|
|
41ed8ced70 |
[intel GPU support] Use latest oneapi-basekit image for Intel images to support b70 (in more places this time) (#9578)
Update additional intel base images |
||
|
|
e16e758dff |
ci(backends): build cpu-whisperx and cpu-faster-whisper for linux/arm64 (#9573)
Extend the existing CPU build matrix entries to produce a multi-arch manifest (linux/amd64,linux/arm64) at the same image tags. arm64 Linux hosts without an NVIDIA GPU report the "default" capability, which already maps to cpu-whisperx / cpu-faster-whisper in backend/index.yaml -- so the manifest list lets Docker pull the right variant without any gallery changes. Both stacks install cleanly under aarch64: torch (2.4.1/2.8.0), faster-whisper, ctranslate2, whisperx, opencv-python and the remaining deps all ship manylinux2014_aarch64 wheels, so no source builds run under QEMU emulation. Follows the same pattern already used by cpu-llama-cpp-quantization. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
703b4fcae8 |
Change cron schedule to run every 12 hours
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> |
||
|
|
24505e57f5 |
feat(backends): add CUDA 13 + L4T arm64 CUDA 13 variants for vllm/vllm-omni/sglang (#9553)
* feat(backends): add CUDA 13 + L4T arm64 CUDA 13 variants for vllm/vllm-omni/sglang
Adds new build profiles mirroring the diffusers/ace-step pattern so vLLM
serving (and SGLang on arm64) can be deployed on CUDA 13 hosts and
JetPack 7 boards:
- vllm: cublas13 (PyPI cu130 channel) + l4t13 (jetson-ai-lab SBSA cu130
prebuilt vllm + flash-attn).
- vllm-omni: cublas13 + l4t13. Floats vllm version on cu13 since vllm
0.19+ ships cu130 wheels by default and vllm-omni tracks vllm master;
cu12 path keeps the 0.14.0 pin to avoid disturbing existing images.
- sglang: l4t13 arm64 only — uses the prebuilt sglang wheel from the
jetson-ai-lab SBSA cu130 index, so no source build is needed.
Cublas13 sglang on x86_64 is intentionally deferred.
CI matrix gains five new images (-gpu-nvidia-cuda-13-vllm{,-omni},
-nvidia-l4t-cuda-13-arm64-{vllm,vllm-omni,sglang}); backend/index.yaml
gains the matching capability keys (nvidia-cuda-13, nvidia-l4t-cuda-13)
and latest/development merge entries.
Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash]
* fix(backends): use unsafe-best-match index strategy on l4t13 builds
The jetson-ai-lab SBSA cu130 index lists transitive deps (decord, etc.)
at limited versions / older Python ABIs. uv defaults to the first index
that contains a package and refuses to fall through to PyPI, so sglang
l4t13 build fails resolving decord. Mirror the existing cpu sglang
profile by setting --index-strategy=unsafe-best-match on l4t13 across
the three backends, and apply it to the explicit vllm install line in
vllm-omni's install.sh (which doesn't honor EXTRA_PIP_INSTALL_FLAGS).
Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash]
* fix(sglang): drop [all] extras on l4t13, floor version at 0.5.0
The [all] extra brings in outlines→decord, and decord has no aarch64
cp312 wheel on PyPI nor the jetson-ai-lab index (only legacy cp35-cp37
tags). With unsafe-best-match enabled, uv backtracked through sglang
versions trying to satisfy decord and silently landed on
sglang==0.1.16, an ancient version with an entirely different dep
tree (cloudpickle/outlines 0.0.44, etc.).
Drop [all] so decord is no longer required, and floor sglang at 0.5.0
to prevent any future resolver misfire from degrading the version
again.
Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
|
||
|
|
e5337039b0 |
[intel GPU support] Use latest oneapi-basekit image for Intel images to support b70 (#9543)
* Use latest oneapi-basekit image for Intel images The current `localai/localai:master-gpu-intel` images don't work with the intel arc pro b70. Updating the base_image to 2025.3.2 fixes it. Signed-off-by: Alex Brick <3220905+arbrick@users.noreply.github.com> * Update github workflow base image --------- Signed-off-by: Alex Brick <3220905+arbrick@users.noreply.github.com> |
||
|
|
13734ae9fa |
feat: Add Sherpa ONNX backend for ASR and TTS (#8523)
feat(backend): Add Sherpa ONNX backend and Omnilingual ASR Adds a new Go backend wrapping sherpa-onnx via purego (no cgo). Same approach as opus/stablediffusion-ggml/whisper — a thin C shim (csrc/shim.c + shim.h → libsherpa-shim.so) wraps the bits purego can't reach directly: nested struct config writes, result-struct field reads, and the streaming TTS callback trampoline. The Go side uses opaque uintptr handles and purego.NewCallback for the TTS callback. Supports: - VAD via sherpa-onnx's Silero VAD - Offline ASR: Whisper, Paraformer, SenseVoice, Omnilingual CTC - Online/streaming ASR: zipformer transducer with endpoint detection (AudioTranscriptionStream emits delta events during decode) - Offline TTS: VITS (LJS, etc.) - Streaming TTS: sherpa-onnx's callback API → PCM chunks on a channel, prefixed by a streaming WAV header Gallery entries: omnilingual-0.3b-ctc-q8-sherpa (1600-language offline ASR), streaming-zipformer-en-sherpa (low-latency streaming ASR), silero-vad-sherpa, vits-ljs-sherpa. E2E coverage: tests/e2e-backends for offline + streaming ASR, tests/e2e for the full realtime pipeline (VAD + STT + TTS). Assisted-by: claude-opus-4-7-1M [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> |
||
|
|
181ebb6df4 |
feat: voice recognition (#9500)
* feat(voice-recognition): add /v1/voice/{verify,analyze,embed} + speaker-recognition backend
Audio analog to face recognition. Adds three gRPC RPCs
(VoiceVerify / VoiceAnalyze / VoiceEmbed), their Go service and HTTP
layers, a new FLAG_SPEAKER_RECOGNITION capability flag, and a Python
backend scaffold under backend/python/speaker-recognition/ wrapping
SpeechBrain ECAPA-TDNN with a parallel OnnxDirectEngine for
WeSpeaker / 3D-Speaker ONNX exports.
The kokoros Rust backend gets matching unimplemented trait stubs —
tonic's async_trait has no defaults, so adding an RPC without Rust
stubs breaks the build (same regression fixed by
|
||
|
|
20baec77ab |
feat(face-recognition): add insightface/onnx backend for 1:1 verify, 1:N identify, embedding, detection, analysis (#9480)
* feat(face-recognition): add insightface backend for 1:1 verify, 1:N identify, embedding, detection, analysis
Adds face recognition as a new first-class capability in LocalAI via the
`insightface` Python backend, with a pluggable two-engine design so
non-commercial (insightface model packs) and commercial-safe
(OpenCV Zoo YuNet + SFace) models share the same gRPC/HTTP surface.
New gRPC RPCs (backend/backend.proto):
* FaceVerify(FaceVerifyRequest) returns FaceVerifyResponse
* FaceAnalyze(FaceAnalyzeRequest) returns FaceAnalyzeResponse
Existing Embedding and Detect RPCs are reused (face image in
PredictOptions.Images / DetectOptions.src) for face embedding and
face detection respectively.
New HTTP endpoints under /v1/face/:
* verify — 1:1 image pair same-person decision
* analyze — per-face age + gender (emotion/race reserved)
* register — 1:N enrollment; stores embedding in vector store
* identify — 1:N recognition; detect → embed → StoresFind
* forget — remove a registered face by opaque ID
Service layer (core/services/facerecognition/) introduces a
`Registry` interface with one in-memory `storeRegistry` impl backed
by LocalAI's existing local-store gRPC vector backend. HTTP handlers
depend on the interface, not on StoresSet/StoresFind directly, so a
persistent PostgreSQL/pgvector implementation can be slotted in via a
single constructor change in core/application (TODO marker in the
package doc).
New usecase flag FLAG_FACE_RECOGNITION; insightface is also wired
into FLAG_DETECTION so /v1/detection works for face bounding boxes.
Gallery (backend/index.yaml) ships three entries:
* insightface-buffalo-l — SCRFD-10GF + ArcFace R50 + genderage
(~326MB pre-baked; non-commercial research use only)
* insightface-opencv — YuNet + SFace (~40MB pre-baked; Apache 2.0)
* insightface-buffalo-s — SCRFD-500MF + MBF (runtime download; non-commercial)
Python backend (backend/python/insightface/):
* engines.py — FaceEngine protocol with InsightFaceEngine and
OnnxDirectEngine; resolves model paths relative to the backend
directory so the same gallery config works in docker-scratch and
in the e2e-backends rootfs-extraction harness.
* backend.py — gRPC servicer implementing Health, LoadModel, Status,
Embedding, Detect, FaceVerify, FaceAnalyze.
* install.sh — pre-bakes buffalo_l + OpenCV YuNet/SFace inside the
backend directory so first-run is offline-clean (the final scratch
image only preserves files under /<backend>/).
* test.py — parametrized unit tests over both engines.
Tests:
* Registry unit tests (go test -race ./core/services/facerecognition/...)
— in-memory fake grpc.Backend, table-driven, covers register/
identify/forget/error paths + concurrent access.
* tests/e2e-backends/backend_test.go extended with face caps
(face_detect, face_embed, face_verify, face_analyze); relative
ordering + configurable verifyCeiling per engine.
* Makefile targets: test-extra-backend-insightface-buffalo-l,
-opencv, and the -all aggregate.
* CI: .github/workflows/test-extra.yml gains tests-insightface-grpc,
auto-triggered by changes under backend/python/insightface/.
Docs:
* docs/content/features/face-recognition.md — feature page with
license table, quickstart (defaults to the commercial-safe model),
models matrix, API reference, 1:N workflow, storage caveats.
* Cross-refs in object-detection.md, stores.md, embeddings.md, and
whats-new.md.
* Contributor README at backend/python/insightface/README.md.
Verified end-to-end:
* buffalo_l: 6/6 specs (health, load, face_detect, face_embed,
face_verify, face_analyze).
* opencv: 5/5 specs (same minus face_analyze — SFace has no
demographic head; correctly skipped via BACKEND_TEST_CAPS).
Assisted-by: Claude:claude-opus-4-7
* fix(face-recognition): move engine selection to model gallery, collapse backend entries
The previous commit put engine/model_pack options on backend gallery
entries (`backend/index.yaml`). That was wrong — `GalleryBackend`
(core/gallery/backend_types.go:32) has no `options` field, so the
YAML decoder silently dropped those keys and all three "different
insightface-*" backend entries resolved to the same container image
with no distinguishing configuration.
Correct split:
* `backend/index.yaml` now has ONE `insightface` backend entry
shipping the CPU + CUDA 12 container images. The Python backend
bundles both the non-commercial insightface model packs
(buffalo_l / buffalo_s) and the commercial-safe OpenCV Zoo
weights (YuNet + SFace); the active engine is selected at
LoadModel time via `options: ["engine:..."]`.
* `gallery/index.yaml` gains three model entries —
`insightface-buffalo-l`, `insightface-opencv`,
`insightface-buffalo-s` — each setting the appropriate
`overrides.backend` + `overrides.options` so installing one
actually gives the user the intended engine. This matches how
`rfdetr-base` lives in the model gallery against the `rfdetr`
backend.
The earlier e2e tests passed despite this bug because the Makefile
targets pass `BACKEND_TEST_OPTIONS` directly to LoadModel via gRPC,
bypassing any gallery resolution entirely. No code changes needed.
Assisted-by: Claude:claude-opus-4-7
* feat(face-recognition): cover all supported models in the gallery + drop weight baking
Follows up on the model-gallery split: adds entries for every model
configuration either engine actually supports, and switches weight
delivery from image-baked to LocalAI's standard gallery mechanism.
Gallery now has seven `insightface-*` model entries (gallery/index.yaml):
insightface (family) — non-commercial research use
• buffalo-l (326MB) — SCRFD-10GF + ResNet50 + genderage, default
• buffalo-m (313MB) — SCRFD-2.5GF + ResNet50 + genderage
• buffalo-s (159MB) — SCRFD-500MF + MBF + genderage
• buffalo-sc (16MB) — SCRFD-500MF + MBF, recognition only
(no landmarks, no demographics — analyze
returns empty attributes)
• antelopev2 (407MB) — SCRFD-10GF + ResNet100@Glint360K + genderage
OpenCV Zoo family — Apache 2.0 commercial-safe
• opencv — YuNet + SFace fp32 (~40MB)
• opencv-int8 — YuNet + SFace int8 (~12MB, ~3x smaller, faster on CPU)
Model weights are no longer baked into the backend image. The image
now ships only the Python runtime + libraries (~275MB content size,
~1.18GB disk vs ~1.21GB when weights were baked). Weights flow through
LocalAI's gallery mechanism:
* OpenCV variants list `files:` with ONNX URIs + SHA-256, so
`local-ai models install insightface-opencv` pulls them into the
models directory exactly like any other gallery-managed model.
* insightface packs (upstream distributes .zip archives only, not
individual ONNX files) auto-download on first LoadModel via
FaceAnalysis' built-in machinery, rooted at the LocalAI models
directory so they live alongside everything else — same pattern
`rfdetr` uses with `inference.get_model()`.
Backend changes (backend/python/insightface/):
* backend.py — LoadModel propagates `ModelOptions.ModelPath` (the
LocalAI models directory) to engines via a `_model_dir` hint.
This replaces the earlier ModelFile-dirname approach; ModelPath
is the canonical "models directory" variable set by the Go loader
(pkg/model/initializers.go:144) and is always populated.
* engines.py::_resolve_model_path — picks up `model_dir` and searches
it (plus basename-in-model-dir) before falling back to the dev
script-dir. This is how OnnxDirectEngine finds gallery-downloaded
YuNet/SFace files by filename only.
* engines.py::_flatten_insightface_pack — new helper that works
around an upstream packaging inconsistency: buffalo_l/s/sc zips
expand flat, but buffalo_m and antelopev2 zips wrap their ONNX
files in a redundant `<name>/` directory. insightface's own
loader looks one level too shallow and fails. We call
`ensure_available()` explicitly, flatten if nested, then hand to
FaceAnalysis.
* engines.py::InsightFaceEngine.prepare — root-resolution order now
includes the `_model_dir` hint so packs download into the LocalAI
models directory by default.
* install.sh — no longer pre-downloads any weights. Everything is
gallery-managed now.
* smoke.py (new) — parametrized smoke test that iterates over every
gallery configuration, simulating the LocalAI install flow
(creates a models dir, fetches OpenCV files with checksum
verification, lets insightface auto-download its packs), then
runs detect + embed + verify (+ analyze where supported) through
the in-process BackendServicer.
* test.py — OnnxDirectEngineTest no longer hardcodes `/models/opencv/`
paths; downloads ONNX files to a temp dir at setUpClass time and
passes ModelPath accordingly.
Registry change (core/services/facerecognition/store_registry.go):
* `dim=0` in NewStoreRegistry now means "accept whatever dimension
arrives" — needed because the backend supports 512-d ArcFace/MBF
and 128-d SFace via the same Registry. A non-zero dim still fails
fast with ErrDimensionMismatch.
* core/application plumbs `faceEmbeddingDim = 0`, explaining the
rationale in the comment.
Backend gallery description updated to reflect that the image carries
no weights — it's just Python + engines.
Smoke-tested all 7 configurations against the rebuilt image (with the
flatten fix applied), exit 0:
PASS: insightface-buffalo-l faces=6 dim=512 same-dist=0.000
PASS: insightface-buffalo-sc faces=6 dim=512 same-dist=0.000
PASS: insightface-buffalo-s faces=6 dim=512 same-dist=0.000
PASS: insightface-buffalo-m faces=6 dim=512 same-dist=0.000
PASS: insightface-antelopev2 faces=6 dim=512 same-dist=0.000
PASS: insightface-opencv faces=6 dim=128 same-dist=0.000
PASS: insightface-opencv-int8 faces=6 dim=128 same-dist=0.000
7/7 passed
Assisted-by: Claude:claude-opus-4-7
* fix(face-recognition): pre-fetch OpenCV ONNX for e2e target; drop stale pre-baked claim
CI regression from the previous commit: I moved OpenCV Zoo weight
delivery to LocalAI's gallery `files:` mechanism, but the
test-extra-backend-insightface-opencv target was still passing
relative paths `detector_onnx:models/opencv/yunet.onnx` in
BACKEND_TEST_OPTIONS. The e2e suite drives LoadModel directly over
gRPC without going through the gallery, so those relative paths
resolved to nothing and OpenCV's ONNXImporter failed:
LoadModel failed: Failed to load face engine:
OpenCV(4.13.0) ... Can't read ONNX file: models/opencv/yunet.onnx
Fix: add an `insightface-opencv-models` prerequisite target that
fetches the two ONNX files (YuNet + SFace) to a deterministic host
cache at /tmp/localai-insightface-opencv-cache/, verifies SHA-256,
and skips the download on re-runs. The opencv test target depends on
it and passes absolute paths in BACKEND_TEST_OPTIONS, so the backend
finds the files via its normal absolute-path resolution branch.
Also refresh the buffalo_l comment: it no longer says "pre-baked"
(nothing is — the pack auto-downloads from upstream's GitHub release
on first LoadModel, same as in CI).
Locally verified: `make test-extra-backend-insightface-opencv` passes
5/5 specs (health, load, face_detect, face_embed, face_verify).
Assisted-by: Claude:claude-opus-4-7
* feat(face-recognition): add POST /v1/face/embed + correct /v1/embeddings docs
The docs promised that /v1/embeddings returns face vectors when you
send an image data-URI. That was never true: /v1/embeddings is
OpenAI-compatible and text-only by contract — its handler goes
through `core/backend/embeddings.go::ModelEmbedding`, which sets
`predictOptions.Embeddings = s` (a string of TEXT to embed) and never
populates `predictOptions.Images[]`. The Python backend's Embedding
gRPC method does handle Images[] (that's how /v1/face/register reaches
it internally via `backend.FaceEmbed`), but the HTTP embeddings
endpoint wasn't wired to populate it.
Rather than overload /v1/embeddings with image-vs-text detection —
messy, and the endpoint is OpenAI-compatible by design — add a
dedicated /v1/face/embed endpoint that wraps `backend.FaceEmbed`
(already used internally by /v1/face/register and /v1/face/identify).
Matches LocalAI's convention of a dedicated path per non-standard flow
(/v1/rerank, /v1/detection, /v1/face/verify etc.).
Response:
{
"embedding": [<dim> floats, L2-normed],
"dim": int, // 512 for ArcFace R50 / MBF, 128 for SFace
"model": "<name>"
}
Live-tested on the opencv engine: returns a 128-d L2-normalized vector
(sum(x^2) = 1.0000). Sentinel in docs updated to note /v1/embeddings
is text-only and point image users at /v1/face/embed instead.
Assisted-by: Claude:claude-opus-4-7
* fix(http): map malformed image input + gRPC status codes to proper 4xx
Image-input failures on LocalAI's single-image endpoints (/v1/detection,
/v1/face/{verify,analyze,embed,register,identify}) have historically
returned 500 — even when the client was the one who sent garbage.
Classic example: you POST an "image" that isn't a URL, isn't a
data-URI, and isn't a valid JPEG/PNG — the server shouldn't claim
that's its fault.
Two helpers land in core/http/endpoints/localai/images.go and every
single-image handler is switched over:
* decodeImageInput(s)
Wraps utils.GetContentURIAsBase64 and turns any failure
(invalid URL, not a data-URI, download error, etc.) into
echo.NewHTTPError(400, "invalid image input: ...").
* mapBackendError(err)
Inspects the gRPC status on a backend call error and maps:
INVALID_ARGUMENT → 400 Bad Request
NOT_FOUND → 404 Not Found
FAILED_PRECONDITION → 412 Precondition Failed
Unimplemented → 501 Not Implemented
All other codes fall through unchanged (still 500).
Before, my 1×1 PNG error-path test returned:
HTTP 500 "rpc error: code = InvalidArgument desc = failed to decode one or both images"
After:
HTTP 400 "failed to decode one or both images"
Scope-limited to the LocalAI single-image endpoints. The multi-modal
paths (middleware/request.go, openresponses/responses.go,
openai/realtime.go) intentionally log-and-skip individual media parts
when decoding fails — different design intent (graceful degradation
of a multi-part message), not a 400-worthy failure. Left untouched.
Live-verified: every error case in /tmp/face_errors.py now returns
4xx with a meaningful message; the "image with no face (1x1 PNG)"
case specifically went from 500 → 400.
Assisted-by: Claude:claude-opus-4-7
* refactor(face-recognition): insightface packs go through gallery files:, drop FaceAnalysis
Follows up on the discovery that LocalAI's gallery `files:` mechanism
handles archives (zip, tar.gz, …) via mholt/archiver/v3 — the rhasspy
piper voices use exactly this pattern. Insightface packs are zip
archives, so we can now deliver them the same way every other
gallery-managed model gets delivered: declaratively, checksum-verified,
through LocalAI's standard download+extract pipeline.
Two changes:
1. Gallery (gallery/index.yaml) — every insightface-* entry gains a
`files:` list with the pack zip's URI + SHA-256. `local-ai models
install insightface-buffalo-l` now fetches the zip, verifies the
hash, and extracts it into the models directory. No more reliance
on insightface's library-internal `ensure_available()` auto-download
or its hardcoded `BASE_REPO_URL`.
2. InsightFaceEngine (backend/python/insightface/engines.py) — drops
the FaceAnalysis wrapper and drives insightface's `model_zoo`
directly. The ~50 lines FaceAnalysis provides — glob ONNX files,
route each through `model_zoo.get_model()`, build a
`{taskname: model}` dict, loop per-face at inference — are
reimplemented in `InsightFaceEngine`. The actual inference classes
(RetinaFace, ArcFaceONNX, Attribute, Landmark) are still
insightface's — we only replicate the glue, so drift risk against
upstream is minimal.
Why drop FaceAnalysis: it hard-codes a `<root>/models/<name>/*.onnx`
layout that doesn't match what LocalAI's zip extraction produces.
LocalAI unpacks archives flat into `<models_dir>`. Upstream packs
are inconsistent — buffalo_l/s/sc ship ONNX at the zip root (lands
at `<models_dir>/*.onnx`), buffalo_m/antelopev2 wrap in a redundant
`<name>/` dir (lands at `<models_dir>/<name>/*.onnx`). The new
`_locate_insightface_pack` helper searches both locations plus
legacy paths and returns whichever has ONNX files. Replaces the
earlier `_flatten_insightface_pack` helper (which tried to fight
FaceAnalysis's layout expectations; now we just find the files
wherever they are).
Net effect for users: install once via LocalAI's managed flow,
weights live alongside every other model, progress shows in the
jobs endpoint, no first-load network call. Same API surface,
cleaner plumbing.
Assisted-by: Claude:claude-opus-4-7
* fix(face-recognition): CI's insightface e2e path needs the pack pre-fetched
The e2e suite drives LoadModel over gRPC without going through LocalAI's
gallery flow, so the engine's `_model_dir` option (normally populated
from ModelPath) is empty. Previously the insightface target relied on
FaceAnalysis auto-download to paper over this, but we dropped
FaceAnalysis in favor of direct model_zoo calls — so the buffalo_l
target started failing at LoadModel with "no insightface pack found".
Mirror the opencv target's pre-fetch pattern: download buffalo_sc.zip
(same SHA as the gallery entry), extract it on the host, and pass
`root:<dir>` so the engine locates the pack without needing
ModelPath. Switched to buffalo_sc (smallest pack, ~16MB) to keep CI
fast; it covers the same insightface engine code path as buffalo_l.
Face analyze cap dropped since buffalo_sc has no age/gender head.
Assisted-by: Claude:claude-opus-4-7[1m]
* feat(face-recognition): surface face-recognition in advertised feature maps
The six /v1/face/* endpoints were missing from every place LocalAI
advertises its feature surface to clients:
* api_instructions — the machine-readable capability index at
GET /api/instructions. Added `face-recognition` as a dedicated
instruction area with an intro that calls out the in-memory
registry caveat and the /v1/face/embed vs /v1/embeddings split.
* auth/permissions — added FeatureFaceRecognition constant, routed
all six face endpoints through it so admins can gate them per-user
like any other API feature. Default ON (matches the other API
features).
* React UI capabilities — CAP_FACE_RECOGNITION symbol mapped to
FLAG_FACE_RECOGNITION. Declared only for now; the Face page is a
follow-up (noted in the plan).
Instruction count bumped 9 → 10; test updated.
Assisted-by: Claude:claude-opus-4-7[1m]
* docs(agents): capture advertising-surface steps in the endpoint guide
Before this change, adding a new /v1/* endpoint reliably missed one or
more of: the swagger @Tags annotation, the /api/instructions registry,
the auth RouteFeatureRegistry, and the React UI CAP_* symbol. The
endpoint would work but be invisible to API consumers, admins, and the
UI — and nothing in the existing docs said to look in those places.
Extend .agents/api-endpoints-and-auth.md with a new "Advertising
surfaces" section covering all four surfaces (swagger tags, /api/
instructions, capabilities.js, docs/), and expand the closing checklist
so it's impossible to ship a feature without visiting each one. Hoist a
one-liner reminder into AGENTS.md's Quick Reference so agents skim it
before diving in.
Assisted-by: Claude:claude-opus-4-7[1m]
|
||
|
|
d16f19f1eb |
fix(kokoros): Build and publish the backend images from CI/CD (#9487)
* fix(kokoros): Build and publish the backend images from CI/CD Signed-off-by: Richard Palethorpe <io@richiejp.com> * Delete .claude/agents Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> * Delete .claude/commands Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> * Delete .claude/settings.json Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> * Delete .claude/skills Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com> Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com> |
||
|
|
39573ecd2a |
chore(whisperx): drop ROCm/hipblas build target (#9474)
whisperx has no upstream AMD GPU support and its core transcription path (faster-whisper -> ctranslate2) falls back to CPU on AMD since the PyPI ctranslate2 is CUDA-only. The torch rocm wheels would accelerate only the alignment/diarization stages, producing a misleadingly half-working image. Drop the hipblas variant rather than shipping a partially accelerated build users can't distinguish from the real thing. AMD hosts now fall through the capability map to cpu-whisperx / cpu-whisperx-development. Also removes the now-dangling rocm-whisperx assertion from pkg/system/capabilities_test.go and the ROCm mention from the whisperx row in docs/content/reference/compatibility-table.md. Assisted-by: Claude Code:claude-opus-4-7 |
||
|
|
a7dbb2a83d |
fix(gallery-agent): process blacklist command on recently-closed PRs (#9473)
The command-processing step only walked open PRs, so when a maintainer wrote `/gallery-agent blacklist` and immediately closed the PR, the next scheduled run missed the command, the `gallery-agent/blacklisted` label was never applied, and the skip-URL step (which only pulls URLs from closed PRs carrying that label) re-proposed the model on the next cron. Also scan closed gallery-agent PRs from the last 14 days that don't already carry the blacklist label, and apply the label retroactively when the command is present. Close/recreate actions still only run on open PRs. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
c66c41e8d7 |
fix(ci): wire AMDGPU_TARGETS through backend build workflow (#9445)
Commit
|
||
|
|
a90a8cf1d0 |
fix(ci): switch gallery-agent to sigs.k8s.io/yaml (#9397)
The gallery-agent lives under .github/, which Go tooling treats as a hidden directory and excludes from './...' expansion. That means 'go mod tidy' (run on every dependabot dependency bump) repeatedly strips github.com/ghodss/yaml from go.mod/go.sum, breaking 'go run ./.github/gallery-agent' with a missing go.sum entry error. Switch to sigs.k8s.io/yaml — API-compatible with ghodss/yaml and already pulled in as a transitive dependency via non-hidden packages, so tidy can no longer remove it. |
||
|
|
b4e30692a2 |
feat(backends): add sglang (#9359)
* feat(backends): add sglang Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(sglang): force AVX-512 CXXFLAGS and disable CI e2e job sgl-kernel's shm.cpp uses __m512 AVX-512 intrinsics unconditionally; -march=native fails on CI runners without AVX-512 in /proc/cpuinfo. Force -march=sapphirerapids so the build always succeeds, matching sglang upstream's docker/xeon.Dockerfile recipe. The resulting binary still requires an AVX-512 capable CPU at runtime, so disable tests-sglang-grpc in test-extra.yml for the same reason tests-vllm-grpc is disabled. Local runs with make test-extra-backend-sglang still work on hosts with the right SIMD baseline. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(sglang): patch CMakeLists.txt instead of CXXFLAGS for AVX-512 CXXFLAGS with -march=sapphirerapids was being overridden by add_compile_options(-march=native) in sglang's CPU CMakeLists.txt, since CMake appends those flags after CXXFLAGS. Sed-patch the CMakeLists.txt directly after cloning to replace -march=native. --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
6f0051301b |
feat(backend): add tinygrad multimodal backend (experimental) (#9364)
* feat(backend): add tinygrad multimodal backend
Wire tinygrad as a new Python backend covering LLM text generation with
native tool-call extraction, embeddings, Stable Diffusion 1.x image
generation, and Whisper speech-to-text from a single self-contained
container.
Backend (`backend/python/tinygrad/`):
- `backend.py` gRPC servicer with LLM Predict/PredictStream (auto-detects
Llama / Qwen2 / Mistral architecture from `config.json`, supports
safetensors and GGUF), Embedding via mean-pooled last hidden state,
GenerateImage via the vendored SD1.x pipeline, AudioTranscription +
AudioTranscriptionStream via the vendored Whisper inference loop, plus
Tokenize / ModelMetadata / Status / Free.
- Vendored upstream model code under `vendor/` (MIT, headers preserved):
llama.py with an added `qkv_bias` flag for Qwen2-family bias support
and an `embed()` method that returns the last hidden state, plus
clip.py, unet.py, stable_diffusion.py (trimmed to drop the MLPerf
training branch that pulls `mlperf.initializers`), audio_helpers.py
and whisper.py (trimmed to drop the pyaudio listener).
- Pluggable tool-call parsers under `tool_parsers/`: hermes (Qwen2.5 /
Hermes), llama3_json (Llama 3.1+), qwen3_xml (Qwen 3), mistral
(Mistral / Mixtral). Auto-selected from model architecture or `Options`.
- `install.sh` pins Python 3.11.14 (tinygrad >=0.12 needs >=3.11; the
default portable python is 3.10).
- `package.sh` bundles libLLVM.so.1 + libedit/libtinfo/libgomp/libsndfile
into the scratch image. `run.sh` sets `CPU_LLVM=1` and `LLVM_PATH` so
tinygrad's CPU device uses the in-process libLLVM JIT instead of
shelling out to the missing `clang` binary.
- Local unit tests for Health and the four parsers in `test.py`.
Build wiring:
- Root `Makefile`: `.NOTPARALLEL`, `prepare-test-extra`, `test-extra`,
`BACKEND_TINYGRAD = tinygrad|python|.|false|true`,
docker-build-target eval, and `docker-build-backends` aggregator.
- `.github/workflows/backend.yml`: cpu / cuda12 / cuda13 build matrix
entries (mirrors the transformers backend placement).
- `backend/index.yaml`: `&tinygrad` meta + cpu/cuda12/cuda13 image
entries (latest + development).
E2E test wiring:
- `tests/e2e-backends/backend_test.go` gains an `image` capability that
exercises GenerateImage and asserts a non-empty PNG is written to
`dst`. New `BACKEND_TEST_IMAGE_PROMPT` / `BACKEND_TEST_IMAGE_STEPS`
knobs.
- Five new make targets next to `test-extra-backend-vllm`:
- `test-extra-backend-tinygrad` — Qwen2.5-0.5B-Instruct + hermes,
mirrors the vllm target 1:1 (5/9 specs in ~57s).
- `test-extra-backend-tinygrad-embeddings` — same model, embeddings
via LLM hidden state (3/9 in ~10s).
- `test-extra-backend-tinygrad-sd` — stable-diffusion-v1-5 mirror,
health/load/image (3/9 in ~10min, 4 diffusion steps on CPU).
- `test-extra-backend-tinygrad-whisper` — openai/whisper-tiny.en
against jfk.wav from whisper.cpp samples (4/9 in ~49s).
- `test-extra-backend-tinygrad-all` aggregate.
All four targets land green on the first MVP pass: 15 specs total, 0
failures across LLM+tools, embeddings, image generation, and speech
transcription.
* refactor(tinygrad): collapse to a single backend image
tinygrad generates its own GPU kernels (PTX renderer for CUDA, the
autogen ctypes wrappers for HIP / Metal / WebGPU) and never links
against cuDNN, cuBLAS, or any toolkit-version-tied library. The only
runtime dependency that varies across hosts is the driver's libcuda.so.1
/ libamdhip64.so, which are injected into the container at run time by
the nvidia-container / rocm runtimes. So unlike torch- or vLLM-based
backends, there is no reason to ship per-CUDA-version images.
- Drop the cuda12-tinygrad and cuda13-tinygrad build-matrix entries
from .github/workflows/backend.yml. The sole remaining entry is
renamed to -tinygrad (from -cpu-tinygrad) since it is no longer
CPU-only.
- Collapse backend/index.yaml to a single meta + development pair.
The meta anchor carries the latest uri directly; the development
entry points at the master tag.
- run.sh picks the tinygrad device at launch time by probing
/usr/lib/... for libcuda.so.1 / libamdhip64.so. When libcuda is
visible we set CUDA=1 + CUDA_PTX=1 so tinygrad uses its own PTX
renderer (avoids any nvrtc/toolkit dependency); otherwise we fall
back to HIP or CLANG. CPU_LLVM=1 + LLVM_PATH keep the in-process
libLLVM JIT for the CLANG path.
- backend.py's _select_tinygrad_device() is trimmed to a CLANG-only
fallback since production device selection happens in run.sh.
Re-ran test-extra-backend-tinygrad after the change:
Ran 5 of 9 Specs in 56.541 seconds — 5 Passed, 0 Failed
|
||
|
|
95efb8a562 |
feat(backend): add turboquant llama.cpp-fork backend (#9355)
* feat(backend): add turboquant llama.cpp-fork backend
turboquant is a llama.cpp fork (TheTom/llama-cpp-turboquant, branch
feature/turboquant-kv-cache) that adds a TurboQuant KV-cache scheme.
It ships as a first-class backend reusing backend/cpp/llama-cpp sources
via a thin wrapper Makefile: each variant target copies ../llama-cpp
into a sibling build dir and invokes llama-cpp's build-llama-cpp-grpc-server
with LLAMA_REPO/LLAMA_VERSION overridden to point at the fork. No
duplication of grpc-server.cpp — upstream fixes flow through automatically.
Wires up the full matrix (CPU, CUDA 12/13, L4T, L4T-CUDA13, ROCm, SYCL
f32/f16, Vulkan) in backend.yml and the gallery entries in index.yaml,
adds a tests-turboquant-grpc e2e job driven by BACKEND_TEST_CACHE_TYPE_K/V=q8_0
to exercise the KV-cache config path (backend_test.go gains dedicated env
vars wired into ModelOptions.CacheTypeKey/Value — a generic improvement
usable by any llama.cpp-family backend), and registers a nightly auto-bump
PR in bump_deps.yaml tracking feature/turboquant-kv-cache.
scripts/changed-backends.js gets a special-case so edits to
backend/cpp/llama-cpp/ also retrigger the turboquant CI pipeline, since
the wrapper reuses those sources.
* feat(turboquant): carry upstream patches against fork API drift
turboquant branched from llama.cpp before upstream commit 66060008
("server: respect the ignore eos flag", #21203) which added the
`logit_bias_eog` field to `server_context_meta` and a matching
parameter to `server_task::params_from_json_cmpl`. The shared
backend/cpp/llama-cpp/grpc-server.cpp depends on that field, so
building it against the fork unmodified fails.
Cherry-pick that commit as a patch file under
backend/cpp/turboquant/patches/ and apply it to the cloned fork
sources via a new apply-patches.sh hook called from the wrapper
Makefile. Simplifies the build flow too: instead of hopping through
llama-cpp's build-llama-cpp-grpc-server indirection, the wrapper now
drives the copied Makefile directly (clone -> patch -> build).
Drop the corresponding patch whenever the fork catches up with
upstream — the build fails fast if a patch stops applying, which
is the signal to retire it.
* docs: add turboquant backend section + clarify cache_type_k/v
Document the new turboquant (llama.cpp fork with TurboQuant KV-cache)
backend alongside the existing llama-cpp / ik-llama-cpp sections in
features/text-generation.md: when to pick it, how to install it from
the gallery, and a YAML example showing backend: turboquant together
with cache_type_k / cache_type_v.
Also expand the cache_type_k / cache_type_v table rows in
advanced/model-configuration.md to spell out the accepted llama.cpp
quantization values and note that these fields apply to all
llama.cpp-family backends, not just vLLM.
* feat(turboquant): patch ggml-rpc GGML_OP_COUNT assertion
The fork adds new GGML ops bringing GGML_OP_COUNT to 97, but
ggml/include/ggml-rpc.h static-asserts it equals 96, breaking
the GGML_RPC=ON build paths (turboquant-grpc / turboquant-rpc-server).
Carry a one-line patch that updates the expected count so the
assertion holds. Drop this patch whenever the fork fixes it upstream.
* feat(turboquant): allow turbo* KV-cache types and exercise them in e2e
The shared backend/cpp/llama-cpp/grpc-server.cpp carries its own
allow-list of accepted KV-cache types (kv_cache_types[]) and rejects
anything outside it before the value reaches llama.cpp's parser. That
list only contains the standard llama.cpp types — turbo2/turbo3/turbo4
would throw "Unsupported cache type" at LoadModel time, meaning
nothing the LocalAI gRPC layer accepted was actually fork-specific.
Add a build-time augmentation step (patch-grpc-server.sh, called from
the turboquant wrapper Makefile) that inserts GGML_TYPE_TURBO2_0/3_0/4_0
into the allow-list of the *copied* grpc-server.cpp under
turboquant-<flavor>-build/. The original file under backend/cpp/llama-cpp/
is never touched, so the stock llama-cpp build keeps compiling against
vanilla upstream which has no notion of those enum values.
Switch test-extra-backend-turboquant to set
BACKEND_TEST_CACHE_TYPE_K=turbo3 / _V=turbo3 so the e2e gRPC suite
actually runs the fork's TurboQuant KV-cache code paths (turbo3 also
auto-enables flash_attention in the fork). Picking q8_0 here would
only re-test the standard llama.cpp path that the upstream llama-cpp
backend already covers.
Refresh the docs (text-generation.md + model-configuration.md) to
list turbo2/turbo3/turbo4 explicitly and call out that you only get
the TurboQuant code path with this backend + a turbo* cache type.
* fix(turboquant): rewrite patch-grpc-server.sh in awk, not python3
The builder image (ubuntu:24.04 stage-2 in Dockerfile.turboquant)
does not install python3, so the python-based augmentation step
errored with `python3: command not found` at make time. Switch to
awk, which ships in coreutils and is already available everywhere
the rest of the wrapper Makefile runs.
* Apply suggestion from @mudler
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
---------
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
|
||
|
|
87e6de1989 |
feat: wire transcription for llama.cpp, add streaming support (#9353)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
1e4c4577bb |
fix(ci): small fixups
Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |
||
|
|
7ce675af21 |
chore(gallery-agent): extract readme
Signed-off-by: Ettore Di Giacinto <mudler@localai.io> |