* test(whisper): wire e2e streaming transcription target
Adds test-extra-backend-whisper-transcription, mirroring the existing
llama-cpp / sherpa-onnx / vibevoice-cpp targets. The generic
AudioTranscriptionStream spec at tests/e2e-backends/backend_test.go:644
fails today because backend/go/whisper has no streaming impl - this
target is the failing TDD gate that the next phase makes pass.
Confirmed RED locally: 3 Passed (health, load, offline transcription),
1 Failed (streaming spec hits its 300s context deadline because the
base implementation returns 'unimplemented' but doesn't close the
result channel, leaving the gRPC stream open until the client times
out).
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(whisper-cpp): expose new_segment_callback to the Go side
Adds set_new_segment_callback() and a C-side trampoline that whisper.cpp
invokes once per new text segment during whisper_full(). The trampoline
dispatches (idx_first, n_new, user_data) to a Go function pointer
registered via purego.NewCallback - text and timings are pulled by Go
through the existing get_segment_text/get_segment_t0/get_segment_t1
getters.
Wires the hook only when streaming is actually requested, to avoid a
per-segment function-pointer dispatch on the offline path.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(whisper-cpp): implement AudioTranscriptionStream
Wires whisper.cpp's new_segment_callback through purego back to Go so
the streaming transcription RPC produces real, time-correlated deltas
while whisper_full() is still decoding. Each segment becomes one
TranscriptStreamResponse{Delta}; whisper_full's return is the
TranscriptStreamResponse{FinalResult} carrying the full segment list,
language, and duration.
Per-call state is tracked in a sync.Map keyed by an atomic counter; the
Go callback registered via purego.NewCallback is a singleton, dispatched
through user_data. SingleThread today means only one entry is ever live,
but the map shape matches the sherpa-onnx TTS callback pattern.
The streaming path's final.Text is the literal concat of every emitted
delta (a strings.Builder accumulated by onNewSegment) so the e2e
invariant `final.Text == concat(deltas)` holds exactly. The first delta
has no leading space; subsequent deltas are space-prefixed. The offline
AudioTranscription path is unchanged.
Closes the gap with sherpa-onnx, vibevoice-cpp, llama-cpp, and tinygrad,
which already implement AudioTranscriptionStream.
Verified GREEN locally: make test-extra-backend-whisper-transcription
passes 4/4 specs (3 Passed initially under RED, +1 streaming spec now).
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(whisper-cpp): assert progressive multi-segment streaming
Drives AudioTranscriptionStream against a real long-audio fixture and
asserts len(deltas) >= 2. The generic e2e spec at
tests/e2e-backends/backend_test.go:644 only checks len(deltas) >= 1
which is satisfied by both real and faked streaming - this spec is the
guardrail that a future "fake" impl can't sneak past.
Skipped by default (env-gated, like the cancellation spec); set
WHISPER_LIBRARY, WHISPER_MODEL_PATH, and WHISPER_AUDIO_PATH to a 30+
second clip to run.
Verified locally with a 55s 5x-JFK concat against ggml-base.en.bin:
1 Passed in 7.3s, deltas >= 2, finalSegmentCount >= 2,
concat(deltas) == final.Text.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci(whisper-cpp): add transcription gRPC e2e job
Mirrors tests-sherpa-onnx-grpc-transcription /
tests-llama-cpp-grpc-transcription. Runs make
test-extra-backend-whisper-transcription whenever the whisper backend
or the run-all switch fires, so a pin-bump or refactor that breaks
streaming transcription gets caught before merge.
The whisper output on detect-changes is already emitted by
scripts/changed-backends.js (it iterates allBackendPaths); this PR
just exposes it as a workflow output and consumes it.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(whisper-cpp): silence errcheck on AudioTranscriptionStream defers
golangci-lint runs with new-from-merge-base=origin/master, so the
identical defer patterns in the existing offline AudioTranscription
path are grandfathered while the new ones in AudioTranscriptionStream
trip errcheck. Wrap both defers in `func() { _ = ... }()` to match what
errcheck wants without altering behavior. The errors from os.RemoveAll
and *os.File.Close are not actionable inside a defer here (we're
already returning), matching the offline path's contract.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* ci(backend_build): plumb builder-base-image and BUILDER_TARGET build-args
Adds an optional builder-base-image input. When set, BUILDER_BASE_IMAGE
is forwarded as a build-arg AND BUILDER_TARGET=builder-prebuilt is set
to select the variant Dockerfile's prebuilt-base stage. When empty,
BUILDER_TARGET=builder-fromsource (the default) keeps the existing
from-source build path.
This makes the prebuilt-base optimization opt-in per matrix entry
without breaking local `make backends/<name>` invocations or backends
whose Dockerfile doesn't have a prebuilt path.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci(llama-cpp,ik-llama-cpp,turboquant): multi-target Dockerfiles for prebuilt + from-source
Restructure the three llama.cpp-derived Dockerfiles so each supports
two builder paths in a single file, selected via the BUILDER_TARGET
build-arg:
BUILDER_TARGET=builder-fromsource (default)
- Standalone build: gRPC stage + apt installs + (conditionally)
CUDA/ROCm/Vulkan + compile.
- Used by `make backends/llama-cpp` locally and any caller that
doesn't supply a prebuilt base.
BUILDER_TARGET=builder-prebuilt
- FROM \${BUILDER_BASE_IMAGE} (one of quay.io/go-skynet/ci-cache:
base-grpc-* shipped in PR #9737).
- Skips ~25-35 min of gRPC compile + ~5-10 min of toolchain installs.
- Used by CI when the matrix entry sets builder-base-image.
Final FROM scratch resolves BUILDER_TARGET via an aliasing FROM stage
(BuildKit doesn't support variable expansion directly in COPY --from),
then COPY --from=builder pulls package output from the chosen path.
BuildKit prunes the unreferenced builder, so each build only does the
work for the chosen path.
The compile RUN is identical between both builder stages, so it's
factored into .docker/<name>-compile.sh and bind-mounted into both.
ccache mount + cache-id stay per-arch / per-build-type.
Local DX preserved: `make backends/llama-cpp` (no extra args) defaults
to BUILDER_TARGET=builder-fromsource and works exactly as before.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci(backend.yml,backend_pr.yml): forward builder-base-image from matrix
Plumbs the new optional builder-base-image input from matrix into
backend_build.yml. backend_build.yml derives BUILDER_TARGET from
whether builder-base-image is set, so matrix entries that map to a
prebuilt base get the prebuilt path; entries that don't (python/go/
rust backends) fall through to the default builder-fromsource (which
their own Dockerfiles don't reference, so it's a no-op for them).
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci(backend-matrix): wire builder-base-image to llama-cpp variants
For every entry whose Dockerfile is llama-cpp/ik-llama-cpp/turboquant,
add a builder-base-image field pointing at the appropriate prebuilt
quay.io/go-skynet/ci-cache:base-grpc-* tag.
backend_build.yml derives BUILDER_TARGET from this field's presence:
non-empty -> builder-prebuilt; empty -> builder-fromsource. So this
commit alone activates the prebuilt-base path for these 23 backends
in CI, while local `make backends/<name>` (no extra args) keeps the
from-source path.
Mapping by (build-type, arch):
- '' / amd64 -> base-grpc-amd64
- '' / arm64 -> base-grpc-arm64
- cublas-12 / amd64 -> base-grpc-cuda-12-amd64
- cublas-13 / amd64 -> base-grpc-cuda-13-amd64
- cublas-13 / arm64 -> base-grpc-cuda-13-arm64
- hipblas / amd64 -> base-grpc-rocm-amd64
- vulkan / amd64 -> base-grpc-vulkan-amd64
- vulkan / arm64 -> base-grpc-vulkan-arm64
- sycl_* / amd64 -> base-grpc-intel-amd64
- cublas-12 + JetPack r36.4.0 / arm64 -> base-grpc-l4t-cuda-12-arm64
Cold-build savings expected: ~25-35 min per variant (skips the gRPC
compile + toolchain install that's now in the base).
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci: add base-grpc-l4t-cuda-12-arm64 variant for legacy JetPack entries
Two matrix entries (-nvidia-l4t-arm64-llama-cpp, -nvidia-l4t-arm64-
turboquant) build against nvcr.io/nvidia/l4t-jetpack:r36.4.0 + CUDA
12 ARM64. They're distinct from -nvidia-l4t-cuda-13-arm64-* which use
Ubuntu 24.04 + CUDA 13 sbsa. Add the missing JetPack-based variant
to base-images.yml so those two entries' builder-base-image mapping
in the previous commit resolves.
Bootstrap order before merging this PR (re-run base-images.yml on
this branch — 9 existing variants hit BuildKit cache, only the new
l4t-cuda-12-arm64 builds cold):
gh workflow run base-images.yml --ref ci/base-images-consumers
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci: extract base-builder install logic into .docker/install-base-deps.sh
Pre-extraction, the apt + protoc + cmake + conditional CUDA/ROCm/Vulkan
+ gRPC install logic was duplicated across four files:
- backend/Dockerfile.base-grpc-builder (CI prebuilt-base source of truth)
- backend/Dockerfile.llama-cpp (builder-fromsource stage)
- backend/Dockerfile.ik-llama-cpp (builder-fromsource stage)
- backend/Dockerfile.turboquant (builder-fromsource stage)
A bump to e.g. CUDA toolkit packages had to be made in 4 places, and
drift between the prebuilt base and the variant-Dockerfile from-source
path was a real concern (ik-llama-cpp's hipblas branch was already
missing the rocBLAS Kernels echo that llama-cpp / turboquant /
base-grpc-builder all had).
Factor the install logic into a single .docker/install-base-deps.sh
that reads its inputs from env vars and runs conditionally on
BUILD_TYPE / CUDA_*_VERSION / TARGETARCH. Each Dockerfile now bind-
mounts the script alongside .docker/apt-mirror.sh and invokes it from
a single RUN step.
The variant Dockerfiles' grpc-source stage is removed entirely — the
script handles gRPC compile + install at /opt/grpc, and the
builder-fromsource stage mirrors builder-prebuilt by copying
/opt/grpc/. to /usr/local/.
Result:
- install-base-deps.sh: 244 lines (one source of truth)
- Dockerfile.base-grpc-builder: 268 -> 98 lines
- Dockerfile.llama-cpp: 361 -> 157 lines
- Dockerfile.ik-llama-cpp: 348 -> 151 lines
- Dockerfile.turboquant: 355 -> 154 lines
- Total Dockerfile bytes: 1332 -> 560 lines (58% reduction)
Bit-equivalence between prebuilt and from-source paths is now enforced
by construction: both invoke the same script with the same inputs.
A side-effect is that ik-llama-cpp now also gets the rocBLAS Kernels
echo + clblas block parity it was previously missing.
Includes the BUILD_TYPE=clblas branch (libclblast-dev) for parity even
though no current CI matrix entry uses it.
After this commit's force-push, base-images.yml needs to be redispatched
on this branch — the Dockerfile.base-grpc-builder content shifts so the
existing cache won't apply for the install layer (gRPC layer also
rebuilds since it's now in the same RUN step).
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci(base-images): skip-drivers on JetPack l4t variant
cuda-nvcc-12-0 isn't installable via apt on the JetPack r36.4.0 base
image — JetPack ships CUDA preinstalled at /usr/local/cuda and its
apt feed doesn't carry the cuda-nvcc-* packages from the public
repositories. The original matrix entry for -nvidia-l4t-arm64-llama-cpp
on master sets skip-drivers: 'true' for exactly this reason; the
new base-grpc-l4t-cuda-12-arm64 base needs to match.
Also forwards SKIP_DRIVERS as a build-arg from matrix into the build
(was missing entirely before this commit).
Caught by run 25612030775 — l4t-cuda-12-arm64 failed at:
E: Package 'cuda-nvcc-12-0' has no installation candidate
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Introduces a parameterized Dockerfile.base-grpc-builder that produces
a fully-prepped builder base image (apt deps + protoc + cmake + gRPC
at /opt/grpc + conditional CUDA/ROCm/Vulkan toolchains) and a
base-images.yml workflow that builds + pushes 9 variants to
quay.io/go-skynet/ci-cache:base-grpc-*:
base-grpc-amd64 (Ubuntu 24.04, CPU-only)
base-grpc-arm64 (Ubuntu 24.04, CPU-only)
base-grpc-cuda-12-amd64 (Ubuntu 24.04 + CUDA 12.8)
base-grpc-cuda-13-amd64 (Ubuntu 22.04 + CUDA 13.0)
base-grpc-cuda-13-arm64 (Ubuntu 24.04 + CUDA 13.0 sbsa)
base-grpc-rocm-amd64 (rocm/dev-ubuntu-24.04:7.2.1 + hipblas)
base-grpc-vulkan-amd64 (Ubuntu 24.04 + Vulkan SDK 1.4.335)
base-grpc-vulkan-arm64 (Ubuntu 24.04 + Vulkan SDK ARM 1.4.335)
base-grpc-intel-amd64 (intel/oneapi-basekit:2025.3.2)
The variant Dockerfiles (Dockerfile.llama-cpp, ik-llama-cpp, turboquant)
are NOT touched in this PR. PR 2 will refactor them to FROM these
prebuilt bases. This PR is intentionally inert - landing it changes no
existing CI behavior. The base images don't exist on quay until
someone manually triggers the workflow.
Bootstrap after merge:
gh workflow run base-images.yml --ref master
Wait ~30 min for all 9 variants to push, then merge PR 2 (the
consumer-side refactor that uses BUILDER_BASE_IMAGE build-arg to
FROM these tags).
Triggers afterwards:
- Saturdays 05:00 UTC (cron) - picks up upstream security updates,
runs ~24h before the backend.yml Sunday cron so bases are fresh.
- workflow_dispatch - manual ad-hoc rebuild.
- master push touching Dockerfile.base-grpc-builder or this workflow.
Why split into two PRs: the variant Dockerfiles in PR 2 will FROM the
prebuilt bases and have no from-source fallback. Their CI builds fail
if the bases don't exist on quay yet. Landing infrastructure first +
manual bootstrap + then consumer refactor avoids a broken-master window.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Mirror the ccache mount added to Dockerfile.llama-cpp in 9228e5b4 for
the other two llama.cpp-derived backends. Same shape, distinct mount
ids so each backend's cache is independent:
ik-llama-cpp-ccache-${TARGETARCH}-${BUILD_TYPE}
turboquant-ccache-${TARGETARCH}-${BUILD_TYPE}
ik_llama.cpp is a different upstream fork; no source overlap with
llama-cpp, separate cache makes sense.
turboquant is a llama.cpp fork that reuses backend/cpp/llama-cpp
source via a thin wrapper Makefile — most TUs would in principle hit
llama-cpp's ccache too. Keeping them separate for now to avoid one
fork's regressions poisoning the other; revisit sharing after we have
hit-rate numbers.
Same registry-export behavior as llama-cpp: the cache mount rides on
backend_build.yml's existing cache-to: type=registry,mode=max.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The big RUN at line 268 of Dockerfile.llama-cpp re-runs from scratch on
every LLAMA_VERSION bump (or any LocalAI source change due to
COPY . /LocalAI just before). For CUDA-13 specifically that compile
recently hit the GHA 6h hard limit and failed:
https://github.com/mudler/LocalAI/actions/runs/25598418931/job/75148244557
Add a BuildKit cache mount on /root/.ccache and thread ccache through
CMake (CMAKE_C/CXX/CUDA_COMPILER_LAUNCHER) so most translation units
hit cache when their preprocessed source is byte-identical to the
previous build.
The cache mount is exported to the registry as part of the existing
cache-to: type=registry,mode=max in backend_build.yml, so it persists
across runs. mount id is keyed on TARGETARCH + BUILD_TYPE so different
variants don't thrash the same cache slot; sharing=locked serializes
concurrent writes.
Cold-build effect (first run after enable, or on LLAMA_VERSION bump
that touches every TU): unchanged. Hot-build effect (subsequent runs
with the same source, or LLAMA_VERSION bumps that touch a handful of
files): ~5-15 min for the llama.cpp compile vs the previous 1-3h cold.
For CUDA-13 specifically this should bring rebuilds well under the 6h
GHA limit.
Does NOT help the *first* post-bump build — that's still cold. For
that, follow-up work would be: (a) trim CUDA_DOCKER_ARCH to modern
GPUs only, (b) audit which CMake variants the published images
actually need, (c) pre-built CUDA+gRPC base image.
ccache package is already installed in the builder stage (line 90).
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The PR that introduced the per-arch + manifest-merge pilot (#9727)
only touched CI infrastructure files, so the path filter correctly
skipped backend builds on its merge commit. To observe the new
backend-merge-jobs flow assemble a real manifest list, this commit
touches faster-whisper's Makefile so its two new per-arch entries
schedule and the merge job runs.
The trailing comment is the smallest possible diff and is harmless
to the build.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(transcription): propagate request ctx through ModelTranscription*
Replaces context.Background() with the HTTP request ctx so client
disconnects start cancelling the gRPC call. No backend-side abort wiring
yet — that comes in a later commit. Pure plumbing.
Assisted-by: Claude:claude-haiku-4-5
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(cli): pass ctx to backend.ModelTranscription
Follow-up to e65d3e1f which threaded ctx through ModelTranscription
but missed the CLI caller. CLI commands have no request-scoped ctx,
so context.Background() is correct here.
Assisted-by: Claude:claude-haiku-4-5
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(audio): propagate request ctx into TTS, sound-gen, audio-transform
Same ctx-plumbing pattern applied to the rest of the audio path. CLI
callers use context.Background() since there is no request scope; HTTP
callers use c.Request().Context().
Assisted-by: Claude:claude-haiku-4-5
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(backend): propagate request ctx into biometric, detection, rerank, diarization paths
Replaces remaining context.Background() sites in core/backend with the
caller's ctx. After this commit, every core/backend/*.go entry point
threads the request ctx end-to-end to the gRPC client.
Assisted-by: Claude:claude-haiku-4-5
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(grpc): plumb ctx through AIModel.AudioTranscription{,Stream}
Adds context.Context as first parameter to the AIModel interface methods
that wrap whisper-style transcription. Server-side gRPC handler now
forwards the per-RPC ctx (server-streaming uses stream.Context()).
Whisper, Voxtral, vibevoice-cpp, and sherpa-onnx accept the parameter;
none uses it yet — the actual cancellation primitive lands in the next
commit so this is pure plumbing.
Assisted-by: Claude:claude-sonnet-4-6
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(whisper): add abort_callback hook in the C++ bridge
Installs a std::atomic<int> flag, wires it into
whisper_full_params.abort_callback, and exposes a set_abort(int) C
symbol so Go can flip the flag from a goroutine watching the request
context. transcribe() now distinguishes abort (return 2) from real
whisper_full failure (return 1).
Assisted-by: Claude:claude-haiku-4-5
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(whisper): register set_abort symbol in the purego loader
Adds the Go-side binding for the new C export so the next commit can
call CppSetAbort(1) from a watcher goroutine on ctx.Done().
Assisted-by: Claude:claude-haiku-4-5
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(whisper): honor ctx cancellation and return codes.Canceled
A watcher goroutine watches ctx.Done() during AudioTranscription and
calls CppSetAbort(1) on cancel. whisper_full sees abort_callback return
true at the next compute graph step, returns non-zero, and the bridge
returns 2 -> AudioTranscription maps that to codes.Canceled.
Adds an opt-in test (gated on WHISPER_MODEL_PATH / WHISPER_AUDIO_PATH)
that asserts cancellation latency under 5s and proves the abort flag
resets cleanly so the next transcription succeeds.
Assisted-by: Claude:claude-sonnet-4-6
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(whisper): join the cancel watcher goroutine before returning
Follow-up to 85edf9d2. The previous commit used `defer close(done)` and
called the watcher "joined synchronously" — but close() only signals,
it does not block until the goroutine exits. That left a window where
a late CppSetAbort(1) from a cancelled call could land on the next
call, after its C-side g_abort reset but before whisper_full() began
polling the abort callback, corrupting the second transcription.
Switch to a sync.WaitGroup join so wg.Wait() blocks until the watcher
has actually returned from its select.
Assisted-by: Claude:claude-sonnet-4-6
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(whisper): short-circuit pre-cancelled ctx in AudioTranscription
If ctx is already Done() at entry, return codes.Canceled immediately
instead of running the full transcription. The C-side g_abort reset
happens at the start of transcribe() and would otherwise overwrite a
watcher-set abort flag from an already-cancelled ctx, producing a
spurious successful transcription on a request the client has already
abandoned.
Assisted-by: Claude:claude-haiku-4-5
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(tests/distributed): update testLLM mock for new AudioTranscription signature
Phase B (93c48e19) added context.Context to AIModel.AudioTranscription
but missed the testLLM mock in tests/e2e/distributed. CI golangci-lint
caught it: *testLLM did not implement grpc.AIModel because the method
signature lacked the ctx parameter, which broke the distributed test
suite compilation and cascaded through every backend-build job that
runs `go build ./...`.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(whisper): port cancellation test to Ginkgo/Gomega
Project policy (.agents/coding-style.md, enforced by golangci-lint
forbidigo) is that all Go tests must use Ginkgo v2 + Gomega — no
stdlib testing patterns (t.Skip, t.Fatalf, etc.). Convert the
cancellation test to a Describe/It block with Skip(...) for env
gating and Expect/HaveOccurred for assertions.
Same coverage: cancel mid-flight returns codes.Canceled within 5s and
a follow-up transcription succeeds, proving the C-side g_abort flag
resets cleanly.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Bring the sglang Python backend up to feature parity with vllm by adding
the same engine_args:-map plumbing the vLLM backend already has. Any
ServerArgs field (~380 in sglang 0.5.11) becomes settable from a model
YAML, including the speculative-decoding flags needed for Multi-Token
Prediction. Validation matches the vllm backend's: keys are checked
against dataclasses.fields(ServerArgs), unknown keys raise ValueError
with a difflib close-match suggestion at LoadModel time, and the typed
ModelOptions fields keep their existing meaning with engine_args
overriding them.
Backend code:
* backend/python/sglang/backend.py: add _apply_engine_args, import
dataclasses/difflib/ServerArgs, call from LoadModel; rename Seed ->
sampling_seed (sglang 0.5.11 renamed the SamplingParams field).
* backend/python/sglang/test.py + test.sh + Makefile: six unit tests
exercising the helper directly (no engine load required).
Build / CI / backend gallery (cuda13 + l4t13 paths are now first-class):
* backend/python/sglang/install.sh: add --prerelease=allow because
sglang 0.5.11 hard-pins flash-attn-4 which only ships beta wheels;
add --index-strategy=unsafe-best-match for cublas12 so the cu128
torch index wins over default-PyPI's cu130; new pyproject.toml-driven
l4t13 install path so [tool.uv.sources] can pin torch/torchvision/
torchaudio/sglang to the jetson-ai-lab index without forcing every
transitive PyPI dep through the L4T mirror's flaky proxy (mirrors the
equivalent fix in backend/python/vllm/install.sh).
* backend/python/sglang/pyproject.toml (new): L4T project spec with
explicit-source jetson-ai-lab index. Replaces requirements-l4t13.txt
for the l4t13 BUILD_PROFILE; other profiles still go through the
requirements-*.txt pipeline via libbackend.sh's installRequirements.
* backend/python/sglang/requirements-l4t13.txt: removed; superseded
by pyproject.toml.
* backend/python/sglang/requirements-cublas{12,13}{,-after}.txt: pin
sglang>=0.5.11 (Gemma 4 floor); add cu130 torch index for cublas13
(new files) and cu128 torch index for cublas12 (default PyPI now
ships cu130 torch wheels by default and breaks cu12 hosts).
* backend/index.yaml: add cuda13-sglang and cuda13-sglang-development
capability mappings + image entries pointing at
quay.io/.../-gpu-nvidia-cuda-13-sglang.
* .github/workflows/backend.yml: new cublas13 sglang matrix entry,
mirroring vllm's cuda13 build.
Model gallery + docs:
* gallery/sglang.yaml: base sglang config template, mirrors vllm.yaml.
* gallery/sglang-gemma-4-{e2b,e4b}-mtp.yaml: Gemma 4 MTP demos
transcribed verbatim from the SGLang Gemma 4 cookbook MTP commands.
* gallery/sglang-mimo-7b-mtp.yaml: MiMo-7B-RL with built-in MTP heads
+ online fp8 weight quantization, verified end-to-end on a 16 GB
RTX 5070 Ti at ~88 tok/s. Uses mem_fraction_static: 0.7 because the
MTP draft worker's vocab embedding is loaded unquantised and OOMs
the static reservation at sglang's 0.85 default.
* gallery/index.yaml: three new entries (gemma-4-e2b-it:sglang-mtp,
gemma-4-e4b-it:sglang-mtp, mimo-7b-mtp:sglang).
* docs/content/features/text-generation.md: new SGLang section with
setup, engine_args reference, MTP demos, version requirements.
* .agents/sglang-backend.md (new): agent one-pager covering the flat
ServerArgs structure, the typed-vs-engine_args precedence, the
speculative-decoding cheatsheet, and the mem_fraction_static gotcha
documented above.
* AGENTS.md: index entry for the new agent doc.
Known limitation: the two Gemma 4 MTP gallery entries ship a recipe
that doesn't yet run on stock libraries. The drafter checkpoints
(google/gemma-4-{E2B,E4B}-it-assistant) declare
model_type: gemma4_assistant / Gemma4AssistantForCausalLM, which
neither transformers (<=5.6.0, including the SGLang cookbook's pinned
commit 91b1ab1f... and main HEAD) nor sglang's own model registry
(<=0.5.11) registers as of 2026-05-06. They will start working when
HF or sglang upstream registers the architecture -- no LocalAI
changes needed. The MiMo MTP demo and the non-MTP Gemma 4 paths work
today on this build (verified on RTX 5070 Ti, 16 GB).
Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash] [WebFetch] [WebSearch]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
The previous omegaconf pin only addressed one symptom of a deeper problem:
chatterbox-tts upstream depends on `russian-text-stresser` (unpinned git URL),
which transitively pins `spacy==3.6.*` and other ancient packages. That cascade
forces pip to backtrack through Jinja2/MarkupSafe/omegaconf into Python-2-era
sdists that no longer build (e.g. ruamel.yaml<0.15, Jinja2 2.6 importing the
long-removed `setuptools.Feature`).
Install chatterbox-tts itself with --no-deps in install.sh and list its real
runtime deps explicitly in each requirements-*.txt, dropping the optional
russian-text-stresser. This unblocks the darwin (and other) builds without
playing whack-a-mole on each newly-discovered transitive pin.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
The previous pin in requirements.txt was ineffective: installRequirements
runs a separate `pip install --requirement` per file, so resolution does
not carry over to the per-profile file where chatterbox-tts is declared.
With chatterbox-tts's unpinned `omegaconf` dep, pip backtracked through
1.x sdists into ruamel.yaml<0.15, whose Python-2-era setup.py fails on
Python 3.10+.
Pin omegaconf==2.3.0 next to chatterbox-tts in every profile file
(matches what upstream chatterbox uses). Drop the dead pin from
requirements.txt.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Without an upper-floor pin, pip's resolver backtracks through omegaconf 1.x
sdists when installing chatterbox-tts. Old 1.x setups depend on
ruamel.yaml<0.15, whose setup.py uses Python-2-era names (Str, Bytes) and
fails to build on Python 3.10+, breaking the darwin python backend build.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Two unrelated CI breakages bundled together since both are one-liners:
- rerankers: bump torch 2.4.1 -> 2.7.1 on cpu/cublas12. The unpinned
transformers resolves to 5.x, whose moe.py registers a custom_op with
string-typed `'torch.Tensor'` annotations that torch 2.4.1's
infer_schema rejects, blocking the gRPC server from starting and
failing all 5 backend tests with "Connection refused" on :50051.
Matches the version used by the transformers backend.
- vllm-omni: strip fa3-fwd from the upstream requirements/cuda.txt
before resolving on aarch64. fa3-fwd 0.0.3 ships only an
x86_64 wheel and has no sdist, making the cuda profile unsatisfiable
on Jetson/SBSA. fa3-fwd is a soft runtime dep — vllm-omni's
attention backends fall back to FA2 then SDPA when it's missing.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* fix(docs): correct broken Hugo relrefs
The Hugo build has been failing on master since the relevant pages
landed:
- text-generation.md:720 referenced `/docs/features/distributed-mode`,
but Hugo `relref` paths are relative to the content root, not the
rendered URL. Drop the `/docs/` prefix so the lookup matches the
existing `features/...` form used elsewhere in the file.
- audio-transform.md:144 referenced `tts.md`; the actual page is
`text-to-audio.md`.
Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(kokoros): stub Diarize and AudioTransform Backend trait methods
The recent backend.proto additions (Diarize, AudioTransform,
AudioTransformStream) extended the gRPC Backend trait, breaking
kokoros-grpc compilation with E0046 because the Rust implementation
hadn't picked up the new methods. Add Unimplemented stubs matching the
existing pattern for non-applicable RPCs in this TTS-only backend.
Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(vibevoice-cpp): track upstream ABI + wire 1.5B voice cloning
Two recent commits in mudler/vibevoice.cpp reshaped the vv_capi_tts
signature without a corresponding bump on the LocalAI side:
3bd759c "1.5b: unify into a single tts entry point" inserted a
ref_audio_path parameter between voice_path and dst_wav_path.
ad856bd "1.5b: multi-speaker dialog support" promoted that to a
(const char* const* ref_audio_paths, int n_ref_audio_paths)
pair for per-speaker conditioning.
Because purego resolves symbols by name and not by signature, the
build kept linking; at runtime the misaligned arguments turned the
TTS->ASR closed-loop test into a SIGSEGV inside cgo. Track HEAD
explicitly and bring the bridge in line with it:
* Update the CppTTS purego binding to the 9-arg form. purego
marshals []*byte as a **char by handing the C side the underlying
array address; nil/empty maps to NULL, which matches the C
contract for "no reference audio" on the realtime-0.5B path.
* Add a `ref_audio` gallery option (comma-separated, repeatable)
that the 1.5B path consumes for runtime voice cloning. Multiple
entries are interpreted as one WAV per speaker (Speaker 0..n-1).
* TTSRequest.Voice now routes by extension/shape: `.wav` or a
comma-separated list goes to ref_audio_paths; anything else stays
on voice_path (realtime-0.5B's pre-baked voice gguf).
* Pin VIBEVOICE_CPP_VERSION to ad856bd and wire the Makefile into
the existing bump_deps matrix so future upstream rolls land as
reviewable PRs instead of a silent CI break.
Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(vibevoice-cpp): use ModelOptions.AudioPath for 1.5B ref audio
Use the existing audio_path field from ModelOptions (already plumbed
through config_file's `audio_path:` YAML and consumed by other audio
backends like kokoros) instead of inventing a custom `ref_audio:`
Options[] string. Multi-speaker setups stay on a single comma-
separated value.
No behavior change beyond the gallery key name; per-call routing via
TTSRequest.Voice is unchanged.
Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Two related runtime fixes for Python backends that JIT-compile CUDA
kernels at first model load (FlashInfer, PyTorch inductor, triton):
1. libbackend.sh: replace `source ${EDIR}/venv/bin/activate` with a
minimal manual setup (_activateVenv: export VIRTUAL_ENV, prepend
PATH, unset PYTHONHOME) computed from $EDIR at runtime. `uv venv`
and `python -m venv` both bake the create-time absolute path into
bin/activate (e.g. VIRTUAL_ENV='/vllm/venv' from the Docker build
stage), so sourcing activate on a relocated venv — copied out of
the build container and unpacked at an arbitrary backend dir —
prepends a stale, non-existent path to $PATH. Pip-installed CLI
tools (e.g. ninja, used by FlashInfer's NVFP4 GEMM JIT) are then
never found and the load aborts with FileNotFoundError. Doing the
env setup ourselves matches what `uv run` does internally and
sidesteps the relocation problem entirely. Generic — every Python
backend benefits.
2. vllm/run.sh: replace ninja's default -j$(nproc)+2 with an adaptive
MAX_JOBS = min(nproc, (MemAvailable-4)/4). Each concurrent
nvcc/cudafe++ peaks at multiple GiB; the default OOM-kills on
memory-tight hosts (e.g. a 16 GiB desktop loading a 27B NVFP4
model) but underutilises 100-core / 1 TB boxes. User-set MAX_JOBS
still wins. Also pin NVCC_THREADS=2 unless overridden.
Refs: https://github.com/vllm-project/vllm/issues/20079
Assisted-by: Claude:claude-opus-4-7 [Edit] [Bash]
* feat(vllm): build vllm from source for Intel XPU
Upstream publishes no XPU wheels for vllm. The Intel profile was
silently picking up a non-XPU wheel that imported but errored at
engine init, and several runtime deps (pillow, charset-normalizer,
chardet) were missing on Intel -- backend.py crashed at import time
before the gRPC server came up.
Switch the Intel profile to upstream's documented from-source
procedure (docs/getting_started/installation/gpu.xpu.inc.md in
vllm-project/vllm):
- Bump portable Python to 3.12 -- vllm-xpu-kernels ships only a
cp312 wheel.
- Source /opt/intel/oneapi/setvars.sh so vllm's CMake build sees
the dpcpp/sycl compiler from the oneapi-basekit base image.
- Hide requirements-intel-after.txt during installRequirements
(it used to 'pip install vllm'); install vllm's deps from a
fresh git clone of vllm via 'uv pip install -r
requirements/xpu.txt', swap stock triton for
triton-xpu==3.7.0, then 'VLLM_TARGET_DEVICE=xpu uv pip install
--no-deps .'.
- requirements-intel.txt trimmed to LocalAI's direct deps
(accelerate / transformers / bitsandbytes); torch-xpu, vllm,
vllm_xpu_kernels and the rest come from upstream's xpu.txt
during the source build.
- requirements.txt: add pillow + charset-normalizer + chardet --
used by backend.py and missing on the Intel install profile.
- run.sh: 'set -x' so backend startup is visible in container
logs (the gRPC startup error path was previously opaque).
Also adds a one-line docs example for engine_args.attention_backend
under the vLLM section, since older XE-HPG GPUs (e.g. Arc A770)
need TRITON_ATTN to bypass the cutlass path in vllm_xpu_kernels.
Tested end-to-end on an Intel Arc A770 with Qwen2.5-0.5B-Instruct
via LocalAI's /v1/chat/completions.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(vllm): add multi-node data-parallel follower worker
vLLM v1's multi-node story is one process per node sharing a DP
coordinator over ZMQ -- the head runs the API server with
data_parallel_size > 1 and followers run `vllm serve --headless ...`
with matching topology. Today LocalAI can already configure DP on the
head via the engine_args YAML map, but there's no way to bring up the
follower nodes -- so the head sits waiting for ranks that never
handshake.
Add `local-ai p2p-worker vllm`, mirroring MLXDistributed's structural
precedent (operator-launched, static config, no NATS placement). The
worker:
- Optionally self-registers with the frontend as an agent-type node
tagged `node.role=vllm-follower` so it's visible in the admin UI
and operators can scope ordinary models away via inverse
selectors.
- Resolves the platform-specific vllm backend via the gallery's
"vllm" meta-entry (cuda*, intel-vllm, rocm-vllm, ...).
- Runs vLLM as a child process so the heartbeat goroutine survives
until vLLM exits; forwards SIGINT/SIGTERM so vLLM can clean up its
ZMQ sockets before we tear down.
- Validates --headless + --start-rank 0 is rejected (rank 0 is the
head and must serve the API).
Backend run.sh dispatches `serve` as the first arg to vllm's own CLI
instead of LocalAI's backend.py gRPC server -- the follower speaks
ZMQ directly to the head, there is no LocalAI gRPC on the follower
side. Single-node usage is unchanged.
Generalises the gallery resolution helper into findBackendPath()
shared by MLX and vLLM workers; extracts ParseNodeLabels for the
comma-separated label parsing both use.
Ships with two compose recipes (`docker-compose.vllm-multinode.yaml`
for NVIDIA, `docker-compose.vllm-multinode.intel.yaml` for Intel
XPU/xccl) plus `tests/e2e/vllm-multinode/smoke.sh`. Both vendors are
supported (NCCL for CUDA/ROCm, xccl for XPU) but mixed-vendor DP is
not -- PyTorch's process group requires every rank to use the same
collective backend, and NCCL/xccl/gloo don't interoperate.
Out of scope (deferred): SmartRouter-driven placement of follower
ranks via NATS backend.install events, follower log streaming through
/api/backend-logs, tensor-parallel across nodes, disaggregated
prefill via KVTransferConfig.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* test(vllm): CPU-only end-to-end test for multi-node DP
Adds tests/e2e/vllm-multinode/, a Ginkgo + testcontainers-go suite
that brings up a head + headless follower from the locally-built
local-ai:tests image, bind-mounts the cpu-vllm backend extracted by
make extract-backend-vllm so it's seen as a system backend (no gallery
fetch, no registry server), and asserts a chat completion across both
DP ranks. New `make test-e2e-vllm-multinode` target wires the docker
build, backend extract, and ginkgo run together; BuildKit caches both
images so re-runs only rebuild what changed. Tagged Label("VLLMMultinode")
so the existing distributed suite isn't pulled along.
Two pre-existing bugs surfaced by the test:
1. extract-backend-% (Makefile) failed for every backend, because all
backend images end with `FROM scratch` and `docker create` rejects
an image with no CMD/ENTRYPOINT. Fixed by passing
--entrypoint=/run.sh -- the container is never started, only
docker-cp'd, so the path doesn't have to exist; we just need
anything that satisfies the daemon's create-time validation.
2. backend/python/vllm/run.sh's `serve` shortcut for the multi-node DP
follower exec'd ${EDIR}/venv/bin/vllm directly, but uv bakes an
absolute build-time shebang (`#!/vllm/venv/bin/python3`) that no
longer resolves once the backend is relocated to BackendsPath.
_makeVenvPortable's shebang rewriter only matches paths that
already point at ${EDIR}, so the original shebang slips through
unchanged. Fixed by exec-ing ${EDIR}/venv/bin/python with the script
as an argument -- Python ignores the script's shebang in that case.
The test fixture caps memory aggressively (max_model_len=512,
VLLM_CPU_KVCACHE_SPACE=1, TORCH_COMPILE_DISABLE=1) so two CPU engines
fit on a 32 GB box. TORCH_COMPILE_DISABLE is currently mandatory for
cpu-vllm: torch._inductor's CPU-ISA probe runs even with
enforce_eager=True and needs g++ on PATH, which the LocalAI runtime
image doesn't ship -- to be addressed in a follow-up that bundles a
toolchain in the cpu-vllm backend.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(vllm): bundle a g++ toolchain in the cpu-vllm backend image
torch._inductor's CPU-ISA probe (`cpu_model_runner.py:65 "Warming up
model for the compilation"`) shells out to `g++` at vllm engine
startup, regardless of `enforce_eager=True` -- the eager flag only
disables CUDA graphs, not inductor's first-batch warmup. The LocalAI
CPU runtime image (Dockerfile, unconditional apt list) does not ship
build-essential, and the cpu-vllm backend image is `FROM scratch`,
so any non-trivial inference on cpu-vllm crashes with:
torch._inductor.exc.InductorError:
InvalidCxxCompiler: No working C++ compiler found in
torch._inductor.config.cpp.cxx: (None, 'g++')
Bundling the toolchain in the CPU runtime image would bloat every
non-vllm-CPU deployment and force a single GCC version on backends
that may want clang or a different version. So this lives in the
backend, gated to BUILD_TYPE=='' (the CPU profile).
`package.sh` snapshots g++ + binutils + cc1plus + libstdc++ + libc6
(runtime + dev) + the math libs cc1plus links (libisl/libmpc/libmpfr/
libjansson) into ${BACKEND}/toolchain/, mirroring /usr/... layout. The
unversioned binaries on Debian/Ubuntu are symlink chains pointing into
multiarch packages (`g++` -> `g++-13` -> `x86_64-linux-gnu-g++-13`,
the latter in `g++-13-x86-64-linux-gnu`), so the package list resolves
both the version and the arch-triplet variant. Symlinks /lib ->
usr/lib and /lib64 -> usr/lib64 are recreated under the toolchain
root because Ubuntu's UsrMerge keeps them at /, and ld scripts
(`libc.so`, `libm.so`) hardcode `/lib/...` paths that --sysroot
re-roots into the toolchain.
The unversioned `g++`/`gcc`/`cpp` symlinks are replaced with wrapper
shell scripts that resolve their own location at runtime and pass
`--sysroot=<toolchain>` and `-B <toolchain>/usr/lib/gcc/<triplet>/<ver>/`
to the underlying versioned binary. That's how torch's bare `g++ foo.cpp
-o foo` invocation finds cc1plus (-B), system headers (--sysroot), and
the bundled libstdc++ (--sysroot, --sysroot is recursive into linker).
`run.sh` adds the toolchain bin dir to PATH and the toolchain's
shared-lib dir to LD_LIBRARY_PATH -- everything else (header search,
linker search, executable search) is encapsulated in the wrappers.
No-op for non-CPU builds, the dir doesn't exist there.
The cpu-vllm image grows by ~217 MB. Tradeoff is acceptable -- cpu-vllm
is already a niche profile (few users compared to GPU vllm) and the
alternative is a backend that crashes at first inference unless the
operator manually sets TORCH_COMPILE_DISABLE=1, which silently disables
all torch.compile optimizations.
Drops `TORCH_COMPILE_DISABLE=1` from tests/e2e/vllm-multinode -- the
smoke now exercises the real compile path through the bundled toolchain.
Test runtime is +20s for the warmup compile, still <90s end to end.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* fix(vllm): scope jetson-ai-lab index to L4T-specific wheels via pyproject.toml
The L4T arm64 build resolves dependencies through pypi.jetson-ai-lab.io,
which hosts the L4T-specific torch / vllm / flash-attn wheels but also
transparently proxies the rest of PyPI through `/+f/<sha>/<filename>`
URLs. With `--extra-index-url` + `--index-strategy=unsafe-best-match`
uv would pick those proxy URLs for ordinary PyPI packages —
anthropic/openai/propcache/annotated-types — and fail when the proxy
503s. Master is hitting the same bug on its own l4t-vllm matrix entry.
Switch the l4t13 install path to a pyproject.toml that marks the
jetson-ai-lab index `explicit = true` and pins only torch, torchvision,
torchaudio, flash-attn, and vllm to it via [tool.uv.sources]. uv won't
consult the L4T mirror for anything else, so transitive deps fall back
to PyPI as the default index — no exposure to the proxy 503s.
`uv pip install -r requirements.txt` ignores [tool.uv.sources], so the
l4t13 branch in install.sh now invokes `uv pip install --requirement
pyproject.toml` directly, replacing the old requirements-l4t13*.txt
files. Other BUILD_PROFILEs continue using libbackend.sh's
installRequirements and never read pyproject.toml.
Local resolution test (x86_64, dry-run) confirms uv hits the L4T
index for torch and falls through to PyPI for everything else.
Assisted-by: claude-code:claude-opus-4-7-1m [Read] [Edit] [Bash] [Write]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
---------
Signed-off-by: Richard Palethorpe <io@richiejp.com>
`int(x) * 1e9` returns a float because `1e9` is a float literal, but
TranscriptSegment.start/end are integer protobuf fields. This caused
every transcription request to fail with:
TypeError: 'float' object cannot be interpreted as an integer
Multiply first, then cast — `int(x * 1e9)` — to get an int as required.
* feat(api): add /v1/audio/diarization endpoint with sherpa-onnx + vibevoice.cpp
Closes#1648.
OpenAI-style multipart endpoint that returns "who spoke when". Single
endpoint instead of the issue's three-endpoint sketch (refactor /vad,
/vad/embedding, /diarization) — the typical client wants one call, and
embeddings can land later as a sibling without breaking this surface.
Response shape borrows from Pyannote/Deepgram: segments carry a
normalised SPEAKER_NN id (zero-padded, stable across the response) plus
the raw backend label, optional per-segment text when the backend bundles
ASR, and a speakers summary in verbose_json. response_format also accepts
rttm so consumers can pipe straight into pyannote.metrics / dscore.
Backends:
* vibevoice-cpp — Diarize() reuses the existing vv_capi_asr pass.
vibevoice's ASR prompt asks the model to emit
[{Start,End,Speaker,Content}] natively, so diarization is a by-product
of the same pass; include_text=true preserves the transcript per
segment, otherwise we drop it.
* sherpa-onnx — wraps the upstream SherpaOnnxOfflineSpeakerDiarization
C API (pyannote segmentation + speaker-embedding extractor + fast
clustering). libsherpa-shim grew config builders, a SetClustering
wrapper for per-call num_clusters/threshold overrides, and a
segment_at accessor (purego can't read field arrays out of
SherpaOnnxOfflineSpeakerDiarizationSegment[] directly).
Plumbing: new Diarize gRPC RPC + DiarizeRequest / DiarizeSegment /
DiarizeResponse messages, threaded through interface.go, base, server,
client, embed. Default Base impl returns unimplemented.
Capability surfaces all updated: FLAG_DIARIZATION usecase,
FeatureAudioDiarization permission (default-on), RouteFeatureRegistry
entries for /v1/audio/diarization and /audio/diarization, audio
instruction-def description widened, CAP_DIARIZATION JS symbol,
swagger regenerated, /api/instructions discovery map updated.
Tests:
* core/backend: speaker-label normalisation (first-seen → SPEAKER_NN,
per-speaker totals, nil-safety, fallback to backend NumSpeakers when
no segments).
* core/http/endpoints/openai: RTTM rendering (file-id basename, negative
duration clamping, fallback id).
* tests/e2e: mock-backend grew a deterministic Diarize that emits
raw labels "5","2","5" so the e2e suite verifies SPEAKER_NN
remapping, verbose_json speakers summary + transcript pass-through
(gated by include_text), RTTM bytes content-type, and rejection of
unknown response_format. mock-diarize model config registered with
known_usecases=[FLAG_DIARIZATION] to bypass the backend-name guard.
Docs: new features/audio-diarization.md (request/response, RTTM example,
sherpa-onnx + vibevoice setup), cross-link from audio-to-text.md, entry
in whats-new.md.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
* fix(diarization): correct sherpa-onnx symbol name + lint cleanup
CI failures on #9654:
* sherpa-onnx-grpc-{tts,transcription} and sherpa-onnx-realtime panicked
at backend startup with `undefined symbol: SherpaOnnxDestroyOfflineSpeakerDiarizationResult`.
Upstream's actual symbol is SherpaOnnxOfflineSpeakerDiarizationDestroyResult
(Destroy in the middle, not the prefix); the rest of the diarization
surface follows the same naming pattern. The mismatched name made
purego.RegisterLibFunc fail at dlopen time and crashed the gRPC server
before the BeforeAll could probe Health, taking down every sherpa-onnx
test job — not just the diarization-related ones.
* golangci-lint flagged 5 errcheck violations on new defer cleanups
(os.RemoveAll / Close / conn.Close); wrap each in a `defer func() { _ = X() }()`
closure (matches the pattern other LocalAI files use for new code, since
pre-existing bare defers are grandfathered in via new-from-merge-base).
* golangci-lint also flagged forbidigo violations: the new
diarization_test.go files used testing.T-style `t.Errorf` / `t.Fatalf`,
which are forbidden by the project's coding-style policy
(.agents/coding-style.md). Convert both files to Ginkgo/Gomega
Describe/It with Expect(...) — they get picked up by the existing
TestBackend / TestOpenAI suites, no new suite plumbing needed.
* modernize linter: tightened the diarization segment loop to
`for i := range int(numSegments)` (Go 1.22+ idiom).
Verified locally: golangci-lint with new-from-merge-base=origin/master
reports 0 issues across all touched packages, and the four mocked
diarization e2e specs in tests/e2e/mock_backend_test.go still pass.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
* fix(vibevoice-cpp): convert non-WAV input via ffmpeg + raise ASR token budget
Confirmed end-to-end against a real LocalAI instance with vibevoice-asr-q4_k
loaded and the multi-speaker MP3 sample at vibevoice.cpp/samples/2p_argument.mp3:
both /v1/audio/transcriptions and /v1/audio/diarization now succeed and
return correctly attributed speaker turns for the full clip.
Two latent issues surfaced once the diarization endpoint actually exercised
the backend with a non-trivial input:
1. vv_capi_asr only accepts WAV via load_wav_24k_mono. The previous code
passed the uploaded path straight through, so anything that wasn't
already a 24 kHz mono s16le WAV failed at the C side with rc=-8 and
the very unhelpful "vv_capi_asr failed". prepareWavInput shells out
to ffmpeg ("-ar 24000 -ac 1 -acodec pcm_s16le") in a per-call temp
dir, matching the rate the model was trained on; both AudioTranscription
and Diarize now route through it. This is the same shape sherpa-onnx
uses (utils.AudioToWav), but vibevoice needs 24 kHz rather than 16 kHz
so we don't reuse that helper.
2. The C ABI's max_new_tokens defaults to 256 when 0 is passed. That's
fine for a five-second clip but not for anything past ~10 s — vibevoice
stops mid-JSON, the parse fails, and the caller sees a hard error.
Pass a much larger budget (16 384 ≈ ~9 minutes of speech at the
model's ~30 tok/s rate); generation stops at EOS so this is a cap
rather than a target.
3. As a defensive belt-and-braces, mirror AudioTranscription's existing
"fall back to a single segment if the model emits non-JSON text"
pattern in Diarize, so partial / unusual model output never produces
a 500. This kept the endpoint usable while diagnosing (1) and (2),
and is the right behaviour to keep.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
* fix(vibevoice-cpp): pass valid WAVs through directly so ffmpeg is not required at runtime
Spotted by tests-e2e-backend (1.25.x): the previous fix forced every
incoming audio file through `ffmpeg -ar 24000 ...`, which meant the
backend container — which does not ship ffmpeg — failed even for the
existing happy path where the caller already uploads a WAV. The
container-side error was:
rpc error: code = Unknown desc = vibevoice-cpp: ffmpeg convert to
24k mono wav: exec: "ffmpeg": executable file not found in $PATH
Reading vibevoice.cpp's audio_io.cpp, `load_wav_24k_mono` uses drwav and
already accepts any PCM/IEEE-float WAV at any sample rate, downmixes
multi-channel input to mono, and resamples to 24 kHz internally. So the
only inputs that genuinely need an external converter are non-WAV
formats (MP3, OGG, FLAC, ...).
Detect WAVs by RIFF/WAVE magic at bytes 0..3 / 8..11 and pass them
straight through with a no-op cleanup; everything else still goes
through ffmpeg with the same 24 kHz mono s16le target. The result:
* Container builds without ffmpeg keep working for WAV uploads
(the e2e-backends fixture is jfk.wav at 16 kHz mono s16le).
* MP3 and other non-WAV inputs still get the new ffmpeg conversion
path so the diarization endpoint stays useful.
* If the caller uploads a non-WAV but ffmpeg isn't on PATH, the
surfaced error is still descriptive enough to act on.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
* fix(ci): make gcc-14 install in Dockerfile.golang best-effort for jammy bases
The LocalVQE PR (bb033b16) made `gcc-14 g++-14` an unconditional apt
install in backend/Dockerfile.golang and pointed update-alternatives at
them. That works on the default `BASE_IMAGE=ubuntu:24.04` (noble has
gcc-14 in main), but every Go backend that builds on
`nvcr.io/nvidia/l4t-jetpack:r36.4.0` — jammy under the hood — now fails
at the apt step:
E: Unable to locate package gcc-14
This blocked unrelated jobs:
backend-jobs(*-nvidia-l4t-arm64-{stablediffusion-ggml, sam3-cpp, whisper,
acestep-cpp, qwen3-tts-cpp, vibevoice-cpp}). LocalVQE itself is only
matrix-built on ubuntu:24.04 (CPU + Vulkan), so it doesn't actually
need gcc-14 anywhere else.
Make the gcc-14 install conditional on the package being available in
the configured apt repos. On noble: identical behaviour to today (gcc-14
installed, update-alternatives points at it). On jammy: skip the
gcc-14 stanza entirely and let build-essential's default gcc take over,
which is what the other Go backends compile with anyway.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
feat(audio-transform): add LocalVQE backend, bidi gRPC RPC, Studio UI
Introduce a generic "audio transform" capability for any audio-in / audio-out
operation (echo cancellation, noise suppression, dereverberation, voice
conversion, etc.) and ship LocalVQE as the first backend implementation.
Backend protocol:
- Two new gRPC RPCs in backend.proto: unary AudioTransform for batch and
bidirectional AudioTransformStream for low-latency frame-by-frame use.
This is the first bidi stream in the proto; per-frame unary at LocalVQE's
16 ms hop would be RTT-bound. Wire it through pkg/grpc/{client,server,
embed,interface,base} with paired-channel ergonomics.
LocalVQE backend (backend/go/localvqe/):
- Go-Purego wrapper around upstream liblocalvqe.so. CMake builds the upstream
shared lib + its libggml-cpu-*.so runtime variants directly — no MODULE
wrapper needed because LocalVQE handles CPU feature selection internally
via GGML_BACKEND_DL.
- Sets GGML_NTHREADS from opts.Threads (or runtime.NumCPU()-1) — without it
LocalVQE runs single-threaded at ~1× realtime instead of the documented
~9.6×.
- Reference-length policy: zero-pad short refs, truncate long ones (the
trailing portion can't have leaked into a mic that wasn't recording).
- Ginkgo test suite (9 always-on specs + 2 model-gated).
HTTP layer:
- POST /audio/transformations (alias /audio/transform): multipart batch
endpoint, accepts audio + optional reference + params[*]=v form fields.
Persists inputs alongside the output in GeneratedContentDir/audio so the
React UI history can replay past (audio, reference, output) triples.
- GET /audio/transformations/stream: WebSocket bidi, 16 ms PCM frames
(interleaved stereo mic+ref in, mono out). JSON session.update envelope
for config; constants hoisted in core/schema/audio_transform.go.
- ffmpeg-based input normalisation to 16 kHz mono s16 WAV via the existing
utils.AudioToWav (with passthrough fast-path), so the user can upload any
format / rate without seeing the model's strict 16 kHz constraint.
- BackendTraceAudioTransform integration so /api/backend-traces and the
Traces UI light up with audio_snippet base64 and timing.
- Routes registered under routes/localai.go (LocalAI extension; OpenAI has
no /audio/transformations endpoint), traced via TraceMiddleware.
Auth + capability + importer:
- FLAG_AUDIO_TRANSFORM (model_config.go), FeatureAudioTransform (default-on,
in APIFeatures), three RouteFeatureRegistry rows.
- localvqe added to knownPrefOnlyBackends with modality "audio-transform".
- Gallery entry localvqe-v1-1.3m (sha256-pinned, hosted on
huggingface.co/LocalAI-io/LocalVQE).
React UI:
- New /app/transform page surfaced via a dedicated "Enhance" sidebar
section (sibling of Tools / Biometrics) — the page is enhancement, not
generation, so it lives outside Studio. Two AudioInput components
(Upload + Record tabs, drag-drop, mic capture).
- Echo-test button: records mic while playing the loaded reference through
the speakers — the mic naturally picks up speaker bleed, giving a real
(mic, ref) pair for AEC testing without leaving the UI.
- Reusable WaveformPlayer (canvas peaks + click-to-seek + audio controls)
and useAudioPeaks hook (shared module-scoped AudioContext to avoid
hitting browser context limits with three players on one page); migrated
TTS, Sound, Traces audio blocks to use it.
- Past runs saved in localStorage via useMediaHistory('audio-transform') —
the history entry stores all three URLs so clicking re-renders the full
triple, not just the output.
Build + e2e:
- 11 matrix entries removed from .github/workflows/backend.yml (CUDA, ROCm,
SYCL, Metal, L4T): upstream supports only CPU + Vulkan, so we ship those
two and let GPU-class hardware route through Vulkan in the gallery
capabilities map.
- tests-localvqe-grpc-transform job in test-extra.yml (gated on
detect-changes.outputs.localvqe).
- New audio_transform capability + 4 specs in tests/e2e-backends.
- Playwright spec suite in core/http/react-ui/e2e/audio-transform.spec.js
(8 specs covering tabs, file upload, multipart shape, history, errors).
Docs:
- New docs/content/features/audio-transform.md covering the (audio,
reference) mental model, batch + WebSocket wire formats, LocalVQE param
keys, and a YAML config example. Cross-links from text-to-audio and
audio-to-text feature pages.
Assisted-by: Claude:claude-opus-4-7 [Bash Read Edit Write Agent TaskCreate]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(ci): allow routing apt traffic through an alternate Ubuntu mirror
Adds opt-in APT_MIRROR / APT_PORTS_MIRROR knobs to all Dockerfiles, the
Makefile, and CI workflows so we can fail over to a non-canonical Ubuntu
mirror when archive.ubuntu.com / security.ubuntu.com / ports.ubuntu.com
are degraded (recently observed: multi-day DDoS against the default pool).
Defaults are empty everywhere — behavior is unchanged unless a mirror is
configured. To enable in CI, set the repo-level GitHub Actions variables
APT_MIRROR (and APT_PORTS_MIRROR for arm64 builds). Locally:
make docker APT_MIRROR=http://azure.archive.ubuntu.com
A small POSIX-sh helper in .docker/apt-mirror.sh rewrites both DEB822
(/etc/apt/sources.list.d/ubuntu.sources, Ubuntu 24.04+) and the legacy
/etc/apt/sources.list before the first apt-get update. Dockerfile stages
load it via RUN --mount=type=bind, so there is no extra layer and no
cache invalidation when the script is unchanged. Reusable workflows also
rewrite the runner's own /etc/apt sources before any sudo apt-get call.
Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci(apt-mirror): default to the Azure mirror, visible in the workflow source
Bakes Azure (http://azure.archive.ubuntu.com / http://azure.ports.ubuntu.com)
in as the default for both Docker builds and runner-side apt — rather than
hiding the URL behind a GitHub Actions repo variable that's not visible
from the source tree.
A new composite action at .github/actions/configure-apt-mirror is the
single source of truth for runner-side rewrites. Five standalone
workflows (build-test, release, tests-e2e, tests-ui-e2e, update_swagger)
just `uses: ./.github/actions/configure-apt-mirror`.
Three workflows (image_build, backend_build, checksum_checker) keep an
inline bash rewrite, because they install/upgrade git via apt *before*
the checkout step (so the local composite action isn't loadable yet).
The Azure URL is visible in those files too.
The `apt-mirror` / `apt-ports-mirror` inputs of the reusable workflows
keep their now-Azure defaults — they still feed the Docker build-args
block in addition to the inline runner-side rewrite. Callers (image.yml,
image-pr.yml, backend.yml, backend_pr.yml) drop the previous
`vars.APT_MIRROR` plumbing and rely on those defaults.
Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci(apt-mirror): drop Force Install GIT, consolidate on the composite action
The PPA git upgrade ran add-apt-repository ppa:git-core/ppa, which talks
to api.launchpad.net — also part of Canonical's infrastructure and
currently returning HTTP 504. The Azure mirror only covers
archive.ubuntu.com / security.ubuntu.com / ports.ubuntu.com, not PPAs.
The system git that ubuntu-latest already ships is sufficient for
actions/checkout and the build pipeline, so just drop the upgrade. With
that gone, the apt-before-checkout constraint disappears too — all three
holdouts (image_build, backend_build, checksum_checker) can now switch
to ./.github/actions/configure-apt-mirror like the other five.
Net: 0 inline apt-mirror blocks, all 8 workflows route through the
composite action.
Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(ci): fix AMDGPU_TARGETS empty-string bypass in hipblas builds
399c1dec wired amdgpu-targets through the backend_build workflow_call
interface, intending the input's default value to cover matrix entries
that don't specify targets. However, GitHub Actions only applies a
workflow_call input default when the caller omits the input entirely.
When backend.yml passes `amdgpu-targets: ${{ matrix.amdgpu-targets }}`
and the matrix entry has no amdgpu-targets key, the expression evaluates
to an empty string, which is treated as an explicit value — bypassing
the default. The result is Docker receiving AMDGPU_TARGETS="" which in
turn causes Make's ?= default to be skipped (since the variable is
already set in the environment, even to empty), and cmake gets
-DAMDGPU_TARGETS= with no targets, so the HIP backend compiles for an
indeterminate target rather than the intended GPU list.
Fix this at two levels:
1. backend.yml: use a || fallback in the expression so that an undefined
matrix.amdgpu-targets never reaches the reusable workflow as an empty
string. The target list is the canonical default and lives here.
2. backend_build.yml: remove the now-misleading default value from the
input declaration. The default never fired due to the above bug, so
keeping it implied a guarantee that didn't exist.
3. backend/cpp/llama-cpp/Makefile: add an explicit $(error ...) guard
after the ?= assignment so that if AMDGPU_TARGETS is empty (whether
from environment or any future CI wiring mistake) the build fails
immediately with a clear message rather than silently producing a
binary compiled for an unknown GPU target.
Assisted-by: Claude Code:claude-sonnet-4-6
Signed-off-by: Russell Sim <rsl@simopolis.xyz>
* fix(build): plumb AMDGPU_TARGETS through to Docker builds
The docker-build-backend Makefile macro and Dockerfile.golang did not
pass AMDGPU_TARGETS to the inner make invocation, so hipblas builds
always used the backend Makefile's hardcoded default GPU targets
regardless of what was specified via environment or CI inputs.
Signed-off-by: Russell Sim <rsl@simopolis.xyz>
---------
Signed-off-by: Russell Sim <rsl@simopolis.xyz>
compel 2.3.1 (latest, Nov 2025) declares transformers~=4.25 in its
metadata, i.e. >=4.25,<5.0. After transformers 5.0 (2026-01-26) and
huggingface-hub 1.0 (2025-10-27) shipped, the weekly DEPS_REFRESH
cache rotation in CI started seeing the new majors and pip's resolver
went into multi-hour backtracking storms walking every transformers
4.x candidate against every accelerate/hf-hub/tokenizers combination
to find a set compel would accept. The 2026-04-29 backend-build for
the diffusers backend (darwin-mps + l4t + cublas13-turboquant matrix
cells) hit the GitHub Actions 6h job timeout still inside pip
install — the build itself never started.
compel is the only hard upper bound on transformers in this stack
(diffusers, accelerate, peft, optimum-quanto are all flexible), and
upstream support for transformers 5 is still in flight: damian0815/
compel#129 ("Modernize Compel for Transformers 5") and #128 ("Bump
transformers version to >5.0") are both open as of today.
backend.py only constructs Compel() when COMPEL=1 is set in the env
(default off), so make compel a true optional extra:
- Wrap the top-level `from compel import ...` in try/except
ImportError, mirroring the existing sd_embed pattern.
- Auto-disable COMPEL with a warning when the module isn't
installed, instead of crashing on module load.
- Drop compel from all eight requirements-*.txt variants so the
resolver no longer has to satisfy its transformers cap.
- Leave a TODO in backend.py and in each requirements file
pointing at the upstream PR/issue, so the dependency can be
reinstated once compel supports transformers >= 5.
Users who rely on weighted-prompt embeddings can opt in with a
manual `pip install compel` alongside COMPEL=1; the warning emitted
on startup tells them how.
Assisted-by: Claude:claude-opus-4-7 [Bash Read Edit WebFetch]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Closes#9601
Makes the temporary scratch paths in vllm, vllm-omni, tinygrad, and pocket-tts
backends configurable via the standard TMPDIR env var, instead of always writing
to /tmp. This is a one-line change per call site that calls tempfile.gettempdir()
for the directory and keeps the same filename suffix.
Users who run on systems with a small root partition (or want to relocate scratch
files to a larger volume) can now redirect these by setting TMPDIR
(e.g. TMPDIR=/data/tmp), without affecting the existing LOCALAI_GENERATED_CONTENT_PATH
or LOCALAI_UPLOAD_PATH options that already cover other temp paths.
Files touched:
- backend/python/vllm/backend.py (1 site: video base64 scratch)
- backend/python/tinygrad/backend.py (1 site: image fallback dst)
- backend/python/pocket-tts/backend.py (1 site: tts wav fallback dst)
- backend/python/vllm-omni/backend.py (2 sites: video + audio scratch)
Bumps backend/cpp/llama-cpp/Makefile LLAMA_VERSION from 665abc6 to
d775992, picking up upstream PR ggml-org/llama.cpp#22397 which splits
common_params_speculative into nested draft / ngram_simple / ngram_mod
sub-structs. Renames every grpc-server.cpp reference to match:
speculative.mparams_dft.path -> speculative.draft.mparams.path
speculative.{n_max,n_min} -> speculative.draft.{n_max,n_min}
speculative.{p_min,p_split} -> speculative.draft.{p_min,p_split}
speculative.{n_gpu_layers,n_ctx} -> speculative.draft.{n_gpu_layers,n_ctx}
speculative.ngram_size_n -> speculative.ngram_simple.size_n
speculative.ngram_size_m -> speculative.ngram_simple.size_m
speculative.ngram_min_hits -> speculative.ngram_simple.min_hits
The "speculative.n_max" JSON key sent to the upstream server stays
unchanged — server-task.cpp still reads it and routes the value into
draft.n_max internally.
The turboquant fork (TheTom/llama-cpp-turboquant @ 11a241d) branched
before #22397 and still exposes the flat layout. Since turboquant
reuses the shared backend/cpp/llama-cpp/grpc-server.cpp, extend
patch-grpc-server.sh with an idempotent sed block that reverts the
ten field references back to the legacy flat names on the build copy
only — the original under backend/cpp/llama-cpp/ stays compiling
against vanilla upstream. Drop the block once the fork rebases.
ik-llama-cpp has its own grpc-server.cpp with no speculative refs
(0/2661 lines), so it is unaffected.
Validated locally with `make docker-build-llama-cpp` (avx, avx2,
avx512, fallback, grpc + rpc-server all built; image exported).
Assisted-by: Claude:claude-opus-4-7 [Bash Read Edit]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>