* feat(backends): darwin/Metal build for the privacy-filter backend (timeboxed try)
The privacy-filter.cpp engine is already Metal-capable on Apple Silicon: it pulls
ggml and never forces GGML_METAL=OFF, and ggml defaults Metal ON on Apple, so a
plain Darwin build is Metal-enabled. grpc++/protobuf resolve from Homebrew via
find_package(... CONFIG). It just had no darwin build path - the existing
package.sh and run.sh are Linux-only and there was no make target / workflow step.
Adds the bespoke darwin path, modeled on the ds4 one:
- scripts/build/privacy-filter-darwin.sh: native make grpc-server, otool -L dylib
bundling, create-oci-image (no Linux package.sh).
- Makefile: backends/privacy-filter-darwin target (+ .NOTPARALLEL).
- .github/workflows/backend_build_darwin.yml: gated build step for privacy-filter.
- scripts/changed-backends.js: inferBackendPathDarwin special-case -> backend/cpp.
- .github/backend-matrix.yml: includeDarwin entry (lang go, like ds4/llama-cpp).
- backend/index.yaml: metal: capability + metal-privacy-filter(-development) entries.
- backend/cpp/privacy-filter/run.sh: DYLD_LIBRARY_PATH branch on Darwin.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
* fix(privacy-filter): macOS proto include + bundle ggml dylibs
Validated natively on an M4 (the build/package/load chain now works with Metal):
- CMakeLists.txt: hw_grpc_proto compiles the generated proto/grpc sources but
only linked the binary dir, so on macOS it could not find protobuf's headers
(runtime_version.h) - Homebrew puts them under /opt/homebrew, not /usr/include.
Link protobuf::libprotobuf + gRPC::grpc++ so their include dirs propagate. No-op
on Linux (apt headers are already on the default search path).
- privacy-filter-darwin.sh: bundle the ggml shared libs the binary @rpath-links
(libggml{,-base,-cpu,-blas,-metal}); the otool -L walk only catches on-disk
absolute deps and missed them. Resolved at runtime by run.sh's DYLD_LIBRARY_PATH.
M4 check: arm64 grpc-server links @rpath/libggml-metal.0.dylib; with the 15 ggml
dylibs + grpc/protobuf bundled, it loads clean (no dyld errors) and prints usage.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
fix(backends): quote $CURDIR in run.sh so backends work in paths with spaces
The backend launcher scripts derive their own directory with
CURDIR=$(dirname "$(realpath $0)") and then referenced it unquoted as
$CURDIR (e.g. [ -f $CURDIR/lib/ld.so ], export LD_LIBRARY_PATH=$CURDIR/lib:...,
exec $CURDIR/<binary> "$@"). When a backend is installed under a path that
contains a space - notably macOS's ~/Library/Application Support/... - bash
word-splits the unquoted $CURDIR, so the test builtin fails with
"binary operator expected" and exec tries to run ".../Library/Application",
yielding "No such file or directory". The backend never starts, surfacing as
a gRPC "service not ready" error and an HTTP 500. Quote $CURDIR (and the
realpath "$0") in every affected run.sh; no logic changes.
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
feat(backends): darwin build for the localvqe backend
LocalVQE (acoustic echo cancellation / noise suppression / dereverberation)
already builds on Darwin - its Makefile takes the OS=Darwin branch with
GGML_METAL=OFF (upstream is CPU + Vulkan only), producing a native arm64 CPU
image. It was just never wired into CI.
- .github/backend-matrix.yml: add localvqe to includeDarwin (build-type metal,
lang go) - the darwin/arm64 build profile; the backend itself stays CPU.
- backend/index.yaml: metal: capability + concrete metal-localvqe(-development)
entries pointing at the -metal-darwin-arm64-localvqe images.
- backend/go/localvqe/Makefile: note on the existing Darwin branch (also the
per-backend change the CI path filter needs to build it here).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
feat(backends): darwin/Metal builds for the vision C++/ggml backends
depth-anything-cpp, locate-anything-cpp, rfdetr-cpp and sam3-cpp already carry
a Darwin/Metal path in their Makefiles (GGML_METAL=ON when build-type=metal),
but were never wired into CI, so no Metal image was published and Apple Silicon
could not install them.
- .github/backend-matrix.yml: add the four to includeDarwin (build-type metal,
lang go), matching the other go+ggml *-cpp Metal entries.
- backend/index.yaml: add metal: to each backend's capabilities map (main and
-development) plus concrete metal-<backend>(-development) entries pointing at
the latest/master -metal-darwin-arm64-<backend> images.
- backend/go/*/Makefile: a one-line note on the existing Darwin branch (also
the per-backend change the CI path filter needs to actually build them here).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(llama-cpp): single x86 CPU build via ggml CPU_ALL_VARIANTS
Replace the per-microarch avx/avx2/avx512/fallback multi-binary build on
x86 with a single grpc-server plus the dlopen-able libggml-cpu-*.so set
that ggml's backend registry selects at runtime by probing host CPU
features. One build instead of four, broader microarch coverage (adds
alderlake AVX-VNNI, zen4 AVX512-BF16, sapphirerapids AMX), and the
shell-side /proc/cpuinfo probing in run.sh goes away.
Build/link notes:
- CPU_ALL_VARIANTS requires GGML_BACKEND_DL + BUILD_SHARED_LIBS=ON, so
ggml/llama become shared objects. SHARED_LIBS is now a make variable
(default OFF) so the override survives the recursive sub-make into the
VARIANT build dir instead of being re-clobbered by the base flags.
- The cpu-all target also builds "--target ggml": the per-microarch
backends are runtime-dlopened, not link deps, so they only compile via
ggml's add_dependencies().
- hw_grpc_proto is pinned STATIC. Under BUILD_SHARED_LIBS=ON it would
otherwise become a DSO referencing hidden-visibility symbols in the
static libprotobuf.a, which fails to link ("hidden symbol ... is
referenced by DSO"). Keeping it static links gRPC/protobuf into the
executable while only ggml/llama stay shared, so no PIC or base-image
change is required.
- package.sh bundles the libggml-*.so set into package/lib; ggml finds
them by scanning the bundled ld.so directory (/proc/self/exe), which
run.sh launches from.
Scope: x86 only. arm64/darwin keep the single fallback build. The
ik-llama-cpp / turboquant forks and the other ggml C++ backends are
unchanged; the same recipe applies but is out of scope here.
Validated with a full docker build plus a live inference smoke test:
the model loads, ggml selects the AVX512_BF16 variant on a Zen-class
host, and tokens generate correctly.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* feat(llama-cpp,turboquant): extend CPU_ALL_VARIANTS to arm64 + turboquant
- llama-cpp: x86 AND arm64 now use the single llama-cpp-cpu-all build
(only hipblas keeps the fallback build). ggml's arm64 variant table
(armv8.x / armv9.x, plus apple_m* on darwin) is selected at runtime.
- turboquant: same recipe via a turboquant-cpu-all target. turboquant
copies backend/cpp/llama-cpp's CMakeLists.txt + Makefile per flavor, so
the hw_grpc_proto STATIC fix and the SHARED_LIBS / EXTRA_CMAKE_ARGS
make-vars are inherited; the target just passes SHARED_LIBS=ON, the DL
flags and --target ggml through, then collects the .so set. run.sh and
package.sh updated to ship/select turboquant-cpu-all.
- Makefile lib-collection find now also matches *.dylib (for the darwin
build, which emits dylibs rather than .so).
ik-llama-cpp is intentionally left unchanged: its pinned ggml has no
CPU_ALL_VARIANTS support and its IQK kernels require AVX2, so the
per-microarch dynamic backend set does not apply.
Scope still excludes the darwin packaging wiring (separate change).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* feat(llama-cpp,turboquant): arm64 gcc-14 for SME variants + darwin cpu-all packaging
- arm64: ggml CPU_ALL_VARIANTS builds armv9.2 SME variants whose -march=...+sme
is rejected by the Ubuntu 24.04 default gcc-13. Build the arm64 variants with
gcc-14 (installed in the compile step). The host only selects a variant it
actually supports at runtime, but every variant must still compile.
- darwin: scripts/build/llama-cpp-darwin.sh builds llama-cpp-cpu-all instead of
the fallback binary, keeps Metal (GGML_METAL stays ON; --target ggml also builds
ggml-metal). The per-microarch libggml-cpu-*.dylib are placed in the package
root next to the binary (darwin has no bundled ld.so, so ggml's executable-dir
scan looks there), while the other shared dylibs go in lib/ for DYLD_LIBRARY_PATH.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(llama-cpp-darwin): distribute ggml backends by suffix (.so root, .dylib lib)
ggml emits its loadable backends (per-microarch CPU variants, metal, blas) with a
.so suffix even on darwin, while the core libraries (ggml-base/ggml/llama/
llama-common/mtmd) use .dylib. Split the distribution by suffix: .so DL backends
go in the package root for ggml's executable-directory scan, .dylib core libs go
in lib/ for DYLD_LIBRARY_PATH. The previous .dylib name-pattern matched none of the
variants.
Verified on an M4: ggml loads the apple_m4 CPU variant (SME=1) and Metal, model
loads and generates correct tokens.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(llama-cpp,turboquant): only CPU_ALL_VARIANTS for pure-CPU builds, GPU uses fallback
The previous gate sent every non-hipblas build through llama-cpp-cpu-all, so the
GPU image builds (cublas, sycl_f16/f32, vulkan, nvidia l4t) compiled the whole CPU
microarch variant matrix on top of their already-huge GPU backend - blowing the
build time (the sycl job was only 59% done after 2h11m) - and the arm64 l4t build
failed at `apt-get install gcc-14` (exit 100) on the Jetson base.
Gate on an empty BUILD_TYPE instead: only the pure CPU image (build-type: '' in
.github/backend-matrix.yml) builds the CPU_ALL_VARIANTS set; every GPU build gets a
single fallback CPU grpc-server, since the accelerator does the compute. This also
confines the arm64 gcc-14 step (needed for the armv9.2 SME variants) to the CPU
build, away from the GPU base images.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* docs(llama-cpp): correct run.sh comment for arm64/darwin cpu-all
arm64 and darwin CPU images now also ship llama-cpp-cpu-all (not fallback-only);
only GPU images ship fallback-only. Fix the stale comment to match.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(vllm): macOS/Metal support via vllm-metal (MLX)
Add an additive Apple-Silicon path to the existing vllm Python backend so
vLLM runs on macOS via vllm-metal (github.com/vllm-project/vllm-metal).
Spike outcome (proven on a real M4 / macOS 26.5, Qwen3-0.6B):
- vllm-metal registers through vLLM's platform-plugin entry point
(metal -> vllm_metal:register); MetalPlatform activates and runs on the
GPU through MLX.
- LocalAI's backend.py is UNCHANGED: AsyncEngineArgs(...) ->
AsyncLLMEngine.from_engine_args transparently resolves to vLLM 0.23's v1
AsyncLLM MLX engine, and async generate produced correct output.
- backend.py is NOT touched: its only empty_cache() call is CUDA-only
(guarded by torch.cuda.is_available()), so the benign shutdown-only
"Allocator for mps is not a DeviceAllocator" noise comes from vLLM's
internal EngineCore teardown, not from our code.
Changes (all gated behind a darwin condition; Linux/CUDA/ROCm/Intel paths
are byte-for-byte unchanged):
- install.sh: darwin branch forces PYTHON_VERSION=3.12 (vllm-metal
requirement), creates/activates LocalAI's managed venv via ensureVenv,
then reproduces vllm-metal's installer INTO that venv (build vLLM 0.23.0
from the release source tarball against requirements/cpu.txt, then install
the prebuilt vllm-metal wheel from its latest GitHub release), and runs
runProtogen. installRequirements is skipped on darwin.
- backend-matrix.yml: add a vllm includeDarwin entry (mps, python).
- index.yaml: add metal capability + concrete metal-vllm /
metal-vllm-development child entries mirroring the metal-kitten-tts
template.
Version coupling: vllm-metal pins vLLM 0.23.0, equal to LocalAI's current
vllm pin. Bumping vllm must be coordinated with a supporting vllm-metal
release; documented in install.sh and requirements-cublas13-after.txt.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
* chore(vllm): track the darwin vllm-metal pin via the autobumper
The Apple Silicon build pinned vLLM 0.23.0 as a hidden string in install.sh
while floating the vllm-metal wheel on releases/latest - the two could drift
apart silently. Make both a tracked, reproducible pair (VLLM_METAL_VERSION +
VLLM_VERSION), fetch the wheel by tag, and add .github/bump_vllm_metal.sh wired
into bump_deps.yaml. It tracks vllm-project/vllm-metal (not vllm/vllm latest),
reading the coupled vLLM source version from vllm-metal's own installer, and
opens a bump PR - mirroring the existing bump_vllm_wheel.sh for the cu130 wheel.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
* chore(vllm): derive the darwin vLLM version, drop the second pin
Follow-up: VLLM_VERSION was still a hardcoded string duplicating what
VLLM_METAL_VERSION already determines. Derive it at install time from
vllm-metal's own installer (vllm_v=) at the pinned tag - one source of truth,
no second value to drift. The bumper now touches only VLLM_METAL_VERSION;
the derivation is immutable per tag, so builds stay reproducible.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
* fix(vllm): fetch the vllm-metal wheel without the GitHub API
The darwin build resolved the wheel URL via api.github.com, whose
unauthenticated rate limit (60/hr per IP) 403s on shared macOS runners
(observed after the 9-min vLLM source build). Construct the release-asset
download URL deterministically from the pinned tag and the cp312/arm64 wheel
name instead - no API call, no rate limit. Verified the URL resolves (200).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
* fix(vllm): fail Score cleanly when the engine returns no prompt_logprobs
Audit of the Score path against vllm-metal (MLX on macOS): the engine accepts
SamplingParams(prompt_logprobs=1) but returns an all-None prompt_logprobs list
rather than computing it, so scoring is not supported there. The old guard
treated the truthy [None] list as valid and silently scored every candidate as
0. Detect the all-None case and return UNIMPLEMENTED instead. No-op on
Linux/CUDA, which populate real entries.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(llama-cpp): add main-model cpu_moe/n_cpu_moe options
Mirror the existing draft_cpu_moe/draft_n_cpu_moe siblings for the main
model, matching upstream --cpu-moe / --n-cpu-moe (common/arg.cpp). Lets
users keep MoE expert weights on CPU to manage VRAM on large MoE models.
Closes part of #10483
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(llama-cpp): forward unknown '-' options to upstream arg parser
Any options: entry starting with '-' is collected and passed verbatim to
llama.cpp's own common_params_parse (LLAMA_EXAMPLE_SERVER) at the end of
params_parse, so every upstream llama-server flag works without a new
hand-wired branch. Passthrough runs last and wins on overlap; n_parallel is
snapshotted to survive parser_init's SERVER reset, and help/usage/completion
flags are skipped to avoid exiting the backend.
Closes#10483
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs(llama-cpp): document cpu_moe/n_cpu_moe and option passthrough
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(llama-cpp): terminate tensor/kv override vectors after passthrough
The tensor_buft_overrides padding and the kv/draft override terminators
ran before the generic option passthrough, so a passthrough flag
(--cpu-moe, --override-tensor, --override-kv, ...) appended a real entry
after the null sentinel - tripping the model loader's
back().pattern == nullptr assertion (crash) or being silently dropped.
Move all three termination/padding blocks to the end of params_parse,
after both the named-option loop and common_params_parse have pushed
their real entries. Also widen the exit()-flag skip list so --version,
--license, --list-devices and --cache-list cannot terminate the backend.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(backends): add darwin/metal (MPS) build for trl
Authors backend/python/trl/requirements-mps.txt and wires trl into the
darwin CI matrix and gallery so the MPS training path can be built and
validated on Apple Silicon. The MPS variant installs plain PyPI torch
wheels (MPS-capable on macOS arm64) and the trl training stack; bitsandbytes
is omitted as it is a CUDA-only dependency with poor Apple Silicon support.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
* fix(trl): guard uv-only --index-strategy for the pip/darwin path
The darwin/MPS build installs with pip (USE_PIP=true), which rejects the
uv-only --index-strategy flag and failed the darwin backend build. Add it
only on the uv path; Linux/CUDA resolution is unchanged.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* fix(parakeet-cpp): darwin/metal support (libparakeet.dylib + DYLD path)
The parakeet-cpp backend had no macOS support and panicked at startup on
Apple/Metal nodes when purego.Dlopen could not find "libparakeet.so".
Fix it across the same four layers the sibling voxtral backend already
handles correctly:
- main.go: default the dlopen target to libparakeet.dylib on darwin
(runtime.GOOS), libparakeet.so elsewhere; PARAKEET_LIBRARY still wins.
- Makefile: also stage the built libparakeet.dylib next to the Go sources.
- package.sh: accept either the Linux .so[.X.Y] or the macOS .dylib when
bundling instead of hard-failing when no .so is present (the macOS case);
note that on Darwin only system frameworks are linked.
- run.sh: on Darwin set DYLD_LIBRARY_PATH and PARAKEET_LIBRARY to the
packaged .dylib; keep LD_LIBRARY_PATH + .so on Linux.
Mirrors backend/go/voxtral.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(backends): darwin/metal support across purego Go backends
The parakeet-cpp fix in the previous commit was an instance of a bug
shared by nearly every purego/dlopen Go backend: the dlopen target was
hardcoded to a .so name and run.sh exported only LD_LIBRARY_PATH, so the
backend panicked at startup on macOS/Apple-Metal nodes (dyld needs the
.dylib name and DYLD_LIBRARY_PATH). voxtral was the only backend handling
this correctly.
Apply the same four-layer fix (mirroring backend/go/voxtral) to the
remaining affected backends:
whisper, sherpa-onnx, ced, stablediffusion-ggml, vibevoice-cpp,
qwen3-tts-cpp, omnivoice-cpp, crispasr, acestep-cpp, locate-anything-cpp,
depth-anything-cpp, rfdetr-cpp, sam3-cpp, localvqe
Per backend:
- main.go (sherpa-onnx: backend.go, two libraries): default the dlopen
target to the .dylib on darwin (runtime.GOOS), .so elsewhere; the
existing <BACKEND>_LIBRARY env override still wins.
- run.sh: on Darwin set DYLD_LIBRARY_PATH and point <BACKEND>_LIBRARY at
the packaged .dylib; keep LD_LIBRARY_PATH + the Linux CPU-variant
(avx/avx2/avx512) selection unchanged in the else branch.
- package.sh: also bundle the .dylib and stop hard-failing when no .so is
present (the macOS case).
- Makefile: also stage the built .dylib.
Notes:
- stablediffusion-ggml and acestep-cpp build their lib as a CMake MODULE,
which emits .so (not .dylib) on macOS; run.sh prefers .dylib and falls
back to .so so both layouts work.
- sherpa-onnx was already partly darwin-aware (Makefile/package.sh); only
run.sh and the two dlopen defaults needed fixing.
Linux behavior is unchanged. Verified gofmt-clean and
`CGO_ENABLED=0 go build` for every backend.
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(backends): add darwin/metal build for liquid-audio
Wire the already-MPS-ready liquid-audio backend (it ships
requirements-mps.txt) into the darwin CI matrix and the gallery so
metal-darwin-arm64 images are built and selectable.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
* ci(liquid-audio): trigger darwin build via requirements-mps note
The changed-backends path filter only builds a backend when a file under
its directory changes. The metal wiring lived in index.yaml + the matrix,
so the darwin job was skipped. Add a documenting comment to the MPS
requirements so CI actually exercises the darwin build.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
* fix(liquid-audio): guard uv-only --index-strategy for the pip/darwin path
Same fix as trl: the darwin/MPS build installs with pip (USE_PIP=true), which
rejects the uv-only --index-strategy flag and failed the darwin backend build.
Add it only on the uv path; Linux/CUDA resolution is unchanged.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The supertonic Go TTS backend dlopens ONNX Runtime, but its runtime and
packaging scripts were Linux-only: run.sh exported LD_LIBRARY_PATH, pointed
ONNXRUNTIME_LIB_PATH at libonnxruntime.so, and always tried the ld.so exec
path, while package.sh hard-failed on any non-Linux host. On macOS dyld has
no ld.so loader, uses DYLD_LIBRARY_PATH, and ONNX Runtime ships as a .dylib.
This applies the same purego .dylib/DYLD_LIBRARY_PATH fix that PR #10481
landed for 15 other ONNX/purego backends (sherpa-onnx, silero-vad, etc.) but
which omitted supertonic:
- run.sh: on darwin export DYLD_LIBRARY_PATH and point ONNXRUNTIME_LIB_PATH
at libonnxruntime.dylib; guard the ld.so exec path to Linux only.
- package.sh: recognize Darwin instead of erroring out; the bundled .dylib is
resolved via DYLD_LIBRARY_PATH, no glibc/ld.so to bundle.
- helper.go: platform-native default library extension (dylib on darwin) for
the last-resort dlopen fallback.
It also wires the darwin CI build and gallery entries, resolving the
inconsistency where backend/index.yaml advertised metal for supertonic but no
includeDarwin matrix entry built the image:
- .github/backend-matrix.yml: add the -metal-darwin-arm64-supertonic Go entry.
- backend/index.yaml: declare metal capabilities and add the concrete
metal-supertonic / metal-supertonic-development child entries.
The Makefile already detects Darwin/osx/arm64 and stages the per-OS ONNX
Runtime tarball, mirroring sherpa-onnx, so no Makefile change is required.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
fix(diffusers): pin diffusers and transformers to a known-good pair
The diffusers backend tracked git+https://github.com/huggingface/diffusers
(main) with an unpinned transformers. transformers v5 restructured
CLIPTextModel and removed the .text_model attribute that diffusers' single
-file loader reads, so loading any single-file Stable Diffusion checkpoint
fails:
create_diffusers_clip_model_from_ldm (single_file_utils.py)
position_embedding_dim = model.text_model.embeddings.position_embedding...
AttributeError: 'CLIPTextModel' object has no attribute 'text_model'
No released diffusers (<=0.38.0) supports transformers v5 - only unreleased
diffusers main does. Because the requirements tracked main plus an unpinned
transformers, every backend image froze whichever pair existed at build
time, and images built once transformers v5 shipped but before diffusers
main caught up are permanently broken.
Pin the last known-good released pair across all requirements files:
diffusers==0.38.0 and transformers==4.57.6. 0.38.0 still exposes every
pipeline backend.py imports (Flux, Wan, Sana, LTX2, Qwen, GGUF), so no
functionality is lost, and builds become reproducible instead of drifting
into the broken window.
Fixes#9979
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com>
* feat(ced): sketch sound-classification backend (CED audio tagger)
Wires ced.cpp (CED, 527-class AudioSet sound-event tagger; baby cry,
footsteps, glass, alarms, dog bark) into LocalAI as a Go/purego backend.
SKETCH (backend skeleton real; core REST wiring + CI/gallery is a checklist
in DESIGN.md):
- backend/backend.proto: new SoundDetection rpc + SoundClass messages
(run `make protogen-go` to regenerate pkg/grpc/proto).
- backend/go/ced: main.go (purego dlopen libced.so + ced_capi.h),
goced.go (Ced gRPC backend: Load + SoundDetection), Makefile
(clone-at-pin CED_VERSION, ggml static-PIC shared build), run.sh,
package.sh, .gitignore.
- DESIGN.md: REST /v1/audio/classification wiring (handler/route/capability
registration checklist), gallery/index + CI registration, and a scoping
note for the realtime/websocket live-recognition path (sliding-window
classify over the existing ws transport + voicegate; the ced C-API
per-PCM entry point is already window-friendly).
Backend code does not compile until protogen-go regenerates the pb types
and a libced.so is built (Makefile clones+builds it).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ced): REST /v1/audio/classification endpoint + capability registration
Wires the ced sound-event classification backend (AudioSet audio tagger)
end to end through the REST surface, mirroring the transcription path.
- Handler: core/http/endpoints/openai/sound_classification.go parses the
multipart audio upload, temp-files it, resolves the model config and
calls the SoundDetection RPC; returns {model, detections[]} JSON.
- Backend wrapper: core/backend/sound_classification.go (ModelSoundDetection)
loads the model and normalizes the proto response into schema types.
- Schema: core/schema/sound_classification.go (SoundClassificationResult).
- gRPC layer: SoundDetection wired through the LocalAI wrapper (interface,
Backend client, Client, embed, server, base default) so the loader-typed
client exposes the RPC; proto regenerated via make protogen-go.
- Route: POST /v1/audio/classification (+ /audio/classification alias) with
the audio/multipart default-model middleware in routes/openai.go.
- Capability surfaces: swagger @Tags/@Router on the handler; FLAG_SOUND_
CLASSIFICATION usecase flag + UsecaseSoundClassification + UsecaseInfoMap +
GuessUsecases + ModalityGroups + GetAllModelConfigUsecases; meta usecase
option; /api/instructions audio area updated; auth RouteFeatureRegistry +
FeatureAudioClassification (APIFeatures, default ON) + FeatureMetas; UI
usecaseFilters, capabilities.js CAP_SOUND_CLASSIFICATION, Models.jsx filter
+ i18n; docs page features/audio-classification.md + whats-new + crosslink.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ced): realtime sound-event detection over the websocket API
When a realtime pipeline configures a sound-classification model, each
VAD-committed utterance (the same window the transcription path produces)
is also run through the CED sound-event classifier and the scored AudioSet
tags are emitted as a new server event. No new backend rpc is needed: the
SoundDetection gRPC method already exists on this branch.
- config: add Pipeline.SoundDetection (yaml/json sound_detection,omitempty)
beside Transcription/VAD.
- realtime: add Model.SoundDetection(ctx, audio, topK, threshold) to the
ModelInterface; implement it on wrappedModel and transcriptOnlyModel by
calling backend.ModelSoundDetection with the session's sound-classification
model config (mirrors how Transcribe dispatches). Load the optional config
in newModel / newTranscriptionOnlyModel; nil config keeps it additive.
- types: add ConversationItemSoundDetectionEvent (item_id, content_index,
detections[]{label,score,index}) with type conversation.item.sound_detection,
its ServerEventType constant and MarshalJSON, mirroring the transcription
completed event.
- realtime: add emitSoundDetection (unary path: classify the committed window,
build the event, t.SendEvent) and wire it at the utterance-commit hook right
after emitTranscription; gated on session.SoundDetectionEnabled (resolved
from Pipeline.SoundDetection at session setup, defaults top_k=5, threshold=0).
Its error is logged via xlog but never aborts the turn.
- test: Ginkgo specs for emitSoundDetection (tags emitted, empty detections,
classifier error) plus a SoundDetection method on the fakeModel double.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(ced): implement SoundDetection in nodes backend test doubles
The SoundDetection method added to the grpc backend interface left two
test doubles (fakeBackendClient, fakeGRPCBackend) incomplete, so
core/services/nodes failed to compile under `go vet`/`go test` (go build
missed it: the doubles live in _test.go). Add the method to both,
mirroring their existing Detect mock. Repairs CI for the nodes package.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ced): decouple realtime sound detection from VAD (sound-only sessions)
Sound-event detection must activate on sounds, not speech, so it no longer
runs through the voice VAD/transcription path. A sound-detection-only
pipeline (sound_detection set, no transcription/LLM) now:
- is accepted by prepareRealtimeConfig (sound_detection counts as a pipeline
stage),
- builds a lightweight model via newSoundDetectionOnlyModel (no VAD/STT/LLM/TTS
loaded), and
- defaults the session to turn_detection none (no VAD) with no transcription
stage, so the client drives windowing via input_audio_buffer.commit
(option A: client-side sliding window). The per-PCM C-API already supports
arbitrary windows.
commitUtterance gains a sound-only branch: it emits the
conversation.item.sound_detection event (scored AudioSet tags) and stops -
no transcription, no LLM response. generateResponse is now guarded on a
transcription stage being present, so a sound-only turn never invokes the LLM.
Existing transcription/VAD sessions are unchanged (additive). Added a
commitUtterance sound-only Ginkgo spec asserting it emits the sound event and
neither transcribes nor generates a response. go vet + golangci-lint
(new-from-merge-base) clean; openai suite green.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ced): register sound-classification backend in gallery + CI
Mechanical backend-image registration for the ced sound-event classifier,
mirroring the parakeet-cpp Go/purego backend everywhere it is wired up.
- .github/backend-matrix.yml: add the ced build matrix, field-for-field copies
of the parakeet-cpp entries (cpu amd64/arm64, cublas cuda 12/13 amd64,
l4t cuda-13 arm64, l4t-jetpack cuda-12 arm64, sycl f32/f16, vulkan
amd64/arm64, rocm hipblas, and the metal darwin entry), changing only
backend and tag-suffix. dockerfile stays ./backend/Dockerfile.golang.
- backend/index.yaml: add the &ced meta anchor (capabilities map per platform)
plus ced-development and the per-arch image entries, each uri/mirror
tag-suffix matching the matrix exactly. The model gallery (GGUF) entry is
intentionally deferred pending the HuggingFace publish (TODO note inline).
- scripts/changed-backends.js: add an explicit item.backend === "ced" branch in
inferBackendPath mapping to backend/go/ced/, same mechanism and ordering as
the parakeet-cpp branch (before the generic golang fallthrough).
- .github/workflows/bump_deps.yaml: register mudler/ced.cpp -> CED_VERSION in
backend/go/ced/Makefile so the daily bot bumps the pin.
- swagger/{docs.go,swagger.json,swagger.yaml}: regenerated via make swagger so
the existing /v1/audio/classification annotations land in the generated spec.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ced): server-side windowing for realtime sound detection (option B)
Adds an optional server-driven sliding-window classifier so a sound-only
realtime client only has to stream audio (no input_audio_buffer.commit):
- Pipeline.sound_detection_window_ms / sound_detection_hop_ms config knobs.
When both > 0 on a sound-only session, the server classifies the last
window of streamed audio every hop and emits a conversation.item.sound_
detection event; the input buffer is trimmed to one window so a long
stream stays bounded. When unset, the session stays client-driven
(option A). Runs independent of VAD (sound events are not speech).
- handleSoundWindow (ticker) + classifySoundWindow (one tick, extracted so
it is unit-testable) + writeWindowWAV, which declares the true
InputSampleRate (NewWAVHeaderWithRate) so the classifier resamples
correctly. Goroutine is started after toggleVAD and torn down with the
session (close + wg.Wait).
- Register pipeline.sound_detection (+window_ms/hop_ms) in the config meta
registry; the earlier realtime commit added pipeline.sound_detection
without a registry entry, failing TestAllFieldsHaveRegistryEntries. This
fixes that and covers the two new knobs.
Tests: classifySoundWindow emits an event + trims the buffer to one window,
no-ops on too-little audio; writeWindowWAV declares the given sample rate.
go build/vet + golangci-lint (new-from-merge-base) clean; config + openai
suites green.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ced): add ced-base GGUF model gallery entries (f16 + q8_0)
The ced-base weights are now published at mudler/ced-base-gguf (Apache-2.0,
converted from mispeech/ced-base). Adds gallery/ced.yaml (backend: ced +
known_usecases: sound_classification) and two gallery/index.yaml entries
(ced-base-f16 default, ced-base-q8 smallest) with sha256-pinned files, and
removes the now-resolved TODO from backend/index.yaml.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ced): add tiny/mini/small GGUF model gallery entries
Publishes the rest of the CED family (same architecture, metadata-driven port
verified end-to-end on ced-tiny) to mudler/ced-{tiny,mini,small}-gguf and adds
their f16 + q8_0 gallery entries:
ced-tiny (5.5M, edge/Pi-class) f16 11MB / q8_0 6MB
ced-mini (9.6M) f16 19MB / q8_0 11MB
ced-small (22M) f16 42MB / q8_0 23MB
All sha256-pinned. ced-base remains the accuracy default.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(ced): point gallery entries at the consolidated mudler/ced-gguf repo
All CED quantizations (tiny/mini/small/base, f16/q8_0) now live in a single
HuggingFace repo, mudler/ced-gguf, instead of per-model repos. Repoint the 8
gallery model entries' urls + file uris accordingly. sha256 and filenames are
unchanged.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(ced): bump CED_VERSION to the short-clip fix
Pin the ced backend to ced.cpp 99c6ed3, which fixes a crash on any clip
shorter than target_length (~10.11s): time_pos_embed was added at its full
63-frame grid instead of being sliced to the clip's actual time grid, tripping
ggml_can_repeat in ggml_add. Surfaced by the live realtime e2e (sub-10s
windows) and gated with a short-clip parity test upstream.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs(ced): list ced.cpp as a LocalAI-team engine + backend-guide directive
- README.md: add ced.cpp to the "native C/C++/GGML engines developed and
maintained by the LocalAI project" table.
- docs/content/features/backends.md: add a Sound Classification backend
category (sound-event classification / audio tagging) listing ced.cpp.
- .agents/adding-backends.md: add a "Documenting the backend" section and two
verification-checklist items requiring new backends to be documented in the
backends.md category list, and in-house native engines to be added to the
README maintained-engines table. This directive was missing.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(ced): repin CED_VERSION to the v0.1.0 release commit
ced.cpp history was squashed into a single release commit (tagged v0.1.0), so
the previous pin (99c6ed3) no longer exists upstream. Pin to c04ac14, the
v0.1.0 release commit, so the backend builds against a commit that exists.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(ced): silence gosec G304/G103 + govet unsafeptr on audited paths
- sound_classification.go: os.Create(dst) where dst = temp dir + path.Base of
the upload (no traversal). #nosec G304, matching the depth-anything-cpp handler.
- goced.go: reading a NUL-terminated C string from a libced-owned buffer.
#nosec G103 (gosec) + //nolint:govet (golangci-lint's unsafeptr check), since
the uintptr is a C-owned malloc'd buffer, not Go-GC memory.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* fix(vllm): don't stream raw tool-call markup as content when a tool parser is active
When a tool_parser is configured and the request carries tools, the streaming
loop emitted every text delta as delta.content — including the model's raw
tool-call markup (e.g. <tool_call>...) — because extract_tool_calls only runs
on the full output after the stream. Clients streaming a tool call therefore
saw the unparsed tool-call syntax as assistant content.
Buffer the text while a tool parser is active for the request; the existing
end-of-stream chat_delta already carries the parsed tool_calls (or the cleaned
content), which the Go side converts to SSE deltas. Non-tool-parser streaming
is unchanged.
Add a server-less regression test covering both the tool-call case (no raw
markup leaked as content) and the plain-text case (content delivered exactly
once — guards against double-emitting the buffered content).
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
* test(vllm): add expectedFailure test for progressive streaming with tool parser (Case 3, #582)
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
* test(vllm): add Cases 4+5 — marker split across chunks + false-positive prefix (TDD, Option B state machine, #582)
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
* feat(vllm): progressive streaming via parser.extract_tool_calls_streaming
When a tool parser is active for a tool-enabled streaming request,
#10346 buffers the entire generation and surfaces it on the final
chunk to prevent raw tool-call markup from leaking as delta.content.
This is correct but turns the request into effectively non-streaming
for plain-text responses — the client sees nothing until the model
stops.
Every concrete tool parser shipped with vLLM 0.23+ already implements
extract_tool_calls_streaming (Granite4, Qwen3Coder, DeepSeekV31, Jamba,
Ernie45, Hermes2Pro, llama3_json, mistral, …). Use it: instantiate
the parser before the streaming loop and call its streaming method per
delta, emitting DeltaMessage(content=…) or DeltaMessage(tool_calls=[…])
when the parser is ready.
Falls back to the existing #10346 buffer path when:
- the parser does not have extract_tool_calls_streaming, OR
- extract_tool_calls_streaming raises mid-stream (logged, the
rest of the request finishes via post-loop extract_tool_calls).
Tests (TestStreamingToolParser):
1. Buffer path: no markup leaked, no content duplication
2. Native streaming: plain-text response streams progressively
3. Native streaming: tool_call structured, no markup leaked
4. Native streaming exception → graceful fallback, no markup, no crash
5. No tool parser → unchanged per-delta content stream
E2E verified against qwen3_coder on vLLM 0.23.0 (NVIDIA GB10 / arm64 / CUDA 13).
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
* docs(vllm): add server-side TTFT benchmark for the streaming tool-parser path
Self-contained stdlib-only script that measures time-to-first-token (TTFT)
for the vLLM backend's two streaming scenarios:
- tool_call: request mentions a tool; model is expected to call it
- plain_text: request offers a tool but explicitly asks for prose
Use this to compare:
- the buffer-all path (#10346) → plain_text TTFT ≈ total response time
- the native-streaming path (this PR) → plain_text TTFT ≈ true first-token time
python examples/vllm-bench/ttft_streaming_tool_parser.py \\
--url http://localhost:8080 --model my-coder --runs 3
Lives under examples/ so it does not interfere with the test suite.
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
* examples/vllm-bench: add long-text scenario (8 paragraphs, 1500 tokens)
The long-text scenario shows the buffering vs streaming difference most
dramatically: with the buffer-all path, the client receives nothing for
20+ seconds and then the entire 1500-token response at once. With native
streaming, the first token arrives in tens of milliseconds and the
response flows progressively.
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
---------
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
Co-authored-by: Philipp Wacker <philipp.wacker@ibf-solutions.com>
* feat(nemo): enable word-level timestamps for ASR models
The nemo backend ignored timestamp_granularities and always returned a
single segment with start=0 end=0, making word-level timestamps
impossible to obtain even though the NeMo models (parakeet-tdt, etc.)
fully support them.
Changes:
- Add _get_stride_seconds() to compute frame duration from the model's
preprocessor window_stride and encoder subsampling_factor.
- Add _build_segments_with_words() that extracts word offsets from the
NeMo Hypothesis.timestamp dict and converts frame indices to
nanosecond timestamps.
- Support 'word' granularity (one segment per word) and 'segment'
granularity (merge at time-gap boundaries using a dynamic threshold).
- Populate TranscriptSegment.words with TranscriptWord entries so
callers get both segment-level and word-level timing.
- Only request timestamps from NeMo when the caller actually asks for
them (timestamp_granularities is non-empty), keeping the fast path
unchanged for callers that don't need timestamps.
Tested with nvidia/parakeet-tdt-0.6b-v3 on the JFK "ask not" clip:
curl -X POST /v1/audio/transcriptions \
-F file=@jfk.wav -F model=nemo-parakeet-tdt-0.6b \
-F 'timestamp_granularities[]=word' -F response_format=verbose_json
→ each word has correct start/end times in seconds.
Signed-off-by: fqscfqj <fqscfqj@outlook.com>
* fix(nemo): address Copilot review feedback
- Narrow exception handling in _get_stride_seconds to catch only
AttributeError, KeyError, TypeError instead of bare Exception, and
emit a warning when falling back to the hardcoded stride.
- Remove explicit return_hypotheses=False when timestamps are requested;
timestamps=True already forces NeMo to return Hypothesis objects.
- Add a warning when NeMo does not return Hypothesis objects despite
timestamps being requested.
Signed-off-by: fqscfqj <fqscfqj@outlook.com>
---------
Signed-off-by: fqscfqj <fqscfqj@outlook.com>
The parakeet-specific word accessors can return stale initialisation
data (model name, binary blobs) for segments with no real speech.
Add isValidWord() to filter out words that have:
- empty or whitespace-only text
- U+FFFD replacement characters (from binary data scrubbing)
- negative timestamps
- zero duration (end <= start)
Also skip empty segments entirely when they have no recognisable
content (empty text AND no valid words), preventing spurious subtitle
entries like '00:45:33,592 --> 00:45:33,592 parakeet@rH\u000b\ufffdI'.
Applies to both AudioTranscription and AudioTranscriptionStream.
Signed-off-by: fqscfqj <fqscfqj@outlook.com>
fix(vllm): structured outputs silently ignored on vLLM >= 0.23
vLLM >= 0.23 removed GuidedDecodingParams (now StructuredOutputsParams) and
renamed the SamplingParams field guided_decoding -> structured_outputs. The
import failed, HAS_GUIDED_DECODING became False, and the whole guided-decoding
block was skipped, so response_format / grammar constraints were silently
ignored. Adapt the existing request.Grammar path to the new class/field.
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
* feat(gallery): add Depth Anything V2 models + bump native version
Add Depth Anything V2 (DA2) support to the depth-anything backend. DA2 is
depth-only (no camera pose, no confidence) and ships both relative
(relative inverse depth) and metric (depth in metres) variants. The Go
backend is model-agnostic, so no backend code changes are required — only
a native version bump and new gallery entries.
- backend/go/depth-anything-cpp/Makefile: pin DEPTHANYTHING_VERSION to the
depth-anything.cpp commit that adds the DA2 engine + C-API routing
(e3dec57f13a52366bbc4f279ef44804915960a6b, kept alive by the upstream tag
da2-support so it survives a squash-merge).
- gallery/index.yaml: add 12 DA2 entries (4 base quants, small, large, plus
Hypersim indoor and VKITTI outdoor metric models in S/B/L). Metric models
carry the metric-depth tag; none carry camera-pose.
Assisted-by: Claude:claude-opus-4-8
* chore(depth-anything-cpp): pin to merged DA2 master commit
PR #1 (mudler/depth-anything.cpp) merged to master as f4e17de (squash); repoint
the pin from the pre-merge commit to the canonical master commit.
Assisted-by: Claude:claude-opus-4-8
---------
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* ⬆️ Update ggml-org/llama.cpp
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* fix(llama-cpp): adapt grpc-server to upstream server-schema split
Upstream llama.cpp (e475fa2) extracted the JSON request-schema evaluation
out of the static server_task::params_from_json_cmpl into the new
server_schema::eval_llama_cmpl_schema (tools/server/server-schema.cpp).
The grpc-server unity build still called the old static member, breaking
every llama-cpp backend build with "no member named 'params_from_json_cmpl'
in 'server_task'".
Pull server-schema.cpp into the translation unit and call the new function,
keeping both guarded by __has_include so forks that predate the split (e.g.
llama-cpp-turboquant, which still exposes params_from_json_cmpl) keep
compiling against the old static member.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
---------
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(crispasr): add word-level timestamp support
Add word-level timestamp extraction to the crispasr backend by calling
the CrispASR C library's word accessor functions that are already
exported by libgocraspasr but were not previously bound by the Go
wrapper.
Two families of word functions are supported:
1. Session-based (get_word_count/text/t0/t1) — works per-segment for
whisper-like backends.
2. Parakeet-specific (get_parakeet_word_count/text/t0/t1) — returns a
global word list for TDT/CTC/RNNT parakeet models where the session
API does not expose per-segment word data.
The Go code tries session-based first and falls back to parakeet-specific
when the session word count is zero.
Depends on #10402 (grpc server Words forwarding) for the words to reach
the HTTP response.
Signed-off-by: fqscfqj <fqscfqj@outlook.com>
* fix(crispasr): use portable sed -i.bak for macOS compatibility
BSD sed requires -i '' for in-place editing while GNU sed uses -i.
Replace with -i.bak which works on both platforms, then remove the
backup file.
Signed-off-by: fqscfqj <fqscfqj@outlook.com>
---------
Signed-off-by: fqscfqj <fqscfqj@outlook.com>
Vulkan backends bundled their own loader and ICD manifests but neither the
Mesa driver the manifests point at nor a way to make the loader find them,
so on a runtime base image without Mesa the loader enumerated zero devices
and the GPU silently fell back to CPU (only NVIDIA worked, since its ICD is
injected by the container toolkit).
- scripts/build/package-gpu-libs.sh: for each installed ICD manifest, bundle
the driver .so its library_path names — no hard-coded, platform-dependent
soname list — plus that driver's ldd dependencies, skipping manifests whose
driver isn't installed. Rewrite each library_path to a bare soname so the
bundled driver resolves via the LD_LIBRARY_PATH run.sh already sets.
- .docker/install-base-deps.sh, backend/Dockerfile.golang,
backend/Dockerfile.python: install mesa-vulkan-drivers in every Vulkan
builder so the driver + manifests exist to be packaged (the LunarG SDK
ships only the loader and shader tooling).
- pkg/model/process.go: when a backend ships vulkan/icd.d/, point the loader
at it via VK_DRIVER_FILES/VK_ICD_FILENAMES at launch (no-op otherwise).
Covered by pkg/model/process_vulkan_test.go.
- backend/go/parakeet-cpp/package.sh: complete the L0 stub (was missing the
libc-family ldd walk + GPU-lib packaging) by mirroring whisper, so the
vulkan-parakeet image actually bundles its GPU runtime.
Assisted-by: Claude Code:claude-opus-4-8
Signed-off-by: Richard Palethorpe <io@richiejp.com>