LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-07-02 20:37:03 -04:00

Author	SHA1	Message	Date
Richard Palethorpe	0245b33eab	feat(realtime): Add Liquid Audio s2s model and assistant mode on talk page (#9801 ) * feat(liquid-audio): add LFM2.5-Audio any-to-any backend + realtime_audio usecase Wires LiquidAI's LFM2.5-Audio-1.5B as a self-contained Realtime API model: single engine handles VAD, transcription, LLM, and TTS in one bidirectional stream — drop-in alternative to a VAD+STT+LLM+TTS pipeline. Backend - backend/python/liquid-audio/ — new Python gRPC backend wrapping the `liquid-audio` package. Modes: chat / asr / tts / s2s, voice presets, Load/Predict/PredictStream/AudioTranscription/TTS/VAD/AudioToAudioStream/ Free and StartFineTune/FineTuneProgress/StopFineTune. Runtime monkey-patch on `liquid_audio.utils.snapshot_download` so absolute local paths from LocalAI's gallery resolve without a HF round-trip. soundfile in place of torchaudio.load/save (torchcodec drags NVIDIA NPP we don't bundle). - backend/backend.proto + pkg/grpc/{backend,client,server,base,embed, interface}.go — new AudioToAudioStream RPC mirroring AudioTransformStream (config/frame/control oneof in; typed event+pcm+meta out). - core/services/nodes/{health_mock,inflight}_test.go — add stubs for the new RPC to the test fakes. Config + capabilities - core/config/backend_capabilities.go — UsecaseRealtimeAudio, MethodAudio ToAudioStream, UsecaseInfoMap entry, liquid-audio BackendCapability row. - core/config/model_config.go — FLAG_REALTIME_AUDIO bitmask, ModalityGroups membership in both speech-input and audio-output groups so a lone flag still reads as multimodal, GetAllModelConfigUsecases entry, GuessUsecases branch. Realtime endpoint - core/http/endpoints/openai/realtime.go — extract prepareRealtimeConfig() so the gate is unit-testable; accept realtime_audio models and self-fill empty pipeline slots with the model's own name (user-pinned slots win). - core/http/endpoints/openai/realtime_gate_test.go — six specs covering nil cfg, empty pipeline, legacy pipeline, self-contained realtime_audio, user-pinned VAD slot, and partial legacy pipeline. UI + endpoints - core/http/routes/ui.go — /api/pipeline-models accepts either a legacy VAD+STT+LLM+TTS pipeline or a realtime_audio model; surfaces a self_contained flag so the Talk page can collapse the four cards. - core/http/routes/ui_api.go — realtime_audio in usecaseFilters. - core/http/routes/ui_pipeline_models_test.go — covers both code paths. - core/http/react-ui/src/pages/Talk.jsx — self-contained badge instead of the four-slot grid; rename Edit Pipeline → Edit Model Config; less pipeline-specific wording. - core/http/react-ui/src/pages/Models.jsx + locales/en/models.json — new realtime_audio filter button + i18n. - core/http/react-ui/src/utils/capabilities.js — CAP_REALTIME_AUDIO. - core/http/react-ui/src/pages/FineTune.jsx — voice + validation-dataset fields, surfaced when backend === liquid-audio, plumbed via extra_options on submit/export/import. Gallery + importer - gallery/liquid-audio.yaml — config template with known_usecases: [realtime_audio, chat, tts, transcript, vad]. - gallery/index.yaml — four model entries (realtime/chat/asr/tts) keyed by mode option. Fixed pre-existing `transcribe` typo on the asr entry (loader silently dropped the unknown string → entry never surfaced as a transcript model). - gallery/lfm.yaml — function block for the LFM2 Pythonic tool-call format `<\|tool_call_start\|>[name(k="v")]<\|tool_call_end\|>` matching common_chat_params_init_lfm2 in vendored llama.cpp. - core/gallery/importers/{liquid-audio,liquid-audio_test}.go — detector matches LFM2-Audio HF repos (excludes -gguf mirrors); mode/voice preferences plumbed through to options. - core/gallery/importers/importers.go — register LiquidAudioImporter before LlamaCPPImporter. - pkg/functions/parse_lfm2_test.go — seven specs for the response/argument regex pair on the LFM2 pythonic format. Build matrix - .github/backend-matrix.yml — seven liquid-audio targets (cuda12, cuda13, l4t-cuda-13, hipblas, intel, cpu amd64, cpu arm64). Jetpack r36 cuda-12 is skipped (Ubuntu 22.04 / Python 3.10 incompatible with liquid-audio's 3.12 floor). - backend/index.yaml — anchor + 13 image entries. - Makefile — .NOTPARALLEL, prepare-test-extra, test-extra, docker-build-liquid-audio. Docs - .agents/plans/liquid-audio-integration.md — phased plan; PR-D (real any-to-any wiring via AudioToAudioStream), PR-E (mid-audio tool-call detector), PR-G (GGUF entries once upstream llama.cpp PR #18641 lands) remain. - .agents/api-endpoints-and-auth.md — expand the capability-surface checklist with every place a new FLAG_* needs to be registered. Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(realtime): function calling + history cap for any-to-any models Three pieces, all on the realtime_audio path that just landed: 1. liquid-audio backend (backend/python/liquid-audio/backend.py): - _build_chat_state grows a `tools_prelude` arg. - new _render_tools_prelude parses request.Tools (the OpenAI Chat Completions function array realtime.go already serialises) and emits an LFM2 `<\|tool_list_start\|>…<\|tool_list_end\|>` system turn ahead of the user history. Mirrors gallery/lfm.yaml's `function:` template so the model sees the same prompt shape whether served via llama-cpp or here. Without this the backend silently dropped tools — function calling was wired end-to-end on the Go side but the model never saw a tool list. 2. Realtime history cap (core/http/endpoints/openai/realtime.go): - Session grows MaxHistoryItems int; default picked by new defaultMaxHistoryItems(cfg) — 6 for realtime_audio models (LFM2.5 1.5B degrades quickly past a handful of turns), 0/unlimited for legacy pipelines composing larger LLMs. - triggerResponse runs conv.Items through trimRealtimeItems before building conversationHistory. Helper walks the cut left if it would orphan a function_call_output, so tool result + call pairs stay intact. - realtime_gate_test.go: specs for defaultMaxHistoryItems and trimRealtimeItems (zero cap, under cap, over cap, tool-call pair preservation). 3. Talk page (core/http/react-ui/src/pages/Talk.jsx): - Reuses the chat page's MCP plumbing — useMCPClient hook, ClientMCPDropdown component, same auto-connect/disconnect effect pattern. No bespoke tool registry, no new REST endpoints; tools come from whichever MCP servers the user toggles on, exactly as on the chat page. - sendSessionUpdate now passes session.tools=getToolsForLLM(); the update re-fires when the active server set changes mid-session. - New response.function_call_arguments.done handler executes via the hook's executeTool (which round-trips through the MCP client SDK), then replies with conversation.item.create {type:function_call_output} + response.create so the model completes its turn with the tool output. Mirrors chat's client-side agentic loop, translated to the realtime wire shape. UI changes require a LocalAI image rebuild (Dockerfile:308-313 bakes react-ui/dist into the runtime image). Backend.py changes can be swapped live in /backends/<id>/backend.py + /backend/shutdown. Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(realtime): LocalAI Assistant ("Manage Mode") for the Talk page Mirrors the chat-page metadata.localai_assistant flow so users can ask the realtime model what's loaded / installed / configured. Tools are run server-side via the same in-process MCP holder that powers the chat modality — no transport switch, no proxy, no new wire protocol. Wire: - core/http/endpoints/openai/realtime.go: - RealtimeSessionOptions{LocalAIAssistant,IsAdmin}; isCurrentUserAdmin helper mirrors chat.go's requireAssistantAccess (no-op when auth disabled, else requires auth.RoleAdmin). - Session grows AssistantExecutor mcpTools.ToolExecutor. - runRealtimeSession, when opts.LocalAIAssistant is set: gate on admin, fail closed if DisableLocalAIAssistant or the holder has no tools, DiscoverTools and inject into session.Tools, prepend holder.SystemPrompt() to instructions. - Tool-call dispatch loop: when AssistantExecutor.IsTool(name), run ExecuteTool inproc, append a FunctionCallOutput to conv.Items, skip the function_call_arguments client emit (the client can't execute these — it doesn't know about them). After the loop, if any assistant tool ran, trigger another response so the model speaks the result. Mirrors chat's agentic loop, driven server-side rather than via client round-trip. - core/http/endpoints/openai/realtime_webrtc.go: RealtimeCallRequest gains `localai_assistant` (JSON omitempty). Handshake calls isCurrentUserAdmin and builds RealtimeSessionOptions. - core/http/react-ui/src/pages/Talk.jsx: admin-only "Manage Mode" checkbox under the Tools dropdown; passes localai_assistant: true to realtimeApi.call's body, captured in the connect callback's deps. Mirroring chat's pattern means the in-process MCP tools surface "just works" for the Talk page without exposing a Streamable-HTTP MCP endpoint (which was the alternative). Clients with their own MCP servers can still use the existing ClientMCPDropdown path in parallel; the realtime handler distinguishes them by AssistantExecutor.IsTool() at dispatch time. Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(realtime): render Manage Mode tool calls in the Talk transcript Previously the realtime endpoint only emitted response.output_item.added for the FunctionCall item, and Talk.jsx's switch ignored the event — so server-side tool runs were invisible in the UI. The model would speak the result but the user had no way to see what tool was actually called. realtime.go: after executing an assistant tool inproc, emit a second output_item.added/.done pair for the FunctionCallOutput item. Mirrors the way the chat page displays tool_call + tool_result blocks. Talk.jsx: handle both response.output_item.added and .done. Render FunctionCall (with arguments) and FunctionCallOutput (pretty-printed JSON when possible) as two transcript entries — `tool_call` with the wrench icon, `tool_result` with the clipboard icon, both in mono-space secondary-colour. Resets streamingRef after the result so the next assistant text delta starts a fresh transcript entry instead of appending to the previous turn. Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * refactor(realtime): bound the Manage Mode tool-loop + preserve assistant tools Fallout from a review pass on the Manage Mode patches: - Bound the server-side agentic loop. triggerResponse used to recurse on executedAssistantTool with no cap — a model that kept calling tools would blow the goroutine stack. New maxAssistantToolTurns = 10 (mirrors useChat.js's maxToolTurns). Public triggerResponse is now a thin shim over triggerResponseAtTurn(toolTurn int); recursion increments the counter and stops at the cap with an xlog.Warn. - Preserve Manage Mode tools across client session.update. The handler used to blindly overwrite session.Tools, so toggling a client MCP server mid-session silently wiped the in-process admin tools. Session now caches the original AssistantTools slice at session creation and the session.update handler merges them back in (client names win on collision — the client is explicit). - strconv.ParseBool for the localai_assistant query param instead of hand-rolled "1" \|\| "true". Mirrors LocalAIAssistantFromMetadata. - Talk.jsx: render both tool_call and tool_result on response.output_item.done instead of splitting them across .added and .done. The server's event pairing (added → done) stays correct; the UI just doesn't need to inspect both phases of the same item. One switch case instead of two, no behavioural change. Out of scope (noted for follow-ups): extract a shared assistant-tools helper between chat.go and realtime.go (duplication is small enough that two parallel implementations stay readable for now), and an i18n key for the Manage Mode helper text (Talk.jsx doesn't use i18n anywhere else yet). Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * ci(test-extra): wire liquid-audio backend smoke test The backend ships test.py + a `make test` target and is listed in backend-matrix.yml, so scripts/changed-backends.js already writes a `liquid-audio=true\|false` output when files under backend/python/liquid-audio/ change. The workflow just wasn't reading it. - Expose the `liquid-audio` output on the detect-changes job - Add a tests-liquid-audio job that runs `make` + `make test` in backend/python/liquid-audio, gated on the per-backend detect flag The smoke covers Health() and LoadModel(mode:finetune); fine-tune mode short-circuits before any HuggingFace download (backend.py:192), so the job needs neither weights nor a GPU. The full-inference path remains gated on LIQUID_AUDIO_MODEL_ID, which CI doesn't set. The four new Go test files (core/gallery/importers/liquid-audio_test.go, core/http/endpoints/openai/realtime_gate_test.go, core/http/routes/ui_pipeline_models_test.go, pkg/functions/parse_lfm2_test.go) are already picked up by the existing test.yml workflow via `make test` → `ginkgo -r ./pkg/... ./core/...`; their packages all carry RunSpecs entries. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-05-13 21:57:27 +02:00
LocalAI [bot]	d892e4af80	feat: add ds4 backend (DeepSeek V4 Flash) with tool calls, thinking, KV cache (#9758 ) * test(e2e-backends): allow BACKEND_BINARY for native-built backends Adds an escape hatch for hardware-gated backends (e.g. ds4) where the model is too large for Docker build context. When BACKEND_BINARY points at a run.sh produced by 'make -C backend/cpp/<name> package', the suite skips docker image extraction and drives the binary directly. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(e2e-backends): validate BACKEND_BINARY basename + log actual source Two follow-ups from the `cbcf5148` code review: - BACKEND_BINARY now requires a path whose basename is `run.sh`. Without this check, `filepath.Dir(binary)` silently discarded the filename, so pointing the env var at an arbitrary binary failed later with a confusing assertion that named a path the user never typed. - The "Testing image=..." debug line printed an empty string when the binary path was used, hiding the actual source in CI logs. The line now reports whichever of BACKEND_IMAGE / BACKEND_BINARY is in effect as `src=...`. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): scaffold ds4 backend dir Adds prepare.sh, run.sh, and a .gitignore. CMakeLists, Makefile, and the implementation arrive in follow-up commits. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): add backend Makefile Drives ds4's upstream Makefile to produce engine .o files (CUDA on Linux when BUILD_TYPE=cublas, Metal on Darwin, otherwise CPU debug path), then invokes CMake on our wrapper. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): add CMakeLists for grpc-server Generates protoc stubs from backend.proto, links grpc-server.cpp + dsml_parser.cpp + dsml_renderer.cpp + kv_cache.cpp against pre-built ds4 engine .o files. DS4_GPU=cuda\|metal\|cpu selects the backend. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): grpc-server skeleton + module stubs The minimum that links: Backend service with Health + Free; other RPCs default to UNIMPLEMENTED. Stub headers/sources for dsml_parser, dsml_renderer, and kv_cache are in place so CMake links cleanly even before those modules ship. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): implement LoadModel Opens engine + creates session sized to ContextSize (default 32768). Backend is compile-time: CPU when DS4_NO_GPU, Metal on __APPLE__, else CUDA. MTP/speculative options are accepted via ModelOptions.Options[] (mtp_path, mtp_draft, mtp_margin). kv_cache_dir option is captured into g_kv_cache_dir for the cache module (Task 19 wires it in). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): implement TokenizeString Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): implement Predict (plain text) Tool calls + thinking-mode split arrive in Task 13 once dsml_parser is in. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): implement PredictStream (plain text) ChatDelta + reasoning/tool_calls split arrives in Task 14. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): implement Status RPC Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): add DSML streaming parser Classifies raw model-emitted token text into CONTENT / REASONING / TOOL_START / TOOL_ARGS / TOOL_END events. Markers it watches for are the literal DSML strings rendered by ds4_server.c's prompt template (<｜DSML｜tool_calls>, <｜DSML｜invoke name=...>, <think>, etc.) - these are plain text the model emits, not special tokens. Partial markers split across token chunks are buffered until a full marker or a definitively-not-a-marker '<' is observed. RandomToolId() generates the API-side tool call id (call_xxx) that exact-replay would key on. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend/cpp/ds4): split hex escapes in DSML markers + add cstring/cstdio includes C++ \x hex escapes have no length cap. '\x9cD' was read as a single escape producing byte 0xCD, eating the 'D'. The markers were never actually matching the DSML text the model emits. Split each escape with adjacent string literal concatenation so the byte sequence is exactly EF BD 9C 44 (｜D) at runtime. Also adds <cstring> and <cstdio> includes (libstdc++ 13 does not transitively expose std::strlen / std::snprintf via <string>). The local plan file (uncommitted) was also updated with the same fixes so Task 16's dsml_renderer.cpp does not re-introduce the bug. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): wire DsmlParser into Predict (ChatDelta) Non-streaming Predict now emits one ChatDelta carrying content, reasoning_content, and tool_calls[] parsed from the model's DSML output. Reply.message still carries the raw model bytes for backends that prefer the regex fallback path. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): wire DsmlParser into PredictStream Per-token ChatDelta writes: content/reasoning_content go incrementally, tool_calls emit TOOL_START as one delta (id + name) followed by TOOL_ARGS deltas with incremental JSON. The Go-side aggregator (pkg/functions/chat_deltas.go) reassembles them. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): chat template + reasoning_effort mapping UseTokenizerTemplate=true + Messages -> ds4_chat_begin / append / assistant_prefix. PredictOptions.Metadata['enable_thinking'] and ['reasoning_effort'] map to ds4_think_mode (DS4_THINK_HIGH default; 'max'/'xhigh' -> DS4_THINK_MAX; disabled -> DS4_THINK_NONE). Tool-call rendering for assistant turns with tool_calls JSON arrives in the next commit (dsml_renderer). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): render assistant tool_calls + tool results to DSML Closes the round-trip: when an OpenAI client sends a multi-turn chat where prior turns contain tool_calls or role=tool messages, build_prompt serializes them back to the DSML shape the model was trained on. Mirrors ds4_server.c's prompt renderer; uses nlohmann::json for parsing the OpenAI tool_calls payload. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): disk KV cache module Dir-based cache keyed by SHA1(rendered prompt prefix). File format: 'DS4G' magic + version + ctx_size + prefix_len + prefix + payload_bytes + ds4_session_save_payload output. NOT bit-compatible with ds4-server's KVC files - that interop is a follow-up plan. LoadLongestPrefix walks the dir picking the longest stored prefix that prefixes the incoming prompt. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): wire KvCache into Predict/PredictStream LoadModel reads 'kv_cache_dir' from ModelOptions.Options[], passes it to g_kv_cache.SetDir. Each Predict/PredictStream computes a render text for the request, tries LoadLongestPrefix to recover state, then Saves the new state after generation. ds4_session_sync handles the live-cache fast path internally, so the disk cache only matters for cold-starts and cross-session reuse. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): add package.sh Linux: bundles libc + ld + libstdc++ + libgomp + GPU runtime libs into package/lib so the FROM scratch image boots without a host libc. Darwin is handled by scripts/build/ds4-darwin.sh which uses otool -L. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend/cpp/ds4): rename namespace ds4_backend -> ds4cpp ds4.h defines 'typedef enum {...} ds4_backend' which collides with our C++ 'namespace ds4_backend' anywhere a TU includes both. kv_cache.h includes ds4.h directly and surfaces the conflict immediately; other TUs would hit it once gRPC dev headers are available. Renames the C++ namespace to ds4cpp across all wrapper files and the plan, leaving the upstream ds4 typedef untouched. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend): add Dockerfile.ds4 Single-stage builder (CUDA devel image for cublas, ubuntu:24.04 for cpu) -> FROM scratch with packaged grpc-server + bundled runtime libs. nlohmann-json3-dev is required for dsml_renderer's JSON handling. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(make): wire backend/cpp/ds4 + ds4-darwin into root Makefile BACKEND_DS4 entry + generate-docker-build-target eval + docker-build-ds4 in docker-build-backends + .NOTPARALLEL guards. Also adds the backends/ds4-darwin target which delegates to scripts/build/ds4-darwin.sh (landed in Task 24). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: add backend-matrix entries for ds4 (cpu + cuda13, per-arch) Two entries per build (amd64 + arm64) so backend-merge-jobs assembles a multi-arch manifest. Skipping cuda12 - ds4 was validated against CUDA 13. Darwin Metal is handled outside this matrix by backend_build_darwin.yml. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/index): add ds4 meta + image entries cpu + cuda13 x latest + master. Darwin Metal builds publish under ds4-darwin via the existing llama-cpp-darwin OCI pipeline. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(scripts/build): add ds4-darwin.sh Native macOS/Metal build for the ds4 backend. Mirrors llama-cpp-darwin.sh: make grpc-server -> otool -L for dylib bundling -> OCI tar that 'local-ai backends install' consumes via the backends/ds4-darwin Makefile target. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(darwin): build ds4-darwin in backend_build_darwin Adds a 'Build ds4 backend (Darwin Metal)' step that runs the backends/ds4-darwin Makefile target on the macOS runner. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(import): auto-detect ds4 weights via DS4Importer Adds core/gallery/importers/ds4.go which matches on the antirez/deepseek-v4-gguf repo URI and the DeepSeek-V4-Flash-.gguf filename pattern. Registered before LlamaCPPImporter so ds4 weights route to backend: ds4 instead of falling through to llama-cpp. Also lists ds4 in /backends/known so the /import-model UI surfaces it as a manual choice for users who want to force the backend on a non-canonical URI. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> feat(gallery): add deepseek-v4-flash-q2 (ds4 backend) One-click install of the q2 weights with backend: ds4. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(.agents): add ds4-backend.md Documents the backend shape, DSML state machine, thinking-mode mapping, disk KV cache, build matrix (cpu/cuda13/Darwin), and the BACKEND_BINARY hardware-validation path. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend/cpp/ds4): pass UBUNTU_VERSION + arch env vars to install-base-deps The .docker/install-base-deps.sh script needs UBUNTU_VERSION (defaults to 2404), TARGETARCH, SKIP_DRIVERS, and APT_MIRROR/APT_PORTS_MIRROR exported into the environment so it can pick the right cuda-keyring / cudss / nvpl debs and apt mirrors. Dockerfile.ds4 was declaring some of the ARGs but not re-exporting them via ENV. Mirrors Dockerfile.llama-cpp's pattern. Without this fix 'make docker-build-ds4 BUILD_TYPE=cublas CUDA_MAJOR_VERSION=13' failed at: /usr/local/sbin/install-base-deps: line 120: UBUNTU_VERSION: unbound variable Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/index): add Metal image entries for ds4 Adds metal-ds4 + metal-ds4-development image entries pointing at quay.io/go-skynet/local-ai-backends:{latest,master}-metal-darwin-arm64-ds4 (built by scripts/build/ds4-darwin.sh on macOS arm64 runners), plus the 'metal' and 'metal-darwin-arm64' capability mappings on the ds4 meta and ds4-development variant. Closes a gap from the initial Task 23 landing - the Darwin Metal build script and CI workflow step were already wired (Tasks 24-25), but the gallery had no image entry for users to install the Metal variant. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(ci): use ubuntu:24.04 base for ds4 cuda13 matrix entries The initial Task 22 matrix landing used base-image: 'nvidia/cuda:13.0.0-devel-ubuntu24.04' which clashes with install-base-deps.sh's cuda-keyring step: E: Conflicting values set for option Signed-By regarding source https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/sbsa/ The canonical pattern (llama-cpp, ik-llama-cpp, turboquant) uses plain 'ubuntu:24.04' + 'skip-drivers: false' so install-base-deps installs CUDA from scratch via its own keyring setup. Adopting that here. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend/cpp/ds4): drop install-base-deps.sh dependency The .docker/install-base-deps.sh pipeline is built around the llama-cpp needs: NVIDIA keyring + cuda-toolkit apt + gRPC-from-source build at /opt/grpc. For ds4 we don't need any of that: - CUDA: nvidia/cuda:13.0.0-devel-ubuntu24.04 ships /usr/local/cuda ready to go; install-base-deps's keyring step then conflicts with the pre-installed Signed-By. - gRPC: ds4's grpc-server.cpp only links against grpc++; system libgrpc++-dev (apt) is sufficient, no source build needed. Replaced the install-base-deps invocation in Dockerfile.ds4 with a direct 'apt-get install libgrpc++-dev libprotobuf-dev protobuf-compiler-grpc nlohmann-json3-dev cmake build-essential pkg-config git'. Matrix entries back to nvidia/cuda base + skip-drivers=true so install-base-deps would no-op even if some downstream tooling calls it. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend/cpp/ds4): correct proto accessors + alias grpc::Status as GStatus Two compile bugs caught by the docker build: 1. proto::Message uses snake_case accessors. The build_prompt loop called m.toolcalls() / m.toolcallid() - the protoc-generated names are m.tool_calls() / m.tool_call_id(). Plan-text bug propagated to the wrapper. 2. The Status RPC method shadowed the 'using grpc::Status' alias, so any later method declaration using Status as a return type failed to parse ('Status does not name a type' starting at LoadModel). Solution: alias grpc::Status as GStatus instead, with no 'using' clause that would conflict. All RPC method declarations and return-statement constructions now use GStatus. Pre-existing code reviewer flagged the Status-shadow concern as 'minor' in the original Task 10 commit; it turned out to be a real compile blocker under libstdc++ 13 once the surrounding methods were filled in. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend/cpp/ds4): preserve TOOL_ARGS content in dsml_parser Flush When the model emitted a parameter value that arrived in the same buffer as the surrounding tool_call markers (e.g. the buffered tail after a literal '</think>' opened the model output), the parser deferred all buffered bytes to Flush() because looks_like_prefix() always returns true while buf starts with '<'. Flush() then drained the buffer as plain CONTENT/REASONING regardless of parser state, so the bytes between the parameter open and close markers were classified as CONTENT instead of TOOL_ARGS. Symptom: the model emitted <\|DSML\|parameter name="location" string="true">Paris, France</\|DSML\|parameter> and the assembled tool_call arguments came out as {"location":""} - the opener and closer were emitted into the args stream but the "Paris, France" content went to the assistant message instead. Fix: 1. Flush() now uses the same state-aware emit logic as DrainPlain: PARAM_VALUE bytes become TOOL_ARGS (json-escaped when string), THINK bytes become REASONING, TEXT bytes become CONTENT, and INVOKE / TOOL_CALLS structural whitespace is discarded. 2. looks_like_prefix() restricts its leading-'<' fallback to buffers that have not yet seen a '>'. Without that change, char-by-char feeds would discard the '<' of '<\|DSML\|invoke name="..."' once the marker prefix length was reached but the closing quote/'>' were still in flight. Verified with a standalone harness that runs the failing input three ways (single Feed, split-after-'>', and char-by-char) and aggregates TOOL_ARGS for tool index 0: all three now produce {"location":"Paris, France"}. Assisted-by: Claude:opus-4.7 [Read,Edit,Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend/cpp/ds4): use ds4_session_sync + manual generation loop for KV persistence ds4_engine_generate_argmax() is a self-contained helper that doesn't take or update a ds4_session - it manages its own internal state. Our Predict and PredictStream methods created g_session via ds4_session_create() but then called ds4_engine_generate_argmax(), so g_session's KV state never advanced. ds4_session_payload_bytes(g_session) returned 0 and the disk KV cache save correctly rejected with 'session has no valid checkpoint to save'. Switch both RPCs to the proper session API: ds4_session_sync(g_session, &prompt, ...) loop: int token = ds4_session_argmax(g_session) if token == eos: break emit(token) ds4_session_eval(g_session, token, ...) After the loop the session has a real checkpoint and ds4_session_save_payload writes the KV state to disk. Verified end-to-end on a DGX Spark GB10: three .kv files (15-30 MB each) are written when BACKEND_TEST_OPTIONS sets kv_cache_dir, and the e2e tool-call assertion still passes. Also added stderr diagnostics to KvCache (enabled/disabled at SetDir; per-save path + payload_bytes + result) so future failures are visible instead of silent. The 'wrote ok' lines are low-volume - one per Predict/PredictStream when the cache is enabled - and skipped entirely when the option is unset. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): use ds4_session_eval_speculative_argmax when MTP loaded Wires MTP (Multi-Token Prediction) speculative decoding into the manual generation loop in both Predict and PredictStream. When the upstream MTP weights are loaded via 'mtp_path:' option AND we're on CUDA / Metal, ds4_engine_mtp_draft_tokens() returns >0 and we switch the inner loop to ds4_session_eval_speculative_argmax(), which can accept N>1 tokens per verifier step. When MTP is not loaded (no option, CPU backend, or weights absent), we fall through to the simple ds4_session_argmax + ds4_session_eval path with no behavior change. Validated on a DGX Spark GB10 with the optional MTP GGUF (DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf, ~3.6 GB). LoadModel logs 'ds4: MTP support model loaded ... (draft=2)' on stderr. Caveat per upstream README: 'currently provides at most a slight speedup, not a meaningful generation-speed win'. Wired now mainly to track the upstream API; bigger speedups arrive when ds4 improves the speculative path. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): honor PredictOptions sampling with DSML-aware override Mirrors ds4_server.c:7102-7115 sampling-policy semantics on the LocalAI gRPC side. The generation loop now consults compute_sample_params() per token to pick the effective (temperature, top_k, top_p, min_p), based on: 1. Request defaults: PredictOptions.temperature / .topk / .topp / .minp 2. Thinking-mode override: when enable_thinking != false, force T=1.0, top_k=0, top_p=1.0, min_p=0.0 (creativity for the reasoning pass and the trailing content) 3. DSML structural override: when DsmlParser::IsInDsmlStructural() returns true (we are between tool-call markers but NOT in a param value payload), force T=0.0 so protocol bytes parse cleanly When the effective temperature is 0, we keep using ds4_session_argmax + MTP speculative path (matches ds4-server's gate that only enables MTP for greedy positions). When > 0, we call ds4_session_sample(s, T, ...) with a per-thread RNG seeded from system_clock and fall back to single-token ds4_session_eval. New public method on DsmlParser: IsInDsmlStructural() encodes which states need protocol-byte determinism. PARAM_VALUE is excluded (payload uses user sampling); TEXT and THINK are excluded (no tool-call context to protect). Verified on the DGX Spark GB10: the e2e suite still passes with all 5 specs including tools, and the Predict output now varies between runs (creative sampling active) while the tool-call args remain a clean '{"location":"Paris, France"}' because the parser-state check forces greedy on the structural bytes. UX note: thinking mode is ON by default (matching ds4-server). Users who want deterministic output should set Metadata.enable_thinking = false. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(gallery): add sha256 to deepseek-v4-flash-q2 entry Per HF LFS metadata for antirez/deepseek-v4-gguf: size: 86720111200 bytes (~80.76 GiB) sha256: 31598c67c8b8744d3bcebcd19aa62253c6dc43cef3b8adf9f593656c9e86fd8c LocalAI's downloader verifies sha256 when present, so users who install deepseek-v4-flash-q2 from the gallery get integrity-checked weights and the partial-download issue (an 81 GB file is easy to truncate) becomes recoverable instead of silently producing a broken backend. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-11 22:15:47 +02:00
LocalAI [bot]	19d59102d5	feat(whisper-cpp): implement streaming transcription (#9751 ) * test(whisper): wire e2e streaming transcription target Adds test-extra-backend-whisper-transcription, mirroring the existing llama-cpp / sherpa-onnx / vibevoice-cpp targets. The generic AudioTranscriptionStream spec at tests/e2e-backends/backend_test.go:644 fails today because backend/go/whisper has no streaming impl - this target is the failing TDD gate that the next phase makes pass. Confirmed RED locally: 3 Passed (health, load, offline transcription), 1 Failed (streaming spec hits its 300s context deadline because the base implementation returns 'unimplemented' but doesn't close the result channel, leaving the gRPC stream open until the client times out). Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(whisper-cpp): expose new_segment_callback to the Go side Adds set_new_segment_callback() and a C-side trampoline that whisper.cpp invokes once per new text segment during whisper_full(). The trampoline dispatches (idx_first, n_new, user_data) to a Go function pointer registered via purego.NewCallback - text and timings are pulled by Go through the existing get_segment_text/get_segment_t0/get_segment_t1 getters. Wires the hook only when streaming is actually requested, to avoid a per-segment function-pointer dispatch on the offline path. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(whisper-cpp): implement AudioTranscriptionStream Wires whisper.cpp's new_segment_callback through purego back to Go so the streaming transcription RPC produces real, time-correlated deltas while whisper_full() is still decoding. Each segment becomes one TranscriptStreamResponse{Delta}; whisper_full's return is the TranscriptStreamResponse{FinalResult} carrying the full segment list, language, and duration. Per-call state is tracked in a sync.Map keyed by an atomic counter; the Go callback registered via purego.NewCallback is a singleton, dispatched through user_data. SingleThread today means only one entry is ever live, but the map shape matches the sherpa-onnx TTS callback pattern. The streaming path's final.Text is the literal concat of every emitted delta (a strings.Builder accumulated by onNewSegment) so the e2e invariant `final.Text == concat(deltas)` holds exactly. The first delta has no leading space; subsequent deltas are space-prefixed. The offline AudioTranscription path is unchanged. Closes the gap with sherpa-onnx, vibevoice-cpp, llama-cpp, and tinygrad, which already implement AudioTranscriptionStream. Verified GREEN locally: make test-extra-backend-whisper-transcription passes 4/4 specs (3 Passed initially under RED, +1 streaming spec now). Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(whisper-cpp): assert progressive multi-segment streaming Drives AudioTranscriptionStream against a real long-audio fixture and asserts len(deltas) >= 2. The generic e2e spec at tests/e2e-backends/backend_test.go:644 only checks len(deltas) >= 1 which is satisfied by both real and faked streaming - this spec is the guardrail that a future "fake" impl can't sneak past. Skipped by default (env-gated, like the cancellation spec); set WHISPER_LIBRARY, WHISPER_MODEL_PATH, and WHISPER_AUDIO_PATH to a 30+ second clip to run. Verified locally with a 55s 5x-JFK concat against ggml-base.en.bin: 1 Passed in 7.3s, deltas >= 2, finalSegmentCount >= 2, concat(deltas) == final.Text. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(whisper-cpp): add transcription gRPC e2e job Mirrors tests-sherpa-onnx-grpc-transcription / tests-llama-cpp-grpc-transcription. Runs make test-extra-backend-whisper-transcription whenever the whisper backend or the run-all switch fires, so a pin-bump or refactor that breaks streaming transcription gets caught before merge. The whisper output on detect-changes is already emitted by scripts/changed-backends.js (it iterates allBackendPaths); this PR just exposes it as a workflow output and consumes it. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(whisper-cpp): silence errcheck on AudioTranscriptionStream defers golangci-lint runs with new-from-merge-base=origin/master, so the identical defer patterns in the existing offline AudioTranscription path are grandfathered while the new ones in AudioTranscriptionStream trip errcheck. Wrap both defers in `func() { _ = ... }()` to match what errcheck wants without altering behavior. The errors from os.RemoveAll and *os.File.Close are not actionable inside a defer here (we're already returning), matching the offline path's contract. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-10 23:11:46 +02:00
Ettore Di Giacinto	3bc5ae8da6	fix(tests/e2e-backends): bump ctx_size for llama-cpp transcription Qwen3-ASR-0.6B encodes the jfk.wav fixture into 777 audio tokens via its mmproj, but the test harness defaulted BACKEND_TEST_CTX_SIZE to 512, so llama.cpp server rejected every transcription request with "request (777 tokens) exceeds the available context size (512 tokens)". Set BACKEND_TEST_CTX_SIZE=2048 on the llama-cpp transcription target only — sherpa-onnx and vibevoice transcription targets don't go through llama.cpp's slot/n_ctx and weren't failing. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code]	2026-05-07 22:31:08 +00:00
Richard Palethorpe	8e43842175	feat(vllm, distributed): tensor parallel distributed workers (#9612 ) * feat(vllm): build vllm from source for Intel XPU Upstream publishes no XPU wheels for vllm. The Intel profile was silently picking up a non-XPU wheel that imported but errored at engine init, and several runtime deps (pillow, charset-normalizer, chardet) were missing on Intel -- backend.py crashed at import time before the gRPC server came up. Switch the Intel profile to upstream's documented from-source procedure (docs/getting_started/installation/gpu.xpu.inc.md in vllm-project/vllm): - Bump portable Python to 3.12 -- vllm-xpu-kernels ships only a cp312 wheel. - Source /opt/intel/oneapi/setvars.sh so vllm's CMake build sees the dpcpp/sycl compiler from the oneapi-basekit base image. - Hide requirements-intel-after.txt during installRequirements (it used to 'pip install vllm'); install vllm's deps from a fresh git clone of vllm via 'uv pip install -r requirements/xpu.txt', swap stock triton for triton-xpu==3.7.0, then 'VLLM_TARGET_DEVICE=xpu uv pip install --no-deps .'. - requirements-intel.txt trimmed to LocalAI's direct deps (accelerate / transformers / bitsandbytes); torch-xpu, vllm, vllm_xpu_kernels and the rest come from upstream's xpu.txt during the source build. - requirements.txt: add pillow + charset-normalizer + chardet -- used by backend.py and missing on the Intel install profile. - run.sh: 'set -x' so backend startup is visible in container logs (the gRPC startup error path was previously opaque). Also adds a one-line docs example for engine_args.attention_backend under the vLLM section, since older XE-HPG GPUs (e.g. Arc A770) need TRITON_ATTN to bypass the cutlass path in vllm_xpu_kernels. Tested end-to-end on an Intel Arc A770 with Qwen2.5-0.5B-Instruct via LocalAI's /v1/chat/completions. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(vllm): add multi-node data-parallel follower worker vLLM v1's multi-node story is one process per node sharing a DP coordinator over ZMQ -- the head runs the API server with data_parallel_size > 1 and followers run `vllm serve --headless ...` with matching topology. Today LocalAI can already configure DP on the head via the engine_args YAML map, but there's no way to bring up the follower nodes -- so the head sits waiting for ranks that never handshake. Add `local-ai p2p-worker vllm`, mirroring MLXDistributed's structural precedent (operator-launched, static config, no NATS placement). The worker: - Optionally self-registers with the frontend as an agent-type node tagged `node.role=vllm-follower` so it's visible in the admin UI and operators can scope ordinary models away via inverse selectors. - Resolves the platform-specific vllm backend via the gallery's "vllm" meta-entry (cuda, intel-vllm, rocm-vllm, ...). - Runs vLLM as a child process so the heartbeat goroutine survives until vLLM exits; forwards SIGINT/SIGTERM so vLLM can clean up its ZMQ sockets before we tear down. - Validates --headless + --start-rank 0 is rejected (rank 0 is the head and must serve the API). Backend run.sh dispatches `serve` as the first arg to vllm's own CLI instead of LocalAI's backend.py gRPC server -- the follower speaks ZMQ directly to the head, there is no LocalAI gRPC on the follower side. Single-node usage is unchanged. Generalises the gallery resolution helper into findBackendPath() shared by MLX and vLLM workers; extracts ParseNodeLabels for the comma-separated label parsing both use. Ships with two compose recipes (`docker-compose.vllm-multinode.yaml` for NVIDIA, `docker-compose.vllm-multinode.intel.yaml` for Intel XPU/xccl) plus `tests/e2e/vllm-multinode/smoke.sh`. Both vendors are supported (NCCL for CUDA/ROCm, xccl for XPU) but mixed-vendor DP is not -- PyTorch's process group requires every rank to use the same collective backend, and NCCL/xccl/gloo don't interoperate. Out of scope (deferred): SmartRouter-driven placement of follower ranks via NATS backend.install events, follower log streaming through /api/backend-logs, tensor-parallel across nodes, disaggregated prefill via KVTransferConfig. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> test(vllm): CPU-only end-to-end test for multi-node DP Adds tests/e2e/vllm-multinode/, a Ginkgo + testcontainers-go suite that brings up a head + headless follower from the locally-built local-ai:tests image, bind-mounts the cpu-vllm backend extracted by make extract-backend-vllm so it's seen as a system backend (no gallery fetch, no registry server), and asserts a chat completion across both DP ranks. New `make test-e2e-vllm-multinode` target wires the docker build, backend extract, and ginkgo run together; BuildKit caches both images so re-runs only rebuild what changed. Tagged Label("VLLMMultinode") so the existing distributed suite isn't pulled along. Two pre-existing bugs surfaced by the test: 1. extract-backend-% (Makefile) failed for every backend, because all backend images end with `FROM scratch` and `docker create` rejects an image with no CMD/ENTRYPOINT. Fixed by passing --entrypoint=/run.sh -- the container is never started, only docker-cp'd, so the path doesn't have to exist; we just need anything that satisfies the daemon's create-time validation. 2. backend/python/vllm/run.sh's `serve` shortcut for the multi-node DP follower exec'd ${EDIR}/venv/bin/vllm directly, but uv bakes an absolute build-time shebang (`#!/vllm/venv/bin/python3`) that no longer resolves once the backend is relocated to BackendsPath. _makeVenvPortable's shebang rewriter only matches paths that already point at ${EDIR}, so the original shebang slips through unchanged. Fixed by exec-ing ${EDIR}/venv/bin/python with the script as an argument -- Python ignores the script's shebang in that case. The test fixture caps memory aggressively (max_model_len=512, VLLM_CPU_KVCACHE_SPACE=1, TORCH_COMPILE_DISABLE=1) so two CPU engines fit on a 32 GB box. TORCH_COMPILE_DISABLE is currently mandatory for cpu-vllm: torch._inductor's CPU-ISA probe runs even with enforce_eager=True and needs g++ on PATH, which the LocalAI runtime image doesn't ship -- to be addressed in a follow-up that bundles a toolchain in the cpu-vllm backend. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(vllm): bundle a g++ toolchain in the cpu-vllm backend image torch._inductor's CPU-ISA probe (`cpu_model_runner.py:65 "Warming up model for the compilation"`) shells out to `g++` at vllm engine startup, regardless of `enforce_eager=True` -- the eager flag only disables CUDA graphs, not inductor's first-batch warmup. The LocalAI CPU runtime image (Dockerfile, unconditional apt list) does not ship build-essential, and the cpu-vllm backend image is `FROM scratch`, so any non-trivial inference on cpu-vllm crashes with: torch._inductor.exc.InductorError: InvalidCxxCompiler: No working C++ compiler found in torch._inductor.config.cpp.cxx: (None, 'g++') Bundling the toolchain in the CPU runtime image would bloat every non-vllm-CPU deployment and force a single GCC version on backends that may want clang or a different version. So this lives in the backend, gated to BUILD_TYPE=='' (the CPU profile). `package.sh` snapshots g++ + binutils + cc1plus + libstdc++ + libc6 (runtime + dev) + the math libs cc1plus links (libisl/libmpc/libmpfr/ libjansson) into ${BACKEND}/toolchain/, mirroring /usr/... layout. The unversioned binaries on Debian/Ubuntu are symlink chains pointing into multiarch packages (`g++` -> `g++-13` -> `x86_64-linux-gnu-g++-13`, the latter in `g++-13-x86-64-linux-gnu`), so the package list resolves both the version and the arch-triplet variant. Symlinks /lib -> usr/lib and /lib64 -> usr/lib64 are recreated under the toolchain root because Ubuntu's UsrMerge keeps them at /, and ld scripts (`libc.so`, `libm.so`) hardcode `/lib/...` paths that --sysroot re-roots into the toolchain. The unversioned `g++`/`gcc`/`cpp` symlinks are replaced with wrapper shell scripts that resolve their own location at runtime and pass `--sysroot=<toolchain>` and `-B <toolchain>/usr/lib/gcc/<triplet>/<ver>/` to the underlying versioned binary. That's how torch's bare `g++ foo.cpp -o foo` invocation finds cc1plus (-B), system headers (--sysroot), and the bundled libstdc++ (--sysroot, --sysroot is recursive into linker). `run.sh` adds the toolchain bin dir to PATH and the toolchain's shared-lib dir to LD_LIBRARY_PATH -- everything else (header search, linker search, executable search) is encapsulated in the wrappers. No-op for non-CPU builds, the dir doesn't exist there. The cpu-vllm image grows by ~217 MB. Tradeoff is acceptable -- cpu-vllm is already a niche profile (few users compared to GPU vllm) and the alternative is a backend that crashes at first inference unless the operator manually sets TORCH_COMPILE_DISABLE=1, which silently disables all torch.compile optimizations. Drops `TORCH_COMPILE_DISABLE=1` from tests/e2e/vllm-multinode -- the smoke now exercises the real compile path through the bundled toolchain. Test runtime is +20s for the warmup compile, still <90s end to end. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(vllm): scope jetson-ai-lab index to L4T-specific wheels via pyproject.toml The L4T arm64 build resolves dependencies through pypi.jetson-ai-lab.io, which hosts the L4T-specific torch / vllm / flash-attn wheels but also transparently proxies the rest of PyPI through `/+f/<sha>/<filename>` URLs. With `--extra-index-url` + `--index-strategy=unsafe-best-match` uv would pick those proxy URLs for ordinary PyPI packages — anthropic/openai/propcache/annotated-types — and fail when the proxy 503s. Master is hitting the same bug on its own l4t-vllm matrix entry. Switch the l4t13 install path to a pyproject.toml that marks the jetson-ai-lab index `explicit = true` and pins only torch, torchvision, torchaudio, flash-attn, and vllm to it via [tool.uv.sources]. uv won't consult the L4T mirror for anything else, so transitive deps fall back to PyPI as the default index — no exposure to the proxy 503s. `uv pip install -r requirements.txt` ignores [tool.uv.sources], so the l4t13 branch in install.sh now invokes `uv pip install --requirement pyproject.toml` directly, replacing the old requirements-l4t13*.txt files. Other BUILD_PROFILEs continue using libbackend.sh's installRequirements and never read pyproject.toml. Local resolution test (x86_64, dry-run) confirms uv hits the L4T index for torch and falls through to PyPI for everything else. Assisted-by: claude-code:claude-opus-4-7-1m [Read] [Edit] [Bash] [Write] Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-05-06 00:22:50 +02:00
Richard Palethorpe	bb033b16a9	feat: add LocalVQE backend and audio transformations UI (#9640 ) feat(audio-transform): add LocalVQE backend, bidi gRPC RPC, Studio UI Introduce a generic "audio transform" capability for any audio-in / audio-out operation (echo cancellation, noise suppression, dereverberation, voice conversion, etc.) and ship LocalVQE as the first backend implementation. Backend protocol: - Two new gRPC RPCs in backend.proto: unary AudioTransform for batch and bidirectional AudioTransformStream for low-latency frame-by-frame use. This is the first bidi stream in the proto; per-frame unary at LocalVQE's 16 ms hop would be RTT-bound. Wire it through pkg/grpc/{client,server, embed,interface,base} with paired-channel ergonomics. LocalVQE backend (backend/go/localvqe/): - Go-Purego wrapper around upstream liblocalvqe.so. CMake builds the upstream shared lib + its libggml-cpu-.so runtime variants directly — no MODULE wrapper needed because LocalVQE handles CPU feature selection internally via GGML_BACKEND_DL. - Sets GGML_NTHREADS from opts.Threads (or runtime.NumCPU()-1) — without it LocalVQE runs single-threaded at ~1× realtime instead of the documented ~9.6×. - Reference-length policy: zero-pad short refs, truncate long ones (the trailing portion can't have leaked into a mic that wasn't recording). - Ginkgo test suite (9 always-on specs + 2 model-gated). HTTP layer: - POST /audio/transformations (alias /audio/transform): multipart batch endpoint, accepts audio + optional reference + params[]=v form fields. Persists inputs alongside the output in GeneratedContentDir/audio so the React UI history can replay past (audio, reference, output) triples. - GET /audio/transformations/stream: WebSocket bidi, 16 ms PCM frames (interleaved stereo mic+ref in, mono out). JSON session.update envelope for config; constants hoisted in core/schema/audio_transform.go. - ffmpeg-based input normalisation to 16 kHz mono s16 WAV via the existing utils.AudioToWav (with passthrough fast-path), so the user can upload any format / rate without seeing the model's strict 16 kHz constraint. - BackendTraceAudioTransform integration so /api/backend-traces and the Traces UI light up with audio_snippet base64 and timing. - Routes registered under routes/localai.go (LocalAI extension; OpenAI has no /audio/transformations endpoint), traced via TraceMiddleware. Auth + capability + importer: - FLAG_AUDIO_TRANSFORM (model_config.go), FeatureAudioTransform (default-on, in APIFeatures), three RouteFeatureRegistry rows. - localvqe added to knownPrefOnlyBackends with modality "audio-transform". - Gallery entry localvqe-v1-1.3m (sha256-pinned, hosted on huggingface.co/LocalAI-io/LocalVQE). React UI: - New /app/transform page surfaced via a dedicated "Enhance" sidebar section (sibling of Tools / Biometrics) — the page is enhancement, not generation, so it lives outside Studio. Two AudioInput components (Upload + Record tabs, drag-drop, mic capture). - Echo-test button: records mic while playing the loaded reference through the speakers — the mic naturally picks up speaker bleed, giving a real (mic, ref) pair for AEC testing without leaving the UI. - Reusable WaveformPlayer (canvas peaks + click-to-seek + audio controls) and useAudioPeaks hook (shared module-scoped AudioContext to avoid hitting browser context limits with three players on one page); migrated TTS, Sound, Traces audio blocks to use it. - Past runs saved in localStorage via useMediaHistory('audio-transform') — the history entry stores all three URLs so clicking re-renders the full triple, not just the output. Build + e2e: - 11 matrix entries removed from .github/workflows/backend.yml (CUDA, ROCm, SYCL, Metal, L4T): upstream supports only CPU + Vulkan, so we ship those two and let GPU-class hardware route through Vulkan in the gallery capabilities map. - tests-localvqe-grpc-transform job in test-extra.yml (gated on detect-changes.outputs.localvqe). - New audio_transform capability + 4 specs in tests/e2e-backends. - Playwright spec suite in core/http/react-ui/e2e/audio-transform.spec.js (8 specs covering tabs, file upload, multipart shape, history, errors). Docs: - New docs/content/features/audio-transform.md covering the (audio, reference) mental model, batch + WebSocket wire formats, LocalVQE param keys, and a YAML config example. Cross-links from text-to-audio and audio-to-text feature pages. Assisted-by: Claude:claude-opus-4-7 [Bash Read Edit Write Agent TaskCreate] Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-05-04 22:07:11 +02:00
Ettore Di Giacinto	8edac61e57	feat(ci): allow routing apt traffic through an alternate Ubuntu mirror (#9650 ) * feat(ci): allow routing apt traffic through an alternate Ubuntu mirror Adds opt-in APT_MIRROR / APT_PORTS_MIRROR knobs to all Dockerfiles, the Makefile, and CI workflows so we can fail over to a non-canonical Ubuntu mirror when archive.ubuntu.com / security.ubuntu.com / ports.ubuntu.com are degraded (recently observed: multi-day DDoS against the default pool). Defaults are empty everywhere — behavior is unchanged unless a mirror is configured. To enable in CI, set the repo-level GitHub Actions variables APT_MIRROR (and APT_PORTS_MIRROR for arm64 builds). Locally: make docker APT_MIRROR=http://azure.archive.ubuntu.com A small POSIX-sh helper in .docker/apt-mirror.sh rewrites both DEB822 (/etc/apt/sources.list.d/ubuntu.sources, Ubuntu 24.04+) and the legacy /etc/apt/sources.list before the first apt-get update. Dockerfile stages load it via RUN --mount=type=bind, so there is no extra layer and no cache invalidation when the script is unchanged. Reusable workflows also rewrite the runner's own /etc/apt sources before any sudo apt-get call. Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(apt-mirror): default to the Azure mirror, visible in the workflow source Bakes Azure (http://azure.archive.ubuntu.com / http://azure.ports.ubuntu.com) in as the default for both Docker builds and runner-side apt — rather than hiding the URL behind a GitHub Actions repo variable that's not visible from the source tree. A new composite action at .github/actions/configure-apt-mirror is the single source of truth for runner-side rewrites. Five standalone workflows (build-test, release, tests-e2e, tests-ui-e2e, update_swagger) just `uses: ./.github/actions/configure-apt-mirror`. Three workflows (image_build, backend_build, checksum_checker) keep an inline bash rewrite, because they install/upgrade git via apt before the checkout step (so the local composite action isn't loadable yet). The Azure URL is visible in those files too. The `apt-mirror` / `apt-ports-mirror` inputs of the reusable workflows keep their now-Azure defaults — they still feed the Docker build-args block in addition to the inline runner-side rewrite. Callers (image.yml, image-pr.yml, backend.yml, backend_pr.yml) drop the previous `vars.APT_MIRROR` plumbing and rely on those defaults. Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(apt-mirror): drop Force Install GIT, consolidate on the composite action The PPA git upgrade ran add-apt-repository ppa:git-core/ppa, which talks to api.launchpad.net — also part of Canonical's infrastructure and currently returning HTTP 504. The Azure mirror only covers archive.ubuntu.com / security.ubuntu.com / ports.ubuntu.com, not PPAs. The system git that ubuntu-latest already ships is sufficient for actions/checkout and the build pipeline, so just drop the upgrade. With that gone, the apt-before-checkout constraint disappears too — all three holdouts (image_build, backend_build, checksum_checker) can now switch to ./.github/actions/configure-apt-mirror like the other five. Net: 0 inline apt-mirror blocks, all 8 workflows route through the composite action. Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-03 23:50:13 +02:00
Russell Sim	18e039f305	fix(ci): fix AMDGPU_TARGETS empty-string bypass in hipblas builds (#9626 ) * fix(ci): fix AMDGPU_TARGETS empty-string bypass in hipblas builds `399c1dec` wired amdgpu-targets through the backend_build workflow_call interface, intending the input's default value to cover matrix entries that don't specify targets. However, GitHub Actions only applies a workflow_call input default when the caller omits the input entirely. When backend.yml passes `amdgpu-targets: ${{ matrix.amdgpu-targets }}` and the matrix entry has no amdgpu-targets key, the expression evaluates to an empty string, which is treated as an explicit value — bypassing the default. The result is Docker receiving AMDGPU_TARGETS="" which in turn causes Make's ?= default to be skipped (since the variable is already set in the environment, even to empty), and cmake gets -DAMDGPU_TARGETS= with no targets, so the HIP backend compiles for an indeterminate target rather than the intended GPU list. Fix this at two levels: 1. backend.yml: use a \|\| fallback in the expression so that an undefined matrix.amdgpu-targets never reaches the reusable workflow as an empty string. The target list is the canonical default and lives here. 2. backend_build.yml: remove the now-misleading default value from the input declaration. The default never fired due to the above bug, so keeping it implied a guarantee that didn't exist. 3. backend/cpp/llama-cpp/Makefile: add an explicit $(error ...) guard after the ?= assignment so that if AMDGPU_TARGETS is empty (whether from environment or any future CI wiring mistake) the build fails immediately with a clear message rather than silently producing a binary compiled for an unknown GPU target. Assisted-by: Claude Code:claude-sonnet-4-6 Signed-off-by: Russell Sim <rsl@simopolis.xyz> * fix(build): plumb AMDGPU_TARGETS through to Docker builds The docker-build-backend Makefile macro and Dockerfile.golang did not pass AMDGPU_TARGETS to the inner make invocation, so hipblas builds always used the backend Makefile's hardcoded default GPU targets regardless of what was specified via environment or CI inputs. Signed-off-by: Russell Sim <rsl@simopolis.xyz> --------- Signed-off-by: Russell Sim <rsl@simopolis.xyz>	2026-05-02 15:53:14 +02:00
Ettore Di Giacinto	fe6eb57082	feat(vibevoice-cpp): add purego TTS+ASR backend (#9610 ) * feat(vibevoice-cpp): add purego TTS+ASR backend Wire up Microsoft VibeVoice via the vibevoice.cpp C ABI as a new purego-based Go backend that serves both Backend.TTS and Backend.AudioTranscription from a single gRPC binary. Mirrors the qwen3-tts-cpp / sherpa-onnx pattern so the variant matrix (cpu/cuda12/cuda13/metal/rocm/sycl-f16/f32/vulkan/l4t) and the e2e-backends gRPC harness reuse existing infrastructure. - backend/go/vibevoice-cpp/ - Makefile, CMakeLists, purego shim, gRPC Backend with model-dir auto-detection, closed-loop TTS->ASR smoke test - backend/index.yaml - &vibevoicecpp meta + 18 image entries - Makefile - .NOTPARALLEL, BACKEND_VIBEVOICE_CPP, docker-build wiring, test-extra-backend-vibevoice-cpp-{tts,transcription} e2e wrappers - .github/workflows/backend.yml - matrix entries for all variants - .github/workflows/test-extra.yml - per-backend smoke + 2 gRPC e2e jobs * feat(vibevoice-cpp): drop hardcoded glob detection, add gallery entries Refactor backend Load() to follow the standard Options[] convention used by sherpa-onnx and the rest of the multi-role backends: ModelFile is the primary gguf, supplementary paths come through opts.Options[] as key=value (or key:value for Make-target compat), resolved against opts.ModelPath. type=asr/tts decides the role of ModelFile when neither tts_model nor asr_model is set explicitly. Add gallery/index.yaml entries: - vibevoice-cpp - realtime 0.5B Q8_0 TTS + tokenizer + Carter voice - vibevoice-cpp-asr - long-form ASR Q8_0 + tokenizer Both pull from huggingface://mudler/vibevoice.cpp-models with sha256 verification. parameters.model + Options[] paths are siblings under {models_dir} per the qwen3-tts-cpp convention. Update Makefile e2e wrappers to pass BACKEND_TEST_OPTIONS comma+colon style, and tighten the per-backend Go closed-loop test to use the explicit Options API. * fix(vibevoice-cpp): force whole-archive link so vv_capi_* exports survive libvibevoice is a STATIC archive linked into the MODULE library. Without --whole-archive (or -force_load on Apple, /WHOLEARCHIVE on MSVC), the linker garbage-collects symbols not referenced from this translation unit - which means dlopen+RegisterLibFunc panics with 'undefined symbol: vv_capi_load' at backend startup, since purego looks them up by name and our cpp/govibevoicecpp.cpp doesn't call them directly. * test(vibevoice-cpp): rewrite suite with Ginkgo v2 Match the convention used by backend/go/sherpa-onnx/backend_test.go. The suite now covers backend semantics that don't need purego (Locking, empty-ModelFile rejection, TTS/ASR-without-loaded-model errors) on top of the gRPC lifecycle specs (Health, Load, closed-loop TTS->ASR). Model-dependent specs Skip() when VIBEVOICE_MODEL_DIR is unset, so `go test ./backend/go/vibevoice-cpp/` is green on a clean checkout and runs the heavyweight closed-loop spec when test.sh has staged the bundle. * fix(vibevoice-cpp): implement TTSStream + AudioTranscriptionStream The gRPC server's stream handlers (pkg/grpc/server.go) spawn a goroutine that ranges over a chan; the only thing closing that chan is the backend's own Stream method. With the default Base stub returning 'unimplemented' and never touching the chan, the server goroutine hangs forever and the client hits DeadlineExceeded - which is exactly what the e2e harness saw in the test-extra-backend-vibevoice-cpp-tts matrix run. TTSStream synthesizes via vv_capi_tts to a tempfile, then emits a streaming WAV header (chunk sizes 0xFFFFFFFF so HTTP clients can start playback before the full PCM lands) followed by the PCM body in 64 KB slices. The header + >=2 PCM frames satisfy the harness's 'expected >=2 chunks' assertion and give a real progressive stream. AudioTranscriptionStream runs the offline transcription, emits each segment as a delta, and closes with a final_result whose Text equals the concatenated deltas (the harness asserts those match). Two new Ginkgo specs guard the close-channel-on-error path so the deadline-exceeded regression can't come back silently. fix(vibevoice-cpp): silence errcheck on cleanup paths Lint flagged six unchecked Close()/Remove()/RemoveAll() calls along purely-cleanup deferred paths. Wrap each in '_ = ...' (or a closure for defers that take args) - matches what the rest of the LocalAI backend/go/* tree already does for these callsites. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(vibevoice-cpp): closed-loop slot fill + modelRoot-relative path resolution Two bugs the test-extra-backend-vibevoice-cpp-* CI matrix surfaced: 1. Closed-loop Load with ModelFile=tts.gguf + Options[asr_model=...] left v.ttsModel empty, because the default-fill block only ran when BOTH slots were empty. vv_capi_load then got tts="" + a voice and the C side rejected it with rc=-3 'TTS model required to load a voice'. Fix: ModelFile fills the primary role-slot (decided by 'type=' in Options, defaulting to tts) independently of the secondary, so ModelFile + asr_model resolves to both. 2. resolvePath stat'd CWD before falling back to relTo. With LocalAI launched from a directory that happens to contain a same-named file, supplementary Options[] paths could leak away from the models dir. Drop the CWD probe entirely - relative paths now always join onto opts.ModelPath (the gallery convention). New Ginkgo coverage: * 'ModelFile slot resolution' (4 specs) - asr_model+ModelFile, type=asr, explicit tts_model override, key:value variant. * 'resolvePath (relative-to-modelRoot)' (5 specs) - join, abs passthrough, empty input, empty relTo, and the CWD-trap regression test. * 'Load resolves relative Options paths against opts.ModelPath' - end- to-end gallery layout round-trip. Verified locally: 19/19 specs pass (with model bundle, including the closed-loop TTS->ASR; without bundle, 17 pass + 2 model-dependent skip). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(vibevoice-cpp): use gallery convention in closed-loop spec The 'loads the realtime TTS model' / closed-loop specs were passing already-prefixed paths into Options[]: Options: ['tokenizer=' + filepath.Join(modelDir, 'tokenizer.gguf')] Combined with no ModelPath set on the request, the backend's modelRoot fell back to filepath.Dir(ModelFile) = modelDir, then resolvePath joined the prefixed Options path on top of it - producing 'vibevoice-models/vibevoice-models/tokenizer.gguf' when the CI's VIBEVOICE_MODEL_DIR is the relative './vibevoice-models'. The fix is to mirror the gallery contract LocalAI core actually sends in production: ModelPath is the models root (absolute), ModelFile is a name under it, every Options[] path is relative to ModelPath. Uses filepath.Base() to get bare filenames. Verified locally with both VIBEVOICE_MODEL_DIR=/tmp/vv-bundle (abs) and VIBEVOICE_MODEL_DIR=vibevoice-models (the relative shape that broke CI). Both: 19/19 specs pass, ~55-60s. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(vibevoice-cpp): switch ASR to Q4_K + bump transcription timeout The Q8_0 ASR gguf is ~14 GB - too big to fit alongside the runner image, the docker build cache, and the test artifacts on a free ubuntu-latest GHA runner; 'test-extra-backend-vibevoice-cpp-transcription' was getting SIGTERM'd at 90 min before the model could finish loading. Switch to Q4_K (~10 GB on disk, slightly faster CPU decode) for: * the e2e harness Make target * the gallery 'vibevoice-cpp-asr' entry (parameters + files block) * the per-backend test.sh auto-download list Bump tests-vibevoice-cpp-grpc-transcription's timeout-minutes from 90 to 150 - even with Q4_K, the 30 s JFK clip on a CPU runner needs runway above the previous 90 min cap. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(vibevoice-cpp): drop transcription gRPC e2e job - too heavy for free runners The vibevoice ASR is a 7B-parameter model. Even on Q4_K (~10 GB on disk) a single 30 s transcription saturates the per-test 30 min timeout in the e2e-backends harness on a 4-core ubuntu-latest, and the 10 GB download + Docker layer + working space leaves no headroom on the runner's free disk. Two attempts in CI got SIGTERM'd at the LoadModel boundary - the bottleneck isn't tunable from the workflow side without a paid-tier runner. The per-backend tests-vibevoice-cpp job already runs the same AudioTranscription path via a closed-loop TTS->ASR Ginkgo spec - same gRPC contract, same model, single process - so the standalone tests-vibevoice-cpp-grpc-transcription job was redundant on top of the disk/CPU pressure. The Makefile target test-extra-backend-vibevoice-cpp-transcription stays for local invocation on workstations that can afford it - useful when developing the streaming codepaths. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(vibevoice-cpp): restore transcription gRPC e2e on bigger-runner Switch tests-vibevoice-cpp-grpc-transcription from ubuntu-latest to the self-hosted 'bigger-runner' label that GPU image builds in backend.yml use, plus the documented Free-disk-space prep step (purge dotnet / ghc / android / CodeQL caches) the disabled vllm/sglang entries in this file describe. That gives the 7B-param Q4_K ASR model the disk + CPU runway it needs. Keep timeout-minutes: 150 - even on a beefier runner the 30 s JFK decode plus 10 GB download has to fit comfortably. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(vibevoice-cpp): apt-get install make on bigger-runner before transcription e2e bigger-runner is a self-hosted bare runner without the standard ubuntu image's preinstalled build tools, so the previous job died at the very first command with 'make: command not found' (exit 127). Add the Dependencies step that the disabled vllm/sglang entries in this file already document - apt-get installs make + build-essential + curl + unzip + ca-certificates + git + tar before the make target runs. Mirrors how every other 'runs-on: bigger-runner' entry in backend.yml prepares the runner. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-29 22:22:14 +02:00
Richard Palethorpe	4443250756	chore: add golangci-lint with new-from-merge-base baseline (#9603 ) * chore: add golangci-lint with new-from-merge-base baseline Configure golangci-lint v2 with the standard linter set (errcheck, govet, ineffassign, unused) plus forbidigo, which enforces the Ginkgo/Gomega-only test convention from .agents/coding-style.md by rejecting stdlib testing calls (t.Errorf, t.Fatalf, t.Run, ...). staticcheck is disabled — the codebase has many pre-existing QF-style suggestions not worth gating on. issues.new-from-merge-base = master makes the lint job a gate for new issues only; the ~1300 pre-existing baseline stays visible via 'make lint-all' for incremental cleanup. CI runs 'make lint'. Backends needing C/C++ headers we don't install in the lint runner are excluded via a deny list in the Makefile (backend/go/{piper,silero-vad, llm}, cmd/launcher). Discovery still flows through 'go list ./...', so new packages are scanned automatically. To make backend/go/{sam3-cpp,stablediffusion-ggml,whisper} typecheckable, move their .cpp/.h sources into cpp/ subdirs (matching qwen3-tts-cpp / acestep-cpp). Without this 'go list' rejects the package because Go does not allow .cpp alongside .go without cgo. Fix two real bugs found by lint in tests/integration/ (run only via 'make test-stores', not default CI): a stale zerolog reference left over from the slog migration (`c37785b7`) and an unused 'os' import. Assisted-by: Claude Code:Opus 4.7 (1M) [Bash] [Read] [Edit] [Write] Signed-off-by: Richard Palethorpe <io@richiejp.com> * ci(lint): generate proto sources and fetch full history The lint job was failing for two reasons: - pkg/grpc/proto/.go is generated, not checked in. Several packages import it, so without 'make protogen-go' typecheck fails project-wide with "no required module provides package github.com/mudler/LocalAI/ pkg/grpc/proto". - golangci-lint's new-from-merge-base needs to git-merge-base the PR against master, but actions/checkout's default shallow clone doesn't fetch master. fetch-depth: 0 brings full history; the config now references origin/master (the remote-tracking branch that survives the shallow checkout) instead of bare master (which doesn't exist locally after checkout). Assisted-by: Claude Code:Opus 4.7 (1M) [Bash] [Read] [Edit] [Write] Signed-off-by: Richard Palethorpe <io@richiejp.com> ci(lint): stub react-ui/dist for go:embed glob core/http/app.go has //go:embed react-ui/dist/*. The glob must match at least one non-hidden entry or typecheck fails the whole core/http package. We don't need the real React bundle to lint Go code, so just touch an empty index.html to satisfy the embed. Assisted-by: Claude Code:Opus 4.7 (1M) [Bash] [Read] [Edit] [Write] Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-04-28 22:07:44 +02:00
Ettore Di Giacinto	a0317d9926	refactor(tests): split app_test.go, move real-backend coverage to e2e-backends core/http/app_test.go had grown to 1495 lines exercising three concerns at once: HTTP-layer integration, real-backend inference (llama-gguf, tts, stablediffusion, transformers embeddings, whisper), and service logic that already has unit-level coverage. Each PR paid for 6 backend builds plus real-model downloads to satisfy a single suite. Reorg per layer: - app_test.go (1495 -> 1003 lines) drives the mock-backend binary only. Kept: auth, routing, gallery API, file:// import, /system, agent-jobs HTTP plumbing, config-file model loading. Deleted real-inference specs (llama-gguf chat, ggml completions/streaming, logprobs, logit_bias, transcription, embeddings, External-gRPC, Stores duplicate, Model gallery Context). Lifted Agent Jobs out of the deleted Stores Context. - tests/e2e-backends/backend_test.go gains logprobs, logit_bias, and no-first-token-dup specs (the latter folded into PredictStream). Two new caps gate them so non-LLM backends opt out. - tests/e2e-aio/e2e_test.go gains a streaming smoke under Context("text") to catch container-level streaming regressions. - tests/models_fixtures/ removed; all fixtures referenced testmodel.ggml. app_test.go now writes per-Context inline mock-model YAMLs. CI: - test.yml + tests-e2e.yml gain paths-ignore (docs/, examples/, *.md, backend/) so docs and backend-only PRs skip them. test.yml drops the 6-backend Build step plus TRANSFORMER_BACKEND/GO_TAGS=tts; tests-apple drops the llama-cpp-darwin build. - New tests-aio.yml runs the AIO container nightly + on workflow_dispatch + master/tags. The tests-e2e-container job moved out of test.yml so PRs no longer pay AIO cost. - New tests-llama-cpp-smoke job in test-extra.yml runs on every PR with no detect-changes gate; pulls quay.io/go-skynet/local-ai-backends: master-cpu-llama-cpp (no build on PR) and exercises predict/stream/ logprobs/logit_bias against Qwen3-0.6B. This is the PR-acceptance real-backend gate after AIO moved to nightly. The path-gated heavy test-extra-backend-llama-cpp wrapper appends the same caps so it exercises the moved specs when the backend actually changes. Makefile: - Deleted test-models/testmodel.ggml (the wget chain), test-llama-gguf, test-tts, test-stablediffusion, test-realtime-models. test target drops --label-filter, HUGGINGFACE_GRPC, TRANSFORMER_BACKEND, TEST_DIR, FIXTURES, CONFIG_FILE, MODELS_PATH, BACKENDS_PATH; depends on build-mock-backend. test-stores keeps a focused entry point and depends on backends/local-store. clean-tests also clears the mock-backend binary. Net per typical Go-side PR: ~25min (6 backend builds + tests + AIO) + ~8min e2e drops to ~5min mock-backend test + ~8min e2e + ~5-10min llama-cpp-smoke (image pulled). Docs and backend-only PRs skip the always-on workflows entirely. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:claude-opus-4-7 [Edit] [Write] [Bash]	2026-04-27 23:09:20 +00:00
Alex Brick	e5337039b0	[intel GPU support] Use latest oneapi-basekit image for Intel images to support b70 (#9543 ) * Use latest oneapi-basekit image for Intel images The current `localai/localai:master-gpu-intel` images don't work with the intel arc pro b70. Updating the base_image to 2025.3.2 fixes it. Signed-off-by: Alex Brick <3220905+arbrick@users.noreply.github.com> * Update github workflow base image --------- Signed-off-by: Alex Brick <3220905+arbrick@users.noreply.github.com>	2026-04-24 18:29:10 +02:00
Richard Palethorpe	13734ae9fa	feat: Add Sherpa ONNX backend for ASR and TTS (#8523 ) feat(backend): Add Sherpa ONNX backend and Omnilingual ASR Adds a new Go backend wrapping sherpa-onnx via purego (no cgo). Same approach as opus/stablediffusion-ggml/whisper — a thin C shim (csrc/shim.c + shim.h → libsherpa-shim.so) wraps the bits purego can't reach directly: nested struct config writes, result-struct field reads, and the streaming TTS callback trampoline. The Go side uses opaque uintptr handles and purego.NewCallback for the TTS callback. Supports: - VAD via sherpa-onnx's Silero VAD - Offline ASR: Whisper, Paraformer, SenseVoice, Omnilingual CTC - Online/streaming ASR: zipformer transducer with endpoint detection (AudioTranscriptionStream emits delta events during decode) - Offline TTS: VITS (LJS, etc.) - Streaming TTS: sherpa-onnx's callback API → PCM chunks on a channel, prefixed by a streaming WAV header Gallery entries: omnilingual-0.3b-ctc-q8-sherpa (1600-language offline ASR), streaming-zipformer-en-sherpa (low-latency streaming ASR), silero-vad-sherpa, vits-ljs-sherpa. E2E coverage: tests/e2e-backends for offline + streaming ASR, tests/e2e for the full realtime pipeline (VAD + STT + TTS). Assisted-by: claude-opus-4-7-1M [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-04-24 14:40:06 +02:00
Ettore Di Giacinto	4906cbad04	feat: add biometrics UI (#9524 ) * feat(react-ui): add Face & Voice Recognition pages Expose the face and voice biometrics endpoints (/v1/face/, /v1/voice/) through the React UI. Each page has four tabs driving the six endpoints per modality: Analyze (demographics with bounding boxes / waveform segments), Compare (verify with a match gauge and live threshold slider), Enrollment (register / identify / forget with a top-K matches view), Embedding (raw vector inspector with sparkline + copy). MediaInput supports file upload plus live capture: webcam snap-to-canvas for face, MediaRecorder -> AudioContext -> 16-bit PCM mono WAV transcode for voice (libsndfile on the backend only handles WAV/FLAC/OGG natively). Sidebar gets a new Biometrics section feature-gated on face_recognition / voice_recognition; routes are wrapped in <RequireFeature>. No new dependencies -- Font Awesome icons picked from the Free set. Assisted-by: Claude:Opus 4.7 * fix(localai): accept data URI prefixes with codec/charset params Browser MediaRecorder produces data URIs like data:audio/webm;codecs=opus;base64,... so the pre-';base64,' section can carry multiple parameter segments. The `^data:([^;]+);base64,` regex in pkg/utils/base64.go and core/http/endpoints/localai/audio.go only matched exactly one segment, so recordings straight from the React UI's live-capture tab failed the strip and then tripped the base64 decoder on the leading 'data:' literal, surfacing as "invalid audio base64: illegal base64 data at input byte 4" Widened both regexes to `^data:[^,]+?;base64,` so any number of ';param=value' segments between the mime type and ';base64,' are tolerated. Added a regression test covering the MediaRecorder shape. Assisted-by: Claude:Opus 4.7 * fix(insightface): scope pack ONNX loading to known manifests LocalAI's gallery extracts buffalo_* zips flat into the models directory, which inevitably mixes with ONNX files from other backends (opencv face engine, MiniFASNet antispoof, WeSpeaker voice embedding) and older buffalo pack installs. Feeding those foreign files into insightface's model_zoo.get_model() blows up inside the router -- it assumes a 4-D NCHW input and indexes `input_shape[2]` on tensors that aren't shaped like a face model, raising IndexError mid-load and leaving the backend unusable. The router's dispatch isn't amenable to per-file try/except alone (first-file-wins picks det_10g.onnx from buffalo_l even when the user asked for buffalo_sc -- alphabetical order happens to favour the wrong pack). Instead, ship an explicit manifest of the upstream v0.7 pack contents and scope the glob to that when the requested pack is known. The manifest is small and stable; future packs can be added alongside or fall through to the tolerance loop, which also swallows any remaining IndexError / ValueError from foreign files with a clear `[insightface] skipped` stderr line for diagnostics. Assisted-by: Claude:Opus 4.7 * fix(speaker-recognition): extract FBank features for rank-3 ONNX encoders Pre-exported speaker-encoder ONNX graphs come in two shapes: rank-2 [batch, samples] -- some 3D-Speaker exports, take raw waveform directly. rank-3 [batch, frames, n_mels] -- WeSpeaker and most Kaldi- lineage encoders, expect pre-computed Kaldi FBank. OnnxDirectEngine unconditionally fed `audio.reshape(1, -1)` -- correct for rank-2, IndexError-on-input_shape[3] on rank-3, which surfaced to the UI as "Invalid rank for input: feats Got: 2 Expected: 3" Detect the input rank at session init and run Kaldi FBank (80-dim, 25ms/10ms frames, dither=0.0, per-utterance CMN) before the forward pass when rank>=3. All knobs are configurable via backend options for encoders that deviate from defaults. torchaudio.compliance.kaldi is already in the backend's requirements (SpeechBrain pulls torchaudio in), so no new dependency. Assisted-by: Claude:Opus 4.7 * fix(biometrics): isolate face and voice vector stores Face (ArcFace, 512-D) and voice (ECAPA-TDNN 192-D / WeSpeaker 256-D) biometric embeddings were colliding inside a single in-memory local-store instance. Enrolling one after the other failed with "Try to add key with length N when existing length is M" because local-store correctly refuses to mix dimensions in one keyspace. The registries were constructed with `storeName=""`, which in StoreBackend() is just a WithModel() call. But ModelLoader's cache is keyed on `modelID`, not `model` -- so both registries collapsed to the same `modelID=""` slot and reused the same backend process despite looking isolated on paper. Three complementary fixes: 1. application.go -- give each registry a distinct default namespace ("localai-face-biometrics" / "localai-voice-biometrics"). The comment claimed isolation, now it's actually enforced. 2. stores.go -- pass the storeName as both WithModelID and WithModel so the ModelLoader cache key separates namespaces and the loader spawns distinct processes. 3. local-store/store.go -- drop the Load() `opts.Model != ""` guard. It was there to prevent generic model-loading loops from picking up local-store by accident, but that auto-load path is being retired; the guard now just blocks legitimate namespace isolation. opts.Model is treated as a tag; the per-tuple process isolation upstream handles discrimination. Assisted-by: Claude:Opus 4.7 * fix(gallery): stale-file cleanup and upgrade-tmp directory safety Two related robustness fixes for backend install/upgrade: pkg/downloader/uri.go OCI downloads passed through if filepath.Ext(filePath) != "" ... filePath = filepath.Dir(filePath) which was intended to redirect file-shaped download targets into their parent directory for OCI extraction. The heuristic misfires on directory-shaped paths with a dot-suffix -- gallery.UpgradeBackend uses tmpPath = "<backendsPath>/<name>.upgrade-tmp" and Go's filepath.Ext treats ".upgrade-tmp" as an extension. The rewrite landed the extraction at "<backendsPath>/", which then overwrote the real install (backends/<name>/) with a flat-layout file and left a stray run.sh at the top level. The tmp dir itself stayed empty, so the validation step that checked "<tmpPath>/run.sh" predictably failed with "upgrade validation failed: run.sh not found in new backend" Every manual upgrade silently corrupted the backends tree this way. Guard the rewrite behind "target isn't already an existing directory" -- InstallBackend / UpgradeBackend both pre-create the target as a directory, so they get the correct behaviour; existing file-path callers with a genuine dot-extension still get the parent redirect. core/gallery/backends.go InstallBackend's MkdirAll returned ENOTDIR when something at the target path was already a file (legacy dev builds dropped golang backend binaries directly at `<backendsPath>/<name>` instead of nesting them under their own subdir). That permanently blocked reinstall and upgrade for anyone carrying that state, since every retry hit the same error. Detect a pre-existing non-directory, warn, and remove it before the MkdirAll so the fresh install can write the correct nested layout with metadata.json + run.sh. Assisted-by: Claude:Opus 4.7 * fix(galleryop): refresh upgrade cache after backend ops UpgradeChecker caches the last upgrade-check result and only refreshes on the 6-hour tick or after an auto-upgrade cycle. Manual upgrades (POST /api/backends/upgrade/:name) go through the async galleryop worker, which completes the upgrade correctly but never tells UpgradeChecker to re-check -- so /api/backends/upgrades continued to list a just-upgraded backend as upgradeable, indistinguishable from a failed upgrade, for up to six hours. Add an optional `OnBackendOpCompleted func()` hook on GalleryService that fires after every successful install / upgrade / delete on the backend channel (async, so a slow callback doesn't stall the queue). startup.go wires it to UpgradeChecker.TriggerCheck after both services exist. Result: the upgrade banner clears within milliseconds of the worker finishing. Assisted-by: Claude:Opus 4.7 * build: prepend GOPATH/bin to PATH for protogen-go install-go-tools runs `go install` for protoc-gen-go and protoc-gen-go-grpc, which writes them into `go env GOPATH`/bin. That directory isn't on every dev's PATH, and protoc resolves its code-gen plugins via PATH, so the immediately-following protoc invocation fails with "protoc-gen-go: program not found" which in turn blocks `make build` and any `make backends/%` target that depends on build. Prepend `go env GOPATH`/bin to PATH for the protoc invocation so the freshly-installed plugins are found without requiring a shell-profile change. Assisted-by: Claude:Opus 4.7 * refactor(ui-api): non-blocking backend upgrade handler with opcache POST /api/backends/upgrade/:name used to send the ManagementOp directly onto the unbuffered BackendGalleryChannel, which blocked the HTTP request whenever the galleryop worker was busy with a prior operation. The op also didn't show up in /api/operations, so the Backends UI couldn't reflect upgrade progress on the affected row. Register the op in opcache immediately, wrap it in a cancellable context, store the cancellation function on the GalleryService, and push onto the channel from a goroutine so the handler returns right away. Response gains a `jobID` field and a `message` string so clients have a consistent handle regardless of whether the op is queued or running. Pairs with the OnBackendOpCompleted hook added in the galleryop commit — together the UI sees the upgrade start, watches progress via /api/operations, and drops the "upgradeable" flag the moment the worker finishes. Assisted-by: Claude:Opus 4.7	2026-04-24 08:50:34 +02:00
Ettore Di Giacinto	f5eb13d3c2	feat(insightface): add antispoofing (liveness) detection (#9515 ) * feat(insightface): add antispoofing (liveness) detection Light up the anti_spoofing flag that was parked during the first pass. Both FaceVerify and FaceAnalyze now run the Silent-Face MiniFASNetV2 + MiniFASNetV1SE ensemble (~4 MB, Apache 2.0, CPU <10ms) when the flag is set. Failed liveness on either image vetoes FaceVerify regardless of embedding similarity. Every insightface* gallery entry now ships the MiniFASNet ONNX weights so existing packs light up after reinstall. Setting the flag against a model without the MiniFASNet files returns FAILED_PRECONDITION (HTTP 412) with a clear install message — no silent is_real=false. FaceVerifyResponse gained per-image img{1,2}_is_real and img{1,2}_antispoof_score (proto 9-12); FaceAnalysis's existing is_real/antispoof_score fields are now populated. Schema fields are pointers so they are fully absent from the JSON response when anti_spoofing was not requested — avoids collapsing "not checked" with "checked and fake" under Go's omitempty on bool. Validated end-to-end over HTTP against a local install: - verify + anti_spoofing, both real -> verified=true, score ~0.76 - verify + anti_spoofing, img2 spoof -> verified=false, img2_is_real=false - analyze + anti_spoofing -> is_real and score per face - flag against model without MiniFASNet -> HTTP 412 fail-loud Assisted-by: Claude:claude-opus-4-7 go vet * test(insightface): wire test target into test-extra The root Makefile's `test-extra` already runs `$(MAKE) -C backend/python/insightface test`, but the backend's Makefile never defined the target — so the command silently errored and the suite was never executed in CI. Adding the two-line target (matching ace-step/Makefile) hooks `test.sh` → `runUnittests` → `python -m unittest test.py`, which discovers both the pre-existing engine classes (InsightFaceEngineTest, OnnxDirectEngineTest) and the new AntispoofingTest. Each class skips gracefully when its weights can't be downloaded from a network-restricted runner. Assisted-by: Claude:claude-opus-4-7 * test(insightface): exercise antispoofing in e2e-backends (both paths) Add a `face_antispoof` capability to the Ginkgo e2e suite and extend the existing FaceVerify + FaceAnalyze specs with liveness assertions covering BOTH paths: real fixture -> is_real=true, score>0, verified stays true spoof fixture -> is_real=false, verified vetoed to false The spoof fixture is upstream's own `image_F2.jpg` (via the yakhyo mirror) — verified locally against the MiniFASNetV2+V1SE ensemble to classify as is_real=false with score ~0.013. That makes the assertion deterministic across CI runs; synthetic/derived spoofs fool the model unpredictably and would be flaky. Makefile wires it up end-to-end: - New INSIGHTFACE_ANTISPOOF_* cache dir + two ONNX downloads with pinned SHAs, matching the gallery entries. - insightface-antispoof-models target shared by both backend configs. - FACE_SPOOF_IMAGE_URL passed via BACKEND_TEST_FACE_SPOOF_IMAGE_URL. - Both e2e targets (buffalo-sc + opencv) now: * depend on insightface-antispoof-models * pass antispoof_v2_onnx / antispoof_v1se_onnx in BACKEND_TEST_OPTIONS * include face_antispoof in BACKEND_TEST_CAPS backend_test.go adds the new capability constant and a faceSpoofFile fixture resolved the same way as faceFile1/2/3. Spoof assertions are gated on both capFaceAntispoof AND faceSpoofFile being set, so a test config that omits the spoof fixture degrades gracefully to "real path only" instead of failing. Assisted-by: Claude:claude-opus-4-7 go vet	2026-04-23 18:28:15 +02:00
Ettore Di Giacinto	181ebb6df4	feat: voice recognition (#9500 ) * feat(voice-recognition): add /v1/voice/{verify,analyze,embed} + speaker-recognition backend Audio analog to face recognition. Adds three gRPC RPCs (VoiceVerify / VoiceAnalyze / VoiceEmbed), their Go service and HTTP layers, a new FLAG_SPEAKER_RECOGNITION capability flag, and a Python backend scaffold under backend/python/speaker-recognition/ wrapping SpeechBrain ECAPA-TDNN with a parallel OnnxDirectEngine for WeSpeaker / 3D-Speaker ONNX exports. The kokoros Rust backend gets matching unimplemented trait stubs — tonic's async_trait has no defaults, so adding an RPC without Rust stubs breaks the build (same regression fixed by `eb01c772` for face). Swagger, /api/instructions, and the auth RouteFeatureRegistry / APIFeatures list are updated so the endpoints surface everywhere a client or admin UI looks. Assisted-by: Claude:claude-opus-4-7 * feat(voice-recognition): add 1:N identify + register/forget endpoints Mirrors the face-recognition register/identify/forget surface. New package core/services/voicerecognition/ carries a Registry interface and a local-store-backed implementation (same in-memory vector-store plumbing facerecognition uses, separate instance so the embedding spaces stay isolated). Handlers under /v1/voice/{register,identify,forget} reuse backend.VoiceEmbed to compute the probe vector, then delegate the nearest-neighbour search to the registry. Default cosine-distance threshold is tuned for ECAPA-TDNN on VoxCeleb (0.25, EER ~1.9%). As with the face registry, the current backing is in-memory only — a pgvector implementation is a future constructor-level swap. Assisted-by: Claude:claude-opus-4-7 * feat(voice-recognition): gallery, docs, CI and e2e coverage - backend/index.yaml: speaker-recognition backend entry + CPU and CUDA-12 image variants (plus matching development variants). - gallery/index.yaml: speechbrain-ecapa-tdnn (default) and wespeaker-resnet34 model entries. The WeSpeaker SHA-256 is a deliberate placeholder — the HF URI must be curl'd and its hash filled in before the entry installs. - docs/content/features/voice-recognition.md: API reference + quickstart, mirrors the face-recognition docs. - React UI: CAP_SPEAKER_RECOGNITION flag export (consumers follow face's precedent — no dedicated tab yet). - tests/e2e-backends: voice_embed / voice_verify / voice_analyze specs. Helper resolveFaceFixture is reused as-is — the only thing face/voice share is "download a file into workDir", so no need for a new helper. - Makefile: docker-build-speaker-recognition + test-extra-backend- speaker-recognition-{ecapa,all} targets. Audio fixtures default to VCTK p225/p226 samples from HuggingFace. - CI: test-extra.yml grows a tests-speaker-recognition-grpc job mirroring insightface. backend.yml matrix gains CPU + CUDA-12 image build entries — scripts/changed-backends.js auto-picks these up. Assisted-by: Claude:claude-opus-4-7 * feat(voice-recognition): wire a working /v1/voice/analyze head Adds AnalysisHead: a lazy-loading age / gender / emotion inference wrapper that plugs into both SpeechBrainEngine and OnnxDirectEngine. Defaults to two open-licence HuggingFace checkpoints: - audeering/wav2vec2-large-robust-24-ft-age-gender (Apache 2.0) — age regression + 3-way gender (female / male / child). - superb/wav2vec2-base-superb-er (Apache 2.0) — 4-way emotion. Both are optional and degrade gracefully when transformers or the model can't be loaded — the engine raises NotImplementedError so the gRPC layer returns 501 instead of a generic 500. Emotion classes pass through from the model (neutral/happy/angry/sad on the default checkpoint); the e2e test now accepts any non-empty dominant gender so custom age_gender_model overrides don't fail it. Adds transformers to the backend's CPU and CUDA-12 requirements. Assisted-by: Claude:claude-opus-4-7 * fix(voice-recognition): pin real WeSpeaker ResNet34 ONNX SHA-256 Replaces the placeholder hash in gallery/index.yaml with the actual SHA-256 (7bb2f06e…) of the upstream Wespeaker/wespeaker-voxceleb-resnet34-LM ONNX at ~25MB. `local-ai models install wespeaker-resnet34` now succeeds. Assisted-by: Claude:claude-opus-4-7 * fix(voice-recognition): soundfile loader + honest analyze default Two issues surfaced on first end-to-end smoke with the actual backend image: 1. torchaudio.load in torchaudio 2.8+ requires the torchcodec package for audio decoding. Switch SpeechBrainEngine._load_waveform to the already-present soundfile (listed in requirements.txt) plus a numpy linear resample to 16kHz. Drops a heavy ffmpeg-linked dep and the codepath we never exercise (torchaudio's ffmpeg backend). 2. The AnalysisHead was defaulting to audeering/wav2vec2-large-robust- 24-ft-age-gender, but AutoModelForAudioClassification silently mangles that checkpoint — it reports the age head weights as UNEXPECTED and re-initialises the classifier head with random values, so the "gender" output is noise and there is no age output at all. Make age/gender opt-in instead (empty default; users wire a cleanly-loadable Wav2Vec2ForSequenceClassification checkpoint via age_gender_model: option). Emotion keeps its working Superb default. Also broaden _infer_age_gender's tensor-shape handling and catch runtime exceptions so a dodgy age/gender head never takes down the whole analyze call. Docs and README updated to match the new policy. Verified with the branch-scoped gallery on localhost: - voice/embed → 192-d ECAPA-TDNN vector - voice/verify → same-clip dist≈6e-08 verified=true; cross-speaker dist 0.76–0.99 verified=false (as expected) - voice/register/identify/forget → round-trip works, 404 on unknown id - voice/analyze → emotion populated, age/gender omitted (opt-in) Assisted-by: Claude:claude-opus-4-7 * fix(voice-recognition): real CI audio fixtures + fixture-agnostic verify spec Two issues surfaced after CI actually ran the speaker-recognition e2e target (I'd curl-tested against a running server but hadn't run the make target locally): 1. The default BACKEND_TEST_VOICE_AUDIO_* URLs pointed at huggingface.co/datasets/CSTR-Edinburgh/vctk paths that return 404 (the dataset is gated). Swap them for the speechbrain test samples served from github.com/speechbrain/speechbrain/raw/develop/ — public, no auth, correct 16kHz mono format. 2. The VoiceVerify spec required d(file1,file2) < 0.4, assuming file1/file2 were same-speaker. The speechbrain samples are three different speakers (example1/2/5), and there is no easy un-gated source of true same-speaker audio pairs (VoxCeleb/VCTK/LibriSpeech are all license- or size-gated for CI use). Replace the ceiling check with a relative-ordering assertion: d(pair) > d(same-clip) for both file2 and file3 — that's enough to prove the embeddings encode speaker info, and it works with any three non-identical clips. Actual speaker ordering d(1,2) vs d(1,3) is logged but not asserted. Local run: 4/4 voice specs pass (Health, LoadModel, VoiceEmbed, VoiceVerify) on the built backend image. 12 non-voice specs skipped as expected. Assisted-by: Claude:claude-opus-4-7 * fix(ci): checkout with submodules in the reusable backend_build workflow The kokoros Rust backend build fails with failed to read .../sources/Kokoros/kokoros/Cargo.toml: No such file because the reusable backend_build.yml workflow's actions/checkout step was missing `submodules: true`. Dockerfile.rust does `COPY . /LocalAI`, and without the submodule files the subsequent `cargo build` can't find the vendored Kokoros crate. The bug pre-dates this PR — scripts/changed-backends.js only triggers the kokoros image job when something under backend/rust/kokoros or the shared proto changes, so master had been coasting past it. The voice-recognition proto addition re-broke it. Other checkouts in backend.yml (llama-cpp-darwin) and test-extra.yml (insightface, kokoros, speaker-recognition) already pass `submodules: true`; this brings the shared backend image builder in line. Assisted-by: Claude:claude-opus-4-7	2026-04-23 12:07:14 +02:00
Ettore Di Giacinto	20baec77ab	feat(face-recognition): add insightface/onnx backend for 1:1 verify, 1:N identify, embedding, detection, analysis (#9480 ) * feat(face-recognition): add insightface backend for 1:1 verify, 1:N identify, embedding, detection, analysis Adds face recognition as a new first-class capability in LocalAI via the `insightface` Python backend, with a pluggable two-engine design so non-commercial (insightface model packs) and commercial-safe (OpenCV Zoo YuNet + SFace) models share the same gRPC/HTTP surface. New gRPC RPCs (backend/backend.proto): * FaceVerify(FaceVerifyRequest) returns FaceVerifyResponse * FaceAnalyze(FaceAnalyzeRequest) returns FaceAnalyzeResponse Existing Embedding and Detect RPCs are reused (face image in PredictOptions.Images / DetectOptions.src) for face embedding and face detection respectively. New HTTP endpoints under /v1/face/: * verify — 1:1 image pair same-person decision * analyze — per-face age + gender (emotion/race reserved) * register — 1:N enrollment; stores embedding in vector store * identify — 1:N recognition; detect → embed → StoresFind * forget — remove a registered face by opaque ID Service layer (core/services/facerecognition/) introduces a `Registry` interface with one in-memory `storeRegistry` impl backed by LocalAI's existing local-store gRPC vector backend. HTTP handlers depend on the interface, not on StoresSet/StoresFind directly, so a persistent PostgreSQL/pgvector implementation can be slotted in via a single constructor change in core/application (TODO marker in the package doc). New usecase flag FLAG_FACE_RECOGNITION; insightface is also wired into FLAG_DETECTION so /v1/detection works for face bounding boxes. Gallery (backend/index.yaml) ships three entries: * insightface-buffalo-l — SCRFD-10GF + ArcFace R50 + genderage (~326MB pre-baked; non-commercial research use only) * insightface-opencv — YuNet + SFace (~40MB pre-baked; Apache 2.0) * insightface-buffalo-s — SCRFD-500MF + MBF (runtime download; non-commercial) Python backend (backend/python/insightface/): * engines.py — FaceEngine protocol with InsightFaceEngine and OnnxDirectEngine; resolves model paths relative to the backend directory so the same gallery config works in docker-scratch and in the e2e-backends rootfs-extraction harness. * backend.py — gRPC servicer implementing Health, LoadModel, Status, Embedding, Detect, FaceVerify, FaceAnalyze. * install.sh — pre-bakes buffalo_l + OpenCV YuNet/SFace inside the backend directory so first-run is offline-clean (the final scratch image only preserves files under /<backend>/). * test.py — parametrized unit tests over both engines. Tests: * Registry unit tests (go test -race ./core/services/facerecognition/...) — in-memory fake grpc.Backend, table-driven, covers register/ identify/forget/error paths + concurrent access. * tests/e2e-backends/backend_test.go extended with face caps (face_detect, face_embed, face_verify, face_analyze); relative ordering + configurable verifyCeiling per engine. * Makefile targets: test-extra-backend-insightface-buffalo-l, -opencv, and the -all aggregate. * CI: .github/workflows/test-extra.yml gains tests-insightface-grpc, auto-triggered by changes under backend/python/insightface/. Docs: * docs/content/features/face-recognition.md — feature page with license table, quickstart (defaults to the commercial-safe model), models matrix, API reference, 1:N workflow, storage caveats. * Cross-refs in object-detection.md, stores.md, embeddings.md, and whats-new.md. * Contributor README at backend/python/insightface/README.md. Verified end-to-end: * buffalo_l: 6/6 specs (health, load, face_detect, face_embed, face_verify, face_analyze). * opencv: 5/5 specs (same minus face_analyze — SFace has no demographic head; correctly skipped via BACKEND_TEST_CAPS). Assisted-by: Claude:claude-opus-4-7 * fix(face-recognition): move engine selection to model gallery, collapse backend entries The previous commit put engine/model_pack options on backend gallery entries (`backend/index.yaml`). That was wrong — `GalleryBackend` (core/gallery/backend_types.go:32) has no `options` field, so the YAML decoder silently dropped those keys and all three "different insightface-" backend entries resolved to the same container image with no distinguishing configuration. Correct split: `backend/index.yaml` now has ONE `insightface` backend entry shipping the CPU + CUDA 12 container images. The Python backend bundles both the non-commercial insightface model packs (buffalo_l / buffalo_s) and the commercial-safe OpenCV Zoo weights (YuNet + SFace); the active engine is selected at LoadModel time via `options: ["engine:..."]`. * `gallery/index.yaml` gains three model entries — `insightface-buffalo-l`, `insightface-opencv`, `insightface-buffalo-s` — each setting the appropriate `overrides.backend` + `overrides.options` so installing one actually gives the user the intended engine. This matches how `rfdetr-base` lives in the model gallery against the `rfdetr` backend. The earlier e2e tests passed despite this bug because the Makefile targets pass `BACKEND_TEST_OPTIONS` directly to LoadModel via gRPC, bypassing any gallery resolution entirely. No code changes needed. Assisted-by: Claude:claude-opus-4-7 * feat(face-recognition): cover all supported models in the gallery + drop weight baking Follows up on the model-gallery split: adds entries for every model configuration either engine actually supports, and switches weight delivery from image-baked to LocalAI's standard gallery mechanism. Gallery now has seven `insightface-` model entries (gallery/index.yaml): insightface (family) — non-commercial research use • buffalo-l (326MB) — SCRFD-10GF + ResNet50 + genderage, default • buffalo-m (313MB) — SCRFD-2.5GF + ResNet50 + genderage • buffalo-s (159MB) — SCRFD-500MF + MBF + genderage • buffalo-sc (16MB) — SCRFD-500MF + MBF, recognition only (no landmarks, no demographics — analyze returns empty attributes) • antelopev2 (407MB) — SCRFD-10GF + ResNet100@Glint360K + genderage OpenCV Zoo family — Apache 2.0 commercial-safe • opencv — YuNet + SFace fp32 (~40MB) • opencv-int8 — YuNet + SFace int8 (~12MB, ~3x smaller, faster on CPU) Model weights are no longer baked into the backend image. The image now ships only the Python runtime + libraries (~275MB content size, ~1.18GB disk vs ~1.21GB when weights were baked). Weights flow through LocalAI's gallery mechanism: OpenCV variants list `files:` with ONNX URIs + SHA-256, so `local-ai models install insightface-opencv` pulls them into the models directory exactly like any other gallery-managed model. * insightface packs (upstream distributes .zip archives only, not individual ONNX files) auto-download on first LoadModel via FaceAnalysis' built-in machinery, rooted at the LocalAI models directory so they live alongside everything else — same pattern `rfdetr` uses with `inference.get_model()`. Backend changes (backend/python/insightface/): * backend.py — LoadModel propagates `ModelOptions.ModelPath` (the LocalAI models directory) to engines via a `_model_dir` hint. This replaces the earlier ModelFile-dirname approach; ModelPath is the canonical "models directory" variable set by the Go loader (pkg/model/initializers.go:144) and is always populated. * engines.py::_resolve_model_path — picks up `model_dir` and searches it (plus basename-in-model-dir) before falling back to the dev script-dir. This is how OnnxDirectEngine finds gallery-downloaded YuNet/SFace files by filename only. * engines.py::_flatten_insightface_pack — new helper that works around an upstream packaging inconsistency: buffalo_l/s/sc zips expand flat, but buffalo_m and antelopev2 zips wrap their ONNX files in a redundant `<name>/` directory. insightface's own loader looks one level too shallow and fails. We call `ensure_available()` explicitly, flatten if nested, then hand to FaceAnalysis. * engines.py::InsightFaceEngine.prepare — root-resolution order now includes the `_model_dir` hint so packs download into the LocalAI models directory by default. * install.sh — no longer pre-downloads any weights. Everything is gallery-managed now. * smoke.py (new) — parametrized smoke test that iterates over every gallery configuration, simulating the LocalAI install flow (creates a models dir, fetches OpenCV files with checksum verification, lets insightface auto-download its packs), then runs detect + embed + verify (+ analyze where supported) through the in-process BackendServicer. * test.py — OnnxDirectEngineTest no longer hardcodes `/models/opencv/` paths; downloads ONNX files to a temp dir at setUpClass time and passes ModelPath accordingly. Registry change (core/services/facerecognition/store_registry.go): * `dim=0` in NewStoreRegistry now means "accept whatever dimension arrives" — needed because the backend supports 512-d ArcFace/MBF and 128-d SFace via the same Registry. A non-zero dim still fails fast with ErrDimensionMismatch. * core/application plumbs `faceEmbeddingDim = 0`, explaining the rationale in the comment. Backend gallery description updated to reflect that the image carries no weights — it's just Python + engines. Smoke-tested all 7 configurations against the rebuilt image (with the flatten fix applied), exit 0: PASS: insightface-buffalo-l faces=6 dim=512 same-dist=0.000 PASS: insightface-buffalo-sc faces=6 dim=512 same-dist=0.000 PASS: insightface-buffalo-s faces=6 dim=512 same-dist=0.000 PASS: insightface-buffalo-m faces=6 dim=512 same-dist=0.000 PASS: insightface-antelopev2 faces=6 dim=512 same-dist=0.000 PASS: insightface-opencv faces=6 dim=128 same-dist=0.000 PASS: insightface-opencv-int8 faces=6 dim=128 same-dist=0.000 7/7 passed Assisted-by: Claude:claude-opus-4-7 * fix(face-recognition): pre-fetch OpenCV ONNX for e2e target; drop stale pre-baked claim CI regression from the previous commit: I moved OpenCV Zoo weight delivery to LocalAI's gallery `files:` mechanism, but the test-extra-backend-insightface-opencv target was still passing relative paths `detector_onnx:models/opencv/yunet.onnx` in BACKEND_TEST_OPTIONS. The e2e suite drives LoadModel directly over gRPC without going through the gallery, so those relative paths resolved to nothing and OpenCV's ONNXImporter failed: LoadModel failed: Failed to load face engine: OpenCV(4.13.0) ... Can't read ONNX file: models/opencv/yunet.onnx Fix: add an `insightface-opencv-models` prerequisite target that fetches the two ONNX files (YuNet + SFace) to a deterministic host cache at /tmp/localai-insightface-opencv-cache/, verifies SHA-256, and skips the download on re-runs. The opencv test target depends on it and passes absolute paths in BACKEND_TEST_OPTIONS, so the backend finds the files via its normal absolute-path resolution branch. Also refresh the buffalo_l comment: it no longer says "pre-baked" (nothing is — the pack auto-downloads from upstream's GitHub release on first LoadModel, same as in CI). Locally verified: `make test-extra-backend-insightface-opencv` passes 5/5 specs (health, load, face_detect, face_embed, face_verify). Assisted-by: Claude:claude-opus-4-7 * feat(face-recognition): add POST /v1/face/embed + correct /v1/embeddings docs The docs promised that /v1/embeddings returns face vectors when you send an image data-URI. That was never true: /v1/embeddings is OpenAI-compatible and text-only by contract — its handler goes through `core/backend/embeddings.go::ModelEmbedding`, which sets `predictOptions.Embeddings = s` (a string of TEXT to embed) and never populates `predictOptions.Images[]`. The Python backend's Embedding gRPC method does handle Images[] (that's how /v1/face/register reaches it internally via `backend.FaceEmbed`), but the HTTP embeddings endpoint wasn't wired to populate it. Rather than overload /v1/embeddings with image-vs-text detection — messy, and the endpoint is OpenAI-compatible by design — add a dedicated /v1/face/embed endpoint that wraps `backend.FaceEmbed` (already used internally by /v1/face/register and /v1/face/identify). Matches LocalAI's convention of a dedicated path per non-standard flow (/v1/rerank, /v1/detection, /v1/face/verify etc.). Response: { "embedding": [<dim> floats, L2-normed], "dim": int, // 512 for ArcFace R50 / MBF, 128 for SFace "model": "<name>" } Live-tested on the opencv engine: returns a 128-d L2-normalized vector (sum(x^2) = 1.0000). Sentinel in docs updated to note /v1/embeddings is text-only and point image users at /v1/face/embed instead. Assisted-by: Claude:claude-opus-4-7 * fix(http): map malformed image input + gRPC status codes to proper 4xx Image-input failures on LocalAI's single-image endpoints (/v1/detection, /v1/face/{verify,analyze,embed,register,identify}) have historically returned 500 — even when the client was the one who sent garbage. Classic example: you POST an "image" that isn't a URL, isn't a data-URI, and isn't a valid JPEG/PNG — the server shouldn't claim that's its fault. Two helpers land in core/http/endpoints/localai/images.go and every single-image handler is switched over: * decodeImageInput(s) Wraps utils.GetContentURIAsBase64 and turns any failure (invalid URL, not a data-URI, download error, etc.) into echo.NewHTTPError(400, "invalid image input: ..."). * mapBackendError(err) Inspects the gRPC status on a backend call error and maps: INVALID_ARGUMENT → 400 Bad Request NOT_FOUND → 404 Not Found FAILED_PRECONDITION → 412 Precondition Failed Unimplemented → 501 Not Implemented All other codes fall through unchanged (still 500). Before, my 1×1 PNG error-path test returned: HTTP 500 "rpc error: code = InvalidArgument desc = failed to decode one or both images" After: HTTP 400 "failed to decode one or both images" Scope-limited to the LocalAI single-image endpoints. The multi-modal paths (middleware/request.go, openresponses/responses.go, openai/realtime.go) intentionally log-and-skip individual media parts when decoding fails — different design intent (graceful degradation of a multi-part message), not a 400-worthy failure. Left untouched. Live-verified: every error case in /tmp/face_errors.py now returns 4xx with a meaningful message; the "image with no face (1x1 PNG)" case specifically went from 500 → 400. Assisted-by: Claude:claude-opus-4-7 * refactor(face-recognition): insightface packs go through gallery files:, drop FaceAnalysis Follows up on the discovery that LocalAI's gallery `files:` mechanism handles archives (zip, tar.gz, …) via mholt/archiver/v3 — the rhasspy piper voices use exactly this pattern. Insightface packs are zip archives, so we can now deliver them the same way every other gallery-managed model gets delivered: declaratively, checksum-verified, through LocalAI's standard download+extract pipeline. Two changes: 1. Gallery (gallery/index.yaml) — every insightface-* entry gains a `files:` list with the pack zip's URI + SHA-256. `local-ai models install insightface-buffalo-l` now fetches the zip, verifies the hash, and extracts it into the models directory. No more reliance on insightface's library-internal `ensure_available()` auto-download or its hardcoded `BASE_REPO_URL`. 2. InsightFaceEngine (backend/python/insightface/engines.py) — drops the FaceAnalysis wrapper and drives insightface's `model_zoo` directly. The ~50 lines FaceAnalysis provides — glob ONNX files, route each through `model_zoo.get_model()`, build a `{taskname: model}` dict, loop per-face at inference — are reimplemented in `InsightFaceEngine`. The actual inference classes (RetinaFace, ArcFaceONNX, Attribute, Landmark) are still insightface's — we only replicate the glue, so drift risk against upstream is minimal. Why drop FaceAnalysis: it hard-codes a `<root>/models/<name>/.onnx` layout that doesn't match what LocalAI's zip extraction produces. LocalAI unpacks archives flat into `<models_dir>`. Upstream packs are inconsistent — buffalo_l/s/sc ship ONNX at the zip root (lands at `<models_dir>/.onnx`), buffalo_m/antelopev2 wrap in a redundant `<name>/` dir (lands at `<models_dir>/<name>/.onnx`). The new `_locate_insightface_pack` helper searches both locations plus legacy paths and returns whichever has ONNX files. Replaces the earlier `_flatten_insightface_pack` helper (which tried to fight FaceAnalysis's layout expectations; now we just find the files wherever they are). Net effect for users: install once via LocalAI's managed flow, weights live alongside every other model, progress shows in the jobs endpoint, no first-load network call. Same API surface, cleaner plumbing. Assisted-by: Claude:claude-opus-4-7 fix(face-recognition): CI's insightface e2e path needs the pack pre-fetched The e2e suite drives LoadModel over gRPC without going through LocalAI's gallery flow, so the engine's `_model_dir` option (normally populated from ModelPath) is empty. Previously the insightface target relied on FaceAnalysis auto-download to paper over this, but we dropped FaceAnalysis in favor of direct model_zoo calls — so the buffalo_l target started failing at LoadModel with "no insightface pack found". Mirror the opencv target's pre-fetch pattern: download buffalo_sc.zip (same SHA as the gallery entry), extract it on the host, and pass `root:<dir>` so the engine locates the pack without needing ModelPath. Switched to buffalo_sc (smallest pack, ~16MB) to keep CI fast; it covers the same insightface engine code path as buffalo_l. Face analyze cap dropped since buffalo_sc has no age/gender head. Assisted-by: Claude:claude-opus-4-7[1m] * feat(face-recognition): surface face-recognition in advertised feature maps The six /v1/face/* endpoints were missing from every place LocalAI advertises its feature surface to clients: * api_instructions — the machine-readable capability index at GET /api/instructions. Added `face-recognition` as a dedicated instruction area with an intro that calls out the in-memory registry caveat and the /v1/face/embed vs /v1/embeddings split. * auth/permissions — added FeatureFaceRecognition constant, routed all six face endpoints through it so admins can gate them per-user like any other API feature. Default ON (matches the other API features). * React UI capabilities — CAP_FACE_RECOGNITION symbol mapped to FLAG_FACE_RECOGNITION. Declared only for now; the Face page is a follow-up (noted in the plan). Instruction count bumped 9 → 10; test updated. Assisted-by: Claude:claude-opus-4-7[1m] * docs(agents): capture advertising-surface steps in the endpoint guide Before this change, adding a new /v1/* endpoint reliably missed one or more of: the swagger @Tags annotation, the /api/instructions registry, the auth RouteFeatureRegistry, and the React UI CAP_* symbol. The endpoint would work but be invisible to API consumers, admins, and the UI — and nothing in the existing docs said to look in those places. Extend .agents/api-endpoints-and-auth.md with a new "Advertising surfaces" section covering all four surfaces (swagger tags, /api/ instructions, capabilities.js, docs/), and expand the closing checklist so it's impossible to ship a feature without visiting each one. Hoist a one-liner reminder into AGENTS.md's Quick Reference so agents skim it before diving in. Assisted-by: Claude:claude-opus-4-7[1m]	2026-04-22 21:55:41 +02:00
Ettore Di Giacinto	a0cbc46be9	refactor(tinygrad): reuse tinygrad.apps.llm instead of vendored Transformer (#9380 ) Drop the 295-line vendor/llama.py fork in favor of `tinygrad.apps.llm`, which now provides the Transformer blocks, GGUF loader (incl. Q4/Q6/Q8 quantization), KV-cache and generate loop we were maintaining ourselves. What changed: - New vendor/appsllm_adapter.py (~90 LOC) — HF -> GGUF-native state-dict keymap, Transformer kwargs builder, `_embed_hidden` helper, and a hard rejection of qkv_bias models (Qwen2 / 2.5 are no longer supported; the apps.llm Transformer ties `bias=False` on Q/K/V projections). - backend.py routes both safetensors and GGUF paths through apps.llm.Transformer. Generation now delegates to its (greedy-only) `generate()`; Temperature / TopK / TopP / RepetitionPenalty are still accepted on the wire but ignored — documented in the module docstring. - Jinja chat render now passes `enable_thinking=False` so Qwen3's reasoning preamble doesn't eat the tool-call token budget on small models. - Embedding path uses `_embed_hidden` (block stack + output_norm) rather than the custom `embed()` method we were carrying on the vendored Transformer. - test.py gains TestAppsLLMAdapter covering the keymap rename, tied embedding fallback, unknown-key skipping, and qkv_bias rejection. - Makefile fixtures move from Qwen/Qwen2.5-0.5B-Instruct to Qwen/Qwen3-0.6B (apps.llm-compatible) and tool_parser from qwen3_xml to hermes (the HF chat template emits hermes-style JSON tool calls). Verified with the docker-backed targets: test-extra-backend-tinygrad 5/5 PASS test-extra-backend-tinygrad-embeddings 3/3 PASS test-extra-backend-tinygrad-whisper 4/4 PASS test-extra-backend-tinygrad-sd 3/3 PASS	2026-04-16 22:41:18 +02:00
Ettore Di Giacinto	b4e30692a2	feat(backends): add sglang (#9359 ) * feat(backends): add sglang Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(sglang): force AVX-512 CXXFLAGS and disable CI e2e job sgl-kernel's shm.cpp uses __m512 AVX-512 intrinsics unconditionally; -march=native fails on CI runners without AVX-512 in /proc/cpuinfo. Force -march=sapphirerapids so the build always succeeds, matching sglang upstream's docker/xeon.Dockerfile recipe. The resulting binary still requires an AVX-512 capable CPU at runtime, so disable tests-sglang-grpc in test-extra.yml for the same reason tests-vllm-grpc is disabled. Local runs with make test-extra-backend-sglang still work on hosts with the right SIMD baseline. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(sglang): patch CMakeLists.txt instead of CXXFLAGS for AVX-512 CXXFLAGS with -march=sapphirerapids was being overridden by add_compile_options(-march=native) in sglang's CPU CMakeLists.txt, since CMake appends those flags after CXXFLAGS. Sed-patch the CMakeLists.txt directly after cloning to replace -march=native. --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-16 22:40:56 +02:00
Ettore Di Giacinto	6f0051301b	feat(backend): add tinygrad multimodal backend (experimental) (#9364 ) * feat(backend): add tinygrad multimodal backend Wire tinygrad as a new Python backend covering LLM text generation with native tool-call extraction, embeddings, Stable Diffusion 1.x image generation, and Whisper speech-to-text from a single self-contained container. Backend (`backend/python/tinygrad/`): - `backend.py` gRPC servicer with LLM Predict/PredictStream (auto-detects Llama / Qwen2 / Mistral architecture from `config.json`, supports safetensors and GGUF), Embedding via mean-pooled last hidden state, GenerateImage via the vendored SD1.x pipeline, AudioTranscription + AudioTranscriptionStream via the vendored Whisper inference loop, plus Tokenize / ModelMetadata / Status / Free. - Vendored upstream model code under `vendor/` (MIT, headers preserved): llama.py with an added `qkv_bias` flag for Qwen2-family bias support and an `embed()` method that returns the last hidden state, plus clip.py, unet.py, stable_diffusion.py (trimmed to drop the MLPerf training branch that pulls `mlperf.initializers`), audio_helpers.py and whisper.py (trimmed to drop the pyaudio listener). - Pluggable tool-call parsers under `tool_parsers/`: hermes (Qwen2.5 / Hermes), llama3_json (Llama 3.1+), qwen3_xml (Qwen 3), mistral (Mistral / Mixtral). Auto-selected from model architecture or `Options`. - `install.sh` pins Python 3.11.14 (tinygrad >=0.12 needs >=3.11; the default portable python is 3.10). - `package.sh` bundles libLLVM.so.1 + libedit/libtinfo/libgomp/libsndfile into the scratch image. `run.sh` sets `CPU_LLVM=1` and `LLVM_PATH` so tinygrad's CPU device uses the in-process libLLVM JIT instead of shelling out to the missing `clang` binary. - Local unit tests for Health and the four parsers in `test.py`. Build wiring: - Root `Makefile`: `.NOTPARALLEL`, `prepare-test-extra`, `test-extra`, `BACKEND_TINYGRAD = tinygrad\|python\|.\|false\|true`, docker-build-target eval, and `docker-build-backends` aggregator. - `.github/workflows/backend.yml`: cpu / cuda12 / cuda13 build matrix entries (mirrors the transformers backend placement). - `backend/index.yaml`: `&tinygrad` meta + cpu/cuda12/cuda13 image entries (latest + development). E2E test wiring: - `tests/e2e-backends/backend_test.go` gains an `image` capability that exercises GenerateImage and asserts a non-empty PNG is written to `dst`. New `BACKEND_TEST_IMAGE_PROMPT` / `BACKEND_TEST_IMAGE_STEPS` knobs. - Five new make targets next to `test-extra-backend-vllm`: - `test-extra-backend-tinygrad` — Qwen2.5-0.5B-Instruct + hermes, mirrors the vllm target 1:1 (5/9 specs in ~57s). - `test-extra-backend-tinygrad-embeddings` — same model, embeddings via LLM hidden state (3/9 in ~10s). - `test-extra-backend-tinygrad-sd` — stable-diffusion-v1-5 mirror, health/load/image (3/9 in ~10min, 4 diffusion steps on CPU). - `test-extra-backend-tinygrad-whisper` — openai/whisper-tiny.en against jfk.wav from whisper.cpp samples (4/9 in ~49s). - `test-extra-backend-tinygrad-all` aggregate. All four targets land green on the first MVP pass: 15 specs total, 0 failures across LLM+tools, embeddings, image generation, and speech transcription. * refactor(tinygrad): collapse to a single backend image tinygrad generates its own GPU kernels (PTX renderer for CUDA, the autogen ctypes wrappers for HIP / Metal / WebGPU) and never links against cuDNN, cuBLAS, or any toolkit-version-tied library. The only runtime dependency that varies across hosts is the driver's libcuda.so.1 / libamdhip64.so, which are injected into the container at run time by the nvidia-container / rocm runtimes. So unlike torch- or vLLM-based backends, there is no reason to ship per-CUDA-version images. - Drop the cuda12-tinygrad and cuda13-tinygrad build-matrix entries from .github/workflows/backend.yml. The sole remaining entry is renamed to -tinygrad (from -cpu-tinygrad) since it is no longer CPU-only. - Collapse backend/index.yaml to a single meta + development pair. The meta anchor carries the latest uri directly; the development entry points at the master tag. - run.sh picks the tinygrad device at launch time by probing /usr/lib/... for libcuda.so.1 / libamdhip64.so. When libcuda is visible we set CUDA=1 + CUDA_PTX=1 so tinygrad uses its own PTX renderer (avoids any nvrtc/toolkit dependency); otherwise we fall back to HIP or CLANG. CPU_LLVM=1 + LLVM_PATH keep the in-process libLLVM JIT for the CLANG path. - backend.py's _select_tinygrad_device() is trimmed to a CLANG-only fallback since production device selection happens in run.sh. Re-ran test-extra-backend-tinygrad after the change: Ran 5 of 9 Specs in 56.541 seconds — 5 Passed, 0 Failed	2026-04-15 19:48:23 +02:00
Ettore Di Giacinto	95efb8a562	feat(backend): add turboquant llama.cpp-fork backend (#9355 ) * feat(backend): add turboquant llama.cpp-fork backend turboquant is a llama.cpp fork (TheTom/llama-cpp-turboquant, branch feature/turboquant-kv-cache) that adds a TurboQuant KV-cache scheme. It ships as a first-class backend reusing backend/cpp/llama-cpp sources via a thin wrapper Makefile: each variant target copies ../llama-cpp into a sibling build dir and invokes llama-cpp's build-llama-cpp-grpc-server with LLAMA_REPO/LLAMA_VERSION overridden to point at the fork. No duplication of grpc-server.cpp — upstream fixes flow through automatically. Wires up the full matrix (CPU, CUDA 12/13, L4T, L4T-CUDA13, ROCm, SYCL f32/f16, Vulkan) in backend.yml and the gallery entries in index.yaml, adds a tests-turboquant-grpc e2e job driven by BACKEND_TEST_CACHE_TYPE_K/V=q8_0 to exercise the KV-cache config path (backend_test.go gains dedicated env vars wired into ModelOptions.CacheTypeKey/Value — a generic improvement usable by any llama.cpp-family backend), and registers a nightly auto-bump PR in bump_deps.yaml tracking feature/turboquant-kv-cache. scripts/changed-backends.js gets a special-case so edits to backend/cpp/llama-cpp/ also retrigger the turboquant CI pipeline, since the wrapper reuses those sources. * feat(turboquant): carry upstream patches against fork API drift turboquant branched from llama.cpp before upstream commit 66060008 ("server: respect the ignore eos flag", #21203) which added the `logit_bias_eog` field to `server_context_meta` and a matching parameter to `server_task::params_from_json_cmpl`. The shared backend/cpp/llama-cpp/grpc-server.cpp depends on that field, so building it against the fork unmodified fails. Cherry-pick that commit as a patch file under backend/cpp/turboquant/patches/ and apply it to the cloned fork sources via a new apply-patches.sh hook called from the wrapper Makefile. Simplifies the build flow too: instead of hopping through llama-cpp's build-llama-cpp-grpc-server indirection, the wrapper now drives the copied Makefile directly (clone -> patch -> build). Drop the corresponding patch whenever the fork catches up with upstream — the build fails fast if a patch stops applying, which is the signal to retire it. * docs: add turboquant backend section + clarify cache_type_k/v Document the new turboquant (llama.cpp fork with TurboQuant KV-cache) backend alongside the existing llama-cpp / ik-llama-cpp sections in features/text-generation.md: when to pick it, how to install it from the gallery, and a YAML example showing backend: turboquant together with cache_type_k / cache_type_v. Also expand the cache_type_k / cache_type_v table rows in advanced/model-configuration.md to spell out the accepted llama.cpp quantization values and note that these fields apply to all llama.cpp-family backends, not just vLLM. * feat(turboquant): patch ggml-rpc GGML_OP_COUNT assertion The fork adds new GGML ops bringing GGML_OP_COUNT to 97, but ggml/include/ggml-rpc.h static-asserts it equals 96, breaking the GGML_RPC=ON build paths (turboquant-grpc / turboquant-rpc-server). Carry a one-line patch that updates the expected count so the assertion holds. Drop this patch whenever the fork fixes it upstream. * feat(turboquant): allow turbo* KV-cache types and exercise them in e2e The shared backend/cpp/llama-cpp/grpc-server.cpp carries its own allow-list of accepted KV-cache types (kv_cache_types[]) and rejects anything outside it before the value reaches llama.cpp's parser. That list only contains the standard llama.cpp types — turbo2/turbo3/turbo4 would throw "Unsupported cache type" at LoadModel time, meaning nothing the LocalAI gRPC layer accepted was actually fork-specific. Add a build-time augmentation step (patch-grpc-server.sh, called from the turboquant wrapper Makefile) that inserts GGML_TYPE_TURBO2_0/3_0/4_0 into the allow-list of the copied grpc-server.cpp under turboquant-<flavor>-build/. The original file under backend/cpp/llama-cpp/ is never touched, so the stock llama-cpp build keeps compiling against vanilla upstream which has no notion of those enum values. Switch test-extra-backend-turboquant to set BACKEND_TEST_CACHE_TYPE_K=turbo3 / _V=turbo3 so the e2e gRPC suite actually runs the fork's TurboQuant KV-cache code paths (turbo3 also auto-enables flash_attention in the fork). Picking q8_0 here would only re-test the standard llama.cpp path that the upstream llama-cpp backend already covers. Refresh the docs (text-generation.md + model-configuration.md) to list turbo2/turbo3/turbo4 explicitly and call out that you only get the TurboQuant code path with this backend + a turbo* cache type. * fix(turboquant): rewrite patch-grpc-server.sh in awk, not python3 The builder image (ubuntu:24.04 stage-2 in Dockerfile.turboquant) does not install python3, so the python-based augmentation step errored with `python3: command not found` at make time. Switch to awk, which ships in coreutils and is already available everywhere the rest of the wrapper Makefile runs. * Apply suggestion from @mudler Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> --------- Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2026-04-15 01:25:04 +02:00
Ettore Di Giacinto	87e6de1989	feat: wire transcription for llama.cpp, add streaming support (#9353 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-14 16:13:40 +02:00
Ettore Di Giacinto	016da02845	feat: refactor shared helpers and enhance MLX backend functionality (#9335 ) * refactor(backends): extract python_utils + add mlx_utils shared helpers Move parse_options() and messages_to_dicts() out of vllm_utils.py into a new framework-agnostic python_utils.py, and re-export them from vllm_utils so existing vllm / vllm-omni imports keep working. Add mlx_utils.py with split_reasoning() and parse_tool_calls() — ported from mlx_vlm/server.py's process_tool_calls. These work with any mlx-lm / mlx-vlm tool module (anything exposing tool_call_start, tool_call_end, parse_tool_call). Used by the mlx and mlx-vlm backends in later commits to emit structured ChatDelta.tool_calls without reimplementing per-model parsing. Shared smoke tests confirm: - parse_options round-trips bool/int/float/string - vllm_utils re-exports are identity-equal to python_utils originals - mlx_utils parse_tool_calls handles <tool_call>...</tool_call> with a shim module and produces a correctly-indexed list with JSON arguments - mlx_utils split_reasoning extracts <think> blocks and leaves clean content * feat(mlx): wire native tool parsers + ChatDelta + token usage + logprobs Bring the MLX backend up to the same structured-output contract as vLLM and llama.cpp: emit Reply.chat_deltas so the OpenAI HTTP layer sees tool_calls and reasoning_content, not just raw text. Key insight: mlx_lm.load() returns a TokenizerWrapper that already auto- detects the right tool parser from the model's chat template (_infer_tool_parser in mlx_lm/tokenizer_utils.py). The wrapper exposes has_tool_calling, has_thinking, tool_parser, tool_call_start, tool_call_end, think_start, think_end — no user configuration needed, unlike vLLM. Changes in backend/python/mlx/backend.py: - Imports: replace inline parse_options / messages_to_dicts with the shared helpers from python_utils. Pull split_reasoning / parse_tool_calls from the new mlx_utils shared module. - LoadModel: log the auto-detected has_tool_calling / has_thinking / tool_parser_type for observability. Drop the local is_float / is_int duplicates. - _prepare_prompt: run request.Messages through messages_to_dicts so tool_call_id / tool_calls / reasoning_content survive the conversion, and pass tools=json.loads(request.Tools) + enable_thinking=True (when request.Metadata says so) to apply_chat_template. Falls back on TypeError for tokenizers whose template doesn't accept those kwargs. - _build_generation_params: return an additional (logits_params, stop_words) pair. Maps RepetitionPenalty / PresencePenalty / FrequencyPenalty to mlx_lm.sample_utils.make_logits_processors and threads StopPrompts through to post-decode truncation. - New _tool_module_from_tokenizer / _finalize_output / _truncate_at_stop helpers. _finalize_output runs split_reasoning when has_thinking is true and parse_tool_calls (using a SimpleNamespace shim around the wrapper's tool_parser callable) when has_tool_calling is true, then extracts prompt_tokens, generation_tokens and (best-effort) logprobs from the last GenerationResponse chunk. - Predict: use make_logits_processors, accumulate text + last_response, finalize into a structured Reply carrying chat_deltas, prompt_tokens, tokens, logprobs. Early-stops on user stop sequences. - PredictStream: per-chunk Reply still carries raw message bytes for back-compat but now also emits chat_deltas=[ChatDelta(content=delta)]. On loop exit, emit a terminal Reply with structured reasoning_content / tool_calls / token counts / logprobs — so the Go side sees tool calls without needing the regex fallback. - TokenizeString RPC: uses the TokenizerWrapper's encode(); returns length + tokens or FAILED_PRECONDITION if the model isn't loaded. - Free RPC: drops model / tokenizer / lru_cache, runs gc.collect(), calls mx.metal.clear_cache() when available, and best-effort clears torch.cuda as a belt-and-suspenders. * feat(mlx-vlm): mirror MLX parity (tool parsers + ChatDelta + samplers) Same treatment as the MLX backend: emit structured Reply.chat_deltas, tool_calls, reasoning_content, token counts and logprobs, and extend sampling parameter coverage beyond the temp/top_p pair the backend used to handle. - Imports: drop the inline is_float/is_int helpers, pull parse_options / messages_to_dicts from python_utils and split_reasoning / parse_tool_calls from mlx_utils. Also import make_sampler and make_logits_processors from mlx_lm.sample_utils — mlx-vlm re-uses them. - LoadModel: use parse_options; call mlx_vlm.tool_parsers._infer_tool_parser / load_tool_module to auto-detect a tool module from the processor's chat_template. Stash think_start / think_end / has_thinking so later finalisation can split reasoning blocks without duck-typing on each call. Logs the detected parser type. - _prepare_prompt: convert proto Messages via messages_to_dicts (so tool_call_id / tool_calls survive), pass tools=json.loads(request.Tools) and enable_thinking=True to apply_chat_template when present, fall back on TypeError for older mlx-vlm versions. Also handle the prompt-only + media and empty-prompt + media paths consistently. - _build_generation_params: return (max_tokens, sampler_params, logits_params, stop_words). Maps repetition_penalty / presence_penalty / frequency_penalty and passes them through make_logits_processors. - _finalize_output / _truncate_at_stop: common helper used by Predict and PredictStream to split reasoning, run parse_tool_calls against the auto-detected tool module, build ToolCallDelta list, and extract token counts + logprobs from the last GenerationResult. - Predict / PredictStream: switch from mlx_vlm.generate to mlx_vlm.stream_generate in both paths, accumulate text + last_response, pass sampler and logits_processors through, emit content-only ChatDelta per streaming chunk followed by a terminal Reply carrying reasoning_content, tool_calls, prompt_tokens, tokens and logprobs. Non-streaming Predict returns the same structured Reply shape. - New helper _collect_media extracted from the duplicated base64 image / audio decode loop. - New TokenizeString RPC using the processor's tokenizer.encode and Free RPC that drops model/processor/config, runs gc + Metal cache clear + best-effort torch.cuda cache clear. * feat(importer/mlx): auto-set tool_parser/reasoning_parser on import Mirror what core/gallery/importers/vllm.go does: after applying the shared inference defaults, look up the model URI in parser_defaults.json and append matching tool_parser:/reasoning_parser: entries to Options. The MLX backends auto-detect tool parsers from the chat template at runtime so they don't actually consume these options — but surfacing them in the generated YAML: - keeps the import experience consistent with vllm - gives users a single visible place to override - documents the intended parser for a given model family * test(mlx): add helper unit tests + TokenizeString/Free + e2e make targets - backend/python/mlx/test.py: add TestSharedHelpers with server-less unit tests for parse_options, messages_to_dicts, split_reasoning and parse_tool_calls (using a SimpleNamespace shim to fake a tool module without requiring a model). Plus test_tokenize_string and test_free RPC tests that load a tiny MLX-quantized Llama and exercise the new RPCs end-to-end. - backend/python/mlx-vlm/test.py: same helper unit tests + cleanup of the duplicated import block at the top of the file. - Makefile: register BACKEND_MLX and BACKEND_MLX_VLM (they were missing from the docker-build-target eval list — only mlx-distributed had a generated target before). Add test-extra-backend-mlx and test-extra-backend-mlx-vlm convenience targets that build the respective image and run tests/e2e-backends with the tools capability against mlx-community/Qwen2.5-0.5B-Instruct-4bit. The MLX backend auto-detects the tool parser from the chat template so no BACKEND_TEST_OPTIONS is needed (unlike vllm). * fix(libbackend): don't pass --copies to venv unless PORTABLE_PYTHON=true backend/python/common/libbackend.sh:ensureVenv() always invoked 'python -m venv --copies', but macOS system python (and some other builds) refuses with: Error: This build of python cannot create venvs without using symlinks --copies only matters when _makeVenvPortable later relocates the venv, which only happens when PORTABLE_PYTHON=true. Make --copies conditional on that flag and fall back to default (symlinked) venv otherwise. Caught while bringing up the mlx backend on Apple Silicon — the same build path is used by every Python backend with USE_PIP=true. * fix(mlx): support mlx-lm 0.29.x tool calling + drop deprecated clear_cache The released mlx-lm 0.29.x ships a much simpler tool-calling API than HEAD: TokenizerWrapper detects the <tool_call>...</tool_call> markers from the tokenizer vocab and exposes has_tool_calling / tool_call_start / tool_call_end, but does NOT expose a tool_parser callable on the wrapper and does NOT ship a mlx_lm.tool_parsers subpackage at all (those only exist on main). Caught while running the smoke test on Apple Silicon with the released mlx-lm 0.29.1: tokenizer.tool_parser raised AttributeError (falling through to the underlying HF tokenizer), so _tool_module_from_tokenizer always returned None and tool calls slipped through as raw <tool_call>...</tool_call> text in Reply.message instead of being parsed into ChatDelta.tool_calls. Fix: when has_tool_calling is True but tokenizer.tool_parser is missing, default the parse_tool_call callable to json.loads(body.strip()) — that's exactly what mlx_lm.tool_parsers.json_tools.parse_tool_call does on HEAD and covers the only format 0.29 detects (<tool_call>JSON</tool_call>). Future mlx-lm releases that ship more parsers will be picked up automatically via the tokenizer.tool_parser attribute when present. Also tighten the LoadModel logging — the old log line read init_kwargs.get('tool_parser_type') which doesn't exist on 0.29 and showed None even when has_tool_calling was True. Log the actual tool_call_start / tool_call_end markers instead. While here, switch Free()'s Metal cache clear from the deprecated mx.metal.clear_cache to mx.clear_cache (mlx >= 0.30), with a fallback for older releases. Mirrored to the mlx-vlm backend. * feat(mlx-distributed): mirror MLX parity (tool calls + ChatDelta + sampler) Same treatment as the mlx and mlx-vlm backends: emit Reply.chat_deltas with structured tool_calls / reasoning_content / token counts / logprobs, expand sampling parameter coverage beyond temp+top_p, and add the missing TokenizeString and Free RPCs. Notes specific to mlx-distributed: - Rank 0 is the only rank that owns a sampler — workers participate in the pipeline-parallel forward pass via mx.distributed and don't re-implement sampling. So the new logits_params (repetition_penalty, presence_penalty, frequency_penalty) and stop_words apply on rank 0 only; we don't need to extend coordinator.broadcast_generation_params, which still ships only max_tokens / temperature / top_p to workers (everything else is a rank-0 concern). - Free() now broadcasts CMD_SHUTDOWN to workers when a coordinator is active, so they release the model on their end too. The constant is already defined and handled by the existing worker loop in backend.py:633 (CMD_SHUTDOWN = -1). - Drop the locally-defined is_float / is_int / parse_options trio in favor of python_utils.parse_options, re-exported under the module name for back-compat with anything that imported it directly. - _prepare_prompt: route through messages_to_dicts so tool_call_id / tool_calls / reasoning_content survive, pass tools=json.loads( request.Tools) and enable_thinking=True to apply_chat_template, fall back on TypeError for templates that don't accept those kwargs. - New _tool_module_from_tokenizer (with the json.loads fallback for mlx-lm 0.29.x), _finalize_output, _truncate_at_stop helpers — same contract as the mlx backend. - LoadModel logs the auto-detected has_tool_calling / has_thinking / tool_call_start / tool_call_end so users can see what the wrapper picked up for the loaded model. - backend/python/mlx-distributed/test.py: add the same TestSharedHelpers unit tests (parse_options, messages_to_dicts, split_reasoning, parse_tool_calls) that exist for mlx and mlx-vlm.	2026-04-13 18:44:03 +02:00
Ettore Di Giacinto	d67623230f	feat(vllm): parity with llama.cpp backend (#9328 ) * fix(schema): serialize ToolCallID and Reasoning in Messages.ToProto The ToProto conversion was dropping tool_call_id and reasoning_content even though both proto and Go fields existed, breaking multi-turn tool calling and reasoning passthrough to backends. * refactor(config): introduce backend hook system and migrate llama-cpp defaults Adds RegisterBackendHook/runBackendHooks so each backend can register default-filling functions that run during ModelConfig.SetDefaults(). Migrates the existing GGUF guessing logic into hooks_llamacpp.go, registered for both 'llama-cpp' and the empty backend (auto-detect). Removes the old guesser.go shim. * feat(config): add vLLM parser defaults hook and importer auto-detection Introduces parser_defaults.json mapping model families to vLLM tool_parser/reasoning_parser names, with longest-pattern-first matching. The vllmDefaults hook auto-fills tool_parser and reasoning_parser options at load time for known families, while the VLLMImporter writes the same values into generated YAML so users can review and edit them. Adds tests covering MatchParserDefaults, hook registration via SetDefaults, and the user-override behavior. * feat(vllm): wire native tool/reasoning parsers + chat deltas + logprobs - Use vLLM's ToolParserManager/ReasoningParserManager to extract structured output (tool calls, reasoning content) instead of reimplementing parsing - Convert proto Messages to dicts and pass tools to apply_chat_template - Emit ChatDelta with content/reasoning_content/tool_calls in Reply - Extract prompt_tokens, completion_tokens, and logprobs from output - Replace boolean GuidedDecoding with proper GuidedDecodingParams from Grammar - Add TokenizeString and Free RPC methods - Fix missing `time` import used by load_video() * feat(vllm): CPU support + shared utils + vllm-omni feature parity - Split vllm install per acceleration: move generic `vllm` out of requirements-after.txt into per-profile after files (cublas12, hipblas, intel) and add CPU wheel URL for cpu-after.txt - requirements-cpu.txt now pulls torch==2.7.0+cpu from PyTorch CPU index - backend/index.yaml: register cpu-vllm / cpu-vllm-development variants - New backend/python/common/vllm_utils.py: shared parse_options, messages_to_dicts, setup_parsers helpers (used by both vllm backends) - vllm-omni: replace hardcoded chat template with tokenizer.apply_chat_template, wire native parsers via shared utils, emit ChatDelta with token counts, add TokenizeString and Free RPCs, detect CPU and set VLLM_TARGET_DEVICE - Add test_cpu_inference.py: standalone script to validate CPU build with a small model (Qwen2.5-0.5B-Instruct) * fix(vllm): CPU build compatibility with vllm 0.14.1 Validated end-to-end on CPU with Qwen2.5-0.5B-Instruct (LoadModel, Predict, TokenizeString, Free all working). - requirements-cpu-after.txt: pin vllm to 0.14.1+cpu (pre-built wheel from GitHub releases) for x86_64 and aarch64. vllm 0.14.1 is the newest CPU wheel whose torch dependency resolves against published PyTorch builds (torch==2.9.1+cpu). Later vllm CPU wheels currently require torch==2.10.0+cpu which is only available on the PyTorch test channel with incompatible torchvision. - requirements-cpu.txt: bump torch to 2.9.1+cpu, add torchvision/torchaudio so uv resolves them consistently from the PyTorch CPU index. - install.sh: add --index-strategy=unsafe-best-match for CPU builds so uv can mix the PyTorch index and PyPI for transitive deps (matches the existing intel profile behaviour). - backend.py LoadModel: vllm >= 0.14 removed AsyncLLMEngine.get_model_config so the old code path errored out with AttributeError on model load. Switch to the new get_tokenizer()/tokenizer accessor with a fallback to building the tokenizer directly from request.Model. * fix(vllm): tool parser constructor compat + e2e tool calling test Concrete vLLM tool parsers override the abstract base's __init__ and drop the tools kwarg (e.g. Hermes2ProToolParser only takes tokenizer). Instantiating with tools= raised TypeError which was silently caught, leaving chat_deltas.tool_calls empty. Retry the constructor without the tools kwarg on TypeError — tools aren't required by these parsers since extract_tool_calls finds tool syntax in the raw model output directly. Validated with Qwen/Qwen2.5-0.5B-Instruct + hermes parser on CPU: the backend correctly returns ToolCallDelta{name='get_weather', arguments='{"location": "Paris, France"}'} in ChatDelta. test_tool_calls.py is a standalone smoke test that spawns the gRPC backend, sends a chat completion with tools, and asserts the response contains a structured tool call. * ci(backend): build cpu-vllm container image Add the cpu-vllm variant to the backend container build matrix so the image registered in backend/index.yaml (cpu-vllm / cpu-vllm-development) is actually produced by CI. Follows the same pattern as the other CPU python backends (cpu-diffusers, cpu-chatterbox, etc.) with build-type='' and no CUDA. backend_pr.yml auto-picks this up via its matrix filter from backend.yml. * test(e2e-backends): add tools capability + HF model name support Extends tests/e2e-backends to cover backends that: - Resolve HuggingFace model ids natively (vllm, vllm-omni) instead of loading a local file: BACKEND_TEST_MODEL_NAME is passed verbatim as ModelOptions.Model with no download/ModelFile. - Parse tool calls into ChatDelta.tool_calls: new "tools" capability sends a Predict with a get_weather function definition and asserts the Reply contains a matching ToolCallDelta. Uses UseTokenizerTemplate with OpenAI-style Messages so the backend can wire tools into the model's chat template. - Need backend-specific Options[]: BACKEND_TEST_OPTIONS lets a test set e.g. "tool_parser:hermes,reasoning_parser:qwen3" at LoadModel time. Adds make target test-extra-backend-vllm that: - docker-build-vllm - loads Qwen/Qwen2.5-0.5B-Instruct - runs health,load,predict,stream,tools with tool_parser:hermes Drops backend/python/vllm/test_{cpu_inference,tool_calls}.py — those standalone scripts were scaffolding used while bringing up the Python backend; the e2e-backends harness now covers the same ground uniformly alongside llama-cpp and ik-llama-cpp. * ci(test-extra): run vllm e2e tests on CPU Adds tests-vllm-grpc to the test-extra workflow, mirroring the llama-cpp and ik-llama-cpp gRPC jobs. Triggers when files under backend/python/vllm/ change (or on run-all), builds the local-ai vllm container image, and runs the tests/e2e-backends harness with BACKEND_TEST_MODEL_NAME=Qwen/Qwen2.5-0.5B-Instruct, tool_parser:hermes, and the tools capability enabled. Uses ubuntu-latest (no GPU) — vllm runs on CPU via the cpu-vllm wheel we pinned in requirements-cpu-after.txt. Frees disk space before the build since the docker image + torch + vllm wheel is sizeable. * fix(vllm): build from source on CI to avoid SIGILL on prebuilt wheel The prebuilt vllm 0.14.1+cpu wheel from GitHub releases is compiled with SIMD instructions (AVX-512 VNNI/BF16 or AMX-BF16) that not every CPU supports. GitHub Actions ubuntu-latest runners SIGILL when vllm spawns the model_executor.models.registry subprocess for introspection, so LoadModel never reaches the actual inference path. - install.sh: when FROM_SOURCE=true on a CPU build, temporarily hide requirements-cpu-after.txt so installRequirements installs the base deps + torch CPU without pulling the prebuilt wheel, then clone vllm and compile it with VLLM_TARGET_DEVICE=cpu. The resulting binaries target the host's actual CPU. - backend/Dockerfile.python: accept a FROM_SOURCE build-arg and expose it as an ENV so install.sh sees it during `make`. - Makefile docker-build-backend: forward FROM_SOURCE as --build-arg when set, so backends that need source builds can opt in. - Makefile test-extra-backend-vllm: call docker-build-vllm via a recursive $(MAKE) invocation so FROM_SOURCE flows through. - .github/workflows/test-extra.yml: set FROM_SOURCE=true on the tests-vllm-grpc job. Slower but reliable — the prebuilt wheel only works on hosts that share the build-time SIMD baseline. Answers 'did you test locally?': yes, end-to-end on my local machine with the prebuilt wheel (CPU supports AVX-512 VNNI). The CI runner CPU gap was not covered locally — this commit plugs that gap. * ci(vllm): use bigger-runner instead of source build The prebuilt vllm 0.14.1+cpu wheel requires SIMD instructions (AVX-512 VNNI/BF16) that stock ubuntu-latest GitHub runners don't support — vllm.model_executor.models.registry SIGILLs on import during LoadModel. Source compilation works but takes 30-40 minutes per CI run, which is too slow for an e2e smoke test. Instead, switch tests-vllm-grpc to the bigger-runner self-hosted label (already used by backend.yml for the llama-cpp CUDA build) — that hardware has the required SIMD baseline and the prebuilt wheel runs cleanly. FROM_SOURCE=true is kept as an opt-in escape hatch: - install.sh still has the CPU source-build path for hosts that need it - backend/Dockerfile.python still declares the ARG + ENV - Makefile docker-build-backend still forwards the build-arg when set Default CI path uses the fast prebuilt wheel; source build can be re-enabled by exporting FROM_SOURCE=true in the environment. * ci(vllm): install make + build deps on bigger-runner bigger-runner is a bare self-hosted runner used by backend.yml for docker image builds — it has docker but not the usual ubuntu-latest toolchain. The make-based test target needs make, build-essential (cgo in 'go test'), and curl/unzip (the Makefile protoc target downloads protoc from github releases). protoc-gen-go and protoc-gen-go-grpc come via 'go install' in the install-go-tools target, which setup-go makes possible. * ci(vllm): install libnuma1 + libgomp1 on bigger-runner The vllm 0.14.1+cpu wheel ships a _C C++ extension that dlopens libnuma.so.1 at import time. When the runner host doesn't have it, the extension silently fails to register its torch ops, so EngineCore crashes on init_device with: AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' Also add libgomp1 (OpenMP runtime, used by torch CPU kernels) to be safe on stripped-down runners. * feat(vllm): bundle libnuma/libgomp via package.sh The vllm CPU wheel ships a _C extension that dlopens libnuma.so.1 at import time; torch's CPU kernels in turn use libgomp.so.1 (OpenMP). Without these on the host, vllm._C silently fails to register its torch ops and EngineCore crashes with: AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' Rather than asking every user to install libnuma1/libgomp1 on their host (or every LocalAI base image to ship them), bundle them into the backend image itself — same pattern fish-speech and the GPU libs already use. libbackend.sh adds ${EDIR}/lib to LD_LIBRARY_PATH at run time so the bundled copies are picked up automatically. - backend/python/vllm/package.sh (new): copies libnuma.so.1 and libgomp.so.1 from the builder's multilib paths into ${BACKEND}/lib, preserving soname symlinks. Runs during Dockerfile.python's 'Run backend-specific packaging' step (which already invokes package.sh if present). - backend/Dockerfile.python: install libnuma1 + libgomp1 in the builder stage so package.sh has something to copy (the Ubuntu base image otherwise only has libgomp in the gcc dep chain). - test-extra.yml: drop the workaround that installed these libs on the runner host — with the backend image self-contained, the runner no longer needs them, and the test now exercises the packaging path end-to-end the way a production host would. * ci(vllm): disable tests-vllm-grpc job (heterogeneous runners) Both ubuntu-latest and bigger-runner have inconsistent CPU baselines: some instances support the AVX-512 VNNI/BF16 instructions the prebuilt vllm 0.14.1+cpu wheel was compiled with, others SIGILL on import of vllm.model_executor.models.registry. The libnuma packaging fix doesn't help when the wheel itself can't be loaded. FROM_SOURCE=true compiles vllm against the actual host CPU and works everywhere, but takes 30-50 minutes per run — too slow for a smoke test on every PR. Comment out the job for now. The test itself is intact and passes locally; run it via 'make test-extra-backend-vllm' on a host with the required SIMD baseline. Re-enable when: - we have a self-hosted runner label with guaranteed AVX-512 VNNI/BF16, or - vllm publishes a CPU wheel with a wider baseline, or - we set up a docker layer cache that makes FROM_SOURCE acceptable The detect-changes vllm output, the test harness changes (tests/ e2e-backends + tools cap), the make target (test-extra-backend-vllm), the package.sh and the Dockerfile/install.sh plumbing all stay in place.	2026-04-13 11:00:29 +02:00
Ettore Di Giacinto	9ca03cf9cc	feat(backends): add ik-llama-cpp (#9326 ) * feat(backends): add ik-llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore: add grpc e2e suite, hook to CI, update README Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Apply suggestion from @mudler Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> * Apply suggestion from @mudler Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2026-04-12 13:51:28 +02:00
Ettore Di Giacinto	7a0e6ae6d2	feat(qwen3tts.cpp): add new backend (#9316 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-11 23:14:26 +02:00
Ettore Di Giacinto	706cf5d43c	feat(sam.cpp): add sam.cpp detection backend (#9288 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-09 21:49:11 +02:00
Ettore Di Giacinto	e00ce981f0	fix: try to add whisperx and faster-whisper for more variants (#9278 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-08 21:23:38 +02:00
Richard Palethorpe	ea6e850809	feat: Add Kokoros backend (#9212 ) Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-04-08 19:23:16 +02:00
Ettore Di Giacinto	0e9d1a6588	chore(ci): drop unnecessary test Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-08 12:19:54 +00:00
Ettore Di Giacinto	031a36c995	feat: inferencing default, automatic tool parsing fallback and wire min_p (#9092 ) * feat: wire min_p Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat: inferencing defaults Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(refactor): re-use iterative parser Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore: generate automatically inference defaults from unsloth Instead of trying to re-invent the wheel and maintain here the inference defaults, prefer to consume unsloth ones, and contribute there as necessary. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore: apply defaults also to models installed via gallery Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore: be consistent and apply fallback to all endpoint Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-03-22 00:57:15 +01:00
Ettore Di Giacinto	f7e8d9e791	feat(quantization): add quantization backend (#9096 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-03-22 00:56:34 +01:00
Ettore Di Giacinto	d9c1db2b87	feat: add (experimental) fine-tuning support with TRL (#9088 ) * feat: add fine-tuning endpoint Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(experimental): add fine-tuning endpoint and TRL support This changeset defines new GRPC signatues for Fine tuning backends, and add TRL backend as initial fine-tuning engine. This implementation also supports exporting to GGUF and automatically importing it to LocalAI after fine-tuning. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * commit TRL backend, stop by killing process Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * move fine-tune to generic features Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * add evals, reorder menu Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fix tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-03-21 02:08:02 +01:00
Richard Palethorpe	3d9ccd1ddc	fix(ui): Add tracing inline settings back and create UI tests (#9027 ) Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-03-16 17:51:06 +01:00
Ettore Di Giacinto	5affb747a9	chore: drop AIO images (#9004 ) AIO images are behind, and takes effort to maintain these. Wizard and installation of models have been semplified massively, so AIO images lost their purpose. This allows us to be more laser focused on main images and reliefes stress from CI. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-03-14 17:49:36 +01:00
Richard Palethorpe	f9a850c02a	feat(realtime): WebRTC support (#8790 ) * feat(realtime): WebRTC support Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(tracing): Show full LLM opts and deltas Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-03-13 21:37:15 +01:00
Ettore Di Giacinto	a738f8b0e4	feat(backends): add ace-step.cpp (#8965 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-03-12 18:56:26 +01:00
Ettore Di Giacinto	7dc691c171	feat: add fish-speech backend (#8962 ) * feat: add fish-speech backend Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * drop portaudio Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-03-12 07:48:23 +01:00
Ettore Di Giacinto	a026277ab9	feat(mlx-distributed): add new MLX-distributed backend (#8801 ) * feat(mlx-distributed): add new MLX-distributed backend Add new MLX distributed backend with support for both TCP and RDMA for model sharding. This implementation ties in the discovery implementation already in place, and re-uses the same P2P mechanism for the TCP MLX-distributed inferencing. The Auto-parallel implementation is inspired by Exo's ones (who have been added to acknowledgement for the great work!) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * expose a CLI to facilitate backend starting Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat: make manual rank0 configurable via model configs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add missing features from mlx backend Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Apply suggestion from @mudler Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2026-03-09 17:29:32 +01:00
Ettore Di Giacinto	09ddaf94b2	feat(ui): move to React for frontend (#8772 ) * feat(ui): move to React Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add import model Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * syntax highlight Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Minor fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-03-05 21:47:12 +01:00
LocalAI [bot]	dfc6efb88d	feat(backends): add faster-qwen3-tts (#8664 ) * feat(backends): add faster-qwen3-tts Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: this backend is CUDA only Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: add requirements-install.txt with setuptools for build isolation The faster-qwen3-tts backend requires setuptools to build packages like sox that have setuptools as a build dependency. This ensures the build completes successfully in CI. Signed-off-by: LocalAI Bot <localai-bot@users.noreply.github.com> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Signed-off-by: LocalAI Bot <localai-bot@users.noreply.github.com> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-02-27 08:16:51 +01:00
Ettore Di Giacinto	bf5a1dd840	feat(voxtral): add voxtral backend (#8451 ) * feat(voxtral): add voxtral backend Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * simplify Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-02-09 09:12:05 +01:00
Ettore Di Giacinto	3370d807c2	feat(nemo): add Nemo (only asr for now) backend (#8436 ) * feat(nemo): add Nemo (only asr for now) backend Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(nemo): add Nemo backend without Python version pins (#8438) * Initial plan * Remove Python version pins from nemo backend install.sh Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> * Pin pyarrow to 20.0.0 in nemo requirements Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-02-07 08:19:37 +01:00
Ettore Di Giacinto	53276d28e7	feat(musicgen): add ace-step and UI interface (#8396 ) * feat(musicgen): add ace-step and UI interface Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly handle model dir Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop auto-download Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to models, fixup UIs icons Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Update docs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * l4t13 is incompatbile Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * avoid pinning version for cuda12 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop l4t12 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-02-05 12:04:53 +01:00
Ettore Di Giacinto	e7fc604dbc	feat(metal): try to extend support to remaining backends (#8374 ) * feat(metal): try to extend support to remaining backends Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * neutts doesn't work Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * split outetts out of transformers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Remove torch pin to whisperx Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-02-03 21:57:50 +01:00
Dream	10a1e6c74d	feat(whisperx): add whisperx backend for transcription with speaker diarization (#8299 ) * feat(proto): add speaker field to TranscriptSegment for diarization Add speaker field to the gRPC TranscriptSegment message and map it through the Go schema, enabling backends to return speaker labels. Signed-off-by: eureka928 <meobius123@gmail.com> * feat(whisperx): add whisperx backend for transcription with diarization Add Python gRPC backend using WhisperX for speech-to-text with word-level timestamps, forced alignment, and speaker diarization via pyannote-audio when HF_TOKEN is provided. Signed-off-by: eureka928 <meobius123@gmail.com> * feat(whisperx): register whisperx backend in Makefile Signed-off-by: eureka928 <meobius123@gmail.com> * feat(whisperx): add whisperx meta and image entries to index.yaml Signed-off-by: eureka928 <meobius123@gmail.com> * ci(whisperx): add build matrix entries for CPU, CUDA 12/13, and ROCm Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): unpin torch versions and use CPU index for cpu requirements Address review feedback: - Use --extra-index-url for CPU torch wheels to reduce size - Remove torch version pins, let uv resolve compatible versions Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): pin torch ROCm variant to fix CI build failure Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): pin torch CPU variant to fix uv resolution failure Pin torch==2.8.0+cpu so uv resolves the CPU wheel from the extra index instead of picking torch==2.8.0+cu128 from PyPI, which pulls unresolvable CUDA dependencies. Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): use unsafe-best-match index strategy to fix uv resolution failure uv's default first-match strategy finds torch on PyPI before checking the extra index, causing it to pick torch==2.8.0+cu128 instead of the CPU variant. This makes whisperx's transitive torch dependency unresolvable. Using unsafe-best-match lets uv consider all indexes. Signed-off-by: eureka928 <meobius123@gmail.com> * fix(whisperx): drop +cpu local version suffix to fix uv resolution failure PEP 440 ==2.8.0 matches 2.8.0+cpu from the extra index, avoiding the issue where uv cannot locate an explicit +cpu local version specifier. This aligns with the pattern used by all other CPU backends. Signed-off-by: eureka928 <meobius123@gmail.com> * fix(backends): drop +rocm local version suffixes from hipblas requirements to fix uv resolution uv cannot resolve PEP 440 local version specifiers (e.g. +rocm6.4, +rocm6.3) in pinned requirements. The --extra-index-url already points to the correct ROCm wheel index and --index-strategy unsafe-best-match (set in libbackend.sh) ensures the ROCm variant is preferred. Applies the same fix as `7f5d72e8` (which resolved this for +cpu) across all 14 hipblas requirements files. Signed-off-by: eureka928 <meobius123@gmail.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: eureka928 <meobius123@gmail.com> * revert: scope hipblas suffix fix to whisperx only Reverts changes to non-whisperx hipblas requirements files per maintainer review — other backends are building fine with the +rocm local version suffix. Signed-off-by: eureka928 <meobius123@gmail.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: eureka928 <meobius123@gmail.com> --------- Signed-off-by: eureka928 <meobius123@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-02 16:33:12 +01:00
Ettore Di Giacinto	4ca5b737bf	chore(cuda): target 12.8 for 12 to increase compatibility (#8297 ) Some datacenter setups might be stuck with the 5.x kernel which doesn't play well with CUDA >=12.9. To incrase compatibility with the CUDA 12.x branch, downgrade to 12.8. For newer systems, it is still suggested to use CUDA 13.x wherever compatible. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-01-30 12:58:44 +01:00
Ettore Di Giacinto	4077aaf978	chore: re-enable e2e tests, fixups anthropic API tools support (#8296 ) * chore(tests): add mock backend e2e tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixup anthropic tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * prepare e2e tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop repetitive tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop specific CI workflow Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixup anthropic issues, move all e2e tests to use mocked backend Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-01-30 12:41:50 +01:00
Ettore Di Giacinto	1e08e02598	feat(qwen-asr): add support to qwen-asr (#8281 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-01-29 21:50:35 +01:00
Ettore Di Giacinto	9b973b79f6	feat: add VoxCPM tts backend (#8109 ) * feat: add VoxCPM tts backend Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Disable voxcpm on arm64 cpu Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-01-28 14:44:04 +01:00

1 2 3 4 5 ...

1211 Commits