LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-07-02 04:16:56 -04:00

Author	SHA1	Message	Date
LocalAI [bot]	39a93e91cf	chore: ⬆️ Update vllm-metal (darwin) to `v0.3.0.dev20260701132215` (#10633 ) ⬆️ Update vllm-project/vllm-metal (darwin) Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-07-02 09:48:08 +02:00
LocalAI [bot]	26e0c98967	chore: ⬆️ Update leejet/stable-diffusion.cpp to `3590aa8d626e671a1b1dc84506ea2932a243a480` (#10631 ) ⬆️ Update leejet/stable-diffusion.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-07-02 09:47:54 +02:00
LocalAI [bot]	9acca54b25	chore: ⬆️ Update mudler/parakeet.cpp to `e8acc6172a94e20a952cf1843decace5d771a94b` (#10629 ) ⬆️ Update mudler/parakeet.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-07-02 09:47:41 +02:00
LocalAI [bot]	2728e6000e	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `068b173649f2fd8dc96b35ada5a0b76d8985105d` (#10632 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-07-02 09:47:28 +02:00
LocalAI [bot]	006310d746	chore: ⬆️ Update ggml-org/llama.cpp to `4fc4ec5541b243957ae5099edb67372f8f3b550e` (#10630 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-07-02 09:47:15 +02:00
LocalAI [bot]	05acdb1778	chore: ⬆️ Update ggml-org/whisper.cpp to `6fc7c33b4c3a2cec83e4b65abd5e96a890480375` (#10635 ) ⬆️ Update ggml-org/whisper.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-07-02 09:47:01 +02:00
LocalAI [bot]	5e68b5700c	chore(model-gallery): ⬆️ update checksum (#10637 ) ⬆️ Checksum updates in gallery/index.yaml Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-07-02 09:26:32 +02:00
pos-ei-don	7910018249	fix(vllm): non-streaming tool-call regression after #10351 (#10638 ) fix(vllm): non-streaming tool-call regression after #10351 (native_streaming is a capability flag, not a state flag) #10351 introduced native streaming via `parser.extract_tool_calls_streaming` and gated the post-loop `extract_tool_calls` block on `native_streaming and not native_streaming_error`. That works for streaming requests, but for non-streaming requests the same flag is still True (it only means "the parser can stream", not "we actually streamed"), so the block was skipped and the `elif` cleared `content = ""` — the tool call was silently lost. Symptom: non-streaming chat.completions with `tools=[...]` returns `finish_reason: "stop"` with `content: ""` and no `tool_calls`. Streaming requests are unaffected. Fix: gate both branches on `streaming` too, so the extract_tool_calls block runs for non-streaming requests (and for streaming requests that fell back to the buffered path). Reproduction (vLLM 0.24, Qwen3-Coder-Next-NVFP4, qwen3_coder parser): curl -s -X POST http://localhost:8080/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{"model":"coder","stream":false, "messages":[{"role":"user","content":"7*8 via calc"}], "tools":[{"type":"function","function":{"name":"calc", "parameters":{"type":"object", "properties":{"expression":{"type":"string"}}}}}]}' Before: finish_reason: "stop", content: "", tool_calls: [] After: finish_reason: "tool_calls", tool_calls[0].function.name: "calc" Streaming path re-verified in the same setup: delta.tool_calls arrives token-by-token, finish_reason: "tool_calls", no raw XML in content. Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>	2026-07-02 09:26:14 +02:00
LocalAI [bot]	1a03712a6f	fix(hipblas): symlink amdgpu.ids so ROCm backends find the ASIC ID table (#10627 ) * fix(hipblas): symlink amdgpu.ids so ROCm backends find the ASIC ID table ROCm's bundled libdrm_amdgpu looks up the GPU ASIC ID table at a hardcoded fallback path, /opt/amdgpu/share/libdrm/amdgpu.ids, which is only populated by AMD's full amdgpu-install (graphics/DKMS) stack. The hipblas image is compute-only and doesn't have it, so every model load logs "No such file or directory" and the GPU can't be identified. Symlink it to the equivalent file already shipped by Ubuntu's libdrm-amdgpu1 package. Fixes #10624 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(hipblas): correct amdgpu.ids source package name in comment Verified against the real rocm/dev-ubuntu-24.04:7.2.1 image with hipblas-dev/hipblaslt-dev/rocblas-dev installed: /usr/share/libdrm/amdgpu.ids is owned by libdrm-common, not libdrm-amdgpu1 as the comment said. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-07-02 09:25:14 +02:00
LocalAI [bot]	703ea32de6	chore: ⬆️ Update vllm-metal (darwin) to `v0.3.0.dev20260630095652` (#10616 ) ⬆️ Update vllm-project/vllm-metal (darwin) Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-07-01 21:56:59 +02:00
LocalAI [bot]	751db06e35	chore: ⬆️ Update CrispStrobe/CrispASR to `8fd9db8fec8cb5e929d23d3267ed5817794feb1a` (#10615 ) ⬆️ Update CrispStrobe/CrispASR Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-07-01 21:56:41 +02:00
LocalAI [bot]	f46c0e9c83	docs: ⬆️ update docs version mudler/LocalAI (#10614 ) ⬆️ Update docs version mudler/LocalAI Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-07-01 21:56:21 +02:00
LocalAI [bot]	0d8adfc59a	chore: ⬆️ Update ggml-org/llama.cpp to `0eca4d490e591d4e93058d07540cf47278a72577` (#10617 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-07-01 09:31:50 +02:00
LocalAI [bot]	43f2615e19	chore: ⬆️ Update vllm-project/vllm cu130 wheel to `0.24.0` (#10618 ) ⬆️ Update vllm-project/vllm cu130 wheel Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-07-01 08:53:03 +02:00
LocalAI [bot]	875c539ad5	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `29431b31c89e79c10f8736e8f2742485ba1713d6` (#10620 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-07-01 08:52:36 +02:00
LocalAI [bot]	d641ded194	chore: ⬆️ Update ggml-org/whisper.cpp to `0874de3e8e8e48361dba85c7fe6d176f008bf158` (#10621 ) ⬆️ Update ggml-org/whisper.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-07-01 08:43:40 +02:00
LocalAI [bot]	40445fff05	chore: ⬆️ Update leejet/stable-diffusion.cpp to `484baa41e5e006c52dcd4addc38c830b9489745f` (#10619 ) * ⬆️ Update leejet/stable-diffusion.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * fix(stablediffusion-ggml): adapt to new generate_image() out-param signature leejet/stable-diffusion.cpp@484baa4 changed generate_image() from returning sd_image_t* to returning bool with images_out/num_images_out out-parameters (same pattern already used by generate_video()). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-07-01 08:32:57 +02:00
Tai An	057dee956a	fix(launcher): keep data/config under ~/.localai (#10610 ) (#10613 ) The launcher starts the server with run --models-path/--backends-path but leaves --data-path and the dynamic config dir unset, so the server falls back to its /data and /configuration defaults. is kong.ExpandPath("."), i.e. the launcher process CWD (commonly the user's home root), producing ~/data and ~/configuration outside ~/.localai and an agent-pool stateDir under ~/data. Pass --data-path and --localai-config-dir explicitly, rooted at the launcher's own data directory (GetDataPath() -> ~/.localai), so data and config stay consistent with --models-path/--backends-path.	2026-06-30 22:14:59 +02:00
Adira	4ec39bb776	fix(watchdog): don't log optional Free() as an error when backend returns Unimplemented (#10602 ) (#10607 ) * fix(watchdog): don't log optional Free() as an error when backend returns Unimplemented (#10602) When the watchdog evicts a model, deleteProcess calls the backend's gRPC Free() to release VRAM before stopping the process. Free is optional: backends that don't override it -- the generated UnimplementedBackendServer stub, many Python/external backends, or a federation proxy in distributed mode -- return gRPC Unimplemented. That is expected, not a failure: VRAM is reclaimed when the local process is stopped, or by the remote unloader for remote backends. Logging it as "WARN Error freeing GPU resources" made a benign, optional RPC look like a fault (the alarming line in #10602, seen in distributed mode where the model is remote and Free hits a stub). Treat gRPC Unimplemented from Free() as a no-op logged at Debug; genuine failures still Warn. Free() is still attempted for every backend, so any backend that does implement it is unaffected. Add a reusable grpcerrors.IsUnimplemented helper following the package's existing code-based detection idiom (prefer the typed status code, fall back to the message across non-gRPC boundaries), with table tests. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com> * fix(watchdog): log a non-Unimplemented Free() failure at error level Per review: now that the expected gRPC Unimplemented case is split out and logged at Debug, any remaining Free() error is a genuine failure to release VRAM, so surface it at error level instead of warn. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com> --------- Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com>	2026-06-30 22:14:01 +02:00
Ettore Di Giacinto	25ecb9f015	fix(gallery): use Q8_0 for lfm2.5-8b-a1b to fix poor tool-call quality The Q4_K_M quant degraded tool-call reliability for LFM2.5-8B-A1B. Switch the gallery entry to the Q8_0 GGUF (sha256 verified via HF x-linked-etag) while keeping the native jinja tool-parsing config. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]	2026-06-30 17:46:20 +00:00
LocalAI [bot]	2be495f9c0	fix(kokoros): implement AudioTranscriptionLive trait stub (#10612 ) The backend.proto AudioTranscriptionLive bidirectional streaming RPC added new required trait items (AudioTranscriptionLiveStream + audio_transcription_live) on the generated Backend trait. The kokoros (TTS) backend did not implement them, breaking its release build with E0046 (missing trait items). kokoros is text-to-speech and has no live-ASR support, so stub the method to return UNIMPLEMENTED, mirroring the existing audio_transcription_stream stub. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-30 19:38:41 +02:00
LocalAI [bot]	02b007a31e	feat(config): default swa_full:true for sliding-window-attention models (#10611 ) LocalAI enables a cross-request prompt-prefix cache (cache_reuse, see core/config/serving_defaults.go) so repeated prefixes — system prompts, RAG context, agent scaffolds, multi-turn chat — are not reprocessed every turn. For sliding-window-attention (SWA) models (Gemma 2/3, Cohere2, Llama 4, ...) this silently does nothing: llama.cpp defaults to a reduced SWA KV cache sized to the sliding window, and that reduced cache cannot preserve a prompt prefix across requests, so every turn reprocesses the whole prompt anyway. llama.cpp's --swa-full (params.swa_full, already wired through the LocalAI llama.cpp backend's `swa_full` option) keeps the full KV cache so the shared prefix is reused. Enable it automatically, but only for models that are actually SWA: detection reads the gguf-parser-normalized `<arch>.attention.sliding_window` metadata (which also applies llama.cpp's family rules, e.g. Phi-3 → not SWA), right where the GGUF is already parsed for defaults. It is never applied to dense models (pure memory waste) and never overrides an explicit user `swa_full`/`n_swa` choice. Tradeoff: the full SWA cache scales with context_size, so it costs more memory at large contexts — hence the SWA gating and the documented `swa_full:false` opt-out. Assisted-by: Claude:claude-opus-4-8 [Claude Code] golangci-lint Co-authored-by: Ettore Di Giacinto <mudler@localai.io> v4.5.6	2026-06-30 17:58:17 +02:00
LocalAI [bot]	fd8cebd0b3	fix(watchdog): persist UI-saved Check Interval across restarts (#10601 ) (#10605 ) fix(watchdog): persist a UI-saved Check Interval across restarts (#10601) The watchdog Check Interval saved via /api/settings reverted to 500ms on every restart, while the idle/busy timeouts persisted correctly. Root cause: NewApplicationConfig baseline-defaulted WatchDogInterval to 500ms, whereas the idle/busy timeouts default to 0. The startup loader (loadRuntimeSettingsFromFile) applies a persisted runtime_settings.json value only when the field is still at its zero default - its heuristic for "this wasn't set by an env var". Because the interval was always 500ms at that point, the loader never read the persisted value back, so the saved interval was silently discarded on each boot. Fix: drop the non-zero baseline default so the interval behaves like the sibling timeouts (0 = unset). The effective 500ms default is now supplied at the watchdog layer: WithWatchdogInterval ignores a non-positive value so DefaultWatchDogOptions' 500ms is preserved (and a 0 interval can never turn the watchdog loop into a busy spin). Also mirror the interval in the live config file watcher alongside idle/busy, and report the real 500ms default (not the stale "2s") from ToRuntimeSettings. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-30 17:48:14 +02:00
LocalAI [bot]	dd625921ff	fix(macos): staple the notarization ticket to the .app, not just the dmg (#10606 ) Stapling only the dmg leaves the LocalAI.app bundle with no embedded notarization ticket. Gatekeeper then falls back to an online notarization check on first launch, so the app fails to open on a Mac that is offline or behind a firewall, or once it has been copied out of the dmg — while it keeps working on the (online) build host, which masks the problem. Notarize and staple the .app before packaging it into the dmg so the bundle verifies offline. Adds a `notarize-app` subcommand to contrib/macos/sign-and-notarize.sh (zips the bundle for notarytool, then staples + validates) and invokes it from dmg-launcher-darwin. Stays a no-op when notary secrets are unset, so unsigned local/fork builds are unaffected. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: mudler <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-30 17:38:47 +02:00
LocalAI [bot]	d74f88357e	fix(tests): align openresponses test model name with GGUF-derived naming (#10589 ) (#10609 ) PR #10589 changed repo-root HuggingFace URI imports to name the model after the selected GGUF file rather than the repository. The Open Responses API integration test still requested the old repo-derived name ("Qwen3-VL-2B-Instruct-GGUF"), so every request 404'd on an unknown model and the suite has failed on master since `1a4f68ed4`. Update testModel to the name the importer now registers for the default q4_k_m quant ("Qwen3-VL-2B-Instruct-Q4_K_M") so the specs resolve the model again. The #10589 behaviour change is intentional; only the stale test needed updating. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-30 15:41:44 +02:00
Adira	dfaec3bd51	fix(import): strip file:// scheme from model path for local imports (#10599 ) Importing a model from a local directory (e.g. a HuggingFace checkout or an LM Studio store) via a file:// URI produced a config whose model field kept the scheme verbatim, e.g. model: file:///Users/u/.../Qwen3-4bit. The mlx and vllm backends treat that field as a HuggingFace repo id or local path and reject the file:// form with "Repo id must be in the form 'repo_name' or 'namespace/repo_name'", so the model imported fine but failed to load (issue #7461). Add a shared LocalModelPath helper that reduces a file:// URI to the bare filesystem path it points at and leaves HuggingFace/HTTP URIs untouched, and route the mlx, vllm, transformers and diffusers importers (all of which pass details.URI straight into the model field for from_pretrained-style loading) through it. Cover the helper directly plus end-to-end file:// import specs for the mlx and vllm importers. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com>	2026-06-30 10:21:08 +02:00
LocalAI [bot]	0e381897b5	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `f74a6fb87b315b2c3154166e075360e15021a61d` (#10598 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-30 09:17:48 +02:00
LocalAI [bot]	b1af37257d	chore: ⬆️ Update CrispStrobe/CrispASR to `3b93758f9725d400eca82976f895e4cec3f31260` (#10597 ) ⬆️ Update CrispStrobe/CrispASR Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-30 09:17:11 +02:00
LocalAI [bot]	ebefa6dcca	chore: ⬆️ Update localai-org/privacy-filter.cpp to `595f59630c69d361b5196f2aba2c71c873d0c13c` (#10596 ) ⬆️ Update localai-org/privacy-filter.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-30 09:16:52 +02:00
LocalAI [bot]	605348925d	chore: ⬆️ Update ggml-org/llama.cpp to `6f4f53f2b7da54fcdbbecaaa734337c337ad6176` (#10595 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-30 09:16:37 +02:00
LocalAI [bot]	686ce10b54	chore: ⬆️ Update leejet/stable-diffusion.cpp to `3b6c9ca97cfcda8e68e719e6670d06379fcbe943` (#10594 ) ⬆️ Update leejet/stable-diffusion.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-30 09:16:21 +02:00
pos-ei-don	2cee318fad	fix(functions): avoid quadratic-time debug logging in CleanupLLMResult / ParseFunctionCall (#10592 ) fix(functions): avoid quadratic-time debug logging in CleanupLLMResult/ParseFunctionCall The streaming chat path (core/http/endpoints/openai/chat_stream_workers.go) calls CleanupLLMResult / ParseFunctionCall once per delta chunk with the full accumulated LLM result so far. Both functions xlog.Debug the entire argument on entry and exit, so a single N-chunk stream emits roughly chunk_size * N^2 bytes of debug output. Under LOG_LEVEL=debug this was observed in a recent SGLang-via-LocalAI session on a DGX Spark host (about 50K tokens, long streaming generation) to drive container logs to ~96 GiB, which interacted with the streaming hot loop on the same filesystem and contributed to a host-wide hard hang once disk pressure built up. Workaround was setting LOG_LEVEL=info, but the quadratic shape remains a foot-gun for anyone intentionally enabling debug. Replace the four result-content debug arguments with len(...) plus a fixed-size head (200 bytes via a new truncForLog helper), bounding per- call output to a constant. The debug signal stays useful: the first 200 chars are enough to identify which generation is in flight, and the length lets you observe growth without paying for the payload itself. No API change. No behaviour change for LOG_LEVEL != debug. Signed-off-by: Poseidon <philipp.wacker@ibf-solutions.com> Co-authored-by: Poseidon <philipp.wacker@ibf-solutions.com>	2026-06-30 09:16:03 +02:00
Adira	1a4f68ed4a	fix(import): derive model name from selected GGUF for repo-root URIs (#10589 ) When importing a HuggingFace GGUF model from a repository-root URI (no file component, e.g. hf://owner/repo) with the Model Name field left blank, the importer named the model after the repository (filepath.Base(details.URI)) instead of the GGUF file it actually selected from the repo listing (issue #10587). Track whether the user supplied an explicit name; the URI base is now only a fallback. In the HuggingFace branch, once the model group is picked, re-derive the name from the selected GGUF via a new modelNameFromShardGroup helper that uses ShardGroup.Base minus the .gguf extension. For sharded models this yields a clean logical name (e.g. Qwen3-30B-A3B-Q4_K_M) rather than a shard filename like ...-00001-of-00002. An explicit name preference still always wins, and the .gguf/URL/OCI paths are unchanged. Add network-free unit specs covering name-from-GGUF, clean-name-from-shard-base, and explicit-name precedence, and update the live integration specs that had encoded the previous repo-name behaviour. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com>	2026-06-30 09:03:27 +02:00
Adira	28d7397743	fix(openai): stop max_tokens streaming retry loop on reasoning models (#9716 ) (#10448 ) fix(openai): stop max_tokens streaming retry loop on reasoning models When a thinking model spends its entire max_tokens budget on the reasoning block, the C++ autoparser clears the raw Response and delivers reasoning-only ChatDeltas (no content, no tool calls). ComputeChoices' empty-response retry then fires and regenerates from scratch up to maxRetries times, each re-consuming the whole budget, instead of terminating with finish_reason "length" (issue #9716). Add a reachedTokenBudget helper and suppress both the built-in and caller-driven retries when the completion count has reached the configured max_tokens ceiling. Report finish_reason "length" instead of "stop" in the streaming and non-streaming chat paths when the budget was exhausted. Adds a deterministic regression test that counts backend invocations (previously 6, now 1) plus boundary tests for the helper. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Dennisadira <dennisadira@gmail.com>	2026-06-30 09:01:53 +02:00
Richard Palethorpe	5d0c43ec6e	feat(realtime): Semantic VAD EOU token (#10444 ) * feat(realtime): EOU-driven semantic_vad turn detection Add a `semantic_vad` turn-detection mode to the realtime API that feeds the transcription model live and decides "the user finished speaking" from the `<EOU>` end-of-utterance token rather than from silence alone. When EOU fires the turn commits immediately (~0.3s); otherwise it falls back to an eagerness-scaled silence threshold (low/med/high = 8/4/2s). Plumbing, bottom to top: - proto: `AudioTranscriptionLive` bidirectional RPC (config-first oneof, mono float PCM @16k, ready-ack / Unimplemented degrade signal) plus `TranscriptResult.eou` for the unary retranscribe gate. - pkg/grpc: client/server/base/embed scaffolding for the bidi stream, modeled on AudioTransformStream; release stream conns on terminal Recv. - parakeet-cpp: live transcription RPC with per-C-call engine locking (one live stream per turn, finalize+free at commit); bump parakeet.cpp to ABI v5 — incremental StreamingMel (no more quadratic per-feed mel recompute that delayed EOU on long turns) and the <EOU>/<EOB> split; strip the literal <EOU>/<EOB> from offline text and set Eou. - core/backend: LiveTranscriptionSession wrapper + pipeline `turn_detection:` config block (type/eagerness/retranscribe). - realtime: semantic_vad integration — live input captions streamed as transcription deltas while the user speaks, EOU-immediate commit with eagerness fallback, optional retranscribe gate (batch re-decode must also end in <EOU> to confirm), clause synthesis off the LLM token callback, and per-turn live-transcription / model_load telemetry. - UI: show the realtime pipeline components as a vertical list. Docs and tests included; opt-in via the pipeline YAML or per-session `session.update`. Non-streaming STT backends degrade to silence-only. Assisted-by: Claude Code:claude-opus-4-8 [Read] [Edit] [Write] [Bash] Assisted-by: Claude Code:claude-fable-5 [Read] [Edit] [Bash] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(realtime): explicit formally-verified state machines + parakeet streaming driver The realtime API had several implicit state machines whose state was inferred from scattered booleans, channels, and five separate mutexes, leaving illegal/inconsistent states reachable. Make them explicit and keep the implementation in step with a formal design; rework the parakeet streaming backend along the same lines. Realtime state machines (M1-M5). Each is a sealed sum-type State/Event/Effect with a total, pure Next(state,event)->(state,[]effect) behind a single-writer Coordinator: M1 conncoord connection lifecycle: VAD toggle + once-only teardown (replaces vadServerStarted + a `done` channel closed from two sites). M2 turncoord turn detection: collapses speechStarted and the live-stream "turn open" flag into one state, so discardTurn can no longer desync them and suppress the next onset. M3 respcoord response coordination: serializes the dual-writer start/cancel so at most one response is live; one response.done per response.create. M4 compactcoord conversation compaction: single-flight (replaces the `compacting atomic.Bool` CAS). M5 ttscoord TTS pipeline: open->closing->closed, idempotent wait(), rejects enqueue-after-close (was a silent drop). The Coordinator/Sink/Next plumbing — only the sealed types and Next differed per machine — is extracted once into core/http/endpoints/openai/coordinator as a generic Coordinator[S,E,F]; each machine keeps its public API via type aliases, so no sink, call-site, or test moved. Hierarchy. session_lifecycle.fizz models M1 as the parent region with its children (M2/M3/M4) as one statechart and asserts ChildrenDieWithParent (conn torn => all children terminal, none start after teardown). respcoord and compactcoord gain an absorbing Terminated state + Shutdown event; conncoord's teardown drives the children terminal. This closes a compaction teardown gap: a fire-and-forget compaction could outlive a torn session — compactionSink now takes a session-scoped cancellable context + WaitGroup and joins the in-flight summarize+evict on shutdown. Formal verification. formal-verification/ holds one authoritative FizzBee spec per machine plus the composition spec, each with an always-assertion and a documented one-line edit that makes the checker fail (verified non-vacuous). scripts/realtime-conformance.sh is fail-closed: all Go conformance suites under -race AND a model-check of every .fizz spec; a missing FizzBee is a hard error (only the loud REALTIME_CONFORMANCE_SKIP_FIZZBEE=1 bypasses it, never in CI). FizzBee is pinned by sha256 and installed via scripts/install-fizzbee.sh into .tools/ (gitignored). Wired as make test-realtime-conformance, a CI workflow, and a pre-commit path filter. Go conformance tests are Ginkgo/Gomega (per the repo's forbidigo lint): transition tables + fixed-seed property walks + concurrent/-race specs, no rapid dependency. Design map: docs/design/realtime-state-machines.md. Parakeet streaming backend. The same treatment applied to the parakeet-cpp streaming paths: - AudioTranscriptionStream returns codes.Unimplemented for non-streaming models instead of decoding offline and emitting it as one delta + final. A client that asked for streaming learns the model cannot stream rather than receiving a batch result shaped like a stream. New grpcerrors.StreamTranscriptionUnsupported carries that signal; the HTTP /v1/audio/transcriptions stream path surfaces it as an SSE error event. Mirrors AudioTranscriptionLive, which already did this. - utteranceBoundary (boundary.go): a single definition of the end-of-utterance latch, replacing three open-coded finalEou toggles. Modelled as a two-valued type so illegal states are unrepresentable. - Shared decode driver (driver.go): streamFeedResult (one per-feed event) + feedChunk (hides the ABI v4 JSON vs text-only split) + feedSlices + flushTail. The feed loop is written once. - AudioTranscriptionLive becomes a bidi adapter: it streams the per-feed {delta,eou,eob,words} the realtime turn detector consumes and a terminal FinalResult carrying only Text. Segments/duration/eou are offline-only and no longer produced (nor read) on the live path; liveTraceState drops the terminal eou and keeps the per-feed eou_events count. - AudioTranscriptionStream + streamJSON merge into one driver-based function; streamSegmenter is generalized to the unified event with a text-only fallback that preserves the legacy (no-words) library's per-utterance segmentation. Verified: build/vet/gofumpt clean, golangci-lint 0 issues, all coordinator and parakeet packages under -race, the fail-closed conformance gate green, and make test-realtime (12 e2e WS+WebRTC). Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-06-30 09:01:22 +02:00
pos-ei-don	6ab29ec8b9	fix(sglang): parse tool_call function arguments before applying the chat template (#10558 ) OpenAI wire format carries `function.arguments` as a JSON-encoded string, but chat templates (e.g. Qwen3-Coder) iterate over it as a mapping. The vllm backend already parses arguments before applying the chat template (PR #10256); this mirrors that fix in the sglang backend. Without this fix the second turn of any tool-using session (assistant returns tool_calls, user posts `role:"tool"` result, model is invoked with arguments still as a string) crashes inside transformers' Jinja chat-template rendering with: TypeError: Can only get item pairs from a mapping. File ".../transformers/utils/chat_template_utils.py", in render_jinja_template File ".../jinja2/filters.py", in do_items raise TypeError("Can only get item pairs from a mapping.") Reproduced on `lmsysorg/sglang:v0.5.14` via LocalAI v4.5.4 with `saricles/Qwen3-Coder-Next-NVFP4-GB10` (W4A4 NVFP4 / compressed-tensors) on NVIDIA DGX Spark (GB10, sm_121). After the patch, a tool-call roundtrip (assistant tool_calls -> tool result -> assistant final answer) returns http=200 with the expected follow-up content; no behaviour change on requests that don't carry tool_calls. Signed-off-by: Poseidon <philipp.wacker@ibf-solutions.com> Co-authored-by: Poseidon <philipp.wacker@ibf-solutions.com>	2026-06-30 09:00:51 +02:00
dependabot[bot]	036f950b1b	chore(deps): bump actions/cache from 4 to 6 (#10593 ) Bumps [actions/cache](https://github.com/actions/cache) from 4 to 6. - [Release notes](https://github.com/actions/cache/releases) - [Changelog](https://github.com/actions/cache/blob/main/RELEASES.md) - [Commits](https://github.com/actions/cache/compare/v4...v6) --- updated-dependencies: - dependency-name: actions/cache dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-06-29 22:31:10 +02:00
LocalAI [bot]	5b7b914b4f	chore(recon): re-pin voice/face-detect to squashed release commits (+ graph-cache fix) (#10591 ) chore(recon): re-pin voice/face-detect to squashed release commits The voice-detect.cpp and face-detect.cpp engine repos were squashed to a single release commit, which orphaned the previous pins (voice 3d51077, face 06914b0). Re-pin to the new single-commit SHAs (voice 1db1759, face e22260d). These also fold in a real correctness fix: the persistent graph-cache fingerprint now includes op_params, so two structurally identical GGML_OP_CUSTOM graphs (a blocked 3x3 vs a blocked 1x1 strided conv) can no longer false-hit the cache and replay the wrong kernel. voice CI was failing test_blocked/conv1x1_s2 with an out-of-bounds write on the GGML_NATIVE=OFF build; both engine repos are now green and WeSpeaker embed parity is 1.0 vs golden. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-29 18:48:47 +02:00
LocalAI [bot]	d1cee4c52a	chore: ⬆️ Update vllm-metal (darwin) to `v0.3.0.dev20260628073537` (#10562 ) ⬆️ Update vllm-project/vllm-metal (darwin) Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-29 09:13:22 +02:00
LocalAI [bot]	baaa0fe94f	chore: ⬆️ Update mudler/face-detect.cpp to `06914b077d52f90d5421299138e7be6bdd06b5e8` (#10580 ) ⬆️ Update mudler/face-detect.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-29 08:04:22 +02:00
LocalAI [bot]	c3b5c7c3fa	chore: ⬆️ Update mudler/voice-detect.cpp to `3d510772357538c5182808ac7de2278b84824e24` (#10581 ) ⬆️ Update mudler/voice-detect.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-29 08:03:43 +02:00
LocalAI [bot]	bd1ec8f2c2	chore: ⬆️ Update ggml-org/llama.cpp to `dbdaece23de9ac63f2e7ca9e6bfcdc4fc156a3fa` (#10582 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-29 08:03:20 +02:00
LocalAI [bot]	135debf9af	chore: ⬆️ Update CrispStrobe/CrispASR to `6b50f76e59700665358a1aabf5295597fa318e06` (#10583 ) ⬆️ Update CrispStrobe/CrispASR Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-29 08:03:06 +02:00
LocalAI [bot]	e8c18ae28e	chore: ⬆️ Update leejet/stable-diffusion.cpp to `c1790754d31bec0731ed5fddc9d5b9ff22ee19cd` (#10584 ) ⬆️ Update leejet/stable-diffusion.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-29 08:02:52 +02:00
LocalAI [bot]	c4d302e1ab	chore(model-gallery): ⬆️ update checksum (#10585 ) ⬆️ Checksum updates in gallery/index.yaml Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-28 23:26:28 +02:00
LocalAI [bot]	323b57a4bc	fix(oci): retry layer downloads on transient network errors (#10579 ) Installing large backend images (e.g. vLLM/vLLM-omni, several GiB) over the Web UI could fail with "failed to download layer 0: unexpected EOF" when a single connection to the registry dropped mid-stream. The whole install then failed with no recovery, and since the download is not resumable, retrying from the UI restarted from zero and usually hit the same blip again - so users saw it as a consistent, size-correlated failure (issue #10577). The registry transport already retries manifest/digest fetches via defaultRetryPredicate (GetImage/GetImageDigest), but the per-layer data stream in DownloadOCIImageTar bypassed it entirely: layer.Compressed() + xio.Copy ran exactly once. Extract the per-layer copy into downloadLayerToFile, which retries on the same transient errors (unexpected EOF, EOF, EPIPE, ECONNRESET, connection refused) with exponential backoff, truncating any partial data before each retry. Non-retryable errors and context cancellation still fail fast. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 21:21:08 +02:00
LocalAI [bot]	3d2f639213	fix(fish-speech): allow invalid_reference_casting so tokenizers builds on darwin (#10573 ) On darwin arm64 the fish-speech editable install (pip install --no-build-isolation -e) compiles the transitive `tokenizers` Python package's Rust extension from source, because there is no prebuilt manylinux wheel for that platform (Linux builds never compile it, so this only breaks on macOS). The pinned tokenizers crate fish-speech's stack resolves to contains a `&T` -> `&mut T` cast that the macOS CI runner's newer Rust toolchain rejects via the now-deny-by-default `invalid_reference_casting` lint: error: casting `&T` to `&mut T` is undefined behavior ... error: could not compile `tokenizers` (lib) due to 1 previous error ERROR: Failed building wheel for tokenizers This failed the fish-speech darwin/metal (mps) backend image build in the v4.5.5 release CI while all Linux variants built fine. Fix: export RUSTFLAGS with `-A invalid_reference_casting` (appended to any existing value, not clobbering) before installRequirements so the unchanged third-party crate compiles as it did under the older toolchain. Version-agnostic and harmless on Linux, where no Rust compile happens. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 19:10:27 +02:00
Nicholas Ciechanowski	be1ae9338b	fix(distributed): missing agent NATS permissions (#10571 ) Signed-off-by: Nicholas Ciechanowski <nicholas@ciech.anow.ski>	2026-06-28 12:58:13 +02:00
LocalAI [bot]	923c47020d	fix(launcher): robust binary download/upgrade (resume, rate-limit, UX) (#10575 ) * fix(launcher): resume flaky downloads, drop redundant percent, fit dialogs The binary upgrade/download flow had three rough edges: - The status label printed "Downloading... N%" right next to a progress bar already showing the percent. Replace it with a human-readable byte readout ("Downloading... 12.3 MB / 45.6 MB"). - A failed download (GitHub releases are flaky) had no recourse and always restarted from byte 0. Stream to "<dest>.part" and resume via a "Range: bytes=N-" request (handling 206/200/416), renaming to the final path only after checksum verification; on checksum failure the file is discarded so the next attempt starts clean. Add a Retry button that appears on failure and resumes from the partial file. - Progress/install dialogs were hardcoded to oversized dimensions, leaving a blank gap below "View Release Notes". Size each window to its content with a sane minimum width. Also unify the three near-identical download-progress popups into one Launcher.showDownloadProgressWindow helper (and delete a dead unused copy in ui.go) so the behaviour stays consistent across every entry point. The progress callback now reports (downloaded, total) byte counts instead of a single fraction. Resume/retry behaviour is covered by httptest-backed unit tests in release_manager_test.go. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(launcher): resolve latest version via redirect to dodge GitHub API 403 On a fresh Linux start with no LocalAI installed, the download failed with "failed to fetch latest release: status 403". The cause is the unauthenticated api.github.com rate limit (60 requests/hour, per IP): on shared/NAT/CGNAT/cloud addresses it is exhausted almost immediately and every request 403s. Resolve the latest version by following the github.com "releases/latest" redirect instead, reading the tag from the final ".../releases/tag/<tag>" URL. That endpoint is not subject to the API rate limit. Only the version is ever consumed by callers, so the tag is sufficient. The JSON API is kept as a fallback, now honoring GITHUB_TOKEN and reporting rate-limit 403/429 clearly instead of an opaque status code. Covered by an httptest-backed unit test that asserts the redirect path is used. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 12:57:32 +02:00
LocalAI [bot]	b7a1dec773	fix(kokoro): add explicit click dep so spacy CLI works on intel build (#10572 ) The kokoro install.sh ends with `python -m spacy download en_core_web_sm`. spaCy's CLI imports typer -> click, so click must be present at that point. On the intel build profile, install.sh adds `--upgrade --index-strategy=unsafe-first-match` against the Intel pip index. With that resolution strategy, click is not resolved/installed, so the spacy CLI import fails with: ModuleNotFoundError: No module named 'click' make: *** [Makefile:3: kokoro] Error 1 Other profiles (cpu/cublas) pull click in transitively and build fine; only the intel profile breaks. This surfaced in the v4.5.5 release CI as the gpu-intel-kokoro backend image build failure. Make click an explicit dependency in the base requirements.txt (installed for every profile) so it is always present before `python -m spacy download` runs, regardless of index resolution. Unpinned: spacy constrains the version. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 11:29:17 +02:00

1 2 3 4 5 ...

6920 Commits