fix(vllm): non-streaming tool-call regression after #10351 (native_streaming is a capability flag, not a state flag)
#10351 introduced native streaming via `parser.extract_tool_calls_streaming`
and gated the post-loop `extract_tool_calls` block on `native_streaming and
not native_streaming_error`. That works for streaming requests, but for
non-streaming requests the same flag is still True (it only means "the
parser can stream", not "we actually streamed"), so the block was skipped
and the `elif` cleared `content = ""` — the tool call was silently lost.
Symptom: non-streaming chat.completions with `tools=[...]` returns
`finish_reason: "stop"` with `content: ""` and no `tool_calls`. Streaming
requests are unaffected.
Fix: gate both branches on `streaming` too, so the extract_tool_calls
block runs for non-streaming requests (and for streaming requests that
fell back to the buffered path).
Reproduction (vLLM 0.24, Qwen3-Coder-Next-NVFP4, qwen3_coder parser):
curl -s -X POST http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"coder","stream":false,
"messages":[{"role":"user","content":"7*8 via calc"}],
"tools":[{"type":"function","function":{"name":"calc",
"parameters":{"type":"object",
"properties":{"expression":{"type":"string"}}}}}]}'
Before: finish_reason: "stop", content: "", tool_calls: []
After: finish_reason: "tool_calls", tool_calls[0].function.name: "calc"
Streaming path re-verified in the same setup: delta.tool_calls arrives
token-by-token, finish_reason: "tool_calls", no raw XML in content.
Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com>
* fix(hipblas): symlink amdgpu.ids so ROCm backends find the ASIC ID table
ROCm's bundled libdrm_amdgpu looks up the GPU ASIC ID table at a
hardcoded fallback path, /opt/amdgpu/share/libdrm/amdgpu.ids, which is
only populated by AMD's full amdgpu-install (graphics/DKMS) stack. The
hipblas image is compute-only and doesn't have it, so every model load
logs "No such file or directory" and the GPU can't be identified.
Symlink it to the equivalent file already shipped by Ubuntu's
libdrm-amdgpu1 package.
Fixes#10624
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(hipblas): correct amdgpu.ids source package name in comment
Verified against the real rocm/dev-ubuntu-24.04:7.2.1 image with
hipblas-dev/hipblaslt-dev/rocblas-dev installed: /usr/share/libdrm/amdgpu.ids
is owned by libdrm-common, not libdrm-amdgpu1 as the comment said.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The launcher starts the server with run --models-path/--backends-path but
leaves --data-path and the dynamic config dir unset, so the server falls
back to its /data and /configuration defaults.
is kong.ExpandPath("."), i.e. the launcher process CWD
(commonly the user's home root), producing ~/data and ~/configuration
outside ~/.localai and an agent-pool stateDir under ~/data.
Pass --data-path and --localai-config-dir explicitly, rooted at the
launcher's own data directory (GetDataPath() -> ~/.localai), so data and
config stay consistent with --models-path/--backends-path.
* fix(watchdog): don't log optional Free() as an error when backend returns Unimplemented (#10602)
When the watchdog evicts a model, deleteProcess calls the backend's gRPC
Free() to release VRAM before stopping the process. Free is optional:
backends that don't override it -- the generated UnimplementedBackendServer
stub, many Python/external backends, or a federation proxy in distributed
mode -- return gRPC Unimplemented. That is expected, not a failure: VRAM is
reclaimed when the local process is stopped, or by the remote unloader for
remote backends. Logging it as "WARN Error freeing GPU resources" made a
benign, optional RPC look like a fault (the alarming line in #10602, seen
in distributed mode where the model is remote and Free hits a stub).
Treat gRPC Unimplemented from Free() as a no-op logged at Debug; genuine
failures still Warn. Free() is still attempted for every backend, so any
backend that does implement it is unaffected.
Add a reusable grpcerrors.IsUnimplemented helper following the package's
existing code-based detection idiom (prefer the typed status code, fall
back to the message across non-gRPC boundaries), with table tests.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com>
* fix(watchdog): log a non-Unimplemented Free() failure at error level
Per review: now that the expected gRPC Unimplemented case is split out and
logged at Debug, any remaining Free() error is a genuine failure to release
VRAM, so surface it at error level instead of warn.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com>
---------
Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com>
The Q4_K_M quant degraded tool-call reliability for LFM2.5-8B-A1B.
Switch the gallery entry to the Q8_0 GGUF (sha256 verified via HF
x-linked-etag) while keeping the native jinja tool-parsing config.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
The backend.proto AudioTranscriptionLive bidirectional streaming RPC added
new required trait items (AudioTranscriptionLiveStream + audio_transcription_live)
on the generated Backend trait. The kokoros (TTS) backend did not implement
them, breaking its release build with E0046 (missing trait items).
kokoros is text-to-speech and has no live-ASR support, so stub the method to
return UNIMPLEMENTED, mirroring the existing audio_transcription_stream stub.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
LocalAI enables a cross-request prompt-prefix cache (cache_reuse, see
core/config/serving_defaults.go) so repeated prefixes — system prompts,
RAG context, agent scaffolds, multi-turn chat — are not reprocessed every
turn. For sliding-window-attention (SWA) models (Gemma 2/3, Cohere2,
Llama 4, ...) this silently does nothing: llama.cpp defaults to a reduced
SWA KV cache sized to the sliding window, and that reduced cache cannot
preserve a prompt prefix across requests, so every turn reprocesses the
whole prompt anyway.
llama.cpp's --swa-full (params.swa_full, already wired through the
LocalAI llama.cpp backend's `swa_full` option) keeps the full KV cache so
the shared prefix is reused. Enable it automatically, but only for models
that are actually SWA: detection reads the gguf-parser-normalized
`<arch>.attention.sliding_window` metadata (which also applies llama.cpp's
family rules, e.g. Phi-3 → not SWA), right where the GGUF is already
parsed for defaults. It is never applied to dense models (pure memory
waste) and never overrides an explicit user `swa_full`/`n_swa` choice.
Tradeoff: the full SWA cache scales with context_size, so it costs more
memory at large contexts — hence the SWA gating and the documented
`swa_full:false` opt-out.
Assisted-by: Claude:claude-opus-4-8 [Claude Code] golangci-lint
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
fix(watchdog): persist a UI-saved Check Interval across restarts (#10601)
The watchdog Check Interval saved via /api/settings reverted to 500ms on
every restart, while the idle/busy timeouts persisted correctly.
Root cause: NewApplicationConfig baseline-defaulted WatchDogInterval to
500ms, whereas the idle/busy timeouts default to 0. The startup loader
(loadRuntimeSettingsFromFile) applies a persisted runtime_settings.json
value only when the field is still at its zero default - its heuristic
for "this wasn't set by an env var". Because the interval was always
500ms at that point, the loader never read the persisted value back, so
the saved interval was silently discarded on each boot.
Fix: drop the non-zero baseline default so the interval behaves like the
sibling timeouts (0 = unset). The effective 500ms default is now supplied
at the watchdog layer: WithWatchdogInterval ignores a non-positive value
so DefaultWatchDogOptions' 500ms is preserved (and a 0 interval can never
turn the watchdog loop into a busy spin). Also mirror the interval in the
live config file watcher alongside idle/busy, and report the real 500ms
default (not the stale "2s") from ToRuntimeSettings.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Stapling only the dmg leaves the LocalAI.app bundle with no embedded
notarization ticket. Gatekeeper then falls back to an online notarization
check on first launch, so the app fails to open on a Mac that is offline or
behind a firewall, or once it has been copied out of the dmg — while it keeps
working on the (online) build host, which masks the problem.
Notarize and staple the .app before packaging it into the dmg so the bundle
verifies offline. Adds a `notarize-app` subcommand to
contrib/macos/sign-and-notarize.sh (zips the bundle for notarytool, then
staples + validates) and invokes it from dmg-launcher-darwin. Stays a no-op
when notary secrets are unset, so unsigned local/fork builds are unaffected.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: mudler <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
PR #10589 changed repo-root HuggingFace URI imports to name the model after
the selected GGUF file rather than the repository. The Open Responses API
integration test still requested the old repo-derived name
("Qwen3-VL-2B-Instruct-GGUF"), so every request 404'd on an unknown model and
the suite has failed on master since 1a4f68ed4.
Update testModel to the name the importer now registers for the default
q4_k_m quant ("Qwen3-VL-2B-Instruct-Q4_K_M") so the specs resolve the model
again. The #10589 behaviour change is intentional; only the stale test needed
updating.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Importing a model from a local directory (e.g. a HuggingFace checkout or an
LM Studio store) via a file:// URI produced a config whose model field kept
the scheme verbatim, e.g. model: file:///Users/u/.../Qwen3-4bit. The mlx and
vllm backends treat that field as a HuggingFace repo id or local path and
reject the file:// form with "Repo id must be in the form 'repo_name' or
'namespace/repo_name'", so the model imported fine but failed to load (issue
#7461).
Add a shared LocalModelPath helper that reduces a file:// URI to the bare
filesystem path it points at and leaves HuggingFace/HTTP URIs untouched, and
route the mlx, vllm, transformers and diffusers importers (all of which pass
details.URI straight into the model field for from_pretrained-style loading)
through it.
Cover the helper directly plus end-to-end file:// import specs for the mlx and
vllm importers.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com>
fix(functions): avoid quadratic-time debug logging in CleanupLLMResult/ParseFunctionCall
The streaming chat path (core/http/endpoints/openai/chat_stream_workers.go)
calls CleanupLLMResult / ParseFunctionCall once per delta chunk with the
*full accumulated* LLM result so far. Both functions xlog.Debug the entire
argument on entry and exit, so a single N-chunk stream emits roughly
chunk_size * N^2 bytes of debug output.
Under LOG_LEVEL=debug this was observed in a recent SGLang-via-LocalAI
session on a DGX Spark host (about 50K tokens, long streaming generation)
to drive container logs to ~96 GiB, which interacted with the streaming
hot loop on the same filesystem and contributed to a host-wide hard hang
once disk pressure built up. Workaround was setting LOG_LEVEL=info, but
the quadratic shape remains a foot-gun for anyone intentionally enabling
debug.
Replace the four result-content debug arguments with len(...) plus a
fixed-size head (200 bytes via a new truncForLog helper), bounding per-
call output to a constant. The debug signal stays useful: the first 200
chars are enough to identify which generation is in flight, and the
length lets you observe growth without paying for the payload itself.
No API change. No behaviour change for LOG_LEVEL != debug.
Signed-off-by: Poseidon <philipp.wacker@ibf-solutions.com>
Co-authored-by: Poseidon <philipp.wacker@ibf-solutions.com>
When importing a HuggingFace GGUF model from a repository-root URI (no file
component, e.g. hf://owner/repo) with the Model Name field left blank, the
importer named the model after the repository (filepath.Base(details.URI))
instead of the GGUF file it actually selected from the repo listing (issue
#10587).
Track whether the user supplied an explicit name; the URI base is now only a
fallback. In the HuggingFace branch, once the model group is picked, re-derive
the name from the selected GGUF via a new modelNameFromShardGroup helper that
uses ShardGroup.Base minus the .gguf extension. For sharded models this yields
a clean logical name (e.g. Qwen3-30B-A3B-Q4_K_M) rather than a shard filename
like ...-00001-of-00002. An explicit name preference still always wins, and the
.gguf/URL/OCI paths are unchanged.
Add network-free unit specs covering name-from-GGUF, clean-name-from-shard-base,
and explicit-name precedence, and update the live integration specs that had
encoded the previous repo-name behaviour.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com>
fix(openai): stop max_tokens streaming retry loop on reasoning models
When a thinking model spends its entire max_tokens budget on the reasoning
block, the C++ autoparser clears the raw Response and delivers reasoning-only
ChatDeltas (no content, no tool calls). ComputeChoices' empty-response retry
then fires and regenerates from scratch up to maxRetries times, each
re-consuming the whole budget, instead of terminating with finish_reason
"length" (issue #9716).
Add a reachedTokenBudget helper and suppress both the built-in and
caller-driven retries when the completion count has reached the configured
max_tokens ceiling. Report finish_reason "length" instead of "stop" in the
streaming and non-streaming chat paths when the budget was exhausted.
Adds a deterministic regression test that counts backend invocations
(previously 6, now 1) plus boundary tests for the helper.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Dennisadira <dennisadira@gmail.com>
* feat(realtime): EOU-driven semantic_vad turn detection
Add a `semantic_vad` turn-detection mode to the realtime API that feeds
the transcription model live and decides "the user finished speaking"
from the `<EOU>` end-of-utterance token rather than from silence alone.
When EOU fires the turn commits immediately (~0.3s); otherwise it falls
back to an eagerness-scaled silence threshold (low/med/high = 8/4/2s).
Plumbing, bottom to top:
- proto: `AudioTranscriptionLive` bidirectional RPC (config-first oneof,
mono float PCM @16k, ready-ack / Unimplemented degrade signal) plus
`TranscriptResult.eou` for the unary retranscribe gate.
- pkg/grpc: client/server/base/embed scaffolding for the bidi stream,
modeled on AudioTransformStream; release stream conns on terminal Recv.
- parakeet-cpp: live transcription RPC with per-C-call engine locking
(one live stream per turn, finalize+free at commit); bump parakeet.cpp
to ABI v5 — incremental StreamingMel (no more quadratic per-feed mel
recompute that delayed EOU on long turns) and the <EOU>/<EOB> split;
strip the literal <EOU>/<EOB> from offline text and set Eou.
- core/backend: LiveTranscriptionSession wrapper + pipeline
`turn_detection:` config block (type/eagerness/retranscribe).
- realtime: semantic_vad integration — live input captions streamed as
transcription deltas while the user speaks, EOU-immediate commit with
eagerness fallback, optional retranscribe gate (batch re-decode must
also end in <EOU> to confirm), clause synthesis off the LLM token
callback, and per-turn live-transcription / model_load telemetry.
- UI: show the realtime pipeline components as a vertical list.
Docs and tests included; opt-in via the pipeline YAML or per-session
`session.update`. Non-streaming STT backends degrade to silence-only.
Assisted-by: Claude Code:claude-opus-4-8 [Read] [Edit] [Write] [Bash]
Assisted-by: Claude Code:claude-fable-5 [Read] [Edit] [Bash]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(realtime): explicit formally-verified state machines + parakeet streaming driver
The realtime API had several implicit state machines whose state was inferred
from scattered booleans, channels, and five separate mutexes, leaving
illegal/inconsistent states reachable. Make them explicit and keep the
implementation in step with a formal design; rework the parakeet streaming
backend along the same lines.
Realtime state machines (M1-M5). Each is a sealed sum-type State/Event/Effect
with a total, pure Next(state,event)->(state,[]effect) behind a single-writer
Coordinator:
M1 conncoord connection lifecycle: VAD toggle + once-only teardown
(replaces vadServerStarted + a `done` channel closed from
two sites).
M2 turncoord turn detection: collapses speechStarted and the live-stream
"turn open" flag into one state, so discardTurn can no longer
desync them and suppress the next onset.
M3 respcoord response coordination: serializes the dual-writer
start/cancel so at most one response is live; one
response.done per response.create.
M4 compactcoord conversation compaction: single-flight (replaces the
`compacting atomic.Bool` CAS).
M5 ttscoord TTS pipeline: open->closing->closed, idempotent wait(),
rejects enqueue-after-close (was a silent drop).
The Coordinator/Sink/Next plumbing — only the sealed types and Next differed
per machine — is extracted once into core/http/endpoints/openai/coordinator as
a generic Coordinator[S,E,F]; each machine keeps its public API via type
aliases, so no sink, call-site, or test moved.
Hierarchy. session_lifecycle.fizz models M1 as the parent region with its
children (M2/M3/M4) as one statechart and asserts ChildrenDieWithParent (conn
torn => all children terminal, none start after teardown). respcoord and
compactcoord gain an absorbing Terminated state + Shutdown event; conncoord's
teardown drives the children terminal. This closes a compaction teardown gap: a
fire-and-forget compaction could outlive a torn session — compactionSink now
takes a session-scoped cancellable context + WaitGroup and joins the in-flight
summarize+evict on shutdown.
Formal verification. formal-verification/ holds one authoritative FizzBee spec
per machine plus the composition spec, each with an always-assertion and a
documented one-line edit that makes the checker fail (verified non-vacuous).
scripts/realtime-conformance.sh is fail-closed: all Go conformance suites under
-race AND a model-check of every .fizz spec; a missing FizzBee is a hard error
(only the loud REALTIME_CONFORMANCE_SKIP_FIZZBEE=1 bypasses it, never in CI).
FizzBee is pinned by sha256 and installed via scripts/install-fizzbee.sh into
.tools/ (gitignored). Wired as make test-realtime-conformance, a CI workflow,
and a pre-commit path filter. Go conformance tests are Ginkgo/Gomega (per the
repo's forbidigo lint): transition tables + fixed-seed property walks +
concurrent/-race specs, no rapid dependency. Design map:
docs/design/realtime-state-machines.md.
Parakeet streaming backend. The same treatment applied to the parakeet-cpp
streaming paths:
- AudioTranscriptionStream returns codes.Unimplemented for non-streaming models
instead of decoding offline and emitting it as one delta + final. A client
that asked for streaming learns the model cannot stream rather than receiving
a batch result shaped like a stream. New grpcerrors.StreamTranscriptionUnsupported
carries that signal; the HTTP /v1/audio/transcriptions stream path surfaces it
as an SSE error event. Mirrors AudioTranscriptionLive, which already did this.
- utteranceBoundary (boundary.go): a single definition of the end-of-utterance
latch, replacing three open-coded finalEou toggles. Modelled as a two-valued
type so illegal states are unrepresentable.
- Shared decode driver (driver.go): streamFeedResult (one per-feed event) +
feedChunk (hides the ABI v4 JSON vs text-only split) + feedSlices + flushTail.
The feed loop is written once.
- AudioTranscriptionLive becomes a bidi adapter: it streams the per-feed
{delta,eou,eob,words} the realtime turn detector consumes and a terminal
FinalResult carrying only Text. Segments/duration/eou are offline-only and no
longer produced (nor read) on the live path; liveTraceState drops the terminal
eou and keeps the per-feed eou_events count.
- AudioTranscriptionStream + streamJSON merge into one driver-based function;
streamSegmenter is generalized to the unified event with a text-only fallback
that preserves the legacy (no-words) library's per-utterance segmentation.
Verified: build/vet/gofumpt clean, golangci-lint 0 issues, all coordinator and
parakeet packages under -race, the fail-closed conformance gate green, and
make test-realtime (12 e2e WS+WebRTC).
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
---------
Signed-off-by: Richard Palethorpe <io@richiejp.com>
OpenAI wire format carries `function.arguments` as a JSON-encoded string,
but chat templates (e.g. Qwen3-Coder) iterate over it as a mapping. The
vllm backend already parses arguments before applying the chat template
(PR #10256); this mirrors that fix in the sglang backend.
Without this fix the second turn of any tool-using session (assistant
returns tool_calls, user posts `role:"tool"` result, model is invoked
with arguments still as a string) crashes inside transformers' Jinja
chat-template rendering with:
TypeError: Can only get item pairs from a mapping.
File ".../transformers/utils/chat_template_utils.py", in render_jinja_template
File ".../jinja2/filters.py", in do_items
raise TypeError("Can only get item pairs from a mapping.")
Reproduced on `lmsysorg/sglang:v0.5.14` via LocalAI v4.5.4 with
`saricles/Qwen3-Coder-Next-NVFP4-GB10` (W4A4 NVFP4 / compressed-tensors)
on NVIDIA DGX Spark (GB10, sm_121).
After the patch, a tool-call roundtrip (assistant tool_calls -> tool
result -> assistant final answer) returns http=200 with the expected
follow-up content; no behaviour change on requests that don't carry
tool_calls.
Signed-off-by: Poseidon <philipp.wacker@ibf-solutions.com>
Co-authored-by: Poseidon <philipp.wacker@ibf-solutions.com>
chore(recon): re-pin voice/face-detect to squashed release commits
The voice-detect.cpp and face-detect.cpp engine repos were squashed to a single
release commit, which orphaned the previous pins (voice 3d51077, face 06914b0).
Re-pin to the new single-commit SHAs (voice 1db1759, face e22260d).
These also fold in a real correctness fix: the persistent graph-cache fingerprint
now includes op_params, so two structurally identical GGML_OP_CUSTOM graphs (a
blocked 3x3 vs a blocked 1x1 strided conv) can no longer false-hit the cache and
replay the wrong kernel. voice CI was failing test_blocked/conv1x1_s2 with an
out-of-bounds write on the GGML_NATIVE=OFF build; both engine repos are now green
and WeSpeaker embed parity is 1.0 vs golden.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Installing large backend images (e.g. vLLM/vLLM-omni, several GiB) over
the Web UI could fail with "failed to download layer 0: unexpected EOF"
when a single connection to the registry dropped mid-stream. The whole
install then failed with no recovery, and since the download is not
resumable, retrying from the UI restarted from zero and usually hit the
same blip again - so users saw it as a consistent, size-correlated
failure (issue #10577).
The registry transport already retries manifest/digest fetches via
defaultRetryPredicate (GetImage/GetImageDigest), but the per-layer data
stream in DownloadOCIImageTar bypassed it entirely: layer.Compressed()
+ xio.Copy ran exactly once.
Extract the per-layer copy into downloadLayerToFile, which retries on the
same transient errors (unexpected EOF, EOF, EPIPE, ECONNRESET, connection
refused) with exponential backoff, truncating any partial data before
each retry. Non-retryable errors and context cancellation still fail
fast.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
On darwin arm64 the fish-speech editable install (pip install
--no-build-isolation -e) compiles the transitive `tokenizers` Python
package's Rust extension from source, because there is no prebuilt
manylinux wheel for that platform (Linux builds never compile it, so this
only breaks on macOS). The pinned tokenizers crate fish-speech's stack
resolves to contains a `&T` -> `&mut T` cast that the macOS CI runner's
newer Rust toolchain rejects via the now-deny-by-default
`invalid_reference_casting` lint:
error: casting `&T` to `&mut T` is undefined behavior ...
error: could not compile `tokenizers` (lib) due to 1 previous error
ERROR: Failed building wheel for tokenizers
This failed the fish-speech darwin/metal (mps) backend image build in the
v4.5.5 release CI while all Linux variants built fine.
Fix: export RUSTFLAGS with `-A invalid_reference_casting` (appended to any
existing value, not clobbering) before installRequirements so the
unchanged third-party crate compiles as it did under the older toolchain.
Version-agnostic and harmless on Linux, where no Rust compile happens.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* fix(launcher): resume flaky downloads, drop redundant percent, fit dialogs
The binary upgrade/download flow had three rough edges:
- The status label printed "Downloading... N%" right next to a progress
bar already showing the percent. Replace it with a human-readable byte
readout ("Downloading... 12.3 MB / 45.6 MB").
- A failed download (GitHub releases are flaky) had no recourse and always
restarted from byte 0. Stream to "<dest>.part" and resume via a
"Range: bytes=N-" request (handling 206/200/416), renaming to the final
path only after checksum verification; on checksum failure the file is
discarded so the next attempt starts clean. Add a Retry button that
appears on failure and resumes from the partial file.
- Progress/install dialogs were hardcoded to oversized dimensions, leaving
a blank gap below "View Release Notes". Size each window to its content
with a sane minimum width.
Also unify the three near-identical download-progress popups into one
Launcher.showDownloadProgressWindow helper (and delete a dead unused copy
in ui.go) so the behaviour stays consistent across every entry point.
The progress callback now reports (downloaded, total) byte counts instead
of a single fraction. Resume/retry behaviour is covered by httptest-backed
unit tests in release_manager_test.go.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(launcher): resolve latest version via redirect to dodge GitHub API 403
On a fresh Linux start with no LocalAI installed, the download failed with
"failed to fetch latest release: status 403". The cause is the unauthenticated
api.github.com rate limit (60 requests/hour, per IP): on shared/NAT/CGNAT/cloud
addresses it is exhausted almost immediately and every request 403s.
Resolve the latest version by following the github.com "releases/latest"
redirect instead, reading the tag from the final ".../releases/tag/<tag>" URL.
That endpoint is not subject to the API rate limit. Only the version is ever
consumed by callers, so the tag is sufficient. The JSON API is kept as a
fallback, now honoring GITHUB_TOKEN and reporting rate-limit 403/429 clearly
instead of an opaque status code.
Covered by an httptest-backed unit test that asserts the redirect path is used.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The kokoro install.sh ends with `python -m spacy download en_core_web_sm`.
spaCy's CLI imports typer -> click, so click must be present at that point.
On the intel build profile, install.sh adds `--upgrade --index-strategy=unsafe-first-match`
against the Intel pip index. With that resolution strategy, click is not
resolved/installed, so the spacy CLI import fails with:
ModuleNotFoundError: No module named 'click'
make: *** [Makefile:3: kokoro] Error 1
Other profiles (cpu/cublas) pull click in transitively and build fine; only
the intel profile breaks. This surfaced in the v4.5.5 release CI as the
gpu-intel-kokoro backend image build failure.
Make click an explicit dependency in the base requirements.txt (installed for
every profile) so it is always present before `python -m spacy download` runs,
regardless of index resolution. Unpinned: spacy constrains the version.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>