Commit Graph

857 Commits

Author SHA1 Message Date
Ettore Di Giacinto
76f3e032ae feat(ui): Home loaded-models skeleton list, button hierarchy, EmptyState wizard
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-18 17:40:37 +00:00
Ettore Di Giacinto
c7d9dbda6b feat(ui): Home editorial header + status line (north-star redesign)
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-18 17:40:37 +00:00
Ettore Di Giacinto
0838d7d3e6 feat(ui): StatusPill primitive + time-of-day greeting helper
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-18 17:40:37 +00:00
Ettore Di Giacinto
abd8162aad feat(ui): PageHeader + SectionHeading editorial primitives
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-18 17:40:37 +00:00
Ettore Di Giacinto
b73b93850c feat(ui): Skeleton shimmer primitive
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-18 17:40:37 +00:00
Ettore Di Giacinto
f31f311e38 feat(ui): EmptyState primitive with serif title
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-18 17:40:37 +00:00
Ettore Di Giacinto
24623d52e8 refactor(ui): route boot fallback through the LoadingSpinner primitive
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-18 17:40:37 +00:00
Ettore Di Giacinto
6ef88c78f2 refactor(ui): fix dead token refs + dedupe toggle to one primitive
Migrate all .toggle-slider consumers (Users, Chat, AgentChat) to the
canonical BEM toggle primitive and delete the legacy duplicate CSS block.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-18 17:40:37 +00:00
Ettore Di Giacinto
7d501f1ba1 feat(ui): orchestrated page reveal + stagger motion primitives
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-18 17:40:37 +00:00
Ettore Di Giacinto
53af1807b6 feat(ui): un-overload accent — nav rail, stronger focus ring, neutral hover
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-18 17:40:37 +00:00
Ettore Di Giacinto
b5d8992531 feat(ui): serif display tier + section-heading typography scale
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-18 17:40:37 +00:00
Ettore Di Giacinto
9110002592 feat(ui): add Fraunces variable serif + --font-serif token
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-18 17:40:37 +00:00
Richard Palethorpe
3fa7b2955c feat(pii): NER tier engine — privacy-filter.cpp backend + NER-centric PII filter (#10360)
Squashed feat/pii-ner-tier-engine rebased onto master (was 45 commits; see
backup/pii-ner-tier-engine-prerebase). Net change:

- privacy-filter.cpp: standalone GGML engine for the openai-privacy-filter
  PII/NER token classifier, wired as a LocalAI gRPC backend (CPU/CUDA/Vulkan).
  TokenClassify moves off the patched llama.cpp path onto this backend.
- PII filter reworked to be NER-centric (encoder/NER detection tier scanning
  whole conversations as one document), with a recreated bounded restricted-
  regex secret-matching pattern detector tier alongside it (per-model
  pii_detection.builtins / .patterns + core/services/routing/piipattern).
- Detection labelled by source (ner vs pattern); backend trace / confidence /
  debug observability; analyze/redact exposed as a synchronous API.
- Instance-wide default detector policy + per-usecase default-on; request
  filtering extended to completions, embeddings, edits & Ollama.
- React UI: NER-centric PII editor, detector-models table, pattern/builtins
  editor, middleware default-policy UI.
- Gallery: privacy-filter-multilingual token-classify model + NER install
  filter; token_classify known_usecase; batch sized to context for NER models.
  privacy-filter backend registered in the backend gallery (cpu/vulkan/cuda-13
  meta + image entries with a capabilities map) matching its CI matrix jobs,
  and an /import-model auto-detect importer (PrivacyFilterImporter, narrow
  privacy-filter GGUF detection) replacing the prior pref-only registration.

Reconciled against master's independent evolution:

- Dropped master's PIIPatternOverrides feature (global-pattern runtime
  overrides + /api/pii/patterns API + runtime_settings.json persistence). The
  per-model NER + pattern-detector design supersedes it; it was built on the
  global redactor pattern set this branch replaced.
- Reverted the llama.cpp Score carry-patch (0006-server-task-type-score):
  removed the patch and restored master's grpc-server.cpp Score RPC (direct
  llama_decode, slot-loop bypass) and LLAMA_VERSION pin, plus master's
  model_config validation forbidding score + chat/completion/embeddings on
  llama-cpp. token_classify is unaffected (it runs on the privacy-filter
  backend, not llama-cpp).

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-06-18 11:45:22 +01:00
LocalAI [bot]
88726f2da4 fix(react-ui): restore sidebar collapse in dev + stop Talk page auto-scroll (#10383)
The sidebar collapse toggle silently no-op'd in dev builds. toggleCollapse
ran its side effects (localStorage write + sidebar-collapse dispatch) inside
the setCollapsed updater. StrictMode double-invokes updaters in dev to surface
impurity, and the synchronous dispatch re-entered setState from the
App/Sidebar listeners mid-update, so the toggle never committed. Production
builds don't double-invoke, which is why only the dev server was affected.
Compute next from current state and move the persist + broadcast into the
handler body so the updater is pure.

Also fix the Talk page anchoring to the transcript box on load. The transcript
is its own overflow container, but scrollIntoView bubbles to every scrollable
ancestor including the window, yanking the whole page down on mount. Scroll
the transcript container directly instead.


Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-18 00:48:56 +02:00
LocalAI [bot]
5ac864dbed feat(ui): console-based navigation + drop-in API endpoint section (#10377)
* feat(ui): restructure sidebar into Create/Recognition/Build tiers

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(ui): preserve exact sidebar gating for agent items and fine-tune/quantize

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* i18n(ui): add nav tier + console keys to all locales

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): add grouped admin console via pathless layout route

Wrap the existing admin pages in a pathless AdminConsoleLayout route so
they keep their exact flat URLs while gaining a grouped left rail
(Inference / Cluster / Observability / Access / System). Rail item gating
mirrors the sidebar (adminOnly / authOnly / feature + /api/features). The
layout forwards the App-level outlet context (addToast) to the wrapped
pages, which read it via useOutletContext().

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): fold Audio Transform into Studio as a tab

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(ui): update e2e specs for tiered nav + admin console

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(ui): gate embedded Studio transform view on audio_transform feature

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): visual polish + console-ize Build/Recognition tiers

Generalize the one-off admin console into a reusable ConsoleLayout driven by
a shared consoleConfig (single source of truth for the rail, its gating, and
the sidebar entry that opens it — removes the prior rail/sidebar drift).

- Promote Install Models to the top menu next to Home.
- Build and Operate are now console tiers (secondary rail); Create stays inline.
- Fold Recognition (Faces/Voices) into the Build console as a group alongside
  Automation and Training so it no longer feels split off.
- Style the console rail as a panel (header, grouped dividers, rounded active
  pills) with a hover nudge; sidebar items become inset rounded pills. The rail
  slide-in plays only when entering a console, not on item-to-item sub-nav
  (which remounts the layout), so switching no longer flashes the menu. All
  token-based (light + dark), respects reduced-motion.
- Add a delayed RouteFallback loader so lazy routes no longer flash blank;
  scoped inside ConsoleLayout so the rail stays put while the body loads.
- Update e2e specs for the new structure (.console-* classes, console entries).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): persist console layout across sub-nav + add drop-in endpoint section

- Keep the page-transition key stable within a console (derived from the
  shared console config) so the ConsoleLayout and its rail persist across
  item-to-item navigation instead of remounting — fixes the submenu flash.
  Cache /api/features across mounts and play the rail entrance animation only
  when actually entering a console.
- Add a "One endpoint, every API" section to Home: leads with LocalAI's own
  native API (images, video, realtime voice over WebRTC/WS, depth, object
  detection, rerank, audio/TTS, face & voice recognition) plus a Full API
  reference link, then the drop-in compatibility layer (OpenAI, Anthropic,
  Ollama, OpenAI Responses) with the live copyable base URL. All 7 locales.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(ui): revert Middleware nav label rename (keep Middleware in all locales)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-18 00:09:17 +02:00
LocalAI [bot]
294170d3ed feat(backend): add depth-anything (Depth Anything 3) C++/ggml backend + gallery (#10352)
* feat(backend): add depth-anything (Depth Anything 3) C++/ggml backend + gallery

Mirrors the locate-anything-cpp backend to register a new depth-anything
backend that wraps the Depth Anything 3 ggml port (depth-anything.cpp) via
purego (cgo-less, no Python at inference).

- backend/go/depth-anything-cpp/: gRPC backend (Load + Predict + GenerateImage),
  purego binding to the da_capi_* C ABI, CMake/Makefile/run/package/test scripts
  building depth-anything.cpp's DA_SHARED static .so per CPU variant.
- backend/index.yaml: depth-anything backend meta + all hardware-variant
  capability entries (cpu/cuda12/cuda13/intel-sycl-f32+f16/vulkan/nvidia-l4t).
- gallery/index.yaml: 8 Depth Anything 3 GGUF models (base q4_k/q8_0/f16/f32,
  small, large, giant, mono-large).
- .github/backend-matrix.yml: one build entry per hardware variant.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(depth): typed Depth RPC + REST endpoint exposing full DA3 data

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(depth): pin depth-anything.cpp to e0b6814 (ABI 3 dense C-API)

The Depth RPC handler calls da_capi_depth_dense / da_capi_points (C-API ABI 3);
pin the native build to the commit that exports them.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(depth): pin depth-anything.cpp to v0.1.0 release (b515c31)

Repoint the native version from the now-orphaned e0b6814 to the
b515c31 release commit, kept alive by the upstream v0.1.0 tag.
C-API is unchanged (da_capi_abi_version == 3).

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(depth): wire depth-anything-cpp into build, CI bump, and importer

The backend dir, gallery index, and CI build-matrix were present but the
backend was never wired into the integration points that adding-backends.md
requires:

- root Makefile: add to .NOTPARALLEL, the test-extra chain, a BACKEND_*
  definition, the docker-build target eval, and docker-build-backends
  (mirrors parakeet-cpp; the backend's own Makefile already documented that
  its `test` target is driven by test-extra).
- bump_deps.yaml: register the DEPTHANYTHING_VERSION pin so the daily
  auto-bump bot tracks mudler/depth-anything.cpp master (it cannot see an
  unregistered Makefile pin).
- import form: add a preference-only KnownBackend entry so depth-anything is
  selectable at /import-model (mirrors sam3-cpp; no reliable GGUF auto-detect
  signal, so pref-only per the doc's default).

changed-backends.js needs no entry: the generic golang suffix branch already
resolves backend/go/depth-anything-cpp/.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(depth): auto-detect importer for depth-anything GGUFs

Replace the preference-only entry with a real auto-detect importer
(mirrors parakeet-cpp / locate-anything):

- DepthAnythingImporter matches a .gguf whose name carries a
  depth-anything token (depth-anything-<size>-<quant>.gguf), so
  /import-model recognises mudler/depth-anything.cpp-gguf repos and direct
  GGUF URLs without an explicit backend preference. preferences.backend=
  "depth-anything" still forces it.
- Registered before LlamaCPPImporter so its GGUF bundles aren't claimed by
  the generic .gguf importer; the narrow name match means it cannot claim
  arbitrary llama GGUFs or the upstream safetensors PyTorch repos.
- Multi-quant repos pick the smallest quant by default (q4_k -> ... -> f32,
  depth stays >0.998 corr even at q4_k); quantizations preference overrides.
- Drops the now-redundant knownPrefOnlyBackends entry (importer-backed
  backends are not listed there, matching parakeet-cpp).
- Table-driven Ginkgo test covers detection, negative cases (llama GGUF,
  upstream safetensors), default/override/fallback quant pick, and direct
  URL import. 10/10 specs pass.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(depth): check conn.Close error in grpc Depth client (errcheck)

The new Depth() client method used a bare `defer conn.Close()`. golangci-lint
runs with new-from-merge-base, so although the 39 sibling methods use the same
bare form (grandfathered), the newly added line trips errcheck. Drop the result
explicitly to satisfy the linter.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8

* fix(depth): bump depth-anything.cpp to v0.1.1 (embeddable CMake)

v0.1.0 (b515c31) used ${CMAKE_SOURCE_DIR} for its include dirs, which
points at the parent project when built via add_subdirectory() as this
backend does, so the container build failed with missing stb_image.h /
da_gguf_keys.h. v0.1.1 (2d42897) switches to project-relative paths.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8

* fix(depth): resolve gosec findings in the backend wrapper

The code-scanning gate flagged three new failure-level alerts in
godepthanythingcpp.go (gosec runs with -no-fail; GitHub gates on new alerts):

- G301: export dirs were created with 0o755. Tighten to 0o750 (no world
  access needed for backend-written export output).
- G304: writeDepthPNG creates req.GetDst(). That path is chosen by the
  LocalAI core as the intended output destination (same pattern every
  image backend uses), not attacker input, so annotate with #nosec G304
  and document why.

The remaining G103 "audit unsafe" notes on the unsafe.Slice C-buffer copies
are warning-level (the same purego interop whisper/parakeet use) and do not
gate the check, per the supertonic exclusion precedent in secscan.yaml.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8

* fix(depth): bump depth-anything.cpp to v0.1.2 (CUDA cross-build arch)

v0.1.1 forced CMAKE_CUDA_ARCHITECTURES=native, which breaks the GPU-less
l4t/cublas CI builds (nvcc "Unsupported gpu architecture 'compute_'" on
CMake 3.22). v0.1.2 (442eea4) drops the override and lets ggml pick its
default cross-build arch list.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-16 16:28:28 +02:00
LocalAI [bot]
1ab61a0875 feat: generic chat_template_kwargs (model config + per-request metadata) (#10359)
* feat(config): add chat_template_kwargs model field + resolver

Adds the ChatTemplateKwargs model-config map and RequestMetadata carrier,
plus ResolveChatTemplateKwargs which layers the config map under coerced
request metadata. Foundation for generic jinja chat-template kwargs (issue #10329).

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend): forward resolved chat_template_kwargs blob to backends

gRPCPredictOpts now merges per-request client metadata over the server-derived
enable_thinking/reasoning_effort (reaching all backends via the standalone keys)
and serialises the resolved chat_template_kwargs map into a JSON blob for
llama.cpp, written last so a client cannot clobber it. Issue #10329.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(http): wire request metadata to config.RequestMetadata

The OpenAI request metadata field was parsed but unused; stamp it onto the
per-request ModelConfig so gRPCPredictOpts forwards it as chat_template_kwargs
overrides. Issue #10329.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(llama-cpp): generic chat_template_kwargs merge (drop per-key blocks)

Replace the per-key enable_thinking/reasoning_effort handling in both the
streaming and non-streaming chat paths with a single block that parses the
chat_template_kwargs JSON blob resolved by the Go layer and merges every key
into body_json. New jinja template levers (e.g. preserve_thinking) now need
no C++ change. Issue #10329.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs: document custom chat_template_kwargs (model + per-request)

Issue #10329.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(backend): pin reasoning_effort as a string in the chat_template_kwargs blob

Issue #10329.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(http): e2e guard pinning chat_template_kwargs forwarded to gRPC

Adds an ECHO_PREDICT_METADATA marker to the mock-backend that echoes the
received PredictOptions.Metadata, and an app_test.go spec that drives a real
/v1/chat/completions request (model chat_template_kwargs + per-request metadata
override) and asserts the exact metadata + chat_template_kwargs blob the REST
layer forwards to gRPC. Locks the REST->gRPC contract against regressions. Issue #10329.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(config): grandfather chat_template_kwargs in registry coverage

chat_template_kwargs is a free-form map[string]any (like engine_args, already
on the list), not a scalar the config UI registry can surface, so it is exempt
from the registry-entry requirement. Fixes the TestAllFieldsHaveRegistryEntries
failure introduced by the new field. Issue #10329.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-16 12:16:34 +02:00
github-actions[bot]
416f871bea chore: bump inference defaults from unsloth (#10358)
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-16 09:59:36 +02:00
Dedy F. Setyawan
9ba8521e7e feat(react-ui): localize models and fix 'Import' typo (#10341)
* feat(react-ui): localize SearchableSelect component

Signed-off-by: Dedy F. Setyawan <dedyfajars@gmail.com>

* feat(react-ui): localize ModelSelector component

Signed-off-by: Dedy F. Setyawan <dedyfajars@gmail.com>

* fix(react-ui): dynamically localize back navigation caption to match page title

Signed-off-by: Dedy F. Setyawan <dedyfajars@gmail.com>

* feat(react-ui): localize back navigation state on Models page

Signed-off-by: Dedy F. Setyawan <dedyfajars@gmail.com>

* feat(react-ui): localize ModelEditor page

Signed-off-by: Dedy F. Setyawan <dedyfajars@gmail.com>

* fix(react-ui): fix Indonesian typo 'Import' to 'Impor' in importModel locale

Signed-off-by: Dedy F. Setyawan <dedyfajars@gmail.com>

---------

Signed-off-by: Dedy F. Setyawan <dedyfajars@gmail.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-06-15 18:26:27 +02:00
LocalAI [bot]
2df2876db2 feat(supertonic): add Supertonic ONNX TTS backend (CPU) (#10342)
* feat(supertonic): vendor upstream Go TTS pipeline (helper.go)

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(supertonic): add gRPC backend (Load/TTS/TTSStream, CPU)

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(supertonic): satisfy unused linter (use onnxProvider; exclude vendored helper.go)

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(supertonic): unit tests for resolvers + gated end-to-end synthesis

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* style(supertonic): gofmt backend.go comment block

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(supertonic): add Makefile, run.sh, package.sh (CPU build)

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* build(supertonic): wire backend into root Makefile

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(supertonic): check ort.DestroyEnvironment return (errcheck)

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(supertonic): resolve voice_styles as sibling of onnx dir; guard trim; test voice

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(supertonic): add CPU build matrix + gallery index entries

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(supertonic): expose as pref-only importable backend

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(supertonic): add Supertonic/supertonic-3 TTS model to the gallery

16 files (4 onnx + tts.json + unicode_indexer.json + 10 voice styles)
from HF Supertone/supertonic-3, served via the supertonic backend.
Defaults to voice F1; onnx/ + sibling voice_styles/ layout matches the
backend's resolveVoicesDir.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(meta): register pipeline.max_history_items config field

Pre-existing on master: the field was added without a registry entry,
failing TestAllFieldsHaveRegistryEntries (core/config/meta). Add the
entry so it renders properly in the model-config UI.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(secscan): exclude vendored supertonic backend from gosec

helper.go is vendored from supertone-inc/supertonic; its G304/G404/G104
findings are inherent to upstream and the math/rand use is correct for
flow-matching noise (crypto/rand would be wrong).

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-15 16:54:11 +02:00
LocalAI [bot]
7d2a762b53 feat(realtime): configurable pipeline.max_history_items (#10331)
Composed realtime pipelines (VAD+STT+LLM+TTS) defaulted to unlimited history,
so a long-running session grew every turn and fed the whole conversation to the
LLM until its context window filled. Add an optional pipeline.max_history_items
to cap the trailing items per turn; explicit value (including 0=unlimited) wins
over the per-model-type default. Self-contained any-to-any models keep their
6-item default.

Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 18:13:09 +02:00
LocalAI [bot]
fdc352a618 fix(settings): start watchdog on cold-enable from the React UI (#9125) (#10287)
fix(watchdog): start the live watchdog on a cold enable from Settings (#9125)

The React Settings "Enable Watchdog" master toggle only ever writes the
idle/busy flags; watchdog_enabled is vestigial in that UI. The live
start/stop decision in UpdateSettingsEndpoint keyed off the raw, stale
watchdog_enabled request field, so a cold enable (idle/busy=true,
watchdog_enabled=false) called StopWatchdog() and the watchdog stayed
stopped until the next restart - at which point startup re-derived it
from the idle flag. Net: enabling the watchdog appeared to do nothing.

Derive the run-state from idle||busy as the single source of truth,
mirroring the startup invariant:

- ApplyRuntimeSettings now sets WatchDog = idle||busy whenever either
  field is present (so a full disable also brings it down), while an API
  client posting only watchdog_enabled keeps its explicit value.
- Add ApplicationConfig.WatchdogShouldRun() mirroring startWatchdog's
  gating (idle/busy, LRU eviction, memory reclaimer); the /api/settings
  handler uses it to decide start vs stop.
- Belt-and-suspenders: the Settings.jsx master toggle also writes
  watchdog_enabled = idle||busy.

Assisted-by: claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-14 16:46:14 +02:00
LocalAI [bot]
e5c95e0449 fix(distributed): stage backend companion assets to remote nodes (#10330)
A model whose ModelFile is a single file (e.g. sherpa-onnx VITS/piper: the
.onnx) failed to load on remote worker nodes because the sibling assets the
backend resolves from the model dir — tokens.txt, lexicon.txt, the
espeak-ng-data / dict directories, Kokoro's voices.bin — were never staged.
Only the declared ModelFile was shipped, so the worker hit "failed to create
sherpa-onnx TTS engine" and TTS produced no audio.

Lean on the existing option-path staging instead of hardcoding filenames:

- stageGenericOptions now also resolves an option value relative to the model's
  own directory (not just the frontend models dir), so a shared config can
  declare companions with bare names regardless of whether Model includes a
  subdirectory; and it expands directory-valued options (e.g. espeak-ng-data)
  file-by-file rather than handing a directory fd to the stager.
- gallery/sherpa-onnx-tts.yaml declares the companion assets as option paths
  (tokens, lexicon, espeak-ng-data, voices.bin, dict, per-lang lexicons). The
  backend ignores these keys and keeps resolving siblings from the model dir;
  they exist only so distributed staging ships them. Absent files are skipped.

Adds router_optionstage_test.go covering file + directory companion staging via
the model-dir fallback.

Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-14 16:42:59 +02:00
LocalAI [bot]
4ec6e3221e feat(realtime): gate realtime pipeline voice models behind voice recognition (#10319)
* feat(realtime): add pipeline voice_recognition gate config schema

Add the PipelineVoiceRecognition config block that gates a realtime
pipeline behind speaker verification (identify against the voice
registry, or verify against reference audios), with Normalize defaults
and Validate enum/shape checks. Register the new fields in the config
meta registry so the UI renders them with proper labels/components
(required by the registry-coverage gate).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

* fix(realtime): range-check voice gate threshold and floor UI min

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

* feat(realtime): add cosineDistance helper for voice gate

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

* feat(realtime): add voiceGate identify-mode authorization

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

* test(realtime): cover voice gate fail-closed error paths

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

* feat(realtime): add voiceGate verify-mode authorization

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

* feat(realtime): add voiceGate decide policy helper

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

* feat(realtime): add newVoiceGate constructor

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

* feat(realtime): gate pipeline responses behind voice recognition

Run speaker verification concurrently with transcription and join on a
hard barrier before generateResponse, so unauthorized utterances never
reach the LLM, tools, or TTS. Supports identify (registry) and verify
(reference) modes with multiple authorized speakers, per-utterance or
first-utterance checking, and drop-with-event or silent-drop on reject.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

* fix(realtime): harden voice gate goroutine lifecycle

Only launch the verification goroutine on the transcription path and
drain it before the temp WAV is removed on the transcription-error
return, so an in-flight backend read never races the deferred cleanup.
Drop the write-only voiceMatched field; log the matched speaker instead.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

* docs(realtime): document the voice_recognition pipeline gate

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

* fix(realtime): fail closed on an incomplete voice_recognition block

A present voice_recognition block with no model previously disabled the
gate silently, authorizing every speaker. Treat block presence as the
intent signal and reject an empty model in Validate, so the session is
refused instead of running unprotected.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

* test(realtime): integration-test the voice gate through commitUtterance

Drive the real commitUtterance path (gate goroutine, hard join before the
LLM, reject event, when:first session trust) with the existing
transport/model doubles: authorized speakers reach a full response,
unauthorized ones are dropped before the LLM with a speaker_not_authorized
event, backend errors fail closed, drop_silent stays quiet, and when:first
trusts the session after one match.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-13 23:38:08 +02:00
LocalAI [bot]
4bb592cf91 feat(qwen3-tts-cpp): migrate to ServeurpersoCom/qwentts.cpp (streaming, speakers, voice design) (#10316)
* feat(qwen3-tts-cpp): repoint upstream to ServeurpersoCom/qwentts.cpp

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(qwen3-tts-cpp): flatten qt_* ABI into qt3_* purego shim

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(qwen3-tts-cpp): build shim against upstream qwen-core static lib

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(qwen3-tts-cpp): add option/language/voice/sampling parsing

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(qwen3-tts-cpp): add 24kHz WAV encode/decode/stream-header helpers

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(qwen3-tts-cpp): purego backend with streaming, speakers, voice design

Map TTSRequest onto qwentts.cpp: instructions->instruct, voice->named
speaker or clone-reference path, params map->ref_text + sampling. Add
TTSStream over the qt chunk callback.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* test(qwen3-tts-cpp): unit specs + build-gated TTS/TTSStream e2e

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(qwen3-tts-cpp): close defensive PCM-free gap on zero-sample result

Register CppPCMFree before the n<=0 guard so a non-null buffer with zero
samples cannot leak (the C contract returns NULL on failure, so this is
defensive). Raised in code review.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(qwen3-tts-cpp): advertise TTSStream capability

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(qwen3-tts-cpp): update backend index metadata for qwentts.cpp

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(gallery): qwentts.cpp models - base/customvoice/voicedesign, Q8_0 & Q4_K_M

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* docs(qwen3-tts-cpp): release note for qwentts.cpp migration

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* test(qwen3-tts-cpp): cover audio_path voice-cloning fallback

Add resolveRequest unit specs (config audio_path used as the clone
reference when Voice is empty; per-request audio Voice overrides it; a
named-speaker Voice does not trigger cloning) plus a real-inference e2e
that clones from audio_path (confirmed ref_spk_emb=yes in the pipeline).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(qwen3-tts-cpp): drop the release-note doc

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-13 23:09:59 +02:00
moduvoice
36b4a81d1e feat(i18n): add Korean (ko) translation (#10312)
Add a full Korean locale (core/http/react-ui/public/locales/ko/, 13 namespaces,
840 keys, full parity with en/) and register ko in SUPPORTED_LANGUAGES
(core/http/react-ui/src/i18n/index.js). All i18next {{interpolation}} and
_one/_other plural keys preserved; brand/model names kept untranslated.

Assisted-by: Claude:claude-opus-4-8

Signed-off-by: moduvoice <moduvoicr77@gmail.com>
2026-06-13 21:58:50 +02:00
LocalAI [bot]
0854932a25 feat(omnivoice-cpp): add OmniVoice TTS backend (file + streaming, voice cloning + voice design) (#10310)
* feat(omnivoice-cpp): add C wrapper + CMake/Makefile build over OmniVoice ov_* ABI

Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(omnivoice-cpp): add option/language parsing + WAV framing helpers with tests

Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(omnivoice-cpp): wire purego binding with TTS + streaming TTSStream

Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* build(omnivoice-cpp): wire backend into root Makefile

Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(omnivoice-cpp): add build matrix entries + dep-bump registration

Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(omnivoice-cpp): register backend meta + image entries

Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(omnivoice-cpp): expose as preference-only importable backend

Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): add omnivoice-cpp TTS models (Q8_0 default + BF16 HQ)

Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(omnivoice-cpp): document the OmniVoice TTS backend

Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(omnivoice-cpp): add env-gated e2e for TTS + streaming

Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(omnivoice-cpp): honor tts.audio_path/tts.voice config as default cloning reference

The model config tts.audio_path (ModelOptions.AudioPath) and tts.voice now
provide a default voice-cloning reference used when a request omits Voice, so a
cloned voice can be pinned in the model YAML instead of passed per request. A
per-request voice still overrides. Paths resolve relative to the model dir.

Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(omnivoice-cpp): add missing omnivoice-cpp-development backend meta

Mirrors the whisper/vibevoice convention: a -development meta aggregating the
master-tagged image variants (the production meta and per-variant prod+dev image
entries already existed; only the development meta aggregator was missing).

Assisted-by: claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-13 21:28:46 +02:00
LocalAI [bot]
7637f8cf1b feat(distributed): declarative per-model scheduling via env/args (#10308)
* feat(distributed): add SpreadAll column and authoritative scheduling seeding

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): parse declarative model scheduling config (env/file)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): reconcile spread_all to one replica per matching node

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): wire LOCALAI_MODEL_SCHEDULING env/args and startup seeding

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): expose spread_all on the scheduling API endpoint

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): add spread-to-all-nodes mode to the scheduling UI

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(distributed): document LOCALAI_MODEL_SCHEDULING env/args

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(distributed): clarify replica modes and all-nodes spread in scheduling config

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-13 18:31:06 +02:00
LocalAI [bot]
e1556aa1dc fix(react-ui): make agent chat timestamps format-agnostic (#9867) (#10290)
fix(agents): make React agent chat timestamps format-agnostic

The agent SSE bridge emits the json_message timestamp in three different
encodings depending on deploy mode: an RFC3339 string (standalone agent
pool), Unix milliseconds (local dispatcher), and Unix nanoseconds (the
older NATS path). The React AgentChat handler passed data.timestamp
straight through, so the standalone string and any numeric value outside
the millisecond range rendered as "Invalid Timestamp" or a constant
epoch-ish time.

Add a small pure helper, normalizeTimestampMs, that accepts an RFC3339
string or a numeric epoch in s/ms/us/ns and returns JS milliseconds,
falling back to Date.now() on null/empty/unparseable input. Use it in
the json_message handler so the rendered time is correct regardless of
which backend path produced it.

Fixes #9867


Assisted-by: claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-13 11:05:21 +02:00
LocalAI [bot]
99c8205740 fix(react-ui): stop Talk pipeline overflow and center collapsed-rail avatar (#10305)
Two small visual fixes in the React UI:

- Talk page pipeline summary: the four-column grid used
  `repeat(4, 1fr)`, which resolves to `minmax(auto, 1fr)` so each track
  refuses to shrink below the min-content width of its `nowrap` model
  name. Long names (e.g. a verbose GGUF LLM id) blew the grid out past
  the container despite the per-cell ellipsis styling. Switching to
  `minmax(0, 1fr)` lets the tracks shrink and the ellipsis take effect.

- Sidebar user avatar: the desktop collapsed look centers the avatar via
  `.sidebar.collapsed .sidebar-user{-link}` rules, but the tablet
  icon-rail (640-1023px) collapses visually through `.sidebar:not(.open)`
  without necessarily carrying the `.collapsed` class, so the avatar kept
  its left-aligned negative margins and looked misaligned. Mirror the
  centering rules under `.sidebar:not(.open)`.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-13 11:02:48 +02:00
LocalAI [bot]
a906438a69 fix(config): backend-gate the top_k=40 sampler default (#6632) (#10285)
fix(config): gate top_k=40 default on backend family (#6632)

SetDefaults injected top_k=40 (llama.cpp's sampling default) for every
model config regardless of backend. That value is wrong for backends
whose native default differs: mlx_lm's intended default is top_k=0
(disabled) and mlx does not remap 0->40, so a client that omits top_k
silently got 40 shipped to mlx, changing sampling. The mlx backend's own
getattr(request,'TopK',0) fallback is dead because proto3 int32 is always
present.

Gate the injection on backend family via UsesLlamaSamplerDefaults: keep
top_k=40 for the llama.cpp family and for the empty/auto backend (the GGUF
auto-detect path resolves to llama.cpp, so existing behavior is preserved),
but leave TopK nil for the known non-llama backends (mlx, mlx-vlm,
mlx-distributed). gRPCPredictOpts now sends 0 when TopK is nil, which is
the value mlx actually wants.

Only TopK is gated - the confirmed bug. The sibling sampler defaults
(top_p, temperature, min_p) are left global to avoid widening scope and
introducing nil-deref risk; revisit per-backend if needed.

Assisted-by: claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-13 09:04:25 +02:00
LocalAI [bot]
edeacf22c4 fix(realtime): keep transcription model on a language-only session.update (#10295)
A transcription session.update that carries only a language (no model) —
e.g. a client forcing the STT input language — has an empty
Transcription.Model. updateSession unconditionally copied that into
session.ModelConfig.Pipeline.Transcription, blanking the pipeline's
configured transcription backend. The next utterance then transcribed
against an empty model and the backend RPC failed with "unimplemented"
(surfaced to the client as transcription_failed), so transcription
silently stopped whenever a language was selected.

Only adopt the incoming transcription model when it is non-empty, and
preserve the existing model otherwise (mirroring updateTransSession).

Signed-off-by: mudler <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-13 01:01:36 +02:00
Aniruddh Jha
51f4f67c47 fix(agents): emit chat event timestamps in milliseconds (#9867) (#10243)
Agent chat replies rendered a broken timestamp in the web UI
("Invalid Timestamp" / "12:00 AM", identical for every reply) because
the SSE timestamp unit was inconsistent across producers.

EventBridge.PublishEvent emitted Unix nanoseconds while the local
dispatcher (dispatcher.go) already emitted Unix milliseconds, and the
React UI fed the value straight into `new Date(ts)` after dividing by
1e6. Nanoseconds also overflow JS's safe-integer range (~1.7e18).

Standardize on Unix milliseconds: switch PublishEvent to UnixMilli and
drop the /1e6 conversion in AgentChat.jsx so both SSE paths agree and
match the React UI's expectation. Add a regression test asserting the
published timestamp is in milliseconds.
2026-06-12 23:18:44 +02:00
LocalAI [bot]
a7a7bd646b fix(mlx): route vision-language models to the mlx-vlm backend (#10274)
Vision-language checkpoints such as mlx-community/gemma-4-E4B-it-qat-4bit
declare the "image-text-to-text" pipeline tag on HuggingFace. The mlx
importer hardcoded backend "mlx" for every mlx-community model, so these
VLMs were served by the text-only mlx-lm backend whose tokenizer does not
carry the processor chat template. The template was never applied and the
model produced degenerate, looping output that echoed the prompt.

Detect the "image-text-to-text" pipeline tag in the importer and route those
models to mlx-vlm, which applies the processor-aware chat template. An
explicit backend preference still wins.

As a defensive backstop, the mlx backend now warns loudly when the loaded
model has no chat template, so a misrouted VLM surfaces the problem instead
of silently looping.

Fixes #10269


Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-12 23:12:42 +02:00
Richard Palethorpe
085fc53bbc fix(router): production-ready request router + auto-size batch for embedding/rerank (#10104)
* fix(router): score classifier production-readiness

Conversation trimming runs through the classifier model's chat template
and trims by exact token count, sized to the model's n_batch which is
now scaled to context so long probes can't crash the backend. Missing
chat_message templates are a hard error at router build time. Router-
facing factories (Embedder/Scorer/Reranker/TokenCounter) re-resolve
ModelConfig per call so a model installed post-startup doesn't bind a
stub Backend="" config and silently fall into the loader's auto-
iterate path.

New 'vector_store' backend trace recorded inside localVectorStore on
every Search/Insert — including the backend-load-failure path that
previously vanished into an xlog.Warn — with outcome tagging
(hit/miss/empty_store/backend_load_error/find_error/insert_error/ok).
Companion cleanup drops misleading similarity:0 and input_tokens_count:0
from non-hit and text-mode traces.

Gallery local-store-development aliases to 'local-store' so the master
image satisfies pkg/model.LocalStoreBackend lookups from the embedding
cache.

Misc: llama-cpp TokenizeString reads the correct 'prompt' JSON key
(the original bug); ModelTokenize nil-guard; non-fatal mitm proxy
startup; PII 'route_local' renamed to 'allow' with docs/UI in sync;
model-editor footer no longer eats the edit area on small screens;
several config-editor template/dropdown/section fixes.

Tests: e2e router specs (casual/code-hint + long-conversation trim),
vector_store trace specs, lazy-factory specs, gallery dev-alias
resolution, Playwright trace badge + scroll regression.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(backend): auto-size batch to context for embedding and rerank models

Embedding and rerank models pool over the whole input in a single physical batch (n_ubatch). With batch left at the 512 default, the backend rejects longer inputs with "input is too large to process", silently capping a large-context embedder (e.g. 8k/32k) at 512 tokens. Size n_batch to the context for these single-pass usecases, mirroring the existing FLAG_SCORE behaviour; an explicit batch: still wins.

Extracts EffectiveContextSize/EffectiveBatchSize from grpcModelOpts so the effective decode window has one home for other callers to reuse.

Adds an e2e-aio regression test that embeds a >512-token input. The AIO embedding model is switched to nomic-embed-text-v1.5 (2048 context) because the previous granite model was capped at 512 tokens and could not exercise the larger batch.

Assisted-by: claude-code:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(gallery): raise arch-router scoring output cap via parallel:64

Scoring decodes the whole prompt+candidate in a single llama_decode and
reads one logit row per candidate token. The vendored llama.cpp server
caps causal output rows at n_parallel, so the default of 1 aborts with
GGML_ASSERT(n_outputs_max <= cparams.n_outputs_max) on multi-token route
labels. Set options: [parallel:64] on both arch-router quant entries to
lift the cap; kv_unified (the grpc-server default) keeps the full context
per sequence, so this does not split the KV cache.

Assisted-by: claude-code:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-06-12 16:21:15 +02:00
LocalAI [bot]
56cc4f63fc feat(backend): locate-anything-cpp (open-vocabulary object detection via ggml) (#10264)
* feat(backend): add locate-anything-cpp backend (open-vocab detection via la_capi)

A Go/purego backend wrapping locate-anything.cpp's la_capi C ABI, implementing
the gRPC Detect RPC: image + open-vocabulary text prompt -> labeled boxes.
Mirrors backend/go/rfdetr-cpp; static-links ggml into a per-CPU-variant .so.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(backend): register locate-anything-cpp in build matrix

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): locate-anything gallery entry + model importer

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(backend): locate-anything-cpp Load+Detect wire test

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): add locate-anything-3b model to the gallery index

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(backend): register locate-anything.cpp in bump_deps auto-bump

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: mudler <mudler@localai.io>

* ci(test): e2e smoke for locate-anything-cpp in test-extra (loads the 3B + image, runs Detect)

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: mudler <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: mudler <mudler@localai.io>
Co-authored-by: mudler <mudler@localai.io>
2026-06-12 14:59:07 +02:00
Dedy F. Setyawan
1cea96f09f feat(react-ui): add Indonesian language support (#10266)
Signed-off-by: Dedy F. Setyawan <dedyfajars@gmail.com>
2026-06-12 10:08:58 +02:00
LocalAI [bot]
892fc49949 feat(realtime): stream the LLM / TTS / transcription pipeline stages (#10176)
* feat(realtime): pipeline streaming + disable_thinking config

Add a nested pipeline.streaming.{llm,tts,transcription} block plus
pipeline.disable_thinking, with StreamLLM/StreamTTS/StreamTranscription/
ThinkingDisabled helpers. Pointer-bools so unset keeps the unary path;
existing configs are unaffected. Wiring into the realtime handler follows.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): sentence segmenter for streamed LLM->TTS pipelining

streamSegmenter accumulates streamed LLM tokens and emits complete
sentence/clause segments (terminator+whitespace, or newline) so TTS can
synthesize each segment as it completes instead of waiting for the whole
reply. Pure helper; the streaming handler wiring consumes it next.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): streaming TTS/transcription methods on Model interface

Add TTSStream and TranscribeStream to the realtime Model interface and
implement them on wrappedModel (delegating to backend.ModelTTSStream /
ModelTranscriptionStream) and transcriptOnlyModel. ttsStream adapts the
backend's WAV-framed stream (44-byte header carrying the sample rate, then
PCM) into raw PCM + sample rate for the realtime transports. Handler wiring
that consumes these (flag-gated) follows.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): emitSpeech with flag-gated streaming TTS

emitSpeech synthesizes a piece of text and forwards audio to the client,
streaming one output_audio.delta per backend PCM chunk when the pipeline
sets streaming.tts, or one delta for the whole utterance otherwise. WebRTC
gets raw PCM (it resamples internally); WebSocket gets base64 PCM at the
session rate. It emits no transcript/audio-done events so a streamed reply
can be split into multiple spoken segments sharing one response.

Adds fakeModel/fakeTransport test doubles for the realtime Model/Transport
interfaces, driving streaming assertions deterministically.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): route response audio through emitSpeech (streaming TTS)

Replace the inline unary TTS block in the response handler with emitSpeech,
which streams a response.output_audio.delta per backend PCM chunk when
pipeline.streaming.tts is set and otherwise preserves the single-delta unary
behaviour. emitSpeech returns the accumulated base64 audio, stored on the
conversation item as before. Transcript and audio-done events stay in the
handler so later per-segment streaming can reuse emitSpeech.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): streaming transcription text deltas

Add emitTranscription and route commitUtterance through it. With
pipeline.streaming.transcription set it streams each transcript fragment as
a conversation.item.input_audio_transcription.delta via TranscribeStream
then a completed event; otherwise it preserves the single completed-event
unary behaviour. Returns the final transcript for response generation.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): pipeline disable_thinking maps to enable_thinking off

applyPipelineThinking forces the LLM's ReasoningConfig.DisableReasoning when
pipeline.disable_thinking is set, which gRPCPredictOpts turns into the
enable_thinking=false backend metadata. Applied at newModel construction on
the per-session LLM config copy, so it doesn't leak to other model users and
needs no realtime-specific request plumbing.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): speechStreamer for token-streamed LLM->TTS

emitSpeech now returns raw PCM (caller base64-encodes) so streamed segments
accumulate correctly. speechStreamer consumes streamed LLM tokens: it strips
reasoning via the streaming ReasoningExtractor, emits a transcript delta per
content fragment, and sentence-pipes content into emitSpeech so each sentence
is synthesized as soon as it's ready. Handler wiring (plain-content turns)
follows.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): wire streamLLMResponse for token-streamed replies

triggerResponseAtTurn takes a streamed path when pipeline.streaming.llm is
set, the turn has no tools, and audio is requested: streamLLMResponse
announces the assistant item, drives the LLM token callback through a
speechStreamer (reasoning-stripped transcript deltas + sentence-piped TTS),
and emits the terminal events. Tool turns and non-streaming pipelines keep
the existing buffered path unchanged, so this is strictly opt-in.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(realtime): document pipeline streaming + disable_thinking

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(realtime): register pipeline streaming/thinking config fields

TestAllFieldsHaveRegistryEntries (core/config/meta) requires every config
field to have a meta registry entry. The four new pipeline fields
(disable_thinking, streaming.{llm,tts,transcription}) had none, failing
tests-linux/tests-apple. Add toggle entries for them.

Also handle the os.Remove return in realtime_speech_test.go to satisfy
errcheck (golangci-lint).

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(realtime): always strip reasoning from spoken output

disable_thinking maps to ReasoningConfig.DisableReasoning=true on the LLM
config, which the backend reads as enable_thinking=false. But the realtime
handler reads that SAME config to drive reasoning extraction, and there
DisableReasoning=true means "skip stripping". PredictConfig() returns this
LLM config, so both the streamed (speechStreamer) and buffered realtime
paths stopped stripping <think>…</think> exactly when disable_thinking was
on — leaking raw reasoning to the client whenever the model ignored the
enable_thinking hint (e.g. lfm2.5).

Add spokenReasoningConfig() which clears DisableReasoning for extraction
(keeping custom tokens/tag pairs) and route both realtime paths through it.
Spoken output now always strips reasoning, independent of the backend
suppression hint.

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(realtime): clean TTS temp path before read (gosec G304)

emitSpeech reads the WAV file the TTS backend wrote. The read moved here
from realtime.go, so code-scanning flagged it as a new G304 alert even
though the path is backend-controlled (a temp file), not user input.
Wrap it in filepath.Clean — a real path normalization that also clears
the alert, keeping with the repo's no-#nosec convention.

Assisted-by: Claude:claude-opus-4-8 gosec, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(realtime): buffer whole message for TTS, drop sentence segmenter

Per review (richiejp): the sentence segmenter pipelined unary TTS by
splitting on ASCII .!?/newline, which does nothing for languages without
those boundaries (CJK/Thai) — there it already degraded to buffering the
whole message anyway.

Replace it with a uniform model: stream the LLM transcript live, buffer the
full message, then synthesize it once. emitSpeech already streams the audio
chunks when the backend implements TTSStream and falls back to a single
unary delta otherwise, so this is real streaming TTS where supported and a
clean whole-message synthesis elsewhere — no per-sentence emulation, no
language assumptions. speechStreamer becomes transcriptStreamer (transcript
deltas only); the whole-message synthesis moves into streamLLMResponse.

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): stream tool-call turns via tokenizer-template autoparser

Per review (richiejp): tool-call deltas exist, so streaming should work with
tools too. It does — for models that use their tokenizer template. The C++
autoparser then clears reply.Message and delivers content + tool calls via
ChatDeltas, so the streamed transcript carries only spoken content (no
tool-call JSON leak) and the tool calls are parsed from the final response.

- Drop the len(tools)==0 gate; stream when no tools OR use_tokenizer_template
  (grammar-based function calling still buffers, since its call is emitted as
  JSON in the token stream and would leak into the transcript).
- streamLLMResponse takes tools/toolChoice/toolTurn, reads ChatDelta content
  in the token callback, parses tool calls from the final ChatDeltas, and
  creates the assistant content item lazily so a content-less tool turn emits
  only the tool calls.
- Extract emitToolCallItems from the buffered path so both paths finalize tool
  calls, response.done, and server-side assistant-tool follow-ups identically.

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): script-aware clause chunking + streamed-reply fixes

Opt-in pipeline.streaming.clause_chunking splits the streamed LLM reply
into speakable clauses and synthesizes each as soon as it completes,
lowering time-to-first-audio instead of buffering the whole message. The
splitter is script-aware (rivo/uniseg, pure Go): UAX#29 sentence
segmentation handles CJK 。!? with no whitespace, CJK clause
punctuation (,、;:) and Thai/Lao spaces give finer cuts, and a UAX#14
line-break cap bounds an over-long punctuation-less run. Unlike the old
ASCII .!?/newline segmenter (dropped in 076dcdbe) it does not degrade to
whole-message buffering for CJK/Thai; scripts needing a dictionary
(Khmer/Burmese) stay buffered until a space or end-of-message. Clauses
are synthesized synchronously in the token callback (the LLM keeps
generating into the gRPC stream meanwhile), so audio still starts
mid-generation. Off by default — the whole-message path is unchanged.

Also fix the streamed-reply path and the Talk page:

- Don't swallow streamed autoparser content as reasoning: the
  tokenizer-template path already delivers reasoning-free content via
  ChatDeltas, so prefilling the thinking start token re-tagged it as an
  unclosed reasoning block, leaving no spoken reply. Disable the prefill
  on that path; closed tag pairs are still stripped (#9985).

- Generate collision-free realtime IDs (16 random bytes) instead of a
  constant, so per-item bookkeeping (cancel, conversation.item.retrieve)
  works.

- Key the Talk transcript by the server item_id and upsert entries.
  Realtime events arrive over a WebRTC data channel — outside React's
  event system — so React defers the setTranscript updaters while
  synchronous ref writes in handler bodies run first; the old
  index-tracking ref rendered a duplicate assistant bubble on
  completion. Upserts by item_id are idempotent and order-independent.

- Drop the partial assistant bubble on a cancelled response (barge-in):
  the server discards the interrupted item and sends response.done with
  status "cancelled"; mirror that in the UI so the regenerated reply
  isn't rendered as a second assistant message.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Assisted-by: Claude:claude-fable-5 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Richard Palethorpe <io@richiejp.com>
2026-06-11 08:43:12 +01:00
LocalAI [bot]
fba8c9c498 fix(distributed): track in-flight for non-LLM inference methods (VAD, diarize, voice, ...) (#10238)
fix(distributed): track in-flight for non-LLM inference methods

InFlightTrackingClient only wrapped a subset of the grpc.Backend
inference methods (Predict, Embeddings, TTS, AudioTranscription, Detect,
Rerank, ...). Methods like VAD were left as embedded passthrough, so
track() never ran for them.

In distributed mode every model is loaded with in_flight=1 as a
reservation; that reservation is only released by the OnFirstComplete
callback, which fires after the first *tracked* inference call completes.
A VAD-only model (e.g. silero-vad) never calls a tracked method, so the
reservation is never released and in-flight stays pinned at 1 forever -
which also blocks the router's idle-eviction logic.

Wrap the remaining unary inference methods (VAD, Diarize, Face*, Voice*,
TokenClassify, Score, AudioEncode, AudioDecode, AudioTransform) with the
same track()/reconcile() pattern. The three bidi-stream constructors
(AudioTransformStream, AudioToAudioStream, Forward) are deliberately left
as passthrough - their inference spans the stream lifetime, not the
constructor call, so track() there would fire onFirstComplete before any
data flows.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-10 16:29:50 +02:00
LocalAI [bot]
b203b32e57 feat(realtime): make WebRTC ICE candidates configurable (#10231)
The /v1/realtime WebRTC handler created the peer connection with a bare
webrtc.Configuration and no SettingEngine, so pion gathered a host ICE
candidate for every local interface. Under Docker host networking that
includes bridge addresses (docker0/veth, 172.x) a remote browser cannot
route to; the call establishes on a good pair and then drops once ICE
consent freshness checks fail on the unreachable candidates.

Add two opt-in knobs, applied via a pion SettingEngine:
- LOCALAI_WEBRTC_NAT_1TO1_IPS: advertise these IPs as the host candidates
  (e.g. the host LAN IP)
- LOCALAI_WEBRTC_ICE_INTERFACES: restrict ICE gathering to these interfaces

Defaults are unchanged (empty => current all-interface behavior).

Assisted-by: Claude:claude-opus-4-8

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-09 22:28:03 +02:00
Ching
48a8ce98aa fix(cli): handle chat output errors (#10229)
Propagate terminal write errors from the chat prompt and explicitly ignore stream close errors during cleanup.

Update chat tests to assert response writer errors so errcheck passes without hiding failed writes.

Tests:
- go test -count=1 ./core/cli/chat
- go test -count=1 ./core/cli

Assisted-by: Codex:GPT-5

Signed-off-by: Ching Kao <0980124jim@gmail.com>
2026-06-09 19:10:24 +02:00
Ching
8344d1c865 feat(cli): add interactive chat mode (#10226)
Add an opt-in `local-ai chat` command for testing chat models directly from the terminal without manually sending curl requests.

The command connects to a running LocalAI server, lists available models through the existing OpenAI-compatible API, streams chat completions, and supports interactive commands such as `/models`, `/model`, `/clear`, and `/exit`.

Keep `local-ai run` focused on the server lifecycle so the web UI, API clients, and multiple chat terminals can coexist against the same server.

Document the new command and terminal workflow in the README and CLI docs.

Tests:
- go test -count=1 ./core/cli/chat
- go test -count=1 ./core/cli

Assisted-by: Codex:GPT-5

Signed-off-by: Ching Kao <0980124jim@gmail.com>
2026-06-09 14:58:44 +00:00
Pete
d2e6b93369 feat(agents): surface KB source citations in RAG responses (#10228)
* dev knowledge.go structure

Signed-off-by: Pete Chen <petechentw@gmail.com>

* feat(agents): append KB source citations to responses

Render structured KB citations as a Sources block after agent responses, linking each source to the existing raw collection entry endpoint.

Keep long-term memory writes on the original model response so citation blocks do not get stored back into the knowledge base.

Tested with: go test ./core/services/agents

Assisted-by: Codex:gpt-5
Signed-off-by: Pete Chen <petechentw@gmail.com>

* Collect KB citations from tool searches

Signed-off-by: Pete Chen <petechentw@gmail.com>

* fix(agents): append KB sources in local chats

Apply the shared KB citation post-processing to standalone LocalAGI chat responses so the React agent chat receives the same clickable Sources block as the native executor path. Also fix the run target to use the current cmd/local-ai entrypoint.

Assisted-by: Codex:gpt-5
Signed-off-by: Pete Chen <petechentw@gmail.com>

---------

Signed-off-by: Pete Chen <petechentw@gmail.com>
Co-authored-by: shihyunhuang <shihyunhuang88@gmail.com>
Co-authored-by: TLoE419 <tloemizuchizu@gmail.com>
Co-authored-by: Ching Kao <0980124jim@gmail.com>
2026-06-09 16:32:56 +02:00
LocalAI [bot]
e1ec03d33f fix(reasoning): stop prefilled <think> from swallowing tag-less answers (#10225)
* fix(reasoning): stop prefilled <think> from swallowing tag-less answers

When a chat template injects the thinking start token into the prompt (so
DetectThinkingStartToken returns e.g. "<think>"), the model's output begins
inside a reasoning block and carries only the closing tag. The non-jinja
autoparser fallback (peg-native "pure content" mode, issue #9985) prepends the
start token so the extractor can pair it with the model's </think>.

But on a COMPLETE response that contains no closing tag, the model answered
directly with no reasoning at all. Prepending the start token there manufactures
an unclosed block that swallows the entire answer into reasoning, leaving the
OpenAI `content` field empty. This breaks short/direct answers — session names,
JSON summaries, any terse completion where the model skips the think block —
which come back with empty content. Regression surfaced by #9991, which added
the defensive prefill extraction to the complete-response paths.

Add reasoning.ExtractReasoningComplete: it only honors a prefilled start token
when the response actually contains the matching closing tag (proof a reasoning
block exists). Genuine reasoning tags already in the content still extract;
tag-less content stays content. Apply it at every complete-response site
(applyAutoparserOverride, realtime, openresponses). The streaming per-token
extractor is intentionally left on ExtractReasoningWithConfig — mid-stream an
as-yet-unclosed block is legitimate and must surface as reasoning deltas.

Also adds reasoning.ClosingTokenForStart and hoists the default reasoning tag
pairs to package scope so both helpers share one source of truth.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(reasoning): cover the enable_thinking=false non-thinking-mode regression

Adds the end-to-end case that actually broke session summaries / auto-titles
and was not covered before: a request with enable_thinking=false against a
<think>-capable model. In non-thinking mode the model emits no reasoning block,
so llama.cpp's autoparser returns ChatDeltas with content set and
reasoning_content empty (verified against stock llama-server: same model with
chat_template_kwargs.enable_thinking=false returns reasoning_content=null,
content="hello"). thinkingStartToken is still "<think>" because it is detected
per-model from the enable_thinking=true render, so the old code prepended it and
swallowed the answer. The test fails without the ExtractReasoningComplete gate.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 09:02:04 +02:00
LocalAI [bot]
9323f4b5ca feat(llama-cpp): video input support (mtmd #24269) (#10216)
* chore(llama-cpp): bump to 8f83d6c for mtmd video input support

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(llama-cpp): forward video input to mtmd (template + non-template paths)

Wire request->videos() into grpc-server.cpp mirroring the existing image
and audio handling: a video_data build + non-template files extraction, and
input_video chat chunks on the tokenizer-template path. allow_video is
auto-set at model load by the vendored upstream chat_params.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): add video attachment support to the chat UI

Mirror the image/audio attachment path for video: emit video_url content
parts, accept video/* in the picker, keep video files as base64, show a
film icon badge, and render attached video inline with a <video> player.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(llama-cpp): patch mtmd video stdin double-close (heap crash)

Upstream mtmd video input (ggml-org/llama.cpp#24269) double-fcloses the
ffmpeg/ffprobe stdin FILE: feed_stdin() fclose()s the FILE returned by
subprocess_stdin() (which is sp->stdin_file), then subprocess_destroy()
fclose()s the same pointer again -> heap corruption that aborts the
backend on any base64 input_video request (the CLI --video file path is
unaffected). Vendor a one-line fix (null sp->stdin_file after fclose)
via prepare.sh's patches/ until upstream merges it.

Verified e2e with gemma-4-e2b-it-qat-q4_0: video frames decode via
ffmpeg and the model answers correctly (red clip -> 'Red', blue -> 'Blue').

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore(llama-cpp): re-pin to upstream #24316, drop vendored stdin patch

Upstream replaced the ad-hoc video stdin handling with a proper RAII
refactor (ggml-org/llama.cpp#24316, "mtmd: refactor video subproc
handling"), which includes the same `sp->stdin_file = nullptr` guard our
patch added (plus join-before-destroy ordering). Re-pin LLAMA_VERSION to
that branch head and drop patches/0001 - it's now redundant.

Verified e2e with gemma-4-e2b-it-qat-q4_0: no crash, video frames decode
and the model answers correctly (red clip -> "Red", blue -> "Blue").

NOTE: #24316 is not yet merged, so this pins to its branch-head commit
(28ca1e60). Re-pin to the squash-merge commit on master once it lands,
otherwise `git fetch` may lose the commit after the branch is deleted.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-08 23:17:50 +02:00
LocalAI [bot]
92dea961c2 fix: distributed backend reinstall/upgrade UI stuck on 'reinstalling' (#10214)
* fix(galleryop): self-evict terminal ops from OpCache.GetStatus

The processingBackends map (the UI 'reinstalling' spinner source) only cleared
an op when a client polled /api/backends/job/:uid. The Manage-page Reinstall and
Upgrade buttons never poll, so completed installs leaked into processingBackends
forever and the backend card spun 'reinstalling' even though the install had
finished. Evict terminal ops on the list read instead; DeleteUUID already
broadcasts the eviction so peer replicas converge.

Reproduced on a live 5-node distributed cluster: 5 backends sat in
processingBackends with underlying jobs reporting completed:true,progress:100.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(nodes): clear pending backend ops behind offline/draining nodes

ListDuePendingBackendOps filters status=healthy, so a backend op queued against
a node that went offline (stale heartbeat) or draining (admin action) was never
retried, aged out, or deleted - it leaked forever and kept the UI operation
spinning. Add DeleteStalePendingBackendOps and run it each reconcile pass:
draining nodes are cleared immediately (model rows already purged), offline
nodes once their heartbeat is older than a grace window (blip protection).

Reproduced on a live cluster: orphaned llama-cpp install rows targeting an
offline (nvidia-thor) and a draining (mac-mini-m4) node sat at attempts=0
indefinitely.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(nodes): stream per-node progress during backend upgrade

The install dispatch subscribed to a per-op progress subject and streamed
per-node download ticks; the upgrade dispatch did a bare 15-minute blocking
NATS round-trip with no subscription, so the UI showed progress:0 the whole
time (the 'reinstalling but nothing happens' report on a slow node).

Thread the op ID through BackendManager.UpgradeBackend -> the distributed
manager -> the adapter, and have the adapter subscribe to the per-op progress
subject before the request (extracted into a shared subscribeProgress helper
reused by install/upgrade/force-fallback). The worker's upgradeBackend now
creates the same DebouncedInstallProgressPublisher installBackend uses. An
upgrade is a force-reinstall, so it reuses SubjectNodeBackendInstallProgress
rather than minting a new subject - no new NATS permission, no new
rolling-update compat surface. Reconciler-driven retries pass empty
opID/onProgress and stay on the silent path.

Reproduced on a live cluster: upgrade of llama-cpp-development on agx-orin-slow
sat at progress:0 for 4+ minutes with no per-node feedback.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(galleryop): persist cancellation + periodically reap orphaned ops

Two distributed gaps surfaced when a replica was killed mid-upgrade on a live
cluster, leaving the backend stuck 'processing' in the UI forever:

1. CancelOperation flipped the in-memory status to cancelled and broadcast a
   NATS event but never persisted the terminal status. On the next replica
   restart the still-active row re-hydrated straight back into
   processingBackends and the UI spun again. It now calls store.Cancel(id) so
   the cancel survives a restart.

2. CleanStale (which marks abandoned active ops failed) only ran once on
   startup, so an op orphaned AFTER startup - its owning replica's foreground
   handler goroutine gone - was never reaped until the next restart. Add
   GalleryService.ReapStaleOperations and run it on a 15m ticker (CleanStale
   now returns the reaped count for observability).

Neither is covered by the OpCache self-evict fix: an orphaned op never reaches
Processed, so it would never self-evict.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(review): address self-review findings on the distributed install fixes

Three findings from an adversarial review of this branch:

1. CRITICAL - OpCache.GetStatus crashed under concurrent load. m.Map() returns
   the live internal map by reference, so deleting from it on the read path was
   an unsynchronized write to a map four HTTP handlers poll every ~1s -> a
   'concurrent map writes' fatal. Rewritten to iterate a Keys() snapshot, build
   a fresh result map, and apply evictions via the locked DeleteUUID after the
   loop. Added a -race concurrency regression guard.

2. HIGH - GetStatus evicted failed ops too, hiding them from /api/operations
   and breaking the dismiss-failed-op flow (the panel keeps Error != nil ops so
   the admin can read the error and click Dismiss). Eviction now fires only for
   terminal ops with Error == nil (success/cancelled); failures are retained.

3. MEDIUM - DeleteStalePendingBackendOps missed StatusUnhealthy nodes. A node
   marked unhealthy on a NATS ErrNoResponders never transitions to offline
   (health.go skips re-marking it), so its pending ops leaked exactly like the
   offline case. Unhealthy is now reaped via the same stale-heartbeat grace path
   (a fresh-heartbeat node is recovering and keeps its op).

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(review-2): don't evict the still-installing soft-path; don't spin on failed ops

Second review pass found two issues:

1. MEDIUM (Go) - OpCache.GetStatus evicted the ErrWorkerStillInstalling
   soft-path op. That op is deliberately Processed=true with no error to show a
   yellow in-progress state when a worker timed out the NATS round-trip but is
   still installing in the background; the reconciler confirms the real outcome
   later. Evicting it (and broadcasting OpEnd + marking the DB completed) hid an
   install that may still fail. Eviction is now scoped to a clean success
   (progress 100 + 'completed', matching the job-poll's historical condition) or
   a cancellation - the soft-path (progress != 100) and failures are kept.

2. MEDIUM (React) - the Backends gallery card rendered ANY operation as an
   'Installing...' spinner, so a failed op (now intentionally kept in the list
   for the OperationsBar error + Dismiss) spun forever. Exclude errored ops from
   the card spinner, mirroring Models.jsx (isInstalling already excludes
   op.error). The error + Dismiss still surface in the global OperationsBar.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(ui): refresh Manage backends table when an operation settles

The Manage backends table fetched installed backends only on mount/after delete
and checked upgrades only on tab activation. After a reinstall/upgrade completed
neither re-ran, so the installed-version cell and the 'update available' badge
stayed stale until the user switched tabs - the op looked like it 'did nothing'.

Watch the operations list (via useOperations) and re-fetch installed backends +
available upgrades whenever the count settles, mirroring the operations.length
watch Backends.jsx already uses. Consolidates the prior tab-activation upgrades
check into the same effect.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-08 10:03:02 +02:00
Adira
2c804bef5a fix(config): skip vocab arrays and mmap GGUF headers to speed up startup (#10213)
When the models directory holds many GGUF files, startup parsed every
model's full GGUF — including the tokenizer vocab arrays
(tokenizer.ggml.tokens/scores/merges, often >100k entries) — once per
model while guessing defaults. On slow storage (e.g. a models directory
on a Docker volume) those hundreds of thousands of tiny reads dominate
boot time before the HTTP server comes up.

The default-guessing path and the VRAM metadata reader only consume
scalar metadata and array lengths, never the array contents. Parse with
SkipLargeMetadata (seek past large arrays) and UseMMap (fault in a few
header pages instead of issuing per-element read() syscalls). For a
256k-token vocab this cuts the parse from ~524k read() syscalls to 8.
The mapping is released when ParseGGUFFile returns.

Fixes #9790

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com>
2026-06-07 23:33:52 +02:00
LocalAI [bot]
67f80a152b fix(mtp): don't auto-enable self-spec MTP for draft-only assistant GGUFs (#10208)
Gemma4 MTP (ggml-org/llama.cpp#23398) registers the prediction head as a
separate `gemma4-assistant` architecture. That assistant GGUF still carries
`<arch>.nextn_predict_layers`, so the architecture-agnostic detection in
HasEmbeddedMTPHead matched it and appended the `spec_type:draft-mtp` defaults.

Unlike the DeepSeek/Qwen embedded-head models, an assistant checkpoint cannot
self-speculate: it is a draft model that requires a paired target context
(`ctx_other`) and throws if loaded alone. Auto-applying the self-spec defaults
to a standalone assistant import therefore produces a broken config.

Guard the detection against draft-only assistant architectures (the `-assistant`
suffix is upstream's naming convention) so importing one no longer yields a
self-speculation config. Two-model target+draft pairing remains expressible
manually via `draft_model:` and is left to a follow-up.


Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-07 22:09:02 +02:00
LocalAI [bot]
e837921c2c feat: forward reasoning_effort to the backend so jinja models honor it (#10184)
* feat: forward reasoning_effort to the backend so jinja models honor it

reasoning_effort was only mapped to the binary enable_thinking toggle and
otherwise reached Go-side templates — it was never sent to the backend. So
jinja-templated models whose chat template keys on reasoning_effort (gpt-oss
Harmony, LFM2.5) could not be driven by it: LFM2.5 ignores enable_thinking and
kept emitting <think>.

Forward the effective reasoning_effort to the backend as a chat_template_kwarg
(mirroring enable_thinking) in grpc-server.cpp, and put it in PredictOptions
metadata (gRPCPredictOpts). Add a config-level default: ModelConfig.reasoning_effort
and Pipeline.reasoning_effort, resolved by ModelConfig.ApplyReasoningEffort
(request value overrides config default, none->disable / level->enable, an
operator's reasoning.disable wins). request.go now uses that helper.

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): set the pipeline LLM's reasoning_effort

Apply Pipeline.ReasoningEffort to the pipeline's LLM config when the realtime
model is built (per-session copy, overrides the LLM's own reasoning_effort),
and surface the resolved effort on the template input so Go-templated models
get it too. jinja models receive it via the backend metadata. This lets a
realtime pipeline disable thinking on models that only honor reasoning_effort
(e.g. LFM2.5), which enable_thinking can't.

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-05 13:45:43 +00:00
Richard Palethorpe
73385713ca feat(distributed): enforce registration token for worker file transfer (#10183)
The worker HTTP file-transfer server is authenticated by the registration
token via checkBearerToken, which fails open on an empty token: every
/v1/files, /v1/files-list and /v1/backend-logs request is then served
unauthenticated, granting read/write to the worker's models/staging/data
directories. The fail-open was also silent (the only auth log sat on the
unreachable reject branch), and the worker process never runs
DistributedConfig.Validate(), so the existing frontend warning did not
cover the component that exposes the server.

Mirror the NatsRequireAuth pattern: keep anonymous as the default but make
it loud and opt-in enforceable.

- Log a prominent warning when the file-transfer server starts tokenless.
- Add LOCALAI_REGISTRATION_REQUIRE_AUTH: DistributedConfig.Validate() errors
  on an empty token (frontend) and the worker refuses to start (fail-fast,
  before registration), so production can fail closed. Also satisfies the
  F-003 suggestion to fail Validate() on distributed + empty token.
- Add LOCALAI_DISTRIBUTED_REQUIRE_AUTH umbrella switch implying both
  RegistrationRequireAuth and NatsRequireAuth — one production knob locking
  down the registration/file-transfer layer and the NATS bus together; the
  granular flags remain available as single-layer overrides. Wired into the
  frontend, supervisor worker, and agent worker (vLLM worker has neither a
  NATS connection nor a file-transfer server, so it is left untouched).
- Document in distributed-mode.md (warning callout + flag tables).

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-06-05 14:34:28 +02:00