Compare commits

..

12 Commits

Author SHA1 Message Date
LocalAI [bot]
bfd6c09d88 chore(model gallery): 🤖 add 1 new models via gallery agent (#10663)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-07-03 18:02:09 +02:00
Richard Palethorpe
eb32cd9073 feat(realtime): eager blocking pipeline warm-up + /backend/load API (#10662)
Realtime sessions previously lazy-loaded each pipeline sub-model (VAD,
transcription, LLM, TTS) on first use, so every cold session paid a
per-request model-load stall and load errors only surfaced mid-stream.

Warm the whole pipeline eagerly and blockingly at session start
(including the voice-gate speaker-recognition model, which an enforced
gate blocks each utterance on; compaction's summary_model stays lazy
since it only runs off the response path):
- Add backend.PreloadModel / PreloadModelByName as the single load path
  for every modality (no transcription special-case; backend-omitted
  configs are deprecated).
- The realtime session blocks on Model.Warmup and returns a
  model_load_error to the client if any stage fails to load;
  updateSession warms in the background. Opt out per pipeline with
  pipeline.disable_warmup, exposed as a UI toggle via the
  config-metadata registry.

Add a LocalAI-native POST /backend/load (and /v1/backend/load) that
pre-loads a model -- expanding realtime pipelines into their sub-models
-- as the inverse of /backend/shutdown. There is one preload engine
(backend.PreloadStages): the realtime Warmup methods, /backend/load and
the --load-to-memory startup flag all use it, so --load-to-memory now
also expands pipeline models and records load-failure traces. Pipeline
sub-model alias resolution is likewise shared
(ModelConfigLoader.LoadResolvedModelConfig). Surface the endpoint
everywhere an admin manages models:
- MCP admin tool load_model (httpapi + inproc clients, safety/catalog
  prompts, catalog/dispatch tests).
- "Load into memory" action in the React models UI.
- Swagger regenerated; docs moved to the general backend-monitor page
  since it is not realtime-specific.

Fix a Traces UI crash ("json: unsupported value: -Inf"): audio-snippet
RMS/peak now floor at a finite dBFS, and backend-trace data is sanitized
to drop non-finite floats before marshaling. The sanitizer is
copy-on-write -- it runs on every RecordBackendTrace, so containers are
only re-allocated on the paths that actually changed.

Migrate core/http/openresponses_test.go onto the prebuilt mock-backend
the rest of the http suite already uses -- it was the last spec still
pointing at a real HuggingFace model, so it 404'd wherever no vision
backend was built -- and fix its item_reference specs to send the
spec's "id" field instead of "item_id", which the handler never
accepted.

Assisted-by: Claude:claude-opus-4-8 Claude Code

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-07-03 18:00:37 +02:00
alaningtrump
80ec22945a refactor: use the built-in max/min to simplify the code (#10657)
Signed-off-by: alaningtrump <alaningtrump@outlook.com>
2026-07-03 17:59:26 +02:00
LocalAI [bot]
7a3583b52c fix(python-backends): parse tool-call arguments for chat templates and split implicit reasoning blocks (#10658)
Two bugs broke OpenAI-style tool calling on the MLX backend (and any
Python backend sharing backend/python/common), reproduced end-to-end on
LocalAI v4.5.5 with the metal-mlx backend and
mlx-community/Qwen3.5-2B-MLX-8bit.

messages_to_dicts left each tool call's function.arguments as the raw
OpenAI-wire JSON string. HuggingFace chat templates (e.g. Qwen3.5)
iterate arguments as a mapping (.items()), so any request whose history
contained a prior assistant tool_calls message failed with HTTP 500
"Generation failed: Can only get item pairs from a mapping." — breaking
every agent loop on its second turn. Decode the string back into a dict
so the template sees a mapping.

split_reasoning returned ("", text) whenever the opening think tag was
absent. Models like Qwen3.5 open the assistant turn already inside
thinking, so the generated text carries only the closing </think>; the
whole chain-of-thought leaked into content. When the opener is missing
but the closer is present, treat everything before the closer as
reasoning.

Adds platform-independent unit tests under backend/python/common
(stdlib-only, no MLX/venv required, following parent_watch_test.py).

Assisted-by: Claude Code:claude-opus-4-8

Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-07-03 12:13:37 +02:00
LocalAI [bot]
715d4ed8e5 chore: ⬆️ Update ggml-org/llama.cpp to fdb1db877c526ec90f668eca1b858da5dba85560 (#10647)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-07-03 00:46:56 +02:00
LocalAI [bot]
9fcc9c0d43 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 87fc8701ff4da81a7d2a91ec0695f95eb3066a47 (#10649)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-07-03 00:46:41 +02:00
LocalAI [bot]
3c67b5b746 chore: ⬆️ Update CrispStrobe/CrispASR to 9a26976a8c8cf5af0afcdd04463cf8ba91e96a54 (#10648)
⬆️ Update CrispStrobe/CrispASR

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-07-03 00:46:25 +02:00
LocalAI [bot]
bea66fd84e chore: ⬆️ Update leejet/stable-diffusion.cpp to 2574f5936571645f784b77623e1f09bad97d948a (#10650)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-07-03 00:46:10 +02:00
LocalAI [bot]
f7a5dfd5ae chore: ⬆️ Update vllm-metal (darwin) to v0.3.0.dev20260701212152 (#10646)
⬆️ Update vllm-project/vllm-metal (darwin)

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-07-03 00:45:36 +02:00
LocalAI [bot]
6bcaf30c14 chore: ⬆️ Update localai-org/privacy-filter.cpp to 735a6c28607ee82afc3a670383f41b55266a3b9a (#10628)
⬆️ Update localai-org/privacy-filter.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-07-03 00:45:17 +02:00
LocalAI [bot]
ef15b4bfda fix(vllm): install ROCm vLLM from the AMD wheel index on Python 3.12 (#10651)
* fix(vllm): install ROCm vLLM from the AMD wheel index on Python 3.12

The rocm-vllm backend crashed at load with "No module named 'vllm'".
requirements-hipblas-after.txt requested a bare `vllm`, which resolves to
the CUDA-only PyPI wheel; that wheel is unusable on an AMD GPU. vLLM's
prebuilt ROCm wheels live on a dedicated index (https://wheels.vllm.ai/rocm/)
and are published only for CPython 3.12, so on the backend's default 3.10
the installer silently falls back to the CUDA wheel.

Add a hipblas branch to backend/python/vllm/install.sh that pins Python to
3.12 and installs vllm from the ROCm wheel index, hiding the bare-`vllm`
after-file so installRequirements installs only the base ROCm
torch/transformers first and does not pull the CUDA wheel.

Fixes #10642

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(vllm): drop the dead hipblas-after requirement and its hide dance

requirements-hipblas-after.txt (a bare `vllm`) is never installed for
hipblas: installRequirements only adds requirements-${BUILD_PROFILE}-after.txt
when BUILD_TYPE != BUILD_PROFILE, and for hipblas they are equal. So the file
was dead and the install.sh hide/restore of it was a no-op. Remove both. The
hipblas branch already installs vllm explicitly from the ROCm wheel index, so
deleting the bare-`vllm` file also removes a latent CUDA-wheel trap should the
installRequirements gap ever be closed.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-07-03 00:44:55 +02:00
LocalAI [bot]
237bce48e8 feat(ui): forking chat - retry any answer, copy, duplicate, branch (#10645) (#10654)
* feat(ui): clone a chat into a new conversation (#10645)

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): retry any assistant answer, not just the last (#10645)

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): copy an entire chat to the clipboard (#10645)

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): branch a new chat from any assistant answer (#10645)

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(ui): send truncated history on mid-conversation retry (#10645)

Mid-conversation retry regenerated an answer with the downstream turns
still in the model's context. handleRegenerate truncated the DOM history
via updateChatSettings (a scheduled state update), but the synchronous
sendMessage that followed read the stale, pre-truncation history from its
closure to build the outbound API payload. Thread the intended base
history explicitly through sendMessage's options.baseHistory so the
request body matches the truncated view. Backward compatible: the normal
send path (no baseHistory) is unchanged.

Also guard two minor issues in Chat.jsx: the "Branch from here" button now
renders under !isStreaming to match the retry button, and the duplicate
toast only fires when forkChat returns a chat (not on a null result).

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-07-03 00:04:44 +02:00
71 changed files with 1926 additions and 234 deletions

View File

@@ -1,5 +1,5 @@
IK_LLAMA_VERSION?=068b173649f2fd8dc96b35ada5a0b76d8985105d IK_LLAMA_VERSION?=87fc8701ff4da81a7d2a91ec0695f95eb3066a47
LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp
CMAKE_ARGS?= CMAKE_ARGS?=

View File

@@ -1,5 +1,5 @@
LLAMA_VERSION?=4fc4ec5541b243957ae5099edb67372f8f3b550e LLAMA_VERSION?=fdb1db877c526ec90f668eca1b858da5dba85560
LLAMA_REPO?=https://github.com/ggerganov/llama.cpp LLAMA_REPO?=https://github.com/ggerganov/llama.cpp
CMAKE_ARGS?= CMAKE_ARGS?=

View File

@@ -8,7 +8,7 @@
# Local development: point at a working checkout instead of cloning, e.g. # Local development: point at a working checkout instead of cloning, e.g.
# make PRIVACY_FILTER_SRC=$HOME/c/privacy-filter.cpp grpc-server # make PRIVACY_FILTER_SRC=$HOME/c/privacy-filter.cpp grpc-server
PRIVACY_FILTER_VERSION?=595f59630c69d361b5196f2aba2c71c873d0c13c PRIVACY_FILTER_VERSION?=735a6c28607ee82afc3a670383f41b55266a3b9a
PRIVACY_FILTER_REPO?=https://github.com/localai-org/privacy-filter.cpp PRIVACY_FILTER_REPO?=https://github.com/localai-org/privacy-filter.cpp
PRIVACY_FILTER_SRC?= PRIVACY_FILTER_SRC?=

View File

@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
# CrispASR version (release tag) # CrispASR version (release tag)
CRISPASR_REPO?=https://github.com/CrispStrobe/CrispASR CRISPASR_REPO?=https://github.com/CrispStrobe/CrispASR
CRISPASR_VERSION?=fcbc8718e654995e3bd2d0c98bcb8e55e297d23c CRISPASR_VERSION?=9a26976a8c8cf5af0afcdd04463cf8ba91e96a54
SO_TARGET?=libgocrispasr.so SO_TARGET?=libgocrispasr.so
CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF

View File

@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
# stablediffusion.cpp (ggml) # stablediffusion.cpp (ggml)
STABLEDIFFUSION_GGML_REPO?=https://github.com/leejet/stable-diffusion.cpp STABLEDIFFUSION_GGML_REPO?=https://github.com/leejet/stable-diffusion.cpp
STABLEDIFFUSION_GGML_VERSION?=3590aa8d626e671a1b1dc84506ea2932a243a480 STABLEDIFFUSION_GGML_VERSION?=2574f5936571645f784b77623e1f09bad97d948a
CMAKE_ARGS+=-DGGML_MAX_NAME=128 CMAKE_ARGS+=-DGGML_MAX_NAME=128

View File

@@ -20,7 +20,15 @@ def split_reasoning(text, think_start, think_end):
Returns ``(reasoning_content, remaining_text)``. When ``think_start`` is Returns ``(reasoning_content, remaining_text)``. When ``think_start`` is
empty or not found, returns ``("", text)`` unchanged. empty or not found, returns ``("", text)`` unchanged.
""" """
if not think_start or not text or think_start not in text: if not think_start or not text:
return "", text
if think_start not in text:
# Models like Qwen3.5 open assistant turns already INSIDE thinking, so
# the generated text carries only the closing tag. Everything before it
# is reasoning that would otherwise leak into the content.
if think_end and think_end in text:
head, _, tail = text.partition(think_end)
return head.strip(), tail.strip()
return "", text return "", text
pattern = re.compile( pattern = re.compile(
re.escape(think_start) + r"(.*?)" + re.escape(think_end or ""), re.escape(think_start) + r"(.*?)" + re.escape(think_end or ""),

View File

@@ -0,0 +1,75 @@
"""Unit tests for the mlx/mlx-vlm shared helpers (mlx_utils.py).
Run standalone (Python standard library only, no backend venv needed):
python3 -m unittest mlx_utils_test
These mirror the server-less helper tests in backend/python/mlx/test.py
(TestSharedHelpers), but live here so they run on any platform: the mlx
test module imports grpc/backend_pb2 at import time and needs the MLX venv,
whereas mlx_utils only needs the standard library.
"""
import types
import unittest
from mlx_utils import parse_tool_calls, split_reasoning
class TestSplitReasoning(unittest.TestCase):
def test_both_tags(self):
r, c = split_reasoning(
"<think>step 1\nstep 2</think>The answer is 42.", "<think>", "</think>"
)
self.assertEqual(r, "step 1\nstep 2")
self.assertEqual(c, "The answer is 42.")
def test_implicit_opener_only_closing_tag(self):
# Qwen3.5 opens the assistant turn already inside thinking, so the
# output carries only the closing tag; everything before it is reasoning.
r, c = split_reasoning(
"The user is asking about the weather.\n</think>\n\nThe weather in Rome is sunny.",
"<think>",
"</think>",
)
self.assertEqual(r, "The user is asking about the weather.")
self.assertEqual(c, "The weather in Rome is sunny.")
def test_no_tags_at_all(self):
r, c = split_reasoning("just text", "<think>", "</think>")
self.assertEqual(r, "")
self.assertEqual(c, "just text")
def test_empty_think_end_and_no_opener_match(self):
# No think_end to anchor on, and the opener is absent → return unchanged.
r, c = split_reasoning("no opener here", "<think>", "")
self.assertEqual(r, "")
self.assertEqual(c, "no opener here")
def test_empty_text(self):
r, c = split_reasoning("", "<think>", "</think>")
self.assertEqual(r, "")
self.assertEqual(c, "")
class TestParseToolCalls(unittest.TestCase):
def test_with_shim(self):
tm = types.SimpleNamespace(
tool_call_start="<tool_call>",
tool_call_end="</tool_call>",
parse_tool_call=lambda body, tools: {
"name": "get_weather",
"arguments": {"location": body.strip()},
},
)
calls, remaining = parse_tool_calls(
"Sure: <tool_call>Paris</tool_call>", tm, tools=None
)
self.assertEqual(len(calls), 1)
self.assertEqual(calls[0]["name"], "get_weather")
self.assertEqual(calls[0]["arguments"], '{"location": "Paris"}')
self.assertEqual(calls[0]["index"], 0)
self.assertNotIn("<tool_call>", remaining)
if __name__ == "__main__":
unittest.main()

View File

@@ -58,7 +58,18 @@ def messages_to_dicts(proto_messages):
d["reasoning_content"] = msg.reasoning_content d["reasoning_content"] = msg.reasoning_content
if msg.tool_calls: if msg.tool_calls:
try: try:
d["tool_calls"] = json.loads(msg.tool_calls) tool_calls = json.loads(msg.tool_calls)
# Chat templates (e.g. Qwen) iterate function.arguments as a
# mapping, but the OpenAI wire format carries it as a JSON
# string — decode it back so the template's .items() works.
for tc in tool_calls:
fn = tc.get("function") if isinstance(tc, dict) else None
if isinstance(fn, dict) and isinstance(fn.get("arguments"), str):
try:
fn["arguments"] = json.loads(fn["arguments"])
except json.JSONDecodeError:
pass
d["tool_calls"] = tool_calls
except json.JSONDecodeError: except json.JSONDecodeError:
pass pass
result.append(d) result.append(d)

View File

@@ -0,0 +1,122 @@
"""Unit tests for the shared python backend helpers (python_utils.py).
Run standalone (Python standard library only, no backend venv needed):
python3 -m unittest python_utils_test
These mirror the server-less helper tests in backend/python/mlx/test.py
(TestSharedHelpers), but live here so they run on any platform: the mlx
test module imports grpc/backend_pb2 at import time and needs the MLX venv,
whereas python_utils has no third-party dependency. Proto Message objects
are faked with types.SimpleNamespace (real proto fields default to "").
"""
import json
import types
import unittest
from python_utils import messages_to_dicts, parse_options
def _msg(**fields):
"""Fake a proto Message: every unset field is the empty string, as protobuf."""
defaults = {
"role": "",
"content": "",
"name": "",
"tool_call_id": "",
"reasoning_content": "",
"tool_calls": "",
}
defaults.update(fields)
return types.SimpleNamespace(**defaults)
class TestParseOptions(unittest.TestCase):
def test_type_inference(self):
opts = parse_options(
["temperature:0.7", "max_tokens:128", "trust:true", "name:hello", "no_colon_skipped"]
)
self.assertEqual(opts["temperature"], 0.7)
self.assertEqual(opts["max_tokens"], 128)
self.assertIs(opts["trust"], True)
self.assertEqual(opts["name"], "hello")
self.assertNotIn("no_colon_skipped", opts)
class TestMessagesToDicts(unittest.TestCase):
def test_basic_fields(self):
out = messages_to_dicts(
[
_msg(role="user", content="hi"),
_msg(role="tool", content="42", tool_call_id="call_1", name="f"),
]
)
self.assertEqual(out[0], {"role": "user", "content": "hi"})
self.assertEqual(out[1]["tool_call_id"], "call_1")
self.assertEqual(out[1]["name"], "f")
def test_tool_call_arguments_string_decoded_to_mapping(self):
# OpenAI wire format ships function.arguments as a JSON *string*; chat
# templates iterate it as a mapping, so it must come back as a dict.
out = messages_to_dicts(
[
_msg(
role="assistant",
tool_calls=json.dumps(
[
{
"id": "call_1",
"type": "function",
"function": {
"name": "get_weather",
"arguments": '{"location": "Rome"}',
},
}
]
),
)
]
)
args = out[0]["tool_calls"][0]["function"]["arguments"]
self.assertEqual(args, {"location": "Rome"})
self.assertEqual(dict(args.items()), {"location": "Rome"})
def test_tool_call_arguments_already_mapping_is_idempotent(self):
out = messages_to_dicts(
[
_msg(
role="assistant",
tool_calls=json.dumps(
[{"function": {"name": "f", "arguments": {"a": 1}}}]
),
)
]
)
self.assertEqual(out[0]["tool_calls"][0]["function"]["arguments"], {"a": 1})
def test_tool_call_arguments_invalid_json_left_as_string(self):
out = messages_to_dicts(
[
_msg(
role="assistant",
tool_calls=json.dumps(
[{"function": {"name": "f", "arguments": "not-json"}}]
),
)
]
)
self.assertEqual(out[0]["tool_calls"][0]["function"]["arguments"], "not-json")
def test_tool_call_without_function_key(self):
out = messages_to_dicts(
[_msg(role="assistant", tool_calls=json.dumps([{"id": "call_1"}]))]
)
self.assertEqual(out[0]["tool_calls"], [{"id": "call_1"}])
def test_tool_calls_invalid_json_dropped(self):
out = messages_to_dicts([_msg(role="assistant", tool_calls="{not json")])
self.assertNotIn("tool_calls", out[0])
if __name__ == "__main__":
unittest.main()

View File

@@ -35,6 +35,21 @@ if [ "x${BUILD_PROFILE}" == "xcpu" ]; then
EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match" EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
fi fi
# AMD ROCm: vLLM ships prebuilt ROCm wheels, but on a DEDICATED index
# (https://wheels.vllm.ai/rocm/), NOT PyPI, and ONLY for CPython 3.12. On any
# other Python the installer silently falls back to the CUDA-only PyPI wheel,
# which is unusable on an AMD GPU (import fails, so the backend never finds the
# vllm module). Force Python 3.12 before the venv is created (matches the
# intel/l4t13 cp312 bump); the hipblas branch below pulls vllm from the ROCm
# wheel index. unsafe-best-match lets uv consult that index and PyPI together.
# https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html?device=rocm
if [ "x${BUILD_TYPE}" == "xhipblas" ]; then
PYTHON_VERSION="3.12"
PYTHON_PATCH="12"
PY_STANDALONE_TAG="20251120"
EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
fi
# cublas13 pulls the vLLM wheel from a per-tag cu130 index (PyPI's vllm wheel # cublas13 pulls the vLLM wheel from a per-tag cu130 index (PyPI's vllm wheel
# is built against CUDA 12 and won't load on cu130). uv's default per-package # is built against CUDA 12 and won't load on cu130). uv's default per-package
# first-match strategy would still pick the PyPI wheel, so allow it to consult # first-match strategy would still pick the PyPI wheel, so allow it to consult
@@ -104,7 +119,7 @@ if [ "$(uname -s)" = "Darwin" ]; then
# can rewrite it. Darwin therefore follows vllm-metal and can lag the Linux # can rewrite it. Darwin therefore follows vllm-metal and can lag the Linux
# vllm pin (requirements-cublas13-after.txt, bumped independently against # vllm pin (requirements-cublas13-after.txt, bumped independently against
# vllm/vllm) until vllm-metal supports a newer vLLM. # vllm/vllm) until vllm-metal supports a newer vLLM.
VLLM_METAL_VERSION="v0.3.0.dev20260701132215" VLLM_METAL_VERSION="v0.3.0.dev20260701212152"
# The coupled vLLM source version is whatever this vllm-metal release builds # The coupled vLLM source version is whatever this vllm-metal release builds
# against -- it declares it in its own installer as `vllm_v=`. Derive it from # against -- it declares it in its own installer as `vllm_v=`. Derive it from
@@ -194,6 +209,22 @@ elif [ "x${BUILD_TYPE}" == "xintel" ]; then
export CMAKE_PREFIX_PATH="$(python -c 'import site; print(site.getsitepackages()[0])'):${CMAKE_PREFIX_PATH:-}" export CMAKE_PREFIX_PATH="$(python -c 'import site; print(site.getsitepackages()[0])'):${CMAKE_PREFIX_PATH:-}"
VLLM_TARGET_DEVICE=xpu uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} --no-deps . VLLM_TARGET_DEVICE=xpu uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} --no-deps .
popd popd
# AMD ROCm: install vllm from its dedicated ROCm wheel index instead of the
# CUDA-only PyPI wheel. installRequirements brings the base ROCm
# torch/transformers (requirements-hipblas.txt), then we pull vllm (plus the
# matching ROCm torch, via --upgrade) from wheels.vllm.ai/rocm. This is the
# method upstream prescribes for AMD; the Python-3.12 pin is set above.
# There is intentionally no requirements-hipblas-after.txt: a bare `vllm`
# there would resolve to the CUDA wheel, and installRequirements never loads
# a ${BUILD_TYPE}-after file for hipblas anyway (BUILD_TYPE == BUILD_PROFILE).
# https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html?device=rocm
elif [ "x${BUILD_TYPE}" == "xhipblas" ]; then
installRequirements
# --upgrade reconciles the base ROCm torch to whatever the vllm ROCm wheel
# pins; --extra-index-url adds the ROCm wheel repository on top of PyPI.
uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} \
--extra-index-url https://wheels.vllm.ai/rocm/ --upgrade vllm
# FROM_SOURCE=true on a CPU build skips the prebuilt vllm wheel in # FROM_SOURCE=true on a CPU build skips the prebuilt vllm wheel in
# requirements-cpu-after.txt and compiles vllm locally against the host's # requirements-cpu-after.txt and compiles vllm locally against the host's
# actual CPU. Not used by default because it takes ~30-40 minutes, but # actual CPU. Not used by default because it takes ~30-40 minutes, but

View File

@@ -1 +0,0 @@
vllm

View File

@@ -473,20 +473,13 @@ func New(opts ...config.AppOption) (*Application, error) {
if options.LoadToMemory != nil && !options.SingleBackend { if options.LoadToMemory != nil && !options.SingleBackend {
for _, m := range options.LoadToMemory { for _, m := range options.LoadToMemory {
cfg, err := application.ModelConfigLoader().LoadModelConfigFileByNameDefaultOptions(m, options) xlog.Debug("Auto loading model into memory from file", "model", m)
if err != nil { // Same path as POST /backend/load: a realtime pipeline model expands
// to its sub-models, and load failures are recorded as model_load
// traces.
if _, err := backend.PreloadModelByName(options.Context, application.ModelConfigLoader(), application.ModelLoader(), options, m); err != nil {
return nil, err return nil, err
} }
xlog.Debug("Auto loading model into memory from file", "model", m, "file", cfg.Model)
o := backend.ModelOptions(*cfg, options)
var backendErr error
_, backendErr = application.ModelLoader().Load(o...)
if backendErr != nil {
return nil, backendErr
}
} }
} }

View File

@@ -157,33 +157,6 @@ var _ = Describe("X-LocalAI-Node ctx propagation contract", func() {
stampViaRouterCtx() stampViaRouterCtx()
}) })
// Regression for #10636: a canceled request context must NOT cancel the
// model LOAD. The heavy image/audio backends bind the load to the request
// context so the routing holder reaches the SmartRouter; but a large
// diffusers/LLM model on a slow (e.g. shared-memory iGPU) host can take
// far longer to load than the client stays connected. If the request's
// cancellation propagates to the load, the LoadModel RPC is aborted, the
// backend process is torn down, and every retry restarts from scratch and
// never converges. The load must instead run to completion and cache while
// still carrying the request's routing holder value.
It("ImageGeneration does not propagate request cancellation to the model load", func() {
canceledCtx, cancel := context.WithCancel(reqCtx)
cancel() // client disconnected while the (slow) load was still running
_, err := backend.ImageGeneration(canceledCtx, 64, 64, 1, 0, "p", "", "", "/tmp/out.png", loader, modelCfg, appCfg, nil)
// The load reached the router (short-circuit sentinel), i.e. it was
// NOT aborted early by the already-canceled request context.
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("router short-circuit (test)"))
routerCtx := routerCtxOf()
Expect(routerCtx).ToNot(BeNil(), "router callback must have been invoked")
Expect(routerCtx.Err()).To(BeNil(),
"a canceled request must not cancel the model load")
// The routing holder value still propagates despite the decoupling.
stampViaRouterCtx()
})
It("does NOT leak the holder when the app context is used instead", func() { It("does NOT leak the holder when the app context is used instead", func() {
// Sanity: the bug being fixed manifests as the router getting // Sanity: the bug being fixed manifests as the router getting
// appCfg.Context (no holder) instead of reqCtx (holder). A direct // appCfg.Context (no holder) instead of reqCtx (holder). A direct

View File

@@ -40,14 +40,10 @@ func (e *modelEmbedder) Embed(ctx context.Context, text string) ([]float32, erro
func ModelEmbedding(ctx context.Context, s string, tokens []int, loader *model.ModelLoader, modelConfig config.ModelConfig, appConfig *config.ApplicationConfig) (func() ([]float32, error), error) { func ModelEmbedding(ctx context.Context, s string, tokens []int, loader *model.ModelLoader, modelConfig config.ModelConfig, appConfig *config.ApplicationConfig) (func() ([]float32, error), error) {
// model.WithContext carries the request context into the load so distributed // model.WithContext(ctx) overrides the app-context default set in
// routing decisions reach the request's X-LocalAI-Node holder via // ModelOptions so distributed routing decisions reach the request's
// distributedhdr.Stamp. context.WithoutCancel keeps those values but drops // X-LocalAI-Node holder via distributedhdr.Stamp.
// the request's cancellation, so a slow first load still completes and opts := ModelOptions(modelConfig, appConfig, model.WithContext(ctx))
// caches if the client disconnects instead of aborting the LoadModel RPC and
// tearing down the backend process (issue #10636). Inference below keeps the
// cancellable ctx, so a disconnect still stops generation.
opts := ModelOptions(modelConfig, appConfig, model.WithContext(context.WithoutCancel(ctx)))
inferenceModel, err := loader.Load(opts...) inferenceModel, err := loader.Load(opts...)
if err != nil { if err != nil {

View File

@@ -13,14 +13,10 @@ import (
func ImageGeneration(ctx context.Context, height, width, step, seed int, positive_prompt, negative_prompt, src, dst string, loader *model.ModelLoader, modelConfig config.ModelConfig, appConfig *config.ApplicationConfig, refImages []string) (func() error, error) { func ImageGeneration(ctx context.Context, height, width, step, seed int, positive_prompt, negative_prompt, src, dst string, loader *model.ModelLoader, modelConfig config.ModelConfig, appConfig *config.ApplicationConfig, refImages []string) (func() error, error) {
// model.WithContext carries the request context into the load so distributed // model.WithContext(ctx) overrides the app-context default set in
// routing decisions reach the request's X-LocalAI-Node holder via // ModelOptions so distributed routing decisions reach the request's
// distributedhdr.Stamp. context.WithoutCancel keeps those values but drops // X-LocalAI-Node holder via distributedhdr.Stamp.
// the request's cancellation, so a slow first load still completes and opts := ModelOptions(modelConfig, appConfig, model.WithContext(ctx))
// caches if the client disconnects instead of aborting the LoadModel RPC and
// tearing down the backend process (issue #10636). Inference below keeps the
// cancellable ctx, so a disconnect still stops generation.
opts := ModelOptions(modelConfig, appConfig, model.WithContext(context.WithoutCancel(ctx)))
inferenceModel, err := loader.Load( inferenceModel, err := loader.Load(
opts..., opts...,
) )

View File

@@ -111,12 +111,7 @@ func ModelInference(ctx context.Context, s string, messages schema.Messages, ima
} }
ctx = distributedhdr.MaybeWithPrefixChain(ctx, c.ModelID(), chainSource) ctx = distributedhdr.MaybeWithPrefixChain(ctx, c.ModelID(), chainSource)
// context.WithoutCancel decouples the model load from the request's opts := ModelOptions(*c, o, model.WithContext(ctx))
// cancellation while preserving its routing values, so a slow load still
// completes and caches if the client disconnects instead of aborting the
// LoadModel RPC mid-load (issue #10636). Inference below keeps the
// cancellable ctx, so a disconnect still stops generation.
opts := ModelOptions(*c, o, model.WithContext(context.WithoutCancel(ctx)))
inferenceModel, err := loader.Load(opts...) inferenceModel, err := loader.Load(opts...)
if err != nil { if err != nil {
recordModelLoadFailure(o, c.Name, c.Backend, err, map[string]any{"model_file": modelFile}) recordModelLoadFailure(o, c.Name, c.Backend, err, map[string]any{"model_file": modelFile})

View File

@@ -52,6 +52,22 @@ func ModelLoadTraceObserver(appConfig *config.ApplicationConfig) func(model.Back
} }
} }
// PreloadModel warms a model into memory without running any inference, so the
// first real request doesn't pay the backend's cold-start load cost. It uses
// the same ModelOptions + ml.Load path the modality functions use, so a
// subsequent inference call hits the loader cache instead of reloading. Load
// failures are recorded and returned; callers that warm models opportunistically
// (e.g. realtime session warm-up) typically log and continue, since the lazy
// path will retry on first use.
func PreloadModel(ctx context.Context, ml *model.ModelLoader, modelConfig config.ModelConfig, appConfig *config.ApplicationConfig) error {
opts := ModelOptions(modelConfig, appConfig, model.WithContext(ctx))
if _, err := ml.Load(opts...); err != nil {
recordModelLoadFailure(appConfig, modelConfig.Name, modelConfig.Backend, err, nil)
return err
}
return nil
}
// recordModelLoadFailure records a backend trace when model loading fails. // recordModelLoadFailure records a backend trace when model loading fails.
func recordModelLoadFailure(appConfig *config.ApplicationConfig, modelName, backend string, err error, data map[string]any) { func recordModelLoadFailure(appConfig *config.ApplicationConfig, modelName, backend string, err error, data map[string]any) {
if !appConfig.EnableTracing { if !appConfig.EnableTracing {

122
core/backend/preload.go Normal file
View File

@@ -0,0 +1,122 @@
package backend
import (
"context"
"errors"
"fmt"
"sync"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/pkg/model"
"github.com/mudler/xlog"
)
// PreloadModelByName loads the named model into memory so the first request
// that uses it pays no cold-start load cost — the inverse of shutting a model
// down. If the model is a realtime pipeline (its config declares a `pipeline:`
// block), each configured sub-model (VAD, transcription, LLM, TTS,
// sound_detection, voice_recognition) is loaded concurrently instead of the
// pipeline stub, which has no backend of its own. It returns the model names
// actually loaded and a joined error naming each sub-model that failed (nil on
// full success); a partial pipeline load reports both the loaded names and the
// failures so the caller can surface exactly what is and isn't resident.
// Compaction's summary_model is deliberately left cold: it is only invoked off
// the response path, so it can stay lazy.
func PreloadModelByName(ctx context.Context, cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig, name string) ([]string, error) {
cfg, err := cl.LoadModelConfigFileByNameDefaultOptions(name, appConfig)
if err != nil {
return nil, err
}
stages, err := pipelineStages(cl, &cfg.Pipeline, ml.ModelPath)
if err != nil {
return nil, err
}
if len(stages) == 0 {
// Not a pipeline: load the model's own backend directly.
if err := PreloadModel(ctx, ml, *cfg, appConfig); err != nil {
return nil, err
}
return []string{cfg.Name}, nil
}
return PreloadStages(ctx, ml, appConfig, stages)
}
// PreloadStage names one pipeline sub-model to preload and the resolved config
// to load it from (nil = stage absent, skipped). Role labels the pipeline slot
// in errors and logs.
type PreloadStage struct {
Role string
Cfg *config.ModelConfig
}
// loadStage is PreloadModel behind a seam so PreloadStages can be unit-tested
// without spawning real backends.
var loadStage = PreloadModel
// pipelineStages resolves each populated pipeline stage to its concrete model
// config, following a single alias hop — the same resolution the realtime
// pipeline itself uses. A stage that fails to resolve is a misconfiguration,
// so it fails fast rather than being deferred to load. A pipeline with no
// stages set returns nil, which callers treat as "not a pipeline".
func pipelineStages(cl *config.ModelConfigLoader, p *config.Pipeline, modelPath string) ([]PreloadStage, error) {
voiceRec := ""
if p.VoiceRecognition != nil {
voiceRec = p.VoiceRecognition.Model
}
var stages []PreloadStage
for _, s := range []struct{ role, name string }{
{"vad", p.VAD},
{"transcription", p.Transcription},
{"llm", p.LLM},
{"tts", p.TTS},
{"sound_detection", p.SoundDetection},
{"voice_recognition", voiceRec},
} {
if s.name == "" {
continue
}
cfg, err := cl.LoadResolvedModelConfig(s.name, modelPath)
if err != nil {
return nil, fmt.Errorf("%s (%s): %w", s.role, s.name, err)
}
stages = append(stages, PreloadStage{Role: s.role, Cfg: cfg})
}
return stages, nil
}
// PreloadStages loads every present stage at once and waits for all of them, so
// a pipeline warms in the time of its slowest stage rather than the sum. Absent
// (nil-config) stages are skipped. A failed stage does not cancel the others —
// they all run to completion so the joined error names every broken stage at
// once, alongside the names that did load.
func PreloadStages(ctx context.Context, ml *model.ModelLoader, appConfig *config.ApplicationConfig, stages []PreloadStage) ([]string, error) {
var (
wg sync.WaitGroup
mu sync.Mutex
loaded []string
errs []error
)
for _, s := range stages {
if s.Cfg == nil {
continue
}
wg.Add(1)
go func(s PreloadStage) {
defer wg.Done()
if err := loadStage(ctx, ml, *s.Cfg, appConfig); err != nil {
xlog.Warn("preload: failed to load pipeline sub-model", "stage", s.Role, "model", s.Cfg.Name, "error", err)
mu.Lock()
errs = append(errs, fmt.Errorf("%s (%s): %w", s.Role, s.Cfg.Name, err))
mu.Unlock()
return
}
xlog.Debug("preload: loaded pipeline sub-model", "stage", s.Role, "model", s.Cfg.Name)
mu.Lock()
loaded = append(loaded, s.Cfg.Name)
mu.Unlock()
}(s)
}
wg.Wait()
return loaded, errors.Join(errs...)
}

View File

@@ -0,0 +1,146 @@
package backend
import (
"context"
"errors"
"os"
"path/filepath"
"sync"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/pkg/model"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
var _ = Describe("pipelineStages", func() {
seed := func(dir string, names ...string) *config.ModelConfigLoader {
for _, n := range names {
yaml := "name: " + n + "\nbackend: fake-backend\n"
Expect(os.WriteFile(filepath.Join(dir, n+".yaml"), []byte(yaml), 0o644)).To(Succeed())
}
cl := config.NewModelConfigLoader(dir)
Expect(cl.LoadModelConfigsFromPath(dir)).To(Succeed())
return cl
}
It("resolves only the populated stages, in load order", func() {
dir := GinkgoT().TempDir()
cl := seed(dir, "vad-m", "stt-m", "llm-m", "tts-m")
stages, err := pipelineStages(cl, &config.Pipeline{
VAD: "vad-m",
Transcription: "stt-m",
LLM: "llm-m",
TTS: "tts-m",
}, dir)
Expect(err).ToNot(HaveOccurred())
roles := make([]string, len(stages))
names := make([]string, len(stages))
for i, s := range stages {
roles[i] = s.Role
names[i] = s.Cfg.Name
}
Expect(roles).To(Equal([]string{"vad", "transcription", "llm", "tts"}))
Expect(names).To(Equal([]string{"vad-m", "stt-m", "llm-m", "tts-m"}))
})
It("skips unset stages and includes sound_detection and voice_recognition when set", func() {
dir := GinkgoT().TempDir()
cl := seed(dir, "stt-m", "ced", "spk")
stages, err := pipelineStages(cl, &config.Pipeline{
Transcription: "stt-m",
SoundDetection: "ced",
VoiceRecognition: &config.PipelineVoiceRecognition{Model: "spk"},
}, dir)
Expect(err).ToNot(HaveOccurred())
roles := make([]string, len(stages))
for i, s := range stages {
roles[i] = s.Role
}
Expect(roles).To(ConsistOf("transcription", "sound_detection", "voice_recognition"))
})
It("returns nil for a pipeline with no stages (not a pipeline)", func() {
dir := GinkgoT().TempDir()
cl := seed(dir)
stages, err := pipelineStages(cl, &config.Pipeline{}, dir)
Expect(err).ToNot(HaveOccurred())
Expect(stages).To(BeNil())
})
})
var _ = Describe("PreloadStages", func() {
var (
mu sync.Mutex
seen []string
)
// stubLoader swaps the loadStage seam for a recorder so no real backends
// are spawned; errFor injects per-model failures.
stubLoader := func(errFor map[string]error) {
loadStage = func(_ context.Context, _ *model.ModelLoader, cfg config.ModelConfig, _ *config.ApplicationConfig) error {
mu.Lock()
seen = append(seen, cfg.Name)
mu.Unlock()
return errFor[cfg.Name]
}
}
BeforeEach(func() {
seen = nil
})
AfterEach(func() {
loadStage = PreloadModel
})
mkStage := func(role, name string) PreloadStage {
return PreloadStage{Role: role, Cfg: &config.ModelConfig{Name: name}}
}
It("loads every present stage, skips absent (nil-config) ones, and returns the loaded names", func() {
stubLoader(nil)
loaded, err := PreloadStages(context.Background(), nil, nil, []PreloadStage{
mkStage("vad", "vad-m"),
{Role: "transcription"}, // absent stage
mkStage("llm", "llm-m"),
})
Expect(err).ToNot(HaveOccurred())
Expect(loaded).To(ConsistOf("vad-m", "llm-m"))
// Barrier: every stage has run by the time PreloadStages returns, so
// reading seen without the lock here is safe.
Expect(seen).To(ConsistOf("vad-m", "llm-m"))
})
It("reports a joined error naming each failed stage while still loading the rest", func() {
stubLoader(map[string]error{
"vad-m": errors.New("vad boom"),
"tts-m": errors.New("tts boom"),
})
loaded, err := PreloadStages(context.Background(), nil, nil, []PreloadStage{
mkStage("vad", "vad-m"),
mkStage("llm", "llm-m"),
mkStage("tts", "tts-m"),
})
// Every stage ran (a failure does not cancel the others)...
Expect(seen).To(ConsistOf("vad-m", "llm-m", "tts-m"))
// ...the stage that loaded fine is reported as loaded...
Expect(loaded).To(ConsistOf("llm-m"))
// ...and the joined error names every broken stage and its cause.
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("vad (vad-m)"))
Expect(err.Error()).To(ContainSubstring("vad boom"))
Expect(err.Error()).To(ContainSubstring("tts (tts-m)"))
Expect(err.Error()).To(ContainSubstring("tts boom"))
Expect(err.Error()).ToNot(ContainSubstring("llm"))
})
})

View File

@@ -57,14 +57,10 @@ func (r *modelReranker) Rerank(ctx context.Context, query string, documents []st
} }
func Rerank(ctx context.Context, request *proto.RerankRequest, loader *model.ModelLoader, appConfig *config.ApplicationConfig, modelConfig config.ModelConfig) (*proto.RerankResult, error) { func Rerank(ctx context.Context, request *proto.RerankRequest, loader *model.ModelLoader, appConfig *config.ApplicationConfig, modelConfig config.ModelConfig) (*proto.RerankResult, error) {
// model.WithContext carries the request context into the load so distributed // model.WithContext(ctx) overrides the app-context default set in
// routing decisions reach the request's X-LocalAI-Node holder via // ModelOptions so distributed routing decisions reach the request's
// distributedhdr.Stamp. context.WithoutCancel keeps those values but drops // X-LocalAI-Node holder via distributedhdr.Stamp.
// the request's cancellation, so a slow first load still completes and opts := ModelOptions(modelConfig, appConfig, model.WithContext(ctx))
// caches if the client disconnects instead of aborting the LoadModel RPC and
// tearing down the backend process (issue #10636). Inference below keeps the
// cancellable ctx, so a disconnect still stops generation.
opts := ModelOptions(modelConfig, appConfig, model.WithContext(context.WithoutCancel(ctx)))
rerankModel, err := loader.Load(opts...) rerankModel, err := loader.Load(opts...)
if err != nil { if err != nil {
recordModelLoadFailure(appConfig, modelConfig.Name, modelConfig.Backend, err, nil) recordModelLoadFailure(appConfig, modelConfig.Name, modelConfig.Backend, err, nil)

View File

@@ -45,14 +45,10 @@ func loadTranscriptionModel(ctx context.Context, ml *model.ModelLoader, modelCon
if modelConfig.Backend == "" { if modelConfig.Backend == "" {
modelConfig.Backend = model.WhisperBackend modelConfig.Backend = model.WhisperBackend
} }
// model.WithContext carries the request context into the load so distributed // model.WithContext(ctx) overrides the app-context default set in
// routing decisions reach the request's X-LocalAI-Node holder via // ModelOptions so distributed routing decisions reach the request's
// distributedhdr.Stamp. context.WithoutCancel keeps those values but drops // X-LocalAI-Node holder via distributedhdr.Stamp.
// the request's cancellation, so a slow first load still completes and opts := ModelOptions(modelConfig, appConfig, model.WithContext(ctx))
// caches if the client disconnects instead of aborting the LoadModel RPC and
// tearing down the backend process (issue #10636). Inference below keeps the
// cancellable ctx, so a disconnect still stops generation.
opts := ModelOptions(modelConfig, appConfig, model.WithContext(context.WithoutCancel(ctx)))
transcriptionModel, err := ml.Load(opts...) transcriptionModel, err := ml.Load(opts...)
if err != nil { if err != nil {
recordModelLoadFailure(appConfig, modelConfig.Name, modelConfig.Backend, err, nil) recordModelLoadFailure(appConfig, modelConfig.Name, modelConfig.Backend, err, nil)

View File

@@ -50,14 +50,10 @@ func ModelTTS(
appConfig *config.ApplicationConfig, appConfig *config.ApplicationConfig,
modelConfig config.ModelConfig, modelConfig config.ModelConfig,
) (string, *proto.Result, error) { ) (string, *proto.Result, error) {
// model.WithContext carries the request context into the load so distributed // model.WithContext(ctx) overrides the app-context default set in
// routing decisions reach the request's X-LocalAI-Node holder via // ModelOptions so distributed routing decisions reach the request's
// distributedhdr.Stamp. context.WithoutCancel keeps those values but drops // X-LocalAI-Node holder via distributedhdr.Stamp.
// the request's cancellation, so a slow first load still completes and opts := ModelOptions(modelConfig, appConfig, model.WithContext(ctx))
// caches if the client disconnects instead of aborting the LoadModel RPC and
// tearing down the backend process (issue #10636). Inference below keeps the
// cancellable ctx, so a disconnect still stops generation.
opts := ModelOptions(modelConfig, appConfig, model.WithContext(context.WithoutCancel(ctx)))
ttsModel, err := loader.Load(opts...) ttsModel, err := loader.Load(opts...)
if err != nil { if err != nil {
recordModelLoadFailure(appConfig, modelConfig.Name, modelConfig.Backend, err, nil) recordModelLoadFailure(appConfig, modelConfig.Name, modelConfig.Backend, err, nil)
@@ -157,9 +153,7 @@ func ModelTTSStream(
modelConfig config.ModelConfig, modelConfig config.ModelConfig,
audioCallback func([]byte) error, audioCallback func([]byte) error,
) error { ) error {
// See ModelTTS above: WithoutCancel decouples the load from request opts := ModelOptions(modelConfig, appConfig, model.WithContext(ctx))
// cancellation while preserving routing values (issue #10636).
opts := ModelOptions(modelConfig, appConfig, model.WithContext(context.WithoutCancel(ctx)))
ttsModel, err := loader.Load(opts...) ttsModel, err := loader.Load(opts...)
if err != nil { if err != nil {
recordModelLoadFailure(appConfig, modelConfig.Name, modelConfig.Backend, err, nil) recordModelLoadFailure(appConfig, modelConfig.Name, modelConfig.Backend, err, nil)

View File

@@ -14,14 +14,10 @@ func VAD(request *schema.VADRequest,
ml *model.ModelLoader, ml *model.ModelLoader,
appConfig *config.ApplicationConfig, appConfig *config.ApplicationConfig,
modelConfig config.ModelConfig) (*schema.VADResponse, error) { modelConfig config.ModelConfig) (*schema.VADResponse, error) {
// model.WithContext carries the request context into the load so distributed // model.WithContext(ctx) overrides the app-context default set in
// routing decisions reach the request's X-LocalAI-Node holder via // ModelOptions so distributed routing decisions reach the request's
// distributedhdr.Stamp. context.WithoutCancel keeps those values but drops // X-LocalAI-Node holder via distributedhdr.Stamp.
// the request's cancellation, so a slow first load still completes and opts := ModelOptions(modelConfig, appConfig, model.WithContext(ctx))
// caches if the client disconnects instead of aborting the LoadModel RPC and
// tearing down the backend process (issue #10636). Inference below keeps the
// cancellable ctx, so a disconnect still stops generation.
opts := ModelOptions(modelConfig, appConfig, model.WithContext(context.WithoutCancel(ctx)))
vadModel, err := ml.Load(opts...) vadModel, err := ml.Load(opts...)
if err != nil { if err != nil {
recordModelLoadFailure(appConfig, modelConfig.Name, modelConfig.Backend, err, nil) recordModelLoadFailure(appConfig, modelConfig.Name, modelConfig.Backend, err, nil)

View File

@@ -599,6 +599,13 @@ func DefaultRegistry() map[string]FieldMetaOverride {
Component: "toggle", Component: "toggle",
Order: 89, Order: 89,
}, },
"pipeline.disable_warmup": {
Section: "pipeline",
Label: "Disable Warmup",
Description: "Turn off eager pre-loading of the pipeline's sub-models at realtime session start. By default LocalAI loads every configured sub-model backend (VAD, transcription, LLM, TTS, sound detection, voice recognition) before the session starts and blocks until they are ready, so the first turn pays no cold-start cost and a model that fails to load is reported at session start instead of mid-call. Enable this to restore the lazy 'load on first use' behavior — session start no longer waits on loading and load errors surface on the first turn instead. Useful to keep idle sessions from holding model memory they may never use.",
Component: "toggle",
Order: 90,
},
// --- Functions --- // --- Functions ---
"function.grammar.parallel_calls": { "function.grammar.parallel_calls": {

View File

@@ -656,6 +656,18 @@ type Pipeline struct {
// to benefit. A client session.update still overrides type and eagerness // to benefit. A client session.update still overrides type and eagerness
// per session; retranscribe is server-side only. Unset keeps server_vad. // per session; retranscribe is server-side only. Unset keeps server_vad.
TurnDetection PipelineTurnDetection `yaml:"turn_detection,omitempty" json:"turn_detection,omitempty"` TurnDetection PipelineTurnDetection `yaml:"turn_detection,omitempty" json:"turn_detection,omitempty"`
// DisableWarmup turns off eager pre-loading of the pipeline's sub-models at
// realtime session start. By default (false) LocalAI loads every configured
// sub-model backend (VAD, transcription, LLM, TTS, sound detection, voice
// recognition) into memory (concurrently) before the
// session is announced and blocks until they are ready, so the first turn
// pays no cold-start cost and a model that fails to load surfaces as an error
// at session start rather than mid-call. Set true to restore the lazy "load
// on first use" behavior — session start no longer blocks on loading and
// load errors surface on first use instead (e.g. to keep idle sessions from
// holding model memory they may never use).
DisableWarmup bool `yaml:"disable_warmup,omitempty" json:"disable_warmup,omitempty"`
} }
// PipelineCompaction configures summarize-then-drop for a realtime pipeline. // PipelineCompaction configures summarize-then-drop for a realtime pipeline.

View File

@@ -155,6 +155,25 @@ func (bcl *ModelConfigLoader) LoadModelConfigFileByNameDefaultOptions(modelName
ModelPath(appConfig.SystemState.Model.ModelsPath)) ModelPath(appConfig.SystemState.Model.ModelsPath))
} }
// LoadResolvedModelConfig loads a model config by name and follows a single
// alias hop, so a caller that references an alias (e.g. a pipeline with
// `llm: default`) gets the alias target's full config (Backend, Model, ...)
// rather than the alias stub with an empty Backend. Without this the alias
// survives unresolved into model loading and fails downstream — notably in
// distributed mode with "backend name is empty". Mirrors the top-level alias
// resolution in core/http/middleware/request.go.
func (bcl *ModelConfigLoader) LoadResolvedModelConfig(modelName, modelPath string) (*ModelConfig, error) {
cfg, err := bcl.LoadModelConfigFileByName(modelName, modelPath)
if err != nil {
return nil, err
}
resolved, _, err := bcl.ResolveAlias(cfg)
if err != nil {
return nil, err
}
return resolved, nil
}
// This format is currently only used when reading a single file at startup, passed in via ApplicationConfig.ConfigFile // This format is currently only used when reading a single file at startup, passed in via ApplicationConfig.ConfigFile
func (bcl *ModelConfigLoader) LoadMultipleModelConfigsSingleFile(file string, opts ...ConfigLoaderOption) error { func (bcl *ModelConfigLoader) LoadMultipleModelConfigsSingleFile(file string, opts ...ConfigLoaderOption) error {
bcl.Lock() bcl.Lock()

View File

@@ -1,4 +1,4 @@
package openai package config_test
import ( import (
"os" "os"
@@ -10,14 +10,14 @@ import (
"github.com/mudler/LocalAI/core/config" "github.com/mudler/LocalAI/core/config"
) )
// loadPipelineSubModel must resolve a pipeline sub-model that references an // LoadResolvedModelConfig must resolve a model that references an alias
// alias (e.g. `llm: default`) one hop to the alias target's full config — so // (e.g. a pipeline with `llm: default`) one hop to the alias target's full
// the effective backend is the target's backend, not the empty backend of the // config — so the effective backend is the target's backend, not the empty
// alias stub. This mirrors the top-level alias resolution done in // backend of the alias stub. This mirrors the top-level alias resolution done
// core/http/middleware/request.go, which the realtime pipeline previously // in core/http/middleware/request.go, which the realtime pipeline previously
// skipped (failing in distributed mode with "backend name is empty"). // skipped (failing in distributed mode with "backend name is empty").
var _ = Describe("loadPipelineSubModel", func() { var _ = Describe("LoadResolvedModelConfig", func() {
It("resolves a sub-model alias one hop to the target's config", func() { It("resolves an alias one hop to the target's config", func() {
tmpDir := GinkgoT().TempDir() tmpDir := GinkgoT().TempDir()
// A real model config with a concrete backend. // A real model config with a concrete backend.
@@ -38,13 +38,13 @@ alias: real-llm
Expect(cl.LoadModelConfigsFromPath(tmpDir)).To(Succeed()) Expect(cl.LoadModelConfigsFromPath(tmpDir)).To(Succeed())
// Resolving the alias must follow the hop to the target's full config. // Resolving the alias must follow the hop to the target's full config.
resolved, err := loadPipelineSubModel(cl, "default", tmpDir) resolved, err := cl.LoadResolvedModelConfig("default", tmpDir)
Expect(err).NotTo(HaveOccurred()) Expect(err).NotTo(HaveOccurred())
Expect(resolved.IsAlias()).To(BeFalse()) Expect(resolved.IsAlias()).To(BeFalse())
Expect(resolved.Backend).To(Equal("llama-cpp")) Expect(resolved.Backend).To(Equal("llama-cpp"))
// A non-alias name must load unchanged. // A non-alias name must load unchanged.
direct, err := loadPipelineSubModel(cl, "real-llm", tmpDir) direct, err := cl.LoadResolvedModelConfig("real-llm", tmpDir)
Expect(err).NotTo(HaveOccurred()) Expect(err).NotTo(HaveOccurred())
Expect(direct.Backend).To(Equal("llama-cpp")) Expect(direct.Backend).To(Equal("llama-cpp"))
Expect(direct.Name).To(Equal("real-llm")) Expect(direct.Name).To(Equal("real-llm"))

View File

@@ -0,0 +1,54 @@
package localai
import (
"net/http"
"github.com/labstack/echo/v4"
"github.com/mudler/LocalAI/core/backend"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/schema"
"github.com/mudler/LocalAI/pkg/model"
"github.com/mudler/xlog"
)
// LoadModelEndpoint pre-loads a model into memory by name — the inverse of
// /backend/shutdown. For a realtime pipeline model every configured sub-model
// (VAD, transcription, LLM, TTS, sound_detection, voice_recognition) is loaded; for a regular
// model its own backend is loaded. The call blocks until loading finishes so
// clients can drive warm-up explicitly and learn up front whether a model
// fails to load.
// @Summary Pre-load a model into memory
// @Description Loads the named model (or, for a realtime pipeline, all of its sub-models) into memory so subsequent requests pay no cold-start cost. The inverse of /backend/shutdown.
// @Tags monitoring
// @Accept json
// @Produce json
// @Param request body schema.ModelLoadRequest true "Model to load"
// @Success 200 {object} schema.ModelLoadResponse "Model loaded"
// @Failure 400 {object} schema.ModelLoadResponse "Missing model name"
// @Failure 500 {object} schema.ModelLoadResponse "Load failed (Loaded lists any sub-models that did load)"
// @Router /backend/load [post]
func LoadModelEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig) echo.HandlerFunc {
return func(c echo.Context) error {
input := new(schema.ModelLoadRequest)
if err := c.Bind(input); err != nil {
return err
}
if input.Model == "" {
return c.JSON(http.StatusBadRequest, schema.ModelLoadResponse{Message: "model is required"})
}
loaded, err := backend.PreloadModelByName(c.Request().Context(), cl, ml, appConfig, input.Model)
if err != nil {
xlog.Error("failed to pre-load model", "model", input.Model, "loaded", loaded, "error", err)
return c.JSON(http.StatusInternalServerError, schema.ModelLoadResponse{
Loaded: loaded,
Message: "failed to load model: " + err.Error(),
})
}
return c.JSON(http.StatusOK, schema.ModelLoadResponse{
Loaded: loaded,
Message: "model loaded",
})
}
}

View File

@@ -0,0 +1,102 @@
package localai_test
import (
"bytes"
"encoding/json"
"net/http"
"net/http/httptest"
"os"
"path/filepath"
"github.com/labstack/echo/v4"
"github.com/mudler/LocalAI/core/config"
. "github.com/mudler/LocalAI/core/http/endpoints/localai"
"github.com/mudler/LocalAI/core/schema"
"github.com/mudler/LocalAI/pkg/model"
"github.com/mudler/LocalAI/pkg/system"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
var _ = Describe("LoadModelEndpoint (/backend/load)", func() {
var (
app *echo.Echo
tempDir string
configLoader *config.ModelConfigLoader
modelLoader *model.ModelLoader
appConfig *config.ApplicationConfig
)
post := func(body string) *httptest.ResponseRecorder {
req := httptest.NewRequest(http.MethodPost, "/backend/load", bytes.NewBufferString(body))
req.Header.Set(echo.HeaderContentType, echo.MIMEApplicationJSON)
rec := httptest.NewRecorder()
app.ServeHTTP(rec, req)
return rec
}
decode := func(rec *httptest.ResponseRecorder) schema.ModelLoadResponse {
var resp schema.ModelLoadResponse
Expect(json.Unmarshal(rec.Body.Bytes(), &resp)).To(Succeed())
return resp
}
writeConfig := func(name, contents string) {
Expect(os.WriteFile(filepath.Join(tempDir, name+".yaml"), []byte(contents), 0o600)).To(Succeed())
}
BeforeEach(func() {
var err error
tempDir, err = os.MkdirTemp("", "backend-load-test-*")
Expect(err).NotTo(HaveOccurred())
systemState, err := system.GetSystemState(system.WithModelPath(tempDir))
Expect(err).NotTo(HaveOccurred())
appConfig = config.NewApplicationConfig(config.WithSystemState(systemState))
configLoader = config.NewModelConfigLoader(tempDir)
modelLoader = model.NewModelLoader(systemState) // no backends installed
app = echo.New()
app.POST("/backend/load", LoadModelEndpoint(configLoader, modelLoader, appConfig))
})
AfterEach(func() {
_ = os.RemoveAll(tempDir)
})
It("rejects a request with no model name", func() {
rec := post(`{}`)
Expect(rec.Code).To(Equal(http.StatusBadRequest))
Expect(decode(rec).Message).To(ContainSubstring("model is required"))
})
It("reports a load failure for a regular model with nothing loaded", func() {
writeConfig("solo", "name: solo\n")
rec := post(`{"model":"solo"}`)
Expect(rec.Code).To(Equal(http.StatusInternalServerError))
resp := decode(rec)
Expect(resp.Loaded).To(BeEmpty())
Expect(resp.Message).To(ContainSubstring("failed to load model"))
})
It("expands a pipeline model and reports each sub-model that failed to load", func() {
writeConfig("voicebot", "name: voicebot\npipeline:\n vad: vad-m\n transcription: stt-m\n llm: llm-m\n tts: tts-m\n")
writeConfig("vad-m", "name: vad-m\n")
writeConfig("stt-m", "name: stt-m\n")
writeConfig("llm-m", "name: llm-m\n")
writeConfig("tts-m", "name: tts-m\n")
rec := post(`{"model":"voicebot"}`)
Expect(rec.Code).To(Equal(http.StatusInternalServerError))
resp := decode(rec)
Expect(resp.Message).To(ContainSubstring("failed to load model"))
// The pipeline stub itself is never loaded; its sub-models are what the
// endpoint tries, so the error names them rather than "voicebot".
Expect(resp.Message).To(ContainSubstring("vad-m"))
Expect(resp.Message).ToNot(ContainSubstring("voicebot"))
})
})

View File

@@ -51,6 +51,9 @@ func (stubClient) EditModelConfig(_ context.Context, _ string, _ map[string]any)
return nil return nil
} }
func (stubClient) ReloadModels(_ context.Context) error { return nil } func (stubClient) ReloadModels(_ context.Context) error { return nil }
func (stubClient) LoadModel(_ context.Context, model string) ([]string, error) {
return []string{model}, nil
}
func (stubClient) SetAlias(_ context.Context, _, _ string) error { func (stubClient) SetAlias(_ context.Context, _, _ string) error {
return nil return nil
} }

View File

@@ -7,6 +7,7 @@ import (
"encoding/binary" "encoding/binary"
"encoding/hex" "encoding/hex"
"encoding/json" "encoding/json"
"errors"
"fmt" "fmt"
"math" "math"
"os" "os"
@@ -266,6 +267,12 @@ type Model interface {
// grpcerrors.IsLiveTranscriptionUnsupported. // grpcerrors.IsLiveTranscriptionUnsupported.
TranscribeLive(ctx context.Context, language string, onEvent func(backend.LiveTranscriptionEvent)) (backend.LiveTranscriptionSession, error) TranscribeLive(ctx context.Context, language string, onEvent func(backend.LiveTranscriptionEvent)) (backend.LiveTranscriptionSession, error)
PredictConfig() *config.ModelConfig PredictConfig() *config.ModelConfig
// Warmup eagerly loads the pipeline's sub-model backends into memory so the
// first realtime turn doesn't pay each backend's cold-start load cost. Loads
// run concurrently; Warmup blocks until they all finish and returns a joined
// error naming every stage that failed to load (nil if all succeeded), so a
// caller can surface model-load failures at session start instead of mid-call.
Warmup(ctx context.Context) error
} }
var upgrader = websocket.Upgrader{ var upgrader = websocket.Upgrader{
@@ -583,18 +590,8 @@ func runRealtimeSession(application *application.Application, t Transport, model
} }
session.ModelInterface = m session.ModelInterface = m
if session.SummaryModel != "" { // The voice gate is built before the warm-up below so its
summaryModelName := session.SummaryModel // speaker-recognition model can warm alongside the pipeline stages.
sid := sessionID
session.summarizerFactory = func() (Model, error) {
summaryCfg, lerr := application.ModelConfigLoader().LoadModelConfigFileByNameDefaultOptions(summaryModelName, application.ApplicationConfig())
if lerr != nil {
return nil, fmt.Errorf("load summary model config %q: %w", summaryModelName, lerr)
}
return newModel(&summaryCfg.Pipeline, application.ModelConfigLoader(), application.ModelLoader(), application.ApplicationConfig(), evaluator, buildRealtimeRoutingContext(application, sid))
}
}
if cfg.Pipeline.VoiceGateEnabled() { if cfg.Pipeline.VoiceGateEnabled() {
gate, gerr := newVoiceGate( gate, gerr := newVoiceGate(
*cfg.Pipeline.VoiceRecognition, *cfg.Pipeline.VoiceRecognition,
@@ -612,6 +609,47 @@ func runRealtimeSession(application *application.Application, t Transport, model
xlog.Info("realtime voice recognition gate enabled", "mode", gate.cfg.Mode, "when", gate.cfg.When) xlog.Info("realtime voice recognition gate enabled", "mode", gate.cfg.Mode, "when", gate.cfg.When)
} }
// Warm the pipeline's sub-model backends before announcing the session.
// Loads run concurrently but we block here until they all finish, so a model
// that fails to load (missing weights, bad backend, OOM) surfaces as an error
// at session start rather than stalling — or failing — mid-call on the first
// turn (VAD on the first audio chunk, STT at end-of-speech, LLM on the first
// reply, TTS on the first spoken output). On success the backends are already
// resident, so the first turn pays no cold-start cost. Opt out per pipeline
// with `pipeline.disable_warmup: true` to restore lazy load-on-first-use
// (errors then surface on first use instead of at session start).
if !cfg.Pipeline.DisableWarmup {
warmErr := make(chan error, 1)
go func() { warmErr <- m.Warmup(context.Background()) }()
// The voice-gate model warms concurrently with the pipeline stages: an
// enforced gate blocks each utterance on speaker resolution, so its
// cold-start would otherwise land on the first turn too. (Compaction's
// summary_model stays lazy — it only runs off the response path.)
var gateErr error
if session.voiceGate != nil {
_, gateErr = backend.PreloadStages(context.Background(), application.ModelLoader(), application.ApplicationConfig(), []backend.PreloadStage{
{Role: "voice_recognition", Cfg: session.voiceGate.recCfg},
})
}
if err := errors.Join(<-warmErr, gateErr); err != nil {
xlog.Error("realtime warmup failed", "model", model, "error", err)
sendError(t, "model_load_error", "Failed to load pipeline models: "+err.Error(), "", "")
return
}
}
if session.SummaryModel != "" {
summaryModelName := session.SummaryModel
sid := sessionID
session.summarizerFactory = func() (Model, error) {
summaryCfg, lerr := application.ModelConfigLoader().LoadModelConfigFileByNameDefaultOptions(summaryModelName, application.ApplicationConfig())
if lerr != nil {
return nil, fmt.Errorf("load summary model config %q: %w", summaryModelName, lerr)
}
return newModel(&summaryCfg.Pipeline, application.ModelConfigLoader(), application.ModelLoader(), application.ApplicationConfig(), evaluator, buildRealtimeRoutingContext(application, sid))
}
}
// Store the session and notify the transport (for WebRTC audio track handling) // Store the session and notify the transport (for WebRTC audio track handling)
sessionLock.Lock() sessionLock.Lock()
sessions[sessionID] = session sessions[sessionID] = session
@@ -1125,6 +1163,21 @@ func updateSession(session *Session, update *types.SessionUnion, cl *config.Mode
return err return err
} }
session.ModelInterface = m session.ModelInterface = m
// A session.update that swaps the model/voice rebuilds the pipeline, so
// warm the new backends too (unless opted out) — otherwise the next turn
// pays the cold-start load the original session warm-up already avoided.
// Unlike session start this stays non-blocking: updateSession runs under
// the global sessionLock, so blocking on a multi-second load here would
// stall every other session. Load errors are logged (and still surface on
// first use); per-stage failures are already warned inside
// backend.PreloadStages.
if !session.ModelConfig.Pipeline.DisableWarmup {
go func() {
if err := m.Warmup(context.Background()); err != nil {
xlog.Error("realtime warmup failed after session.update", "error", err)
}
}()
}
} }
if rt.Audio != nil && rt.Audio.Input != nil && rt.Audio.Input.TurnDetectionSet { if rt.Audio != nil && rt.Audio.Input != nil && rt.Audio.Input.TurnDetectionSet {

View File

@@ -174,6 +174,8 @@ func (m *fakeModel) TranscribeLive(_ context.Context, _ string, onEvent func(bac
func (m *fakeModel) PredictConfig() *config.ModelConfig { return m.cfg } func (m *fakeModel) PredictConfig() *config.ModelConfig { return m.cfg }
func (m *fakeModel) Warmup(ctx context.Context) error { return nil }
// fakeLiveSession records what semantic_vad fed and closed; closeEvents are // fakeLiveSession records what semantic_vad fed and closed; closeEvents are
// replayed through onEvent during Close, mimicking the backend's finalize // replayed through onEvent during Close, mimicking the backend's finalize
// flush (trailing delta + Final) landing before Close returns. // flush (trailing delta + Final) landing before Close returns.

View File

@@ -110,6 +110,15 @@ func (m *transcriptOnlyModel) PredictConfig() *config.ModelConfig {
return nil return nil
} }
func (m *transcriptOnlyModel) Warmup(ctx context.Context) error {
_, err := backend.PreloadStages(ctx, m.modelLoader, m.appConfig, []backend.PreloadStage{
{Role: "vad", Cfg: m.VADConfig},
{Role: "transcription", Cfg: m.TranscriptionConfig},
{Role: "sound_detection", Cfg: m.SoundDetectionConfig},
})
return err
}
func (m *wrappedModel) VAD(ctx context.Context, request *schema.VADRequest) (*schema.VADResponse, error) { func (m *wrappedModel) VAD(ctx context.Context, request *schema.VADRequest) (*schema.VADResponse, error) {
return backend.VAD(request, ctx, m.modelLoader, m.appConfig, *m.VADConfig) return backend.VAD(request, ctx, m.modelLoader, m.appConfig, *m.VADConfig)
} }
@@ -360,6 +369,17 @@ func (m *wrappedModel) PredictConfig() *config.ModelConfig {
return m.LLMConfig return m.LLMConfig
} }
func (m *wrappedModel) Warmup(ctx context.Context) error {
_, err := backend.PreloadStages(ctx, m.modelLoader, m.appConfig, []backend.PreloadStage{
{Role: "vad", Cfg: m.VADConfig},
{Role: "transcription", Cfg: m.TranscriptionConfig},
{Role: "llm", Cfg: m.LLMConfig},
{Role: "tts", Cfg: m.TTSConfig},
{Role: "sound_detection", Cfg: m.SoundDetectionConfig},
})
return err
}
// wavStreamHeaderBytes is the size of the WAV header that backend.ModelTTSStream // wavStreamHeaderBytes is the size of the WAV header that backend.ModelTTSStream
// emits as its first audio callback; the sample rate lives at byte offset 24. // emits as its first audio callback; the sample rate lives at byte offset 24.
const wavStreamHeaderBytes = 44 const wavStreamHeaderBytes = 44
@@ -440,7 +460,7 @@ func loadSoundDetectionConfig(pipeline *config.Pipeline, cl *config.ModelConfigL
if pipeline.SoundDetection == "" { if pipeline.SoundDetection == "" {
return nil, nil return nil, nil
} }
cfg, err := loadPipelineSubModel(cl, pipeline.SoundDetection, ml.ModelPath) cfg, err := cl.LoadResolvedModelConfig(pipeline.SoundDetection, ml.ModelPath)
if err != nil { if err != nil {
return nil, fmt.Errorf("failed to load sound detection config: %w", err) return nil, fmt.Errorf("failed to load sound detection config: %w", err)
} }
@@ -451,7 +471,7 @@ func loadSoundDetectionConfig(pipeline *config.Pipeline, cl *config.ModelConfigL
} }
func newTranscriptionOnlyModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig) (Model, *config.ModelConfig, error) { func newTranscriptionOnlyModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig) (Model, *config.ModelConfig, error) {
cfgVAD, err := loadPipelineSubModel(cl, pipeline.VAD, ml.ModelPath) cfgVAD, err := cl.LoadResolvedModelConfig(pipeline.VAD, ml.ModelPath)
if err != nil { if err != nil {
return nil, nil, fmt.Errorf("failed to load backend config: %w", err) return nil, nil, fmt.Errorf("failed to load backend config: %w", err)
@@ -461,7 +481,7 @@ func newTranscriptionOnlyModel(pipeline *config.Pipeline, cl *config.ModelConfig
return nil, nil, fmt.Errorf("failed to validate config: %w", err) return nil, nil, fmt.Errorf("failed to validate config: %w", err)
} }
cfgSST, err := loadPipelineSubModel(cl, pipeline.Transcription, ml.ModelPath) cfgSST, err := cl.LoadResolvedModelConfig(pipeline.Transcription, ml.ModelPath)
if err != nil { if err != nil {
return nil, nil, fmt.Errorf("failed to load backend config: %w", err) return nil, nil, fmt.Errorf("failed to load backend config: %w", err)
@@ -550,30 +570,11 @@ func buildRealtimeRoutingContext(a *application.Application, sessionID string) *
} }
} }
// loadPipelineSubModel loads a pipeline sub-model config by name and follows a
// single alias hop, so a pipeline that references an alias (e.g. `llm: default`)
// gets the alias target's full config (Backend, Model, ...) rather than the
// alias stub with an empty Backend. Without this the alias survives unresolved
// into model loading and fails downstream — notably in distributed mode with
// "backend name is empty". Mirrors the top-level alias resolution in
// core/http/middleware/request.go.
func loadPipelineSubModel(cl *config.ModelConfigLoader, name, modelPath string) (*config.ModelConfig, error) {
cfg, err := cl.LoadModelConfigFileByName(name, modelPath)
if err != nil {
return nil, err
}
resolved, _, err := cl.ResolveAlias(cfg)
if err != nil {
return nil, err
}
return resolved, nil
}
// returns and loads either a wrapped model or a model that support audio-to-audio // returns and loads either a wrapped model or a model that support audio-to-audio
func newModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig, evaluator *templates.Evaluator, routing *RealtimeRoutingContext) (Model, error) { func newModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig, evaluator *templates.Evaluator, routing *RealtimeRoutingContext) (Model, error) {
xlog.Debug("Creating new model pipeline model", "pipeline", pipeline) xlog.Debug("Creating new model pipeline model", "pipeline", pipeline)
cfgVAD, err := loadPipelineSubModel(cl, pipeline.VAD, ml.ModelPath) cfgVAD, err := cl.LoadResolvedModelConfig(pipeline.VAD, ml.ModelPath)
if err != nil { if err != nil {
return nil, fmt.Errorf("failed to load backend config: %w", err) return nil, fmt.Errorf("failed to load backend config: %w", err)
@@ -584,7 +585,7 @@ func newModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model
} }
// TODO: Do we always need a transcription model? It can be disabled. Note that any-to-any instruction following models don't transcribe as such, so if transcription is required it is a separate process // TODO: Do we always need a transcription model? It can be disabled. Note that any-to-any instruction following models don't transcribe as such, so if transcription is required it is a separate process
cfgSST, err := loadPipelineSubModel(cl, pipeline.Transcription, ml.ModelPath) cfgSST, err := cl.LoadResolvedModelConfig(pipeline.Transcription, ml.ModelPath)
if err != nil { if err != nil {
return nil, fmt.Errorf("failed to load backend config: %w", err) return nil, fmt.Errorf("failed to load backend config: %w", err)
@@ -616,7 +617,7 @@ func newModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model
xlog.Debug("Loading a wrapped model") xlog.Debug("Loading a wrapped model")
// Otherwise we want to return a wrapped model, which is a "virtual" model that re-uses other models to perform operations // Otherwise we want to return a wrapped model, which is a "virtual" model that re-uses other models to perform operations
cfgLLM, err := loadPipelineSubModel(cl, pipeline.LLM, ml.ModelPath) cfgLLM, err := cl.LoadResolvedModelConfig(pipeline.LLM, ml.ModelPath)
if err != nil { if err != nil {
return nil, fmt.Errorf("failed to load backend config: %w", err) return nil, fmt.Errorf("failed to load backend config: %w", err)
@@ -631,7 +632,7 @@ func newModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model
applyPipelineReasoning(cfgLLM, *pipeline) applyPipelineReasoning(cfgLLM, *pipeline)
applyPipelineThinking(cfgLLM, *pipeline) applyPipelineThinking(cfgLLM, *pipeline)
cfgTTS, err := loadPipelineSubModel(cl, pipeline.TTS, ml.ModelPath) cfgTTS, err := cl.LoadResolvedModelConfig(pipeline.TTS, ml.ModelPath)
if err != nil { if err != nil {
return nil, fmt.Errorf("failed to load backend config: %w", err) return nil, fmt.Errorf("failed to load backend config: %w", err)

View File

@@ -21,6 +21,7 @@ type namedEmbedding struct {
// drive the realtime pipeline. // drive the realtime pipeline.
type voiceGate struct { type voiceGate struct {
cfg config.PipelineVoiceRecognition // normalized cfg config.PipelineVoiceRecognition // normalized
recCfg *config.ModelConfig // resolved speaker-recognition model, for warm-up
registry voicerecognition.Registry // identify mode (nil otherwise) registry voicerecognition.Registry // identify mode (nil otherwise)
refEmbeds []namedEmbedding // verify mode, pre-embedded refs refEmbeds []namedEmbedding // verify mode, pre-embedded refs
refAudios []config.VoiceReference // verify + anti-spoofing: ref paths refAudios []config.VoiceReference // verify + anti-spoofing: ref paths
@@ -72,7 +73,9 @@ func newVoiceGate(
return nil, err return nil, err
} }
recCfg, err := cl.LoadModelConfigFileByName(cfg.Model, ml.ModelPath) // Resolved like every other pipeline sub-model (one alias hop), so an
// aliased voice_recognition model gets its target's backend.
recCfg, err := cl.LoadResolvedModelConfig(cfg.Model, ml.ModelPath)
if err != nil { if err != nil {
return nil, fmt.Errorf("voice_recognition: failed to load model %q: %w", cfg.Model, err) return nil, fmt.Errorf("voice_recognition: failed to load model %q: %w", cfg.Model, err)
} }
@@ -82,6 +85,7 @@ func newVoiceGate(
g := &voiceGate{ g := &voiceGate{
cfg: cfg, cfg: cfg,
recCfg: recCfg,
registry: registry, registry: registry,
embedFn: func(ctx context.Context, wavPath string) ([]float32, error) { embedFn: func(ctx context.Context, wavPath string) ([]float32, error) {
res, err := backend.VoiceEmbed(ctx, wavPath, ml, appConfig, *recCfg) res, err := backend.VoiceEmbed(ctx, wavPath, ml, appConfig, *recCfg)

View File

@@ -0,0 +1,64 @@
package openai
import (
"context"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/pkg/model"
"github.com/mudler/LocalAI/pkg/system"
)
// Warmup delegates to backend.PreloadStages (its concurrency, nil-skipping and
// error-joining semantics are pinned in core/backend). These specs pin the
// wiring instead: each realtime model type must warm exactly its configured
// stages under the right pipeline-role labels. No backends are installed, so
// every attempted stage fails to load — the joined error is the proof of which
// stages were attempted and how they were labeled.
var _ = Describe("realtime model Warmup wiring", func() {
newLoader := func() (*model.ModelLoader, *config.ApplicationConfig) {
systemState, err := system.GetSystemState(system.WithModelPath(GinkgoT().TempDir()))
Expect(err).ToNot(HaveOccurred())
appConfig := config.NewApplicationConfig(config.WithSystemState(systemState))
return model.NewModelLoader(systemState), appConfig
}
It("wrappedModel warms every configured stage under its pipeline role", func() {
ml, appConfig := newLoader()
m := &wrappedModel{
VADConfig: &config.ModelConfig{Name: "vad-m"},
TranscriptionConfig: &config.ModelConfig{Name: "stt-m"},
LLMConfig: &config.ModelConfig{Name: "llm-m"},
TTSConfig: &config.ModelConfig{Name: "tts-m"},
SoundDetectionConfig: &config.ModelConfig{Name: "ced-m"},
modelLoader: ml,
appConfig: appConfig,
}
err := m.Warmup(context.Background())
Expect(err).To(HaveOccurred())
for _, stage := range []string{"vad (vad-m)", "transcription (stt-m)", "llm (llm-m)", "tts (tts-m)", "sound_detection (ced-m)"} {
Expect(err.Error()).To(ContainSubstring(stage))
}
})
It("transcriptOnlyModel warms its stages and skips absent ones", func() {
ml, appConfig := newLoader()
m := &transcriptOnlyModel{
VADConfig: &config.ModelConfig{Name: "vad-m"},
TranscriptionConfig: &config.ModelConfig{Name: "stt-m"},
// SoundDetectionConfig nil: an absent stage must be skipped, not
// fail the warm-up.
modelLoader: ml,
appConfig: appConfig,
}
err := m.Warmup(context.Background())
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("vad (vad-m)"))
Expect(err.Error()).To(ContainSubstring("transcription (stt-m)"))
Expect(err.Error()).ToNot(ContainSubstring("sound_detection"))
})
})

View File

@@ -7,6 +7,7 @@ import (
"io" "io"
"net/http" "net/http"
"os" "os"
"path/filepath"
"strings" "strings"
"time" "time"
@@ -29,6 +30,8 @@ const testModel = "Qwen3-VL-2B-Instruct-Q4_K_M"
var _ = Describe("Open Responses API", func() { var _ = Describe("Open Responses API", func() {
var app *echo.Echo var app *echo.Echo
var localApp *application.Application
var localModelDir string
var c context.Context var c context.Context
var cancel context.CancelFunc var cancel context.CancelFunc
@@ -38,28 +41,47 @@ var _ = Describe("Open Responses API", func() {
Context("API with ephemeral models", func() { Context("API with ephemeral models", func() {
BeforeEach(func(sc SpecContext) { BeforeEach(func(sc SpecContext) {
var err error // This suite exercises the /v1/responses HTTP/protocol contract
// (Content-Type, SSE framing, response envelope, error shapes),
// not real inference — so it runs against the same prebuilt
// mock-backend the rest of the http suite uses instead of
// downloading a real model. Skip cleanly when it isn't built.
if mockBackendPath == "" {
Skip("mock-backend binary not built; run 'make build-mock-backend'")
}
backendPath := os.Getenv("BACKENDS_PATH") var err error
c, cancel = context.WithCancel(context.Background()) c, cancel = context.WithCancel(context.Background())
// Isolated model dir carrying a single config named after testModel
// but served by the mock backend, so the responses endpoint can
// resolve and load the model without any real backend build.
localModelDir, err = os.MkdirTemp("", "openresponses-models-")
Expect(err).ToNot(HaveOccurred())
mockModelYAML := "name: " + testModel + "\n" +
"backend: mock-backend\n" +
"parameters:\n" +
" model: mock-model.bin\n"
Expect(os.WriteFile(filepath.Join(localModelDir, testModel+".yaml"), []byte(mockModelYAML), 0644)).To(Succeed())
systemState, err := system.GetSystemState( systemState, err := system.GetSystemState(
system.WithBackendPath(backendPath), system.WithBackendPath(backendDir),
system.WithModelPath(modelDir), system.WithModelPath(localModelDir),
) )
Expect(err).ToNot(HaveOccurred()) Expect(err).ToNot(HaveOccurred())
application, err := application.New( localApp, err = application.New(
append(commonOpts, append(commonOpts,
config.WithContext(c), config.WithContext(c),
config.WithSystemState(systemState), config.WithSystemState(systemState),
config.WithApiKeys([]string{apiKey}), config.WithApiKeys([]string{apiKey}),
config.WithModelsURL("https://huggingface.co/unsloth/Qwen3-VL-2B-Instruct-GGUF"),
)...) )...)
Expect(err).ToNot(HaveOccurred()) Expect(err).ToNot(HaveOccurred())
localApp.ModelLoader().SetExternalBackend("mock-backend", mockBackendPath)
app, err = API(application) app, err = API(localApp)
Expect(err).ToNot(HaveOccurred()) Expect(err).ToNot(HaveOccurred())
go func() { go func() {
@@ -80,14 +102,24 @@ var _ = Describe("Open Responses API", func() {
}) })
AfterEach(func(sc SpecContext) { AfterEach(func(sc SpecContext) {
// Synchronous app shutdown first — context-cancel cleanup is async
// and races test-binary exit, orphaning mock-backend children.
if localApp != nil {
_ = localApp.Shutdown()
localApp = nil
}
cancel() cancel()
if app != nil { if app != nil {
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel() defer cancel()
err := app.Shutdown(ctx) err := app.Shutdown(ctx)
Expect(err).ToNot(HaveOccurred()) Expect(err).ToNot(HaveOccurred())
app = nil
}
if localModelDir != "" {
_ = os.RemoveAll(localModelDir)
localModelDir = ""
} }
}) })
Context("HTTP Protocol Compliance", func() { Context("HTTP Protocol Compliance", func() {
@@ -969,13 +1001,16 @@ var _ = Describe("Open Responses API", func() {
Expect(ok).To(BeTrue()) Expect(ok).To(BeTrue())
Expect(itemID).ToNot(BeEmpty()) Expect(itemID).ToNot(BeEmpty())
// Now create a new response with item_reference // Now create a new response with item_reference. Per the OpenAI
// Responses spec (and this server's parser in
// endpoints/openresponses/responses.go) an item_reference carries
// the referenced item in the "id" field, not "item_id".
reqBody2 := map[string]any{ reqBody2 := map[string]any{
"model": testModel, "model": testModel,
"input": []any{ "input": []any{
map[string]any{ map[string]any{
"type": "item_reference", "type": "item_reference",
"item_id": itemID, "id": itemID,
}, },
map[string]any{ map[string]any{
"type": "message", "type": "message",
@@ -1005,8 +1040,8 @@ var _ = Describe("Open Responses API", func() {
"model": testModel, "model": testModel,
"input": []any{ "input": []any{
map[string]any{ map[string]any{
"type": "item_reference", "type": "item_reference",
"item_id": "nonexistent_item_id", "id": "nonexistent_item_id",
}, },
}, },
} }

View File

@@ -0,0 +1,133 @@
import { test, expect } from './coverage-fixtures.js'
// Seeds two-message chat into localStorage so we don't need a live model.
async function seedChat(page, history) {
await page.addInitScript((h) => {
const chat = {
id: 'seed1', name: 'Seeded Chat', model: 'test-model',
history: h, systemPrompt: '', mcpMode: false, mcpServers: [],
clientMCPServers: [], temperature: null, topP: null, topK: null,
tokenUsage: { prompt: 0, completion: 0, total: 0 },
contextSize: null, createdAt: Date.now(), updatedAt: Date.now(),
}
localStorage.setItem('localai_chats_data', JSON.stringify({
chats: [chat], activeChatId: 'seed1', lastSaved: Date.now(),
}))
}, history)
}
async function mockModels(page) {
await page.route('**/api/models/capabilities', (route) => route.fulfill({
contentType: 'application/json',
body: JSON.stringify({ data: [{ id: 'test-model', capabilities: ['FLAG_CHAT'] }] }),
}))
await page.route('**/api/operations', (route) => route.fulfill({
contentType: 'application/json', body: JSON.stringify({ operations: [] }),
}))
}
const TWO_TURNS = [
{ role: 'user', content: 'first question' },
{ role: 'assistant', content: 'first answer' },
{ role: 'user', content: 'second question' },
{ role: 'assistant', content: 'second answer' },
]
test('duplicate creates an independent copy and switches to it', async ({ page }) => {
await mockModels(page)
await seedChat(page, TWO_TURNS)
await page.goto('/app/chat')
// Open the chats menu (Ctrl/Cmd+K) and duplicate the seeded chat.
// Wait for the menu trigger to mount so its global keydown listener is armed
// before we dispatch the shortcut.
await page.getByTitle('Conversations (Ctrl/Cmd+K)').waitFor()
await page.keyboard.press('Control+k')
await page.getByTitle('Duplicate chat').first().click()
// A new active chat named "Seeded Chat (fork)" with the same 4 messages.
await expect(page.locator('.chat-header-title')).toHaveText('Seeded Chat (fork)')
await expect(page.locator('.chat-message-user')).toHaveCount(2)
await expect(page.locator('.chat-message-assistant')).toHaveCount(2)
})
async function mockCompletion(page, replyText) {
await page.route('**/v1/chat/completions', (route) => {
const sse =
`data: ${JSON.stringify({ choices: [{ delta: { content: replyText } }] })}\n\n` +
`data: ${JSON.stringify({ choices: [{ delta: {}, finish_reason: 'stop' }], usage: { prompt_tokens: 1, completion_tokens: 1, total_tokens: 2 } })}\n\n` +
`data: [DONE]\n\n`
route.fulfill({ status: 200, contentType: 'text/event-stream', body: sse })
})
}
test('retry regenerates the first answer and drops the later turn', async ({ page }) => {
await mockModels(page)
// Capture the outbound request body so we can assert the model receives the
// truncated history (not the stale downstream turns).
let sentMessages = null
await page.route('**/v1/chat/completions', (route) => {
sentMessages = route.request().postDataJSON()?.messages || []
const sse =
`data: ${JSON.stringify({ choices: [{ delta: { content: 'REGENERATED first answer' } }] })}\n\n` +
`data: ${JSON.stringify({ choices: [{ delta: {}, finish_reason: 'stop' }], usage: { prompt_tokens: 1, completion_tokens: 1, total_tokens: 2 } })}\n\n` +
`data: [DONE]\n\n`
route.fulfill({ status: 200, contentType: 'text/event-stream', body: sse })
})
await seedChat(page, TWO_TURNS)
await page.goto('/app/chat')
// Hover the FIRST assistant message and click its retry button.
const firstAssistant = page.locator('.chat-message-assistant').first()
await firstAssistant.hover()
await firstAssistant.getByTitle('Regenerate').click()
// History is truncated to the first user turn, then the new answer streams in;
// the second Q/A turn is gone.
await expect(page.locator('.chat-message-assistant')).toContainText(['REGENERATED first answer'])
await expect(page.locator('.chat-message-user')).toHaveCount(1)
await expect(page.locator('.chat-message-assistant')).toHaveCount(1)
// The OUTBOUND payload must also be truncated: the resent user turn is present,
// but the downstream turn and the stale first answer must be gone.
const contents = (sentMessages || []).map(m =>
typeof m.content === 'string' ? m.content : JSON.stringify(m.content)
)
expect(contents.join('\n')).toContain('first question')
expect(contents.join('\n')).not.toContain('second question')
expect(contents.join('\n')).not.toContain('first answer')
})
test('copy chat puts the whole conversation on the clipboard', async ({ page, context }) => {
await context.grantPermissions(['clipboard-read', 'clipboard-write'])
await mockModels(page)
await seedChat(page, TWO_TURNS)
await page.goto('/app/chat')
// Wait for the menu trigger to mount so its global keydown listener is armed
// before we dispatch the shortcut (same mount-race guard as the duplicate test).
await page.getByTitle('Conversations (Ctrl/Cmd+K)').waitFor()
await page.keyboard.press('Control+k')
await page.getByTitle('Copy chat').first().click()
const clip = await page.evaluate(() => navigator.clipboard.readText())
expect(clip).toContain('# Seeded Chat')
expect(clip).toContain('first answer')
expect(clip).toContain('second answer')
})
test('branch from the first answer forks history up to that point', async ({ page }) => {
await mockModels(page)
await seedChat(page, TWO_TURNS)
await page.goto('/app/chat')
const firstAssistant = page.locator('.chat-message-assistant').first()
await firstAssistant.hover()
await firstAssistant.getByTitle('Branch from here').click()
// New active chat "Seeded Chat (fork)" contains only the first Q/A turn.
await expect(page.locator('.chat-header-title')).toHaveText('Seeded Chat (fork)')
await expect(page.locator('.chat-message-user')).toHaveCount(1)
await expect(page.locator('.chat-message-assistant')).toHaveCount(1)
await expect(page.locator('.chat-message-assistant')).toContainText(['first answer'])
})

View File

@@ -72,6 +72,7 @@
"actions": { "actions": {
"copy": "Copy", "copy": "Copy",
"regenerate": "Regenerate", "regenerate": "Regenerate",
"branch": "Branch from here",
"jumpToLatest": "Jump to latest" "jumpToLatest": "Jump to latest"
}, },
"streaming": { "streaming": {
@@ -100,7 +101,9 @@
"toasts": { "toasts": {
"selectModel": "Please select a model", "selectModel": "Please select a model",
"copied": "Copied to clipboard", "copied": "Copied to clipboard",
"copyFailed": "Could not copy to clipboard" "copyFailed": "Could not copy to clipboard",
"chatCopied": "Chat copied to clipboard",
"forked": "Created a new chat"
}, },
"menu": { "menu": {
"trigger": "Chats", "trigger": "Chats",
@@ -110,6 +113,8 @@
"noMatch": "No conversations match your search", "noMatch": "No conversations match your search",
"noConversations": "No conversations yet", "noConversations": "No conversations yet",
"rename": "Rename", "rename": "Rename",
"duplicate": "Duplicate chat",
"copyChat": "Copy chat",
"exportMarkdown": "Export as Markdown", "exportMarkdown": "Export as Markdown",
"deleteChat": "Delete chat", "deleteChat": "Delete chat",
"newChat": "New chat", "newChat": "New chat",

View File

@@ -24,6 +24,8 @@ const ChatsMenu = forwardRef(function ChatsMenu({
onDeleteAll, onDeleteAll,
onRename, onRename,
onExport, onExport,
onCopyChat,
onDuplicate,
}, ref) { }, ref) {
const { t } = useTranslation('chat') const { t } = useTranslation('chat')
const [open, setOpen] = useState(false) const [open, setOpen] = useState(false)
@@ -230,6 +232,24 @@ const ChatsMenu = forwardRef(function ChatsMenu({
> >
<i className="fas fa-pen" /> <i className="fas fa-pen" />
</button> </button>
{onDuplicate && (
<button
type="button"
onClick={(e) => { e.stopPropagation(); onDuplicate(chat); setOpen(false) }}
title={t('menu.duplicate')}
>
<i className="fas fa-clone" />
</button>
)}
{(chat.history?.length || 0) > 0 && onCopyChat && (
<button
type="button"
onClick={(e) => { e.stopPropagation(); onCopyChat(chat) }}
title={t('menu.copyChat')}
>
<i className="fas fa-clipboard" />
</button>
)}
{(chat.history?.length || 0) > 0 && onExport && ( {(chat.history?.length || 0) > 0 && onExport && (
<button <button
type="button" type="button"

View File

@@ -141,6 +141,24 @@ export function useChat(initialModel = '') {
return chat return chat
}, []) }, [])
const forkChat = useCallback((chatId, uptoIndex) => {
const src = chats.find(c => c.id === chatId)
if (!src) return null
const end = typeof uptoIndex === 'number' ? uptoIndex : src.history.length
const forked = {
...src,
id: generateId(),
name: `${src.name} (fork)`,
history: structuredClone(src.history.slice(0, end)),
tokenUsage: { prompt: 0, completion: 0, total: 0 },
createdAt: Date.now(),
updatedAt: Date.now(),
}
setChats(prev => [forked, ...prev])
setActiveChatId(forked.id)
return forked
}, [chats])
const switchChat = useCallback((chatId) => { const switchChat = useCallback((chatId) => {
setActiveChatId(chatId) setActiveChatId(chatId)
setStreamingContent('') setStreamingContent('')
@@ -260,8 +278,12 @@ export function useChat(initialModel = '') {
if (chat?.systemPrompt) { if (chat?.systemPrompt) {
messages.push({ role: 'system', content: chat.systemPrompt }) messages.push({ role: 'system', content: chat.systemPrompt })
} }
// Filter out thinking/reasoning/tool_call/tool_result messages // Filter out thinking/reasoning/tool_call/tool_result messages.
const historyForApi = (chat?.history || []).filter(m => // options.baseHistory lets callers (e.g. mid-conversation retry) pass the
// intended truncated history synchronously; the closure `chat` still holds
// the stale pre-truncation state because setChats only schedules an update.
const baseHistory = options.baseHistory || chat?.history || []
const historyForApi = baseHistory.filter(m =>
m.role !== 'thinking' && m.role !== 'reasoning' && m.role !== 'tool_call' && m.role !== 'tool_result' m.role !== 'thinking' && m.role !== 'reasoning' && m.role !== 'tool_call' && m.role !== 'tool_result'
) )
messages.push(...historyForApi, { role: 'user', content: messageContent }) messages.push(...historyForApi, { role: 'user', content: messageContent })
@@ -793,6 +815,7 @@ export function useChat(initialModel = '') {
tokensPerSecond, tokensPerSecond,
maxTokensPerSecond, maxTokensPerSecond,
addChat, addChat,
forkChat,
switchChat, switchChat,
deleteChat, deleteChat,
deleteAllChats, deleteAllChats,

View File

@@ -33,7 +33,7 @@ function getLastMessagePreview(chat) {
return '' return ''
} }
function exportChatAsMarkdown(chat) { function serializeChatAsMarkdown(chat) {
let md = `# ${chat.name}\n\n` let md = `# ${chat.name}\n\n`
md += `Model: ${chat.model || 'Unknown'}\n` md += `Model: ${chat.model || 'Unknown'}\n`
md += `Date: ${new Date(chat.createdAt).toLocaleString()}\n\n---\n\n` md += `Date: ${new Date(chat.createdAt).toLocaleString()}\n\n---\n\n`
@@ -47,7 +47,11 @@ function exportChatAsMarkdown(chat) {
md += `<details><summary>Thinking</summary>\n\n${msg.content}\n\n</details>\n\n` md += `<details><summary>Thinking</summary>\n\n${msg.content}\n\n</details>\n\n`
} }
} }
const blob = new Blob([md], { type: 'text/markdown' }) return md
}
function downloadChatAsMarkdown(chat) {
const blob = new Blob([serializeChatAsMarkdown(chat)], { type: 'text/markdown' })
const url = URL.createObjectURL(blob) const url = URL.createObjectURL(blob)
const a = document.createElement('a') const a = document.createElement('a')
a.href = url a.href = url
@@ -294,7 +298,7 @@ export default function Chat() {
const { const {
chats, activeChat, activeChatId, isStreaming, streamingChatId, streamingContent, chats, activeChat, activeChatId, isStreaming, streamingChatId, streamingContent,
streamingReasoning, streamingToolCalls, tokensPerSecond, maxTokensPerSecond, streamingReasoning, streamingToolCalls, tokensPerSecond, maxTokensPerSecond,
addChat, switchChat, deleteChat, deleteAllChats, renameChat, updateChatSettings, addChat, forkChat, switchChat, deleteChat, deleteAllChats, renameChat, updateChatSettings,
sendMessage, stopGeneration, clearHistory, getContextUsagePercent, addMessage, sendMessage, stopGeneration, clearHistory, getContextUsagePercent, addMessage,
} = useChat(urlModel || '') } = useChat(urlModel || '')
@@ -795,34 +799,27 @@ export default function Chat() {
await sendMessage(msg, files, mcpOptions) await sendMessage(msg, files, mcpOptions)
}, [input, files, activeChat, sendMessage, addToast, getToolsForLLM, isClientTool, executeTool, hasAppUI, getAppResource, getToolDefinition]) }, [input, files, activeChat, sendMessage, addToast, getToolsForLLM, isClientTool, executeTool, hasAppUI, getAppResource, getToolDefinition])
const handleRegenerate = useCallback(async () => { const handleRegenerate = useCallback(async (targetIndex) => {
if (!activeChat || isStreaming) return if (!activeChat || isStreaming) return
const history = activeChat.history const history = activeChat.history
let lastUserMsg = null const end = typeof targetIndex === 'number' ? targetIndex : history.length
let lastUserFiles = null // Nearest user message at or before the target answer.
for (let i = history.length - 1; i >= 0; i--) { let userIdx = -1
if (history[i].role === 'user') { for (let i = Math.min(end, history.length) - 1; i >= 0; i--) {
lastUserMsg = typeof history[i].content === 'string' ? history[i].content : history[i].content?.[0]?.text || '' if (history[i].role === 'user') { userIdx = i; break }
lastUserFiles = history[i].files || []
break
}
} }
if (!lastUserMsg) return if (userIdx === -1) return
const userMsg = typeof history[userIdx].content === 'string'
// Remove everything after and including the last user message ? history[userIdx].content
const newHistory = [] : history[userIdx].content?.[0]?.text || ''
let foundLastUser = false const userFiles = history[userIdx].files || []
for (let i = history.length - 1; i >= 0; i--) { // Drop the user turn and everything after it; sendMessage re-appends it.
if (!foundLastUser && history[i].role === 'user') { // Thread the truncated history through explicitly: updateChatSettings only
foundLastUser = true // schedules a state update, so sendMessage's closure would otherwise read
continue // the stale pre-truncation history for the outbound API payload.
} const baseHistory = history.slice(0, userIdx)
if (foundLastUser) { updateChatSettings(activeChat.id, { history: baseHistory })
newHistory.unshift(history[i]) await sendMessage(userMsg, userFiles, { baseHistory })
}
}
updateChatSettings(activeChat.id, { history: newHistory })
await sendMessage(lastUserMsg, lastUserFiles)
}, [activeChat, isStreaming, sendMessage, updateChatSettings]) }, [activeChat, isStreaming, sendMessage, updateChatSettings])
const handleKeyDown = (e) => { const handleKeyDown = (e) => {
@@ -852,6 +849,11 @@ export default function Chat() {
} }
} }
const copyChatAsMarkdown = async (chat) => {
const ok = await copyToClipboard(serializeChatAsMarkdown(chat))
addToast(ok ? t('toasts.chatCopied') : t('toasts.copyFailed'), ok ? 'success' : 'error', ok ? 2000 : 3000)
}
const contextPercent = getContextUsagePercent() const contextPercent = getContextUsagePercent()
// Recent chats for the empty state — exclude the current chat and any // Recent chats for the empty state — exclude the current chat and any
@@ -892,7 +894,9 @@ export default function Chat() {
onDelete={deleteChat} onDelete={deleteChat}
onDeleteAll={promptDeleteAll} onDeleteAll={promptDeleteAll}
onRename={renameChat} onRename={renameChat}
onExport={(chat) => exportChatAsMarkdown(chat)} onExport={(chat) => downloadChatAsMarkdown(chat)}
onCopyChat={(chat) => copyChatAsMarkdown(chat)}
onDuplicate={(chat) => { if (forkChat(chat.id)) addToast(t('toasts.forked'), 'success', 2000) }}
/> />
{activeChat.localaiAssistant && ( {activeChat.localaiAssistant && (
<span <span
@@ -1184,11 +1188,19 @@ export default function Chat() {
<button onClick={() => copyMessage(msg.content)} title={t('actions.copy')}> <button onClick={() => copyMessage(msg.content)} title={t('actions.copy')}>
<i className="fas fa-copy" /> <i className="fas fa-copy" />
</button> </button>
{msg.role === 'assistant' && i === activeChat.history.length - 1 && !isStreaming && ( {msg.role === 'assistant' && !isStreaming && (
<button onClick={handleRegenerate} title={t('actions.regenerate')}> <button onClick={() => handleRegenerate(i)} title={t('actions.regenerate')}>
<i className="fas fa-rotate" /> <i className="fas fa-rotate" />
</button> </button>
)} )}
{msg.role === 'assistant' && !isStreaming && (
<button
onClick={() => { forkChat(activeChat.id, i + 1); addToast(t('toasts.forked'), 'success', 2000) }}
title={t('actions.branch')}
>
<i className="fas fa-code-branch" />
</button>
)}
</div> </div>
</div> </div>
</div> </div>

View File

@@ -146,6 +146,7 @@ export default function Manage() {
const [distributedMode, setDistributedMode] = useState(false) const [distributedMode, setDistributedMode] = useState(false)
const [togglingModels, setTogglingModels] = useState(new Set()) const [togglingModels, setTogglingModels] = useState(new Set())
const [pinningModels, setPinningModels] = useState(new Set()) const [pinningModels, setPinningModels] = useState(new Set())
const [loadingModels, setLoadingModels] = useState(new Set())
// Expanded row state — keyed by `${tab}:${id}` so switching tabs doesn't // Expanded row state — keyed by `${tab}:${id}` so switching tabs doesn't
// collide and a single row is open at a time per tab. // collide and a single row is open at a time per tab.
const [expandedKey, setExpandedKey] = useState(null) const [expandedKey, setExpandedKey] = useState(null)
@@ -313,6 +314,26 @@ export default function Manage() {
}) })
} }
// Pre-load a model (or all of a realtime pipeline's sub-models) into memory.
// The /backend/load call blocks until loading finishes, so the menu item shows
// a loading state while in flight and reports the outcome on completion.
const handleLoadModel = async (modelName) => {
setLoadingModels(prev => new Set(prev).add(modelName))
try {
await backendControlApi.load({ model: modelName })
addToast(`Loaded ${modelName}`, 'success')
setTimeout(fetchLoadedModels, 500)
} catch (err) {
addToast(`Failed to load: ${err.message}`, 'error')
} finally {
setLoadingModels(prev => {
const next = new Set(prev)
next.delete(modelName)
return next
})
}
}
const handleDeleteModel = (modelName) => { const handleDeleteModel = (modelName) => {
setConfirmDialog({ setConfirmDialog({
title: 'Delete Model', title: 'Delete Model',
@@ -687,6 +708,11 @@ export default function Manage() {
label: model.disabled ? 'Enable model' : 'Disable model', label: model.disabled ? 'Enable model' : 'Disable model',
onClick: () => handleToggleModel(model.id, model.disabled), onClick: () => handleToggleModel(model.id, model.disabled),
disabled: togglingModels.has(model.id) }, disabled: togglingModels.has(model.id) },
{ key: 'load', icon: 'fa-bolt',
label: loadingModels.has(model.id) ? 'Loading…' : 'Load into memory',
onClick: () => handleLoadModel(model.id),
hidden: isRunning || !!model.disabled,
disabled: loadingModels.has(model.id) },
{ key: 'stop', icon: 'fa-stop', label: 'Stop model', { key: 'stop', icon: 'fa-stop', label: 'Stop model',
onClick: () => handleStopModel(model.id), hidden: !isRunning }, onClick: () => handleStopModel(model.id), hidden: !isRunning },
{ key: 'pin', icon: 'fa-thumbtack', { key: 'pin', icon: 'fa-thumbtack',

View File

@@ -352,6 +352,9 @@ export const realtimeApi = {
// Backend control // Backend control
export const backendControlApi = { export const backendControlApi = {
shutdown: (body) => postJSON(API_CONFIG.endpoints.backendShutdown, body), shutdown: (body) => postJSON(API_CONFIG.endpoints.backendShutdown, body),
// Pre-load a model (or all of a realtime pipeline's sub-models) into memory.
// body: { model: "<name>" }. Inverse of shutdown.
load: (body) => postJSON(API_CONFIG.endpoints.backendLoad, body),
} }
// System info // System info

View File

@@ -106,6 +106,7 @@ export const API_CONFIG = {
video: '/video', video: '/video',
backendMonitor: '/backend/monitor', backendMonitor: '/backend/monitor',
backendShutdown: '/backend/shutdown', backendShutdown: '/backend/shutdown',
backendLoad: '/backend/load',
modelsApply: '/models/apply', modelsApply: '/models/apply',
modelsDelete: (name) => `/models/delete/${name}`, modelsDelete: (name) => `/models/delete/${name}`,
modelsAvailable: '/models/available', modelsAvailable: '/models/available',

View File

@@ -207,9 +207,14 @@ func RegisterLocalAIRoutes(router *echo.Echo,
backendMonitorService := monitoring.NewBackendMonitorService(ml, cl, appConfig) // Split out for now backendMonitorService := monitoring.NewBackendMonitorService(ml, cl, appConfig) // Split out for now
router.GET("/backend/monitor", localai.BackendMonitorEndpoint(backendMonitorService), adminMiddleware) router.GET("/backend/monitor", localai.BackendMonitorEndpoint(backendMonitorService), adminMiddleware)
router.POST("/backend/shutdown", localai.BackendShutdownEndpoint(backendMonitorService), adminMiddleware) router.POST("/backend/shutdown", localai.BackendShutdownEndpoint(backendMonitorService), adminMiddleware)
// /backend/load is the inverse of /backend/shutdown: pre-load a model (or all
// of a realtime pipeline's sub-models) into memory so clients can drive
// warm-up explicitly instead of paying the cold-start cost on first use.
router.POST("/backend/load", localai.LoadModelEndpoint(cl, ml, appConfig), adminMiddleware)
// The v1/* urls are exactly the same as above - makes local e2e testing easier if they are registered. // The v1/* urls are exactly the same as above - makes local e2e testing easier if they are registered.
router.GET("/v1/backend/monitor", localai.BackendMonitorEndpoint(backendMonitorService), adminMiddleware) router.GET("/v1/backend/monitor", localai.BackendMonitorEndpoint(backendMonitorService), adminMiddleware)
router.POST("/v1/backend/shutdown", localai.BackendShutdownEndpoint(backendMonitorService), adminMiddleware) router.POST("/v1/backend/shutdown", localai.BackendShutdownEndpoint(backendMonitorService), adminMiddleware)
router.POST("/v1/backend/load", localai.LoadModelEndpoint(cl, ml, appConfig), adminMiddleware)
// Traces and backend logs (monitoring) // Traces and backend logs (monitoring)
router.GET("/api/traces", localai.GetAPITracesEndpoint(), adminMiddleware) router.GET("/api/traces", localai.GetAPITracesEndpoint(), adminMiddleware)
@@ -245,6 +250,7 @@ func RegisterLocalAIRoutes(router *echo.Echo,
"metrics": "/metrics", "metrics": "/metrics",
"backend_monitor": "/backend/monitor", "backend_monitor": "/backend/monitor",
"backend_shutdown": "/backend/shutdown", "backend_shutdown": "/backend/shutdown",
"backend_load": "/backend/load",
"system": "/system", "system": "/system",
"version": "/version", "version": "/version",
"traces": "/api/traces", "traces": "/api/traces",

View File

@@ -11,6 +11,24 @@ type BackendMonitorRequest struct {
BasicModelRequest BasicModelRequest
} }
// ModelLoadRequest asks LocalAI to pre-load a model into memory by name, so the
// first request that uses it pays no cold-start load cost. For a realtime
// pipeline model, every configured sub-model (VAD, transcription, LLM, TTS,
// sound_detection, voice_recognition) is loaded instead of the pipeline stub.
// It is the inverse of the /backend/shutdown request.
type ModelLoadRequest struct {
BasicModelRequest
}
// ModelLoadResponse reports the outcome of a /backend/load call.
type ModelLoadResponse struct {
// Loaded lists the model names actually resident in memory after the call.
// For a pipeline model these are its sub-models, not the pipeline name.
Loaded []string `json:"loaded"`
// Message is a short human-readable status ("model loaded", or an error).
Message string `json:"message"`
}
type TokenMetricsRequest struct { type TokenMetricsRequest struct {
BasicModelRequest BasicModelRequest
} }

View File

@@ -14,6 +14,16 @@ import (
// MaxSnippetSeconds is the maximum number of seconds of audio captured per trace. // MaxSnippetSeconds is the maximum number of seconds of audio captured per trace.
const MaxSnippetSeconds = 30 const MaxSnippetSeconds = 30
// silenceFloorDBFS is the dBFS value reported for digital silence (RMS or peak
// of zero). The true level is -∞ dBFS; reporting a finite floor keeps the
// metric present and meaningful in the Traces UI (a scrubbed nil would read as
// "missing" rather than "silent"). -120 dBFS sits well below 16-bit PCM's
// ~-90 dBFS least-significant-bit floor, so it reads unambiguously as
// "effectively silent". JSON-marshal safety for any non-finite float that does
// reach a trace is owned centrally by RecordBackendTrace's sanitizer — this
// floor is about presentation, not transport.
const silenceFloorDBFS = -120.0
// AudioSnippet captures the first MaxSnippetSeconds of a WAV file and computes // AudioSnippet captures the first MaxSnippetSeconds of a WAV file and computes
// quality metrics. The result is a map suitable for merging into a BackendTrace // quality metrics. The result is a map suitable for merging into a BackendTrace
// Data field. maxBytes caps the embedded base64 waveform so a single TTS or // Data field. maxBytes caps the embedded base64 waveform so a single TTS or
@@ -63,7 +73,7 @@ func AudioSnippetFromPCM(pcm []byte, sampleRate, totalPCMBytes, maxBytes int) ma
snippetDuration := float64(len(samples)) / float64(sampleRate) snippetDuration := float64(len(samples)) / float64(sampleRate)
rms := sound.CalculateRMS16(samples) rms := sound.CalculateRMS16(samples)
rmsDBFS := -math.Inf(1) rmsDBFS := silenceFloorDBFS
if rms > 0 { if rms > 0 {
rmsDBFS = 20 * math.Log10(rms/32768.0) rmsDBFS = 20 * math.Log10(rms/32768.0)
} }
@@ -78,7 +88,7 @@ func AudioSnippetFromPCM(pcm []byte, sampleRate, totalPCMBytes, maxBytes int) ma
} }
dcSum += int64(s) dcSum += int64(s)
} }
peakDBFS := -math.Inf(1) peakDBFS := silenceFloorDBFS
if peak > 0 { if peak > 0 {
peakDBFS = 20 * math.Log10(float64(peak)/32768.0) peakDBFS = 20 * math.Log10(float64(peak)/32768.0)
} }

View File

@@ -1,6 +1,9 @@
package trace_test package trace_test
import ( import (
"encoding/json"
"math"
. "github.com/onsi/ginkgo/v2" . "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega" . "github.com/onsi/gomega"
@@ -47,3 +50,32 @@ var _ = Describe("AudioSnippetFromPCM byte cap", func() {
Expect(out).To(HaveKey("audio_wav_base64")) Expect(out).To(HaveKey("audio_wav_base64"))
}) })
}) })
// Silent audio (RMS/peak of zero) has a true level of -∞ dBFS, but emitting
// -Inf made the whole /api/backend-traces response fail to JSON-marshal and
// blanked the Traces UI. The metrics must instead be finite and serializable.
var _ = Describe("AudioSnippetFromPCM silent audio dBFS", func() {
pcm := makePCM(snippetSeconds, snippetSampleRate) // all zeros == digital silence
totalPCM := len(pcm)
It("reports finite dBFS for silence instead of -Inf", func() {
out := trace.AudioSnippetFromPCM(pcm, snippetSampleRate, totalPCM, 0)
rms, ok := out["audio_rms_dbfs"].(float64)
Expect(ok).To(BeTrue())
Expect(math.IsInf(rms, 0)).To(BeFalse(), "silent RMS must not be ±Inf")
Expect(math.IsNaN(rms)).To(BeFalse())
peak, ok := out["audio_peak_dbfs"].(float64)
Expect(ok).To(BeTrue())
Expect(math.IsInf(peak, 0)).To(BeFalse(), "silent peak must not be ±Inf")
Expect(math.IsNaN(peak)).To(BeFalse())
})
It("produces a snippet that round-trips through encoding/json", func() {
out := trace.AudioSnippetFromPCM(pcm, snippetSampleRate, totalPCM, 0)
_, err := json.Marshal(out)
Expect(err).ToNot(HaveOccurred(), "silent-audio metrics must be JSON-marshalable")
})
})

View File

@@ -3,6 +3,8 @@ package trace
import ( import (
"encoding/json" "encoding/json"
"fmt" "fmt"
"maps"
"math"
"slices" "slices"
"sync" "sync"
"time" "time"
@@ -116,8 +118,13 @@ func RecordBackendTrace(t BackendTrace) {
backendMu.Lock() backendMu.Lock()
maxBody := backendMaxBodyBytes maxBody := backendMaxBodyBytes
backendMu.Unlock() backendMu.Unlock()
if t.Data != nil && maxBody > 0 { // Always walk Data, even with no body cap configured: besides capping
t.Data = capDataStrings(t.Data, maxBody) // oversized strings (maxBody > 0), the walk replaces non-finite floats
// (Inf/NaN) that encoding/json cannot marshal. A single such value — e.g. a
// -Inf dBFS audio metric from a silent clip — would otherwise fail the whole
// /api/backend-traces response and blank the Traces UI.
if t.Data != nil {
t.Data = sanitizeData(t.Data, maxBody)
} }
select { select {
case backendLogChan <- &t: case backendLogChan <- &t:
@@ -126,32 +133,90 @@ func RecordBackendTrace(t BackendTrace) {
} }
} }
// capDataStrings walks a trace Data map and replaces any string value (at any // sanitizeData walks a trace Data map (recursing into nested maps and slices)
// depth) that exceeds maxBytes with a fixed-size marker that names the // and makes every value safe for the /api/backend-traces JSON response:
// original byte count. The replacement is intentionally short and not valid //
// base64/JSON: the goal is to flag "this was dropped" cheaply, not to keep a // - When maxBytes > 0, any string longer than maxBytes is replaced with a
// partial value that the UI might try to render. Non-string scalars and // fixed-size marker that names the original byte count. The replacement is
// non-map containers pass through untouched so structural fields like // intentionally short and not valid base64/JSON: it flags "this was dropped"
// total_deltas or audio_sample_rate remain useful. // cheaply rather than keeping a partial value the UI might try to render.
func capDataStrings(data map[string]any, maxBytes int) map[string]any { // - Non-finite floats (Inf/NaN) are replaced with nil regardless of maxBytes,
out := make(map[string]any, len(data)) // because encoding/json refuses to marshal them and one bad value would fail
for k, v := range data { // the entire response.
out[k] = capValue(v, maxBytes) //
} // Other scalars (ints, bools, finite floats) pass through untouched so
// structural fields like total_deltas or audio_sample_rate remain useful.
//
// The walk is copy-on-write: it runs on every RecordBackendTrace call, and in
// the common case nothing needs rewriting, so containers are only re-allocated
// on the paths that actually changed and untouched values keep their original
// interface boxes instead of paying a per-value re-boxing allocation.
func sanitizeData(data map[string]any, maxBytes int) map[string]any {
out, _ := sanitizeMap(data, maxBytes)
return out return out
} }
func capValue(v any, maxBytes int) any { func sanitizeMap(m map[string]any, maxBytes int) (map[string]any, bool) {
var out map[string]any
for k, v := range m {
nv, changed := sanitizeValue(v, maxBytes)
if changed && out == nil {
// First change: fork the map. Entries already visited were
// unchanged, so a full copy then overwriting as we go is exact.
out = make(map[string]any, len(m))
maps.Copy(out, m)
}
if out != nil {
out[k] = nv
}
}
if out == nil {
return m, false
}
return out, true
}
func sanitizeSlice(s []any, maxBytes int) ([]any, bool) {
var out []any
for i, v := range s {
nv, changed := sanitizeValue(v, maxBytes)
if changed && out == nil {
out = make([]any, len(s))
copy(out, s)
}
if out != nil {
out[i] = nv
}
}
if out == nil {
return s, false
}
return out, true
}
func sanitizeValue(v any, maxBytes int) (any, bool) {
switch val := v.(type) { switch val := v.(type) {
case string: case string:
if len(val) > maxBytes { if maxBytes > 0 && len(val) > maxBytes {
return fmt.Sprintf("<truncated: %d bytes>", len(val)) return fmt.Sprintf("<truncated: %d bytes>", len(val)), true
} }
return val return v, false
case float64:
if math.IsInf(val, 0) || math.IsNaN(val) {
return nil, true
}
return v, false
case float32:
if f := float64(val); math.IsInf(f, 0) || math.IsNaN(f) {
return nil, true
}
return v, false
case map[string]any: case map[string]any:
return capDataStrings(val, maxBytes) return sanitizeMap(val, maxBytes)
case []any:
return sanitizeSlice(val, maxBytes)
default: default:
return v return v, false
} }
} }

View File

@@ -0,0 +1,80 @@
package trace_test
import (
"encoding/json"
"math"
"time"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"github.com/mudler/LocalAI/core/trace"
)
// encoding/json cannot marshal ±Inf or NaN. The /api/backend-traces endpoint
// serializes the whole buffer with one json call, so a single non-finite float
// in any trace's Data map (e.g. a -Inf dBFS audio metric from a silent clip)
// would fail the entire response and blank the Traces UI. RecordBackendTrace
// must scrub those values regardless of whether a body cap is configured.
var _ = Describe("RecordBackendTrace non-finite float sanitization", func() {
BeforeEach(func() {
// maxBodyBytes 0 == no body cap: float sanitization must still run.
trace.InitBackendTracingIfEnabled(64, 0)
trace.ClearBackendTraces()
})
It("replaces ±Inf and NaN with nil so the response stays JSON-marshalable", func() {
trace.RecordBackendTrace(trace.BackendTrace{
Timestamp: time.Now(),
Type: trace.BackendTraceTranscription,
ModelName: "m",
Data: map[string]any{
"audio_rms_dbfs": math.Inf(-1),
"audio_peak_dbfs": math.Inf(1),
"weird": math.NaN(),
"audio_duration_s": 1.5, // finite siblings must survive
},
})
Eventually(trace.GetBackendTraces).Should(HaveLen(1))
got := trace.GetBackendTraces()[0]
Expect(got.Data["audio_rms_dbfs"]).To(BeNil())
Expect(got.Data["audio_peak_dbfs"]).To(BeNil())
Expect(got.Data["weird"]).To(BeNil())
Expect(got.Data["audio_duration_s"]).To(Equal(1.5), "finite floats must pass through untouched")
_, err := json.Marshal(trace.GetBackendTraces())
Expect(err).ToNot(HaveOccurred(), "the whole trace buffer must marshal even with non-finite inputs")
})
It("scrubs non-finite floats nested in maps and slices", func() {
trace.RecordBackendTrace(trace.BackendTrace{
Timestamp: time.Now(),
Type: trace.BackendTraceLLM,
ModelName: "m",
Data: map[string]any{
"nested": map[string]any{
"logprob": math.Inf(-1),
"ok": 0.25,
},
"scores": []any{1.0, math.Inf(1), math.NaN()},
},
})
Eventually(trace.GetBackendTraces).Should(HaveLen(1))
got := trace.GetBackendTraces()[0]
nested := got.Data["nested"].(map[string]any)
Expect(nested["logprob"]).To(BeNil())
Expect(nested["ok"]).To(Equal(0.25))
scores := got.Data["scores"].([]any)
Expect(scores[0]).To(Equal(1.0))
Expect(scores[1]).To(BeNil())
Expect(scores[2]).To(BeNil())
_, err := json.Marshal(trace.GetBackendTraces())
Expect(err).ToNot(HaveOccurred())
})
})

View File

@@ -381,6 +381,8 @@ curl -X POST http://localhost:8080/backend/shutdown \
To stop all models, you'll need to call the endpoint for each loaded model individually, or use the web UI to stop all models at once. To stop all models, you'll need to call the endpoint for each loaded model individually, or use the web UI to stop all models at once.
Conversely, you can pre-load a model into memory ahead of its first request with `POST /backend/load` (the inverse of shutdown) — see [Backend Monitor]({{%relref "features/backend-monitor" %}}).
### Best Practices ### Best Practices
1. **Monitor VRAM usage**: Use `nvidia-smi` (for NVIDIA GPUs) or similar tools to monitor actual VRAM usage 1. **Monitor VRAM usage**: Use `nvidia-smi` (for NVIDIA GPUs) or similar tools to monitor actual VRAM usage

View File

@@ -166,7 +166,7 @@ When authentication is enabled, the following endpoints require admin role:
- `GET /api/backend-traces`, `POST /api/backend-traces/clear` - `GET /api/backend-traces`, `POST /api/backend-traces/clear`
- `GET /api/backend-logs/*`, `POST /api/backend-logs/*/clear` - `GET /api/backend-logs/*`, `POST /api/backend-logs/*/clear`
- `GET /api/resources`, `GET /api/settings`, `POST /api/settings` - `GET /api/resources`, `GET /api/settings`, `POST /api/settings`
- `GET /system`, `GET /backend/monitor`, `POST /backend/shutdown` - `GET /system`, `GET /backend/monitor`, `POST /backend/shutdown`, `POST /backend/load`
**P2P:** **P2P:**
- `GET /api/p2p/*` - `GET /api/p2p/*`

View File

@@ -5,7 +5,9 @@ weight = 20
url = "/features/backend-monitor/" url = "/features/backend-monitor/"
+++ +++
LocalAI provides endpoints to monitor and manage running backends. The `/backend/monitor` endpoint reports the status and resource usage of loaded models, and `/backend/shutdown` allows stopping a model's backend process. LocalAI provides endpoints to monitor and manage running backends. The `/backend/monitor` endpoint reports the status and resource usage of loaded models, `/backend/load` pre-loads a model into memory, and `/backend/shutdown` allows stopping a model's backend process.
All three are admin-only.
## Monitor API ## Monitor API
@@ -62,6 +64,42 @@ curl "http://localhost:8080/backend/monitor?model=my-model"
} }
``` ```
## Load API
Pre-loads a model into memory ahead of its first request, so that request pays no cold-start load cost. It is the inverse of the Shutdown API and works for any model, not just realtime pipelines.
- **Method:** `POST`
- **Endpoints:** `/backend/load`, `/v1/backend/load`
### Request
| Parameter | Type | Required | Description |
|-----------|----------|----------|------------------------------|
| `model` | `string` | Yes | Name of the model to load |
### Behavior
- For a regular model, its own backend is loaded.
- For a **realtime pipeline** model (a config with a `pipeline:` block), every configured sub-model (VAD, transcription, LLM, TTS, sound_detection, voice_recognition) is loaded concurrently instead of the pipeline stub, which has no backend of its own.
The call blocks until loading finishes and reports which model names became resident, so partial failures are visible.
### Usage
```bash
curl -X POST http://localhost:8080/backend/load \
-H "Content-Type: application/json" \
-d '{"model": "my-model"}'
```
### Example response
```json
{ "loaded": ["my-model"], "message": "model loaded" }
```
On failure the call returns `500` with `loaded` listing whichever sub-models did load and `message` naming the failures.
## Shutdown API ## Shutdown API
- **Method:** `POST` - **Method:** `POST`

View File

@@ -56,6 +56,39 @@ pipeline:
All streaming flags are off by default, so existing pipelines are unaffected. All streaming flags are off by default, so existing pipelines are unaffected.
### Model warm-up (cold start)
Without warm-up the pipeline's models are loaded into memory only on first use *within* a session: the VAD on the first audio chunk, transcription at the first end-of-speech, the LLM on the first reply, and TTS on the first spoken output. On a cold session this staggers a load delay across those first few interactions — and a model that fails to load (missing weights, wrong backend, out of memory) only fails part-way through the first turn.
To avoid that, LocalAI **warms the pipeline by default**: it loads the VAD, transcription, LLM and TTS backends into memory *before* the session is announced, and the session start **blocks until they are all ready**. The loads run concurrently, so the wait is the slowest single model, not the sum. This means:
- The first turn pays no cold-start cost — every backend is already resident.
- **Model-load errors surface at session start.** If any stage fails to load, the session is not started and the client receives a `model_load_error` instead of `session.created`, so a broken pipeline fails fast and visibly rather than mid-call.
Set `disable_warmup: true` to restore the lazy "load on first use" behavior — session start no longer waits on loading and load errors surface on the first turn instead. Useful if you want idle sessions to avoid holding model memory they may never use:
```yaml
name: gpt-realtime
pipeline:
vad: silero-vad-ggml
transcription: whisper-large-turbo
llm: qwen3-4b
tts: tts-1
disable_warmup: true # lazily load each model on first use instead of at session start
```
#### Pre-loading a pipeline on demand
Warm-up only fires when a realtime session opens. To load a pipeline into memory ahead of time — e.g. to warm it right after boot, or when running with `disable_warmup: true` — POST the model name to the admin-only `/backend/load` endpoint. For a pipeline model it loads every configured sub-model (VAD, transcription, LLM, TTS, sound_detection, voice_recognition) concurrently:
```bash
curl -X POST http://localhost:8080/backend/load \
-H "Content-Type: application/json" \
-d '{"model": "gpt-realtime"}'
```
The endpoint is not realtime-specific — it pre-loads any model. See [Backend Monitor]({{%relref "features/backend-monitor" %}}) for the full request/response reference (it is the inverse of `/backend/shutdown`).
### Turn detection ### Turn detection
Turn detection decides when the user has finished speaking and the pipeline should respond. Two modes are supported, matching the OpenAI session schema: Turn detection decides when the user has finished speaking and the pipeline should respond. Two modes are supported, matching the OpenAI session schema:

View File

@@ -1,4 +1,56 @@
--- ---
- name: "qwopus3.6-35b-a3b-coder-mtp"
url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
urls:
- https://huggingface.co/Jackrong/Qwopus3.6-35B-A3B-Coder-MTP-GGUF
description: |
# 🌟 Qwopus3.6-35B-A3B-v1
## 💡 Base Model Overview
**Qwen3.6-35B-A3B** is an advanced hybrid sparse MoE (Mixture-of-Experts) model developed by Alibaba Cloud. It features 35B total parameters with only 3B active parameters per token, ensuring high inference efficiency. Architecturally, it combines Gated DeltaNet linear attention with standard gated attention layers, routing tokens across **256 experts**. It natively supports a massive **262k context window** and is specifically designed for high-performance agentic coding, deep reasoning, and multimodal tasks.
## 🚀 Model Refinement & Logic Tuning Qwopus3.6-35B-A3B-v1
🪐**Qwopus3.6-35B-A3B-v1** is a reasoning-enhanced MoE (Mixture of Experts) model fine-tuned on top of **Qwen3.6-35B-A3B**.
### 🛠 Training Strategy
The fine-tuning process for this model is structured into **three distinct stages of distributed SFT (Supervised Fine-Tuning)**, progressively scaling reasoning complexity and data diversity. This systematic approach ensures the model inherits the base MoE capabilities while sharpening its logic-handling depth.
...
license: "apache-2.0"
tags:
- llm
- gguf
- vision
- multimodal
icon: https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/ztbyGV_zGhzcLuTCSVyq3.png
overrides:
backend: llama-cpp
function:
automatic_tool_parsing_fallback: true
grammar:
disable: true
known_usecases:
- chat
mmproj: llama-cpp/mmproj/Qwopus3.6-35B-A3B-Coder-MTP-Q4_K_M/mmproj-F32.gguf
options:
- use_jinja:true
- spec_type:draft-mtp
- spec_n_max:6
- spec_p_min:0.75
parameters:
model: llama-cpp/models/Qwopus3.6-35B-A3B-Coder-MTP-Q4_K_M/Qwopus3.6-35B-A3B-Coder-MTP-Q4_K_M.gguf
template:
use_tokenizer_template: true
files:
- filename: llama-cpp/models/Qwopus3.6-35B-A3B-Coder-MTP-Q4_K_M/Qwopus3.6-35B-A3B-Coder-MTP-Q4_K_M.gguf
sha256: c283cd2321a3cb4c6e7faf9481ac7d946913e4f02e20172eb2872112f567d8d4
uri: https://huggingface.co/Jackrong/Qwopus3.6-35B-A3B-Coder-MTP-GGUF/resolve/main/Qwopus3.6-35B-A3B-Coder-MTP-Q4_K_M.gguf
- filename: llama-cpp/mmproj/Qwopus3.6-35B-A3B-Coder-MTP-Q4_K_M/mmproj-F32.gguf
sha256: 5c82c8095717b39f29c88ebfec3607a10307785b1e14a87744603d6c582cd497
uri: https://huggingface.co/Jackrong/Qwopus3.6-35B-A3B-Coder-MTP-GGUF/resolve/main/mmproj-F32.gguf
- name: "ornith-1.0-9b-mtp" - name: "ornith-1.0-9b-mtp"
url: "github:mudler/LocalAI/gallery/virtual.yaml@master" url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
urls: urls:

View File

@@ -461,10 +461,7 @@ func (p *RuleParser) parse(arena *Arena, ctx *ParseContext, start int) ParseResu
if result.Type != Fail { if result.Type != Fail {
text := "" text := ""
if result.Start < len(ctx.Input) { if result.Start < len(ctx.Input) {
end := result.End end := min(result.End, len(ctx.Input))
if end > len(ctx.Input) {
end = len(ctx.Input)
}
text = ctx.Input[result.Start:end] text = ctx.Input[result.Start:end]
} }
@@ -514,10 +511,7 @@ func (p *TagParser) parse(arena *Arena, ctx *ParseContext, start int) ParseResul
if result.Type != Fail { if result.Type != Fail {
text := "" text := ""
if result.Start < len(ctx.Input) { if result.Start < len(ctx.Input) {
end := result.End end := min(result.End, len(ctx.Input))
if end > len(ctx.Input) {
end = len(ctx.Input)
}
text = ctx.Input[result.Start:end] text = ctx.Input[result.Start:end]
} }

View File

@@ -36,6 +36,10 @@ type LocalAIClient interface {
DeleteModel(ctx context.Context, name string) error DeleteModel(ctx context.Context, name string) error
EditModelConfig(ctx context.Context, name string, patch map[string]any) error EditModelConfig(ctx context.Context, name string, patch map[string]any) error
ReloadModels(ctx context.Context) error ReloadModels(ctx context.Context) error
// LoadModel pre-loads a model into memory by name (the inverse of shutting
// it down). For a realtime pipeline model every configured sub-model is
// loaded; it returns the model names that became resident.
LoadModel(ctx context.Context, model string) ([]string, error)
ImportModelURI(ctx context.Context, req ImportModelURIRequest) (*ImportModelURIResponse, error) ImportModelURI(ctx context.Context, req ImportModelURIRequest) (*ImportModelURIResponse, error)
// ---- Model aliases ---- // ---- Model aliases ----

View File

@@ -49,6 +49,7 @@ var toolToHTTPRoute = map[string]string{
ToolDeleteModel: "POST /models/delete/:name", ToolDeleteModel: "POST /models/delete/:name",
ToolEditModelConfig: "PATCH /api/models/config-json/:name", ToolEditModelConfig: "PATCH /api/models/config-json/:name",
ToolReloadModels: "POST /models/reload", ToolReloadModels: "POST /models/reload",
ToolLoadModel: "POST /backend/load",
ToolInstallBackend: "POST /backends/apply", ToolInstallBackend: "POST /backends/apply",
ToolUpgradeBackend: "POST /backends/upgrade/:name", ToolUpgradeBackend: "POST /backends/upgrade/:name",
ToolToggleModelState: "PUT /models/toggle-state/:name/:action", ToolToggleModelState: "PUT /models/toggle-state/:name/:action",

View File

@@ -35,6 +35,7 @@ type fakeClient struct {
setAlias func(string, string) error setAlias func(string, string) error
listAliases func() ([]AliasInfo, error) listAliases func() ([]AliasInfo, error)
reloadModels func() error reloadModels func() error
loadModel func(string) ([]string, error)
listBackends func() ([]Backend, error) listBackends func() ([]Backend, error)
listKnownBackends func() ([]schema.KnownBackend, error) listKnownBackends func() ([]schema.KnownBackend, error)
installBackend func(InstallBackendRequest) (string, error) installBackend func(InstallBackendRequest) (string, error)
@@ -169,6 +170,14 @@ func (f *fakeClient) ReloadModels(_ context.Context) error {
return nil return nil
} }
func (f *fakeClient) LoadModel(_ context.Context, model string) ([]string, error) {
f.record("LoadModel", model)
if f.loadModel != nil {
return f.loadModel(model)
}
return []string{model}, nil
}
func (f *fakeClient) ListBackends(_ context.Context) ([]Backend, error) { func (f *fakeClient) ListBackends(_ context.Context) ([]Backend, error) {
f.record("ListBackends", nil) f.record("ListBackends", nil)
if f.listBackends != nil { if f.listBackends != nil {

View File

@@ -338,6 +338,16 @@ func (c *Client) ReloadModels(ctx context.Context) error {
return c.do(ctx, http.MethodPost, routeModelsReload, nil, nil) return c.do(ctx, http.MethodPost, routeModelsReload, nil, nil)
} }
func (c *Client) LoadModel(ctx context.Context, model string) ([]string, error) {
// On a load failure the endpoint returns a non-2xx whose body (carrying the
// per-sub-model failure detail) is folded into the HTTPError by c.do.
var resp schema.ModelLoadResponse
if err := c.do(ctx, http.MethodPost, routeBackendLoad, map[string]string{"model": model}, &resp); err != nil {
return nil, err
}
return resp.Loaded, nil
}
// ---- Model aliases ---- // ---- Model aliases ----
// SetAlias is swap-first: it PATCHes the alias config (a deep-merge that // SetAlias is swap-first: it PATCHes the alias config (a deep-merge that

View File

@@ -19,6 +19,7 @@ const (
routeModelImport = "/models/import" routeModelImport = "/models/import"
routeAliases = "/api/aliases" routeAliases = "/api/aliases"
routeModelsReload = "/models/reload" routeModelsReload = "/models/reload"
routeBackendLoad = "/backend/load"
routeBackends = "/backends" routeBackends = "/backends"
routeBackendsKnown = "/backends/known" routeBackendsKnown = "/backends/known"
routeBackendsApply = "/backends/apply" routeBackendsApply = "/backends/apply"

View File

@@ -13,6 +13,7 @@ import (
"path/filepath" "path/filepath"
"github.com/google/uuid" "github.com/google/uuid"
"github.com/mudler/LocalAI/core/backend"
"github.com/mudler/LocalAI/core/config" "github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/gallery" "github.com/mudler/LocalAI/core/gallery"
"github.com/mudler/LocalAI/core/gallery/importers" "github.com/mudler/LocalAI/core/gallery/importers"
@@ -302,6 +303,16 @@ func (c *Client) ReloadModels(_ context.Context) error {
return c.ConfigLoader.LoadModelConfigsFromPath(c.SystemState.Model.ModelsPath) return c.ConfigLoader.LoadModelConfigsFromPath(c.SystemState.Model.ModelsPath)
} }
func (c *Client) LoadModel(ctx context.Context, model string) ([]string, error) {
if c.ConfigLoader == nil || c.ModelLoader == nil {
return nil, errors.New("model loader not available")
}
// Reuse the same preload path the REST /backend/load endpoint uses, so a
// pipeline model loads all its sub-models and the behaviour stays identical
// across the in-process and HTTP clients.
return backend.PreloadModelByName(ctx, c.ConfigLoader, c.ModelLoader, c.AppConfig, model)
}
// ---- Model aliases ---- // ---- Model aliases ----
// SetAlias is swap-first to match the httpapi client: PatchConfig swaps an // SetAlias is swap-first to match the httpapi client: PatchConfig swaps an

View File

@@ -0,0 +1,71 @@
package inproc
import (
"context"
"os"
"path/filepath"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/pkg/model"
"github.com/mudler/LocalAI/pkg/system"
)
var _ = Describe("inproc.Client LoadModel", func() {
var (
ctx context.Context
tempDir string
cl *config.ModelConfigLoader
ml *model.ModelLoader
c *Client
seedModel func(name, body string)
)
BeforeEach(func() {
ctx = context.Background()
tempDir = GinkgoT().TempDir()
systemState, err := system.GetSystemState(system.WithModelPath(tempDir))
Expect(err).ToNot(HaveOccurred())
appConfig := config.NewApplicationConfig(config.WithSystemState(systemState))
cl = config.NewModelConfigLoader(tempDir)
ml = model.NewModelLoader(systemState) // no backends installed
c = New(appConfig, systemState, cl, ml, nil)
seedModel = func(name, body string) {
Expect(os.WriteFile(filepath.Join(tempDir, name+".yaml"), []byte(body), 0o644)).To(Succeed())
Expect(cl.LoadModelConfigsFromPath(tempDir)).To(Succeed())
}
})
It("errors when the model loader is unavailable", func() {
noLoader := New(c.AppConfig, c.SystemState, cl, nil, nil)
_, err := noLoader.LoadModel(ctx, "anything")
Expect(err).To(MatchError(ContainSubstring("model loader not available")))
})
It("loads a regular model through the model loader", func() {
seedModel("solo", "name: solo\n")
// No backend is installed in the test env, so the load itself fails — but
// the call must exercise the single-model path and surface that error
// rather than panicking or silently succeeding.
loaded, err := c.LoadModel(ctx, "solo")
Expect(err).To(HaveOccurred())
Expect(loaded).To(BeEmpty())
})
It("expands a pipeline model into its sub-models", func() {
seedModel("voicebot", "name: voicebot\npipeline:\n vad: vad-m\n llm: llm-m\n")
seedModel("vad-m", "name: vad-m\n")
seedModel("llm-m", "name: llm-m\n")
loaded, err := c.LoadModel(ctx, "voicebot")
// Sub-models can't load without backends, so the joined error names them
// — proving the pipeline stub was expanded rather than loaded directly.
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("vad-m"))
Expect(err.Error()).ToNot(ContainSubstring("voicebot"))
Expect(loaded).To(BeEmpty())
})
})

View File

@@ -2,7 +2,7 @@
These rules are non-negotiable. The user trusts you to operate their server without unintended changes. These rules are non-negotiable. The user trusts you to operate their server without unintended changes.
1. **Confirm before mutating.** Before calling any of these tools — `install_model`, `import_model_uri`, `delete_model`, `install_backend`, `upgrade_backend`, `edit_model_config`, `reload_models`, `toggle_model_state`, `toggle_model_pinned` — first state in plain language what you are about to do (which tool, which target, which arguments) and wait for the user's explicit confirmation in the next turn. "Yes", "do it", "go ahead", "proceed" all count as confirmation. Anything else does not. 1. **Confirm before mutating.** Before calling any of these tools — `install_model`, `import_model_uri`, `delete_model`, `install_backend`, `upgrade_backend`, `edit_model_config`, `reload_models`, `load_model`, `toggle_model_state`, `toggle_model_pinned` — first state in plain language what you are about to do (which tool, which target, which arguments) and wait for the user's explicit confirmation in the next turn. "Yes", "do it", "go ahead", "proceed" all count as confirmation. Anything else does not.
2. **Disambiguate before mutating.** If the user's request is ambiguous (several gallery candidates match, the model name has multiple installed versions, the backend has variants), present the candidates as a numbered list and ask the user to pick before calling any mutating tool. 2. **Disambiguate before mutating.** If the user's request is ambiguous (several gallery candidates match, the model name has multiple installed versions, the backend has variants), present the candidates as a numbered list and ask the user to pick before calling any mutating tool.

View File

@@ -24,5 +24,6 @@ The MCP `tools/list` endpoint also exposes the full input schema for each of the
- `upgrade_backend` — Upgrade an installed backend by name. - `upgrade_backend` — Upgrade an installed backend by name.
- `edit_model_config` — Patch (deep-merge) JSON into an installed model's config. - `edit_model_config` — Patch (deep-merge) JSON into an installed model's config.
- `reload_models` — Reload all model configs from disk. - `reload_models` — Reload all model configs from disk.
- `load_model` — Pre-load a model into memory so the first request pays no cold-start cost. For a realtime pipeline model, every sub-model (VAD, transcription, LLM, TTS, sound_detection, voice_recognition) is loaded. Inverse of stopping a model.
- `toggle_model_state` — Enable or disable a model (`action`: `enable` or `disable`). - `toggle_model_state` — Enable or disable a model (`action`: `enable` or `disable`).
- `toggle_model_pinned` — Pin or unpin a model (`action`: `pin` or `unpin`). - `toggle_model_pinned` — Pin or unpin a model (`action`: `pin` or `unpin`).

View File

@@ -92,6 +92,7 @@ var expectedFullCatalog = sortedStrings(
ToolListInstalledModels, ToolListInstalledModels,
ToolListKnownBackends, ToolListKnownBackends,
ToolListNodes, ToolListNodes,
ToolLoadModel,
ToolReloadModels, ToolReloadModels,
ToolSetAlias, ToolSetAlias,
ToolSetBranding, ToolSetBranding,
@@ -166,6 +167,7 @@ var _ = Describe("Tool dispatch", func() {
{ToolUpgradeBackend, map[string]any{"name": "llama-cpp"}, "UpgradeBackend"}, {ToolUpgradeBackend, map[string]any{"name": "llama-cpp"}, "UpgradeBackend"},
{ToolEditModelConfig, map[string]any{"name": "foo", "patch": map[string]any{"context_size": 4096}}, "EditModelConfig"}, {ToolEditModelConfig, map[string]any{"name": "foo", "patch": map[string]any{"context_size": 4096}}, "EditModelConfig"},
{ToolReloadModels, struct{}{}, "ReloadModels"}, {ToolReloadModels, struct{}{}, "ReloadModels"},
{ToolLoadModel, map[string]any{"model": "test-model"}, "LoadModel"},
{ToolToggleModelState, map[string]any{"name": "foo", "action": "enable"}, "ToggleModelState"}, {ToolToggleModelState, map[string]any{"name": "foo", "action": "enable"}, "ToggleModelState"},
{ToolToggleModelPinned, map[string]any{"name": "foo", "action": "pin"}, "ToggleModelPinned"}, {ToolToggleModelPinned, map[string]any{"name": "foo", "action": "pin"}, "ToggleModelPinned"},
{ToolSetAlias, map[string]any{"name": "gpt-4", "target": "real"}, "SetAlias"}, {ToolSetAlias, map[string]any{"name": "gpt-4", "target": "real"}, "SetAlias"},

View File

@@ -31,6 +31,7 @@ const (
ToolDeleteModel = "delete_model" ToolDeleteModel = "delete_model"
ToolEditModelConfig = "edit_model_config" ToolEditModelConfig = "edit_model_config"
ToolReloadModels = "reload_models" ToolReloadModels = "reload_models"
ToolLoadModel = "load_model"
ToolInstallBackend = "install_backend" ToolInstallBackend = "install_backend"
ToolUpgradeBackend = "upgrade_backend" ToolUpgradeBackend = "upgrade_backend"
ToolToggleModelState = "toggle_model_state" ToolToggleModelState = "toggle_model_state"

View File

@@ -65,6 +65,22 @@ func registerModelTools(s *mcp.Server, client LocalAIClient, opts Options) {
return return
} }
mcp.AddTool(s, &mcp.Tool{
Name: ToolLoadModel,
Description: "Pre-load a model into memory by name so the first request pays no cold-start cost (the inverse of shutting a model down). For a realtime pipeline model every configured sub-model (VAD, transcription, LLM, TTS, sound_detection, voice_recognition) is loaded. Returns the model names that became resident. Requires user confirmation per safety rule 1.",
}, func(ctx context.Context, _ *mcp.CallToolRequest, args struct {
Model string `json:"model" jsonschema:"The installed model name to load into memory."`
}) (*mcp.CallToolResult, any, error) {
if args.Model == "" {
return errorResultf("model is required"), nil, nil
}
loaded, err := client.LoadModel(ctx, args.Model)
if err != nil {
return errorResult(err), nil, nil
}
return jsonResult(map[string]any{"loaded": loaded}), nil, nil
})
mcp.AddTool(s, &mcp.Tool{ mcp.AddTool(s, &mcp.Tool{
Name: ToolInstallModel, Name: ToolInstallModel,
Description: "Install a model from a gallery. Requires explicit user confirmation per safety rule 1. Returns a job id; poll with get_job_status.", Description: "Install a model from a gallery. Requires explicit user confirmation per safety rule 1. Returns a job id; poll with get_job_status.",

View File

@@ -1443,6 +1443,52 @@ const docTemplate = `{
"responses": {} "responses": {}
} }
}, },
"/backend/load": {
"post": {
"description": "Loads the named model (or, for a realtime pipeline, all of its sub-models) into memory so subsequent requests pay no cold-start cost. The inverse of /backend/shutdown.",
"consumes": [
"application/json"
],
"produces": [
"application/json"
],
"tags": [
"monitoring"
],
"summary": "Pre-load a model into memory",
"parameters": [
{
"description": "Model to load",
"name": "request",
"in": "body",
"required": true,
"schema": {
"$ref": "#/definitions/schema.ModelLoadRequest"
}
}
],
"responses": {
"200": {
"description": "Model loaded",
"schema": {
"$ref": "#/definitions/schema.ModelLoadResponse"
}
},
"400": {
"description": "Missing model name",
"schema": {
"$ref": "#/definitions/schema.ModelLoadResponse"
}
},
"500": {
"description": "Load failed (Loaded lists any sub-models that did load)",
"schema": {
"$ref": "#/definitions/schema.ModelLoadResponse"
}
}
}
}
},
"/backend/monitor": { "/backend/monitor": {
"get": { "get": {
"tags": [ "tags": [
@@ -5136,6 +5182,30 @@ const docTemplate = `{
} }
} }
}, },
"schema.ModelLoadRequest": {
"type": "object",
"properties": {
"model": {
"type": "string"
}
}
},
"schema.ModelLoadResponse": {
"type": "object",
"properties": {
"loaded": {
"description": "Loaded lists the model names actually resident in memory after the call.\nFor a pipeline model these are its sub-models, not the pipeline name.",
"type": "array",
"items": {
"type": "string"
}
},
"message": {
"description": "Message is a short human-readable status (\"model loaded\", or an error).",
"type": "string"
}
}
},
"schema.ModelsDataResponse": { "schema.ModelsDataResponse": {
"type": "object", "type": "object",
"properties": { "properties": {

View File

@@ -1440,6 +1440,52 @@
"responses": {} "responses": {}
} }
}, },
"/backend/load": {
"post": {
"description": "Loads the named model (or, for a realtime pipeline, all of its sub-models) into memory so subsequent requests pay no cold-start cost. The inverse of /backend/shutdown.",
"consumes": [
"application/json"
],
"produces": [
"application/json"
],
"tags": [
"monitoring"
],
"summary": "Pre-load a model into memory",
"parameters": [
{
"description": "Model to load",
"name": "request",
"in": "body",
"required": true,
"schema": {
"$ref": "#/definitions/schema.ModelLoadRequest"
}
}
],
"responses": {
"200": {
"description": "Model loaded",
"schema": {
"$ref": "#/definitions/schema.ModelLoadResponse"
}
},
"400": {
"description": "Missing model name",
"schema": {
"$ref": "#/definitions/schema.ModelLoadResponse"
}
},
"500": {
"description": "Load failed (Loaded lists any sub-models that did load)",
"schema": {
"$ref": "#/definitions/schema.ModelLoadResponse"
}
}
}
}
},
"/backend/monitor": { "/backend/monitor": {
"get": { "get": {
"tags": [ "tags": [
@@ -5133,6 +5179,30 @@
} }
} }
}, },
"schema.ModelLoadRequest": {
"type": "object",
"properties": {
"model": {
"type": "string"
}
}
},
"schema.ModelLoadResponse": {
"type": "object",
"properties": {
"loaded": {
"description": "Loaded lists the model names actually resident in memory after the call.\nFor a pipeline model these are its sub-models, not the pipeline name.",
"type": "array",
"items": {
"type": "string"
}
},
"message": {
"description": "Message is a short human-readable status (\"model loaded\", or an error).",
"type": "string"
}
}
},
"schema.ModelsDataResponse": { "schema.ModelsDataResponse": {
"type": "object", "type": "object",
"properties": { "properties": {

View File

@@ -1362,6 +1362,25 @@ definitions:
$ref: '#/definitions/schema.ToolCall' $ref: '#/definitions/schema.ToolCall'
type: array type: array
type: object type: object
schema.ModelLoadRequest:
properties:
model:
type: string
type: object
schema.ModelLoadResponse:
properties:
loaded:
description: |-
Loaded lists the model names actually resident in memory after the call.
For a pipeline model these are its sub-models, not the pipeline name.
items:
type: string
type: array
message:
description: Message is a short human-readable status ("model loaded", or
an error).
type: string
type: object
schema.ModelsDataResponse: schema.ModelsDataResponse:
properties: properties:
data: data:
@@ -3510,6 +3529,38 @@ paths:
summary: Bidirectional realtime audio transform over WebSocket. summary: Bidirectional realtime audio transform over WebSocket.
tags: tags:
- audio - audio
/backend/load:
post:
consumes:
- application/json
description: Loads the named model (or, for a realtime pipeline, all of its
sub-models) into memory so subsequent requests pay no cold-start cost. The
inverse of /backend/shutdown.
parameters:
- description: Model to load
in: body
name: request
required: true
schema:
$ref: '#/definitions/schema.ModelLoadRequest'
produces:
- application/json
responses:
"200":
description: Model loaded
schema:
$ref: '#/definitions/schema.ModelLoadResponse'
"400":
description: Missing model name
schema:
$ref: '#/definitions/schema.ModelLoadResponse'
"500":
description: Load failed (Loaded lists any sub-models that did load)
schema:
$ref: '#/definitions/schema.ModelLoadResponse'
summary: Pre-load a model into memory
tags:
- monitoring
/backend/monitor: /backend/monitor:
get: get:
parameters: parameters: