feat(realtime): eager blocking pipeline warm-up + /backend/load API (#10662)

Realtime sessions previously lazy-loaded each pipeline sub-model (VAD, transcription, LLM, TTS) on first use, so every cold session paid a per-request model-load stall and load errors only surfaced mid-stream. Warm the whole pipeline eagerly and blockingly at session start (including the voice-gate speaker-recognition model, which an enforced gate blocks each utterance on; compaction's summary_model stays lazy since it only runs off the response path): - Add backend.PreloadModel / PreloadModelByName as the single load path for every modality (no transcription special-case; backend-omitted configs are deprecated). - The realtime session blocks on Model.Warmup and returns a model_load_error to the client if any stage fails to load; updateSession warms in the background. Opt out per pipeline with pipeline.disable_warmup, exposed as a UI toggle via the config-metadata registry. Add a LocalAI-native POST /backend/load (and /v1/backend/load) that pre-loads a model -- expanding realtime pipelines into their sub-models -- as the inverse of /backend/shutdown. There is one preload engine (backend.PreloadStages): the realtime Warmup methods, /backend/load and the --load-to-memory startup flag all use it, so --load-to-memory now also expands pipeline models and records load-failure traces. Pipeline sub-model alias resolution is likewise shared (ModelConfigLoader.LoadResolvedModelConfig). Surface the endpoint everywhere an admin manages models: - MCP admin tool load_model (httpapi + inproc clients, safety/catalog prompts, catalog/dispatch tests). - "Load into memory" action in the React models UI. - Swagger regenerated; docs moved to the general backend-monitor page since it is not realtime-specific. Fix a Traces UI crash ("json: unsupported value: -Inf"): audio-snippet RMS/peak now floor at a finite dBFS, and backend-trace data is sanitized to drop non-finite floats before marshaling. The sanitizer is copy-on-write -- it runs on every RecordBackendTrace, so containers are only re-allocated on the paths that actually changed. Migrate core/http/openresponses_test.go onto the prebuilt mock-backend the rest of the http suite already uses -- it was the last spec still pointing at a real HuggingFace model, so it 404'd wherever no vision backend was built -- and fix its item_reference specs to send the spec's "id" field instead of "item_id", which the handler never accepted. Assisted-by: Claude:claude-opus-4-8 Claude Code Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-07-03 21:07:33 -04:00 · 2026-07-03 17:00:37 +01:00
parent 80ec22945a
commit eb32cd9073
45 changed files with 1364 additions and 99 deletions
--- a/docs/content/advanced/vram-management.md
+++ b/docs/content/advanced/vram-management.md
@@ -381,6 +381,8 @@ curl -X POST http://localhost:8080/backend/shutdown \

 To stop all models, you'll need to call the endpoint for each loaded model individually, or use the web UI to stop all models at once.

+Conversely, you can pre-load a model into memory ahead of its first request with `POST /backend/load` (the inverse of shutdown) — see [Backend Monitor]({{%relref "features/backend-monitor" %}}).
+
 ### Best Practices

 1. **Monitor VRAM usage**: Use `nvidia-smi` (for NVIDIA GPUs) or similar tools to monitor actual VRAM usage
--- a/docs/content/features/authentication.md
+++ b/docs/content/features/authentication.md
@@ -166,7 +166,7 @@ When authentication is enabled, the following endpoints require admin role:
 - `GET /api/backend-traces`, `POST /api/backend-traces/clear`
 - `GET /api/backend-logs/*`, `POST /api/backend-logs/*/clear`
 - `GET /api/resources`, `GET /api/settings`, `POST /api/settings`
- `GET /system`, `GET /backend/monitor`, `POST /backend/shutdown`
+- `GET /system`, `GET /backend/monitor`, `POST /backend/shutdown`, `POST /backend/load`

 **P2P:**
 - `GET /api/p2p/*`
--- a/docs/content/features/backend-monitor.md
+++ b/docs/content/features/backend-monitor.md
@@ -5,7 +5,9 @@ weight = 20
 url = "/features/backend-monitor/"
 +++

-LocalAI provides endpoints to monitor and manage running backends. The `/backend/monitor` endpoint reports the status and resource usage of loaded models, and `/backend/shutdown` allows stopping a model's backend process.
+LocalAI provides endpoints to monitor and manage running backends. The `/backend/monitor` endpoint reports the status and resource usage of loaded models, `/backend/load` pre-loads a model into memory, and `/backend/shutdown` allows stopping a model's backend process.
+
+All three are admin-only.

 ## Monitor API

@@ -62,6 +64,42 @@ curl "http://localhost:8080/backend/monitor?model=my-model"
 }
 ```

+## Load API
+
+Pre-loads a model into memory ahead of its first request, so that request pays no cold-start load cost. It is the inverse of the Shutdown API and works for any model, not just realtime pipelines.
+
+- **Method:** `POST`
+- **Endpoints:** `/backend/load`, `/v1/backend/load`
+
+### Request
+
+| Parameter | Type     | Required | Description                  |
+|-----------|----------|----------|------------------------------|
+| `model`   | `string` | Yes      | Name of the model to load    |
+
+### Behavior
+
+- For a regular model, its own backend is loaded.
+- For a **realtime pipeline** model (a config with a `pipeline:` block), every configured sub-model (VAD, transcription, LLM, TTS, sound_detection, voice_recognition) is loaded concurrently instead of the pipeline stub, which has no backend of its own.
+
+The call blocks until loading finishes and reports which model names became resident, so partial failures are visible.
+
+### Usage
+
+```bash
+curl -X POST http://localhost:8080/backend/load \
+  -H "Content-Type: application/json" \
+  -d '{"model": "my-model"}'
+```
+
+### Example response
+
+```json
+{ "loaded": ["my-model"], "message": "model loaded" }
+```
+
+On failure the call returns `500` with `loaded` listing whichever sub-models did load and `message` naming the failures.
+
 ## Shutdown API

 - **Method:** `POST`
--- a/docs/content/features/openai-realtime.md
+++ b/docs/content/features/openai-realtime.md
@@ -56,6 +56,39 @@ pipeline:

 All streaming flags are off by default, so existing pipelines are unaffected.

+### Model warm-up (cold start)
+
+Without warm-up the pipeline's models are loaded into memory only on first use *within* a session: the VAD on the first audio chunk, transcription at the first end-of-speech, the LLM on the first reply, and TTS on the first spoken output. On a cold session this staggers a load delay across those first few interactions — and a model that fails to load (missing weights, wrong backend, out of memory) only fails part-way through the first turn.
+
+To avoid that, LocalAI **warms the pipeline by default**: it loads the VAD, transcription, LLM and TTS backends into memory *before* the session is announced, and the session start **blocks until they are all ready**. The loads run concurrently, so the wait is the slowest single model, not the sum. This means:
+
+- The first turn pays no cold-start cost — every backend is already resident.
+- **Model-load errors surface at session start.** If any stage fails to load, the session is not started and the client receives a `model_load_error` instead of `session.created`, so a broken pipeline fails fast and visibly rather than mid-call.
+
+Set `disable_warmup: true` to restore the lazy "load on first use" behavior — session start no longer waits on loading and load errors surface on the first turn instead. Useful if you want idle sessions to avoid holding model memory they may never use:
+
+```yaml
+name: gpt-realtime
+pipeline:
+  vad: silero-vad-ggml
+  transcription: whisper-large-turbo
+  llm: qwen3-4b
+  tts: tts-1
+  disable_warmup: true   # lazily load each model on first use instead of at session start
+```
+
+#### Pre-loading a pipeline on demand
+
+Warm-up only fires when a realtime session opens. To load a pipeline into memory ahead of time — e.g. to warm it right after boot, or when running with `disable_warmup: true` — POST the model name to the admin-only `/backend/load` endpoint. For a pipeline model it loads every configured sub-model (VAD, transcription, LLM, TTS, sound_detection, voice_recognition) concurrently:
+
+```bash
+curl -X POST http://localhost:8080/backend/load \
+  -H "Content-Type: application/json" \
+  -d '{"model": "gpt-realtime"}'
+```
+
+The endpoint is not realtime-specific — it pre-loads any model. See [Backend Monitor]({{%relref "features/backend-monitor" %}}) for the full request/response reference (it is the inverse of `/backend/shutdown`).
+
 ### Turn detection

 Turn detection decides when the user has finished speaking and the pipeline should respond. Two modes are supported, matching the OpenAI session schema: