feat(realtime): eager blocking pipeline warm-up + /backend/load API (#10662)

Realtime sessions previously lazy-loaded each pipeline sub-model (VAD,
transcription, LLM, TTS) on first use, so every cold session paid a
per-request model-load stall and load errors only surfaced mid-stream.

Warm the whole pipeline eagerly and blockingly at session start
(including the voice-gate speaker-recognition model, which an enforced
gate blocks each utterance on; compaction's summary_model stays lazy
since it only runs off the response path):
- Add backend.PreloadModel / PreloadModelByName as the single load path
  for every modality (no transcription special-case; backend-omitted
  configs are deprecated).
- The realtime session blocks on Model.Warmup and returns a
  model_load_error to the client if any stage fails to load;
  updateSession warms in the background. Opt out per pipeline with
  pipeline.disable_warmup, exposed as a UI toggle via the
  config-metadata registry.

Add a LocalAI-native POST /backend/load (and /v1/backend/load) that
pre-loads a model -- expanding realtime pipelines into their sub-models
-- as the inverse of /backend/shutdown. There is one preload engine
(backend.PreloadStages): the realtime Warmup methods, /backend/load and
the --load-to-memory startup flag all use it, so --load-to-memory now
also expands pipeline models and records load-failure traces. Pipeline
sub-model alias resolution is likewise shared
(ModelConfigLoader.LoadResolvedModelConfig). Surface the endpoint
everywhere an admin manages models:
- MCP admin tool load_model (httpapi + inproc clients, safety/catalog
  prompts, catalog/dispatch tests).
- "Load into memory" action in the React models UI.
- Swagger regenerated; docs moved to the general backend-monitor page
  since it is not realtime-specific.

Fix a Traces UI crash ("json: unsupported value: -Inf"): audio-snippet
RMS/peak now floor at a finite dBFS, and backend-trace data is sanitized
to drop non-finite floats before marshaling. The sanitizer is
copy-on-write -- it runs on every RecordBackendTrace, so containers are
only re-allocated on the paths that actually changed.

Migrate core/http/openresponses_test.go onto the prebuilt mock-backend
the rest of the http suite already uses -- it was the last spec still
pointing at a real HuggingFace model, so it 404'd wherever no vision
backend was built -- and fix its item_reference specs to send the
spec's "id" field instead of "item_id", which the handler never
accepted.

Assisted-by: Claude:claude-opus-4-8 Claude Code

Signed-off-by: Richard Palethorpe <io@richiejp.com>
This commit is contained in:
Richard Palethorpe
2026-07-03 17:00:37 +01:00
committed by GitHub
parent 80ec22945a
commit eb32cd9073
45 changed files with 1364 additions and 99 deletions

View File

@@ -381,6 +381,8 @@ curl -X POST http://localhost:8080/backend/shutdown \
To stop all models, you'll need to call the endpoint for each loaded model individually, or use the web UI to stop all models at once.
Conversely, you can pre-load a model into memory ahead of its first request with `POST /backend/load` (the inverse of shutdown) — see [Backend Monitor]({{%relref "features/backend-monitor" %}}).
### Best Practices
1. **Monitor VRAM usage**: Use `nvidia-smi` (for NVIDIA GPUs) or similar tools to monitor actual VRAM usage

View File

@@ -166,7 +166,7 @@ When authentication is enabled, the following endpoints require admin role:
- `GET /api/backend-traces`, `POST /api/backend-traces/clear`
- `GET /api/backend-logs/*`, `POST /api/backend-logs/*/clear`
- `GET /api/resources`, `GET /api/settings`, `POST /api/settings`
- `GET /system`, `GET /backend/monitor`, `POST /backend/shutdown`
- `GET /system`, `GET /backend/monitor`, `POST /backend/shutdown`, `POST /backend/load`
**P2P:**
- `GET /api/p2p/*`

View File

@@ -5,7 +5,9 @@ weight = 20
url = "/features/backend-monitor/"
+++
LocalAI provides endpoints to monitor and manage running backends. The `/backend/monitor` endpoint reports the status and resource usage of loaded models, and `/backend/shutdown` allows stopping a model's backend process.
LocalAI provides endpoints to monitor and manage running backends. The `/backend/monitor` endpoint reports the status and resource usage of loaded models, `/backend/load` pre-loads a model into memory, and `/backend/shutdown` allows stopping a model's backend process.
All three are admin-only.
## Monitor API
@@ -62,6 +64,42 @@ curl "http://localhost:8080/backend/monitor?model=my-model"
}
```
## Load API
Pre-loads a model into memory ahead of its first request, so that request pays no cold-start load cost. It is the inverse of the Shutdown API and works for any model, not just realtime pipelines.
- **Method:** `POST`
- **Endpoints:** `/backend/load`, `/v1/backend/load`
### Request
| Parameter | Type | Required | Description |
|-----------|----------|----------|------------------------------|
| `model` | `string` | Yes | Name of the model to load |
### Behavior
- For a regular model, its own backend is loaded.
- For a **realtime pipeline** model (a config with a `pipeline:` block), every configured sub-model (VAD, transcription, LLM, TTS, sound_detection, voice_recognition) is loaded concurrently instead of the pipeline stub, which has no backend of its own.
The call blocks until loading finishes and reports which model names became resident, so partial failures are visible.
### Usage
```bash
curl -X POST http://localhost:8080/backend/load \
-H "Content-Type: application/json" \
-d '{"model": "my-model"}'
```
### Example response
```json
{ "loaded": ["my-model"], "message": "model loaded" }
```
On failure the call returns `500` with `loaded` listing whichever sub-models did load and `message` naming the failures.
## Shutdown API
- **Method:** `POST`

View File

@@ -56,6 +56,39 @@ pipeline:
All streaming flags are off by default, so existing pipelines are unaffected.
### Model warm-up (cold start)
Without warm-up the pipeline's models are loaded into memory only on first use *within* a session: the VAD on the first audio chunk, transcription at the first end-of-speech, the LLM on the first reply, and TTS on the first spoken output. On a cold session this staggers a load delay across those first few interactions — and a model that fails to load (missing weights, wrong backend, out of memory) only fails part-way through the first turn.
To avoid that, LocalAI **warms the pipeline by default**: it loads the VAD, transcription, LLM and TTS backends into memory *before* the session is announced, and the session start **blocks until they are all ready**. The loads run concurrently, so the wait is the slowest single model, not the sum. This means:
- The first turn pays no cold-start cost — every backend is already resident.
- **Model-load errors surface at session start.** If any stage fails to load, the session is not started and the client receives a `model_load_error` instead of `session.created`, so a broken pipeline fails fast and visibly rather than mid-call.
Set `disable_warmup: true` to restore the lazy "load on first use" behavior — session start no longer waits on loading and load errors surface on the first turn instead. Useful if you want idle sessions to avoid holding model memory they may never use:
```yaml
name: gpt-realtime
pipeline:
vad: silero-vad-ggml
transcription: whisper-large-turbo
llm: qwen3-4b
tts: tts-1
disable_warmup: true # lazily load each model on first use instead of at session start
```
#### Pre-loading a pipeline on demand
Warm-up only fires when a realtime session opens. To load a pipeline into memory ahead of time — e.g. to warm it right after boot, or when running with `disable_warmup: true` — POST the model name to the admin-only `/backend/load` endpoint. For a pipeline model it loads every configured sub-model (VAD, transcription, LLM, TTS, sound_detection, voice_recognition) concurrently:
```bash
curl -X POST http://localhost:8080/backend/load \
-H "Content-Type: application/json" \
-d '{"model": "gpt-realtime"}'
```
The endpoint is not realtime-specific — it pre-loads any model. See [Backend Monitor]({{%relref "features/backend-monitor" %}}) for the full request/response reference (it is the inverse of `/backend/shutdown`).
### Turn detection
Turn detection decides when the user has finished speaking and the pipeline should respond. Two modes are supported, matching the OpenAI session schema: