mirror of
https://github.com/mudler/LocalAI.git
synced 2026-05-17 04:56:52 -04:00
* feat(liquid-audio): add LFM2.5-Audio any-to-any backend + realtime_audio usecase
Wires LiquidAI's LFM2.5-Audio-1.5B as a self-contained Realtime API model:
single engine handles VAD, transcription, LLM, and TTS in one bidirectional
stream — drop-in alternative to a VAD+STT+LLM+TTS pipeline.
Backend
- backend/python/liquid-audio/ — new Python gRPC backend wrapping the
`liquid-audio` package. Modes: chat / asr / tts / s2s, voice presets,
Load/Predict/PredictStream/AudioTranscription/TTS/VAD/AudioToAudioStream/
Free and StartFineTune/FineTuneProgress/StopFineTune. Runtime monkey-patch
on `liquid_audio.utils.snapshot_download` so absolute local paths from
LocalAI's gallery resolve without a HF round-trip. soundfile in place of
torchaudio.load/save (torchcodec drags NVIDIA NPP we don't bundle).
- backend/backend.proto + pkg/grpc/{backend,client,server,base,embed,
interface}.go — new AudioToAudioStream RPC mirroring AudioTransformStream
(config/frame/control oneof in; typed event+pcm+meta out).
- core/services/nodes/{health_mock,inflight}_test.go — add stubs for the
new RPC to the test fakes.
Config + capabilities
- core/config/backend_capabilities.go — UsecaseRealtimeAudio, MethodAudio
ToAudioStream, UsecaseInfoMap entry, liquid-audio BackendCapability row.
- core/config/model_config.go — FLAG_REALTIME_AUDIO bitmask, ModalityGroups
membership in both speech-input and audio-output groups so a lone flag
still reads as multimodal, GetAllModelConfigUsecases entry, GuessUsecases
branch.
Realtime endpoint
- core/http/endpoints/openai/realtime.go — extract prepareRealtimeConfig()
so the gate is unit-testable; accept realtime_audio models and self-fill
empty pipeline slots with the model's own name (user-pinned slots win).
- core/http/endpoints/openai/realtime_gate_test.go — six specs covering nil
cfg, empty pipeline, legacy pipeline, self-contained realtime_audio,
user-pinned VAD slot, and partial legacy pipeline.
UI + endpoints
- core/http/routes/ui.go — /api/pipeline-models accepts either a legacy
VAD+STT+LLM+TTS pipeline or a realtime_audio model; surfaces a
self_contained flag so the Talk page can collapse the four cards.
- core/http/routes/ui_api.go — realtime_audio in usecaseFilters.
- core/http/routes/ui_pipeline_models_test.go — covers both code paths.
- core/http/react-ui/src/pages/Talk.jsx — self-contained badge instead of
the four-slot grid; rename Edit Pipeline → Edit Model Config; less
pipeline-specific wording.
- core/http/react-ui/src/pages/Models.jsx + locales/en/models.json — new
realtime_audio filter button + i18n.
- core/http/react-ui/src/utils/capabilities.js — CAP_REALTIME_AUDIO.
- core/http/react-ui/src/pages/FineTune.jsx — voice + validation-dataset
fields, surfaced when backend === liquid-audio, plumbed via
extra_options on submit/export/import.
Gallery + importer
- gallery/liquid-audio.yaml — config template with known_usecases:
[realtime_audio, chat, tts, transcript, vad].
- gallery/index.yaml — four model entries (realtime/chat/asr/tts) keyed by
mode option. Fixed pre-existing `transcribe` typo on the asr entry
(loader silently dropped the unknown string → entry never surfaced as a
transcript model).
- gallery/lfm.yaml — function block for the LFM2 Pythonic tool-call format
`<|tool_call_start|>[name(k="v")]<|tool_call_end|>` matching
common_chat_params_init_lfm2 in vendored llama.cpp.
- core/gallery/importers/{liquid-audio,liquid-audio_test}.go — detector
matches LFM2-Audio HF repos (excludes -gguf mirrors); mode/voice
preferences plumbed through to options.
- core/gallery/importers/importers.go — register LiquidAudioImporter
before LlamaCPPImporter.
- pkg/functions/parse_lfm2_test.go — seven specs for the response/argument
regex pair on the LFM2 pythonic format.
Build matrix
- .github/backend-matrix.yml — seven liquid-audio targets (cuda12, cuda13,
l4t-cuda-13, hipblas, intel, cpu amd64, cpu arm64). Jetpack r36 cuda-12
is skipped (Ubuntu 22.04 / Python 3.10 incompatible with liquid-audio's
3.12 floor).
- backend/index.yaml — anchor + 13 image entries.
- Makefile — .NOTPARALLEL, prepare-test-extra, test-extra,
docker-build-liquid-audio.
Docs
- .agents/plans/liquid-audio-integration.md — phased plan; PR-D (real
any-to-any wiring via AudioToAudioStream), PR-E (mid-audio tool-call
detector), PR-G (GGUF entries once upstream llama.cpp PR #18641 lands)
remain.
- .agents/api-endpoints-and-auth.md — expand the capability-surface
checklist with every place a new FLAG_* needs to be registered.
Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(realtime): function calling + history cap for any-to-any models
Three pieces, all on the realtime_audio path that just landed:
1. liquid-audio backend (backend/python/liquid-audio/backend.py):
- _build_chat_state grows a `tools_prelude` arg.
- new _render_tools_prelude parses request.Tools (the OpenAI Chat
Completions function array realtime.go already serialises) and
emits an LFM2 `<|tool_list_start|>…<|tool_list_end|>` system turn
ahead of the user history. Mirrors gallery/lfm.yaml's `function:`
template so the model sees the same prompt shape whether served
via llama-cpp or here. Without this the backend silently dropped
tools — function calling was wired end-to-end on the Go side but
the model never saw a tool list.
2. Realtime history cap (core/http/endpoints/openai/realtime.go):
- Session grows MaxHistoryItems int; default picked by new
defaultMaxHistoryItems(cfg) — 6 for realtime_audio models (LFM2.5
1.5B degrades quickly past a handful of turns), 0/unlimited for
legacy pipelines composing larger LLMs.
- triggerResponse runs conv.Items through trimRealtimeItems before
building conversationHistory. Helper walks the cut left if it
would orphan a function_call_output, so tool result + call pairs
stay intact.
- realtime_gate_test.go: specs for defaultMaxHistoryItems and
trimRealtimeItems (zero cap, under cap, over cap, tool-call pair
preservation).
3. Talk page (core/http/react-ui/src/pages/Talk.jsx):
- Reuses the chat page's MCP plumbing — useMCPClient hook,
ClientMCPDropdown component, same auto-connect/disconnect effect
pattern. No bespoke tool registry, no new REST endpoints; tools
come from whichever MCP servers the user toggles on, exactly as
on the chat page.
- sendSessionUpdate now passes session.tools=getToolsForLLM(); the
update re-fires when the active server set changes mid-session.
- New response.function_call_arguments.done handler executes via
the hook's executeTool (which round-trips through the MCP client
SDK), then replies with conversation.item.create
{type:function_call_output} + response.create so the model
completes its turn with the tool output. Mirrors chat's
client-side agentic loop, translated to the realtime wire shape.
UI changes require a LocalAI image rebuild (Dockerfile:308-313 bakes
react-ui/dist into the runtime image). Backend.py changes can be
swapped live in /backends/<id>/backend.py + /backend/shutdown.
Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(realtime): LocalAI Assistant ("Manage Mode") for the Talk page
Mirrors the chat-page metadata.localai_assistant flow so users can ask the
realtime model what's loaded / installed / configured. Tools are run
server-side via the same in-process MCP holder that powers the chat
modality — no transport switch, no proxy, no new wire protocol.
Wire:
- core/http/endpoints/openai/realtime.go:
- RealtimeSessionOptions{LocalAIAssistant,IsAdmin}; isCurrentUserAdmin
helper mirrors chat.go's requireAssistantAccess (no-op when auth
disabled, else requires auth.RoleAdmin).
- Session grows AssistantExecutor mcpTools.ToolExecutor.
- runRealtimeSession, when opts.LocalAIAssistant is set: gate on admin,
fail closed if DisableLocalAIAssistant or the holder has no tools,
DiscoverTools and inject into session.Tools, prepend
holder.SystemPrompt() to instructions.
- Tool-call dispatch loop: when AssistantExecutor.IsTool(name), run
ExecuteTool inproc, append a FunctionCallOutput to conv.Items, skip
the function_call_arguments client emit (the client can't execute
these — it doesn't know about them). After the loop, if any
assistant tool ran, trigger another response so the model speaks the
result. Mirrors chat's agentic loop, driven server-side rather than
via client round-trip.
- core/http/endpoints/openai/realtime_webrtc.go: RealtimeCallRequest
gains `localai_assistant` (JSON omitempty). Handshake calls
isCurrentUserAdmin and builds RealtimeSessionOptions.
- core/http/react-ui/src/pages/Talk.jsx: admin-only "Manage Mode"
checkbox under the Tools dropdown; passes localai_assistant: true to
realtimeApi.call's body, captured in the connect callback's deps.
Mirroring chat's pattern means the in-process MCP tools surface "just
works" for the Talk page without exposing a Streamable-HTTP MCP endpoint
(which was the alternative). Clients with their own MCP servers can
still use the existing ClientMCPDropdown path in parallel; the realtime
handler distinguishes them by AssistantExecutor.IsTool() at dispatch
time.
Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(realtime): render Manage Mode tool calls in the Talk transcript
Previously the realtime endpoint only emitted response.output_item.added
for the FunctionCall item, and Talk.jsx's switch ignored the event — so
server-side tool runs were invisible in the UI. The model would speak
the result but the user had no way to see what tool was actually
called.
realtime.go: after executing an assistant tool inproc, emit a second
output_item.added/.done pair for the FunctionCallOutput item. Mirrors
the way the chat page displays tool_call + tool_result blocks.
Talk.jsx: handle both response.output_item.added and .done. Render
FunctionCall (with arguments) and FunctionCallOutput (pretty-printed
JSON when possible) as two transcript entries — `tool_call` with the
wrench icon, `tool_result` with the clipboard icon, both in mono-space
secondary-colour. Resets streamingRef after the result so the next
assistant text delta starts a fresh transcript entry instead of
appending to the previous turn.
Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* refactor(realtime): bound the Manage Mode tool-loop + preserve assistant tools
Fallout from a review pass on the Manage Mode patches:
- Bound the server-side agentic loop. triggerResponse used to recurse on
executedAssistantTool with no cap — a model that kept calling tools
would blow the goroutine stack. New maxAssistantToolTurns = 10 (mirrors
useChat.js's maxToolTurns). Public triggerResponse is now a thin shim
over triggerResponseAtTurn(toolTurn int); recursion increments the
counter and stops at the cap with an xlog.Warn.
- Preserve Manage Mode tools across client session.update. The handler
used to blindly overwrite session.Tools, so toggling a client MCP
server mid-session silently wiped the in-process admin tools. Session
now caches the original AssistantTools slice at session creation and
the session.update handler merges them back in (client names win on
collision — the client is explicit).
- strconv.ParseBool for the localai_assistant query param instead of
hand-rolled "1" || "true". Mirrors LocalAIAssistantFromMetadata.
- Talk.jsx: render both tool_call and tool_result on
response.output_item.done instead of splitting them across .added and
.done. The server's event pairing (added → done) stays correct; the
UI just doesn't need to inspect both phases of the same item. One
switch case instead of two, no behavioural change.
Out of scope (noted for follow-ups): extract a shared assistant-tools
helper between chat.go and realtime.go (duplication is small enough
that two parallel implementations stay readable for now), and an i18n
key for the Manage Mode helper text (Talk.jsx doesn't use i18n
anywhere else yet).
Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* ci(test-extra): wire liquid-audio backend smoke test
The backend ships test.py + a `make test` target and is listed in
backend-matrix.yml, so scripts/changed-backends.js already writes a
`liquid-audio=true|false` output when files under backend/python/liquid-audio/
change. The workflow just wasn't reading it.
- Expose the `liquid-audio` output on the detect-changes job
- Add a tests-liquid-audio job that runs `make` + `make test` in
backend/python/liquid-audio, gated on the per-backend detect flag
The smoke covers Health() and LoadModel(mode:finetune); fine-tune mode
short-circuits before any HuggingFace download (backend.py:192), so the
job needs neither weights nor a GPU. The full-inference path remains
gated on LIQUID_AUDIO_MODEL_ID, which CI doesn't set.
The four new Go test files (core/gallery/importers/liquid-audio_test.go,
core/http/endpoints/openai/realtime_gate_test.go,
core/http/routes/ui_pipeline_models_test.go, pkg/functions/parse_lfm2_test.go)
are already picked up by the existing test.yml workflow via `make test` →
`ginkgo -r ./pkg/... ./core/...`; their packages all carry RunSpecs entries.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>
---------
Signed-off-by: Richard Palethorpe <io@richiejp.com>
497 lines
19 KiB
Go
497 lines
19 KiB
Go
package config
|
|
|
|
import (
|
|
"slices"
|
|
"strings"
|
|
)
|
|
|
|
// Usecase name constants — the canonical string values used in gallery entries,
|
|
// model configs (known_usecases), and UsecaseInfoMap keys.
|
|
const (
|
|
UsecaseChat = "chat"
|
|
UsecaseCompletion = "completion"
|
|
UsecaseEdit = "edit"
|
|
UsecaseVision = "vision"
|
|
UsecaseEmbeddings = "embeddings"
|
|
UsecaseTokenize = "tokenize"
|
|
UsecaseImage = "image"
|
|
UsecaseVideo = "video"
|
|
UsecaseTranscript = "transcript"
|
|
UsecaseTTS = "tts"
|
|
UsecaseSoundGeneration = "sound_generation"
|
|
UsecaseRerank = "rerank"
|
|
UsecaseDetection = "detection"
|
|
UsecaseVAD = "vad"
|
|
UsecaseAudioTransform = "audio_transform"
|
|
UsecaseDiarization = "diarization"
|
|
UsecaseRealtimeAudio = "realtime_audio"
|
|
)
|
|
|
|
// GRPCMethod identifies a Backend service RPC from backend.proto.
|
|
type GRPCMethod string
|
|
|
|
const (
|
|
MethodPredict GRPCMethod = "Predict"
|
|
MethodPredictStream GRPCMethod = "PredictStream"
|
|
MethodEmbedding GRPCMethod = "Embedding"
|
|
MethodGenerateImage GRPCMethod = "GenerateImage"
|
|
MethodGenerateVideo GRPCMethod = "GenerateVideo"
|
|
MethodAudioTranscription GRPCMethod = "AudioTranscription"
|
|
MethodTTS GRPCMethod = "TTS"
|
|
MethodTTSStream GRPCMethod = "TTSStream"
|
|
MethodSoundGeneration GRPCMethod = "SoundGeneration"
|
|
MethodTokenizeString GRPCMethod = "TokenizeString"
|
|
MethodDetect GRPCMethod = "Detect"
|
|
MethodRerank GRPCMethod = "Rerank"
|
|
MethodVAD GRPCMethod = "VAD"
|
|
MethodAudioTransform GRPCMethod = "AudioTransform"
|
|
MethodDiarize GRPCMethod = "Diarize"
|
|
MethodAudioToAudioStream GRPCMethod = "AudioToAudioStream"
|
|
)
|
|
|
|
// UsecaseInfo describes a single known_usecase value and how it maps
|
|
// to the gRPC backend API.
|
|
type UsecaseInfo struct {
|
|
// Flag is the ModelConfigUsecase bitmask value.
|
|
Flag ModelConfigUsecase
|
|
// GRPCMethod is the primary Backend service RPC this usecase maps to.
|
|
GRPCMethod GRPCMethod
|
|
// IsModifier is true when this usecase doesn't map to its own gRPC RPC
|
|
// but modifies how another RPC behaves (e.g., vision uses Predict with images).
|
|
IsModifier bool
|
|
// DependsOn names the usecase(s) this modifier requires (e.g., "chat").
|
|
DependsOn string
|
|
// Description is a human/LLM-readable explanation of what this usecase means.
|
|
Description string
|
|
}
|
|
|
|
// UsecaseInfoMap maps each known_usecase string to its gRPC and semantic info.
|
|
var UsecaseInfoMap = map[string]UsecaseInfo{
|
|
UsecaseChat: {
|
|
Flag: FLAG_CHAT,
|
|
GRPCMethod: MethodPredict,
|
|
Description: "Conversational/instruction-following via the Predict RPC with chat templates.",
|
|
},
|
|
UsecaseCompletion: {
|
|
Flag: FLAG_COMPLETION,
|
|
GRPCMethod: MethodPredict,
|
|
Description: "Text completion via the Predict RPC with a completion template.",
|
|
},
|
|
UsecaseEdit: {
|
|
Flag: FLAG_EDIT,
|
|
GRPCMethod: MethodPredict,
|
|
Description: "Text editing via the Predict RPC with an edit template.",
|
|
},
|
|
UsecaseVision: {
|
|
Flag: FLAG_VISION,
|
|
GRPCMethod: MethodPredict,
|
|
IsModifier: true,
|
|
DependsOn: UsecaseChat,
|
|
Description: "The model accepts images alongside text in the Predict RPC. For llama-cpp this requires an mmproj file.",
|
|
},
|
|
UsecaseEmbeddings: {
|
|
Flag: FLAG_EMBEDDINGS,
|
|
GRPCMethod: MethodEmbedding,
|
|
Description: "Vector embedding generation via the Embedding RPC.",
|
|
},
|
|
UsecaseTokenize: {
|
|
Flag: FLAG_TOKENIZE,
|
|
GRPCMethod: MethodTokenizeString,
|
|
Description: "Tokenization via the TokenizeString RPC without running inference.",
|
|
},
|
|
UsecaseImage: {
|
|
Flag: FLAG_IMAGE,
|
|
GRPCMethod: MethodGenerateImage,
|
|
Description: "Image generation via the GenerateImage RPC (Stable Diffusion, Flux, etc.).",
|
|
},
|
|
UsecaseVideo: {
|
|
Flag: FLAG_VIDEO,
|
|
GRPCMethod: MethodGenerateVideo,
|
|
Description: "Video generation via the GenerateVideo RPC.",
|
|
},
|
|
UsecaseTranscript: {
|
|
Flag: FLAG_TRANSCRIPT,
|
|
GRPCMethod: MethodAudioTranscription,
|
|
Description: "Speech-to-text via the AudioTranscription RPC.",
|
|
},
|
|
UsecaseTTS: {
|
|
Flag: FLAG_TTS,
|
|
GRPCMethod: MethodTTS,
|
|
Description: "Text-to-speech via the TTS RPC.",
|
|
},
|
|
UsecaseSoundGeneration: {
|
|
Flag: FLAG_SOUND_GENERATION,
|
|
GRPCMethod: MethodSoundGeneration,
|
|
Description: "Music/sound generation via the SoundGeneration RPC (not speech).",
|
|
},
|
|
UsecaseRerank: {
|
|
Flag: FLAG_RERANK,
|
|
GRPCMethod: MethodRerank,
|
|
Description: "Document reranking via the Rerank RPC.",
|
|
},
|
|
UsecaseDetection: {
|
|
Flag: FLAG_DETECTION,
|
|
GRPCMethod: MethodDetect,
|
|
Description: "Object detection via the Detect RPC with bounding boxes.",
|
|
},
|
|
UsecaseVAD: {
|
|
Flag: FLAG_VAD,
|
|
GRPCMethod: MethodVAD,
|
|
Description: "Voice activity detection via the VAD RPC.",
|
|
},
|
|
UsecaseAudioTransform: {
|
|
Flag: FLAG_AUDIO_TRANSFORM,
|
|
GRPCMethod: MethodAudioTransform,
|
|
Description: "Audio-in / audio-out transformations (echo cancellation, noise suppression, dereverberation, voice conversion) via the AudioTransform RPC.",
|
|
},
|
|
UsecaseDiarization: {
|
|
Flag: FLAG_DIARIZATION,
|
|
GRPCMethod: MethodDiarize,
|
|
Description: "Speaker diarization (who-spoke-when, per-speaker segments) via the Diarize RPC.",
|
|
},
|
|
UsecaseRealtimeAudio: {
|
|
Flag: FLAG_REALTIME_AUDIO,
|
|
GRPCMethod: MethodAudioToAudioStream,
|
|
Description: "Self-contained any-to-any audio model for the Realtime API — accepts microphone audio and emits speech + transcript (+ optional function calls) from a single backend via the AudioToAudioStream RPC.",
|
|
},
|
|
}
|
|
|
|
// BackendCapability describes which gRPC methods and usecases a backend supports.
|
|
// Derived from reviewing actual implementations in backend/go/ and backend/python/.
|
|
type BackendCapability struct {
|
|
// GRPCMethods lists the Backend service RPCs this backend implements.
|
|
GRPCMethods []GRPCMethod
|
|
// PossibleUsecases lists all usecase strings this backend can support.
|
|
PossibleUsecases []string
|
|
// DefaultUsecases lists the conservative safe defaults.
|
|
DefaultUsecases []string
|
|
// AcceptsImages indicates multimodal image input in Predict.
|
|
AcceptsImages bool
|
|
// AcceptsVideos indicates multimodal video input in Predict.
|
|
AcceptsVideos bool
|
|
// AcceptsAudios indicates multimodal audio input in Predict.
|
|
AcceptsAudios bool
|
|
// Description is a human-readable summary of the backend.
|
|
Description string
|
|
}
|
|
|
|
// BackendCapabilities maps each backend name (as used in model configs and gallery
|
|
// entries) to its verified capabilities. This is the single source of truth for
|
|
// what each backend supports.
|
|
//
|
|
// Backend names use hyphens (e.g., "llama-cpp") matching the gallery convention.
|
|
// Use NormalizeBackendName() for names with dots (e.g., "llama.cpp").
|
|
var BackendCapabilities = map[string]BackendCapability{
|
|
// --- LLM / text generation backends ---
|
|
"llama-cpp": {
|
|
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodEmbedding, MethodTokenizeString},
|
|
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseEdit, UsecaseEmbeddings, UsecaseTokenize, UsecaseVision},
|
|
DefaultUsecases: []string{UsecaseChat},
|
|
AcceptsImages: true, // requires mmproj
|
|
Description: "llama.cpp GGUF models — LLM inference with optional vision via mmproj",
|
|
},
|
|
"vllm": {
|
|
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodEmbedding},
|
|
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseEmbeddings, UsecaseVision},
|
|
DefaultUsecases: []string{UsecaseChat},
|
|
AcceptsImages: true,
|
|
AcceptsVideos: true,
|
|
Description: "vLLM engine — high-throughput LLM serving with optional multimodal",
|
|
},
|
|
"vllm-omni": {
|
|
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodGenerateImage, MethodGenerateVideo, MethodTTS},
|
|
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseImage, UsecaseVideo, UsecaseTTS, UsecaseVision},
|
|
DefaultUsecases: []string{UsecaseChat},
|
|
AcceptsImages: true,
|
|
AcceptsVideos: true,
|
|
AcceptsAudios: true,
|
|
Description: "vLLM omni-modal — supports text, image, video generation and TTS",
|
|
},
|
|
"transformers": {
|
|
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodEmbedding, MethodTTS, MethodSoundGeneration},
|
|
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseEmbeddings, UsecaseTTS, UsecaseSoundGeneration},
|
|
DefaultUsecases: []string{UsecaseChat},
|
|
Description: "HuggingFace transformers — general-purpose Python inference",
|
|
},
|
|
"mlx": {
|
|
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodEmbedding},
|
|
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseEmbeddings},
|
|
DefaultUsecases: []string{UsecaseChat},
|
|
Description: "Apple MLX framework — optimized for Apple Silicon",
|
|
},
|
|
"mlx-distributed": {
|
|
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodEmbedding},
|
|
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseEmbeddings},
|
|
DefaultUsecases: []string{UsecaseChat},
|
|
Description: "MLX distributed inference across multiple Apple Silicon devices",
|
|
},
|
|
"mlx-vlm": {
|
|
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodEmbedding},
|
|
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseEmbeddings, UsecaseVision},
|
|
DefaultUsecases: []string{UsecaseChat, UsecaseVision},
|
|
AcceptsImages: true,
|
|
AcceptsAudios: true,
|
|
Description: "MLX vision-language models with multimodal input",
|
|
},
|
|
"mlx-audio": {
|
|
GRPCMethods: []GRPCMethod{MethodPredict, MethodTTS},
|
|
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseChat},
|
|
Description: "MLX audio models — text generation and TTS",
|
|
},
|
|
|
|
// --- Image/video generation backends ---
|
|
"diffusers": {
|
|
GRPCMethods: []GRPCMethod{MethodGenerateImage, MethodGenerateVideo},
|
|
PossibleUsecases: []string{UsecaseImage, UsecaseVideo},
|
|
DefaultUsecases: []string{UsecaseImage},
|
|
Description: "HuggingFace diffusers — Stable Diffusion, Flux, video generation",
|
|
},
|
|
"stablediffusion": {
|
|
GRPCMethods: []GRPCMethod{MethodGenerateImage},
|
|
PossibleUsecases: []string{UsecaseImage},
|
|
DefaultUsecases: []string{UsecaseImage},
|
|
Description: "Stable Diffusion native backend",
|
|
},
|
|
"stablediffusion-ggml": {
|
|
GRPCMethods: []GRPCMethod{MethodGenerateImage},
|
|
PossibleUsecases: []string{UsecaseImage},
|
|
DefaultUsecases: []string{UsecaseImage},
|
|
Description: "Stable Diffusion via GGML quantized models",
|
|
},
|
|
|
|
// --- Speech-to-text backends ---
|
|
"whisper": {
|
|
GRPCMethods: []GRPCMethod{MethodAudioTranscription, MethodVAD},
|
|
PossibleUsecases: []string{UsecaseTranscript, UsecaseVAD},
|
|
DefaultUsecases: []string{UsecaseTranscript},
|
|
Description: "OpenAI Whisper — speech recognition and voice activity detection",
|
|
},
|
|
"faster-whisper": {
|
|
GRPCMethods: []GRPCMethod{MethodAudioTranscription},
|
|
PossibleUsecases: []string{UsecaseTranscript},
|
|
DefaultUsecases: []string{UsecaseTranscript},
|
|
Description: "CTranslate2-accelerated Whisper for faster transcription",
|
|
},
|
|
"whisperx": {
|
|
GRPCMethods: []GRPCMethod{MethodAudioTranscription},
|
|
PossibleUsecases: []string{UsecaseTranscript},
|
|
DefaultUsecases: []string{UsecaseTranscript},
|
|
Description: "WhisperX — Whisper with word-level timestamps and speaker diarization",
|
|
},
|
|
"moonshine": {
|
|
GRPCMethods: []GRPCMethod{MethodAudioTranscription},
|
|
PossibleUsecases: []string{UsecaseTranscript},
|
|
DefaultUsecases: []string{UsecaseTranscript},
|
|
Description: "Moonshine speech recognition",
|
|
},
|
|
"nemo": {
|
|
GRPCMethods: []GRPCMethod{MethodAudioTranscription},
|
|
PossibleUsecases: []string{UsecaseTranscript},
|
|
DefaultUsecases: []string{UsecaseTranscript},
|
|
Description: "NVIDIA NeMo speech recognition",
|
|
},
|
|
"qwen-asr": {
|
|
GRPCMethods: []GRPCMethod{MethodAudioTranscription},
|
|
PossibleUsecases: []string{UsecaseTranscript},
|
|
DefaultUsecases: []string{UsecaseTranscript},
|
|
Description: "Qwen automatic speech recognition",
|
|
},
|
|
"voxtral": {
|
|
GRPCMethods: []GRPCMethod{MethodAudioTranscription},
|
|
PossibleUsecases: []string{UsecaseTranscript},
|
|
DefaultUsecases: []string{UsecaseTranscript},
|
|
Description: "Voxtral speech recognition",
|
|
},
|
|
"vibevoice": {
|
|
GRPCMethods: []GRPCMethod{MethodAudioTranscription, MethodTTS},
|
|
PossibleUsecases: []string{UsecaseTranscript, UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTranscript, UsecaseTTS},
|
|
Description: "VibeVoice — bidirectional speech (transcription and synthesis)",
|
|
},
|
|
|
|
// --- TTS backends ---
|
|
"piper": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "Piper — fast neural TTS optimized for Raspberry Pi",
|
|
},
|
|
"kokoro": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "Kokoro TTS",
|
|
},
|
|
"coqui": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "Coqui TTS — multi-speaker neural synthesis",
|
|
},
|
|
"kitten-tts": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "Kitten TTS",
|
|
},
|
|
"outetts": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "OuteTTS",
|
|
},
|
|
"pocket-tts": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "Pocket TTS — lightweight text-to-speech",
|
|
},
|
|
"qwen-tts": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "Qwen TTS",
|
|
},
|
|
"faster-qwen3-tts": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "Faster Qwen3 TTS — accelerated Qwen TTS",
|
|
},
|
|
"fish-speech": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "Fish Speech TTS",
|
|
},
|
|
"neutts": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "NeuTTS — neural text-to-speech",
|
|
},
|
|
"chatterbox": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "Chatterbox TTS",
|
|
},
|
|
"voxcpm": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS, MethodTTSStream},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "VoxCPM TTS with streaming support",
|
|
},
|
|
|
|
// --- Sound generation backends ---
|
|
"ace-step": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS, MethodSoundGeneration},
|
|
PossibleUsecases: []string{UsecaseTTS, UsecaseSoundGeneration},
|
|
DefaultUsecases: []string{UsecaseSoundGeneration},
|
|
Description: "ACE-Step — music and sound generation",
|
|
},
|
|
"acestep-cpp": {
|
|
GRPCMethods: []GRPCMethod{MethodSoundGeneration},
|
|
PossibleUsecases: []string{UsecaseSoundGeneration},
|
|
DefaultUsecases: []string{UsecaseSoundGeneration},
|
|
Description: "ACE-Step C++ — native sound generation",
|
|
},
|
|
"transformers-musicgen": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS, MethodSoundGeneration},
|
|
PossibleUsecases: []string{UsecaseTTS, UsecaseSoundGeneration},
|
|
DefaultUsecases: []string{UsecaseSoundGeneration},
|
|
Description: "Meta MusicGen via transformers — music generation from text",
|
|
},
|
|
|
|
// --- Any-to-any audio backends ---
|
|
"liquid-audio": {
|
|
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodAudioTranscription, MethodTTS, MethodAudioToAudioStream, MethodVAD},
|
|
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseTranscript, UsecaseTTS, UsecaseRealtimeAudio, UsecaseVAD},
|
|
DefaultUsecases: []string{UsecaseRealtimeAudio, UsecaseChat, UsecaseTranscript, UsecaseTTS, UsecaseVAD},
|
|
AcceptsAudios: true,
|
|
Description: "LFM2 / LFM2.5-Audio — self-contained any-to-any audio model for the Realtime API; also exposes chat, transcription, TTS and a stub energy-based VAD endpoint",
|
|
},
|
|
|
|
// --- Audio transform backends ---
|
|
"localvqe": {
|
|
GRPCMethods: []GRPCMethod{MethodAudioTransform},
|
|
PossibleUsecases: []string{UsecaseAudioTransform},
|
|
DefaultUsecases: []string{UsecaseAudioTransform},
|
|
Description: "LocalVQE — joint AEC, noise suppression, and dereverberation for 16 kHz mono speech",
|
|
},
|
|
|
|
// --- Utility backends ---
|
|
"rerankers": {
|
|
GRPCMethods: []GRPCMethod{MethodRerank},
|
|
PossibleUsecases: []string{UsecaseRerank},
|
|
DefaultUsecases: []string{UsecaseRerank},
|
|
Description: "Cross-encoder reranking models",
|
|
},
|
|
"rfdetr": {
|
|
GRPCMethods: []GRPCMethod{MethodDetect},
|
|
PossibleUsecases: []string{UsecaseDetection},
|
|
DefaultUsecases: []string{UsecaseDetection},
|
|
Description: "RF-DETR object detection",
|
|
},
|
|
"silero-vad": {
|
|
GRPCMethods: []GRPCMethod{MethodVAD},
|
|
PossibleUsecases: []string{UsecaseVAD},
|
|
DefaultUsecases: []string{UsecaseVAD},
|
|
Description: "Silero VAD — voice activity detection",
|
|
},
|
|
}
|
|
|
|
// NormalizeBackendName converts backend names to the canonical hyphenated form
|
|
// used in gallery entries (e.g., "llama.cpp" → "llama-cpp").
|
|
func NormalizeBackendName(backend string) string {
|
|
return strings.ReplaceAll(backend, ".", "-")
|
|
}
|
|
|
|
// GetBackendCapability returns the capability info for a backend, or nil if unknown.
|
|
// Handles backend name normalization.
|
|
func GetBackendCapability(backend string) *BackendCapability {
|
|
if cap, ok := BackendCapabilities[NormalizeBackendName(backend)]; ok {
|
|
return &cap
|
|
}
|
|
return nil
|
|
}
|
|
|
|
// PossibleUsecasesForBackend returns all usecases a backend can support.
|
|
// Returns nil if the backend is unknown.
|
|
func PossibleUsecasesForBackend(backend string) []string {
|
|
if cap := GetBackendCapability(backend); cap != nil {
|
|
return cap.PossibleUsecases
|
|
}
|
|
return nil
|
|
}
|
|
|
|
// DefaultUsecasesForBackend returns the conservative default usecases.
|
|
// Returns nil if the backend is unknown.
|
|
func DefaultUsecasesForBackendCap(backend string) []string {
|
|
if cap := GetBackendCapability(backend); cap != nil {
|
|
return cap.DefaultUsecases
|
|
}
|
|
return nil
|
|
}
|
|
|
|
// IsValidUsecaseForBackend checks whether a usecase is in a backend's possible set.
|
|
// Returns true for unknown backends (permissive fallback).
|
|
func IsValidUsecaseForBackend(backend, usecase string) bool {
|
|
cap := GetBackendCapability(backend)
|
|
if cap == nil {
|
|
return true // unknown backend — don't restrict
|
|
}
|
|
return slices.Contains(cap.PossibleUsecases, usecase)
|
|
}
|
|
|
|
// AllBackendNames returns a sorted list of all known backend names.
|
|
func AllBackendNames() []string {
|
|
names := make([]string, 0, len(BackendCapabilities))
|
|
for name := range BackendCapabilities {
|
|
names = append(names, name)
|
|
}
|
|
slices.Sort(names)
|
|
return names
|
|
}
|