Files
LocalAI/core/config/backend_capabilities.go
LocalAI [bot] 294170d3ed feat(backend): add depth-anything (Depth Anything 3) C++/ggml backend + gallery (#10352)
* feat(backend): add depth-anything (Depth Anything 3) C++/ggml backend + gallery

Mirrors the locate-anything-cpp backend to register a new depth-anything
backend that wraps the Depth Anything 3 ggml port (depth-anything.cpp) via
purego (cgo-less, no Python at inference).

- backend/go/depth-anything-cpp/: gRPC backend (Load + Predict + GenerateImage),
  purego binding to the da_capi_* C ABI, CMake/Makefile/run/package/test scripts
  building depth-anything.cpp's DA_SHARED static .so per CPU variant.
- backend/index.yaml: depth-anything backend meta + all hardware-variant
  capability entries (cpu/cuda12/cuda13/intel-sycl-f32+f16/vulkan/nvidia-l4t).
- gallery/index.yaml: 8 Depth Anything 3 GGUF models (base q4_k/q8_0/f16/f32,
  small, large, giant, mono-large).
- .github/backend-matrix.yml: one build entry per hardware variant.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(depth): typed Depth RPC + REST endpoint exposing full DA3 data

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(depth): pin depth-anything.cpp to e0b6814 (ABI 3 dense C-API)

The Depth RPC handler calls da_capi_depth_dense / da_capi_points (C-API ABI 3);
pin the native build to the commit that exports them.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(depth): pin depth-anything.cpp to v0.1.0 release (b515c31)

Repoint the native version from the now-orphaned e0b6814 to the
b515c31 release commit, kept alive by the upstream v0.1.0 tag.
C-API is unchanged (da_capi_abi_version == 3).

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(depth): wire depth-anything-cpp into build, CI bump, and importer

The backend dir, gallery index, and CI build-matrix were present but the
backend was never wired into the integration points that adding-backends.md
requires:

- root Makefile: add to .NOTPARALLEL, the test-extra chain, a BACKEND_*
  definition, the docker-build target eval, and docker-build-backends
  (mirrors parakeet-cpp; the backend's own Makefile already documented that
  its `test` target is driven by test-extra).
- bump_deps.yaml: register the DEPTHANYTHING_VERSION pin so the daily
  auto-bump bot tracks mudler/depth-anything.cpp master (it cannot see an
  unregistered Makefile pin).
- import form: add a preference-only KnownBackend entry so depth-anything is
  selectable at /import-model (mirrors sam3-cpp; no reliable GGUF auto-detect
  signal, so pref-only per the doc's default).

changed-backends.js needs no entry: the generic golang suffix branch already
resolves backend/go/depth-anything-cpp/.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(depth): auto-detect importer for depth-anything GGUFs

Replace the preference-only entry with a real auto-detect importer
(mirrors parakeet-cpp / locate-anything):

- DepthAnythingImporter matches a .gguf whose name carries a
  depth-anything token (depth-anything-<size>-<quant>.gguf), so
  /import-model recognises mudler/depth-anything.cpp-gguf repos and direct
  GGUF URLs without an explicit backend preference. preferences.backend=
  "depth-anything" still forces it.
- Registered before LlamaCPPImporter so its GGUF bundles aren't claimed by
  the generic .gguf importer; the narrow name match means it cannot claim
  arbitrary llama GGUFs or the upstream safetensors PyTorch repos.
- Multi-quant repos pick the smallest quant by default (q4_k -> ... -> f32,
  depth stays >0.998 corr even at q4_k); quantizations preference overrides.
- Drops the now-redundant knownPrefOnlyBackends entry (importer-backed
  backends are not listed there, matching parakeet-cpp).
- Table-driven Ginkgo test covers detection, negative cases (llama GGUF,
  upstream safetensors), default/override/fallback quant pick, and direct
  URL import. 10/10 specs pass.

Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(depth): check conn.Close error in grpc Depth client (errcheck)

The new Depth() client method used a bare `defer conn.Close()`. golangci-lint
runs with new-from-merge-base, so although the 39 sibling methods use the same
bare form (grandfathered), the newly added line trips errcheck. Drop the result
explicitly to satisfy the linter.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8

* fix(depth): bump depth-anything.cpp to v0.1.1 (embeddable CMake)

v0.1.0 (b515c31) used ${CMAKE_SOURCE_DIR} for its include dirs, which
points at the parent project when built via add_subdirectory() as this
backend does, so the container build failed with missing stb_image.h /
da_gguf_keys.h. v0.1.1 (2d42897) switches to project-relative paths.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8

* fix(depth): resolve gosec findings in the backend wrapper

The code-scanning gate flagged three new failure-level alerts in
godepthanythingcpp.go (gosec runs with -no-fail; GitHub gates on new alerts):

- G301: export dirs were created with 0o755. Tighten to 0o750 (no world
  access needed for backend-written export output).
- G304: writeDepthPNG creates req.GetDst(). That path is chosen by the
  LocalAI core as the intended output destination (same pattern every
  image backend uses), not attacker input, so annotate with #nosec G304
  and document why.

The remaining G103 "audit unsafe" notes on the unsafe.Slice C-buffer copies
are warning-level (the same purego interop whisper/parakeet use) and do not
gate the check, per the supertonic exclusion precedent in secscan.yaml.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8

* fix(depth): bump depth-anything.cpp to v0.1.2 (CUDA cross-build arch)

v0.1.1 forced CMAKE_CUDA_ARCHITECTURES=native, which breaks the GPU-less
l4t/cublas CI builds (nvcc "Unsupported gpu architecture 'compute_'" on
CMake 3.22). v0.1.2 (442eea4) drops the override and lets ggml pick its
default cross-build arch list.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-16 16:28:28 +02:00

607 lines
24 KiB
Go

package config
import (
"slices"
"strings"
)
// Usecase name constants — the canonical string values used in gallery entries,
// model configs (known_usecases), and UsecaseInfoMap keys.
const (
UsecaseChat = "chat"
UsecaseCompletion = "completion"
UsecaseEdit = "edit"
UsecaseVision = "vision"
UsecaseEmbeddings = "embeddings"
UsecaseTokenize = "tokenize"
UsecaseImage = "image"
UsecaseVideo = "video"
UsecaseTranscript = "transcript"
UsecaseTTS = "tts"
UsecaseSoundGeneration = "sound_generation"
UsecaseRerank = "rerank"
UsecaseDetection = "detection"
UsecaseDepth = "depth"
UsecaseVAD = "vad"
UsecaseAudioTransform = "audio_transform"
UsecaseDiarization = "diarization"
UsecaseRealtimeAudio = "realtime_audio"
UsecaseFaceRecognition = "face_recognition"
UsecaseSpeakerRecognition = "speaker_recognition"
)
// GRPCMethod identifies a Backend service RPC from backend.proto.
type GRPCMethod string
const (
MethodPredict GRPCMethod = "Predict"
MethodPredictStream GRPCMethod = "PredictStream"
MethodEmbedding GRPCMethod = "Embedding"
MethodGenerateImage GRPCMethod = "GenerateImage"
MethodGenerateVideo GRPCMethod = "GenerateVideo"
MethodAudioTranscription GRPCMethod = "AudioTranscription"
MethodTTS GRPCMethod = "TTS"
MethodTTSStream GRPCMethod = "TTSStream"
MethodSoundGeneration GRPCMethod = "SoundGeneration"
MethodTokenizeString GRPCMethod = "TokenizeString"
MethodDetect GRPCMethod = "Detect"
MethodDepth GRPCMethod = "Depth"
MethodRerank GRPCMethod = "Rerank"
MethodVAD GRPCMethod = "VAD"
MethodAudioTransform GRPCMethod = "AudioTransform"
MethodDiarize GRPCMethod = "Diarize"
MethodAudioToAudioStream GRPCMethod = "AudioToAudioStream"
MethodFaceVerify GRPCMethod = "FaceVerify"
MethodFaceAnalyze GRPCMethod = "FaceAnalyze"
MethodVoiceVerify GRPCMethod = "VoiceVerify"
MethodVoiceEmbed GRPCMethod = "VoiceEmbed"
MethodVoiceAnalyze GRPCMethod = "VoiceAnalyze"
)
// UsecaseInfo describes a single known_usecase value and how it maps
// to the gRPC backend API.
type UsecaseInfo struct {
// Flag is the ModelConfigUsecase bitmask value.
Flag ModelConfigUsecase
// GRPCMethod is the primary Backend service RPC this usecase maps to.
GRPCMethod GRPCMethod
// IsModifier is true when this usecase doesn't map to its own gRPC RPC
// but modifies how another RPC behaves (e.g., vision uses Predict with images).
IsModifier bool
// DependsOn names the usecase(s) this modifier requires (e.g., "chat").
DependsOn string
// Description is a human/LLM-readable explanation of what this usecase means.
Description string
}
// UsecaseInfoMap maps each known_usecase string to its gRPC and semantic info.
var UsecaseInfoMap = map[string]UsecaseInfo{
UsecaseChat: {
Flag: FLAG_CHAT,
GRPCMethod: MethodPredict,
Description: "Conversational/instruction-following via the Predict RPC with chat templates.",
},
UsecaseCompletion: {
Flag: FLAG_COMPLETION,
GRPCMethod: MethodPredict,
Description: "Text completion via the Predict RPC with a completion template.",
},
UsecaseEdit: {
Flag: FLAG_EDIT,
GRPCMethod: MethodPredict,
Description: "Text editing via the Predict RPC with an edit template.",
},
UsecaseVision: {
Flag: FLAG_VISION,
GRPCMethod: MethodPredict,
IsModifier: true,
DependsOn: UsecaseChat,
Description: "The model accepts images alongside text in the Predict RPC. For llama-cpp this requires an mmproj file.",
},
UsecaseEmbeddings: {
Flag: FLAG_EMBEDDINGS,
GRPCMethod: MethodEmbedding,
Description: "Vector embedding generation via the Embedding RPC.",
},
UsecaseTokenize: {
Flag: FLAG_TOKENIZE,
GRPCMethod: MethodTokenizeString,
Description: "Tokenization via the TokenizeString RPC without running inference.",
},
UsecaseImage: {
Flag: FLAG_IMAGE,
GRPCMethod: MethodGenerateImage,
Description: "Image generation via the GenerateImage RPC (Stable Diffusion, Flux, etc.).",
},
UsecaseVideo: {
Flag: FLAG_VIDEO,
GRPCMethod: MethodGenerateVideo,
Description: "Video generation via the GenerateVideo RPC.",
},
UsecaseTranscript: {
Flag: FLAG_TRANSCRIPT,
GRPCMethod: MethodAudioTranscription,
Description: "Speech-to-text via the AudioTranscription RPC.",
},
UsecaseTTS: {
Flag: FLAG_TTS,
GRPCMethod: MethodTTS,
Description: "Text-to-speech via the TTS RPC.",
},
UsecaseSoundGeneration: {
Flag: FLAG_SOUND_GENERATION,
GRPCMethod: MethodSoundGeneration,
Description: "Music/sound generation via the SoundGeneration RPC (not speech).",
},
UsecaseRerank: {
Flag: FLAG_RERANK,
GRPCMethod: MethodRerank,
Description: "Document reranking via the Rerank RPC.",
},
UsecaseDetection: {
Flag: FLAG_DETECTION,
GRPCMethod: MethodDetect,
Description: "Object detection via the Detect RPC with bounding boxes.",
},
UsecaseDepth: {
Flag: FLAG_DEPTH,
GRPCMethod: MethodDepth,
Description: "Per-pixel metric depth, camera pose and 3D point cloud via the Depth RPC (Depth Anything 3).",
},
UsecaseVAD: {
Flag: FLAG_VAD,
GRPCMethod: MethodVAD,
Description: "Voice activity detection via the VAD RPC.",
},
UsecaseAudioTransform: {
Flag: FLAG_AUDIO_TRANSFORM,
GRPCMethod: MethodAudioTransform,
Description: "Audio-in / audio-out transformations (echo cancellation, noise suppression, dereverberation, voice conversion) via the AudioTransform RPC.",
},
UsecaseDiarization: {
Flag: FLAG_DIARIZATION,
GRPCMethod: MethodDiarize,
Description: "Speaker diarization (who-spoke-when, per-speaker segments) via the Diarize RPC.",
},
UsecaseRealtimeAudio: {
Flag: FLAG_REALTIME_AUDIO,
GRPCMethod: MethodAudioToAudioStream,
Description: "Self-contained any-to-any audio model for the Realtime API — accepts microphone audio and emits speech + transcript (+ optional function calls) from a single backend via the AudioToAudioStream RPC.",
},
UsecaseFaceRecognition: {
Flag: FLAG_FACE_RECOGNITION,
GRPCMethod: MethodFaceVerify,
Description: "Face recognition — verify identity, analyze attributes (age/gender/emotion) via FaceVerify and FaceAnalyze RPCs.",
},
UsecaseSpeakerRecognition: {
Flag: FLAG_SPEAKER_RECOGNITION,
GRPCMethod: MethodVoiceVerify,
Description: "Speaker recognition — verify identity, embed and analyze voice via VoiceVerify, VoiceEmbed and VoiceAnalyze RPCs.",
},
}
// BackendCapability describes which gRPC methods and usecases a backend supports.
// Derived from reviewing actual implementations in backend/go/ and backend/python/.
type BackendCapability struct {
// GRPCMethods lists the Backend service RPCs this backend implements.
GRPCMethods []GRPCMethod
// PossibleUsecases lists all usecase strings this backend can support.
PossibleUsecases []string
// DefaultUsecases lists the conservative safe defaults.
DefaultUsecases []string
// AcceptsImages indicates multimodal image input in Predict.
AcceptsImages bool
// AcceptsVideos indicates multimodal video input in Predict.
AcceptsVideos bool
// AcceptsAudios indicates multimodal audio input in Predict.
AcceptsAudios bool
// Description is a human-readable summary of the backend.
Description string
}
// BackendCapabilities maps each backend name (as used in model configs and gallery
// entries) to its verified capabilities. This is the single source of truth for
// what each backend supports.
//
// Backend names use hyphens (e.g., "llama-cpp") matching the gallery convention.
// Use NormalizeBackendName() for names with dots (e.g., "llama.cpp").
var BackendCapabilities = map[string]BackendCapability{
// --- LLM / text generation backends ---
"llama-cpp": {
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodEmbedding, MethodTokenizeString},
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseEdit, UsecaseEmbeddings, UsecaseTokenize, UsecaseVision},
DefaultUsecases: []string{UsecaseChat},
AcceptsImages: true, // requires mmproj
Description: "llama.cpp GGUF models — LLM inference with optional vision via mmproj",
},
"vllm": {
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodEmbedding},
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseEmbeddings, UsecaseVision},
DefaultUsecases: []string{UsecaseChat},
AcceptsImages: true,
AcceptsVideos: true,
Description: "vLLM engine — high-throughput LLM serving with optional multimodal",
},
"sglang": {
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodTokenizeString},
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseTokenize, UsecaseVision},
DefaultUsecases: []string{UsecaseChat},
AcceptsImages: true,
Description: "SGLang — fast LLM inference with structured generation and optional vision",
},
"vllm-omni": {
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodGenerateImage, MethodGenerateVideo, MethodTTS},
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseImage, UsecaseVideo, UsecaseTTS, UsecaseVision},
DefaultUsecases: []string{UsecaseChat},
AcceptsImages: true,
AcceptsVideos: true,
AcceptsAudios: true,
Description: "vLLM omni-modal — supports text, image, video generation and TTS",
},
"transformers": {
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodEmbedding, MethodTTS, MethodSoundGeneration},
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseEmbeddings, UsecaseTTS, UsecaseSoundGeneration},
DefaultUsecases: []string{UsecaseChat},
Description: "HuggingFace transformers — general-purpose Python inference",
},
"mlx": {
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodEmbedding},
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseEmbeddings},
DefaultUsecases: []string{UsecaseChat},
Description: "Apple MLX framework — optimized for Apple Silicon",
},
"mlx-distributed": {
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodEmbedding},
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseEmbeddings},
DefaultUsecases: []string{UsecaseChat},
Description: "MLX distributed inference across multiple Apple Silicon devices",
},
"mlx-vlm": {
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodEmbedding},
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseEmbeddings, UsecaseVision},
DefaultUsecases: []string{UsecaseChat, UsecaseVision},
AcceptsImages: true,
AcceptsAudios: true,
Description: "MLX vision-language models with multimodal input",
},
"mlx-audio": {
GRPCMethods: []GRPCMethod{MethodPredict, MethodTTS},
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseTTS},
DefaultUsecases: []string{UsecaseChat},
Description: "MLX audio models — text generation and TTS",
},
// --- Image/video generation backends ---
"diffusers": {
GRPCMethods: []GRPCMethod{MethodGenerateImage, MethodGenerateVideo},
PossibleUsecases: []string{UsecaseImage, UsecaseVideo},
DefaultUsecases: []string{UsecaseImage},
Description: "HuggingFace diffusers — Stable Diffusion, Flux, video generation",
},
"stablediffusion": {
GRPCMethods: []GRPCMethod{MethodGenerateImage},
PossibleUsecases: []string{UsecaseImage},
DefaultUsecases: []string{UsecaseImage},
Description: "Stable Diffusion native backend",
},
"stablediffusion-ggml": {
GRPCMethods: []GRPCMethod{MethodGenerateImage},
PossibleUsecases: []string{UsecaseImage},
DefaultUsecases: []string{UsecaseImage},
Description: "Stable Diffusion via GGML quantized models",
},
// --- Speech-to-text backends ---
"whisper": {
GRPCMethods: []GRPCMethod{MethodAudioTranscription, MethodVAD},
PossibleUsecases: []string{UsecaseTranscript, UsecaseVAD},
DefaultUsecases: []string{UsecaseTranscript},
Description: "OpenAI Whisper — speech recognition and voice activity detection",
},
"faster-whisper": {
GRPCMethods: []GRPCMethod{MethodAudioTranscription},
PossibleUsecases: []string{UsecaseTranscript},
DefaultUsecases: []string{UsecaseTranscript},
Description: "CTranslate2-accelerated Whisper for faster transcription",
},
"whisperx": {
GRPCMethods: []GRPCMethod{MethodAudioTranscription},
PossibleUsecases: []string{UsecaseTranscript},
DefaultUsecases: []string{UsecaseTranscript},
Description: "WhisperX — Whisper with word-level timestamps and speaker diarization",
},
"moonshine": {
GRPCMethods: []GRPCMethod{MethodAudioTranscription},
PossibleUsecases: []string{UsecaseTranscript},
DefaultUsecases: []string{UsecaseTranscript},
Description: "Moonshine speech recognition",
},
"nemo": {
GRPCMethods: []GRPCMethod{MethodAudioTranscription},
PossibleUsecases: []string{UsecaseTranscript},
DefaultUsecases: []string{UsecaseTranscript},
Description: "NVIDIA NeMo speech recognition",
},
"parakeet-cpp": {
GRPCMethods: []GRPCMethod{MethodAudioTranscription},
PossibleUsecases: []string{UsecaseTranscript},
DefaultUsecases: []string{UsecaseTranscript},
Description: "NVIDIA NeMo Parakeet ASR (parakeet.cpp)",
},
"qwen-asr": {
GRPCMethods: []GRPCMethod{MethodAudioTranscription},
PossibleUsecases: []string{UsecaseTranscript},
DefaultUsecases: []string{UsecaseTranscript},
Description: "Qwen automatic speech recognition",
},
"voxtral": {
GRPCMethods: []GRPCMethod{MethodAudioTranscription},
PossibleUsecases: []string{UsecaseTranscript},
DefaultUsecases: []string{UsecaseTranscript},
Description: "Voxtral speech recognition",
},
"vibevoice": {
GRPCMethods: []GRPCMethod{MethodAudioTranscription, MethodTTS},
PossibleUsecases: []string{UsecaseTranscript, UsecaseTTS},
DefaultUsecases: []string{UsecaseTranscript, UsecaseTTS},
Description: "VibeVoice — bidirectional speech (transcription and synthesis)",
},
"vibevoice-cpp": {
GRPCMethods: []GRPCMethod{MethodAudioTranscription, MethodTTS, MethodTTSStream},
PossibleUsecases: []string{UsecaseTranscript, UsecaseTTS},
DefaultUsecases: []string{UsecaseTranscript, UsecaseTTS},
Description: "VibeVoice C++ — bidirectional speech, C++ backend with streaming TTS",
},
"sherpa-onnx": {
GRPCMethods: []GRPCMethod{MethodAudioTranscription, MethodTTS, MethodTTSStream, MethodVAD},
PossibleUsecases: []string{UsecaseTranscript, UsecaseTTS, UsecaseVAD},
DefaultUsecases: []string{UsecaseTranscript},
Description: "Sherpa-ONNX — multi-model speech toolkit (ASR, TTS, VAD)",
},
// --- TTS backends ---
"piper": {
GRPCMethods: []GRPCMethod{MethodTTS},
PossibleUsecases: []string{UsecaseTTS},
DefaultUsecases: []string{UsecaseTTS},
Description: "Piper — fast neural TTS optimized for Raspberry Pi",
},
"kokoro": {
GRPCMethods: []GRPCMethod{MethodTTS},
PossibleUsecases: []string{UsecaseTTS},
DefaultUsecases: []string{UsecaseTTS},
Description: "Kokoro TTS",
},
"coqui": {
GRPCMethods: []GRPCMethod{MethodTTS},
PossibleUsecases: []string{UsecaseTTS},
DefaultUsecases: []string{UsecaseTTS},
Description: "Coqui TTS — multi-speaker neural synthesis",
},
"kitten-tts": {
GRPCMethods: []GRPCMethod{MethodTTS},
PossibleUsecases: []string{UsecaseTTS},
DefaultUsecases: []string{UsecaseTTS},
Description: "Kitten TTS",
},
"outetts": {
GRPCMethods: []GRPCMethod{MethodTTS},
PossibleUsecases: []string{UsecaseTTS},
DefaultUsecases: []string{UsecaseTTS},
Description: "OuteTTS",
},
"pocket-tts": {
GRPCMethods: []GRPCMethod{MethodTTS},
PossibleUsecases: []string{UsecaseTTS},
DefaultUsecases: []string{UsecaseTTS},
Description: "Pocket TTS — lightweight text-to-speech",
},
"qwen-tts": {
GRPCMethods: []GRPCMethod{MethodTTS},
PossibleUsecases: []string{UsecaseTTS},
DefaultUsecases: []string{UsecaseTTS},
Description: "Qwen TTS",
},
"qwen3-tts-cpp": {
GRPCMethods: []GRPCMethod{MethodTTS, MethodTTSStream},
PossibleUsecases: []string{UsecaseTTS},
DefaultUsecases: []string{UsecaseTTS},
Description: "Qwen3 TTS C++ - text-to-speech with streaming, named speakers, voice design and cloning (qwentts.cpp / GGML)",
},
"faster-qwen3-tts": {
GRPCMethods: []GRPCMethod{MethodTTS},
PossibleUsecases: []string{UsecaseTTS},
DefaultUsecases: []string{UsecaseTTS},
Description: "Faster Qwen3 TTS — accelerated Qwen TTS",
},
"fish-speech": {
GRPCMethods: []GRPCMethod{MethodTTS},
PossibleUsecases: []string{UsecaseTTS},
DefaultUsecases: []string{UsecaseTTS},
Description: "Fish Speech TTS",
},
"neutts": {
GRPCMethods: []GRPCMethod{MethodTTS},
PossibleUsecases: []string{UsecaseTTS},
DefaultUsecases: []string{UsecaseTTS},
Description: "NeuTTS — neural text-to-speech",
},
"chatterbox": {
GRPCMethods: []GRPCMethod{MethodTTS},
PossibleUsecases: []string{UsecaseTTS},
DefaultUsecases: []string{UsecaseTTS},
Description: "Chatterbox TTS",
},
"voxcpm": {
GRPCMethods: []GRPCMethod{MethodTTS, MethodTTSStream},
PossibleUsecases: []string{UsecaseTTS},
DefaultUsecases: []string{UsecaseTTS},
Description: "VoxCPM TTS with streaming support",
},
// --- Sound generation backends ---
"ace-step": {
GRPCMethods: []GRPCMethod{MethodTTS, MethodSoundGeneration},
PossibleUsecases: []string{UsecaseTTS, UsecaseSoundGeneration},
DefaultUsecases: []string{UsecaseSoundGeneration},
Description: "ACE-Step — music and sound generation",
},
"acestep-cpp": {
GRPCMethods: []GRPCMethod{MethodSoundGeneration},
PossibleUsecases: []string{UsecaseSoundGeneration},
DefaultUsecases: []string{UsecaseSoundGeneration},
Description: "ACE-Step C++ — native sound generation",
},
"transformers-musicgen": {
GRPCMethods: []GRPCMethod{MethodTTS, MethodSoundGeneration},
PossibleUsecases: []string{UsecaseTTS, UsecaseSoundGeneration},
DefaultUsecases: []string{UsecaseSoundGeneration},
Description: "Meta MusicGen via transformers — music generation from text",
},
// --- Any-to-any audio backends ---
"liquid-audio": {
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodAudioTranscription, MethodTTS, MethodAudioToAudioStream, MethodVAD},
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseTranscript, UsecaseTTS, UsecaseRealtimeAudio, UsecaseVAD},
DefaultUsecases: []string{UsecaseRealtimeAudio, UsecaseChat, UsecaseTranscript, UsecaseTTS, UsecaseVAD},
AcceptsAudios: true,
Description: "LFM2 / LFM2.5-Audio — self-contained any-to-any audio model for the Realtime API; also exposes chat, transcription, TTS and a stub energy-based VAD endpoint",
},
// --- Audio transform backends ---
"localvqe": {
GRPCMethods: []GRPCMethod{MethodAudioTransform},
PossibleUsecases: []string{UsecaseAudioTransform},
DefaultUsecases: []string{UsecaseAudioTransform},
Description: "LocalVQE — joint AEC, noise suppression, and dereverberation for 16 kHz mono speech",
},
// --- Utility backends ---
"rerankers": {
GRPCMethods: []GRPCMethod{MethodRerank},
PossibleUsecases: []string{UsecaseRerank},
DefaultUsecases: []string{UsecaseRerank},
Description: "Cross-encoder reranking models",
},
"rfdetr": {
GRPCMethods: []GRPCMethod{MethodDetect},
PossibleUsecases: []string{UsecaseDetection},
DefaultUsecases: []string{UsecaseDetection},
Description: "RF-DETR object detection",
},
"rfdetr-cpp": {
GRPCMethods: []GRPCMethod{MethodDetect},
PossibleUsecases: []string{UsecaseDetection},
DefaultUsecases: []string{UsecaseDetection},
Description: "RF-DETR C++ object detection",
},
"depth-anything": {
GRPCMethods: []GRPCMethod{MethodDepth, MethodPredict, MethodGenerateImage},
PossibleUsecases: []string{UsecaseDepth},
DefaultUsecases: []string{UsecaseDepth},
AcceptsImages: true,
Description: "Depth Anything 3 C++ — per-pixel metric depth, camera pose and 3D point cloud",
},
// --- Face and speaker recognition backends ---
"insightface": {
GRPCMethods: []GRPCMethod{MethodEmbedding, MethodDetect, MethodFaceVerify, MethodFaceAnalyze},
PossibleUsecases: []string{UsecaseEmbeddings, UsecaseDetection, UsecaseFaceRecognition},
DefaultUsecases: []string{UsecaseFaceRecognition},
AcceptsImages: true,
Description: "InsightFace — face detection, embedding, verification and attribute analysis",
},
"speaker-recognition": {
GRPCMethods: []GRPCMethod{MethodVoiceVerify, MethodVoiceEmbed, MethodVoiceAnalyze},
PossibleUsecases: []string{UsecaseSpeakerRecognition},
DefaultUsecases: []string{UsecaseSpeakerRecognition},
Description: "Speaker recognition — voice identity verification and analysis",
},
"silero-vad": {
GRPCMethods: []GRPCMethod{MethodVAD},
PossibleUsecases: []string{UsecaseVAD},
DefaultUsecases: []string{UsecaseVAD},
Description: "Silero VAD — voice activity detection",
},
}
// NormalizeBackendName converts backend names to the canonical hyphenated form
// used in gallery entries (e.g., "llama.cpp" → "llama-cpp").
func NormalizeBackendName(backend string) string {
return strings.ReplaceAll(backend, ".", "-")
}
// nonLlamaSamplerBackends lists backends whose native sampler defaults differ
// from llama.cpp's, so LocalAI must NOT inject llama.cpp's top_k=40 default for
// them (issue #6632). mlx_lm's intended default is top_k=0 (disabled) and mlx
// does not remap 0->40, so shipping 40 silently changes sampling for clients
// that omit top_k. Leaving TopK nil lets the wire value default to 0.
//
// This is intentionally a small allow-list of KNOWN non-llama backends: empty
// and unknown backends fall through to the llama.cpp default to preserve the
// GGUF auto-detect path's behavior.
var nonLlamaSamplerBackends = map[string]struct{}{
"mlx": {},
"mlx-vlm": {},
"mlx-distributed": {},
}
// UsesLlamaSamplerDefaults reports whether a backend should receive llama.cpp's
// sampler defaults (e.g. top_k=40). Empty/unknown backends return true so the
// GGUF auto-detect path (which resolves to llama.cpp) keeps today's behavior;
// only the known non-llama backends in nonLlamaSamplerBackends return false.
func UsesLlamaSamplerDefaults(backend string) bool {
if backend == "" {
return true
}
_, isNonLlama := nonLlamaSamplerBackends[NormalizeBackendName(backend)]
return !isNonLlama
}
// GetBackendCapability returns the capability info for a backend, or nil if unknown.
// Handles backend name normalization.
func GetBackendCapability(backend string) *BackendCapability {
if cap, ok := BackendCapabilities[NormalizeBackendName(backend)]; ok {
return &cap
}
return nil
}
// PossibleUsecasesForBackend returns all usecases a backend can support.
// Returns nil if the backend is unknown.
func PossibleUsecasesForBackend(backend string) []string {
if cap := GetBackendCapability(backend); cap != nil {
return cap.PossibleUsecases
}
return nil
}
// DefaultUsecasesForBackend returns the conservative default usecases.
// Returns nil if the backend is unknown.
func DefaultUsecasesForBackendCap(backend string) []string {
if cap := GetBackendCapability(backend); cap != nil {
return cap.DefaultUsecases
}
return nil
}
// IsValidUsecaseForBackend checks whether a usecase is in a backend's possible set.
// Returns true for unknown backends (permissive fallback).
func IsValidUsecaseForBackend(backend, usecase string) bool {
cap := GetBackendCapability(backend)
if cap == nil {
return true // unknown backend — don't restrict
}
return slices.Contains(cap.PossibleUsecases, usecase)
}
// AllBackendNames returns a sorted list of all known backend names.
func AllBackendNames() []string {
names := make([]string, 0, len(BackendCapabilities))
for name := range BackendCapabilities {
names = append(names, name)
}
slices.Sort(names)
return names
}