mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-28 10:27:30 -04:00
* feat(voice-detect): add Go purego backend for voice-detect.cpp Add backend/go/voice-detect implementing the Backend gRPC voice subset (VoiceEmbed/VoiceVerify/VoiceAnalyze) over libvoicedetect.so via purego, mirroring the parakeet-cpp / omnivoice-cpp backends. The flat voicedetect_capi C ABI is dlopen'd cgo-less; malloc'd string and float-vector returns are owned by Go and released through the matching capi free functions, with the per-ctx last error surfaced into Go errors. Calls are serialized via base.SingleThread since the C context is not reentrant. Proto field mapping: - VoiceEmbed: VoiceEmbedRequest.audio (path) -> embed_path -> Embedding+Model. - VoiceVerify: audio1/audio2 + threshold (<=0 falls back to the verify_threshold option, default 0.25) -> verify_paths -> verified/distance/ threshold/confidence/model/processing_time_ms. - VoiceAnalyze: audio (path) -> analyze_path_json; the JSON age/gender/emotion document maps to a single VoiceAnalysis segment (start/end 0; gender "label" -> dominant_gender with the remaining float scores as the gender map; emotion label/scores -> dominant_emotion/emotion). The Makefile pins voice-detect.cpp to 47546430, clones+builds libvoicedetect.so with ggml static-linked (PIC, GGML_NATIVE off) so dlopen needs no external libggml/libvoicedetect; ldd on the artifact shows only system libs. Ginkgo tests cover option parsing and analyze-JSON mapping; embed/verify smoke specs gate on VOICEDETECT_BACKEND_TEST_MODEL + VOICEDETECT_BACKEND_TEST_WAV. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(voice-detect): wire backend into index, gallery and build Register the voice-detect.cpp speaker-recognition + voice-analysis backend (added in Voice-INT-A) into LocalAI's distribution surfaces, mirroring the ced backend (the closest mudler C++/ggml audio analogue): - backend/index.yaml: add the &voicedetect meta-backend (capabilities platform map, no top-level uri) plus the full set of concrete per-arch image entries (cpu/cuda12/cuda13/metal/rocm/sycl/vulkan/l4t and the -development variants). Referential integrity audited - every alias target resolves. - gallery/index.yaml: add 5 model entries on backend voice-detect - ECAPA-TDNN, WeSpeaker ResNet34, 3D-Speaker ERes2Net, CAM++ and the wav2vec2 age/gender/emotion analyze model. The engine architecture is read from GGUF metadata (voicedetect.arch) at load. GGUF artifacts are not yet published: each files: entry points at the intended mudler/voice-detect-gguf location with a TODO to fill sha256 after upload (no fabricated hashes). - .github/backend-matrix.yml: add the linux build matrix block + the darwin metal entry mirroring ced. - .github/workflows/bump_deps.yaml: track mudler/voice-detect.cpp via VOICEDETECT_VERSION (pin 47546430, = 4754643). - core/config/backend_capabilities.go: register voice-detect in the backend capability map (VoiceVerify/VoiceEmbed/VoiceAnalyze -> speaker_recognition), mirroring speaker-recognition. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(face-detect): add purego Go backend for face-detect.cpp Add the LocalAI Go backend that dlopens libfacedetect.so (the flat facedetect_capi_* C-ABI) via purego, mirroring the sibling voice-detect backend. Implements the Face subset of the Backend gRPC service: - Embeddings(PredictOptions): Images[0] base64 -> temp file -> embed_path -> L2-normalized ArcFace embedding. - Detect(DetectOptions): src -> detect_path_json -> Detection boxes (class_name "face", [x1,y1,x2,y2] -> x/y/w/h). - FaceVerify(FaceVerifyRequest): two images + threshold + anti_spoof -> verify_paths; best-effort img areas via detect. - FaceAnalyze(FaceAnalyzeRequest): img -> analyze_path_json -> per-face age + gender ("M"/"F" normalized to "Man"/"Woman"). The Makefile pins face-detect.cpp to 636a1963 and builds the shared lib with ggml + vendored libjpeg-turbo static (PIC), so the .so is ldd-clean (no libggml) and exports only facedetect_capi_* (no jpeg_ symbols). Gated Ginkgo e2e mirrors voice-detect. Note for the gallery-wiring task: backend registration (index.yaml, gallery, core/config/backend_capabilities.go) is intentionally not touched here. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(voice-detect): replace em dashes in net-new descriptions Project style forbids em/en dashes. Replace the three U+2014 chars introduced by the voice-detect gallery/index wiring with `-`/`:`. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(face-detect): wire backend into index, gallery and build Register the face-detect.cpp face detection / embedding / verification / analysis backend (added in Face-INT-A) into LocalAI's distribution surfaces, mirroring the voice-detect wiring (the closest mudler C++/ggml recognition analogue): - backend/index.yaml: add the &facedetect meta-backend (capabilities platform map, no top-level uri to avoid the meta-backend gotcha) plus the full set of concrete per-arch image entries (cpu/cuda12/cuda13/ metal/rocm/sycl-f16/sycl-f32/vulkan/l4t and the -development variants), 22 entries. Referential integrity audited: every alias target resolves. - gallery/index.yaml: add 4 model entries on backend face-detect - face-detect-buffalo-l/m/s (insightface SCRFD + ArcFace/MBF, NON-COMMERCIAL) and face-detect-yunet-sface (OpenCV-Zoo YuNet + SFace, APACHE-2.0, the commercial-friendly alternative). The detector/embedder architecture is read from GGUF metadata (facedetect.arch) at load; only the real verify_threshold option is set (0.35 buffalo, 0.363 sface). GGUF artifacts are not yet published: each files: entry points at the intended mudler/face-detect-gguf location with a TODO to fill sha256 after upload (no fabricated hashes). - core/config/backend_capabilities.go: register face-detect in the backend capability map (Embedding/Detect/FaceVerify/FaceAnalyze -> face_recognition), mirroring insightface. - .github/backend-matrix.yml: add the linux build matrix block + the darwin metal entry mirroring voice-detect. - .github/workflows/bump_deps.yaml: track mudler/face-detect.cpp via FACEDETECT_VERSION (pin 636a1963). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(recon): voice-detect metal build branch + face-detect gallery usecases Add the missing metal BUILD_TYPE branch to the voice-detect Makefile forwarding -DVOICEDETECT_GGML_METAL=ON, mirroring face-detect, so the darwin metal CI artifact is built with the Metal backend instead of CPU-only. Expand the 4 face-detect gallery models' known_usecases to [face_recognition, detection, embeddings] to match the backend capabilities map and the mirrored insightface-buffalo entries, so auto-selection for /v1/detect and /embeddings works. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * docs(recon): document voice-detect and face-detect ggml backends Document the new standalone C++/ggml biometric backends as the recommended/default option for face and voice recognition, keeping the existing Python insightface / speaker-recognition backends framed as the legacy path. - features/face-recognition.md: add a face-detect (ggml) backend section with the gallery entries (buffalo-l/m/s non-commercial, yunet-sface Apache-2.0), licensing, and verify/detect/analyze quickstart. - features/voice-recognition.md: add a voice-detect (ggml) backend section with the gallery entries (ecapa-tdnn, wespeaker-resnet34, eres2net, campplus speaker recognizers; emotion-wav2vec2 non-commercial analyze head) and quickstart. - reference/compatibility-table.md: add face-detect.cpp and voice-detect.cpp rows to the Vision, Detection & Recognition table. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(gallery): publish recon backend GGUF uris + sha256 Fill in the published HuggingFace GGUF uris and verified sha256 for the 9 recon gallery entries (voice-detect-* and face-detect-*), and remove the TODO publish markers. Correct the eres2net, campplus, and emotion-wav2vec2 uris to the actual published filenames. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(gallery): re-embed buffalo anti-spoof + add audeering age/gender voice model Update the 3 buffalo face-detect GGUF sha256 (anti-spoof ensemble now embedded and re-uploaded under the same filenames/uris) and note the FaceVerify anti_spoof request flag in each description. Add a new voice-detect-age-gender-wav2vec2 gallery entry mirroring the emotion model. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(gallery): add face-detect-buffalo-sc and antelopev2 packs Add gallery entries for two newly-published insightface face packs on the face-detect backend: buffalo_sc (smallest pack, SCRFD-500M + small ArcFace) and antelopev2 (higher-accuracy, SCRFD-10G + ArcFace glint360k R100, 512-d). Both are non-commercial research-only. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(recon): honor LocalAI per-model threads in voice/face-detect backends LocalAI spawns one backend process per model and serves requests concurrently, so the engines' own min(hardware_concurrency, 8) default can oversubscribe cores. Forward the per-model Threads value from the gRPC LoadModel options into the engine via VOICEDETECT_THREADS / FACEDETECT_THREADS (read at backend construction) before the capi load. A non-positive Threads is treated as unset, leaving the engine default. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump backend pins to CPU-optimized engine commits voice-detect.cpp -> 0d9c1b3 (radix-2 FFT FBank, threads, flash attn + cached pos-conv); face-detect.cpp -> 523aee1 (thread-gated direct conv, threads). Brings the CPU optimizations into the LocalAI backend builds. GGUF format and parity unchanged, so the published HF GGUFs remain valid. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump backend pins to round-2 CPU-optimized engines voice-detect.cpp -> fe7e6a3 (ERes2Net 1x1->mul_mat, CAM++ layout+context, wav2vec2 conv-LN, ECAPA capture-drop, AVX512 dispatch opt-in); face-detect.cpp -> 9c8adb7 (AVX2 Winograd F(2x2,3x3) for SCRFD/ArcFace 3x3 convs, ArcFace BN-fold). Parity unchanged (cosine=1.0); GGUF format unchanged, HF GGUFs valid. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump backend pins to round-3 Winograd engines voice-detect.cpp -> 45122ec (Winograd F(2x2,3x3) for WeSpeaker/ERes2Net 3x3 convs, -22%/-20% @8t); face-detect.cpp -> cd5c962 (Winograd F(4x4,3x3) for SCRFD large maps, -22% @1t on top of F(2x2), more load-stable). Parity held (cosine=1.0); GGUF format unchanged, HF GGUFs valid. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump backend pins to round-4 Winograd engines (CPU opt complete) voice-detect.cpp -> d2839ca (CAM++ FCM 2D convs through Winograd, -15.5%/-10.3%); face-detect.cpp -> c1db23d (AVX2-vectorized Winograd tile transforms, SCRFD detect -14%/-9.6%). Final CPU optimization round; the conv-kernel lever class is now exhausted (parity held cosine=1.0; GGUF/parity unchanged, HF GGUFs valid). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump face-detect pin to deep-kernel engine (7ae5c4d) face-detect.cpp -> 7ae5c4d: register-blocked winograd-domain GEMM microkernel (2.8x isolated GFLOP/s), AVX-512 zmm evolution behind runtime CPUID dispatch (ship-safe, AVX2 fallback bit-identical), bias/relu fused into the winograd output transform, and SFace Conv+BN fold + bias/PReLU fusion. SCRFD detect ~1.4x faster end-to-end vs the round-4 baseline; parity bit-exact; portable single binary (function-multiversioned, no global -mavx512f). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump voice-detect pin to ECAPA operand-order win (e9c56ae) voice-detect.cpp -> e9c56ae: weight-as-src0 mul_mat order in ECAPA's F32 conv1d_same (routes through tinyBLAS sgemm); ECAPA embed 1.67x @1t / ~1.3x @8t, parity cosine=1.0. Isolated to encoder.cpp (ECAPA-only); ERes2Net/CAM++/WeSpeaker do not call conv1d_same so are provably unaffected. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump pins to FMA-throughput engines (voice f7b9f89, face 2d2d5f0) face -> 2d2d5f0: route ArcFace 3x3 body convs through the AVX-512 winograd microkernel (kWinoMinSize 80->14); ArcFace 1.62x @1t, SCRFD detect to 0.966 of MLAS @1t, no regression. voice -> f7b9f89: runtime-CPUID-dispatched AVX-512 winograd-GEMM microkernel (ship-safe, AVX2 fallback bit-identical); WeSpeaker 1.90x @1t. Parity cosine=1.0 throughout; portable single binaries. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump pins to MLAS-class direct-conv engines (voice 7ecfd07, face be22d67) Hand-tuned nChw16c AVX-512 register-tiled direct-conv microkernel (~263 GFLOP/s, within 6-7% of MLAS per-op efficiency), runtime-CPUID-dispatched + AVX2 fallback, fused bias/relu. voice 7ecfd07: default 3x3-s1 kernel for WeSpeaker (+37%/+32%) + ERes2Net, CAM++ pinned to Winograd. face be22d67: shape-gated to the ArcFace recognizer body (+25-27% @8t); SCRFD detector stays on Winograd (no regression). Parity cosine=1.0 / detect <=1px on AVX-512 + AVX2 paths. Portable single binaries. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump voice pin to Phase-A blocked backbone (f4e7eef) WeSpeaker ResNet34 runs as one nChw16c blocked island (2 reorders/forward vs ~60) on AVX-512, default; per-conv directconv fallback on AVX2. +2.9% @1t / +17-19% @8t vs per-conv directconv, parity cosine=1.0. The conv microkernel is already FMA-bound near peak (~0.86-0.98x MLAS-implied); residual to MLAS is sub-peak edge + non-conv tail, documented in docs/cpu-optimization.md. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump pins to breadth blocked-backbone (voice 7f66871, face d80092b) voice 7f66871: AVX2-vectorized (ymm) blocked island - AVX2-only hosts now run the blocked backbone for WeSpeaker (2.3x over per-conv-AVX2, cosine=1.0); ERes2Net stays per-conv (blocked regresses, opt-in only); CAM++ Winograd-pinned. face d80092b: ArcFace recognizer blocked island, AVX-512 default (-13% @8t, ~0.90x MLAS, the closest conv result), auto per-conv on AVX2; SCRFD untouched on Winograd (0 island invocations during detect). Parity cosine=1.0 / detect <=1px throughout. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump pins to small-spatial + stem conv kernels (voice 99b1804, face 47fdab6) Measured-gap-driven conv kernels: small-spatial (fill the register tile when output width <= tile width) + small-IC stem + strided-1x1/downsample recovery. ArcFace recognizer 0.57 -> 0.70x MLAS @1t (the closest conv model), WeSpeaker 0.65 -> 0.79x @1t. Parity cosine=1.0 / detect <=1px. The OC-block-sharing lever was a measured dead-end (deep stride-1 is L3-weight-bandwidth bound, not read-port bound) and was NOT shipped. Kernel ceiling reached; further gap needs an algorithm-class change (cache-blocked weight-stationary GEMM, or q8 weights). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump pins to GPU persistent-graph + multi-model-safe cache (voice 45d2e6b, face 0a4799a) GPU wins (CUDA/ggml backend, no CPU-path change): persistent per-shape graph+context cache in Backend::compute() eliminates the per-call cudaGraph re-instantiation churn -> wav2vec2 emotion+age-gender now AT GPU parity with torch-cuDNN on GB10 (0.97-0.98x), CAM++ -5.7ms; bit-identical parity. Cache hardened multi-model-safe (invalidate-on-free keyed by the ModelLoader weights buffer) so LocalAI multi-model hosting cannot stale-hit. Conv models still trail cuDNN (im2col-materialization-bound) - cuDNN implicit-GEMM lever next. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump pins to cuDNN-conv-capable engines (voice b6e4356, face 6107a24) Adds the opt-in cuDNN implicit-GEMM conv path (VOICEDETECT_GGML_CUDNN / FACEDETECT_GGML_CUDNN, DEFAULT OFF -> zero build/runtime dep until enabled). On GPU it kills the im2col-materialization bottleneck and reaches torch-cuDNN parity on the spill-bound convs: SCRFD detect 14.8->6.4ms (2.3x, ~parity), WeSpeaker ~parity, ERes2Net beats torch (1.10x); ArcFace/CAM++ neutral (no spill). Parity exact (SCRFD <=1px, cosine=1.0). To USE it in LocalAI, the CUDA backend build must enable the flag AND bundle libcudnn - deferred until a cuDNN-bundled GPU image; flag stays OFF here. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(recon): enable cuDNN conv path on arm64+CUDA13 recon backends The voice-detect.cpp / face-detect.cpp engines have an opt-in cuDNN implicit-GEMM conv path behind VOICEDETECT_GGML_CUDNN / FACEDETECT_GGML_CUDNN (default OFF) that kills im2col on the GPU and reaches torch-cuDNN parity (SCRFD 2.3x, WeSpeaker/ERes2Net parity), measured on the GB10 (arm64, CUDA 13, sm_121a). Enable it for the CUDA build, but only where cuDNN actually ships: the arm64 + CUDA 13 image (GB10/Jetson/L4T). x86 CUDA images carry no cuDNN, so flipping it on globally for BUILD_TYPE=cublas would be a link failure. The Makefiles gate on CUDA_MAJOR_VERSION=13 + arch (TARGETARCH from the matrix/Docker build, uname -m fallback for local builds). backend/Dockerfile.golang already installs the runtime libcudnn9-cuda-13 in the arm64+CUDA13 apt block; add the matching libcudnn9-dev-cuda-13 so the build-time link resolves. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): bump voice-detect pin to ERes2Net blocked-default (30beecd) Defaults VD_ERES2NET_BLOCKED ON: routes the ERes2Net Res2Net body through the blocked nChw16c AVX-512 directconv island instead of the 1x1 mul_mat fast path (CONT-transpose + skinny low-K GEMM). On the shipped GGML_NATIVE=OFF build (ggml mul_mat is AVX2-only) this wins ~2x at every thread count (2.07x@1t, 2.2x@4t, 2.05x@8t); pure-AVX2 fallback still 1.3-1.62x. Parity exact (cosine=1.000000 vs golden), so registered voices + verify/identify thresholds are unaffected. The prior default-OFF rested on a stale comment whose 23pct regression only held on the non-shipping GGML_NATIVE=ON build. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * docs(readme): announce native voice-detect + face-detect backends in Latest News Add a Latest News entry for the new from-scratch C++/ggml biometric backends (voice-detect.cpp + face-detect.cpp) that replace the Python insightface and speaker-recognition backends: no Python/onnxruntime at inference, self-contained GGUF, bit-exact parity, GPU cuDNN parity. Mirrors the parakeet.cpp / locate-anything.cpp native-backend news entries. Refs PR #10441. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(recon): re-pin to the squashed engine release commits The voice-detect.cpp and face-detect.cpp histories were squashed to a single release commit, which orphaned the previous pins (voice 30beecd, face 6107a24). Re-pin to the new single-commit SHAs (voice 3d51077, face 06914b0); the tree is identical, so the backend build is unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
645 lines
26 KiB
Go
645 lines
26 KiB
Go
package config
|
|
|
|
import (
|
|
"slices"
|
|
"strings"
|
|
)
|
|
|
|
// Usecase name constants — the canonical string values used in gallery entries,
|
|
// model configs (known_usecases), and UsecaseInfoMap keys.
|
|
const (
|
|
UsecaseChat = "chat"
|
|
UsecaseCompletion = "completion"
|
|
UsecaseEdit = "edit"
|
|
UsecaseVision = "vision"
|
|
UsecaseEmbeddings = "embeddings"
|
|
UsecaseTokenize = "tokenize"
|
|
UsecaseImage = "image"
|
|
UsecaseVideo = "video"
|
|
UsecaseTranscript = "transcript"
|
|
UsecaseTTS = "tts"
|
|
UsecaseSoundGeneration = "sound_generation"
|
|
UsecaseRerank = "rerank"
|
|
UsecaseDetection = "detection"
|
|
UsecaseDepth = "depth"
|
|
UsecaseVAD = "vad"
|
|
UsecaseAudioTransform = "audio_transform"
|
|
UsecaseDiarization = "diarization"
|
|
UsecaseSoundClassification = "sound_classification"
|
|
UsecaseRealtimeAudio = "realtime_audio"
|
|
UsecaseFaceRecognition = "face_recognition"
|
|
UsecaseSpeakerRecognition = "speaker_recognition"
|
|
UsecaseTokenClassify = "token_classify"
|
|
)
|
|
|
|
// GRPCMethod identifies a Backend service RPC from backend.proto.
|
|
type GRPCMethod string
|
|
|
|
const (
|
|
MethodPredict GRPCMethod = "Predict"
|
|
MethodPredictStream GRPCMethod = "PredictStream"
|
|
MethodEmbedding GRPCMethod = "Embedding"
|
|
MethodGenerateImage GRPCMethod = "GenerateImage"
|
|
MethodGenerateVideo GRPCMethod = "GenerateVideo"
|
|
MethodAudioTranscription GRPCMethod = "AudioTranscription"
|
|
MethodTTS GRPCMethod = "TTS"
|
|
MethodTTSStream GRPCMethod = "TTSStream"
|
|
MethodSoundGeneration GRPCMethod = "SoundGeneration"
|
|
MethodTokenizeString GRPCMethod = "TokenizeString"
|
|
MethodDetect GRPCMethod = "Detect"
|
|
MethodDepth GRPCMethod = "Depth"
|
|
MethodRerank GRPCMethod = "Rerank"
|
|
MethodVAD GRPCMethod = "VAD"
|
|
MethodAudioTransform GRPCMethod = "AudioTransform"
|
|
MethodDiarize GRPCMethod = "Diarize"
|
|
MethodSoundDetection GRPCMethod = "SoundDetection"
|
|
MethodAudioToAudioStream GRPCMethod = "AudioToAudioStream"
|
|
MethodFaceVerify GRPCMethod = "FaceVerify"
|
|
MethodFaceAnalyze GRPCMethod = "FaceAnalyze"
|
|
MethodVoiceVerify GRPCMethod = "VoiceVerify"
|
|
MethodVoiceEmbed GRPCMethod = "VoiceEmbed"
|
|
MethodVoiceAnalyze GRPCMethod = "VoiceAnalyze"
|
|
MethodTokenClassify GRPCMethod = "TokenClassify"
|
|
)
|
|
|
|
// UsecaseInfo describes a single known_usecase value and how it maps
|
|
// to the gRPC backend API.
|
|
type UsecaseInfo struct {
|
|
// Flag is the ModelConfigUsecase bitmask value.
|
|
Flag ModelConfigUsecase
|
|
// GRPCMethod is the primary Backend service RPC this usecase maps to.
|
|
GRPCMethod GRPCMethod
|
|
// IsModifier is true when this usecase doesn't map to its own gRPC RPC
|
|
// but modifies how another RPC behaves (e.g., vision uses Predict with images).
|
|
IsModifier bool
|
|
// DependsOn names the usecase(s) this modifier requires (e.g., "chat").
|
|
DependsOn string
|
|
// Description is a human/LLM-readable explanation of what this usecase means.
|
|
Description string
|
|
}
|
|
|
|
// UsecaseInfoMap maps each known_usecase string to its gRPC and semantic info.
|
|
var UsecaseInfoMap = map[string]UsecaseInfo{
|
|
UsecaseChat: {
|
|
Flag: FLAG_CHAT,
|
|
GRPCMethod: MethodPredict,
|
|
Description: "Conversational/instruction-following via the Predict RPC with chat templates.",
|
|
},
|
|
UsecaseCompletion: {
|
|
Flag: FLAG_COMPLETION,
|
|
GRPCMethod: MethodPredict,
|
|
Description: "Text completion via the Predict RPC with a completion template.",
|
|
},
|
|
UsecaseEdit: {
|
|
Flag: FLAG_EDIT,
|
|
GRPCMethod: MethodPredict,
|
|
Description: "Text editing via the Predict RPC with an edit template.",
|
|
},
|
|
UsecaseVision: {
|
|
Flag: FLAG_VISION,
|
|
GRPCMethod: MethodPredict,
|
|
IsModifier: true,
|
|
DependsOn: UsecaseChat,
|
|
Description: "The model accepts images alongside text in the Predict RPC. For llama-cpp this requires an mmproj file.",
|
|
},
|
|
UsecaseEmbeddings: {
|
|
Flag: FLAG_EMBEDDINGS,
|
|
GRPCMethod: MethodEmbedding,
|
|
Description: "Vector embedding generation via the Embedding RPC.",
|
|
},
|
|
UsecaseTokenize: {
|
|
Flag: FLAG_TOKENIZE,
|
|
GRPCMethod: MethodTokenizeString,
|
|
Description: "Tokenization via the TokenizeString RPC without running inference.",
|
|
},
|
|
UsecaseImage: {
|
|
Flag: FLAG_IMAGE,
|
|
GRPCMethod: MethodGenerateImage,
|
|
Description: "Image generation via the GenerateImage RPC (Stable Diffusion, Flux, etc.).",
|
|
},
|
|
UsecaseVideo: {
|
|
Flag: FLAG_VIDEO,
|
|
GRPCMethod: MethodGenerateVideo,
|
|
Description: "Video generation via the GenerateVideo RPC.",
|
|
},
|
|
UsecaseTranscript: {
|
|
Flag: FLAG_TRANSCRIPT,
|
|
GRPCMethod: MethodAudioTranscription,
|
|
Description: "Speech-to-text via the AudioTranscription RPC.",
|
|
},
|
|
UsecaseTTS: {
|
|
Flag: FLAG_TTS,
|
|
GRPCMethod: MethodTTS,
|
|
Description: "Text-to-speech via the TTS RPC.",
|
|
},
|
|
UsecaseSoundGeneration: {
|
|
Flag: FLAG_SOUND_GENERATION,
|
|
GRPCMethod: MethodSoundGeneration,
|
|
Description: "Music/sound generation via the SoundGeneration RPC (not speech).",
|
|
},
|
|
UsecaseRerank: {
|
|
Flag: FLAG_RERANK,
|
|
GRPCMethod: MethodRerank,
|
|
Description: "Document reranking via the Rerank RPC.",
|
|
},
|
|
UsecaseDetection: {
|
|
Flag: FLAG_DETECTION,
|
|
GRPCMethod: MethodDetect,
|
|
Description: "Object detection via the Detect RPC with bounding boxes.",
|
|
},
|
|
UsecaseDepth: {
|
|
Flag: FLAG_DEPTH,
|
|
GRPCMethod: MethodDepth,
|
|
Description: "Per-pixel metric depth, camera pose and 3D point cloud via the Depth RPC (Depth Anything 3).",
|
|
},
|
|
UsecaseVAD: {
|
|
Flag: FLAG_VAD,
|
|
GRPCMethod: MethodVAD,
|
|
Description: "Voice activity detection via the VAD RPC.",
|
|
},
|
|
UsecaseAudioTransform: {
|
|
Flag: FLAG_AUDIO_TRANSFORM,
|
|
GRPCMethod: MethodAudioTransform,
|
|
Description: "Audio-in / audio-out transformations (echo cancellation, noise suppression, dereverberation, voice conversion) via the AudioTransform RPC.",
|
|
},
|
|
UsecaseDiarization: {
|
|
Flag: FLAG_DIARIZATION,
|
|
GRPCMethod: MethodDiarize,
|
|
Description: "Speaker diarization (who-spoke-when, per-speaker segments) via the Diarize RPC.",
|
|
},
|
|
UsecaseSoundClassification: {
|
|
Flag: FLAG_SOUND_CLASSIFICATION,
|
|
GRPCMethod: MethodSoundDetection,
|
|
Description: "Sound-event classification / audio tagging (scored AudioSet labels like baby cry, glass breaking, alarms) via the SoundDetection RPC.",
|
|
},
|
|
UsecaseRealtimeAudio: {
|
|
Flag: FLAG_REALTIME_AUDIO,
|
|
GRPCMethod: MethodAudioToAudioStream,
|
|
Description: "Self-contained any-to-any audio model for the Realtime API — accepts microphone audio and emits speech + transcript (+ optional function calls) from a single backend via the AudioToAudioStream RPC.",
|
|
},
|
|
UsecaseFaceRecognition: {
|
|
Flag: FLAG_FACE_RECOGNITION,
|
|
GRPCMethod: MethodFaceVerify,
|
|
Description: "Face recognition — verify identity, analyze attributes (age/gender/emotion) via FaceVerify and FaceAnalyze RPCs.",
|
|
},
|
|
UsecaseSpeakerRecognition: {
|
|
Flag: FLAG_SPEAKER_RECOGNITION,
|
|
GRPCMethod: MethodVoiceVerify,
|
|
Description: "Speaker recognition — verify identity, embed and analyze voice via VoiceVerify, VoiceEmbed and VoiceAnalyze RPCs.",
|
|
},
|
|
UsecaseTokenClassify: {
|
|
Flag: FLAG_TOKEN_CLASSIFY,
|
|
GRPCMethod: MethodTokenClassify,
|
|
Description: "Per-token classification (NER) via the TokenClassify RPC — the PII detector tier. Declared explicitly via known_usecases; never auto-guessed, since the token-classification head is not useful as general generation or embeddings.",
|
|
},
|
|
}
|
|
|
|
// BackendCapability describes which gRPC methods and usecases a backend supports.
|
|
// Derived from reviewing actual implementations in backend/go/ and backend/python/.
|
|
type BackendCapability struct {
|
|
// GRPCMethods lists the Backend service RPCs this backend implements.
|
|
GRPCMethods []GRPCMethod
|
|
// PossibleUsecases lists all usecase strings this backend can support.
|
|
PossibleUsecases []string
|
|
// DefaultUsecases lists the conservative safe defaults.
|
|
DefaultUsecases []string
|
|
// AcceptsImages indicates multimodal image input in Predict.
|
|
AcceptsImages bool
|
|
// AcceptsVideos indicates multimodal video input in Predict.
|
|
AcceptsVideos bool
|
|
// AcceptsAudios indicates multimodal audio input in Predict.
|
|
AcceptsAudios bool
|
|
// Description is a human-readable summary of the backend.
|
|
Description string
|
|
}
|
|
|
|
// BackendCapabilities maps each backend name (as used in model configs and gallery
|
|
// entries) to its verified capabilities. This is the single source of truth for
|
|
// what each backend supports.
|
|
//
|
|
// Backend names use hyphens (e.g., "llama-cpp") matching the gallery convention.
|
|
// Use NormalizeBackendName() for names with dots (e.g., "llama.cpp").
|
|
var BackendCapabilities = map[string]BackendCapability{
|
|
// --- LLM / text generation backends ---
|
|
"llama-cpp": {
|
|
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodEmbedding, MethodTokenizeString},
|
|
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseEdit, UsecaseEmbeddings, UsecaseTokenize, UsecaseVision},
|
|
DefaultUsecases: []string{UsecaseChat},
|
|
AcceptsImages: true, // requires mmproj
|
|
Description: "llama.cpp GGUF models — LLM inference with optional vision via mmproj",
|
|
},
|
|
// privacy-filter is the standalone GGML engine (backend/cpp/privacy-filter,
|
|
// wrapping privacy-filter.cpp) for the openai-privacy-filter PII/NER token
|
|
// classifier — the dedicated TokenClassify path that replaces the
|
|
// patched-llama.cpp route. Never auto-guessed; declared explicitly via
|
|
// known_usecases: [token_classify].
|
|
"privacy-filter": {
|
|
GRPCMethods: []GRPCMethod{MethodTokenClassify},
|
|
PossibleUsecases: []string{UsecaseTokenClassify},
|
|
DefaultUsecases: []string{UsecaseTokenClassify},
|
|
Description: "privacy-filter.cpp — standalone GGML backend for openai-privacy-filter PII/NER token classification",
|
|
},
|
|
"vllm": {
|
|
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodEmbedding},
|
|
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseEmbeddings, UsecaseVision},
|
|
DefaultUsecases: []string{UsecaseChat},
|
|
AcceptsImages: true,
|
|
AcceptsVideos: true,
|
|
Description: "vLLM engine — high-throughput LLM serving with optional multimodal",
|
|
},
|
|
"sglang": {
|
|
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodTokenizeString},
|
|
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseTokenize, UsecaseVision},
|
|
DefaultUsecases: []string{UsecaseChat},
|
|
AcceptsImages: true,
|
|
Description: "SGLang — fast LLM inference with structured generation and optional vision",
|
|
},
|
|
"vllm-omni": {
|
|
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodGenerateImage, MethodGenerateVideo, MethodTTS},
|
|
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseImage, UsecaseVideo, UsecaseTTS, UsecaseVision},
|
|
DefaultUsecases: []string{UsecaseChat},
|
|
AcceptsImages: true,
|
|
AcceptsVideos: true,
|
|
AcceptsAudios: true,
|
|
Description: "vLLM omni-modal — supports text, image, video generation and TTS",
|
|
},
|
|
"transformers": {
|
|
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodEmbedding, MethodTTS, MethodSoundGeneration},
|
|
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseEmbeddings, UsecaseTTS, UsecaseSoundGeneration},
|
|
DefaultUsecases: []string{UsecaseChat},
|
|
Description: "HuggingFace transformers — general-purpose Python inference",
|
|
},
|
|
"mlx": {
|
|
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodEmbedding},
|
|
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseEmbeddings},
|
|
DefaultUsecases: []string{UsecaseChat},
|
|
Description: "Apple MLX framework — optimized for Apple Silicon",
|
|
},
|
|
"mlx-distributed": {
|
|
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodEmbedding},
|
|
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseEmbeddings},
|
|
DefaultUsecases: []string{UsecaseChat},
|
|
Description: "MLX distributed inference across multiple Apple Silicon devices",
|
|
},
|
|
"mlx-vlm": {
|
|
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodEmbedding},
|
|
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseEmbeddings, UsecaseVision},
|
|
DefaultUsecases: []string{UsecaseChat, UsecaseVision},
|
|
AcceptsImages: true,
|
|
AcceptsAudios: true,
|
|
Description: "MLX vision-language models with multimodal input",
|
|
},
|
|
"mlx-audio": {
|
|
GRPCMethods: []GRPCMethod{MethodPredict, MethodTTS},
|
|
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseChat},
|
|
Description: "MLX audio models — text generation and TTS",
|
|
},
|
|
|
|
// --- Image/video generation backends ---
|
|
"diffusers": {
|
|
GRPCMethods: []GRPCMethod{MethodGenerateImage, MethodGenerateVideo},
|
|
PossibleUsecases: []string{UsecaseImage, UsecaseVideo},
|
|
DefaultUsecases: []string{UsecaseImage},
|
|
Description: "HuggingFace diffusers — Stable Diffusion, Flux, video generation",
|
|
},
|
|
"stablediffusion": {
|
|
GRPCMethods: []GRPCMethod{MethodGenerateImage},
|
|
PossibleUsecases: []string{UsecaseImage},
|
|
DefaultUsecases: []string{UsecaseImage},
|
|
Description: "Stable Diffusion native backend",
|
|
},
|
|
"stablediffusion-ggml": {
|
|
GRPCMethods: []GRPCMethod{MethodGenerateImage},
|
|
PossibleUsecases: []string{UsecaseImage},
|
|
DefaultUsecases: []string{UsecaseImage},
|
|
Description: "Stable Diffusion via GGML quantized models",
|
|
},
|
|
|
|
// --- Speech-to-text backends ---
|
|
"whisper": {
|
|
GRPCMethods: []GRPCMethod{MethodAudioTranscription, MethodVAD},
|
|
PossibleUsecases: []string{UsecaseTranscript, UsecaseVAD},
|
|
DefaultUsecases: []string{UsecaseTranscript},
|
|
Description: "OpenAI Whisper — speech recognition and voice activity detection",
|
|
},
|
|
"faster-whisper": {
|
|
GRPCMethods: []GRPCMethod{MethodAudioTranscription},
|
|
PossibleUsecases: []string{UsecaseTranscript},
|
|
DefaultUsecases: []string{UsecaseTranscript},
|
|
Description: "CTranslate2-accelerated Whisper for faster transcription",
|
|
},
|
|
"whisperx": {
|
|
GRPCMethods: []GRPCMethod{MethodAudioTranscription},
|
|
PossibleUsecases: []string{UsecaseTranscript},
|
|
DefaultUsecases: []string{UsecaseTranscript},
|
|
Description: "WhisperX — Whisper with word-level timestamps and speaker diarization",
|
|
},
|
|
"moonshine": {
|
|
GRPCMethods: []GRPCMethod{MethodAudioTranscription},
|
|
PossibleUsecases: []string{UsecaseTranscript},
|
|
DefaultUsecases: []string{UsecaseTranscript},
|
|
Description: "Moonshine speech recognition",
|
|
},
|
|
"nemo": {
|
|
GRPCMethods: []GRPCMethod{MethodAudioTranscription},
|
|
PossibleUsecases: []string{UsecaseTranscript},
|
|
DefaultUsecases: []string{UsecaseTranscript},
|
|
Description: "NVIDIA NeMo speech recognition",
|
|
},
|
|
"parakeet-cpp": {
|
|
GRPCMethods: []GRPCMethod{MethodAudioTranscription},
|
|
PossibleUsecases: []string{UsecaseTranscript},
|
|
DefaultUsecases: []string{UsecaseTranscript},
|
|
Description: "NVIDIA NeMo Parakeet ASR (parakeet.cpp)",
|
|
},
|
|
"qwen-asr": {
|
|
GRPCMethods: []GRPCMethod{MethodAudioTranscription},
|
|
PossibleUsecases: []string{UsecaseTranscript},
|
|
DefaultUsecases: []string{UsecaseTranscript},
|
|
Description: "Qwen automatic speech recognition",
|
|
},
|
|
"voxtral": {
|
|
GRPCMethods: []GRPCMethod{MethodAudioTranscription},
|
|
PossibleUsecases: []string{UsecaseTranscript},
|
|
DefaultUsecases: []string{UsecaseTranscript},
|
|
Description: "Voxtral speech recognition",
|
|
},
|
|
"vibevoice": {
|
|
GRPCMethods: []GRPCMethod{MethodAudioTranscription, MethodTTS},
|
|
PossibleUsecases: []string{UsecaseTranscript, UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTranscript, UsecaseTTS},
|
|
Description: "VibeVoice — bidirectional speech (transcription and synthesis)",
|
|
},
|
|
"vibevoice-cpp": {
|
|
GRPCMethods: []GRPCMethod{MethodAudioTranscription, MethodTTS, MethodTTSStream},
|
|
PossibleUsecases: []string{UsecaseTranscript, UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTranscript, UsecaseTTS},
|
|
Description: "VibeVoice C++ — bidirectional speech, C++ backend with streaming TTS",
|
|
},
|
|
"sherpa-onnx": {
|
|
GRPCMethods: []GRPCMethod{MethodAudioTranscription, MethodTTS, MethodTTSStream, MethodVAD},
|
|
PossibleUsecases: []string{UsecaseTranscript, UsecaseTTS, UsecaseVAD},
|
|
DefaultUsecases: []string{UsecaseTranscript},
|
|
Description: "Sherpa-ONNX — multi-model speech toolkit (ASR, TTS, VAD)",
|
|
},
|
|
|
|
// --- TTS backends ---
|
|
"piper": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "Piper — fast neural TTS optimized for Raspberry Pi",
|
|
},
|
|
"kokoro": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "Kokoro TTS",
|
|
},
|
|
"coqui": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "Coqui TTS — multi-speaker neural synthesis",
|
|
},
|
|
"kitten-tts": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "Kitten TTS",
|
|
},
|
|
"outetts": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "OuteTTS",
|
|
},
|
|
"pocket-tts": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "Pocket TTS — lightweight text-to-speech",
|
|
},
|
|
"qwen-tts": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "Qwen TTS",
|
|
},
|
|
"qwen3-tts-cpp": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS, MethodTTSStream},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "Qwen3 TTS C++ - text-to-speech with streaming, named speakers, voice design and cloning (qwentts.cpp / GGML)",
|
|
},
|
|
"faster-qwen3-tts": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "Faster Qwen3 TTS — accelerated Qwen TTS",
|
|
},
|
|
"fish-speech": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "Fish Speech TTS",
|
|
},
|
|
"neutts": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "NeuTTS — neural text-to-speech",
|
|
},
|
|
"chatterbox": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "Chatterbox TTS",
|
|
},
|
|
"voxcpm": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS, MethodTTSStream},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "VoxCPM TTS with streaming support",
|
|
},
|
|
|
|
// --- Sound generation backends ---
|
|
"ace-step": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS, MethodSoundGeneration},
|
|
PossibleUsecases: []string{UsecaseTTS, UsecaseSoundGeneration},
|
|
DefaultUsecases: []string{UsecaseSoundGeneration},
|
|
Description: "ACE-Step — music and sound generation",
|
|
},
|
|
"acestep-cpp": {
|
|
GRPCMethods: []GRPCMethod{MethodSoundGeneration},
|
|
PossibleUsecases: []string{UsecaseSoundGeneration},
|
|
DefaultUsecases: []string{UsecaseSoundGeneration},
|
|
Description: "ACE-Step C++ — native sound generation",
|
|
},
|
|
"transformers-musicgen": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS, MethodSoundGeneration},
|
|
PossibleUsecases: []string{UsecaseTTS, UsecaseSoundGeneration},
|
|
DefaultUsecases: []string{UsecaseSoundGeneration},
|
|
Description: "Meta MusicGen via transformers — music generation from text",
|
|
},
|
|
|
|
// --- Any-to-any audio backends ---
|
|
"liquid-audio": {
|
|
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodAudioTranscription, MethodTTS, MethodAudioToAudioStream, MethodVAD},
|
|
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseTranscript, UsecaseTTS, UsecaseRealtimeAudio, UsecaseVAD},
|
|
DefaultUsecases: []string{UsecaseRealtimeAudio, UsecaseChat, UsecaseTranscript, UsecaseTTS, UsecaseVAD},
|
|
AcceptsAudios: true,
|
|
Description: "LFM2 / LFM2.5-Audio — self-contained any-to-any audio model for the Realtime API; also exposes chat, transcription, TTS and a stub energy-based VAD endpoint",
|
|
},
|
|
|
|
// --- Audio transform backends ---
|
|
"localvqe": {
|
|
GRPCMethods: []GRPCMethod{MethodAudioTransform},
|
|
PossibleUsecases: []string{UsecaseAudioTransform},
|
|
DefaultUsecases: []string{UsecaseAudioTransform},
|
|
Description: "LocalVQE — joint AEC, noise suppression, and dereverberation for 16 kHz mono speech",
|
|
},
|
|
|
|
// --- Utility backends ---
|
|
"rerankers": {
|
|
GRPCMethods: []GRPCMethod{MethodRerank},
|
|
PossibleUsecases: []string{UsecaseRerank},
|
|
DefaultUsecases: []string{UsecaseRerank},
|
|
Description: "Cross-encoder reranking models",
|
|
},
|
|
"rfdetr": {
|
|
GRPCMethods: []GRPCMethod{MethodDetect},
|
|
PossibleUsecases: []string{UsecaseDetection},
|
|
DefaultUsecases: []string{UsecaseDetection},
|
|
Description: "RF-DETR object detection",
|
|
},
|
|
"rfdetr-cpp": {
|
|
GRPCMethods: []GRPCMethod{MethodDetect},
|
|
PossibleUsecases: []string{UsecaseDetection},
|
|
DefaultUsecases: []string{UsecaseDetection},
|
|
Description: "RF-DETR C++ object detection",
|
|
},
|
|
"depth-anything": {
|
|
GRPCMethods: []GRPCMethod{MethodDepth, MethodPredict, MethodGenerateImage},
|
|
PossibleUsecases: []string{UsecaseDepth},
|
|
DefaultUsecases: []string{UsecaseDepth},
|
|
AcceptsImages: true,
|
|
Description: "Depth Anything 3 C++ — per-pixel metric depth, camera pose and 3D point cloud",
|
|
},
|
|
|
|
// --- Face and speaker recognition backends ---
|
|
"insightface": {
|
|
GRPCMethods: []GRPCMethod{MethodEmbedding, MethodDetect, MethodFaceVerify, MethodFaceAnalyze},
|
|
PossibleUsecases: []string{UsecaseEmbeddings, UsecaseDetection, UsecaseFaceRecognition},
|
|
DefaultUsecases: []string{UsecaseFaceRecognition},
|
|
AcceptsImages: true,
|
|
Description: "InsightFace — face detection, embedding, verification and attribute analysis",
|
|
},
|
|
"speaker-recognition": {
|
|
GRPCMethods: []GRPCMethod{MethodVoiceVerify, MethodVoiceEmbed, MethodVoiceAnalyze},
|
|
PossibleUsecases: []string{UsecaseSpeakerRecognition},
|
|
DefaultUsecases: []string{UsecaseSpeakerRecognition},
|
|
Description: "Speaker recognition — voice identity verification and analysis",
|
|
},
|
|
"voice-detect": {
|
|
GRPCMethods: []GRPCMethod{MethodVoiceVerify, MethodVoiceEmbed, MethodVoiceAnalyze},
|
|
PossibleUsecases: []string{UsecaseSpeakerRecognition},
|
|
DefaultUsecases: []string{UsecaseSpeakerRecognition},
|
|
Description: "voice-detect.cpp: C++/ggml speaker embedding, verification and voice analysis (age/gender/emotion)",
|
|
},
|
|
"face-detect": {
|
|
GRPCMethods: []GRPCMethod{MethodEmbedding, MethodDetect, MethodFaceVerify, MethodFaceAnalyze},
|
|
PossibleUsecases: []string{UsecaseEmbeddings, UsecaseDetection, UsecaseFaceRecognition},
|
|
DefaultUsecases: []string{UsecaseFaceRecognition},
|
|
AcceptsImages: true,
|
|
Description: "face-detect.cpp: C++/ggml face detection, embedding, verification and attribute analysis",
|
|
},
|
|
"silero-vad": {
|
|
GRPCMethods: []GRPCMethod{MethodVAD},
|
|
PossibleUsecases: []string{UsecaseVAD},
|
|
DefaultUsecases: []string{UsecaseVAD},
|
|
Description: "Silero VAD — voice activity detection",
|
|
},
|
|
}
|
|
|
|
// NormalizeBackendName converts backend names to the canonical hyphenated form
|
|
// used in gallery entries (e.g., "llama.cpp" → "llama-cpp").
|
|
func NormalizeBackendName(backend string) string {
|
|
return strings.ReplaceAll(backend, ".", "-")
|
|
}
|
|
|
|
// nonLlamaSamplerBackends lists backends whose native sampler defaults differ
|
|
// from llama.cpp's, so LocalAI must NOT inject llama.cpp's top_k=40 default for
|
|
// them (issue #6632). mlx_lm's intended default is top_k=0 (disabled) and mlx
|
|
// does not remap 0->40, so shipping 40 silently changes sampling for clients
|
|
// that omit top_k. Leaving TopK nil lets the wire value default to 0.
|
|
//
|
|
// This is intentionally a small allow-list of KNOWN non-llama backends: empty
|
|
// and unknown backends fall through to the llama.cpp default to preserve the
|
|
// GGUF auto-detect path's behavior.
|
|
var nonLlamaSamplerBackends = map[string]struct{}{
|
|
"mlx": {},
|
|
"mlx-vlm": {},
|
|
"mlx-distributed": {},
|
|
}
|
|
|
|
// UsesLlamaSamplerDefaults reports whether a backend should receive llama.cpp's
|
|
// sampler defaults (e.g. top_k=40). Empty/unknown backends return true so the
|
|
// GGUF auto-detect path (which resolves to llama.cpp) keeps today's behavior;
|
|
// only the known non-llama backends in nonLlamaSamplerBackends return false.
|
|
func UsesLlamaSamplerDefaults(backend string) bool {
|
|
if backend == "" {
|
|
return true
|
|
}
|
|
_, isNonLlama := nonLlamaSamplerBackends[NormalizeBackendName(backend)]
|
|
return !isNonLlama
|
|
}
|
|
|
|
// GetBackendCapability returns the capability info for a backend, or nil if unknown.
|
|
// Handles backend name normalization.
|
|
func GetBackendCapability(backend string) *BackendCapability {
|
|
if cap, ok := BackendCapabilities[NormalizeBackendName(backend)]; ok {
|
|
return &cap
|
|
}
|
|
return nil
|
|
}
|
|
|
|
// PossibleUsecasesForBackend returns all usecases a backend can support.
|
|
// Returns nil if the backend is unknown.
|
|
func PossibleUsecasesForBackend(backend string) []string {
|
|
if cap := GetBackendCapability(backend); cap != nil {
|
|
return cap.PossibleUsecases
|
|
}
|
|
return nil
|
|
}
|
|
|
|
// DefaultUsecasesForBackend returns the conservative default usecases.
|
|
// Returns nil if the backend is unknown.
|
|
func DefaultUsecasesForBackendCap(backend string) []string {
|
|
if cap := GetBackendCapability(backend); cap != nil {
|
|
return cap.DefaultUsecases
|
|
}
|
|
return nil
|
|
}
|
|
|
|
// IsValidUsecaseForBackend checks whether a usecase is in a backend's possible set.
|
|
// Returns true for unknown backends (permissive fallback).
|
|
func IsValidUsecaseForBackend(backend, usecase string) bool {
|
|
cap := GetBackendCapability(backend)
|
|
if cap == nil {
|
|
return true // unknown backend — don't restrict
|
|
}
|
|
return slices.Contains(cap.PossibleUsecases, usecase)
|
|
}
|
|
|
|
// AllBackendNames returns a sorted list of all known backend names.
|
|
func AllBackendNames() []string {
|
|
names := make([]string, 0, len(BackendCapabilities))
|
|
for name := range BackendCapabilities {
|
|
names = append(names, name)
|
|
}
|
|
slices.Sort(names)
|
|
return names
|
|
}
|