Files
LocalAI/core/config/backend_capabilities.go
LocalAI [bot] de2ec2f136 feat(backends): add voice-detect + face-detect ggml backends (replace Python insightface/speaker-recognition) (#10441)
* feat(voice-detect): add Go purego backend for voice-detect.cpp

Add backend/go/voice-detect implementing the Backend gRPC voice subset
(VoiceEmbed/VoiceVerify/VoiceAnalyze) over libvoicedetect.so via purego,
mirroring the parakeet-cpp / omnivoice-cpp backends.

The flat voicedetect_capi C ABI is dlopen'd cgo-less; malloc'd string and
float-vector returns are owned by Go and released through the matching capi
free functions, with the per-ctx last error surfaced into Go errors. Calls are
serialized via base.SingleThread since the C context is not reentrant.

Proto field mapping:
- VoiceEmbed: VoiceEmbedRequest.audio (path) -> embed_path -> Embedding+Model.
- VoiceVerify: audio1/audio2 + threshold (<=0 falls back to the
  verify_threshold option, default 0.25) -> verify_paths -> verified/distance/
  threshold/confidence/model/processing_time_ms.
- VoiceAnalyze: audio (path) -> analyze_path_json; the JSON age/gender/emotion
  document maps to a single VoiceAnalysis segment (start/end 0; gender "label"
  -> dominant_gender with the remaining float scores as the gender map; emotion
  label/scores -> dominant_emotion/emotion).

The Makefile pins voice-detect.cpp to 47546430, clones+builds libvoicedetect.so
with ggml static-linked (PIC, GGML_NATIVE off) so dlopen needs no external
libggml/libvoicedetect; ldd on the artifact shows only system libs. Ginkgo
tests cover option parsing and analyze-JSON mapping; embed/verify smoke specs
gate on VOICEDETECT_BACKEND_TEST_MODEL + VOICEDETECT_BACKEND_TEST_WAV.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(voice-detect): wire backend into index, gallery and build

Register the voice-detect.cpp speaker-recognition + voice-analysis
backend (added in Voice-INT-A) into LocalAI's distribution surfaces,
mirroring the ced backend (the closest mudler C++/ggml audio analogue):

- backend/index.yaml: add the &voicedetect meta-backend (capabilities
  platform map, no top-level uri) plus the full set of concrete per-arch
  image entries (cpu/cuda12/cuda13/metal/rocm/sycl/vulkan/l4t and the
  -development variants). Referential integrity audited - every alias
  target resolves.
- gallery/index.yaml: add 5 model entries on backend voice-detect -
  ECAPA-TDNN, WeSpeaker ResNet34, 3D-Speaker ERes2Net, CAM++ and the
  wav2vec2 age/gender/emotion analyze model. The engine architecture is
  read from GGUF metadata (voicedetect.arch) at load. GGUF artifacts are
  not yet published: each files: entry points at the intended
  mudler/voice-detect-gguf location with a TODO to fill sha256 after
  upload (no fabricated hashes).
- .github/backend-matrix.yml: add the linux build matrix block + the
  darwin metal entry mirroring ced.
- .github/workflows/bump_deps.yaml: track mudler/voice-detect.cpp via
  VOICEDETECT_VERSION (pin 47546430, = 4754643).
- core/config/backend_capabilities.go: register voice-detect in the
  backend capability map (VoiceVerify/VoiceEmbed/VoiceAnalyze ->
  speaker_recognition), mirroring speaker-recognition.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(face-detect): add purego Go backend for face-detect.cpp

Add the LocalAI Go backend that dlopens libfacedetect.so (the flat
facedetect_capi_* C-ABI) via purego, mirroring the sibling voice-detect
backend. Implements the Face subset of the Backend gRPC service:

- Embeddings(PredictOptions): Images[0] base64 -> temp file -> embed_path
  -> L2-normalized ArcFace embedding.
- Detect(DetectOptions): src -> detect_path_json -> Detection boxes
  (class_name "face", [x1,y1,x2,y2] -> x/y/w/h).
- FaceVerify(FaceVerifyRequest): two images + threshold + anti_spoof ->
  verify_paths; best-effort img areas via detect.
- FaceAnalyze(FaceAnalyzeRequest): img -> analyze_path_json -> per-face
  age + gender ("M"/"F" normalized to "Man"/"Woman").

The Makefile pins face-detect.cpp to 636a1963 and builds the shared lib
with ggml + vendored libjpeg-turbo static (PIC), so the .so is
ldd-clean (no libggml) and exports only facedetect_capi_* (no jpeg_
symbols). Gated Ginkgo e2e mirrors voice-detect.

Note for the gallery-wiring task: backend registration (index.yaml,
gallery, core/config/backend_capabilities.go) is intentionally not
touched here.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(voice-detect): replace em dashes in net-new descriptions

Project style forbids em/en dashes. Replace the three U+2014 chars
introduced by the voice-detect gallery/index wiring with `-`/`:`.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(face-detect): wire backend into index, gallery and build

Register the face-detect.cpp face detection / embedding / verification /
analysis backend (added in Face-INT-A) into LocalAI's distribution
surfaces, mirroring the voice-detect wiring (the closest mudler C++/ggml
recognition analogue):

- backend/index.yaml: add the &facedetect meta-backend (capabilities
  platform map, no top-level uri to avoid the meta-backend gotcha) plus
  the full set of concrete per-arch image entries (cpu/cuda12/cuda13/
  metal/rocm/sycl-f16/sycl-f32/vulkan/l4t and the -development variants),
  22 entries. Referential integrity audited: every alias target resolves.
- gallery/index.yaml: add 4 model entries on backend face-detect -
  face-detect-buffalo-l/m/s (insightface SCRFD + ArcFace/MBF, NON-COMMERCIAL)
  and face-detect-yunet-sface (OpenCV-Zoo YuNet + SFace, APACHE-2.0, the
  commercial-friendly alternative). The detector/embedder architecture is
  read from GGUF metadata (facedetect.arch) at load; only the real
  verify_threshold option is set (0.35 buffalo, 0.363 sface). GGUF
  artifacts are not yet published: each files: entry points at the
  intended mudler/face-detect-gguf location with a TODO to fill sha256
  after upload (no fabricated hashes).
- core/config/backend_capabilities.go: register face-detect in the
  backend capability map (Embedding/Detect/FaceVerify/FaceAnalyze ->
  face_recognition), mirroring insightface.
- .github/backend-matrix.yml: add the linux build matrix block + the
  darwin metal entry mirroring voice-detect.
- .github/workflows/bump_deps.yaml: track mudler/face-detect.cpp via
  FACEDETECT_VERSION (pin 636a1963).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* fix(recon): voice-detect metal build branch + face-detect gallery usecases

Add the missing metal BUILD_TYPE branch to the voice-detect Makefile
forwarding -DVOICEDETECT_GGML_METAL=ON, mirroring face-detect, so the
darwin metal CI artifact is built with the Metal backend instead of
CPU-only.

Expand the 4 face-detect gallery models' known_usecases to
[face_recognition, detection, embeddings] to match the backend
capabilities map and the mirrored insightface-buffalo entries, so
auto-selection for /v1/detect and /embeddings works.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* docs(recon): document voice-detect and face-detect ggml backends

Document the new standalone C++/ggml biometric backends as the
recommended/default option for face and voice recognition, keeping the
existing Python insightface / speaker-recognition backends framed as the
legacy path.

- features/face-recognition.md: add a face-detect (ggml) backend section
  with the gallery entries (buffalo-l/m/s non-commercial, yunet-sface
  Apache-2.0), licensing, and verify/detect/analyze quickstart.
- features/voice-recognition.md: add a voice-detect (ggml) backend
  section with the gallery entries (ecapa-tdnn, wespeaker-resnet34,
  eres2net, campplus speaker recognizers; emotion-wav2vec2 non-commercial
  analyze head) and quickstart.
- reference/compatibility-table.md: add face-detect.cpp and
  voice-detect.cpp rows to the Vision, Detection & Recognition table.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(gallery): publish recon backend GGUF uris + sha256

Fill in the published HuggingFace GGUF uris and verified sha256 for the
9 recon gallery entries (voice-detect-* and face-detect-*), and remove
the TODO publish markers. Correct the eres2net, campplus, and
emotion-wav2vec2 uris to the actual published filenames.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(gallery): re-embed buffalo anti-spoof + add audeering age/gender voice model

Update the 3 buffalo face-detect GGUF sha256 (anti-spoof ensemble now
embedded and re-uploaded under the same filenames/uris) and note the
FaceVerify anti_spoof request flag in each description. Add a new
voice-detect-age-gender-wav2vec2 gallery entry mirroring the emotion
model.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(gallery): add face-detect-buffalo-sc and antelopev2 packs

Add gallery entries for two newly-published insightface face packs on
the face-detect backend: buffalo_sc (smallest pack, SCRFD-500M + small
ArcFace) and antelopev2 (higher-accuracy, SCRFD-10G + ArcFace glint360k
R100, 512-d). Both are non-commercial research-only.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(recon): honor LocalAI per-model threads in voice/face-detect backends

LocalAI spawns one backend process per model and serves requests
concurrently, so the engines' own min(hardware_concurrency, 8) default
can oversubscribe cores. Forward the per-model Threads value from the
gRPC LoadModel options into the engine via VOICEDETECT_THREADS /
FACEDETECT_THREADS (read at backend construction) before the capi load.
A non-positive Threads is treated as unset, leaving the engine default.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump backend pins to CPU-optimized engine commits

voice-detect.cpp -> 0d9c1b3 (radix-2 FFT FBank, threads, flash attn + cached
pos-conv); face-detect.cpp -> 523aee1 (thread-gated direct conv, threads).
Brings the CPU optimizations into the LocalAI backend builds. GGUF format and
parity unchanged, so the published HF GGUFs remain valid.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump backend pins to round-2 CPU-optimized engines

voice-detect.cpp -> fe7e6a3 (ERes2Net 1x1->mul_mat, CAM++ layout+context,
wav2vec2 conv-LN, ECAPA capture-drop, AVX512 dispatch opt-in); face-detect.cpp
-> 9c8adb7 (AVX2 Winograd F(2x2,3x3) for SCRFD/ArcFace 3x3 convs, ArcFace
BN-fold). Parity unchanged (cosine=1.0); GGUF format unchanged, HF GGUFs valid.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump backend pins to round-3 Winograd engines

voice-detect.cpp -> 45122ec (Winograd F(2x2,3x3) for WeSpeaker/ERes2Net 3x3
convs, -22%/-20% @8t); face-detect.cpp -> cd5c962 (Winograd F(4x4,3x3) for
SCRFD large maps, -22% @1t on top of F(2x2), more load-stable). Parity held
(cosine=1.0); GGUF format unchanged, HF GGUFs valid.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump backend pins to round-4 Winograd engines (CPU opt complete)

voice-detect.cpp -> d2839ca (CAM++ FCM 2D convs through Winograd, -15.5%/-10.3%);
face-detect.cpp -> c1db23d (AVX2-vectorized Winograd tile transforms, SCRFD
detect -14%/-9.6%). Final CPU optimization round; the conv-kernel lever class is
now exhausted (parity held cosine=1.0; GGUF/parity unchanged, HF GGUFs valid).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump face-detect pin to deep-kernel engine (7ae5c4d)

face-detect.cpp -> 7ae5c4d: register-blocked winograd-domain GEMM microkernel
(2.8x isolated GFLOP/s), AVX-512 zmm evolution behind runtime CPUID dispatch
(ship-safe, AVX2 fallback bit-identical), bias/relu fused into the winograd
output transform, and SFace Conv+BN fold + bias/PReLU fusion. SCRFD detect
~1.4x faster end-to-end vs the round-4 baseline; parity bit-exact; portable
single binary (function-multiversioned, no global -mavx512f).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump voice-detect pin to ECAPA operand-order win (e9c56ae)

voice-detect.cpp -> e9c56ae: weight-as-src0 mul_mat order in ECAPA's F32
conv1d_same (routes through tinyBLAS sgemm); ECAPA embed 1.67x @1t / ~1.3x @8t,
parity cosine=1.0. Isolated to encoder.cpp (ECAPA-only); ERes2Net/CAM++/WeSpeaker
do not call conv1d_same so are provably unaffected.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump pins to FMA-throughput engines (voice f7b9f89, face 2d2d5f0)

face -> 2d2d5f0: route ArcFace 3x3 body convs through the AVX-512 winograd
microkernel (kWinoMinSize 80->14); ArcFace 1.62x @1t, SCRFD detect to 0.966 of
MLAS @1t, no regression. voice -> f7b9f89: runtime-CPUID-dispatched AVX-512
winograd-GEMM microkernel (ship-safe, AVX2 fallback bit-identical); WeSpeaker
1.90x @1t. Parity cosine=1.0 throughout; portable single binaries.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump pins to MLAS-class direct-conv engines (voice 7ecfd07, face be22d67)

Hand-tuned nChw16c AVX-512 register-tiled direct-conv microkernel (~263 GFLOP/s,
within 6-7% of MLAS per-op efficiency), runtime-CPUID-dispatched + AVX2 fallback,
fused bias/relu. voice 7ecfd07: default 3x3-s1 kernel for WeSpeaker (+37%/+32%)
+ ERes2Net, CAM++ pinned to Winograd. face be22d67: shape-gated to the ArcFace
recognizer body (+25-27% @8t); SCRFD detector stays on Winograd (no regression).
Parity cosine=1.0 / detect <=1px on AVX-512 + AVX2 paths. Portable single binaries.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump voice pin to Phase-A blocked backbone (f4e7eef)

WeSpeaker ResNet34 runs as one nChw16c blocked island (2 reorders/forward vs
~60) on AVX-512, default; per-conv directconv fallback on AVX2. +2.9% @1t /
+17-19% @8t vs per-conv directconv, parity cosine=1.0. The conv microkernel is
already FMA-bound near peak (~0.86-0.98x MLAS-implied); residual to MLAS is
sub-peak edge + non-conv tail, documented in docs/cpu-optimization.md.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump pins to breadth blocked-backbone (voice 7f66871, face d80092b)

voice 7f66871: AVX2-vectorized (ymm) blocked island - AVX2-only hosts now run
the blocked backbone for WeSpeaker (2.3x over per-conv-AVX2, cosine=1.0);
ERes2Net stays per-conv (blocked regresses, opt-in only); CAM++ Winograd-pinned.
face d80092b: ArcFace recognizer blocked island, AVX-512 default (-13% @8t, ~0.90x
MLAS, the closest conv result), auto per-conv on AVX2; SCRFD untouched on Winograd
(0 island invocations during detect). Parity cosine=1.0 / detect <=1px throughout.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump pins to small-spatial + stem conv kernels (voice 99b1804, face 47fdab6)

Measured-gap-driven conv kernels: small-spatial (fill the register tile when
output width <= tile width) + small-IC stem + strided-1x1/downsample recovery.
ArcFace recognizer 0.57 -> 0.70x MLAS @1t (the closest conv model), WeSpeaker
0.65 -> 0.79x @1t. Parity cosine=1.0 / detect <=1px. The OC-block-sharing lever
was a measured dead-end (deep stride-1 is L3-weight-bandwidth bound, not
read-port bound) and was NOT shipped. Kernel ceiling reached; further gap needs
an algorithm-class change (cache-blocked weight-stationary GEMM, or q8 weights).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump pins to GPU persistent-graph + multi-model-safe cache (voice 45d2e6b, face 0a4799a)

GPU wins (CUDA/ggml backend, no CPU-path change): persistent per-shape graph+context
cache in Backend::compute() eliminates the per-call cudaGraph re-instantiation churn
-> wav2vec2 emotion+age-gender now AT GPU parity with torch-cuDNN on GB10 (0.97-0.98x),
CAM++ -5.7ms; bit-identical parity. Cache hardened multi-model-safe (invalidate-on-free
keyed by the ModelLoader weights buffer) so LocalAI multi-model hosting cannot stale-hit.
Conv models still trail cuDNN (im2col-materialization-bound) - cuDNN implicit-GEMM lever next.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump pins to cuDNN-conv-capable engines (voice b6e4356, face 6107a24)

Adds the opt-in cuDNN implicit-GEMM conv path (VOICEDETECT_GGML_CUDNN /
FACEDETECT_GGML_CUDNN, DEFAULT OFF -> zero build/runtime dep until enabled).
On GPU it kills the im2col-materialization bottleneck and reaches torch-cuDNN
parity on the spill-bound convs: SCRFD detect 14.8->6.4ms (2.3x, ~parity),
WeSpeaker ~parity, ERes2Net beats torch (1.10x); ArcFace/CAM++ neutral (no
spill). Parity exact (SCRFD <=1px, cosine=1.0). To USE it in LocalAI, the CUDA
backend build must enable the flag AND bundle libcudnn - deferred until a
cuDNN-bundled GPU image; flag stays OFF here.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(recon): enable cuDNN conv path on arm64+CUDA13 recon backends

The voice-detect.cpp / face-detect.cpp engines have an opt-in cuDNN
implicit-GEMM conv path behind VOICEDETECT_GGML_CUDNN / FACEDETECT_GGML_CUDNN
(default OFF) that kills im2col on the GPU and reaches torch-cuDNN parity
(SCRFD 2.3x, WeSpeaker/ERes2Net parity), measured on the GB10
(arm64, CUDA 13, sm_121a).

Enable it for the CUDA build, but only where cuDNN actually ships: the
arm64 + CUDA 13 image (GB10/Jetson/L4T). x86 CUDA images carry no cuDNN,
so flipping it on globally for BUILD_TYPE=cublas would be a link failure.
The Makefiles gate on CUDA_MAJOR_VERSION=13 + arch (TARGETARCH from the
matrix/Docker build, uname -m fallback for local builds).

backend/Dockerfile.golang already installs the runtime libcudnn9-cuda-13
in the arm64+CUDA13 apt block; add the matching libcudnn9-dev-cuda-13 so
the build-time link resolves.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): bump voice-detect pin to ERes2Net blocked-default (30beecd)

Defaults VD_ERES2NET_BLOCKED ON: routes the ERes2Net Res2Net body through the
blocked nChw16c AVX-512 directconv island instead of the 1x1 mul_mat fast path
(CONT-transpose + skinny low-K GEMM). On the shipped GGML_NATIVE=OFF build (ggml
mul_mat is AVX2-only) this wins ~2x at every thread count (2.07x@1t, 2.2x@4t,
2.05x@8t); pure-AVX2 fallback still 1.3-1.62x. Parity exact (cosine=1.000000 vs
golden), so registered voices + verify/identify thresholds are unaffected. The
prior default-OFF rested on a stale comment whose 23pct regression only held on
the non-shipping GGML_NATIVE=ON build.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* docs(readme): announce native voice-detect + face-detect backends in Latest News

Add a Latest News entry for the new from-scratch C++/ggml biometric backends
(voice-detect.cpp + face-detect.cpp) that replace the Python insightface and
speaker-recognition backends: no Python/onnxruntime at inference, self-contained
GGUF, bit-exact parity, GPU cuDNN parity. Mirrors the parakeet.cpp /
locate-anything.cpp native-backend news entries. Refs PR #10441.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(recon): re-pin to the squashed engine release commits

The voice-detect.cpp and face-detect.cpp histories were squashed to a single
release commit, which orphaned the previous pins (voice 30beecd, face 6107a24).
Re-pin to the new single-commit SHAs (voice 3d51077, face 06914b0); the tree is
identical, so the backend build is unchanged.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-28 09:29:08 +02:00

645 lines
26 KiB
Go

package config
import (
"slices"
"strings"
)
// Usecase name constants — the canonical string values used in gallery entries,
// model configs (known_usecases), and UsecaseInfoMap keys.
const (
UsecaseChat = "chat"
UsecaseCompletion = "completion"
UsecaseEdit = "edit"
UsecaseVision = "vision"
UsecaseEmbeddings = "embeddings"
UsecaseTokenize = "tokenize"
UsecaseImage = "image"
UsecaseVideo = "video"
UsecaseTranscript = "transcript"
UsecaseTTS = "tts"
UsecaseSoundGeneration = "sound_generation"
UsecaseRerank = "rerank"
UsecaseDetection = "detection"
UsecaseDepth = "depth"
UsecaseVAD = "vad"
UsecaseAudioTransform = "audio_transform"
UsecaseDiarization = "diarization"
UsecaseSoundClassification = "sound_classification"
UsecaseRealtimeAudio = "realtime_audio"
UsecaseFaceRecognition = "face_recognition"
UsecaseSpeakerRecognition = "speaker_recognition"
UsecaseTokenClassify = "token_classify"
)
// GRPCMethod identifies a Backend service RPC from backend.proto.
type GRPCMethod string
const (
MethodPredict GRPCMethod = "Predict"
MethodPredictStream GRPCMethod = "PredictStream"
MethodEmbedding GRPCMethod = "Embedding"
MethodGenerateImage GRPCMethod = "GenerateImage"
MethodGenerateVideo GRPCMethod = "GenerateVideo"
MethodAudioTranscription GRPCMethod = "AudioTranscription"
MethodTTS GRPCMethod = "TTS"
MethodTTSStream GRPCMethod = "TTSStream"
MethodSoundGeneration GRPCMethod = "SoundGeneration"
MethodTokenizeString GRPCMethod = "TokenizeString"
MethodDetect GRPCMethod = "Detect"
MethodDepth GRPCMethod = "Depth"
MethodRerank GRPCMethod = "Rerank"
MethodVAD GRPCMethod = "VAD"
MethodAudioTransform GRPCMethod = "AudioTransform"
MethodDiarize GRPCMethod = "Diarize"
MethodSoundDetection GRPCMethod = "SoundDetection"
MethodAudioToAudioStream GRPCMethod = "AudioToAudioStream"
MethodFaceVerify GRPCMethod = "FaceVerify"
MethodFaceAnalyze GRPCMethod = "FaceAnalyze"
MethodVoiceVerify GRPCMethod = "VoiceVerify"
MethodVoiceEmbed GRPCMethod = "VoiceEmbed"
MethodVoiceAnalyze GRPCMethod = "VoiceAnalyze"
MethodTokenClassify GRPCMethod = "TokenClassify"
)
// UsecaseInfo describes a single known_usecase value and how it maps
// to the gRPC backend API.
type UsecaseInfo struct {
// Flag is the ModelConfigUsecase bitmask value.
Flag ModelConfigUsecase
// GRPCMethod is the primary Backend service RPC this usecase maps to.
GRPCMethod GRPCMethod
// IsModifier is true when this usecase doesn't map to its own gRPC RPC
// but modifies how another RPC behaves (e.g., vision uses Predict with images).
IsModifier bool
// DependsOn names the usecase(s) this modifier requires (e.g., "chat").
DependsOn string
// Description is a human/LLM-readable explanation of what this usecase means.
Description string
}
// UsecaseInfoMap maps each known_usecase string to its gRPC and semantic info.
var UsecaseInfoMap = map[string]UsecaseInfo{
UsecaseChat: {
Flag: FLAG_CHAT,
GRPCMethod: MethodPredict,
Description: "Conversational/instruction-following via the Predict RPC with chat templates.",
},
UsecaseCompletion: {
Flag: FLAG_COMPLETION,
GRPCMethod: MethodPredict,
Description: "Text completion via the Predict RPC with a completion template.",
},
UsecaseEdit: {
Flag: FLAG_EDIT,
GRPCMethod: MethodPredict,
Description: "Text editing via the Predict RPC with an edit template.",
},
UsecaseVision: {
Flag: FLAG_VISION,
GRPCMethod: MethodPredict,
IsModifier: true,
DependsOn: UsecaseChat,
Description: "The model accepts images alongside text in the Predict RPC. For llama-cpp this requires an mmproj file.",
},
UsecaseEmbeddings: {
Flag: FLAG_EMBEDDINGS,
GRPCMethod: MethodEmbedding,
Description: "Vector embedding generation via the Embedding RPC.",
},
UsecaseTokenize: {
Flag: FLAG_TOKENIZE,
GRPCMethod: MethodTokenizeString,
Description: "Tokenization via the TokenizeString RPC without running inference.",
},
UsecaseImage: {
Flag: FLAG_IMAGE,
GRPCMethod: MethodGenerateImage,
Description: "Image generation via the GenerateImage RPC (Stable Diffusion, Flux, etc.).",
},
UsecaseVideo: {
Flag: FLAG_VIDEO,
GRPCMethod: MethodGenerateVideo,
Description: "Video generation via the GenerateVideo RPC.",
},
UsecaseTranscript: {
Flag: FLAG_TRANSCRIPT,
GRPCMethod: MethodAudioTranscription,
Description: "Speech-to-text via the AudioTranscription RPC.",
},
UsecaseTTS: {
Flag: FLAG_TTS,
GRPCMethod: MethodTTS,
Description: "Text-to-speech via the TTS RPC.",
},
UsecaseSoundGeneration: {
Flag: FLAG_SOUND_GENERATION,
GRPCMethod: MethodSoundGeneration,
Description: "Music/sound generation via the SoundGeneration RPC (not speech).",
},
UsecaseRerank: {
Flag: FLAG_RERANK,
GRPCMethod: MethodRerank,
Description: "Document reranking via the Rerank RPC.",
},
UsecaseDetection: {
Flag: FLAG_DETECTION,
GRPCMethod: MethodDetect,
Description: "Object detection via the Detect RPC with bounding boxes.",
},
UsecaseDepth: {
Flag: FLAG_DEPTH,
GRPCMethod: MethodDepth,
Description: "Per-pixel metric depth, camera pose and 3D point cloud via the Depth RPC (Depth Anything 3).",
},
UsecaseVAD: {
Flag: FLAG_VAD,
GRPCMethod: MethodVAD,
Description: "Voice activity detection via the VAD RPC.",
},
UsecaseAudioTransform: {
Flag: FLAG_AUDIO_TRANSFORM,
GRPCMethod: MethodAudioTransform,
Description: "Audio-in / audio-out transformations (echo cancellation, noise suppression, dereverberation, voice conversion) via the AudioTransform RPC.",
},
UsecaseDiarization: {
Flag: FLAG_DIARIZATION,
GRPCMethod: MethodDiarize,
Description: "Speaker diarization (who-spoke-when, per-speaker segments) via the Diarize RPC.",
},
UsecaseSoundClassification: {
Flag: FLAG_SOUND_CLASSIFICATION,
GRPCMethod: MethodSoundDetection,
Description: "Sound-event classification / audio tagging (scored AudioSet labels like baby cry, glass breaking, alarms) via the SoundDetection RPC.",
},
UsecaseRealtimeAudio: {
Flag: FLAG_REALTIME_AUDIO,
GRPCMethod: MethodAudioToAudioStream,
Description: "Self-contained any-to-any audio model for the Realtime API — accepts microphone audio and emits speech + transcript (+ optional function calls) from a single backend via the AudioToAudioStream RPC.",
},
UsecaseFaceRecognition: {
Flag: FLAG_FACE_RECOGNITION,
GRPCMethod: MethodFaceVerify,
Description: "Face recognition — verify identity, analyze attributes (age/gender/emotion) via FaceVerify and FaceAnalyze RPCs.",
},
UsecaseSpeakerRecognition: {
Flag: FLAG_SPEAKER_RECOGNITION,
GRPCMethod: MethodVoiceVerify,
Description: "Speaker recognition — verify identity, embed and analyze voice via VoiceVerify, VoiceEmbed and VoiceAnalyze RPCs.",
},
UsecaseTokenClassify: {
Flag: FLAG_TOKEN_CLASSIFY,
GRPCMethod: MethodTokenClassify,
Description: "Per-token classification (NER) via the TokenClassify RPC — the PII detector tier. Declared explicitly via known_usecases; never auto-guessed, since the token-classification head is not useful as general generation or embeddings.",
},
}
// BackendCapability describes which gRPC methods and usecases a backend supports.
// Derived from reviewing actual implementations in backend/go/ and backend/python/.
type BackendCapability struct {
// GRPCMethods lists the Backend service RPCs this backend implements.
GRPCMethods []GRPCMethod
// PossibleUsecases lists all usecase strings this backend can support.
PossibleUsecases []string
// DefaultUsecases lists the conservative safe defaults.
DefaultUsecases []string
// AcceptsImages indicates multimodal image input in Predict.
AcceptsImages bool
// AcceptsVideos indicates multimodal video input in Predict.
AcceptsVideos bool
// AcceptsAudios indicates multimodal audio input in Predict.
AcceptsAudios bool
// Description is a human-readable summary of the backend.
Description string
}
// BackendCapabilities maps each backend name (as used in model configs and gallery
// entries) to its verified capabilities. This is the single source of truth for
// what each backend supports.
//
// Backend names use hyphens (e.g., "llama-cpp") matching the gallery convention.
// Use NormalizeBackendName() for names with dots (e.g., "llama.cpp").
var BackendCapabilities = map[string]BackendCapability{
// --- LLM / text generation backends ---
"llama-cpp": {
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodEmbedding, MethodTokenizeString},
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseEdit, UsecaseEmbeddings, UsecaseTokenize, UsecaseVision},
DefaultUsecases: []string{UsecaseChat},
AcceptsImages: true, // requires mmproj
Description: "llama.cpp GGUF models — LLM inference with optional vision via mmproj",
},
// privacy-filter is the standalone GGML engine (backend/cpp/privacy-filter,
// wrapping privacy-filter.cpp) for the openai-privacy-filter PII/NER token
// classifier — the dedicated TokenClassify path that replaces the
// patched-llama.cpp route. Never auto-guessed; declared explicitly via
// known_usecases: [token_classify].
"privacy-filter": {
GRPCMethods: []GRPCMethod{MethodTokenClassify},
PossibleUsecases: []string{UsecaseTokenClassify},
DefaultUsecases: []string{UsecaseTokenClassify},
Description: "privacy-filter.cpp — standalone GGML backend for openai-privacy-filter PII/NER token classification",
},
"vllm": {
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodEmbedding},
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseEmbeddings, UsecaseVision},
DefaultUsecases: []string{UsecaseChat},
AcceptsImages: true,
AcceptsVideos: true,
Description: "vLLM engine — high-throughput LLM serving with optional multimodal",
},
"sglang": {
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodTokenizeString},
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseTokenize, UsecaseVision},
DefaultUsecases: []string{UsecaseChat},
AcceptsImages: true,
Description: "SGLang — fast LLM inference with structured generation and optional vision",
},
"vllm-omni": {
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodGenerateImage, MethodGenerateVideo, MethodTTS},
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseImage, UsecaseVideo, UsecaseTTS, UsecaseVision},
DefaultUsecases: []string{UsecaseChat},
AcceptsImages: true,
AcceptsVideos: true,
AcceptsAudios: true,
Description: "vLLM omni-modal — supports text, image, video generation and TTS",
},
"transformers": {
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodEmbedding, MethodTTS, MethodSoundGeneration},
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseEmbeddings, UsecaseTTS, UsecaseSoundGeneration},
DefaultUsecases: []string{UsecaseChat},
Description: "HuggingFace transformers — general-purpose Python inference",
},
"mlx": {
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodEmbedding},
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseEmbeddings},
DefaultUsecases: []string{UsecaseChat},
Description: "Apple MLX framework — optimized for Apple Silicon",
},
"mlx-distributed": {
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodEmbedding},
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseEmbeddings},
DefaultUsecases: []string{UsecaseChat},
Description: "MLX distributed inference across multiple Apple Silicon devices",
},
"mlx-vlm": {
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodEmbedding},
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseEmbeddings, UsecaseVision},
DefaultUsecases: []string{UsecaseChat, UsecaseVision},
AcceptsImages: true,
AcceptsAudios: true,
Description: "MLX vision-language models with multimodal input",
},
"mlx-audio": {
GRPCMethods: []GRPCMethod{MethodPredict, MethodTTS},
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseTTS},
DefaultUsecases: []string{UsecaseChat},
Description: "MLX audio models — text generation and TTS",
},
// --- Image/video generation backends ---
"diffusers": {
GRPCMethods: []GRPCMethod{MethodGenerateImage, MethodGenerateVideo},
PossibleUsecases: []string{UsecaseImage, UsecaseVideo},
DefaultUsecases: []string{UsecaseImage},
Description: "HuggingFace diffusers — Stable Diffusion, Flux, video generation",
},
"stablediffusion": {
GRPCMethods: []GRPCMethod{MethodGenerateImage},
PossibleUsecases: []string{UsecaseImage},
DefaultUsecases: []string{UsecaseImage},
Description: "Stable Diffusion native backend",
},
"stablediffusion-ggml": {
GRPCMethods: []GRPCMethod{MethodGenerateImage},
PossibleUsecases: []string{UsecaseImage},
DefaultUsecases: []string{UsecaseImage},
Description: "Stable Diffusion via GGML quantized models",
},
// --- Speech-to-text backends ---
"whisper": {
GRPCMethods: []GRPCMethod{MethodAudioTranscription, MethodVAD},
PossibleUsecases: []string{UsecaseTranscript, UsecaseVAD},
DefaultUsecases: []string{UsecaseTranscript},
Description: "OpenAI Whisper — speech recognition and voice activity detection",
},
"faster-whisper": {
GRPCMethods: []GRPCMethod{MethodAudioTranscription},
PossibleUsecases: []string{UsecaseTranscript},
DefaultUsecases: []string{UsecaseTranscript},
Description: "CTranslate2-accelerated Whisper for faster transcription",
},
"whisperx": {
GRPCMethods: []GRPCMethod{MethodAudioTranscription},
PossibleUsecases: []string{UsecaseTranscript},
DefaultUsecases: []string{UsecaseTranscript},
Description: "WhisperX — Whisper with word-level timestamps and speaker diarization",
},
"moonshine": {
GRPCMethods: []GRPCMethod{MethodAudioTranscription},
PossibleUsecases: []string{UsecaseTranscript},
DefaultUsecases: []string{UsecaseTranscript},
Description: "Moonshine speech recognition",
},
"nemo": {
GRPCMethods: []GRPCMethod{MethodAudioTranscription},
PossibleUsecases: []string{UsecaseTranscript},
DefaultUsecases: []string{UsecaseTranscript},
Description: "NVIDIA NeMo speech recognition",
},
"parakeet-cpp": {
GRPCMethods: []GRPCMethod{MethodAudioTranscription},
PossibleUsecases: []string{UsecaseTranscript},
DefaultUsecases: []string{UsecaseTranscript},
Description: "NVIDIA NeMo Parakeet ASR (parakeet.cpp)",
},
"qwen-asr": {
GRPCMethods: []GRPCMethod{MethodAudioTranscription},
PossibleUsecases: []string{UsecaseTranscript},
DefaultUsecases: []string{UsecaseTranscript},
Description: "Qwen automatic speech recognition",
},
"voxtral": {
GRPCMethods: []GRPCMethod{MethodAudioTranscription},
PossibleUsecases: []string{UsecaseTranscript},
DefaultUsecases: []string{UsecaseTranscript},
Description: "Voxtral speech recognition",
},
"vibevoice": {
GRPCMethods: []GRPCMethod{MethodAudioTranscription, MethodTTS},
PossibleUsecases: []string{UsecaseTranscript, UsecaseTTS},
DefaultUsecases: []string{UsecaseTranscript, UsecaseTTS},
Description: "VibeVoice — bidirectional speech (transcription and synthesis)",
},
"vibevoice-cpp": {
GRPCMethods: []GRPCMethod{MethodAudioTranscription, MethodTTS, MethodTTSStream},
PossibleUsecases: []string{UsecaseTranscript, UsecaseTTS},
DefaultUsecases: []string{UsecaseTranscript, UsecaseTTS},
Description: "VibeVoice C++ — bidirectional speech, C++ backend with streaming TTS",
},
"sherpa-onnx": {
GRPCMethods: []GRPCMethod{MethodAudioTranscription, MethodTTS, MethodTTSStream, MethodVAD},
PossibleUsecases: []string{UsecaseTranscript, UsecaseTTS, UsecaseVAD},
DefaultUsecases: []string{UsecaseTranscript},
Description: "Sherpa-ONNX — multi-model speech toolkit (ASR, TTS, VAD)",
},
// --- TTS backends ---
"piper": {
GRPCMethods: []GRPCMethod{MethodTTS},
PossibleUsecases: []string{UsecaseTTS},
DefaultUsecases: []string{UsecaseTTS},
Description: "Piper — fast neural TTS optimized for Raspberry Pi",
},
"kokoro": {
GRPCMethods: []GRPCMethod{MethodTTS},
PossibleUsecases: []string{UsecaseTTS},
DefaultUsecases: []string{UsecaseTTS},
Description: "Kokoro TTS",
},
"coqui": {
GRPCMethods: []GRPCMethod{MethodTTS},
PossibleUsecases: []string{UsecaseTTS},
DefaultUsecases: []string{UsecaseTTS},
Description: "Coqui TTS — multi-speaker neural synthesis",
},
"kitten-tts": {
GRPCMethods: []GRPCMethod{MethodTTS},
PossibleUsecases: []string{UsecaseTTS},
DefaultUsecases: []string{UsecaseTTS},
Description: "Kitten TTS",
},
"outetts": {
GRPCMethods: []GRPCMethod{MethodTTS},
PossibleUsecases: []string{UsecaseTTS},
DefaultUsecases: []string{UsecaseTTS},
Description: "OuteTTS",
},
"pocket-tts": {
GRPCMethods: []GRPCMethod{MethodTTS},
PossibleUsecases: []string{UsecaseTTS},
DefaultUsecases: []string{UsecaseTTS},
Description: "Pocket TTS — lightweight text-to-speech",
},
"qwen-tts": {
GRPCMethods: []GRPCMethod{MethodTTS},
PossibleUsecases: []string{UsecaseTTS},
DefaultUsecases: []string{UsecaseTTS},
Description: "Qwen TTS",
},
"qwen3-tts-cpp": {
GRPCMethods: []GRPCMethod{MethodTTS, MethodTTSStream},
PossibleUsecases: []string{UsecaseTTS},
DefaultUsecases: []string{UsecaseTTS},
Description: "Qwen3 TTS C++ - text-to-speech with streaming, named speakers, voice design and cloning (qwentts.cpp / GGML)",
},
"faster-qwen3-tts": {
GRPCMethods: []GRPCMethod{MethodTTS},
PossibleUsecases: []string{UsecaseTTS},
DefaultUsecases: []string{UsecaseTTS},
Description: "Faster Qwen3 TTS — accelerated Qwen TTS",
},
"fish-speech": {
GRPCMethods: []GRPCMethod{MethodTTS},
PossibleUsecases: []string{UsecaseTTS},
DefaultUsecases: []string{UsecaseTTS},
Description: "Fish Speech TTS",
},
"neutts": {
GRPCMethods: []GRPCMethod{MethodTTS},
PossibleUsecases: []string{UsecaseTTS},
DefaultUsecases: []string{UsecaseTTS},
Description: "NeuTTS — neural text-to-speech",
},
"chatterbox": {
GRPCMethods: []GRPCMethod{MethodTTS},
PossibleUsecases: []string{UsecaseTTS},
DefaultUsecases: []string{UsecaseTTS},
Description: "Chatterbox TTS",
},
"voxcpm": {
GRPCMethods: []GRPCMethod{MethodTTS, MethodTTSStream},
PossibleUsecases: []string{UsecaseTTS},
DefaultUsecases: []string{UsecaseTTS},
Description: "VoxCPM TTS with streaming support",
},
// --- Sound generation backends ---
"ace-step": {
GRPCMethods: []GRPCMethod{MethodTTS, MethodSoundGeneration},
PossibleUsecases: []string{UsecaseTTS, UsecaseSoundGeneration},
DefaultUsecases: []string{UsecaseSoundGeneration},
Description: "ACE-Step — music and sound generation",
},
"acestep-cpp": {
GRPCMethods: []GRPCMethod{MethodSoundGeneration},
PossibleUsecases: []string{UsecaseSoundGeneration},
DefaultUsecases: []string{UsecaseSoundGeneration},
Description: "ACE-Step C++ — native sound generation",
},
"transformers-musicgen": {
GRPCMethods: []GRPCMethod{MethodTTS, MethodSoundGeneration},
PossibleUsecases: []string{UsecaseTTS, UsecaseSoundGeneration},
DefaultUsecases: []string{UsecaseSoundGeneration},
Description: "Meta MusicGen via transformers — music generation from text",
},
// --- Any-to-any audio backends ---
"liquid-audio": {
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodAudioTranscription, MethodTTS, MethodAudioToAudioStream, MethodVAD},
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseTranscript, UsecaseTTS, UsecaseRealtimeAudio, UsecaseVAD},
DefaultUsecases: []string{UsecaseRealtimeAudio, UsecaseChat, UsecaseTranscript, UsecaseTTS, UsecaseVAD},
AcceptsAudios: true,
Description: "LFM2 / LFM2.5-Audio — self-contained any-to-any audio model for the Realtime API; also exposes chat, transcription, TTS and a stub energy-based VAD endpoint",
},
// --- Audio transform backends ---
"localvqe": {
GRPCMethods: []GRPCMethod{MethodAudioTransform},
PossibleUsecases: []string{UsecaseAudioTransform},
DefaultUsecases: []string{UsecaseAudioTransform},
Description: "LocalVQE — joint AEC, noise suppression, and dereverberation for 16 kHz mono speech",
},
// --- Utility backends ---
"rerankers": {
GRPCMethods: []GRPCMethod{MethodRerank},
PossibleUsecases: []string{UsecaseRerank},
DefaultUsecases: []string{UsecaseRerank},
Description: "Cross-encoder reranking models",
},
"rfdetr": {
GRPCMethods: []GRPCMethod{MethodDetect},
PossibleUsecases: []string{UsecaseDetection},
DefaultUsecases: []string{UsecaseDetection},
Description: "RF-DETR object detection",
},
"rfdetr-cpp": {
GRPCMethods: []GRPCMethod{MethodDetect},
PossibleUsecases: []string{UsecaseDetection},
DefaultUsecases: []string{UsecaseDetection},
Description: "RF-DETR C++ object detection",
},
"depth-anything": {
GRPCMethods: []GRPCMethod{MethodDepth, MethodPredict, MethodGenerateImage},
PossibleUsecases: []string{UsecaseDepth},
DefaultUsecases: []string{UsecaseDepth},
AcceptsImages: true,
Description: "Depth Anything 3 C++ — per-pixel metric depth, camera pose and 3D point cloud",
},
// --- Face and speaker recognition backends ---
"insightface": {
GRPCMethods: []GRPCMethod{MethodEmbedding, MethodDetect, MethodFaceVerify, MethodFaceAnalyze},
PossibleUsecases: []string{UsecaseEmbeddings, UsecaseDetection, UsecaseFaceRecognition},
DefaultUsecases: []string{UsecaseFaceRecognition},
AcceptsImages: true,
Description: "InsightFace — face detection, embedding, verification and attribute analysis",
},
"speaker-recognition": {
GRPCMethods: []GRPCMethod{MethodVoiceVerify, MethodVoiceEmbed, MethodVoiceAnalyze},
PossibleUsecases: []string{UsecaseSpeakerRecognition},
DefaultUsecases: []string{UsecaseSpeakerRecognition},
Description: "Speaker recognition — voice identity verification and analysis",
},
"voice-detect": {
GRPCMethods: []GRPCMethod{MethodVoiceVerify, MethodVoiceEmbed, MethodVoiceAnalyze},
PossibleUsecases: []string{UsecaseSpeakerRecognition},
DefaultUsecases: []string{UsecaseSpeakerRecognition},
Description: "voice-detect.cpp: C++/ggml speaker embedding, verification and voice analysis (age/gender/emotion)",
},
"face-detect": {
GRPCMethods: []GRPCMethod{MethodEmbedding, MethodDetect, MethodFaceVerify, MethodFaceAnalyze},
PossibleUsecases: []string{UsecaseEmbeddings, UsecaseDetection, UsecaseFaceRecognition},
DefaultUsecases: []string{UsecaseFaceRecognition},
AcceptsImages: true,
Description: "face-detect.cpp: C++/ggml face detection, embedding, verification and attribute analysis",
},
"silero-vad": {
GRPCMethods: []GRPCMethod{MethodVAD},
PossibleUsecases: []string{UsecaseVAD},
DefaultUsecases: []string{UsecaseVAD},
Description: "Silero VAD — voice activity detection",
},
}
// NormalizeBackendName converts backend names to the canonical hyphenated form
// used in gallery entries (e.g., "llama.cpp" → "llama-cpp").
func NormalizeBackendName(backend string) string {
return strings.ReplaceAll(backend, ".", "-")
}
// nonLlamaSamplerBackends lists backends whose native sampler defaults differ
// from llama.cpp's, so LocalAI must NOT inject llama.cpp's top_k=40 default for
// them (issue #6632). mlx_lm's intended default is top_k=0 (disabled) and mlx
// does not remap 0->40, so shipping 40 silently changes sampling for clients
// that omit top_k. Leaving TopK nil lets the wire value default to 0.
//
// This is intentionally a small allow-list of KNOWN non-llama backends: empty
// and unknown backends fall through to the llama.cpp default to preserve the
// GGUF auto-detect path's behavior.
var nonLlamaSamplerBackends = map[string]struct{}{
"mlx": {},
"mlx-vlm": {},
"mlx-distributed": {},
}
// UsesLlamaSamplerDefaults reports whether a backend should receive llama.cpp's
// sampler defaults (e.g. top_k=40). Empty/unknown backends return true so the
// GGUF auto-detect path (which resolves to llama.cpp) keeps today's behavior;
// only the known non-llama backends in nonLlamaSamplerBackends return false.
func UsesLlamaSamplerDefaults(backend string) bool {
if backend == "" {
return true
}
_, isNonLlama := nonLlamaSamplerBackends[NormalizeBackendName(backend)]
return !isNonLlama
}
// GetBackendCapability returns the capability info for a backend, or nil if unknown.
// Handles backend name normalization.
func GetBackendCapability(backend string) *BackendCapability {
if cap, ok := BackendCapabilities[NormalizeBackendName(backend)]; ok {
return &cap
}
return nil
}
// PossibleUsecasesForBackend returns all usecases a backend can support.
// Returns nil if the backend is unknown.
func PossibleUsecasesForBackend(backend string) []string {
if cap := GetBackendCapability(backend); cap != nil {
return cap.PossibleUsecases
}
return nil
}
// DefaultUsecasesForBackend returns the conservative default usecases.
// Returns nil if the backend is unknown.
func DefaultUsecasesForBackendCap(backend string) []string {
if cap := GetBackendCapability(backend); cap != nil {
return cap.DefaultUsecases
}
return nil
}
// IsValidUsecaseForBackend checks whether a usecase is in a backend's possible set.
// Returns true for unknown backends (permissive fallback).
func IsValidUsecaseForBackend(backend, usecase string) bool {
cap := GetBackendCapability(backend)
if cap == nil {
return true // unknown backend — don't restrict
}
return slices.Contains(cap.PossibleUsecases, usecase)
}
// AllBackendNames returns a sorted list of all known backend names.
func AllBackendNames() []string {
names := make([]string, 0, len(BackendCapabilities))
for name := range BackendCapabilities {
names = append(names, name)
}
slices.Sort(names)
return names
}