mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-22 23:58:51 -04:00
* feat(ced): sketch sound-classification backend (CED audio tagger) Wires ced.cpp (CED, 527-class AudioSet sound-event tagger; baby cry, footsteps, glass, alarms, dog bark) into LocalAI as a Go/purego backend. SKETCH (backend skeleton real; core REST wiring + CI/gallery is a checklist in DESIGN.md): - backend/backend.proto: new SoundDetection rpc + SoundClass messages (run `make protogen-go` to regenerate pkg/grpc/proto). - backend/go/ced: main.go (purego dlopen libced.so + ced_capi.h), goced.go (Ced gRPC backend: Load + SoundDetection), Makefile (clone-at-pin CED_VERSION, ggml static-PIC shared build), run.sh, package.sh, .gitignore. - DESIGN.md: REST /v1/audio/classification wiring (handler/route/capability registration checklist), gallery/index + CI registration, and a scoping note for the realtime/websocket live-recognition path (sliding-window classify over the existing ws transport + voicegate; the ced C-API per-PCM entry point is already window-friendly). Backend code does not compile until protogen-go regenerates the pb types and a libced.so is built (Makefile clones+builds it). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ced): REST /v1/audio/classification endpoint + capability registration Wires the ced sound-event classification backend (AudioSet audio tagger) end to end through the REST surface, mirroring the transcription path. - Handler: core/http/endpoints/openai/sound_classification.go parses the multipart audio upload, temp-files it, resolves the model config and calls the SoundDetection RPC; returns {model, detections[]} JSON. - Backend wrapper: core/backend/sound_classification.go (ModelSoundDetection) loads the model and normalizes the proto response into schema types. - Schema: core/schema/sound_classification.go (SoundClassificationResult). - gRPC layer: SoundDetection wired through the LocalAI wrapper (interface, Backend client, Client, embed, server, base default) so the loader-typed client exposes the RPC; proto regenerated via make protogen-go. - Route: POST /v1/audio/classification (+ /audio/classification alias) with the audio/multipart default-model middleware in routes/openai.go. - Capability surfaces: swagger @Tags/@Router on the handler; FLAG_SOUND_ CLASSIFICATION usecase flag + UsecaseSoundClassification + UsecaseInfoMap + GuessUsecases + ModalityGroups + GetAllModelConfigUsecases; meta usecase option; /api/instructions audio area updated; auth RouteFeatureRegistry + FeatureAudioClassification (APIFeatures, default ON) + FeatureMetas; UI usecaseFilters, capabilities.js CAP_SOUND_CLASSIFICATION, Models.jsx filter + i18n; docs page features/audio-classification.md + whats-new + crosslink. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ced): realtime sound-event detection over the websocket API When a realtime pipeline configures a sound-classification model, each VAD-committed utterance (the same window the transcription path produces) is also run through the CED sound-event classifier and the scored AudioSet tags are emitted as a new server event. No new backend rpc is needed: the SoundDetection gRPC method already exists on this branch. - config: add Pipeline.SoundDetection (yaml/json sound_detection,omitempty) beside Transcription/VAD. - realtime: add Model.SoundDetection(ctx, audio, topK, threshold) to the ModelInterface; implement it on wrappedModel and transcriptOnlyModel by calling backend.ModelSoundDetection with the session's sound-classification model config (mirrors how Transcribe dispatches). Load the optional config in newModel / newTranscriptionOnlyModel; nil config keeps it additive. - types: add ConversationItemSoundDetectionEvent (item_id, content_index, detections[]{label,score,index}) with type conversation.item.sound_detection, its ServerEventType constant and MarshalJSON, mirroring the transcription completed event. - realtime: add emitSoundDetection (unary path: classify the committed window, build the event, t.SendEvent) and wire it at the utterance-commit hook right after emitTranscription; gated on session.SoundDetectionEnabled (resolved from Pipeline.SoundDetection at session setup, defaults top_k=5, threshold=0). Its error is logged via xlog but never aborts the turn. - test: Ginkgo specs for emitSoundDetection (tags emitted, empty detections, classifier error) plus a SoundDetection method on the fakeModel double. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(ced): implement SoundDetection in nodes backend test doubles The SoundDetection method added to the grpc backend interface left two test doubles (fakeBackendClient, fakeGRPCBackend) incomplete, so core/services/nodes failed to compile under `go vet`/`go test` (go build missed it: the doubles live in _test.go). Add the method to both, mirroring their existing Detect mock. Repairs CI for the nodes package. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ced): decouple realtime sound detection from VAD (sound-only sessions) Sound-event detection must activate on sounds, not speech, so it no longer runs through the voice VAD/transcription path. A sound-detection-only pipeline (sound_detection set, no transcription/LLM) now: - is accepted by prepareRealtimeConfig (sound_detection counts as a pipeline stage), - builds a lightweight model via newSoundDetectionOnlyModel (no VAD/STT/LLM/TTS loaded), and - defaults the session to turn_detection none (no VAD) with no transcription stage, so the client drives windowing via input_audio_buffer.commit (option A: client-side sliding window). The per-PCM C-API already supports arbitrary windows. commitUtterance gains a sound-only branch: it emits the conversation.item.sound_detection event (scored AudioSet tags) and stops - no transcription, no LLM response. generateResponse is now guarded on a transcription stage being present, so a sound-only turn never invokes the LLM. Existing transcription/VAD sessions are unchanged (additive). Added a commitUtterance sound-only Ginkgo spec asserting it emits the sound event and neither transcribes nor generates a response. go vet + golangci-lint (new-from-merge-base) clean; openai suite green. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ced): register sound-classification backend in gallery + CI Mechanical backend-image registration for the ced sound-event classifier, mirroring the parakeet-cpp Go/purego backend everywhere it is wired up. - .github/backend-matrix.yml: add the ced build matrix, field-for-field copies of the parakeet-cpp entries (cpu amd64/arm64, cublas cuda 12/13 amd64, l4t cuda-13 arm64, l4t-jetpack cuda-12 arm64, sycl f32/f16, vulkan amd64/arm64, rocm hipblas, and the metal darwin entry), changing only backend and tag-suffix. dockerfile stays ./backend/Dockerfile.golang. - backend/index.yaml: add the &ced meta anchor (capabilities map per platform) plus ced-development and the per-arch image entries, each uri/mirror tag-suffix matching the matrix exactly. The model gallery (GGUF) entry is intentionally deferred pending the HuggingFace publish (TODO note inline). - scripts/changed-backends.js: add an explicit item.backend === "ced" branch in inferBackendPath mapping to backend/go/ced/, same mechanism and ordering as the parakeet-cpp branch (before the generic golang fallthrough). - .github/workflows/bump_deps.yaml: register mudler/ced.cpp -> CED_VERSION in backend/go/ced/Makefile so the daily bot bumps the pin. - swagger/{docs.go,swagger.json,swagger.yaml}: regenerated via make swagger so the existing /v1/audio/classification annotations land in the generated spec. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ced): server-side windowing for realtime sound detection (option B) Adds an optional server-driven sliding-window classifier so a sound-only realtime client only has to stream audio (no input_audio_buffer.commit): - Pipeline.sound_detection_window_ms / sound_detection_hop_ms config knobs. When both > 0 on a sound-only session, the server classifies the last window of streamed audio every hop and emits a conversation.item.sound_ detection event; the input buffer is trimmed to one window so a long stream stays bounded. When unset, the session stays client-driven (option A). Runs independent of VAD (sound events are not speech). - handleSoundWindow (ticker) + classifySoundWindow (one tick, extracted so it is unit-testable) + writeWindowWAV, which declares the true InputSampleRate (NewWAVHeaderWithRate) so the classifier resamples correctly. Goroutine is started after toggleVAD and torn down with the session (close + wg.Wait). - Register pipeline.sound_detection (+window_ms/hop_ms) in the config meta registry; the earlier realtime commit added pipeline.sound_detection without a registry entry, failing TestAllFieldsHaveRegistryEntries. This fixes that and covers the two new knobs. Tests: classifySoundWindow emits an event + trims the buffer to one window, no-ops on too-little audio; writeWindowWAV declares the given sample rate. go build/vet + golangci-lint (new-from-merge-base) clean; config + openai suites green. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ced): add ced-base GGUF model gallery entries (f16 + q8_0) The ced-base weights are now published at mudler/ced-base-gguf (Apache-2.0, converted from mispeech/ced-base). Adds gallery/ced.yaml (backend: ced + known_usecases: sound_classification) and two gallery/index.yaml entries (ced-base-f16 default, ced-base-q8 smallest) with sha256-pinned files, and removes the now-resolved TODO from backend/index.yaml. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ced): add tiny/mini/small GGUF model gallery entries Publishes the rest of the CED family (same architecture, metadata-driven port verified end-to-end on ced-tiny) to mudler/ced-{tiny,mini,small}-gguf and adds their f16 + q8_0 gallery entries: ced-tiny (5.5M, edge/Pi-class) f16 11MB / q8_0 6MB ced-mini (9.6M) f16 19MB / q8_0 11MB ced-small (22M) f16 42MB / q8_0 23MB All sha256-pinned. ced-base remains the accuracy default. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(ced): point gallery entries at the consolidated mudler/ced-gguf repo All CED quantizations (tiny/mini/small/base, f16/q8_0) now live in a single HuggingFace repo, mudler/ced-gguf, instead of per-model repos. Repoint the 8 gallery model entries' urls + file uris accordingly. sha256 and filenames are unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(ced): bump CED_VERSION to the short-clip fix Pin the ced backend to ced.cpp 99c6ed3, which fixes a crash on any clip shorter than target_length (~10.11s): time_pos_embed was added at its full 63-frame grid instead of being sliced to the clip's actual time grid, tripping ggml_can_repeat in ggml_add. Surfaced by the live realtime e2e (sub-10s windows) and gated with a short-clip parity test upstream. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(ced): list ced.cpp as a LocalAI-team engine + backend-guide directive - README.md: add ced.cpp to the "native C/C++/GGML engines developed and maintained by the LocalAI project" table. - docs/content/features/backends.md: add a Sound Classification backend category (sound-event classification / audio tagging) listing ced.cpp. - .agents/adding-backends.md: add a "Documenting the backend" section and two verification-checklist items requiring new backends to be documented in the backends.md category list, and in-house native engines to be added to the README maintained-engines table. This directive was missing. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(ced): repin CED_VERSION to the v0.1.0 release commit ced.cpp history was squashed into a single release commit (tagged v0.1.0), so the previous pin (99c6ed3) no longer exists upstream. Pin to c04ac14, the v0.1.0 release commit, so the backend builds against a commit that exists. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(ced): silence gosec G304/G103 + govet unsafeptr on audited paths - sound_classification.go: os.Create(dst) where dst = temp dir + path.Base of the upload (no traversal). #nosec G304, matching the depth-anything-cpp handler. - goced.go: reading a NUL-terminated C string from a libced-owned buffer. #nosec G103 (gosec) + //nolint:govet (golangci-lint's unsafeptr check), since the uintptr is a C-owned malloc'd buffer, not Go-GC memory. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
632 lines
26 KiB
Go
632 lines
26 KiB
Go
package config
|
|
|
|
import (
|
|
"slices"
|
|
"strings"
|
|
)
|
|
|
|
// Usecase name constants — the canonical string values used in gallery entries,
|
|
// model configs (known_usecases), and UsecaseInfoMap keys.
|
|
const (
|
|
UsecaseChat = "chat"
|
|
UsecaseCompletion = "completion"
|
|
UsecaseEdit = "edit"
|
|
UsecaseVision = "vision"
|
|
UsecaseEmbeddings = "embeddings"
|
|
UsecaseTokenize = "tokenize"
|
|
UsecaseImage = "image"
|
|
UsecaseVideo = "video"
|
|
UsecaseTranscript = "transcript"
|
|
UsecaseTTS = "tts"
|
|
UsecaseSoundGeneration = "sound_generation"
|
|
UsecaseRerank = "rerank"
|
|
UsecaseDetection = "detection"
|
|
UsecaseDepth = "depth"
|
|
UsecaseVAD = "vad"
|
|
UsecaseAudioTransform = "audio_transform"
|
|
UsecaseDiarization = "diarization"
|
|
UsecaseSoundClassification = "sound_classification"
|
|
UsecaseRealtimeAudio = "realtime_audio"
|
|
UsecaseFaceRecognition = "face_recognition"
|
|
UsecaseSpeakerRecognition = "speaker_recognition"
|
|
UsecaseTokenClassify = "token_classify"
|
|
)
|
|
|
|
// GRPCMethod identifies a Backend service RPC from backend.proto.
|
|
type GRPCMethod string
|
|
|
|
const (
|
|
MethodPredict GRPCMethod = "Predict"
|
|
MethodPredictStream GRPCMethod = "PredictStream"
|
|
MethodEmbedding GRPCMethod = "Embedding"
|
|
MethodGenerateImage GRPCMethod = "GenerateImage"
|
|
MethodGenerateVideo GRPCMethod = "GenerateVideo"
|
|
MethodAudioTranscription GRPCMethod = "AudioTranscription"
|
|
MethodTTS GRPCMethod = "TTS"
|
|
MethodTTSStream GRPCMethod = "TTSStream"
|
|
MethodSoundGeneration GRPCMethod = "SoundGeneration"
|
|
MethodTokenizeString GRPCMethod = "TokenizeString"
|
|
MethodDetect GRPCMethod = "Detect"
|
|
MethodDepth GRPCMethod = "Depth"
|
|
MethodRerank GRPCMethod = "Rerank"
|
|
MethodVAD GRPCMethod = "VAD"
|
|
MethodAudioTransform GRPCMethod = "AudioTransform"
|
|
MethodDiarize GRPCMethod = "Diarize"
|
|
MethodSoundDetection GRPCMethod = "SoundDetection"
|
|
MethodAudioToAudioStream GRPCMethod = "AudioToAudioStream"
|
|
MethodFaceVerify GRPCMethod = "FaceVerify"
|
|
MethodFaceAnalyze GRPCMethod = "FaceAnalyze"
|
|
MethodVoiceVerify GRPCMethod = "VoiceVerify"
|
|
MethodVoiceEmbed GRPCMethod = "VoiceEmbed"
|
|
MethodVoiceAnalyze GRPCMethod = "VoiceAnalyze"
|
|
MethodTokenClassify GRPCMethod = "TokenClassify"
|
|
)
|
|
|
|
// UsecaseInfo describes a single known_usecase value and how it maps
|
|
// to the gRPC backend API.
|
|
type UsecaseInfo struct {
|
|
// Flag is the ModelConfigUsecase bitmask value.
|
|
Flag ModelConfigUsecase
|
|
// GRPCMethod is the primary Backend service RPC this usecase maps to.
|
|
GRPCMethod GRPCMethod
|
|
// IsModifier is true when this usecase doesn't map to its own gRPC RPC
|
|
// but modifies how another RPC behaves (e.g., vision uses Predict with images).
|
|
IsModifier bool
|
|
// DependsOn names the usecase(s) this modifier requires (e.g., "chat").
|
|
DependsOn string
|
|
// Description is a human/LLM-readable explanation of what this usecase means.
|
|
Description string
|
|
}
|
|
|
|
// UsecaseInfoMap maps each known_usecase string to its gRPC and semantic info.
|
|
var UsecaseInfoMap = map[string]UsecaseInfo{
|
|
UsecaseChat: {
|
|
Flag: FLAG_CHAT,
|
|
GRPCMethod: MethodPredict,
|
|
Description: "Conversational/instruction-following via the Predict RPC with chat templates.",
|
|
},
|
|
UsecaseCompletion: {
|
|
Flag: FLAG_COMPLETION,
|
|
GRPCMethod: MethodPredict,
|
|
Description: "Text completion via the Predict RPC with a completion template.",
|
|
},
|
|
UsecaseEdit: {
|
|
Flag: FLAG_EDIT,
|
|
GRPCMethod: MethodPredict,
|
|
Description: "Text editing via the Predict RPC with an edit template.",
|
|
},
|
|
UsecaseVision: {
|
|
Flag: FLAG_VISION,
|
|
GRPCMethod: MethodPredict,
|
|
IsModifier: true,
|
|
DependsOn: UsecaseChat,
|
|
Description: "The model accepts images alongside text in the Predict RPC. For llama-cpp this requires an mmproj file.",
|
|
},
|
|
UsecaseEmbeddings: {
|
|
Flag: FLAG_EMBEDDINGS,
|
|
GRPCMethod: MethodEmbedding,
|
|
Description: "Vector embedding generation via the Embedding RPC.",
|
|
},
|
|
UsecaseTokenize: {
|
|
Flag: FLAG_TOKENIZE,
|
|
GRPCMethod: MethodTokenizeString,
|
|
Description: "Tokenization via the TokenizeString RPC without running inference.",
|
|
},
|
|
UsecaseImage: {
|
|
Flag: FLAG_IMAGE,
|
|
GRPCMethod: MethodGenerateImage,
|
|
Description: "Image generation via the GenerateImage RPC (Stable Diffusion, Flux, etc.).",
|
|
},
|
|
UsecaseVideo: {
|
|
Flag: FLAG_VIDEO,
|
|
GRPCMethod: MethodGenerateVideo,
|
|
Description: "Video generation via the GenerateVideo RPC.",
|
|
},
|
|
UsecaseTranscript: {
|
|
Flag: FLAG_TRANSCRIPT,
|
|
GRPCMethod: MethodAudioTranscription,
|
|
Description: "Speech-to-text via the AudioTranscription RPC.",
|
|
},
|
|
UsecaseTTS: {
|
|
Flag: FLAG_TTS,
|
|
GRPCMethod: MethodTTS,
|
|
Description: "Text-to-speech via the TTS RPC.",
|
|
},
|
|
UsecaseSoundGeneration: {
|
|
Flag: FLAG_SOUND_GENERATION,
|
|
GRPCMethod: MethodSoundGeneration,
|
|
Description: "Music/sound generation via the SoundGeneration RPC (not speech).",
|
|
},
|
|
UsecaseRerank: {
|
|
Flag: FLAG_RERANK,
|
|
GRPCMethod: MethodRerank,
|
|
Description: "Document reranking via the Rerank RPC.",
|
|
},
|
|
UsecaseDetection: {
|
|
Flag: FLAG_DETECTION,
|
|
GRPCMethod: MethodDetect,
|
|
Description: "Object detection via the Detect RPC with bounding boxes.",
|
|
},
|
|
UsecaseDepth: {
|
|
Flag: FLAG_DEPTH,
|
|
GRPCMethod: MethodDepth,
|
|
Description: "Per-pixel metric depth, camera pose and 3D point cloud via the Depth RPC (Depth Anything 3).",
|
|
},
|
|
UsecaseVAD: {
|
|
Flag: FLAG_VAD,
|
|
GRPCMethod: MethodVAD,
|
|
Description: "Voice activity detection via the VAD RPC.",
|
|
},
|
|
UsecaseAudioTransform: {
|
|
Flag: FLAG_AUDIO_TRANSFORM,
|
|
GRPCMethod: MethodAudioTransform,
|
|
Description: "Audio-in / audio-out transformations (echo cancellation, noise suppression, dereverberation, voice conversion) via the AudioTransform RPC.",
|
|
},
|
|
UsecaseDiarization: {
|
|
Flag: FLAG_DIARIZATION,
|
|
GRPCMethod: MethodDiarize,
|
|
Description: "Speaker diarization (who-spoke-when, per-speaker segments) via the Diarize RPC.",
|
|
},
|
|
UsecaseSoundClassification: {
|
|
Flag: FLAG_SOUND_CLASSIFICATION,
|
|
GRPCMethod: MethodSoundDetection,
|
|
Description: "Sound-event classification / audio tagging (scored AudioSet labels like baby cry, glass breaking, alarms) via the SoundDetection RPC.",
|
|
},
|
|
UsecaseRealtimeAudio: {
|
|
Flag: FLAG_REALTIME_AUDIO,
|
|
GRPCMethod: MethodAudioToAudioStream,
|
|
Description: "Self-contained any-to-any audio model for the Realtime API — accepts microphone audio and emits speech + transcript (+ optional function calls) from a single backend via the AudioToAudioStream RPC.",
|
|
},
|
|
UsecaseFaceRecognition: {
|
|
Flag: FLAG_FACE_RECOGNITION,
|
|
GRPCMethod: MethodFaceVerify,
|
|
Description: "Face recognition — verify identity, analyze attributes (age/gender/emotion) via FaceVerify and FaceAnalyze RPCs.",
|
|
},
|
|
UsecaseSpeakerRecognition: {
|
|
Flag: FLAG_SPEAKER_RECOGNITION,
|
|
GRPCMethod: MethodVoiceVerify,
|
|
Description: "Speaker recognition — verify identity, embed and analyze voice via VoiceVerify, VoiceEmbed and VoiceAnalyze RPCs.",
|
|
},
|
|
UsecaseTokenClassify: {
|
|
Flag: FLAG_TOKEN_CLASSIFY,
|
|
GRPCMethod: MethodTokenClassify,
|
|
Description: "Per-token classification (NER) via the TokenClassify RPC — the PII detector tier. Declared explicitly via known_usecases; never auto-guessed, since the token-classification head is not useful as general generation or embeddings.",
|
|
},
|
|
}
|
|
|
|
// BackendCapability describes which gRPC methods and usecases a backend supports.
|
|
// Derived from reviewing actual implementations in backend/go/ and backend/python/.
|
|
type BackendCapability struct {
|
|
// GRPCMethods lists the Backend service RPCs this backend implements.
|
|
GRPCMethods []GRPCMethod
|
|
// PossibleUsecases lists all usecase strings this backend can support.
|
|
PossibleUsecases []string
|
|
// DefaultUsecases lists the conservative safe defaults.
|
|
DefaultUsecases []string
|
|
// AcceptsImages indicates multimodal image input in Predict.
|
|
AcceptsImages bool
|
|
// AcceptsVideos indicates multimodal video input in Predict.
|
|
AcceptsVideos bool
|
|
// AcceptsAudios indicates multimodal audio input in Predict.
|
|
AcceptsAudios bool
|
|
// Description is a human-readable summary of the backend.
|
|
Description string
|
|
}
|
|
|
|
// BackendCapabilities maps each backend name (as used in model configs and gallery
|
|
// entries) to its verified capabilities. This is the single source of truth for
|
|
// what each backend supports.
|
|
//
|
|
// Backend names use hyphens (e.g., "llama-cpp") matching the gallery convention.
|
|
// Use NormalizeBackendName() for names with dots (e.g., "llama.cpp").
|
|
var BackendCapabilities = map[string]BackendCapability{
|
|
// --- LLM / text generation backends ---
|
|
"llama-cpp": {
|
|
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodEmbedding, MethodTokenizeString},
|
|
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseEdit, UsecaseEmbeddings, UsecaseTokenize, UsecaseVision},
|
|
DefaultUsecases: []string{UsecaseChat},
|
|
AcceptsImages: true, // requires mmproj
|
|
Description: "llama.cpp GGUF models — LLM inference with optional vision via mmproj",
|
|
},
|
|
// privacy-filter is the standalone GGML engine (backend/cpp/privacy-filter,
|
|
// wrapping privacy-filter.cpp) for the openai-privacy-filter PII/NER token
|
|
// classifier — the dedicated TokenClassify path that replaces the
|
|
// patched-llama.cpp route. Never auto-guessed; declared explicitly via
|
|
// known_usecases: [token_classify].
|
|
"privacy-filter": {
|
|
GRPCMethods: []GRPCMethod{MethodTokenClassify},
|
|
PossibleUsecases: []string{UsecaseTokenClassify},
|
|
DefaultUsecases: []string{UsecaseTokenClassify},
|
|
Description: "privacy-filter.cpp — standalone GGML backend for openai-privacy-filter PII/NER token classification",
|
|
},
|
|
"vllm": {
|
|
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodEmbedding},
|
|
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseEmbeddings, UsecaseVision},
|
|
DefaultUsecases: []string{UsecaseChat},
|
|
AcceptsImages: true,
|
|
AcceptsVideos: true,
|
|
Description: "vLLM engine — high-throughput LLM serving with optional multimodal",
|
|
},
|
|
"sglang": {
|
|
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodTokenizeString},
|
|
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseTokenize, UsecaseVision},
|
|
DefaultUsecases: []string{UsecaseChat},
|
|
AcceptsImages: true,
|
|
Description: "SGLang — fast LLM inference with structured generation and optional vision",
|
|
},
|
|
"vllm-omni": {
|
|
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodGenerateImage, MethodGenerateVideo, MethodTTS},
|
|
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseImage, UsecaseVideo, UsecaseTTS, UsecaseVision},
|
|
DefaultUsecases: []string{UsecaseChat},
|
|
AcceptsImages: true,
|
|
AcceptsVideos: true,
|
|
AcceptsAudios: true,
|
|
Description: "vLLM omni-modal — supports text, image, video generation and TTS",
|
|
},
|
|
"transformers": {
|
|
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodEmbedding, MethodTTS, MethodSoundGeneration},
|
|
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseEmbeddings, UsecaseTTS, UsecaseSoundGeneration},
|
|
DefaultUsecases: []string{UsecaseChat},
|
|
Description: "HuggingFace transformers — general-purpose Python inference",
|
|
},
|
|
"mlx": {
|
|
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodEmbedding},
|
|
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseEmbeddings},
|
|
DefaultUsecases: []string{UsecaseChat},
|
|
Description: "Apple MLX framework — optimized for Apple Silicon",
|
|
},
|
|
"mlx-distributed": {
|
|
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodEmbedding},
|
|
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseEmbeddings},
|
|
DefaultUsecases: []string{UsecaseChat},
|
|
Description: "MLX distributed inference across multiple Apple Silicon devices",
|
|
},
|
|
"mlx-vlm": {
|
|
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodEmbedding},
|
|
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseEmbeddings, UsecaseVision},
|
|
DefaultUsecases: []string{UsecaseChat, UsecaseVision},
|
|
AcceptsImages: true,
|
|
AcceptsAudios: true,
|
|
Description: "MLX vision-language models with multimodal input",
|
|
},
|
|
"mlx-audio": {
|
|
GRPCMethods: []GRPCMethod{MethodPredict, MethodTTS},
|
|
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseChat},
|
|
Description: "MLX audio models — text generation and TTS",
|
|
},
|
|
|
|
// --- Image/video generation backends ---
|
|
"diffusers": {
|
|
GRPCMethods: []GRPCMethod{MethodGenerateImage, MethodGenerateVideo},
|
|
PossibleUsecases: []string{UsecaseImage, UsecaseVideo},
|
|
DefaultUsecases: []string{UsecaseImage},
|
|
Description: "HuggingFace diffusers — Stable Diffusion, Flux, video generation",
|
|
},
|
|
"stablediffusion": {
|
|
GRPCMethods: []GRPCMethod{MethodGenerateImage},
|
|
PossibleUsecases: []string{UsecaseImage},
|
|
DefaultUsecases: []string{UsecaseImage},
|
|
Description: "Stable Diffusion native backend",
|
|
},
|
|
"stablediffusion-ggml": {
|
|
GRPCMethods: []GRPCMethod{MethodGenerateImage},
|
|
PossibleUsecases: []string{UsecaseImage},
|
|
DefaultUsecases: []string{UsecaseImage},
|
|
Description: "Stable Diffusion via GGML quantized models",
|
|
},
|
|
|
|
// --- Speech-to-text backends ---
|
|
"whisper": {
|
|
GRPCMethods: []GRPCMethod{MethodAudioTranscription, MethodVAD},
|
|
PossibleUsecases: []string{UsecaseTranscript, UsecaseVAD},
|
|
DefaultUsecases: []string{UsecaseTranscript},
|
|
Description: "OpenAI Whisper — speech recognition and voice activity detection",
|
|
},
|
|
"faster-whisper": {
|
|
GRPCMethods: []GRPCMethod{MethodAudioTranscription},
|
|
PossibleUsecases: []string{UsecaseTranscript},
|
|
DefaultUsecases: []string{UsecaseTranscript},
|
|
Description: "CTranslate2-accelerated Whisper for faster transcription",
|
|
},
|
|
"whisperx": {
|
|
GRPCMethods: []GRPCMethod{MethodAudioTranscription},
|
|
PossibleUsecases: []string{UsecaseTranscript},
|
|
DefaultUsecases: []string{UsecaseTranscript},
|
|
Description: "WhisperX — Whisper with word-level timestamps and speaker diarization",
|
|
},
|
|
"moonshine": {
|
|
GRPCMethods: []GRPCMethod{MethodAudioTranscription},
|
|
PossibleUsecases: []string{UsecaseTranscript},
|
|
DefaultUsecases: []string{UsecaseTranscript},
|
|
Description: "Moonshine speech recognition",
|
|
},
|
|
"nemo": {
|
|
GRPCMethods: []GRPCMethod{MethodAudioTranscription},
|
|
PossibleUsecases: []string{UsecaseTranscript},
|
|
DefaultUsecases: []string{UsecaseTranscript},
|
|
Description: "NVIDIA NeMo speech recognition",
|
|
},
|
|
"parakeet-cpp": {
|
|
GRPCMethods: []GRPCMethod{MethodAudioTranscription},
|
|
PossibleUsecases: []string{UsecaseTranscript},
|
|
DefaultUsecases: []string{UsecaseTranscript},
|
|
Description: "NVIDIA NeMo Parakeet ASR (parakeet.cpp)",
|
|
},
|
|
"qwen-asr": {
|
|
GRPCMethods: []GRPCMethod{MethodAudioTranscription},
|
|
PossibleUsecases: []string{UsecaseTranscript},
|
|
DefaultUsecases: []string{UsecaseTranscript},
|
|
Description: "Qwen automatic speech recognition",
|
|
},
|
|
"voxtral": {
|
|
GRPCMethods: []GRPCMethod{MethodAudioTranscription},
|
|
PossibleUsecases: []string{UsecaseTranscript},
|
|
DefaultUsecases: []string{UsecaseTranscript},
|
|
Description: "Voxtral speech recognition",
|
|
},
|
|
"vibevoice": {
|
|
GRPCMethods: []GRPCMethod{MethodAudioTranscription, MethodTTS},
|
|
PossibleUsecases: []string{UsecaseTranscript, UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTranscript, UsecaseTTS},
|
|
Description: "VibeVoice — bidirectional speech (transcription and synthesis)",
|
|
},
|
|
"vibevoice-cpp": {
|
|
GRPCMethods: []GRPCMethod{MethodAudioTranscription, MethodTTS, MethodTTSStream},
|
|
PossibleUsecases: []string{UsecaseTranscript, UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTranscript, UsecaseTTS},
|
|
Description: "VibeVoice C++ — bidirectional speech, C++ backend with streaming TTS",
|
|
},
|
|
"sherpa-onnx": {
|
|
GRPCMethods: []GRPCMethod{MethodAudioTranscription, MethodTTS, MethodTTSStream, MethodVAD},
|
|
PossibleUsecases: []string{UsecaseTranscript, UsecaseTTS, UsecaseVAD},
|
|
DefaultUsecases: []string{UsecaseTranscript},
|
|
Description: "Sherpa-ONNX — multi-model speech toolkit (ASR, TTS, VAD)",
|
|
},
|
|
|
|
// --- TTS backends ---
|
|
"piper": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "Piper — fast neural TTS optimized for Raspberry Pi",
|
|
},
|
|
"kokoro": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "Kokoro TTS",
|
|
},
|
|
"coqui": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "Coqui TTS — multi-speaker neural synthesis",
|
|
},
|
|
"kitten-tts": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "Kitten TTS",
|
|
},
|
|
"outetts": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "OuteTTS",
|
|
},
|
|
"pocket-tts": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "Pocket TTS — lightweight text-to-speech",
|
|
},
|
|
"qwen-tts": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "Qwen TTS",
|
|
},
|
|
"qwen3-tts-cpp": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS, MethodTTSStream},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "Qwen3 TTS C++ - text-to-speech with streaming, named speakers, voice design and cloning (qwentts.cpp / GGML)",
|
|
},
|
|
"faster-qwen3-tts": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "Faster Qwen3 TTS — accelerated Qwen TTS",
|
|
},
|
|
"fish-speech": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "Fish Speech TTS",
|
|
},
|
|
"neutts": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "NeuTTS — neural text-to-speech",
|
|
},
|
|
"chatterbox": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "Chatterbox TTS",
|
|
},
|
|
"voxcpm": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS, MethodTTSStream},
|
|
PossibleUsecases: []string{UsecaseTTS},
|
|
DefaultUsecases: []string{UsecaseTTS},
|
|
Description: "VoxCPM TTS with streaming support",
|
|
},
|
|
|
|
// --- Sound generation backends ---
|
|
"ace-step": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS, MethodSoundGeneration},
|
|
PossibleUsecases: []string{UsecaseTTS, UsecaseSoundGeneration},
|
|
DefaultUsecases: []string{UsecaseSoundGeneration},
|
|
Description: "ACE-Step — music and sound generation",
|
|
},
|
|
"acestep-cpp": {
|
|
GRPCMethods: []GRPCMethod{MethodSoundGeneration},
|
|
PossibleUsecases: []string{UsecaseSoundGeneration},
|
|
DefaultUsecases: []string{UsecaseSoundGeneration},
|
|
Description: "ACE-Step C++ — native sound generation",
|
|
},
|
|
"transformers-musicgen": {
|
|
GRPCMethods: []GRPCMethod{MethodTTS, MethodSoundGeneration},
|
|
PossibleUsecases: []string{UsecaseTTS, UsecaseSoundGeneration},
|
|
DefaultUsecases: []string{UsecaseSoundGeneration},
|
|
Description: "Meta MusicGen via transformers — music generation from text",
|
|
},
|
|
|
|
// --- Any-to-any audio backends ---
|
|
"liquid-audio": {
|
|
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodAudioTranscription, MethodTTS, MethodAudioToAudioStream, MethodVAD},
|
|
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseTranscript, UsecaseTTS, UsecaseRealtimeAudio, UsecaseVAD},
|
|
DefaultUsecases: []string{UsecaseRealtimeAudio, UsecaseChat, UsecaseTranscript, UsecaseTTS, UsecaseVAD},
|
|
AcceptsAudios: true,
|
|
Description: "LFM2 / LFM2.5-Audio — self-contained any-to-any audio model for the Realtime API; also exposes chat, transcription, TTS and a stub energy-based VAD endpoint",
|
|
},
|
|
|
|
// --- Audio transform backends ---
|
|
"localvqe": {
|
|
GRPCMethods: []GRPCMethod{MethodAudioTransform},
|
|
PossibleUsecases: []string{UsecaseAudioTransform},
|
|
DefaultUsecases: []string{UsecaseAudioTransform},
|
|
Description: "LocalVQE — joint AEC, noise suppression, and dereverberation for 16 kHz mono speech",
|
|
},
|
|
|
|
// --- Utility backends ---
|
|
"rerankers": {
|
|
GRPCMethods: []GRPCMethod{MethodRerank},
|
|
PossibleUsecases: []string{UsecaseRerank},
|
|
DefaultUsecases: []string{UsecaseRerank},
|
|
Description: "Cross-encoder reranking models",
|
|
},
|
|
"rfdetr": {
|
|
GRPCMethods: []GRPCMethod{MethodDetect},
|
|
PossibleUsecases: []string{UsecaseDetection},
|
|
DefaultUsecases: []string{UsecaseDetection},
|
|
Description: "RF-DETR object detection",
|
|
},
|
|
"rfdetr-cpp": {
|
|
GRPCMethods: []GRPCMethod{MethodDetect},
|
|
PossibleUsecases: []string{UsecaseDetection},
|
|
DefaultUsecases: []string{UsecaseDetection},
|
|
Description: "RF-DETR C++ object detection",
|
|
},
|
|
"depth-anything": {
|
|
GRPCMethods: []GRPCMethod{MethodDepth, MethodPredict, MethodGenerateImage},
|
|
PossibleUsecases: []string{UsecaseDepth},
|
|
DefaultUsecases: []string{UsecaseDepth},
|
|
AcceptsImages: true,
|
|
Description: "Depth Anything 3 C++ — per-pixel metric depth, camera pose and 3D point cloud",
|
|
},
|
|
|
|
// --- Face and speaker recognition backends ---
|
|
"insightface": {
|
|
GRPCMethods: []GRPCMethod{MethodEmbedding, MethodDetect, MethodFaceVerify, MethodFaceAnalyze},
|
|
PossibleUsecases: []string{UsecaseEmbeddings, UsecaseDetection, UsecaseFaceRecognition},
|
|
DefaultUsecases: []string{UsecaseFaceRecognition},
|
|
AcceptsImages: true,
|
|
Description: "InsightFace — face detection, embedding, verification and attribute analysis",
|
|
},
|
|
"speaker-recognition": {
|
|
GRPCMethods: []GRPCMethod{MethodVoiceVerify, MethodVoiceEmbed, MethodVoiceAnalyze},
|
|
PossibleUsecases: []string{UsecaseSpeakerRecognition},
|
|
DefaultUsecases: []string{UsecaseSpeakerRecognition},
|
|
Description: "Speaker recognition — voice identity verification and analysis",
|
|
},
|
|
"silero-vad": {
|
|
GRPCMethods: []GRPCMethod{MethodVAD},
|
|
PossibleUsecases: []string{UsecaseVAD},
|
|
DefaultUsecases: []string{UsecaseVAD},
|
|
Description: "Silero VAD — voice activity detection",
|
|
},
|
|
}
|
|
|
|
// NormalizeBackendName converts backend names to the canonical hyphenated form
|
|
// used in gallery entries (e.g., "llama.cpp" → "llama-cpp").
|
|
func NormalizeBackendName(backend string) string {
|
|
return strings.ReplaceAll(backend, ".", "-")
|
|
}
|
|
|
|
// nonLlamaSamplerBackends lists backends whose native sampler defaults differ
|
|
// from llama.cpp's, so LocalAI must NOT inject llama.cpp's top_k=40 default for
|
|
// them (issue #6632). mlx_lm's intended default is top_k=0 (disabled) and mlx
|
|
// does not remap 0->40, so shipping 40 silently changes sampling for clients
|
|
// that omit top_k. Leaving TopK nil lets the wire value default to 0.
|
|
//
|
|
// This is intentionally a small allow-list of KNOWN non-llama backends: empty
|
|
// and unknown backends fall through to the llama.cpp default to preserve the
|
|
// GGUF auto-detect path's behavior.
|
|
var nonLlamaSamplerBackends = map[string]struct{}{
|
|
"mlx": {},
|
|
"mlx-vlm": {},
|
|
"mlx-distributed": {},
|
|
}
|
|
|
|
// UsesLlamaSamplerDefaults reports whether a backend should receive llama.cpp's
|
|
// sampler defaults (e.g. top_k=40). Empty/unknown backends return true so the
|
|
// GGUF auto-detect path (which resolves to llama.cpp) keeps today's behavior;
|
|
// only the known non-llama backends in nonLlamaSamplerBackends return false.
|
|
func UsesLlamaSamplerDefaults(backend string) bool {
|
|
if backend == "" {
|
|
return true
|
|
}
|
|
_, isNonLlama := nonLlamaSamplerBackends[NormalizeBackendName(backend)]
|
|
return !isNonLlama
|
|
}
|
|
|
|
// GetBackendCapability returns the capability info for a backend, or nil if unknown.
|
|
// Handles backend name normalization.
|
|
func GetBackendCapability(backend string) *BackendCapability {
|
|
if cap, ok := BackendCapabilities[NormalizeBackendName(backend)]; ok {
|
|
return &cap
|
|
}
|
|
return nil
|
|
}
|
|
|
|
// PossibleUsecasesForBackend returns all usecases a backend can support.
|
|
// Returns nil if the backend is unknown.
|
|
func PossibleUsecasesForBackend(backend string) []string {
|
|
if cap := GetBackendCapability(backend); cap != nil {
|
|
return cap.PossibleUsecases
|
|
}
|
|
return nil
|
|
}
|
|
|
|
// DefaultUsecasesForBackend returns the conservative default usecases.
|
|
// Returns nil if the backend is unknown.
|
|
func DefaultUsecasesForBackendCap(backend string) []string {
|
|
if cap := GetBackendCapability(backend); cap != nil {
|
|
return cap.DefaultUsecases
|
|
}
|
|
return nil
|
|
}
|
|
|
|
// IsValidUsecaseForBackend checks whether a usecase is in a backend's possible set.
|
|
// Returns true for unknown backends (permissive fallback).
|
|
func IsValidUsecaseForBackend(backend, usecase string) bool {
|
|
cap := GetBackendCapability(backend)
|
|
if cap == nil {
|
|
return true // unknown backend — don't restrict
|
|
}
|
|
return slices.Contains(cap.PossibleUsecases, usecase)
|
|
}
|
|
|
|
// AllBackendNames returns a sorted list of all known backend names.
|
|
func AllBackendNames() []string {
|
|
names := make([]string, 0, len(BackendCapabilities))
|
|
for name := range BackendCapabilities {
|
|
names = append(names, name)
|
|
}
|
|
slices.Sort(names)
|
|
return names
|
|
}
|