mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-22 07:39:02 -04:00
feat(ced): sound-event classification backend (CED audio tagger) (#10425)
* feat(ced): sketch sound-classification backend (CED audio tagger) Wires ced.cpp (CED, 527-class AudioSet sound-event tagger; baby cry, footsteps, glass, alarms, dog bark) into LocalAI as a Go/purego backend. SKETCH (backend skeleton real; core REST wiring + CI/gallery is a checklist in DESIGN.md): - backend/backend.proto: new SoundDetection rpc + SoundClass messages (run `make protogen-go` to regenerate pkg/grpc/proto). - backend/go/ced: main.go (purego dlopen libced.so + ced_capi.h), goced.go (Ced gRPC backend: Load + SoundDetection), Makefile (clone-at-pin CED_VERSION, ggml static-PIC shared build), run.sh, package.sh, .gitignore. - DESIGN.md: REST /v1/audio/classification wiring (handler/route/capability registration checklist), gallery/index + CI registration, and a scoping note for the realtime/websocket live-recognition path (sliding-window classify over the existing ws transport + voicegate; the ced C-API per-PCM entry point is already window-friendly). Backend code does not compile until protogen-go regenerates the pb types and a libced.so is built (Makefile clones+builds it). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ced): REST /v1/audio/classification endpoint + capability registration Wires the ced sound-event classification backend (AudioSet audio tagger) end to end through the REST surface, mirroring the transcription path. - Handler: core/http/endpoints/openai/sound_classification.go parses the multipart audio upload, temp-files it, resolves the model config and calls the SoundDetection RPC; returns {model, detections[]} JSON. - Backend wrapper: core/backend/sound_classification.go (ModelSoundDetection) loads the model and normalizes the proto response into schema types. - Schema: core/schema/sound_classification.go (SoundClassificationResult). - gRPC layer: SoundDetection wired through the LocalAI wrapper (interface, Backend client, Client, embed, server, base default) so the loader-typed client exposes the RPC; proto regenerated via make protogen-go. - Route: POST /v1/audio/classification (+ /audio/classification alias) with the audio/multipart default-model middleware in routes/openai.go. - Capability surfaces: swagger @Tags/@Router on the handler; FLAG_SOUND_ CLASSIFICATION usecase flag + UsecaseSoundClassification + UsecaseInfoMap + GuessUsecases + ModalityGroups + GetAllModelConfigUsecases; meta usecase option; /api/instructions audio area updated; auth RouteFeatureRegistry + FeatureAudioClassification (APIFeatures, default ON) + FeatureMetas; UI usecaseFilters, capabilities.js CAP_SOUND_CLASSIFICATION, Models.jsx filter + i18n; docs page features/audio-classification.md + whats-new + crosslink. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ced): realtime sound-event detection over the websocket API When a realtime pipeline configures a sound-classification model, each VAD-committed utterance (the same window the transcription path produces) is also run through the CED sound-event classifier and the scored AudioSet tags are emitted as a new server event. No new backend rpc is needed: the SoundDetection gRPC method already exists on this branch. - config: add Pipeline.SoundDetection (yaml/json sound_detection,omitempty) beside Transcription/VAD. - realtime: add Model.SoundDetection(ctx, audio, topK, threshold) to the ModelInterface; implement it on wrappedModel and transcriptOnlyModel by calling backend.ModelSoundDetection with the session's sound-classification model config (mirrors how Transcribe dispatches). Load the optional config in newModel / newTranscriptionOnlyModel; nil config keeps it additive. - types: add ConversationItemSoundDetectionEvent (item_id, content_index, detections[]{label,score,index}) with type conversation.item.sound_detection, its ServerEventType constant and MarshalJSON, mirroring the transcription completed event. - realtime: add emitSoundDetection (unary path: classify the committed window, build the event, t.SendEvent) and wire it at the utterance-commit hook right after emitTranscription; gated on session.SoundDetectionEnabled (resolved from Pipeline.SoundDetection at session setup, defaults top_k=5, threshold=0). Its error is logged via xlog but never aborts the turn. - test: Ginkgo specs for emitSoundDetection (tags emitted, empty detections, classifier error) plus a SoundDetection method on the fakeModel double. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(ced): implement SoundDetection in nodes backend test doubles The SoundDetection method added to the grpc backend interface left two test doubles (fakeBackendClient, fakeGRPCBackend) incomplete, so core/services/nodes failed to compile under `go vet`/`go test` (go build missed it: the doubles live in _test.go). Add the method to both, mirroring their existing Detect mock. Repairs CI for the nodes package. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ced): decouple realtime sound detection from VAD (sound-only sessions) Sound-event detection must activate on sounds, not speech, so it no longer runs through the voice VAD/transcription path. A sound-detection-only pipeline (sound_detection set, no transcription/LLM) now: - is accepted by prepareRealtimeConfig (sound_detection counts as a pipeline stage), - builds a lightweight model via newSoundDetectionOnlyModel (no VAD/STT/LLM/TTS loaded), and - defaults the session to turn_detection none (no VAD) with no transcription stage, so the client drives windowing via input_audio_buffer.commit (option A: client-side sliding window). The per-PCM C-API already supports arbitrary windows. commitUtterance gains a sound-only branch: it emits the conversation.item.sound_detection event (scored AudioSet tags) and stops - no transcription, no LLM response. generateResponse is now guarded on a transcription stage being present, so a sound-only turn never invokes the LLM. Existing transcription/VAD sessions are unchanged (additive). Added a commitUtterance sound-only Ginkgo spec asserting it emits the sound event and neither transcribes nor generates a response. go vet + golangci-lint (new-from-merge-base) clean; openai suite green. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ced): register sound-classification backend in gallery + CI Mechanical backend-image registration for the ced sound-event classifier, mirroring the parakeet-cpp Go/purego backend everywhere it is wired up. - .github/backend-matrix.yml: add the ced build matrix, field-for-field copies of the parakeet-cpp entries (cpu amd64/arm64, cublas cuda 12/13 amd64, l4t cuda-13 arm64, l4t-jetpack cuda-12 arm64, sycl f32/f16, vulkan amd64/arm64, rocm hipblas, and the metal darwin entry), changing only backend and tag-suffix. dockerfile stays ./backend/Dockerfile.golang. - backend/index.yaml: add the &ced meta anchor (capabilities map per platform) plus ced-development and the per-arch image entries, each uri/mirror tag-suffix matching the matrix exactly. The model gallery (GGUF) entry is intentionally deferred pending the HuggingFace publish (TODO note inline). - scripts/changed-backends.js: add an explicit item.backend === "ced" branch in inferBackendPath mapping to backend/go/ced/, same mechanism and ordering as the parakeet-cpp branch (before the generic golang fallthrough). - .github/workflows/bump_deps.yaml: register mudler/ced.cpp -> CED_VERSION in backend/go/ced/Makefile so the daily bot bumps the pin. - swagger/{docs.go,swagger.json,swagger.yaml}: regenerated via make swagger so the existing /v1/audio/classification annotations land in the generated spec. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ced): server-side windowing for realtime sound detection (option B) Adds an optional server-driven sliding-window classifier so a sound-only realtime client only has to stream audio (no input_audio_buffer.commit): - Pipeline.sound_detection_window_ms / sound_detection_hop_ms config knobs. When both > 0 on a sound-only session, the server classifies the last window of streamed audio every hop and emits a conversation.item.sound_ detection event; the input buffer is trimmed to one window so a long stream stays bounded. When unset, the session stays client-driven (option A). Runs independent of VAD (sound events are not speech). - handleSoundWindow (ticker) + classifySoundWindow (one tick, extracted so it is unit-testable) + writeWindowWAV, which declares the true InputSampleRate (NewWAVHeaderWithRate) so the classifier resamples correctly. Goroutine is started after toggleVAD and torn down with the session (close + wg.Wait). - Register pipeline.sound_detection (+window_ms/hop_ms) in the config meta registry; the earlier realtime commit added pipeline.sound_detection without a registry entry, failing TestAllFieldsHaveRegistryEntries. This fixes that and covers the two new knobs. Tests: classifySoundWindow emits an event + trims the buffer to one window, no-ops on too-little audio; writeWindowWAV declares the given sample rate. go build/vet + golangci-lint (new-from-merge-base) clean; config + openai suites green. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ced): add ced-base GGUF model gallery entries (f16 + q8_0) The ced-base weights are now published at mudler/ced-base-gguf (Apache-2.0, converted from mispeech/ced-base). Adds gallery/ced.yaml (backend: ced + known_usecases: sound_classification) and two gallery/index.yaml entries (ced-base-f16 default, ced-base-q8 smallest) with sha256-pinned files, and removes the now-resolved TODO from backend/index.yaml. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ced): add tiny/mini/small GGUF model gallery entries Publishes the rest of the CED family (same architecture, metadata-driven port verified end-to-end on ced-tiny) to mudler/ced-{tiny,mini,small}-gguf and adds their f16 + q8_0 gallery entries: ced-tiny (5.5M, edge/Pi-class) f16 11MB / q8_0 6MB ced-mini (9.6M) f16 19MB / q8_0 11MB ced-small (22M) f16 42MB / q8_0 23MB All sha256-pinned. ced-base remains the accuracy default. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(ced): point gallery entries at the consolidated mudler/ced-gguf repo All CED quantizations (tiny/mini/small/base, f16/q8_0) now live in a single HuggingFace repo, mudler/ced-gguf, instead of per-model repos. Repoint the 8 gallery model entries' urls + file uris accordingly. sha256 and filenames are unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(ced): bump CED_VERSION to the short-clip fix Pin the ced backend to ced.cpp 99c6ed3, which fixes a crash on any clip shorter than target_length (~10.11s): time_pos_embed was added at its full 63-frame grid instead of being sliced to the clip's actual time grid, tripping ggml_can_repeat in ggml_add. Surfaced by the live realtime e2e (sub-10s windows) and gated with a short-clip parity test upstream. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(ced): list ced.cpp as a LocalAI-team engine + backend-guide directive - README.md: add ced.cpp to the "native C/C++/GGML engines developed and maintained by the LocalAI project" table. - docs/content/features/backends.md: add a Sound Classification backend category (sound-event classification / audio tagging) listing ced.cpp. - .agents/adding-backends.md: add a "Documenting the backend" section and two verification-checklist items requiring new backends to be documented in the backends.md category list, and in-house native engines to be added to the README maintained-engines table. This directive was missing. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(ced): repin CED_VERSION to the v0.1.0 release commit ced.cpp history was squashed into a single release commit (tagged v0.1.0), so the previous pin (99c6ed3) no longer exists upstream. Pin to c04ac14, the v0.1.0 release commit, so the backend builds against a commit that exists. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(ced): silence gosec G304/G103 + govet unsafeptr on audited paths - sound_classification.go: os.Create(dst) where dst = temp dir + path.Base of the upload (no traversal). #nosec G304, matching the depth-anything-cpp handler. - goced.go: reading a NUL-terminated C string from a libced-owned buffer. #nosec G103 (gosec) + //nolint:govet (golangci-lint's unsafeptr check), since the uintptr is a C-owned malloc'd buffer, not Go-GC memory. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -198,6 +198,27 @@ docker-build-backends: ... docker-build-<backend-name>
|
||||
- If the backend is in `backend/python/<backend-name>/` but uses `.` as context in the workflow file, use `.` context
|
||||
- Check similar backends to determine the correct context
|
||||
|
||||
## Documenting the backend (README + docs)
|
||||
|
||||
A backend is not "added" until it is discoverable. Update the user-facing docs:
|
||||
|
||||
- **`docs/content/features/backends.md`** - add the backend to the right
|
||||
category in the "LocalAI supports various types of backends" list (and add a
|
||||
new category if it introduces a new modality, e.g. sound classification).
|
||||
- If the backend introduces a **new API surface** (a new endpoint or a realtime
|
||||
capability), document it under `docs/content/` where its area lives (audio,
|
||||
vision, etc.) and follow the api-endpoints checklist in
|
||||
[api-endpoints-and-auth.md](api-endpoints-and-auth.md).
|
||||
|
||||
**If the backend is a native C/C++/GGML engine created and maintained by the
|
||||
LocalAI team** (a from-scratch port like `parakeet.cpp`, `ced.cpp`,
|
||||
`vibevoice.cpp`, `rf-detr.cpp`, not a wrapper around a third-party runtime), it
|
||||
ALSO belongs in the top-level **`README.md`** table under "native C/C++/GGML
|
||||
engines ... developed and maintained by the LocalAI project itself". Add a row
|
||||
linking the upstream engine repo with a one-line description. This is the
|
||||
project's showcase of its own engines; a new in-house backend that is missing
|
||||
from it is a documentation bug.
|
||||
|
||||
## 5. Verification Checklist
|
||||
|
||||
After adding a new backend, verify:
|
||||
@@ -211,6 +232,8 @@ After adding a new backend, verify:
|
||||
- [ ] No YAML syntax errors (check with linter)
|
||||
- [ ] No Makefile syntax errors (check with linter)
|
||||
- [ ] Follows the same pattern as similar backends (e.g., if it's a transcription backend, follow `faster-whisper` pattern)
|
||||
- [ ] Documented: added to the category list in `docs/content/features/backends.md` (and any new endpoint/realtime capability documented under `docs/content/`)
|
||||
- [ ] If it is an in-house native C/C++/GGML engine, added to the maintained-engines table in the top-level `README.md`
|
||||
|
||||
## Bundling runtime shared libraries (`package.sh`)
|
||||
|
||||
|
||||
152
.github/backend-matrix.yml
vendored
152
.github/backend-matrix.yml
vendored
@@ -3575,6 +3575,154 @@ include:
|
||||
dockerfile: "./backend/Dockerfile.golang"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
# ced
|
||||
- build-type: 'cublas'
|
||||
cuda-major-version: "12"
|
||||
cuda-minor-version: "8"
|
||||
platforms: 'linux/amd64'
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-gpu-nvidia-cuda-12-ced'
|
||||
runs-on: 'ubuntu-latest'
|
||||
base-image: "ubuntu:24.04"
|
||||
skip-drivers: 'false'
|
||||
backend: "ced"
|
||||
dockerfile: "./backend/Dockerfile.golang"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
- build-type: 'cublas'
|
||||
cuda-major-version: "13"
|
||||
cuda-minor-version: "0"
|
||||
platforms: 'linux/amd64'
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-gpu-nvidia-cuda-13-ced'
|
||||
runs-on: 'ubuntu-latest'
|
||||
base-image: "ubuntu:24.04"
|
||||
skip-drivers: 'false'
|
||||
backend: "ced"
|
||||
dockerfile: "./backend/Dockerfile.golang"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
- build-type: 'cublas'
|
||||
cuda-major-version: "13"
|
||||
cuda-minor-version: "0"
|
||||
platforms: 'linux/arm64'
|
||||
skip-drivers: 'false'
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-nvidia-l4t-cuda-13-arm64-ced'
|
||||
base-image: "ubuntu:24.04"
|
||||
ubuntu-version: '2404'
|
||||
runs-on: 'ubuntu-24.04-arm'
|
||||
backend: "ced"
|
||||
dockerfile: "./backend/Dockerfile.golang"
|
||||
context: "./"
|
||||
- build-type: ''
|
||||
cuda-major-version: ""
|
||||
cuda-minor-version: ""
|
||||
platforms: 'linux/amd64'
|
||||
platform-tag: 'amd64'
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-cpu-ced'
|
||||
runs-on: 'ubuntu-latest'
|
||||
base-image: "ubuntu:24.04"
|
||||
skip-drivers: 'false'
|
||||
backend: "ced"
|
||||
dockerfile: "./backend/Dockerfile.golang"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
- build-type: ''
|
||||
cuda-major-version: ""
|
||||
cuda-minor-version: ""
|
||||
platforms: 'linux/arm64'
|
||||
platform-tag: 'arm64'
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-cpu-ced'
|
||||
runs-on: 'ubuntu-24.04-arm'
|
||||
base-image: "ubuntu:24.04"
|
||||
skip-drivers: 'false'
|
||||
backend: "ced"
|
||||
dockerfile: "./backend/Dockerfile.golang"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
- build-type: 'sycl_f32'
|
||||
cuda-major-version: ""
|
||||
cuda-minor-version: ""
|
||||
platforms: 'linux/amd64'
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-gpu-intel-sycl-f32-ced'
|
||||
runs-on: 'ubuntu-latest'
|
||||
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
|
||||
skip-drivers: 'false'
|
||||
backend: "ced"
|
||||
dockerfile: "./backend/Dockerfile.golang"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
- build-type: 'sycl_f16'
|
||||
cuda-major-version: ""
|
||||
cuda-minor-version: ""
|
||||
platforms: 'linux/amd64'
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-gpu-intel-sycl-f16-ced'
|
||||
runs-on: 'ubuntu-latest'
|
||||
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
|
||||
skip-drivers: 'false'
|
||||
backend: "ced"
|
||||
dockerfile: "./backend/Dockerfile.golang"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
- build-type: 'vulkan'
|
||||
cuda-major-version: ""
|
||||
cuda-minor-version: ""
|
||||
platforms: 'linux/amd64'
|
||||
platform-tag: 'amd64'
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-gpu-vulkan-ced'
|
||||
runs-on: 'ubuntu-latest'
|
||||
base-image: "ubuntu:24.04"
|
||||
skip-drivers: 'false'
|
||||
backend: "ced"
|
||||
dockerfile: "./backend/Dockerfile.golang"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
- build-type: 'vulkan'
|
||||
cuda-major-version: ""
|
||||
cuda-minor-version: ""
|
||||
platforms: 'linux/arm64'
|
||||
platform-tag: 'arm64'
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-gpu-vulkan-ced'
|
||||
runs-on: 'ubuntu-24.04-arm'
|
||||
base-image: "ubuntu:24.04"
|
||||
skip-drivers: 'false'
|
||||
backend: "ced"
|
||||
dockerfile: "./backend/Dockerfile.golang"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
- build-type: 'cublas'
|
||||
cuda-major-version: "12"
|
||||
cuda-minor-version: "0"
|
||||
platforms: 'linux/arm64'
|
||||
skip-drivers: 'false'
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-nvidia-l4t-arm64-ced'
|
||||
base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0"
|
||||
runs-on: 'ubuntu-24.04-arm'
|
||||
backend: "ced"
|
||||
dockerfile: "./backend/Dockerfile.golang"
|
||||
context: "./"
|
||||
ubuntu-version: '2204'
|
||||
- build-type: 'hipblas'
|
||||
cuda-major-version: ""
|
||||
cuda-minor-version: ""
|
||||
platforms: 'linux/amd64'
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-gpu-rocm-hipblas-ced'
|
||||
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
|
||||
runs-on: 'ubuntu-latest'
|
||||
skip-drivers: 'false'
|
||||
backend: "ced"
|
||||
dockerfile: "./backend/Dockerfile.golang"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
# acestep-cpp
|
||||
- build-type: ''
|
||||
cuda-major-version: ""
|
||||
@@ -4754,6 +4902,10 @@ includeDarwin:
|
||||
tag-suffix: "-metal-darwin-arm64-parakeet-cpp"
|
||||
build-type: "metal"
|
||||
lang: "go"
|
||||
- backend: "ced"
|
||||
tag-suffix: "-metal-darwin-arm64-ced"
|
||||
build-type: "metal"
|
||||
lang: "go"
|
||||
- backend: "acestep-cpp"
|
||||
tag-suffix: "-metal-darwin-arm64-acestep-cpp"
|
||||
build-type: "metal"
|
||||
|
||||
4
.github/workflows/bump_deps.yaml
vendored
4
.github/workflows/bump_deps.yaml
vendored
@@ -42,6 +42,10 @@ jobs:
|
||||
variable: "PARAKEET_VERSION"
|
||||
branch: "master"
|
||||
file: "backend/go/parakeet-cpp/Makefile"
|
||||
- repository: "mudler/ced.cpp"
|
||||
variable: "CED_VERSION"
|
||||
branch: "master"
|
||||
file: "backend/go/ced/Makefile"
|
||||
- repository: "mudler/depth-anything.cpp"
|
||||
variable: "DEPTHANYTHING_VERSION"
|
||||
branch: "master"
|
||||
|
||||
@@ -231,6 +231,7 @@ Most backends wrap a best-in-class upstream engine. A handful of them are native
|
||||
| Backend | What it does |
|
||||
|---------|-------------|
|
||||
| [parakeet.cpp](https://github.com/mudler/parakeet.cpp) | C++/GGML port of NVIDIA NeMo Parakeet ASR (tdt/ctc/rnnt/hybrid), with cache-aware streaming transcription |
|
||||
| [ced.cpp](https://github.com/mudler/ced.cpp) | C++/GGML port of the CED audio-tagging models: sound-event classification (527-class AudioSet) over REST and the realtime API for live recognition |
|
||||
| [voxtral.c](https://github.com/mudler/voxtral.c) | Voxtral Realtime 4B speech-to-text in pure C |
|
||||
| [vibevoice.cpp](https://github.com/mudler/vibevoice.cpp) | Native port of Microsoft VibeVoice for TTS (voice cloning) and long-form ASR with speaker diarization |
|
||||
| [rf-detr.cpp](https://github.com/mudler/rf-detr.cpp) | Native RF-DETR object detection and instance segmentation |
|
||||
|
||||
@@ -24,6 +24,9 @@ service Backend {
|
||||
rpc TokenizeString(PredictOptions) returns (TokenizationResponse) {}
|
||||
rpc Status(HealthMessage) returns (StatusResponse) {}
|
||||
rpc Detect(DetectOptions) returns (DetectResponse) {}
|
||||
// SoundDetection runs an audio-tagging / sound-event-classification model
|
||||
// (e.g. CED over the AudioSet ontology) on a clip and returns scored labels.
|
||||
rpc SoundDetection(SoundDetectionRequest) returns (SoundDetectionResponse) {}
|
||||
rpc Depth(DepthRequest) returns (DepthResponse) {}
|
||||
rpc FaceVerify(FaceVerifyRequest) returns (FaceVerifyResponse) {}
|
||||
rpc FaceAnalyze(FaceAnalyzeRequest) returns (FaceAnalyzeResponse) {}
|
||||
@@ -671,6 +674,24 @@ message DetectResponse {
|
||||
repeated Detection Detections = 1;
|
||||
}
|
||||
|
||||
// --- Sound-event classification / audio tagging messages (CED) ---
|
||||
|
||||
message SoundDetectionRequest {
|
||||
string src = 1; // audio file path (LocalAI writes the upload to disk)
|
||||
int32 top_k = 2; // number of top tags to return (0 = all classes)
|
||||
float threshold = 3; // optional: drop tags scoring below this
|
||||
}
|
||||
|
||||
message SoundClass {
|
||||
string label = 1; // AudioSet class name, e.g. "Baby cry, infant cry"
|
||||
float score = 2; // per-class probability (multi-label, independent)
|
||||
int32 index = 3; // class index in the model ontology
|
||||
}
|
||||
|
||||
message SoundDetectionResponse {
|
||||
repeated SoundClass detections = 1; // score-descending
|
||||
}
|
||||
|
||||
// --- Depth estimation messages (Depth Anything 3) ---
|
||||
|
||||
message DepthRequest {
|
||||
|
||||
11
backend/go/ced/.gitignore
vendored
Normal file
11
backend/go/ced/.gitignore
vendored
Normal file
@@ -0,0 +1,11 @@
|
||||
.cache/
|
||||
sources/
|
||||
build/
|
||||
package/
|
||||
ced-grpc
|
||||
# build artifacts staged in-tree by the Makefile (cp from sources/) or
|
||||
# symlinked for local dev; the real sources live in ced.cpp upstream.
|
||||
*.so
|
||||
*.so.*
|
||||
ced_capi.h
|
||||
compile_commands.json
|
||||
77
backend/go/ced/Makefile
Normal file
77
backend/go/ced/Makefile
Normal file
@@ -0,0 +1,77 @@
|
||||
# ced sound-classification backend Makefile.
|
||||
#
|
||||
# Upstream pin lives below as CED_VERSION?=<sha> so .github/bump_deps.sh can find
|
||||
# and update it (matches the parakeet-cpp / whisper.cpp convention).
|
||||
#
|
||||
# Local dev shortcut: symlink an out-of-tree ced.cpp shared build + header and
|
||||
# skip the clone/cmake steps entirely:
|
||||
# ln -sf /path/to/ced.cpp/build-shared/libced.so .
|
||||
# ln -sf /path/to/ced.cpp/include/ced_capi.h .
|
||||
# go build -o ced-grpc .
|
||||
|
||||
CED_VERSION?=c04ac14b7992d00584d9e812c9bb6268598a6ce7
|
||||
CED_REPO?=https://github.com/mudler/ced.cpp
|
||||
|
||||
GOCMD?=go
|
||||
GO_TAGS?=
|
||||
JOBS?=$(shell nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 4)
|
||||
|
||||
BUILD_TYPE?=
|
||||
NATIVE?=false
|
||||
|
||||
# Static-link ggml into libced.so (PIC) so the shared lib is self-contained:
|
||||
# dlopen needs no libggml*.so alongside it, only system libs the runtime image
|
||||
# already provides.
|
||||
CMAKE_ARGS?=-DCMAKE_BUILD_TYPE=Release -DCED_SHARED=ON -DCED_BUILD_CLI=OFF -DCED_BUILD_TESTS=OFF -DBUILD_SHARED_LIBS=OFF -DCMAKE_POSITION_INDEPENDENT_CODE=ON
|
||||
|
||||
ifeq ($(NATIVE),false)
|
||||
CMAKE_ARGS+=-DGGML_NATIVE=OFF
|
||||
endif
|
||||
|
||||
# ced.cpp gates its ggml backends behind CED_GGML_* options (set(... CACHE BOOL
|
||||
# "" FORCE)), so forward those instead of a bare -DGGML_CUDA=ON.
|
||||
ifeq ($(BUILD_TYPE),cublas)
|
||||
CMAKE_ARGS+=-DCED_GGML_CUDA=ON -DGGML_CUDA_GRAPHS=ON
|
||||
else ifeq ($(BUILD_TYPE),openblas)
|
||||
CMAKE_ARGS+=-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
|
||||
else ifeq ($(BUILD_TYPE),hipblas)
|
||||
CMAKE_ARGS+=-DCED_GGML_HIP=ON
|
||||
else ifeq ($(BUILD_TYPE),vulkan)
|
||||
CMAKE_ARGS+=-DCED_GGML_VULKAN=ON
|
||||
endif
|
||||
|
||||
.PHONY: ced-grpc package build clean purge test all
|
||||
|
||||
all: ced-grpc
|
||||
|
||||
sources/ced.cpp:
|
||||
mkdir -p sources/ced.cpp
|
||||
cd sources/ced.cpp && \
|
||||
git init -q && \
|
||||
git remote add origin $(CED_REPO) && \
|
||||
git fetch --depth 1 origin $(CED_VERSION) && \
|
||||
git checkout FETCH_HEAD && \
|
||||
git submodule update --init --recursive --depth 1 --single-branch
|
||||
|
||||
libced.so: sources/ced.cpp
|
||||
cmake -B sources/ced.cpp/build-shared -S sources/ced.cpp $(CMAKE_ARGS)
|
||||
cmake --build sources/ced.cpp/build-shared --config Release -j$(JOBS)
|
||||
cp -fv sources/ced.cpp/build-shared/libced.so* ./ 2>/dev/null || true
|
||||
cp -fv sources/ced.cpp/include/ced_capi.h ./
|
||||
|
||||
ced-grpc: libced.so main.go goced.go
|
||||
CGO_ENABLED=0 $(GOCMD) build -tags "$(GO_TAGS)" -o ced-grpc .
|
||||
|
||||
package: ced-grpc
|
||||
bash package.sh
|
||||
|
||||
build: package
|
||||
|
||||
test:
|
||||
LD_LIBRARY_PATH=$(CURDIR):$$LD_LIBRARY_PATH $(GOCMD) test ./... -count=1
|
||||
|
||||
clean: purge
|
||||
rm -rf libced.so* ced_capi.h package ced-grpc
|
||||
|
||||
purge:
|
||||
rm -rf sources/ced.cpp
|
||||
130
backend/go/ced/goced.go
Normal file
130
backend/go/ced/goced.go
Normal file
@@ -0,0 +1,130 @@
|
||||
package main
|
||||
|
||||
// Go side of the ced backend: purego bindings over ced_capi.h plus the gRPC
|
||||
// SoundDetection implementation.
|
||||
//
|
||||
// SKETCH: the pb.SoundDetection* types come from backend.proto (regenerate with
|
||||
// `make protogen-go`). The C side is single-threaded per ctx, so we guard the
|
||||
// engine with engineMu; LocalAI also serializes via base.SingleThread.
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"errors"
|
||||
"fmt"
|
||||
"sort"
|
||||
"sync"
|
||||
"unsafe"
|
||||
|
||||
"github.com/mudler/LocalAI/pkg/grpc/base"
|
||||
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
|
||||
)
|
||||
|
||||
// purego-bound entry points from libced.so. Names match ced_capi.h exactly.
|
||||
var (
|
||||
CppAbiVersion func() int32
|
||||
CppLoad func(ggufPath string) uintptr
|
||||
CppFree func(ctx uintptr)
|
||||
CppLastError func(ctx uintptr) string
|
||||
CppNumClasses func(ctx uintptr) int32
|
||||
CppSampleRate func(ctx uintptr) int32
|
||||
CppClassifyPathJSON func(ctx uintptr, wavPath string, topK int32) uintptr
|
||||
CppClassifyPcmJSON func(ctx uintptr, pcm []float32, nSamples int32, sampleRate int32, topK int32) uintptr
|
||||
CppFreeString func(s uintptr)
|
||||
)
|
||||
|
||||
// cstr copies a malloc'd C string (returned as uintptr) into a Go string and
|
||||
// frees the original via ced_capi_free_string. Empty/0 -> "".
|
||||
func cstr(p uintptr) string {
|
||||
if p == 0 {
|
||||
return ""
|
||||
}
|
||||
defer CppFreeString(p)
|
||||
var b []byte
|
||||
for i := 0; ; i++ {
|
||||
ch := *(*byte)(unsafe.Pointer(p + uintptr(i))) //nolint:govet // #nosec G103 -- C-owned NUL-terminated string from libced (not Go-GC memory)
|
||||
if ch == 0 {
|
||||
break
|
||||
}
|
||||
b = append(b, ch)
|
||||
}
|
||||
return string(b)
|
||||
}
|
||||
|
||||
// Ced is the gRPC backend. One loaded CED model per instance.
|
||||
type Ced struct {
|
||||
base.Base
|
||||
ctxPtr uintptr
|
||||
engineMu sync.Mutex
|
||||
}
|
||||
|
||||
// Load resolves the GGUF and opens the C-API context.
|
||||
func (c *Ced) Load(opts *pb.ModelOptions) error {
|
||||
if opts.ModelFile == "" {
|
||||
return errors.New("ced: ModelFile is required")
|
||||
}
|
||||
ctx := CppLoad(opts.ModelFile)
|
||||
if ctx == 0 {
|
||||
return fmt.Errorf("ced: ced_capi_load failed for %q: %s", opts.ModelFile, CppLastError(0))
|
||||
}
|
||||
c.ctxPtr = ctx
|
||||
return nil
|
||||
}
|
||||
|
||||
// jsonTag mirrors the ced_capi JSON tag objects.
|
||||
type jsonTag struct {
|
||||
Index int `json:"index"`
|
||||
Score float32 `json:"score"`
|
||||
Label string `json:"label"`
|
||||
}
|
||||
|
||||
// SoundDetection classifies the clip at req.Src and returns scored AudioSet tags.
|
||||
func (c *Ced) SoundDetection(ctx context.Context, req *pb.SoundDetectionRequest) (*pb.SoundDetectionResponse, error) {
|
||||
if c.ctxPtr == 0 {
|
||||
return nil, errors.New("ced: model not loaded")
|
||||
}
|
||||
if req.GetSrc() == "" {
|
||||
return nil, errors.New("ced: SoundDetectionRequest.src (audio path) is required")
|
||||
}
|
||||
topK := req.GetTopK()
|
||||
if topK <= 0 {
|
||||
topK = 10 // sensible default for a tagging response
|
||||
}
|
||||
|
||||
c.engineMu.Lock()
|
||||
out := cstr(CppClassifyPathJSON(c.ctxPtr, req.GetSrc(), topK))
|
||||
lastErr := CppLastError(c.ctxPtr)
|
||||
c.engineMu.Unlock()
|
||||
|
||||
if out == "" {
|
||||
return nil, fmt.Errorf("ced: classification failed: %s", lastErr)
|
||||
}
|
||||
var tags []jsonTag
|
||||
if err := json.Unmarshal([]byte(out), &tags); err != nil {
|
||||
return nil, fmt.Errorf("ced: bad classifier JSON: %w", err)
|
||||
}
|
||||
|
||||
thr := req.GetThreshold()
|
||||
resp := &pb.SoundDetectionResponse{}
|
||||
for _, t := range tags {
|
||||
if t.Score < thr {
|
||||
continue
|
||||
}
|
||||
resp.Detections = append(resp.Detections, &pb.SoundClass{
|
||||
Label: t.Label, Score: t.Score, Index: int32(t.Index),
|
||||
})
|
||||
}
|
||||
sort.Slice(resp.Detections, func(i, j int) bool {
|
||||
return resp.Detections[i].Score > resp.Detections[j].Score
|
||||
})
|
||||
return resp, nil
|
||||
}
|
||||
|
||||
func (c *Ced) Free() error {
|
||||
c.engineMu.Lock()
|
||||
defer c.engineMu.Unlock()
|
||||
if c.ctxPtr != 0 {
|
||||
CppFree(c.ctxPtr)
|
||||
c.ctxPtr = 0
|
||||
}
|
||||
return nil
|
||||
}
|
||||
59
backend/go/ced/main.go
Normal file
59
backend/go/ced/main.go
Normal file
@@ -0,0 +1,59 @@
|
||||
package main
|
||||
|
||||
// ced sound-classification backend. Started internally by LocalAI: one gRPC
|
||||
// server per loaded model. Loads libced.so via purego and registers the flat
|
||||
// C-API declared in ced_capi.h. The library name can be overridden with
|
||||
// CED_LIBRARY (mirrors PARAKEET_LIBRARY / WHISPER_LIBRARY); the default looks
|
||||
// for the .so next to this binary.
|
||||
//
|
||||
// SKETCH: requires `make protogen-go` after the backend.proto SoundDetection
|
||||
// addition, and a built libced.so (see Makefile). See DESIGN.md.
|
||||
import (
|
||||
"flag"
|
||||
"fmt"
|
||||
"os"
|
||||
|
||||
"github.com/ebitengine/purego"
|
||||
grpc "github.com/mudler/LocalAI/pkg/grpc"
|
||||
)
|
||||
|
||||
var addr = flag.String("addr", "localhost:50051", "the address to connect to")
|
||||
|
||||
type libFunc struct {
|
||||
ptr any
|
||||
name string
|
||||
}
|
||||
|
||||
func main() {
|
||||
libName := os.Getenv("CED_LIBRARY")
|
||||
if libName == "" {
|
||||
libName = "libced.so"
|
||||
}
|
||||
lib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
|
||||
if err != nil {
|
||||
panic(fmt.Errorf("ced: dlopen %q: %w", libName, err))
|
||||
}
|
||||
|
||||
// Bound 1:1 to ced_capi.h. char*-returning functions are declared uintptr
|
||||
// so we can free the same pointer with ced_capi_free_string after copying
|
||||
// (purego's string return would copy and leak the original).
|
||||
for _, lf := range []libFunc{
|
||||
{&CppAbiVersion, "ced_capi_abi_version"},
|
||||
{&CppLoad, "ced_capi_load"},
|
||||
{&CppFree, "ced_capi_free"},
|
||||
{&CppLastError, "ced_capi_last_error"},
|
||||
{&CppNumClasses, "ced_capi_num_classes"},
|
||||
{&CppSampleRate, "ced_capi_sample_rate"},
|
||||
{&CppClassifyPathJSON, "ced_capi_classify_path_json"},
|
||||
{&CppClassifyPcmJSON, "ced_capi_classify_pcm_json"},
|
||||
{&CppFreeString, "ced_capi_free_string"},
|
||||
} {
|
||||
purego.RegisterLibFunc(lf.ptr, lib, lf.name)
|
||||
}
|
||||
|
||||
fmt.Fprintf(os.Stderr, "[ced] ABI=%d\n", CppAbiVersion())
|
||||
flag.Parse()
|
||||
if err := grpc.StartServer(*addr, &Ced{}); err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
||||
60
backend/go/ced/package.sh
Executable file
60
backend/go/ced/package.sh
Executable file
@@ -0,0 +1,60 @@
|
||||
#!/bin/bash
|
||||
#
|
||||
# Bundle the ced-grpc binary, libced.so, the core runtime libs (libc/libstdc++/
|
||||
# libgomp + ld.so) and the GPU runtime for the active BUILD_TYPE so the package
|
||||
# is self-contained. Mirrors backend/go/parakeet-cpp/package.sh; run.sh routes
|
||||
# the (CGO_ENABLED=0) binary through lib/ld.so so the packaged libc is used.
|
||||
|
||||
set -e
|
||||
|
||||
CURDIR=$(dirname "$(realpath "$0")")
|
||||
REPO_ROOT="${CURDIR}/../../.."
|
||||
|
||||
mkdir -p "$CURDIR/package/lib"
|
||||
|
||||
cp -avf "$CURDIR/ced-grpc" "$CURDIR/package/"
|
||||
cp -avf "$CURDIR/run.sh" "$CURDIR/package/"
|
||||
|
||||
cp -avf "$CURDIR"/libced.so* "$CURDIR/package/lib/" 2>/dev/null || {
|
||||
echo "ERROR: libced.so not found in $CURDIR, run 'make' first" >&2
|
||||
exit 1
|
||||
}
|
||||
|
||||
if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
|
||||
echo "Detected x86_64 architecture, copying x86_64 libraries..."
|
||||
cp -arfLv /lib64/ld-linux-x86-64.so.2 "$CURDIR/package/lib/ld.so"
|
||||
cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 "$CURDIR/package/lib/libc.so.6"
|
||||
cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 "$CURDIR/package/lib/libgcc_s.so.1"
|
||||
cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 "$CURDIR/package/lib/libstdc++.so.6"
|
||||
cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 "$CURDIR/package/lib/libm.so.6"
|
||||
cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 "$CURDIR/package/lib/libgomp.so.1"
|
||||
cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 "$CURDIR/package/lib/libdl.so.2"
|
||||
cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 "$CURDIR/package/lib/librt.so.1"
|
||||
cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 "$CURDIR/package/lib/libpthread.so.0"
|
||||
elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
|
||||
echo "Detected ARM64 architecture, copying ARM64 libraries..."
|
||||
cp -arfLv /lib/ld-linux-aarch64.so.1 "$CURDIR/package/lib/ld.so"
|
||||
cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 "$CURDIR/package/lib/libc.so.6"
|
||||
cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 "$CURDIR/package/lib/libgcc_s.so.1"
|
||||
cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 "$CURDIR/package/lib/libstdc++.so.6"
|
||||
cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 "$CURDIR/package/lib/libm.so.6"
|
||||
cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 "$CURDIR/package/lib/libgomp.so.1"
|
||||
cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 "$CURDIR/package/lib/libdl.so.2"
|
||||
cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 "$CURDIR/package/lib/librt.so.1"
|
||||
cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 "$CURDIR/package/lib/libpthread.so.0"
|
||||
elif [ "$(uname -s)" = "Darwin" ]; then
|
||||
echo "Detected Darwin"
|
||||
else
|
||||
echo "Error: Could not detect architecture"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
|
||||
if [ -f "$GPU_LIB_SCRIPT" ]; then
|
||||
echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
|
||||
source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
|
||||
package_gpu_libs
|
||||
fi
|
||||
|
||||
echo "Packaging completed successfully"
|
||||
ls -liah "$CURDIR/package/" "$CURDIR/package/lib/"
|
||||
15
backend/go/ced/run.sh
Executable file
15
backend/go/ced/run.sh
Executable file
@@ -0,0 +1,15 @@
|
||||
#!/bin/bash
|
||||
set -e
|
||||
|
||||
CURDIR=$(dirname "$(realpath "$0")")
|
||||
|
||||
export LD_LIBRARY_PATH="$CURDIR/lib:$CURDIR:${LD_LIBRARY_PATH:-}"
|
||||
|
||||
# If a self-contained ld.so was packaged, route through it so the packaged
|
||||
# libc / libstdc++ are used instead of the host's (matches the sibling backends).
|
||||
if [ -f "$CURDIR/lib/ld.so" ]; then
|
||||
echo "Using lib/ld.so"
|
||||
exec "$CURDIR/lib/ld.so" "$CURDIR/ced-grpc" "$@"
|
||||
fi
|
||||
|
||||
exec "$CURDIR/ced-grpc" "$@"
|
||||
@@ -178,6 +178,37 @@
|
||||
nvidia-cuda-12: "cuda12-parakeet-cpp"
|
||||
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-parakeet-cpp"
|
||||
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-parakeet-cpp"
|
||||
- &ced
|
||||
name: "ced"
|
||||
alias: "ced"
|
||||
license: mit
|
||||
icon: https://avatars.githubusercontent.com/u/95302084
|
||||
description: |
|
||||
CED sound-event classification / audio tagging (527-class AudioSet).
|
||||
ced.cpp is a C++/ggml port that performs audio tagging over the AudioSet
|
||||
taxonomy, exposed through the SoundDetection gRPC rpc and the
|
||||
/v1/audio/classification REST endpoint. It runs on CPU, NVIDIA CUDA,
|
||||
AMD ROCm/HIP, Intel SYCL, Vulkan and NVIDIA Jetson (L4T) targets.
|
||||
urls:
|
||||
- https://github.com/mudler/ced.cpp
|
||||
tags:
|
||||
- audio-classification
|
||||
- CPU
|
||||
- GPU
|
||||
- CUDA
|
||||
- HIP
|
||||
capabilities:
|
||||
default: "cpu-ced"
|
||||
nvidia: "cuda12-ced"
|
||||
intel: "intel-sycl-f16-ced"
|
||||
metal: "metal-ced"
|
||||
amd: "rocm-ced"
|
||||
vulkan: "vulkan-ced"
|
||||
nvidia-l4t: "nvidia-l4t-arm64-ced"
|
||||
nvidia-cuda-13: "cuda13-ced"
|
||||
nvidia-cuda-12: "cuda12-ced"
|
||||
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-ced"
|
||||
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-ced"
|
||||
- &voxtral
|
||||
name: "voxtral"
|
||||
alias: "voxtral"
|
||||
@@ -2650,6 +2681,121 @@
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-parakeet-cpp"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-gpu-nvidia-cuda-13-parakeet-cpp
|
||||
## ced
|
||||
- !!merge <<: *ced
|
||||
name: "ced-development"
|
||||
capabilities:
|
||||
default: "cpu-ced-development"
|
||||
nvidia: "cuda12-ced-development"
|
||||
intel: "intel-sycl-f16-ced-development"
|
||||
metal: "metal-ced-development"
|
||||
amd: "rocm-ced-development"
|
||||
vulkan: "vulkan-ced-development"
|
||||
nvidia-l4t: "nvidia-l4t-arm64-ced-development"
|
||||
nvidia-cuda-13: "cuda13-ced-development"
|
||||
nvidia-cuda-12: "cuda12-ced-development"
|
||||
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-ced-development"
|
||||
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-ced-development"
|
||||
- !!merge <<: *ced
|
||||
name: "nvidia-l4t-arm64-ced"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-arm64-ced"
|
||||
mirrors:
|
||||
- localai/localai-backends:latest-nvidia-l4t-arm64-ced
|
||||
- !!merge <<: *ced
|
||||
name: "nvidia-l4t-arm64-ced-development"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-arm64-ced"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-nvidia-l4t-arm64-ced
|
||||
- !!merge <<: *ced
|
||||
name: "cuda13-nvidia-l4t-arm64-ced"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-ced"
|
||||
mirrors:
|
||||
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-ced
|
||||
- !!merge <<: *ced
|
||||
name: "cuda13-nvidia-l4t-arm64-ced-development"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-ced"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-ced
|
||||
- !!merge <<: *ced
|
||||
name: "cpu-ced"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-ced"
|
||||
mirrors:
|
||||
- localai/localai-backends:latest-cpu-ced
|
||||
- !!merge <<: *ced
|
||||
name: "cpu-ced-development"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-ced"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-cpu-ced
|
||||
- !!merge <<: *ced
|
||||
name: "metal-ced"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-ced"
|
||||
mirrors:
|
||||
- localai/localai-backends:latest-metal-darwin-arm64-ced
|
||||
- !!merge <<: *ced
|
||||
name: "metal-ced-development"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-ced"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-metal-darwin-arm64-ced
|
||||
- !!merge <<: *ced
|
||||
name: "cuda12-ced"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-ced"
|
||||
mirrors:
|
||||
- localai/localai-backends:latest-gpu-nvidia-cuda-12-ced
|
||||
- !!merge <<: *ced
|
||||
name: "cuda12-ced-development"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-ced"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-gpu-nvidia-cuda-12-ced
|
||||
- !!merge <<: *ced
|
||||
name: "rocm-ced"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-ced"
|
||||
mirrors:
|
||||
- localai/localai-backends:latest-gpu-rocm-hipblas-ced
|
||||
- !!merge <<: *ced
|
||||
name: "rocm-ced-development"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-ced"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-gpu-rocm-hipblas-ced
|
||||
- !!merge <<: *ced
|
||||
name: "intel-sycl-f32-ced"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f32-ced"
|
||||
mirrors:
|
||||
- localai/localai-backends:latest-gpu-intel-sycl-f32-ced
|
||||
- !!merge <<: *ced
|
||||
name: "intel-sycl-f32-ced-development"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f32-ced"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-gpu-intel-sycl-f32-ced
|
||||
- !!merge <<: *ced
|
||||
name: "intel-sycl-f16-ced"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f16-ced"
|
||||
mirrors:
|
||||
- localai/localai-backends:latest-gpu-intel-sycl-f16-ced
|
||||
- !!merge <<: *ced
|
||||
name: "intel-sycl-f16-ced-development"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f16-ced"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-gpu-intel-sycl-f16-ced
|
||||
- !!merge <<: *ced
|
||||
name: "vulkan-ced"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-vulkan-ced"
|
||||
mirrors:
|
||||
- localai/localai-backends:latest-gpu-vulkan-ced
|
||||
- !!merge <<: *ced
|
||||
name: "vulkan-ced-development"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-vulkan-ced"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-gpu-vulkan-ced
|
||||
- !!merge <<: *ced
|
||||
name: "cuda13-ced"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-ced"
|
||||
mirrors:
|
||||
- localai/localai-backends:latest-gpu-nvidia-cuda-13-ced
|
||||
- !!merge <<: *ced
|
||||
name: "cuda13-ced-development"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-ced"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-gpu-nvidia-cuda-13-ced
|
||||
## stablediffusion-ggml
|
||||
- !!merge <<: *stablediffusionggml
|
||||
name: "cpu-stablediffusion-ggml"
|
||||
|
||||
88
core/backend/sound_classification.go
Normal file
88
core/backend/sound_classification.go
Normal file
@@ -0,0 +1,88 @@
|
||||
package backend
|
||||
|
||||
import (
|
||||
"context"
|
||||
"fmt"
|
||||
"sort"
|
||||
|
||||
"github.com/mudler/LocalAI/core/config"
|
||||
"github.com/mudler/LocalAI/core/schema"
|
||||
|
||||
grpcPkg "github.com/mudler/LocalAI/pkg/grpc"
|
||||
"github.com/mudler/LocalAI/pkg/grpc/proto"
|
||||
"github.com/mudler/LocalAI/pkg/model"
|
||||
)
|
||||
|
||||
// SoundDetectionRequest carries the knobs the HTTP layer collects for an
|
||||
// audio-tagging / sound-event-classification call. Audio is the path to the
|
||||
// uploaded clip on disk; TopK and Threshold are optional (0 = backend default).
|
||||
type SoundDetectionRequest struct {
|
||||
Audio string
|
||||
TopK int32
|
||||
Threshold float32
|
||||
}
|
||||
|
||||
func (r *SoundDetectionRequest) toProto() *proto.SoundDetectionRequest {
|
||||
return &proto.SoundDetectionRequest{
|
||||
Src: r.Audio,
|
||||
TopK: r.TopK,
|
||||
Threshold: r.Threshold,
|
||||
}
|
||||
}
|
||||
|
||||
func loadSoundDetectionModel(ml *model.ModelLoader, modelConfig config.ModelConfig, appConfig *config.ApplicationConfig) (grpcPkg.Backend, error) {
|
||||
if modelConfig.Backend == "" {
|
||||
return nil, fmt.Errorf("sound classification: model %q has no backend set; supported backends include ced", modelConfig.Name)
|
||||
}
|
||||
opts := ModelOptions(modelConfig, appConfig)
|
||||
m, err := ml.Load(opts...)
|
||||
if err != nil {
|
||||
recordModelLoadFailure(appConfig, modelConfig.Name, modelConfig.Backend, err, nil)
|
||||
return nil, err
|
||||
}
|
||||
if m == nil {
|
||||
return nil, fmt.Errorf("could not load sound classification model")
|
||||
}
|
||||
return m, nil
|
||||
}
|
||||
|
||||
// ModelSoundDetection runs the SoundDetection RPC against the configured
|
||||
// backend and returns a normalized schema.SoundClassificationResult.
|
||||
func ModelSoundDetection(ctx context.Context, req SoundDetectionRequest, ml *model.ModelLoader, modelConfig config.ModelConfig, appConfig *config.ApplicationConfig) (*schema.SoundClassificationResult, error) {
|
||||
m, err := loadSoundDetectionModel(ml, modelConfig, appConfig)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
|
||||
r, err := m.SoundDetection(ctx, req.toProto())
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
return soundClassificationResultFromProto(modelConfig.Name, r), nil
|
||||
}
|
||||
|
||||
// soundClassificationResultFromProto maps the backend detections to the
|
||||
// HTTP-facing schema, keeping the backend's score-descending order.
|
||||
func soundClassificationResultFromProto(modelName string, r *proto.SoundDetectionResponse) *schema.SoundClassificationResult {
|
||||
out := &schema.SoundClassificationResult{
|
||||
Model: modelName,
|
||||
Detections: []schema.SoundClassification{},
|
||||
}
|
||||
if r == nil {
|
||||
return out
|
||||
}
|
||||
for _, d := range r.Detections {
|
||||
if d == nil {
|
||||
continue
|
||||
}
|
||||
out.Detections = append(out.Detections, schema.SoundClassification{
|
||||
Index: int(d.Index),
|
||||
Label: d.Label,
|
||||
Score: d.Score,
|
||||
})
|
||||
}
|
||||
sort.SliceStable(out.Detections, func(i, j int) bool {
|
||||
return out.Detections[i].Score > out.Detections[j].Score
|
||||
})
|
||||
return out
|
||||
}
|
||||
@@ -8,27 +8,28 @@ import (
|
||||
// Usecase name constants — the canonical string values used in gallery entries,
|
||||
// model configs (known_usecases), and UsecaseInfoMap keys.
|
||||
const (
|
||||
UsecaseChat = "chat"
|
||||
UsecaseCompletion = "completion"
|
||||
UsecaseEdit = "edit"
|
||||
UsecaseVision = "vision"
|
||||
UsecaseEmbeddings = "embeddings"
|
||||
UsecaseTokenize = "tokenize"
|
||||
UsecaseImage = "image"
|
||||
UsecaseVideo = "video"
|
||||
UsecaseTranscript = "transcript"
|
||||
UsecaseTTS = "tts"
|
||||
UsecaseSoundGeneration = "sound_generation"
|
||||
UsecaseRerank = "rerank"
|
||||
UsecaseDetection = "detection"
|
||||
UsecaseDepth = "depth"
|
||||
UsecaseVAD = "vad"
|
||||
UsecaseAudioTransform = "audio_transform"
|
||||
UsecaseDiarization = "diarization"
|
||||
UsecaseRealtimeAudio = "realtime_audio"
|
||||
UsecaseFaceRecognition = "face_recognition"
|
||||
UsecaseSpeakerRecognition = "speaker_recognition"
|
||||
UsecaseTokenClassify = "token_classify"
|
||||
UsecaseChat = "chat"
|
||||
UsecaseCompletion = "completion"
|
||||
UsecaseEdit = "edit"
|
||||
UsecaseVision = "vision"
|
||||
UsecaseEmbeddings = "embeddings"
|
||||
UsecaseTokenize = "tokenize"
|
||||
UsecaseImage = "image"
|
||||
UsecaseVideo = "video"
|
||||
UsecaseTranscript = "transcript"
|
||||
UsecaseTTS = "tts"
|
||||
UsecaseSoundGeneration = "sound_generation"
|
||||
UsecaseRerank = "rerank"
|
||||
UsecaseDetection = "detection"
|
||||
UsecaseDepth = "depth"
|
||||
UsecaseVAD = "vad"
|
||||
UsecaseAudioTransform = "audio_transform"
|
||||
UsecaseDiarization = "diarization"
|
||||
UsecaseSoundClassification = "sound_classification"
|
||||
UsecaseRealtimeAudio = "realtime_audio"
|
||||
UsecaseFaceRecognition = "face_recognition"
|
||||
UsecaseSpeakerRecognition = "speaker_recognition"
|
||||
UsecaseTokenClassify = "token_classify"
|
||||
)
|
||||
|
||||
// GRPCMethod identifies a Backend service RPC from backend.proto.
|
||||
@@ -51,6 +52,7 @@ const (
|
||||
MethodVAD GRPCMethod = "VAD"
|
||||
MethodAudioTransform GRPCMethod = "AudioTransform"
|
||||
MethodDiarize GRPCMethod = "Diarize"
|
||||
MethodSoundDetection GRPCMethod = "SoundDetection"
|
||||
MethodAudioToAudioStream GRPCMethod = "AudioToAudioStream"
|
||||
MethodFaceVerify GRPCMethod = "FaceVerify"
|
||||
MethodFaceAnalyze GRPCMethod = "FaceAnalyze"
|
||||
@@ -165,6 +167,11 @@ var UsecaseInfoMap = map[string]UsecaseInfo{
|
||||
GRPCMethod: MethodDiarize,
|
||||
Description: "Speaker diarization (who-spoke-when, per-speaker segments) via the Diarize RPC.",
|
||||
},
|
||||
UsecaseSoundClassification: {
|
||||
Flag: FLAG_SOUND_CLASSIFICATION,
|
||||
GRPCMethod: MethodSoundDetection,
|
||||
Description: "Sound-event classification / audio tagging (scored AudioSet labels like baby cry, glass breaking, alarms) via the SoundDetection RPC.",
|
||||
},
|
||||
UsecaseRealtimeAudio: {
|
||||
Flag: FLAG_REALTIME_AUDIO,
|
||||
GRPCMethod: MethodAudioToAudioStream,
|
||||
|
||||
@@ -68,6 +68,7 @@ var UsecaseOptions = []FieldOption{
|
||||
{Value: "face_recognition", Label: "Face Recognition"},
|
||||
{Value: "transcript", Label: "Transcript"},
|
||||
{Value: "diarization", Label: "Diarization"},
|
||||
{Value: "sound_classification", Label: "Sound Classification"},
|
||||
{Value: "speaker_recognition", Label: "Speaker Recognition"},
|
||||
{Value: "tts", Label: "TTS"},
|
||||
{Value: "sound_generation", Label: "Sound Generation"},
|
||||
|
||||
@@ -328,6 +328,30 @@ func DefaultRegistry() map[string]FieldMetaOverride {
|
||||
AutocompleteProvider: ProviderModelsVAD,
|
||||
Order: 63,
|
||||
},
|
||||
"pipeline.sound_detection": {
|
||||
Section: "pipeline",
|
||||
Label: "Sound Detection Model",
|
||||
Description: "Model to use for sound-event classification (audio tagging, e.g. ced) in the pipeline. When set, committed realtime audio is also classified and the scored AudioSet tags are emitted as a conversation.item.sound_detection event.",
|
||||
Component: "model-select",
|
||||
AutocompleteProvider: ProviderModels,
|
||||
Order: 64,
|
||||
},
|
||||
"pipeline.sound_detection_window_ms": {
|
||||
Section: "pipeline",
|
||||
Label: "Sound Detection Window (ms)",
|
||||
Description: "Server-side windowing for a sound-only realtime session: length in ms of the audio window classified each hop. 0 = client-driven (the client commits windows).",
|
||||
Component: "number",
|
||||
Min: f64(0),
|
||||
Order: 65,
|
||||
},
|
||||
"pipeline.sound_detection_hop_ms": {
|
||||
Section: "pipeline",
|
||||
Label: "Sound Detection Hop (ms)",
|
||||
Description: "Server-side windowing hop in ms: how often the server classifies the last window. 0 = client-driven.",
|
||||
Component: "number",
|
||||
Min: f64(0),
|
||||
Order: 66,
|
||||
},
|
||||
"pipeline.reasoning_effort": {
|
||||
Section: "pipeline",
|
||||
Label: "Reasoning Effort",
|
||||
|
||||
@@ -604,6 +604,20 @@ type Pipeline struct {
|
||||
LLM string `yaml:"llm,omitempty" json:"llm,omitempty"`
|
||||
Transcription string `yaml:"transcription,omitempty" json:"transcription,omitempty"`
|
||||
VAD string `yaml:"vad,omitempty" json:"vad,omitempty"`
|
||||
// SoundDetection names a sound-event-classification model (e.g. ced). When
|
||||
// set, each VAD-committed realtime utterance is also run through it and the
|
||||
// scored AudioSet tags are emitted as a conversation.item.sound_detection
|
||||
// server event, alongside (and independent of) transcription.
|
||||
SoundDetection string `yaml:"sound_detection,omitempty" json:"sound_detection,omitempty"`
|
||||
|
||||
// SoundDetectionWindowMs / SoundDetectionHopMs enable server-side windowing
|
||||
// for a sound-detection-only realtime session: instead of the client
|
||||
// committing audio buffers, the server classifies the last WindowMs of
|
||||
// streamed audio every HopMs and emits a sound_detection event per hop. Both
|
||||
// must be > 0 to activate; otherwise the session stays client-driven (the
|
||||
// client commits windows via input_audio_buffer.commit).
|
||||
SoundDetectionWindowMs int `yaml:"sound_detection_window_ms,omitempty" json:"sound_detection_window_ms,omitempty"`
|
||||
SoundDetectionHopMs int `yaml:"sound_detection_hop_ms,omitempty" json:"sound_detection_hop_ms,omitempty"`
|
||||
|
||||
// ReasoningEffort sets the reasoning effort (none|minimal|low|medium|high) for
|
||||
// the pipeline's LLM without editing the LLM model config. Overrides the LLM's
|
||||
@@ -1452,6 +1466,11 @@ const (
|
||||
// so it may combine freely with other usecases.
|
||||
FLAG_TOKEN_CLASSIFY ModelConfigUsecase = 0b1000000000000000000000
|
||||
|
||||
// Marks a model as wired for the SoundDetection gRPC primitive
|
||||
// (audio tagging / sound-event classification — scored AudioSet
|
||||
// labels via the SoundDetection RPC, e.g. ced).
|
||||
FLAG_SOUND_CLASSIFICATION ModelConfigUsecase = 0b10000000000000000000000
|
||||
|
||||
// Common Subsets
|
||||
FLAG_LLM ModelConfigUsecase = FLAG_CHAT | FLAG_COMPLETION | FLAG_EDIT
|
||||
)
|
||||
@@ -1460,12 +1479,12 @@ const (
|
||||
// Flags within the same group are NOT orthogonal (e.g., chat and completion are
|
||||
// both text/language). A model is multimodal when its usecases span 2+ groups.
|
||||
var ModalityGroups = []ModelConfigUsecase{
|
||||
FLAG_CHAT | FLAG_COMPLETION | FLAG_EDIT, // text/language
|
||||
FLAG_VISION | FLAG_DETECTION, // visual understanding
|
||||
FLAG_TRANSCRIPT | FLAG_REALTIME_AUDIO, // speech input — realtime_audio is any-to-any, so it counts here too
|
||||
FLAG_TTS | FLAG_SOUND_GENERATION | FLAG_REALTIME_AUDIO, // audio output — and here, so a lone realtime_audio flag still reads as multimodal
|
||||
FLAG_AUDIO_TRANSFORM, // audio in/out transforms
|
||||
FLAG_IMAGE | FLAG_VIDEO, // visual generation
|
||||
FLAG_CHAT | FLAG_COMPLETION | FLAG_EDIT, // text/language
|
||||
FLAG_VISION | FLAG_DETECTION, // visual understanding
|
||||
FLAG_TRANSCRIPT | FLAG_REALTIME_AUDIO | FLAG_SOUND_CLASSIFICATION, // audio input — realtime_audio is any-to-any, so it counts here too
|
||||
FLAG_TTS | FLAG_SOUND_GENERATION | FLAG_REALTIME_AUDIO, // audio output — and here, so a lone realtime_audio flag still reads as multimodal
|
||||
FLAG_AUDIO_TRANSFORM, // audio in/out transforms
|
||||
FLAG_IMAGE | FLAG_VIDEO, // visual generation
|
||||
}
|
||||
|
||||
// IsMultimodal returns true if the given usecases span two or more orthogonal
|
||||
@@ -1488,29 +1507,30 @@ func GetAllModelConfigUsecases() map[string]ModelConfigUsecase {
|
||||
return map[string]ModelConfigUsecase{
|
||||
// Note: FLAG_ANY is intentionally excluded from this map
|
||||
// because it's 0 and would always match in HasUsecases checks
|
||||
"FLAG_CHAT": FLAG_CHAT,
|
||||
"FLAG_COMPLETION": FLAG_COMPLETION,
|
||||
"FLAG_EDIT": FLAG_EDIT,
|
||||
"FLAG_EMBEDDINGS": FLAG_EMBEDDINGS,
|
||||
"FLAG_RERANK": FLAG_RERANK,
|
||||
"FLAG_IMAGE": FLAG_IMAGE,
|
||||
"FLAG_TRANSCRIPT": FLAG_TRANSCRIPT,
|
||||
"FLAG_TTS": FLAG_TTS,
|
||||
"FLAG_SOUND_GENERATION": FLAG_SOUND_GENERATION,
|
||||
"FLAG_TOKENIZE": FLAG_TOKENIZE,
|
||||
"FLAG_VAD": FLAG_VAD,
|
||||
"FLAG_LLM": FLAG_LLM,
|
||||
"FLAG_VIDEO": FLAG_VIDEO,
|
||||
"FLAG_DETECTION": FLAG_DETECTION,
|
||||
"FLAG_VISION": FLAG_VISION,
|
||||
"FLAG_FACE_RECOGNITION": FLAG_FACE_RECOGNITION,
|
||||
"FLAG_SPEAKER_RECOGNITION": FLAG_SPEAKER_RECOGNITION,
|
||||
"FLAG_AUDIO_TRANSFORM": FLAG_AUDIO_TRANSFORM,
|
||||
"FLAG_DIARIZATION": FLAG_DIARIZATION,
|
||||
"FLAG_REALTIME_AUDIO": FLAG_REALTIME_AUDIO,
|
||||
"FLAG_SCORE": FLAG_SCORE,
|
||||
"FLAG_DEPTH": FLAG_DEPTH,
|
||||
"FLAG_TOKEN_CLASSIFY": FLAG_TOKEN_CLASSIFY,
|
||||
"FLAG_CHAT": FLAG_CHAT,
|
||||
"FLAG_COMPLETION": FLAG_COMPLETION,
|
||||
"FLAG_EDIT": FLAG_EDIT,
|
||||
"FLAG_EMBEDDINGS": FLAG_EMBEDDINGS,
|
||||
"FLAG_RERANK": FLAG_RERANK,
|
||||
"FLAG_IMAGE": FLAG_IMAGE,
|
||||
"FLAG_TRANSCRIPT": FLAG_TRANSCRIPT,
|
||||
"FLAG_TTS": FLAG_TTS,
|
||||
"FLAG_SOUND_GENERATION": FLAG_SOUND_GENERATION,
|
||||
"FLAG_TOKENIZE": FLAG_TOKENIZE,
|
||||
"FLAG_VAD": FLAG_VAD,
|
||||
"FLAG_LLM": FLAG_LLM,
|
||||
"FLAG_VIDEO": FLAG_VIDEO,
|
||||
"FLAG_DETECTION": FLAG_DETECTION,
|
||||
"FLAG_VISION": FLAG_VISION,
|
||||
"FLAG_FACE_RECOGNITION": FLAG_FACE_RECOGNITION,
|
||||
"FLAG_SPEAKER_RECOGNITION": FLAG_SPEAKER_RECOGNITION,
|
||||
"FLAG_AUDIO_TRANSFORM": FLAG_AUDIO_TRANSFORM,
|
||||
"FLAG_DIARIZATION": FLAG_DIARIZATION,
|
||||
"FLAG_SOUND_CLASSIFICATION": FLAG_SOUND_CLASSIFICATION,
|
||||
"FLAG_REALTIME_AUDIO": FLAG_REALTIME_AUDIO,
|
||||
"FLAG_SCORE": FLAG_SCORE,
|
||||
"FLAG_DEPTH": FLAG_DEPTH,
|
||||
"FLAG_TOKEN_CLASSIFY": FLAG_TOKEN_CLASSIFY,
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1713,6 +1733,16 @@ func (c *ModelConfig) GuessUsecases(u ModelConfigUsecase) bool {
|
||||
}
|
||||
}
|
||||
|
||||
if (u & FLAG_SOUND_CLASSIFICATION) == FLAG_SOUND_CLASSIFICATION {
|
||||
// ced is a sound-event tagger (AudioSet labels) surfaced via the
|
||||
// SoundDetection gRPC. Models without an explicit known_usecases
|
||||
// still surface when they run on one of these backends.
|
||||
soundClassificationBackends := []string{"ced"}
|
||||
if !slices.Contains(soundClassificationBackends, c.Backend) {
|
||||
return false
|
||||
}
|
||||
}
|
||||
|
||||
if (u & FLAG_REALTIME_AUDIO) == FLAG_REALTIME_AUDIO {
|
||||
// Backends that own a single any-to-any loop and implement
|
||||
// AudioToAudioStream — listed here so models without an explicit
|
||||
|
||||
@@ -48,6 +48,10 @@ var RouteFeatureRegistry = []RouteFeature{
|
||||
{"POST", "/v1/audio/diarization", FeatureAudioDiarization},
|
||||
{"POST", "/audio/diarization", FeatureAudioDiarization},
|
||||
|
||||
// Audio classification (sound-event tagging)
|
||||
{"POST", "/v1/audio/classification", FeatureAudioClassification},
|
||||
{"POST", "/audio/classification", FeatureAudioClassification},
|
||||
|
||||
// Audio speech / TTS
|
||||
{"POST", "/v1/audio/speech", FeatureAudioSpeech},
|
||||
{"POST", "/audio/speech", FeatureAudioSpeech},
|
||||
@@ -172,6 +176,7 @@ func APIFeatureMetas() []FeatureMeta {
|
||||
{FeatureAudioSpeech, "Audio Speech / TTS", true},
|
||||
{FeatureAudioTranscription, "Audio Transcription", true},
|
||||
{FeatureAudioDiarization, "Audio Diarization", true},
|
||||
{FeatureAudioClassification, "Audio Classification", true},
|
||||
{FeatureVAD, "Voice Activity Detection", true},
|
||||
{FeatureDetection, "Detection", true},
|
||||
{FeatureVideo, "Video Generation", true},
|
||||
|
||||
@@ -38,24 +38,25 @@ const (
|
||||
FeatureQuantization = "quantization"
|
||||
|
||||
// API features (default ON for new users)
|
||||
FeatureChat = "chat"
|
||||
FeatureImages = "images"
|
||||
FeatureAudioSpeech = "audio_speech"
|
||||
FeatureAudioTranscription = "audio_transcription"
|
||||
FeatureAudioDiarization = "audio_diarization"
|
||||
FeatureVAD = "vad"
|
||||
FeatureDetection = "detection"
|
||||
FeatureVideo = "video"
|
||||
FeatureEmbeddings = "embeddings"
|
||||
FeatureSound = "sound"
|
||||
FeatureRealtime = "realtime"
|
||||
FeatureRerank = "rerank"
|
||||
FeatureTokenize = "tokenize"
|
||||
FeatureMCP = "mcp"
|
||||
FeatureStores = "stores"
|
||||
FeatureFaceRecognition = "face_recognition"
|
||||
FeatureVoiceRecognition = "voice_recognition"
|
||||
FeatureAudioTransform = "audio_transform"
|
||||
FeatureChat = "chat"
|
||||
FeatureImages = "images"
|
||||
FeatureAudioSpeech = "audio_speech"
|
||||
FeatureAudioTranscription = "audio_transcription"
|
||||
FeatureAudioDiarization = "audio_diarization"
|
||||
FeatureAudioClassification = "audio_classification"
|
||||
FeatureVAD = "vad"
|
||||
FeatureDetection = "detection"
|
||||
FeatureVideo = "video"
|
||||
FeatureEmbeddings = "embeddings"
|
||||
FeatureSound = "sound"
|
||||
FeatureRealtime = "realtime"
|
||||
FeatureRerank = "rerank"
|
||||
FeatureTokenize = "tokenize"
|
||||
FeatureMCP = "mcp"
|
||||
FeatureStores = "stores"
|
||||
FeatureFaceRecognition = "face_recognition"
|
||||
FeatureVoiceRecognition = "voice_recognition"
|
||||
FeatureAudioTransform = "audio_transform"
|
||||
// FeaturePIIFilter gates the synchronous PII analyze/redact service
|
||||
// (POST /api/pii/{analyze,redact}). Default ON like the other API
|
||||
// features; the admin-only events log is gated separately in-handler.
|
||||
@@ -71,7 +72,7 @@ var GeneralFeatures = []string{FeatureFineTuning, FeatureQuantization}
|
||||
// APIFeatures lists API endpoint features (default ON).
|
||||
var APIFeatures = []string{
|
||||
FeatureChat, FeatureImages, FeatureAudioSpeech, FeatureAudioTranscription,
|
||||
FeatureAudioDiarization,
|
||||
FeatureAudioDiarization, FeatureAudioClassification,
|
||||
FeatureVAD, FeatureDetection, FeatureVideo, FeatureEmbeddings, FeatureSound,
|
||||
FeatureRealtime, FeatureRerank, FeatureTokenize, FeatureMCP, FeatureStores,
|
||||
FeatureFaceRecognition, FeatureVoiceRecognition, FeatureAudioTransform,
|
||||
|
||||
@@ -32,9 +32,9 @@ var instructionDefs = []instructionDef{
|
||||
},
|
||||
{
|
||||
Name: "audio",
|
||||
Description: "Text-to-speech, voice activity detection, transcription, speaker diarization, and sound generation",
|
||||
Description: "Text-to-speech, voice activity detection, transcription, speaker diarization, sound classification, and sound generation",
|
||||
Tags: []string{"audio"},
|
||||
Intro: "Diarization (/v1/audio/diarization) returns speaker-labelled time segments. Backends with native ASR-diarization (vibevoice-cpp) can also emit per-segment text via include_text=true; backends with a dedicated pipeline (sherpa-onnx + pyannote) emit segmentation only. Response formats: json (default), verbose_json (adds speakers summary + text), rttm (NIST format).",
|
||||
Intro: "Diarization (/v1/audio/diarization) returns speaker-labelled time segments. Backends with native ASR-diarization (vibevoice-cpp) can also emit per-segment text via include_text=true; backends with a dedicated pipeline (sherpa-onnx + pyannote) emit segmentation only. Response formats: json (default), verbose_json (adds speakers summary + text), rttm (NIST format). Sound classification (/v1/audio/classification) returns scored AudioSet sound-event tags (audio tagging via the ced backend); top_k and threshold control the returned set.",
|
||||
},
|
||||
{
|
||||
Name: "images",
|
||||
|
||||
@@ -93,16 +93,31 @@ type Session struct {
|
||||
Voice string
|
||||
TurnDetection *types.TurnDetectionUnion // "server_vad", "semantic_vad" or "none"
|
||||
InputAudioTranscription *types.AudioTranscription
|
||||
Tools []types.ToolUnion
|
||||
ToolChoice *types.ToolChoiceUnion
|
||||
Conversations map[string]*Conversation
|
||||
InputAudioBuffer []byte
|
||||
AudioBufferLock sync.Mutex
|
||||
OpusFrames [][]byte
|
||||
OpusFramesLock sync.Mutex
|
||||
Instructions string
|
||||
DefaultConversationID string
|
||||
ModelInterface Model
|
||||
|
||||
// SoundDetectionEnabled is set when pipeline.sound_detection names a
|
||||
// sound-event-classification model. When true, each committed utterance is
|
||||
// also run through ModelInterface.SoundDetection and the scored tags are
|
||||
// emitted as a conversation.item.sound_detection event. SoundDetectionTopK
|
||||
// and SoundDetectionThreshold are the knobs passed to that call (defaults:
|
||||
// top_k=5, threshold=0).
|
||||
SoundDetectionEnabled bool
|
||||
SoundDetectionTopK int
|
||||
SoundDetectionThreshold float32
|
||||
// SoundDetectionWindowMs / SoundDetectionHopMs, when both > 0, enable
|
||||
// server-side windowing for a sound-only session: the server classifies the
|
||||
// last WindowMs of streamed audio every HopMs (no client commits needed).
|
||||
SoundDetectionWindowMs int
|
||||
SoundDetectionHopMs int
|
||||
Tools []types.ToolUnion
|
||||
ToolChoice *types.ToolChoiceUnion
|
||||
Conversations map[string]*Conversation
|
||||
InputAudioBuffer []byte
|
||||
AudioBufferLock sync.Mutex
|
||||
OpusFrames [][]byte
|
||||
OpusFramesLock sync.Mutex
|
||||
Instructions string
|
||||
DefaultConversationID string
|
||||
ModelInterface Model
|
||||
// The pipeline model config or the config for an any-to-any model
|
||||
ModelConfig *config.ModelConfig
|
||||
InputSampleRate int
|
||||
@@ -250,6 +265,10 @@ type Model interface {
|
||||
// TranscribeStream transcribes audio incrementally, invoking onDelta for each
|
||||
// transcript text fragment and returning the final aggregated result.
|
||||
TranscribeStream(ctx context.Context, audio, language string, translate, diarize bool, prompt string, onDelta func(text string)) (*schema.TranscriptionResult, error)
|
||||
// SoundDetection classifies a committed audio window into scored AudioSet
|
||||
// sound-event tags. topK caps the number of returned tags (0 = backend
|
||||
// default), threshold drops tags below the given score (0 = keep all).
|
||||
SoundDetection(ctx context.Context, audio string, topK int, threshold float32) (*schema.SoundClassificationResult, error)
|
||||
PredictConfig() *config.ModelConfig
|
||||
}
|
||||
|
||||
@@ -399,7 +418,7 @@ func prepareRealtimeConfig(cfg *config.ModelConfig) (errCode, errMsg string, ok
|
||||
return "", "", true
|
||||
}
|
||||
|
||||
if cfg.Pipeline.VAD == "" && cfg.Pipeline.Transcription == "" && cfg.Pipeline.TTS == "" && cfg.Pipeline.LLM == "" {
|
||||
if cfg.Pipeline.VAD == "" && cfg.Pipeline.Transcription == "" && cfg.Pipeline.TTS == "" && cfg.Pipeline.LLM == "" && cfg.Pipeline.SoundDetection == "" {
|
||||
return "invalid_model", "Model is not a pipeline model", false
|
||||
}
|
||||
return "", "", true
|
||||
@@ -469,6 +488,26 @@ func runRealtimeSession(application *application.Application, t Transport, model
|
||||
|
||||
sttModel := cfg.Pipeline.Transcription
|
||||
|
||||
// A sound-detection-only pipeline (sound_detection set, no transcription/LLM)
|
||||
// activates on sounds, not speech, so it runs WITHOUT the voice VAD: the
|
||||
// session defaults to turn_detection none and the client drives windowing via
|
||||
// input_audio_buffer.commit. There is no transcription stage in that case.
|
||||
soundOnly := cfg.Pipeline.SoundDetection != "" && cfg.Pipeline.Transcription == "" && cfg.Pipeline.LLM == ""
|
||||
|
||||
turnDetection := &types.TurnDetectionUnion{
|
||||
ServerVad: &types.ServerVad{
|
||||
Threshold: 0.5,
|
||||
PrefixPaddingMs: 300,
|
||||
SilenceDurationMs: 500,
|
||||
CreateResponse: true,
|
||||
},
|
||||
}
|
||||
inputAudioTranscription := &types.AudioTranscription{Model: sttModel}
|
||||
if soundOnly {
|
||||
turnDetection = nil // turn_detection none: no VAD
|
||||
inputAudioTranscription = nil // no transcription stage
|
||||
}
|
||||
|
||||
// Compose the system prompt: prepend the assistant prompt when we have
|
||||
// one (it teaches the model the safety rules and tool recipes), then the
|
||||
// session's default voice instructions. Order matches chat.go's
|
||||
@@ -480,30 +519,26 @@ func runRealtimeSession(application *application.Application, t Transport, model
|
||||
|
||||
sessionID := generateSessionID()
|
||||
session := &Session{
|
||||
ID: sessionID,
|
||||
TranscriptionOnly: false,
|
||||
Model: model,
|
||||
Voice: cfg.TTSConfig.Voice,
|
||||
Instructions: instructions,
|
||||
ModelConfig: cfg,
|
||||
Tools: assistantTools,
|
||||
AssistantTools: assistantTools,
|
||||
AssistantExecutor: assistantExecutor,
|
||||
TurnDetection: &types.TurnDetectionUnion{
|
||||
ServerVad: &types.ServerVad{
|
||||
Threshold: 0.5,
|
||||
PrefixPaddingMs: 300,
|
||||
SilenceDurationMs: 500,
|
||||
CreateResponse: true,
|
||||
},
|
||||
},
|
||||
InputAudioTranscription: &types.AudioTranscription{
|
||||
Model: sttModel,
|
||||
},
|
||||
Conversations: make(map[string]*Conversation),
|
||||
InputSampleRate: defaultRemoteSampleRate,
|
||||
OutputSampleRate: defaultRemoteSampleRate,
|
||||
MaxHistoryItems: resolveMaxHistoryItems(cfg),
|
||||
ID: sessionID,
|
||||
TranscriptionOnly: false,
|
||||
Model: model,
|
||||
Voice: cfg.TTSConfig.Voice,
|
||||
Instructions: instructions,
|
||||
ModelConfig: cfg,
|
||||
Tools: assistantTools,
|
||||
AssistantTools: assistantTools,
|
||||
AssistantExecutor: assistantExecutor,
|
||||
TurnDetection: turnDetection,
|
||||
InputAudioTranscription: inputAudioTranscription,
|
||||
Conversations: make(map[string]*Conversation),
|
||||
InputSampleRate: defaultRemoteSampleRate,
|
||||
OutputSampleRate: defaultRemoteSampleRate,
|
||||
MaxHistoryItems: resolveMaxHistoryItems(cfg),
|
||||
SoundDetectionEnabled: cfg.Pipeline.SoundDetection != "",
|
||||
SoundDetectionTopK: defaultSoundDetectionTopK,
|
||||
SoundDetectionThreshold: 0,
|
||||
SoundDetectionWindowMs: cfg.Pipeline.SoundDetectionWindowMs,
|
||||
SoundDetectionHopMs: cfg.Pipeline.SoundDetectionHopMs,
|
||||
}
|
||||
|
||||
// Create a default conversation
|
||||
@@ -517,14 +552,24 @@ func runRealtimeSession(application *application.Application, t Transport, model
|
||||
session.Conversations[conversationID] = conversation
|
||||
session.DefaultConversationID = conversationID
|
||||
|
||||
m, err := newModel(
|
||||
&cfg.Pipeline,
|
||||
application.ModelConfigLoader(),
|
||||
application.ModelLoader(),
|
||||
application.ApplicationConfig(),
|
||||
evaluator,
|
||||
buildRealtimeRoutingContext(application, sessionID),
|
||||
)
|
||||
var m Model
|
||||
if soundOnly {
|
||||
m, err = newSoundDetectionOnlyModel(
|
||||
&cfg.Pipeline,
|
||||
application.ModelConfigLoader(),
|
||||
application.ModelLoader(),
|
||||
application.ApplicationConfig(),
|
||||
)
|
||||
} else {
|
||||
m, err = newModel(
|
||||
&cfg.Pipeline,
|
||||
application.ModelConfigLoader(),
|
||||
application.ModelLoader(),
|
||||
application.ApplicationConfig(),
|
||||
evaluator,
|
||||
buildRealtimeRoutingContext(application, sessionID),
|
||||
)
|
||||
}
|
||||
if err != nil {
|
||||
xlog.Error("failed to load model", "error", err)
|
||||
sendError(t, "model_load_error", "Failed to load model", "", "")
|
||||
@@ -605,6 +650,20 @@ func runRealtimeSession(application *application.Application, t Transport, model
|
||||
|
||||
toggleVAD()
|
||||
|
||||
// Server-side sound-detection windowing (option B): for a sound-only session
|
||||
// with window/hop configured, the server classifies the last window of
|
||||
// streamed audio on a timer, so the client only has to stream (no commits).
|
||||
// This runs independent of VAD (sound events are not speech).
|
||||
var soundWindowDone chan struct{}
|
||||
if soundOnly && session.SoundDetectionWindowMs > 0 && session.SoundDetectionHopMs > 0 {
|
||||
soundWindowDone = make(chan struct{})
|
||||
wg.Go(func() {
|
||||
handleSoundWindow(session, t, soundWindowDone)
|
||||
})
|
||||
xlog.Debug("Starting server-side sound-detection windowing",
|
||||
"window_ms", session.SoundDetectionWindowMs, "hop_ms", session.SoundDetectionHopMs)
|
||||
}
|
||||
|
||||
for {
|
||||
msg, err = t.ReadEvent()
|
||||
if err != nil {
|
||||
@@ -880,6 +939,10 @@ func runRealtimeSession(application *application.Application, t Transport, model
|
||||
if vadServerStarted {
|
||||
close(done)
|
||||
}
|
||||
// Stop the server-side sound-detection windowing goroutine (if running).
|
||||
if soundWindowDone != nil {
|
||||
close(soundWindowDone)
|
||||
}
|
||||
wg.Wait()
|
||||
|
||||
// Remove the session from the sessions map
|
||||
@@ -971,6 +1034,10 @@ func updateTransSession(session *Session, update *types.SessionUnion, cl *config
|
||||
|
||||
session.ModelInterface = m
|
||||
session.ModelConfig = cfg
|
||||
session.SoundDetectionEnabled = cfg.Pipeline.SoundDetection != ""
|
||||
if session.SoundDetectionTopK <= 0 {
|
||||
session.SoundDetectionTopK = defaultSoundDetectionTopK
|
||||
}
|
||||
}
|
||||
|
||||
if trUpd != nil {
|
||||
@@ -1343,7 +1410,8 @@ func commitUtterance(ctx context.Context, utt []byte, session *Session, conv *Co
|
||||
|
||||
// TODO: If we have a real any-to-any model then transcription is optional
|
||||
var transcript string
|
||||
if session.InputAudioTranscription != nil {
|
||||
switch {
|
||||
case session.InputAudioTranscription != nil:
|
||||
// emitTranscription streams transcript deltas when
|
||||
// pipeline.streaming.transcription is set, otherwise emits a single
|
||||
// completed event; either way it returns the final transcript text.
|
||||
@@ -1358,13 +1426,27 @@ func commitUtterance(ctx context.Context, utt []byte, session *Session, conv *Co
|
||||
sendError(t, "transcription_failed", err.Error(), "", "event_TODO")
|
||||
return
|
||||
}
|
||||
} else {
|
||||
case session.SoundDetectionEnabled:
|
||||
// Sound-detection-only session: no transcription and no LLM. The
|
||||
// sound-detection emit below carries the result; there is no any-to-any
|
||||
// path to fall into. Windowing is client-driven (turn_detection none +
|
||||
// input_audio_buffer.commit), so this is not voice-gated.
|
||||
default:
|
||||
// The voice gate runs only on the transcription path above; if an
|
||||
// any-to-any model path is added here, join the gate before responding.
|
||||
sendNotImplemented(t, "any-to-any models")
|
||||
return
|
||||
}
|
||||
|
||||
// Sound-event detection is additive to transcription: classify the same
|
||||
// committed window and emit its scored AudioSet tags as a separate event.
|
||||
// A failure here is logged but must never abort the turn.
|
||||
if session.SoundDetectionEnabled {
|
||||
if sderr := emitSoundDetection(ctx, t, session, generateItemID(), f.Name()); sderr != nil {
|
||||
xlog.Error("sound detection failed", "error", sderr)
|
||||
}
|
||||
}
|
||||
|
||||
// Join on the resolution before any side-effecting step.
|
||||
var speaker *types.Speaker
|
||||
if runResolve {
|
||||
@@ -1415,11 +1497,94 @@ func commitUtterance(ctx context.Context, utt []byte, session *Session, conv *Co
|
||||
}
|
||||
}
|
||||
|
||||
if !session.TranscriptionOnly {
|
||||
// Generate an LLM response only when there is a transcript to feed it. A
|
||||
// sound-detection-only session (no transcription) has no LLM stage, so it
|
||||
// stops here after emitting the sound-detection event.
|
||||
if session.InputAudioTranscription != nil && !session.TranscriptionOnly {
|
||||
generateResponse(ctx, session, utt, transcript, speaker, conv, t)
|
||||
}
|
||||
}
|
||||
|
||||
// handleSoundWindow runs server-side windowed sound-event detection (option B):
|
||||
// every HopMs it classifies the last WindowMs of streamed audio and emits a
|
||||
// sound_detection event, so a sound-only client only has to stream audio (no
|
||||
// input_audio_buffer.commit). It keeps the input buffer trimmed to one window
|
||||
// so a long stream stays bounded. Runs until done is closed. This is
|
||||
// independent of VAD: sound events are not speech.
|
||||
func handleSoundWindow(session *Session, t Transport, done chan struct{}) {
|
||||
ticker := time.NewTicker(time.Duration(session.SoundDetectionHopMs) * time.Millisecond)
|
||||
defer ticker.Stop()
|
||||
|
||||
for {
|
||||
select {
|
||||
case <-done:
|
||||
return
|
||||
case <-ticker.C:
|
||||
classifySoundWindow(session, t)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// classifySoundWindow is one windowing tick: it snapshots the most recent
|
||||
// WindowMs of buffered audio (trimming the buffer so a long stream stays
|
||||
// bounded) and, when there is enough, classifies it and emits a sound_detection
|
||||
// event. Extracted from handleSoundWindow so it can be driven synchronously in
|
||||
// tests.
|
||||
func classifySoundWindow(session *Session, t Transport) {
|
||||
const bytesPerSample = 2 // 16-bit mono PCM
|
||||
sr := session.InputSampleRate
|
||||
windowBytes := session.SoundDetectionWindowMs * sr / 1000 * bytesPerSample
|
||||
minBytes := sr / 100 * bytesPerSample // ~10ms before classifying
|
||||
|
||||
session.AudioBufferLock.Lock()
|
||||
// Keep only the most recent window so a long stream stays bounded.
|
||||
if windowBytes > 0 && len(session.InputAudioBuffer) > windowBytes {
|
||||
trimmed := make([]byte, windowBytes)
|
||||
copy(trimmed, session.InputAudioBuffer[len(session.InputAudioBuffer)-windowBytes:])
|
||||
session.InputAudioBuffer = trimmed
|
||||
}
|
||||
window := make([]byte, len(session.InputAudioBuffer))
|
||||
copy(window, session.InputAudioBuffer)
|
||||
session.AudioBufferLock.Unlock()
|
||||
|
||||
if len(window) < minBytes {
|
||||
return // not enough audio buffered yet
|
||||
}
|
||||
path, err := writeWindowWAV(window, sr)
|
||||
if err != nil {
|
||||
xlog.Error("sound window: failed to write wav", "error", err)
|
||||
return
|
||||
}
|
||||
if sderr := emitSoundDetection(context.Background(), t, session, generateItemID(), path); sderr != nil {
|
||||
xlog.Error("sound window: detection failed", "error", sderr)
|
||||
}
|
||||
if rerr := os.Remove(path); rerr != nil {
|
||||
xlog.Debug("sound window: temp cleanup failed", "error", rerr)
|
||||
}
|
||||
}
|
||||
|
||||
// writeWindowWAV writes mono 16-bit PCM to a temp WAV at the given sample rate
|
||||
// (the ced classifier reads the declared rate and resamples). Returns the path;
|
||||
// the caller removes it.
|
||||
func writeWindowWAV(pcm []byte, sampleRate int) (string, error) {
|
||||
f, err := os.CreateTemp("", "realtime-sound-window-*.wav")
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
defer func() { _ = f.Close() }()
|
||||
hdr := laudio.NewWAVHeaderWithRate(uint32(len(pcm)), uint32(sampleRate))
|
||||
if err := hdr.Write(f); err != nil {
|
||||
_ = os.Remove(f.Name())
|
||||
return "", err
|
||||
}
|
||||
if _, err := f.Write(pcm); err != nil {
|
||||
_ = os.Remove(f.Name())
|
||||
return "", err
|
||||
}
|
||||
_ = f.Sync()
|
||||
return f.Name(), nil
|
||||
}
|
||||
|
||||
func runVAD(ctx context.Context, session *Session, adata []int16) ([]schema.VADSegment, error) {
|
||||
soundIntBuffer := &audio.IntBuffer{
|
||||
Format: &audio.Format{SampleRate: localSampleRate, NumChannels: 1},
|
||||
|
||||
@@ -75,6 +75,11 @@ type fakeModel struct {
|
||||
transcribeDeltas []string
|
||||
transcribeFinal *schema.TranscriptionResult
|
||||
|
||||
// soundDetectionResult/soundDetectionErr drive the SoundDetection double so
|
||||
// the sound-event path can be exercised deterministically.
|
||||
soundDetectionResult *schema.SoundClassificationResult
|
||||
soundDetectionErr error
|
||||
|
||||
// Predict streaming: predictTokens are replayed through the token callback
|
||||
// (simulating streamed LLM output); predictResp/predictErr are returned by
|
||||
// the deferred predict function. predictChunkDeltas, when set, are delivered
|
||||
@@ -95,6 +100,13 @@ func (m *fakeModel) Transcribe(context.Context, string, string, bool, bool, stri
|
||||
return m.transcribeFinal, nil
|
||||
}
|
||||
|
||||
func (m *fakeModel) SoundDetection(context.Context, string, int, float32) (*schema.SoundClassificationResult, error) {
|
||||
if m.soundDetectionErr != nil {
|
||||
return nil, m.soundDetectionErr
|
||||
}
|
||||
return m.soundDetectionResult, nil
|
||||
}
|
||||
|
||||
func (m *fakeModel) Predict(_ context.Context, msgs schema.Messages, _, _, _ []string, cb func(string, backend.TokenUsage) bool, _ []types.ToolUnion, _ *types.ToolChoiceUnion, _, _ *int, _ map[string]float64) (func() (backend.LLMResponse, error), error) {
|
||||
m.lastMessages = msgs
|
||||
if m.predictErr != nil {
|
||||
|
||||
@@ -31,10 +31,11 @@ var (
|
||||
// This means that we will fake an Any-to-Any model by overriding some of the gRPC client methods
|
||||
// which are for Any-To-Any models, but instead we will call a pipeline (for e.g STT->LLM->TTS)
|
||||
type wrappedModel struct {
|
||||
TTSConfig *config.ModelConfig
|
||||
TranscriptionConfig *config.ModelConfig
|
||||
LLMConfig *config.ModelConfig
|
||||
VADConfig *config.ModelConfig
|
||||
TTSConfig *config.ModelConfig
|
||||
TranscriptionConfig *config.ModelConfig
|
||||
LLMConfig *config.ModelConfig
|
||||
VADConfig *config.ModelConfig
|
||||
SoundDetectionConfig *config.ModelConfig
|
||||
|
||||
appConfig *config.ApplicationConfig
|
||||
modelLoader *model.ModelLoader
|
||||
@@ -64,8 +65,9 @@ type anyToAnyModel struct {
|
||||
}
|
||||
|
||||
type transcriptOnlyModel struct {
|
||||
TranscriptionConfig *config.ModelConfig
|
||||
VADConfig *config.ModelConfig
|
||||
TranscriptionConfig *config.ModelConfig
|
||||
VADConfig *config.ModelConfig
|
||||
SoundDetectionConfig *config.ModelConfig
|
||||
|
||||
appConfig *config.ApplicationConfig
|
||||
modelLoader *model.ModelLoader
|
||||
@@ -80,6 +82,10 @@ func (m *transcriptOnlyModel) Transcribe(ctx context.Context, audio, language st
|
||||
return backend.ModelTranscription(ctx, audio, language, translate, diarize, prompt, m.modelLoader, *m.TranscriptionConfig, m.appConfig)
|
||||
}
|
||||
|
||||
func (m *transcriptOnlyModel) SoundDetection(ctx context.Context, audio string, topK int, threshold float32) (*schema.SoundClassificationResult, error) {
|
||||
return modelSoundDetection(ctx, m.modelLoader, m.appConfig, m.SoundDetectionConfig, audio, topK, threshold)
|
||||
}
|
||||
|
||||
func (m *transcriptOnlyModel) Predict(ctx context.Context, messages schema.Messages, images, videos, audios []string, tokenCallback func(string, backend.TokenUsage) bool, tools []types.ToolUnion, toolChoice *types.ToolChoiceUnion, logprobs *int, topLogprobs *int, logitBias map[string]float64) (func() (backend.LLMResponse, error), error) {
|
||||
return nil, fmt.Errorf("predict operation not supported in transcript-only mode")
|
||||
}
|
||||
@@ -108,6 +114,10 @@ func (m *wrappedModel) Transcribe(ctx context.Context, audio, language string, t
|
||||
return backend.ModelTranscription(ctx, audio, language, translate, diarize, prompt, m.modelLoader, *m.TranscriptionConfig, m.appConfig)
|
||||
}
|
||||
|
||||
func (m *wrappedModel) SoundDetection(ctx context.Context, audio string, topK int, threshold float32) (*schema.SoundClassificationResult, error) {
|
||||
return modelSoundDetection(ctx, m.modelLoader, m.appConfig, m.SoundDetectionConfig, audio, topK, threshold)
|
||||
}
|
||||
|
||||
func (m *wrappedModel) Predict(ctx context.Context, messages schema.Messages, images, videos, audios []string, tokenCallback func(string, backend.TokenUsage) bool, tools []types.ToolUnion, toolChoice *types.ToolChoiceUnion, logprobs *int, topLogprobs *int, logitBias map[string]float64) (func() (backend.LLMResponse, error), error) {
|
||||
input := schema.OpenAIRequest{
|
||||
Messages: messages,
|
||||
@@ -399,6 +409,39 @@ func transcribeStream(ctx context.Context, ml *model.ModelLoader, transcriptionC
|
||||
return final, nil
|
||||
}
|
||||
|
||||
// modelSoundDetection runs sound-event classification against the session's
|
||||
// sound-classification model config, mirroring how Transcribe dispatches to
|
||||
// the transcription backend. Returns an error when no sound-detection model is
|
||||
// configured for the session.
|
||||
func modelSoundDetection(ctx context.Context, ml *model.ModelLoader, appConfig *config.ApplicationConfig, soundConfig *config.ModelConfig, audio string, topK int, threshold float32) (*schema.SoundClassificationResult, error) {
|
||||
if soundConfig == nil {
|
||||
return nil, fmt.Errorf("sound detection is not configured for this session")
|
||||
}
|
||||
return backend.ModelSoundDetection(ctx, backend.SoundDetectionRequest{
|
||||
Audio: audio,
|
||||
TopK: int32(topK),
|
||||
Threshold: threshold,
|
||||
}, ml, *soundConfig, appConfig)
|
||||
}
|
||||
|
||||
// loadSoundDetectionConfig resolves the optional sound-classification model
|
||||
// config named by pipeline.sound_detection. Returns (nil, nil) when no model
|
||||
// is configured so sound detection stays additive and never blocks session
|
||||
// setup.
|
||||
func loadSoundDetectionConfig(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model.ModelLoader) (*config.ModelConfig, error) {
|
||||
if pipeline.SoundDetection == "" {
|
||||
return nil, nil
|
||||
}
|
||||
cfg, err := cl.LoadModelConfigFileByName(pipeline.SoundDetection, ml.ModelPath)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("failed to load sound detection config: %w", err)
|
||||
}
|
||||
if valid, _ := cfg.Validate(); !valid {
|
||||
return nil, fmt.Errorf("failed to validate sound detection config %q", pipeline.SoundDetection)
|
||||
}
|
||||
return cfg, nil
|
||||
}
|
||||
|
||||
func newTranscriptionOnlyModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig) (Model, *config.ModelConfig, error) {
|
||||
cfgVAD, err := cl.LoadModelConfigFileByName(pipeline.VAD, ml.ModelPath)
|
||||
if err != nil {
|
||||
@@ -420,9 +463,15 @@ func newTranscriptionOnlyModel(pipeline *config.Pipeline, cl *config.ModelConfig
|
||||
return nil, nil, fmt.Errorf("failed to validate config: %w", err)
|
||||
}
|
||||
|
||||
cfgSound, err := loadSoundDetectionConfig(pipeline, cl, ml)
|
||||
if err != nil {
|
||||
return nil, nil, err
|
||||
}
|
||||
|
||||
return &transcriptOnlyModel{
|
||||
TranscriptionConfig: cfgSST,
|
||||
VADConfig: cfgVAD,
|
||||
TranscriptionConfig: cfgSST,
|
||||
VADConfig: cfgVAD,
|
||||
SoundDetectionConfig: cfgSound,
|
||||
|
||||
confLoader: cl,
|
||||
modelLoader: ml,
|
||||
@@ -430,6 +479,27 @@ func newTranscriptionOnlyModel(pipeline *config.Pipeline, cl *config.ModelConfig
|
||||
}, cfgSST, nil
|
||||
}
|
||||
|
||||
// newSoundDetectionOnlyModel builds a realtime model that only does sound-event
|
||||
// classification: no VAD, transcription, LLM or TTS stages are loaded. Used for
|
||||
// a sound-detection-only realtime session, which activates on sounds (not
|
||||
// speech) and is driven by client-side windowing (turn_detection none +
|
||||
// input_audio_buffer.commit) rather than the voice VAD loop.
|
||||
func newSoundDetectionOnlyModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig) (Model, error) {
|
||||
cfgSound, err := loadSoundDetectionConfig(pipeline, cl, ml)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
if cfgSound == nil {
|
||||
return nil, fmt.Errorf("a sound-only realtime session requires pipeline.sound_detection")
|
||||
}
|
||||
return &transcriptOnlyModel{
|
||||
SoundDetectionConfig: cfgSound,
|
||||
confLoader: cl,
|
||||
modelLoader: ml,
|
||||
appConfig: appConfig,
|
||||
}, nil
|
||||
}
|
||||
|
||||
// RealtimeRoutingContext is the bundle of routing dependencies the
|
||||
// realtime pipeline needs to consult router.Resolve per turn. nil-safe:
|
||||
// passing nil skips routing entirely and preserves the historical "one
|
||||
@@ -544,11 +614,17 @@ func newModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model
|
||||
return nil, fmt.Errorf("failed to validate config: %w", err)
|
||||
}
|
||||
|
||||
cfgSound, err := loadSoundDetectionConfig(pipeline, cl, ml)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
|
||||
wm := &wrappedModel{
|
||||
TTSConfig: cfgTTS,
|
||||
TranscriptionConfig: cfgSST,
|
||||
LLMConfig: cfgLLM,
|
||||
VADConfig: cfgVAD,
|
||||
TTSConfig: cfgTTS,
|
||||
TranscriptionConfig: cfgSST,
|
||||
LLMConfig: cfgLLM,
|
||||
VADConfig: cfgVAD,
|
||||
SoundDetectionConfig: cfgSound,
|
||||
|
||||
confLoader: cl,
|
||||
modelLoader: ml,
|
||||
|
||||
48
core/http/endpoints/openai/realtime_sound_detection.go
Normal file
48
core/http/endpoints/openai/realtime_sound_detection.go
Normal file
@@ -0,0 +1,48 @@
|
||||
package openai
|
||||
|
||||
import (
|
||||
"context"
|
||||
|
||||
"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
|
||||
)
|
||||
|
||||
// defaultSoundDetectionTopK is the number of scored tags requested per
|
||||
// committed utterance when the session does not pin its own top_k.
|
||||
const defaultSoundDetectionTopK = 5
|
||||
|
||||
// emitSoundDetection classifies a committed utterance into sound-event tags and
|
||||
// emits a conversation.item.sound_detection event for it. It mirrors
|
||||
// emitTranscription's unary path: it calls the session's sound-event
|
||||
// classifier, maps the scored tags onto the server event, and sends it over
|
||||
// the transport. Sound detection is additive to transcription: its result is
|
||||
// emitted independently and a failure here is the caller's to log, never a
|
||||
// reason to abort the turn.
|
||||
func emitSoundDetection(ctx context.Context, t Transport, session *Session, itemID, audioPath string) error {
|
||||
topK := session.SoundDetectionTopK
|
||||
if topK <= 0 {
|
||||
topK = defaultSoundDetectionTopK
|
||||
}
|
||||
|
||||
result, err := session.ModelInterface.SoundDetection(ctx, audioPath, topK, session.SoundDetectionThreshold)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
detections := make([]types.SoundDetectionTag, 0)
|
||||
if result != nil {
|
||||
for _, d := range result.Detections {
|
||||
detections = append(detections, types.SoundDetectionTag{
|
||||
Label: d.Label,
|
||||
Score: d.Score,
|
||||
Index: d.Index,
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
return t.SendEvent(types.ConversationItemSoundDetectionEvent{
|
||||
ServerEventBase: types.ServerEventBase{EventID: "event_TODO"},
|
||||
ItemID: itemID,
|
||||
ContentIndex: 0,
|
||||
Detections: detections,
|
||||
})
|
||||
}
|
||||
170
core/http/endpoints/openai/realtime_sound_detection_test.go
Normal file
170
core/http/endpoints/openai/realtime_sound_detection_test.go
Normal file
@@ -0,0 +1,170 @@
|
||||
package openai
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/binary"
|
||||
"errors"
|
||||
"os"
|
||||
|
||||
. "github.com/onsi/ginkgo/v2"
|
||||
. "github.com/onsi/gomega"
|
||||
|
||||
"github.com/mudler/LocalAI/core/config"
|
||||
"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
|
||||
"github.com/mudler/LocalAI/core/schema"
|
||||
)
|
||||
|
||||
// emitSoundDetection classifies a committed utterance and emits a single
|
||||
// conversation.item.sound_detection event carrying the scored AudioSet tags.
|
||||
var _ = Describe("emitSoundDetection", func() {
|
||||
It("emits a sound_detection event with the classifier's scored tags", func() {
|
||||
session := &Session{
|
||||
SoundDetectionEnabled: true,
|
||||
SoundDetectionTopK: 5,
|
||||
ModelInterface: &fakeModel{
|
||||
soundDetectionResult: &schema.SoundClassificationResult{
|
||||
Model: "ced",
|
||||
Detections: []schema.SoundClassification{
|
||||
{Index: 3, Label: "Baby cry, infant cry", Score: 0.91},
|
||||
{Index: 7, Label: "Speech", Score: 0.42},
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
t := &fakeTransport{}
|
||||
|
||||
err := emitSoundDetection(context.Background(), t, session, "item1", "/tmp/x.wav")
|
||||
|
||||
Expect(err).ToNot(HaveOccurred())
|
||||
Expect(t.countEvents(types.ServerEventTypeConversationItemSoundDetection)).To(Equal(1))
|
||||
|
||||
ev, ok := t.events[0].(types.ConversationItemSoundDetectionEvent)
|
||||
Expect(ok).To(BeTrue())
|
||||
Expect(ev.ItemID).To(Equal("item1"))
|
||||
Expect(ev.ContentIndex).To(Equal(0))
|
||||
Expect(ev.Detections).To(HaveLen(2))
|
||||
Expect(ev.Detections[0].Label).To(Equal("Baby cry, infant cry"))
|
||||
Expect(ev.Detections[0].Score).To(BeNumerically("~", 0.91, 1e-6))
|
||||
Expect(ev.Detections[0].Index).To(Equal(3))
|
||||
Expect(ev.Detections[1].Label).To(Equal("Speech"))
|
||||
})
|
||||
|
||||
It("emits an event with no detections when the classifier returns none", func() {
|
||||
session := &Session{
|
||||
SoundDetectionEnabled: true,
|
||||
ModelInterface: &fakeModel{
|
||||
soundDetectionResult: &schema.SoundClassificationResult{Model: "ced"},
|
||||
},
|
||||
}
|
||||
t := &fakeTransport{}
|
||||
|
||||
err := emitSoundDetection(context.Background(), t, session, "item1", "/tmp/x.wav")
|
||||
|
||||
Expect(err).ToNot(HaveOccurred())
|
||||
Expect(t.countEvents(types.ServerEventTypeConversationItemSoundDetection)).To(Equal(1))
|
||||
ev, ok := t.events[0].(types.ConversationItemSoundDetectionEvent)
|
||||
Expect(ok).To(BeTrue())
|
||||
Expect(ev.Detections).To(BeEmpty())
|
||||
})
|
||||
|
||||
It("propagates the classifier error and emits no event", func() {
|
||||
session := &Session{
|
||||
SoundDetectionEnabled: true,
|
||||
ModelInterface: &fakeModel{soundDetectionErr: errors.New("boom")},
|
||||
}
|
||||
t := &fakeTransport{}
|
||||
|
||||
err := emitSoundDetection(context.Background(), t, session, "item1", "/tmp/x.wav")
|
||||
|
||||
Expect(err).To(HaveOccurred())
|
||||
Expect(t.countEvents(types.ServerEventTypeConversationItemSoundDetection)).To(Equal(0))
|
||||
})
|
||||
})
|
||||
|
||||
// A sound-detection-only session (no transcription, no LLM) runs through
|
||||
// commitUtterance WITHOUT the voice/transcription path: it emits the
|
||||
// sound_detection event and stops - no transcription event, no LLM response.
|
||||
var _ = Describe("commitUtterance (sound-detection-only session)", func() {
|
||||
It("emits sound detection and neither transcribes nor generates a response", func() {
|
||||
session := &Session{
|
||||
SoundDetectionEnabled: true,
|
||||
SoundDetectionTopK: 5,
|
||||
InputAudioTranscription: nil, // sound-only: no transcription stage
|
||||
ModelConfig: &config.ModelConfig{},
|
||||
ModelInterface: &fakeModel{
|
||||
soundDetectionResult: &schema.SoundClassificationResult{
|
||||
Model: "ced",
|
||||
Detections: []schema.SoundClassification{
|
||||
{Index: 23, Label: "Baby cry, infant cry", Score: 0.87},
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
tr := &fakeTransport{}
|
||||
utt := make([]byte, 32) // non-empty PCM so commitUtterance proceeds
|
||||
|
||||
commitUtterance(context.Background(), utt, session, &Conversation{}, tr)
|
||||
|
||||
Expect(tr.countEvents(types.ServerEventTypeConversationItemSoundDetection)).To(Equal(1))
|
||||
// No transcription happened.
|
||||
Expect(tr.countEvents(types.ServerEventTypeConversationItemInputAudioTranscriptionCompleted)).To(Equal(0))
|
||||
// No LLM response was generated (sound-only has no LLM stage).
|
||||
Expect(tr.countEvents(types.ServerEventTypeResponseDone)).To(Equal(0))
|
||||
})
|
||||
})
|
||||
|
||||
// Server-side windowing (option B): a sound-only session classifies the last
|
||||
// WindowMs of streamed audio per tick, with no client commit, and keeps the
|
||||
// input buffer trimmed to one window.
|
||||
var _ = Describe("classifySoundWindow (server-side windowing)", func() {
|
||||
newSoundSession := func() (*Session, *fakeTransport) {
|
||||
return &Session{
|
||||
SoundDetectionEnabled: true,
|
||||
SoundDetectionTopK: 5,
|
||||
SoundDetectionWindowMs: 200, // 200ms @ 16kHz mono16 = 6400 bytes
|
||||
SoundDetectionHopMs: 20,
|
||||
InputSampleRate: 16000,
|
||||
ModelInterface: &fakeModel{
|
||||
soundDetectionResult: &schema.SoundClassificationResult{
|
||||
Model: "ced",
|
||||
Detections: []schema.SoundClassification{{Index: 23, Label: "Baby cry, infant cry", Score: 0.87}},
|
||||
},
|
||||
},
|
||||
}, &fakeTransport{}
|
||||
}
|
||||
|
||||
It("emits a sound_detection event and trims the buffer to one window", func() {
|
||||
session, tr := newSoundSession()
|
||||
session.InputAudioBuffer = make([]byte, 10000) // > 6400-byte window
|
||||
|
||||
classifySoundWindow(session, tr)
|
||||
|
||||
Expect(tr.countEvents(types.ServerEventTypeConversationItemSoundDetection)).To(Equal(1))
|
||||
// buffer trimmed to exactly one window (200ms @ 16kHz mono 16-bit)
|
||||
Expect(len(session.InputAudioBuffer)).To(Equal(6400))
|
||||
})
|
||||
|
||||
It("does nothing when too little audio is buffered", func() {
|
||||
session, tr := newSoundSession()
|
||||
session.InputAudioBuffer = make([]byte, 100) // < ~10ms (320 bytes)
|
||||
|
||||
classifySoundWindow(session, tr)
|
||||
|
||||
Expect(tr.countEvents(types.ServerEventTypeConversationItemSoundDetection)).To(Equal(0))
|
||||
})
|
||||
})
|
||||
|
||||
var _ = Describe("writeWindowWAV", func() {
|
||||
It("writes a mono 16-bit WAV header declaring the given sample rate", func() {
|
||||
pcm := make([]byte, 640)
|
||||
path, err := writeWindowWAV(pcm, 24000)
|
||||
Expect(err).ToNot(HaveOccurred())
|
||||
defer func() { _ = os.Remove(path) }()
|
||||
|
||||
data, err := os.ReadFile(path)
|
||||
Expect(err).ToNot(HaveOccurred())
|
||||
Expect(len(data)).To(BeNumerically(">=", 44+len(pcm)))
|
||||
// SampleRate is a little-endian uint32 at byte offset 24 of a WAV header.
|
||||
Expect(binary.LittleEndian.Uint32(data[24:28])).To(Equal(uint32(24000)))
|
||||
})
|
||||
})
|
||||
91
core/http/endpoints/openai/sound_classification.go
Normal file
91
core/http/endpoints/openai/sound_classification.go
Normal file
@@ -0,0 +1,91 @@
|
||||
package openai
|
||||
|
||||
import (
|
||||
"io"
|
||||
"net/http"
|
||||
"os"
|
||||
"path"
|
||||
"path/filepath"
|
||||
|
||||
"github.com/labstack/echo/v4"
|
||||
"github.com/mudler/LocalAI/core/backend"
|
||||
"github.com/mudler/LocalAI/core/config"
|
||||
"github.com/mudler/LocalAI/core/http/middleware"
|
||||
"github.com/mudler/LocalAI/core/schema"
|
||||
model "github.com/mudler/LocalAI/pkg/model"
|
||||
|
||||
"github.com/mudler/xlog"
|
||||
)
|
||||
|
||||
// SoundClassificationEndpoint runs an audio-tagging / sound-event
|
||||
// classification model (e.g. ced) over an uploaded clip and returns the
|
||||
// scored AudioSet tags in score-descending order. It mirrors the
|
||||
// transcription path: multipart audio upload -> temp file -> backend call.
|
||||
//
|
||||
// @Summary Classify sound events in audio (audio tagging).
|
||||
// @Tags audio
|
||||
// @accept multipart/form-data
|
||||
// @Param model formData string true "model"
|
||||
// @Param file formData file true "audio file"
|
||||
// @Param top_k formData int false "number of top tags to return (0 = backend default)"
|
||||
// @Param threshold formData number false "drop tags scoring below this value"
|
||||
// @Success 200 {object} schema.SoundClassificationResult
|
||||
// @Router /v1/audio/classification [post]
|
||||
func SoundClassificationEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig) echo.HandlerFunc {
|
||||
return func(c echo.Context) error {
|
||||
input, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_LOCALAI_REQUEST).(*schema.OpenAIRequest)
|
||||
if !ok || input.Model == "" {
|
||||
return echo.ErrBadRequest
|
||||
}
|
||||
|
||||
modelConfig, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_MODEL_CONFIG).(*config.ModelConfig)
|
||||
if !ok || modelConfig == nil {
|
||||
return echo.ErrBadRequest
|
||||
}
|
||||
|
||||
req := backend.SoundDetectionRequest{
|
||||
TopK: int32(parseFormInt(c, "top_k", 0)),
|
||||
Threshold: float32(parseFormFloat(c, "threshold", 0)),
|
||||
}
|
||||
|
||||
file, err := c.FormFile("file")
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
f, err := file.Open()
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
defer func() { _ = f.Close() }()
|
||||
|
||||
dir, err := os.MkdirTemp("", "sound-classification")
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
defer func() { _ = os.RemoveAll(dir) }()
|
||||
|
||||
dst := filepath.Join(dir, path.Base(file.Filename))
|
||||
dstFile, err := os.Create(dst) // #nosec G304 -- dst is a server-created temp dir joined with path.Base of the upload name (no traversal)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
if _, err := io.Copy(dstFile, f); err != nil {
|
||||
xlog.Debug("Audio file copying error", "filename", file.Filename, "dst", dst, "error", err)
|
||||
_ = dstFile.Close()
|
||||
return err
|
||||
}
|
||||
_ = dstFile.Close()
|
||||
req.Audio = dst
|
||||
|
||||
result, err := backend.ModelSoundDetection(c.Request().Context(), req, ml, *modelConfig, appConfig)
|
||||
if err != nil {
|
||||
xlog.Error("Sound classification failed",
|
||||
"model", modelConfig.Name,
|
||||
"audio", dst,
|
||||
"error", err)
|
||||
return err
|
||||
}
|
||||
|
||||
return c.JSON(http.StatusOK, result)
|
||||
}
|
||||
}
|
||||
@@ -18,6 +18,7 @@ const (
|
||||
ServerEventTypeConversationItemInputAudioTranscriptionDelta ServerEventType = "conversation.item.input_audio_transcription.delta"
|
||||
ServerEventTypeConversationItemInputAudioTranscriptionSegment ServerEventType = "conversation.item.input_audio_transcription.segment"
|
||||
ServerEventTypeConversationItemInputAudioTranscriptionFailed ServerEventType = "conversation.item.input_audio_transcription.failed"
|
||||
ServerEventTypeConversationItemSoundDetection ServerEventType = "conversation.item.sound_detection"
|
||||
ServerEventTypeConversationItemTruncated ServerEventType = "conversation.item.truncated"
|
||||
ServerEventTypeConversationItemDeleted ServerEventType = "conversation.item.deleted"
|
||||
// ServerEventTypeConversationItemSpeaker is a LocalAI extension: it reports
|
||||
@@ -473,6 +474,55 @@ func (m ConversationItemInputAudioTranscriptionCompletedEvent) MarshalJSON() ([]
|
||||
return json.Marshal(shadow)
|
||||
}
|
||||
|
||||
// SoundDetectionTag is one scored sound-event tag from the sound-event
|
||||
// classifier. Label is the human-readable AudioSet class name, Score is the
|
||||
// per-class probability (multi-label, independent), and Index is the class
|
||||
// index in the model ontology.
|
||||
type SoundDetectionTag struct {
|
||||
// The human-readable AudioSet class name (e.g. "Baby cry, infant cry").
|
||||
Label string `json:"label"`
|
||||
|
||||
// The per-class probability for this tag.
|
||||
Score float32 `json:"score"`
|
||||
|
||||
// The class index in the model ontology.
|
||||
Index int `json:"index"`
|
||||
}
|
||||
|
||||
// Returned when a committed input audio window has been classified by a
|
||||
// sound-event-detection model. This is a LocalAI extension to the OpenAI
|
||||
// Realtime API: when a pipeline configures sound_detection, each VAD-committed
|
||||
// utterance is run through the classifier and the scored AudioSet tags are
|
||||
// emitted as this event, independent of (and alongside) transcription.
|
||||
type ConversationItemSoundDetectionEvent struct {
|
||||
ServerEventBase
|
||||
// The ID of the item.
|
||||
ItemID string `json:"item_id"`
|
||||
|
||||
// The index of the content part in the item's content array.
|
||||
ContentIndex int `json:"content_index"`
|
||||
|
||||
// The scored sound-event tags, in score-descending order.
|
||||
Detections []SoundDetectionTag `json:"detections"`
|
||||
}
|
||||
|
||||
func (m ConversationItemSoundDetectionEvent) ServerEventType() ServerEventType {
|
||||
return ServerEventTypeConversationItemSoundDetection
|
||||
}
|
||||
|
||||
func (m ConversationItemSoundDetectionEvent) MarshalJSON() ([]byte, error) {
|
||||
type typeAlias ConversationItemSoundDetectionEvent
|
||||
type typeWrapper struct {
|
||||
typeAlias
|
||||
Type ServerEventType `json:"type"`
|
||||
}
|
||||
shadow := typeWrapper{
|
||||
typeAlias: typeAlias(m),
|
||||
Type: m.ServerEventType(),
|
||||
}
|
||||
return json.Marshal(shadow)
|
||||
}
|
||||
|
||||
// Returned when the text value of an input audio transcription content part is updated with incremental transcription results.
|
||||
//
|
||||
// See https://platform.openai.com/docs/api-reference/realtime-server-events/conversation/item/input_audio_transcription/delta
|
||||
|
||||
@@ -23,6 +23,7 @@
|
||||
"tts": "TTS",
|
||||
"stt": "STT",
|
||||
"diarization": "Diarization",
|
||||
"soundClassification": "Sound Tagging",
|
||||
"soundGen": "Sound",
|
||||
"audioTransform": "Audio FX",
|
||||
"realtimeAudio": "Realtime Audio",
|
||||
|
||||
@@ -31,6 +31,7 @@ const FILTERS = [
|
||||
{ key: 'tts', labelKey: 'filters.tts', icon: 'fa-microphone' },
|
||||
{ key: 'transcript', labelKey: 'filters.stt', icon: 'fa-headphones' },
|
||||
{ key: 'diarization', labelKey: 'filters.diarization', icon: 'fa-users' },
|
||||
{ key: 'sound_classification', labelKey: 'filters.soundClassification', icon: 'fa-ear-listen' },
|
||||
{ key: 'sound_generation', labelKey: 'filters.soundGen', icon: 'fa-music' },
|
||||
{ key: 'audio_transform', labelKey: 'filters.audioTransform', icon: 'fa-sliders' },
|
||||
{ key: 'realtime_audio', labelKey: 'filters.realtimeAudio', icon: 'fa-tower-broadcast' },
|
||||
|
||||
1
core/http/react-ui/src/utils/capabilities.js
vendored
1
core/http/react-ui/src/utils/capabilities.js
vendored
@@ -15,6 +15,7 @@ export const CAP_SOUND_GENERATION = 'FLAG_SOUND_GENERATION'
|
||||
export const CAP_TOKENIZE = 'FLAG_TOKENIZE'
|
||||
export const CAP_VAD = 'FLAG_VAD'
|
||||
export const CAP_DIARIZATION = 'FLAG_DIARIZATION'
|
||||
export const CAP_SOUND_CLASSIFICATION = 'FLAG_SOUND_CLASSIFICATION'
|
||||
export const CAP_VIDEO = 'FLAG_VIDEO'
|
||||
export const CAP_DETECTION = 'FLAG_DETECTION'
|
||||
export const CAP_FACE_RECOGNITION = 'FLAG_FACE_RECOGNITION'
|
||||
|
||||
@@ -284,13 +284,14 @@ func RegisterLocalAIRoutes(router *echo.Echo,
|
||||
// Categorized endpoint groups for structured discovery
|
||||
"endpoint_groups": map[string]any{
|
||||
"openai_compatible": map[string]string{
|
||||
"models": "/v1/models",
|
||||
"chat_completions": "/v1/chat/completions",
|
||||
"completions": "/v1/completions",
|
||||
"embeddings": "/v1/embeddings",
|
||||
"transcription": "/v1/audio/transcriptions",
|
||||
"diarization": "/v1/audio/diarization",
|
||||
"image_generation": "/v1/images/generations",
|
||||
"models": "/v1/models",
|
||||
"chat_completions": "/v1/chat/completions",
|
||||
"completions": "/v1/completions",
|
||||
"embeddings": "/v1/embeddings",
|
||||
"transcription": "/v1/audio/transcriptions",
|
||||
"diarization": "/v1/audio/diarization",
|
||||
"sound_classification": "/v1/audio/classification",
|
||||
"image_generation": "/v1/images/generations",
|
||||
},
|
||||
"config_management": map[string]string{
|
||||
"config_metadata": "/api/models/config-metadata",
|
||||
@@ -342,7 +343,7 @@ func RegisterLocalAIRoutes(router *echo.Echo,
|
||||
"delete": "/stores/delete",
|
||||
},
|
||||
"docs": map[string]string{
|
||||
"swagger": "/swagger/index.html",
|
||||
"swagger": "/swagger/index.html",
|
||||
"instructions": "/api/instructions",
|
||||
},
|
||||
},
|
||||
|
||||
@@ -200,6 +200,23 @@ func RegisterOpenAIRoutes(app *echo.Echo,
|
||||
app.POST("/v1/audio/diarization", diarizationHandler, diarizationMiddleware...)
|
||||
app.POST("/audio/diarization", diarizationHandler, diarizationMiddleware...)
|
||||
|
||||
soundClassificationHandler := openai.SoundClassificationEndpoint(application.ModelConfigLoader(), application.ModelLoader(), application.ApplicationConfig())
|
||||
soundClassificationMiddleware := []echo.MiddlewareFunc{
|
||||
traceMiddleware,
|
||||
re.BuildFilteredFirstAvailableDefaultModel(config.BuildUsecaseFilterFn(config.FLAG_SOUND_CLASSIFICATION)),
|
||||
re.SetModelAndConfig(func() schema.LocalAIRequest { return new(schema.OpenAIRequest) }),
|
||||
func(next echo.HandlerFunc) echo.HandlerFunc {
|
||||
return func(c echo.Context) error {
|
||||
if err := re.SetOpenAIRequest(c); err != nil {
|
||||
return err
|
||||
}
|
||||
return next(c)
|
||||
}
|
||||
},
|
||||
}
|
||||
app.POST("/v1/audio/classification", soundClassificationHandler, soundClassificationMiddleware...)
|
||||
app.POST("/audio/classification", soundClassificationHandler, soundClassificationMiddleware...)
|
||||
|
||||
audioSpeechHandler := localai.TTSEndpoint(application.ModelConfigLoader(), application.ModelLoader(), application.ApplicationConfig())
|
||||
audioSpeechMiddleware := []echo.MiddlewareFunc{
|
||||
nodeHeaderMiddleware,
|
||||
|
||||
@@ -42,21 +42,22 @@ const (
|
||||
// usecaseFilters maps UI filter keys to ModelConfigUsecase flags for
|
||||
// capability-based gallery filtering.
|
||||
var usecaseFilters = map[string]config.ModelConfigUsecase{
|
||||
config.UsecaseChat: config.FLAG_CHAT,
|
||||
config.UsecaseImage: config.FLAG_IMAGE,
|
||||
config.UsecaseVideo: config.FLAG_VIDEO,
|
||||
config.UsecaseVision: config.FLAG_VISION,
|
||||
config.UsecaseTTS: config.FLAG_TTS,
|
||||
config.UsecaseTranscript: config.FLAG_TRANSCRIPT,
|
||||
config.UsecaseSoundGeneration: config.FLAG_SOUND_GENERATION,
|
||||
config.UsecaseEmbeddings: config.FLAG_EMBEDDINGS,
|
||||
config.UsecaseRerank: config.FLAG_RERANK,
|
||||
config.UsecaseDetection: config.FLAG_DETECTION,
|
||||
config.UsecaseVAD: config.FLAG_VAD,
|
||||
config.UsecaseAudioTransform: config.FLAG_AUDIO_TRANSFORM,
|
||||
config.UsecaseDiarization: config.FLAG_DIARIZATION,
|
||||
config.UsecaseRealtimeAudio: config.FLAG_REALTIME_AUDIO,
|
||||
config.UsecaseTokenClassify: config.FLAG_TOKEN_CLASSIFY,
|
||||
config.UsecaseChat: config.FLAG_CHAT,
|
||||
config.UsecaseImage: config.FLAG_IMAGE,
|
||||
config.UsecaseVideo: config.FLAG_VIDEO,
|
||||
config.UsecaseVision: config.FLAG_VISION,
|
||||
config.UsecaseTTS: config.FLAG_TTS,
|
||||
config.UsecaseTranscript: config.FLAG_TRANSCRIPT,
|
||||
config.UsecaseSoundGeneration: config.FLAG_SOUND_GENERATION,
|
||||
config.UsecaseEmbeddings: config.FLAG_EMBEDDINGS,
|
||||
config.UsecaseRerank: config.FLAG_RERANK,
|
||||
config.UsecaseDetection: config.FLAG_DETECTION,
|
||||
config.UsecaseVAD: config.FLAG_VAD,
|
||||
config.UsecaseAudioTransform: config.FLAG_AUDIO_TRANSFORM,
|
||||
config.UsecaseDiarization: config.FLAG_DIARIZATION,
|
||||
config.UsecaseSoundClassification: config.FLAG_SOUND_CLASSIFICATION,
|
||||
config.UsecaseRealtimeAudio: config.FLAG_REALTIME_AUDIO,
|
||||
config.UsecaseTokenClassify: config.FLAG_TOKEN_CLASSIFY,
|
||||
}
|
||||
|
||||
// extractHFRepo tries to find a HuggingFace repo ID from model overrides or URLs.
|
||||
|
||||
19
core/schema/sound_classification.go
Normal file
19
core/schema/sound_classification.go
Normal file
@@ -0,0 +1,19 @@
|
||||
package schema
|
||||
|
||||
// SoundClassification is one scored sound-event tag. Score is the
|
||||
// per-class probability (multi-label, independent), Index is the class
|
||||
// index in the model ontology, and Label is the human-readable AudioSet
|
||||
// class name (e.g. "Baby cry, infant cry").
|
||||
type SoundClassification struct {
|
||||
Index int `json:"index"`
|
||||
Label string `json:"label"`
|
||||
Score float32 `json:"score"`
|
||||
}
|
||||
|
||||
// SoundClassificationResult is the JSON response of the
|
||||
// /v1/audio/classification endpoint: the model name and the scored tags
|
||||
// in score-descending order.
|
||||
type SoundClassificationResult struct {
|
||||
Model string `json:"model"`
|
||||
Detections []SoundClassification `json:"detections"`
|
||||
}
|
||||
@@ -169,6 +169,9 @@ func (c *fakeBackendClient) SoundGeneration(_ context.Context, _ *pb.SoundGenera
|
||||
func (c *fakeBackendClient) Detect(_ context.Context, _ *pb.DetectOptions, _ ...ggrpc.CallOption) (*pb.DetectResponse, error) {
|
||||
return nil, nil
|
||||
}
|
||||
func (c *fakeBackendClient) SoundDetection(_ context.Context, _ *pb.SoundDetectionRequest, _ ...ggrpc.CallOption) (*pb.SoundDetectionResponse, error) {
|
||||
return nil, nil
|
||||
}
|
||||
func (c *fakeBackendClient) Depth(_ context.Context, _ *pb.DepthRequest, _ ...ggrpc.CallOption) (*pb.DepthResponse, error) {
|
||||
return nil, nil
|
||||
}
|
||||
|
||||
@@ -99,6 +99,9 @@ func (f *fakeGRPCBackend) SoundGeneration(_ context.Context, _ *pb.SoundGenerati
|
||||
func (f *fakeGRPCBackend) Detect(_ context.Context, _ *pb.DetectOptions, _ ...ggrpc.CallOption) (*pb.DetectResponse, error) {
|
||||
return &pb.DetectResponse{}, nil
|
||||
}
|
||||
func (f *fakeGRPCBackend) SoundDetection(_ context.Context, _ *pb.SoundDetectionRequest, _ ...ggrpc.CallOption) (*pb.SoundDetectionResponse, error) {
|
||||
return &pb.SoundDetectionResponse{}, nil
|
||||
}
|
||||
|
||||
func (f *fakeGRPCBackend) Depth(_ context.Context, _ *pb.DepthRequest, _ ...ggrpc.CallOption) (*pb.DepthResponse, error) {
|
||||
return &pb.DepthResponse{}, nil
|
||||
|
||||
55
docs/content/features/audio-classification.md
Normal file
55
docs/content/features/audio-classification.md
Normal file
@@ -0,0 +1,55 @@
|
||||
+++
|
||||
disableToc = false
|
||||
title = "Sound Classification"
|
||||
weight = 18
|
||||
url = "/features/audio-classification/"
|
||||
+++
|
||||
|
||||
Sound-event classification (audio tagging) answers the question **"what am I hearing?"** - given an audio clip, it returns a list of scored [AudioSet](https://research.google.com/audioset/) labels (e.g. *Baby cry, infant cry*, *Glass breaking*, *Dog bark*, *Alarm*).
|
||||
|
||||
LocalAI exposes this through the `/v1/audio/classification` endpoint, modelled after `/v1/audio/transcriptions`. The reference backend is **[ced.cpp](https://github.com/mudler/ced.cpp)** (CED, a 527-class AudioSet tagger), a small ViT over a log-mel spectrogram ported to ggml with full PyTorch parity. Apache-2.0 weights are redistributable as GGUF.
|
||||
|
||||
Because classification is exposed as a regular OpenAI-style endpoint, any HTTP client works - there is no Python dependency on the consumer side.
|
||||
|
||||
## Endpoint
|
||||
|
||||
```
|
||||
POST /v1/audio/classification
|
||||
Content-Type: multipart/form-data
|
||||
```
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `file` | file (required) | audio file in any format `ffmpeg` accepts |
|
||||
| `model` | string (required) | name of the sound-classification-capable model (e.g. `ced-base`) |
|
||||
| `top_k` | int | number of top tags to return (0 = backend default) |
|
||||
| `threshold` | float | drop tags scoring below this value |
|
||||
|
||||
### Response
|
||||
|
||||
```json
|
||||
{
|
||||
"model": "ced-base",
|
||||
"detections": [
|
||||
{"index": 23, "label": "Baby cry, infant cry", "score": 0.87},
|
||||
{"index": 22, "label": "Crying, sobbing", "score": 0.41}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Detections are returned in score-descending order. Scores are per-class probabilities (multi-label, independent), so they do not sum to 1.
|
||||
|
||||
## Example
|
||||
|
||||
```bash
|
||||
curl http://localhost:8080/v1/audio/classification \
|
||||
-H "Content-Type: multipart/form-data" \
|
||||
-F file="@/path/to/clip.wav" \
|
||||
-F model="ced-base" \
|
||||
-F top_k=10
|
||||
```
|
||||
|
||||
## See also
|
||||
|
||||
- [Audio to Text]({{% relref "audio-to-text" %}}) - speech transcription
|
||||
- [Speaker Diarization]({{% relref "audio-diarization" %}}) - who spoke when
|
||||
@@ -152,3 +152,7 @@ curl http://localhost:8080/v1/audio/diarization \
|
||||
- **Speaker identity across files**: speaker IDs (`SPEAKER_00`, `SPEAKER_01`, …) are local to each request. To track the same person across multiple recordings, combine `/v1/audio/diarization` with `/v1/voice/embed` (speaker embedding) and maintain your own embedding store.
|
||||
- **Hints vs. forces**: `num_speakers` overrides clustering when set; `min_speakers` / `max_speakers` are advisory and only honored by backends that expose a range hint. vibevoice.cpp ignores them — its model picks the count itself.
|
||||
- **Sample rate**: input is automatically converted to 16 kHz mono via ffmpeg before the backend sees it; sherpa-onnx pyannote-3.0 requires 16 kHz.
|
||||
|
||||
## See also
|
||||
|
||||
- [Sound Classification]({{% relref "audio-classification" %}}) - tag non-speech sound events (alarms, glass breaking, baby cry) in a clip.
|
||||
|
||||
@@ -128,6 +128,7 @@ LocalAI supports various types of backends:
|
||||
- **Speech-to-Text Backends**: For transcription (e.g., whisper.cpp, parakeet.cpp, faster-whisper, NeMo)
|
||||
- **Text-to-Speech Backends**: For speech synthesis (e.g., piper, Kokoro, VibeVoice, Qwen3-TTS)
|
||||
- **Sound Generation Backends**: For music and audio generation (e.g., ACE-Step)
|
||||
- **Sound Classification Backends**: For sound-event classification / audio tagging - identifying everyday sounds like baby cry, glass breaking, alarms (e.g., ced.cpp)
|
||||
- **Image & Video Generation Backends**: For diffusion models (e.g., stable-diffusion.cpp, diffusers)
|
||||
- **Vision & Detection Backends**: For object detection, segmentation, depth, and face/voice recognition (e.g., rf-detr.cpp, locate-anything.cpp, sam3.cpp, insightface)
|
||||
- **Audio Processing Backends**: For voice activity detection and audio enhancement (e.g., Silero VAD, LocalVQE)
|
||||
|
||||
@@ -15,6 +15,7 @@ You can see the release notes [here](https://github.com/mudler/LocalAI/releases)
|
||||
- **April 2026**: [Audio Transform](/features/audio-transform/) — generic audio-in / audio-out endpoint with optional reference signal. First implementation: [LocalVQE](https://github.com/localai-org/LocalVQE) C++ backend (joint AEC + noise suppression + dereverberation, DeepVQE-style). Both batch (`POST /audio/transformations`) and bidirectional WebSocket streaming (`/audio/transformations/stream`). Studio "Transform" tab with synchronized waveform players for input / reference / output.
|
||||
- **April 2026**: [Face recognition backend](/features/face-recognition/) — `insightface`-powered 1:1 verification, 1:N identification, face embedding, face detection, and demographic analysis. Ships both a non-commercial `buffalo_l` model and an Apache 2.0 OpenCV Zoo alternative.
|
||||
- **May 2026**: [Speaker diarization](/features/audio-diarization/) — new `/v1/audio/diarization` endpoint returning "who spoke when" segments. Backed by `sherpa-onnx` (pyannote-3.0 + speaker embeddings + clustering) for pure diarization, and `vibevoice-cpp` for diarization bundled with long-form ASR. Supports `json` / `verbose_json` / `rttm` response formats.
|
||||
- **June 2026**: [Sound classification](/features/audio-classification/) — new `/v1/audio/classification` endpoint for audio tagging / sound-event classification, returning scored [AudioSet](https://research.google.com/audioset/) labels (baby cry, glass breaking, alarms, ...). Backed by [ced.cpp](https://github.com/mudler/ced.cpp), a 527-class AudioSet tagger ported to ggml.
|
||||
- **June 2026**: [PII analyze / redact API](/features/middleware/#analyze--redact-api) — the PII detection pipeline (NER + restricted-regex pattern tiers) is now a standalone service: `POST /api/pii/analyze` returns detected entity spans and `POST /api/pii/redact` returns the sanitised text (or `400 pii_blocked`), without routing a chat request through the middleware. Events gain an `origin` (`middleware` / `proxy` / `pii_analyze` / `pii_redact`) so `/api/pii/events` can be filtered by source.
|
||||
- **June 2026**: Concurrent scoring and PII NER on llama.cpp — the `Score` (router classifier) and `TokenClassify` (PII NER) primitives now ride llama.cpp's server task queue instead of locking the context, so they run concurrently with chat/completion/embedding traffic and with each other. The `known_usecases` restriction that forced dedicated scorer/NER model configs on llama-cpp is lifted, repeated scoring calls reuse the prompt KV cache across candidates, and scoring inputs are no longer capped by the physical batch size.
|
||||
|
||||
|
||||
7
gallery/ced.yaml
Normal file
7
gallery/ced.yaml
Normal file
@@ -0,0 +1,7 @@
|
||||
---
|
||||
name: "ced-sound-classification"
|
||||
|
||||
config_file: |
|
||||
backend: ced
|
||||
known_usecases:
|
||||
- sound_classification
|
||||
@@ -3077,6 +3077,190 @@
|
||||
- transcript
|
||||
parameters:
|
||||
model: tiny
|
||||
- name: ced-base-f16
|
||||
url: github:mudler/LocalAI/gallery/ced.yaml@master
|
||||
urls:
|
||||
- https://huggingface.co/mudler/ced-gguf
|
||||
- https://huggingface.co/mispeech/ced-base
|
||||
description: |
|
||||
CED (Consistent Ensemble Distillation, Xiaomi) is a sound-event classifier that tags everyday sounds (baby cry, footsteps, glass breaking, alarms, dog bark, ...) into the 527-class AudioSet ontology. This is the f16 GGUF for the ced backend (a standalone C++/ggml port). Recommended default: fastest on CPU and near-lossless. Use POST /v1/audio/classification, or the realtime websocket API for live recognition.
|
||||
license: apache-2.0
|
||||
tags:
|
||||
- audio-classification
|
||||
- sound-event-detection
|
||||
- audio-tagging
|
||||
- audioset
|
||||
- ced
|
||||
- gguf
|
||||
- f16
|
||||
overrides:
|
||||
parameters:
|
||||
model: ced-base-f16.gguf
|
||||
files:
|
||||
- filename: ced-base-f16.gguf
|
||||
sha256: 5c058d9f7b737167195fa54eae4a2ae17658ac2c0a8073f7f116ba006b2ab32c
|
||||
uri: https://huggingface.co/mudler/ced-gguf/resolve/main/ced-base-f16.gguf
|
||||
- name: ced-base-q8
|
||||
url: github:mudler/LocalAI/gallery/ced.yaml@master
|
||||
urls:
|
||||
- https://huggingface.co/mudler/ced-gguf
|
||||
- https://huggingface.co/mispeech/ced-base
|
||||
description: |
|
||||
CED (Consistent Ensemble Distillation, Xiaomi) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). This is the q8_0 GGUF for the ced backend: smallest footprint (~88 MB, ~6.5x less memory than the PyTorch reference) and near-lossless (identical top-5 tags). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.
|
||||
license: apache-2.0
|
||||
tags:
|
||||
- audio-classification
|
||||
- sound-event-detection
|
||||
- audio-tagging
|
||||
- audioset
|
||||
- ced
|
||||
- gguf
|
||||
- q8
|
||||
overrides:
|
||||
parameters:
|
||||
model: ced-base-q8_0.gguf
|
||||
files:
|
||||
- filename: ced-base-q8_0.gguf
|
||||
sha256: bd34a7710169f0047fea17267965d211f967828ab25ba6fb9d3768481393f6e2
|
||||
uri: https://huggingface.co/mudler/ced-gguf/resolve/main/ced-base-q8_0.gguf
|
||||
- name: ced-tiny-f16
|
||||
url: github:mudler/LocalAI/gallery/ced.yaml@master
|
||||
urls:
|
||||
- https://huggingface.co/mudler/ced-gguf
|
||||
- https://huggingface.co/mispeech/ced-tiny
|
||||
description: |
|
||||
CED-tiny (5.5M params, Pi-class / edge) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). f16 GGUF for the ced backend (recommended (fastest on CPU)). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.
|
||||
license: apache-2.0
|
||||
tags:
|
||||
- audio-classification
|
||||
- sound-event-detection
|
||||
- audio-tagging
|
||||
- audioset
|
||||
- ced
|
||||
- gguf
|
||||
- f16
|
||||
overrides:
|
||||
parameters:
|
||||
model: ced-tiny-f16.gguf
|
||||
files:
|
||||
- filename: ced-tiny-f16.gguf
|
||||
sha256: af8b81c67bae50bfca4ea83dbba77b3bae4fa6180d36c17d6877f7700aeeb77b
|
||||
uri: https://huggingface.co/mudler/ced-gguf/resolve/main/ced-tiny-f16.gguf
|
||||
- name: ced-tiny-q8
|
||||
url: github:mudler/LocalAI/gallery/ced.yaml@master
|
||||
urls:
|
||||
- https://huggingface.co/mudler/ced-gguf
|
||||
- https://huggingface.co/mispeech/ced-tiny
|
||||
description: |
|
||||
CED-tiny (5.5M params, Pi-class / edge) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). q8_0 GGUF for the ced backend (smallest footprint, near-lossless). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.
|
||||
license: apache-2.0
|
||||
tags:
|
||||
- audio-classification
|
||||
- sound-event-detection
|
||||
- audio-tagging
|
||||
- audioset
|
||||
- ced
|
||||
- gguf
|
||||
- q8
|
||||
overrides:
|
||||
parameters:
|
||||
model: ced-tiny-q8_0.gguf
|
||||
files:
|
||||
- filename: ced-tiny-q8_0.gguf
|
||||
sha256: 48bee4e2fc3cc85d7806e03471db24e77fda6c2a2e81ffe9ef67caebaf2bd674
|
||||
uri: https://huggingface.co/mudler/ced-gguf/resolve/main/ced-tiny-q8_0.gguf
|
||||
- name: ced-mini-f16
|
||||
url: github:mudler/LocalAI/gallery/ced.yaml@master
|
||||
urls:
|
||||
- https://huggingface.co/mudler/ced-gguf
|
||||
- https://huggingface.co/mispeech/ced-mini
|
||||
description: |
|
||||
CED-mini (9.6M params, low-power) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). f16 GGUF for the ced backend (recommended (fastest on CPU)). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.
|
||||
license: apache-2.0
|
||||
tags:
|
||||
- audio-classification
|
||||
- sound-event-detection
|
||||
- audio-tagging
|
||||
- audioset
|
||||
- ced
|
||||
- gguf
|
||||
- f16
|
||||
overrides:
|
||||
parameters:
|
||||
model: ced-mini-f16.gguf
|
||||
files:
|
||||
- filename: ced-mini-f16.gguf
|
||||
sha256: 3c6a8936c77312f07a9ecb7b4bbbcb1f93ad137920ca6656bae9306571fb0c03
|
||||
uri: https://huggingface.co/mudler/ced-gguf/resolve/main/ced-mini-f16.gguf
|
||||
- name: ced-mini-q8
|
||||
url: github:mudler/LocalAI/gallery/ced.yaml@master
|
||||
urls:
|
||||
- https://huggingface.co/mudler/ced-gguf
|
||||
- https://huggingface.co/mispeech/ced-mini
|
||||
description: |
|
||||
CED-mini (9.6M params, low-power) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). q8_0 GGUF for the ced backend (smallest footprint, near-lossless). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.
|
||||
license: apache-2.0
|
||||
tags:
|
||||
- audio-classification
|
||||
- sound-event-detection
|
||||
- audio-tagging
|
||||
- audioset
|
||||
- ced
|
||||
- gguf
|
||||
- q8
|
||||
overrides:
|
||||
parameters:
|
||||
model: ced-mini-q8_0.gguf
|
||||
files:
|
||||
- filename: ced-mini-q8_0.gguf
|
||||
sha256: 7062cef9ca31459f339ce24a5914f3b65bde76ffd9ca4fc924a040327ff292bd
|
||||
uri: https://huggingface.co/mudler/ced-gguf/resolve/main/ced-mini-q8_0.gguf
|
||||
- name: ced-small-f16
|
||||
url: github:mudler/LocalAI/gallery/ced.yaml@master
|
||||
urls:
|
||||
- https://huggingface.co/mudler/ced-gguf
|
||||
- https://huggingface.co/mispeech/ced-small
|
||||
description: |
|
||||
CED-small (22M params, balanced size/accuracy) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). f16 GGUF for the ced backend (recommended (fastest on CPU)). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.
|
||||
license: apache-2.0
|
||||
tags:
|
||||
- audio-classification
|
||||
- sound-event-detection
|
||||
- audio-tagging
|
||||
- audioset
|
||||
- ced
|
||||
- gguf
|
||||
- f16
|
||||
overrides:
|
||||
parameters:
|
||||
model: ced-small-f16.gguf
|
||||
files:
|
||||
- filename: ced-small-f16.gguf
|
||||
sha256: c391ed8697a1b08d7c1a463e4940a5c3a2f670e0544ab0d8ee23b544583602a8
|
||||
uri: https://huggingface.co/mudler/ced-gguf/resolve/main/ced-small-f16.gguf
|
||||
- name: ced-small-q8
|
||||
url: github:mudler/LocalAI/gallery/ced.yaml@master
|
||||
urls:
|
||||
- https://huggingface.co/mudler/ced-gguf
|
||||
- https://huggingface.co/mispeech/ced-small
|
||||
description: |
|
||||
CED-small (22M params, balanced size/accuracy) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). q8_0 GGUF for the ced backend (smallest footprint, near-lossless). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.
|
||||
license: apache-2.0
|
||||
tags:
|
||||
- audio-classification
|
||||
- sound-event-detection
|
||||
- audio-tagging
|
||||
- audioset
|
||||
- ced
|
||||
- gguf
|
||||
- q8
|
||||
overrides:
|
||||
parameters:
|
||||
model: ced-small-q8_0.gguf
|
||||
files:
|
||||
- filename: ced-small-q8_0.gguf
|
||||
sha256: 888275fe43491cf832fb7b8125eccba34d1120745166f40cc12e93b79dea8efe
|
||||
uri: https://huggingface.co/mudler/ced-gguf/resolve/main/ced-small-q8_0.gguf
|
||||
- name: omnilingual-0.3b-ctc-q8-sherpa
|
||||
url: github:mudler/LocalAI/gallery/sherpa-onnx-asr.yaml@master
|
||||
urls:
|
||||
|
||||
@@ -82,6 +82,8 @@ type Backend interface {
|
||||
|
||||
Diarize(ctx context.Context, in *pb.DiarizeRequest, opts ...grpc.CallOption) (*pb.DiarizeResponse, error)
|
||||
|
||||
SoundDetection(ctx context.Context, in *pb.SoundDetectionRequest, opts ...grpc.CallOption) (*pb.SoundDetectionResponse, error)
|
||||
|
||||
AudioEncode(ctx context.Context, in *pb.AudioEncodeRequest, opts ...grpc.CallOption) (*pb.AudioEncodeResult, error)
|
||||
AudioDecode(ctx context.Context, in *pb.AudioDecodeRequest, opts ...grpc.CallOption) (*pb.AudioDecodeResult, error)
|
||||
|
||||
|
||||
@@ -110,6 +110,10 @@ func (llm *Base) Diarize(*pb.DiarizeRequest) (pb.DiarizeResponse, error) {
|
||||
return pb.DiarizeResponse{}, fmt.Errorf("unimplemented")
|
||||
}
|
||||
|
||||
func (llm *Base) SoundDetection(context.Context, *pb.SoundDetectionRequest) (*pb.SoundDetectionResponse, error) {
|
||||
return nil, fmt.Errorf("unimplemented")
|
||||
}
|
||||
|
||||
func (llm *Base) TokenizeString(opts *pb.PredictOptions) (pb.TokenizationResponse, error) {
|
||||
return pb.TokenizationResponse{}, fmt.Errorf("unimplemented")
|
||||
}
|
||||
|
||||
@@ -616,6 +616,24 @@ func (c *Client) Diarize(ctx context.Context, in *pb.DiarizeRequest, opts ...grp
|
||||
return client.Diarize(ctx, in, opts...)
|
||||
}
|
||||
|
||||
func (c *Client) SoundDetection(ctx context.Context, in *pb.SoundDetectionRequest, opts ...grpc.CallOption) (*pb.SoundDetectionResponse, error) {
|
||||
if !c.parallel {
|
||||
c.opMutex.Lock()
|
||||
defer c.opMutex.Unlock()
|
||||
}
|
||||
c.setBusy(true)
|
||||
defer c.setBusy(false)
|
||||
c.wdMark()
|
||||
defer c.wdUnMark()
|
||||
conn, err := c.dial()
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
defer func() { _ = conn.Close() }()
|
||||
client := pb.NewBackendClient(conn)
|
||||
return client.SoundDetection(ctx, in, opts...)
|
||||
}
|
||||
|
||||
func (c *Client) Detect(ctx context.Context, in *pb.DetectOptions, opts ...grpc.CallOption) (*pb.DetectResponse, error) {
|
||||
if !c.parallel {
|
||||
c.opMutex.Lock()
|
||||
|
||||
@@ -153,6 +153,10 @@ func (e *embedBackend) Diarize(ctx context.Context, in *pb.DiarizeRequest, opts
|
||||
return e.s.Diarize(ctx, in)
|
||||
}
|
||||
|
||||
func (e *embedBackend) SoundDetection(ctx context.Context, in *pb.SoundDetectionRequest, opts ...grpc.CallOption) (*pb.SoundDetectionResponse, error) {
|
||||
return e.s.SoundDetection(ctx, in)
|
||||
}
|
||||
|
||||
func (e *embedBackend) AudioEncode(ctx context.Context, in *pb.AudioEncodeRequest, opts ...grpc.CallOption) (*pb.AudioEncodeResult, error) {
|
||||
return e.s.AudioEncode(ctx, in)
|
||||
}
|
||||
|
||||
@@ -40,6 +40,7 @@ type AIModel interface {
|
||||
|
||||
VAD(*pb.VADRequest) (pb.VADResponse, error)
|
||||
Diarize(*pb.DiarizeRequest) (pb.DiarizeResponse, error)
|
||||
SoundDetection(context.Context, *pb.SoundDetectionRequest) (*pb.SoundDetectionResponse, error)
|
||||
|
||||
AudioEncode(*pb.AudioEncodeRequest) (*pb.AudioEncodeResult, error)
|
||||
AudioDecode(*pb.AudioDecodeRequest) (*pb.AudioDecodeResult, error)
|
||||
|
||||
@@ -435,6 +435,14 @@ func (s *server) Diarize(ctx context.Context, in *pb.DiarizeRequest) (*pb.Diariz
|
||||
return &res, nil
|
||||
}
|
||||
|
||||
func (s *server) SoundDetection(ctx context.Context, in *pb.SoundDetectionRequest) (*pb.SoundDetectionResponse, error) {
|
||||
if s.llm.Locking() {
|
||||
s.llm.Lock()
|
||||
defer s.llm.Unlock()
|
||||
}
|
||||
return s.llm.SoundDetection(ctx, in)
|
||||
}
|
||||
|
||||
func (s *server) AudioEncode(ctx context.Context, in *pb.AudioEncodeRequest) (*pb.AudioEncodeResult, error) {
|
||||
if s.llm.Locking() {
|
||||
s.llm.Lock()
|
||||
|
||||
@@ -26,6 +26,13 @@ function inferBackendPath(item) {
|
||||
if (item.backend === "parakeet-cpp") {
|
||||
return `backend/go/parakeet-cpp/`;
|
||||
}
|
||||
// ced is a Go backend (Dockerfile.golang) wrapping the ced.cpp ggml port via
|
||||
// purego, living in backend/go/ced/. Same explicit-branch rationale as
|
||||
// parakeet-cpp above: the generic golang fallthrough would also resolve it,
|
||||
// but this documents the mapping and guards a future dockerfile-suffix change.
|
||||
if (item.backend === "ced") {
|
||||
return `backend/go/ced/`;
|
||||
}
|
||||
if (item.dockerfile.endsWith("golang")) {
|
||||
return `backend/go/${item.backend}/`;
|
||||
}
|
||||
|
||||
@@ -1939,6 +1939,53 @@ const docTemplate = `{
|
||||
}
|
||||
}
|
||||
},
|
||||
"/v1/audio/classification": {
|
||||
"post": {
|
||||
"consumes": [
|
||||
"multipart/form-data"
|
||||
],
|
||||
"tags": [
|
||||
"audio"
|
||||
],
|
||||
"summary": "Classify sound events in audio (audio tagging).",
|
||||
"parameters": [
|
||||
{
|
||||
"type": "string",
|
||||
"description": "model",
|
||||
"name": "model",
|
||||
"in": "formData",
|
||||
"required": true
|
||||
},
|
||||
{
|
||||
"type": "file",
|
||||
"description": "audio file",
|
||||
"name": "file",
|
||||
"in": "formData",
|
||||
"required": true
|
||||
},
|
||||
{
|
||||
"type": "integer",
|
||||
"description": "number of top tags to return (0 = backend default)",
|
||||
"name": "top_k",
|
||||
"in": "formData"
|
||||
},
|
||||
{
|
||||
"type": "number",
|
||||
"description": "drop tags scoring below this value",
|
||||
"name": "threshold",
|
||||
"in": "formData"
|
||||
}
|
||||
],
|
||||
"responses": {
|
||||
"200": {
|
||||
"description": "OK",
|
||||
"schema": {
|
||||
"$ref": "#/definitions/schema.SoundClassificationResult"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"/v1/audio/diarization": {
|
||||
"post": {
|
||||
"consumes": [
|
||||
@@ -6084,6 +6131,34 @@ const docTemplate = `{
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.SoundClassification": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"index": {
|
||||
"type": "integer"
|
||||
},
|
||||
"label": {
|
||||
"type": "string"
|
||||
},
|
||||
"score": {
|
||||
"type": "number"
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.SoundClassificationResult": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"detections": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"$ref": "#/definitions/schema.SoundClassification"
|
||||
}
|
||||
},
|
||||
"model": {
|
||||
"type": "string"
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.StreamOptions": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
|
||||
@@ -1936,6 +1936,53 @@
|
||||
}
|
||||
}
|
||||
},
|
||||
"/v1/audio/classification": {
|
||||
"post": {
|
||||
"consumes": [
|
||||
"multipart/form-data"
|
||||
],
|
||||
"tags": [
|
||||
"audio"
|
||||
],
|
||||
"summary": "Classify sound events in audio (audio tagging).",
|
||||
"parameters": [
|
||||
{
|
||||
"type": "string",
|
||||
"description": "model",
|
||||
"name": "model",
|
||||
"in": "formData",
|
||||
"required": true
|
||||
},
|
||||
{
|
||||
"type": "file",
|
||||
"description": "audio file",
|
||||
"name": "file",
|
||||
"in": "formData",
|
||||
"required": true
|
||||
},
|
||||
{
|
||||
"type": "integer",
|
||||
"description": "number of top tags to return (0 = backend default)",
|
||||
"name": "top_k",
|
||||
"in": "formData"
|
||||
},
|
||||
{
|
||||
"type": "number",
|
||||
"description": "drop tags scoring below this value",
|
||||
"name": "threshold",
|
||||
"in": "formData"
|
||||
}
|
||||
],
|
||||
"responses": {
|
||||
"200": {
|
||||
"description": "OK",
|
||||
"schema": {
|
||||
"$ref": "#/definitions/schema.SoundClassificationResult"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"/v1/audio/diarization": {
|
||||
"post": {
|
||||
"consumes": [
|
||||
@@ -6081,6 +6128,34 @@
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.SoundClassification": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"index": {
|
||||
"type": "integer"
|
||||
},
|
||||
"label": {
|
||||
"type": "string"
|
||||
},
|
||||
"score": {
|
||||
"type": "number"
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.SoundClassificationResult": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"detections": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"$ref": "#/definitions/schema.SoundClassification"
|
||||
}
|
||||
},
|
||||
"model": {
|
||||
"type": "string"
|
||||
}
|
||||
}
|
||||
},
|
||||
"schema.StreamOptions": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
|
||||
@@ -2087,6 +2087,24 @@ definitions:
|
||||
classifier-side confidence signal).
|
||||
type: number
|
||||
type: object
|
||||
schema.SoundClassification:
|
||||
properties:
|
||||
index:
|
||||
type: integer
|
||||
label:
|
||||
type: string
|
||||
score:
|
||||
type: number
|
||||
type: object
|
||||
schema.SoundClassificationResult:
|
||||
properties:
|
||||
detections:
|
||||
items:
|
||||
$ref: '#/definitions/schema.SoundClassification'
|
||||
type: array
|
||||
model:
|
||||
type: string
|
||||
type: object
|
||||
schema.StreamOptions:
|
||||
properties:
|
||||
include_usage:
|
||||
@@ -3770,6 +3788,37 @@ paths:
|
||||
summary: Generates audio from the input text.
|
||||
tags:
|
||||
- audio
|
||||
/v1/audio/classification:
|
||||
post:
|
||||
consumes:
|
||||
- multipart/form-data
|
||||
parameters:
|
||||
- description: model
|
||||
in: formData
|
||||
name: model
|
||||
required: true
|
||||
type: string
|
||||
- description: audio file
|
||||
in: formData
|
||||
name: file
|
||||
required: true
|
||||
type: file
|
||||
- description: number of top tags to return (0 = backend default)
|
||||
in: formData
|
||||
name: top_k
|
||||
type: integer
|
||||
- description: drop tags scoring below this value
|
||||
in: formData
|
||||
name: threshold
|
||||
type: number
|
||||
responses:
|
||||
"200":
|
||||
description: OK
|
||||
schema:
|
||||
$ref: '#/definitions/schema.SoundClassificationResult'
|
||||
summary: Classify sound events in audio (audio tagging).
|
||||
tags:
|
||||
- audio
|
||||
/v1/audio/diarization:
|
||||
post:
|
||||
consumes:
|
||||
|
||||
Reference in New Issue
Block a user