From 600dafd20b624219e82c2b370fdaf5fa0ad11dcb Mon Sep 17 00:00:00 2001 From: "LocalAI [bot]" <139863280+localai-bot@users.noreply.github.com> Date: Mon, 22 Jun 2026 01:00:28 +0200 Subject: [PATCH] feat(ced): sound-event classification backend (CED audio tagger) (#10425) * feat(ced): sketch sound-classification backend (CED audio tagger) Wires ced.cpp (CED, 527-class AudioSet sound-event tagger; baby cry, footsteps, glass, alarms, dog bark) into LocalAI as a Go/purego backend. SKETCH (backend skeleton real; core REST wiring + CI/gallery is a checklist in DESIGN.md): - backend/backend.proto: new SoundDetection rpc + SoundClass messages (run `make protogen-go` to regenerate pkg/grpc/proto). - backend/go/ced: main.go (purego dlopen libced.so + ced_capi.h), goced.go (Ced gRPC backend: Load + SoundDetection), Makefile (clone-at-pin CED_VERSION, ggml static-PIC shared build), run.sh, package.sh, .gitignore. - DESIGN.md: REST /v1/audio/classification wiring (handler/route/capability registration checklist), gallery/index + CI registration, and a scoping note for the realtime/websocket live-recognition path (sliding-window classify over the existing ws transport + voicegate; the ced C-API per-PCM entry point is already window-friendly). Backend code does not compile until protogen-go regenerates the pb types and a libced.so is built (Makefile clones+builds it). Signed-off-by: Ettore Di Giacinto * feat(ced): REST /v1/audio/classification endpoint + capability registration Wires the ced sound-event classification backend (AudioSet audio tagger) end to end through the REST surface, mirroring the transcription path. - Handler: core/http/endpoints/openai/sound_classification.go parses the multipart audio upload, temp-files it, resolves the model config and calls the SoundDetection RPC; returns {model, detections[]} JSON. - Backend wrapper: core/backend/sound_classification.go (ModelSoundDetection) loads the model and normalizes the proto response into schema types. - Schema: core/schema/sound_classification.go (SoundClassificationResult). - gRPC layer: SoundDetection wired through the LocalAI wrapper (interface, Backend client, Client, embed, server, base default) so the loader-typed client exposes the RPC; proto regenerated via make protogen-go. - Route: POST /v1/audio/classification (+ /audio/classification alias) with the audio/multipart default-model middleware in routes/openai.go. - Capability surfaces: swagger @Tags/@Router on the handler; FLAG_SOUND_ CLASSIFICATION usecase flag + UsecaseSoundClassification + UsecaseInfoMap + GuessUsecases + ModalityGroups + GetAllModelConfigUsecases; meta usecase option; /api/instructions audio area updated; auth RouteFeatureRegistry + FeatureAudioClassification (APIFeatures, default ON) + FeatureMetas; UI usecaseFilters, capabilities.js CAP_SOUND_CLASSIFICATION, Models.jsx filter + i18n; docs page features/audio-classification.md + whats-new + crosslink. Signed-off-by: Ettore Di Giacinto * feat(ced): realtime sound-event detection over the websocket API When a realtime pipeline configures a sound-classification model, each VAD-committed utterance (the same window the transcription path produces) is also run through the CED sound-event classifier and the scored AudioSet tags are emitted as a new server event. No new backend rpc is needed: the SoundDetection gRPC method already exists on this branch. - config: add Pipeline.SoundDetection (yaml/json sound_detection,omitempty) beside Transcription/VAD. - realtime: add Model.SoundDetection(ctx, audio, topK, threshold) to the ModelInterface; implement it on wrappedModel and transcriptOnlyModel by calling backend.ModelSoundDetection with the session's sound-classification model config (mirrors how Transcribe dispatches). Load the optional config in newModel / newTranscriptionOnlyModel; nil config keeps it additive. - types: add ConversationItemSoundDetectionEvent (item_id, content_index, detections[]{label,score,index}) with type conversation.item.sound_detection, its ServerEventType constant and MarshalJSON, mirroring the transcription completed event. - realtime: add emitSoundDetection (unary path: classify the committed window, build the event, t.SendEvent) and wire it at the utterance-commit hook right after emitTranscription; gated on session.SoundDetectionEnabled (resolved from Pipeline.SoundDetection at session setup, defaults top_k=5, threshold=0). Its error is logged via xlog but never aborts the turn. - test: Ginkgo specs for emitSoundDetection (tags emitted, empty detections, classifier error) plus a SoundDetection method on the fakeModel double. Signed-off-by: Ettore Di Giacinto * fix(ced): implement SoundDetection in nodes backend test doubles The SoundDetection method added to the grpc backend interface left two test doubles (fakeBackendClient, fakeGRPCBackend) incomplete, so core/services/nodes failed to compile under `go vet`/`go test` (go build missed it: the doubles live in _test.go). Add the method to both, mirroring their existing Detect mock. Repairs CI for the nodes package. Signed-off-by: Ettore Di Giacinto * feat(ced): decouple realtime sound detection from VAD (sound-only sessions) Sound-event detection must activate on sounds, not speech, so it no longer runs through the voice VAD/transcription path. A sound-detection-only pipeline (sound_detection set, no transcription/LLM) now: - is accepted by prepareRealtimeConfig (sound_detection counts as a pipeline stage), - builds a lightweight model via newSoundDetectionOnlyModel (no VAD/STT/LLM/TTS loaded), and - defaults the session to turn_detection none (no VAD) with no transcription stage, so the client drives windowing via input_audio_buffer.commit (option A: client-side sliding window). The per-PCM C-API already supports arbitrary windows. commitUtterance gains a sound-only branch: it emits the conversation.item.sound_detection event (scored AudioSet tags) and stops - no transcription, no LLM response. generateResponse is now guarded on a transcription stage being present, so a sound-only turn never invokes the LLM. Existing transcription/VAD sessions are unchanged (additive). Added a commitUtterance sound-only Ginkgo spec asserting it emits the sound event and neither transcribes nor generates a response. go vet + golangci-lint (new-from-merge-base) clean; openai suite green. Signed-off-by: Ettore Di Giacinto * feat(ced): register sound-classification backend in gallery + CI Mechanical backend-image registration for the ced sound-event classifier, mirroring the parakeet-cpp Go/purego backend everywhere it is wired up. - .github/backend-matrix.yml: add the ced build matrix, field-for-field copies of the parakeet-cpp entries (cpu amd64/arm64, cublas cuda 12/13 amd64, l4t cuda-13 arm64, l4t-jetpack cuda-12 arm64, sycl f32/f16, vulkan amd64/arm64, rocm hipblas, and the metal darwin entry), changing only backend and tag-suffix. dockerfile stays ./backend/Dockerfile.golang. - backend/index.yaml: add the &ced meta anchor (capabilities map per platform) plus ced-development and the per-arch image entries, each uri/mirror tag-suffix matching the matrix exactly. The model gallery (GGUF) entry is intentionally deferred pending the HuggingFace publish (TODO note inline). - scripts/changed-backends.js: add an explicit item.backend === "ced" branch in inferBackendPath mapping to backend/go/ced/, same mechanism and ordering as the parakeet-cpp branch (before the generic golang fallthrough). - .github/workflows/bump_deps.yaml: register mudler/ced.cpp -> CED_VERSION in backend/go/ced/Makefile so the daily bot bumps the pin. - swagger/{docs.go,swagger.json,swagger.yaml}: regenerated via make swagger so the existing /v1/audio/classification annotations land in the generated spec. Signed-off-by: Ettore Di Giacinto * feat(ced): server-side windowing for realtime sound detection (option B) Adds an optional server-driven sliding-window classifier so a sound-only realtime client only has to stream audio (no input_audio_buffer.commit): - Pipeline.sound_detection_window_ms / sound_detection_hop_ms config knobs. When both > 0 on a sound-only session, the server classifies the last window of streamed audio every hop and emits a conversation.item.sound_ detection event; the input buffer is trimmed to one window so a long stream stays bounded. When unset, the session stays client-driven (option A). Runs independent of VAD (sound events are not speech). - handleSoundWindow (ticker) + classifySoundWindow (one tick, extracted so it is unit-testable) + writeWindowWAV, which declares the true InputSampleRate (NewWAVHeaderWithRate) so the classifier resamples correctly. Goroutine is started after toggleVAD and torn down with the session (close + wg.Wait). - Register pipeline.sound_detection (+window_ms/hop_ms) in the config meta registry; the earlier realtime commit added pipeline.sound_detection without a registry entry, failing TestAllFieldsHaveRegistryEntries. This fixes that and covers the two new knobs. Tests: classifySoundWindow emits an event + trims the buffer to one window, no-ops on too-little audio; writeWindowWAV declares the given sample rate. go build/vet + golangci-lint (new-from-merge-base) clean; config + openai suites green. Signed-off-by: Ettore Di Giacinto * feat(ced): add ced-base GGUF model gallery entries (f16 + q8_0) The ced-base weights are now published at mudler/ced-base-gguf (Apache-2.0, converted from mispeech/ced-base). Adds gallery/ced.yaml (backend: ced + known_usecases: sound_classification) and two gallery/index.yaml entries (ced-base-f16 default, ced-base-q8 smallest) with sha256-pinned files, and removes the now-resolved TODO from backend/index.yaml. Signed-off-by: Ettore Di Giacinto * feat(ced): add tiny/mini/small GGUF model gallery entries Publishes the rest of the CED family (same architecture, metadata-driven port verified end-to-end on ced-tiny) to mudler/ced-{tiny,mini,small}-gguf and adds their f16 + q8_0 gallery entries: ced-tiny (5.5M, edge/Pi-class) f16 11MB / q8_0 6MB ced-mini (9.6M) f16 19MB / q8_0 11MB ced-small (22M) f16 42MB / q8_0 23MB All sha256-pinned. ced-base remains the accuracy default. Signed-off-by: Ettore Di Giacinto * chore(ced): point gallery entries at the consolidated mudler/ced-gguf repo All CED quantizations (tiny/mini/small/base, f16/q8_0) now live in a single HuggingFace repo, mudler/ced-gguf, instead of per-model repos. Repoint the 8 gallery model entries' urls + file uris accordingly. sha256 and filenames are unchanged. Signed-off-by: Ettore Di Giacinto * chore(ced): bump CED_VERSION to the short-clip fix Pin the ced backend to ced.cpp 99c6ed3, which fixes a crash on any clip shorter than target_length (~10.11s): time_pos_embed was added at its full 63-frame grid instead of being sliced to the clip's actual time grid, tripping ggml_can_repeat in ggml_add. Surfaced by the live realtime e2e (sub-10s windows) and gated with a short-clip parity test upstream. Signed-off-by: Ettore Di Giacinto * docs(ced): list ced.cpp as a LocalAI-team engine + backend-guide directive - README.md: add ced.cpp to the "native C/C++/GGML engines developed and maintained by the LocalAI project" table. - docs/content/features/backends.md: add a Sound Classification backend category (sound-event classification / audio tagging) listing ced.cpp. - .agents/adding-backends.md: add a "Documenting the backend" section and two verification-checklist items requiring new backends to be documented in the backends.md category list, and in-house native engines to be added to the README maintained-engines table. This directive was missing. Signed-off-by: Ettore Di Giacinto * chore(ced): repin CED_VERSION to the v0.1.0 release commit ced.cpp history was squashed into a single release commit (tagged v0.1.0), so the previous pin (99c6ed3) no longer exists upstream. Pin to c04ac14, the v0.1.0 release commit, so the backend builds against a commit that exists. Signed-off-by: Ettore Di Giacinto * fix(ced): silence gosec G304/G103 + govet unsafeptr on audited paths - sound_classification.go: os.Create(dst) where dst = temp dir + path.Base of the upload (no traversal). #nosec G304, matching the depth-anything-cpp handler. - goced.go: reading a NUL-terminated C string from a libced-owned buffer. #nosec G103 (gosec) + //nolint:govet (golangci-lint's unsafeptr check), since the uintptr is a C-owned malloc'd buffer, not Go-GC memory. Signed-off-by: Ettore Di Giacinto --------- Signed-off-by: Ettore Di Giacinto Co-authored-by: Ettore Di Giacinto --- .agents/adding-backends.md | 23 ++ .github/backend-matrix.yml | 152 +++++++++++ .github/workflows/bump_deps.yaml | 4 + README.md | 1 + backend/backend.proto | 21 ++ backend/go/ced/.gitignore | 11 + backend/go/ced/Makefile | 77 ++++++ backend/go/ced/goced.go | 130 +++++++++ backend/go/ced/main.go | 59 ++++ backend/go/ced/package.sh | 60 ++++ backend/go/ced/run.sh | 15 + backend/index.yaml | 146 ++++++++++ core/backend/sound_classification.go | 88 ++++++ core/config/backend_capabilities.go | 49 ++-- core/config/meta/constants.go | 1 + core/config/meta/registry.go | 24 ++ core/config/model_config.go | 88 ++++-- core/http/auth/features.go | 5 + core/http/auth/permissions.go | 39 +-- .../endpoints/localai/api_instructions.go | 4 +- core/http/endpoints/openai/realtime.go | 257 ++++++++++++++---- .../endpoints/openai/realtime_doubles_test.go | 12 + core/http/endpoints/openai/realtime_model.go | 100 ++++++- .../openai/realtime_sound_detection.go | 48 ++++ .../openai/realtime_sound_detection_test.go | 170 ++++++++++++ .../endpoints/openai/sound_classification.go | 91 +++++++ .../endpoints/openai/types/server_events.go | 50 ++++ .../react-ui/public/locales/en/models.json | 1 + core/http/react-ui/src/pages/Models.jsx | 1 + core/http/react-ui/src/utils/capabilities.js | 1 + core/http/routes/localai.go | 17 +- core/http/routes/openai.go | 17 ++ core/http/routes/ui_api.go | 31 ++- core/schema/sound_classification.go | 19 ++ core/services/nodes/health_mock_test.go | 3 + core/services/nodes/inflight_test.go | 3 + docs/content/features/audio-classification.md | 55 ++++ docs/content/features/audio-diarization.md | 4 + docs/content/features/backends.md | 1 + docs/content/whats-new.md | 1 + gallery/ced.yaml | 7 + gallery/index.yaml | 184 +++++++++++++ pkg/grpc/backend.go | 2 + pkg/grpc/base/base.go | 4 + pkg/grpc/client.go | 18 ++ pkg/grpc/embed.go | 4 + pkg/grpc/interface.go | 1 + pkg/grpc/server.go | 8 + scripts/changed-backends.js | 7 + swagger/docs.go | 75 +++++ swagger/swagger.json | 75 +++++ swagger/swagger.yaml | 49 ++++ 52 files changed, 2161 insertions(+), 152 deletions(-) create mode 100644 backend/go/ced/.gitignore create mode 100644 backend/go/ced/Makefile create mode 100644 backend/go/ced/goced.go create mode 100644 backend/go/ced/main.go create mode 100755 backend/go/ced/package.sh create mode 100755 backend/go/ced/run.sh create mode 100644 core/backend/sound_classification.go create mode 100644 core/http/endpoints/openai/realtime_sound_detection.go create mode 100644 core/http/endpoints/openai/realtime_sound_detection_test.go create mode 100644 core/http/endpoints/openai/sound_classification.go create mode 100644 core/schema/sound_classification.go create mode 100644 docs/content/features/audio-classification.md create mode 100644 gallery/ced.yaml diff --git a/.agents/adding-backends.md b/.agents/adding-backends.md index 4a37a298e..ab965f789 100644 --- a/.agents/adding-backends.md +++ b/.agents/adding-backends.md @@ -198,6 +198,27 @@ docker-build-backends: ... docker-build- - If the backend is in `backend/python//` but uses `.` as context in the workflow file, use `.` context - Check similar backends to determine the correct context +## Documenting the backend (README + docs) + +A backend is not "added" until it is discoverable. Update the user-facing docs: + +- **`docs/content/features/backends.md`** - add the backend to the right + category in the "LocalAI supports various types of backends" list (and add a + new category if it introduces a new modality, e.g. sound classification). +- If the backend introduces a **new API surface** (a new endpoint or a realtime + capability), document it under `docs/content/` where its area lives (audio, + vision, etc.) and follow the api-endpoints checklist in + [api-endpoints-and-auth.md](api-endpoints-and-auth.md). + +**If the backend is a native C/C++/GGML engine created and maintained by the +LocalAI team** (a from-scratch port like `parakeet.cpp`, `ced.cpp`, +`vibevoice.cpp`, `rf-detr.cpp`, not a wrapper around a third-party runtime), it +ALSO belongs in the top-level **`README.md`** table under "native C/C++/GGML +engines ... developed and maintained by the LocalAI project itself". Add a row +linking the upstream engine repo with a one-line description. This is the +project's showcase of its own engines; a new in-house backend that is missing +from it is a documentation bug. + ## 5. Verification Checklist After adding a new backend, verify: @@ -211,6 +232,8 @@ After adding a new backend, verify: - [ ] No YAML syntax errors (check with linter) - [ ] No Makefile syntax errors (check with linter) - [ ] Follows the same pattern as similar backends (e.g., if it's a transcription backend, follow `faster-whisper` pattern) +- [ ] Documented: added to the category list in `docs/content/features/backends.md` (and any new endpoint/realtime capability documented under `docs/content/`) +- [ ] If it is an in-house native C/C++/GGML engine, added to the maintained-engines table in the top-level `README.md` ## Bundling runtime shared libraries (`package.sh`) diff --git a/.github/backend-matrix.yml b/.github/backend-matrix.yml index c2c6638ec..593e44cde 100644 --- a/.github/backend-matrix.yml +++ b/.github/backend-matrix.yml @@ -3575,6 +3575,154 @@ include: dockerfile: "./backend/Dockerfile.golang" context: "./" ubuntu-version: '2404' + # ced + - build-type: 'cublas' + cuda-major-version: "12" + cuda-minor-version: "8" + platforms: 'linux/amd64' + tag-latest: 'auto' + tag-suffix: '-gpu-nvidia-cuda-12-ced' + runs-on: 'ubuntu-latest' + base-image: "ubuntu:24.04" + skip-drivers: 'false' + backend: "ced" + dockerfile: "./backend/Dockerfile.golang" + context: "./" + ubuntu-version: '2404' + - build-type: 'cublas' + cuda-major-version: "13" + cuda-minor-version: "0" + platforms: 'linux/amd64' + tag-latest: 'auto' + tag-suffix: '-gpu-nvidia-cuda-13-ced' + runs-on: 'ubuntu-latest' + base-image: "ubuntu:24.04" + skip-drivers: 'false' + backend: "ced" + dockerfile: "./backend/Dockerfile.golang" + context: "./" + ubuntu-version: '2404' + - build-type: 'cublas' + cuda-major-version: "13" + cuda-minor-version: "0" + platforms: 'linux/arm64' + skip-drivers: 'false' + tag-latest: 'auto' + tag-suffix: '-nvidia-l4t-cuda-13-arm64-ced' + base-image: "ubuntu:24.04" + ubuntu-version: '2404' + runs-on: 'ubuntu-24.04-arm' + backend: "ced" + dockerfile: "./backend/Dockerfile.golang" + context: "./" + - build-type: '' + cuda-major-version: "" + cuda-minor-version: "" + platforms: 'linux/amd64' + platform-tag: 'amd64' + tag-latest: 'auto' + tag-suffix: '-cpu-ced' + runs-on: 'ubuntu-latest' + base-image: "ubuntu:24.04" + skip-drivers: 'false' + backend: "ced" + dockerfile: "./backend/Dockerfile.golang" + context: "./" + ubuntu-version: '2404' + - build-type: '' + cuda-major-version: "" + cuda-minor-version: "" + platforms: 'linux/arm64' + platform-tag: 'arm64' + tag-latest: 'auto' + tag-suffix: '-cpu-ced' + runs-on: 'ubuntu-24.04-arm' + base-image: "ubuntu:24.04" + skip-drivers: 'false' + backend: "ced" + dockerfile: "./backend/Dockerfile.golang" + context: "./" + ubuntu-version: '2404' + - build-type: 'sycl_f32' + cuda-major-version: "" + cuda-minor-version: "" + platforms: 'linux/amd64' + tag-latest: 'auto' + tag-suffix: '-gpu-intel-sycl-f32-ced' + runs-on: 'ubuntu-latest' + base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04" + skip-drivers: 'false' + backend: "ced" + dockerfile: "./backend/Dockerfile.golang" + context: "./" + ubuntu-version: '2404' + - build-type: 'sycl_f16' + cuda-major-version: "" + cuda-minor-version: "" + platforms: 'linux/amd64' + tag-latest: 'auto' + tag-suffix: '-gpu-intel-sycl-f16-ced' + runs-on: 'ubuntu-latest' + base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04" + skip-drivers: 'false' + backend: "ced" + dockerfile: "./backend/Dockerfile.golang" + context: "./" + ubuntu-version: '2404' + - build-type: 'vulkan' + cuda-major-version: "" + cuda-minor-version: "" + platforms: 'linux/amd64' + platform-tag: 'amd64' + tag-latest: 'auto' + tag-suffix: '-gpu-vulkan-ced' + runs-on: 'ubuntu-latest' + base-image: "ubuntu:24.04" + skip-drivers: 'false' + backend: "ced" + dockerfile: "./backend/Dockerfile.golang" + context: "./" + ubuntu-version: '2404' + - build-type: 'vulkan' + cuda-major-version: "" + cuda-minor-version: "" + platforms: 'linux/arm64' + platform-tag: 'arm64' + tag-latest: 'auto' + tag-suffix: '-gpu-vulkan-ced' + runs-on: 'ubuntu-24.04-arm' + base-image: "ubuntu:24.04" + skip-drivers: 'false' + backend: "ced" + dockerfile: "./backend/Dockerfile.golang" + context: "./" + ubuntu-version: '2404' + - build-type: 'cublas' + cuda-major-version: "12" + cuda-minor-version: "0" + platforms: 'linux/arm64' + skip-drivers: 'false' + tag-latest: 'auto' + tag-suffix: '-nvidia-l4t-arm64-ced' + base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0" + runs-on: 'ubuntu-24.04-arm' + backend: "ced" + dockerfile: "./backend/Dockerfile.golang" + context: "./" + ubuntu-version: '2204' + - build-type: 'hipblas' + cuda-major-version: "" + cuda-minor-version: "" + platforms: 'linux/amd64' + tag-latest: 'auto' + tag-suffix: '-gpu-rocm-hipblas-ced' + base-image: "rocm/dev-ubuntu-24.04:7.2.1" + runs-on: 'ubuntu-latest' + skip-drivers: 'false' + backend: "ced" + dockerfile: "./backend/Dockerfile.golang" + context: "./" + ubuntu-version: '2404' # acestep-cpp - build-type: '' cuda-major-version: "" @@ -4754,6 +4902,10 @@ includeDarwin: tag-suffix: "-metal-darwin-arm64-parakeet-cpp" build-type: "metal" lang: "go" + - backend: "ced" + tag-suffix: "-metal-darwin-arm64-ced" + build-type: "metal" + lang: "go" - backend: "acestep-cpp" tag-suffix: "-metal-darwin-arm64-acestep-cpp" build-type: "metal" diff --git a/.github/workflows/bump_deps.yaml b/.github/workflows/bump_deps.yaml index 6dbf8dcf2..481c9a609 100644 --- a/.github/workflows/bump_deps.yaml +++ b/.github/workflows/bump_deps.yaml @@ -42,6 +42,10 @@ jobs: variable: "PARAKEET_VERSION" branch: "master" file: "backend/go/parakeet-cpp/Makefile" + - repository: "mudler/ced.cpp" + variable: "CED_VERSION" + branch: "master" + file: "backend/go/ced/Makefile" - repository: "mudler/depth-anything.cpp" variable: "DEPTHANYTHING_VERSION" branch: "master" diff --git a/README.md b/README.md index 5fff7db69..f7843950d 100644 --- a/README.md +++ b/README.md @@ -231,6 +231,7 @@ Most backends wrap a best-in-class upstream engine. A handful of them are native | Backend | What it does | |---------|-------------| | [parakeet.cpp](https://github.com/mudler/parakeet.cpp) | C++/GGML port of NVIDIA NeMo Parakeet ASR (tdt/ctc/rnnt/hybrid), with cache-aware streaming transcription | +| [ced.cpp](https://github.com/mudler/ced.cpp) | C++/GGML port of the CED audio-tagging models: sound-event classification (527-class AudioSet) over REST and the realtime API for live recognition | | [voxtral.c](https://github.com/mudler/voxtral.c) | Voxtral Realtime 4B speech-to-text in pure C | | [vibevoice.cpp](https://github.com/mudler/vibevoice.cpp) | Native port of Microsoft VibeVoice for TTS (voice cloning) and long-form ASR with speaker diarization | | [rf-detr.cpp](https://github.com/mudler/rf-detr.cpp) | Native RF-DETR object detection and instance segmentation | diff --git a/backend/backend.proto b/backend/backend.proto index 68db81e35..2a575426e 100644 --- a/backend/backend.proto +++ b/backend/backend.proto @@ -24,6 +24,9 @@ service Backend { rpc TokenizeString(PredictOptions) returns (TokenizationResponse) {} rpc Status(HealthMessage) returns (StatusResponse) {} rpc Detect(DetectOptions) returns (DetectResponse) {} + // SoundDetection runs an audio-tagging / sound-event-classification model + // (e.g. CED over the AudioSet ontology) on a clip and returns scored labels. + rpc SoundDetection(SoundDetectionRequest) returns (SoundDetectionResponse) {} rpc Depth(DepthRequest) returns (DepthResponse) {} rpc FaceVerify(FaceVerifyRequest) returns (FaceVerifyResponse) {} rpc FaceAnalyze(FaceAnalyzeRequest) returns (FaceAnalyzeResponse) {} @@ -671,6 +674,24 @@ message DetectResponse { repeated Detection Detections = 1; } +// --- Sound-event classification / audio tagging messages (CED) --- + +message SoundDetectionRequest { + string src = 1; // audio file path (LocalAI writes the upload to disk) + int32 top_k = 2; // number of top tags to return (0 = all classes) + float threshold = 3; // optional: drop tags scoring below this +} + +message SoundClass { + string label = 1; // AudioSet class name, e.g. "Baby cry, infant cry" + float score = 2; // per-class probability (multi-label, independent) + int32 index = 3; // class index in the model ontology +} + +message SoundDetectionResponse { + repeated SoundClass detections = 1; // score-descending +} + // --- Depth estimation messages (Depth Anything 3) --- message DepthRequest { diff --git a/backend/go/ced/.gitignore b/backend/go/ced/.gitignore new file mode 100644 index 000000000..5e47da6c5 --- /dev/null +++ b/backend/go/ced/.gitignore @@ -0,0 +1,11 @@ +.cache/ +sources/ +build/ +package/ +ced-grpc +# build artifacts staged in-tree by the Makefile (cp from sources/) or +# symlinked for local dev; the real sources live in ced.cpp upstream. +*.so +*.so.* +ced_capi.h +compile_commands.json diff --git a/backend/go/ced/Makefile b/backend/go/ced/Makefile new file mode 100644 index 000000000..632c0e255 --- /dev/null +++ b/backend/go/ced/Makefile @@ -0,0 +1,77 @@ +# ced sound-classification backend Makefile. +# +# Upstream pin lives below as CED_VERSION?= so .github/bump_deps.sh can find +# and update it (matches the parakeet-cpp / whisper.cpp convention). +# +# Local dev shortcut: symlink an out-of-tree ced.cpp shared build + header and +# skip the clone/cmake steps entirely: +# ln -sf /path/to/ced.cpp/build-shared/libced.so . +# ln -sf /path/to/ced.cpp/include/ced_capi.h . +# go build -o ced-grpc . + +CED_VERSION?=c04ac14b7992d00584d9e812c9bb6268598a6ce7 +CED_REPO?=https://github.com/mudler/ced.cpp + +GOCMD?=go +GO_TAGS?= +JOBS?=$(shell nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 4) + +BUILD_TYPE?= +NATIVE?=false + +# Static-link ggml into libced.so (PIC) so the shared lib is self-contained: +# dlopen needs no libggml*.so alongside it, only system libs the runtime image +# already provides. +CMAKE_ARGS?=-DCMAKE_BUILD_TYPE=Release -DCED_SHARED=ON -DCED_BUILD_CLI=OFF -DCED_BUILD_TESTS=OFF -DBUILD_SHARED_LIBS=OFF -DCMAKE_POSITION_INDEPENDENT_CODE=ON + +ifeq ($(NATIVE),false) + CMAKE_ARGS+=-DGGML_NATIVE=OFF +endif + +# ced.cpp gates its ggml backends behind CED_GGML_* options (set(... CACHE BOOL +# "" FORCE)), so forward those instead of a bare -DGGML_CUDA=ON. +ifeq ($(BUILD_TYPE),cublas) + CMAKE_ARGS+=-DCED_GGML_CUDA=ON -DGGML_CUDA_GRAPHS=ON +else ifeq ($(BUILD_TYPE),openblas) + CMAKE_ARGS+=-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS +else ifeq ($(BUILD_TYPE),hipblas) + CMAKE_ARGS+=-DCED_GGML_HIP=ON +else ifeq ($(BUILD_TYPE),vulkan) + CMAKE_ARGS+=-DCED_GGML_VULKAN=ON +endif + +.PHONY: ced-grpc package build clean purge test all + +all: ced-grpc + +sources/ced.cpp: + mkdir -p sources/ced.cpp + cd sources/ced.cpp && \ + git init -q && \ + git remote add origin $(CED_REPO) && \ + git fetch --depth 1 origin $(CED_VERSION) && \ + git checkout FETCH_HEAD && \ + git submodule update --init --recursive --depth 1 --single-branch + +libced.so: sources/ced.cpp + cmake -B sources/ced.cpp/build-shared -S sources/ced.cpp $(CMAKE_ARGS) + cmake --build sources/ced.cpp/build-shared --config Release -j$(JOBS) + cp -fv sources/ced.cpp/build-shared/libced.so* ./ 2>/dev/null || true + cp -fv sources/ced.cpp/include/ced_capi.h ./ + +ced-grpc: libced.so main.go goced.go + CGO_ENABLED=0 $(GOCMD) build -tags "$(GO_TAGS)" -o ced-grpc . + +package: ced-grpc + bash package.sh + +build: package + +test: + LD_LIBRARY_PATH=$(CURDIR):$$LD_LIBRARY_PATH $(GOCMD) test ./... -count=1 + +clean: purge + rm -rf libced.so* ced_capi.h package ced-grpc + +purge: + rm -rf sources/ced.cpp diff --git a/backend/go/ced/goced.go b/backend/go/ced/goced.go new file mode 100644 index 000000000..a405bf017 --- /dev/null +++ b/backend/go/ced/goced.go @@ -0,0 +1,130 @@ +package main + +// Go side of the ced backend: purego bindings over ced_capi.h plus the gRPC +// SoundDetection implementation. +// +// SKETCH: the pb.SoundDetection* types come from backend.proto (regenerate with +// `make protogen-go`). The C side is single-threaded per ctx, so we guard the +// engine with engineMu; LocalAI also serializes via base.SingleThread. +import ( + "context" + "encoding/json" + "errors" + "fmt" + "sort" + "sync" + "unsafe" + + "github.com/mudler/LocalAI/pkg/grpc/base" + pb "github.com/mudler/LocalAI/pkg/grpc/proto" +) + +// purego-bound entry points from libced.so. Names match ced_capi.h exactly. +var ( + CppAbiVersion func() int32 + CppLoad func(ggufPath string) uintptr + CppFree func(ctx uintptr) + CppLastError func(ctx uintptr) string + CppNumClasses func(ctx uintptr) int32 + CppSampleRate func(ctx uintptr) int32 + CppClassifyPathJSON func(ctx uintptr, wavPath string, topK int32) uintptr + CppClassifyPcmJSON func(ctx uintptr, pcm []float32, nSamples int32, sampleRate int32, topK int32) uintptr + CppFreeString func(s uintptr) +) + +// cstr copies a malloc'd C string (returned as uintptr) into a Go string and +// frees the original via ced_capi_free_string. Empty/0 -> "". +func cstr(p uintptr) string { + if p == 0 { + return "" + } + defer CppFreeString(p) + var b []byte + for i := 0; ; i++ { + ch := *(*byte)(unsafe.Pointer(p + uintptr(i))) //nolint:govet // #nosec G103 -- C-owned NUL-terminated string from libced (not Go-GC memory) + if ch == 0 { + break + } + b = append(b, ch) + } + return string(b) +} + +// Ced is the gRPC backend. One loaded CED model per instance. +type Ced struct { + base.Base + ctxPtr uintptr + engineMu sync.Mutex +} + +// Load resolves the GGUF and opens the C-API context. +func (c *Ced) Load(opts *pb.ModelOptions) error { + if opts.ModelFile == "" { + return errors.New("ced: ModelFile is required") + } + ctx := CppLoad(opts.ModelFile) + if ctx == 0 { + return fmt.Errorf("ced: ced_capi_load failed for %q: %s", opts.ModelFile, CppLastError(0)) + } + c.ctxPtr = ctx + return nil +} + +// jsonTag mirrors the ced_capi JSON tag objects. +type jsonTag struct { + Index int `json:"index"` + Score float32 `json:"score"` + Label string `json:"label"` +} + +// SoundDetection classifies the clip at req.Src and returns scored AudioSet tags. +func (c *Ced) SoundDetection(ctx context.Context, req *pb.SoundDetectionRequest) (*pb.SoundDetectionResponse, error) { + if c.ctxPtr == 0 { + return nil, errors.New("ced: model not loaded") + } + if req.GetSrc() == "" { + return nil, errors.New("ced: SoundDetectionRequest.src (audio path) is required") + } + topK := req.GetTopK() + if topK <= 0 { + topK = 10 // sensible default for a tagging response + } + + c.engineMu.Lock() + out := cstr(CppClassifyPathJSON(c.ctxPtr, req.GetSrc(), topK)) + lastErr := CppLastError(c.ctxPtr) + c.engineMu.Unlock() + + if out == "" { + return nil, fmt.Errorf("ced: classification failed: %s", lastErr) + } + var tags []jsonTag + if err := json.Unmarshal([]byte(out), &tags); err != nil { + return nil, fmt.Errorf("ced: bad classifier JSON: %w", err) + } + + thr := req.GetThreshold() + resp := &pb.SoundDetectionResponse{} + for _, t := range tags { + if t.Score < thr { + continue + } + resp.Detections = append(resp.Detections, &pb.SoundClass{ + Label: t.Label, Score: t.Score, Index: int32(t.Index), + }) + } + sort.Slice(resp.Detections, func(i, j int) bool { + return resp.Detections[i].Score > resp.Detections[j].Score + }) + return resp, nil +} + +func (c *Ced) Free() error { + c.engineMu.Lock() + defer c.engineMu.Unlock() + if c.ctxPtr != 0 { + CppFree(c.ctxPtr) + c.ctxPtr = 0 + } + return nil +} diff --git a/backend/go/ced/main.go b/backend/go/ced/main.go new file mode 100644 index 000000000..ea8aa8549 --- /dev/null +++ b/backend/go/ced/main.go @@ -0,0 +1,59 @@ +package main + +// ced sound-classification backend. Started internally by LocalAI: one gRPC +// server per loaded model. Loads libced.so via purego and registers the flat +// C-API declared in ced_capi.h. The library name can be overridden with +// CED_LIBRARY (mirrors PARAKEET_LIBRARY / WHISPER_LIBRARY); the default looks +// for the .so next to this binary. +// +// SKETCH: requires `make protogen-go` after the backend.proto SoundDetection +// addition, and a built libced.so (see Makefile). See DESIGN.md. +import ( + "flag" + "fmt" + "os" + + "github.com/ebitengine/purego" + grpc "github.com/mudler/LocalAI/pkg/grpc" +) + +var addr = flag.String("addr", "localhost:50051", "the address to connect to") + +type libFunc struct { + ptr any + name string +} + +func main() { + libName := os.Getenv("CED_LIBRARY") + if libName == "" { + libName = "libced.so" + } + lib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL) + if err != nil { + panic(fmt.Errorf("ced: dlopen %q: %w", libName, err)) + } + + // Bound 1:1 to ced_capi.h. char*-returning functions are declared uintptr + // so we can free the same pointer with ced_capi_free_string after copying + // (purego's string return would copy and leak the original). + for _, lf := range []libFunc{ + {&CppAbiVersion, "ced_capi_abi_version"}, + {&CppLoad, "ced_capi_load"}, + {&CppFree, "ced_capi_free"}, + {&CppLastError, "ced_capi_last_error"}, + {&CppNumClasses, "ced_capi_num_classes"}, + {&CppSampleRate, "ced_capi_sample_rate"}, + {&CppClassifyPathJSON, "ced_capi_classify_path_json"}, + {&CppClassifyPcmJSON, "ced_capi_classify_pcm_json"}, + {&CppFreeString, "ced_capi_free_string"}, + } { + purego.RegisterLibFunc(lf.ptr, lib, lf.name) + } + + fmt.Fprintf(os.Stderr, "[ced] ABI=%d\n", CppAbiVersion()) + flag.Parse() + if err := grpc.StartServer(*addr, &Ced{}); err != nil { + panic(err) + } +} diff --git a/backend/go/ced/package.sh b/backend/go/ced/package.sh new file mode 100755 index 000000000..bde0adad6 --- /dev/null +++ b/backend/go/ced/package.sh @@ -0,0 +1,60 @@ +#!/bin/bash +# +# Bundle the ced-grpc binary, libced.so, the core runtime libs (libc/libstdc++/ +# libgomp + ld.so) and the GPU runtime for the active BUILD_TYPE so the package +# is self-contained. Mirrors backend/go/parakeet-cpp/package.sh; run.sh routes +# the (CGO_ENABLED=0) binary through lib/ld.so so the packaged libc is used. + +set -e + +CURDIR=$(dirname "$(realpath "$0")") +REPO_ROOT="${CURDIR}/../../.." + +mkdir -p "$CURDIR/package/lib" + +cp -avf "$CURDIR/ced-grpc" "$CURDIR/package/" +cp -avf "$CURDIR/run.sh" "$CURDIR/package/" + +cp -avf "$CURDIR"/libced.so* "$CURDIR/package/lib/" 2>/dev/null || { + echo "ERROR: libced.so not found in $CURDIR, run 'make' first" >&2 + exit 1 +} + +if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then + echo "Detected x86_64 architecture, copying x86_64 libraries..." + cp -arfLv /lib64/ld-linux-x86-64.so.2 "$CURDIR/package/lib/ld.so" + cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 "$CURDIR/package/lib/libc.so.6" + cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 "$CURDIR/package/lib/libgcc_s.so.1" + cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 "$CURDIR/package/lib/libstdc++.so.6" + cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 "$CURDIR/package/lib/libm.so.6" + cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 "$CURDIR/package/lib/libgomp.so.1" + cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 "$CURDIR/package/lib/libdl.so.2" + cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 "$CURDIR/package/lib/librt.so.1" + cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 "$CURDIR/package/lib/libpthread.so.0" +elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then + echo "Detected ARM64 architecture, copying ARM64 libraries..." + cp -arfLv /lib/ld-linux-aarch64.so.1 "$CURDIR/package/lib/ld.so" + cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 "$CURDIR/package/lib/libc.so.6" + cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 "$CURDIR/package/lib/libgcc_s.so.1" + cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 "$CURDIR/package/lib/libstdc++.so.6" + cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 "$CURDIR/package/lib/libm.so.6" + cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 "$CURDIR/package/lib/libgomp.so.1" + cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 "$CURDIR/package/lib/libdl.so.2" + cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 "$CURDIR/package/lib/librt.so.1" + cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 "$CURDIR/package/lib/libpthread.so.0" +elif [ "$(uname -s)" = "Darwin" ]; then + echo "Detected Darwin" +else + echo "Error: Could not detect architecture" + exit 1 +fi + +GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh" +if [ -f "$GPU_LIB_SCRIPT" ]; then + echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..." + source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib" + package_gpu_libs +fi + +echo "Packaging completed successfully" +ls -liah "$CURDIR/package/" "$CURDIR/package/lib/" diff --git a/backend/go/ced/run.sh b/backend/go/ced/run.sh new file mode 100755 index 000000000..bce6fec8e --- /dev/null +++ b/backend/go/ced/run.sh @@ -0,0 +1,15 @@ +#!/bin/bash +set -e + +CURDIR=$(dirname "$(realpath "$0")") + +export LD_LIBRARY_PATH="$CURDIR/lib:$CURDIR:${LD_LIBRARY_PATH:-}" + +# If a self-contained ld.so was packaged, route through it so the packaged +# libc / libstdc++ are used instead of the host's (matches the sibling backends). +if [ -f "$CURDIR/lib/ld.so" ]; then + echo "Using lib/ld.so" + exec "$CURDIR/lib/ld.so" "$CURDIR/ced-grpc" "$@" +fi + +exec "$CURDIR/ced-grpc" "$@" diff --git a/backend/index.yaml b/backend/index.yaml index 97fd1eb28..3f61f7b4e 100644 --- a/backend/index.yaml +++ b/backend/index.yaml @@ -178,6 +178,37 @@ nvidia-cuda-12: "cuda12-parakeet-cpp" nvidia-l4t-cuda-12: "nvidia-l4t-arm64-parakeet-cpp" nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-parakeet-cpp" +- &ced + name: "ced" + alias: "ced" + license: mit + icon: https://avatars.githubusercontent.com/u/95302084 + description: | + CED sound-event classification / audio tagging (527-class AudioSet). + ced.cpp is a C++/ggml port that performs audio tagging over the AudioSet + taxonomy, exposed through the SoundDetection gRPC rpc and the + /v1/audio/classification REST endpoint. It runs on CPU, NVIDIA CUDA, + AMD ROCm/HIP, Intel SYCL, Vulkan and NVIDIA Jetson (L4T) targets. + urls: + - https://github.com/mudler/ced.cpp + tags: + - audio-classification + - CPU + - GPU + - CUDA + - HIP + capabilities: + default: "cpu-ced" + nvidia: "cuda12-ced" + intel: "intel-sycl-f16-ced" + metal: "metal-ced" + amd: "rocm-ced" + vulkan: "vulkan-ced" + nvidia-l4t: "nvidia-l4t-arm64-ced" + nvidia-cuda-13: "cuda13-ced" + nvidia-cuda-12: "cuda12-ced" + nvidia-l4t-cuda-12: "nvidia-l4t-arm64-ced" + nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-ced" - &voxtral name: "voxtral" alias: "voxtral" @@ -2650,6 +2681,121 @@ uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-parakeet-cpp" mirrors: - localai/localai-backends:master-gpu-nvidia-cuda-13-parakeet-cpp +## ced +- !!merge <<: *ced + name: "ced-development" + capabilities: + default: "cpu-ced-development" + nvidia: "cuda12-ced-development" + intel: "intel-sycl-f16-ced-development" + metal: "metal-ced-development" + amd: "rocm-ced-development" + vulkan: "vulkan-ced-development" + nvidia-l4t: "nvidia-l4t-arm64-ced-development" + nvidia-cuda-13: "cuda13-ced-development" + nvidia-cuda-12: "cuda12-ced-development" + nvidia-l4t-cuda-12: "nvidia-l4t-arm64-ced-development" + nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-ced-development" +- !!merge <<: *ced + name: "nvidia-l4t-arm64-ced" + uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-arm64-ced" + mirrors: + - localai/localai-backends:latest-nvidia-l4t-arm64-ced +- !!merge <<: *ced + name: "nvidia-l4t-arm64-ced-development" + uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-arm64-ced" + mirrors: + - localai/localai-backends:master-nvidia-l4t-arm64-ced +- !!merge <<: *ced + name: "cuda13-nvidia-l4t-arm64-ced" + uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-ced" + mirrors: + - localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-ced +- !!merge <<: *ced + name: "cuda13-nvidia-l4t-arm64-ced-development" + uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-ced" + mirrors: + - localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-ced +- !!merge <<: *ced + name: "cpu-ced" + uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-ced" + mirrors: + - localai/localai-backends:latest-cpu-ced +- !!merge <<: *ced + name: "cpu-ced-development" + uri: "quay.io/go-skynet/local-ai-backends:master-cpu-ced" + mirrors: + - localai/localai-backends:master-cpu-ced +- !!merge <<: *ced + name: "metal-ced" + uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-ced" + mirrors: + - localai/localai-backends:latest-metal-darwin-arm64-ced +- !!merge <<: *ced + name: "metal-ced-development" + uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-ced" + mirrors: + - localai/localai-backends:master-metal-darwin-arm64-ced +- !!merge <<: *ced + name: "cuda12-ced" + uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-ced" + mirrors: + - localai/localai-backends:latest-gpu-nvidia-cuda-12-ced +- !!merge <<: *ced + name: "cuda12-ced-development" + uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-ced" + mirrors: + - localai/localai-backends:master-gpu-nvidia-cuda-12-ced +- !!merge <<: *ced + name: "rocm-ced" + uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-ced" + mirrors: + - localai/localai-backends:latest-gpu-rocm-hipblas-ced +- !!merge <<: *ced + name: "rocm-ced-development" + uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-ced" + mirrors: + - localai/localai-backends:master-gpu-rocm-hipblas-ced +- !!merge <<: *ced + name: "intel-sycl-f32-ced" + uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f32-ced" + mirrors: + - localai/localai-backends:latest-gpu-intel-sycl-f32-ced +- !!merge <<: *ced + name: "intel-sycl-f32-ced-development" + uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f32-ced" + mirrors: + - localai/localai-backends:master-gpu-intel-sycl-f32-ced +- !!merge <<: *ced + name: "intel-sycl-f16-ced" + uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f16-ced" + mirrors: + - localai/localai-backends:latest-gpu-intel-sycl-f16-ced +- !!merge <<: *ced + name: "intel-sycl-f16-ced-development" + uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f16-ced" + mirrors: + - localai/localai-backends:master-gpu-intel-sycl-f16-ced +- !!merge <<: *ced + name: "vulkan-ced" + uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-vulkan-ced" + mirrors: + - localai/localai-backends:latest-gpu-vulkan-ced +- !!merge <<: *ced + name: "vulkan-ced-development" + uri: "quay.io/go-skynet/local-ai-backends:master-gpu-vulkan-ced" + mirrors: + - localai/localai-backends:master-gpu-vulkan-ced +- !!merge <<: *ced + name: "cuda13-ced" + uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-ced" + mirrors: + - localai/localai-backends:latest-gpu-nvidia-cuda-13-ced +- !!merge <<: *ced + name: "cuda13-ced-development" + uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-ced" + mirrors: + - localai/localai-backends:master-gpu-nvidia-cuda-13-ced ## stablediffusion-ggml - !!merge <<: *stablediffusionggml name: "cpu-stablediffusion-ggml" diff --git a/core/backend/sound_classification.go b/core/backend/sound_classification.go new file mode 100644 index 000000000..666c32321 --- /dev/null +++ b/core/backend/sound_classification.go @@ -0,0 +1,88 @@ +package backend + +import ( + "context" + "fmt" + "sort" + + "github.com/mudler/LocalAI/core/config" + "github.com/mudler/LocalAI/core/schema" + + grpcPkg "github.com/mudler/LocalAI/pkg/grpc" + "github.com/mudler/LocalAI/pkg/grpc/proto" + "github.com/mudler/LocalAI/pkg/model" +) + +// SoundDetectionRequest carries the knobs the HTTP layer collects for an +// audio-tagging / sound-event-classification call. Audio is the path to the +// uploaded clip on disk; TopK and Threshold are optional (0 = backend default). +type SoundDetectionRequest struct { + Audio string + TopK int32 + Threshold float32 +} + +func (r *SoundDetectionRequest) toProto() *proto.SoundDetectionRequest { + return &proto.SoundDetectionRequest{ + Src: r.Audio, + TopK: r.TopK, + Threshold: r.Threshold, + } +} + +func loadSoundDetectionModel(ml *model.ModelLoader, modelConfig config.ModelConfig, appConfig *config.ApplicationConfig) (grpcPkg.Backend, error) { + if modelConfig.Backend == "" { + return nil, fmt.Errorf("sound classification: model %q has no backend set; supported backends include ced", modelConfig.Name) + } + opts := ModelOptions(modelConfig, appConfig) + m, err := ml.Load(opts...) + if err != nil { + recordModelLoadFailure(appConfig, modelConfig.Name, modelConfig.Backend, err, nil) + return nil, err + } + if m == nil { + return nil, fmt.Errorf("could not load sound classification model") + } + return m, nil +} + +// ModelSoundDetection runs the SoundDetection RPC against the configured +// backend and returns a normalized schema.SoundClassificationResult. +func ModelSoundDetection(ctx context.Context, req SoundDetectionRequest, ml *model.ModelLoader, modelConfig config.ModelConfig, appConfig *config.ApplicationConfig) (*schema.SoundClassificationResult, error) { + m, err := loadSoundDetectionModel(ml, modelConfig, appConfig) + if err != nil { + return nil, err + } + + r, err := m.SoundDetection(ctx, req.toProto()) + if err != nil { + return nil, err + } + return soundClassificationResultFromProto(modelConfig.Name, r), nil +} + +// soundClassificationResultFromProto maps the backend detections to the +// HTTP-facing schema, keeping the backend's score-descending order. +func soundClassificationResultFromProto(modelName string, r *proto.SoundDetectionResponse) *schema.SoundClassificationResult { + out := &schema.SoundClassificationResult{ + Model: modelName, + Detections: []schema.SoundClassification{}, + } + if r == nil { + return out + } + for _, d := range r.Detections { + if d == nil { + continue + } + out.Detections = append(out.Detections, schema.SoundClassification{ + Index: int(d.Index), + Label: d.Label, + Score: d.Score, + }) + } + sort.SliceStable(out.Detections, func(i, j int) bool { + return out.Detections[i].Score > out.Detections[j].Score + }) + return out +} diff --git a/core/config/backend_capabilities.go b/core/config/backend_capabilities.go index eba8c3c37..cc9567887 100644 --- a/core/config/backend_capabilities.go +++ b/core/config/backend_capabilities.go @@ -8,27 +8,28 @@ import ( // Usecase name constants — the canonical string values used in gallery entries, // model configs (known_usecases), and UsecaseInfoMap keys. const ( - UsecaseChat = "chat" - UsecaseCompletion = "completion" - UsecaseEdit = "edit" - UsecaseVision = "vision" - UsecaseEmbeddings = "embeddings" - UsecaseTokenize = "tokenize" - UsecaseImage = "image" - UsecaseVideo = "video" - UsecaseTranscript = "transcript" - UsecaseTTS = "tts" - UsecaseSoundGeneration = "sound_generation" - UsecaseRerank = "rerank" - UsecaseDetection = "detection" - UsecaseDepth = "depth" - UsecaseVAD = "vad" - UsecaseAudioTransform = "audio_transform" - UsecaseDiarization = "diarization" - UsecaseRealtimeAudio = "realtime_audio" - UsecaseFaceRecognition = "face_recognition" - UsecaseSpeakerRecognition = "speaker_recognition" - UsecaseTokenClassify = "token_classify" + UsecaseChat = "chat" + UsecaseCompletion = "completion" + UsecaseEdit = "edit" + UsecaseVision = "vision" + UsecaseEmbeddings = "embeddings" + UsecaseTokenize = "tokenize" + UsecaseImage = "image" + UsecaseVideo = "video" + UsecaseTranscript = "transcript" + UsecaseTTS = "tts" + UsecaseSoundGeneration = "sound_generation" + UsecaseRerank = "rerank" + UsecaseDetection = "detection" + UsecaseDepth = "depth" + UsecaseVAD = "vad" + UsecaseAudioTransform = "audio_transform" + UsecaseDiarization = "diarization" + UsecaseSoundClassification = "sound_classification" + UsecaseRealtimeAudio = "realtime_audio" + UsecaseFaceRecognition = "face_recognition" + UsecaseSpeakerRecognition = "speaker_recognition" + UsecaseTokenClassify = "token_classify" ) // GRPCMethod identifies a Backend service RPC from backend.proto. @@ -51,6 +52,7 @@ const ( MethodVAD GRPCMethod = "VAD" MethodAudioTransform GRPCMethod = "AudioTransform" MethodDiarize GRPCMethod = "Diarize" + MethodSoundDetection GRPCMethod = "SoundDetection" MethodAudioToAudioStream GRPCMethod = "AudioToAudioStream" MethodFaceVerify GRPCMethod = "FaceVerify" MethodFaceAnalyze GRPCMethod = "FaceAnalyze" @@ -165,6 +167,11 @@ var UsecaseInfoMap = map[string]UsecaseInfo{ GRPCMethod: MethodDiarize, Description: "Speaker diarization (who-spoke-when, per-speaker segments) via the Diarize RPC.", }, + UsecaseSoundClassification: { + Flag: FLAG_SOUND_CLASSIFICATION, + GRPCMethod: MethodSoundDetection, + Description: "Sound-event classification / audio tagging (scored AudioSet labels like baby cry, glass breaking, alarms) via the SoundDetection RPC.", + }, UsecaseRealtimeAudio: { Flag: FLAG_REALTIME_AUDIO, GRPCMethod: MethodAudioToAudioStream, diff --git a/core/config/meta/constants.go b/core/config/meta/constants.go index 72da2f99a..7fed6ba75 100644 --- a/core/config/meta/constants.go +++ b/core/config/meta/constants.go @@ -68,6 +68,7 @@ var UsecaseOptions = []FieldOption{ {Value: "face_recognition", Label: "Face Recognition"}, {Value: "transcript", Label: "Transcript"}, {Value: "diarization", Label: "Diarization"}, + {Value: "sound_classification", Label: "Sound Classification"}, {Value: "speaker_recognition", Label: "Speaker Recognition"}, {Value: "tts", Label: "TTS"}, {Value: "sound_generation", Label: "Sound Generation"}, diff --git a/core/config/meta/registry.go b/core/config/meta/registry.go index b7ffa9290..a1cfe4c9a 100644 --- a/core/config/meta/registry.go +++ b/core/config/meta/registry.go @@ -328,6 +328,30 @@ func DefaultRegistry() map[string]FieldMetaOverride { AutocompleteProvider: ProviderModelsVAD, Order: 63, }, + "pipeline.sound_detection": { + Section: "pipeline", + Label: "Sound Detection Model", + Description: "Model to use for sound-event classification (audio tagging, e.g. ced) in the pipeline. When set, committed realtime audio is also classified and the scored AudioSet tags are emitted as a conversation.item.sound_detection event.", + Component: "model-select", + AutocompleteProvider: ProviderModels, + Order: 64, + }, + "pipeline.sound_detection_window_ms": { + Section: "pipeline", + Label: "Sound Detection Window (ms)", + Description: "Server-side windowing for a sound-only realtime session: length in ms of the audio window classified each hop. 0 = client-driven (the client commits windows).", + Component: "number", + Min: f64(0), + Order: 65, + }, + "pipeline.sound_detection_hop_ms": { + Section: "pipeline", + Label: "Sound Detection Hop (ms)", + Description: "Server-side windowing hop in ms: how often the server classifies the last window. 0 = client-driven.", + Component: "number", + Min: f64(0), + Order: 66, + }, "pipeline.reasoning_effort": { Section: "pipeline", Label: "Reasoning Effort", diff --git a/core/config/model_config.go b/core/config/model_config.go index 5dbfd2026..cbb336838 100644 --- a/core/config/model_config.go +++ b/core/config/model_config.go @@ -604,6 +604,20 @@ type Pipeline struct { LLM string `yaml:"llm,omitempty" json:"llm,omitempty"` Transcription string `yaml:"transcription,omitempty" json:"transcription,omitempty"` VAD string `yaml:"vad,omitempty" json:"vad,omitempty"` + // SoundDetection names a sound-event-classification model (e.g. ced). When + // set, each VAD-committed realtime utterance is also run through it and the + // scored AudioSet tags are emitted as a conversation.item.sound_detection + // server event, alongside (and independent of) transcription. + SoundDetection string `yaml:"sound_detection,omitempty" json:"sound_detection,omitempty"` + + // SoundDetectionWindowMs / SoundDetectionHopMs enable server-side windowing + // for a sound-detection-only realtime session: instead of the client + // committing audio buffers, the server classifies the last WindowMs of + // streamed audio every HopMs and emits a sound_detection event per hop. Both + // must be > 0 to activate; otherwise the session stays client-driven (the + // client commits windows via input_audio_buffer.commit). + SoundDetectionWindowMs int `yaml:"sound_detection_window_ms,omitempty" json:"sound_detection_window_ms,omitempty"` + SoundDetectionHopMs int `yaml:"sound_detection_hop_ms,omitempty" json:"sound_detection_hop_ms,omitempty"` // ReasoningEffort sets the reasoning effort (none|minimal|low|medium|high) for // the pipeline's LLM without editing the LLM model config. Overrides the LLM's @@ -1452,6 +1466,11 @@ const ( // so it may combine freely with other usecases. FLAG_TOKEN_CLASSIFY ModelConfigUsecase = 0b1000000000000000000000 + // Marks a model as wired for the SoundDetection gRPC primitive + // (audio tagging / sound-event classification — scored AudioSet + // labels via the SoundDetection RPC, e.g. ced). + FLAG_SOUND_CLASSIFICATION ModelConfigUsecase = 0b10000000000000000000000 + // Common Subsets FLAG_LLM ModelConfigUsecase = FLAG_CHAT | FLAG_COMPLETION | FLAG_EDIT ) @@ -1460,12 +1479,12 @@ const ( // Flags within the same group are NOT orthogonal (e.g., chat and completion are // both text/language). A model is multimodal when its usecases span 2+ groups. var ModalityGroups = []ModelConfigUsecase{ - FLAG_CHAT | FLAG_COMPLETION | FLAG_EDIT, // text/language - FLAG_VISION | FLAG_DETECTION, // visual understanding - FLAG_TRANSCRIPT | FLAG_REALTIME_AUDIO, // speech input — realtime_audio is any-to-any, so it counts here too - FLAG_TTS | FLAG_SOUND_GENERATION | FLAG_REALTIME_AUDIO, // audio output — and here, so a lone realtime_audio flag still reads as multimodal - FLAG_AUDIO_TRANSFORM, // audio in/out transforms - FLAG_IMAGE | FLAG_VIDEO, // visual generation + FLAG_CHAT | FLAG_COMPLETION | FLAG_EDIT, // text/language + FLAG_VISION | FLAG_DETECTION, // visual understanding + FLAG_TRANSCRIPT | FLAG_REALTIME_AUDIO | FLAG_SOUND_CLASSIFICATION, // audio input — realtime_audio is any-to-any, so it counts here too + FLAG_TTS | FLAG_SOUND_GENERATION | FLAG_REALTIME_AUDIO, // audio output — and here, so a lone realtime_audio flag still reads as multimodal + FLAG_AUDIO_TRANSFORM, // audio in/out transforms + FLAG_IMAGE | FLAG_VIDEO, // visual generation } // IsMultimodal returns true if the given usecases span two or more orthogonal @@ -1488,29 +1507,30 @@ func GetAllModelConfigUsecases() map[string]ModelConfigUsecase { return map[string]ModelConfigUsecase{ // Note: FLAG_ANY is intentionally excluded from this map // because it's 0 and would always match in HasUsecases checks - "FLAG_CHAT": FLAG_CHAT, - "FLAG_COMPLETION": FLAG_COMPLETION, - "FLAG_EDIT": FLAG_EDIT, - "FLAG_EMBEDDINGS": FLAG_EMBEDDINGS, - "FLAG_RERANK": FLAG_RERANK, - "FLAG_IMAGE": FLAG_IMAGE, - "FLAG_TRANSCRIPT": FLAG_TRANSCRIPT, - "FLAG_TTS": FLAG_TTS, - "FLAG_SOUND_GENERATION": FLAG_SOUND_GENERATION, - "FLAG_TOKENIZE": FLAG_TOKENIZE, - "FLAG_VAD": FLAG_VAD, - "FLAG_LLM": FLAG_LLM, - "FLAG_VIDEO": FLAG_VIDEO, - "FLAG_DETECTION": FLAG_DETECTION, - "FLAG_VISION": FLAG_VISION, - "FLAG_FACE_RECOGNITION": FLAG_FACE_RECOGNITION, - "FLAG_SPEAKER_RECOGNITION": FLAG_SPEAKER_RECOGNITION, - "FLAG_AUDIO_TRANSFORM": FLAG_AUDIO_TRANSFORM, - "FLAG_DIARIZATION": FLAG_DIARIZATION, - "FLAG_REALTIME_AUDIO": FLAG_REALTIME_AUDIO, - "FLAG_SCORE": FLAG_SCORE, - "FLAG_DEPTH": FLAG_DEPTH, - "FLAG_TOKEN_CLASSIFY": FLAG_TOKEN_CLASSIFY, + "FLAG_CHAT": FLAG_CHAT, + "FLAG_COMPLETION": FLAG_COMPLETION, + "FLAG_EDIT": FLAG_EDIT, + "FLAG_EMBEDDINGS": FLAG_EMBEDDINGS, + "FLAG_RERANK": FLAG_RERANK, + "FLAG_IMAGE": FLAG_IMAGE, + "FLAG_TRANSCRIPT": FLAG_TRANSCRIPT, + "FLAG_TTS": FLAG_TTS, + "FLAG_SOUND_GENERATION": FLAG_SOUND_GENERATION, + "FLAG_TOKENIZE": FLAG_TOKENIZE, + "FLAG_VAD": FLAG_VAD, + "FLAG_LLM": FLAG_LLM, + "FLAG_VIDEO": FLAG_VIDEO, + "FLAG_DETECTION": FLAG_DETECTION, + "FLAG_VISION": FLAG_VISION, + "FLAG_FACE_RECOGNITION": FLAG_FACE_RECOGNITION, + "FLAG_SPEAKER_RECOGNITION": FLAG_SPEAKER_RECOGNITION, + "FLAG_AUDIO_TRANSFORM": FLAG_AUDIO_TRANSFORM, + "FLAG_DIARIZATION": FLAG_DIARIZATION, + "FLAG_SOUND_CLASSIFICATION": FLAG_SOUND_CLASSIFICATION, + "FLAG_REALTIME_AUDIO": FLAG_REALTIME_AUDIO, + "FLAG_SCORE": FLAG_SCORE, + "FLAG_DEPTH": FLAG_DEPTH, + "FLAG_TOKEN_CLASSIFY": FLAG_TOKEN_CLASSIFY, } } @@ -1713,6 +1733,16 @@ func (c *ModelConfig) GuessUsecases(u ModelConfigUsecase) bool { } } + if (u & FLAG_SOUND_CLASSIFICATION) == FLAG_SOUND_CLASSIFICATION { + // ced is a sound-event tagger (AudioSet labels) surfaced via the + // SoundDetection gRPC. Models without an explicit known_usecases + // still surface when they run on one of these backends. + soundClassificationBackends := []string{"ced"} + if !slices.Contains(soundClassificationBackends, c.Backend) { + return false + } + } + if (u & FLAG_REALTIME_AUDIO) == FLAG_REALTIME_AUDIO { // Backends that own a single any-to-any loop and implement // AudioToAudioStream — listed here so models without an explicit diff --git a/core/http/auth/features.go b/core/http/auth/features.go index 615e82a49..8dbb32a03 100644 --- a/core/http/auth/features.go +++ b/core/http/auth/features.go @@ -48,6 +48,10 @@ var RouteFeatureRegistry = []RouteFeature{ {"POST", "/v1/audio/diarization", FeatureAudioDiarization}, {"POST", "/audio/diarization", FeatureAudioDiarization}, + // Audio classification (sound-event tagging) + {"POST", "/v1/audio/classification", FeatureAudioClassification}, + {"POST", "/audio/classification", FeatureAudioClassification}, + // Audio speech / TTS {"POST", "/v1/audio/speech", FeatureAudioSpeech}, {"POST", "/audio/speech", FeatureAudioSpeech}, @@ -172,6 +176,7 @@ func APIFeatureMetas() []FeatureMeta { {FeatureAudioSpeech, "Audio Speech / TTS", true}, {FeatureAudioTranscription, "Audio Transcription", true}, {FeatureAudioDiarization, "Audio Diarization", true}, + {FeatureAudioClassification, "Audio Classification", true}, {FeatureVAD, "Voice Activity Detection", true}, {FeatureDetection, "Detection", true}, {FeatureVideo, "Video Generation", true}, diff --git a/core/http/auth/permissions.go b/core/http/auth/permissions.go index 47c4d64e1..1795792f9 100644 --- a/core/http/auth/permissions.go +++ b/core/http/auth/permissions.go @@ -38,24 +38,25 @@ const ( FeatureQuantization = "quantization" // API features (default ON for new users) - FeatureChat = "chat" - FeatureImages = "images" - FeatureAudioSpeech = "audio_speech" - FeatureAudioTranscription = "audio_transcription" - FeatureAudioDiarization = "audio_diarization" - FeatureVAD = "vad" - FeatureDetection = "detection" - FeatureVideo = "video" - FeatureEmbeddings = "embeddings" - FeatureSound = "sound" - FeatureRealtime = "realtime" - FeatureRerank = "rerank" - FeatureTokenize = "tokenize" - FeatureMCP = "mcp" - FeatureStores = "stores" - FeatureFaceRecognition = "face_recognition" - FeatureVoiceRecognition = "voice_recognition" - FeatureAudioTransform = "audio_transform" + FeatureChat = "chat" + FeatureImages = "images" + FeatureAudioSpeech = "audio_speech" + FeatureAudioTranscription = "audio_transcription" + FeatureAudioDiarization = "audio_diarization" + FeatureAudioClassification = "audio_classification" + FeatureVAD = "vad" + FeatureDetection = "detection" + FeatureVideo = "video" + FeatureEmbeddings = "embeddings" + FeatureSound = "sound" + FeatureRealtime = "realtime" + FeatureRerank = "rerank" + FeatureTokenize = "tokenize" + FeatureMCP = "mcp" + FeatureStores = "stores" + FeatureFaceRecognition = "face_recognition" + FeatureVoiceRecognition = "voice_recognition" + FeatureAudioTransform = "audio_transform" // FeaturePIIFilter gates the synchronous PII analyze/redact service // (POST /api/pii/{analyze,redact}). Default ON like the other API // features; the admin-only events log is gated separately in-handler. @@ -71,7 +72,7 @@ var GeneralFeatures = []string{FeatureFineTuning, FeatureQuantization} // APIFeatures lists API endpoint features (default ON). var APIFeatures = []string{ FeatureChat, FeatureImages, FeatureAudioSpeech, FeatureAudioTranscription, - FeatureAudioDiarization, + FeatureAudioDiarization, FeatureAudioClassification, FeatureVAD, FeatureDetection, FeatureVideo, FeatureEmbeddings, FeatureSound, FeatureRealtime, FeatureRerank, FeatureTokenize, FeatureMCP, FeatureStores, FeatureFaceRecognition, FeatureVoiceRecognition, FeatureAudioTransform, diff --git a/core/http/endpoints/localai/api_instructions.go b/core/http/endpoints/localai/api_instructions.go index 2ca856a62..405921e5e 100644 --- a/core/http/endpoints/localai/api_instructions.go +++ b/core/http/endpoints/localai/api_instructions.go @@ -32,9 +32,9 @@ var instructionDefs = []instructionDef{ }, { Name: "audio", - Description: "Text-to-speech, voice activity detection, transcription, speaker diarization, and sound generation", + Description: "Text-to-speech, voice activity detection, transcription, speaker diarization, sound classification, and sound generation", Tags: []string{"audio"}, - Intro: "Diarization (/v1/audio/diarization) returns speaker-labelled time segments. Backends with native ASR-diarization (vibevoice-cpp) can also emit per-segment text via include_text=true; backends with a dedicated pipeline (sherpa-onnx + pyannote) emit segmentation only. Response formats: json (default), verbose_json (adds speakers summary + text), rttm (NIST format).", + Intro: "Diarization (/v1/audio/diarization) returns speaker-labelled time segments. Backends with native ASR-diarization (vibevoice-cpp) can also emit per-segment text via include_text=true; backends with a dedicated pipeline (sherpa-onnx + pyannote) emit segmentation only. Response formats: json (default), verbose_json (adds speakers summary + text), rttm (NIST format). Sound classification (/v1/audio/classification) returns scored AudioSet sound-event tags (audio tagging via the ced backend); top_k and threshold control the returned set.", }, { Name: "images", diff --git a/core/http/endpoints/openai/realtime.go b/core/http/endpoints/openai/realtime.go index 8de50e580..1af4c6b75 100644 --- a/core/http/endpoints/openai/realtime.go +++ b/core/http/endpoints/openai/realtime.go @@ -93,16 +93,31 @@ type Session struct { Voice string TurnDetection *types.TurnDetectionUnion // "server_vad", "semantic_vad" or "none" InputAudioTranscription *types.AudioTranscription - Tools []types.ToolUnion - ToolChoice *types.ToolChoiceUnion - Conversations map[string]*Conversation - InputAudioBuffer []byte - AudioBufferLock sync.Mutex - OpusFrames [][]byte - OpusFramesLock sync.Mutex - Instructions string - DefaultConversationID string - ModelInterface Model + + // SoundDetectionEnabled is set when pipeline.sound_detection names a + // sound-event-classification model. When true, each committed utterance is + // also run through ModelInterface.SoundDetection and the scored tags are + // emitted as a conversation.item.sound_detection event. SoundDetectionTopK + // and SoundDetectionThreshold are the knobs passed to that call (defaults: + // top_k=5, threshold=0). + SoundDetectionEnabled bool + SoundDetectionTopK int + SoundDetectionThreshold float32 + // SoundDetectionWindowMs / SoundDetectionHopMs, when both > 0, enable + // server-side windowing for a sound-only session: the server classifies the + // last WindowMs of streamed audio every HopMs (no client commits needed). + SoundDetectionWindowMs int + SoundDetectionHopMs int + Tools []types.ToolUnion + ToolChoice *types.ToolChoiceUnion + Conversations map[string]*Conversation + InputAudioBuffer []byte + AudioBufferLock sync.Mutex + OpusFrames [][]byte + OpusFramesLock sync.Mutex + Instructions string + DefaultConversationID string + ModelInterface Model // The pipeline model config or the config for an any-to-any model ModelConfig *config.ModelConfig InputSampleRate int @@ -250,6 +265,10 @@ type Model interface { // TranscribeStream transcribes audio incrementally, invoking onDelta for each // transcript text fragment and returning the final aggregated result. TranscribeStream(ctx context.Context, audio, language string, translate, diarize bool, prompt string, onDelta func(text string)) (*schema.TranscriptionResult, error) + // SoundDetection classifies a committed audio window into scored AudioSet + // sound-event tags. topK caps the number of returned tags (0 = backend + // default), threshold drops tags below the given score (0 = keep all). + SoundDetection(ctx context.Context, audio string, topK int, threshold float32) (*schema.SoundClassificationResult, error) PredictConfig() *config.ModelConfig } @@ -399,7 +418,7 @@ func prepareRealtimeConfig(cfg *config.ModelConfig) (errCode, errMsg string, ok return "", "", true } - if cfg.Pipeline.VAD == "" && cfg.Pipeline.Transcription == "" && cfg.Pipeline.TTS == "" && cfg.Pipeline.LLM == "" { + if cfg.Pipeline.VAD == "" && cfg.Pipeline.Transcription == "" && cfg.Pipeline.TTS == "" && cfg.Pipeline.LLM == "" && cfg.Pipeline.SoundDetection == "" { return "invalid_model", "Model is not a pipeline model", false } return "", "", true @@ -469,6 +488,26 @@ func runRealtimeSession(application *application.Application, t Transport, model sttModel := cfg.Pipeline.Transcription + // A sound-detection-only pipeline (sound_detection set, no transcription/LLM) + // activates on sounds, not speech, so it runs WITHOUT the voice VAD: the + // session defaults to turn_detection none and the client drives windowing via + // input_audio_buffer.commit. There is no transcription stage in that case. + soundOnly := cfg.Pipeline.SoundDetection != "" && cfg.Pipeline.Transcription == "" && cfg.Pipeline.LLM == "" + + turnDetection := &types.TurnDetectionUnion{ + ServerVad: &types.ServerVad{ + Threshold: 0.5, + PrefixPaddingMs: 300, + SilenceDurationMs: 500, + CreateResponse: true, + }, + } + inputAudioTranscription := &types.AudioTranscription{Model: sttModel} + if soundOnly { + turnDetection = nil // turn_detection none: no VAD + inputAudioTranscription = nil // no transcription stage + } + // Compose the system prompt: prepend the assistant prompt when we have // one (it teaches the model the safety rules and tool recipes), then the // session's default voice instructions. Order matches chat.go's @@ -480,30 +519,26 @@ func runRealtimeSession(application *application.Application, t Transport, model sessionID := generateSessionID() session := &Session{ - ID: sessionID, - TranscriptionOnly: false, - Model: model, - Voice: cfg.TTSConfig.Voice, - Instructions: instructions, - ModelConfig: cfg, - Tools: assistantTools, - AssistantTools: assistantTools, - AssistantExecutor: assistantExecutor, - TurnDetection: &types.TurnDetectionUnion{ - ServerVad: &types.ServerVad{ - Threshold: 0.5, - PrefixPaddingMs: 300, - SilenceDurationMs: 500, - CreateResponse: true, - }, - }, - InputAudioTranscription: &types.AudioTranscription{ - Model: sttModel, - }, - Conversations: make(map[string]*Conversation), - InputSampleRate: defaultRemoteSampleRate, - OutputSampleRate: defaultRemoteSampleRate, - MaxHistoryItems: resolveMaxHistoryItems(cfg), + ID: sessionID, + TranscriptionOnly: false, + Model: model, + Voice: cfg.TTSConfig.Voice, + Instructions: instructions, + ModelConfig: cfg, + Tools: assistantTools, + AssistantTools: assistantTools, + AssistantExecutor: assistantExecutor, + TurnDetection: turnDetection, + InputAudioTranscription: inputAudioTranscription, + Conversations: make(map[string]*Conversation), + InputSampleRate: defaultRemoteSampleRate, + OutputSampleRate: defaultRemoteSampleRate, + MaxHistoryItems: resolveMaxHistoryItems(cfg), + SoundDetectionEnabled: cfg.Pipeline.SoundDetection != "", + SoundDetectionTopK: defaultSoundDetectionTopK, + SoundDetectionThreshold: 0, + SoundDetectionWindowMs: cfg.Pipeline.SoundDetectionWindowMs, + SoundDetectionHopMs: cfg.Pipeline.SoundDetectionHopMs, } // Create a default conversation @@ -517,14 +552,24 @@ func runRealtimeSession(application *application.Application, t Transport, model session.Conversations[conversationID] = conversation session.DefaultConversationID = conversationID - m, err := newModel( - &cfg.Pipeline, - application.ModelConfigLoader(), - application.ModelLoader(), - application.ApplicationConfig(), - evaluator, - buildRealtimeRoutingContext(application, sessionID), - ) + var m Model + if soundOnly { + m, err = newSoundDetectionOnlyModel( + &cfg.Pipeline, + application.ModelConfigLoader(), + application.ModelLoader(), + application.ApplicationConfig(), + ) + } else { + m, err = newModel( + &cfg.Pipeline, + application.ModelConfigLoader(), + application.ModelLoader(), + application.ApplicationConfig(), + evaluator, + buildRealtimeRoutingContext(application, sessionID), + ) + } if err != nil { xlog.Error("failed to load model", "error", err) sendError(t, "model_load_error", "Failed to load model", "", "") @@ -605,6 +650,20 @@ func runRealtimeSession(application *application.Application, t Transport, model toggleVAD() + // Server-side sound-detection windowing (option B): for a sound-only session + // with window/hop configured, the server classifies the last window of + // streamed audio on a timer, so the client only has to stream (no commits). + // This runs independent of VAD (sound events are not speech). + var soundWindowDone chan struct{} + if soundOnly && session.SoundDetectionWindowMs > 0 && session.SoundDetectionHopMs > 0 { + soundWindowDone = make(chan struct{}) + wg.Go(func() { + handleSoundWindow(session, t, soundWindowDone) + }) + xlog.Debug("Starting server-side sound-detection windowing", + "window_ms", session.SoundDetectionWindowMs, "hop_ms", session.SoundDetectionHopMs) + } + for { msg, err = t.ReadEvent() if err != nil { @@ -880,6 +939,10 @@ func runRealtimeSession(application *application.Application, t Transport, model if vadServerStarted { close(done) } + // Stop the server-side sound-detection windowing goroutine (if running). + if soundWindowDone != nil { + close(soundWindowDone) + } wg.Wait() // Remove the session from the sessions map @@ -971,6 +1034,10 @@ func updateTransSession(session *Session, update *types.SessionUnion, cl *config session.ModelInterface = m session.ModelConfig = cfg + session.SoundDetectionEnabled = cfg.Pipeline.SoundDetection != "" + if session.SoundDetectionTopK <= 0 { + session.SoundDetectionTopK = defaultSoundDetectionTopK + } } if trUpd != nil { @@ -1343,7 +1410,8 @@ func commitUtterance(ctx context.Context, utt []byte, session *Session, conv *Co // TODO: If we have a real any-to-any model then transcription is optional var transcript string - if session.InputAudioTranscription != nil { + switch { + case session.InputAudioTranscription != nil: // emitTranscription streams transcript deltas when // pipeline.streaming.transcription is set, otherwise emits a single // completed event; either way it returns the final transcript text. @@ -1358,13 +1426,27 @@ func commitUtterance(ctx context.Context, utt []byte, session *Session, conv *Co sendError(t, "transcription_failed", err.Error(), "", "event_TODO") return } - } else { + case session.SoundDetectionEnabled: + // Sound-detection-only session: no transcription and no LLM. The + // sound-detection emit below carries the result; there is no any-to-any + // path to fall into. Windowing is client-driven (turn_detection none + + // input_audio_buffer.commit), so this is not voice-gated. + default: // The voice gate runs only on the transcription path above; if an // any-to-any model path is added here, join the gate before responding. sendNotImplemented(t, "any-to-any models") return } + // Sound-event detection is additive to transcription: classify the same + // committed window and emit its scored AudioSet tags as a separate event. + // A failure here is logged but must never abort the turn. + if session.SoundDetectionEnabled { + if sderr := emitSoundDetection(ctx, t, session, generateItemID(), f.Name()); sderr != nil { + xlog.Error("sound detection failed", "error", sderr) + } + } + // Join on the resolution before any side-effecting step. var speaker *types.Speaker if runResolve { @@ -1415,11 +1497,94 @@ func commitUtterance(ctx context.Context, utt []byte, session *Session, conv *Co } } - if !session.TranscriptionOnly { + // Generate an LLM response only when there is a transcript to feed it. A + // sound-detection-only session (no transcription) has no LLM stage, so it + // stops here after emitting the sound-detection event. + if session.InputAudioTranscription != nil && !session.TranscriptionOnly { generateResponse(ctx, session, utt, transcript, speaker, conv, t) } } +// handleSoundWindow runs server-side windowed sound-event detection (option B): +// every HopMs it classifies the last WindowMs of streamed audio and emits a +// sound_detection event, so a sound-only client only has to stream audio (no +// input_audio_buffer.commit). It keeps the input buffer trimmed to one window +// so a long stream stays bounded. Runs until done is closed. This is +// independent of VAD: sound events are not speech. +func handleSoundWindow(session *Session, t Transport, done chan struct{}) { + ticker := time.NewTicker(time.Duration(session.SoundDetectionHopMs) * time.Millisecond) + defer ticker.Stop() + + for { + select { + case <-done: + return + case <-ticker.C: + classifySoundWindow(session, t) + } + } +} + +// classifySoundWindow is one windowing tick: it snapshots the most recent +// WindowMs of buffered audio (trimming the buffer so a long stream stays +// bounded) and, when there is enough, classifies it and emits a sound_detection +// event. Extracted from handleSoundWindow so it can be driven synchronously in +// tests. +func classifySoundWindow(session *Session, t Transport) { + const bytesPerSample = 2 // 16-bit mono PCM + sr := session.InputSampleRate + windowBytes := session.SoundDetectionWindowMs * sr / 1000 * bytesPerSample + minBytes := sr / 100 * bytesPerSample // ~10ms before classifying + + session.AudioBufferLock.Lock() + // Keep only the most recent window so a long stream stays bounded. + if windowBytes > 0 && len(session.InputAudioBuffer) > windowBytes { + trimmed := make([]byte, windowBytes) + copy(trimmed, session.InputAudioBuffer[len(session.InputAudioBuffer)-windowBytes:]) + session.InputAudioBuffer = trimmed + } + window := make([]byte, len(session.InputAudioBuffer)) + copy(window, session.InputAudioBuffer) + session.AudioBufferLock.Unlock() + + if len(window) < minBytes { + return // not enough audio buffered yet + } + path, err := writeWindowWAV(window, sr) + if err != nil { + xlog.Error("sound window: failed to write wav", "error", err) + return + } + if sderr := emitSoundDetection(context.Background(), t, session, generateItemID(), path); sderr != nil { + xlog.Error("sound window: detection failed", "error", sderr) + } + if rerr := os.Remove(path); rerr != nil { + xlog.Debug("sound window: temp cleanup failed", "error", rerr) + } +} + +// writeWindowWAV writes mono 16-bit PCM to a temp WAV at the given sample rate +// (the ced classifier reads the declared rate and resamples). Returns the path; +// the caller removes it. +func writeWindowWAV(pcm []byte, sampleRate int) (string, error) { + f, err := os.CreateTemp("", "realtime-sound-window-*.wav") + if err != nil { + return "", err + } + defer func() { _ = f.Close() }() + hdr := laudio.NewWAVHeaderWithRate(uint32(len(pcm)), uint32(sampleRate)) + if err := hdr.Write(f); err != nil { + _ = os.Remove(f.Name()) + return "", err + } + if _, err := f.Write(pcm); err != nil { + _ = os.Remove(f.Name()) + return "", err + } + _ = f.Sync() + return f.Name(), nil +} + func runVAD(ctx context.Context, session *Session, adata []int16) ([]schema.VADSegment, error) { soundIntBuffer := &audio.IntBuffer{ Format: &audio.Format{SampleRate: localSampleRate, NumChannels: 1}, diff --git a/core/http/endpoints/openai/realtime_doubles_test.go b/core/http/endpoints/openai/realtime_doubles_test.go index 727ce7dcc..10e608c17 100644 --- a/core/http/endpoints/openai/realtime_doubles_test.go +++ b/core/http/endpoints/openai/realtime_doubles_test.go @@ -75,6 +75,11 @@ type fakeModel struct { transcribeDeltas []string transcribeFinal *schema.TranscriptionResult + // soundDetectionResult/soundDetectionErr drive the SoundDetection double so + // the sound-event path can be exercised deterministically. + soundDetectionResult *schema.SoundClassificationResult + soundDetectionErr error + // Predict streaming: predictTokens are replayed through the token callback // (simulating streamed LLM output); predictResp/predictErr are returned by // the deferred predict function. predictChunkDeltas, when set, are delivered @@ -95,6 +100,13 @@ func (m *fakeModel) Transcribe(context.Context, string, string, bool, bool, stri return m.transcribeFinal, nil } +func (m *fakeModel) SoundDetection(context.Context, string, int, float32) (*schema.SoundClassificationResult, error) { + if m.soundDetectionErr != nil { + return nil, m.soundDetectionErr + } + return m.soundDetectionResult, nil +} + func (m *fakeModel) Predict(_ context.Context, msgs schema.Messages, _, _, _ []string, cb func(string, backend.TokenUsage) bool, _ []types.ToolUnion, _ *types.ToolChoiceUnion, _, _ *int, _ map[string]float64) (func() (backend.LLMResponse, error), error) { m.lastMessages = msgs if m.predictErr != nil { diff --git a/core/http/endpoints/openai/realtime_model.go b/core/http/endpoints/openai/realtime_model.go index 789ce0a0d..6843a521d 100644 --- a/core/http/endpoints/openai/realtime_model.go +++ b/core/http/endpoints/openai/realtime_model.go @@ -31,10 +31,11 @@ var ( // This means that we will fake an Any-to-Any model by overriding some of the gRPC client methods // which are for Any-To-Any models, but instead we will call a pipeline (for e.g STT->LLM->TTS) type wrappedModel struct { - TTSConfig *config.ModelConfig - TranscriptionConfig *config.ModelConfig - LLMConfig *config.ModelConfig - VADConfig *config.ModelConfig + TTSConfig *config.ModelConfig + TranscriptionConfig *config.ModelConfig + LLMConfig *config.ModelConfig + VADConfig *config.ModelConfig + SoundDetectionConfig *config.ModelConfig appConfig *config.ApplicationConfig modelLoader *model.ModelLoader @@ -64,8 +65,9 @@ type anyToAnyModel struct { } type transcriptOnlyModel struct { - TranscriptionConfig *config.ModelConfig - VADConfig *config.ModelConfig + TranscriptionConfig *config.ModelConfig + VADConfig *config.ModelConfig + SoundDetectionConfig *config.ModelConfig appConfig *config.ApplicationConfig modelLoader *model.ModelLoader @@ -80,6 +82,10 @@ func (m *transcriptOnlyModel) Transcribe(ctx context.Context, audio, language st return backend.ModelTranscription(ctx, audio, language, translate, diarize, prompt, m.modelLoader, *m.TranscriptionConfig, m.appConfig) } +func (m *transcriptOnlyModel) SoundDetection(ctx context.Context, audio string, topK int, threshold float32) (*schema.SoundClassificationResult, error) { + return modelSoundDetection(ctx, m.modelLoader, m.appConfig, m.SoundDetectionConfig, audio, topK, threshold) +} + func (m *transcriptOnlyModel) Predict(ctx context.Context, messages schema.Messages, images, videos, audios []string, tokenCallback func(string, backend.TokenUsage) bool, tools []types.ToolUnion, toolChoice *types.ToolChoiceUnion, logprobs *int, topLogprobs *int, logitBias map[string]float64) (func() (backend.LLMResponse, error), error) { return nil, fmt.Errorf("predict operation not supported in transcript-only mode") } @@ -108,6 +114,10 @@ func (m *wrappedModel) Transcribe(ctx context.Context, audio, language string, t return backend.ModelTranscription(ctx, audio, language, translate, diarize, prompt, m.modelLoader, *m.TranscriptionConfig, m.appConfig) } +func (m *wrappedModel) SoundDetection(ctx context.Context, audio string, topK int, threshold float32) (*schema.SoundClassificationResult, error) { + return modelSoundDetection(ctx, m.modelLoader, m.appConfig, m.SoundDetectionConfig, audio, topK, threshold) +} + func (m *wrappedModel) Predict(ctx context.Context, messages schema.Messages, images, videos, audios []string, tokenCallback func(string, backend.TokenUsage) bool, tools []types.ToolUnion, toolChoice *types.ToolChoiceUnion, logprobs *int, topLogprobs *int, logitBias map[string]float64) (func() (backend.LLMResponse, error), error) { input := schema.OpenAIRequest{ Messages: messages, @@ -399,6 +409,39 @@ func transcribeStream(ctx context.Context, ml *model.ModelLoader, transcriptionC return final, nil } +// modelSoundDetection runs sound-event classification against the session's +// sound-classification model config, mirroring how Transcribe dispatches to +// the transcription backend. Returns an error when no sound-detection model is +// configured for the session. +func modelSoundDetection(ctx context.Context, ml *model.ModelLoader, appConfig *config.ApplicationConfig, soundConfig *config.ModelConfig, audio string, topK int, threshold float32) (*schema.SoundClassificationResult, error) { + if soundConfig == nil { + return nil, fmt.Errorf("sound detection is not configured for this session") + } + return backend.ModelSoundDetection(ctx, backend.SoundDetectionRequest{ + Audio: audio, + TopK: int32(topK), + Threshold: threshold, + }, ml, *soundConfig, appConfig) +} + +// loadSoundDetectionConfig resolves the optional sound-classification model +// config named by pipeline.sound_detection. Returns (nil, nil) when no model +// is configured so sound detection stays additive and never blocks session +// setup. +func loadSoundDetectionConfig(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model.ModelLoader) (*config.ModelConfig, error) { + if pipeline.SoundDetection == "" { + return nil, nil + } + cfg, err := cl.LoadModelConfigFileByName(pipeline.SoundDetection, ml.ModelPath) + if err != nil { + return nil, fmt.Errorf("failed to load sound detection config: %w", err) + } + if valid, _ := cfg.Validate(); !valid { + return nil, fmt.Errorf("failed to validate sound detection config %q", pipeline.SoundDetection) + } + return cfg, nil +} + func newTranscriptionOnlyModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig) (Model, *config.ModelConfig, error) { cfgVAD, err := cl.LoadModelConfigFileByName(pipeline.VAD, ml.ModelPath) if err != nil { @@ -420,9 +463,15 @@ func newTranscriptionOnlyModel(pipeline *config.Pipeline, cl *config.ModelConfig return nil, nil, fmt.Errorf("failed to validate config: %w", err) } + cfgSound, err := loadSoundDetectionConfig(pipeline, cl, ml) + if err != nil { + return nil, nil, err + } + return &transcriptOnlyModel{ - TranscriptionConfig: cfgSST, - VADConfig: cfgVAD, + TranscriptionConfig: cfgSST, + VADConfig: cfgVAD, + SoundDetectionConfig: cfgSound, confLoader: cl, modelLoader: ml, @@ -430,6 +479,27 @@ func newTranscriptionOnlyModel(pipeline *config.Pipeline, cl *config.ModelConfig }, cfgSST, nil } +// newSoundDetectionOnlyModel builds a realtime model that only does sound-event +// classification: no VAD, transcription, LLM or TTS stages are loaded. Used for +// a sound-detection-only realtime session, which activates on sounds (not +// speech) and is driven by client-side windowing (turn_detection none + +// input_audio_buffer.commit) rather than the voice VAD loop. +func newSoundDetectionOnlyModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig) (Model, error) { + cfgSound, err := loadSoundDetectionConfig(pipeline, cl, ml) + if err != nil { + return nil, err + } + if cfgSound == nil { + return nil, fmt.Errorf("a sound-only realtime session requires pipeline.sound_detection") + } + return &transcriptOnlyModel{ + SoundDetectionConfig: cfgSound, + confLoader: cl, + modelLoader: ml, + appConfig: appConfig, + }, nil +} + // RealtimeRoutingContext is the bundle of routing dependencies the // realtime pipeline needs to consult router.Resolve per turn. nil-safe: // passing nil skips routing entirely and preserves the historical "one @@ -544,11 +614,17 @@ func newModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model return nil, fmt.Errorf("failed to validate config: %w", err) } + cfgSound, err := loadSoundDetectionConfig(pipeline, cl, ml) + if err != nil { + return nil, err + } + wm := &wrappedModel{ - TTSConfig: cfgTTS, - TranscriptionConfig: cfgSST, - LLMConfig: cfgLLM, - VADConfig: cfgVAD, + TTSConfig: cfgTTS, + TranscriptionConfig: cfgSST, + LLMConfig: cfgLLM, + VADConfig: cfgVAD, + SoundDetectionConfig: cfgSound, confLoader: cl, modelLoader: ml, diff --git a/core/http/endpoints/openai/realtime_sound_detection.go b/core/http/endpoints/openai/realtime_sound_detection.go new file mode 100644 index 000000000..6bc4efb47 --- /dev/null +++ b/core/http/endpoints/openai/realtime_sound_detection.go @@ -0,0 +1,48 @@ +package openai + +import ( + "context" + + "github.com/mudler/LocalAI/core/http/endpoints/openai/types" +) + +// defaultSoundDetectionTopK is the number of scored tags requested per +// committed utterance when the session does not pin its own top_k. +const defaultSoundDetectionTopK = 5 + +// emitSoundDetection classifies a committed utterance into sound-event tags and +// emits a conversation.item.sound_detection event for it. It mirrors +// emitTranscription's unary path: it calls the session's sound-event +// classifier, maps the scored tags onto the server event, and sends it over +// the transport. Sound detection is additive to transcription: its result is +// emitted independently and a failure here is the caller's to log, never a +// reason to abort the turn. +func emitSoundDetection(ctx context.Context, t Transport, session *Session, itemID, audioPath string) error { + topK := session.SoundDetectionTopK + if topK <= 0 { + topK = defaultSoundDetectionTopK + } + + result, err := session.ModelInterface.SoundDetection(ctx, audioPath, topK, session.SoundDetectionThreshold) + if err != nil { + return err + } + + detections := make([]types.SoundDetectionTag, 0) + if result != nil { + for _, d := range result.Detections { + detections = append(detections, types.SoundDetectionTag{ + Label: d.Label, + Score: d.Score, + Index: d.Index, + }) + } + } + + return t.SendEvent(types.ConversationItemSoundDetectionEvent{ + ServerEventBase: types.ServerEventBase{EventID: "event_TODO"}, + ItemID: itemID, + ContentIndex: 0, + Detections: detections, + }) +} diff --git a/core/http/endpoints/openai/realtime_sound_detection_test.go b/core/http/endpoints/openai/realtime_sound_detection_test.go new file mode 100644 index 000000000..e440e80c3 --- /dev/null +++ b/core/http/endpoints/openai/realtime_sound_detection_test.go @@ -0,0 +1,170 @@ +package openai + +import ( + "context" + "encoding/binary" + "errors" + "os" + + . "github.com/onsi/ginkgo/v2" + . "github.com/onsi/gomega" + + "github.com/mudler/LocalAI/core/config" + "github.com/mudler/LocalAI/core/http/endpoints/openai/types" + "github.com/mudler/LocalAI/core/schema" +) + +// emitSoundDetection classifies a committed utterance and emits a single +// conversation.item.sound_detection event carrying the scored AudioSet tags. +var _ = Describe("emitSoundDetection", func() { + It("emits a sound_detection event with the classifier's scored tags", func() { + session := &Session{ + SoundDetectionEnabled: true, + SoundDetectionTopK: 5, + ModelInterface: &fakeModel{ + soundDetectionResult: &schema.SoundClassificationResult{ + Model: "ced", + Detections: []schema.SoundClassification{ + {Index: 3, Label: "Baby cry, infant cry", Score: 0.91}, + {Index: 7, Label: "Speech", Score: 0.42}, + }, + }, + }, + } + t := &fakeTransport{} + + err := emitSoundDetection(context.Background(), t, session, "item1", "/tmp/x.wav") + + Expect(err).ToNot(HaveOccurred()) + Expect(t.countEvents(types.ServerEventTypeConversationItemSoundDetection)).To(Equal(1)) + + ev, ok := t.events[0].(types.ConversationItemSoundDetectionEvent) + Expect(ok).To(BeTrue()) + Expect(ev.ItemID).To(Equal("item1")) + Expect(ev.ContentIndex).To(Equal(0)) + Expect(ev.Detections).To(HaveLen(2)) + Expect(ev.Detections[0].Label).To(Equal("Baby cry, infant cry")) + Expect(ev.Detections[0].Score).To(BeNumerically("~", 0.91, 1e-6)) + Expect(ev.Detections[0].Index).To(Equal(3)) + Expect(ev.Detections[1].Label).To(Equal("Speech")) + }) + + It("emits an event with no detections when the classifier returns none", func() { + session := &Session{ + SoundDetectionEnabled: true, + ModelInterface: &fakeModel{ + soundDetectionResult: &schema.SoundClassificationResult{Model: "ced"}, + }, + } + t := &fakeTransport{} + + err := emitSoundDetection(context.Background(), t, session, "item1", "/tmp/x.wav") + + Expect(err).ToNot(HaveOccurred()) + Expect(t.countEvents(types.ServerEventTypeConversationItemSoundDetection)).To(Equal(1)) + ev, ok := t.events[0].(types.ConversationItemSoundDetectionEvent) + Expect(ok).To(BeTrue()) + Expect(ev.Detections).To(BeEmpty()) + }) + + It("propagates the classifier error and emits no event", func() { + session := &Session{ + SoundDetectionEnabled: true, + ModelInterface: &fakeModel{soundDetectionErr: errors.New("boom")}, + } + t := &fakeTransport{} + + err := emitSoundDetection(context.Background(), t, session, "item1", "/tmp/x.wav") + + Expect(err).To(HaveOccurred()) + Expect(t.countEvents(types.ServerEventTypeConversationItemSoundDetection)).To(Equal(0)) + }) +}) + +// A sound-detection-only session (no transcription, no LLM) runs through +// commitUtterance WITHOUT the voice/transcription path: it emits the +// sound_detection event and stops - no transcription event, no LLM response. +var _ = Describe("commitUtterance (sound-detection-only session)", func() { + It("emits sound detection and neither transcribes nor generates a response", func() { + session := &Session{ + SoundDetectionEnabled: true, + SoundDetectionTopK: 5, + InputAudioTranscription: nil, // sound-only: no transcription stage + ModelConfig: &config.ModelConfig{}, + ModelInterface: &fakeModel{ + soundDetectionResult: &schema.SoundClassificationResult{ + Model: "ced", + Detections: []schema.SoundClassification{ + {Index: 23, Label: "Baby cry, infant cry", Score: 0.87}, + }, + }, + }, + } + tr := &fakeTransport{} + utt := make([]byte, 32) // non-empty PCM so commitUtterance proceeds + + commitUtterance(context.Background(), utt, session, &Conversation{}, tr) + + Expect(tr.countEvents(types.ServerEventTypeConversationItemSoundDetection)).To(Equal(1)) + // No transcription happened. + Expect(tr.countEvents(types.ServerEventTypeConversationItemInputAudioTranscriptionCompleted)).To(Equal(0)) + // No LLM response was generated (sound-only has no LLM stage). + Expect(tr.countEvents(types.ServerEventTypeResponseDone)).To(Equal(0)) + }) +}) + +// Server-side windowing (option B): a sound-only session classifies the last +// WindowMs of streamed audio per tick, with no client commit, and keeps the +// input buffer trimmed to one window. +var _ = Describe("classifySoundWindow (server-side windowing)", func() { + newSoundSession := func() (*Session, *fakeTransport) { + return &Session{ + SoundDetectionEnabled: true, + SoundDetectionTopK: 5, + SoundDetectionWindowMs: 200, // 200ms @ 16kHz mono16 = 6400 bytes + SoundDetectionHopMs: 20, + InputSampleRate: 16000, + ModelInterface: &fakeModel{ + soundDetectionResult: &schema.SoundClassificationResult{ + Model: "ced", + Detections: []schema.SoundClassification{{Index: 23, Label: "Baby cry, infant cry", Score: 0.87}}, + }, + }, + }, &fakeTransport{} + } + + It("emits a sound_detection event and trims the buffer to one window", func() { + session, tr := newSoundSession() + session.InputAudioBuffer = make([]byte, 10000) // > 6400-byte window + + classifySoundWindow(session, tr) + + Expect(tr.countEvents(types.ServerEventTypeConversationItemSoundDetection)).To(Equal(1)) + // buffer trimmed to exactly one window (200ms @ 16kHz mono 16-bit) + Expect(len(session.InputAudioBuffer)).To(Equal(6400)) + }) + + It("does nothing when too little audio is buffered", func() { + session, tr := newSoundSession() + session.InputAudioBuffer = make([]byte, 100) // < ~10ms (320 bytes) + + classifySoundWindow(session, tr) + + Expect(tr.countEvents(types.ServerEventTypeConversationItemSoundDetection)).To(Equal(0)) + }) +}) + +var _ = Describe("writeWindowWAV", func() { + It("writes a mono 16-bit WAV header declaring the given sample rate", func() { + pcm := make([]byte, 640) + path, err := writeWindowWAV(pcm, 24000) + Expect(err).ToNot(HaveOccurred()) + defer func() { _ = os.Remove(path) }() + + data, err := os.ReadFile(path) + Expect(err).ToNot(HaveOccurred()) + Expect(len(data)).To(BeNumerically(">=", 44+len(pcm))) + // SampleRate is a little-endian uint32 at byte offset 24 of a WAV header. + Expect(binary.LittleEndian.Uint32(data[24:28])).To(Equal(uint32(24000))) + }) +}) diff --git a/core/http/endpoints/openai/sound_classification.go b/core/http/endpoints/openai/sound_classification.go new file mode 100644 index 000000000..b7e23f1b1 --- /dev/null +++ b/core/http/endpoints/openai/sound_classification.go @@ -0,0 +1,91 @@ +package openai + +import ( + "io" + "net/http" + "os" + "path" + "path/filepath" + + "github.com/labstack/echo/v4" + "github.com/mudler/LocalAI/core/backend" + "github.com/mudler/LocalAI/core/config" + "github.com/mudler/LocalAI/core/http/middleware" + "github.com/mudler/LocalAI/core/schema" + model "github.com/mudler/LocalAI/pkg/model" + + "github.com/mudler/xlog" +) + +// SoundClassificationEndpoint runs an audio-tagging / sound-event +// classification model (e.g. ced) over an uploaded clip and returns the +// scored AudioSet tags in score-descending order. It mirrors the +// transcription path: multipart audio upload -> temp file -> backend call. +// +// @Summary Classify sound events in audio (audio tagging). +// @Tags audio +// @accept multipart/form-data +// @Param model formData string true "model" +// @Param file formData file true "audio file" +// @Param top_k formData int false "number of top tags to return (0 = backend default)" +// @Param threshold formData number false "drop tags scoring below this value" +// @Success 200 {object} schema.SoundClassificationResult +// @Router /v1/audio/classification [post] +func SoundClassificationEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig) echo.HandlerFunc { + return func(c echo.Context) error { + input, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_LOCALAI_REQUEST).(*schema.OpenAIRequest) + if !ok || input.Model == "" { + return echo.ErrBadRequest + } + + modelConfig, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_MODEL_CONFIG).(*config.ModelConfig) + if !ok || modelConfig == nil { + return echo.ErrBadRequest + } + + req := backend.SoundDetectionRequest{ + TopK: int32(parseFormInt(c, "top_k", 0)), + Threshold: float32(parseFormFloat(c, "threshold", 0)), + } + + file, err := c.FormFile("file") + if err != nil { + return err + } + f, err := file.Open() + if err != nil { + return err + } + defer func() { _ = f.Close() }() + + dir, err := os.MkdirTemp("", "sound-classification") + if err != nil { + return err + } + defer func() { _ = os.RemoveAll(dir) }() + + dst := filepath.Join(dir, path.Base(file.Filename)) + dstFile, err := os.Create(dst) // #nosec G304 -- dst is a server-created temp dir joined with path.Base of the upload name (no traversal) + if err != nil { + return err + } + if _, err := io.Copy(dstFile, f); err != nil { + xlog.Debug("Audio file copying error", "filename", file.Filename, "dst", dst, "error", err) + _ = dstFile.Close() + return err + } + _ = dstFile.Close() + req.Audio = dst + + result, err := backend.ModelSoundDetection(c.Request().Context(), req, ml, *modelConfig, appConfig) + if err != nil { + xlog.Error("Sound classification failed", + "model", modelConfig.Name, + "audio", dst, + "error", err) + return err + } + + return c.JSON(http.StatusOK, result) + } +} diff --git a/core/http/endpoints/openai/types/server_events.go b/core/http/endpoints/openai/types/server_events.go index 8183a8b78..6b0a233ee 100644 --- a/core/http/endpoints/openai/types/server_events.go +++ b/core/http/endpoints/openai/types/server_events.go @@ -18,6 +18,7 @@ const ( ServerEventTypeConversationItemInputAudioTranscriptionDelta ServerEventType = "conversation.item.input_audio_transcription.delta" ServerEventTypeConversationItemInputAudioTranscriptionSegment ServerEventType = "conversation.item.input_audio_transcription.segment" ServerEventTypeConversationItemInputAudioTranscriptionFailed ServerEventType = "conversation.item.input_audio_transcription.failed" + ServerEventTypeConversationItemSoundDetection ServerEventType = "conversation.item.sound_detection" ServerEventTypeConversationItemTruncated ServerEventType = "conversation.item.truncated" ServerEventTypeConversationItemDeleted ServerEventType = "conversation.item.deleted" // ServerEventTypeConversationItemSpeaker is a LocalAI extension: it reports @@ -473,6 +474,55 @@ func (m ConversationItemInputAudioTranscriptionCompletedEvent) MarshalJSON() ([] return json.Marshal(shadow) } +// SoundDetectionTag is one scored sound-event tag from the sound-event +// classifier. Label is the human-readable AudioSet class name, Score is the +// per-class probability (multi-label, independent), and Index is the class +// index in the model ontology. +type SoundDetectionTag struct { + // The human-readable AudioSet class name (e.g. "Baby cry, infant cry"). + Label string `json:"label"` + + // The per-class probability for this tag. + Score float32 `json:"score"` + + // The class index in the model ontology. + Index int `json:"index"` +} + +// Returned when a committed input audio window has been classified by a +// sound-event-detection model. This is a LocalAI extension to the OpenAI +// Realtime API: when a pipeline configures sound_detection, each VAD-committed +// utterance is run through the classifier and the scored AudioSet tags are +// emitted as this event, independent of (and alongside) transcription. +type ConversationItemSoundDetectionEvent struct { + ServerEventBase + // The ID of the item. + ItemID string `json:"item_id"` + + // The index of the content part in the item's content array. + ContentIndex int `json:"content_index"` + + // The scored sound-event tags, in score-descending order. + Detections []SoundDetectionTag `json:"detections"` +} + +func (m ConversationItemSoundDetectionEvent) ServerEventType() ServerEventType { + return ServerEventTypeConversationItemSoundDetection +} + +func (m ConversationItemSoundDetectionEvent) MarshalJSON() ([]byte, error) { + type typeAlias ConversationItemSoundDetectionEvent + type typeWrapper struct { + typeAlias + Type ServerEventType `json:"type"` + } + shadow := typeWrapper{ + typeAlias: typeAlias(m), + Type: m.ServerEventType(), + } + return json.Marshal(shadow) +} + // Returned when the text value of an input audio transcription content part is updated with incremental transcription results. // // See https://platform.openai.com/docs/api-reference/realtime-server-events/conversation/item/input_audio_transcription/delta diff --git a/core/http/react-ui/public/locales/en/models.json b/core/http/react-ui/public/locales/en/models.json index 9af2d77a9..2bf7b018d 100644 --- a/core/http/react-ui/public/locales/en/models.json +++ b/core/http/react-ui/public/locales/en/models.json @@ -23,6 +23,7 @@ "tts": "TTS", "stt": "STT", "diarization": "Diarization", + "soundClassification": "Sound Tagging", "soundGen": "Sound", "audioTransform": "Audio FX", "realtimeAudio": "Realtime Audio", diff --git a/core/http/react-ui/src/pages/Models.jsx b/core/http/react-ui/src/pages/Models.jsx index 3c40afc93..5f3a3908d 100644 --- a/core/http/react-ui/src/pages/Models.jsx +++ b/core/http/react-ui/src/pages/Models.jsx @@ -31,6 +31,7 @@ const FILTERS = [ { key: 'tts', labelKey: 'filters.tts', icon: 'fa-microphone' }, { key: 'transcript', labelKey: 'filters.stt', icon: 'fa-headphones' }, { key: 'diarization', labelKey: 'filters.diarization', icon: 'fa-users' }, + { key: 'sound_classification', labelKey: 'filters.soundClassification', icon: 'fa-ear-listen' }, { key: 'sound_generation', labelKey: 'filters.soundGen', icon: 'fa-music' }, { key: 'audio_transform', labelKey: 'filters.audioTransform', icon: 'fa-sliders' }, { key: 'realtime_audio', labelKey: 'filters.realtimeAudio', icon: 'fa-tower-broadcast' }, diff --git a/core/http/react-ui/src/utils/capabilities.js b/core/http/react-ui/src/utils/capabilities.js index 95dd4bb7a..5d30a472d 100644 --- a/core/http/react-ui/src/utils/capabilities.js +++ b/core/http/react-ui/src/utils/capabilities.js @@ -15,6 +15,7 @@ export const CAP_SOUND_GENERATION = 'FLAG_SOUND_GENERATION' export const CAP_TOKENIZE = 'FLAG_TOKENIZE' export const CAP_VAD = 'FLAG_VAD' export const CAP_DIARIZATION = 'FLAG_DIARIZATION' +export const CAP_SOUND_CLASSIFICATION = 'FLAG_SOUND_CLASSIFICATION' export const CAP_VIDEO = 'FLAG_VIDEO' export const CAP_DETECTION = 'FLAG_DETECTION' export const CAP_FACE_RECOGNITION = 'FLAG_FACE_RECOGNITION' diff --git a/core/http/routes/localai.go b/core/http/routes/localai.go index 1df1d5d8c..212f379f0 100644 --- a/core/http/routes/localai.go +++ b/core/http/routes/localai.go @@ -284,13 +284,14 @@ func RegisterLocalAIRoutes(router *echo.Echo, // Categorized endpoint groups for structured discovery "endpoint_groups": map[string]any{ "openai_compatible": map[string]string{ - "models": "/v1/models", - "chat_completions": "/v1/chat/completions", - "completions": "/v1/completions", - "embeddings": "/v1/embeddings", - "transcription": "/v1/audio/transcriptions", - "diarization": "/v1/audio/diarization", - "image_generation": "/v1/images/generations", + "models": "/v1/models", + "chat_completions": "/v1/chat/completions", + "completions": "/v1/completions", + "embeddings": "/v1/embeddings", + "transcription": "/v1/audio/transcriptions", + "diarization": "/v1/audio/diarization", + "sound_classification": "/v1/audio/classification", + "image_generation": "/v1/images/generations", }, "config_management": map[string]string{ "config_metadata": "/api/models/config-metadata", @@ -342,7 +343,7 @@ func RegisterLocalAIRoutes(router *echo.Echo, "delete": "/stores/delete", }, "docs": map[string]string{ - "swagger": "/swagger/index.html", + "swagger": "/swagger/index.html", "instructions": "/api/instructions", }, }, diff --git a/core/http/routes/openai.go b/core/http/routes/openai.go index 5252edfdd..32603f567 100644 --- a/core/http/routes/openai.go +++ b/core/http/routes/openai.go @@ -200,6 +200,23 @@ func RegisterOpenAIRoutes(app *echo.Echo, app.POST("/v1/audio/diarization", diarizationHandler, diarizationMiddleware...) app.POST("/audio/diarization", diarizationHandler, diarizationMiddleware...) + soundClassificationHandler := openai.SoundClassificationEndpoint(application.ModelConfigLoader(), application.ModelLoader(), application.ApplicationConfig()) + soundClassificationMiddleware := []echo.MiddlewareFunc{ + traceMiddleware, + re.BuildFilteredFirstAvailableDefaultModel(config.BuildUsecaseFilterFn(config.FLAG_SOUND_CLASSIFICATION)), + re.SetModelAndConfig(func() schema.LocalAIRequest { return new(schema.OpenAIRequest) }), + func(next echo.HandlerFunc) echo.HandlerFunc { + return func(c echo.Context) error { + if err := re.SetOpenAIRequest(c); err != nil { + return err + } + return next(c) + } + }, + } + app.POST("/v1/audio/classification", soundClassificationHandler, soundClassificationMiddleware...) + app.POST("/audio/classification", soundClassificationHandler, soundClassificationMiddleware...) + audioSpeechHandler := localai.TTSEndpoint(application.ModelConfigLoader(), application.ModelLoader(), application.ApplicationConfig()) audioSpeechMiddleware := []echo.MiddlewareFunc{ nodeHeaderMiddleware, diff --git a/core/http/routes/ui_api.go b/core/http/routes/ui_api.go index f398d71cd..e26894273 100644 --- a/core/http/routes/ui_api.go +++ b/core/http/routes/ui_api.go @@ -42,21 +42,22 @@ const ( // usecaseFilters maps UI filter keys to ModelConfigUsecase flags for // capability-based gallery filtering. var usecaseFilters = map[string]config.ModelConfigUsecase{ - config.UsecaseChat: config.FLAG_CHAT, - config.UsecaseImage: config.FLAG_IMAGE, - config.UsecaseVideo: config.FLAG_VIDEO, - config.UsecaseVision: config.FLAG_VISION, - config.UsecaseTTS: config.FLAG_TTS, - config.UsecaseTranscript: config.FLAG_TRANSCRIPT, - config.UsecaseSoundGeneration: config.FLAG_SOUND_GENERATION, - config.UsecaseEmbeddings: config.FLAG_EMBEDDINGS, - config.UsecaseRerank: config.FLAG_RERANK, - config.UsecaseDetection: config.FLAG_DETECTION, - config.UsecaseVAD: config.FLAG_VAD, - config.UsecaseAudioTransform: config.FLAG_AUDIO_TRANSFORM, - config.UsecaseDiarization: config.FLAG_DIARIZATION, - config.UsecaseRealtimeAudio: config.FLAG_REALTIME_AUDIO, - config.UsecaseTokenClassify: config.FLAG_TOKEN_CLASSIFY, + config.UsecaseChat: config.FLAG_CHAT, + config.UsecaseImage: config.FLAG_IMAGE, + config.UsecaseVideo: config.FLAG_VIDEO, + config.UsecaseVision: config.FLAG_VISION, + config.UsecaseTTS: config.FLAG_TTS, + config.UsecaseTranscript: config.FLAG_TRANSCRIPT, + config.UsecaseSoundGeneration: config.FLAG_SOUND_GENERATION, + config.UsecaseEmbeddings: config.FLAG_EMBEDDINGS, + config.UsecaseRerank: config.FLAG_RERANK, + config.UsecaseDetection: config.FLAG_DETECTION, + config.UsecaseVAD: config.FLAG_VAD, + config.UsecaseAudioTransform: config.FLAG_AUDIO_TRANSFORM, + config.UsecaseDiarization: config.FLAG_DIARIZATION, + config.UsecaseSoundClassification: config.FLAG_SOUND_CLASSIFICATION, + config.UsecaseRealtimeAudio: config.FLAG_REALTIME_AUDIO, + config.UsecaseTokenClassify: config.FLAG_TOKEN_CLASSIFY, } // extractHFRepo tries to find a HuggingFace repo ID from model overrides or URLs. diff --git a/core/schema/sound_classification.go b/core/schema/sound_classification.go new file mode 100644 index 000000000..decd7c7e3 --- /dev/null +++ b/core/schema/sound_classification.go @@ -0,0 +1,19 @@ +package schema + +// SoundClassification is one scored sound-event tag. Score is the +// per-class probability (multi-label, independent), Index is the class +// index in the model ontology, and Label is the human-readable AudioSet +// class name (e.g. "Baby cry, infant cry"). +type SoundClassification struct { + Index int `json:"index"` + Label string `json:"label"` + Score float32 `json:"score"` +} + +// SoundClassificationResult is the JSON response of the +// /v1/audio/classification endpoint: the model name and the scored tags +// in score-descending order. +type SoundClassificationResult struct { + Model string `json:"model"` + Detections []SoundClassification `json:"detections"` +} diff --git a/core/services/nodes/health_mock_test.go b/core/services/nodes/health_mock_test.go index f14dd133d..86ac5cdcb 100644 --- a/core/services/nodes/health_mock_test.go +++ b/core/services/nodes/health_mock_test.go @@ -169,6 +169,9 @@ func (c *fakeBackendClient) SoundGeneration(_ context.Context, _ *pb.SoundGenera func (c *fakeBackendClient) Detect(_ context.Context, _ *pb.DetectOptions, _ ...ggrpc.CallOption) (*pb.DetectResponse, error) { return nil, nil } +func (c *fakeBackendClient) SoundDetection(_ context.Context, _ *pb.SoundDetectionRequest, _ ...ggrpc.CallOption) (*pb.SoundDetectionResponse, error) { + return nil, nil +} func (c *fakeBackendClient) Depth(_ context.Context, _ *pb.DepthRequest, _ ...ggrpc.CallOption) (*pb.DepthResponse, error) { return nil, nil } diff --git a/core/services/nodes/inflight_test.go b/core/services/nodes/inflight_test.go index 85de0ac8e..2eb90f9c6 100644 --- a/core/services/nodes/inflight_test.go +++ b/core/services/nodes/inflight_test.go @@ -99,6 +99,9 @@ func (f *fakeGRPCBackend) SoundGeneration(_ context.Context, _ *pb.SoundGenerati func (f *fakeGRPCBackend) Detect(_ context.Context, _ *pb.DetectOptions, _ ...ggrpc.CallOption) (*pb.DetectResponse, error) { return &pb.DetectResponse{}, nil } +func (f *fakeGRPCBackend) SoundDetection(_ context.Context, _ *pb.SoundDetectionRequest, _ ...ggrpc.CallOption) (*pb.SoundDetectionResponse, error) { + return &pb.SoundDetectionResponse{}, nil +} func (f *fakeGRPCBackend) Depth(_ context.Context, _ *pb.DepthRequest, _ ...ggrpc.CallOption) (*pb.DepthResponse, error) { return &pb.DepthResponse{}, nil diff --git a/docs/content/features/audio-classification.md b/docs/content/features/audio-classification.md new file mode 100644 index 000000000..f70674dc9 --- /dev/null +++ b/docs/content/features/audio-classification.md @@ -0,0 +1,55 @@ ++++ +disableToc = false +title = "Sound Classification" +weight = 18 +url = "/features/audio-classification/" ++++ + +Sound-event classification (audio tagging) answers the question **"what am I hearing?"** - given an audio clip, it returns a list of scored [AudioSet](https://research.google.com/audioset/) labels (e.g. *Baby cry, infant cry*, *Glass breaking*, *Dog bark*, *Alarm*). + +LocalAI exposes this through the `/v1/audio/classification` endpoint, modelled after `/v1/audio/transcriptions`. The reference backend is **[ced.cpp](https://github.com/mudler/ced.cpp)** (CED, a 527-class AudioSet tagger), a small ViT over a log-mel spectrogram ported to ggml with full PyTorch parity. Apache-2.0 weights are redistributable as GGUF. + +Because classification is exposed as a regular OpenAI-style endpoint, any HTTP client works - there is no Python dependency on the consumer side. + +## Endpoint + +``` +POST /v1/audio/classification +Content-Type: multipart/form-data +``` + +| Field | Type | Description | +|-------|------|-------------| +| `file` | file (required) | audio file in any format `ffmpeg` accepts | +| `model` | string (required) | name of the sound-classification-capable model (e.g. `ced-base`) | +| `top_k` | int | number of top tags to return (0 = backend default) | +| `threshold` | float | drop tags scoring below this value | + +### Response + +```json +{ + "model": "ced-base", + "detections": [ + {"index": 23, "label": "Baby cry, infant cry", "score": 0.87}, + {"index": 22, "label": "Crying, sobbing", "score": 0.41} + ] +} +``` + +Detections are returned in score-descending order. Scores are per-class probabilities (multi-label, independent), so they do not sum to 1. + +## Example + +```bash +curl http://localhost:8080/v1/audio/classification \ + -H "Content-Type: multipart/form-data" \ + -F file="@/path/to/clip.wav" \ + -F model="ced-base" \ + -F top_k=10 +``` + +## See also + +- [Audio to Text]({{% relref "audio-to-text" %}}) - speech transcription +- [Speaker Diarization]({{% relref "audio-diarization" %}}) - who spoke when diff --git a/docs/content/features/audio-diarization.md b/docs/content/features/audio-diarization.md index 36d9437dc..b2cfa32b0 100644 --- a/docs/content/features/audio-diarization.md +++ b/docs/content/features/audio-diarization.md @@ -152,3 +152,7 @@ curl http://localhost:8080/v1/audio/diarization \ - **Speaker identity across files**: speaker IDs (`SPEAKER_00`, `SPEAKER_01`, …) are local to each request. To track the same person across multiple recordings, combine `/v1/audio/diarization` with `/v1/voice/embed` (speaker embedding) and maintain your own embedding store. - **Hints vs. forces**: `num_speakers` overrides clustering when set; `min_speakers` / `max_speakers` are advisory and only honored by backends that expose a range hint. vibevoice.cpp ignores them — its model picks the count itself. - **Sample rate**: input is automatically converted to 16 kHz mono via ffmpeg before the backend sees it; sherpa-onnx pyannote-3.0 requires 16 kHz. + +## See also + +- [Sound Classification]({{% relref "audio-classification" %}}) - tag non-speech sound events (alarms, glass breaking, baby cry) in a clip. diff --git a/docs/content/features/backends.md b/docs/content/features/backends.md index 1713fabfb..4b7445a98 100644 --- a/docs/content/features/backends.md +++ b/docs/content/features/backends.md @@ -128,6 +128,7 @@ LocalAI supports various types of backends: - **Speech-to-Text Backends**: For transcription (e.g., whisper.cpp, parakeet.cpp, faster-whisper, NeMo) - **Text-to-Speech Backends**: For speech synthesis (e.g., piper, Kokoro, VibeVoice, Qwen3-TTS) - **Sound Generation Backends**: For music and audio generation (e.g., ACE-Step) +- **Sound Classification Backends**: For sound-event classification / audio tagging - identifying everyday sounds like baby cry, glass breaking, alarms (e.g., ced.cpp) - **Image & Video Generation Backends**: For diffusion models (e.g., stable-diffusion.cpp, diffusers) - **Vision & Detection Backends**: For object detection, segmentation, depth, and face/voice recognition (e.g., rf-detr.cpp, locate-anything.cpp, sam3.cpp, insightface) - **Audio Processing Backends**: For voice activity detection and audio enhancement (e.g., Silero VAD, LocalVQE) diff --git a/docs/content/whats-new.md b/docs/content/whats-new.md index 170ccae98..6ff7979cc 100644 --- a/docs/content/whats-new.md +++ b/docs/content/whats-new.md @@ -15,6 +15,7 @@ You can see the release notes [here](https://github.com/mudler/LocalAI/releases) - **April 2026**: [Audio Transform](/features/audio-transform/) — generic audio-in / audio-out endpoint with optional reference signal. First implementation: [LocalVQE](https://github.com/localai-org/LocalVQE) C++ backend (joint AEC + noise suppression + dereverberation, DeepVQE-style). Both batch (`POST /audio/transformations`) and bidirectional WebSocket streaming (`/audio/transformations/stream`). Studio "Transform" tab with synchronized waveform players for input / reference / output. - **April 2026**: [Face recognition backend](/features/face-recognition/) — `insightface`-powered 1:1 verification, 1:N identification, face embedding, face detection, and demographic analysis. Ships both a non-commercial `buffalo_l` model and an Apache 2.0 OpenCV Zoo alternative. - **May 2026**: [Speaker diarization](/features/audio-diarization/) — new `/v1/audio/diarization` endpoint returning "who spoke when" segments. Backed by `sherpa-onnx` (pyannote-3.0 + speaker embeddings + clustering) for pure diarization, and `vibevoice-cpp` for diarization bundled with long-form ASR. Supports `json` / `verbose_json` / `rttm` response formats. +- **June 2026**: [Sound classification](/features/audio-classification/) — new `/v1/audio/classification` endpoint for audio tagging / sound-event classification, returning scored [AudioSet](https://research.google.com/audioset/) labels (baby cry, glass breaking, alarms, ...). Backed by [ced.cpp](https://github.com/mudler/ced.cpp), a 527-class AudioSet tagger ported to ggml. - **June 2026**: [PII analyze / redact API](/features/middleware/#analyze--redact-api) — the PII detection pipeline (NER + restricted-regex pattern tiers) is now a standalone service: `POST /api/pii/analyze` returns detected entity spans and `POST /api/pii/redact` returns the sanitised text (or `400 pii_blocked`), without routing a chat request through the middleware. Events gain an `origin` (`middleware` / `proxy` / `pii_analyze` / `pii_redact`) so `/api/pii/events` can be filtered by source. - **June 2026**: Concurrent scoring and PII NER on llama.cpp — the `Score` (router classifier) and `TokenClassify` (PII NER) primitives now ride llama.cpp's server task queue instead of locking the context, so they run concurrently with chat/completion/embedding traffic and with each other. The `known_usecases` restriction that forced dedicated scorer/NER model configs on llama-cpp is lifted, repeated scoring calls reuse the prompt KV cache across candidates, and scoring inputs are no longer capped by the physical batch size. diff --git a/gallery/ced.yaml b/gallery/ced.yaml new file mode 100644 index 000000000..171b0d0d8 --- /dev/null +++ b/gallery/ced.yaml @@ -0,0 +1,7 @@ +--- +name: "ced-sound-classification" + +config_file: | + backend: ced + known_usecases: + - sound_classification diff --git a/gallery/index.yaml b/gallery/index.yaml index fcf180e13..cde505d72 100644 --- a/gallery/index.yaml +++ b/gallery/index.yaml @@ -3077,6 +3077,190 @@ - transcript parameters: model: tiny +- name: ced-base-f16 + url: github:mudler/LocalAI/gallery/ced.yaml@master + urls: + - https://huggingface.co/mudler/ced-gguf + - https://huggingface.co/mispeech/ced-base + description: | + CED (Consistent Ensemble Distillation, Xiaomi) is a sound-event classifier that tags everyday sounds (baby cry, footsteps, glass breaking, alarms, dog bark, ...) into the 527-class AudioSet ontology. This is the f16 GGUF for the ced backend (a standalone C++/ggml port). Recommended default: fastest on CPU and near-lossless. Use POST /v1/audio/classification, or the realtime websocket API for live recognition. + license: apache-2.0 + tags: + - audio-classification + - sound-event-detection + - audio-tagging + - audioset + - ced + - gguf + - f16 + overrides: + parameters: + model: ced-base-f16.gguf + files: + - filename: ced-base-f16.gguf + sha256: 5c058d9f7b737167195fa54eae4a2ae17658ac2c0a8073f7f116ba006b2ab32c + uri: https://huggingface.co/mudler/ced-gguf/resolve/main/ced-base-f16.gguf +- name: ced-base-q8 + url: github:mudler/LocalAI/gallery/ced.yaml@master + urls: + - https://huggingface.co/mudler/ced-gguf + - https://huggingface.co/mispeech/ced-base + description: | + CED (Consistent Ensemble Distillation, Xiaomi) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). This is the q8_0 GGUF for the ced backend: smallest footprint (~88 MB, ~6.5x less memory than the PyTorch reference) and near-lossless (identical top-5 tags). Use POST /v1/audio/classification, or the realtime websocket API for live recognition. + license: apache-2.0 + tags: + - audio-classification + - sound-event-detection + - audio-tagging + - audioset + - ced + - gguf + - q8 + overrides: + parameters: + model: ced-base-q8_0.gguf + files: + - filename: ced-base-q8_0.gguf + sha256: bd34a7710169f0047fea17267965d211f967828ab25ba6fb9d3768481393f6e2 + uri: https://huggingface.co/mudler/ced-gguf/resolve/main/ced-base-q8_0.gguf +- name: ced-tiny-f16 + url: github:mudler/LocalAI/gallery/ced.yaml@master + urls: + - https://huggingface.co/mudler/ced-gguf + - https://huggingface.co/mispeech/ced-tiny + description: | + CED-tiny (5.5M params, Pi-class / edge) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). f16 GGUF for the ced backend (recommended (fastest on CPU)). Use POST /v1/audio/classification, or the realtime websocket API for live recognition. + license: apache-2.0 + tags: + - audio-classification + - sound-event-detection + - audio-tagging + - audioset + - ced + - gguf + - f16 + overrides: + parameters: + model: ced-tiny-f16.gguf + files: + - filename: ced-tiny-f16.gguf + sha256: af8b81c67bae50bfca4ea83dbba77b3bae4fa6180d36c17d6877f7700aeeb77b + uri: https://huggingface.co/mudler/ced-gguf/resolve/main/ced-tiny-f16.gguf +- name: ced-tiny-q8 + url: github:mudler/LocalAI/gallery/ced.yaml@master + urls: + - https://huggingface.co/mudler/ced-gguf + - https://huggingface.co/mispeech/ced-tiny + description: | + CED-tiny (5.5M params, Pi-class / edge) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). q8_0 GGUF for the ced backend (smallest footprint, near-lossless). Use POST /v1/audio/classification, or the realtime websocket API for live recognition. + license: apache-2.0 + tags: + - audio-classification + - sound-event-detection + - audio-tagging + - audioset + - ced + - gguf + - q8 + overrides: + parameters: + model: ced-tiny-q8_0.gguf + files: + - filename: ced-tiny-q8_0.gguf + sha256: 48bee4e2fc3cc85d7806e03471db24e77fda6c2a2e81ffe9ef67caebaf2bd674 + uri: https://huggingface.co/mudler/ced-gguf/resolve/main/ced-tiny-q8_0.gguf +- name: ced-mini-f16 + url: github:mudler/LocalAI/gallery/ced.yaml@master + urls: + - https://huggingface.co/mudler/ced-gguf + - https://huggingface.co/mispeech/ced-mini + description: | + CED-mini (9.6M params, low-power) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). f16 GGUF for the ced backend (recommended (fastest on CPU)). Use POST /v1/audio/classification, or the realtime websocket API for live recognition. + license: apache-2.0 + tags: + - audio-classification + - sound-event-detection + - audio-tagging + - audioset + - ced + - gguf + - f16 + overrides: + parameters: + model: ced-mini-f16.gguf + files: + - filename: ced-mini-f16.gguf + sha256: 3c6a8936c77312f07a9ecb7b4bbbcb1f93ad137920ca6656bae9306571fb0c03 + uri: https://huggingface.co/mudler/ced-gguf/resolve/main/ced-mini-f16.gguf +- name: ced-mini-q8 + url: github:mudler/LocalAI/gallery/ced.yaml@master + urls: + - https://huggingface.co/mudler/ced-gguf + - https://huggingface.co/mispeech/ced-mini + description: | + CED-mini (9.6M params, low-power) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). q8_0 GGUF for the ced backend (smallest footprint, near-lossless). Use POST /v1/audio/classification, or the realtime websocket API for live recognition. + license: apache-2.0 + tags: + - audio-classification + - sound-event-detection + - audio-tagging + - audioset + - ced + - gguf + - q8 + overrides: + parameters: + model: ced-mini-q8_0.gguf + files: + - filename: ced-mini-q8_0.gguf + sha256: 7062cef9ca31459f339ce24a5914f3b65bde76ffd9ca4fc924a040327ff292bd + uri: https://huggingface.co/mudler/ced-gguf/resolve/main/ced-mini-q8_0.gguf +- name: ced-small-f16 + url: github:mudler/LocalAI/gallery/ced.yaml@master + urls: + - https://huggingface.co/mudler/ced-gguf + - https://huggingface.co/mispeech/ced-small + description: | + CED-small (22M params, balanced size/accuracy) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). f16 GGUF for the ced backend (recommended (fastest on CPU)). Use POST /v1/audio/classification, or the realtime websocket API for live recognition. + license: apache-2.0 + tags: + - audio-classification + - sound-event-detection + - audio-tagging + - audioset + - ced + - gguf + - f16 + overrides: + parameters: + model: ced-small-f16.gguf + files: + - filename: ced-small-f16.gguf + sha256: c391ed8697a1b08d7c1a463e4940a5c3a2f670e0544ab0d8ee23b544583602a8 + uri: https://huggingface.co/mudler/ced-gguf/resolve/main/ced-small-f16.gguf +- name: ced-small-q8 + url: github:mudler/LocalAI/gallery/ced.yaml@master + urls: + - https://huggingface.co/mudler/ced-gguf + - https://huggingface.co/mispeech/ced-small + description: | + CED-small (22M params, balanced size/accuracy) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). q8_0 GGUF for the ced backend (smallest footprint, near-lossless). Use POST /v1/audio/classification, or the realtime websocket API for live recognition. + license: apache-2.0 + tags: + - audio-classification + - sound-event-detection + - audio-tagging + - audioset + - ced + - gguf + - q8 + overrides: + parameters: + model: ced-small-q8_0.gguf + files: + - filename: ced-small-q8_0.gguf + sha256: 888275fe43491cf832fb7b8125eccba34d1120745166f40cc12e93b79dea8efe + uri: https://huggingface.co/mudler/ced-gguf/resolve/main/ced-small-q8_0.gguf - name: omnilingual-0.3b-ctc-q8-sherpa url: github:mudler/LocalAI/gallery/sherpa-onnx-asr.yaml@master urls: diff --git a/pkg/grpc/backend.go b/pkg/grpc/backend.go index 44912c04b..f4cd511ac 100644 --- a/pkg/grpc/backend.go +++ b/pkg/grpc/backend.go @@ -82,6 +82,8 @@ type Backend interface { Diarize(ctx context.Context, in *pb.DiarizeRequest, opts ...grpc.CallOption) (*pb.DiarizeResponse, error) + SoundDetection(ctx context.Context, in *pb.SoundDetectionRequest, opts ...grpc.CallOption) (*pb.SoundDetectionResponse, error) + AudioEncode(ctx context.Context, in *pb.AudioEncodeRequest, opts ...grpc.CallOption) (*pb.AudioEncodeResult, error) AudioDecode(ctx context.Context, in *pb.AudioDecodeRequest, opts ...grpc.CallOption) (*pb.AudioDecodeResult, error) diff --git a/pkg/grpc/base/base.go b/pkg/grpc/base/base.go index c67c832a7..55b0d96b6 100644 --- a/pkg/grpc/base/base.go +++ b/pkg/grpc/base/base.go @@ -110,6 +110,10 @@ func (llm *Base) Diarize(*pb.DiarizeRequest) (pb.DiarizeResponse, error) { return pb.DiarizeResponse{}, fmt.Errorf("unimplemented") } +func (llm *Base) SoundDetection(context.Context, *pb.SoundDetectionRequest) (*pb.SoundDetectionResponse, error) { + return nil, fmt.Errorf("unimplemented") +} + func (llm *Base) TokenizeString(opts *pb.PredictOptions) (pb.TokenizationResponse, error) { return pb.TokenizationResponse{}, fmt.Errorf("unimplemented") } diff --git a/pkg/grpc/client.go b/pkg/grpc/client.go index 8dd2b2c2e..b80c74bcd 100644 --- a/pkg/grpc/client.go +++ b/pkg/grpc/client.go @@ -616,6 +616,24 @@ func (c *Client) Diarize(ctx context.Context, in *pb.DiarizeRequest, opts ...grp return client.Diarize(ctx, in, opts...) } +func (c *Client) SoundDetection(ctx context.Context, in *pb.SoundDetectionRequest, opts ...grpc.CallOption) (*pb.SoundDetectionResponse, error) { + if !c.parallel { + c.opMutex.Lock() + defer c.opMutex.Unlock() + } + c.setBusy(true) + defer c.setBusy(false) + c.wdMark() + defer c.wdUnMark() + conn, err := c.dial() + if err != nil { + return nil, err + } + defer func() { _ = conn.Close() }() + client := pb.NewBackendClient(conn) + return client.SoundDetection(ctx, in, opts...) +} + func (c *Client) Detect(ctx context.Context, in *pb.DetectOptions, opts ...grpc.CallOption) (*pb.DetectResponse, error) { if !c.parallel { c.opMutex.Lock() diff --git a/pkg/grpc/embed.go b/pkg/grpc/embed.go index c7c6406ca..2251dc707 100644 --- a/pkg/grpc/embed.go +++ b/pkg/grpc/embed.go @@ -153,6 +153,10 @@ func (e *embedBackend) Diarize(ctx context.Context, in *pb.DiarizeRequest, opts return e.s.Diarize(ctx, in) } +func (e *embedBackend) SoundDetection(ctx context.Context, in *pb.SoundDetectionRequest, opts ...grpc.CallOption) (*pb.SoundDetectionResponse, error) { + return e.s.SoundDetection(ctx, in) +} + func (e *embedBackend) AudioEncode(ctx context.Context, in *pb.AudioEncodeRequest, opts ...grpc.CallOption) (*pb.AudioEncodeResult, error) { return e.s.AudioEncode(ctx, in) } diff --git a/pkg/grpc/interface.go b/pkg/grpc/interface.go index 888e36a0c..282735612 100644 --- a/pkg/grpc/interface.go +++ b/pkg/grpc/interface.go @@ -40,6 +40,7 @@ type AIModel interface { VAD(*pb.VADRequest) (pb.VADResponse, error) Diarize(*pb.DiarizeRequest) (pb.DiarizeResponse, error) + SoundDetection(context.Context, *pb.SoundDetectionRequest) (*pb.SoundDetectionResponse, error) AudioEncode(*pb.AudioEncodeRequest) (*pb.AudioEncodeResult, error) AudioDecode(*pb.AudioDecodeRequest) (*pb.AudioDecodeResult, error) diff --git a/pkg/grpc/server.go b/pkg/grpc/server.go index 35afb502c..53522f114 100644 --- a/pkg/grpc/server.go +++ b/pkg/grpc/server.go @@ -435,6 +435,14 @@ func (s *server) Diarize(ctx context.Context, in *pb.DiarizeRequest) (*pb.Diariz return &res, nil } +func (s *server) SoundDetection(ctx context.Context, in *pb.SoundDetectionRequest) (*pb.SoundDetectionResponse, error) { + if s.llm.Locking() { + s.llm.Lock() + defer s.llm.Unlock() + } + return s.llm.SoundDetection(ctx, in) +} + func (s *server) AudioEncode(ctx context.Context, in *pb.AudioEncodeRequest) (*pb.AudioEncodeResult, error) { if s.llm.Locking() { s.llm.Lock() diff --git a/scripts/changed-backends.js b/scripts/changed-backends.js index a2fe48e06..5690e00f5 100644 --- a/scripts/changed-backends.js +++ b/scripts/changed-backends.js @@ -26,6 +26,13 @@ function inferBackendPath(item) { if (item.backend === "parakeet-cpp") { return `backend/go/parakeet-cpp/`; } + // ced is a Go backend (Dockerfile.golang) wrapping the ced.cpp ggml port via + // purego, living in backend/go/ced/. Same explicit-branch rationale as + // parakeet-cpp above: the generic golang fallthrough would also resolve it, + // but this documents the mapping and guards a future dockerfile-suffix change. + if (item.backend === "ced") { + return `backend/go/ced/`; + } if (item.dockerfile.endsWith("golang")) { return `backend/go/${item.backend}/`; } diff --git a/swagger/docs.go b/swagger/docs.go index 20a1f5a3f..e01761643 100644 --- a/swagger/docs.go +++ b/swagger/docs.go @@ -1939,6 +1939,53 @@ const docTemplate = `{ } } }, + "/v1/audio/classification": { + "post": { + "consumes": [ + "multipart/form-data" + ], + "tags": [ + "audio" + ], + "summary": "Classify sound events in audio (audio tagging).", + "parameters": [ + { + "type": "string", + "description": "model", + "name": "model", + "in": "formData", + "required": true + }, + { + "type": "file", + "description": "audio file", + "name": "file", + "in": "formData", + "required": true + }, + { + "type": "integer", + "description": "number of top tags to return (0 = backend default)", + "name": "top_k", + "in": "formData" + }, + { + "type": "number", + "description": "drop tags scoring below this value", + "name": "threshold", + "in": "formData" + } + ], + "responses": { + "200": { + "description": "OK", + "schema": { + "$ref": "#/definitions/schema.SoundClassificationResult" + } + } + } + } + }, "/v1/audio/diarization": { "post": { "consumes": [ @@ -6084,6 +6131,34 @@ const docTemplate = `{ } } }, + "schema.SoundClassification": { + "type": "object", + "properties": { + "index": { + "type": "integer" + }, + "label": { + "type": "string" + }, + "score": { + "type": "number" + } + } + }, + "schema.SoundClassificationResult": { + "type": "object", + "properties": { + "detections": { + "type": "array", + "items": { + "$ref": "#/definitions/schema.SoundClassification" + } + }, + "model": { + "type": "string" + } + } + }, "schema.StreamOptions": { "type": "object", "properties": { diff --git a/swagger/swagger.json b/swagger/swagger.json index 09e03581b..5fc4ac638 100644 --- a/swagger/swagger.json +++ b/swagger/swagger.json @@ -1936,6 +1936,53 @@ } } }, + "/v1/audio/classification": { + "post": { + "consumes": [ + "multipart/form-data" + ], + "tags": [ + "audio" + ], + "summary": "Classify sound events in audio (audio tagging).", + "parameters": [ + { + "type": "string", + "description": "model", + "name": "model", + "in": "formData", + "required": true + }, + { + "type": "file", + "description": "audio file", + "name": "file", + "in": "formData", + "required": true + }, + { + "type": "integer", + "description": "number of top tags to return (0 = backend default)", + "name": "top_k", + "in": "formData" + }, + { + "type": "number", + "description": "drop tags scoring below this value", + "name": "threshold", + "in": "formData" + } + ], + "responses": { + "200": { + "description": "OK", + "schema": { + "$ref": "#/definitions/schema.SoundClassificationResult" + } + } + } + } + }, "/v1/audio/diarization": { "post": { "consumes": [ @@ -6081,6 +6128,34 @@ } } }, + "schema.SoundClassification": { + "type": "object", + "properties": { + "index": { + "type": "integer" + }, + "label": { + "type": "string" + }, + "score": { + "type": "number" + } + } + }, + "schema.SoundClassificationResult": { + "type": "object", + "properties": { + "detections": { + "type": "array", + "items": { + "$ref": "#/definitions/schema.SoundClassification" + } + }, + "model": { + "type": "string" + } + } + }, "schema.StreamOptions": { "type": "object", "properties": { diff --git a/swagger/swagger.yaml b/swagger/swagger.yaml index a25674539..f83ef14e8 100644 --- a/swagger/swagger.yaml +++ b/swagger/swagger.yaml @@ -2087,6 +2087,24 @@ definitions: classifier-side confidence signal). type: number type: object + schema.SoundClassification: + properties: + index: + type: integer + label: + type: string + score: + type: number + type: object + schema.SoundClassificationResult: + properties: + detections: + items: + $ref: '#/definitions/schema.SoundClassification' + type: array + model: + type: string + type: object schema.StreamOptions: properties: include_usage: @@ -3770,6 +3788,37 @@ paths: summary: Generates audio from the input text. tags: - audio + /v1/audio/classification: + post: + consumes: + - multipart/form-data + parameters: + - description: model + in: formData + name: model + required: true + type: string + - description: audio file + in: formData + name: file + required: true + type: file + - description: number of top tags to return (0 = backend default) + in: formData + name: top_k + type: integer + - description: drop tags scoring below this value + in: formData + name: threshold + type: number + responses: + "200": + description: OK + schema: + $ref: '#/definitions/schema.SoundClassificationResult' + summary: Classify sound events in audio (audio tagging). + tags: + - audio /v1/audio/diarization: post: consumes: