feat(ced): sound-event classification backend (CED audio tagger) (#10425)

* feat(ced): sketch sound-classification backend (CED audio tagger) Wires ced.cpp (CED, 527-class AudioSet sound-event tagger; baby cry, footsteps, glass, alarms, dog bark) into LocalAI as a Go/purego backend. SKETCH (backend skeleton real; core REST wiring + CI/gallery is a checklist in DESIGN.md): - backend/backend.proto: new SoundDetection rpc + SoundClass messages (run `make protogen-go` to regenerate pkg/grpc/proto). - backend/go/ced: main.go (purego dlopen libced.so + ced_capi.h), goced.go (Ced gRPC backend: Load + SoundDetection), Makefile (clone-at-pin CED_VERSION, ggml static-PIC shared build), run.sh, package.sh, .gitignore. - DESIGN.md: REST /v1/audio/classification wiring (handler/route/capability registration checklist), gallery/index + CI registration, and a scoping note for the realtime/websocket live-recognition path (sliding-window classify over the existing ws transport + voicegate; the ced C-API per-PCM entry point is already window-friendly). Backend code does not compile until protogen-go regenerates the pb types and a libced.so is built (Makefile clones+builds it). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ced): REST /v1/audio/classification endpoint + capability registration Wires the ced sound-event classification backend (AudioSet audio tagger) end to end through the REST surface, mirroring the transcription path. - Handler: core/http/endpoints/openai/sound_classification.go parses the multipart audio upload, temp-files it, resolves the model config and calls the SoundDetection RPC; returns {model, detections[]} JSON. - Backend wrapper: core/backend/sound_classification.go (ModelSoundDetection) loads the model and normalizes the proto response into schema types. - Schema: core/schema/sound_classification.go (SoundClassificationResult). - gRPC layer: SoundDetection wired through the LocalAI wrapper (interface, Backend client, Client, embed, server, base default) so the loader-typed client exposes the RPC; proto regenerated via make protogen-go. - Route: POST /v1/audio/classification (+ /audio/classification alias) with the audio/multipart default-model middleware in routes/openai.go. - Capability surfaces: swagger @Tags/@Router on the handler; FLAG_SOUND_ CLASSIFICATION usecase flag + UsecaseSoundClassification + UsecaseInfoMap + GuessUsecases + ModalityGroups + GetAllModelConfigUsecases; meta usecase option; /api/instructions audio area updated; auth RouteFeatureRegistry + FeatureAudioClassification (APIFeatures, default ON) + FeatureMetas; UI usecaseFilters, capabilities.js CAP_SOUND_CLASSIFICATION, Models.jsx filter + i18n; docs page features/audio-classification.md + whats-new + crosslink. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ced): realtime sound-event detection over the websocket API When a realtime pipeline configures a sound-classification model, each VAD-committed utterance (the same window the transcription path produces) is also run through the CED sound-event classifier and the scored AudioSet tags are emitted as a new server event. No new backend rpc is needed: the SoundDetection gRPC method already exists on this branch. - config: add Pipeline.SoundDetection (yaml/json sound_detection,omitempty) beside Transcription/VAD. - realtime: add Model.SoundDetection(ctx, audio, topK, threshold) to the ModelInterface; implement it on wrappedModel and transcriptOnlyModel by calling backend.ModelSoundDetection with the session's sound-classification model config (mirrors how Transcribe dispatches). Load the optional config in newModel / newTranscriptionOnlyModel; nil config keeps it additive. - types: add ConversationItemSoundDetectionEvent (item_id, content_index, detections[]{label,score,index}) with type conversation.item.sound_detection, its ServerEventType constant and MarshalJSON, mirroring the transcription completed event. - realtime: add emitSoundDetection (unary path: classify the committed window, build the event, t.SendEvent) and wire it at the utterance-commit hook right after emitTranscription; gated on session.SoundDetectionEnabled (resolved from Pipeline.SoundDetection at session setup, defaults top_k=5, threshold=0). Its error is logged via xlog but never aborts the turn. - test: Ginkgo specs for emitSoundDetection (tags emitted, empty detections, classifier error) plus a SoundDetection method on the fakeModel double. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(ced): implement SoundDetection in nodes backend test doubles The SoundDetection method added to the grpc backend interface left two test doubles (fakeBackendClient, fakeGRPCBackend) incomplete, so core/services/nodes failed to compile under `go vet`/`go test` (go build missed it: the doubles live in _test.go). Add the method to both, mirroring their existing Detect mock. Repairs CI for the nodes package. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ced): decouple realtime sound detection from VAD (sound-only sessions) Sound-event detection must activate on sounds, not speech, so it no longer runs through the voice VAD/transcription path. A sound-detection-only pipeline (sound_detection set, no transcription/LLM) now: - is accepted by prepareRealtimeConfig (sound_detection counts as a pipeline stage), - builds a lightweight model via newSoundDetectionOnlyModel (no VAD/STT/LLM/TTS loaded), and - defaults the session to turn_detection none (no VAD) with no transcription stage, so the client drives windowing via input_audio_buffer.commit (option A: client-side sliding window). The per-PCM C-API already supports arbitrary windows. commitUtterance gains a sound-only branch: it emits the conversation.item.sound_detection event (scored AudioSet tags) and stops - no transcription, no LLM response. generateResponse is now guarded on a transcription stage being present, so a sound-only turn never invokes the LLM. Existing transcription/VAD sessions are unchanged (additive). Added a commitUtterance sound-only Ginkgo spec asserting it emits the sound event and neither transcribes nor generates a response. go vet + golangci-lint (new-from-merge-base) clean; openai suite green. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ced): register sound-classification backend in gallery + CI Mechanical backend-image registration for the ced sound-event classifier, mirroring the parakeet-cpp Go/purego backend everywhere it is wired up. - .github/backend-matrix.yml: add the ced build matrix, field-for-field copies of the parakeet-cpp entries (cpu amd64/arm64, cublas cuda 12/13 amd64, l4t cuda-13 arm64, l4t-jetpack cuda-12 arm64, sycl f32/f16, vulkan amd64/arm64, rocm hipblas, and the metal darwin entry), changing only backend and tag-suffix. dockerfile stays ./backend/Dockerfile.golang. - backend/index.yaml: add the &ced meta anchor (capabilities map per platform) plus ced-development and the per-arch image entries, each uri/mirror tag-suffix matching the matrix exactly. The model gallery (GGUF) entry is intentionally deferred pending the HuggingFace publish (TODO note inline). - scripts/changed-backends.js: add an explicit item.backend === "ced" branch in inferBackendPath mapping to backend/go/ced/, same mechanism and ordering as the parakeet-cpp branch (before the generic golang fallthrough). - .github/workflows/bump_deps.yaml: register mudler/ced.cpp -> CED_VERSION in backend/go/ced/Makefile so the daily bot bumps the pin. - swagger/{docs.go,swagger.json,swagger.yaml}: regenerated via make swagger so the existing /v1/audio/classification annotations land in the generated spec. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ced): server-side windowing for realtime sound detection (option B) Adds an optional server-driven sliding-window classifier so a sound-only realtime client only has to stream audio (no input_audio_buffer.commit): - Pipeline.sound_detection_window_ms / sound_detection_hop_ms config knobs. When both > 0 on a sound-only session, the server classifies the last window of streamed audio every hop and emits a conversation.item.sound_ detection event; the input buffer is trimmed to one window so a long stream stays bounded. When unset, the session stays client-driven (option A). Runs independent of VAD (sound events are not speech). - handleSoundWindow (ticker) + classifySoundWindow (one tick, extracted so it is unit-testable) + writeWindowWAV, which declares the true InputSampleRate (NewWAVHeaderWithRate) so the classifier resamples correctly. Goroutine is started after toggleVAD and torn down with the session (close + wg.Wait). - Register pipeline.sound_detection (+window_ms/hop_ms) in the config meta registry; the earlier realtime commit added pipeline.sound_detection without a registry entry, failing TestAllFieldsHaveRegistryEntries. This fixes that and covers the two new knobs. Tests: classifySoundWindow emits an event + trims the buffer to one window, no-ops on too-little audio; writeWindowWAV declares the given sample rate. go build/vet + golangci-lint (new-from-merge-base) clean; config + openai suites green. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ced): add ced-base GGUF model gallery entries (f16 + q8_0) The ced-base weights are now published at mudler/ced-base-gguf (Apache-2.0, converted from mispeech/ced-base). Adds gallery/ced.yaml (backend: ced + known_usecases: sound_classification) and two gallery/index.yaml entries (ced-base-f16 default, ced-base-q8 smallest) with sha256-pinned files, and removes the now-resolved TODO from backend/index.yaml. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ced): add tiny/mini/small GGUF model gallery entries Publishes the rest of the CED family (same architecture, metadata-driven port verified end-to-end on ced-tiny) to mudler/ced-{tiny,mini,small}-gguf and adds their f16 + q8_0 gallery entries: ced-tiny (5.5M, edge/Pi-class) f16 11MB / q8_0 6MB ced-mini (9.6M) f16 19MB / q8_0 11MB ced-small (22M) f16 42MB / q8_0 23MB All sha256-pinned. ced-base remains the accuracy default. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(ced): point gallery entries at the consolidated mudler/ced-gguf repo All CED quantizations (tiny/mini/small/base, f16/q8_0) now live in a single HuggingFace repo, mudler/ced-gguf, instead of per-model repos. Repoint the 8 gallery model entries' urls + file uris accordingly. sha256 and filenames are unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(ced): bump CED_VERSION to the short-clip fix Pin the ced backend to ced.cpp 99c6ed3, which fixes a crash on any clip shorter than target_length (~10.11s): time_pos_embed was added at its full 63-frame grid instead of being sliced to the clip's actual time grid, tripping ggml_can_repeat in ggml_add. Surfaced by the live realtime e2e (sub-10s windows) and gated with a short-clip parity test upstream. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(ced): list ced.cpp as a LocalAI-team engine + backend-guide directive - README.md: add ced.cpp to the "native C/C++/GGML engines developed and maintained by the LocalAI project" table. - docs/content/features/backends.md: add a Sound Classification backend category (sound-event classification / audio tagging) listing ced.cpp. - .agents/adding-backends.md: add a "Documenting the backend" section and two verification-checklist items requiring new backends to be documented in the backends.md category list, and in-house native engines to be added to the README maintained-engines table. This directive was missing. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(ced): repin CED_VERSION to the v0.1.0 release commit ced.cpp history was squashed into a single release commit (tagged v0.1.0), so the previous pin (99c6ed3) no longer exists upstream. Pin to c04ac14, the v0.1.0 release commit, so the backend builds against a commit that exists. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(ced): silence gosec G304/G103 + govet unsafeptr on audited paths - sound_classification.go: os.Create(dst) where dst = temp dir + path.Base of the upload (no traversal). #nosec G304, matching the depth-anything-cpp handler. - goced.go: reading a NUL-terminated C string from a libced-owned buffer. #nosec G103 (gosec) + //nolint:govet (golangci-lint's unsafeptr check), since the uintptr is a C-owned malloc'd buffer, not Go-GC memory. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-22 07:39:02 -04:00 · 2026-06-22 01:00:28 +02:00
parent ce8a3e9266
commit 600dafd20b
52 changed files with 2161 additions and 152 deletions
--- a/.agents/adding-backends.md
+++ b/.agents/adding-backends.md
@@ -198,6 +198,27 @@ docker-build-backends: ... docker-build-<backend-name>
 - If the backend is in `backend/python/<backend-name>/` but uses `.` as context in the workflow file, use `.` context
 - Check similar backends to determine the correct context

+## Documenting the backend (README + docs)
+
+A backend is not "added" until it is discoverable. Update the user-facing docs:
+
+- **`docs/content/features/backends.md`** - add the backend to the right
+  category in the "LocalAI supports various types of backends" list (and add a
+  new category if it introduces a new modality, e.g. sound classification).
+- If the backend introduces a **new API surface** (a new endpoint or a realtime
+  capability), document it under `docs/content/` where its area lives (audio,
+  vision, etc.) and follow the api-endpoints checklist in
+  [api-endpoints-and-auth.md](api-endpoints-and-auth.md).
+
+**If the backend is a native C/C++/GGML engine created and maintained by the
+LocalAI team** (a from-scratch port like `parakeet.cpp`, `ced.cpp`,
+`vibevoice.cpp`, `rf-detr.cpp`, not a wrapper around a third-party runtime), it
+ALSO belongs in the top-level **`README.md`** table under "native C/C++/GGML
+engines ... developed and maintained by the LocalAI project itself". Add a row
+linking the upstream engine repo with a one-line description. This is the
+project's showcase of its own engines; a new in-house backend that is missing
+from it is a documentation bug.
+
 ## 5. Verification Checklist

 After adding a new backend, verify:
@@ -211,6 +232,8 @@ After adding a new backend, verify:
 - [ ] No YAML syntax errors (check with linter)
 - [ ] No Makefile syntax errors (check with linter)
 - [ ] Follows the same pattern as similar backends (e.g., if it's a transcription backend, follow `faster-whisper` pattern)
+- [ ] Documented: added to the category list in `docs/content/features/backends.md` (and any new endpoint/realtime capability documented under `docs/content/`)
+- [ ] If it is an in-house native C/C++/GGML engine, added to the maintained-engines table in the top-level `README.md`

 ## Bundling runtime shared libraries (`package.sh`)

--- a/.github/backend-matrix.yml
+++ b/.github/backend-matrix.yml
@@ -3575,6 +3575,154 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
+  # ced
+  - build-type: 'cublas'
+    cuda-major-version: "12"
+    cuda-minor-version: "8"
+    platforms: 'linux/amd64'
+    tag-latest: 'auto'
+    tag-suffix: '-gpu-nvidia-cuda-12-ced'
+    runs-on: 'ubuntu-latest'
+    base-image: "ubuntu:24.04"
+    skip-drivers: 'false'
+    backend: "ced"
+    dockerfile: "./backend/Dockerfile.golang"
+    context: "./"
+    ubuntu-version: '2404'
+  - build-type: 'cublas'
+    cuda-major-version: "13"
+    cuda-minor-version: "0"
+    platforms: 'linux/amd64'
+    tag-latest: 'auto'
+    tag-suffix: '-gpu-nvidia-cuda-13-ced'
+    runs-on: 'ubuntu-latest'
+    base-image: "ubuntu:24.04"
+    skip-drivers: 'false'
+    backend: "ced"
+    dockerfile: "./backend/Dockerfile.golang"
+    context: "./"
+    ubuntu-version: '2404'
+  - build-type: 'cublas'
+    cuda-major-version: "13"
+    cuda-minor-version: "0"
+    platforms: 'linux/arm64'
+    skip-drivers: 'false'
+    tag-latest: 'auto'
+    tag-suffix: '-nvidia-l4t-cuda-13-arm64-ced'
+    base-image: "ubuntu:24.04"
+    ubuntu-version: '2404'
+    runs-on: 'ubuntu-24.04-arm'
+    backend: "ced"
+    dockerfile: "./backend/Dockerfile.golang"
+    context: "./"
+  - build-type: ''
+    cuda-major-version: ""
+    cuda-minor-version: ""
+    platforms: 'linux/amd64'
+    platform-tag: 'amd64'
+    tag-latest: 'auto'
+    tag-suffix: '-cpu-ced'
+    runs-on: 'ubuntu-latest'
+    base-image: "ubuntu:24.04"
+    skip-drivers: 'false'
+    backend: "ced"
+    dockerfile: "./backend/Dockerfile.golang"
+    context: "./"
+    ubuntu-version: '2404'
+  - build-type: ''
+    cuda-major-version: ""
+    cuda-minor-version: ""
+    platforms: 'linux/arm64'
+    platform-tag: 'arm64'
+    tag-latest: 'auto'
+    tag-suffix: '-cpu-ced'
+    runs-on: 'ubuntu-24.04-arm'
+    base-image: "ubuntu:24.04"
+    skip-drivers: 'false'
+    backend: "ced"
+    dockerfile: "./backend/Dockerfile.golang"
+    context: "./"
+    ubuntu-version: '2404'
+  - build-type: 'sycl_f32'
+    cuda-major-version: ""
+    cuda-minor-version: ""
+    platforms: 'linux/amd64'
+    tag-latest: 'auto'
+    tag-suffix: '-gpu-intel-sycl-f32-ced'
+    runs-on: 'ubuntu-latest'
+    base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
+    skip-drivers: 'false'
+    backend: "ced"
+    dockerfile: "./backend/Dockerfile.golang"
+    context: "./"
+    ubuntu-version: '2404'
+  - build-type: 'sycl_f16'
+    cuda-major-version: ""
+    cuda-minor-version: ""
+    platforms: 'linux/amd64'
+    tag-latest: 'auto'
+    tag-suffix: '-gpu-intel-sycl-f16-ced'
+    runs-on: 'ubuntu-latest'
+    base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
+    skip-drivers: 'false'
+    backend: "ced"
+    dockerfile: "./backend/Dockerfile.golang"
+    context: "./"
+    ubuntu-version: '2404'
+  - build-type: 'vulkan'
+    cuda-major-version: ""
+    cuda-minor-version: ""
+    platforms: 'linux/amd64'
+    platform-tag: 'amd64'
+    tag-latest: 'auto'
+    tag-suffix: '-gpu-vulkan-ced'
+    runs-on: 'ubuntu-latest'
+    base-image: "ubuntu:24.04"
+    skip-drivers: 'false'
+    backend: "ced"
+    dockerfile: "./backend/Dockerfile.golang"
+    context: "./"
+    ubuntu-version: '2404'
+  - build-type: 'vulkan'
+    cuda-major-version: ""
+    cuda-minor-version: ""
+    platforms: 'linux/arm64'
+    platform-tag: 'arm64'
+    tag-latest: 'auto'
+    tag-suffix: '-gpu-vulkan-ced'
+    runs-on: 'ubuntu-24.04-arm'
+    base-image: "ubuntu:24.04"
+    skip-drivers: 'false'
+    backend: "ced"
+    dockerfile: "./backend/Dockerfile.golang"
+    context: "./"
+    ubuntu-version: '2404'
+  - build-type: 'cublas'
+    cuda-major-version: "12"
+    cuda-minor-version: "0"
+    platforms: 'linux/arm64'
+    skip-drivers: 'false'
+    tag-latest: 'auto'
+    tag-suffix: '-nvidia-l4t-arm64-ced'
+    base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0"
+    runs-on: 'ubuntu-24.04-arm'
+    backend: "ced"
+    dockerfile: "./backend/Dockerfile.golang"
+    context: "./"
+    ubuntu-version: '2204'
+  - build-type: 'hipblas'
+    cuda-major-version: ""
+    cuda-minor-version: ""
+    platforms: 'linux/amd64'
+    tag-latest: 'auto'
+    tag-suffix: '-gpu-rocm-hipblas-ced'
+    base-image: "rocm/dev-ubuntu-24.04:7.2.1"
+    runs-on: 'ubuntu-latest'
+    skip-drivers: 'false'
+    backend: "ced"
+    dockerfile: "./backend/Dockerfile.golang"
+    context: "./"
+    ubuntu-version: '2404'
  # acestep-cpp
  - build-type: ''
    cuda-major-version: ""
@@ -4754,6 +4902,10 @@ includeDarwin:
    tag-suffix: "-metal-darwin-arm64-parakeet-cpp"
    build-type: "metal"
    lang: "go"
+  - backend: "ced"
+    tag-suffix: "-metal-darwin-arm64-ced"
+    build-type: "metal"
+    lang: "go"
  - backend: "acestep-cpp"
    tag-suffix: "-metal-darwin-arm64-acestep-cpp"
    build-type: "metal"
--- a/.github/workflows/bump_deps.yaml
+++ b/.github/workflows/bump_deps.yaml
@@ -42,6 +42,10 @@ jobs:
            variable: "PARAKEET_VERSION"
            branch: "master"
            file: "backend/go/parakeet-cpp/Makefile"
+          - repository: "mudler/ced.cpp"
+            variable: "CED_VERSION"
+            branch: "master"
+            file: "backend/go/ced/Makefile"
          - repository: "mudler/depth-anything.cpp"
            variable: "DEPTHANYTHING_VERSION"
            branch: "master"
--- a/README.md
+++ b/README.md
@@ -231,6 +231,7 @@ Most backends wrap a best-in-class upstream engine. A handful of them are native
 | Backend | What it does |
 |---------|-------------|
 | [parakeet.cpp](https://github.com/mudler/parakeet.cpp) | C++/GGML port of NVIDIA NeMo Parakeet ASR (tdt/ctc/rnnt/hybrid), with cache-aware streaming transcription |
+| [ced.cpp](https://github.com/mudler/ced.cpp) | C++/GGML port of the CED audio-tagging models: sound-event classification (527-class AudioSet) over REST and the realtime API for live recognition |
 | [voxtral.c](https://github.com/mudler/voxtral.c) | Voxtral Realtime 4B speech-to-text in pure C |
 | [vibevoice.cpp](https://github.com/mudler/vibevoice.cpp) | Native port of Microsoft VibeVoice for TTS (voice cloning) and long-form ASR with speaker diarization |
 | [rf-detr.cpp](https://github.com/mudler/rf-detr.cpp) | Native RF-DETR object detection and instance segmentation |
--- a/backend/backend.proto
+++ b/backend/backend.proto
@@ -24,6 +24,9 @@ service Backend {
  rpc TokenizeString(PredictOptions) returns (TokenizationResponse) {}
  rpc Status(HealthMessage) returns (StatusResponse) {}
  rpc Detect(DetectOptions) returns (DetectResponse) {}
+  // SoundDetection runs an audio-tagging / sound-event-classification model
+  // (e.g. CED over the AudioSet ontology) on a clip and returns scored labels.
+  rpc SoundDetection(SoundDetectionRequest) returns (SoundDetectionResponse) {}
  rpc Depth(DepthRequest) returns (DepthResponse) {}
  rpc FaceVerify(FaceVerifyRequest) returns (FaceVerifyResponse) {}
  rpc FaceAnalyze(FaceAnalyzeRequest) returns (FaceAnalyzeResponse) {}
@@ -671,6 +674,24 @@ message DetectResponse {
  repeated Detection Detections = 1;
 }

+// --- Sound-event classification / audio tagging messages (CED) ---
+
+message SoundDetectionRequest {
+  string src = 1;       // audio file path (LocalAI writes the upload to disk)
+  int32 top_k = 2;      // number of top tags to return (0 = all classes)
+  float threshold = 3;  // optional: drop tags scoring below this
+}
+
+message SoundClass {
+  string label = 1;     // AudioSet class name, e.g. "Baby cry, infant cry"
+  float score = 2;      // per-class probability (multi-label, independent)
+  int32 index = 3;      // class index in the model ontology
+}
+
+message SoundDetectionResponse {
+  repeated SoundClass detections = 1;  // score-descending
+}
+
 // --- Depth estimation messages (Depth Anything 3) ---

 message DepthRequest {
--- a/backend/go/ced/.gitignore
+++ b/backend/go/ced/.gitignore
@@ -0,0 +1,11 @@
+.cache/
+sources/
+build/
+package/
+ced-grpc
+# build artifacts staged in-tree by the Makefile (cp from sources/) or
+# symlinked for local dev; the real sources live in ced.cpp upstream.
+*.so
+*.so.*
+ced_capi.h
+compile_commands.json
--- a/backend/go/ced/Makefile
+++ b/backend/go/ced/Makefile
@@ -0,0 +1,77 @@
+# ced sound-classification backend Makefile.
+#
+# Upstream pin lives below as CED_VERSION?=<sha> so .github/bump_deps.sh can find
+# and update it (matches the parakeet-cpp / whisper.cpp convention).
+#
+# Local dev shortcut: symlink an out-of-tree ced.cpp shared build + header and
+# skip the clone/cmake steps entirely:
+#   ln -sf /path/to/ced.cpp/build-shared/libced.so .
+#   ln -sf /path/to/ced.cpp/include/ced_capi.h .
+#   go build -o ced-grpc .
+
+CED_VERSION?=c04ac14b7992d00584d9e812c9bb6268598a6ce7
+CED_REPO?=https://github.com/mudler/ced.cpp
+
+GOCMD?=go
+GO_TAGS?=
+JOBS?=$(shell nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 4)
+
+BUILD_TYPE?=
+NATIVE?=false
+
+# Static-link ggml into libced.so (PIC) so the shared lib is self-contained:
+# dlopen needs no libggml*.so alongside it, only system libs the runtime image
+# already provides.
+CMAKE_ARGS?=-DCMAKE_BUILD_TYPE=Release -DCED_SHARED=ON -DCED_BUILD_CLI=OFF -DCED_BUILD_TESTS=OFF -DBUILD_SHARED_LIBS=OFF -DCMAKE_POSITION_INDEPENDENT_CODE=ON
+
+ifeq ($(NATIVE),false)
+	CMAKE_ARGS+=-DGGML_NATIVE=OFF
+endif
+
+# ced.cpp gates its ggml backends behind CED_GGML_* options (set(... CACHE BOOL
+# "" FORCE)), so forward those instead of a bare -DGGML_CUDA=ON.
+ifeq ($(BUILD_TYPE),cublas)
+	CMAKE_ARGS+=-DCED_GGML_CUDA=ON -DGGML_CUDA_GRAPHS=ON
+else ifeq ($(BUILD_TYPE),openblas)
+	CMAKE_ARGS+=-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
+else ifeq ($(BUILD_TYPE),hipblas)
+	CMAKE_ARGS+=-DCED_GGML_HIP=ON
+else ifeq ($(BUILD_TYPE),vulkan)
+	CMAKE_ARGS+=-DCED_GGML_VULKAN=ON
+endif
+
+.PHONY: ced-grpc package build clean purge test all
+
+all: ced-grpc
+
+sources/ced.cpp:
+	mkdir -p sources/ced.cpp
+	cd sources/ced.cpp && \
+	git init -q && \
+	git remote add origin $(CED_REPO) && \
+	git fetch --depth 1 origin $(CED_VERSION) && \
+	git checkout FETCH_HEAD && \
+	git submodule update --init --recursive --depth 1 --single-branch
+
+libced.so: sources/ced.cpp
+	cmake -B sources/ced.cpp/build-shared -S sources/ced.cpp $(CMAKE_ARGS)
+	cmake --build sources/ced.cpp/build-shared --config Release -j$(JOBS)
+	cp -fv sources/ced.cpp/build-shared/libced.so* ./ 2>/dev/null || true
+	cp -fv sources/ced.cpp/include/ced_capi.h ./
+
+ced-grpc: libced.so main.go goced.go
+	CGO_ENABLED=0 $(GOCMD) build -tags "$(GO_TAGS)" -o ced-grpc .
+
+package: ced-grpc
+	bash package.sh
+
+build: package
+
+test:
+	LD_LIBRARY_PATH=$(CURDIR):$$LD_LIBRARY_PATH $(GOCMD) test ./... -count=1
+
+clean: purge
+	rm -rf libced.so* ced_capi.h package ced-grpc
+
+purge:
+	rm -rf sources/ced.cpp
--- a/backend/go/ced/goced.go
+++ b/backend/go/ced/goced.go
@@ -0,0 +1,130 @@
+package main
+
+// Go side of the ced backend: purego bindings over ced_capi.h plus the gRPC
+// SoundDetection implementation.
+//
+// SKETCH: the pb.SoundDetection* types come from backend.proto (regenerate with
+// `make protogen-go`). The C side is single-threaded per ctx, so we guard the
+// engine with engineMu; LocalAI also serializes via base.SingleThread.
+import (
+	"context"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"sort"
+	"sync"
+	"unsafe"
+
+	"github.com/mudler/LocalAI/pkg/grpc/base"
+	pb "github.com/mudler/LocalAI/pkg/grpc/proto"
+)
+
+// purego-bound entry points from libced.so. Names match ced_capi.h exactly.
+var (
+	CppAbiVersion       func() int32
+	CppLoad             func(ggufPath string) uintptr
+	CppFree             func(ctx uintptr)
+	CppLastError        func(ctx uintptr) string
+	CppNumClasses       func(ctx uintptr) int32
+	CppSampleRate       func(ctx uintptr) int32
+	CppClassifyPathJSON func(ctx uintptr, wavPath string, topK int32) uintptr
+	CppClassifyPcmJSON  func(ctx uintptr, pcm []float32, nSamples int32, sampleRate int32, topK int32) uintptr
+	CppFreeString       func(s uintptr)
+)
+
+// cstr copies a malloc'd C string (returned as uintptr) into a Go string and
+// frees the original via ced_capi_free_string. Empty/0 -> "".
+func cstr(p uintptr) string {
+	if p == 0 {
+		return ""
+	}
+	defer CppFreeString(p)
+	var b []byte
+	for i := 0; ; i++ {
+		ch := *(*byte)(unsafe.Pointer(p + uintptr(i))) //nolint:govet // #nosec G103 -- C-owned NUL-terminated string from libced (not Go-GC memory)
+		if ch == 0 {
+			break
+		}
+		b = append(b, ch)
+	}
+	return string(b)
+}
+
+// Ced is the gRPC backend. One loaded CED model per instance.
+type Ced struct {
+	base.Base
+	ctxPtr   uintptr
+	engineMu sync.Mutex
+}
+
+// Load resolves the GGUF and opens the C-API context.
+func (c *Ced) Load(opts *pb.ModelOptions) error {
+	if opts.ModelFile == "" {
+		return errors.New("ced: ModelFile is required")
+	}
+	ctx := CppLoad(opts.ModelFile)
+	if ctx == 0 {
+		return fmt.Errorf("ced: ced_capi_load failed for %q: %s", opts.ModelFile, CppLastError(0))
+	}
+	c.ctxPtr = ctx
+	return nil
+}
+
+// jsonTag mirrors the ced_capi JSON tag objects.
+type jsonTag struct {
+	Index int     `json:"index"`
+	Score float32 `json:"score"`
+	Label string  `json:"label"`
+}
+
+// SoundDetection classifies the clip at req.Src and returns scored AudioSet tags.
+func (c *Ced) SoundDetection(ctx context.Context, req *pb.SoundDetectionRequest) (*pb.SoundDetectionResponse, error) {
+	if c.ctxPtr == 0 {
+		return nil, errors.New("ced: model not loaded")
+	}
+	if req.GetSrc() == "" {
+		return nil, errors.New("ced: SoundDetectionRequest.src (audio path) is required")
+	}
+	topK := req.GetTopK()
+	if topK <= 0 {
+		topK = 10 // sensible default for a tagging response
+	}
+
+	c.engineMu.Lock()
+	out := cstr(CppClassifyPathJSON(c.ctxPtr, req.GetSrc(), topK))
+	lastErr := CppLastError(c.ctxPtr)
+	c.engineMu.Unlock()
+
+	if out == "" {
+		return nil, fmt.Errorf("ced: classification failed: %s", lastErr)
+	}
+	var tags []jsonTag
+	if err := json.Unmarshal([]byte(out), &tags); err != nil {
+		return nil, fmt.Errorf("ced: bad classifier JSON: %w", err)
+	}
+
+	thr := req.GetThreshold()
+	resp := &pb.SoundDetectionResponse{}
+	for _, t := range tags {
+		if t.Score < thr {
+			continue
+		}
+		resp.Detections = append(resp.Detections, &pb.SoundClass{
+			Label: t.Label, Score: t.Score, Index: int32(t.Index),
+		})
+	}
+	sort.Slice(resp.Detections, func(i, j int) bool {
+		return resp.Detections[i].Score > resp.Detections[j].Score
+	})
+	return resp, nil
+}
+
+func (c *Ced) Free() error {
+	c.engineMu.Lock()
+	defer c.engineMu.Unlock()
+	if c.ctxPtr != 0 {
+		CppFree(c.ctxPtr)
+		c.ctxPtr = 0
+	}
+	return nil
+}
--- a/backend/go/ced/main.go
+++ b/backend/go/ced/main.go
@@ -0,0 +1,59 @@
+package main
+
+// ced sound-classification backend. Started internally by LocalAI: one gRPC
+// server per loaded model. Loads libced.so via purego and registers the flat
+// C-API declared in ced_capi.h. The library name can be overridden with
+// CED_LIBRARY (mirrors PARAKEET_LIBRARY / WHISPER_LIBRARY); the default looks
+// for the .so next to this binary.
+//
+// SKETCH: requires `make protogen-go` after the backend.proto SoundDetection
+// addition, and a built libced.so (see Makefile). See DESIGN.md.
+import (
+	"flag"
+	"fmt"
+	"os"
+
+	"github.com/ebitengine/purego"
+	grpc "github.com/mudler/LocalAI/pkg/grpc"
+)
+
+var addr = flag.String("addr", "localhost:50051", "the address to connect to")
+
+type libFunc struct {
+	ptr  any
+	name string
+}
+
+func main() {
+	libName := os.Getenv("CED_LIBRARY")
+	if libName == "" {
+		libName = "libced.so"
+	}
+	lib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
+	if err != nil {
+		panic(fmt.Errorf("ced: dlopen %q: %w", libName, err))
+	}
+
+	// Bound 1:1 to ced_capi.h. char*-returning functions are declared uintptr
+	// so we can free the same pointer with ced_capi_free_string after copying
+	// (purego's string return would copy and leak the original).
+	for _, lf := range []libFunc{
+		{&CppAbiVersion, "ced_capi_abi_version"},
+		{&CppLoad, "ced_capi_load"},
+		{&CppFree, "ced_capi_free"},
+		{&CppLastError, "ced_capi_last_error"},
+		{&CppNumClasses, "ced_capi_num_classes"},
+		{&CppSampleRate, "ced_capi_sample_rate"},
+		{&CppClassifyPathJSON, "ced_capi_classify_path_json"},
+		{&CppClassifyPcmJSON, "ced_capi_classify_pcm_json"},
+		{&CppFreeString, "ced_capi_free_string"},
+	} {
+		purego.RegisterLibFunc(lf.ptr, lib, lf.name)
+	}
+
+	fmt.Fprintf(os.Stderr, "[ced] ABI=%d\n", CppAbiVersion())
+	flag.Parse()
+	if err := grpc.StartServer(*addr, &Ced{}); err != nil {
+		panic(err)
+	}
+}
--- a/backend/go/ced/package.sh
+++ b/backend/go/ced/package.sh
@@ -0,0 +1,60 @@
+#!/bin/bash
+#
+# Bundle the ced-grpc binary, libced.so, the core runtime libs (libc/libstdc++/
+# libgomp + ld.so) and the GPU runtime for the active BUILD_TYPE so the package
+# is self-contained. Mirrors backend/go/parakeet-cpp/package.sh; run.sh routes
+# the (CGO_ENABLED=0) binary through lib/ld.so so the packaged libc is used.
+
+set -e
+
+CURDIR=$(dirname "$(realpath "$0")")
+REPO_ROOT="${CURDIR}/../../.."
+
+mkdir -p "$CURDIR/package/lib"
+
+cp -avf "$CURDIR/ced-grpc" "$CURDIR/package/"
+cp -avf "$CURDIR/run.sh" "$CURDIR/package/"
+
+cp -avf "$CURDIR"/libced.so* "$CURDIR/package/lib/" 2>/dev/null || {
+	echo "ERROR: libced.so not found in $CURDIR, run 'make' first" >&2
+	exit 1
+}
+
+if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
+    echo "Detected x86_64 architecture, copying x86_64 libraries..."
+    cp -arfLv /lib64/ld-linux-x86-64.so.2 "$CURDIR/package/lib/ld.so"
+    cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 "$CURDIR/package/lib/libc.so.6"
+    cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 "$CURDIR/package/lib/libgcc_s.so.1"
+    cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 "$CURDIR/package/lib/libstdc++.so.6"
+    cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 "$CURDIR/package/lib/libm.so.6"
+    cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 "$CURDIR/package/lib/libgomp.so.1"
+    cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 "$CURDIR/package/lib/libdl.so.2"
+    cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 "$CURDIR/package/lib/librt.so.1"
+    cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 "$CURDIR/package/lib/libpthread.so.0"
+elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
+    echo "Detected ARM64 architecture, copying ARM64 libraries..."
+    cp -arfLv /lib/ld-linux-aarch64.so.1 "$CURDIR/package/lib/ld.so"
+    cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 "$CURDIR/package/lib/libc.so.6"
+    cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 "$CURDIR/package/lib/libgcc_s.so.1"
+    cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 "$CURDIR/package/lib/libstdc++.so.6"
+    cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 "$CURDIR/package/lib/libm.so.6"
+    cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 "$CURDIR/package/lib/libgomp.so.1"
+    cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 "$CURDIR/package/lib/libdl.so.2"
+    cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 "$CURDIR/package/lib/librt.so.1"
+    cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 "$CURDIR/package/lib/libpthread.so.0"
+elif [ "$(uname -s)" = "Darwin" ]; then
+    echo "Detected Darwin"
+else
+    echo "Error: Could not detect architecture"
+    exit 1
+fi
+
+GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
+if [ -f "$GPU_LIB_SCRIPT" ]; then
+    echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
+    source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
+    package_gpu_libs
+fi
+
+echo "Packaging completed successfully"
+ls -liah "$CURDIR/package/" "$CURDIR/package/lib/"
--- a/backend/go/ced/run.sh
+++ b/backend/go/ced/run.sh
@@ -0,0 +1,15 @@
+#!/bin/bash
+set -e
+
+CURDIR=$(dirname "$(realpath "$0")")
+
+export LD_LIBRARY_PATH="$CURDIR/lib:$CURDIR:${LD_LIBRARY_PATH:-}"
+
+# If a self-contained ld.so was packaged, route through it so the packaged
+# libc / libstdc++ are used instead of the host's (matches the sibling backends).
+if [ -f "$CURDIR/lib/ld.so" ]; then
+	echo "Using lib/ld.so"
+	exec "$CURDIR/lib/ld.so" "$CURDIR/ced-grpc" "$@"
+fi
+
+exec "$CURDIR/ced-grpc" "$@"
--- a/backend/index.yaml
+++ b/backend/index.yaml
@@ -178,6 +178,37 @@
    nvidia-cuda-12: "cuda12-parakeet-cpp"
    nvidia-l4t-cuda-12: "nvidia-l4t-arm64-parakeet-cpp"
    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-parakeet-cpp"
+- &ced
+  name: "ced"
+  alias: "ced"
+  license: mit
+  icon: https://avatars.githubusercontent.com/u/95302084
+  description: |
+    CED sound-event classification / audio tagging (527-class AudioSet).
+    ced.cpp is a C++/ggml port that performs audio tagging over the AudioSet
+    taxonomy, exposed through the SoundDetection gRPC rpc and the
+    /v1/audio/classification REST endpoint. It runs on CPU, NVIDIA CUDA,
+    AMD ROCm/HIP, Intel SYCL, Vulkan and NVIDIA Jetson (L4T) targets.
+  urls:
+    - https://github.com/mudler/ced.cpp
+  tags:
+    - audio-classification
+    - CPU
+    - GPU
+    - CUDA
+    - HIP
+  capabilities:
+    default: "cpu-ced"
+    nvidia: "cuda12-ced"
+    intel: "intel-sycl-f16-ced"
+    metal: "metal-ced"
+    amd: "rocm-ced"
+    vulkan: "vulkan-ced"
+    nvidia-l4t: "nvidia-l4t-arm64-ced"
+    nvidia-cuda-13: "cuda13-ced"
+    nvidia-cuda-12: "cuda12-ced"
+    nvidia-l4t-cuda-12: "nvidia-l4t-arm64-ced"
+    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-ced"
 - &voxtral
  name: "voxtral"
  alias: "voxtral"
@@ -2650,6 +2681,121 @@
  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-parakeet-cpp"
  mirrors:
    - localai/localai-backends:master-gpu-nvidia-cuda-13-parakeet-cpp
+## ced
+- !!merge <<: *ced
+  name: "ced-development"
+  capabilities:
+    default: "cpu-ced-development"
+    nvidia: "cuda12-ced-development"
+    intel: "intel-sycl-f16-ced-development"
+    metal: "metal-ced-development"
+    amd: "rocm-ced-development"
+    vulkan: "vulkan-ced-development"
+    nvidia-l4t: "nvidia-l4t-arm64-ced-development"
+    nvidia-cuda-13: "cuda13-ced-development"
+    nvidia-cuda-12: "cuda12-ced-development"
+    nvidia-l4t-cuda-12: "nvidia-l4t-arm64-ced-development"
+    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-ced-development"
+- !!merge <<: *ced
+  name: "nvidia-l4t-arm64-ced"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-arm64-ced"
+  mirrors:
+    - localai/localai-backends:latest-nvidia-l4t-arm64-ced
+- !!merge <<: *ced
+  name: "nvidia-l4t-arm64-ced-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-arm64-ced"
+  mirrors:
+    - localai/localai-backends:master-nvidia-l4t-arm64-ced
+- !!merge <<: *ced
+  name: "cuda13-nvidia-l4t-arm64-ced"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-ced"
+  mirrors:
+    - localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-ced
+- !!merge <<: *ced
+  name: "cuda13-nvidia-l4t-arm64-ced-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-ced"
+  mirrors:
+    - localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-ced
+- !!merge <<: *ced
+  name: "cpu-ced"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-ced"
+  mirrors:
+    - localai/localai-backends:latest-cpu-ced
+- !!merge <<: *ced
+  name: "cpu-ced-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-cpu-ced"
+  mirrors:
+    - localai/localai-backends:master-cpu-ced
+- !!merge <<: *ced
+  name: "metal-ced"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-ced"
+  mirrors:
+    - localai/localai-backends:latest-metal-darwin-arm64-ced
+- !!merge <<: *ced
+  name: "metal-ced-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-ced"
+  mirrors:
+    - localai/localai-backends:master-metal-darwin-arm64-ced
+- !!merge <<: *ced
+  name: "cuda12-ced"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-ced"
+  mirrors:
+    - localai/localai-backends:latest-gpu-nvidia-cuda-12-ced
+- !!merge <<: *ced
+  name: "cuda12-ced-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-ced"
+  mirrors:
+    - localai/localai-backends:master-gpu-nvidia-cuda-12-ced
+- !!merge <<: *ced
+  name: "rocm-ced"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-ced"
+  mirrors:
+    - localai/localai-backends:latest-gpu-rocm-hipblas-ced
+- !!merge <<: *ced
+  name: "rocm-ced-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-ced"
+  mirrors:
+    - localai/localai-backends:master-gpu-rocm-hipblas-ced
+- !!merge <<: *ced
+  name: "intel-sycl-f32-ced"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f32-ced"
+  mirrors:
+    - localai/localai-backends:latest-gpu-intel-sycl-f32-ced
+- !!merge <<: *ced
+  name: "intel-sycl-f32-ced-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f32-ced"
+  mirrors:
+    - localai/localai-backends:master-gpu-intel-sycl-f32-ced
+- !!merge <<: *ced
+  name: "intel-sycl-f16-ced"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f16-ced"
+  mirrors:
+    - localai/localai-backends:latest-gpu-intel-sycl-f16-ced
+- !!merge <<: *ced
+  name: "intel-sycl-f16-ced-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f16-ced"
+  mirrors:
+    - localai/localai-backends:master-gpu-intel-sycl-f16-ced
+- !!merge <<: *ced
+  name: "vulkan-ced"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-vulkan-ced"
+  mirrors:
+    - localai/localai-backends:latest-gpu-vulkan-ced
+- !!merge <<: *ced
+  name: "vulkan-ced-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-vulkan-ced"
+  mirrors:
+    - localai/localai-backends:master-gpu-vulkan-ced
+- !!merge <<: *ced
+  name: "cuda13-ced"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-ced"
+  mirrors:
+    - localai/localai-backends:latest-gpu-nvidia-cuda-13-ced
+- !!merge <<: *ced
+  name: "cuda13-ced-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-ced"
+  mirrors:
+    - localai/localai-backends:master-gpu-nvidia-cuda-13-ced
 ## stablediffusion-ggml
 - !!merge <<: *stablediffusionggml
  name: "cpu-stablediffusion-ggml"
--- a/core/backend/sound_classification.go
+++ b/core/backend/sound_classification.go
@@ -0,0 +1,88 @@
+package backend
+
+import (
+	"context"
+	"fmt"
+	"sort"
+
+	"github.com/mudler/LocalAI/core/config"
+	"github.com/mudler/LocalAI/core/schema"
+
+	grpcPkg "github.com/mudler/LocalAI/pkg/grpc"
+	"github.com/mudler/LocalAI/pkg/grpc/proto"
+	"github.com/mudler/LocalAI/pkg/model"
+)
+
+// SoundDetectionRequest carries the knobs the HTTP layer collects for an
+// audio-tagging / sound-event-classification call. Audio is the path to the
+// uploaded clip on disk; TopK and Threshold are optional (0 = backend default).
+type SoundDetectionRequest struct {
+	Audio     string
+	TopK      int32
+	Threshold float32
+}
+
+func (r *SoundDetectionRequest) toProto() *proto.SoundDetectionRequest {
+	return &proto.SoundDetectionRequest{
+		Src:       r.Audio,
+		TopK:      r.TopK,
+		Threshold: r.Threshold,
+	}
+}
+
+func loadSoundDetectionModel(ml *model.ModelLoader, modelConfig config.ModelConfig, appConfig *config.ApplicationConfig) (grpcPkg.Backend, error) {
+	if modelConfig.Backend == "" {
+		return nil, fmt.Errorf("sound classification: model %q has no backend set; supported backends include ced", modelConfig.Name)
+	}
+	opts := ModelOptions(modelConfig, appConfig)
+	m, err := ml.Load(opts...)
+	if err != nil {
+		recordModelLoadFailure(appConfig, modelConfig.Name, modelConfig.Backend, err, nil)
+		return nil, err
+	}
+	if m == nil {
+		return nil, fmt.Errorf("could not load sound classification model")
+	}
+	return m, nil
+}
+
+// ModelSoundDetection runs the SoundDetection RPC against the configured
+// backend and returns a normalized schema.SoundClassificationResult.
+func ModelSoundDetection(ctx context.Context, req SoundDetectionRequest, ml *model.ModelLoader, modelConfig config.ModelConfig, appConfig *config.ApplicationConfig) (*schema.SoundClassificationResult, error) {
+	m, err := loadSoundDetectionModel(ml, modelConfig, appConfig)
+	if err != nil {
+		return nil, err
+	}
+
+	r, err := m.SoundDetection(ctx, req.toProto())
+	if err != nil {
+		return nil, err
+	}
+	return soundClassificationResultFromProto(modelConfig.Name, r), nil
+}
+
+// soundClassificationResultFromProto maps the backend detections to the
+// HTTP-facing schema, keeping the backend's score-descending order.
+func soundClassificationResultFromProto(modelName string, r *proto.SoundDetectionResponse) *schema.SoundClassificationResult {
+	out := &schema.SoundClassificationResult{
+		Model:      modelName,
+		Detections: []schema.SoundClassification{},
+	}
+	if r == nil {
+		return out
+	}
+	for _, d := range r.Detections {
+		if d == nil {
+			continue
+		}
+		out.Detections = append(out.Detections, schema.SoundClassification{
+			Index: int(d.Index),
+			Label: d.Label,
+			Score: d.Score,
+		})
+	}
+	sort.SliceStable(out.Detections, func(i, j int) bool {
+		return out.Detections[i].Score > out.Detections[j].Score
+	})
+	return out
+}
--- a/core/config/backend_capabilities.go
+++ b/core/config/backend_capabilities.go
@@ -8,27 +8,28 @@ import (
 // Usecase name constants — the canonical string values used in gallery entries,
 // model configs (known_usecases), and UsecaseInfoMap keys.
 const (
-	UsecaseChat               = "chat"
-	UsecaseCompletion         = "completion"
-	UsecaseEdit               = "edit"
-	UsecaseVision             = "vision"
-	UsecaseEmbeddings         = "embeddings"
-	UsecaseTokenize           = "tokenize"
-	UsecaseImage              = "image"
-	UsecaseVideo              = "video"
-	UsecaseTranscript         = "transcript"
-	UsecaseTTS                = "tts"
-	UsecaseSoundGeneration    = "sound_generation"
-	UsecaseRerank             = "rerank"
-	UsecaseDetection          = "detection"
-	UsecaseDepth              = "depth"
-	UsecaseVAD                = "vad"
-	UsecaseAudioTransform     = "audio_transform"
-	UsecaseDiarization        = "diarization"
-	UsecaseRealtimeAudio      = "realtime_audio"
-	UsecaseFaceRecognition    = "face_recognition"
-	UsecaseSpeakerRecognition = "speaker_recognition"
-	UsecaseTokenClassify      = "token_classify"
+	UsecaseChat                = "chat"
+	UsecaseCompletion          = "completion"
+	UsecaseEdit                = "edit"
+	UsecaseVision              = "vision"
+	UsecaseEmbeddings          = "embeddings"
+	UsecaseTokenize            = "tokenize"
+	UsecaseImage               = "image"
+	UsecaseVideo               = "video"
+	UsecaseTranscript          = "transcript"
+	UsecaseTTS                 = "tts"
+	UsecaseSoundGeneration     = "sound_generation"
+	UsecaseRerank              = "rerank"
+	UsecaseDetection           = "detection"
+	UsecaseDepth               = "depth"
+	UsecaseVAD                 = "vad"
+	UsecaseAudioTransform      = "audio_transform"
+	UsecaseDiarization         = "diarization"
+	UsecaseSoundClassification = "sound_classification"
+	UsecaseRealtimeAudio       = "realtime_audio"
+	UsecaseFaceRecognition     = "face_recognition"
+	UsecaseSpeakerRecognition  = "speaker_recognition"
+	UsecaseTokenClassify       = "token_classify"
 )

 // GRPCMethod identifies a Backend service RPC from backend.proto.
@@ -51,6 +52,7 @@ const (
 	MethodVAD                GRPCMethod = "VAD"
 	MethodAudioTransform     GRPCMethod = "AudioTransform"
 	MethodDiarize            GRPCMethod = "Diarize"
+	MethodSoundDetection     GRPCMethod = "SoundDetection"
 	MethodAudioToAudioStream GRPCMethod = "AudioToAudioStream"
 	MethodFaceVerify         GRPCMethod = "FaceVerify"
 	MethodFaceAnalyze        GRPCMethod = "FaceAnalyze"
@@ -165,6 +167,11 @@ var UsecaseInfoMap = map[string]UsecaseInfo{
 		GRPCMethod:  MethodDiarize,
 		Description: "Speaker diarization (who-spoke-when, per-speaker segments) via the Diarize RPC.",
 	},
+	UsecaseSoundClassification: {
+		Flag:        FLAG_SOUND_CLASSIFICATION,
+		GRPCMethod:  MethodSoundDetection,
+		Description: "Sound-event classification / audio tagging (scored AudioSet labels like baby cry, glass breaking, alarms) via the SoundDetection RPC.",
+	},
 	UsecaseRealtimeAudio: {
 		Flag:        FLAG_REALTIME_AUDIO,
 		GRPCMethod:  MethodAudioToAudioStream,
--- a/core/config/meta/constants.go
+++ b/core/config/meta/constants.go
@@ -68,6 +68,7 @@ var UsecaseOptions = []FieldOption{
 	{Value: "face_recognition", Label: "Face Recognition"},
 	{Value: "transcript", Label: "Transcript"},
 	{Value: "diarization", Label: "Diarization"},
+	{Value: "sound_classification", Label: "Sound Classification"},
 	{Value: "speaker_recognition", Label: "Speaker Recognition"},
 	{Value: "tts", Label: "TTS"},
 	{Value: "sound_generation", Label: "Sound Generation"},
--- a/core/config/meta/registry.go
+++ b/core/config/meta/registry.go
@@ -328,6 +328,30 @@ func DefaultRegistry() map[string]FieldMetaOverride {
 			AutocompleteProvider: ProviderModelsVAD,
 			Order:                63,
 		},
+		"pipeline.sound_detection": {
+			Section:              "pipeline",
+			Label:                "Sound Detection Model",
+			Description:          "Model to use for sound-event classification (audio tagging, e.g. ced) in the pipeline. When set, committed realtime audio is also classified and the scored AudioSet tags are emitted as a conversation.item.sound_detection event.",
+			Component:            "model-select",
+			AutocompleteProvider: ProviderModels,
+			Order:                64,
+		},
+		"pipeline.sound_detection_window_ms": {
+			Section:     "pipeline",
+			Label:       "Sound Detection Window (ms)",
+			Description: "Server-side windowing for a sound-only realtime session: length in ms of the audio window classified each hop. 0 = client-driven (the client commits windows).",
+			Component:   "number",
+			Min:         f64(0),
+			Order:       65,
+		},
+		"pipeline.sound_detection_hop_ms": {
+			Section:     "pipeline",
+			Label:       "Sound Detection Hop (ms)",
+			Description: "Server-side windowing hop in ms: how often the server classifies the last window. 0 = client-driven.",
+			Component:   "number",
+			Min:         f64(0),
+			Order:       66,
+		},
 		"pipeline.reasoning_effort": {
 			Section:     "pipeline",
 			Label:       "Reasoning Effort",
--- a/core/config/model_config.go
+++ b/core/config/model_config.go
@@ -604,6 +604,20 @@ type Pipeline struct {
 	LLM           string `yaml:"llm,omitempty" json:"llm,omitempty"`
 	Transcription string `yaml:"transcription,omitempty" json:"transcription,omitempty"`
 	VAD           string `yaml:"vad,omitempty" json:"vad,omitempty"`
+	// SoundDetection names a sound-event-classification model (e.g. ced). When
+	// set, each VAD-committed realtime utterance is also run through it and the
+	// scored AudioSet tags are emitted as a conversation.item.sound_detection
+	// server event, alongside (and independent of) transcription.
+	SoundDetection string `yaml:"sound_detection,omitempty" json:"sound_detection,omitempty"`
+
+	// SoundDetectionWindowMs / SoundDetectionHopMs enable server-side windowing
+	// for a sound-detection-only realtime session: instead of the client
+	// committing audio buffers, the server classifies the last WindowMs of
+	// streamed audio every HopMs and emits a sound_detection event per hop. Both
+	// must be > 0 to activate; otherwise the session stays client-driven (the
+	// client commits windows via input_audio_buffer.commit).
+	SoundDetectionWindowMs int `yaml:"sound_detection_window_ms,omitempty" json:"sound_detection_window_ms,omitempty"`
+	SoundDetectionHopMs    int `yaml:"sound_detection_hop_ms,omitempty" json:"sound_detection_hop_ms,omitempty"`

 	// ReasoningEffort sets the reasoning effort (none|minimal|low|medium|high) for
 	// the pipeline's LLM without editing the LLM model config. Overrides the LLM's
@@ -1452,6 +1466,11 @@ const (
 	// so it may combine freely with other usecases.
 	FLAG_TOKEN_CLASSIFY ModelConfigUsecase = 0b1000000000000000000000

+	// Marks a model as wired for the SoundDetection gRPC primitive
+	// (audio tagging / sound-event classification — scored AudioSet
+	// labels via the SoundDetection RPC, e.g. ced).
+	FLAG_SOUND_CLASSIFICATION ModelConfigUsecase = 0b10000000000000000000000
+
 	// Common Subsets
 	FLAG_LLM ModelConfigUsecase = FLAG_CHAT | FLAG_COMPLETION | FLAG_EDIT
 )
@@ -1460,12 +1479,12 @@ const (
 // Flags within the same group are NOT orthogonal (e.g., chat and completion are
 // both text/language). A model is multimodal when its usecases span 2+ groups.
 var ModalityGroups = []ModelConfigUsecase{
-	FLAG_CHAT | FLAG_COMPLETION | FLAG_EDIT,                // text/language
-	FLAG_VISION | FLAG_DETECTION,                           // visual understanding
-	FLAG_TRANSCRIPT | FLAG_REALTIME_AUDIO,                  // speech input — realtime_audio is any-to-any, so it counts here too
-	FLAG_TTS | FLAG_SOUND_GENERATION | FLAG_REALTIME_AUDIO, // audio output — and here, so a lone realtime_audio flag still reads as multimodal
-	FLAG_AUDIO_TRANSFORM,                                   // audio in/out transforms
-	FLAG_IMAGE | FLAG_VIDEO,                                // visual generation
+	FLAG_CHAT | FLAG_COMPLETION | FLAG_EDIT,                           // text/language
+	FLAG_VISION | FLAG_DETECTION,                                      // visual understanding
+	FLAG_TRANSCRIPT | FLAG_REALTIME_AUDIO | FLAG_SOUND_CLASSIFICATION, // audio input — realtime_audio is any-to-any, so it counts here too
+	FLAG_TTS | FLAG_SOUND_GENERATION | FLAG_REALTIME_AUDIO,            // audio output — and here, so a lone realtime_audio flag still reads as multimodal
+	FLAG_AUDIO_TRANSFORM,                                              // audio in/out transforms
+	FLAG_IMAGE | FLAG_VIDEO,                                           // visual generation
 }

 // IsMultimodal returns true if the given usecases span two or more orthogonal
@@ -1488,29 +1507,30 @@ func GetAllModelConfigUsecases() map[string]ModelConfigUsecase {
 	return map[string]ModelConfigUsecase{
 		// Note: FLAG_ANY is intentionally excluded from this map
 		// because it's 0 and would always match in HasUsecases checks
-		"FLAG_CHAT":                FLAG_CHAT,
-		"FLAG_COMPLETION":          FLAG_COMPLETION,
-		"FLAG_EDIT":                FLAG_EDIT,
-		"FLAG_EMBEDDINGS":          FLAG_EMBEDDINGS,
-		"FLAG_RERANK":              FLAG_RERANK,
-		"FLAG_IMAGE":               FLAG_IMAGE,
-		"FLAG_TRANSCRIPT":          FLAG_TRANSCRIPT,
-		"FLAG_TTS":                 FLAG_TTS,
-		"FLAG_SOUND_GENERATION":    FLAG_SOUND_GENERATION,
-		"FLAG_TOKENIZE":            FLAG_TOKENIZE,
-		"FLAG_VAD":                 FLAG_VAD,
-		"FLAG_LLM":                 FLAG_LLM,
-		"FLAG_VIDEO":               FLAG_VIDEO,
-		"FLAG_DETECTION":           FLAG_DETECTION,
-		"FLAG_VISION":              FLAG_VISION,
-		"FLAG_FACE_RECOGNITION":    FLAG_FACE_RECOGNITION,
-		"FLAG_SPEAKER_RECOGNITION": FLAG_SPEAKER_RECOGNITION,
-		"FLAG_AUDIO_TRANSFORM":     FLAG_AUDIO_TRANSFORM,
-		"FLAG_DIARIZATION":         FLAG_DIARIZATION,
-		"FLAG_REALTIME_AUDIO":      FLAG_REALTIME_AUDIO,
-		"FLAG_SCORE":               FLAG_SCORE,
-		"FLAG_DEPTH":               FLAG_DEPTH,
-		"FLAG_TOKEN_CLASSIFY":      FLAG_TOKEN_CLASSIFY,
+		"FLAG_CHAT":                 FLAG_CHAT,
+		"FLAG_COMPLETION":           FLAG_COMPLETION,
+		"FLAG_EDIT":                 FLAG_EDIT,
+		"FLAG_EMBEDDINGS":           FLAG_EMBEDDINGS,
+		"FLAG_RERANK":               FLAG_RERANK,
+		"FLAG_IMAGE":                FLAG_IMAGE,
+		"FLAG_TRANSCRIPT":           FLAG_TRANSCRIPT,
+		"FLAG_TTS":                  FLAG_TTS,
+		"FLAG_SOUND_GENERATION":     FLAG_SOUND_GENERATION,
+		"FLAG_TOKENIZE":             FLAG_TOKENIZE,
+		"FLAG_VAD":                  FLAG_VAD,
+		"FLAG_LLM":                  FLAG_LLM,
+		"FLAG_VIDEO":                FLAG_VIDEO,
+		"FLAG_DETECTION":            FLAG_DETECTION,
+		"FLAG_VISION":               FLAG_VISION,
+		"FLAG_FACE_RECOGNITION":     FLAG_FACE_RECOGNITION,
+		"FLAG_SPEAKER_RECOGNITION":  FLAG_SPEAKER_RECOGNITION,
+		"FLAG_AUDIO_TRANSFORM":      FLAG_AUDIO_TRANSFORM,
+		"FLAG_DIARIZATION":          FLAG_DIARIZATION,
+		"FLAG_SOUND_CLASSIFICATION": FLAG_SOUND_CLASSIFICATION,
+		"FLAG_REALTIME_AUDIO":       FLAG_REALTIME_AUDIO,
+		"FLAG_SCORE":                FLAG_SCORE,
+		"FLAG_DEPTH":                FLAG_DEPTH,
+		"FLAG_TOKEN_CLASSIFY":       FLAG_TOKEN_CLASSIFY,
 	}
 }

@@ -1713,6 +1733,16 @@ func (c *ModelConfig) GuessUsecases(u ModelConfigUsecase) bool {
 		}
 	}

+	if (u & FLAG_SOUND_CLASSIFICATION) == FLAG_SOUND_CLASSIFICATION {
+		// ced is a sound-event tagger (AudioSet labels) surfaced via the
+		// SoundDetection gRPC. Models without an explicit known_usecases
+		// still surface when they run on one of these backends.
+		soundClassificationBackends := []string{"ced"}
+		if !slices.Contains(soundClassificationBackends, c.Backend) {
+			return false
+		}
+	}
+
 	if (u & FLAG_REALTIME_AUDIO) == FLAG_REALTIME_AUDIO {
 		// Backends that own a single any-to-any loop and implement
 		// AudioToAudioStream — listed here so models without an explicit
--- a/core/http/auth/features.go
+++ b/core/http/auth/features.go
@@ -48,6 +48,10 @@ var RouteFeatureRegistry = []RouteFeature{
 	{"POST", "/v1/audio/diarization", FeatureAudioDiarization},
 	{"POST", "/audio/diarization", FeatureAudioDiarization},

+	// Audio classification (sound-event tagging)
+	{"POST", "/v1/audio/classification", FeatureAudioClassification},
+	{"POST", "/audio/classification", FeatureAudioClassification},
+
 	// Audio speech / TTS
 	{"POST", "/v1/audio/speech", FeatureAudioSpeech},
 	{"POST", "/audio/speech", FeatureAudioSpeech},
@@ -172,6 +176,7 @@ func APIFeatureMetas() []FeatureMeta {
 		{FeatureAudioSpeech, "Audio Speech / TTS", true},
 		{FeatureAudioTranscription, "Audio Transcription", true},
 		{FeatureAudioDiarization, "Audio Diarization", true},
+		{FeatureAudioClassification, "Audio Classification", true},
 		{FeatureVAD, "Voice Activity Detection", true},
 		{FeatureDetection, "Detection", true},
 		{FeatureVideo, "Video Generation", true},
--- a/core/http/auth/permissions.go
+++ b/core/http/auth/permissions.go
@@ -38,24 +38,25 @@ const (
 	FeatureQuantization = "quantization"

 	// API features (default ON for new users)
-	FeatureChat               = "chat"
-	FeatureImages             = "images"
-	FeatureAudioSpeech        = "audio_speech"
-	FeatureAudioTranscription = "audio_transcription"
-	FeatureAudioDiarization   = "audio_diarization"
-	FeatureVAD                = "vad"
-	FeatureDetection          = "detection"
-	FeatureVideo              = "video"
-	FeatureEmbeddings         = "embeddings"
-	FeatureSound              = "sound"
-	FeatureRealtime           = "realtime"
-	FeatureRerank             = "rerank"
-	FeatureTokenize           = "tokenize"
-	FeatureMCP                = "mcp"
-	FeatureStores             = "stores"
-	FeatureFaceRecognition    = "face_recognition"
-	FeatureVoiceRecognition   = "voice_recognition"
-	FeatureAudioTransform     = "audio_transform"
+	FeatureChat                = "chat"
+	FeatureImages              = "images"
+	FeatureAudioSpeech         = "audio_speech"
+	FeatureAudioTranscription  = "audio_transcription"
+	FeatureAudioDiarization    = "audio_diarization"
+	FeatureAudioClassification = "audio_classification"
+	FeatureVAD                 = "vad"
+	FeatureDetection           = "detection"
+	FeatureVideo               = "video"
+	FeatureEmbeddings          = "embeddings"
+	FeatureSound               = "sound"
+	FeatureRealtime            = "realtime"
+	FeatureRerank              = "rerank"
+	FeatureTokenize            = "tokenize"
+	FeatureMCP                 = "mcp"
+	FeatureStores              = "stores"
+	FeatureFaceRecognition     = "face_recognition"
+	FeatureVoiceRecognition    = "voice_recognition"
+	FeatureAudioTransform      = "audio_transform"
 	// FeaturePIIFilter gates the synchronous PII analyze/redact service
 	// (POST /api/pii/{analyze,redact}). Default ON like the other API
 	// features; the admin-only events log is gated separately in-handler.
@@ -71,7 +72,7 @@ var GeneralFeatures = []string{FeatureFineTuning, FeatureQuantization}
 // APIFeatures lists API endpoint features (default ON).
 var APIFeatures = []string{
 	FeatureChat, FeatureImages, FeatureAudioSpeech, FeatureAudioTranscription,
-	FeatureAudioDiarization,
+	FeatureAudioDiarization, FeatureAudioClassification,
 	FeatureVAD, FeatureDetection, FeatureVideo, FeatureEmbeddings, FeatureSound,
 	FeatureRealtime, FeatureRerank, FeatureTokenize, FeatureMCP, FeatureStores,
 	FeatureFaceRecognition, FeatureVoiceRecognition, FeatureAudioTransform,
--- a/core/http/endpoints/localai/api_instructions.go
+++ b/core/http/endpoints/localai/api_instructions.go
@@ -32,9 +32,9 @@ var instructionDefs = []instructionDef{
 	},
 	{
 		Name:        "audio",
-		Description: "Text-to-speech, voice activity detection, transcription, speaker diarization, and sound generation",
+		Description: "Text-to-speech, voice activity detection, transcription, speaker diarization, sound classification, and sound generation",
 		Tags:        []string{"audio"},
-		Intro:       "Diarization (/v1/audio/diarization) returns speaker-labelled time segments. Backends with native ASR-diarization (vibevoice-cpp) can also emit per-segment text via include_text=true; backends with a dedicated pipeline (sherpa-onnx + pyannote) emit segmentation only. Response formats: json (default), verbose_json (adds speakers summary + text), rttm (NIST format).",
+		Intro:       "Diarization (/v1/audio/diarization) returns speaker-labelled time segments. Backends with native ASR-diarization (vibevoice-cpp) can also emit per-segment text via include_text=true; backends with a dedicated pipeline (sherpa-onnx + pyannote) emit segmentation only. Response formats: json (default), verbose_json (adds speakers summary + text), rttm (NIST format). Sound classification (/v1/audio/classification) returns scored AudioSet sound-event tags (audio tagging via the ced backend); top_k and threshold control the returned set.",
 	},
 	{
 		Name:        "images",
--- a/core/http/endpoints/openai/realtime.go
+++ b/core/http/endpoints/openai/realtime.go
@@ -93,16 +93,31 @@ type Session struct {
 	Voice                   string
 	TurnDetection           *types.TurnDetectionUnion // "server_vad", "semantic_vad" or "none"
 	InputAudioTranscription *types.AudioTranscription
-	Tools                   []types.ToolUnion
-	ToolChoice              *types.ToolChoiceUnion
-	Conversations           map[string]*Conversation
-	InputAudioBuffer        []byte
-	AudioBufferLock         sync.Mutex
-	OpusFrames              [][]byte
-	OpusFramesLock          sync.Mutex
-	Instructions            string
-	DefaultConversationID   string
-	ModelInterface          Model
+
+	// SoundDetectionEnabled is set when pipeline.sound_detection names a
+	// sound-event-classification model. When true, each committed utterance is
+	// also run through ModelInterface.SoundDetection and the scored tags are
+	// emitted as a conversation.item.sound_detection event. SoundDetectionTopK
+	// and SoundDetectionThreshold are the knobs passed to that call (defaults:
+	// top_k=5, threshold=0).
+	SoundDetectionEnabled   bool
+	SoundDetectionTopK      int
+	SoundDetectionThreshold float32
+	// SoundDetectionWindowMs / SoundDetectionHopMs, when both > 0, enable
+	// server-side windowing for a sound-only session: the server classifies the
+	// last WindowMs of streamed audio every HopMs (no client commits needed).
+	SoundDetectionWindowMs int
+	SoundDetectionHopMs    int
+	Tools                  []types.ToolUnion
+	ToolChoice             *types.ToolChoiceUnion
+	Conversations          map[string]*Conversation
+	InputAudioBuffer       []byte
+	AudioBufferLock        sync.Mutex
+	OpusFrames             [][]byte
+	OpusFramesLock         sync.Mutex
+	Instructions           string
+	DefaultConversationID  string
+	ModelInterface         Model
 	// The pipeline model config or the config for an any-to-any model
 	ModelConfig      *config.ModelConfig
 	InputSampleRate  int
@@ -250,6 +265,10 @@ type Model interface {
 	// TranscribeStream transcribes audio incrementally, invoking onDelta for each
 	// transcript text fragment and returning the final aggregated result.
 	TranscribeStream(ctx context.Context, audio, language string, translate, diarize bool, prompt string, onDelta func(text string)) (*schema.TranscriptionResult, error)
+	// SoundDetection classifies a committed audio window into scored AudioSet
+	// sound-event tags. topK caps the number of returned tags (0 = backend
+	// default), threshold drops tags below the given score (0 = keep all).
+	SoundDetection(ctx context.Context, audio string, topK int, threshold float32) (*schema.SoundClassificationResult, error)
 	PredictConfig() *config.ModelConfig
 }

@@ -399,7 +418,7 @@ func prepareRealtimeConfig(cfg *config.ModelConfig) (errCode, errMsg string, ok
 		return "", "", true
 	}

-	if cfg.Pipeline.VAD == "" && cfg.Pipeline.Transcription == "" && cfg.Pipeline.TTS == "" && cfg.Pipeline.LLM == "" {
+	if cfg.Pipeline.VAD == "" && cfg.Pipeline.Transcription == "" && cfg.Pipeline.TTS == "" && cfg.Pipeline.LLM == "" && cfg.Pipeline.SoundDetection == "" {
 		return "invalid_model", "Model is not a pipeline model", false
 	}
 	return "", "", true
@@ -469,6 +488,26 @@ func runRealtimeSession(application *application.Application, t Transport, model

 	sttModel := cfg.Pipeline.Transcription

+	// A sound-detection-only pipeline (sound_detection set, no transcription/LLM)
+	// activates on sounds, not speech, so it runs WITHOUT the voice VAD: the
+	// session defaults to turn_detection none and the client drives windowing via
+	// input_audio_buffer.commit. There is no transcription stage in that case.
+	soundOnly := cfg.Pipeline.SoundDetection != "" && cfg.Pipeline.Transcription == "" && cfg.Pipeline.LLM == ""
+
+	turnDetection := &types.TurnDetectionUnion{
+		ServerVad: &types.ServerVad{
+			Threshold:         0.5,
+			PrefixPaddingMs:   300,
+			SilenceDurationMs: 500,
+			CreateResponse:    true,
+		},
+	}
+	inputAudioTranscription := &types.AudioTranscription{Model: sttModel}
+	if soundOnly {
+		turnDetection = nil           // turn_detection none: no VAD
+		inputAudioTranscription = nil // no transcription stage
+	}
+
 	// Compose the system prompt: prepend the assistant prompt when we have
 	// one (it teaches the model the safety rules and tool recipes), then the
 	// session's default voice instructions. Order matches chat.go's
@@ -480,30 +519,26 @@ func runRealtimeSession(application *application.Application, t Transport, model

 	sessionID := generateSessionID()
 	session := &Session{
-		ID:                sessionID,
-		TranscriptionOnly: false,
-		Model:             model,
-		Voice:             cfg.TTSConfig.Voice,
-		Instructions:      instructions,
-		ModelConfig:       cfg,
-		Tools:             assistantTools,
-		AssistantTools:    assistantTools,
-		AssistantExecutor: assistantExecutor,
-		TurnDetection: &types.TurnDetectionUnion{
-			ServerVad: &types.ServerVad{
-				Threshold:         0.5,
-				PrefixPaddingMs:   300,
-				SilenceDurationMs: 500,
-				CreateResponse:    true,
-			},
-		},
-		InputAudioTranscription: &types.AudioTranscription{
-			Model: sttModel,
-		},
-		Conversations:    make(map[string]*Conversation),
-		InputSampleRate:  defaultRemoteSampleRate,
-		OutputSampleRate: defaultRemoteSampleRate,
-		MaxHistoryItems:  resolveMaxHistoryItems(cfg),
+		ID:                      sessionID,
+		TranscriptionOnly:       false,
+		Model:                   model,
+		Voice:                   cfg.TTSConfig.Voice,
+		Instructions:            instructions,
+		ModelConfig:             cfg,
+		Tools:                   assistantTools,
+		AssistantTools:          assistantTools,
+		AssistantExecutor:       assistantExecutor,
+		TurnDetection:           turnDetection,
+		InputAudioTranscription: inputAudioTranscription,
+		Conversations:           make(map[string]*Conversation),
+		InputSampleRate:         defaultRemoteSampleRate,
+		OutputSampleRate:        defaultRemoteSampleRate,
+		MaxHistoryItems:         resolveMaxHistoryItems(cfg),
+		SoundDetectionEnabled:   cfg.Pipeline.SoundDetection != "",
+		SoundDetectionTopK:      defaultSoundDetectionTopK,
+		SoundDetectionThreshold: 0,
+		SoundDetectionWindowMs:  cfg.Pipeline.SoundDetectionWindowMs,
+		SoundDetectionHopMs:     cfg.Pipeline.SoundDetectionHopMs,
 	}

 	// Create a default conversation
@@ -517,14 +552,24 @@ func runRealtimeSession(application *application.Application, t Transport, model
 	session.Conversations[conversationID] = conversation
 	session.DefaultConversationID = conversationID

-	m, err := newModel(
-		&cfg.Pipeline,
-		application.ModelConfigLoader(),
-		application.ModelLoader(),
-		application.ApplicationConfig(),
-		evaluator,
-		buildRealtimeRoutingContext(application, sessionID),
-	)
+	var m Model
+	if soundOnly {
+		m, err = newSoundDetectionOnlyModel(
+			&cfg.Pipeline,
+			application.ModelConfigLoader(),
+			application.ModelLoader(),
+			application.ApplicationConfig(),
+		)
+	} else {
+		m, err = newModel(
+			&cfg.Pipeline,
+			application.ModelConfigLoader(),
+			application.ModelLoader(),
+			application.ApplicationConfig(),
+			evaluator,
+			buildRealtimeRoutingContext(application, sessionID),
+		)
+	}
 	if err != nil {
 		xlog.Error("failed to load model", "error", err)
 		sendError(t, "model_load_error", "Failed to load model", "", "")
@@ -605,6 +650,20 @@ func runRealtimeSession(application *application.Application, t Transport, model

 	toggleVAD()

+	// Server-side sound-detection windowing (option B): for a sound-only session
+	// with window/hop configured, the server classifies the last window of
+	// streamed audio on a timer, so the client only has to stream (no commits).
+	// This runs independent of VAD (sound events are not speech).
+	var soundWindowDone chan struct{}
+	if soundOnly && session.SoundDetectionWindowMs > 0 && session.SoundDetectionHopMs > 0 {
+		soundWindowDone = make(chan struct{})
+		wg.Go(func() {
+			handleSoundWindow(session, t, soundWindowDone)
+		})
+		xlog.Debug("Starting server-side sound-detection windowing",
+			"window_ms", session.SoundDetectionWindowMs, "hop_ms", session.SoundDetectionHopMs)
+	}
+
 	for {
 		msg, err = t.ReadEvent()
 		if err != nil {
@@ -880,6 +939,10 @@ func runRealtimeSession(application *application.Application, t Transport, model
 	if vadServerStarted {
 		close(done)
 	}
+	// Stop the server-side sound-detection windowing goroutine (if running).
+	if soundWindowDone != nil {
+		close(soundWindowDone)
+	}
 	wg.Wait()

 	// Remove the session from the sessions map
@@ -971,6 +1034,10 @@ func updateTransSession(session *Session, update *types.SessionUnion, cl *config

 		session.ModelInterface = m
 		session.ModelConfig = cfg
+		session.SoundDetectionEnabled = cfg.Pipeline.SoundDetection != ""
+		if session.SoundDetectionTopK <= 0 {
+			session.SoundDetectionTopK = defaultSoundDetectionTopK
+		}
 	}

 	if trUpd != nil {
@@ -1343,7 +1410,8 @@ func commitUtterance(ctx context.Context, utt []byte, session *Session, conv *Co

 	// TODO: If we have a real any-to-any model then transcription is optional
 	var transcript string
-	if session.InputAudioTranscription != nil {
+	switch {
+	case session.InputAudioTranscription != nil:
 		// emitTranscription streams transcript deltas when
 		// pipeline.streaming.transcription is set, otherwise emits a single
 		// completed event; either way it returns the final transcript text.
@@ -1358,13 +1426,27 @@ func commitUtterance(ctx context.Context, utt []byte, session *Session, conv *Co
 			sendError(t, "transcription_failed", err.Error(), "", "event_TODO")
 			return
 		}
-	} else {
+	case session.SoundDetectionEnabled:
+		// Sound-detection-only session: no transcription and no LLM. The
+		// sound-detection emit below carries the result; there is no any-to-any
+		// path to fall into. Windowing is client-driven (turn_detection none +
+		// input_audio_buffer.commit), so this is not voice-gated.
+	default:
 		// The voice gate runs only on the transcription path above; if an
 		// any-to-any model path is added here, join the gate before responding.
 		sendNotImplemented(t, "any-to-any models")
 		return
 	}

+	// Sound-event detection is additive to transcription: classify the same
+	// committed window and emit its scored AudioSet tags as a separate event.
+	// A failure here is logged but must never abort the turn.
+	if session.SoundDetectionEnabled {
+		if sderr := emitSoundDetection(ctx, t, session, generateItemID(), f.Name()); sderr != nil {
+			xlog.Error("sound detection failed", "error", sderr)
+		}
+	}
+
 	// Join on the resolution before any side-effecting step.
 	var speaker *types.Speaker
 	if runResolve {
@@ -1415,11 +1497,94 @@ func commitUtterance(ctx context.Context, utt []byte, session *Session, conv *Co
 		}
 	}

-	if !session.TranscriptionOnly {
+	// Generate an LLM response only when there is a transcript to feed it. A
+	// sound-detection-only session (no transcription) has no LLM stage, so it
+	// stops here after emitting the sound-detection event.
+	if session.InputAudioTranscription != nil && !session.TranscriptionOnly {
 		generateResponse(ctx, session, utt, transcript, speaker, conv, t)
 	}
 }

+// handleSoundWindow runs server-side windowed sound-event detection (option B):
+// every HopMs it classifies the last WindowMs of streamed audio and emits a
+// sound_detection event, so a sound-only client only has to stream audio (no
+// input_audio_buffer.commit). It keeps the input buffer trimmed to one window
+// so a long stream stays bounded. Runs until done is closed. This is
+// independent of VAD: sound events are not speech.
+func handleSoundWindow(session *Session, t Transport, done chan struct{}) {
+	ticker := time.NewTicker(time.Duration(session.SoundDetectionHopMs) * time.Millisecond)
+	defer ticker.Stop()
+
+	for {
+		select {
+		case <-done:
+			return
+		case <-ticker.C:
+			classifySoundWindow(session, t)
+		}
+	}
+}
+
+// classifySoundWindow is one windowing tick: it snapshots the most recent
+// WindowMs of buffered audio (trimming the buffer so a long stream stays
+// bounded) and, when there is enough, classifies it and emits a sound_detection
+// event. Extracted from handleSoundWindow so it can be driven synchronously in
+// tests.
+func classifySoundWindow(session *Session, t Transport) {
+	const bytesPerSample = 2 // 16-bit mono PCM
+	sr := session.InputSampleRate
+	windowBytes := session.SoundDetectionWindowMs * sr / 1000 * bytesPerSample
+	minBytes := sr / 100 * bytesPerSample // ~10ms before classifying
+
+	session.AudioBufferLock.Lock()
+	// Keep only the most recent window so a long stream stays bounded.
+	if windowBytes > 0 && len(session.InputAudioBuffer) > windowBytes {
+		trimmed := make([]byte, windowBytes)
+		copy(trimmed, session.InputAudioBuffer[len(session.InputAudioBuffer)-windowBytes:])
+		session.InputAudioBuffer = trimmed
+	}
+	window := make([]byte, len(session.InputAudioBuffer))
+	copy(window, session.InputAudioBuffer)
+	session.AudioBufferLock.Unlock()
+
+	if len(window) < minBytes {
+		return // not enough audio buffered yet
+	}
+	path, err := writeWindowWAV(window, sr)
+	if err != nil {
+		xlog.Error("sound window: failed to write wav", "error", err)
+		return
+	}
+	if sderr := emitSoundDetection(context.Background(), t, session, generateItemID(), path); sderr != nil {
+		xlog.Error("sound window: detection failed", "error", sderr)
+	}
+	if rerr := os.Remove(path); rerr != nil {
+		xlog.Debug("sound window: temp cleanup failed", "error", rerr)
+	}
+}
+
+// writeWindowWAV writes mono 16-bit PCM to a temp WAV at the given sample rate
+// (the ced classifier reads the declared rate and resamples). Returns the path;
+// the caller removes it.
+func writeWindowWAV(pcm []byte, sampleRate int) (string, error) {
+	f, err := os.CreateTemp("", "realtime-sound-window-*.wav")
+	if err != nil {
+		return "", err
+	}
+	defer func() { _ = f.Close() }()
+	hdr := laudio.NewWAVHeaderWithRate(uint32(len(pcm)), uint32(sampleRate))
+	if err := hdr.Write(f); err != nil {
+		_ = os.Remove(f.Name())
+		return "", err
+	}
+	if _, err := f.Write(pcm); err != nil {
+		_ = os.Remove(f.Name())
+		return "", err
+	}
+	_ = f.Sync()
+	return f.Name(), nil
+}
+
 func runVAD(ctx context.Context, session *Session, adata []int16) ([]schema.VADSegment, error) {
 	soundIntBuffer := &audio.IntBuffer{
 		Format:         &audio.Format{SampleRate: localSampleRate, NumChannels: 1},
--- a/core/http/endpoints/openai/realtime_doubles_test.go
+++ b/core/http/endpoints/openai/realtime_doubles_test.go
@@ -75,6 +75,11 @@ type fakeModel struct {
 	transcribeDeltas []string
 	transcribeFinal  *schema.TranscriptionResult

+	// soundDetectionResult/soundDetectionErr drive the SoundDetection double so
+	// the sound-event path can be exercised deterministically.
+	soundDetectionResult *schema.SoundClassificationResult
+	soundDetectionErr    error
+
 	// Predict streaming: predictTokens are replayed through the token callback
 	// (simulating streamed LLM output); predictResp/predictErr are returned by
 	// the deferred predict function. predictChunkDeltas, when set, are delivered
@@ -95,6 +100,13 @@ func (m *fakeModel) Transcribe(context.Context, string, string, bool, bool, stri
 	return m.transcribeFinal, nil
 }

+func (m *fakeModel) SoundDetection(context.Context, string, int, float32) (*schema.SoundClassificationResult, error) {
+	if m.soundDetectionErr != nil {
+		return nil, m.soundDetectionErr
+	}
+	return m.soundDetectionResult, nil
+}
+
 func (m *fakeModel) Predict(_ context.Context, msgs schema.Messages, _, _, _ []string, cb func(string, backend.TokenUsage) bool, _ []types.ToolUnion, _ *types.ToolChoiceUnion, _, _ *int, _ map[string]float64) (func() (backend.LLMResponse, error), error) {
 	m.lastMessages = msgs
 	if m.predictErr != nil {
--- a/core/http/endpoints/openai/realtime_model.go
+++ b/core/http/endpoints/openai/realtime_model.go
@@ -31,10 +31,11 @@ var (
 // This means that we will fake an Any-to-Any model by overriding some of the gRPC client methods
 // which are for Any-To-Any models, but instead we will call a pipeline (for e.g STT->LLM->TTS)
 type wrappedModel struct {
-	TTSConfig           *config.ModelConfig
-	TranscriptionConfig *config.ModelConfig
-	LLMConfig           *config.ModelConfig
-	VADConfig           *config.ModelConfig
+	TTSConfig            *config.ModelConfig
+	TranscriptionConfig  *config.ModelConfig
+	LLMConfig            *config.ModelConfig
+	VADConfig            *config.ModelConfig
+	SoundDetectionConfig *config.ModelConfig

 	appConfig   *config.ApplicationConfig
 	modelLoader *model.ModelLoader
@@ -64,8 +65,9 @@ type anyToAnyModel struct {
 }

 type transcriptOnlyModel struct {
-	TranscriptionConfig *config.ModelConfig
-	VADConfig           *config.ModelConfig
+	TranscriptionConfig  *config.ModelConfig
+	VADConfig            *config.ModelConfig
+	SoundDetectionConfig *config.ModelConfig

 	appConfig   *config.ApplicationConfig
 	modelLoader *model.ModelLoader
@@ -80,6 +82,10 @@ func (m *transcriptOnlyModel) Transcribe(ctx context.Context, audio, language st
 	return backend.ModelTranscription(ctx, audio, language, translate, diarize, prompt, m.modelLoader, *m.TranscriptionConfig, m.appConfig)
 }

+func (m *transcriptOnlyModel) SoundDetection(ctx context.Context, audio string, topK int, threshold float32) (*schema.SoundClassificationResult, error) {
+	return modelSoundDetection(ctx, m.modelLoader, m.appConfig, m.SoundDetectionConfig, audio, topK, threshold)
+}
+
 func (m *transcriptOnlyModel) Predict(ctx context.Context, messages schema.Messages, images, videos, audios []string, tokenCallback func(string, backend.TokenUsage) bool, tools []types.ToolUnion, toolChoice *types.ToolChoiceUnion, logprobs *int, topLogprobs *int, logitBias map[string]float64) (func() (backend.LLMResponse, error), error) {
 	return nil, fmt.Errorf("predict operation not supported in transcript-only mode")
 }
@@ -108,6 +114,10 @@ func (m *wrappedModel) Transcribe(ctx context.Context, audio, language string, t
 	return backend.ModelTranscription(ctx, audio, language, translate, diarize, prompt, m.modelLoader, *m.TranscriptionConfig, m.appConfig)
 }

+func (m *wrappedModel) SoundDetection(ctx context.Context, audio string, topK int, threshold float32) (*schema.SoundClassificationResult, error) {
+	return modelSoundDetection(ctx, m.modelLoader, m.appConfig, m.SoundDetectionConfig, audio, topK, threshold)
+}
+
 func (m *wrappedModel) Predict(ctx context.Context, messages schema.Messages, images, videos, audios []string, tokenCallback func(string, backend.TokenUsage) bool, tools []types.ToolUnion, toolChoice *types.ToolChoiceUnion, logprobs *int, topLogprobs *int, logitBias map[string]float64) (func() (backend.LLMResponse, error), error) {
 	input := schema.OpenAIRequest{
 		Messages: messages,
@@ -399,6 +409,39 @@ func transcribeStream(ctx context.Context, ml *model.ModelLoader, transcriptionC
 	return final, nil
 }

+// modelSoundDetection runs sound-event classification against the session's
+// sound-classification model config, mirroring how Transcribe dispatches to
+// the transcription backend. Returns an error when no sound-detection model is
+// configured for the session.
+func modelSoundDetection(ctx context.Context, ml *model.ModelLoader, appConfig *config.ApplicationConfig, soundConfig *config.ModelConfig, audio string, topK int, threshold float32) (*schema.SoundClassificationResult, error) {
+	if soundConfig == nil {
+		return nil, fmt.Errorf("sound detection is not configured for this session")
+	}
+	return backend.ModelSoundDetection(ctx, backend.SoundDetectionRequest{
+		Audio:     audio,
+		TopK:      int32(topK),
+		Threshold: threshold,
+	}, ml, *soundConfig, appConfig)
+}
+
+// loadSoundDetectionConfig resolves the optional sound-classification model
+// config named by pipeline.sound_detection. Returns (nil, nil) when no model
+// is configured so sound detection stays additive and never blocks session
+// setup.
+func loadSoundDetectionConfig(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model.ModelLoader) (*config.ModelConfig, error) {
+	if pipeline.SoundDetection == "" {
+		return nil, nil
+	}
+	cfg, err := cl.LoadModelConfigFileByName(pipeline.SoundDetection, ml.ModelPath)
+	if err != nil {
+		return nil, fmt.Errorf("failed to load sound detection config: %w", err)
+	}
+	if valid, _ := cfg.Validate(); !valid {
+		return nil, fmt.Errorf("failed to validate sound detection config %q", pipeline.SoundDetection)
+	}
+	return cfg, nil
+}
+
 func newTranscriptionOnlyModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig) (Model, *config.ModelConfig, error) {
 	cfgVAD, err := cl.LoadModelConfigFileByName(pipeline.VAD, ml.ModelPath)
 	if err != nil {
@@ -420,9 +463,15 @@ func newTranscriptionOnlyModel(pipeline *config.Pipeline, cl *config.ModelConfig
 		return nil, nil, fmt.Errorf("failed to validate config: %w", err)
 	}

+	cfgSound, err := loadSoundDetectionConfig(pipeline, cl, ml)
+	if err != nil {
+		return nil, nil, err
+	}
+
 	return &transcriptOnlyModel{
-		TranscriptionConfig: cfgSST,
-		VADConfig:           cfgVAD,
+		TranscriptionConfig:  cfgSST,
+		VADConfig:            cfgVAD,
+		SoundDetectionConfig: cfgSound,

 		confLoader:  cl,
 		modelLoader: ml,
@@ -430,6 +479,27 @@ func newTranscriptionOnlyModel(pipeline *config.Pipeline, cl *config.ModelConfig
 	}, cfgSST, nil
 }

+// newSoundDetectionOnlyModel builds a realtime model that only does sound-event
+// classification: no VAD, transcription, LLM or TTS stages are loaded. Used for
+// a sound-detection-only realtime session, which activates on sounds (not
+// speech) and is driven by client-side windowing (turn_detection none +
+// input_audio_buffer.commit) rather than the voice VAD loop.
+func newSoundDetectionOnlyModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig) (Model, error) {
+	cfgSound, err := loadSoundDetectionConfig(pipeline, cl, ml)
+	if err != nil {
+		return nil, err
+	}
+	if cfgSound == nil {
+		return nil, fmt.Errorf("a sound-only realtime session requires pipeline.sound_detection")
+	}
+	return &transcriptOnlyModel{
+		SoundDetectionConfig: cfgSound,
+		confLoader:           cl,
+		modelLoader:          ml,
+		appConfig:            appConfig,
+	}, nil
+}
+
 // RealtimeRoutingContext is the bundle of routing dependencies the
 // realtime pipeline needs to consult router.Resolve per turn. nil-safe:
 // passing nil skips routing entirely and preserves the historical "one
@@ -544,11 +614,17 @@ func newModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model
 		return nil, fmt.Errorf("failed to validate config: %w", err)
 	}

+	cfgSound, err := loadSoundDetectionConfig(pipeline, cl, ml)
+	if err != nil {
+		return nil, err
+	}
+
 	wm := &wrappedModel{
-		TTSConfig:           cfgTTS,
-		TranscriptionConfig: cfgSST,
-		LLMConfig:           cfgLLM,
-		VADConfig:           cfgVAD,
+		TTSConfig:            cfgTTS,
+		TranscriptionConfig:  cfgSST,
+		LLMConfig:            cfgLLM,
+		VADConfig:            cfgVAD,
+		SoundDetectionConfig: cfgSound,

 		confLoader:  cl,
 		modelLoader: ml,
--- a/core/http/endpoints/openai/realtime_sound_detection.go
+++ b/core/http/endpoints/openai/realtime_sound_detection.go
@@ -0,0 +1,48 @@
+package openai
+
+import (
+	"context"
+
+	"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
+)
+
+// defaultSoundDetectionTopK is the number of scored tags requested per
+// committed utterance when the session does not pin its own top_k.
+const defaultSoundDetectionTopK = 5
+
+// emitSoundDetection classifies a committed utterance into sound-event tags and
+// emits a conversation.item.sound_detection event for it. It mirrors
+// emitTranscription's unary path: it calls the session's sound-event
+// classifier, maps the scored tags onto the server event, and sends it over
+// the transport. Sound detection is additive to transcription: its result is
+// emitted independently and a failure here is the caller's to log, never a
+// reason to abort the turn.
+func emitSoundDetection(ctx context.Context, t Transport, session *Session, itemID, audioPath string) error {
+	topK := session.SoundDetectionTopK
+	if topK <= 0 {
+		topK = defaultSoundDetectionTopK
+	}
+
+	result, err := session.ModelInterface.SoundDetection(ctx, audioPath, topK, session.SoundDetectionThreshold)
+	if err != nil {
+		return err
+	}
+
+	detections := make([]types.SoundDetectionTag, 0)
+	if result != nil {
+		for _, d := range result.Detections {
+			detections = append(detections, types.SoundDetectionTag{
+				Label: d.Label,
+				Score: d.Score,
+				Index: d.Index,
+			})
+		}
+	}
+
+	return t.SendEvent(types.ConversationItemSoundDetectionEvent{
+		ServerEventBase: types.ServerEventBase{EventID: "event_TODO"},
+		ItemID:          itemID,
+		ContentIndex:    0,
+		Detections:      detections,
+	})
+}
--- a/core/http/endpoints/openai/realtime_sound_detection_test.go
+++ b/core/http/endpoints/openai/realtime_sound_detection_test.go
@@ -0,0 +1,170 @@
+package openai
+
+import (
+	"context"
+	"encoding/binary"
+	"errors"
+	"os"
+
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+
+	"github.com/mudler/LocalAI/core/config"
+	"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
+	"github.com/mudler/LocalAI/core/schema"
+)
+
+// emitSoundDetection classifies a committed utterance and emits a single
+// conversation.item.sound_detection event carrying the scored AudioSet tags.
+var _ = Describe("emitSoundDetection", func() {
+	It("emits a sound_detection event with the classifier's scored tags", func() {
+		session := &Session{
+			SoundDetectionEnabled: true,
+			SoundDetectionTopK:    5,
+			ModelInterface: &fakeModel{
+				soundDetectionResult: &schema.SoundClassificationResult{
+					Model: "ced",
+					Detections: []schema.SoundClassification{
+						{Index: 3, Label: "Baby cry, infant cry", Score: 0.91},
+						{Index: 7, Label: "Speech", Score: 0.42},
+					},
+				},
+			},
+		}
+		t := &fakeTransport{}
+
+		err := emitSoundDetection(context.Background(), t, session, "item1", "/tmp/x.wav")
+
+		Expect(err).ToNot(HaveOccurred())
+		Expect(t.countEvents(types.ServerEventTypeConversationItemSoundDetection)).To(Equal(1))
+
+		ev, ok := t.events[0].(types.ConversationItemSoundDetectionEvent)
+		Expect(ok).To(BeTrue())
+		Expect(ev.ItemID).To(Equal("item1"))
+		Expect(ev.ContentIndex).To(Equal(0))
+		Expect(ev.Detections).To(HaveLen(2))
+		Expect(ev.Detections[0].Label).To(Equal("Baby cry, infant cry"))
+		Expect(ev.Detections[0].Score).To(BeNumerically("~", 0.91, 1e-6))
+		Expect(ev.Detections[0].Index).To(Equal(3))
+		Expect(ev.Detections[1].Label).To(Equal("Speech"))
+	})
+
+	It("emits an event with no detections when the classifier returns none", func() {
+		session := &Session{
+			SoundDetectionEnabled: true,
+			ModelInterface: &fakeModel{
+				soundDetectionResult: &schema.SoundClassificationResult{Model: "ced"},
+			},
+		}
+		t := &fakeTransport{}
+
+		err := emitSoundDetection(context.Background(), t, session, "item1", "/tmp/x.wav")
+
+		Expect(err).ToNot(HaveOccurred())
+		Expect(t.countEvents(types.ServerEventTypeConversationItemSoundDetection)).To(Equal(1))
+		ev, ok := t.events[0].(types.ConversationItemSoundDetectionEvent)
+		Expect(ok).To(BeTrue())
+		Expect(ev.Detections).To(BeEmpty())
+	})
+
+	It("propagates the classifier error and emits no event", func() {
+		session := &Session{
+			SoundDetectionEnabled: true,
+			ModelInterface:        &fakeModel{soundDetectionErr: errors.New("boom")},
+		}
+		t := &fakeTransport{}
+
+		err := emitSoundDetection(context.Background(), t, session, "item1", "/tmp/x.wav")
+
+		Expect(err).To(HaveOccurred())
+		Expect(t.countEvents(types.ServerEventTypeConversationItemSoundDetection)).To(Equal(0))
+	})
+})
+
+// A sound-detection-only session (no transcription, no LLM) runs through
+// commitUtterance WITHOUT the voice/transcription path: it emits the
+// sound_detection event and stops - no transcription event, no LLM response.
+var _ = Describe("commitUtterance (sound-detection-only session)", func() {
+	It("emits sound detection and neither transcribes nor generates a response", func() {
+		session := &Session{
+			SoundDetectionEnabled:   true,
+			SoundDetectionTopK:      5,
+			InputAudioTranscription: nil, // sound-only: no transcription stage
+			ModelConfig:             &config.ModelConfig{},
+			ModelInterface: &fakeModel{
+				soundDetectionResult: &schema.SoundClassificationResult{
+					Model: "ced",
+					Detections: []schema.SoundClassification{
+						{Index: 23, Label: "Baby cry, infant cry", Score: 0.87},
+					},
+				},
+			},
+		}
+		tr := &fakeTransport{}
+		utt := make([]byte, 32) // non-empty PCM so commitUtterance proceeds
+
+		commitUtterance(context.Background(), utt, session, &Conversation{}, tr)
+
+		Expect(tr.countEvents(types.ServerEventTypeConversationItemSoundDetection)).To(Equal(1))
+		// No transcription happened.
+		Expect(tr.countEvents(types.ServerEventTypeConversationItemInputAudioTranscriptionCompleted)).To(Equal(0))
+		// No LLM response was generated (sound-only has no LLM stage).
+		Expect(tr.countEvents(types.ServerEventTypeResponseDone)).To(Equal(0))
+	})
+})
+
+// Server-side windowing (option B): a sound-only session classifies the last
+// WindowMs of streamed audio per tick, with no client commit, and keeps the
+// input buffer trimmed to one window.
+var _ = Describe("classifySoundWindow (server-side windowing)", func() {
+	newSoundSession := func() (*Session, *fakeTransport) {
+		return &Session{
+			SoundDetectionEnabled:  true,
+			SoundDetectionTopK:     5,
+			SoundDetectionWindowMs: 200, // 200ms @ 16kHz mono16 = 6400 bytes
+			SoundDetectionHopMs:    20,
+			InputSampleRate:        16000,
+			ModelInterface: &fakeModel{
+				soundDetectionResult: &schema.SoundClassificationResult{
+					Model:      "ced",
+					Detections: []schema.SoundClassification{{Index: 23, Label: "Baby cry, infant cry", Score: 0.87}},
+				},
+			},
+		}, &fakeTransport{}
+	}
+
+	It("emits a sound_detection event and trims the buffer to one window", func() {
+		session, tr := newSoundSession()
+		session.InputAudioBuffer = make([]byte, 10000) // > 6400-byte window
+
+		classifySoundWindow(session, tr)
+
+		Expect(tr.countEvents(types.ServerEventTypeConversationItemSoundDetection)).To(Equal(1))
+		// buffer trimmed to exactly one window (200ms @ 16kHz mono 16-bit)
+		Expect(len(session.InputAudioBuffer)).To(Equal(6400))
+	})
+
+	It("does nothing when too little audio is buffered", func() {
+		session, tr := newSoundSession()
+		session.InputAudioBuffer = make([]byte, 100) // < ~10ms (320 bytes)
+
+		classifySoundWindow(session, tr)
+
+		Expect(tr.countEvents(types.ServerEventTypeConversationItemSoundDetection)).To(Equal(0))
+	})
+})
+
+var _ = Describe("writeWindowWAV", func() {
+	It("writes a mono 16-bit WAV header declaring the given sample rate", func() {
+		pcm := make([]byte, 640)
+		path, err := writeWindowWAV(pcm, 24000)
+		Expect(err).ToNot(HaveOccurred())
+		defer func() { _ = os.Remove(path) }()
+
+		data, err := os.ReadFile(path)
+		Expect(err).ToNot(HaveOccurred())
+		Expect(len(data)).To(BeNumerically(">=", 44+len(pcm)))
+		// SampleRate is a little-endian uint32 at byte offset 24 of a WAV header.
+		Expect(binary.LittleEndian.Uint32(data[24:28])).To(Equal(uint32(24000)))
+	})
+})
--- a/core/http/endpoints/openai/sound_classification.go
+++ b/core/http/endpoints/openai/sound_classification.go
@@ -0,0 +1,91 @@
+package openai
+
+import (
+	"io"
+	"net/http"
+	"os"
+	"path"
+	"path/filepath"
+
+	"github.com/labstack/echo/v4"
+	"github.com/mudler/LocalAI/core/backend"
+	"github.com/mudler/LocalAI/core/config"
+	"github.com/mudler/LocalAI/core/http/middleware"
+	"github.com/mudler/LocalAI/core/schema"
+	model "github.com/mudler/LocalAI/pkg/model"
+
+	"github.com/mudler/xlog"
+)
+
+// SoundClassificationEndpoint runs an audio-tagging / sound-event
+// classification model (e.g. ced) over an uploaded clip and returns the
+// scored AudioSet tags in score-descending order. It mirrors the
+// transcription path: multipart audio upload -> temp file -> backend call.
+//
+// @Summary Classify sound events in audio (audio tagging).
+// @Tags audio
+// @accept multipart/form-data
+// @Param model formData string true "model"
+// @Param file formData file true "audio file"
+// @Param top_k formData int false "number of top tags to return (0 = backend default)"
+// @Param threshold formData number false "drop tags scoring below this value"
+// @Success 200 {object} schema.SoundClassificationResult
+// @Router /v1/audio/classification [post]
+func SoundClassificationEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig) echo.HandlerFunc {
+	return func(c echo.Context) error {
+		input, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_LOCALAI_REQUEST).(*schema.OpenAIRequest)
+		if !ok || input.Model == "" {
+			return echo.ErrBadRequest
+		}
+
+		modelConfig, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_MODEL_CONFIG).(*config.ModelConfig)
+		if !ok || modelConfig == nil {
+			return echo.ErrBadRequest
+		}
+
+		req := backend.SoundDetectionRequest{
+			TopK:      int32(parseFormInt(c, "top_k", 0)),
+			Threshold: float32(parseFormFloat(c, "threshold", 0)),
+		}
+
+		file, err := c.FormFile("file")
+		if err != nil {
+			return err
+		}
+		f, err := file.Open()
+		if err != nil {
+			return err
+		}
+		defer func() { _ = f.Close() }()
+
+		dir, err := os.MkdirTemp("", "sound-classification")
+		if err != nil {
+			return err
+		}
+		defer func() { _ = os.RemoveAll(dir) }()
+
+		dst := filepath.Join(dir, path.Base(file.Filename))
+		dstFile, err := os.Create(dst) // #nosec G304 -- dst is a server-created temp dir joined with path.Base of the upload name (no traversal)
+		if err != nil {
+			return err
+		}
+		if _, err := io.Copy(dstFile, f); err != nil {
+			xlog.Debug("Audio file copying error", "filename", file.Filename, "dst", dst, "error", err)
+			_ = dstFile.Close()
+			return err
+		}
+		_ = dstFile.Close()
+		req.Audio = dst
+
+		result, err := backend.ModelSoundDetection(c.Request().Context(), req, ml, *modelConfig, appConfig)
+		if err != nil {
+			xlog.Error("Sound classification failed",
+				"model", modelConfig.Name,
+				"audio", dst,
+				"error", err)
+			return err
+		}
+
+		return c.JSON(http.StatusOK, result)
+	}
+}
--- a/core/http/endpoints/openai/types/server_events.go
+++ b/core/http/endpoints/openai/types/server_events.go
@@ -18,6 +18,7 @@ const (
 	ServerEventTypeConversationItemInputAudioTranscriptionDelta     ServerEventType = "conversation.item.input_audio_transcription.delta"
 	ServerEventTypeConversationItemInputAudioTranscriptionSegment   ServerEventType = "conversation.item.input_audio_transcription.segment"
 	ServerEventTypeConversationItemInputAudioTranscriptionFailed    ServerEventType = "conversation.item.input_audio_transcription.failed"
+	ServerEventTypeConversationItemSoundDetection                   ServerEventType = "conversation.item.sound_detection"
 	ServerEventTypeConversationItemTruncated                        ServerEventType = "conversation.item.truncated"
 	ServerEventTypeConversationItemDeleted                          ServerEventType = "conversation.item.deleted"
 	// ServerEventTypeConversationItemSpeaker is a LocalAI extension: it reports
@@ -473,6 +474,55 @@ func (m ConversationItemInputAudioTranscriptionCompletedEvent) MarshalJSON() ([]
 	return json.Marshal(shadow)
 }

+// SoundDetectionTag is one scored sound-event tag from the sound-event
+// classifier. Label is the human-readable AudioSet class name, Score is the
+// per-class probability (multi-label, independent), and Index is the class
+// index in the model ontology.
+type SoundDetectionTag struct {
+	// The human-readable AudioSet class name (e.g. "Baby cry, infant cry").
+	Label string `json:"label"`
+
+	// The per-class probability for this tag.
+	Score float32 `json:"score"`
+
+	// The class index in the model ontology.
+	Index int `json:"index"`
+}
+
+// Returned when a committed input audio window has been classified by a
+// sound-event-detection model. This is a LocalAI extension to the OpenAI
+// Realtime API: when a pipeline configures sound_detection, each VAD-committed
+// utterance is run through the classifier and the scored AudioSet tags are
+// emitted as this event, independent of (and alongside) transcription.
+type ConversationItemSoundDetectionEvent struct {
+	ServerEventBase
+	// The ID of the item.
+	ItemID string `json:"item_id"`
+
+	// The index of the content part in the item's content array.
+	ContentIndex int `json:"content_index"`
+
+	// The scored sound-event tags, in score-descending order.
+	Detections []SoundDetectionTag `json:"detections"`
+}
+
+func (m ConversationItemSoundDetectionEvent) ServerEventType() ServerEventType {
+	return ServerEventTypeConversationItemSoundDetection
+}
+
+func (m ConversationItemSoundDetectionEvent) MarshalJSON() ([]byte, error) {
+	type typeAlias ConversationItemSoundDetectionEvent
+	type typeWrapper struct {
+		typeAlias
+		Type ServerEventType `json:"type"`
+	}
+	shadow := typeWrapper{
+		typeAlias: typeAlias(m),
+		Type:      m.ServerEventType(),
+	}
+	return json.Marshal(shadow)
+}
+
 // Returned when the text value of an input audio transcription content part is updated with incremental transcription results.
 //
 // See https://platform.openai.com/docs/api-reference/realtime-server-events/conversation/item/input_audio_transcription/delta
--- a/core/http/react-ui/public/locales/en/models.json
+++ b/core/http/react-ui/public/locales/en/models.json
@@ -23,6 +23,7 @@
    "tts": "TTS",
    "stt": "STT",
    "diarization": "Diarization",
+    "soundClassification": "Sound Tagging",
    "soundGen": "Sound",
    "audioTransform": "Audio FX",
    "realtimeAudio": "Realtime Audio",
--- a/core/http/react-ui/src/pages/Models.jsx
+++ b/core/http/react-ui/src/pages/Models.jsx
@@ -31,6 +31,7 @@ const FILTERS = [
  { key: 'tts', labelKey: 'filters.tts', icon: 'fa-microphone' },
  { key: 'transcript', labelKey: 'filters.stt', icon: 'fa-headphones' },
  { key: 'diarization', labelKey: 'filters.diarization', icon: 'fa-users' },
+  { key: 'sound_classification', labelKey: 'filters.soundClassification', icon: 'fa-ear-listen' },
  { key: 'sound_generation', labelKey: 'filters.soundGen', icon: 'fa-music' },
  { key: 'audio_transform', labelKey: 'filters.audioTransform', icon: 'fa-sliders' },
  { key: 'realtime_audio', labelKey: 'filters.realtimeAudio', icon: 'fa-tower-broadcast' },
--- a/core/http/react-ui/src/utils/capabilities.js
+++ b/core/http/react-ui/src/utils/capabilities.js
@@ -15,6 +15,7 @@ export const CAP_SOUND_GENERATION = 'FLAG_SOUND_GENERATION'
 export const CAP_TOKENIZE = 'FLAG_TOKENIZE'
 export const CAP_VAD = 'FLAG_VAD'
 export const CAP_DIARIZATION = 'FLAG_DIARIZATION'
+export const CAP_SOUND_CLASSIFICATION = 'FLAG_SOUND_CLASSIFICATION'
 export const CAP_VIDEO = 'FLAG_VIDEO'
 export const CAP_DETECTION = 'FLAG_DETECTION'
 export const CAP_FACE_RECOGNITION = 'FLAG_FACE_RECOGNITION'
--- a/core/http/routes/localai.go
+++ b/core/http/routes/localai.go
@@ -284,13 +284,14 @@ func RegisterLocalAIRoutes(router *echo.Echo,
 			// Categorized endpoint groups for structured discovery
 			"endpoint_groups": map[string]any{
 				"openai_compatible": map[string]string{
-					"models":           "/v1/models",
-					"chat_completions": "/v1/chat/completions",
-					"completions":      "/v1/completions",
-					"embeddings":       "/v1/embeddings",
-					"transcription":    "/v1/audio/transcriptions",
-					"diarization":      "/v1/audio/diarization",
-					"image_generation": "/v1/images/generations",
+					"models":               "/v1/models",
+					"chat_completions":     "/v1/chat/completions",
+					"completions":          "/v1/completions",
+					"embeddings":           "/v1/embeddings",
+					"transcription":        "/v1/audio/transcriptions",
+					"diarization":          "/v1/audio/diarization",
+					"sound_classification": "/v1/audio/classification",
+					"image_generation":     "/v1/images/generations",
 				},
 				"config_management": map[string]string{
 					"config_metadata": "/api/models/config-metadata",
@@ -342,7 +343,7 @@ func RegisterLocalAIRoutes(router *echo.Echo,
 					"delete": "/stores/delete",
 				},
 				"docs": map[string]string{
-					"swagger": "/swagger/index.html",
+					"swagger":      "/swagger/index.html",
 					"instructions": "/api/instructions",
 				},
 			},
--- a/core/http/routes/openai.go
+++ b/core/http/routes/openai.go
@@ -200,6 +200,23 @@ func RegisterOpenAIRoutes(app *echo.Echo,
 	app.POST("/v1/audio/diarization", diarizationHandler, diarizationMiddleware...)
 	app.POST("/audio/diarization", diarizationHandler, diarizationMiddleware...)

+	soundClassificationHandler := openai.SoundClassificationEndpoint(application.ModelConfigLoader(), application.ModelLoader(), application.ApplicationConfig())
+	soundClassificationMiddleware := []echo.MiddlewareFunc{
+		traceMiddleware,
+		re.BuildFilteredFirstAvailableDefaultModel(config.BuildUsecaseFilterFn(config.FLAG_SOUND_CLASSIFICATION)),
+		re.SetModelAndConfig(func() schema.LocalAIRequest { return new(schema.OpenAIRequest) }),
+		func(next echo.HandlerFunc) echo.HandlerFunc {
+			return func(c echo.Context) error {
+				if err := re.SetOpenAIRequest(c); err != nil {
+					return err
+				}
+				return next(c)
+			}
+		},
+	}
+	app.POST("/v1/audio/classification", soundClassificationHandler, soundClassificationMiddleware...)
+	app.POST("/audio/classification", soundClassificationHandler, soundClassificationMiddleware...)
+
 	audioSpeechHandler := localai.TTSEndpoint(application.ModelConfigLoader(), application.ModelLoader(), application.ApplicationConfig())
 	audioSpeechMiddleware := []echo.MiddlewareFunc{
 		nodeHeaderMiddleware,
--- a/core/http/routes/ui_api.go
+++ b/core/http/routes/ui_api.go
@@ -42,21 +42,22 @@ const (
 // usecaseFilters maps UI filter keys to ModelConfigUsecase flags for
 // capability-based gallery filtering.
 var usecaseFilters = map[string]config.ModelConfigUsecase{
-	config.UsecaseChat:            config.FLAG_CHAT,
-	config.UsecaseImage:           config.FLAG_IMAGE,
-	config.UsecaseVideo:           config.FLAG_VIDEO,
-	config.UsecaseVision:          config.FLAG_VISION,
-	config.UsecaseTTS:             config.FLAG_TTS,
-	config.UsecaseTranscript:      config.FLAG_TRANSCRIPT,
-	config.UsecaseSoundGeneration: config.FLAG_SOUND_GENERATION,
-	config.UsecaseEmbeddings:      config.FLAG_EMBEDDINGS,
-	config.UsecaseRerank:          config.FLAG_RERANK,
-	config.UsecaseDetection:       config.FLAG_DETECTION,
-	config.UsecaseVAD:             config.FLAG_VAD,
-	config.UsecaseAudioTransform:  config.FLAG_AUDIO_TRANSFORM,
-	config.UsecaseDiarization:     config.FLAG_DIARIZATION,
-	config.UsecaseRealtimeAudio:   config.FLAG_REALTIME_AUDIO,
-	config.UsecaseTokenClassify:   config.FLAG_TOKEN_CLASSIFY,
+	config.UsecaseChat:                config.FLAG_CHAT,
+	config.UsecaseImage:               config.FLAG_IMAGE,
+	config.UsecaseVideo:               config.FLAG_VIDEO,
+	config.UsecaseVision:              config.FLAG_VISION,
+	config.UsecaseTTS:                 config.FLAG_TTS,
+	config.UsecaseTranscript:          config.FLAG_TRANSCRIPT,
+	config.UsecaseSoundGeneration:     config.FLAG_SOUND_GENERATION,
+	config.UsecaseEmbeddings:          config.FLAG_EMBEDDINGS,
+	config.UsecaseRerank:              config.FLAG_RERANK,
+	config.UsecaseDetection:           config.FLAG_DETECTION,
+	config.UsecaseVAD:                 config.FLAG_VAD,
+	config.UsecaseAudioTransform:      config.FLAG_AUDIO_TRANSFORM,
+	config.UsecaseDiarization:         config.FLAG_DIARIZATION,
+	config.UsecaseSoundClassification: config.FLAG_SOUND_CLASSIFICATION,
+	config.UsecaseRealtimeAudio:       config.FLAG_REALTIME_AUDIO,
+	config.UsecaseTokenClassify:       config.FLAG_TOKEN_CLASSIFY,
 }

 // extractHFRepo tries to find a HuggingFace repo ID from model overrides or URLs.
--- a/core/schema/sound_classification.go
+++ b/core/schema/sound_classification.go
@@ -0,0 +1,19 @@
+package schema
+
+// SoundClassification is one scored sound-event tag. Score is the
+// per-class probability (multi-label, independent), Index is the class
+// index in the model ontology, and Label is the human-readable AudioSet
+// class name (e.g. "Baby cry, infant cry").
+type SoundClassification struct {
+	Index int     `json:"index"`
+	Label string  `json:"label"`
+	Score float32 `json:"score"`
+}
+
+// SoundClassificationResult is the JSON response of the
+// /v1/audio/classification endpoint: the model name and the scored tags
+// in score-descending order.
+type SoundClassificationResult struct {
+	Model      string                `json:"model"`
+	Detections []SoundClassification `json:"detections"`
+}
--- a/core/services/nodes/health_mock_test.go
+++ b/core/services/nodes/health_mock_test.go
@@ -169,6 +169,9 @@ func (c *fakeBackendClient) SoundGeneration(_ context.Context, _ *pb.SoundGenera
 func (c *fakeBackendClient) Detect(_ context.Context, _ *pb.DetectOptions, _ ...ggrpc.CallOption) (*pb.DetectResponse, error) {
 	return nil, nil
 }
+func (c *fakeBackendClient) SoundDetection(_ context.Context, _ *pb.SoundDetectionRequest, _ ...ggrpc.CallOption) (*pb.SoundDetectionResponse, error) {
+	return nil, nil
+}
 func (c *fakeBackendClient) Depth(_ context.Context, _ *pb.DepthRequest, _ ...ggrpc.CallOption) (*pb.DepthResponse, error) {
 	return nil, nil
 }
--- a/core/services/nodes/inflight_test.go
+++ b/core/services/nodes/inflight_test.go
@@ -99,6 +99,9 @@ func (f *fakeGRPCBackend) SoundGeneration(_ context.Context, _ *pb.SoundGenerati
 func (f *fakeGRPCBackend) Detect(_ context.Context, _ *pb.DetectOptions, _ ...ggrpc.CallOption) (*pb.DetectResponse, error) {
 	return &pb.DetectResponse{}, nil
 }
+func (f *fakeGRPCBackend) SoundDetection(_ context.Context, _ *pb.SoundDetectionRequest, _ ...ggrpc.CallOption) (*pb.SoundDetectionResponse, error) {
+	return &pb.SoundDetectionResponse{}, nil
+}

 func (f *fakeGRPCBackend) Depth(_ context.Context, _ *pb.DepthRequest, _ ...ggrpc.CallOption) (*pb.DepthResponse, error) {
 	return &pb.DepthResponse{}, nil
--- a/docs/content/features/audio-classification.md
+++ b/docs/content/features/audio-classification.md
@@ -0,0 +1,55 @@
+++
+disableToc = false
+title = "Sound Classification"
+weight = 18
+url = "/features/audio-classification/"
+++
+
+Sound-event classification (audio tagging) answers the question **"what am I hearing?"** - given an audio clip, it returns a list of scored [AudioSet](https://research.google.com/audioset/) labels (e.g. *Baby cry, infant cry*, *Glass breaking*, *Dog bark*, *Alarm*).
+
+LocalAI exposes this through the `/v1/audio/classification` endpoint, modelled after `/v1/audio/transcriptions`. The reference backend is **[ced.cpp](https://github.com/mudler/ced.cpp)** (CED, a 527-class AudioSet tagger), a small ViT over a log-mel spectrogram ported to ggml with full PyTorch parity. Apache-2.0 weights are redistributable as GGUF.
+
+Because classification is exposed as a regular OpenAI-style endpoint, any HTTP client works - there is no Python dependency on the consumer side.
+
+## Endpoint
+
+```
+POST /v1/audio/classification
+Content-Type: multipart/form-data
+```
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `file` | file (required) | audio file in any format `ffmpeg` accepts |
+| `model` | string (required) | name of the sound-classification-capable model (e.g. `ced-base`) |
+| `top_k` | int | number of top tags to return (0 = backend default) |
+| `threshold` | float | drop tags scoring below this value |
+
+### Response
+
+```json
+{
+  "model": "ced-base",
+  "detections": [
+    {"index": 23, "label": "Baby cry, infant cry", "score": 0.87},
+    {"index": 22, "label": "Crying, sobbing", "score": 0.41}
+  ]
+}
+```
+
+Detections are returned in score-descending order. Scores are per-class probabilities (multi-label, independent), so they do not sum to 1.
+
+## Example
+
+```bash
+curl http://localhost:8080/v1/audio/classification \
+  -H "Content-Type: multipart/form-data" \
+  -F file="@/path/to/clip.wav" \
+  -F model="ced-base" \
+  -F top_k=10
+```
+
+## See also
+
+- [Audio to Text]({{% relref "audio-to-text" %}}) - speech transcription
+- [Speaker Diarization]({{% relref "audio-diarization" %}}) - who spoke when
--- a/docs/content/features/audio-diarization.md
+++ b/docs/content/features/audio-diarization.md
@@ -152,3 +152,7 @@ curl http://localhost:8080/v1/audio/diarization \
 - **Speaker identity across files**: speaker IDs (`SPEAKER_00`, `SPEAKER_01`, …) are local to each request. To track the same person across multiple recordings, combine `/v1/audio/diarization` with `/v1/voice/embed` (speaker embedding) and maintain your own embedding store.
 - **Hints vs. forces**: `num_speakers` overrides clustering when set; `min_speakers` / `max_speakers` are advisory and only honored by backends that expose a range hint. vibevoice.cpp ignores them — its model picks the count itself.
 - **Sample rate**: input is automatically converted to 16 kHz mono via ffmpeg before the backend sees it; sherpa-onnx pyannote-3.0 requires 16 kHz.
+
+## See also
+
+- [Sound Classification]({{% relref "audio-classification" %}}) - tag non-speech sound events (alarms, glass breaking, baby cry) in a clip.
--- a/docs/content/features/backends.md
+++ b/docs/content/features/backends.md
@@ -128,6 +128,7 @@ LocalAI supports various types of backends:
 - **Speech-to-Text Backends**: For transcription (e.g., whisper.cpp, parakeet.cpp, faster-whisper, NeMo)
 - **Text-to-Speech Backends**: For speech synthesis (e.g., piper, Kokoro, VibeVoice, Qwen3-TTS)
 - **Sound Generation Backends**: For music and audio generation (e.g., ACE-Step)
+- **Sound Classification Backends**: For sound-event classification / audio tagging - identifying everyday sounds like baby cry, glass breaking, alarms (e.g., ced.cpp)
 - **Image & Video Generation Backends**: For diffusion models (e.g., stable-diffusion.cpp, diffusers)
 - **Vision & Detection Backends**: For object detection, segmentation, depth, and face/voice recognition (e.g., rf-detr.cpp, locate-anything.cpp, sam3.cpp, insightface)
 - **Audio Processing Backends**: For voice activity detection and audio enhancement (e.g., Silero VAD, LocalVQE)
--- a/docs/content/whats-new.md
+++ b/docs/content/whats-new.md
@@ -15,6 +15,7 @@ You can see the release notes [here](https://github.com/mudler/LocalAI/releases)
 - **April 2026**: [Audio Transform](/features/audio-transform/) — generic audio-in / audio-out endpoint with optional reference signal. First implementation: [LocalVQE](https://github.com/localai-org/LocalVQE) C++ backend (joint AEC + noise suppression + dereverberation, DeepVQE-style). Both batch (`POST /audio/transformations`) and bidirectional WebSocket streaming (`/audio/transformations/stream`). Studio "Transform" tab with synchronized waveform players for input / reference / output.
 - **April 2026**: [Face recognition backend](/features/face-recognition/) — `insightface`-powered 1:1 verification, 1:N identification, face embedding, face detection, and demographic analysis. Ships both a non-commercial `buffalo_l` model and an Apache 2.0 OpenCV Zoo alternative.
 - **May 2026**: [Speaker diarization](/features/audio-diarization/) — new `/v1/audio/diarization` endpoint returning "who spoke when" segments. Backed by `sherpa-onnx` (pyannote-3.0 + speaker embeddings + clustering) for pure diarization, and `vibevoice-cpp` for diarization bundled with long-form ASR. Supports `json` / `verbose_json` / `rttm` response formats.
+- **June 2026**: [Sound classification](/features/audio-classification/) — new `/v1/audio/classification` endpoint for audio tagging / sound-event classification, returning scored [AudioSet](https://research.google.com/audioset/) labels (baby cry, glass breaking, alarms, ...). Backed by [ced.cpp](https://github.com/mudler/ced.cpp), a 527-class AudioSet tagger ported to ggml.
 - **June 2026**: [PII analyze / redact API](/features/middleware/#analyze--redact-api) — the PII detection pipeline (NER + restricted-regex pattern tiers) is now a standalone service: `POST /api/pii/analyze` returns detected entity spans and `POST /api/pii/redact` returns the sanitised text (or `400 pii_blocked`), without routing a chat request through the middleware. Events gain an `origin` (`middleware` / `proxy` / `pii_analyze` / `pii_redact`) so `/api/pii/events` can be filtered by source.
 - **June 2026**: Concurrent scoring and PII NER on llama.cpp — the `Score` (router classifier) and `TokenClassify` (PII NER) primitives now ride llama.cpp's server task queue instead of locking the context, so they run concurrently with chat/completion/embedding traffic and with each other. The `known_usecases` restriction that forced dedicated scorer/NER model configs on llama-cpp is lifted, repeated scoring calls reuse the prompt KV cache across candidates, and scoring inputs are no longer capped by the physical batch size.

--- a/gallery/ced.yaml
+++ b/gallery/ced.yaml
@@ -0,0 +1,7 @@
+---
+name: "ced-sound-classification"
+
+config_file: |
+  backend: ced
+  known_usecases:
+    - sound_classification
--- a/gallery/index.yaml
+++ b/gallery/index.yaml
@@ -3077,6 +3077,190 @@
      - transcript
    parameters:
      model: tiny
+- name: ced-base-f16
+  url: github:mudler/LocalAI/gallery/ced.yaml@master
+  urls:
+    - https://huggingface.co/mudler/ced-gguf
+    - https://huggingface.co/mispeech/ced-base
+  description: |
+    CED (Consistent Ensemble Distillation, Xiaomi) is a sound-event classifier that tags everyday sounds (baby cry, footsteps, glass breaking, alarms, dog bark, ...) into the 527-class AudioSet ontology. This is the f16 GGUF for the ced backend (a standalone C++/ggml port). Recommended default: fastest on CPU and near-lossless. Use POST /v1/audio/classification, or the realtime websocket API for live recognition.
+  license: apache-2.0
+  tags:
+    - audio-classification
+    - sound-event-detection
+    - audio-tagging
+    - audioset
+    - ced
+    - gguf
+    - f16
+  overrides:
+    parameters:
+      model: ced-base-f16.gguf
+  files:
+    - filename: ced-base-f16.gguf
+      sha256: 5c058d9f7b737167195fa54eae4a2ae17658ac2c0a8073f7f116ba006b2ab32c
+      uri: https://huggingface.co/mudler/ced-gguf/resolve/main/ced-base-f16.gguf
+- name: ced-base-q8
+  url: github:mudler/LocalAI/gallery/ced.yaml@master
+  urls:
+    - https://huggingface.co/mudler/ced-gguf
+    - https://huggingface.co/mispeech/ced-base
+  description: |
+    CED (Consistent Ensemble Distillation, Xiaomi) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). This is the q8_0 GGUF for the ced backend: smallest footprint (~88 MB, ~6.5x less memory than the PyTorch reference) and near-lossless (identical top-5 tags). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.
+  license: apache-2.0
+  tags:
+    - audio-classification
+    - sound-event-detection
+    - audio-tagging
+    - audioset
+    - ced
+    - gguf
+    - q8
+  overrides:
+    parameters:
+      model: ced-base-q8_0.gguf
+  files:
+    - filename: ced-base-q8_0.gguf
+      sha256: bd34a7710169f0047fea17267965d211f967828ab25ba6fb9d3768481393f6e2
+      uri: https://huggingface.co/mudler/ced-gguf/resolve/main/ced-base-q8_0.gguf
+- name: ced-tiny-f16
+  url: github:mudler/LocalAI/gallery/ced.yaml@master
+  urls:
+    - https://huggingface.co/mudler/ced-gguf
+    - https://huggingface.co/mispeech/ced-tiny
+  description: |
+    CED-tiny (5.5M params, Pi-class / edge) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). f16 GGUF for the ced backend (recommended (fastest on CPU)). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.
+  license: apache-2.0
+  tags:
+    - audio-classification
+    - sound-event-detection
+    - audio-tagging
+    - audioset
+    - ced
+    - gguf
+    - f16
+  overrides:
+    parameters:
+      model: ced-tiny-f16.gguf
+  files:
+    - filename: ced-tiny-f16.gguf
+      sha256: af8b81c67bae50bfca4ea83dbba77b3bae4fa6180d36c17d6877f7700aeeb77b
+      uri: https://huggingface.co/mudler/ced-gguf/resolve/main/ced-tiny-f16.gguf
+- name: ced-tiny-q8
+  url: github:mudler/LocalAI/gallery/ced.yaml@master
+  urls:
+    - https://huggingface.co/mudler/ced-gguf
+    - https://huggingface.co/mispeech/ced-tiny
+  description: |
+    CED-tiny (5.5M params, Pi-class / edge) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). q8_0 GGUF for the ced backend (smallest footprint, near-lossless). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.
+  license: apache-2.0
+  tags:
+    - audio-classification
+    - sound-event-detection
+    - audio-tagging
+    - audioset
+    - ced
+    - gguf
+    - q8
+  overrides:
+    parameters:
+      model: ced-tiny-q8_0.gguf
+  files:
+    - filename: ced-tiny-q8_0.gguf
+      sha256: 48bee4e2fc3cc85d7806e03471db24e77fda6c2a2e81ffe9ef67caebaf2bd674
+      uri: https://huggingface.co/mudler/ced-gguf/resolve/main/ced-tiny-q8_0.gguf
+- name: ced-mini-f16
+  url: github:mudler/LocalAI/gallery/ced.yaml@master
+  urls:
+    - https://huggingface.co/mudler/ced-gguf
+    - https://huggingface.co/mispeech/ced-mini
+  description: |
+    CED-mini (9.6M params, low-power) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). f16 GGUF for the ced backend (recommended (fastest on CPU)). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.
+  license: apache-2.0
+  tags:
+    - audio-classification
+    - sound-event-detection
+    - audio-tagging
+    - audioset
+    - ced
+    - gguf
+    - f16
+  overrides:
+    parameters:
+      model: ced-mini-f16.gguf
+  files:
+    - filename: ced-mini-f16.gguf
+      sha256: 3c6a8936c77312f07a9ecb7b4bbbcb1f93ad137920ca6656bae9306571fb0c03
+      uri: https://huggingface.co/mudler/ced-gguf/resolve/main/ced-mini-f16.gguf
+- name: ced-mini-q8
+  url: github:mudler/LocalAI/gallery/ced.yaml@master
+  urls:
+    - https://huggingface.co/mudler/ced-gguf
+    - https://huggingface.co/mispeech/ced-mini
+  description: |
+    CED-mini (9.6M params, low-power) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). q8_0 GGUF for the ced backend (smallest footprint, near-lossless). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.
+  license: apache-2.0
+  tags:
+    - audio-classification
+    - sound-event-detection
+    - audio-tagging
+    - audioset
+    - ced
+    - gguf
+    - q8
+  overrides:
+    parameters:
+      model: ced-mini-q8_0.gguf
+  files:
+    - filename: ced-mini-q8_0.gguf
+      sha256: 7062cef9ca31459f339ce24a5914f3b65bde76ffd9ca4fc924a040327ff292bd
+      uri: https://huggingface.co/mudler/ced-gguf/resolve/main/ced-mini-q8_0.gguf
+- name: ced-small-f16
+  url: github:mudler/LocalAI/gallery/ced.yaml@master
+  urls:
+    - https://huggingface.co/mudler/ced-gguf
+    - https://huggingface.co/mispeech/ced-small
+  description: |
+    CED-small (22M params, balanced size/accuracy) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). f16 GGUF for the ced backend (recommended (fastest on CPU)). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.
+  license: apache-2.0
+  tags:
+    - audio-classification
+    - sound-event-detection
+    - audio-tagging
+    - audioset
+    - ced
+    - gguf
+    - f16
+  overrides:
+    parameters:
+      model: ced-small-f16.gguf
+  files:
+    - filename: ced-small-f16.gguf
+      sha256: c391ed8697a1b08d7c1a463e4940a5c3a2f670e0544ab0d8ee23b544583602a8
+      uri: https://huggingface.co/mudler/ced-gguf/resolve/main/ced-small-f16.gguf
+- name: ced-small-q8
+  url: github:mudler/LocalAI/gallery/ced.yaml@master
+  urls:
+    - https://huggingface.co/mudler/ced-gguf
+    - https://huggingface.co/mispeech/ced-small
+  description: |
+    CED-small (22M params, balanced size/accuracy) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). q8_0 GGUF for the ced backend (smallest footprint, near-lossless). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.
+  license: apache-2.0
+  tags:
+    - audio-classification
+    - sound-event-detection
+    - audio-tagging
+    - audioset
+    - ced
+    - gguf
+    - q8
+  overrides:
+    parameters:
+      model: ced-small-q8_0.gguf
+  files:
+    - filename: ced-small-q8_0.gguf
+      sha256: 888275fe43491cf832fb7b8125eccba34d1120745166f40cc12e93b79dea8efe
+      uri: https://huggingface.co/mudler/ced-gguf/resolve/main/ced-small-q8_0.gguf
 - name: omnilingual-0.3b-ctc-q8-sherpa
  url: github:mudler/LocalAI/gallery/sherpa-onnx-asr.yaml@master
  urls:
--- a/pkg/grpc/backend.go
+++ b/pkg/grpc/backend.go
@@ -82,6 +82,8 @@ type Backend interface {

 	Diarize(ctx context.Context, in *pb.DiarizeRequest, opts ...grpc.CallOption) (*pb.DiarizeResponse, error)

+	SoundDetection(ctx context.Context, in *pb.SoundDetectionRequest, opts ...grpc.CallOption) (*pb.SoundDetectionResponse, error)
+
 	AudioEncode(ctx context.Context, in *pb.AudioEncodeRequest, opts ...grpc.CallOption) (*pb.AudioEncodeResult, error)
 	AudioDecode(ctx context.Context, in *pb.AudioDecodeRequest, opts ...grpc.CallOption) (*pb.AudioDecodeResult, error)

--- a/pkg/grpc/base/base.go
+++ b/pkg/grpc/base/base.go
@@ -110,6 +110,10 @@ func (llm *Base) Diarize(*pb.DiarizeRequest) (pb.DiarizeResponse, error) {
 	return pb.DiarizeResponse{}, fmt.Errorf("unimplemented")
 }

+func (llm *Base) SoundDetection(context.Context, *pb.SoundDetectionRequest) (*pb.SoundDetectionResponse, error) {
+	return nil, fmt.Errorf("unimplemented")
+}
+
 func (llm *Base) TokenizeString(opts *pb.PredictOptions) (pb.TokenizationResponse, error) {
 	return pb.TokenizationResponse{}, fmt.Errorf("unimplemented")
 }
--- a/pkg/grpc/client.go
+++ b/pkg/grpc/client.go
@@ -616,6 +616,24 @@ func (c *Client) Diarize(ctx context.Context, in *pb.DiarizeRequest, opts ...grp
 	return client.Diarize(ctx, in, opts...)
 }

+func (c *Client) SoundDetection(ctx context.Context, in *pb.SoundDetectionRequest, opts ...grpc.CallOption) (*pb.SoundDetectionResponse, error) {
+	if !c.parallel {
+		c.opMutex.Lock()
+		defer c.opMutex.Unlock()
+	}
+	c.setBusy(true)
+	defer c.setBusy(false)
+	c.wdMark()
+	defer c.wdUnMark()
+	conn, err := c.dial()
+	if err != nil {
+		return nil, err
+	}
+	defer func() { _ = conn.Close() }()
+	client := pb.NewBackendClient(conn)
+	return client.SoundDetection(ctx, in, opts...)
+}
+
 func (c *Client) Detect(ctx context.Context, in *pb.DetectOptions, opts ...grpc.CallOption) (*pb.DetectResponse, error) {
 	if !c.parallel {
 		c.opMutex.Lock()
--- a/pkg/grpc/embed.go
+++ b/pkg/grpc/embed.go
@@ -153,6 +153,10 @@ func (e *embedBackend) Diarize(ctx context.Context, in *pb.DiarizeRequest, opts
 	return e.s.Diarize(ctx, in)
 }

+func (e *embedBackend) SoundDetection(ctx context.Context, in *pb.SoundDetectionRequest, opts ...grpc.CallOption) (*pb.SoundDetectionResponse, error) {
+	return e.s.SoundDetection(ctx, in)
+}
+
 func (e *embedBackend) AudioEncode(ctx context.Context, in *pb.AudioEncodeRequest, opts ...grpc.CallOption) (*pb.AudioEncodeResult, error) {
 	return e.s.AudioEncode(ctx, in)
 }
--- a/pkg/grpc/interface.go
+++ b/pkg/grpc/interface.go
@@ -40,6 +40,7 @@ type AIModel interface {

 	VAD(*pb.VADRequest) (pb.VADResponse, error)
 	Diarize(*pb.DiarizeRequest) (pb.DiarizeResponse, error)
+	SoundDetection(context.Context, *pb.SoundDetectionRequest) (*pb.SoundDetectionResponse, error)

 	AudioEncode(*pb.AudioEncodeRequest) (*pb.AudioEncodeResult, error)
 	AudioDecode(*pb.AudioDecodeRequest) (*pb.AudioDecodeResult, error)
--- a/pkg/grpc/server.go
+++ b/pkg/grpc/server.go
@@ -435,6 +435,14 @@ func (s *server) Diarize(ctx context.Context, in *pb.DiarizeRequest) (*pb.Diariz
 	return &res, nil
 }

+func (s *server) SoundDetection(ctx context.Context, in *pb.SoundDetectionRequest) (*pb.SoundDetectionResponse, error) {
+	if s.llm.Locking() {
+		s.llm.Lock()
+		defer s.llm.Unlock()
+	}
+	return s.llm.SoundDetection(ctx, in)
+}
+
 func (s *server) AudioEncode(ctx context.Context, in *pb.AudioEncodeRequest) (*pb.AudioEncodeResult, error) {
 	if s.llm.Locking() {
 		s.llm.Lock()
--- a/scripts/changed-backends.js
+++ b/scripts/changed-backends.js
@@ -26,6 +26,13 @@ function inferBackendPath(item) {
  if (item.backend === "parakeet-cpp") {
    return `backend/go/parakeet-cpp/`;
  }
+  // ced is a Go backend (Dockerfile.golang) wrapping the ced.cpp ggml port via
+  // purego, living in backend/go/ced/. Same explicit-branch rationale as
+  // parakeet-cpp above: the generic golang fallthrough would also resolve it,
+  // but this documents the mapping and guards a future dockerfile-suffix change.
+  if (item.backend === "ced") {
+    return `backend/go/ced/`;
+  }
  if (item.dockerfile.endsWith("golang")) {
    return `backend/go/${item.backend}/`;
  }
--- a/swagger/docs.go
+++ b/swagger/docs.go
@@ -1939,6 +1939,53 @@ const docTemplate = `{
                }
            }
        },
+        "/v1/audio/classification": {
+            "post": {
+                "consumes": [
+                    "multipart/form-data"
+                ],
+                "tags": [
+                    "audio"
+                ],
+                "summary": "Classify sound events in audio (audio tagging).",
+                "parameters": [
+                    {
+                        "type": "string",
+                        "description": "model",
+                        "name": "model",
+                        "in": "formData",
+                        "required": true
+                    },
+                    {
+                        "type": "file",
+                        "description": "audio file",
+                        "name": "file",
+                        "in": "formData",
+                        "required": true
+                    },
+                    {
+                        "type": "integer",
+                        "description": "number of top tags to return (0 = backend default)",
+                        "name": "top_k",
+                        "in": "formData"
+                    },
+                    {
+                        "type": "number",
+                        "description": "drop tags scoring below this value",
+                        "name": "threshold",
+                        "in": "formData"
+                    }
+                ],
+                "responses": {
+                    "200": {
+                        "description": "OK",
+                        "schema": {
+                            "$ref": "#/definitions/schema.SoundClassificationResult"
+                        }
+                    }
+                }
+            }
+        },
        "/v1/audio/diarization": {
            "post": {
                "consumes": [
@@ -6084,6 +6131,34 @@ const docTemplate = `{
                }
            }
        },
+        "schema.SoundClassification": {
+            "type": "object",
+            "properties": {
+                "index": {
+                    "type": "integer"
+                },
+                "label": {
+                    "type": "string"
+                },
+                "score": {
+                    "type": "number"
+                }
+            }
+        },
+        "schema.SoundClassificationResult": {
+            "type": "object",
+            "properties": {
+                "detections": {
+                    "type": "array",
+                    "items": {
+                        "$ref": "#/definitions/schema.SoundClassification"
+                    }
+                },
+                "model": {
+                    "type": "string"
+                }
+            }
+        },
        "schema.StreamOptions": {
            "type": "object",
            "properties": {
--- a/swagger/swagger.json
+++ b/swagger/swagger.json
@@ -1936,6 +1936,53 @@
                }
            }
        },
+        "/v1/audio/classification": {
+            "post": {
+                "consumes": [
+                    "multipart/form-data"
+                ],
+                "tags": [
+                    "audio"
+                ],
+                "summary": "Classify sound events in audio (audio tagging).",
+                "parameters": [
+                    {
+                        "type": "string",
+                        "description": "model",
+                        "name": "model",
+                        "in": "formData",
+                        "required": true
+                    },
+                    {
+                        "type": "file",
+                        "description": "audio file",
+                        "name": "file",
+                        "in": "formData",
+                        "required": true
+                    },
+                    {
+                        "type": "integer",
+                        "description": "number of top tags to return (0 = backend default)",
+                        "name": "top_k",
+                        "in": "formData"
+                    },
+                    {
+                        "type": "number",
+                        "description": "drop tags scoring below this value",
+                        "name": "threshold",
+                        "in": "formData"
+                    }
+                ],
+                "responses": {
+                    "200": {
+                        "description": "OK",
+                        "schema": {
+                            "$ref": "#/definitions/schema.SoundClassificationResult"
+                        }
+                    }
+                }
+            }
+        },
        "/v1/audio/diarization": {
            "post": {
                "consumes": [
@@ -6081,6 +6128,34 @@
                }
            }
        },
+        "schema.SoundClassification": {
+            "type": "object",
+            "properties": {
+                "index": {
+                    "type": "integer"
+                },
+                "label": {
+                    "type": "string"
+                },
+                "score": {
+                    "type": "number"
+                }
+            }
+        },
+        "schema.SoundClassificationResult": {
+            "type": "object",
+            "properties": {
+                "detections": {
+                    "type": "array",
+                    "items": {
+                        "$ref": "#/definitions/schema.SoundClassification"
+                    }
+                },
+                "model": {
+                    "type": "string"
+                }
+            }
+        },
        "schema.StreamOptions": {
            "type": "object",
            "properties": {
--- a/swagger/swagger.yaml
+++ b/swagger/swagger.yaml
@@ -2087,6 +2087,24 @@ definitions:
          classifier-side confidence signal).
        type: number
    type: object
+  schema.SoundClassification:
+    properties:
+      index:
+        type: integer
+      label:
+        type: string
+      score:
+        type: number
+    type: object
+  schema.SoundClassificationResult:
+    properties:
+      detections:
+        items:
+          $ref: '#/definitions/schema.SoundClassification'
+        type: array
+      model:
+        type: string
+    type: object
  schema.StreamOptions:
    properties:
      include_usage:
@@ -3770,6 +3788,37 @@ paths:
      summary: Generates audio from the input text.
      tags:
      - audio
+  /v1/audio/classification:
+    post:
+      consumes:
+      - multipart/form-data
+      parameters:
+      - description: model
+        in: formData
+        name: model
+        required: true
+        type: string
+      - description: audio file
+        in: formData
+        name: file
+        required: true
+        type: file
+      - description: number of top tags to return (0 = backend default)
+        in: formData
+        name: top_k
+        type: integer
+      - description: drop tags scoring below this value
+        in: formData
+        name: threshold
+        type: number
+      responses:
+        "200":
+          description: OK
+          schema:
+            $ref: '#/definitions/schema.SoundClassificationResult'
+      summary: Classify sound events in audio (audio tagging).
+      tags:
+      - audio
  /v1/audio/diarization:
    post:
      consumes: