feat(ced): sound-event classification backend (CED audio tagger) (#10425)

* feat(ced): sketch sound-classification backend (CED audio tagger)

Wires ced.cpp (CED, 527-class AudioSet sound-event tagger; baby cry,
footsteps, glass, alarms, dog bark) into LocalAI as a Go/purego backend.

SKETCH (backend skeleton real; core REST wiring + CI/gallery is a checklist
in DESIGN.md):
- backend/backend.proto: new SoundDetection rpc + SoundClass messages
  (run `make protogen-go` to regenerate pkg/grpc/proto).
- backend/go/ced: main.go (purego dlopen libced.so + ced_capi.h),
  goced.go (Ced gRPC backend: Load + SoundDetection), Makefile
  (clone-at-pin CED_VERSION, ggml static-PIC shared build), run.sh,
  package.sh, .gitignore.
- DESIGN.md: REST /v1/audio/classification wiring (handler/route/capability
  registration checklist), gallery/index + CI registration, and a scoping
  note for the realtime/websocket live-recognition path (sliding-window
  classify over the existing ws transport + voicegate; the ced C-API
  per-PCM entry point is already window-friendly).

Backend code does not compile until protogen-go regenerates the pb types
and a libced.so is built (Makefile clones+builds it).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ced): REST /v1/audio/classification endpoint + capability registration

Wires the ced sound-event classification backend (AudioSet audio tagger)
end to end through the REST surface, mirroring the transcription path.

- Handler: core/http/endpoints/openai/sound_classification.go parses the
  multipart audio upload, temp-files it, resolves the model config and
  calls the SoundDetection RPC; returns {model, detections[]} JSON.
- Backend wrapper: core/backend/sound_classification.go (ModelSoundDetection)
  loads the model and normalizes the proto response into schema types.
- Schema: core/schema/sound_classification.go (SoundClassificationResult).
- gRPC layer: SoundDetection wired through the LocalAI wrapper (interface,
  Backend client, Client, embed, server, base default) so the loader-typed
  client exposes the RPC; proto regenerated via make protogen-go.
- Route: POST /v1/audio/classification (+ /audio/classification alias) with
  the audio/multipart default-model middleware in routes/openai.go.
- Capability surfaces: swagger @Tags/@Router on the handler; FLAG_SOUND_
  CLASSIFICATION usecase flag + UsecaseSoundClassification + UsecaseInfoMap +
  GuessUsecases + ModalityGroups + GetAllModelConfigUsecases; meta usecase
  option; /api/instructions audio area updated; auth RouteFeatureRegistry +
  FeatureAudioClassification (APIFeatures, default ON) + FeatureMetas; UI
  usecaseFilters, capabilities.js CAP_SOUND_CLASSIFICATION, Models.jsx filter
  + i18n; docs page features/audio-classification.md + whats-new + crosslink.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ced): realtime sound-event detection over the websocket API

When a realtime pipeline configures a sound-classification model, each
VAD-committed utterance (the same window the transcription path produces)
is also run through the CED sound-event classifier and the scored AudioSet
tags are emitted as a new server event. No new backend rpc is needed: the
SoundDetection gRPC method already exists on this branch.

- config: add Pipeline.SoundDetection (yaml/json sound_detection,omitempty)
  beside Transcription/VAD.
- realtime: add Model.SoundDetection(ctx, audio, topK, threshold) to the
  ModelInterface; implement it on wrappedModel and transcriptOnlyModel by
  calling backend.ModelSoundDetection with the session's sound-classification
  model config (mirrors how Transcribe dispatches). Load the optional config
  in newModel / newTranscriptionOnlyModel; nil config keeps it additive.
- types: add ConversationItemSoundDetectionEvent (item_id, content_index,
  detections[]{label,score,index}) with type conversation.item.sound_detection,
  its ServerEventType constant and MarshalJSON, mirroring the transcription
  completed event.
- realtime: add emitSoundDetection (unary path: classify the committed window,
  build the event, t.SendEvent) and wire it at the utterance-commit hook right
  after emitTranscription; gated on session.SoundDetectionEnabled (resolved
  from Pipeline.SoundDetection at session setup, defaults top_k=5, threshold=0).
  Its error is logged via xlog but never aborts the turn.
- test: Ginkgo specs for emitSoundDetection (tags emitted, empty detections,
  classifier error) plus a SoundDetection method on the fakeModel double.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(ced): implement SoundDetection in nodes backend test doubles

The SoundDetection method added to the grpc backend interface left two
test doubles (fakeBackendClient, fakeGRPCBackend) incomplete, so
core/services/nodes failed to compile under `go vet`/`go test` (go build
missed it: the doubles live in _test.go). Add the method to both,
mirroring their existing Detect mock. Repairs CI for the nodes package.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ced): decouple realtime sound detection from VAD (sound-only sessions)

Sound-event detection must activate on sounds, not speech, so it no longer
runs through the voice VAD/transcription path. A sound-detection-only
pipeline (sound_detection set, no transcription/LLM) now:

- is accepted by prepareRealtimeConfig (sound_detection counts as a pipeline
  stage),
- builds a lightweight model via newSoundDetectionOnlyModel (no VAD/STT/LLM/TTS
  loaded), and
- defaults the session to turn_detection none (no VAD) with no transcription
  stage, so the client drives windowing via input_audio_buffer.commit
  (option A: client-side sliding window). The per-PCM C-API already supports
  arbitrary windows.

commitUtterance gains a sound-only branch: it emits the
conversation.item.sound_detection event (scored AudioSet tags) and stops -
no transcription, no LLM response. generateResponse is now guarded on a
transcription stage being present, so a sound-only turn never invokes the LLM.

Existing transcription/VAD sessions are unchanged (additive). Added a
commitUtterance sound-only Ginkgo spec asserting it emits the sound event and
neither transcribes nor generates a response. go vet + golangci-lint
(new-from-merge-base) clean; openai suite green.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ced): register sound-classification backend in gallery + CI

Mechanical backend-image registration for the ced sound-event classifier,
mirroring the parakeet-cpp Go/purego backend everywhere it is wired up.

- .github/backend-matrix.yml: add the ced build matrix, field-for-field copies
  of the parakeet-cpp entries (cpu amd64/arm64, cublas cuda 12/13 amd64,
  l4t cuda-13 arm64, l4t-jetpack cuda-12 arm64, sycl f32/f16, vulkan
  amd64/arm64, rocm hipblas, and the metal darwin entry), changing only
  backend and tag-suffix. dockerfile stays ./backend/Dockerfile.golang.
- backend/index.yaml: add the &ced meta anchor (capabilities map per platform)
  plus ced-development and the per-arch image entries, each uri/mirror
  tag-suffix matching the matrix exactly. The model gallery (GGUF) entry is
  intentionally deferred pending the HuggingFace publish (TODO note inline).
- scripts/changed-backends.js: add an explicit item.backend === "ced" branch in
  inferBackendPath mapping to backend/go/ced/, same mechanism and ordering as
  the parakeet-cpp branch (before the generic golang fallthrough).
- .github/workflows/bump_deps.yaml: register mudler/ced.cpp -> CED_VERSION in
  backend/go/ced/Makefile so the daily bot bumps the pin.
- swagger/{docs.go,swagger.json,swagger.yaml}: regenerated via make swagger so
  the existing /v1/audio/classification annotations land in the generated spec.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ced): server-side windowing for realtime sound detection (option B)

Adds an optional server-driven sliding-window classifier so a sound-only
realtime client only has to stream audio (no input_audio_buffer.commit):

- Pipeline.sound_detection_window_ms / sound_detection_hop_ms config knobs.
  When both > 0 on a sound-only session, the server classifies the last
  window of streamed audio every hop and emits a conversation.item.sound_
  detection event; the input buffer is trimmed to one window so a long
  stream stays bounded. When unset, the session stays client-driven
  (option A). Runs independent of VAD (sound events are not speech).
- handleSoundWindow (ticker) + classifySoundWindow (one tick, extracted so
  it is unit-testable) + writeWindowWAV, which declares the true
  InputSampleRate (NewWAVHeaderWithRate) so the classifier resamples
  correctly. Goroutine is started after toggleVAD and torn down with the
  session (close + wg.Wait).
- Register pipeline.sound_detection (+window_ms/hop_ms) in the config meta
  registry; the earlier realtime commit added pipeline.sound_detection
  without a registry entry, failing TestAllFieldsHaveRegistryEntries. This
  fixes that and covers the two new knobs.

Tests: classifySoundWindow emits an event + trims the buffer to one window,
no-ops on too-little audio; writeWindowWAV declares the given sample rate.
go build/vet + golangci-lint (new-from-merge-base) clean; config + openai
suites green.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ced): add ced-base GGUF model gallery entries (f16 + q8_0)

The ced-base weights are now published at mudler/ced-base-gguf (Apache-2.0,
converted from mispeech/ced-base). Adds gallery/ced.yaml (backend: ced +
known_usecases: sound_classification) and two gallery/index.yaml entries
(ced-base-f16 default, ced-base-q8 smallest) with sha256-pinned files, and
removes the now-resolved TODO from backend/index.yaml.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ced): add tiny/mini/small GGUF model gallery entries

Publishes the rest of the CED family (same architecture, metadata-driven port
verified end-to-end on ced-tiny) to mudler/ced-{tiny,mini,small}-gguf and adds
their f16 + q8_0 gallery entries:

  ced-tiny  (5.5M, edge/Pi-class)  f16 11MB / q8_0 6MB
  ced-mini  (9.6M)                 f16 19MB / q8_0 11MB
  ced-small (22M)                  f16 42MB / q8_0 23MB

All sha256-pinned. ced-base remains the accuracy default.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore(ced): point gallery entries at the consolidated mudler/ced-gguf repo

All CED quantizations (tiny/mini/small/base, f16/q8_0) now live in a single
HuggingFace repo, mudler/ced-gguf, instead of per-model repos. Repoint the 8
gallery model entries' urls + file uris accordingly. sha256 and filenames are
unchanged.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore(ced): bump CED_VERSION to the short-clip fix

Pin the ced backend to ced.cpp 99c6ed3, which fixes a crash on any clip
shorter than target_length (~10.11s): time_pos_embed was added at its full
63-frame grid instead of being sliced to the clip's actual time grid, tripping
ggml_can_repeat in ggml_add. Surfaced by the live realtime e2e (sub-10s
windows) and gated with a short-clip parity test upstream.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(ced): list ced.cpp as a LocalAI-team engine + backend-guide directive

- README.md: add ced.cpp to the "native C/C++/GGML engines developed and
  maintained by the LocalAI project" table.
- docs/content/features/backends.md: add a Sound Classification backend
  category (sound-event classification / audio tagging) listing ced.cpp.
- .agents/adding-backends.md: add a "Documenting the backend" section and two
  verification-checklist items requiring new backends to be documented in the
  backends.md category list, and in-house native engines to be added to the
  README maintained-engines table. This directive was missing.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore(ced): repin CED_VERSION to the v0.1.0 release commit

ced.cpp history was squashed into a single release commit (tagged v0.1.0), so
the previous pin (99c6ed3) no longer exists upstream. Pin to c04ac14, the
v0.1.0 release commit, so the backend builds against a commit that exists.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(ced): silence gosec G304/G103 + govet unsafeptr on audited paths

- sound_classification.go: os.Create(dst) where dst = temp dir + path.Base of
  the upload (no traversal). #nosec G304, matching the depth-anything-cpp handler.
- goced.go: reading a NUL-terminated C string from a libced-owned buffer.
  #nosec G103 (gosec) + //nolint:govet (golangci-lint's unsafeptr check), since
  the uintptr is a C-owned malloc'd buffer, not Go-GC memory.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
LocalAI [bot]
2026-06-22 01:00:28 +02:00
committed by GitHub
parent ce8a3e9266
commit 600dafd20b
52 changed files with 2161 additions and 152 deletions

View File

@@ -198,6 +198,27 @@ docker-build-backends: ... docker-build-<backend-name>
- If the backend is in `backend/python/<backend-name>/` but uses `.` as context in the workflow file, use `.` context
- Check similar backends to determine the correct context
## Documenting the backend (README + docs)
A backend is not "added" until it is discoverable. Update the user-facing docs:
- **`docs/content/features/backends.md`** - add the backend to the right
category in the "LocalAI supports various types of backends" list (and add a
new category if it introduces a new modality, e.g. sound classification).
- If the backend introduces a **new API surface** (a new endpoint or a realtime
capability), document it under `docs/content/` where its area lives (audio,
vision, etc.) and follow the api-endpoints checklist in
[api-endpoints-and-auth.md](api-endpoints-and-auth.md).
**If the backend is a native C/C++/GGML engine created and maintained by the
LocalAI team** (a from-scratch port like `parakeet.cpp`, `ced.cpp`,
`vibevoice.cpp`, `rf-detr.cpp`, not a wrapper around a third-party runtime), it
ALSO belongs in the top-level **`README.md`** table under "native C/C++/GGML
engines ... developed and maintained by the LocalAI project itself". Add a row
linking the upstream engine repo with a one-line description. This is the
project's showcase of its own engines; a new in-house backend that is missing
from it is a documentation bug.
## 5. Verification Checklist
After adding a new backend, verify:
@@ -211,6 +232,8 @@ After adding a new backend, verify:
- [ ] No YAML syntax errors (check with linter)
- [ ] No Makefile syntax errors (check with linter)
- [ ] Follows the same pattern as similar backends (e.g., if it's a transcription backend, follow `faster-whisper` pattern)
- [ ] Documented: added to the category list in `docs/content/features/backends.md` (and any new endpoint/realtime capability documented under `docs/content/`)
- [ ] If it is an in-house native C/C++/GGML engine, added to the maintained-engines table in the top-level `README.md`
## Bundling runtime shared libraries (`package.sh`)

View File

@@ -3575,6 +3575,154 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
# ced
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-12-ced'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "ced"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-13-ced'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "ced"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'false'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-cuda-13-arm64-ced'
base-image: "ubuntu:24.04"
ubuntu-version: '2404'
runs-on: 'ubuntu-24.04-arm'
backend: "ced"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
platform-tag: 'amd64'
tag-latest: 'auto'
tag-suffix: '-cpu-ced'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "ced"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/arm64'
platform-tag: 'arm64'
tag-latest: 'auto'
tag-suffix: '-cpu-ced'
runs-on: 'ubuntu-24.04-arm'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "ced"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f32'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-intel-sycl-f32-ced'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "ced"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f16'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-intel-sycl-f16-ced'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "ced"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
platform-tag: 'amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-vulkan-ced'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "ced"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/arm64'
platform-tag: 'arm64'
tag-latest: 'auto'
tag-suffix: '-gpu-vulkan-ced'
runs-on: 'ubuntu-24.04-arm'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "ced"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'false'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-arm64-ced'
base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0"
runs-on: 'ubuntu-24.04-arm'
backend: "ced"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2204'
- build-type: 'hipblas'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-ced'
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
runs-on: 'ubuntu-latest'
skip-drivers: 'false'
backend: "ced"
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
# acestep-cpp
- build-type: ''
cuda-major-version: ""
@@ -4754,6 +4902,10 @@ includeDarwin:
tag-suffix: "-metal-darwin-arm64-parakeet-cpp"
build-type: "metal"
lang: "go"
- backend: "ced"
tag-suffix: "-metal-darwin-arm64-ced"
build-type: "metal"
lang: "go"
- backend: "acestep-cpp"
tag-suffix: "-metal-darwin-arm64-acestep-cpp"
build-type: "metal"

View File

@@ -42,6 +42,10 @@ jobs:
variable: "PARAKEET_VERSION"
branch: "master"
file: "backend/go/parakeet-cpp/Makefile"
- repository: "mudler/ced.cpp"
variable: "CED_VERSION"
branch: "master"
file: "backend/go/ced/Makefile"
- repository: "mudler/depth-anything.cpp"
variable: "DEPTHANYTHING_VERSION"
branch: "master"

View File

@@ -231,6 +231,7 @@ Most backends wrap a best-in-class upstream engine. A handful of them are native
| Backend | What it does |
|---------|-------------|
| [parakeet.cpp](https://github.com/mudler/parakeet.cpp) | C++/GGML port of NVIDIA NeMo Parakeet ASR (tdt/ctc/rnnt/hybrid), with cache-aware streaming transcription |
| [ced.cpp](https://github.com/mudler/ced.cpp) | C++/GGML port of the CED audio-tagging models: sound-event classification (527-class AudioSet) over REST and the realtime API for live recognition |
| [voxtral.c](https://github.com/mudler/voxtral.c) | Voxtral Realtime 4B speech-to-text in pure C |
| [vibevoice.cpp](https://github.com/mudler/vibevoice.cpp) | Native port of Microsoft VibeVoice for TTS (voice cloning) and long-form ASR with speaker diarization |
| [rf-detr.cpp](https://github.com/mudler/rf-detr.cpp) | Native RF-DETR object detection and instance segmentation |

View File

@@ -24,6 +24,9 @@ service Backend {
rpc TokenizeString(PredictOptions) returns (TokenizationResponse) {}
rpc Status(HealthMessage) returns (StatusResponse) {}
rpc Detect(DetectOptions) returns (DetectResponse) {}
// SoundDetection runs an audio-tagging / sound-event-classification model
// (e.g. CED over the AudioSet ontology) on a clip and returns scored labels.
rpc SoundDetection(SoundDetectionRequest) returns (SoundDetectionResponse) {}
rpc Depth(DepthRequest) returns (DepthResponse) {}
rpc FaceVerify(FaceVerifyRequest) returns (FaceVerifyResponse) {}
rpc FaceAnalyze(FaceAnalyzeRequest) returns (FaceAnalyzeResponse) {}
@@ -671,6 +674,24 @@ message DetectResponse {
repeated Detection Detections = 1;
}
// --- Sound-event classification / audio tagging messages (CED) ---
message SoundDetectionRequest {
string src = 1; // audio file path (LocalAI writes the upload to disk)
int32 top_k = 2; // number of top tags to return (0 = all classes)
float threshold = 3; // optional: drop tags scoring below this
}
message SoundClass {
string label = 1; // AudioSet class name, e.g. "Baby cry, infant cry"
float score = 2; // per-class probability (multi-label, independent)
int32 index = 3; // class index in the model ontology
}
message SoundDetectionResponse {
repeated SoundClass detections = 1; // score-descending
}
// --- Depth estimation messages (Depth Anything 3) ---
message DepthRequest {

11
backend/go/ced/.gitignore vendored Normal file
View File

@@ -0,0 +1,11 @@
.cache/
sources/
build/
package/
ced-grpc
# build artifacts staged in-tree by the Makefile (cp from sources/) or
# symlinked for local dev; the real sources live in ced.cpp upstream.
*.so
*.so.*
ced_capi.h
compile_commands.json

77
backend/go/ced/Makefile Normal file
View File

@@ -0,0 +1,77 @@
# ced sound-classification backend Makefile.
#
# Upstream pin lives below as CED_VERSION?=<sha> so .github/bump_deps.sh can find
# and update it (matches the parakeet-cpp / whisper.cpp convention).
#
# Local dev shortcut: symlink an out-of-tree ced.cpp shared build + header and
# skip the clone/cmake steps entirely:
# ln -sf /path/to/ced.cpp/build-shared/libced.so .
# ln -sf /path/to/ced.cpp/include/ced_capi.h .
# go build -o ced-grpc .
CED_VERSION?=c04ac14b7992d00584d9e812c9bb6268598a6ce7
CED_REPO?=https://github.com/mudler/ced.cpp
GOCMD?=go
GO_TAGS?=
JOBS?=$(shell nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 4)
BUILD_TYPE?=
NATIVE?=false
# Static-link ggml into libced.so (PIC) so the shared lib is self-contained:
# dlopen needs no libggml*.so alongside it, only system libs the runtime image
# already provides.
CMAKE_ARGS?=-DCMAKE_BUILD_TYPE=Release -DCED_SHARED=ON -DCED_BUILD_CLI=OFF -DCED_BUILD_TESTS=OFF -DBUILD_SHARED_LIBS=OFF -DCMAKE_POSITION_INDEPENDENT_CODE=ON
ifeq ($(NATIVE),false)
CMAKE_ARGS+=-DGGML_NATIVE=OFF
endif
# ced.cpp gates its ggml backends behind CED_GGML_* options (set(... CACHE BOOL
# "" FORCE)), so forward those instead of a bare -DGGML_CUDA=ON.
ifeq ($(BUILD_TYPE),cublas)
CMAKE_ARGS+=-DCED_GGML_CUDA=ON -DGGML_CUDA_GRAPHS=ON
else ifeq ($(BUILD_TYPE),openblas)
CMAKE_ARGS+=-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
else ifeq ($(BUILD_TYPE),hipblas)
CMAKE_ARGS+=-DCED_GGML_HIP=ON
else ifeq ($(BUILD_TYPE),vulkan)
CMAKE_ARGS+=-DCED_GGML_VULKAN=ON
endif
.PHONY: ced-grpc package build clean purge test all
all: ced-grpc
sources/ced.cpp:
mkdir -p sources/ced.cpp
cd sources/ced.cpp && \
git init -q && \
git remote add origin $(CED_REPO) && \
git fetch --depth 1 origin $(CED_VERSION) && \
git checkout FETCH_HEAD && \
git submodule update --init --recursive --depth 1 --single-branch
libced.so: sources/ced.cpp
cmake -B sources/ced.cpp/build-shared -S sources/ced.cpp $(CMAKE_ARGS)
cmake --build sources/ced.cpp/build-shared --config Release -j$(JOBS)
cp -fv sources/ced.cpp/build-shared/libced.so* ./ 2>/dev/null || true
cp -fv sources/ced.cpp/include/ced_capi.h ./
ced-grpc: libced.so main.go goced.go
CGO_ENABLED=0 $(GOCMD) build -tags "$(GO_TAGS)" -o ced-grpc .
package: ced-grpc
bash package.sh
build: package
test:
LD_LIBRARY_PATH=$(CURDIR):$$LD_LIBRARY_PATH $(GOCMD) test ./... -count=1
clean: purge
rm -rf libced.so* ced_capi.h package ced-grpc
purge:
rm -rf sources/ced.cpp

130
backend/go/ced/goced.go Normal file
View File

@@ -0,0 +1,130 @@
package main
// Go side of the ced backend: purego bindings over ced_capi.h plus the gRPC
// SoundDetection implementation.
//
// SKETCH: the pb.SoundDetection* types come from backend.proto (regenerate with
// `make protogen-go`). The C side is single-threaded per ctx, so we guard the
// engine with engineMu; LocalAI also serializes via base.SingleThread.
import (
"context"
"encoding/json"
"errors"
"fmt"
"sort"
"sync"
"unsafe"
"github.com/mudler/LocalAI/pkg/grpc/base"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
)
// purego-bound entry points from libced.so. Names match ced_capi.h exactly.
var (
CppAbiVersion func() int32
CppLoad func(ggufPath string) uintptr
CppFree func(ctx uintptr)
CppLastError func(ctx uintptr) string
CppNumClasses func(ctx uintptr) int32
CppSampleRate func(ctx uintptr) int32
CppClassifyPathJSON func(ctx uintptr, wavPath string, topK int32) uintptr
CppClassifyPcmJSON func(ctx uintptr, pcm []float32, nSamples int32, sampleRate int32, topK int32) uintptr
CppFreeString func(s uintptr)
)
// cstr copies a malloc'd C string (returned as uintptr) into a Go string and
// frees the original via ced_capi_free_string. Empty/0 -> "".
func cstr(p uintptr) string {
if p == 0 {
return ""
}
defer CppFreeString(p)
var b []byte
for i := 0; ; i++ {
ch := *(*byte)(unsafe.Pointer(p + uintptr(i))) //nolint:govet // #nosec G103 -- C-owned NUL-terminated string from libced (not Go-GC memory)
if ch == 0 {
break
}
b = append(b, ch)
}
return string(b)
}
// Ced is the gRPC backend. One loaded CED model per instance.
type Ced struct {
base.Base
ctxPtr uintptr
engineMu sync.Mutex
}
// Load resolves the GGUF and opens the C-API context.
func (c *Ced) Load(opts *pb.ModelOptions) error {
if opts.ModelFile == "" {
return errors.New("ced: ModelFile is required")
}
ctx := CppLoad(opts.ModelFile)
if ctx == 0 {
return fmt.Errorf("ced: ced_capi_load failed for %q: %s", opts.ModelFile, CppLastError(0))
}
c.ctxPtr = ctx
return nil
}
// jsonTag mirrors the ced_capi JSON tag objects.
type jsonTag struct {
Index int `json:"index"`
Score float32 `json:"score"`
Label string `json:"label"`
}
// SoundDetection classifies the clip at req.Src and returns scored AudioSet tags.
func (c *Ced) SoundDetection(ctx context.Context, req *pb.SoundDetectionRequest) (*pb.SoundDetectionResponse, error) {
if c.ctxPtr == 0 {
return nil, errors.New("ced: model not loaded")
}
if req.GetSrc() == "" {
return nil, errors.New("ced: SoundDetectionRequest.src (audio path) is required")
}
topK := req.GetTopK()
if topK <= 0 {
topK = 10 // sensible default for a tagging response
}
c.engineMu.Lock()
out := cstr(CppClassifyPathJSON(c.ctxPtr, req.GetSrc(), topK))
lastErr := CppLastError(c.ctxPtr)
c.engineMu.Unlock()
if out == "" {
return nil, fmt.Errorf("ced: classification failed: %s", lastErr)
}
var tags []jsonTag
if err := json.Unmarshal([]byte(out), &tags); err != nil {
return nil, fmt.Errorf("ced: bad classifier JSON: %w", err)
}
thr := req.GetThreshold()
resp := &pb.SoundDetectionResponse{}
for _, t := range tags {
if t.Score < thr {
continue
}
resp.Detections = append(resp.Detections, &pb.SoundClass{
Label: t.Label, Score: t.Score, Index: int32(t.Index),
})
}
sort.Slice(resp.Detections, func(i, j int) bool {
return resp.Detections[i].Score > resp.Detections[j].Score
})
return resp, nil
}
func (c *Ced) Free() error {
c.engineMu.Lock()
defer c.engineMu.Unlock()
if c.ctxPtr != 0 {
CppFree(c.ctxPtr)
c.ctxPtr = 0
}
return nil
}

59
backend/go/ced/main.go Normal file
View File

@@ -0,0 +1,59 @@
package main
// ced sound-classification backend. Started internally by LocalAI: one gRPC
// server per loaded model. Loads libced.so via purego and registers the flat
// C-API declared in ced_capi.h. The library name can be overridden with
// CED_LIBRARY (mirrors PARAKEET_LIBRARY / WHISPER_LIBRARY); the default looks
// for the .so next to this binary.
//
// SKETCH: requires `make protogen-go` after the backend.proto SoundDetection
// addition, and a built libced.so (see Makefile). See DESIGN.md.
import (
"flag"
"fmt"
"os"
"github.com/ebitengine/purego"
grpc "github.com/mudler/LocalAI/pkg/grpc"
)
var addr = flag.String("addr", "localhost:50051", "the address to connect to")
type libFunc struct {
ptr any
name string
}
func main() {
libName := os.Getenv("CED_LIBRARY")
if libName == "" {
libName = "libced.so"
}
lib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
if err != nil {
panic(fmt.Errorf("ced: dlopen %q: %w", libName, err))
}
// Bound 1:1 to ced_capi.h. char*-returning functions are declared uintptr
// so we can free the same pointer with ced_capi_free_string after copying
// (purego's string return would copy and leak the original).
for _, lf := range []libFunc{
{&CppAbiVersion, "ced_capi_abi_version"},
{&CppLoad, "ced_capi_load"},
{&CppFree, "ced_capi_free"},
{&CppLastError, "ced_capi_last_error"},
{&CppNumClasses, "ced_capi_num_classes"},
{&CppSampleRate, "ced_capi_sample_rate"},
{&CppClassifyPathJSON, "ced_capi_classify_path_json"},
{&CppClassifyPcmJSON, "ced_capi_classify_pcm_json"},
{&CppFreeString, "ced_capi_free_string"},
} {
purego.RegisterLibFunc(lf.ptr, lib, lf.name)
}
fmt.Fprintf(os.Stderr, "[ced] ABI=%d\n", CppAbiVersion())
flag.Parse()
if err := grpc.StartServer(*addr, &Ced{}); err != nil {
panic(err)
}
}

60
backend/go/ced/package.sh Executable file
View File

@@ -0,0 +1,60 @@
#!/bin/bash
#
# Bundle the ced-grpc binary, libced.so, the core runtime libs (libc/libstdc++/
# libgomp + ld.so) and the GPU runtime for the active BUILD_TYPE so the package
# is self-contained. Mirrors backend/go/parakeet-cpp/package.sh; run.sh routes
# the (CGO_ENABLED=0) binary through lib/ld.so so the packaged libc is used.
set -e
CURDIR=$(dirname "$(realpath "$0")")
REPO_ROOT="${CURDIR}/../../.."
mkdir -p "$CURDIR/package/lib"
cp -avf "$CURDIR/ced-grpc" "$CURDIR/package/"
cp -avf "$CURDIR/run.sh" "$CURDIR/package/"
cp -avf "$CURDIR"/libced.so* "$CURDIR/package/lib/" 2>/dev/null || {
echo "ERROR: libced.so not found in $CURDIR, run 'make' first" >&2
exit 1
}
if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
echo "Detected x86_64 architecture, copying x86_64 libraries..."
cp -arfLv /lib64/ld-linux-x86-64.so.2 "$CURDIR/package/lib/ld.so"
cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 "$CURDIR/package/lib/libc.so.6"
cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 "$CURDIR/package/lib/libgcc_s.so.1"
cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 "$CURDIR/package/lib/libstdc++.so.6"
cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 "$CURDIR/package/lib/libm.so.6"
cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 "$CURDIR/package/lib/libgomp.so.1"
cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 "$CURDIR/package/lib/libdl.so.2"
cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 "$CURDIR/package/lib/librt.so.1"
cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 "$CURDIR/package/lib/libpthread.so.0"
elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
echo "Detected ARM64 architecture, copying ARM64 libraries..."
cp -arfLv /lib/ld-linux-aarch64.so.1 "$CURDIR/package/lib/ld.so"
cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 "$CURDIR/package/lib/libc.so.6"
cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 "$CURDIR/package/lib/libgcc_s.so.1"
cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 "$CURDIR/package/lib/libstdc++.so.6"
cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 "$CURDIR/package/lib/libm.so.6"
cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 "$CURDIR/package/lib/libgomp.so.1"
cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 "$CURDIR/package/lib/libdl.so.2"
cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 "$CURDIR/package/lib/librt.so.1"
cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 "$CURDIR/package/lib/libpthread.so.0"
elif [ "$(uname -s)" = "Darwin" ]; then
echo "Detected Darwin"
else
echo "Error: Could not detect architecture"
exit 1
fi
GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
if [ -f "$GPU_LIB_SCRIPT" ]; then
echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
package_gpu_libs
fi
echo "Packaging completed successfully"
ls -liah "$CURDIR/package/" "$CURDIR/package/lib/"

15
backend/go/ced/run.sh Executable file
View File

@@ -0,0 +1,15 @@
#!/bin/bash
set -e
CURDIR=$(dirname "$(realpath "$0")")
export LD_LIBRARY_PATH="$CURDIR/lib:$CURDIR:${LD_LIBRARY_PATH:-}"
# If a self-contained ld.so was packaged, route through it so the packaged
# libc / libstdc++ are used instead of the host's (matches the sibling backends).
if [ -f "$CURDIR/lib/ld.so" ]; then
echo "Using lib/ld.so"
exec "$CURDIR/lib/ld.so" "$CURDIR/ced-grpc" "$@"
fi
exec "$CURDIR/ced-grpc" "$@"

View File

@@ -178,6 +178,37 @@
nvidia-cuda-12: "cuda12-parakeet-cpp"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-parakeet-cpp"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-parakeet-cpp"
- &ced
name: "ced"
alias: "ced"
license: mit
icon: https://avatars.githubusercontent.com/u/95302084
description: |
CED sound-event classification / audio tagging (527-class AudioSet).
ced.cpp is a C++/ggml port that performs audio tagging over the AudioSet
taxonomy, exposed through the SoundDetection gRPC rpc and the
/v1/audio/classification REST endpoint. It runs on CPU, NVIDIA CUDA,
AMD ROCm/HIP, Intel SYCL, Vulkan and NVIDIA Jetson (L4T) targets.
urls:
- https://github.com/mudler/ced.cpp
tags:
- audio-classification
- CPU
- GPU
- CUDA
- HIP
capabilities:
default: "cpu-ced"
nvidia: "cuda12-ced"
intel: "intel-sycl-f16-ced"
metal: "metal-ced"
amd: "rocm-ced"
vulkan: "vulkan-ced"
nvidia-l4t: "nvidia-l4t-arm64-ced"
nvidia-cuda-13: "cuda13-ced"
nvidia-cuda-12: "cuda12-ced"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-ced"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-ced"
- &voxtral
name: "voxtral"
alias: "voxtral"
@@ -2650,6 +2681,121 @@
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-parakeet-cpp"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-parakeet-cpp
## ced
- !!merge <<: *ced
name: "ced-development"
capabilities:
default: "cpu-ced-development"
nvidia: "cuda12-ced-development"
intel: "intel-sycl-f16-ced-development"
metal: "metal-ced-development"
amd: "rocm-ced-development"
vulkan: "vulkan-ced-development"
nvidia-l4t: "nvidia-l4t-arm64-ced-development"
nvidia-cuda-13: "cuda13-ced-development"
nvidia-cuda-12: "cuda12-ced-development"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-ced-development"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-ced-development"
- !!merge <<: *ced
name: "nvidia-l4t-arm64-ced"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-arm64-ced"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-arm64-ced
- !!merge <<: *ced
name: "nvidia-l4t-arm64-ced-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-arm64-ced"
mirrors:
- localai/localai-backends:master-nvidia-l4t-arm64-ced
- !!merge <<: *ced
name: "cuda13-nvidia-l4t-arm64-ced"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-ced"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-ced
- !!merge <<: *ced
name: "cuda13-nvidia-l4t-arm64-ced-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-ced"
mirrors:
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-ced
- !!merge <<: *ced
name: "cpu-ced"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-ced"
mirrors:
- localai/localai-backends:latest-cpu-ced
- !!merge <<: *ced
name: "cpu-ced-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-ced"
mirrors:
- localai/localai-backends:master-cpu-ced
- !!merge <<: *ced
name: "metal-ced"
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-ced"
mirrors:
- localai/localai-backends:latest-metal-darwin-arm64-ced
- !!merge <<: *ced
name: "metal-ced-development"
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-ced"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-ced
- !!merge <<: *ced
name: "cuda12-ced"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-ced"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-ced
- !!merge <<: *ced
name: "cuda12-ced-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-ced"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-ced
- !!merge <<: *ced
name: "rocm-ced"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-ced"
mirrors:
- localai/localai-backends:latest-gpu-rocm-hipblas-ced
- !!merge <<: *ced
name: "rocm-ced-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-ced"
mirrors:
- localai/localai-backends:master-gpu-rocm-hipblas-ced
- !!merge <<: *ced
name: "intel-sycl-f32-ced"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f32-ced"
mirrors:
- localai/localai-backends:latest-gpu-intel-sycl-f32-ced
- !!merge <<: *ced
name: "intel-sycl-f32-ced-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f32-ced"
mirrors:
- localai/localai-backends:master-gpu-intel-sycl-f32-ced
- !!merge <<: *ced
name: "intel-sycl-f16-ced"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f16-ced"
mirrors:
- localai/localai-backends:latest-gpu-intel-sycl-f16-ced
- !!merge <<: *ced
name: "intel-sycl-f16-ced-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f16-ced"
mirrors:
- localai/localai-backends:master-gpu-intel-sycl-f16-ced
- !!merge <<: *ced
name: "vulkan-ced"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-vulkan-ced"
mirrors:
- localai/localai-backends:latest-gpu-vulkan-ced
- !!merge <<: *ced
name: "vulkan-ced-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-vulkan-ced"
mirrors:
- localai/localai-backends:master-gpu-vulkan-ced
- !!merge <<: *ced
name: "cuda13-ced"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-ced"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-ced
- !!merge <<: *ced
name: "cuda13-ced-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-ced"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-ced
## stablediffusion-ggml
- !!merge <<: *stablediffusionggml
name: "cpu-stablediffusion-ggml"

View File

@@ -0,0 +1,88 @@
package backend
import (
"context"
"fmt"
"sort"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/schema"
grpcPkg "github.com/mudler/LocalAI/pkg/grpc"
"github.com/mudler/LocalAI/pkg/grpc/proto"
"github.com/mudler/LocalAI/pkg/model"
)
// SoundDetectionRequest carries the knobs the HTTP layer collects for an
// audio-tagging / sound-event-classification call. Audio is the path to the
// uploaded clip on disk; TopK and Threshold are optional (0 = backend default).
type SoundDetectionRequest struct {
Audio string
TopK int32
Threshold float32
}
func (r *SoundDetectionRequest) toProto() *proto.SoundDetectionRequest {
return &proto.SoundDetectionRequest{
Src: r.Audio,
TopK: r.TopK,
Threshold: r.Threshold,
}
}
func loadSoundDetectionModel(ml *model.ModelLoader, modelConfig config.ModelConfig, appConfig *config.ApplicationConfig) (grpcPkg.Backend, error) {
if modelConfig.Backend == "" {
return nil, fmt.Errorf("sound classification: model %q has no backend set; supported backends include ced", modelConfig.Name)
}
opts := ModelOptions(modelConfig, appConfig)
m, err := ml.Load(opts...)
if err != nil {
recordModelLoadFailure(appConfig, modelConfig.Name, modelConfig.Backend, err, nil)
return nil, err
}
if m == nil {
return nil, fmt.Errorf("could not load sound classification model")
}
return m, nil
}
// ModelSoundDetection runs the SoundDetection RPC against the configured
// backend and returns a normalized schema.SoundClassificationResult.
func ModelSoundDetection(ctx context.Context, req SoundDetectionRequest, ml *model.ModelLoader, modelConfig config.ModelConfig, appConfig *config.ApplicationConfig) (*schema.SoundClassificationResult, error) {
m, err := loadSoundDetectionModel(ml, modelConfig, appConfig)
if err != nil {
return nil, err
}
r, err := m.SoundDetection(ctx, req.toProto())
if err != nil {
return nil, err
}
return soundClassificationResultFromProto(modelConfig.Name, r), nil
}
// soundClassificationResultFromProto maps the backend detections to the
// HTTP-facing schema, keeping the backend's score-descending order.
func soundClassificationResultFromProto(modelName string, r *proto.SoundDetectionResponse) *schema.SoundClassificationResult {
out := &schema.SoundClassificationResult{
Model: modelName,
Detections: []schema.SoundClassification{},
}
if r == nil {
return out
}
for _, d := range r.Detections {
if d == nil {
continue
}
out.Detections = append(out.Detections, schema.SoundClassification{
Index: int(d.Index),
Label: d.Label,
Score: d.Score,
})
}
sort.SliceStable(out.Detections, func(i, j int) bool {
return out.Detections[i].Score > out.Detections[j].Score
})
return out
}

View File

@@ -8,27 +8,28 @@ import (
// Usecase name constants — the canonical string values used in gallery entries,
// model configs (known_usecases), and UsecaseInfoMap keys.
const (
UsecaseChat = "chat"
UsecaseCompletion = "completion"
UsecaseEdit = "edit"
UsecaseVision = "vision"
UsecaseEmbeddings = "embeddings"
UsecaseTokenize = "tokenize"
UsecaseImage = "image"
UsecaseVideo = "video"
UsecaseTranscript = "transcript"
UsecaseTTS = "tts"
UsecaseSoundGeneration = "sound_generation"
UsecaseRerank = "rerank"
UsecaseDetection = "detection"
UsecaseDepth = "depth"
UsecaseVAD = "vad"
UsecaseAudioTransform = "audio_transform"
UsecaseDiarization = "diarization"
UsecaseRealtimeAudio = "realtime_audio"
UsecaseFaceRecognition = "face_recognition"
UsecaseSpeakerRecognition = "speaker_recognition"
UsecaseTokenClassify = "token_classify"
UsecaseChat = "chat"
UsecaseCompletion = "completion"
UsecaseEdit = "edit"
UsecaseVision = "vision"
UsecaseEmbeddings = "embeddings"
UsecaseTokenize = "tokenize"
UsecaseImage = "image"
UsecaseVideo = "video"
UsecaseTranscript = "transcript"
UsecaseTTS = "tts"
UsecaseSoundGeneration = "sound_generation"
UsecaseRerank = "rerank"
UsecaseDetection = "detection"
UsecaseDepth = "depth"
UsecaseVAD = "vad"
UsecaseAudioTransform = "audio_transform"
UsecaseDiarization = "diarization"
UsecaseSoundClassification = "sound_classification"
UsecaseRealtimeAudio = "realtime_audio"
UsecaseFaceRecognition = "face_recognition"
UsecaseSpeakerRecognition = "speaker_recognition"
UsecaseTokenClassify = "token_classify"
)
// GRPCMethod identifies a Backend service RPC from backend.proto.
@@ -51,6 +52,7 @@ const (
MethodVAD GRPCMethod = "VAD"
MethodAudioTransform GRPCMethod = "AudioTransform"
MethodDiarize GRPCMethod = "Diarize"
MethodSoundDetection GRPCMethod = "SoundDetection"
MethodAudioToAudioStream GRPCMethod = "AudioToAudioStream"
MethodFaceVerify GRPCMethod = "FaceVerify"
MethodFaceAnalyze GRPCMethod = "FaceAnalyze"
@@ -165,6 +167,11 @@ var UsecaseInfoMap = map[string]UsecaseInfo{
GRPCMethod: MethodDiarize,
Description: "Speaker diarization (who-spoke-when, per-speaker segments) via the Diarize RPC.",
},
UsecaseSoundClassification: {
Flag: FLAG_SOUND_CLASSIFICATION,
GRPCMethod: MethodSoundDetection,
Description: "Sound-event classification / audio tagging (scored AudioSet labels like baby cry, glass breaking, alarms) via the SoundDetection RPC.",
},
UsecaseRealtimeAudio: {
Flag: FLAG_REALTIME_AUDIO,
GRPCMethod: MethodAudioToAudioStream,

View File

@@ -68,6 +68,7 @@ var UsecaseOptions = []FieldOption{
{Value: "face_recognition", Label: "Face Recognition"},
{Value: "transcript", Label: "Transcript"},
{Value: "diarization", Label: "Diarization"},
{Value: "sound_classification", Label: "Sound Classification"},
{Value: "speaker_recognition", Label: "Speaker Recognition"},
{Value: "tts", Label: "TTS"},
{Value: "sound_generation", Label: "Sound Generation"},

View File

@@ -328,6 +328,30 @@ func DefaultRegistry() map[string]FieldMetaOverride {
AutocompleteProvider: ProviderModelsVAD,
Order: 63,
},
"pipeline.sound_detection": {
Section: "pipeline",
Label: "Sound Detection Model",
Description: "Model to use for sound-event classification (audio tagging, e.g. ced) in the pipeline. When set, committed realtime audio is also classified and the scored AudioSet tags are emitted as a conversation.item.sound_detection event.",
Component: "model-select",
AutocompleteProvider: ProviderModels,
Order: 64,
},
"pipeline.sound_detection_window_ms": {
Section: "pipeline",
Label: "Sound Detection Window (ms)",
Description: "Server-side windowing for a sound-only realtime session: length in ms of the audio window classified each hop. 0 = client-driven (the client commits windows).",
Component: "number",
Min: f64(0),
Order: 65,
},
"pipeline.sound_detection_hop_ms": {
Section: "pipeline",
Label: "Sound Detection Hop (ms)",
Description: "Server-side windowing hop in ms: how often the server classifies the last window. 0 = client-driven.",
Component: "number",
Min: f64(0),
Order: 66,
},
"pipeline.reasoning_effort": {
Section: "pipeline",
Label: "Reasoning Effort",

View File

@@ -604,6 +604,20 @@ type Pipeline struct {
LLM string `yaml:"llm,omitempty" json:"llm,omitempty"`
Transcription string `yaml:"transcription,omitempty" json:"transcription,omitempty"`
VAD string `yaml:"vad,omitempty" json:"vad,omitempty"`
// SoundDetection names a sound-event-classification model (e.g. ced). When
// set, each VAD-committed realtime utterance is also run through it and the
// scored AudioSet tags are emitted as a conversation.item.sound_detection
// server event, alongside (and independent of) transcription.
SoundDetection string `yaml:"sound_detection,omitempty" json:"sound_detection,omitempty"`
// SoundDetectionWindowMs / SoundDetectionHopMs enable server-side windowing
// for a sound-detection-only realtime session: instead of the client
// committing audio buffers, the server classifies the last WindowMs of
// streamed audio every HopMs and emits a sound_detection event per hop. Both
// must be > 0 to activate; otherwise the session stays client-driven (the
// client commits windows via input_audio_buffer.commit).
SoundDetectionWindowMs int `yaml:"sound_detection_window_ms,omitempty" json:"sound_detection_window_ms,omitempty"`
SoundDetectionHopMs int `yaml:"sound_detection_hop_ms,omitempty" json:"sound_detection_hop_ms,omitempty"`
// ReasoningEffort sets the reasoning effort (none|minimal|low|medium|high) for
// the pipeline's LLM without editing the LLM model config. Overrides the LLM's
@@ -1452,6 +1466,11 @@ const (
// so it may combine freely with other usecases.
FLAG_TOKEN_CLASSIFY ModelConfigUsecase = 0b1000000000000000000000
// Marks a model as wired for the SoundDetection gRPC primitive
// (audio tagging / sound-event classification — scored AudioSet
// labels via the SoundDetection RPC, e.g. ced).
FLAG_SOUND_CLASSIFICATION ModelConfigUsecase = 0b10000000000000000000000
// Common Subsets
FLAG_LLM ModelConfigUsecase = FLAG_CHAT | FLAG_COMPLETION | FLAG_EDIT
)
@@ -1460,12 +1479,12 @@ const (
// Flags within the same group are NOT orthogonal (e.g., chat and completion are
// both text/language). A model is multimodal when its usecases span 2+ groups.
var ModalityGroups = []ModelConfigUsecase{
FLAG_CHAT | FLAG_COMPLETION | FLAG_EDIT, // text/language
FLAG_VISION | FLAG_DETECTION, // visual understanding
FLAG_TRANSCRIPT | FLAG_REALTIME_AUDIO, // speech input — realtime_audio is any-to-any, so it counts here too
FLAG_TTS | FLAG_SOUND_GENERATION | FLAG_REALTIME_AUDIO, // audio output — and here, so a lone realtime_audio flag still reads as multimodal
FLAG_AUDIO_TRANSFORM, // audio in/out transforms
FLAG_IMAGE | FLAG_VIDEO, // visual generation
FLAG_CHAT | FLAG_COMPLETION | FLAG_EDIT, // text/language
FLAG_VISION | FLAG_DETECTION, // visual understanding
FLAG_TRANSCRIPT | FLAG_REALTIME_AUDIO | FLAG_SOUND_CLASSIFICATION, // audio input — realtime_audio is any-to-any, so it counts here too
FLAG_TTS | FLAG_SOUND_GENERATION | FLAG_REALTIME_AUDIO, // audio output — and here, so a lone realtime_audio flag still reads as multimodal
FLAG_AUDIO_TRANSFORM, // audio in/out transforms
FLAG_IMAGE | FLAG_VIDEO, // visual generation
}
// IsMultimodal returns true if the given usecases span two or more orthogonal
@@ -1488,29 +1507,30 @@ func GetAllModelConfigUsecases() map[string]ModelConfigUsecase {
return map[string]ModelConfigUsecase{
// Note: FLAG_ANY is intentionally excluded from this map
// because it's 0 and would always match in HasUsecases checks
"FLAG_CHAT": FLAG_CHAT,
"FLAG_COMPLETION": FLAG_COMPLETION,
"FLAG_EDIT": FLAG_EDIT,
"FLAG_EMBEDDINGS": FLAG_EMBEDDINGS,
"FLAG_RERANK": FLAG_RERANK,
"FLAG_IMAGE": FLAG_IMAGE,
"FLAG_TRANSCRIPT": FLAG_TRANSCRIPT,
"FLAG_TTS": FLAG_TTS,
"FLAG_SOUND_GENERATION": FLAG_SOUND_GENERATION,
"FLAG_TOKENIZE": FLAG_TOKENIZE,
"FLAG_VAD": FLAG_VAD,
"FLAG_LLM": FLAG_LLM,
"FLAG_VIDEO": FLAG_VIDEO,
"FLAG_DETECTION": FLAG_DETECTION,
"FLAG_VISION": FLAG_VISION,
"FLAG_FACE_RECOGNITION": FLAG_FACE_RECOGNITION,
"FLAG_SPEAKER_RECOGNITION": FLAG_SPEAKER_RECOGNITION,
"FLAG_AUDIO_TRANSFORM": FLAG_AUDIO_TRANSFORM,
"FLAG_DIARIZATION": FLAG_DIARIZATION,
"FLAG_REALTIME_AUDIO": FLAG_REALTIME_AUDIO,
"FLAG_SCORE": FLAG_SCORE,
"FLAG_DEPTH": FLAG_DEPTH,
"FLAG_TOKEN_CLASSIFY": FLAG_TOKEN_CLASSIFY,
"FLAG_CHAT": FLAG_CHAT,
"FLAG_COMPLETION": FLAG_COMPLETION,
"FLAG_EDIT": FLAG_EDIT,
"FLAG_EMBEDDINGS": FLAG_EMBEDDINGS,
"FLAG_RERANK": FLAG_RERANK,
"FLAG_IMAGE": FLAG_IMAGE,
"FLAG_TRANSCRIPT": FLAG_TRANSCRIPT,
"FLAG_TTS": FLAG_TTS,
"FLAG_SOUND_GENERATION": FLAG_SOUND_GENERATION,
"FLAG_TOKENIZE": FLAG_TOKENIZE,
"FLAG_VAD": FLAG_VAD,
"FLAG_LLM": FLAG_LLM,
"FLAG_VIDEO": FLAG_VIDEO,
"FLAG_DETECTION": FLAG_DETECTION,
"FLAG_VISION": FLAG_VISION,
"FLAG_FACE_RECOGNITION": FLAG_FACE_RECOGNITION,
"FLAG_SPEAKER_RECOGNITION": FLAG_SPEAKER_RECOGNITION,
"FLAG_AUDIO_TRANSFORM": FLAG_AUDIO_TRANSFORM,
"FLAG_DIARIZATION": FLAG_DIARIZATION,
"FLAG_SOUND_CLASSIFICATION": FLAG_SOUND_CLASSIFICATION,
"FLAG_REALTIME_AUDIO": FLAG_REALTIME_AUDIO,
"FLAG_SCORE": FLAG_SCORE,
"FLAG_DEPTH": FLAG_DEPTH,
"FLAG_TOKEN_CLASSIFY": FLAG_TOKEN_CLASSIFY,
}
}
@@ -1713,6 +1733,16 @@ func (c *ModelConfig) GuessUsecases(u ModelConfigUsecase) bool {
}
}
if (u & FLAG_SOUND_CLASSIFICATION) == FLAG_SOUND_CLASSIFICATION {
// ced is a sound-event tagger (AudioSet labels) surfaced via the
// SoundDetection gRPC. Models without an explicit known_usecases
// still surface when they run on one of these backends.
soundClassificationBackends := []string{"ced"}
if !slices.Contains(soundClassificationBackends, c.Backend) {
return false
}
}
if (u & FLAG_REALTIME_AUDIO) == FLAG_REALTIME_AUDIO {
// Backends that own a single any-to-any loop and implement
// AudioToAudioStream — listed here so models without an explicit

View File

@@ -48,6 +48,10 @@ var RouteFeatureRegistry = []RouteFeature{
{"POST", "/v1/audio/diarization", FeatureAudioDiarization},
{"POST", "/audio/diarization", FeatureAudioDiarization},
// Audio classification (sound-event tagging)
{"POST", "/v1/audio/classification", FeatureAudioClassification},
{"POST", "/audio/classification", FeatureAudioClassification},
// Audio speech / TTS
{"POST", "/v1/audio/speech", FeatureAudioSpeech},
{"POST", "/audio/speech", FeatureAudioSpeech},
@@ -172,6 +176,7 @@ func APIFeatureMetas() []FeatureMeta {
{FeatureAudioSpeech, "Audio Speech / TTS", true},
{FeatureAudioTranscription, "Audio Transcription", true},
{FeatureAudioDiarization, "Audio Diarization", true},
{FeatureAudioClassification, "Audio Classification", true},
{FeatureVAD, "Voice Activity Detection", true},
{FeatureDetection, "Detection", true},
{FeatureVideo, "Video Generation", true},

View File

@@ -38,24 +38,25 @@ const (
FeatureQuantization = "quantization"
// API features (default ON for new users)
FeatureChat = "chat"
FeatureImages = "images"
FeatureAudioSpeech = "audio_speech"
FeatureAudioTranscription = "audio_transcription"
FeatureAudioDiarization = "audio_diarization"
FeatureVAD = "vad"
FeatureDetection = "detection"
FeatureVideo = "video"
FeatureEmbeddings = "embeddings"
FeatureSound = "sound"
FeatureRealtime = "realtime"
FeatureRerank = "rerank"
FeatureTokenize = "tokenize"
FeatureMCP = "mcp"
FeatureStores = "stores"
FeatureFaceRecognition = "face_recognition"
FeatureVoiceRecognition = "voice_recognition"
FeatureAudioTransform = "audio_transform"
FeatureChat = "chat"
FeatureImages = "images"
FeatureAudioSpeech = "audio_speech"
FeatureAudioTranscription = "audio_transcription"
FeatureAudioDiarization = "audio_diarization"
FeatureAudioClassification = "audio_classification"
FeatureVAD = "vad"
FeatureDetection = "detection"
FeatureVideo = "video"
FeatureEmbeddings = "embeddings"
FeatureSound = "sound"
FeatureRealtime = "realtime"
FeatureRerank = "rerank"
FeatureTokenize = "tokenize"
FeatureMCP = "mcp"
FeatureStores = "stores"
FeatureFaceRecognition = "face_recognition"
FeatureVoiceRecognition = "voice_recognition"
FeatureAudioTransform = "audio_transform"
// FeaturePIIFilter gates the synchronous PII analyze/redact service
// (POST /api/pii/{analyze,redact}). Default ON like the other API
// features; the admin-only events log is gated separately in-handler.
@@ -71,7 +72,7 @@ var GeneralFeatures = []string{FeatureFineTuning, FeatureQuantization}
// APIFeatures lists API endpoint features (default ON).
var APIFeatures = []string{
FeatureChat, FeatureImages, FeatureAudioSpeech, FeatureAudioTranscription,
FeatureAudioDiarization,
FeatureAudioDiarization, FeatureAudioClassification,
FeatureVAD, FeatureDetection, FeatureVideo, FeatureEmbeddings, FeatureSound,
FeatureRealtime, FeatureRerank, FeatureTokenize, FeatureMCP, FeatureStores,
FeatureFaceRecognition, FeatureVoiceRecognition, FeatureAudioTransform,

View File

@@ -32,9 +32,9 @@ var instructionDefs = []instructionDef{
},
{
Name: "audio",
Description: "Text-to-speech, voice activity detection, transcription, speaker diarization, and sound generation",
Description: "Text-to-speech, voice activity detection, transcription, speaker diarization, sound classification, and sound generation",
Tags: []string{"audio"},
Intro: "Diarization (/v1/audio/diarization) returns speaker-labelled time segments. Backends with native ASR-diarization (vibevoice-cpp) can also emit per-segment text via include_text=true; backends with a dedicated pipeline (sherpa-onnx + pyannote) emit segmentation only. Response formats: json (default), verbose_json (adds speakers summary + text), rttm (NIST format).",
Intro: "Diarization (/v1/audio/diarization) returns speaker-labelled time segments. Backends with native ASR-diarization (vibevoice-cpp) can also emit per-segment text via include_text=true; backends with a dedicated pipeline (sherpa-onnx + pyannote) emit segmentation only. Response formats: json (default), verbose_json (adds speakers summary + text), rttm (NIST format). Sound classification (/v1/audio/classification) returns scored AudioSet sound-event tags (audio tagging via the ced backend); top_k and threshold control the returned set.",
},
{
Name: "images",

View File

@@ -93,16 +93,31 @@ type Session struct {
Voice string
TurnDetection *types.TurnDetectionUnion // "server_vad", "semantic_vad" or "none"
InputAudioTranscription *types.AudioTranscription
Tools []types.ToolUnion
ToolChoice *types.ToolChoiceUnion
Conversations map[string]*Conversation
InputAudioBuffer []byte
AudioBufferLock sync.Mutex
OpusFrames [][]byte
OpusFramesLock sync.Mutex
Instructions string
DefaultConversationID string
ModelInterface Model
// SoundDetectionEnabled is set when pipeline.sound_detection names a
// sound-event-classification model. When true, each committed utterance is
// also run through ModelInterface.SoundDetection and the scored tags are
// emitted as a conversation.item.sound_detection event. SoundDetectionTopK
// and SoundDetectionThreshold are the knobs passed to that call (defaults:
// top_k=5, threshold=0).
SoundDetectionEnabled bool
SoundDetectionTopK int
SoundDetectionThreshold float32
// SoundDetectionWindowMs / SoundDetectionHopMs, when both > 0, enable
// server-side windowing for a sound-only session: the server classifies the
// last WindowMs of streamed audio every HopMs (no client commits needed).
SoundDetectionWindowMs int
SoundDetectionHopMs int
Tools []types.ToolUnion
ToolChoice *types.ToolChoiceUnion
Conversations map[string]*Conversation
InputAudioBuffer []byte
AudioBufferLock sync.Mutex
OpusFrames [][]byte
OpusFramesLock sync.Mutex
Instructions string
DefaultConversationID string
ModelInterface Model
// The pipeline model config or the config for an any-to-any model
ModelConfig *config.ModelConfig
InputSampleRate int
@@ -250,6 +265,10 @@ type Model interface {
// TranscribeStream transcribes audio incrementally, invoking onDelta for each
// transcript text fragment and returning the final aggregated result.
TranscribeStream(ctx context.Context, audio, language string, translate, diarize bool, prompt string, onDelta func(text string)) (*schema.TranscriptionResult, error)
// SoundDetection classifies a committed audio window into scored AudioSet
// sound-event tags. topK caps the number of returned tags (0 = backend
// default), threshold drops tags below the given score (0 = keep all).
SoundDetection(ctx context.Context, audio string, topK int, threshold float32) (*schema.SoundClassificationResult, error)
PredictConfig() *config.ModelConfig
}
@@ -399,7 +418,7 @@ func prepareRealtimeConfig(cfg *config.ModelConfig) (errCode, errMsg string, ok
return "", "", true
}
if cfg.Pipeline.VAD == "" && cfg.Pipeline.Transcription == "" && cfg.Pipeline.TTS == "" && cfg.Pipeline.LLM == "" {
if cfg.Pipeline.VAD == "" && cfg.Pipeline.Transcription == "" && cfg.Pipeline.TTS == "" && cfg.Pipeline.LLM == "" && cfg.Pipeline.SoundDetection == "" {
return "invalid_model", "Model is not a pipeline model", false
}
return "", "", true
@@ -469,6 +488,26 @@ func runRealtimeSession(application *application.Application, t Transport, model
sttModel := cfg.Pipeline.Transcription
// A sound-detection-only pipeline (sound_detection set, no transcription/LLM)
// activates on sounds, not speech, so it runs WITHOUT the voice VAD: the
// session defaults to turn_detection none and the client drives windowing via
// input_audio_buffer.commit. There is no transcription stage in that case.
soundOnly := cfg.Pipeline.SoundDetection != "" && cfg.Pipeline.Transcription == "" && cfg.Pipeline.LLM == ""
turnDetection := &types.TurnDetectionUnion{
ServerVad: &types.ServerVad{
Threshold: 0.5,
PrefixPaddingMs: 300,
SilenceDurationMs: 500,
CreateResponse: true,
},
}
inputAudioTranscription := &types.AudioTranscription{Model: sttModel}
if soundOnly {
turnDetection = nil // turn_detection none: no VAD
inputAudioTranscription = nil // no transcription stage
}
// Compose the system prompt: prepend the assistant prompt when we have
// one (it teaches the model the safety rules and tool recipes), then the
// session's default voice instructions. Order matches chat.go's
@@ -480,30 +519,26 @@ func runRealtimeSession(application *application.Application, t Transport, model
sessionID := generateSessionID()
session := &Session{
ID: sessionID,
TranscriptionOnly: false,
Model: model,
Voice: cfg.TTSConfig.Voice,
Instructions: instructions,
ModelConfig: cfg,
Tools: assistantTools,
AssistantTools: assistantTools,
AssistantExecutor: assistantExecutor,
TurnDetection: &types.TurnDetectionUnion{
ServerVad: &types.ServerVad{
Threshold: 0.5,
PrefixPaddingMs: 300,
SilenceDurationMs: 500,
CreateResponse: true,
},
},
InputAudioTranscription: &types.AudioTranscription{
Model: sttModel,
},
Conversations: make(map[string]*Conversation),
InputSampleRate: defaultRemoteSampleRate,
OutputSampleRate: defaultRemoteSampleRate,
MaxHistoryItems: resolveMaxHistoryItems(cfg),
ID: sessionID,
TranscriptionOnly: false,
Model: model,
Voice: cfg.TTSConfig.Voice,
Instructions: instructions,
ModelConfig: cfg,
Tools: assistantTools,
AssistantTools: assistantTools,
AssistantExecutor: assistantExecutor,
TurnDetection: turnDetection,
InputAudioTranscription: inputAudioTranscription,
Conversations: make(map[string]*Conversation),
InputSampleRate: defaultRemoteSampleRate,
OutputSampleRate: defaultRemoteSampleRate,
MaxHistoryItems: resolveMaxHistoryItems(cfg),
SoundDetectionEnabled: cfg.Pipeline.SoundDetection != "",
SoundDetectionTopK: defaultSoundDetectionTopK,
SoundDetectionThreshold: 0,
SoundDetectionWindowMs: cfg.Pipeline.SoundDetectionWindowMs,
SoundDetectionHopMs: cfg.Pipeline.SoundDetectionHopMs,
}
// Create a default conversation
@@ -517,14 +552,24 @@ func runRealtimeSession(application *application.Application, t Transport, model
session.Conversations[conversationID] = conversation
session.DefaultConversationID = conversationID
m, err := newModel(
&cfg.Pipeline,
application.ModelConfigLoader(),
application.ModelLoader(),
application.ApplicationConfig(),
evaluator,
buildRealtimeRoutingContext(application, sessionID),
)
var m Model
if soundOnly {
m, err = newSoundDetectionOnlyModel(
&cfg.Pipeline,
application.ModelConfigLoader(),
application.ModelLoader(),
application.ApplicationConfig(),
)
} else {
m, err = newModel(
&cfg.Pipeline,
application.ModelConfigLoader(),
application.ModelLoader(),
application.ApplicationConfig(),
evaluator,
buildRealtimeRoutingContext(application, sessionID),
)
}
if err != nil {
xlog.Error("failed to load model", "error", err)
sendError(t, "model_load_error", "Failed to load model", "", "")
@@ -605,6 +650,20 @@ func runRealtimeSession(application *application.Application, t Transport, model
toggleVAD()
// Server-side sound-detection windowing (option B): for a sound-only session
// with window/hop configured, the server classifies the last window of
// streamed audio on a timer, so the client only has to stream (no commits).
// This runs independent of VAD (sound events are not speech).
var soundWindowDone chan struct{}
if soundOnly && session.SoundDetectionWindowMs > 0 && session.SoundDetectionHopMs > 0 {
soundWindowDone = make(chan struct{})
wg.Go(func() {
handleSoundWindow(session, t, soundWindowDone)
})
xlog.Debug("Starting server-side sound-detection windowing",
"window_ms", session.SoundDetectionWindowMs, "hop_ms", session.SoundDetectionHopMs)
}
for {
msg, err = t.ReadEvent()
if err != nil {
@@ -880,6 +939,10 @@ func runRealtimeSession(application *application.Application, t Transport, model
if vadServerStarted {
close(done)
}
// Stop the server-side sound-detection windowing goroutine (if running).
if soundWindowDone != nil {
close(soundWindowDone)
}
wg.Wait()
// Remove the session from the sessions map
@@ -971,6 +1034,10 @@ func updateTransSession(session *Session, update *types.SessionUnion, cl *config
session.ModelInterface = m
session.ModelConfig = cfg
session.SoundDetectionEnabled = cfg.Pipeline.SoundDetection != ""
if session.SoundDetectionTopK <= 0 {
session.SoundDetectionTopK = defaultSoundDetectionTopK
}
}
if trUpd != nil {
@@ -1343,7 +1410,8 @@ func commitUtterance(ctx context.Context, utt []byte, session *Session, conv *Co
// TODO: If we have a real any-to-any model then transcription is optional
var transcript string
if session.InputAudioTranscription != nil {
switch {
case session.InputAudioTranscription != nil:
// emitTranscription streams transcript deltas when
// pipeline.streaming.transcription is set, otherwise emits a single
// completed event; either way it returns the final transcript text.
@@ -1358,13 +1426,27 @@ func commitUtterance(ctx context.Context, utt []byte, session *Session, conv *Co
sendError(t, "transcription_failed", err.Error(), "", "event_TODO")
return
}
} else {
case session.SoundDetectionEnabled:
// Sound-detection-only session: no transcription and no LLM. The
// sound-detection emit below carries the result; there is no any-to-any
// path to fall into. Windowing is client-driven (turn_detection none +
// input_audio_buffer.commit), so this is not voice-gated.
default:
// The voice gate runs only on the transcription path above; if an
// any-to-any model path is added here, join the gate before responding.
sendNotImplemented(t, "any-to-any models")
return
}
// Sound-event detection is additive to transcription: classify the same
// committed window and emit its scored AudioSet tags as a separate event.
// A failure here is logged but must never abort the turn.
if session.SoundDetectionEnabled {
if sderr := emitSoundDetection(ctx, t, session, generateItemID(), f.Name()); sderr != nil {
xlog.Error("sound detection failed", "error", sderr)
}
}
// Join on the resolution before any side-effecting step.
var speaker *types.Speaker
if runResolve {
@@ -1415,11 +1497,94 @@ func commitUtterance(ctx context.Context, utt []byte, session *Session, conv *Co
}
}
if !session.TranscriptionOnly {
// Generate an LLM response only when there is a transcript to feed it. A
// sound-detection-only session (no transcription) has no LLM stage, so it
// stops here after emitting the sound-detection event.
if session.InputAudioTranscription != nil && !session.TranscriptionOnly {
generateResponse(ctx, session, utt, transcript, speaker, conv, t)
}
}
// handleSoundWindow runs server-side windowed sound-event detection (option B):
// every HopMs it classifies the last WindowMs of streamed audio and emits a
// sound_detection event, so a sound-only client only has to stream audio (no
// input_audio_buffer.commit). It keeps the input buffer trimmed to one window
// so a long stream stays bounded. Runs until done is closed. This is
// independent of VAD: sound events are not speech.
func handleSoundWindow(session *Session, t Transport, done chan struct{}) {
ticker := time.NewTicker(time.Duration(session.SoundDetectionHopMs) * time.Millisecond)
defer ticker.Stop()
for {
select {
case <-done:
return
case <-ticker.C:
classifySoundWindow(session, t)
}
}
}
// classifySoundWindow is one windowing tick: it snapshots the most recent
// WindowMs of buffered audio (trimming the buffer so a long stream stays
// bounded) and, when there is enough, classifies it and emits a sound_detection
// event. Extracted from handleSoundWindow so it can be driven synchronously in
// tests.
func classifySoundWindow(session *Session, t Transport) {
const bytesPerSample = 2 // 16-bit mono PCM
sr := session.InputSampleRate
windowBytes := session.SoundDetectionWindowMs * sr / 1000 * bytesPerSample
minBytes := sr / 100 * bytesPerSample // ~10ms before classifying
session.AudioBufferLock.Lock()
// Keep only the most recent window so a long stream stays bounded.
if windowBytes > 0 && len(session.InputAudioBuffer) > windowBytes {
trimmed := make([]byte, windowBytes)
copy(trimmed, session.InputAudioBuffer[len(session.InputAudioBuffer)-windowBytes:])
session.InputAudioBuffer = trimmed
}
window := make([]byte, len(session.InputAudioBuffer))
copy(window, session.InputAudioBuffer)
session.AudioBufferLock.Unlock()
if len(window) < minBytes {
return // not enough audio buffered yet
}
path, err := writeWindowWAV(window, sr)
if err != nil {
xlog.Error("sound window: failed to write wav", "error", err)
return
}
if sderr := emitSoundDetection(context.Background(), t, session, generateItemID(), path); sderr != nil {
xlog.Error("sound window: detection failed", "error", sderr)
}
if rerr := os.Remove(path); rerr != nil {
xlog.Debug("sound window: temp cleanup failed", "error", rerr)
}
}
// writeWindowWAV writes mono 16-bit PCM to a temp WAV at the given sample rate
// (the ced classifier reads the declared rate and resamples). Returns the path;
// the caller removes it.
func writeWindowWAV(pcm []byte, sampleRate int) (string, error) {
f, err := os.CreateTemp("", "realtime-sound-window-*.wav")
if err != nil {
return "", err
}
defer func() { _ = f.Close() }()
hdr := laudio.NewWAVHeaderWithRate(uint32(len(pcm)), uint32(sampleRate))
if err := hdr.Write(f); err != nil {
_ = os.Remove(f.Name())
return "", err
}
if _, err := f.Write(pcm); err != nil {
_ = os.Remove(f.Name())
return "", err
}
_ = f.Sync()
return f.Name(), nil
}
func runVAD(ctx context.Context, session *Session, adata []int16) ([]schema.VADSegment, error) {
soundIntBuffer := &audio.IntBuffer{
Format: &audio.Format{SampleRate: localSampleRate, NumChannels: 1},

View File

@@ -75,6 +75,11 @@ type fakeModel struct {
transcribeDeltas []string
transcribeFinal *schema.TranscriptionResult
// soundDetectionResult/soundDetectionErr drive the SoundDetection double so
// the sound-event path can be exercised deterministically.
soundDetectionResult *schema.SoundClassificationResult
soundDetectionErr error
// Predict streaming: predictTokens are replayed through the token callback
// (simulating streamed LLM output); predictResp/predictErr are returned by
// the deferred predict function. predictChunkDeltas, when set, are delivered
@@ -95,6 +100,13 @@ func (m *fakeModel) Transcribe(context.Context, string, string, bool, bool, stri
return m.transcribeFinal, nil
}
func (m *fakeModel) SoundDetection(context.Context, string, int, float32) (*schema.SoundClassificationResult, error) {
if m.soundDetectionErr != nil {
return nil, m.soundDetectionErr
}
return m.soundDetectionResult, nil
}
func (m *fakeModel) Predict(_ context.Context, msgs schema.Messages, _, _, _ []string, cb func(string, backend.TokenUsage) bool, _ []types.ToolUnion, _ *types.ToolChoiceUnion, _, _ *int, _ map[string]float64) (func() (backend.LLMResponse, error), error) {
m.lastMessages = msgs
if m.predictErr != nil {

View File

@@ -31,10 +31,11 @@ var (
// This means that we will fake an Any-to-Any model by overriding some of the gRPC client methods
// which are for Any-To-Any models, but instead we will call a pipeline (for e.g STT->LLM->TTS)
type wrappedModel struct {
TTSConfig *config.ModelConfig
TranscriptionConfig *config.ModelConfig
LLMConfig *config.ModelConfig
VADConfig *config.ModelConfig
TTSConfig *config.ModelConfig
TranscriptionConfig *config.ModelConfig
LLMConfig *config.ModelConfig
VADConfig *config.ModelConfig
SoundDetectionConfig *config.ModelConfig
appConfig *config.ApplicationConfig
modelLoader *model.ModelLoader
@@ -64,8 +65,9 @@ type anyToAnyModel struct {
}
type transcriptOnlyModel struct {
TranscriptionConfig *config.ModelConfig
VADConfig *config.ModelConfig
TranscriptionConfig *config.ModelConfig
VADConfig *config.ModelConfig
SoundDetectionConfig *config.ModelConfig
appConfig *config.ApplicationConfig
modelLoader *model.ModelLoader
@@ -80,6 +82,10 @@ func (m *transcriptOnlyModel) Transcribe(ctx context.Context, audio, language st
return backend.ModelTranscription(ctx, audio, language, translate, diarize, prompt, m.modelLoader, *m.TranscriptionConfig, m.appConfig)
}
func (m *transcriptOnlyModel) SoundDetection(ctx context.Context, audio string, topK int, threshold float32) (*schema.SoundClassificationResult, error) {
return modelSoundDetection(ctx, m.modelLoader, m.appConfig, m.SoundDetectionConfig, audio, topK, threshold)
}
func (m *transcriptOnlyModel) Predict(ctx context.Context, messages schema.Messages, images, videos, audios []string, tokenCallback func(string, backend.TokenUsage) bool, tools []types.ToolUnion, toolChoice *types.ToolChoiceUnion, logprobs *int, topLogprobs *int, logitBias map[string]float64) (func() (backend.LLMResponse, error), error) {
return nil, fmt.Errorf("predict operation not supported in transcript-only mode")
}
@@ -108,6 +114,10 @@ func (m *wrappedModel) Transcribe(ctx context.Context, audio, language string, t
return backend.ModelTranscription(ctx, audio, language, translate, diarize, prompt, m.modelLoader, *m.TranscriptionConfig, m.appConfig)
}
func (m *wrappedModel) SoundDetection(ctx context.Context, audio string, topK int, threshold float32) (*schema.SoundClassificationResult, error) {
return modelSoundDetection(ctx, m.modelLoader, m.appConfig, m.SoundDetectionConfig, audio, topK, threshold)
}
func (m *wrappedModel) Predict(ctx context.Context, messages schema.Messages, images, videos, audios []string, tokenCallback func(string, backend.TokenUsage) bool, tools []types.ToolUnion, toolChoice *types.ToolChoiceUnion, logprobs *int, topLogprobs *int, logitBias map[string]float64) (func() (backend.LLMResponse, error), error) {
input := schema.OpenAIRequest{
Messages: messages,
@@ -399,6 +409,39 @@ func transcribeStream(ctx context.Context, ml *model.ModelLoader, transcriptionC
return final, nil
}
// modelSoundDetection runs sound-event classification against the session's
// sound-classification model config, mirroring how Transcribe dispatches to
// the transcription backend. Returns an error when no sound-detection model is
// configured for the session.
func modelSoundDetection(ctx context.Context, ml *model.ModelLoader, appConfig *config.ApplicationConfig, soundConfig *config.ModelConfig, audio string, topK int, threshold float32) (*schema.SoundClassificationResult, error) {
if soundConfig == nil {
return nil, fmt.Errorf("sound detection is not configured for this session")
}
return backend.ModelSoundDetection(ctx, backend.SoundDetectionRequest{
Audio: audio,
TopK: int32(topK),
Threshold: threshold,
}, ml, *soundConfig, appConfig)
}
// loadSoundDetectionConfig resolves the optional sound-classification model
// config named by pipeline.sound_detection. Returns (nil, nil) when no model
// is configured so sound detection stays additive and never blocks session
// setup.
func loadSoundDetectionConfig(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model.ModelLoader) (*config.ModelConfig, error) {
if pipeline.SoundDetection == "" {
return nil, nil
}
cfg, err := cl.LoadModelConfigFileByName(pipeline.SoundDetection, ml.ModelPath)
if err != nil {
return nil, fmt.Errorf("failed to load sound detection config: %w", err)
}
if valid, _ := cfg.Validate(); !valid {
return nil, fmt.Errorf("failed to validate sound detection config %q", pipeline.SoundDetection)
}
return cfg, nil
}
func newTranscriptionOnlyModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig) (Model, *config.ModelConfig, error) {
cfgVAD, err := cl.LoadModelConfigFileByName(pipeline.VAD, ml.ModelPath)
if err != nil {
@@ -420,9 +463,15 @@ func newTranscriptionOnlyModel(pipeline *config.Pipeline, cl *config.ModelConfig
return nil, nil, fmt.Errorf("failed to validate config: %w", err)
}
cfgSound, err := loadSoundDetectionConfig(pipeline, cl, ml)
if err != nil {
return nil, nil, err
}
return &transcriptOnlyModel{
TranscriptionConfig: cfgSST,
VADConfig: cfgVAD,
TranscriptionConfig: cfgSST,
VADConfig: cfgVAD,
SoundDetectionConfig: cfgSound,
confLoader: cl,
modelLoader: ml,
@@ -430,6 +479,27 @@ func newTranscriptionOnlyModel(pipeline *config.Pipeline, cl *config.ModelConfig
}, cfgSST, nil
}
// newSoundDetectionOnlyModel builds a realtime model that only does sound-event
// classification: no VAD, transcription, LLM or TTS stages are loaded. Used for
// a sound-detection-only realtime session, which activates on sounds (not
// speech) and is driven by client-side windowing (turn_detection none +
// input_audio_buffer.commit) rather than the voice VAD loop.
func newSoundDetectionOnlyModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig) (Model, error) {
cfgSound, err := loadSoundDetectionConfig(pipeline, cl, ml)
if err != nil {
return nil, err
}
if cfgSound == nil {
return nil, fmt.Errorf("a sound-only realtime session requires pipeline.sound_detection")
}
return &transcriptOnlyModel{
SoundDetectionConfig: cfgSound,
confLoader: cl,
modelLoader: ml,
appConfig: appConfig,
}, nil
}
// RealtimeRoutingContext is the bundle of routing dependencies the
// realtime pipeline needs to consult router.Resolve per turn. nil-safe:
// passing nil skips routing entirely and preserves the historical "one
@@ -544,11 +614,17 @@ func newModel(pipeline *config.Pipeline, cl *config.ModelConfigLoader, ml *model
return nil, fmt.Errorf("failed to validate config: %w", err)
}
cfgSound, err := loadSoundDetectionConfig(pipeline, cl, ml)
if err != nil {
return nil, err
}
wm := &wrappedModel{
TTSConfig: cfgTTS,
TranscriptionConfig: cfgSST,
LLMConfig: cfgLLM,
VADConfig: cfgVAD,
TTSConfig: cfgTTS,
TranscriptionConfig: cfgSST,
LLMConfig: cfgLLM,
VADConfig: cfgVAD,
SoundDetectionConfig: cfgSound,
confLoader: cl,
modelLoader: ml,

View File

@@ -0,0 +1,48 @@
package openai
import (
"context"
"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
)
// defaultSoundDetectionTopK is the number of scored tags requested per
// committed utterance when the session does not pin its own top_k.
const defaultSoundDetectionTopK = 5
// emitSoundDetection classifies a committed utterance into sound-event tags and
// emits a conversation.item.sound_detection event for it. It mirrors
// emitTranscription's unary path: it calls the session's sound-event
// classifier, maps the scored tags onto the server event, and sends it over
// the transport. Sound detection is additive to transcription: its result is
// emitted independently and a failure here is the caller's to log, never a
// reason to abort the turn.
func emitSoundDetection(ctx context.Context, t Transport, session *Session, itemID, audioPath string) error {
topK := session.SoundDetectionTopK
if topK <= 0 {
topK = defaultSoundDetectionTopK
}
result, err := session.ModelInterface.SoundDetection(ctx, audioPath, topK, session.SoundDetectionThreshold)
if err != nil {
return err
}
detections := make([]types.SoundDetectionTag, 0)
if result != nil {
for _, d := range result.Detections {
detections = append(detections, types.SoundDetectionTag{
Label: d.Label,
Score: d.Score,
Index: d.Index,
})
}
}
return t.SendEvent(types.ConversationItemSoundDetectionEvent{
ServerEventBase: types.ServerEventBase{EventID: "event_TODO"},
ItemID: itemID,
ContentIndex: 0,
Detections: detections,
})
}

View File

@@ -0,0 +1,170 @@
package openai
import (
"context"
"encoding/binary"
"errors"
"os"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/http/endpoints/openai/types"
"github.com/mudler/LocalAI/core/schema"
)
// emitSoundDetection classifies a committed utterance and emits a single
// conversation.item.sound_detection event carrying the scored AudioSet tags.
var _ = Describe("emitSoundDetection", func() {
It("emits a sound_detection event with the classifier's scored tags", func() {
session := &Session{
SoundDetectionEnabled: true,
SoundDetectionTopK: 5,
ModelInterface: &fakeModel{
soundDetectionResult: &schema.SoundClassificationResult{
Model: "ced",
Detections: []schema.SoundClassification{
{Index: 3, Label: "Baby cry, infant cry", Score: 0.91},
{Index: 7, Label: "Speech", Score: 0.42},
},
},
},
}
t := &fakeTransport{}
err := emitSoundDetection(context.Background(), t, session, "item1", "/tmp/x.wav")
Expect(err).ToNot(HaveOccurred())
Expect(t.countEvents(types.ServerEventTypeConversationItemSoundDetection)).To(Equal(1))
ev, ok := t.events[0].(types.ConversationItemSoundDetectionEvent)
Expect(ok).To(BeTrue())
Expect(ev.ItemID).To(Equal("item1"))
Expect(ev.ContentIndex).To(Equal(0))
Expect(ev.Detections).To(HaveLen(2))
Expect(ev.Detections[0].Label).To(Equal("Baby cry, infant cry"))
Expect(ev.Detections[0].Score).To(BeNumerically("~", 0.91, 1e-6))
Expect(ev.Detections[0].Index).To(Equal(3))
Expect(ev.Detections[1].Label).To(Equal("Speech"))
})
It("emits an event with no detections when the classifier returns none", func() {
session := &Session{
SoundDetectionEnabled: true,
ModelInterface: &fakeModel{
soundDetectionResult: &schema.SoundClassificationResult{Model: "ced"},
},
}
t := &fakeTransport{}
err := emitSoundDetection(context.Background(), t, session, "item1", "/tmp/x.wav")
Expect(err).ToNot(HaveOccurred())
Expect(t.countEvents(types.ServerEventTypeConversationItemSoundDetection)).To(Equal(1))
ev, ok := t.events[0].(types.ConversationItemSoundDetectionEvent)
Expect(ok).To(BeTrue())
Expect(ev.Detections).To(BeEmpty())
})
It("propagates the classifier error and emits no event", func() {
session := &Session{
SoundDetectionEnabled: true,
ModelInterface: &fakeModel{soundDetectionErr: errors.New("boom")},
}
t := &fakeTransport{}
err := emitSoundDetection(context.Background(), t, session, "item1", "/tmp/x.wav")
Expect(err).To(HaveOccurred())
Expect(t.countEvents(types.ServerEventTypeConversationItemSoundDetection)).To(Equal(0))
})
})
// A sound-detection-only session (no transcription, no LLM) runs through
// commitUtterance WITHOUT the voice/transcription path: it emits the
// sound_detection event and stops - no transcription event, no LLM response.
var _ = Describe("commitUtterance (sound-detection-only session)", func() {
It("emits sound detection and neither transcribes nor generates a response", func() {
session := &Session{
SoundDetectionEnabled: true,
SoundDetectionTopK: 5,
InputAudioTranscription: nil, // sound-only: no transcription stage
ModelConfig: &config.ModelConfig{},
ModelInterface: &fakeModel{
soundDetectionResult: &schema.SoundClassificationResult{
Model: "ced",
Detections: []schema.SoundClassification{
{Index: 23, Label: "Baby cry, infant cry", Score: 0.87},
},
},
},
}
tr := &fakeTransport{}
utt := make([]byte, 32) // non-empty PCM so commitUtterance proceeds
commitUtterance(context.Background(), utt, session, &Conversation{}, tr)
Expect(tr.countEvents(types.ServerEventTypeConversationItemSoundDetection)).To(Equal(1))
// No transcription happened.
Expect(tr.countEvents(types.ServerEventTypeConversationItemInputAudioTranscriptionCompleted)).To(Equal(0))
// No LLM response was generated (sound-only has no LLM stage).
Expect(tr.countEvents(types.ServerEventTypeResponseDone)).To(Equal(0))
})
})
// Server-side windowing (option B): a sound-only session classifies the last
// WindowMs of streamed audio per tick, with no client commit, and keeps the
// input buffer trimmed to one window.
var _ = Describe("classifySoundWindow (server-side windowing)", func() {
newSoundSession := func() (*Session, *fakeTransport) {
return &Session{
SoundDetectionEnabled: true,
SoundDetectionTopK: 5,
SoundDetectionWindowMs: 200, // 200ms @ 16kHz mono16 = 6400 bytes
SoundDetectionHopMs: 20,
InputSampleRate: 16000,
ModelInterface: &fakeModel{
soundDetectionResult: &schema.SoundClassificationResult{
Model: "ced",
Detections: []schema.SoundClassification{{Index: 23, Label: "Baby cry, infant cry", Score: 0.87}},
},
},
}, &fakeTransport{}
}
It("emits a sound_detection event and trims the buffer to one window", func() {
session, tr := newSoundSession()
session.InputAudioBuffer = make([]byte, 10000) // > 6400-byte window
classifySoundWindow(session, tr)
Expect(tr.countEvents(types.ServerEventTypeConversationItemSoundDetection)).To(Equal(1))
// buffer trimmed to exactly one window (200ms @ 16kHz mono 16-bit)
Expect(len(session.InputAudioBuffer)).To(Equal(6400))
})
It("does nothing when too little audio is buffered", func() {
session, tr := newSoundSession()
session.InputAudioBuffer = make([]byte, 100) // < ~10ms (320 bytes)
classifySoundWindow(session, tr)
Expect(tr.countEvents(types.ServerEventTypeConversationItemSoundDetection)).To(Equal(0))
})
})
var _ = Describe("writeWindowWAV", func() {
It("writes a mono 16-bit WAV header declaring the given sample rate", func() {
pcm := make([]byte, 640)
path, err := writeWindowWAV(pcm, 24000)
Expect(err).ToNot(HaveOccurred())
defer func() { _ = os.Remove(path) }()
data, err := os.ReadFile(path)
Expect(err).ToNot(HaveOccurred())
Expect(len(data)).To(BeNumerically(">=", 44+len(pcm)))
// SampleRate is a little-endian uint32 at byte offset 24 of a WAV header.
Expect(binary.LittleEndian.Uint32(data[24:28])).To(Equal(uint32(24000)))
})
})

View File

@@ -0,0 +1,91 @@
package openai
import (
"io"
"net/http"
"os"
"path"
"path/filepath"
"github.com/labstack/echo/v4"
"github.com/mudler/LocalAI/core/backend"
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/http/middleware"
"github.com/mudler/LocalAI/core/schema"
model "github.com/mudler/LocalAI/pkg/model"
"github.com/mudler/xlog"
)
// SoundClassificationEndpoint runs an audio-tagging / sound-event
// classification model (e.g. ced) over an uploaded clip and returns the
// scored AudioSet tags in score-descending order. It mirrors the
// transcription path: multipart audio upload -> temp file -> backend call.
//
// @Summary Classify sound events in audio (audio tagging).
// @Tags audio
// @accept multipart/form-data
// @Param model formData string true "model"
// @Param file formData file true "audio file"
// @Param top_k formData int false "number of top tags to return (0 = backend default)"
// @Param threshold formData number false "drop tags scoring below this value"
// @Success 200 {object} schema.SoundClassificationResult
// @Router /v1/audio/classification [post]
func SoundClassificationEndpoint(cl *config.ModelConfigLoader, ml *model.ModelLoader, appConfig *config.ApplicationConfig) echo.HandlerFunc {
return func(c echo.Context) error {
input, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_LOCALAI_REQUEST).(*schema.OpenAIRequest)
if !ok || input.Model == "" {
return echo.ErrBadRequest
}
modelConfig, ok := c.Get(middleware.CONTEXT_LOCALS_KEY_MODEL_CONFIG).(*config.ModelConfig)
if !ok || modelConfig == nil {
return echo.ErrBadRequest
}
req := backend.SoundDetectionRequest{
TopK: int32(parseFormInt(c, "top_k", 0)),
Threshold: float32(parseFormFloat(c, "threshold", 0)),
}
file, err := c.FormFile("file")
if err != nil {
return err
}
f, err := file.Open()
if err != nil {
return err
}
defer func() { _ = f.Close() }()
dir, err := os.MkdirTemp("", "sound-classification")
if err != nil {
return err
}
defer func() { _ = os.RemoveAll(dir) }()
dst := filepath.Join(dir, path.Base(file.Filename))
dstFile, err := os.Create(dst) // #nosec G304 -- dst is a server-created temp dir joined with path.Base of the upload name (no traversal)
if err != nil {
return err
}
if _, err := io.Copy(dstFile, f); err != nil {
xlog.Debug("Audio file copying error", "filename", file.Filename, "dst", dst, "error", err)
_ = dstFile.Close()
return err
}
_ = dstFile.Close()
req.Audio = dst
result, err := backend.ModelSoundDetection(c.Request().Context(), req, ml, *modelConfig, appConfig)
if err != nil {
xlog.Error("Sound classification failed",
"model", modelConfig.Name,
"audio", dst,
"error", err)
return err
}
return c.JSON(http.StatusOK, result)
}
}

View File

@@ -18,6 +18,7 @@ const (
ServerEventTypeConversationItemInputAudioTranscriptionDelta ServerEventType = "conversation.item.input_audio_transcription.delta"
ServerEventTypeConversationItemInputAudioTranscriptionSegment ServerEventType = "conversation.item.input_audio_transcription.segment"
ServerEventTypeConversationItemInputAudioTranscriptionFailed ServerEventType = "conversation.item.input_audio_transcription.failed"
ServerEventTypeConversationItemSoundDetection ServerEventType = "conversation.item.sound_detection"
ServerEventTypeConversationItemTruncated ServerEventType = "conversation.item.truncated"
ServerEventTypeConversationItemDeleted ServerEventType = "conversation.item.deleted"
// ServerEventTypeConversationItemSpeaker is a LocalAI extension: it reports
@@ -473,6 +474,55 @@ func (m ConversationItemInputAudioTranscriptionCompletedEvent) MarshalJSON() ([]
return json.Marshal(shadow)
}
// SoundDetectionTag is one scored sound-event tag from the sound-event
// classifier. Label is the human-readable AudioSet class name, Score is the
// per-class probability (multi-label, independent), and Index is the class
// index in the model ontology.
type SoundDetectionTag struct {
// The human-readable AudioSet class name (e.g. "Baby cry, infant cry").
Label string `json:"label"`
// The per-class probability for this tag.
Score float32 `json:"score"`
// The class index in the model ontology.
Index int `json:"index"`
}
// Returned when a committed input audio window has been classified by a
// sound-event-detection model. This is a LocalAI extension to the OpenAI
// Realtime API: when a pipeline configures sound_detection, each VAD-committed
// utterance is run through the classifier and the scored AudioSet tags are
// emitted as this event, independent of (and alongside) transcription.
type ConversationItemSoundDetectionEvent struct {
ServerEventBase
// The ID of the item.
ItemID string `json:"item_id"`
// The index of the content part in the item's content array.
ContentIndex int `json:"content_index"`
// The scored sound-event tags, in score-descending order.
Detections []SoundDetectionTag `json:"detections"`
}
func (m ConversationItemSoundDetectionEvent) ServerEventType() ServerEventType {
return ServerEventTypeConversationItemSoundDetection
}
func (m ConversationItemSoundDetectionEvent) MarshalJSON() ([]byte, error) {
type typeAlias ConversationItemSoundDetectionEvent
type typeWrapper struct {
typeAlias
Type ServerEventType `json:"type"`
}
shadow := typeWrapper{
typeAlias: typeAlias(m),
Type: m.ServerEventType(),
}
return json.Marshal(shadow)
}
// Returned when the text value of an input audio transcription content part is updated with incremental transcription results.
//
// See https://platform.openai.com/docs/api-reference/realtime-server-events/conversation/item/input_audio_transcription/delta

View File

@@ -23,6 +23,7 @@
"tts": "TTS",
"stt": "STT",
"diarization": "Diarization",
"soundClassification": "Sound Tagging",
"soundGen": "Sound",
"audioTransform": "Audio FX",
"realtimeAudio": "Realtime Audio",

View File

@@ -31,6 +31,7 @@ const FILTERS = [
{ key: 'tts', labelKey: 'filters.tts', icon: 'fa-microphone' },
{ key: 'transcript', labelKey: 'filters.stt', icon: 'fa-headphones' },
{ key: 'diarization', labelKey: 'filters.diarization', icon: 'fa-users' },
{ key: 'sound_classification', labelKey: 'filters.soundClassification', icon: 'fa-ear-listen' },
{ key: 'sound_generation', labelKey: 'filters.soundGen', icon: 'fa-music' },
{ key: 'audio_transform', labelKey: 'filters.audioTransform', icon: 'fa-sliders' },
{ key: 'realtime_audio', labelKey: 'filters.realtimeAudio', icon: 'fa-tower-broadcast' },

View File

@@ -15,6 +15,7 @@ export const CAP_SOUND_GENERATION = 'FLAG_SOUND_GENERATION'
export const CAP_TOKENIZE = 'FLAG_TOKENIZE'
export const CAP_VAD = 'FLAG_VAD'
export const CAP_DIARIZATION = 'FLAG_DIARIZATION'
export const CAP_SOUND_CLASSIFICATION = 'FLAG_SOUND_CLASSIFICATION'
export const CAP_VIDEO = 'FLAG_VIDEO'
export const CAP_DETECTION = 'FLAG_DETECTION'
export const CAP_FACE_RECOGNITION = 'FLAG_FACE_RECOGNITION'

View File

@@ -284,13 +284,14 @@ func RegisterLocalAIRoutes(router *echo.Echo,
// Categorized endpoint groups for structured discovery
"endpoint_groups": map[string]any{
"openai_compatible": map[string]string{
"models": "/v1/models",
"chat_completions": "/v1/chat/completions",
"completions": "/v1/completions",
"embeddings": "/v1/embeddings",
"transcription": "/v1/audio/transcriptions",
"diarization": "/v1/audio/diarization",
"image_generation": "/v1/images/generations",
"models": "/v1/models",
"chat_completions": "/v1/chat/completions",
"completions": "/v1/completions",
"embeddings": "/v1/embeddings",
"transcription": "/v1/audio/transcriptions",
"diarization": "/v1/audio/diarization",
"sound_classification": "/v1/audio/classification",
"image_generation": "/v1/images/generations",
},
"config_management": map[string]string{
"config_metadata": "/api/models/config-metadata",
@@ -342,7 +343,7 @@ func RegisterLocalAIRoutes(router *echo.Echo,
"delete": "/stores/delete",
},
"docs": map[string]string{
"swagger": "/swagger/index.html",
"swagger": "/swagger/index.html",
"instructions": "/api/instructions",
},
},

View File

@@ -200,6 +200,23 @@ func RegisterOpenAIRoutes(app *echo.Echo,
app.POST("/v1/audio/diarization", diarizationHandler, diarizationMiddleware...)
app.POST("/audio/diarization", diarizationHandler, diarizationMiddleware...)
soundClassificationHandler := openai.SoundClassificationEndpoint(application.ModelConfigLoader(), application.ModelLoader(), application.ApplicationConfig())
soundClassificationMiddleware := []echo.MiddlewareFunc{
traceMiddleware,
re.BuildFilteredFirstAvailableDefaultModel(config.BuildUsecaseFilterFn(config.FLAG_SOUND_CLASSIFICATION)),
re.SetModelAndConfig(func() schema.LocalAIRequest { return new(schema.OpenAIRequest) }),
func(next echo.HandlerFunc) echo.HandlerFunc {
return func(c echo.Context) error {
if err := re.SetOpenAIRequest(c); err != nil {
return err
}
return next(c)
}
},
}
app.POST("/v1/audio/classification", soundClassificationHandler, soundClassificationMiddleware...)
app.POST("/audio/classification", soundClassificationHandler, soundClassificationMiddleware...)
audioSpeechHandler := localai.TTSEndpoint(application.ModelConfigLoader(), application.ModelLoader(), application.ApplicationConfig())
audioSpeechMiddleware := []echo.MiddlewareFunc{
nodeHeaderMiddleware,

View File

@@ -42,21 +42,22 @@ const (
// usecaseFilters maps UI filter keys to ModelConfigUsecase flags for
// capability-based gallery filtering.
var usecaseFilters = map[string]config.ModelConfigUsecase{
config.UsecaseChat: config.FLAG_CHAT,
config.UsecaseImage: config.FLAG_IMAGE,
config.UsecaseVideo: config.FLAG_VIDEO,
config.UsecaseVision: config.FLAG_VISION,
config.UsecaseTTS: config.FLAG_TTS,
config.UsecaseTranscript: config.FLAG_TRANSCRIPT,
config.UsecaseSoundGeneration: config.FLAG_SOUND_GENERATION,
config.UsecaseEmbeddings: config.FLAG_EMBEDDINGS,
config.UsecaseRerank: config.FLAG_RERANK,
config.UsecaseDetection: config.FLAG_DETECTION,
config.UsecaseVAD: config.FLAG_VAD,
config.UsecaseAudioTransform: config.FLAG_AUDIO_TRANSFORM,
config.UsecaseDiarization: config.FLAG_DIARIZATION,
config.UsecaseRealtimeAudio: config.FLAG_REALTIME_AUDIO,
config.UsecaseTokenClassify: config.FLAG_TOKEN_CLASSIFY,
config.UsecaseChat: config.FLAG_CHAT,
config.UsecaseImage: config.FLAG_IMAGE,
config.UsecaseVideo: config.FLAG_VIDEO,
config.UsecaseVision: config.FLAG_VISION,
config.UsecaseTTS: config.FLAG_TTS,
config.UsecaseTranscript: config.FLAG_TRANSCRIPT,
config.UsecaseSoundGeneration: config.FLAG_SOUND_GENERATION,
config.UsecaseEmbeddings: config.FLAG_EMBEDDINGS,
config.UsecaseRerank: config.FLAG_RERANK,
config.UsecaseDetection: config.FLAG_DETECTION,
config.UsecaseVAD: config.FLAG_VAD,
config.UsecaseAudioTransform: config.FLAG_AUDIO_TRANSFORM,
config.UsecaseDiarization: config.FLAG_DIARIZATION,
config.UsecaseSoundClassification: config.FLAG_SOUND_CLASSIFICATION,
config.UsecaseRealtimeAudio: config.FLAG_REALTIME_AUDIO,
config.UsecaseTokenClassify: config.FLAG_TOKEN_CLASSIFY,
}
// extractHFRepo tries to find a HuggingFace repo ID from model overrides or URLs.

View File

@@ -0,0 +1,19 @@
package schema
// SoundClassification is one scored sound-event tag. Score is the
// per-class probability (multi-label, independent), Index is the class
// index in the model ontology, and Label is the human-readable AudioSet
// class name (e.g. "Baby cry, infant cry").
type SoundClassification struct {
Index int `json:"index"`
Label string `json:"label"`
Score float32 `json:"score"`
}
// SoundClassificationResult is the JSON response of the
// /v1/audio/classification endpoint: the model name and the scored tags
// in score-descending order.
type SoundClassificationResult struct {
Model string `json:"model"`
Detections []SoundClassification `json:"detections"`
}

View File

@@ -169,6 +169,9 @@ func (c *fakeBackendClient) SoundGeneration(_ context.Context, _ *pb.SoundGenera
func (c *fakeBackendClient) Detect(_ context.Context, _ *pb.DetectOptions, _ ...ggrpc.CallOption) (*pb.DetectResponse, error) {
return nil, nil
}
func (c *fakeBackendClient) SoundDetection(_ context.Context, _ *pb.SoundDetectionRequest, _ ...ggrpc.CallOption) (*pb.SoundDetectionResponse, error) {
return nil, nil
}
func (c *fakeBackendClient) Depth(_ context.Context, _ *pb.DepthRequest, _ ...ggrpc.CallOption) (*pb.DepthResponse, error) {
return nil, nil
}

View File

@@ -99,6 +99,9 @@ func (f *fakeGRPCBackend) SoundGeneration(_ context.Context, _ *pb.SoundGenerati
func (f *fakeGRPCBackend) Detect(_ context.Context, _ *pb.DetectOptions, _ ...ggrpc.CallOption) (*pb.DetectResponse, error) {
return &pb.DetectResponse{}, nil
}
func (f *fakeGRPCBackend) SoundDetection(_ context.Context, _ *pb.SoundDetectionRequest, _ ...ggrpc.CallOption) (*pb.SoundDetectionResponse, error) {
return &pb.SoundDetectionResponse{}, nil
}
func (f *fakeGRPCBackend) Depth(_ context.Context, _ *pb.DepthRequest, _ ...ggrpc.CallOption) (*pb.DepthResponse, error) {
return &pb.DepthResponse{}, nil

View File

@@ -0,0 +1,55 @@
+++
disableToc = false
title = "Sound Classification"
weight = 18
url = "/features/audio-classification/"
+++
Sound-event classification (audio tagging) answers the question **"what am I hearing?"** - given an audio clip, it returns a list of scored [AudioSet](https://research.google.com/audioset/) labels (e.g. *Baby cry, infant cry*, *Glass breaking*, *Dog bark*, *Alarm*).
LocalAI exposes this through the `/v1/audio/classification` endpoint, modelled after `/v1/audio/transcriptions`. The reference backend is **[ced.cpp](https://github.com/mudler/ced.cpp)** (CED, a 527-class AudioSet tagger), a small ViT over a log-mel spectrogram ported to ggml with full PyTorch parity. Apache-2.0 weights are redistributable as GGUF.
Because classification is exposed as a regular OpenAI-style endpoint, any HTTP client works - there is no Python dependency on the consumer side.
## Endpoint
```
POST /v1/audio/classification
Content-Type: multipart/form-data
```
| Field | Type | Description |
|-------|------|-------------|
| `file` | file (required) | audio file in any format `ffmpeg` accepts |
| `model` | string (required) | name of the sound-classification-capable model (e.g. `ced-base`) |
| `top_k` | int | number of top tags to return (0 = backend default) |
| `threshold` | float | drop tags scoring below this value |
### Response
```json
{
"model": "ced-base",
"detections": [
{"index": 23, "label": "Baby cry, infant cry", "score": 0.87},
{"index": 22, "label": "Crying, sobbing", "score": 0.41}
]
}
```
Detections are returned in score-descending order. Scores are per-class probabilities (multi-label, independent), so they do not sum to 1.
## Example
```bash
curl http://localhost:8080/v1/audio/classification \
-H "Content-Type: multipart/form-data" \
-F file="@/path/to/clip.wav" \
-F model="ced-base" \
-F top_k=10
```
## See also
- [Audio to Text]({{% relref "audio-to-text" %}}) - speech transcription
- [Speaker Diarization]({{% relref "audio-diarization" %}}) - who spoke when

View File

@@ -152,3 +152,7 @@ curl http://localhost:8080/v1/audio/diarization \
- **Speaker identity across files**: speaker IDs (`SPEAKER_00`, `SPEAKER_01`, …) are local to each request. To track the same person across multiple recordings, combine `/v1/audio/diarization` with `/v1/voice/embed` (speaker embedding) and maintain your own embedding store.
- **Hints vs. forces**: `num_speakers` overrides clustering when set; `min_speakers` / `max_speakers` are advisory and only honored by backends that expose a range hint. vibevoice.cpp ignores them — its model picks the count itself.
- **Sample rate**: input is automatically converted to 16 kHz mono via ffmpeg before the backend sees it; sherpa-onnx pyannote-3.0 requires 16 kHz.
## See also
- [Sound Classification]({{% relref "audio-classification" %}}) - tag non-speech sound events (alarms, glass breaking, baby cry) in a clip.

View File

@@ -128,6 +128,7 @@ LocalAI supports various types of backends:
- **Speech-to-Text Backends**: For transcription (e.g., whisper.cpp, parakeet.cpp, faster-whisper, NeMo)
- **Text-to-Speech Backends**: For speech synthesis (e.g., piper, Kokoro, VibeVoice, Qwen3-TTS)
- **Sound Generation Backends**: For music and audio generation (e.g., ACE-Step)
- **Sound Classification Backends**: For sound-event classification / audio tagging - identifying everyday sounds like baby cry, glass breaking, alarms (e.g., ced.cpp)
- **Image & Video Generation Backends**: For diffusion models (e.g., stable-diffusion.cpp, diffusers)
- **Vision & Detection Backends**: For object detection, segmentation, depth, and face/voice recognition (e.g., rf-detr.cpp, locate-anything.cpp, sam3.cpp, insightface)
- **Audio Processing Backends**: For voice activity detection and audio enhancement (e.g., Silero VAD, LocalVQE)

View File

@@ -15,6 +15,7 @@ You can see the release notes [here](https://github.com/mudler/LocalAI/releases)
- **April 2026**: [Audio Transform](/features/audio-transform/) — generic audio-in / audio-out endpoint with optional reference signal. First implementation: [LocalVQE](https://github.com/localai-org/LocalVQE) C++ backend (joint AEC + noise suppression + dereverberation, DeepVQE-style). Both batch (`POST /audio/transformations`) and bidirectional WebSocket streaming (`/audio/transformations/stream`). Studio "Transform" tab with synchronized waveform players for input / reference / output.
- **April 2026**: [Face recognition backend](/features/face-recognition/) — `insightface`-powered 1:1 verification, 1:N identification, face embedding, face detection, and demographic analysis. Ships both a non-commercial `buffalo_l` model and an Apache 2.0 OpenCV Zoo alternative.
- **May 2026**: [Speaker diarization](/features/audio-diarization/) — new `/v1/audio/diarization` endpoint returning "who spoke when" segments. Backed by `sherpa-onnx` (pyannote-3.0 + speaker embeddings + clustering) for pure diarization, and `vibevoice-cpp` for diarization bundled with long-form ASR. Supports `json` / `verbose_json` / `rttm` response formats.
- **June 2026**: [Sound classification](/features/audio-classification/) — new `/v1/audio/classification` endpoint for audio tagging / sound-event classification, returning scored [AudioSet](https://research.google.com/audioset/) labels (baby cry, glass breaking, alarms, ...). Backed by [ced.cpp](https://github.com/mudler/ced.cpp), a 527-class AudioSet tagger ported to ggml.
- **June 2026**: [PII analyze / redact API](/features/middleware/#analyze--redact-api) — the PII detection pipeline (NER + restricted-regex pattern tiers) is now a standalone service: `POST /api/pii/analyze` returns detected entity spans and `POST /api/pii/redact` returns the sanitised text (or `400 pii_blocked`), without routing a chat request through the middleware. Events gain an `origin` (`middleware` / `proxy` / `pii_analyze` / `pii_redact`) so `/api/pii/events` can be filtered by source.
- **June 2026**: Concurrent scoring and PII NER on llama.cpp — the `Score` (router classifier) and `TokenClassify` (PII NER) primitives now ride llama.cpp's server task queue instead of locking the context, so they run concurrently with chat/completion/embedding traffic and with each other. The `known_usecases` restriction that forced dedicated scorer/NER model configs on llama-cpp is lifted, repeated scoring calls reuse the prompt KV cache across candidates, and scoring inputs are no longer capped by the physical batch size.

7
gallery/ced.yaml Normal file
View File

@@ -0,0 +1,7 @@
---
name: "ced-sound-classification"
config_file: |
backend: ced
known_usecases:
- sound_classification

View File

@@ -3077,6 +3077,190 @@
- transcript
parameters:
model: tiny
- name: ced-base-f16
url: github:mudler/LocalAI/gallery/ced.yaml@master
urls:
- https://huggingface.co/mudler/ced-gguf
- https://huggingface.co/mispeech/ced-base
description: |
CED (Consistent Ensemble Distillation, Xiaomi) is a sound-event classifier that tags everyday sounds (baby cry, footsteps, glass breaking, alarms, dog bark, ...) into the 527-class AudioSet ontology. This is the f16 GGUF for the ced backend (a standalone C++/ggml port). Recommended default: fastest on CPU and near-lossless. Use POST /v1/audio/classification, or the realtime websocket API for live recognition.
license: apache-2.0
tags:
- audio-classification
- sound-event-detection
- audio-tagging
- audioset
- ced
- gguf
- f16
overrides:
parameters:
model: ced-base-f16.gguf
files:
- filename: ced-base-f16.gguf
sha256: 5c058d9f7b737167195fa54eae4a2ae17658ac2c0a8073f7f116ba006b2ab32c
uri: https://huggingface.co/mudler/ced-gguf/resolve/main/ced-base-f16.gguf
- name: ced-base-q8
url: github:mudler/LocalAI/gallery/ced.yaml@master
urls:
- https://huggingface.co/mudler/ced-gguf
- https://huggingface.co/mispeech/ced-base
description: |
CED (Consistent Ensemble Distillation, Xiaomi) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). This is the q8_0 GGUF for the ced backend: smallest footprint (~88 MB, ~6.5x less memory than the PyTorch reference) and near-lossless (identical top-5 tags). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.
license: apache-2.0
tags:
- audio-classification
- sound-event-detection
- audio-tagging
- audioset
- ced
- gguf
- q8
overrides:
parameters:
model: ced-base-q8_0.gguf
files:
- filename: ced-base-q8_0.gguf
sha256: bd34a7710169f0047fea17267965d211f967828ab25ba6fb9d3768481393f6e2
uri: https://huggingface.co/mudler/ced-gguf/resolve/main/ced-base-q8_0.gguf
- name: ced-tiny-f16
url: github:mudler/LocalAI/gallery/ced.yaml@master
urls:
- https://huggingface.co/mudler/ced-gguf
- https://huggingface.co/mispeech/ced-tiny
description: |
CED-tiny (5.5M params, Pi-class / edge) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). f16 GGUF for the ced backend (recommended (fastest on CPU)). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.
license: apache-2.0
tags:
- audio-classification
- sound-event-detection
- audio-tagging
- audioset
- ced
- gguf
- f16
overrides:
parameters:
model: ced-tiny-f16.gguf
files:
- filename: ced-tiny-f16.gguf
sha256: af8b81c67bae50bfca4ea83dbba77b3bae4fa6180d36c17d6877f7700aeeb77b
uri: https://huggingface.co/mudler/ced-gguf/resolve/main/ced-tiny-f16.gguf
- name: ced-tiny-q8
url: github:mudler/LocalAI/gallery/ced.yaml@master
urls:
- https://huggingface.co/mudler/ced-gguf
- https://huggingface.co/mispeech/ced-tiny
description: |
CED-tiny (5.5M params, Pi-class / edge) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). q8_0 GGUF for the ced backend (smallest footprint, near-lossless). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.
license: apache-2.0
tags:
- audio-classification
- sound-event-detection
- audio-tagging
- audioset
- ced
- gguf
- q8
overrides:
parameters:
model: ced-tiny-q8_0.gguf
files:
- filename: ced-tiny-q8_0.gguf
sha256: 48bee4e2fc3cc85d7806e03471db24e77fda6c2a2e81ffe9ef67caebaf2bd674
uri: https://huggingface.co/mudler/ced-gguf/resolve/main/ced-tiny-q8_0.gguf
- name: ced-mini-f16
url: github:mudler/LocalAI/gallery/ced.yaml@master
urls:
- https://huggingface.co/mudler/ced-gguf
- https://huggingface.co/mispeech/ced-mini
description: |
CED-mini (9.6M params, low-power) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). f16 GGUF for the ced backend (recommended (fastest on CPU)). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.
license: apache-2.0
tags:
- audio-classification
- sound-event-detection
- audio-tagging
- audioset
- ced
- gguf
- f16
overrides:
parameters:
model: ced-mini-f16.gguf
files:
- filename: ced-mini-f16.gguf
sha256: 3c6a8936c77312f07a9ecb7b4bbbcb1f93ad137920ca6656bae9306571fb0c03
uri: https://huggingface.co/mudler/ced-gguf/resolve/main/ced-mini-f16.gguf
- name: ced-mini-q8
url: github:mudler/LocalAI/gallery/ced.yaml@master
urls:
- https://huggingface.co/mudler/ced-gguf
- https://huggingface.co/mispeech/ced-mini
description: |
CED-mini (9.6M params, low-power) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). q8_0 GGUF for the ced backend (smallest footprint, near-lossless). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.
license: apache-2.0
tags:
- audio-classification
- sound-event-detection
- audio-tagging
- audioset
- ced
- gguf
- q8
overrides:
parameters:
model: ced-mini-q8_0.gguf
files:
- filename: ced-mini-q8_0.gguf
sha256: 7062cef9ca31459f339ce24a5914f3b65bde76ffd9ca4fc924a040327ff292bd
uri: https://huggingface.co/mudler/ced-gguf/resolve/main/ced-mini-q8_0.gguf
- name: ced-small-f16
url: github:mudler/LocalAI/gallery/ced.yaml@master
urls:
- https://huggingface.co/mudler/ced-gguf
- https://huggingface.co/mispeech/ced-small
description: |
CED-small (22M params, balanced size/accuracy) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). f16 GGUF for the ced backend (recommended (fastest on CPU)). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.
license: apache-2.0
tags:
- audio-classification
- sound-event-detection
- audio-tagging
- audioset
- ced
- gguf
- f16
overrides:
parameters:
model: ced-small-f16.gguf
files:
- filename: ced-small-f16.gguf
sha256: c391ed8697a1b08d7c1a463e4940a5c3a2f670e0544ab0d8ee23b544583602a8
uri: https://huggingface.co/mudler/ced-gguf/resolve/main/ced-small-f16.gguf
- name: ced-small-q8
url: github:mudler/LocalAI/gallery/ced.yaml@master
urls:
- https://huggingface.co/mudler/ced-gguf
- https://huggingface.co/mispeech/ced-small
description: |
CED-small (22M params, balanced size/accuracy) sound-event classifier over the 527-class AudioSet ontology (baby cry, footsteps, glass breaking, alarms, dog bark, ...). q8_0 GGUF for the ced backend (smallest footprint, near-lossless). Use POST /v1/audio/classification, or the realtime websocket API for live recognition.
license: apache-2.0
tags:
- audio-classification
- sound-event-detection
- audio-tagging
- audioset
- ced
- gguf
- q8
overrides:
parameters:
model: ced-small-q8_0.gguf
files:
- filename: ced-small-q8_0.gguf
sha256: 888275fe43491cf832fb7b8125eccba34d1120745166f40cc12e93b79dea8efe
uri: https://huggingface.co/mudler/ced-gguf/resolve/main/ced-small-q8_0.gguf
- name: omnilingual-0.3b-ctc-q8-sherpa
url: github:mudler/LocalAI/gallery/sherpa-onnx-asr.yaml@master
urls:

View File

@@ -82,6 +82,8 @@ type Backend interface {
Diarize(ctx context.Context, in *pb.DiarizeRequest, opts ...grpc.CallOption) (*pb.DiarizeResponse, error)
SoundDetection(ctx context.Context, in *pb.SoundDetectionRequest, opts ...grpc.CallOption) (*pb.SoundDetectionResponse, error)
AudioEncode(ctx context.Context, in *pb.AudioEncodeRequest, opts ...grpc.CallOption) (*pb.AudioEncodeResult, error)
AudioDecode(ctx context.Context, in *pb.AudioDecodeRequest, opts ...grpc.CallOption) (*pb.AudioDecodeResult, error)

View File

@@ -110,6 +110,10 @@ func (llm *Base) Diarize(*pb.DiarizeRequest) (pb.DiarizeResponse, error) {
return pb.DiarizeResponse{}, fmt.Errorf("unimplemented")
}
func (llm *Base) SoundDetection(context.Context, *pb.SoundDetectionRequest) (*pb.SoundDetectionResponse, error) {
return nil, fmt.Errorf("unimplemented")
}
func (llm *Base) TokenizeString(opts *pb.PredictOptions) (pb.TokenizationResponse, error) {
return pb.TokenizationResponse{}, fmt.Errorf("unimplemented")
}

View File

@@ -616,6 +616,24 @@ func (c *Client) Diarize(ctx context.Context, in *pb.DiarizeRequest, opts ...grp
return client.Diarize(ctx, in, opts...)
}
func (c *Client) SoundDetection(ctx context.Context, in *pb.SoundDetectionRequest, opts ...grpc.CallOption) (*pb.SoundDetectionResponse, error) {
if !c.parallel {
c.opMutex.Lock()
defer c.opMutex.Unlock()
}
c.setBusy(true)
defer c.setBusy(false)
c.wdMark()
defer c.wdUnMark()
conn, err := c.dial()
if err != nil {
return nil, err
}
defer func() { _ = conn.Close() }()
client := pb.NewBackendClient(conn)
return client.SoundDetection(ctx, in, opts...)
}
func (c *Client) Detect(ctx context.Context, in *pb.DetectOptions, opts ...grpc.CallOption) (*pb.DetectResponse, error) {
if !c.parallel {
c.opMutex.Lock()

View File

@@ -153,6 +153,10 @@ func (e *embedBackend) Diarize(ctx context.Context, in *pb.DiarizeRequest, opts
return e.s.Diarize(ctx, in)
}
func (e *embedBackend) SoundDetection(ctx context.Context, in *pb.SoundDetectionRequest, opts ...grpc.CallOption) (*pb.SoundDetectionResponse, error) {
return e.s.SoundDetection(ctx, in)
}
func (e *embedBackend) AudioEncode(ctx context.Context, in *pb.AudioEncodeRequest, opts ...grpc.CallOption) (*pb.AudioEncodeResult, error) {
return e.s.AudioEncode(ctx, in)
}

View File

@@ -40,6 +40,7 @@ type AIModel interface {
VAD(*pb.VADRequest) (pb.VADResponse, error)
Diarize(*pb.DiarizeRequest) (pb.DiarizeResponse, error)
SoundDetection(context.Context, *pb.SoundDetectionRequest) (*pb.SoundDetectionResponse, error)
AudioEncode(*pb.AudioEncodeRequest) (*pb.AudioEncodeResult, error)
AudioDecode(*pb.AudioDecodeRequest) (*pb.AudioDecodeResult, error)

View File

@@ -435,6 +435,14 @@ func (s *server) Diarize(ctx context.Context, in *pb.DiarizeRequest) (*pb.Diariz
return &res, nil
}
func (s *server) SoundDetection(ctx context.Context, in *pb.SoundDetectionRequest) (*pb.SoundDetectionResponse, error) {
if s.llm.Locking() {
s.llm.Lock()
defer s.llm.Unlock()
}
return s.llm.SoundDetection(ctx, in)
}
func (s *server) AudioEncode(ctx context.Context, in *pb.AudioEncodeRequest) (*pb.AudioEncodeResult, error) {
if s.llm.Locking() {
s.llm.Lock()

View File

@@ -26,6 +26,13 @@ function inferBackendPath(item) {
if (item.backend === "parakeet-cpp") {
return `backend/go/parakeet-cpp/`;
}
// ced is a Go backend (Dockerfile.golang) wrapping the ced.cpp ggml port via
// purego, living in backend/go/ced/. Same explicit-branch rationale as
// parakeet-cpp above: the generic golang fallthrough would also resolve it,
// but this documents the mapping and guards a future dockerfile-suffix change.
if (item.backend === "ced") {
return `backend/go/ced/`;
}
if (item.dockerfile.endsWith("golang")) {
return `backend/go/${item.backend}/`;
}

View File

@@ -1939,6 +1939,53 @@ const docTemplate = `{
}
}
},
"/v1/audio/classification": {
"post": {
"consumes": [
"multipart/form-data"
],
"tags": [
"audio"
],
"summary": "Classify sound events in audio (audio tagging).",
"parameters": [
{
"type": "string",
"description": "model",
"name": "model",
"in": "formData",
"required": true
},
{
"type": "file",
"description": "audio file",
"name": "file",
"in": "formData",
"required": true
},
{
"type": "integer",
"description": "number of top tags to return (0 = backend default)",
"name": "top_k",
"in": "formData"
},
{
"type": "number",
"description": "drop tags scoring below this value",
"name": "threshold",
"in": "formData"
}
],
"responses": {
"200": {
"description": "OK",
"schema": {
"$ref": "#/definitions/schema.SoundClassificationResult"
}
}
}
}
},
"/v1/audio/diarization": {
"post": {
"consumes": [
@@ -6084,6 +6131,34 @@ const docTemplate = `{
}
}
},
"schema.SoundClassification": {
"type": "object",
"properties": {
"index": {
"type": "integer"
},
"label": {
"type": "string"
},
"score": {
"type": "number"
}
}
},
"schema.SoundClassificationResult": {
"type": "object",
"properties": {
"detections": {
"type": "array",
"items": {
"$ref": "#/definitions/schema.SoundClassification"
}
},
"model": {
"type": "string"
}
}
},
"schema.StreamOptions": {
"type": "object",
"properties": {

View File

@@ -1936,6 +1936,53 @@
}
}
},
"/v1/audio/classification": {
"post": {
"consumes": [
"multipart/form-data"
],
"tags": [
"audio"
],
"summary": "Classify sound events in audio (audio tagging).",
"parameters": [
{
"type": "string",
"description": "model",
"name": "model",
"in": "formData",
"required": true
},
{
"type": "file",
"description": "audio file",
"name": "file",
"in": "formData",
"required": true
},
{
"type": "integer",
"description": "number of top tags to return (0 = backend default)",
"name": "top_k",
"in": "formData"
},
{
"type": "number",
"description": "drop tags scoring below this value",
"name": "threshold",
"in": "formData"
}
],
"responses": {
"200": {
"description": "OK",
"schema": {
"$ref": "#/definitions/schema.SoundClassificationResult"
}
}
}
}
},
"/v1/audio/diarization": {
"post": {
"consumes": [
@@ -6081,6 +6128,34 @@
}
}
},
"schema.SoundClassification": {
"type": "object",
"properties": {
"index": {
"type": "integer"
},
"label": {
"type": "string"
},
"score": {
"type": "number"
}
}
},
"schema.SoundClassificationResult": {
"type": "object",
"properties": {
"detections": {
"type": "array",
"items": {
"$ref": "#/definitions/schema.SoundClassification"
}
},
"model": {
"type": "string"
}
}
},
"schema.StreamOptions": {
"type": "object",
"properties": {

View File

@@ -2087,6 +2087,24 @@ definitions:
classifier-side confidence signal).
type: number
type: object
schema.SoundClassification:
properties:
index:
type: integer
label:
type: string
score:
type: number
type: object
schema.SoundClassificationResult:
properties:
detections:
items:
$ref: '#/definitions/schema.SoundClassification'
type: array
model:
type: string
type: object
schema.StreamOptions:
properties:
include_usage:
@@ -3770,6 +3788,37 @@ paths:
summary: Generates audio from the input text.
tags:
- audio
/v1/audio/classification:
post:
consumes:
- multipart/form-data
parameters:
- description: model
in: formData
name: model
required: true
type: string
- description: audio file
in: formData
name: file
required: true
type: file
- description: number of top tags to return (0 = backend default)
in: formData
name: top_k
type: integer
- description: drop tags scoring below this value
in: formData
name: threshold
type: number
responses:
"200":
description: OK
schema:
$ref: '#/definitions/schema.SoundClassificationResult'
summary: Classify sound events in audio (audio tagging).
tags:
- audio
/v1/audio/diarization:
post:
consumes: