feat: add LocalVQE backend and audio transformations UI (#9640)

feat(audio-transform): add LocalVQE backend, bidi gRPC RPC, Studio UI Introduce a generic "audio transform" capability for any audio-in / audio-out operation (echo cancellation, noise suppression, dereverberation, voice conversion, etc.) and ship LocalVQE as the first backend implementation. Backend protocol: - Two new gRPC RPCs in backend.proto: unary AudioTransform for batch and bidirectional AudioTransformStream for low-latency frame-by-frame use. This is the first bidi stream in the proto; per-frame unary at LocalVQE's 16 ms hop would be RTT-bound. Wire it through pkg/grpc/{client,server, embed,interface,base} with paired-channel ergonomics. LocalVQE backend (backend/go/localvqe/): - Go-Purego wrapper around upstream liblocalvqe.so. CMake builds the upstream shared lib + its libggml-cpu-*.so runtime variants directly — no MODULE wrapper needed because LocalVQE handles CPU feature selection internally via GGML_BACKEND_DL. - Sets GGML_NTHREADS from opts.Threads (or runtime.NumCPU()-1) — without it LocalVQE runs single-threaded at ~1× realtime instead of the documented ~9.6×. - Reference-length policy: zero-pad short refs, truncate long ones (the trailing portion can't have leaked into a mic that wasn't recording). - Ginkgo test suite (9 always-on specs + 2 model-gated). HTTP layer: - POST /audio/transformations (alias /audio/transform): multipart batch endpoint, accepts audio + optional reference + params[*]=v form fields. Persists inputs alongside the output in GeneratedContentDir/audio so the React UI history can replay past (audio, reference, output) triples. - GET /audio/transformations/stream: WebSocket bidi, 16 ms PCM frames (interleaved stereo mic+ref in, mono out). JSON session.update envelope for config; constants hoisted in core/schema/audio_transform.go. - ffmpeg-based input normalisation to 16 kHz mono s16 WAV via the existing utils.AudioToWav (with passthrough fast-path), so the user can upload any format / rate without seeing the model's strict 16 kHz constraint. - BackendTraceAudioTransform integration so /api/backend-traces and the Traces UI light up with audio_snippet base64 and timing. - Routes registered under routes/localai.go (LocalAI extension; OpenAI has no /audio/transformations endpoint), traced via TraceMiddleware. Auth + capability + importer: - FLAG_AUDIO_TRANSFORM (model_config.go), FeatureAudioTransform (default-on, in APIFeatures), three RouteFeatureRegistry rows. - localvqe added to knownPrefOnlyBackends with modality "audio-transform". - Gallery entry localvqe-v1-1.3m (sha256-pinned, hosted on huggingface.co/LocalAI-io/LocalVQE). React UI: - New /app/transform page surfaced via a dedicated "Enhance" sidebar section (sibling of Tools / Biometrics) — the page is enhancement, not generation, so it lives outside Studio. Two AudioInput components (Upload + Record tabs, drag-drop, mic capture). - Echo-test button: records mic while playing the loaded reference through the speakers — the mic naturally picks up speaker bleed, giving a real (mic, ref) pair for AEC testing without leaving the UI. - Reusable WaveformPlayer (canvas peaks + click-to-seek + audio controls) and useAudioPeaks hook (shared module-scoped AudioContext to avoid hitting browser context limits with three players on one page); migrated TTS, Sound, Traces audio blocks to use it. - Past runs saved in localStorage via useMediaHistory('audio-transform') — the history entry stores all three URLs so clicking re-renders the full triple, not just the output. Build + e2e: - 11 matrix entries removed from .github/workflows/backend.yml (CUDA, ROCm, SYCL, Metal, L4T): upstream supports only CPU + Vulkan, so we ship those two and let GPU-class hardware route through Vulkan in the gallery capabilities map. - tests-localvqe-grpc-transform job in test-extra.yml (gated on detect-changes.outputs.localvqe). - New audio_transform capability + 4 specs in tests/e2e-backends. - Playwright spec suite in core/http/react-ui/e2e/audio-transform.spec.js (8 specs covering tabs, file upload, multipart shape, history, errors). Docs: - New docs/content/features/audio-transform.md covering the (audio, reference) mental model, batch + WebSocket wire formats, LocalVQE param keys, and a YAML config example. Cross-links from text-to-audio and audio-to-text feature pages. Assisted-by: Claude:claude-opus-4-7 [Bash Read Edit Write Agent TaskCreate] Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-07-01 03:46:41 -04:00 · 2026-05-04 21:07:11 +01:00
parent de83b72bb7
commit bb033b16a9
59 changed files with 3923 additions and 86 deletions
--- a/.github/workflows/backend.yml
+++ b/.github/workflows/backend.yml
@@ -2686,6 +2686,19 @@ jobs:
            dockerfile: "./backend/Dockerfile.golang"
            context: "./"
            ubuntu-version: '2404'
+          - build-type: ''
+            cuda-major-version: ""
+            cuda-minor-version: ""
+            platforms: 'linux/amd64,linux/arm64'
+            tag-latest: 'auto'
+            tag-suffix: '-cpu-localvqe'
+            runs-on: 'ubuntu-latest'
+            base-image: "ubuntu:24.04"
+            skip-drivers: 'false'
+            backend: "localvqe"
+            dockerfile: "./backend/Dockerfile.golang"
+            context: "./"
+            ubuntu-version: '2404'
          - build-type: 'sycl_f32'
            cuda-major-version: ""
            cuda-minor-version: ""
@@ -2725,6 +2738,19 @@ jobs:
            dockerfile: "./backend/Dockerfile.golang"
            context: "./"
            ubuntu-version: '2404'
+          - build-type: 'vulkan'
+            cuda-major-version: ""
+            cuda-minor-version: ""
+            platforms: 'linux/amd64,linux/arm64'
+            tag-latest: 'auto'
+            tag-suffix: '-gpu-vulkan-localvqe'
+            runs-on: 'ubuntu-latest'
+            base-image: "ubuntu:24.04"
+            skip-drivers: 'false'
+            backend: "localvqe"
+            dockerfile: "./backend/Dockerfile.golang"
+            context: "./"
+            ubuntu-version: '2404'
          - build-type: 'cublas'
            cuda-major-version: "12"
            cuda-minor-version: "0"
--- a/.github/workflows/test-extra.yml
+++ b/.github/workflows/test-extra.yml
@@ -37,6 +37,7 @@ jobs:
      acestep-cpp: ${{ steps.detect.outputs.acestep-cpp }}
      qwen3-tts-cpp: ${{ steps.detect.outputs.qwen3-tts-cpp }}
      vibevoice-cpp: ${{ steps.detect.outputs.vibevoice-cpp }}
+      localvqe: ${{ steps.detect.outputs.localvqe }}
      voxtral: ${{ steps.detect.outputs.voxtral }}
      kokoros: ${{ steps.detect.outputs.kokoros }}
      insightface: ${{ steps.detect.outputs.insightface }}
@@ -884,6 +885,26 @@ jobs:
      - name: Build vibevoice-cpp backend image and run ASR gRPC e2e tests
        run: |
          make test-extra-backend-vibevoice-cpp-transcription
+  # End-to-end audio transform via the e2e-backends gRPC harness. The
+  # LocalVQE GGUF is small (~5 MB) and the model is real-time on CPU, so
+  # the default ubuntu-latest pool is plenty.
+  tests-localvqe-grpc-transform:
+    needs: detect-changes
+    if: needs.detect-changes.outputs.localvqe == 'true' || needs.detect-changes.outputs.run-all == 'true'
+    runs-on: ubuntu-latest
+    timeout-minutes: 60
+    steps:
+      - name: Clone
+        uses: actions/checkout@v6
+        with:
+          submodules: true
+      - name: Setup Go
+        uses: actions/setup-go@v5
+        with:
+          go-version: '1.25.4'
+      - name: Build localvqe backend image and run audio_transform gRPC e2e tests
+        run: |
+          make test-extra-backend-localvqe-transform
  tests-voxtral:
    needs: detect-changes
    if: needs.detect-changes.outputs.voxtral == 'true' || needs.detect-changes.outputs.run-all == 'true'