feat(qwen3-tts-cpp): migrate to ServeurpersoCom/qwentts.cpp (streaming, speakers, voice design) (#10316)

* feat(qwen3-tts-cpp): repoint upstream to ServeurpersoCom/qwentts.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(qwen3-tts-cpp): flatten qt_* ABI into qt3_* purego shim Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(qwen3-tts-cpp): build shim against upstream qwen-core static lib Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(qwen3-tts-cpp): add option/language/voice/sampling parsing Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(qwen3-tts-cpp): add 24kHz WAV encode/decode/stream-header helpers Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(qwen3-tts-cpp): purego backend with streaming, speakers, voice design Map TTSRequest onto qwentts.cpp: instructions->instruct, voice->named speaker or clone-reference path, params map->ref_text + sampling. Add TTSStream over the qt chunk callback. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * test(qwen3-tts-cpp): unit specs + build-gated TTS/TTSStream e2e Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(qwen3-tts-cpp): close defensive PCM-free gap on zero-sample result Register CppPCMFree before the n<=0 guard so a non-null buffer with zero samples cannot leak (the C contract returns NULL on failure, so this is defensive). Raised in code review. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(qwen3-tts-cpp): advertise TTSStream capability Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(qwen3-tts-cpp): update backend index metadata for qwentts.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * feat(gallery): qwentts.cpp models - base/customvoice/voicedesign, Q8_0 & Q4_K_M Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * docs(qwen3-tts-cpp): release note for qwentts.cpp migration Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * test(qwen3-tts-cpp): cover audio_path voice-cloning fallback Add resolveRequest unit specs (config audio_path used as the clone reference when Voice is empty; per-request audio Voice overrides it; a named-speaker Voice does not trigger cloning) plus a real-inference e2e that clones from audio_path (confirmed ref_spk_emb=yes in the pipeline). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * chore(qwen3-tts-cpp): drop the release-note doc Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-07-31 18:38:23 -04:00 · 2026-06-13 23:09:59 +02:00
parent 3e838c0cff
commit 4bb592cf91
16 changed files with 1264 additions and 558 deletions
--- a/gallery/index.yaml
+++ b/gallery/index.yaml
@@ -3304,38 +3304,267 @@
    - filename: vibevoice-cpp-asr/tokenizer.gguf
      sha256: 37dc3b722d5677e37e29a57df55aa05c485116eeb5459e57ff8dde616b4986f6
      uri: huggingface://mudler/vibevoice.cpp-models/tokenizer.gguf
- name: qwen3-tts-cpp
+- &qwenttscpp_gallery
+  name: qwen3-tts-cpp
  url: github:mudler/LocalAI/gallery/virtual.yaml@master
  urls:
-    - https://huggingface.co/endo5501/qwen3-tts.cpp
-    - https://github.com/predict-woo/qwen3-tts.cpp
+    - https://huggingface.co/Serveurperso/Qwen3-TTS-GGUF
+    - https://github.com/ServeurpersoCom/qwentts.cpp
  description: |
-    Qwen3-TTS 0.6B (C++ / GGML) — native C++ text-to-speech from text input.
-    Generates 24kHz mono audio. Supports 10 languages (en, zh, ja, ko, de, fr, es, it, pt, ru).
-    Uses F16 GGUF models (~2 GB total).
-  license: apache-2.0
+    Qwen3-TTS 0.6B Base (C++ / GGML, qwentts.cpp). Native C++ text-to-speech with
+    streaming output and zero-shot voice cloning (set `voice` to a 24kHz reference
+    .wav). 24kHz mono, 11 languages with Mandarin dialects. Q8_0 (~0.95 GB talker).
+  license: mit
  icon: https://huggingface.co/avatars/c299494fd1e72375832499c75b3425d6.svg
  tags:
    - tts
    - text-to-speech
+    - voice-cloning
+    - streaming
    - qwen3-tts
    - qwen3-tts-cpp
    - gguf
-  last_checked: "2026-04-30"
+  last_checked: "2026-06-13"
  overrides:
    backend: qwen3-tts-cpp
    known_usecases:
      - tts
    name: qwen3-tts-cpp
    parameters:
-      model: qwen3-tts-cpp
+      model: qwen3-tts-cpp/qwen-talker-0.6b-base-Q8_0.gguf
  files:
-    - filename: qwen3-tts-cpp/qwen3-tts-0.6b-f16.gguf
-      sha256: 0b89770118463af8f2467d824a8de57d96df6a09f927a9769a3f7b7fffa7087d
-      uri: huggingface://endo5501/qwen3-tts.cpp/qwen3-tts-0.6b-f16.gguf
-    - filename: qwen3-tts-cpp/qwen3-tts-tokenizer-f16.gguf
-      sha256: d1ad9660bd99343f4851d5a4b17e31f65648feb3559f6ea062ae6575e5cd9d90
-      uri: huggingface://endo5501/qwen3-tts.cpp/qwen3-tts-tokenizer-f16.gguf
+    - filename: qwen3-tts-cpp/qwen-talker-0.6b-base-Q8_0.gguf
+      sha256: d54dbaf10591421fa764ed630d764efa717ae40cd959bd48c66d4eb1af226426
+      uri: huggingface://Serveurperso/Qwen3-TTS-GGUF/qwen-talker-0.6b-base-Q8_0.gguf
+    - filename: qwen3-tts-cpp/qwen-tokenizer-12hz-Q8_0.gguf
+      sha256: 1883beeed99348fc35e23dd225e9082f93f6f8c109330a33d935baa8acdbfd94
+      uri: huggingface://Serveurperso/Qwen3-TTS-GGUF/qwen-tokenizer-12hz-Q8_0.gguf
+- !!merge <<: *qwenttscpp_gallery
+  name: qwen3-tts-cpp-0.6b-base-q4
+  description: |
+    Qwen3-TTS 0.6B Base (C++ / GGML, qwentts.cpp), Q4_K_M (~0.6 GB talker).
+    Streaming + voice cloning, 24kHz mono, 11 languages.
+  overrides:
+    backend: qwen3-tts-cpp
+    known_usecases:
+      - tts
+    name: qwen3-tts-cpp-0.6b-base-q4
+    parameters:
+      model: qwen3-tts-cpp-0.6b-base-q4/qwen-talker-0.6b-base-Q4_K_M.gguf
+  files:
+    - filename: qwen3-tts-cpp-0.6b-base-q4/qwen-talker-0.6b-base-Q4_K_M.gguf
+      sha256: 4b468ec7b1f62b90ef4ca316c0aa57deadfd54b2cf9651703ea753cedaf04226
+      uri: huggingface://Serveurperso/Qwen3-TTS-GGUF/qwen-talker-0.6b-base-Q4_K_M.gguf
+    - filename: qwen3-tts-cpp-0.6b-base-q4/qwen-tokenizer-12hz-Q4_K_M.gguf
+      sha256: cf3788b4d50aaa665fb6e57c170396aae03a3555fea52d2b5d0cda902d658039
+      uri: huggingface://Serveurperso/Qwen3-TTS-GGUF/qwen-tokenizer-12hz-Q4_K_M.gguf
+- !!merge <<: *qwenttscpp_gallery
+  name: qwen3-tts-cpp-1.7b-base
+  description: |
+    Qwen3-TTS 1.7B Base (C++ / GGML, qwentts.cpp), Q8_0 (~2.0 GB talker).
+    Higher-quality streaming + voice cloning, 24kHz mono, 11 languages.
+  overrides:
+    backend: qwen3-tts-cpp
+    known_usecases:
+      - tts
+    name: qwen3-tts-cpp-1.7b-base
+    parameters:
+      model: qwen3-tts-cpp-1.7b-base/qwen-talker-1.7b-base-Q8_0.gguf
+  files:
+    - filename: qwen3-tts-cpp-1.7b-base/qwen-talker-1.7b-base-Q8_0.gguf
+      sha256: 4b9a33a236908dd9435a42f7a396e38038329d053b704342a6413c08544c4fda
+      uri: huggingface://Serveurperso/Qwen3-TTS-GGUF/qwen-talker-1.7b-base-Q8_0.gguf
+    - filename: qwen3-tts-cpp-1.7b-base/qwen-tokenizer-12hz-Q8_0.gguf
+      sha256: 1883beeed99348fc35e23dd225e9082f93f6f8c109330a33d935baa8acdbfd94
+      uri: huggingface://Serveurperso/Qwen3-TTS-GGUF/qwen-tokenizer-12hz-Q8_0.gguf
+- !!merge <<: *qwenttscpp_gallery
+  name: qwen3-tts-cpp-1.7b-base-q4
+  description: |
+    Qwen3-TTS 1.7B Base (C++ / GGML, qwentts.cpp), Q4_K_M (~1.2 GB talker).
+    Streaming + voice cloning, 24kHz mono, 11 languages.
+  overrides:
+    backend: qwen3-tts-cpp
+    known_usecases:
+      - tts
+    name: qwen3-tts-cpp-1.7b-base-q4
+    parameters:
+      model: qwen3-tts-cpp-1.7b-base-q4/qwen-talker-1.7b-base-Q4_K_M.gguf
+  files:
+    - filename: qwen3-tts-cpp-1.7b-base-q4/qwen-talker-1.7b-base-Q4_K_M.gguf
+      sha256: ea393ebaf2167ea23ce9fc18b093822851358a950d7075cd47ab4f6ce23e887d
+      uri: huggingface://Serveurperso/Qwen3-TTS-GGUF/qwen-talker-1.7b-base-Q4_K_M.gguf
+    - filename: qwen3-tts-cpp-1.7b-base-q4/qwen-tokenizer-12hz-Q4_K_M.gguf
+      sha256: cf3788b4d50aaa665fb6e57c170396aae03a3555fea52d2b5d0cda902d658039
+      uri: huggingface://Serveurperso/Qwen3-TTS-GGUF/qwen-tokenizer-12hz-Q4_K_M.gguf
+- !!merge <<: *qwenttscpp_gallery
+  name: qwen3-tts-cpp-customvoice
+  description: |
+    Qwen3-TTS 0.6B CustomVoice (C++ / GGML, qwentts.cpp), Q8_0. Named speakers
+    selected via the `voice` field: serena, vivian, uncle_fu, ryan, aiden,
+    ono_anna, sohee, eric (sichuan dialect), dylan (beijing dialect). Streaming,
+    24kHz mono, 11 languages.
+  tags:
+    - tts
+    - text-to-speech
+    - named-speakers
+    - streaming
+    - qwen3-tts
+    - qwen3-tts-cpp
+    - gguf
+  overrides:
+    backend: qwen3-tts-cpp
+    known_usecases:
+      - tts
+    name: qwen3-tts-cpp-customvoice
+    parameters:
+      model: qwen3-tts-cpp-customvoice/qwen-talker-0.6b-customvoice-Q8_0.gguf
+  files:
+    - filename: qwen3-tts-cpp-customvoice/qwen-talker-0.6b-customvoice-Q8_0.gguf
+      sha256: 4eb38675c736ed6ac72012846ac8d6ef80e5af8bc05726870f0b3a6569588519
+      uri: huggingface://Serveurperso/Qwen3-TTS-GGUF/qwen-talker-0.6b-customvoice-Q8_0.gguf
+    - filename: qwen3-tts-cpp-customvoice/qwen-tokenizer-12hz-Q8_0.gguf
+      sha256: 1883beeed99348fc35e23dd225e9082f93f6f8c109330a33d935baa8acdbfd94
+      uri: huggingface://Serveurperso/Qwen3-TTS-GGUF/qwen-tokenizer-12hz-Q8_0.gguf
+- !!merge <<: *qwenttscpp_gallery
+  name: qwen3-tts-cpp-customvoice-q4
+  description: |
+    Qwen3-TTS 0.6B CustomVoice (C++ / GGML, qwentts.cpp), Q4_K_M. Named speakers
+    via the `voice` field (serena, vivian, ryan, aiden, eric, dylan, ...).
+    Streaming, 24kHz mono, 11 languages.
+  tags:
+    - tts
+    - text-to-speech
+    - named-speakers
+    - streaming
+    - qwen3-tts
+    - qwen3-tts-cpp
+    - gguf
+  overrides:
+    backend: qwen3-tts-cpp
+    known_usecases:
+      - tts
+    name: qwen3-tts-cpp-customvoice-q4
+    parameters:
+      model: qwen3-tts-cpp-customvoice-q4/qwen-talker-0.6b-customvoice-Q4_K_M.gguf
+  files:
+    - filename: qwen3-tts-cpp-customvoice-q4/qwen-talker-0.6b-customvoice-Q4_K_M.gguf
+      sha256: b3a7e6613d80f8a703c06267fc1e94d48ce91932ab82ab6e31c50f4ca4868e1e
+      uri: huggingface://Serveurperso/Qwen3-TTS-GGUF/qwen-talker-0.6b-customvoice-Q4_K_M.gguf
+    - filename: qwen3-tts-cpp-customvoice-q4/qwen-tokenizer-12hz-Q4_K_M.gguf
+      sha256: cf3788b4d50aaa665fb6e57c170396aae03a3555fea52d2b5d0cda902d658039
+      uri: huggingface://Serveurperso/Qwen3-TTS-GGUF/qwen-tokenizer-12hz-Q4_K_M.gguf
+- !!merge <<: *qwenttscpp_gallery
+  name: qwen3-tts-cpp-1.7b-customvoice
+  description: |
+    Qwen3-TTS 1.7B CustomVoice (C++ / GGML, qwentts.cpp), Q8_0. Named speakers via
+    the `voice` field (serena, vivian, ryan, aiden, eric, dylan, ...). Streaming,
+    24kHz mono, 11 languages.
+  tags:
+    - tts
+    - text-to-speech
+    - named-speakers
+    - streaming
+    - qwen3-tts
+    - qwen3-tts-cpp
+    - gguf
+  overrides:
+    backend: qwen3-tts-cpp
+    known_usecases:
+      - tts
+    name: qwen3-tts-cpp-1.7b-customvoice
+    parameters:
+      model: qwen3-tts-cpp-1.7b-customvoice/qwen-talker-1.7b-customvoice-Q8_0.gguf
+  files:
+    - filename: qwen3-tts-cpp-1.7b-customvoice/qwen-talker-1.7b-customvoice-Q8_0.gguf
+      sha256: cab2cff67a0a557310febe558dc83076b28ed790e491867eb2751759f4cd89fa
+      uri: huggingface://Serveurperso/Qwen3-TTS-GGUF/qwen-talker-1.7b-customvoice-Q8_0.gguf
+    - filename: qwen3-tts-cpp-1.7b-customvoice/qwen-tokenizer-12hz-Q8_0.gguf
+      sha256: 1883beeed99348fc35e23dd225e9082f93f6f8c109330a33d935baa8acdbfd94
+      uri: huggingface://Serveurperso/Qwen3-TTS-GGUF/qwen-tokenizer-12hz-Q8_0.gguf
+- !!merge <<: *qwenttscpp_gallery
+  name: qwen3-tts-cpp-1.7b-customvoice-q4
+  description: |
+    Qwen3-TTS 1.7B CustomVoice (C++ / GGML, qwentts.cpp), Q4_K_M. Named speakers
+    via the `voice` field. Streaming, 24kHz mono, 11 languages.
+  tags:
+    - tts
+    - text-to-speech
+    - named-speakers
+    - streaming
+    - qwen3-tts
+    - qwen3-tts-cpp
+    - gguf
+  overrides:
+    backend: qwen3-tts-cpp
+    known_usecases:
+      - tts
+    name: qwen3-tts-cpp-1.7b-customvoice-q4
+    parameters:
+      model: qwen3-tts-cpp-1.7b-customvoice-q4/qwen-talker-1.7b-customvoice-Q4_K_M.gguf
+  files:
+    - filename: qwen3-tts-cpp-1.7b-customvoice-q4/qwen-talker-1.7b-customvoice-Q4_K_M.gguf
+      sha256: cc328834a631bc08bf9f43e62fa23f8a1383d9b429864ce6690cfb172077fc4a
+      uri: huggingface://Serveurperso/Qwen3-TTS-GGUF/qwen-talker-1.7b-customvoice-Q4_K_M.gguf
+    - filename: qwen3-tts-cpp-1.7b-customvoice-q4/qwen-tokenizer-12hz-Q4_K_M.gguf
+      sha256: cf3788b4d50aaa665fb6e57c170396aae03a3555fea52d2b5d0cda902d658039
+      uri: huggingface://Serveurperso/Qwen3-TTS-GGUF/qwen-tokenizer-12hz-Q4_K_M.gguf
+- !!merge <<: *qwenttscpp_gallery
+  name: qwen3-tts-cpp-1.7b-voicedesign
+  description: |
+    Qwen3-TTS 1.7B VoiceDesign (C++ / GGML, qwentts.cpp), Q8_0. Synthesises a
+    speaker from a free-text attribute instruction - REQUIRES the OpenAI
+    `instructions` field (e.g. "male, young adult, moderate pitch"); requests
+    without it are rejected. Streaming, 24kHz mono, 11 languages.
+  tags:
+    - tts
+    - text-to-speech
+    - voice-design
+    - streaming
+    - qwen3-tts
+    - qwen3-tts-cpp
+    - gguf
+  overrides:
+    backend: qwen3-tts-cpp
+    known_usecases:
+      - tts
+    name: qwen3-tts-cpp-1.7b-voicedesign
+    parameters:
+      model: qwen3-tts-cpp-1.7b-voicedesign/qwen-talker-1.7b-voicedesign-Q8_0.gguf
+  files:
+    - filename: qwen3-tts-cpp-1.7b-voicedesign/qwen-talker-1.7b-voicedesign-Q8_0.gguf
+      sha256: 575610ab1ddcca4dca6bd9a64bcd859d93bbad8764f9cab24e1dbc0c51f62276
+      uri: huggingface://Serveurperso/Qwen3-TTS-GGUF/qwen-talker-1.7b-voicedesign-Q8_0.gguf
+    - filename: qwen3-tts-cpp-1.7b-voicedesign/qwen-tokenizer-12hz-Q8_0.gguf
+      sha256: 1883beeed99348fc35e23dd225e9082f93f6f8c109330a33d935baa8acdbfd94
+      uri: huggingface://Serveurperso/Qwen3-TTS-GGUF/qwen-tokenizer-12hz-Q8_0.gguf
+- !!merge <<: *qwenttscpp_gallery
+  name: qwen3-tts-cpp-1.7b-voicedesign-q4
+  description: |
+    Qwen3-TTS 1.7B VoiceDesign (C++ / GGML, qwentts.cpp), Q4_K_M. Synthesises a
+    speaker from a free-text attribute instruction - REQUIRES the `instructions`
+    field. Streaming, 24kHz mono, 11 languages.
+  tags:
+    - tts
+    - text-to-speech
+    - voice-design
+    - streaming
+    - qwen3-tts
+    - qwen3-tts-cpp
+    - gguf
+  overrides:
+    backend: qwen3-tts-cpp
+    known_usecases:
+      - tts
+    name: qwen3-tts-cpp-1.7b-voicedesign-q4
+    parameters:
+      model: qwen3-tts-cpp-1.7b-voicedesign-q4/qwen-talker-1.7b-voicedesign-Q4_K_M.gguf
+  files:
+    - filename: qwen3-tts-cpp-1.7b-voicedesign-q4/qwen-talker-1.7b-voicedesign-Q4_K_M.gguf
+      sha256: 7605ed0cc5e72059f27468c27f70c070e05d1cc0c7b1c76bfb9cba717a59eee3
+      uri: huggingface://Serveurperso/Qwen3-TTS-GGUF/qwen-talker-1.7b-voicedesign-Q4_K_M.gguf
+    - filename: qwen3-tts-cpp-1.7b-voicedesign-q4/qwen-tokenizer-12hz-Q4_K_M.gguf
+      sha256: cf3788b4d50aaa665fb6e57c170396aae03a3555fea52d2b5d0cda902d658039
+      uri: huggingface://Serveurperso/Qwen3-TTS-GGUF/qwen-tokenizer-12hz-Q4_K_M.gguf
 - name: omnivoice-cpp
  url: github:mudler/LocalAI/gallery/virtual.yaml@master
  urls:
@@ -3402,39 +3631,6 @@
    - filename: omnivoice-cpp-hq/omnivoice-tokenizer-BF16.gguf
      sha256: c2179e4cf528b19fea22a5be94c34c083877bb5fc28ac0245d2b4299a262dcec
      uri: huggingface://Serveurperso/OmniVoice-GGUF/omnivoice-tokenizer-BF16.gguf
- name: qwen3-tts-cpp-customvoice
-  url: github:mudler/LocalAI/gallery/virtual.yaml@master
-  urls:
-    - https://huggingface.co/endo5501/qwen3-tts.cpp
-    - https://github.com/predict-woo/qwen3-tts.cpp
-  description: |
-    Qwen3-TTS 0.6B Custom Voice (C++ / GGML) — text-to-speech with voice cloning support.
-    Generates 24kHz mono audio with optional reference audio for voice cloning via ECAPA-TDNN speaker embeddings.
-    Supports 10 languages (en, zh, ja, ko, de, fr, es, it, pt, ru).
-  license: apache-2.0
-  icon: https://huggingface.co/avatars/c299494fd1e72375832499c75b3425d6.svg
-  tags:
-    - tts
-    - text-to-speech
-    - voice-cloning
-    - qwen3-tts
-    - qwen3-tts-cpp
-    - gguf
-  last_checked: "2026-04-30"
-  overrides:
-    backend: qwen3-tts-cpp
-    known_usecases:
-      - tts
-    name: qwen3-tts-cpp-customvoice
-    parameters:
-      model: qwen3-tts-cpp-customvoice
-  files:
-    - filename: qwen3-tts-cpp-customvoice/qwen3-tts-0.6b-customvoice-f16.gguf
-      sha256: 40b985b71be0970d41eb042488766db556cf17290aa1cff631cabfa0bd3b0431
-      uri: huggingface://endo5501/qwen3-tts.cpp/qwen3-tts-0.6b-customvoice-f16.gguf
-    - filename: qwen3-tts-cpp-customvoice/qwen3-tts-tokenizer-f16.gguf
-      sha256: d1ad9660bd99343f4851d5a4b17e31f65648feb3559f6ea062ae6575e5cd9d90
-      uri: huggingface://endo5501/qwen3-tts.cpp/qwen3-tts-tokenizer-f16.gguf
 - name: qwen3-coder-next-mxfp4_moe
  url: github:mudler/LocalAI/gallery/virtual.yaml@master
  urls: