feat(musicgen): add ace-step and UI interface (#8396)

* feat(musicgen): add ace-step and UI interface Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly handle model dir Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop auto-download Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to models, fixup UIs icons Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Update docs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * l4t13 is incompatbile Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * avoid pinning version for cuda12 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop l4t12 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-07 07:24:44 -04:00 · 2026-02-05 12:04:53 +01:00
parent 6dbcdb0b9e
commit 53276d28e7
38 changed files with 1661 additions and 23 deletions
--- a/docs/content/features/audio-to-text.md
+++ b/docs/content/features/audio-to-text.md
@@ -7,9 +7,12 @@ url = "/features/audio-to-text/"

 Audio to text models are models that can generate text from an audio file.

-The transcription endpoint allows to convert audio files to text. The endpoint is based
-on [whisper.cpp](https://github.com/ggerganov/whisper.cpp), a C++ library for audio transcription. The endpoint input
-supports all the audio formats supported by `ffmpeg`.
+The transcription endpoint allows to convert audio files to text. The endpoint supports multiple backends:
+- **[whisper.cpp](https://github.com/ggerganov/whisper.cpp)**: A C++ library for audio transcription (default)
+- **moonshine**: Ultra-fast transcription engine optimized for low-end devices
+- **faster-whisper**: Fast Whisper implementation with CTranslate2
+
+The endpoint input supports all the audio formats supported by `ffmpeg`.

 ## Usage

--- a/docs/content/features/text-to-audio.md
+++ b/docs/content/features/text-to-audio.md
@@ -126,6 +126,64 @@ curl --request POST \

 Future versions of LocalAI will expose additional control over audio generation beyond the text prompt.

+### ACE-Step
+
+[ACE-Step 1.5](https://github.com/ACE-Step/ACE-Step-1.5) is a music generation model that can create music from text descriptions, lyrics, or audio samples. It supports both simple text-to-music and advanced music generation with metadata like BPM, key scale, and time signature.
+
+#### Setup
+
+Install the `ace-step-turbo` model from the Model gallery or run `local-ai run models install ace-step-turbo`.
+
+#### Usage
+
+ACE-Step supports two modes: **Simple mode** (text description + vocal language) and **Advanced mode** (caption, lyrics, BPM, key, and more).
+
+**Simple mode:**
+```bash
+curl http://localhost:8080/v1/audio/speech -H "Content-Type: application/json" -d '{
+  "model": "ace-step-turbo",
+  "input": "A soft Bengali love song for a quiet evening",
+  "vocal_language": "bn"
+}' --output music.flac
+```
+
+**Advanced mode** (using the `/sound` endpoint):
+```bash
+curl http://localhost:8080/sound -H "Content-Type: application/json" -d '{
+  "model": "ace-step-turbo",
+  "caption": "A funky Japanese disco track",
+  "lyrics": "[Verse 1]\n...",
+  "bpm": 120,
+  "keyscale": "Ab major",
+  "language": "ja",
+  "duration_seconds": 225
+}' --output music.flac
+```
+
+#### Configuration
+
+You can configure ACE-Step models with various options:
+
+```yaml
+name: ace-step-turbo
+backend: ace-step
+parameters:
+  model: acestep-v15-turbo
+known_usecases:
+  - sound_generation
+  - tts
+options:
+  - "device:auto"
+  - "use_flash_attention:true"
+  - "init_lm:true"  # Enable LLM for enhanced generation
+  - "lm_model_path:acestep-5Hz-lm-0.6B"  # or acestep-5Hz-lm-4B
+  - "lm_backend:pt"  # or vllm
+  - "temperature:0.85"
+  - "top_p:0.9"
+  - "inference_steps:8"
+  - "guidance_scale:7.0"
+```
+
 ### VibeVoice

 [VibeVoice-Realtime](https://github.com/microsoft/VibeVoice) is a real-time text-to-speech model that generates natural-sounding speech with voice cloning capabilities.