mirror of
https://github.com/mudler/LocalAI.git
synced 2026-05-07 07:24:44 -04:00
feat(musicgen): add ace-step and UI interface (#8396)
* feat(musicgen): add ace-step and UI interface Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Correctly handle model dir Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop auto-download Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Add to models, fixup UIs icons Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Update docs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * l4t13 is incompatbile Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * avoid pinning version for cuda12 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Drop l4t12 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
committed by
GitHub
parent
6dbcdb0b9e
commit
53276d28e7
@@ -7,9 +7,12 @@ url = "/features/audio-to-text/"
|
||||
|
||||
Audio to text models are models that can generate text from an audio file.
|
||||
|
||||
The transcription endpoint allows to convert audio files to text. The endpoint is based
|
||||
on [whisper.cpp](https://github.com/ggerganov/whisper.cpp), a C++ library for audio transcription. The endpoint input
|
||||
supports all the audio formats supported by `ffmpeg`.
|
||||
The transcription endpoint allows to convert audio files to text. The endpoint supports multiple backends:
|
||||
- **[whisper.cpp](https://github.com/ggerganov/whisper.cpp)**: A C++ library for audio transcription (default)
|
||||
- **moonshine**: Ultra-fast transcription engine optimized for low-end devices
|
||||
- **faster-whisper**: Fast Whisper implementation with CTranslate2
|
||||
|
||||
The endpoint input supports all the audio formats supported by `ffmpeg`.
|
||||
|
||||
## Usage
|
||||
|
||||
|
||||
@@ -126,6 +126,64 @@ curl --request POST \
|
||||
|
||||
Future versions of LocalAI will expose additional control over audio generation beyond the text prompt.
|
||||
|
||||
### ACE-Step
|
||||
|
||||
[ACE-Step 1.5](https://github.com/ACE-Step/ACE-Step-1.5) is a music generation model that can create music from text descriptions, lyrics, or audio samples. It supports both simple text-to-music and advanced music generation with metadata like BPM, key scale, and time signature.
|
||||
|
||||
#### Setup
|
||||
|
||||
Install the `ace-step-turbo` model from the Model gallery or run `local-ai run models install ace-step-turbo`.
|
||||
|
||||
#### Usage
|
||||
|
||||
ACE-Step supports two modes: **Simple mode** (text description + vocal language) and **Advanced mode** (caption, lyrics, BPM, key, and more).
|
||||
|
||||
**Simple mode:**
|
||||
```bash
|
||||
curl http://localhost:8080/v1/audio/speech -H "Content-Type: application/json" -d '{
|
||||
"model": "ace-step-turbo",
|
||||
"input": "A soft Bengali love song for a quiet evening",
|
||||
"vocal_language": "bn"
|
||||
}' --output music.flac
|
||||
```
|
||||
|
||||
**Advanced mode** (using the `/sound` endpoint):
|
||||
```bash
|
||||
curl http://localhost:8080/sound -H "Content-Type: application/json" -d '{
|
||||
"model": "ace-step-turbo",
|
||||
"caption": "A funky Japanese disco track",
|
||||
"lyrics": "[Verse 1]\n...",
|
||||
"bpm": 120,
|
||||
"keyscale": "Ab major",
|
||||
"language": "ja",
|
||||
"duration_seconds": 225
|
||||
}' --output music.flac
|
||||
```
|
||||
|
||||
#### Configuration
|
||||
|
||||
You can configure ACE-Step models with various options:
|
||||
|
||||
```yaml
|
||||
name: ace-step-turbo
|
||||
backend: ace-step
|
||||
parameters:
|
||||
model: acestep-v15-turbo
|
||||
known_usecases:
|
||||
- sound_generation
|
||||
- tts
|
||||
options:
|
||||
- "device:auto"
|
||||
- "use_flash_attention:true"
|
||||
- "init_lm:true" # Enable LLM for enhanced generation
|
||||
- "lm_model_path:acestep-5Hz-lm-0.6B" # or acestep-5Hz-lm-4B
|
||||
- "lm_backend:pt" # or vllm
|
||||
- "temperature:0.85"
|
||||
- "top_p:0.9"
|
||||
- "inference_steps:8"
|
||||
- "guidance_scale:7.0"
|
||||
```
|
||||
|
||||
### VibeVoice
|
||||
|
||||
[VibeVoice-Realtime](https://github.com/microsoft/VibeVoice) is a real-time text-to-speech model that generates natural-sounding speech with voice cloning capabilities.
|
||||
|
||||
Reference in New Issue
Block a user