mirror of
https://github.com/mudler/LocalAI.git
synced 2026-05-20 14:46:38 -04:00
Compare commits
74 Commits
v4.2.0
...
dependabot
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
97c31c524b | ||
|
|
cb502de309 | ||
|
|
5d0b549049 | ||
|
|
11cff1b309 | ||
|
|
4ca3d2cdc0 | ||
|
|
3cba35ed32 | ||
|
|
265ae35231 | ||
|
|
6a48157a80 | ||
|
|
41c838b2df | ||
|
|
21e793ad2a | ||
|
|
7c190bb4b9 | ||
|
|
d77a9137d8 | ||
|
|
661a0c3b9d | ||
|
|
00b8989886 | ||
|
|
43e0d397ca | ||
|
|
a1a7a219ed | ||
|
|
3937ec6527 | ||
|
|
1355b55794 | ||
|
|
5a2626d465 | ||
|
|
a39591f144 | ||
|
|
8c785dbe4a | ||
|
|
4abf5befbb | ||
|
|
195b910260 | ||
|
|
ba21bf667c | ||
|
|
7bd1693ad0 | ||
|
|
b5ac3a7373 | ||
|
|
53de474ef5 | ||
|
|
c33d36b870 | ||
|
|
57fa178a64 | ||
|
|
745473cbe6 | ||
|
|
594c9fd92e | ||
|
|
8af963bdd9 | ||
|
|
6e1dbae256 | ||
|
|
53bdb18d10 | ||
|
|
42a8db3573 | ||
|
|
0353d3bd77 | ||
|
|
ec49995190 | ||
|
|
67c34bbb96 | ||
|
|
4430fae779 | ||
|
|
ab01ed1a3e | ||
|
|
6bfe7f8c05 | ||
|
|
5a42dbf3ec | ||
|
|
c2fe0a6475 | ||
|
|
ddbbdf45b9 | ||
|
|
b4fdb41dcc | ||
|
|
0245b33eab | ||
|
|
a2940e5d47 | ||
|
|
a645c1f4aa | ||
|
|
957619af53 | ||
|
|
ad0ab37230 | ||
|
|
0b81e36504 | ||
|
|
602866a9d8 | ||
|
|
8521af145f | ||
|
|
bc4cd3dd85 | ||
|
|
86a7f6c9fa | ||
|
|
a57e73691d | ||
|
|
a689100d61 | ||
|
|
03815e3b59 | ||
|
|
37991c8a18 | ||
|
|
61c9b187fa | ||
|
|
c66014312e | ||
|
|
abc2a51641 | ||
|
|
cd7d163178 | ||
|
|
7aac599deb | ||
|
|
d75173dd2a | ||
|
|
9be5310394 | ||
|
|
cdf50fd723 | ||
|
|
bc3fb16105 | ||
|
|
78722caedc | ||
|
|
621c612b2d | ||
|
|
e3f9de1026 | ||
|
|
d892e4af80 | ||
|
|
5d0f732b16 | ||
|
|
ea00199554 |
@@ -34,7 +34,55 @@ The build matrix is data-only YAML at `.github/backend-matrix.yml` (not inside `
|
||||
|
||||
**Without an entry here no image is ever built or pushed, and the gallery entry in `backend/index.yaml` will point at a tag that does not exist.** The `dockerfile:` field must point at `./backend/Dockerfile.<lang>` matching the language bucket from step 1 (e.g. `Dockerfile.python`, `Dockerfile.golang`, `Dockerfile.rust`). The `tag-suffix` must match the `uri:` in the corresponding `backend/index.yaml` image entry exactly.
|
||||
|
||||
If you add a new language bucket, `scripts/changed-backends.js` also needs a branch in `inferBackendPath` so PR change-detection routes file edits correctly.
|
||||
**`scripts/changed-backends.js` registration — REQUIRED for any new dockerfile suffix.** This is the single most common omission, because it has no effect on the PR that adds the backend (when no prior path filter could catch it anyway) — it only breaks the *next* PR that touches your backend's directory, which then gets zero CI jobs and looks broken for unrelated reasons. Edit `scripts/changed-backends.js:inferBackendPath` and add a branch BEFORE the more-generic suffixes:
|
||||
|
||||
```js
|
||||
if (item.dockerfile.endsWith("<your-dockerfile-suffix>")) {
|
||||
return `backend/cpp/<your-backend>/`; // or backend/python|go|rust/...
|
||||
}
|
||||
```
|
||||
|
||||
The `endsWith()` test is against the matrix entry's `dockerfile:` value (e.g. `./backend/Dockerfile.ds4` → `endsWith("ds4")`). Specificity order matters here just like it does for importers: more-specific suffixes go BEFORE more-generic ones (e.g. `ds4` before `llama-cpp` even though both end with letters, because some upstream might one day call itself `super-ds4-llama-cpp`). Verify locally before pushing:
|
||||
|
||||
```bash
|
||||
# Confirm your dockerfile suffix is unique enough
|
||||
node -e "
|
||||
const yaml = require('js-yaml'); const fs = require('fs');
|
||||
const m = yaml.load(fs.readFileSync('.github/backend-matrix.yml','utf8'));
|
||||
for (const e of m.include.filter(e => e.backend === '<your-backend>')) {
|
||||
console.log(e.dockerfile, '->', e.dockerfile.endsWith('<suffix>'));
|
||||
}"
|
||||
```
|
||||
|
||||
A quick way to find the right insertion point: `grep -n 'item.dockerfile.endsWith' scripts/changed-backends.js`.
|
||||
|
||||
**`bump_deps.yaml` registration — REQUIRED for any backend pinning an upstream commit.** If your backend's Makefile has a `*_VERSION?=<sha>` pin to a third-party repo, the daily auto-bump bot at `.github/workflows/bump_deps.yaml` won't notice it unless you register the backend in its matrix. The bot runs `.github/bump_deps.sh` which `grep`s for `^$VAR?=` in the Makefile you list — so the pin MUST live in the Makefile (not in a separate shell script). The bump for ds4 (#9761) had to walk this back because the original landed the pin in `prepare.sh`, which the bot can't see. Pattern (for `antirez/ds4`):
|
||||
|
||||
```yaml
|
||||
# .github/workflows/bump_deps.yaml
|
||||
matrix:
|
||||
include:
|
||||
- repository: "antirez/ds4"
|
||||
variable: "DS4_VERSION"
|
||||
branch: "main"
|
||||
file: "backend/cpp/ds4/Makefile"
|
||||
```
|
||||
|
||||
And the corresponding Makefile shape (mirror `backend/cpp/llama-cpp/Makefile`):
|
||||
|
||||
```makefile
|
||||
DS4_VERSION?=ae302c2fa18cc6d9aefc021d0f27ae03c9ad2fc0
|
||||
DS4_REPO?=https://github.com/antirez/ds4
|
||||
...
|
||||
ds4:
|
||||
mkdir -p ds4
|
||||
cd ds4 && git init -q && \
|
||||
git remote add origin $(DS4_REPO) && \
|
||||
git fetch --depth 1 origin $(DS4_VERSION) && \
|
||||
git checkout FETCH_HEAD
|
||||
```
|
||||
|
||||
If you have a `prepare.sh` doing the clone, delete it — the recipe belongs in the Makefile target so `make purge && make` works as a clean-and-rebuild and so the bump bot finds the pin.
|
||||
|
||||
**Placement in file:**
|
||||
- CPU builds: Add after other CPU builds (e.g., after `cpu-chatterbox`)
|
||||
@@ -64,6 +112,8 @@ Add a YAML anchor definition in the `## metas` section (around line 2-300). Look
|
||||
|
||||
Add image entries at the end of the file, following the pattern of similar backends such as `diffusers` or `chatterbox`. Include both `latest` (production) and `master` (development) tags.
|
||||
|
||||
**Note on integrity:** OCI backends installed from a gallery whose `verification:` block is set are verified against a keyless-cosign policy before extraction; tarball/HTTP backends use the optional `sha256:` field. New backends do not need any extra YAML — the gallery-level `verification:` block covers every entry. See [.agents/backend-signing.md](backend-signing.md) for the producer-side CI step.
|
||||
|
||||
## 4. Update the Makefile
|
||||
|
||||
The Makefile needs to be updated in several places to support building and testing the new backend:
|
||||
|
||||
@@ -284,7 +284,17 @@ Also bump the expected-length count in `api_instructions_test.go` and add the na
|
||||
|
||||
### 3. `capabilities.js` symbol (for new model-config FLAG_* flags)
|
||||
|
||||
If your feature needs a new `FLAG_*` usecase flag in `core/config/model_config.go` (so users can filter gallery models by it, and so `/v1/models` surfaces it), also declare the matching symbol in `core/http/react-ui/src/utils/capabilities.js`:
|
||||
If your feature needs a new `FLAG_*` usecase flag in `core/config/model_config.go` (so users can filter gallery models by it, and so `/v1/models` surfaces it), you need to update **all** of:
|
||||
|
||||
- `Usecase<Name>` string constant in `core/config/backend_capabilities.go`
|
||||
- `UsecaseInfoMap` entry mapping the string to its flag + gRPC method
|
||||
- `FLAG_<NAME>` bitmask in `core/config/model_config.go`
|
||||
- `GetAllModelConfigUsecases()` map entry (otherwise the YAML loader silently ignores the string)
|
||||
- `ModalityGroups` membership if the flag should affect `IsMultimodal()` (e.g. realtime_audio is in both speech-input and audio-output groups so a lone flag still reads as multimodal)
|
||||
- `GuessUsecases()` branch listing the backends that own this capability
|
||||
- `usecaseFilters` in `core/http/routes/ui_api.go` (drives the gallery filter dropdown)
|
||||
- `Models.jsx` `FILTERS` array + matching `filters.<camelCase>` i18n key in `core/http/react-ui/public/locales/en/models.json`
|
||||
- `core/http/react-ui/src/utils/capabilities.js`:
|
||||
|
||||
```js
|
||||
export const CAP_MY_CAPABILITY = 'FLAG_MY_CAPABILITY'
|
||||
|
||||
120
.agents/backend-signing.md
Normal file
120
.agents/backend-signing.md
Normal file
@@ -0,0 +1,120 @@
|
||||
# Backend image signing & verification
|
||||
|
||||
LocalAI verifies backend OCI images against a per-gallery keyless-cosign
|
||||
policy. This page documents the trust model, the producer side
|
||||
(`.github/workflows/backend_merge.yml` in this repo), and the consumer
|
||||
side (`pkg/oci/cosignverify` plus the gallery YAML).
|
||||
|
||||
## Trust model
|
||||
|
||||
- **Producer:** `.github/workflows/backend_merge.yml` signs each pushed
|
||||
manifest list with `cosign sign --recursive` in keyless mode after
|
||||
`docker buildx imagetools create`. The signing cert is issued by
|
||||
Fulcio bound to the workflow's OIDC identity. There is no long-lived
|
||||
signing key. `--recursive` signs both the manifest list and every
|
||||
per-arch entry — needed because our consumer resolves a tag to a
|
||||
per-arch manifest before checking signatures.
|
||||
- **Storage:** Signatures are written as OCI 1.1 referrers
|
||||
(`--registry-referrers-mode=oci-1-1`) in the new Sigstore bundle format
|
||||
(`--new-bundle-format`). No `:sha256-<hex>.sig` tag clutter.
|
||||
- **Consumer:** `pkg/oci/cosignverify` discovers the bundle via the
|
||||
referrers API, hands it to `sigstore-go`, and verifies it against the
|
||||
policy declared in the gallery YAML (`Gallery.Verification`).
|
||||
- **Revocation:** Keyless cosign certs are ephemeral (10-minute Fulcio
|
||||
validity), so revocation is policy-side, not CA-side. The gallery's
|
||||
`verification.not_before` (RFC3339) is the kill-switch — advance it to
|
||||
invalidate every signature produced before a known compromise window.
|
||||
|
||||
## Producer setup
|
||||
|
||||
`backend_merge.yml` is the workflow that joins per-arch digests into the
|
||||
multi-arch manifest list users actually pull, so it's also the right place
|
||||
to sign. The job needs:
|
||||
|
||||
- `permissions: { id-token: write, contents: read }` at the job level so
|
||||
the runner can exchange its GitHub OIDC token for a Fulcio cert.
|
||||
- `sigstore/cosign-installer@v3` step (cosign ≥ 2.2 for
|
||||
`--new-bundle-format`).
|
||||
- After each `docker buildx imagetools create`, resolve the resulting
|
||||
list digest with `docker buildx imagetools inspect <tag> --format
|
||||
'{{.Manifest.Digest}}'` and sign:
|
||||
|
||||
```sh
|
||||
cosign sign --yes --recursive \
|
||||
--new-bundle-format \
|
||||
--registry-referrers-mode=oci-1-1 \
|
||||
"${REGISTRY_REPO}@${DIGEST}"
|
||||
```
|
||||
|
||||
Sign by digest, never by tag — signing by tag binds the signature to
|
||||
whatever the tag points at *now*, and a subsequent tag push orphans it.
|
||||
|
||||
`backend_build_darwin.yml` builds and pushes single-arch darwin images
|
||||
that bypass the manifest-list merge. If/when those entries get a gallery
|
||||
`verification:` policy, the equivalent cosign step has to land there
|
||||
too.
|
||||
|
||||
## Consumer setup (in `mudler/LocalAI` gallery YAML)
|
||||
|
||||
Once CI is signing, add a `verification:` block to the backend gallery
|
||||
entry (`backend/index.yaml`):
|
||||
|
||||
```yaml
|
||||
- name: localai
|
||||
url: github:mudler/LocalAI/backend/index.yaml@master
|
||||
verification:
|
||||
issuer: "https://token.actions.githubusercontent.com"
|
||||
identity_regex: "^https://github\\.com/mudler/LocalAI/\\.github/workflows/backend_merge\\.yml@refs/heads/master$"
|
||||
# Optional revocation cutoff; advance during incident response.
|
||||
# not_before: "2026-06-01T00:00:00Z"
|
||||
```
|
||||
|
||||
Identity matching pins the OIDC subject Fulcio issued the signing cert
|
||||
to. Without this, any image signed by *anyone* with a Fulcio cert would
|
||||
pass — the regex is what makes a signature mean "produced by our CI".
|
||||
|
||||
## Strict mode
|
||||
|
||||
Default behaviour: OCI backends without a `verification:` block install
|
||||
with a warning (logs include `installing OCI backend without signature
|
||||
verification`). Tarball/HTTP backends without a `sha256` field log a
|
||||
similar warning.
|
||||
|
||||
For production, set `LOCALAI_REQUIRE_BACKEND_INTEGRITY=1` (or pass
|
||||
`--require-backend-integrity` to `local-ai run` / `local-ai backends
|
||||
install` / `local-ai models install`). The warning becomes a hard error
|
||||
and unverifiable backends refuse to install.
|
||||
|
||||
## Revocation playbook
|
||||
|
||||
If `backend_merge.yml` (or any workflow with `id-token: write`) is
|
||||
compromised and we've shipped malicious signed images:
|
||||
|
||||
1. **Identify the compromise window.** Find the earliest IntegratedTime
|
||||
from the bad signatures (Rekor search by `subject` filter).
|
||||
2. **Set `verification.not_before`** in `backend/index.yaml` to a
|
||||
timestamp just *after* that window's start.
|
||||
3. **Push the YAML.** Deployed LocalAI instances pick it up on next
|
||||
gallery refresh (1-hour cache in `core/gallery/gallery.go`).
|
||||
4. **Fix the underlying compromise** in the workflow and re-sign images
|
||||
with the new build, which will have IntegratedTime > `not_before`.
|
||||
5. **Optional:** for absolute decisiveness, also rotate to a new
|
||||
workflow path (`backend_merge_v2.yml`) and update `identity_regex`.
|
||||
|
||||
## Where the code lives
|
||||
|
||||
- `pkg/oci/cosignverify/` — verifier, policy, OCI referrer fetch, NotBefore enforcement.
|
||||
- `pkg/downloader/uri.go` — `WithImageVerifier` option threaded through `DownloadFileWithContext`.
|
||||
- `core/gallery/backends.go` — `backendDownloadOptions` builds the verifier from the gallery's policy.
|
||||
- `core/config/gallery.go` — `Gallery.Verification` YAML schema.
|
||||
- `core/cli/run.go`, `core/cli/backends.go`, `core/cli/models.go` — `--require-backend-integrity` flag propagation.
|
||||
- `.github/workflows/backend_merge.yml` — producer-side `cosign sign --recursive` after each multi-arch manifest list push.
|
||||
|
||||
## Out of scope (follow-ups)
|
||||
|
||||
- **Signing the gallery YAML itself.** The index is fetched over HTTPS
|
||||
from GitHub; we trust the host. A cosign blob signature on the YAML
|
||||
would close that gap but adds key-management overhead. Revisit this
|
||||
page if/when added.
|
||||
- **Tarball/HTTP backend signing.** Cosign can sign arbitrary blobs, but
|
||||
for now non-OCI backends keep using the `sha256:` field in YAML.
|
||||
84
.agents/ds4-backend.md
Normal file
84
.agents/ds4-backend.md
Normal file
@@ -0,0 +1,84 @@
|
||||
# Working on the ds4 Backend
|
||||
|
||||
`antirez/ds4` is a single-model inference engine for DeepSeek V4 Flash.
|
||||
LocalAI wraps the engine's C API (`ds4/ds4.h`) with a fresh C++ gRPC server at
|
||||
`backend/cpp/ds4/` - NOT a fork of llama-cpp's grpc-server.cpp.
|
||||
|
||||
## Pin
|
||||
|
||||
`backend/cpp/ds4/Makefile` pins `DS4_VERSION?=<sha>` at the top. The `ds4`
|
||||
target in the Makefile clones `antirez/ds4` at that commit (mirroring the
|
||||
llama-cpp / ik-llama-cpp / turboquant pattern). The bump-deps bot
|
||||
(`.github/workflows/bump_deps.yaml`) finds this pin via grep and opens a
|
||||
daily PR to update it. To bump manually: edit the `DS4_VERSION?=` line,
|
||||
then `make purge && make` (or rely on CI's clean build).
|
||||
|
||||
## Wire shape
|
||||
|
||||
| RPC | Implementation |
|
||||
|---|---|
|
||||
| Health, Free, Status | Trivial; no engine dependency for Health |
|
||||
| LoadModel | `ds4_engine_open` + `ds4_session_create`; backend is compile-time (DS4_NO_GPU → CPU, __APPLE__ → Metal, otherwise CUDA) |
|
||||
| TokenizeString | `ds4_tokenize_text` |
|
||||
| Predict | `ds4_engine_generate_argmax` + `DsmlParser` → one ChatDelta with content / reasoning_content / tool_calls[] |
|
||||
| PredictStream | Same, per-token ChatDelta writes |
|
||||
|
||||
## DSML
|
||||
|
||||
ds4 emits tool calls as literal text markers (`<|DSML|tool_calls>` etc.) -
|
||||
NOT special tokens. `dsml_parser.{h,cpp}` is our streaming state machine that
|
||||
classifies token bytes into CONTENT / REASONING / TOOL_START / TOOL_ARGS / TOOL_END
|
||||
events. `dsml_renderer.{h,cpp}` does the prompt direction: turns
|
||||
OpenAI tool_calls + role=tool messages back into DSML for the next turn.
|
||||
|
||||
## Thinking modes
|
||||
|
||||
`PredictOptions.Metadata["enable_thinking"]` gates thinking on/off (default ON).
|
||||
`["reasoning_effort"] == "max" | "xhigh"` selects `DS4_THINK_MAX`; anything else
|
||||
maps to `DS4_THINK_HIGH`. We pass the chosen mode to `ds4_chat_append_assistant_prefix`.
|
||||
|
||||
## Disk KV cache
|
||||
|
||||
`kv_cache.{h,cpp}` implements an SHA1-keyed file cache using ds4's public
|
||||
`ds4_session_save_payload` / `ds4_session_load_payload` API. Enable per request
|
||||
via `ModelOptions.Options[] = "kv_cache_dir:/some/path"`. Format is **our own** -
|
||||
NOT bit-compatible with ds4-server's KVC files (interop is a follow-up plan).
|
||||
|
||||
## Build matrix
|
||||
|
||||
| Build | Where | Notes |
|
||||
|---|---|---|
|
||||
| `cpu-ds4` (amd64 + arm64) | Linux GHA | ds4 considers CPU debug-only; useful only for wiring tests |
|
||||
| `cuda13-ds4` (amd64 + arm64) | Linux GHA + DGX Spark validation | Primary production path on Linux |
|
||||
| `ds4-darwin` (arm64) | macOS GHA runners | Metal; uses `scripts/build/ds4-darwin.sh` like llama-cpp-darwin |
|
||||
|
||||
cuda12 is intentionally omitted. ROCm / Vulkan / SYCL are not applicable.
|
||||
|
||||
## Hardware-gated validation
|
||||
|
||||
`tests/e2e-backends/backend_test.go` in `BACKEND_BINARY` mode:
|
||||
|
||||
```
|
||||
BACKEND_BINARY=$(pwd)/backend/cpp/ds4/package/run.sh \
|
||||
BACKEND_TEST_MODEL_FILE=/path/to/ds4flash.gguf \
|
||||
BACKEND_TEST_CAPS=health,load,predict,stream,tools \
|
||||
BACKEND_TEST_TOOL_PROMPT="What's the weather in Paris?" \
|
||||
go test -count=1 -timeout=30m -v ./tests/e2e-backends/...
|
||||
```
|
||||
|
||||
CI does not load the model; the suite is opt-in via env vars.
|
||||
|
||||
## Importer
|
||||
|
||||
`core/gallery/importers/ds4.go` (`DS4Importer`) auto-detects ds4 weights by
|
||||
matching the `antirez/deepseek-v4-gguf` repo URI or the
|
||||
`DeepSeek-V4-Flash-*.gguf` filename pattern. **Registered BEFORE
|
||||
`LlamaCPPImporter`** in `defaultImporters` - both match `.gguf` but ds4 is more
|
||||
specific, and first-match-wins. The importer emits `backend: ds4`, uses
|
||||
`ds4flash.gguf` as the local filename (matches ds4's own CLI default), and
|
||||
disables the Go-side automatic tool-parsing fallback (the C++ backend emits
|
||||
ChatDelta.tool_calls natively via `DsmlParser`).
|
||||
|
||||
ds4 is also listed in `core/http/endpoints/localai/backend.go`'s pref-only
|
||||
slice so the `/import-model` UI surfaces it as a manual choice for users who
|
||||
want to force the backend on a non-canonical URI.
|
||||
@@ -61,6 +61,12 @@ Always check `llama.cpp` for new model configuration options that should be supp
|
||||
- `reasoning_format` - Reasoning format options
|
||||
- Any new flags or parameters
|
||||
|
||||
### Speculative Decoding Types
|
||||
|
||||
The `spec_type` option in `grpc-server.cpp` delegates to upstream's `common_speculative_types_from_names()`, so new speculative types added to the `common_speculative_type_from_name` map in `common/speculative.cpp` are picked up automatically with no code changes - only docs need an entry in `docs/content/advanced/model-configuration.md`. Current values: `none`, `draft-simple`, `draft-eagle3`, `draft-mtp`, `ngram-simple`, `ngram-map-k`, `ngram-map-k4v`, `ngram-mod`, `ngram-cache`.
|
||||
|
||||
`draft-mtp` (Multi-Token Prediction, [ggml-org/llama.cpp#22673](https://github.com/ggml-org/llama.cpp/pull/22673)) does not need a separate draft GGUF: when `spec_type` includes `draft-mtp` and `draftmodel` is empty, the upstream server creates an MTP context off the target model itself. LocalAI's gRPC layer needs no changes for this — it works through the existing `params.speculative.types` plumbing and the derived `cparams.n_rs_seq = params.speculative.need_n_rs_seq()` in `common_context_params_to_llama`.
|
||||
|
||||
### Implementation Guidelines
|
||||
|
||||
1. **Feature Parity**: Always aim for feature parity with llama.cpp's implementation
|
||||
|
||||
151
.github/backend-matrix.yml
vendored
151
.github/backend-matrix.yml
vendored
@@ -278,6 +278,19 @@ include:
|
||||
dockerfile: "./backend/Dockerfile.python"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
- build-type: 'cublas'
|
||||
cuda-major-version: "12"
|
||||
cuda-minor-version: "8"
|
||||
platforms: 'linux/amd64'
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-gpu-nvidia-cuda-12-liquid-audio'
|
||||
runs-on: 'ubuntu-latest'
|
||||
base-image: "ubuntu:24.04"
|
||||
skip-drivers: 'false'
|
||||
backend: "liquid-audio"
|
||||
dockerfile: "./backend/Dockerfile.python"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
- build-type: 'cublas'
|
||||
cuda-major-version: "12"
|
||||
cuda-minor-version: "8"
|
||||
@@ -389,7 +402,12 @@ include:
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-gpu-nvidia-cuda-12-llama-cpp'
|
||||
builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-12-amd64'
|
||||
runs-on: 'ubuntu-latest'
|
||||
# bigger-runner: cold builds for this entry consistently take 5h+ on
|
||||
# ubuntu-latest (observed 5h36m on v4.2.1). Move back to bigger-runner
|
||||
# so the build finishes well within GHA's 6h job timeout. Phase 5.3 of
|
||||
# the free-tier migration (PR #9730) flipped this to ubuntu-latest as
|
||||
# a 'highest-risk batch' with explicit per-entry revert.
|
||||
runs-on: 'bigger-runner'
|
||||
base-image: "ubuntu:24.04"
|
||||
skip-drivers: 'false'
|
||||
backend: "llama-cpp"
|
||||
@@ -403,7 +421,9 @@ include:
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-gpu-nvidia-cuda-12-turboquant'
|
||||
builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-12-amd64'
|
||||
runs-on: 'ubuntu-latest'
|
||||
# bigger-runner: same rationale as -gpu-nvidia-cuda-12-llama-cpp above
|
||||
# (observed 6h5m wall-clock on v4.2.1, just past the 6h job timeout).
|
||||
runs-on: 'bigger-runner'
|
||||
base-image: "ubuntu:24.04"
|
||||
skip-drivers: 'false'
|
||||
backend: "turboquant"
|
||||
@@ -801,6 +821,19 @@ include:
|
||||
dockerfile: "./backend/Dockerfile.python"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
- build-type: 'cublas'
|
||||
cuda-major-version: "13"
|
||||
cuda-minor-version: "0"
|
||||
platforms: 'linux/amd64'
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-gpu-nvidia-cuda-13-liquid-audio'
|
||||
runs-on: 'ubuntu-latest'
|
||||
base-image: "ubuntu:24.04"
|
||||
skip-drivers: 'false'
|
||||
backend: "liquid-audio"
|
||||
dockerfile: "./backend/Dockerfile.python"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
- build-type: 'cublas'
|
||||
cuda-major-version: "13"
|
||||
cuda-minor-version: "0"
|
||||
@@ -899,7 +932,9 @@ include:
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-gpu-nvidia-cuda-13-llama-cpp'
|
||||
builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-13-amd64'
|
||||
runs-on: 'ubuntu-latest'
|
||||
# bigger-runner: cold builds for this entry take 5h+ on ubuntu-latest
|
||||
# (observed 5h37m on v4.2.1). Same rationale as the cuda-12 variant.
|
||||
runs-on: 'bigger-runner'
|
||||
base-image: "ubuntu:24.04"
|
||||
skip-drivers: 'false'
|
||||
backend: "llama-cpp"
|
||||
@@ -913,7 +948,8 @@ include:
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-gpu-nvidia-cuda-13-turboquant'
|
||||
builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-13-amd64'
|
||||
runs-on: 'ubuntu-latest'
|
||||
# bigger-runner: observed 6h5m wall-clock on v4.2.1 — at the GHA timeout.
|
||||
runs-on: 'bigger-runner'
|
||||
base-image: "ubuntu:24.04"
|
||||
skip-drivers: 'false'
|
||||
backend: "turboquant"
|
||||
@@ -948,6 +984,32 @@ include:
|
||||
backend: "turboquant"
|
||||
dockerfile: "./backend/Dockerfile.turboquant"
|
||||
context: "./"
|
||||
- build-type: 'cublas'
|
||||
cuda-major-version: "13"
|
||||
cuda-minor-version: "0"
|
||||
platforms: 'linux/amd64'
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-gpu-nvidia-cuda-13-ds4'
|
||||
runs-on: 'ubuntu-latest'
|
||||
base-image: "nvidia/cuda:13.0.0-devel-ubuntu24.04"
|
||||
skip-drivers: 'true'
|
||||
backend: "ds4"
|
||||
dockerfile: "./backend/Dockerfile.ds4"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
- build-type: 'cublas'
|
||||
cuda-major-version: "13"
|
||||
cuda-minor-version: "0"
|
||||
platforms: 'linux/arm64'
|
||||
skip-drivers: 'true'
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-nvidia-l4t-cuda-13-arm64-ds4'
|
||||
base-image: "nvidia/cuda:13.0.0-devel-ubuntu24.04"
|
||||
runs-on: 'ubuntu-24.04-arm'
|
||||
ubuntu-version: '2404'
|
||||
backend: "ds4"
|
||||
dockerfile: "./backend/Dockerfile.ds4"
|
||||
context: "./"
|
||||
- build-type: 'cublas'
|
||||
cuda-major-version: "13"
|
||||
cuda-minor-version: "0"
|
||||
@@ -1052,6 +1114,19 @@ include:
|
||||
backend: "vibevoice"
|
||||
dockerfile: "./backend/Dockerfile.python"
|
||||
context: "./"
|
||||
- build-type: 'l4t'
|
||||
cuda-major-version: "13"
|
||||
cuda-minor-version: "0"
|
||||
platforms: 'linux/arm64'
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-nvidia-l4t-cuda-13-arm64-liquid-audio'
|
||||
runs-on: 'ubuntu-24.04-arm'
|
||||
base-image: "ubuntu:24.04"
|
||||
skip-drivers: 'false'
|
||||
ubuntu-version: '2404'
|
||||
backend: "liquid-audio"
|
||||
dockerfile: "./backend/Dockerfile.python"
|
||||
context: "./"
|
||||
- build-type: 'l4t'
|
||||
cuda-major-version: "13"
|
||||
cuda-minor-version: "0"
|
||||
@@ -1693,6 +1768,19 @@ include:
|
||||
dockerfile: "./backend/Dockerfile.python"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
- build-type: 'hipblas'
|
||||
cuda-major-version: ""
|
||||
cuda-minor-version: ""
|
||||
platforms: 'linux/amd64'
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-gpu-rocm-hipblas-liquid-audio'
|
||||
runs-on: 'ubuntu-latest'
|
||||
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
|
||||
skip-drivers: 'false'
|
||||
backend: "liquid-audio"
|
||||
dockerfile: "./backend/Dockerfile.python"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
- build-type: 'hipblas'
|
||||
cuda-major-version: ""
|
||||
cuda-minor-version: ""
|
||||
@@ -2141,6 +2229,19 @@ include:
|
||||
dockerfile: "./backend/Dockerfile.python"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
- build-type: 'intel'
|
||||
cuda-major-version: ""
|
||||
cuda-minor-version: ""
|
||||
platforms: 'linux/amd64'
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-gpu-intel-liquid-audio'
|
||||
runs-on: 'ubuntu-latest'
|
||||
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
|
||||
skip-drivers: 'false'
|
||||
backend: "liquid-audio"
|
||||
dockerfile: "./backend/Dockerfile.python"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
- build-type: 'intel'
|
||||
cuda-major-version: ""
|
||||
cuda-minor-version: ""
|
||||
@@ -2321,6 +2422,34 @@ include:
|
||||
dockerfile: "./backend/Dockerfile.turboquant"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
- build-type: ''
|
||||
cuda-major-version: ""
|
||||
cuda-minor-version: ""
|
||||
platforms: 'linux/amd64'
|
||||
platform-tag: 'amd64'
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-cpu-ds4'
|
||||
runs-on: 'ubuntu-latest'
|
||||
base-image: "nvidia/cuda:13.0.0-devel-ubuntu24.04"
|
||||
skip-drivers: 'true'
|
||||
backend: "ds4"
|
||||
dockerfile: "./backend/Dockerfile.ds4"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
- build-type: ''
|
||||
cuda-major-version: ""
|
||||
cuda-minor-version: ""
|
||||
platforms: 'linux/arm64'
|
||||
platform-tag: 'arm64'
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-cpu-ds4'
|
||||
runs-on: 'ubuntu-24.04-arm'
|
||||
base-image: "nvidia/cuda:13.0.0-devel-ubuntu24.04"
|
||||
skip-drivers: 'true'
|
||||
backend: "ds4"
|
||||
dockerfile: "./backend/Dockerfile.ds4"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
- build-type: ''
|
||||
cuda-major-version: ""
|
||||
cuda-minor-version: ""
|
||||
@@ -3439,6 +3568,20 @@ include:
|
||||
dockerfile: "./backend/Dockerfile.python"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
- build-type: ''
|
||||
cuda-major-version: ""
|
||||
cuda-minor-version: ""
|
||||
platforms: 'linux/amd64'
|
||||
platform-tag: 'amd64'
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-cpu-liquid-audio'
|
||||
runs-on: 'ubuntu-latest'
|
||||
base-image: "ubuntu:24.04"
|
||||
skip-drivers: 'false'
|
||||
backend: "liquid-audio"
|
||||
dockerfile: "./backend/Dockerfile.python"
|
||||
context: "./"
|
||||
ubuntu-version: '2404'
|
||||
- build-type: ''
|
||||
cuda-major-version: ""
|
||||
cuda-minor-version: ""
|
||||
|
||||
46
.github/scripts/anchor-digest-in-cache.sh
vendored
Executable file
46
.github/scripts/anchor-digest-in-cache.sh
vendored
Executable file
@@ -0,0 +1,46 @@
|
||||
#!/usr/bin/env bash
|
||||
# Anchor a backend per-arch digest in quay.io/go-skynet/ci-cache so quay's
|
||||
# garbage collector won't reap the manifest before backend_merge.yml runs.
|
||||
#
|
||||
# Context: backend_build.yml pushes by canonical digest only
|
||||
# (push-by-digest=true). Unreferenced manifests on quay can be reaped within
|
||||
# ~1-2h, but backend-merge-jobs runs only after the *entire* per-arch build
|
||||
# matrix drains (max-parallel: 8 × dozens of entries → ~2h+). Without an
|
||||
# anchoring tag, the earliest digests are gone by the time `imagetools create`
|
||||
# tries to read them, producing "manifest not found" merge failures.
|
||||
#
|
||||
# We tag the digest under our internal ci-cache image; quay does not GC tagged
|
||||
# manifests. The user-facing manifest list still references the original
|
||||
# digest in local-ai-backends. backend_merge.yml deletes the anchor tag after
|
||||
# the user-facing manifest is published — see cleanup-keepalive-tags.sh.
|
||||
#
|
||||
# Required env:
|
||||
# GITHUB_RUN_ID - current workflow run id (set automatically by GHA)
|
||||
# TAG_SUFFIX - matrix entry's tag-suffix (e.g. -gpu-nvidia-cuda-12-vllm)
|
||||
# PLATFORM_TAG - amd64 / arm64 / single (single = singleton matrix entry)
|
||||
# DIGEST - canonical content digest from build step (sha256:...)
|
||||
#
|
||||
# Optional env:
|
||||
# ANCHOR_IMAGE - target image (default: quay.io/go-skynet/ci-cache)
|
||||
# SOURCE_IMAGE - source image (default: quay.io/go-skynet/local-ai-backends)
|
||||
# GITHUB_STEP_SUMMARY - if set, an anchored-by line is appended to it
|
||||
set -euo pipefail
|
||||
|
||||
: "${GITHUB_RUN_ID:?}"
|
||||
: "${TAG_SUFFIX:?}"
|
||||
: "${PLATFORM_TAG:?}"
|
||||
: "${DIGEST:?}"
|
||||
|
||||
anchor_image="${ANCHOR_IMAGE:-quay.io/go-skynet/ci-cache}"
|
||||
source_image="${SOURCE_IMAGE:-quay.io/go-skynet/local-ai-backends}"
|
||||
|
||||
tag="keepalive-${GITHUB_RUN_ID}${TAG_SUFFIX}-${PLATFORM_TAG}"
|
||||
|
||||
docker buildx imagetools create \
|
||||
-t "${anchor_image}:${tag}" \
|
||||
"${source_image}@${DIGEST}"
|
||||
|
||||
echo "anchored ${DIGEST} as ${anchor_image}:${tag}"
|
||||
if [[ -n "${GITHUB_STEP_SUMMARY:-}" ]]; then
|
||||
echo "anchored \`${DIGEST}\` as \`${anchor_image}:${tag}\`" >> "${GITHUB_STEP_SUMMARY}"
|
||||
fi
|
||||
49
.github/scripts/cleanup-keepalive-tags.sh
vendored
Executable file
49
.github/scripts/cleanup-keepalive-tags.sh
vendored
Executable file
@@ -0,0 +1,49 @@
|
||||
#!/usr/bin/env bash
|
||||
# Best-effort cleanup of the keepalive anchor tags written by
|
||||
# anchor-digest-in-cache.sh. Called from backend_merge.yml after the
|
||||
# user-facing manifest list has been published.
|
||||
#
|
||||
# Quay's docker registry v2 doesn't allow tag deletes — only digest deletes.
|
||||
# The proper delete is the quay REST API, which requires an OAuth-scoped
|
||||
# token. We try QUAY_TOKEN as a bearer token: if the secret is an OAuth app
|
||||
# token (typical for service accounts) the delete succeeds; otherwise this
|
||||
# is a soft no-op and the tag persists until manually pruned.
|
||||
#
|
||||
# Cleanup failure MUST NOT fail the merge — the merge has already produced
|
||||
# the user-facing manifest list at this point and the keepalive tags are
|
||||
# pure overhead. We always exit 0.
|
||||
#
|
||||
# Required env:
|
||||
# GITHUB_RUN_ID - current workflow run id (set automatically by GHA)
|
||||
# TAG_SUFFIX - matrix entry's tag-suffix (e.g. -gpu-nvidia-cuda-12-vllm)
|
||||
# QUAY_TOKEN - bearer token for quay's REST API
|
||||
#
|
||||
# Optional env:
|
||||
# QUAY_REPO - target repo (default: go-skynet/ci-cache)
|
||||
# PLATFORM_TAGS - space-separated list of platform-tag values to try
|
||||
# (default: "amd64 arm64 single")
|
||||
# We don't know which platform-tag(s) exist for this
|
||||
# tag-suffix without an extra API call, so we just try
|
||||
# all three and ignore 404s for the ones that don't.
|
||||
set -uo pipefail
|
||||
|
||||
: "${GITHUB_RUN_ID:?}"
|
||||
: "${TAG_SUFFIX:?}"
|
||||
: "${QUAY_TOKEN:?}"
|
||||
|
||||
quay_repo="${QUAY_REPO:-go-skynet/ci-cache}"
|
||||
platform_tags="${PLATFORM_TAGS:-amd64 arm64 single}"
|
||||
|
||||
for plat in $platform_tags; do
|
||||
tag="keepalive-${GITHUB_RUN_ID}${TAG_SUFFIX}-${plat}"
|
||||
url="https://quay.io/api/v1/repository/${quay_repo}/tag/${tag}"
|
||||
http=$(curl -sS -o /dev/null -w '%{http_code}' \
|
||||
-X DELETE -H "Authorization: Bearer ${QUAY_TOKEN}" "$url" || echo "000")
|
||||
case "$http" in
|
||||
204|200) echo "deleted $tag" ;;
|
||||
404) echo "not present: $tag" ;;
|
||||
401|403) echo "auth not OAuth-scoped (http $http) for $tag - skipping; orphan tag will persist" ;;
|
||||
*) echo "unexpected http $http deleting $tag - skipping" ;;
|
||||
esac
|
||||
done
|
||||
exit 0
|
||||
53
.github/workflows/backend.yml
vendored
53
.github/workflows/backend.yml
vendored
@@ -35,11 +35,13 @@ jobs:
|
||||
matrix-singlearch: ${{ steps.set-matrix.outputs['matrix-singlearch'] }}
|
||||
matrix-multiarch: ${{ steps.set-matrix.outputs['matrix-multiarch'] }}
|
||||
matrix-darwin: ${{ steps.set-matrix.outputs['matrix-darwin'] }}
|
||||
merge-matrix: ${{ steps.set-matrix.outputs['merge-matrix'] }}
|
||||
merge-matrix-multiarch: ${{ steps.set-matrix.outputs['merge-matrix-multiarch'] }}
|
||||
merge-matrix-singlearch: ${{ steps.set-matrix.outputs['merge-matrix-singlearch'] }}
|
||||
has-backends-singlearch: ${{ steps.set-matrix.outputs['has-backends-singlearch'] }}
|
||||
has-backends-multiarch: ${{ steps.set-matrix.outputs['has-backends-multiarch'] }}
|
||||
has-backends-darwin: ${{ steps.set-matrix.outputs['has-backends-darwin'] }}
|
||||
has-merges: ${{ steps.set-matrix.outputs['has-merges'] }}
|
||||
has-merges-multiarch: ${{ steps.set-matrix.outputs['has-merges-multiarch'] }}
|
||||
has-merges-singlearch: ${{ steps.set-matrix.outputs['has-merges-singlearch'] }}
|
||||
steps:
|
||||
- name: Checkout repository
|
||||
uses: actions/checkout@v6
|
||||
@@ -138,15 +140,27 @@ jobs:
|
||||
max-parallel: 8
|
||||
matrix: ${{ fromJson(needs.generate-matrix.outputs['matrix-singlearch']) }}
|
||||
|
||||
# Merge per-arch digests into manifest lists. Depends ONLY on
|
||||
# backend-jobs-multiarch — single-arch builds are independent and slow.
|
||||
# Without this split, a 6h CUDA-12 single-arch job would gate the merge,
|
||||
# leaving multi-arch digests untagged on quay long enough for quay's
|
||||
# garbage collector to reap them and the merge step to fail with
|
||||
# "manifest not found".
|
||||
backend-merge-jobs:
|
||||
# Apply tags to per-arch digests via `imagetools create`. Split into two
|
||||
# jobs that mirror the build split so each merge waits ONLY on its
|
||||
# corresponding build matrix:
|
||||
#
|
||||
# - backend-merge-jobs-multiarch needs backend-jobs-multiarch (~2-3h)
|
||||
# - backend-merge-jobs-singlearch needs backend-jobs-singlearch (up to ~6h)
|
||||
#
|
||||
# If a single shared merge job depended on both, slow CUDA singlearch
|
||||
# builds would block multiarch merges long enough for quay's GC to reap
|
||||
# the multiarch per-arch digests (the bug fixed by PR #9746). Singletons
|
||||
# also need a merge step because backend_build.yml pushes by canonical
|
||||
# digest only — no tags are applied at build time.
|
||||
backend-merge-jobs-multiarch:
|
||||
needs: [generate-matrix, backend-jobs-multiarch]
|
||||
if: needs.generate-matrix.outputs['has-merges'] == 'true'
|
||||
# !cancelled() lets the merge run even when a few build legs failed.
|
||||
# Without it, GHA's default `needs:` cascade skips the entire merge
|
||||
# matrix on a single failed/cancelled cell. We still want to publish
|
||||
# the manifest lists for tag-suffixes whose legs all succeeded.
|
||||
# Observed in v4.2.1: 2 singlearch build failures cascade-skipped all
|
||||
# ~199 singlearch merge entries.
|
||||
if: ${{ !cancelled() && needs.generate-matrix.outputs['has-merges-multiarch'] == 'true' }}
|
||||
uses: ./.github/workflows/backend_merge.yml
|
||||
with:
|
||||
tag-latest: ${{ matrix.tag-latest }}
|
||||
@@ -158,7 +172,24 @@ jobs:
|
||||
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix: ${{ fromJson(needs.generate-matrix.outputs['merge-matrix']) }}
|
||||
matrix: ${{ fromJson(needs.generate-matrix.outputs['merge-matrix-multiarch']) }}
|
||||
|
||||
backend-merge-jobs-singlearch:
|
||||
needs: [generate-matrix, backend-jobs-singlearch]
|
||||
# See note on backend-merge-jobs-multiarch above for !cancelled().
|
||||
if: ${{ !cancelled() && needs.generate-matrix.outputs['has-merges-singlearch'] == 'true' }}
|
||||
uses: ./.github/workflows/backend_merge.yml
|
||||
with:
|
||||
tag-latest: ${{ matrix.tag-latest }}
|
||||
tag-suffix: ${{ matrix.tag-suffix }}
|
||||
secrets:
|
||||
dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
|
||||
dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
|
||||
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
|
||||
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix: ${{ fromJson(needs.generate-matrix.outputs['merge-matrix-singlearch']) }}
|
||||
|
||||
backend-jobs-darwin:
|
||||
needs: generate-matrix
|
||||
|
||||
21
.github/workflows/backend_build.yml
vendored
21
.github/workflows/backend_build.yml
vendored
@@ -228,11 +228,28 @@ jobs:
|
||||
digest="${{ steps.build.outputs.digest }}"
|
||||
touch "/tmp/digests/${digest#sha256:}"
|
||||
|
||||
# See .github/scripts/anchor-digest-in-cache.sh for why this is needed
|
||||
# and how it interacts with backend_merge.yml's cleanup step.
|
||||
- name: Anchor digest in ci-cache so quay GC won't reap before merge
|
||||
if: github.event_name != 'pull_request'
|
||||
env:
|
||||
TAG_SUFFIX: ${{ inputs.tag-suffix }}
|
||||
PLATFORM_TAG: ${{ inputs.platform-tag || 'single' }}
|
||||
DIGEST: ${{ steps.build.outputs.digest }}
|
||||
run: .github/scripts/anchor-digest-in-cache.sh
|
||||
|
||||
# Artifact name uses a `--` separator between tag-suffix and platform-tag
|
||||
# to avoid prefix collisions during the merge job's pattern-based download.
|
||||
# Tag-suffixes are not prefix-disjoint (e.g. -gpu-nvidia-cuda-12-vllm is a
|
||||
# prefix of -gpu-nvidia-cuda-12-vllm-omni); a single `-` separator plus the
|
||||
# merge-side `digests<tag-suffix>-*` glob would let one merge over-match
|
||||
# the other backend's artifacts. The `-single` placeholder for empty
|
||||
# platform-tag (single-arch entries) keeps the artifact name non-trailing.
|
||||
- name: Upload digest artifact
|
||||
if: github.event_name != 'pull_request'
|
||||
uses: actions/upload-artifact@v4
|
||||
uses: actions/upload-artifact@v7
|
||||
with:
|
||||
name: digests${{ inputs.tag-suffix }}-${{ inputs.platform-tag }}
|
||||
name: digests${{ inputs.tag-suffix }}--${{ inputs.platform-tag || 'single' }}
|
||||
path: /tmp/digests/*
|
||||
if-no-files-found: error
|
||||
retention-days: 1
|
||||
|
||||
14
.github/workflows/backend_build_darwin.yml
vendored
14
.github/workflows/backend_build_darwin.yml
vendored
@@ -116,6 +116,13 @@ jobs:
|
||||
# already), we don't have to chase missing dylibs one at a time.
|
||||
# The downloads cache makes the reinstall fast (~5s on a hit).
|
||||
brew reinstall ccache
|
||||
# Same pattern for grpc: its CMake config (used by the llama-cpp
|
||||
# `grpc-server` target) does find_package(absl). The cache restores
|
||||
# /opt/homebrew/Cellar/grpc so brew above no-ops the install, but
|
||||
# abseil isn't in our Cellar cache list and never gets installed
|
||||
# alongside, leaving grpc's CMake unable to resolve it. Reinstalling
|
||||
# grpc re-validates and pulls abseil in, mirroring the ccache fix.
|
||||
brew reinstall grpc
|
||||
# The brew cache restores the Cellar dirs but NOT the bin symlinks
|
||||
# at /opt/homebrew/bin/*. brew install above sees the Cellar present
|
||||
# and decides "already installed" without re-linking, so on a cache-
|
||||
@@ -211,8 +218,13 @@ jobs:
|
||||
make protogen-go
|
||||
make backends/llama-cpp-darwin
|
||||
|
||||
- name: Build ds4 backend (Darwin Metal)
|
||||
if: inputs.backend == 'ds4'
|
||||
run: |
|
||||
make backends/ds4-darwin
|
||||
|
||||
- name: Build ${{ inputs.backend }}-darwin
|
||||
if: inputs.backend != 'llama-cpp'
|
||||
if: inputs.backend != 'llama-cpp' && inputs.backend != 'ds4'
|
||||
run: |
|
||||
make protogen-go
|
||||
BACKEND=${{ inputs.backend }} BUILD_TYPE=${{ inputs.build-type }} USE_PIP=${{ inputs.use-pip }} make build-darwin-${{ inputs.lang }}-backend
|
||||
|
||||
102
.github/workflows/backend_merge.yml
vendored
102
.github/workflows/backend_merge.yml
vendored
@@ -31,19 +31,48 @@ on:
|
||||
jobs:
|
||||
merge:
|
||||
runs-on: ubuntu-latest
|
||||
# id-token: write is required for keyless cosign — the workflow
|
||||
# exchanges the GitHub OIDC token for a short-lived Fulcio cert that
|
||||
# signs each pushed manifest. Without this permission the runner
|
||||
# cannot mint the token, and `cosign sign` fails with "no token".
|
||||
permissions:
|
||||
contents: read
|
||||
id-token: write
|
||||
env:
|
||||
quay_username: ${{ secrets.quayUsername }}
|
||||
steps:
|
||||
- name: Download digests
|
||||
uses: actions/download-artifact@v4
|
||||
# Sparse checkout: the merge job needs `.github/scripts/` (for the
|
||||
# keepalive cleanup script) but none of the source tree.
|
||||
- name: Checkout (.github/scripts only)
|
||||
uses: actions/checkout@v6
|
||||
with:
|
||||
pattern: digests${{ inputs.tag-suffix }}-*
|
||||
sparse-checkout: |
|
||||
.github/scripts
|
||||
sparse-checkout-cone-mode: false
|
||||
|
||||
# `--` separator anchors the glob so we don't over-match sibling
|
||||
# backends whose tag-suffix happens to be a prefix of ours
|
||||
# (e.g. -cpu-vllm vs -cpu-vllm-omni). Must stay in sync with the
|
||||
# upload-artifact name in backend_build.yml.
|
||||
- name: Download digests
|
||||
uses: actions/download-artifact@v8
|
||||
with:
|
||||
pattern: digests${{ inputs.tag-suffix }}--*
|
||||
merge-multiple: true
|
||||
path: /tmp/digests
|
||||
|
||||
- name: Set up Docker Buildx
|
||||
uses: docker/setup-buildx-action@master
|
||||
|
||||
# cosign signs each pushed manifest list with --recursive so the
|
||||
# index and every per-arch entry get an attached Sigstore bundle.
|
||||
# 2.2+ is required for --new-bundle-format.
|
||||
- name: Install cosign
|
||||
if: github.event_name != 'pull_request'
|
||||
uses: sigstore/cosign-installer@v3
|
||||
with:
|
||||
cosign-release: 'v2.4.1'
|
||||
|
||||
- name: Login to DockerHub
|
||||
if: github.event_name != 'pull_request'
|
||||
uses: docker/login-action@v4
|
||||
@@ -75,6 +104,25 @@ jobs:
|
||||
latest=${{ inputs.tag-latest }}
|
||||
suffix=${{ inputs.tag-suffix }},onlatest=true
|
||||
|
||||
# Source from ci-cache, not local-ai-backends.
|
||||
#
|
||||
# The build job pushes per-arch manifests to local-ai-backends with
|
||||
# push-by-digest=true (no tag), then anchors a tagged copy into
|
||||
# ci-cache so the manifest can be retrieved hours later when this
|
||||
# merge runs. Quay's manifest GC, however, is per-repository: the
|
||||
# anchor tag in ci-cache protects the manifest there, but the same
|
||||
# digest in local-ai-backends has no tag in *that* repo and gets
|
||||
# reaped independently. Sourcing local-ai-backends@<digest> here
|
||||
# then fails with "manifest not found" — exactly the regression
|
||||
# we hit on v4.2.2 (19/37 multiarch merges failed).
|
||||
#
|
||||
# ci-cache@<digest> resolves because we anchored it there. buildx
|
||||
# imagetools create copies the manifest into local-ai-backends
|
||||
# (cross-repo within the same registry, blobs already cross-mounted
|
||||
# from the original push so no transfer needed) and publishes the
|
||||
# manifest list with the user-facing tags. The resulting manifest
|
||||
# list is fully self-contained in local-ai-backends — child digests
|
||||
# only, no embedded references to ci-cache.
|
||||
- name: Create manifest list and push (quay)
|
||||
if: github.event_name != 'pull_request'
|
||||
working-directory: /tmp/digests
|
||||
@@ -88,11 +136,26 @@ jobs:
|
||||
' <<< "$DOCKER_METADATA_OUTPUT_JSON")
|
||||
if [ -z "$tags" ]; then
|
||||
echo "No quay.io tags from docker/metadata-action; skipping quay merge"
|
||||
else
|
||||
# shellcheck disable=SC2086
|
||||
docker buildx imagetools create $tags \
|
||||
$(printf 'quay.io/go-skynet/local-ai-backends@sha256:%s ' *)
|
||||
exit 0
|
||||
fi
|
||||
# shellcheck disable=SC2086
|
||||
docker buildx imagetools create $tags \
|
||||
$(printf 'quay.io/go-skynet/ci-cache@sha256:%s ' *)
|
||||
# Resolve the manifest-list digest (any tag points at it) so
|
||||
# cosign can sign by digest. Signing by tag would leave the
|
||||
# signature orphaned the next time the tag moves.
|
||||
first_tag=$(jq -cr '
|
||||
.tags | map(select(startswith("quay.io/"))) | .[0]
|
||||
' <<< "$DOCKER_METADATA_OUTPUT_JSON")
|
||||
digest=$(docker buildx imagetools inspect "$first_tag" --format '{{.Manifest.Digest}}')
|
||||
# --recursive walks the list and signs every per-arch entry
|
||||
# too — clients that resolve a tag to a platform-specific
|
||||
# manifest before checking signatures need the per-arch
|
||||
# signatures, not just the list-level one.
|
||||
cosign sign --yes --recursive \
|
||||
--new-bundle-format \
|
||||
--registry-referrers-mode=oci-1-1 \
|
||||
"quay.io/go-skynet/local-ai-backends@${digest}"
|
||||
|
||||
- name: Create manifest list and push (dockerhub)
|
||||
if: github.event_name != 'pull_request'
|
||||
@@ -107,11 +170,19 @@ jobs:
|
||||
' <<< "$DOCKER_METADATA_OUTPUT_JSON")
|
||||
if [ -z "$tags" ]; then
|
||||
echo "No dockerhub tags from docker/metadata-action; skipping dockerhub merge"
|
||||
else
|
||||
# shellcheck disable=SC2086
|
||||
docker buildx imagetools create $tags \
|
||||
$(printf 'localai/localai-backends@sha256:%s ' *)
|
||||
exit 0
|
||||
fi
|
||||
# shellcheck disable=SC2086
|
||||
docker buildx imagetools create $tags \
|
||||
$(printf 'localai/localai-backends@sha256:%s ' *)
|
||||
first_tag=$(jq -cr '
|
||||
.tags | map(select(startswith("localai/"))) | .[0]
|
||||
' <<< "$DOCKER_METADATA_OUTPUT_JSON")
|
||||
digest=$(docker buildx imagetools inspect "$first_tag" --format '{{.Manifest.Digest}}')
|
||||
cosign sign --yes --recursive \
|
||||
--new-bundle-format \
|
||||
--registry-referrers-mode=oci-1-1 \
|
||||
"localai/localai-backends@${digest}"
|
||||
|
||||
- name: Inspect manifest
|
||||
if: github.event_name != 'pull_request'
|
||||
@@ -122,6 +193,15 @@ jobs:
|
||||
docker buildx imagetools inspect "$first_tag"
|
||||
fi
|
||||
|
||||
# See .github/scripts/cleanup-keepalive-tags.sh for why this is
|
||||
# best-effort and what the failure modes are.
|
||||
- name: Cleanup keepalive tags in ci-cache
|
||||
if: github.event_name != 'pull_request' && success()
|
||||
env:
|
||||
TAG_SUFFIX: ${{ inputs.tag-suffix }}
|
||||
QUAY_TOKEN: ${{ secrets.quayPassword }}
|
||||
run: .github/scripts/cleanup-keepalive-tags.sh
|
||||
|
||||
- name: Job summary
|
||||
if: github.event_name != 'pull_request'
|
||||
run: |
|
||||
|
||||
28
.github/workflows/backend_pr.yml
vendored
28
.github/workflows/backend_pr.yml
vendored
@@ -14,11 +14,13 @@ jobs:
|
||||
matrix-singlearch: ${{ steps.set-matrix.outputs['matrix-singlearch'] }}
|
||||
matrix-multiarch: ${{ steps.set-matrix.outputs['matrix-multiarch'] }}
|
||||
matrix-darwin: ${{ steps.set-matrix.outputs['matrix-darwin'] }}
|
||||
merge-matrix: ${{ steps.set-matrix.outputs['merge-matrix'] }}
|
||||
merge-matrix-multiarch: ${{ steps.set-matrix.outputs['merge-matrix-multiarch'] }}
|
||||
merge-matrix-singlearch: ${{ steps.set-matrix.outputs['merge-matrix-singlearch'] }}
|
||||
has-backends-singlearch: ${{ steps.set-matrix.outputs['has-backends-singlearch'] }}
|
||||
has-backends-multiarch: ${{ steps.set-matrix.outputs['has-backends-multiarch'] }}
|
||||
has-backends-darwin: ${{ steps.set-matrix.outputs['has-backends-darwin'] }}
|
||||
has-merges: ${{ steps.set-matrix.outputs['has-merges'] }}
|
||||
has-merges-multiarch: ${{ steps.set-matrix.outputs['has-merges-multiarch'] }}
|
||||
has-merges-singlearch: ${{ steps.set-matrix.outputs['has-merges-singlearch'] }}
|
||||
steps:
|
||||
- name: Checkout repository
|
||||
uses: actions/checkout@v6
|
||||
@@ -97,12 +99,14 @@ jobs:
|
||||
fail-fast: true
|
||||
max-parallel: 8
|
||||
matrix: ${{ fromJson(needs.generate-matrix.outputs['matrix-singlearch']) }}
|
||||
backend-merge-jobs:
|
||||
backend-merge-jobs-multiarch:
|
||||
needs: [generate-matrix, backend-jobs-multiarch]
|
||||
# backend_merge.yml's push-side steps are all gated on
|
||||
# github.event_name != 'pull_request', so on a PR the merge job would
|
||||
# do nothing. Skip it entirely to avoid spinning up an empty runner.
|
||||
if: github.event_name != 'pull_request' && needs.generate-matrix.outputs['has-merges'] == 'true'
|
||||
# !cancelled() lets the merge run even when a few build legs fail —
|
||||
# see the matching note in backend.yml.
|
||||
if: ${{ !cancelled() && github.event_name != 'pull_request' && needs.generate-matrix.outputs['has-merges-multiarch'] == 'true' }}
|
||||
uses: ./.github/workflows/backend_merge.yml
|
||||
with:
|
||||
tag-latest: ${{ matrix.tag-latest }}
|
||||
@@ -112,7 +116,21 @@ jobs:
|
||||
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix: ${{ fromJson(needs.generate-matrix.outputs['merge-matrix']) }}
|
||||
matrix: ${{ fromJson(needs.generate-matrix.outputs['merge-matrix-multiarch']) }}
|
||||
|
||||
backend-merge-jobs-singlearch:
|
||||
needs: [generate-matrix, backend-jobs-singlearch]
|
||||
if: ${{ !cancelled() && github.event_name != 'pull_request' && needs.generate-matrix.outputs['has-merges-singlearch'] == 'true' }}
|
||||
uses: ./.github/workflows/backend_merge.yml
|
||||
with:
|
||||
tag-latest: ${{ matrix.tag-latest }}
|
||||
tag-suffix: ${{ matrix.tag-suffix }}
|
||||
secrets:
|
||||
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
|
||||
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix: ${{ fromJson(needs.generate-matrix.outputs['merge-matrix-singlearch']) }}
|
||||
backend-jobs-darwin:
|
||||
needs: generate-matrix
|
||||
uses: ./.github/workflows/backend_build_darwin.yml
|
||||
|
||||
4
.github/workflows/bump_deps.yaml
vendored
4
.github/workflows/bump_deps.yaml
vendored
@@ -22,6 +22,10 @@ jobs:
|
||||
variable: "TURBOQUANT_VERSION"
|
||||
branch: "feature/turboquant-kv-cache"
|
||||
file: "backend/cpp/turboquant/Makefile"
|
||||
- repository: "antirez/ds4"
|
||||
variable: "DS4_VERSION"
|
||||
branch: "main"
|
||||
file: "backend/cpp/ds4/Makefile"
|
||||
- repository: "ggml-org/whisper.cpp"
|
||||
variable: "WHISPER_CPP_VERSION"
|
||||
branch: "master"
|
||||
|
||||
94
.github/workflows/image.yml
vendored
94
.github/workflows/image.yml
vendored
@@ -151,7 +151,11 @@
|
||||
ubuntu-codename: 'noble'
|
||||
|
||||
core-image-merge:
|
||||
if: github.repository == 'mudler/LocalAI'
|
||||
# !cancelled(): without it, GHA's default `needs:` cascade skips the
|
||||
# merge whenever any matrix cell of the parent build fails or is
|
||||
# cancelled. Same fix as backend.yml's merge jobs — we still want to
|
||||
# publish the manifest list for tag-suffixes whose legs all succeeded.
|
||||
if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
|
||||
needs: core-image-build
|
||||
uses: ./.github/workflows/image_merge.yml
|
||||
with:
|
||||
@@ -164,7 +168,7 @@
|
||||
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
|
||||
|
||||
gpu-vulkan-image-merge:
|
||||
if: github.repository == 'mudler/LocalAI'
|
||||
if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
|
||||
needs: core-image-build
|
||||
uses: ./.github/workflows/image_merge.yml
|
||||
with:
|
||||
@@ -175,7 +179,91 @@
|
||||
dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
|
||||
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
|
||||
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
|
||||
|
||||
|
||||
# Single-arch server-image merges. Same conceptual fix as the backend
|
||||
# singletons in PR #9781: image_build.yml pushes by canonical digest
|
||||
# only, so without a downstream merge step there's no tag for consumers
|
||||
# (no :latest-gpu-nvidia-cuda-12, no :v<X>-gpu-nvidia-cuda-12, etc.).
|
||||
# Each merge job needs only its parent build matrix and is filtered by
|
||||
# tag-suffix in image_merge.yml's artifact-download pattern.
|
||||
gpu-nvidia-cuda-12-image-merge:
|
||||
if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
|
||||
needs: core-image-build
|
||||
uses: ./.github/workflows/image_merge.yml
|
||||
with:
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-gpu-nvidia-cuda-12'
|
||||
secrets:
|
||||
dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
|
||||
dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
|
||||
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
|
||||
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
|
||||
|
||||
gpu-nvidia-cuda-13-image-merge:
|
||||
if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
|
||||
needs: core-image-build
|
||||
uses: ./.github/workflows/image_merge.yml
|
||||
with:
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-gpu-nvidia-cuda-13'
|
||||
secrets:
|
||||
dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
|
||||
dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
|
||||
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
|
||||
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
|
||||
|
||||
gpu-intel-image-merge:
|
||||
if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
|
||||
needs: core-image-build
|
||||
uses: ./.github/workflows/image_merge.yml
|
||||
with:
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-gpu-intel'
|
||||
secrets:
|
||||
dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
|
||||
dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
|
||||
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
|
||||
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
|
||||
|
||||
gpu-hipblas-image-merge:
|
||||
if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
|
||||
needs: hipblas-jobs
|
||||
uses: ./.github/workflows/image_merge.yml
|
||||
with:
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-gpu-hipblas'
|
||||
secrets:
|
||||
dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
|
||||
dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
|
||||
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
|
||||
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
|
||||
|
||||
nvidia-l4t-arm64-image-merge:
|
||||
if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
|
||||
needs: gh-runner
|
||||
uses: ./.github/workflows/image_merge.yml
|
||||
with:
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-nvidia-l4t-arm64'
|
||||
secrets:
|
||||
dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
|
||||
dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
|
||||
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
|
||||
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
|
||||
|
||||
nvidia-l4t-arm64-cuda-13-image-merge:
|
||||
if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
|
||||
needs: gh-runner
|
||||
uses: ./.github/workflows/image_merge.yml
|
||||
with:
|
||||
tag-latest: 'auto'
|
||||
tag-suffix: '-nvidia-l4t-arm64-cuda-13'
|
||||
secrets:
|
||||
dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
|
||||
dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
|
||||
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
|
||||
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
|
||||
|
||||
gh-runner:
|
||||
if: github.repository == 'mudler/LocalAI'
|
||||
uses: ./.github/workflows/image_build.yml
|
||||
|
||||
21
.github/workflows/image_build.yml
vendored
21
.github/workflows/image_build.yml
vendored
@@ -185,11 +185,28 @@ jobs:
|
||||
digest="${{ steps.build.outputs.digest }}"
|
||||
touch "/tmp/digests/${digest#sha256:}"
|
||||
|
||||
# See .github/scripts/anchor-digest-in-cache.sh for why this is needed
|
||||
# and how it interacts with image_merge.yml's cleanup step. Mirrors the
|
||||
# same anchor in backend_build.yml — quay's per-repo manifest GC reaps
|
||||
# untagged manifests in local-ai before the merge runs.
|
||||
- name: Anchor digest in ci-cache so quay GC won't reap before merge
|
||||
if: github.event_name != 'pull_request'
|
||||
env:
|
||||
TAG_SUFFIX: ${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}
|
||||
PLATFORM_TAG: ${{ inputs.platform-tag || 'single' }}
|
||||
DIGEST: ${{ steps.build.outputs.digest }}
|
||||
SOURCE_IMAGE: quay.io/go-skynet/local-ai
|
||||
run: .github/scripts/anchor-digest-in-cache.sh
|
||||
|
||||
- name: Upload digest artifact
|
||||
if: github.event_name != 'pull_request'
|
||||
uses: actions/upload-artifact@v4
|
||||
uses: actions/upload-artifact@v7
|
||||
with:
|
||||
name: digests-localai${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}-${{ inputs.platform-tag }}
|
||||
# `--` separator + 'single' placeholder for empty platform-tag —
|
||||
# same pattern as backend_build.yml. Prevents prefix collisions
|
||||
# in the merge-side glob (e.g. -nvidia-l4t-arm64 is a prefix of
|
||||
# -nvidia-l4t-arm64-cuda-13).
|
||||
name: digests-localai${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}--${{ inputs.platform-tag || 'single' }}
|
||||
path: /tmp/digests/*
|
||||
if-no-files-found: error
|
||||
retention-days: 1
|
||||
|
||||
36
.github/workflows/image_merge.yml
vendored
36
.github/workflows/image_merge.yml
vendored
@@ -33,10 +33,22 @@ jobs:
|
||||
env:
|
||||
quay_username: ${{ secrets.quayUsername }}
|
||||
steps:
|
||||
- name: Download digests
|
||||
uses: actions/download-artifact@v4
|
||||
# Sparse checkout: needed for .github/scripts/ (the keepalive cleanup
|
||||
# script). Skips the rest of the source tree.
|
||||
- name: Checkout (.github/scripts only)
|
||||
uses: actions/checkout@v6
|
||||
with:
|
||||
pattern: digests-localai${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}-*
|
||||
sparse-checkout: |
|
||||
.github/scripts
|
||||
sparse-checkout-cone-mode: false
|
||||
|
||||
- name: Download digests
|
||||
uses: actions/download-artifact@v8
|
||||
with:
|
||||
# `--` separator anchors the glob so we don't over-match sibling
|
||||
# tag-suffixes (e.g. -nvidia-l4t-arm64 vs -nvidia-l4t-arm64-cuda-13).
|
||||
# Must stay in sync with image_build.yml's upload-artifact name.
|
||||
pattern: digests-localai${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}--*
|
||||
merge-multiple: true
|
||||
path: /tmp/digests
|
||||
|
||||
@@ -72,6 +84,13 @@ jobs:
|
||||
latest=${{ inputs.tag-latest }}
|
||||
suffix=${{ inputs.tag-suffix }},onlatest=true
|
||||
|
||||
# Source from ci-cache, not local-ai. See backend_merge.yml for the
|
||||
# detailed rationale — quay's manifest GC is per-repository, so the
|
||||
# untagged digest in local-ai gets reaped while the same content lives
|
||||
# tagged under ci-cache (anchored by image_build.yml). buildx imagetools
|
||||
# create copies the manifest into local-ai (blobs already cross-mounted)
|
||||
# and publishes the manifest list with user-facing tags. End state in
|
||||
# local-ai is self-contained; no embedded reference to ci-cache.
|
||||
- name: Create manifest list and push (quay)
|
||||
working-directory: /tmp/digests
|
||||
run: |
|
||||
@@ -82,7 +101,7 @@ jobs:
|
||||
else
|
||||
# shellcheck disable=SC2086
|
||||
docker buildx imagetools create $tags \
|
||||
$(printf 'quay.io/go-skynet/local-ai@sha256:%s ' *)
|
||||
$(printf 'quay.io/go-skynet/ci-cache@sha256:%s ' *)
|
||||
fi
|
||||
|
||||
- name: Create manifest list and push (dockerhub)
|
||||
@@ -107,6 +126,15 @@ jobs:
|
||||
docker buildx imagetools inspect "$first_tag"
|
||||
fi
|
||||
|
||||
# See .github/scripts/cleanup-keepalive-tags.sh for the best-effort
|
||||
# semantics — fails soft when the registry credential isn't OAuth-scoped.
|
||||
- name: Cleanup keepalive tags in ci-cache
|
||||
if: github.event_name != 'pull_request' && success()
|
||||
env:
|
||||
TAG_SUFFIX: ${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}
|
||||
QUAY_TOKEN: ${{ secrets.quayPassword }}
|
||||
run: .github/scripts/cleanup-keepalive-tags.sh
|
||||
|
||||
- name: Job summary
|
||||
run: |
|
||||
set -euo pipefail
|
||||
|
||||
27
.github/workflows/test-extra.yml
vendored
27
.github/workflows/test-extra.yml
vendored
@@ -28,6 +28,7 @@ jobs:
|
||||
qwen-asr: ${{ steps.detect.outputs.qwen-asr }}
|
||||
nemo: ${{ steps.detect.outputs.nemo }}
|
||||
voxcpm: ${{ steps.detect.outputs.voxcpm }}
|
||||
liquid-audio: ${{ steps.detect.outputs.liquid-audio }}
|
||||
llama-cpp-quantization: ${{ steps.detect.outputs.llama-cpp-quantization }}
|
||||
llama-cpp: ${{ steps.detect.outputs.llama-cpp }}
|
||||
ik-llama-cpp: ${{ steps.detect.outputs.ik-llama-cpp }}
|
||||
@@ -447,6 +448,32 @@ jobs:
|
||||
run: |
|
||||
make --jobs=5 --output-sync=target -C backend/python/voxcpm
|
||||
make --jobs=5 --output-sync=target -C backend/python/voxcpm test
|
||||
# liquid-audio: LFM2.5-Audio any-to-any backend. The CI smoke test
|
||||
# exercises Health() and LoadModel(mode:finetune) — fine-tune mode
|
||||
# short-circuits before pulling weights (backend.py:192), so no
|
||||
# HuggingFace download or GPU is needed. The full-inference path is
|
||||
# gated on LIQUID_AUDIO_MODEL_ID, which we don't set here.
|
||||
tests-liquid-audio:
|
||||
needs: detect-changes
|
||||
if: needs.detect-changes.outputs.liquid-audio == 'true' || needs.detect-changes.outputs.run-all == 'true'
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- name: Clone
|
||||
uses: actions/checkout@v6
|
||||
with:
|
||||
submodules: true
|
||||
- name: Dependencies
|
||||
run: |
|
||||
sudo apt-get update
|
||||
sudo apt-get install -y build-essential ffmpeg
|
||||
sudo apt-get install -y ca-certificates cmake curl patch python3-pip
|
||||
# Install UV
|
||||
curl -LsSf https://astral.sh/uv/install.sh | sh
|
||||
pip install --user --no-cache-dir grpcio-tools==1.64.1
|
||||
- name: Test liquid-audio
|
||||
run: |
|
||||
make --jobs=5 --output-sync=target -C backend/python/liquid-audio
|
||||
make --jobs=5 --output-sync=target -C backend/python/liquid-audio test
|
||||
tests-llama-cpp-quantization:
|
||||
needs: detect-changes
|
||||
if: needs.detect-changes.outputs.llama-cpp-quantization == 'true' || needs.detect-changes.outputs.run-all == 'true'
|
||||
|
||||
@@ -46,8 +46,52 @@ linters:
|
||||
msg: 'LocalAI tests must use Ginkgo/Gomega; use Fail(...) instead of t.Fail. See .agents/coding-style.md.'
|
||||
- pattern: '^t\.FailNow$'
|
||||
msg: 'LocalAI tests must use Ginkgo/Gomega; use Fail(...) instead of t.FailNow. See .agents/coding-style.md.'
|
||||
# In-process config should flow through ApplicationConfig / kong-bound
|
||||
# CLI flags, not via os.Getenv. The CLI layer is the legitimate
|
||||
# env→struct boundary (kong's `env:"..."` tag); anything deeper that
|
||||
# reads env directly leaks process state into business logic and
|
||||
# makes flags impossible to test or override per-request. Backend
|
||||
# subprocesses, the system/capabilities probe, and a few places that
|
||||
# read non-LocalAI env vars (HOME, PATH, AUTH_TOKEN passed by parent)
|
||||
# are exempt — see linters.exclusions.rules below.
|
||||
- pattern: '^os\.(Getenv|LookupEnv|Environ)$'
|
||||
msg: 'Plumb config through ApplicationConfig (or the relevant CLI struct) instead of reading env directly. CLI entry points (core/cli/) bind env vars via kong''s `env:` tag — that is the only sanctioned env→struct boundary. See .agents/coding-style.md.'
|
||||
exclusions:
|
||||
paths:
|
||||
# Upstream whisper.cpp source tree fetched by the whisper backend Makefile.
|
||||
- 'backend/go/whisper/sources'
|
||||
- 'docs/'
|
||||
rules:
|
||||
# CLI entry points: kong's `env:"..."` tag is the legitimate env→struct
|
||||
# boundary, and a handful of subcommands legitimately propagate values
|
||||
# to spawned subprocesses (LLAMACPP_GRPC_SERVERS, MLX hostfile, ...).
|
||||
- path: ^core/cli/
|
||||
text: 'os\.(Getenv|LookupEnv|Environ)'
|
||||
linters: [forbidigo]
|
||||
# Backend subprocesses are independent binaries with their own env
|
||||
# surface; they're not "in-process config" of the LocalAI server.
|
||||
- path: ^backend/
|
||||
text: 'os\.(Getenv|LookupEnv|Environ)'
|
||||
linters: [forbidigo]
|
||||
# System capability probe reads HOME, PATH-style vars to discover
|
||||
# GPUs, default paths, etc. — not LocalAI config.
|
||||
- path: ^pkg/system/
|
||||
text: 'os\.(Getenv|LookupEnv|Environ)'
|
||||
linters: [forbidigo]
|
||||
# gRPC server reads AUTH_TOKEN passed in by the parent process at spawn
|
||||
# time; model.Loader sets/inherits env to communicate with subprocesses.
|
||||
- path: ^pkg/grpc/
|
||||
text: 'os\.(Getenv|LookupEnv|Environ)'
|
||||
linters: [forbidigo]
|
||||
- path: ^pkg/model/
|
||||
text: 'os\.(Getenv|LookupEnv|Environ)'
|
||||
linters: [forbidigo]
|
||||
# Top-level main binaries (local-ai, launcher) are entry points.
|
||||
- path: ^cmd/
|
||||
text: 'os\.(Getenv|LookupEnv|Environ)'
|
||||
linters: [forbidigo]
|
||||
# Tests legitimately read $HOME, $TMPDIR, and gating env vars
|
||||
# (LOCALAI_COSIGN_LIVE, etc.) to skip live-network specs.
|
||||
- path: _test\.go$
|
||||
text: 'os\.(Getenv|LookupEnv|Environ)'
|
||||
linters: [forbidigo]
|
||||
|
||||
@@ -25,11 +25,13 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants]
|
||||
| [.agents/llama-cpp-backend.md](.agents/llama-cpp-backend.md) | Working on the llama.cpp backend — architecture, updating, tool call parsing |
|
||||
| [.agents/vllm-backend.md](.agents/vllm-backend.md) | Working on the vLLM / vLLM-omni backends — native parsers, ChatDelta, CPU build, libnuma packaging, backend hooks |
|
||||
| [.agents/sglang-backend.md](.agents/sglang-backend.md) | Working on the SGLang backend — `engine_args` validation against ServerArgs, speculative-decoding (EAGLE/EAGLE3/DFLASH/MTP) recipes, parser handling |
|
||||
| [.agents/ds4-backend.md](.agents/ds4-backend.md) | Working on the ds4 backend - DSML state machine, thinking modes, KV cache, Metal+CUDA matrix |
|
||||
| [.agents/testing-mcp-apps.md](.agents/testing-mcp-apps.md) | Testing MCP Apps (interactive tool UIs) in the React UI |
|
||||
| [.agents/api-endpoints-and-auth.md](.agents/api-endpoints-and-auth.md) | Adding API endpoints, auth middleware, feature permissions, user access control |
|
||||
| [.agents/debugging-backends.md](.agents/debugging-backends.md) | Debugging runtime backend failures, dependency conflicts, rebuilding backends |
|
||||
| [.agents/adding-gallery-models.md](.agents/adding-gallery-models.md) | Adding GGUF models from HuggingFace to the model gallery |
|
||||
| [.agents/localai-assistant-mcp.md](.agents/localai-assistant-mcp.md) | LocalAI Assistant chat modality — adding admin tools to the in-process MCP server, editing skill prompts, keeping REST + MCP + skills in sync |
|
||||
| [.agents/backend-signing.md](.agents/backend-signing.md) | Backend OCI image signing (keyless cosign + sigstore-go) — producer-side CI setup, consumer-side gallery `verification:` block, strict mode (`LOCALAI_REQUIRE_BACKEND_INTEGRITY`), revocation via `not_before` |
|
||||
|
||||
## Quick Reference
|
||||
|
||||
|
||||
@@ -305,7 +305,7 @@ EOT
|
||||
###################################
|
||||
|
||||
# Build React UI
|
||||
FROM node:25-slim AS react-ui-builder
|
||||
FROM node:26-slim AS react-ui-builder
|
||||
WORKDIR /app
|
||||
COPY core/http/react-ui/package*.json ./
|
||||
RUN npm install
|
||||
|
||||
17
Makefile
17
Makefile
@@ -1,5 +1,5 @@
|
||||
# Disable parallel execution for backend builds
|
||||
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx
|
||||
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio
|
||||
|
||||
GOCMD=go
|
||||
GOTEST=$(GOCMD) test
|
||||
@@ -463,6 +463,7 @@ prepare-test-extra: protogen-python
|
||||
$(MAKE) -C backend/python/vllm-omni
|
||||
$(MAKE) -C backend/python/sglang
|
||||
$(MAKE) -C backend/python/vibevoice
|
||||
$(MAKE) -C backend/python/liquid-audio
|
||||
$(MAKE) -C backend/python/moonshine
|
||||
$(MAKE) -C backend/python/pocket-tts
|
||||
$(MAKE) -C backend/python/qwen-tts
|
||||
@@ -488,6 +489,7 @@ test-extra: prepare-test-extra
|
||||
$(MAKE) -C backend/python/vllm test
|
||||
$(MAKE) -C backend/python/vllm-omni test
|
||||
$(MAKE) -C backend/python/vibevoice test
|
||||
$(MAKE) -C backend/python/liquid-audio test
|
||||
$(MAKE) -C backend/python/moonshine test
|
||||
$(MAKE) -C backend/python/pocket-tts test
|
||||
$(MAKE) -C backend/python/qwen-tts test
|
||||
@@ -1009,6 +1011,10 @@ backends/llama-cpp-darwin: build
|
||||
bash ./scripts/build/llama-cpp-darwin.sh
|
||||
./local-ai backends install "ocifile://$(abspath ./backend-images/llama-cpp.tar)"
|
||||
|
||||
backends/ds4-darwin: build
|
||||
bash ./scripts/build/ds4-darwin.sh
|
||||
./local-ai backends install "ocifile://$(abspath ./backend-images/ds4.tar)"
|
||||
|
||||
build-darwin-python-backend: build
|
||||
bash ./scripts/build/python-darwin.sh
|
||||
|
||||
@@ -1050,6 +1056,10 @@ BACKEND_IK_LLAMA_CPP = ik-llama-cpp|ik-llama-cpp|.|false|false
|
||||
# turboquant is a llama.cpp fork with TurboQuant KV-cache quantization.
|
||||
# Reuses backend/cpp/llama-cpp grpc-server sources via a thin wrapper Makefile.
|
||||
BACKEND_TURBOQUANT = turboquant|turboquant|.|false|false
|
||||
# ds4 is antirez/ds4, a DeepSeek V4 Flash-specific inference engine.
|
||||
# Single-model; hardware-only validation lives at tests/e2e-backends/
|
||||
# (BACKEND_BINARY mode); see docs/superpowers/plans/2026-05-11-ds4-backend.md.
|
||||
BACKEND_DS4 = ds4|ds4|.|false|false
|
||||
|
||||
# Golang backends
|
||||
BACKEND_PIPER = piper|golang|.|false|true
|
||||
@@ -1084,6 +1094,7 @@ BACKEND_SGLANG = sglang|python|.|false|true
|
||||
BACKEND_DIFFUSERS = diffusers|python|.|--progress=plain|true
|
||||
BACKEND_CHATTERBOX = chatterbox|python|.|false|true
|
||||
BACKEND_VIBEVOICE = vibevoice|python|.|--progress=plain|true
|
||||
BACKEND_LIQUID_AUDIO = liquid-audio|python|.|--progress=plain|true
|
||||
BACKEND_MOONSHINE = moonshine|python|.|false|true
|
||||
BACKEND_POCKET_TTS = pocket-tts|python|.|false|true
|
||||
BACKEND_QWEN_TTS = qwen-tts|python|.|false|true
|
||||
@@ -1135,6 +1146,7 @@ endef
|
||||
$(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP)))
|
||||
$(eval $(call generate-docker-build-target,$(BACKEND_IK_LLAMA_CPP)))
|
||||
$(eval $(call generate-docker-build-target,$(BACKEND_TURBOQUANT)))
|
||||
$(eval $(call generate-docker-build-target,$(BACKEND_DS4)))
|
||||
$(eval $(call generate-docker-build-target,$(BACKEND_PIPER)))
|
||||
$(eval $(call generate-docker-build-target,$(BACKEND_LOCAL_STORE)))
|
||||
$(eval $(call generate-docker-build-target,$(BACKEND_HUGGINGFACE)))
|
||||
@@ -1160,6 +1172,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_SGLANG)))
|
||||
$(eval $(call generate-docker-build-target,$(BACKEND_DIFFUSERS)))
|
||||
$(eval $(call generate-docker-build-target,$(BACKEND_CHATTERBOX)))
|
||||
$(eval $(call generate-docker-build-target,$(BACKEND_VIBEVOICE)))
|
||||
$(eval $(call generate-docker-build-target,$(BACKEND_LIQUID_AUDIO)))
|
||||
$(eval $(call generate-docker-build-target,$(BACKEND_MOONSHINE)))
|
||||
$(eval $(call generate-docker-build-target,$(BACKEND_POCKET_TTS)))
|
||||
$(eval $(call generate-docker-build-target,$(BACKEND_QWEN_TTS)))
|
||||
@@ -1188,7 +1201,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_SHERPA_ONNX)))
|
||||
docker-save-%: backend-images
|
||||
docker save local-ai-backend:$* -o backend-images/$*.tar
|
||||
|
||||
docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx
|
||||
docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-liquid-audio docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx
|
||||
|
||||
########################################################
|
||||
### Mock Backend for E2E Tests
|
||||
|
||||
41
backend/Dockerfile.ds4
Normal file
41
backend/Dockerfile.ds4
Normal file
@@ -0,0 +1,41 @@
|
||||
ARG BASE_IMAGE=ubuntu:24.04
|
||||
ARG APT_MIRROR=""
|
||||
ARG APT_PORTS_MIRROR=""
|
||||
|
||||
# BASE_IMAGE is either ubuntu:24.04 (for cpu builds) or nvidia/cuda:13.0.0-devel-ubuntu24.04
|
||||
# (for cublas builds). Both ship apt + Ubuntu Noble packages; the nvidia/cuda base
|
||||
# additionally provides /usr/local/cuda. Darwin (Metal) builds bypass this Dockerfile
|
||||
# entirely via scripts/build/ds4-darwin.sh.
|
||||
FROM ${BASE_IMAGE} AS builder
|
||||
ARG BUILD_TYPE
|
||||
ARG TARGETARCH
|
||||
ARG TARGETVARIANT
|
||||
|
||||
ENV BUILD_TYPE=${BUILD_TYPE} \
|
||||
DEBIAN_FRONTEND=noninteractive \
|
||||
PATH=/usr/local/cuda/bin:${PATH}
|
||||
|
||||
WORKDIR /build
|
||||
|
||||
# Install build-time deps via plain apt - install-base-deps.sh's full pipeline
|
||||
# (CUDA keyring + from-source gRPC) is unnecessary here:
|
||||
# - CUDA: when BASE_IMAGE=nvidia/cuda:*, /usr/local/cuda is already populated;
|
||||
# for the cpu build we don't need CUDA at all.
|
||||
# - gRPC/Protobuf: system apt packages are sufficient; ds4's wrapper only links
|
||||
# against them, it doesn't ship the gRPC source tree.
|
||||
# - nlohmann-json: dsml_renderer's only third-party dep.
|
||||
RUN apt-get update && \
|
||||
apt-get install -y --no-install-recommends \
|
||||
git cmake build-essential pkg-config ca-certificates \
|
||||
libgrpc++-dev libprotobuf-dev protobuf-compiler protobuf-compiler-grpc \
|
||||
nlohmann-json3-dev && \
|
||||
apt-get clean && \
|
||||
rm -rf /var/lib/apt/lists/*
|
||||
|
||||
COPY . /LocalAI
|
||||
|
||||
RUN --mount=type=cache,target=/root/.ccache,id=ds4-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
|
||||
make -C /LocalAI/backend/cpp/ds4 BUILD_TYPE=${BUILD_TYPE} NATIVE=false grpc-server package
|
||||
|
||||
FROM scratch
|
||||
COPY --from=builder /LocalAI/backend/cpp/ds4/package/. ./
|
||||
@@ -117,6 +117,12 @@ ARG CUDA_DOCKER_ARCH
|
||||
ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
|
||||
ARG CMAKE_ARGS
|
||||
ENV CMAKE_ARGS=${CMAKE_ARGS}
|
||||
# AMDGPU_TARGETS must be forwarded into the env here too — backend/cpp/llama-cpp/Makefile
|
||||
# (which the turboquant Makefile reuses via a sibling build dir) errors out when the var
|
||||
# is empty on a hipblas build, and the prebuilt path is what CI exercises most of the
|
||||
# time. The builder-fromsource stage above already does this; mirror it here.
|
||||
ARG AMDGPU_TARGETS
|
||||
ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}
|
||||
ARG TARGETARCH
|
||||
ARG TARGETVARIANT
|
||||
|
||||
|
||||
@@ -48,6 +48,11 @@ service Backend {
|
||||
|
||||
rpc AudioTransform(AudioTransformRequest) returns (AudioTransformResult) {}
|
||||
rpc AudioTransformStream(stream AudioTransformFrameRequest) returns (stream AudioTransformFrameResponse) {}
|
||||
// AudioToAudioStream is the bidirectional any-to-any S2S RPC. Backends
|
||||
// that load a speech-to-speech model consume input audio frames and emit
|
||||
// interleaved audio + transcript + tool-call deltas as typed events.
|
||||
// Backends without S2S support return UNIMPLEMENTED.
|
||||
rpc AudioToAudioStream(stream AudioToAudioRequest) returns (stream AudioToAudioResponse) {}
|
||||
|
||||
rpc ModelMetadata(ModelOptions) returns (ModelMetadataResponse) {}
|
||||
|
||||
@@ -768,6 +773,93 @@ message AudioTransformFrameResponse {
|
||||
int64 frame_index = 2;
|
||||
}
|
||||
|
||||
// === AudioToAudioStream messages =========================================
|
||||
//
|
||||
// Bidirectional stream between the LocalAI core and an any-to-any audio
|
||||
// model. The client opens the stream with a Config payload, then alternates
|
||||
// Frame (input audio) and Control (turn boundaries, function-call results,
|
||||
// session updates) payloads. The server streams back typed events: audio
|
||||
// frames carry PCM in `pcm`; transcript / tool-call deltas carry JSON in
|
||||
// `meta`; the stream ends with a `response.done` (success) or `error` event.
|
||||
|
||||
message AudioToAudioRequest {
|
||||
oneof payload {
|
||||
AudioToAudioConfig config = 1;
|
||||
AudioToAudioFrame frame = 2;
|
||||
AudioToAudioControl control = 3;
|
||||
}
|
||||
}
|
||||
|
||||
message AudioToAudioConfig {
|
||||
// PCM format for client→server audio. 0 => backend default
|
||||
// (16 kHz for the LFM2-Audio Conformer encoder).
|
||||
int32 input_sample_rate = 1;
|
||||
// Preferred server→client audio rate. 0 => backend default
|
||||
// (24 kHz for the LFM2-Audio vocoder).
|
||||
int32 output_sample_rate = 2;
|
||||
// Optional system prompt override. Empty => backend chooses based on
|
||||
// mode (e.g. "Respond with interleaved text and audio.").
|
||||
string system_prompt = 3;
|
||||
// Optional baked-voice id. Models that only ship a fixed set of
|
||||
// voices (e.g. LFM2-Audio: us_male/us_female/uk_male/uk_female) match
|
||||
// this against their voice table; an empty string keeps the default.
|
||||
string voice = 4;
|
||||
// JSON-encoded array of tool definitions in OpenAI Chat Completions
|
||||
// format. Empty => no tools.
|
||||
string tools = 5;
|
||||
// Free-form sampling / decoding parameters (temperature, top_k,
|
||||
// max_new_tokens, audio_top_k, etc).
|
||||
map<string, string> params = 6;
|
||||
// True => reset any session-scoped state before processing further
|
||||
// frames on this stream. The first Config implicitly resets.
|
||||
bool reset = 7;
|
||||
}
|
||||
|
||||
message AudioToAudioFrame {
|
||||
// Raw PCM s16le mono at config.input_sample_rate. Empty pcm + end_of_input
|
||||
// is a valid "user finished speaking" marker without trailing audio.
|
||||
bytes pcm = 1;
|
||||
// Marks the last frame of a user turn. The backend may begin emitting
|
||||
// a response immediately after seeing this.
|
||||
bool end_of_input = 2;
|
||||
}
|
||||
|
||||
message AudioToAudioControl {
|
||||
// Free-form control event names. Initial set:
|
||||
// "input_audio_buffer.commit" — user finished speaking
|
||||
// "response.cancel" — abort in-flight generation
|
||||
// "conversation.item.create" — inject a non-audio item (e.g.
|
||||
// function_call_output as JSON in
|
||||
// `payload`)
|
||||
// "session.update" — re-configure mid-stream
|
||||
string event = 1;
|
||||
// Event-specific JSON payload.
|
||||
bytes payload = 2;
|
||||
}
|
||||
|
||||
message AudioToAudioResponse {
|
||||
// Event identifies what this frame carries. Mirrors the OpenAI Realtime
|
||||
// API server-event names where applicable. Initial set:
|
||||
// "response.audio.delta"
|
||||
// "response.audio_transcript.delta"
|
||||
// "response.function_call_arguments.delta"
|
||||
// "response.function_call_arguments.done"
|
||||
// "response.done"
|
||||
// "error"
|
||||
string event = 1;
|
||||
// Populated when event = response.audio.delta.
|
||||
bytes pcm = 2;
|
||||
// Populated alongside pcm to identify its rate. 0 => same as the
|
||||
// session's negotiated output_sample_rate.
|
||||
int32 sample_rate = 3;
|
||||
// JSON payload for non-PCM events (transcript chunk, tool args, error
|
||||
// body).
|
||||
bytes meta = 4;
|
||||
// Monotonic per-stream counter, useful for client reordering and
|
||||
// debugging.
|
||||
int64 sequence = 5;
|
||||
}
|
||||
|
||||
message ModelMetadataResponse {
|
||||
bool supports_thinking = 1;
|
||||
string rendered_template = 2; // The rendered chat template with enable_thinking=true (empty if not applicable)
|
||||
|
||||
9
backend/cpp/ds4/.gitignore
vendored
Normal file
9
backend/cpp/ds4/.gitignore
vendored
Normal file
@@ -0,0 +1,9 @@
|
||||
ds4/
|
||||
build/
|
||||
package/
|
||||
grpc-server
|
||||
*.o
|
||||
backend.pb.cc
|
||||
backend.pb.h
|
||||
backend.grpc.pb.cc
|
||||
backend.grpc.pb.h
|
||||
101
backend/cpp/ds4/CMakeLists.txt
Normal file
101
backend/cpp/ds4/CMakeLists.txt
Normal file
@@ -0,0 +1,101 @@
|
||||
cmake_minimum_required(VERSION 3.15)
|
||||
project(ds4-grpc-server LANGUAGES CXX C)
|
||||
|
||||
set(CMAKE_CXX_STANDARD 17)
|
||||
set(CMAKE_CXX_STANDARD_REQUIRED ON)
|
||||
set(TARGET grpc-server)
|
||||
|
||||
option(DS4_NATIVE "Compile with -march=native / -mcpu=native" ON)
|
||||
set(DS4_GPU "cpu" CACHE STRING "GPU backend: cpu, cuda, or metal")
|
||||
set(DS4_DIR "${CMAKE_CURRENT_SOURCE_DIR}/ds4" CACHE PATH "Path to cloned ds4 source")
|
||||
|
||||
find_package(Threads REQUIRED)
|
||||
find_package(Protobuf CONFIG QUIET)
|
||||
if(NOT Protobuf_FOUND)
|
||||
find_package(Protobuf REQUIRED)
|
||||
endif()
|
||||
find_package(gRPC CONFIG QUIET)
|
||||
if(NOT gRPC_FOUND)
|
||||
# Ubuntu's apt-installed grpc++ does not ship a CMake config - fall back.
|
||||
find_library(GRPCPP_LIB grpc++ REQUIRED)
|
||||
find_library(GRPCPP_REFLECTION_LIB grpc++_reflection REQUIRED)
|
||||
add_library(gRPC::grpc++ INTERFACE IMPORTED)
|
||||
set_target_properties(gRPC::grpc++ PROPERTIES INTERFACE_LINK_LIBRARIES "${GRPCPP_LIB}")
|
||||
add_library(gRPC::grpc++_reflection INTERFACE IMPORTED)
|
||||
set_target_properties(gRPC::grpc++_reflection PROPERTIES INTERFACE_LINK_LIBRARIES "${GRPCPP_REFLECTION_LIB}")
|
||||
endif()
|
||||
|
||||
find_program(_PROTOC NAMES protoc REQUIRED)
|
||||
find_program(_GRPC_CPP_PLUGIN NAMES grpc_cpp_plugin REQUIRED)
|
||||
|
||||
get_filename_component(HW_PROTO "${CMAKE_CURRENT_SOURCE_DIR}/../../backend.proto" ABSOLUTE)
|
||||
get_filename_component(HW_PROTO_PATH "${HW_PROTO}" PATH)
|
||||
|
||||
set(HW_PROTO_SRCS "${CMAKE_CURRENT_BINARY_DIR}/backend.pb.cc")
|
||||
set(HW_PROTO_HDRS "${CMAKE_CURRENT_BINARY_DIR}/backend.pb.h")
|
||||
set(HW_GRPC_SRCS "${CMAKE_CURRENT_BINARY_DIR}/backend.grpc.pb.cc")
|
||||
set(HW_GRPC_HDRS "${CMAKE_CURRENT_BINARY_DIR}/backend.grpc.pb.h")
|
||||
|
||||
add_custom_command(
|
||||
OUTPUT "${HW_PROTO_SRCS}" "${HW_PROTO_HDRS}" "${HW_GRPC_SRCS}" "${HW_GRPC_HDRS}"
|
||||
COMMAND ${_PROTOC}
|
||||
ARGS --grpc_out "${CMAKE_CURRENT_BINARY_DIR}"
|
||||
--cpp_out "${CMAKE_CURRENT_BINARY_DIR}"
|
||||
-I "${HW_PROTO_PATH}"
|
||||
--plugin=protoc-gen-grpc="${_GRPC_CPP_PLUGIN}"
|
||||
"${HW_PROTO}"
|
||||
DEPENDS "${HW_PROTO}")
|
||||
|
||||
add_library(hw_grpc_proto STATIC
|
||||
${HW_GRPC_SRCS} ${HW_GRPC_HDRS}
|
||||
${HW_PROTO_SRCS} ${HW_PROTO_HDRS})
|
||||
target_include_directories(hw_grpc_proto PUBLIC ${CMAKE_CURRENT_BINARY_DIR})
|
||||
|
||||
set(DS4_OBJS "${DS4_DIR}/ds4.o")
|
||||
if(DS4_GPU STREQUAL "cuda")
|
||||
list(APPEND DS4_OBJS "${DS4_DIR}/ds4_cuda.o")
|
||||
elseif(DS4_GPU STREQUAL "metal")
|
||||
list(APPEND DS4_OBJS "${DS4_DIR}/ds4_metal.o")
|
||||
elseif(DS4_GPU STREQUAL "cpu")
|
||||
set(DS4_OBJS "${DS4_DIR}/ds4_cpu.o")
|
||||
endif()
|
||||
|
||||
add_executable(${TARGET}
|
||||
grpc-server.cpp
|
||||
dsml_parser.cpp
|
||||
dsml_renderer.cpp
|
||||
kv_cache.cpp)
|
||||
|
||||
target_include_directories(${TARGET} PRIVATE ${DS4_DIR})
|
||||
|
||||
foreach(obj ${DS4_OBJS})
|
||||
target_sources(${TARGET} PRIVATE ${obj})
|
||||
set_source_files_properties(${obj} PROPERTIES EXTERNAL_OBJECT TRUE GENERATED TRUE)
|
||||
endforeach()
|
||||
|
||||
target_link_libraries(${TARGET} PRIVATE
|
||||
hw_grpc_proto
|
||||
gRPC::grpc++
|
||||
gRPC::grpc++_reflection
|
||||
protobuf::libprotobuf
|
||||
Threads::Threads
|
||||
m)
|
||||
|
||||
if(DS4_GPU STREQUAL "cuda")
|
||||
find_package(CUDAToolkit REQUIRED)
|
||||
target_link_libraries(${TARGET} PRIVATE CUDA::cudart CUDA::cublas)
|
||||
elseif(DS4_GPU STREQUAL "metal")
|
||||
find_library(FOUNDATION_LIB Foundation REQUIRED)
|
||||
find_library(METAL_LIB Metal REQUIRED)
|
||||
target_link_libraries(${TARGET} PRIVATE ${FOUNDATION_LIB} ${METAL_LIB})
|
||||
elseif(DS4_GPU STREQUAL "cpu")
|
||||
target_compile_definitions(${TARGET} PRIVATE DS4_NO_GPU)
|
||||
endif()
|
||||
|
||||
if(DS4_NATIVE)
|
||||
if(APPLE)
|
||||
target_compile_options(${TARGET} PRIVATE -mcpu=native)
|
||||
else()
|
||||
target_compile_options(${TARGET} PRIVATE -march=native)
|
||||
endif()
|
||||
endif()
|
||||
78
backend/cpp/ds4/Makefile
Normal file
78
backend/cpp/ds4/Makefile
Normal file
@@ -0,0 +1,78 @@
|
||||
# ds4 backend Makefile.
|
||||
#
|
||||
# Upstream pin lives below as DS4_VERSION?=c9dd9499bfa57c1bbfbb4446eff963330ab5329b
|
||||
# (.github/bump_deps.sh) can find and update it - matches the
|
||||
# llama-cpp / ik-llama-cpp / turboquant convention.
|
||||
|
||||
DS4_VERSION?=c9dd9499bfa57c1bbfbb4446eff963330ab5329b
|
||||
DS4_REPO?=https://github.com/antirez/ds4
|
||||
|
||||
CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
|
||||
BUILD_DIR := build
|
||||
|
||||
BUILD_TYPE ?=
|
||||
NATIVE ?= false
|
||||
JOBS ?= $(shell nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 4)
|
||||
|
||||
UNAME_S := $(shell uname -s)
|
||||
|
||||
CMAKE_ARGS ?= -DCMAKE_BUILD_TYPE=Release
|
||||
|
||||
ifeq ($(BUILD_TYPE),cublas)
|
||||
CMAKE_ARGS += -DDS4_GPU=cuda
|
||||
DS4_OBJ_TARGET := ds4.o ds4_cuda.o
|
||||
else ifeq ($(UNAME_S),Darwin)
|
||||
CMAKE_ARGS += -DDS4_GPU=metal
|
||||
DS4_OBJ_TARGET := ds4.o ds4_metal.o
|
||||
else
|
||||
# CPU reference path (Linux only - macOS CPU path is broken by VM bug per ds4 README).
|
||||
CMAKE_ARGS += -DDS4_GPU=cpu
|
||||
DS4_OBJ_TARGET := ds4_cpu.o
|
||||
endif
|
||||
|
||||
ifneq ($(NATIVE),true)
|
||||
CMAKE_ARGS += -DDS4_NATIVE=OFF
|
||||
endif
|
||||
|
||||
.PHONY: grpc-server package clean purge test all
|
||||
all: grpc-server
|
||||
|
||||
# Clone the upstream ds4 source at the pinned commit. Directory acts as the
|
||||
# target so make only re-clones when missing. After a DS4_VERSION bump,
|
||||
# run 'make purge && make' to refetch (or rely on CI's clean build).
|
||||
ds4:
|
||||
mkdir -p ds4
|
||||
cd ds4 && \
|
||||
git init -q && \
|
||||
git remote add origin $(DS4_REPO) && \
|
||||
git fetch --depth 1 origin $(DS4_VERSION) && \
|
||||
git checkout FETCH_HEAD
|
||||
|
||||
# Build ds4's engine object files via its own Makefile, which already encodes
|
||||
# the right per-platform compile flags (Objective-C/Metal on Darwin, nvcc on Linux+CUDA).
|
||||
ds4/ds4.o: ds4
|
||||
ifeq ($(BUILD_TYPE),cublas)
|
||||
+$(MAKE) -C ds4 ds4.o ds4_cuda.o
|
||||
else ifeq ($(UNAME_S),Darwin)
|
||||
+$(MAKE) -C ds4 ds4.o ds4_metal.o
|
||||
else
|
||||
+$(MAKE) -C ds4 ds4_cpu.o
|
||||
endif
|
||||
|
||||
grpc-server: ds4/ds4.o
|
||||
mkdir -p $(BUILD_DIR)
|
||||
cd $(BUILD_DIR) && cmake $(CMAKE_ARGS) $(CURRENT_MAKEFILE_DIR) && cmake --build . --config Release -j $(JOBS)
|
||||
cp $(BUILD_DIR)/grpc-server grpc-server
|
||||
|
||||
package: grpc-server
|
||||
bash package.sh
|
||||
|
||||
test:
|
||||
@echo "ds4 backend: e2e coverage at tests/e2e-backends/ (BACKEND_BINARY mode)"
|
||||
|
||||
clean:
|
||||
rm -rf $(BUILD_DIR) grpc-server package
|
||||
if [ -d ds4 ]; then $(MAKE) -C ds4 clean; fi
|
||||
|
||||
purge: clean
|
||||
rm -rf ds4
|
||||
359
backend/cpp/ds4/dsml_parser.cpp
Normal file
359
backend/cpp/ds4/dsml_parser.cpp
Normal file
@@ -0,0 +1,359 @@
|
||||
#include "dsml_parser.h"
|
||||
|
||||
#include <algorithm>
|
||||
#include <cstdio>
|
||||
#include <cstring>
|
||||
#include <chrono>
|
||||
#include <random>
|
||||
#include <string>
|
||||
#include <vector>
|
||||
|
||||
namespace ds4cpp {
|
||||
|
||||
namespace {
|
||||
|
||||
constexpr const char *kThinkOpen = "<think>";
|
||||
constexpr const char *kThinkClose = "</think>";
|
||||
constexpr const char *kToolsOpen = "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_calls>"; // <|DSML|tool_calls>
|
||||
constexpr const char *kToolsClose = "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_calls>"; // </|DSML|tool_calls>
|
||||
constexpr const char *kInvokeOpenPfx = "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke name=\""; // <|DSML|invoke name="
|
||||
constexpr const char *kInvokeClose = "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke>"; // </|DSML|invoke>
|
||||
constexpr const char *kParamOpenPfx = "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter name=\""; // <|DSML|parameter name="
|
||||
constexpr const char *kParamClose = "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter>"; // </|DSML|parameter>
|
||||
|
||||
// All structural markers the parser might encounter - used to detect "buf
|
||||
// might be a partial marker, don't drain yet" conditions.
|
||||
const std::vector<std::string> &all_markers() {
|
||||
static const std::vector<std::string> v = {
|
||||
kThinkOpen, kThinkClose,
|
||||
kToolsOpen, kToolsClose,
|
||||
kInvokeOpenPfx, kInvokeClose,
|
||||
kParamOpenPfx, kParamClose,
|
||||
};
|
||||
return v;
|
||||
}
|
||||
|
||||
// Returns true if `buf` could be a *prefix* of any marker (i.e., we should
|
||||
// wait for more text before draining as plain content). The marker-prefix
|
||||
// loop handles fixed markers exactly. For markers with variable-length
|
||||
// internal data (kInvokeOpenPfx, kParamOpenPfx have an open quote, then the
|
||||
// tool/param name, then a closing quote and `>`), we also wait while buf
|
||||
// starts with `<` and has not yet seen a `>`: the leading `<` could be the
|
||||
// start of one of those open markers, or a literal that we can confirm only
|
||||
// once we know what follows. Anything after the first `>` arrives is either
|
||||
// consumed by TryConsumeMarker or emitted as a literal `<` by the caller.
|
||||
bool looks_like_prefix(const std::string &buf) {
|
||||
for (const auto &m : all_markers()) {
|
||||
if (m.size() > buf.size() && m.compare(0, buf.size(), buf) == 0) return true;
|
||||
}
|
||||
if (!buf.empty() && buf[0] == '<' && buf.find('>') == std::string::npos) {
|
||||
return true;
|
||||
}
|
||||
return false;
|
||||
}
|
||||
|
||||
bool consume_literal(std::string &buf, const std::string &lit) {
|
||||
if (buf.compare(0, lit.size(), lit) == 0) {
|
||||
buf.erase(0, lit.size());
|
||||
return true;
|
||||
}
|
||||
return false;
|
||||
}
|
||||
|
||||
// Find the next '<' in buf starting at offset; returns std::string::npos if none.
|
||||
size_t next_tag(const std::string &buf, size_t off = 0) {
|
||||
return buf.find('<', off);
|
||||
}
|
||||
|
||||
std::string json_escape(const std::string &in) {
|
||||
std::string out;
|
||||
out.reserve(in.size() + 2);
|
||||
for (char c : in) {
|
||||
switch (c) {
|
||||
case '"': out += "\\\""; break;
|
||||
case '\\': out += "\\\\"; break;
|
||||
case '\b': out += "\\b"; break;
|
||||
case '\f': out += "\\f"; break;
|
||||
case '\n': out += "\\n"; break;
|
||||
case '\r': out += "\\r"; break;
|
||||
case '\t': out += "\\t"; break;
|
||||
default:
|
||||
if (static_cast<unsigned char>(c) < 0x20) {
|
||||
char tmp[8];
|
||||
std::snprintf(tmp, sizeof(tmp), "\\u%04x", c);
|
||||
out += tmp;
|
||||
} else {
|
||||
out += c;
|
||||
}
|
||||
}
|
||||
}
|
||||
return out;
|
||||
}
|
||||
|
||||
} // namespace
|
||||
|
||||
DsmlParser::DsmlParser() = default;
|
||||
|
||||
bool DsmlParser::IsInDsmlStructural() const {
|
||||
switch (state_) {
|
||||
case State::TOOL_CALLS:
|
||||
case State::INVOKE:
|
||||
return true;
|
||||
case State::PARAM_VALUE: // payload bytes; user sampling applies
|
||||
case State::TEXT:
|
||||
case State::THINK:
|
||||
return false;
|
||||
}
|
||||
return false;
|
||||
}
|
||||
|
||||
void DsmlParser::EmitArgsChunk(const std::string &chunk, std::vector<ParserEvent> &out) {
|
||||
if (chunk.empty()) return;
|
||||
ParserEvent e;
|
||||
e.type = ParserEvent::TOOL_ARGS;
|
||||
e.text = chunk;
|
||||
e.index = tool_index_;
|
||||
out.push_back(std::move(e));
|
||||
}
|
||||
|
||||
void DsmlParser::FinishCurrentToolCall(std::vector<ParserEvent> &out) {
|
||||
if (tool_index_ < 0) return;
|
||||
// Close the JSON object that was opened on the first parameter.
|
||||
if (args_emitted_open_brace_) {
|
||||
EmitArgsChunk("}", out);
|
||||
} else {
|
||||
EmitArgsChunk("{}", out);
|
||||
}
|
||||
ParserEvent e;
|
||||
e.type = ParserEvent::TOOL_END;
|
||||
e.index = tool_index_;
|
||||
out.push_back(std::move(e));
|
||||
current_tool_name_.clear();
|
||||
args_emitted_open_brace_ = false;
|
||||
args_param_count_ = 0;
|
||||
}
|
||||
|
||||
bool DsmlParser::TryConsumeMarker(std::vector<ParserEvent> &out) {
|
||||
switch (state_) {
|
||||
case State::TEXT: {
|
||||
if (consume_literal(buf_, kThinkOpen)) { state_ = State::THINK; return true; }
|
||||
if (consume_literal(buf_, kToolsOpen)) { state_ = State::TOOL_CALLS; return true; }
|
||||
return false;
|
||||
}
|
||||
case State::THINK: {
|
||||
if (consume_literal(buf_, kThinkClose)) { state_ = State::TEXT; return true; }
|
||||
return false;
|
||||
}
|
||||
case State::TOOL_CALLS: {
|
||||
if (consume_literal(buf_, kToolsClose)) { state_ = State::TEXT; return true; }
|
||||
// <|DSML|invoke name="X">
|
||||
if (buf_.compare(0, std::strlen(kInvokeOpenPfx), kInvokeOpenPfx) == 0) {
|
||||
size_t close_q = buf_.find('"', std::strlen(kInvokeOpenPfx));
|
||||
if (close_q == std::string::npos) return false; // need more bytes
|
||||
size_t close_gt = buf_.find('>', close_q);
|
||||
if (close_gt == std::string::npos) return false;
|
||||
current_tool_name_ = buf_.substr(std::strlen(kInvokeOpenPfx),
|
||||
close_q - std::strlen(kInvokeOpenPfx));
|
||||
tool_index_++;
|
||||
buf_.erase(0, close_gt + 1);
|
||||
ParserEvent e;
|
||||
e.type = ParserEvent::TOOL_START;
|
||||
e.tool_name = current_tool_name_;
|
||||
e.tool_id = RandomToolId();
|
||||
e.index = tool_index_;
|
||||
out.push_back(std::move(e));
|
||||
args_emitted_open_brace_ = false;
|
||||
args_param_count_ = 0;
|
||||
state_ = State::INVOKE;
|
||||
return true;
|
||||
}
|
||||
return false;
|
||||
}
|
||||
case State::INVOKE: {
|
||||
if (consume_literal(buf_, kInvokeClose)) {
|
||||
FinishCurrentToolCall(out);
|
||||
state_ = State::TOOL_CALLS;
|
||||
return true;
|
||||
}
|
||||
// <|DSML|parameter name="K" string="true|false">
|
||||
if (buf_.compare(0, std::strlen(kParamOpenPfx), kParamOpenPfx) == 0) {
|
||||
size_t close_q = buf_.find('"', std::strlen(kParamOpenPfx));
|
||||
if (close_q == std::string::npos) return false;
|
||||
size_t string_attr = buf_.find("string=\"", close_q);
|
||||
if (string_attr == std::string::npos) return false;
|
||||
size_t string_q = buf_.find('"', string_attr + 8);
|
||||
if (string_q == std::string::npos) return false;
|
||||
size_t close_gt = buf_.find('>', string_q);
|
||||
if (close_gt == std::string::npos) return false;
|
||||
param_name_ = buf_.substr(std::strlen(kParamOpenPfx),
|
||||
close_q - std::strlen(kParamOpenPfx));
|
||||
std::string string_val = buf_.substr(string_attr + 8,
|
||||
string_q - (string_attr + 8));
|
||||
param_is_string_ = (string_val == "true");
|
||||
param_value_.clear();
|
||||
buf_.erase(0, close_gt + 1);
|
||||
// Emit args JSON opener / separator.
|
||||
std::string opener;
|
||||
if (!args_emitted_open_brace_) { opener = "{"; args_emitted_open_brace_ = true; }
|
||||
else { opener = ","; }
|
||||
opener += "\"" + json_escape(param_name_) + "\":";
|
||||
if (param_is_string_) opener += "\"";
|
||||
EmitArgsChunk(opener, out);
|
||||
args_param_count_++;
|
||||
state_ = State::PARAM_VALUE;
|
||||
return true;
|
||||
}
|
||||
return false;
|
||||
}
|
||||
case State::PARAM_VALUE: {
|
||||
if (consume_literal(buf_, kParamClose)) {
|
||||
if (param_is_string_) EmitArgsChunk("\"", out);
|
||||
state_ = State::INVOKE;
|
||||
return true;
|
||||
}
|
||||
return false;
|
||||
}
|
||||
}
|
||||
return false;
|
||||
}
|
||||
|
||||
void DsmlParser::DrainPlain(std::vector<ParserEvent> &out) {
|
||||
// Drain everything up to the next '<' that *might* start a marker.
|
||||
// Anything before the next '<' is safe to emit; the '<...' tail stays buffered.
|
||||
while (!buf_.empty()) {
|
||||
size_t lt = next_tag(buf_, 0);
|
||||
if (lt == std::string::npos) {
|
||||
// No tag at all - emit (or accumulate) the whole buffer.
|
||||
ParserEvent e;
|
||||
if (state_ == State::PARAM_VALUE) {
|
||||
std::string esc = param_is_string_ ? json_escape(buf_) : buf_;
|
||||
EmitArgsChunk(esc, out);
|
||||
} else if (state_ == State::THINK) {
|
||||
e.type = ParserEvent::REASONING;
|
||||
e.text = buf_;
|
||||
out.push_back(std::move(e));
|
||||
} else if (state_ == State::TEXT) {
|
||||
e.type = ParserEvent::CONTENT;
|
||||
e.text = buf_;
|
||||
out.push_back(std::move(e));
|
||||
}
|
||||
// Inside INVOKE / TOOL_CALLS with no marker, raw bytes are
|
||||
// structural whitespace - discard.
|
||||
buf_.clear();
|
||||
return;
|
||||
}
|
||||
if (lt > 0) {
|
||||
std::string chunk = buf_.substr(0, lt);
|
||||
buf_.erase(0, lt);
|
||||
ParserEvent e;
|
||||
if (state_ == State::PARAM_VALUE) {
|
||||
std::string esc = param_is_string_ ? json_escape(chunk) : chunk;
|
||||
EmitArgsChunk(esc, out);
|
||||
} else if (state_ == State::THINK) {
|
||||
e.type = ParserEvent::REASONING;
|
||||
e.text = chunk;
|
||||
out.push_back(std::move(e));
|
||||
} else if (state_ == State::TEXT) {
|
||||
e.type = ParserEvent::CONTENT;
|
||||
e.text = chunk;
|
||||
out.push_back(std::move(e));
|
||||
}
|
||||
}
|
||||
// buf_[0] == '<' - try consuming a marker. If we consumed one, loop again.
|
||||
if (!TryConsumeMarker(out)) {
|
||||
// Could be a partial marker - wait for more bytes.
|
||||
if (looks_like_prefix(buf_)) return;
|
||||
// Otherwise this '<' is a literal - emit one char and continue.
|
||||
std::string one(1, buf_[0]);
|
||||
buf_.erase(0, 1);
|
||||
ParserEvent e;
|
||||
if (state_ == State::PARAM_VALUE) {
|
||||
std::string esc = param_is_string_ ? json_escape(one) : one;
|
||||
EmitArgsChunk(esc, out);
|
||||
} else if (state_ == State::THINK) {
|
||||
e.type = ParserEvent::REASONING;
|
||||
e.text = one;
|
||||
out.push_back(std::move(e));
|
||||
} else if (state_ == State::TEXT) {
|
||||
e.type = ParserEvent::CONTENT;
|
||||
e.text = one;
|
||||
out.push_back(std::move(e));
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
void DsmlParser::Feed(const std::string &chunk, std::vector<ParserEvent> &out) {
|
||||
buf_ += chunk;
|
||||
DrainPlain(out);
|
||||
}
|
||||
|
||||
void DsmlParser::Flush(std::vector<ParserEvent> &out) {
|
||||
// At flush time we no longer wait for marker completion - drain everything
|
||||
// (the trailing bytes won't grow). Mirror DrainPlain's state-aware
|
||||
// classification: PARAM_VALUE bytes become TOOL_ARGS, THINK bytes become
|
||||
// REASONING, TEXT bytes become CONTENT, and INVOKE/TOOL_CALLS bytes are
|
||||
// structural whitespace (discarded).
|
||||
auto emit_plain = [&](const std::string &chunk) {
|
||||
if (chunk.empty()) return;
|
||||
if (state_ == State::PARAM_VALUE) {
|
||||
std::string esc = param_is_string_ ? json_escape(chunk) : chunk;
|
||||
EmitArgsChunk(esc, out);
|
||||
return;
|
||||
}
|
||||
if (state_ == State::THINK) {
|
||||
ParserEvent e;
|
||||
e.type = ParserEvent::REASONING;
|
||||
e.text = chunk;
|
||||
out.push_back(std::move(e));
|
||||
return;
|
||||
}
|
||||
if (state_ == State::TEXT) {
|
||||
ParserEvent e;
|
||||
e.type = ParserEvent::CONTENT;
|
||||
e.text = chunk;
|
||||
out.push_back(std::move(e));
|
||||
return;
|
||||
}
|
||||
// INVOKE / TOOL_CALLS: structural whitespace, discard.
|
||||
};
|
||||
while (!buf_.empty()) {
|
||||
size_t lt = next_tag(buf_, 0);
|
||||
if (lt == std::string::npos) {
|
||||
emit_plain(buf_);
|
||||
buf_.clear();
|
||||
return;
|
||||
}
|
||||
if (lt > 0) {
|
||||
std::string chunk = buf_.substr(0, lt);
|
||||
buf_.erase(0, lt);
|
||||
emit_plain(chunk);
|
||||
}
|
||||
if (!TryConsumeMarker(out)) {
|
||||
// Definitely a literal '<' now (no chance of more bytes arriving).
|
||||
std::string one(1, buf_[0]);
|
||||
buf_.erase(0, 1);
|
||||
emit_plain(one);
|
||||
}
|
||||
}
|
||||
// If we ended mid-tool-call (model truncated), close it cleanly.
|
||||
if (state_ == State::INVOKE || state_ == State::PARAM_VALUE) {
|
||||
if (state_ == State::PARAM_VALUE && param_is_string_) EmitArgsChunk("\"", out);
|
||||
FinishCurrentToolCall(out);
|
||||
state_ = State::TEXT;
|
||||
}
|
||||
}
|
||||
|
||||
std::string RandomToolId() {
|
||||
static thread_local std::mt19937_64 rng{
|
||||
static_cast<uint64_t>(std::chrono::system_clock::now().time_since_epoch().count())};
|
||||
const char *alphabet =
|
||||
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
|
||||
std::string out = "call_";
|
||||
for (int i = 0; i < 16; ++i) {
|
||||
out += alphabet[rng() % 62];
|
||||
}
|
||||
return out;
|
||||
}
|
||||
|
||||
} // namespace ds4cpp
|
||||
77
backend/cpp/ds4/dsml_parser.h
Normal file
77
backend/cpp/ds4/dsml_parser.h
Normal file
@@ -0,0 +1,77 @@
|
||||
#pragma once
|
||||
#include <functional>
|
||||
#include <string>
|
||||
#include <vector>
|
||||
|
||||
namespace ds4cpp {
|
||||
|
||||
struct ParserEvent {
|
||||
enum Type { CONTENT, REASONING, TOOL_START, TOOL_ARGS, TOOL_END };
|
||||
Type type;
|
||||
std::string text; // CONTENT, REASONING, TOOL_ARGS
|
||||
std::string tool_name; // TOOL_START
|
||||
std::string tool_id; // TOOL_START (caller-assigned)
|
||||
int index = 0; // TOOL_START / TOOL_ARGS / TOOL_END
|
||||
};
|
||||
|
||||
// Streaming parser. Stateless across instances; one per Predict call.
|
||||
class DsmlParser {
|
||||
public:
|
||||
DsmlParser();
|
||||
|
||||
// Feed a chunk of raw model-emitted text. Appends classified events to
|
||||
// `out`. May buffer the tail of `chunk` internally if it looks like a
|
||||
// marker prefix.
|
||||
void Feed(const std::string &chunk, std::vector<ParserEvent> &out);
|
||||
|
||||
// Flush any remaining buffered text as CONTENT (called at generation end).
|
||||
void Flush(std::vector<ParserEvent> &out);
|
||||
|
||||
// True when the parser is inside a DSML structural position - that is,
|
||||
// tags/markers between tool-call boundaries where the model is expected
|
||||
// to emit protocol bytes verbatim. Mirrors ds4_server.c's "force
|
||||
// temperature=0 unless dsml_decode_state_uses_payload_sampling" rule:
|
||||
//
|
||||
// TEXT / THINK -> false (user sampling applies)
|
||||
// PARAM_VALUE -> false (payload uses user sampling)
|
||||
// TOOL_CALLS / INVOKE -> true (structural; force greedy)
|
||||
//
|
||||
// Callers should use this BEFORE the next sample() call to pick the
|
||||
// effective temperature; the parser's state reflects what's already
|
||||
// been consumed, so it predicts the next token's classification.
|
||||
bool IsInDsmlStructural() const;
|
||||
|
||||
private:
|
||||
enum class State { TEXT, THINK, TOOL_CALLS, INVOKE, PARAM_VALUE };
|
||||
State state_ = State::TEXT;
|
||||
std::string buf_;
|
||||
std::string current_tool_name_;
|
||||
int tool_index_ = -1;
|
||||
// While parsing a parameter value:
|
||||
std::string param_name_;
|
||||
bool param_is_string_ = true;
|
||||
std::string param_value_;
|
||||
// Incrementally-built arguments JSON for the active tool call.
|
||||
std::string args_json_so_far_;
|
||||
bool args_emitted_open_brace_ = false;
|
||||
int args_param_count_ = 0;
|
||||
|
||||
// Try to consume one structural marker starting at buf_[0]. Returns true
|
||||
// and advances state if a complete marker was consumed; false if the
|
||||
// buffer is ambiguous (could be a marker prefix).
|
||||
bool TryConsumeMarker(std::vector<ParserEvent> &out);
|
||||
|
||||
// Drain plain text from buf_ as far as we're sure it's not a marker prefix.
|
||||
// Emits CONTENT or REASONING depending on current state.
|
||||
void DrainPlain(std::vector<ParserEvent> &out);
|
||||
|
||||
// Emit the next chunk of arguments JSON to the consumer.
|
||||
void EmitArgsChunk(const std::string &chunk, std::vector<ParserEvent> &out);
|
||||
void FinishCurrentToolCall(std::vector<ParserEvent> &out);
|
||||
};
|
||||
|
||||
// Generate a random tool call ID (e.g. "call_AbCdEf"). Used by the gRPC layer
|
||||
// when assigning IDs to streamed tool calls.
|
||||
std::string RandomToolId();
|
||||
|
||||
} // namespace ds4cpp
|
||||
140
backend/cpp/ds4/dsml_renderer.cpp
Normal file
140
backend/cpp/ds4/dsml_renderer.cpp
Normal file
@@ -0,0 +1,140 @@
|
||||
#include "dsml_renderer.h"
|
||||
|
||||
// We accept either nlohmann::json (if available) or fall back to a tiny
|
||||
// hand-rolled parser. The LocalAI tree already has nlohmann/json bundled
|
||||
// in vendor paths; we use the apt-installed nlohmann-json3-dev (installed
|
||||
// in Task 11 step 1) when present, otherwise the bundled copy.
|
||||
#if __has_include(<nlohmann/json.hpp>)
|
||||
#include <nlohmann/json.hpp>
|
||||
using json = nlohmann::json;
|
||||
#else
|
||||
#error "nlohmann/json.hpp not found; install nlohmann-json3-dev"
|
||||
#endif
|
||||
|
||||
#include <sstream>
|
||||
|
||||
namespace ds4cpp {
|
||||
|
||||
namespace {
|
||||
|
||||
void render_param(std::ostringstream &os, const std::string &name,
|
||||
const json &value) {
|
||||
bool is_string = value.is_string();
|
||||
os << "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter name=\"" << name
|
||||
<< "\" string=\"" << (is_string ? "true" : "false") << "\">";
|
||||
if (is_string) {
|
||||
os << value.get<std::string>();
|
||||
} else {
|
||||
os << value.dump();
|
||||
}
|
||||
os << "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter>\n";
|
||||
}
|
||||
|
||||
} // namespace
|
||||
|
||||
std::string RenderAssistantToolCalls(const std::string &tool_calls_json) {
|
||||
if (tool_calls_json.empty()) return "";
|
||||
json arr;
|
||||
try {
|
||||
arr = json::parse(tool_calls_json);
|
||||
} catch (const std::exception &) {
|
||||
return "";
|
||||
}
|
||||
if (!arr.is_array() || arr.empty()) return "";
|
||||
|
||||
std::ostringstream os;
|
||||
os << "\n\n<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_calls>\n";
|
||||
for (const auto &call : arr) {
|
||||
// OpenAI shape: { id, type, function: { name, arguments (JSON string) } }
|
||||
// Anthropic shape comes through normalized by LocalAI.
|
||||
std::string name;
|
||||
std::string args_str;
|
||||
if (call.contains("function")) {
|
||||
const auto &fn = call["function"];
|
||||
if (fn.contains("name") && fn["name"].is_string())
|
||||
name = fn["name"].get<std::string>();
|
||||
if (fn.contains("arguments") && fn["arguments"].is_string())
|
||||
args_str = fn["arguments"].get<std::string>();
|
||||
}
|
||||
os << "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke name=\"" << name << "\">\n";
|
||||
if (!args_str.empty()) {
|
||||
json args;
|
||||
try {
|
||||
args = json::parse(args_str);
|
||||
} catch (...) {
|
||||
args = json{};
|
||||
}
|
||||
if (args.is_object()) {
|
||||
for (auto it = args.begin(); it != args.end(); ++it) {
|
||||
render_param(os, it.key(), it.value());
|
||||
}
|
||||
}
|
||||
}
|
||||
os << "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke>\n";
|
||||
}
|
||||
os << "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_calls>";
|
||||
return os.str();
|
||||
}
|
||||
|
||||
std::string RenderToolResult(const std::string &tool_call_id, const std::string &content) {
|
||||
std::ostringstream os;
|
||||
// ds4_server.c wraps tool results in a "tool_result" DSML tag carrying
|
||||
// the tool_call_id. Match that shape.
|
||||
os << "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_result id=\"" << tool_call_id << "\">"
|
||||
<< content
|
||||
<< "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_result>";
|
||||
return os.str();
|
||||
}
|
||||
|
||||
std::string RenderToolsManifest(const std::string &tools_json) {
|
||||
if (tools_json.empty()) return "";
|
||||
json arr;
|
||||
try {
|
||||
arr = json::parse(tools_json);
|
||||
} catch (const std::exception &) {
|
||||
return "";
|
||||
}
|
||||
if (!arr.is_array() || arr.empty()) return "";
|
||||
|
||||
// Extract each OpenAI tool's `function` object, dump as compact JSON, one
|
||||
// per line. Mirrors openai_function_schema_from_tool() in ds4_server.c.
|
||||
std::ostringstream schemas;
|
||||
for (const auto &tool : arr) {
|
||||
if (tool.contains("function") && tool["function"].is_object()) {
|
||||
schemas << tool["function"].dump() << "\n";
|
||||
} else if (tool.is_object()) {
|
||||
// Anthropic / direct-schema form: pass through.
|
||||
schemas << tool.dump() << "\n";
|
||||
}
|
||||
}
|
||||
if (schemas.tellp() == std::streampos(0)) return "";
|
||||
|
||||
// Verbatim text from ds4_server.c append_tools_prompt_text. Do NOT
|
||||
// paraphrase - the model was trained on these exact bytes.
|
||||
std::ostringstream os;
|
||||
os << "## Tools\n\n"
|
||||
"You have access to a set of tools to help answer the user question. "
|
||||
"You can invoke tools by writing a \"<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_calls>\" block like the following:\n\n"
|
||||
"<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_calls>\n"
|
||||
"<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke name=\"$TOOL_NAME\">\n"
|
||||
"<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter name=\"$PARAMETER_NAME\" string=\"true|false\">$PARAMETER_VALUE</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter>\n"
|
||||
"...\n"
|
||||
"</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke>\n"
|
||||
"<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke name=\"$TOOL_NAME2\">\n"
|
||||
"...\n"
|
||||
"</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke>\n"
|
||||
"</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_calls>\n\n"
|
||||
"String parameters should be specified as raw text and set `string=\"true\"`. "
|
||||
"Preserve characters such as `>`, `&`, and `&&` exactly; never replace normal string characters with XML or HTML entity escapes. "
|
||||
"Only if a string value itself contains the exact closing parameter tag `</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter>`, write that tag as `</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter>` inside the value. "
|
||||
"For all other types (numbers, booleans, arrays, objects), pass the value in JSON format and set `string=\"false\"`.\n\n"
|
||||
"If thinking_mode is enabled (triggered by <think>), you MUST output your complete reasoning inside <think>...</think> BEFORE any tool calls or final response.\n\n"
|
||||
"Otherwise, output directly after </think> with tool calls or final response.\n\n"
|
||||
"### Available Tool Schemas\n\n"
|
||||
<< schemas.str()
|
||||
<< "\nYou MUST strictly follow the above defined tool name and parameter schemas to invoke tool calls. "
|
||||
"Use the exact parameter names from the schemas.";
|
||||
return os.str();
|
||||
}
|
||||
|
||||
} // namespace ds4cpp
|
||||
27
backend/cpp/ds4/dsml_renderer.h
Normal file
27
backend/cpp/ds4/dsml_renderer.h
Normal file
@@ -0,0 +1,27 @@
|
||||
#pragma once
|
||||
#include <string>
|
||||
|
||||
namespace ds4cpp {
|
||||
|
||||
// Render an assistant message's tool_calls JSON array into the DSML block
|
||||
// that ds4 expects in its prompt. `tool_calls_json` is the value of
|
||||
// proto.Message.tool_calls (OpenAI shape: array of {id, type, function:{name, arguments}}).
|
||||
// Returns the DSML text to append after the assistant's content.
|
||||
std::string RenderAssistantToolCalls(const std::string &tool_calls_json);
|
||||
|
||||
// Render a role="tool" message into the DSML "tool result" block. ds4's
|
||||
// prompt template expects tool results inside a specific tag; we wrap the
|
||||
// `content` with that tag and include the `tool_call_id` so the model can
|
||||
// correlate.
|
||||
std::string RenderToolResult(const std::string &tool_call_id, const std::string &content);
|
||||
|
||||
// Render the "## Tools" manifest that ds4 expects in the SYSTEM prompt when
|
||||
// tools are available. Without this preamble the model has no idea tools
|
||||
// exist and will not emit DSML tool calls. Mirrors append_tools_prompt_text()
|
||||
// in ds4_server.c (~line 1646): a fixed preamble + "### Available Tool
|
||||
// Schemas" section + one JSON schema per line (extracted from each OpenAI
|
||||
// tool's .function object) + a fixed closing instruction. Returns empty
|
||||
// when tools_json is empty / unparseable.
|
||||
std::string RenderToolsManifest(const std::string &tools_json);
|
||||
|
||||
} // namespace ds4cpp
|
||||
696
backend/cpp/ds4/grpc-server.cpp
Normal file
696
backend/cpp/ds4/grpc-server.cpp
Normal file
@@ -0,0 +1,696 @@
|
||||
// ds4 LocalAI gRPC backend.
|
||||
//
|
||||
// Wraps antirez/ds4's `ds4_engine_*` / `ds4_session_*` public API
|
||||
// (see ds4/ds4.h) over LocalAI's backend.proto. Tool calls, thinking
|
||||
// mode, and disk KV cache are wired in follow-up commits; this commit
|
||||
// is just the bind/listen/Health/Free skeleton.
|
||||
|
||||
#include "backend.pb.h"
|
||||
#include "backend.grpc.pb.h"
|
||||
|
||||
#include "dsml_parser.h" // populated in Task 12
|
||||
#include "dsml_renderer.h" // populated in Task 16
|
||||
#include "kv_cache.h" // populated in Task 17
|
||||
|
||||
extern "C" {
|
||||
#include "ds4.h"
|
||||
}
|
||||
|
||||
#include <grpcpp/grpcpp.h>
|
||||
#include <grpcpp/server.h>
|
||||
#include <grpcpp/server_builder.h>
|
||||
#include <grpcpp/ext/proto_server_reflection_plugin.h>
|
||||
|
||||
#include <atomic>
|
||||
#include <chrono>
|
||||
#include <csignal>
|
||||
#include <cstring>
|
||||
#include <iostream>
|
||||
#include <memory>
|
||||
#include <mutex>
|
||||
#include <string>
|
||||
#include <thread>
|
||||
#include <vector>
|
||||
|
||||
using grpc::Server;
|
||||
using grpc::ServerBuilder;
|
||||
using grpc::ServerContext;
|
||||
using grpc::ServerWriter;
|
||||
// NOTE: do NOT alias `grpc::Status` as `Status` - the Status RPC method below
|
||||
// would shadow the type, breaking the other RPC method declarations that use
|
||||
// it as a return type. Use GStatus instead.
|
||||
using GStatus = ::grpc::Status;
|
||||
using grpc::StatusCode;
|
||||
|
||||
namespace {
|
||||
|
||||
// Global state - ds4 is single-engine-per-process by design.
|
||||
std::mutex g_engine_mu;
|
||||
ds4_engine *g_engine = nullptr;
|
||||
ds4_session *g_session = nullptr;
|
||||
int g_ctx_size = 32768;
|
||||
std::string g_kv_cache_dir; // empty disables disk cache
|
||||
|
||||
std::atomic<Server *> g_server{nullptr};
|
||||
|
||||
// Parse a "key:value" option string. Returns empty when no colon.
|
||||
static std::pair<std::string, std::string> split_option(const std::string &opt) {
|
||||
auto colon = opt.find(':');
|
||||
if (colon == std::string::npos) return {opt, ""};
|
||||
return {opt.substr(0, colon), opt.substr(colon + 1)};
|
||||
}
|
||||
|
||||
static void append_token_text(ds4_engine *engine, int token, std::string &out) {
|
||||
size_t len = 0;
|
||||
const char *text = ds4_token_text(engine, token, &len);
|
||||
if (text && len > 0) out.append(text, len);
|
||||
}
|
||||
|
||||
struct CollectCtx {
|
||||
ds4_engine *engine;
|
||||
std::string raw_buf; // exact raw bytes for Reply.message
|
||||
ds4cpp::DsmlParser parser;
|
||||
backend::Reply *reply;
|
||||
int tokens;
|
||||
|
||||
// Per-tool aggregation: accumulate ChatDelta tool_calls so we emit one
|
||||
// delta with all calls, mirroring how vllm's non-streaming path returns.
|
||||
struct Pending {
|
||||
std::string id;
|
||||
std::string name;
|
||||
std::string args;
|
||||
};
|
||||
std::vector<Pending> pending;
|
||||
|
||||
std::string content_buf;
|
||||
std::string reasoning_buf;
|
||||
};
|
||||
|
||||
static void apply_events(CollectCtx *c, const std::vector<ds4cpp::ParserEvent> &events) {
|
||||
for (const auto &e : events) {
|
||||
switch (e.type) {
|
||||
case ds4cpp::ParserEvent::CONTENT:
|
||||
c->content_buf += e.text;
|
||||
break;
|
||||
case ds4cpp::ParserEvent::REASONING:
|
||||
c->reasoning_buf += e.text;
|
||||
break;
|
||||
case ds4cpp::ParserEvent::TOOL_START:
|
||||
if ((int)c->pending.size() <= e.index)
|
||||
c->pending.resize(e.index + 1);
|
||||
c->pending[e.index].id = e.tool_id;
|
||||
c->pending[e.index].name = e.tool_name;
|
||||
break;
|
||||
case ds4cpp::ParserEvent::TOOL_ARGS:
|
||||
if ((int)c->pending.size() > e.index)
|
||||
c->pending[e.index].args += e.text;
|
||||
break;
|
||||
case ds4cpp::ParserEvent::TOOL_END:
|
||||
// No-op for non-streaming: the final delta is emitted at the end.
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
static void collect_emit(void *ud, int token) {
|
||||
auto *c = static_cast<CollectCtx *>(ud);
|
||||
if (token == ds4_token_eos(c->engine)) return;
|
||||
size_t len = 0;
|
||||
const char *text = ds4_token_text(c->engine, token, &len);
|
||||
if (!text || len == 0) return;
|
||||
std::string chunk(text, len);
|
||||
c->raw_buf += chunk;
|
||||
std::vector<ds4cpp::ParserEvent> events;
|
||||
c->parser.Feed(chunk, events);
|
||||
apply_events(c, events);
|
||||
c->tokens++;
|
||||
}
|
||||
static void collect_done(void *) {}
|
||||
|
||||
struct StreamCtx {
|
||||
ds4_engine *engine;
|
||||
ServerWriter<backend::Reply> *writer;
|
||||
ds4cpp::DsmlParser parser;
|
||||
int tokens;
|
||||
bool aborted;
|
||||
// Track which tool indices we've seen TOOL_START for, so subsequent
|
||||
// ARGS deltas can elide the redundant id/name fields.
|
||||
std::vector<bool> tool_started;
|
||||
};
|
||||
|
||||
static void stream_emit(void *ud, int token) {
|
||||
auto *s = static_cast<StreamCtx *>(ud);
|
||||
if (s->aborted) return;
|
||||
if (token == ds4_token_eos(s->engine)) return;
|
||||
size_t len = 0;
|
||||
const char *text = ds4_token_text(s->engine, token, &len);
|
||||
if (!text || len == 0) return;
|
||||
std::string chunk(text, len);
|
||||
std::vector<ds4cpp::ParserEvent> events;
|
||||
s->parser.Feed(chunk, events);
|
||||
if (events.empty()) { s->tokens++; return; }
|
||||
|
||||
backend::Reply reply;
|
||||
auto *delta = reply.add_chat_deltas();
|
||||
bool any_field = false;
|
||||
for (const auto &e : events) {
|
||||
switch (e.type) {
|
||||
case ds4cpp::ParserEvent::CONTENT:
|
||||
delta->set_content(delta->content() + e.text);
|
||||
any_field = true;
|
||||
break;
|
||||
case ds4cpp::ParserEvent::REASONING:
|
||||
delta->set_reasoning_content(delta->reasoning_content() + e.text);
|
||||
any_field = true;
|
||||
break;
|
||||
case ds4cpp::ParserEvent::TOOL_START: {
|
||||
if ((int)s->tool_started.size() <= e.index)
|
||||
s->tool_started.resize(e.index + 1, false);
|
||||
s->tool_started[e.index] = true;
|
||||
auto *tc = delta->add_tool_calls();
|
||||
tc->set_index(e.index);
|
||||
tc->set_id(e.tool_id);
|
||||
tc->set_name(e.tool_name);
|
||||
any_field = true;
|
||||
break;
|
||||
}
|
||||
case ds4cpp::ParserEvent::TOOL_ARGS: {
|
||||
auto *tc = delta->add_tool_calls();
|
||||
tc->set_index(e.index);
|
||||
tc->set_arguments(e.text);
|
||||
any_field = true;
|
||||
break;
|
||||
}
|
||||
case ds4cpp::ParserEvent::TOOL_END:
|
||||
// No marker delta needed - the Go side closes the tool call on
|
||||
// the final aggregator pass.
|
||||
break;
|
||||
}
|
||||
}
|
||||
reply.set_message(chunk);
|
||||
reply.set_tokens(1);
|
||||
if (any_field) {
|
||||
if (!s->writer->Write(reply)) s->aborted = true;
|
||||
}
|
||||
s->tokens++;
|
||||
}
|
||||
static void stream_done(void *) {}
|
||||
|
||||
// Per-thread RNG seed for ds4_session_sample. Initialized lazily from
|
||||
// system_clock; ds4 owns the random walk after that.
|
||||
static uint64_t *get_rng() {
|
||||
static thread_local uint64_t seed = 0;
|
||||
if (seed == 0) {
|
||||
seed = static_cast<uint64_t>(
|
||||
std::chrono::system_clock::now().time_since_epoch().count());
|
||||
if (seed == 0) seed = 1;
|
||||
}
|
||||
return &seed;
|
||||
}
|
||||
|
||||
struct SampleParams {
|
||||
float temperature;
|
||||
int top_k;
|
||||
float top_p;
|
||||
float min_p;
|
||||
};
|
||||
|
||||
// Compute the effective sampling parameters for the next token, mirroring
|
||||
// ds4_server.c:7102-7115:
|
||||
// - thinking mode enabled -> override (T=1, top_k=0, top_p=1, min_p=0)
|
||||
// - inside DSML structural position (tool-call markers) -> force T=0
|
||||
// - otherwise -> the request's user-supplied sampling settings
|
||||
// The parser argument carries state from tokens emitted so far; its
|
||||
// IsInDsmlStructural() predicts the next token's classification.
|
||||
static SampleParams compute_sample_params(const backend::PredictOptions *request,
|
||||
const ds4cpp::DsmlParser &parser,
|
||||
bool think_enabled);
|
||||
|
||||
static ds4_think_mode parse_think_mode(const backend::PredictOptions *request) {
|
||||
// Per the vllm backend convention, "enable_thinking" gates thinking on/off,
|
||||
// and "reasoning_effort" picks the strength when on.
|
||||
const auto &md = request->metadata();
|
||||
auto et = md.find("enable_thinking");
|
||||
bool enabled = true; // default ON per ds4-server
|
||||
if (et != md.end()) enabled = (et->second == "true" || et->second == "1");
|
||||
if (!enabled) return DS4_THINK_NONE;
|
||||
auto re = md.find("reasoning_effort");
|
||||
if (re != md.end() && (re->second == "max" || re->second == "xhigh"))
|
||||
return DS4_THINK_MAX;
|
||||
return DS4_THINK_HIGH;
|
||||
}
|
||||
|
||||
static SampleParams compute_sample_params(const backend::PredictOptions *request,
|
||||
const ds4cpp::DsmlParser &parser,
|
||||
bool think_enabled) {
|
||||
SampleParams p = {
|
||||
request->temperature(),
|
||||
request->topk(),
|
||||
request->topp(),
|
||||
request->minp(),
|
||||
};
|
||||
if (think_enabled) {
|
||||
// Match ds4-server: thinking mode wants creativity in the reasoning
|
||||
// pass and the trailing content, so the entire generation overrides
|
||||
// sampling unless DSML structural bytes take over below.
|
||||
p.temperature = 1.0f;
|
||||
p.top_k = 0;
|
||||
p.top_p = 1.0f;
|
||||
p.min_p = 0.0f;
|
||||
}
|
||||
if (parser.IsInDsmlStructural()) {
|
||||
// Tool-call structural bytes (tags, markers, headers) must parse
|
||||
// cleanly. Force greedy regardless of user/thinking settings.
|
||||
p.temperature = 0.0f;
|
||||
}
|
||||
return p;
|
||||
}
|
||||
|
||||
// Build the rendered text for cache keying. We feed the same text the model
|
||||
// will see; that lets the cache survive small client-side reformatting of
|
||||
// chat history (the cache is keyed on bytes, not tokens).
|
||||
static std::string render_prompt_text(const backend::PredictOptions *request) {
|
||||
// Two-mode: either the raw prompt or the chat-template path. We mirror
|
||||
// build_prompt's branching but accumulate text (not tokens) so we can
|
||||
// SHA1 it for the cache key. ds4_session caches a tokens-indexed
|
||||
// checkpoint, but the disk format keys on bytes per ds4-server's design.
|
||||
if (!request->usetokenizertemplate() || request->messages_size() == 0) {
|
||||
return request->prompt();
|
||||
}
|
||||
std::string out;
|
||||
const std::string sys_role = "system";
|
||||
for (const auto &m : request->messages()) {
|
||||
if (m.role() == sys_role) { out += "[sys] " + m.content() + "\n"; break; }
|
||||
}
|
||||
for (const auto &m : request->messages()) {
|
||||
if (m.role() == sys_role) continue;
|
||||
out += "[" + m.role() + "] " + m.content() + "\n";
|
||||
}
|
||||
return out;
|
||||
}
|
||||
|
||||
ds4cpp::KvCache g_kv_cache;
|
||||
|
||||
// Try to recover prefill state for `rendered`. Returns the matched prefix length.
|
||||
static size_t maybe_load_cache(const std::string &rendered) {
|
||||
if (!g_kv_cache.enabled() || !g_session) return 0;
|
||||
return g_kv_cache.LoadLongestPrefix(g_session, rendered, g_ctx_size);
|
||||
}
|
||||
|
||||
static void maybe_save_cache(const std::string &rendered) {
|
||||
if (g_kv_cache.enabled() && g_session) {
|
||||
g_kv_cache.Save(g_session, rendered, g_ctx_size);
|
||||
}
|
||||
}
|
||||
|
||||
static void build_prompt(ds4_engine *engine, const backend::PredictOptions *request,
|
||||
ds4_tokens *out) {
|
||||
if (!request->usetokenizertemplate() || request->messages_size() == 0) {
|
||||
ds4_tokenize_text(engine, request->prompt().c_str(), out);
|
||||
return;
|
||||
}
|
||||
// Chat-template path: render via ds4's helpers.
|
||||
ds4_chat_begin(engine, out);
|
||||
|
||||
ds4_think_mode think = parse_think_mode(request);
|
||||
|
||||
// ds4_encode_chat_prompt is convenient when there is exactly one
|
||||
// system+user pair, but for arbitrary turn lists we use the granular
|
||||
// append helpers. Pull the first system message (if any), then append
|
||||
// every other message in order.
|
||||
const std::string sys_role = "system";
|
||||
std::string system_text;
|
||||
for (const auto &m : request->messages()) {
|
||||
if (m.role() == sys_role) { system_text = m.content(); break; }
|
||||
}
|
||||
// Inject the tools manifest into the system prompt when tools are present.
|
||||
// ds4 was trained to emit DSML tool calls ONLY when this preamble is in
|
||||
// the system message - without it, the model has no idea tools exist and
|
||||
// the e2e tool-call test will fail. The renderer lives in dsml_renderer
|
||||
// and is a verbatim port of ds4_server.c's append_tools_prompt_text.
|
||||
std::string tools_manifest;
|
||||
if (!request->tools().empty()) {
|
||||
tools_manifest = ds4cpp::RenderToolsManifest(request->tools());
|
||||
}
|
||||
if (!system_text.empty() || !tools_manifest.empty()) {
|
||||
std::string combined = system_text;
|
||||
if (!tools_manifest.empty()) {
|
||||
if (!combined.empty()) combined += "\n\n";
|
||||
combined += tools_manifest;
|
||||
}
|
||||
ds4_chat_append_message(engine, out, "system", combined.c_str());
|
||||
}
|
||||
for (const auto &m : request->messages()) {
|
||||
if (m.role() == sys_role) continue;
|
||||
if (m.role() == "assistant" && !m.tool_calls().empty()) {
|
||||
std::string combined = m.content();
|
||||
combined += ds4cpp::RenderAssistantToolCalls(m.tool_calls());
|
||||
ds4_chat_append_message(engine, out, "assistant", combined.c_str());
|
||||
} else if (m.role() == "tool") {
|
||||
std::string body = ds4cpp::RenderToolResult(m.tool_call_id(), m.content());
|
||||
ds4_chat_append_message(engine, out, "user", body.c_str());
|
||||
} else {
|
||||
ds4_chat_append_message(engine, out, m.role().c_str(), m.content().c_str());
|
||||
}
|
||||
}
|
||||
ds4_chat_append_assistant_prefix(engine, out, think);
|
||||
}
|
||||
|
||||
class DS4Backend final : public backend::Backend::Service {
|
||||
public:
|
||||
GStatus Health(ServerContext *, const backend::HealthMessage *,
|
||||
backend::Reply *reply) override {
|
||||
reply->set_message(std::string("OK"));
|
||||
return GStatus::OK;
|
||||
}
|
||||
|
||||
GStatus Free(ServerContext *, const backend::HealthMessage *,
|
||||
backend::Result *result) override {
|
||||
std::lock_guard<std::mutex> lock(g_engine_mu);
|
||||
if (g_session) { ds4_session_free(g_session); g_session = nullptr; }
|
||||
if (g_engine) { ds4_engine_close(g_engine); g_engine = nullptr; }
|
||||
result->set_success(true);
|
||||
return GStatus::OK;
|
||||
}
|
||||
|
||||
GStatus LoadModel(ServerContext *, const backend::ModelOptions *request,
|
||||
backend::Result *result) override {
|
||||
std::lock_guard<std::mutex> lock(g_engine_mu);
|
||||
|
||||
if (g_engine) {
|
||||
if (g_session) { ds4_session_free(g_session); g_session = nullptr; }
|
||||
ds4_engine_close(g_engine);
|
||||
g_engine = nullptr;
|
||||
}
|
||||
|
||||
std::string model_path = request->modelfile();
|
||||
if (model_path.empty()) model_path = request->model();
|
||||
if (model_path.empty()) {
|
||||
result->set_success(false);
|
||||
result->set_message("ds4: ModelOptions.Model or .ModelFile must be set");
|
||||
return GStatus::OK;
|
||||
}
|
||||
|
||||
std::string mtp_path;
|
||||
int mtp_draft = 0;
|
||||
float mtp_margin = 3.0f;
|
||||
for (const auto &opt : request->options()) {
|
||||
auto [k, v] = split_option(opt);
|
||||
if (k == "mtp_path") mtp_path = v;
|
||||
else if (k == "mtp_draft") mtp_draft = std::stoi(v);
|
||||
else if (k == "mtp_margin") mtp_margin = std::stof(v);
|
||||
else if (k == "kv_cache_dir") g_kv_cache_dir = v;
|
||||
}
|
||||
|
||||
g_kv_cache.SetDir(g_kv_cache_dir);
|
||||
|
||||
ds4_engine_options opt = {};
|
||||
opt.model_path = model_path.c_str();
|
||||
opt.mtp_path = mtp_path.empty() ? nullptr : mtp_path.c_str();
|
||||
opt.n_threads = request->threads() > 0 ? request->threads() : 0;
|
||||
opt.mtp_draft_tokens = mtp_draft;
|
||||
opt.mtp_margin = mtp_margin;
|
||||
opt.directional_steering_file = nullptr;
|
||||
opt.warm_weights = false;
|
||||
opt.quality = false;
|
||||
|
||||
#if defined(DS4_NO_GPU)
|
||||
opt.backend = DS4_BACKEND_CPU;
|
||||
#elif defined(__APPLE__)
|
||||
opt.backend = DS4_BACKEND_METAL;
|
||||
#else
|
||||
opt.backend = DS4_BACKEND_CUDA;
|
||||
#endif
|
||||
|
||||
int rc = ds4_engine_open(&g_engine, &opt);
|
||||
if (rc != 0 || !g_engine) {
|
||||
result->set_success(false);
|
||||
result->set_message("ds4_engine_open failed (rc=" + std::to_string(rc) + ")");
|
||||
return GStatus::OK;
|
||||
}
|
||||
|
||||
g_ctx_size = request->contextsize() > 0 ? request->contextsize() : 32768;
|
||||
rc = ds4_session_create(&g_session, g_engine, g_ctx_size);
|
||||
if (rc != 0 || !g_session) {
|
||||
ds4_engine_close(g_engine);
|
||||
g_engine = nullptr;
|
||||
result->set_success(false);
|
||||
result->set_message("ds4_session_create failed (rc=" + std::to_string(rc) + ")");
|
||||
return GStatus::OK;
|
||||
}
|
||||
|
||||
result->set_success(true);
|
||||
result->set_message("loaded " + model_path);
|
||||
return GStatus::OK;
|
||||
}
|
||||
|
||||
GStatus TokenizeString(ServerContext *, const backend::PredictOptions *request,
|
||||
backend::TokenizationResponse *response) override {
|
||||
std::lock_guard<std::mutex> lock(g_engine_mu);
|
||||
if (!g_engine) return GStatus(StatusCode::FAILED_PRECONDITION, "ds4: model not loaded");
|
||||
ds4_tokens out = {};
|
||||
ds4_tokenize_text(g_engine, request->prompt().c_str(), &out);
|
||||
for (int i = 0; i < out.len; ++i) response->add_tokens(out.v[i]);
|
||||
response->set_length(out.len);
|
||||
ds4_tokens_free(&out);
|
||||
return GStatus::OK;
|
||||
}
|
||||
|
||||
GStatus Predict(ServerContext *, const backend::PredictOptions *request,
|
||||
backend::Reply *reply) override {
|
||||
std::lock_guard<std::mutex> lock(g_engine_mu);
|
||||
if (!g_engine || !g_session) {
|
||||
return GStatus(StatusCode::FAILED_PRECONDITION, "ds4: model not loaded");
|
||||
}
|
||||
ds4_tokens prompt = {};
|
||||
build_prompt(g_engine, request, &prompt);
|
||||
int n_predict = request->tokens() > 0 ? request->tokens() : 256;
|
||||
|
||||
CollectCtx collect = {g_engine, "", {}, reply, 0, {}, "", ""};
|
||||
std::string cache_key = render_prompt_text(request);
|
||||
size_t cache_hit = maybe_load_cache(cache_key);
|
||||
(void)cache_hit; // future: skip prompt prefix if hit covers full prompt
|
||||
|
||||
// Manual generation loop on g_session. When MTP speculative weights
|
||||
// were loaded (LoadModel option 'mtp_path:'), we use the
|
||||
// ds4_session_eval_speculative_argmax path which may accept N>1
|
||||
// tokens per outer iteration. Otherwise per-token argmax + eval.
|
||||
// Either way g_session advances so the disk KV cache picks up a
|
||||
// real checkpoint after the call (see maybe_save_cache below).
|
||||
char err[256] = {0};
|
||||
int rc = ds4_session_sync(g_session, &prompt, err, sizeof(err));
|
||||
int prompt_len = prompt.len;
|
||||
ds4_tokens_free(&prompt);
|
||||
if (rc == 0) {
|
||||
const int eos = ds4_token_eos(g_engine);
|
||||
const int draft_max = ds4_engine_mtp_draft_tokens(g_engine);
|
||||
const bool think_enabled = ds4_think_mode_enabled(parse_think_mode(request));
|
||||
int produced = 0;
|
||||
while (produced < n_predict) {
|
||||
SampleParams sp = compute_sample_params(request, collect.parser, think_enabled);
|
||||
int first;
|
||||
if (sp.temperature <= 0.0f) {
|
||||
first = ds4_session_argmax(g_session);
|
||||
} else {
|
||||
first = ds4_session_sample(g_session,
|
||||
sp.temperature, sp.top_k,
|
||||
sp.top_p, sp.min_p, get_rng());
|
||||
}
|
||||
if (first == eos) break;
|
||||
// MTP only when sampling is greedy (ds4-server gate).
|
||||
if (draft_max > 0 && sp.temperature <= 0.0f) {
|
||||
constexpr int kAcceptedMax = 8;
|
||||
int accepted[kAcceptedMax];
|
||||
int cap = std::min(kAcceptedMax, draft_max + 1);
|
||||
int n = ds4_session_eval_speculative_argmax(
|
||||
g_session, first, draft_max, eos,
|
||||
accepted, cap, err, sizeof(err));
|
||||
if (n < 0) { rc = -1; break; }
|
||||
bool stop = false;
|
||||
for (int j = 0; j < n; ++j) {
|
||||
if (accepted[j] == eos) { stop = true; break; }
|
||||
collect_emit(&collect, accepted[j]);
|
||||
if (++produced >= n_predict) { stop = true; break; }
|
||||
}
|
||||
if (stop) break;
|
||||
} else {
|
||||
collect_emit(&collect, first);
|
||||
if (++produced >= n_predict) break;
|
||||
rc = ds4_session_eval(g_session, first, err, sizeof(err));
|
||||
if (rc != 0) break;
|
||||
}
|
||||
}
|
||||
collect_done(&collect);
|
||||
}
|
||||
maybe_save_cache(cache_key);
|
||||
|
||||
// Flush any buffered parser state.
|
||||
std::vector<ds4cpp::ParserEvent> events;
|
||||
collect.parser.Flush(events);
|
||||
apply_events(&collect, events);
|
||||
|
||||
if (rc != 0) {
|
||||
return GStatus(StatusCode::INTERNAL,
|
||||
std::string("ds4 generation failed: ") + err);
|
||||
}
|
||||
|
||||
// Emit one ChatDelta with content/reasoning/tool_calls.
|
||||
auto *delta = reply->add_chat_deltas();
|
||||
delta->set_content(collect.content_buf);
|
||||
delta->set_reasoning_content(collect.reasoning_buf);
|
||||
for (size_t i = 0; i < collect.pending.size(); ++i) {
|
||||
auto *tc = delta->add_tool_calls();
|
||||
tc->set_index(static_cast<int32_t>(i));
|
||||
tc->set_id(collect.pending[i].id);
|
||||
tc->set_name(collect.pending[i].name);
|
||||
tc->set_arguments(collect.pending[i].args);
|
||||
}
|
||||
|
||||
reply->set_message(collect.raw_buf);
|
||||
reply->set_tokens(collect.tokens);
|
||||
reply->set_prompt_tokens(prompt_len);
|
||||
return GStatus::OK;
|
||||
}
|
||||
|
||||
GStatus PredictStream(ServerContext *, const backend::PredictOptions *request,
|
||||
ServerWriter<backend::Reply> *writer) override {
|
||||
std::lock_guard<std::mutex> lock(g_engine_mu);
|
||||
if (!g_engine || !g_session) {
|
||||
return GStatus(StatusCode::FAILED_PRECONDITION, "ds4: model not loaded");
|
||||
}
|
||||
ds4_tokens prompt = {};
|
||||
build_prompt(g_engine, request, &prompt);
|
||||
int n_predict = request->tokens() > 0 ? request->tokens() : 256;
|
||||
|
||||
StreamCtx s = {g_engine, writer, {}, 0, false, {}};
|
||||
std::string cache_key = render_prompt_text(request);
|
||||
size_t cache_hit = maybe_load_cache(cache_key);
|
||||
(void)cache_hit;
|
||||
|
||||
// Manual loop on g_session - see Predict() above for the rationale.
|
||||
// MTP speculative path used when ds4_engine_mtp_draft_tokens > 0.
|
||||
char err[256] = {0};
|
||||
int rc = ds4_session_sync(g_session, &prompt, err, sizeof(err));
|
||||
ds4_tokens_free(&prompt);
|
||||
if (rc == 0) {
|
||||
const int eos = ds4_token_eos(g_engine);
|
||||
const int draft_max = ds4_engine_mtp_draft_tokens(g_engine);
|
||||
const bool think_enabled = ds4_think_mode_enabled(parse_think_mode(request));
|
||||
int produced = 0;
|
||||
while (produced < n_predict && !s.aborted) {
|
||||
SampleParams sp = compute_sample_params(request, s.parser, think_enabled);
|
||||
int first;
|
||||
if (sp.temperature <= 0.0f) {
|
||||
first = ds4_session_argmax(g_session);
|
||||
} else {
|
||||
first = ds4_session_sample(g_session,
|
||||
sp.temperature, sp.top_k,
|
||||
sp.top_p, sp.min_p, get_rng());
|
||||
}
|
||||
if (first == eos) break;
|
||||
if (draft_max > 0 && sp.temperature <= 0.0f) {
|
||||
constexpr int kAcceptedMax = 8;
|
||||
int accepted[kAcceptedMax];
|
||||
int cap = std::min(kAcceptedMax, draft_max + 1);
|
||||
int n = ds4_session_eval_speculative_argmax(
|
||||
g_session, first, draft_max, eos,
|
||||
accepted, cap, err, sizeof(err));
|
||||
if (n < 0) { rc = -1; break; }
|
||||
bool stop = false;
|
||||
for (int j = 0; j < n; ++j) {
|
||||
if (accepted[j] == eos) { stop = true; break; }
|
||||
stream_emit(&s, accepted[j]);
|
||||
if (s.aborted) { stop = true; break; }
|
||||
if (++produced >= n_predict) { stop = true; break; }
|
||||
}
|
||||
if (stop) break;
|
||||
} else {
|
||||
stream_emit(&s, first);
|
||||
if (s.aborted || ++produced >= n_predict) break;
|
||||
rc = ds4_session_eval(g_session, first, err, sizeof(err));
|
||||
if (rc != 0) break;
|
||||
}
|
||||
}
|
||||
stream_done(&s);
|
||||
}
|
||||
maybe_save_cache(cache_key);
|
||||
|
||||
// Flush parser state.
|
||||
std::vector<ds4cpp::ParserEvent> events;
|
||||
s.parser.Flush(events);
|
||||
if (!events.empty() && !s.aborted) {
|
||||
backend::Reply reply;
|
||||
auto *delta = reply.add_chat_deltas();
|
||||
for (const auto &e : events) {
|
||||
if (e.type == ds4cpp::ParserEvent::CONTENT) {
|
||||
delta->set_content(delta->content() + e.text);
|
||||
} else if (e.type == ds4cpp::ParserEvent::REASONING) {
|
||||
delta->set_reasoning_content(delta->reasoning_content() + e.text);
|
||||
}
|
||||
}
|
||||
s.writer->Write(reply);
|
||||
}
|
||||
|
||||
if (rc != 0 && !s.aborted) {
|
||||
return GStatus(StatusCode::INTERNAL,
|
||||
std::string("ds4 generation failed: ") + err);
|
||||
}
|
||||
return GStatus::OK;
|
||||
}
|
||||
|
||||
GStatus Status(ServerContext *, const backend::HealthMessage *,
|
||||
backend::StatusResponse *response) override {
|
||||
std::lock_guard<std::mutex> lock(g_engine_mu);
|
||||
response->set_state(g_engine ? backend::StatusResponse::READY
|
||||
: backend::StatusResponse::UNINITIALIZED);
|
||||
return GStatus::OK;
|
||||
}
|
||||
};
|
||||
|
||||
void RunServer(const std::string &addr) {
|
||||
DS4Backend service;
|
||||
grpc::EnableDefaultHealthCheckService(true);
|
||||
grpc::reflection::InitProtoReflectionServerBuilderPlugin();
|
||||
|
||||
ServerBuilder builder;
|
||||
builder.AddListeningPort(addr, grpc::InsecureServerCredentials());
|
||||
builder.RegisterService(&service);
|
||||
builder.SetMaxReceiveMessageSize(64 * 1024 * 1024);
|
||||
builder.SetMaxSendMessageSize(64 * 1024 * 1024);
|
||||
|
||||
std::unique_ptr<Server> server(builder.BuildAndStart());
|
||||
if (!server) {
|
||||
std::cerr << "ds4 grpc-server: failed to bind " << addr << "\n";
|
||||
std::exit(1);
|
||||
}
|
||||
g_server = server.get();
|
||||
std::cerr << "ds4 grpc-server listening on " << addr << "\n";
|
||||
server->Wait();
|
||||
}
|
||||
|
||||
void signal_handler(int) {
|
||||
if (auto *srv = g_server.load()) {
|
||||
srv->Shutdown(std::chrono::system_clock::now() +
|
||||
std::chrono::seconds(3));
|
||||
}
|
||||
}
|
||||
|
||||
} // namespace
|
||||
|
||||
int main(int argc, char *argv[]) {
|
||||
std::string addr = "127.0.0.1:50051";
|
||||
for (int i = 1; i < argc; ++i) {
|
||||
std::string a = argv[i];
|
||||
const std::string addr_flag = "--addr=";
|
||||
if (a.rfind(addr_flag, 0) == 0) addr = a.substr(addr_flag.size());
|
||||
else if (a == "--addr" && i + 1 < argc) addr = argv[++i];
|
||||
else if (a == "--help" || a == "-h") {
|
||||
std::cout << "Usage: grpc-server --addr=HOST:PORT\n";
|
||||
return 0;
|
||||
}
|
||||
}
|
||||
std::signal(SIGINT, signal_handler);
|
||||
std::signal(SIGTERM, signal_handler);
|
||||
RunServer(addr);
|
||||
return 0;
|
||||
}
|
||||
205
backend/cpp/ds4/kv_cache.cpp
Normal file
205
backend/cpp/ds4/kv_cache.cpp
Normal file
@@ -0,0 +1,205 @@
|
||||
#include "kv_cache.h"
|
||||
|
||||
#include <cerrno>
|
||||
#include <cstdio>
|
||||
#include <cstring>
|
||||
#include <dirent.h>
|
||||
#include <fstream>
|
||||
#include <sys/stat.h>
|
||||
#include <vector>
|
||||
|
||||
namespace ds4cpp {
|
||||
|
||||
namespace {
|
||||
|
||||
// Minimal SHA1 (public domain reference). 30 lines; used only here.
|
||||
struct Sha1 {
|
||||
uint32_t h[5];
|
||||
uint64_t bits;
|
||||
uint8_t block[64];
|
||||
size_t used;
|
||||
Sha1() { h[0]=0x67452301; h[1]=0xEFCDAB89; h[2]=0x98BADCFE; h[3]=0x10325476; h[4]=0xC3D2E1F0; bits=0; used=0; }
|
||||
static uint32_t rol(uint32_t x, int n){ return (x<<n)|(x>>(32-n)); }
|
||||
void transform(const uint8_t *b) {
|
||||
uint32_t w[80];
|
||||
for (int i=0;i<16;i++) w[i] = (uint32_t)b[i*4]<<24 | (uint32_t)b[i*4+1]<<16 | (uint32_t)b[i*4+2]<<8 | b[i*4+3];
|
||||
for (int i=16;i<80;i++) w[i] = rol(w[i-3]^w[i-8]^w[i-14]^w[i-16], 1);
|
||||
uint32_t a=h[0],bb=h[1],c=h[2],d=h[3],e=h[4];
|
||||
for (int i=0;i<80;i++) {
|
||||
uint32_t f,k;
|
||||
if (i<20) { f=(bb&c)|((~bb)&d); k=0x5A827999; }
|
||||
else if (i<40) { f=bb^c^d; k=0x6ED9EBA1; }
|
||||
else if (i<60) { f=(bb&c)|(bb&d)|(c&d); k=0x8F1BBCDC; }
|
||||
else { f=bb^c^d; k=0xCA62C1D6; }
|
||||
uint32_t t = rol(a,5)+f+e+k+w[i];
|
||||
e=d; d=c; c=rol(bb,30); bb=a; a=t;
|
||||
}
|
||||
h[0]+=a; h[1]+=bb; h[2]+=c; h[3]+=d; h[4]+=e;
|
||||
}
|
||||
void update(const void *p, size_t n) {
|
||||
const uint8_t *bp = (const uint8_t*)p;
|
||||
bits += (uint64_t)n*8;
|
||||
while (n) {
|
||||
size_t take = 64-used;
|
||||
if (take>n) take=n;
|
||||
std::memcpy(block+used, bp, take);
|
||||
used += take; bp += take; n -= take;
|
||||
if (used == 64) { transform(block); used = 0; }
|
||||
}
|
||||
}
|
||||
void final(uint8_t out[20]) {
|
||||
uint8_t pad[64] = {0x80};
|
||||
size_t padlen = (used < 56) ? (56-used) : (120-used);
|
||||
uint64_t lb = bits;
|
||||
uint8_t len[8];
|
||||
for (int i=0;i<8;i++) len[7-i] = (uint8_t)(lb >> (i*8));
|
||||
update(pad, padlen);
|
||||
update(len, 8);
|
||||
for (int i=0;i<5;i++) {
|
||||
out[i*4] = h[i]>>24;
|
||||
out[i*4+1] = h[i]>>16;
|
||||
out[i*4+2] = h[i]>>8;
|
||||
out[i*4+3] = h[i];
|
||||
}
|
||||
}
|
||||
};
|
||||
|
||||
std::string mkdir_p(const std::string &d) {
|
||||
if (d.empty()) return d;
|
||||
struct stat st{};
|
||||
if (stat(d.c_str(), &st) == 0) return d;
|
||||
mkdir(d.c_str(), 0755);
|
||||
return d;
|
||||
}
|
||||
|
||||
bool file_exists(const std::string &p) {
|
||||
struct stat st{};
|
||||
return stat(p.c_str(), &st) == 0;
|
||||
}
|
||||
|
||||
} // namespace
|
||||
|
||||
std::string Sha1Hex(const void *data, size_t len) {
|
||||
Sha1 s;
|
||||
s.update(data, len);
|
||||
uint8_t out[20];
|
||||
s.final(out);
|
||||
char hex[41];
|
||||
for (int i = 0; i < 20; ++i) std::snprintf(hex + i*2, 3, "%02x", out[i]);
|
||||
hex[40] = 0;
|
||||
return std::string(hex);
|
||||
}
|
||||
|
||||
KvCache::KvCache() = default;
|
||||
|
||||
void KvCache::SetDir(const std::string &dir) {
|
||||
dir_ = dir;
|
||||
if (!dir_.empty()) {
|
||||
mkdir_p(dir_);
|
||||
std::fprintf(stderr, "ds4 KvCache: enabled at %s\n", dir_.c_str());
|
||||
} else {
|
||||
std::fprintf(stderr, "ds4 KvCache: disabled (no dir set)\n");
|
||||
}
|
||||
}
|
||||
|
||||
std::string KvCache::Path(const std::string &rendered_text) const {
|
||||
if (dir_.empty()) return "";
|
||||
return dir_ + "/" + Sha1Hex(rendered_text.data(), rendered_text.size()) + ".kv";
|
||||
}
|
||||
|
||||
size_t KvCache::LoadLongestPrefix(ds4_session *session,
|
||||
const std::string &rendered_text,
|
||||
int ctx_size) {
|
||||
if (dir_.empty() || !session) return 0;
|
||||
// Strategy: enumerate all .kv files in dir, read their stored prefix
|
||||
// header, pick the longest one that is also a prefix of rendered_text.
|
||||
DIR *d = opendir(dir_.c_str());
|
||||
if (!d) return 0;
|
||||
struct dirent *de;
|
||||
size_t best_len = 0;
|
||||
std::string best_path;
|
||||
while ((de = readdir(d)) != nullptr) {
|
||||
std::string name = de->d_name;
|
||||
if (name.size() < 4 || name.substr(name.size()-3) != ".kv") continue;
|
||||
std::string path = dir_ + "/" + name;
|
||||
std::ifstream f(path, std::ios::binary);
|
||||
if (!f) continue;
|
||||
char magic[4]; f.read(magic, 4);
|
||||
if (f.gcount() != 4 || std::memcmp(magic, "DS4G", 4) != 0) continue;
|
||||
uint32_t version=0, file_ctx=0, prefix_len=0;
|
||||
f.read((char*)&version, 4); f.read((char*)&file_ctx, 4); f.read((char*)&prefix_len, 4);
|
||||
if (version != 1) continue;
|
||||
if ((int)file_ctx != ctx_size) continue;
|
||||
if (prefix_len > rendered_text.size()) continue;
|
||||
std::vector<char> prefix(prefix_len);
|
||||
f.read(prefix.data(), prefix_len);
|
||||
if (std::memcmp(prefix.data(), rendered_text.data(), prefix_len) != 0) continue;
|
||||
if (prefix_len > best_len) {
|
||||
best_len = prefix_len;
|
||||
best_path = path;
|
||||
}
|
||||
}
|
||||
closedir(d);
|
||||
if (best_len == 0) return 0;
|
||||
|
||||
// Load best_path's payload into session.
|
||||
std::ifstream f(best_path, std::ios::binary);
|
||||
char magic[4]; f.read(magic, 4);
|
||||
uint32_t version, file_ctx, prefix_len;
|
||||
f.read((char*)&version, 4); f.read((char*)&file_ctx, 4); f.read((char*)&prefix_len, 4);
|
||||
f.seekg(prefix_len, std::ios::cur);
|
||||
uint64_t payload_bytes = 0;
|
||||
f.read((char*)&payload_bytes, 8);
|
||||
// ds4_session_load_payload reads from a FILE*; reopen via fopen.
|
||||
FILE *fp = std::fopen(best_path.c_str(), "rb");
|
||||
if (!fp) return 0;
|
||||
// Seek past header + prefix + payload_bytes field.
|
||||
std::fseek(fp, 4 + 4 + 4 + 4 + prefix_len + 8, SEEK_SET);
|
||||
char errbuf[256] = {0};
|
||||
int rc = ds4_session_load_payload(session, fp, payload_bytes, errbuf, sizeof(errbuf));
|
||||
std::fclose(fp);
|
||||
if (rc != 0) return 0;
|
||||
return best_len;
|
||||
}
|
||||
|
||||
void KvCache::Save(ds4_session *session, const std::string &rendered_text, int ctx_size) {
|
||||
if (dir_.empty()) {
|
||||
std::fprintf(stderr, "ds4 KvCache::Save: skipped (dir empty)\n");
|
||||
return;
|
||||
}
|
||||
if (!session) {
|
||||
std::fprintf(stderr, "ds4 KvCache::Save: skipped (session null)\n");
|
||||
return;
|
||||
}
|
||||
std::string path = Path(rendered_text);
|
||||
uint64_t payload_bytes = ds4_session_payload_bytes(session);
|
||||
std::fprintf(stderr, "ds4 KvCache::Save: path=%s payload_bytes=%llu prefix_len=%zu\n",
|
||||
path.c_str(), (unsigned long long)payload_bytes, rendered_text.size());
|
||||
FILE *fp = std::fopen(path.c_str(), "wb");
|
||||
if (!fp) {
|
||||
std::fprintf(stderr, "ds4 KvCache::Save: fopen failed: %s\n", std::strerror(errno));
|
||||
return;
|
||||
}
|
||||
char magic[4] = {'D','S','4','G'};
|
||||
uint32_t version = 1;
|
||||
uint32_t ctx = static_cast<uint32_t>(ctx_size);
|
||||
uint32_t prefix_len = static_cast<uint32_t>(rendered_text.size());
|
||||
std::fwrite(magic, 4, 1, fp);
|
||||
std::fwrite(&version, 4, 1, fp);
|
||||
std::fwrite(&ctx, 4, 1, fp);
|
||||
std::fwrite(&prefix_len, 4, 1, fp);
|
||||
std::fwrite(rendered_text.data(), prefix_len, 1, fp);
|
||||
std::fwrite(&payload_bytes, 8, 1, fp);
|
||||
char errbuf[256] = {0};
|
||||
int rc = ds4_session_save_payload(session, fp, errbuf, sizeof(errbuf));
|
||||
std::fclose(fp);
|
||||
if (rc != 0) {
|
||||
std::fprintf(stderr, "ds4 KvCache::Save: ds4_session_save_payload rc=%d err=%s; removing %s\n",
|
||||
rc, errbuf, path.c_str());
|
||||
std::remove(path.c_str());
|
||||
} else {
|
||||
std::fprintf(stderr, "ds4 KvCache::Save: wrote %s ok\n", path.c_str());
|
||||
}
|
||||
}
|
||||
|
||||
} // namespace ds4cpp
|
||||
44
backend/cpp/ds4/kv_cache.h
Normal file
44
backend/cpp/ds4/kv_cache.h
Normal file
@@ -0,0 +1,44 @@
|
||||
#pragma once
|
||||
#include <string>
|
||||
extern "C" {
|
||||
#include "ds4.h"
|
||||
}
|
||||
|
||||
namespace ds4cpp {
|
||||
|
||||
// Disk-backed KV cache for ds4 sessions. Keyed by SHA1(rendered prompt prefix).
|
||||
// Format (our own, NOT bit-compatible with ds4-server's KVC files - interop
|
||||
// is a follow-up plan):
|
||||
//
|
||||
// "DS4G" (4 bytes magic) + u32 version=1 + u32 ctx_size +
|
||||
// u32 prefix_text_len + prefix_text + u64 payload_bytes + payload
|
||||
class KvCache {
|
||||
public:
|
||||
KvCache(); // disabled (dir empty)
|
||||
|
||||
// Set the cache directory. Empty disables.
|
||||
void SetDir(const std::string &dir);
|
||||
|
||||
// Returns the cache file path for a given rendered text prefix.
|
||||
std::string Path(const std::string &rendered_text) const;
|
||||
|
||||
// Look up the longest cached prefix that is also a prefix of
|
||||
// `rendered_text`. Loads it into `session` if found. Returns the
|
||||
// matched prefix length in bytes (0 if no hit).
|
||||
size_t LoadLongestPrefix(ds4_session *session,
|
||||
const std::string &rendered_text,
|
||||
int ctx_size);
|
||||
|
||||
// Save the current session, associated with this rendered text prefix.
|
||||
void Save(ds4_session *session, const std::string &rendered_text, int ctx_size);
|
||||
|
||||
bool enabled() const { return !dir_.empty(); }
|
||||
|
||||
private:
|
||||
std::string dir_;
|
||||
};
|
||||
|
||||
// Compute SHA1 of arbitrary bytes; returns 40-char hex.
|
||||
std::string Sha1Hex(const void *data, size_t len);
|
||||
|
||||
} // namespace ds4cpp
|
||||
39
backend/cpp/ds4/package.sh
Executable file
39
backend/cpp/ds4/package.sh
Executable file
@@ -0,0 +1,39 @@
|
||||
#!/bin/bash
|
||||
set -e
|
||||
CURDIR=$(dirname "$(realpath "$0")")
|
||||
REPO_ROOT="${CURDIR}/../../.."
|
||||
|
||||
mkdir -p "$CURDIR/package/lib"
|
||||
cp -avf "$CURDIR/grpc-server" "$CURDIR/package/"
|
||||
cp -rfv "$CURDIR/run.sh" "$CURDIR/package/"
|
||||
|
||||
UNAME_S=$(uname -s)
|
||||
if [ "$UNAME_S" = "Darwin" ]; then
|
||||
# Darwin: bundle dylibs via otool -L (handled by scripts/build/ds4-darwin.sh).
|
||||
echo "package.sh: Darwin handled by ds4-darwin.sh"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
|
||||
cp -arfLv /lib64/ld-linux-x86-64.so.2 "$CURDIR/package/lib/ld.so"
|
||||
LIBDIR=/lib/x86_64-linux-gnu
|
||||
elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
|
||||
cp -arfLv /lib/ld-linux-aarch64.so.1 "$CURDIR/package/lib/ld.so"
|
||||
LIBDIR=/lib/aarch64-linux-gnu
|
||||
else
|
||||
echo "package.sh: unknown architecture" >&2; exit 1
|
||||
fi
|
||||
|
||||
for lib in libc.so.6 libgcc_s.so.1 libstdc++.so.6 libm.so.6 libgomp.so.1 \
|
||||
libdl.so.2 librt.so.1 libpthread.so.0; do
|
||||
cp -arfLv "$LIBDIR/$lib" "$CURDIR/package/lib/$lib"
|
||||
done
|
||||
|
||||
GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
|
||||
if [ -f "$GPU_LIB_SCRIPT" ]; then
|
||||
source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
|
||||
package_gpu_libs
|
||||
fi
|
||||
|
||||
echo "ds4 package contents:"
|
||||
ls -lah "$CURDIR/package/" "$CURDIR/package/lib/"
|
||||
9
backend/cpp/ds4/run.sh
Executable file
9
backend/cpp/ds4/run.sh
Executable file
@@ -0,0 +1,9 @@
|
||||
#!/bin/bash
|
||||
# Entry point for the ds4 backend image / BACKEND_BINARY mode.
|
||||
set -e
|
||||
CURDIR=$(dirname "$(realpath "$0")")
|
||||
export LD_LIBRARY_PATH="$CURDIR/lib:$LD_LIBRARY_PATH"
|
||||
if [ -f "$CURDIR/lib/ld.so" ]; then
|
||||
exec "$CURDIR/lib/ld.so" "$CURDIR/grpc-server" "$@"
|
||||
fi
|
||||
exec "$CURDIR/grpc-server" "$@"
|
||||
@@ -1,5 +1,5 @@
|
||||
|
||||
IK_LLAMA_VERSION?=23127139cb6fa314899c3b5f4935b88b3374c56c
|
||||
IK_LLAMA_VERSION?=c35189d83c91aad780aba62b89f2830cb2916223
|
||||
LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp
|
||||
|
||||
CMAKE_ARGS?=
|
||||
|
||||
@@ -1,5 +1,5 @@
|
||||
|
||||
LLAMA_VERSION?=389ff61d77b5c71cec0cf92fe4e5d01ace80b797
|
||||
LLAMA_VERSION?=87589042cac2c390cec8d68fb2fad64e0a2a252a
|
||||
LLAMA_REPO?=https://github.com/ggerganov/llama.cpp
|
||||
|
||||
CMAKE_ARGS?=
|
||||
|
||||
@@ -32,10 +32,13 @@
|
||||
#include <grpcpp/health_check_service_interface.h>
|
||||
#include <grpcpp/security/server_credentials.h>
|
||||
#include <regex>
|
||||
#include <algorithm>
|
||||
#include <atomic>
|
||||
#include <cstdlib>
|
||||
#include <fstream>
|
||||
#include <iterator>
|
||||
#include <list>
|
||||
#include <map>
|
||||
#include <mutex>
|
||||
#include <signal.h>
|
||||
#include <thread>
|
||||
@@ -443,10 +446,24 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
|
||||
// Draft model for speculative decoding
|
||||
if (!request->draftmodel().empty()) {
|
||||
params.speculative.draft.mparams.path = request->draftmodel();
|
||||
// Default to draft type if a draft model is set but no explicit type
|
||||
// Default to draft type if a draft model is set but no explicit type.
|
||||
// Upstream (post ggml-org/llama.cpp#22838) made the speculative type a
|
||||
// vector; the turboquant fork still uses the legacy scalar. The
|
||||
// LOCALAI_LEGACY_LLAMA_CPP_SPEC macro is injected by
|
||||
// backend/cpp/turboquant/patch-grpc-server.sh for fork builds only.
|
||||
// Upstream renamed COMMON_SPECULATIVE_TYPE_DRAFT -> ..._DRAFT_SIMPLE
|
||||
// in ggml-org/llama.cpp#22964; the fork still uses the old name.
|
||||
#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
|
||||
if (params.speculative.type == COMMON_SPECULATIVE_TYPE_NONE) {
|
||||
params.speculative.type = COMMON_SPECULATIVE_TYPE_DRAFT;
|
||||
}
|
||||
#else
|
||||
const bool no_spec_type = params.speculative.types.empty() ||
|
||||
(params.speculative.types.size() == 1 && params.speculative.types[0] == COMMON_SPECULATIVE_TYPE_NONE);
|
||||
if (no_spec_type) {
|
||||
params.speculative.types = { COMMON_SPECULATIVE_TYPE_DRAFT_SIMPLE };
|
||||
}
|
||||
#endif
|
||||
}
|
||||
|
||||
// params.model_alias ??
|
||||
@@ -671,12 +688,178 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
|
||||
// If conversion fails, keep default value (8)
|
||||
}
|
||||
}
|
||||
|
||||
// --- physical batch size (upstream -ub / --ubatch-size) ---
|
||||
// Note: line ~482 already aliases n_ubatch to n_batch as a default; this
|
||||
// option lets users decouple the two (useful for embeddings/rerank).
|
||||
} else if (!strcmp(optname, "n_ubatch") || !strcmp(optname, "ubatch")) {
|
||||
if (optval != NULL) {
|
||||
try { params.n_ubatch = std::stoi(optval_str); } catch (...) {}
|
||||
}
|
||||
|
||||
// --- main-model batch threads (upstream -tb / --threads-batch) ---
|
||||
} else if (!strcmp(optname, "threads_batch") || !strcmp(optname, "n_threads_batch")) {
|
||||
if (optval != NULL) {
|
||||
try {
|
||||
int n = std::stoi(optval_str);
|
||||
if (n <= 0) n = (int)std::thread::hardware_concurrency();
|
||||
params.cpuparams_batch.n_threads = n;
|
||||
} catch (...) {}
|
||||
}
|
||||
|
||||
// --- pooling type for embeddings (upstream --pooling) ---
|
||||
} else if (!strcmp(optname, "pooling_type") || !strcmp(optname, "pooling")) {
|
||||
if (optval != NULL) {
|
||||
if (optval_str == "none") params.pooling_type = LLAMA_POOLING_TYPE_NONE;
|
||||
else if (optval_str == "mean") params.pooling_type = LLAMA_POOLING_TYPE_MEAN;
|
||||
else if (optval_str == "cls") params.pooling_type = LLAMA_POOLING_TYPE_CLS;
|
||||
else if (optval_str == "last") params.pooling_type = LLAMA_POOLING_TYPE_LAST;
|
||||
else if (optval_str == "rank") params.pooling_type = LLAMA_POOLING_TYPE_RANK;
|
||||
// unknown values silently leave UNSPECIFIED (auto-detect)
|
||||
}
|
||||
|
||||
// --- llama log verbosity threshold (upstream -lv / --verbosity) ---
|
||||
} else if (!strcmp(optname, "verbosity")) {
|
||||
if (optval != NULL) {
|
||||
try { params.verbosity = std::stoi(optval_str); } catch (...) {}
|
||||
}
|
||||
|
||||
// --- O_DIRECT model loading (upstream --direct-io) ---
|
||||
} else if (!strcmp(optname, "direct_io") || !strcmp(optname, "use_direct_io")) {
|
||||
if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
|
||||
params.use_direct_io = true;
|
||||
} else if (optval_str == "false" || optval_str == "0" || optval_str == "no" || optval_str == "off" || optval_str == "disabled") {
|
||||
params.use_direct_io = false;
|
||||
}
|
||||
|
||||
// --- embedding normalization (upstream --embd-normalize) ---
|
||||
// -1 none, 0 max-abs, 1 taxicab, 2 L2 (default), >2 p-norm
|
||||
} else if (!strcmp(optname, "embd_normalize") || !strcmp(optname, "embedding_normalize")) {
|
||||
if (optval != NULL) {
|
||||
try { params.embd_normalize = std::stoi(optval_str); } catch (...) {}
|
||||
}
|
||||
|
||||
// --- reasoning parser (upstream --reasoning-format) ---
|
||||
// Picks the parser for <think> blocks emitted by reasoning models.
|
||||
// none / auto / deepseek / deepseek-legacy
|
||||
} else if (!strcmp(optname, "reasoning_format")) {
|
||||
if (optval != NULL) {
|
||||
if (optval_str == "none") params.reasoning_format = COMMON_REASONING_FORMAT_NONE;
|
||||
else if (optval_str == "auto") params.reasoning_format = COMMON_REASONING_FORMAT_AUTO;
|
||||
else if (optval_str == "deepseek") params.reasoning_format = COMMON_REASONING_FORMAT_DEEPSEEK;
|
||||
else if (optval_str == "deepseek-legacy" || optval_str == "deepseek_legacy")
|
||||
params.reasoning_format = COMMON_REASONING_FORMAT_DEEPSEEK_LEGACY;
|
||||
// unknown values silently keep the upstream default (DEEPSEEK)
|
||||
}
|
||||
|
||||
// --- reasoning budget (upstream --reasoning-budget) ---
|
||||
// -1 unlimited, 0 disabled, >0 token budget for thinking blocks.
|
||||
// Distinct from per-request `enable_thinking` (chat_template_kwargs).
|
||||
} else if (!strcmp(optname, "enable_reasoning") || !strcmp(optname, "reasoning_budget")) {
|
||||
if (optval != NULL) {
|
||||
try { params.enable_reasoning = std::stoi(optval_str); } catch (...) {}
|
||||
}
|
||||
|
||||
// --- prefill assistant turn (upstream --no-prefill-assistant) ---
|
||||
} else if (!strcmp(optname, "prefill_assistant")) {
|
||||
if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
|
||||
params.prefill_assistant = true;
|
||||
} else if (optval_str == "false" || optval_str == "0" || optval_str == "no" || optval_str == "off" || optval_str == "disabled") {
|
||||
params.prefill_assistant = false;
|
||||
}
|
||||
|
||||
// --- mmproj GPU offload (upstream --no-mmproj-offload, inverted) ---
|
||||
} else if (!strcmp(optname, "mmproj_use_gpu") || !strcmp(optname, "mmproj_offload")) {
|
||||
if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
|
||||
params.mmproj_use_gpu = true;
|
||||
} else if (optval_str == "false" || optval_str == "0" || optval_str == "no" || optval_str == "off" || optval_str == "disabled") {
|
||||
params.mmproj_use_gpu = false;
|
||||
}
|
||||
|
||||
// --- per-image vision token budget (upstream --image-min/max-tokens) ---
|
||||
} else if (!strcmp(optname, "image_min_tokens")) {
|
||||
if (optval != NULL) {
|
||||
try { params.image_min_tokens = std::stoi(optval_str); } catch (...) {}
|
||||
}
|
||||
} else if (!strcmp(optname, "image_max_tokens")) {
|
||||
if (optval != NULL) {
|
||||
try { params.image_max_tokens = std::stoi(optval_str); } catch (...) {}
|
||||
}
|
||||
|
||||
// --- main-model tensor buffer overrides (upstream --override-tensor) ---
|
||||
// Format: <tensor regex>=<buffer type>,<tensor regex>=<buffer type>,...
|
||||
// Mirrors the existing `draft_override_tensor` parser below.
|
||||
} else if (!strcmp(optname, "override_tensor") || !strcmp(optname, "tensor_buft_overrides")) {
|
||||
ggml_backend_load_all();
|
||||
std::map<std::string, ggml_backend_buffer_type_t> buft_list;
|
||||
for (size_t i = 0; i < ggml_backend_dev_count(); ++i) {
|
||||
auto * dev = ggml_backend_dev_get(i);
|
||||
auto * buft = ggml_backend_dev_buffer_type(dev);
|
||||
if (buft) {
|
||||
buft_list[ggml_backend_buft_name(buft)] = buft;
|
||||
}
|
||||
}
|
||||
static std::list<std::string> override_names;
|
||||
std::string cur;
|
||||
auto flush = [&](const std::string & spec) {
|
||||
auto pos = spec.find('=');
|
||||
if (pos == std::string::npos) return;
|
||||
const std::string name = spec.substr(0, pos);
|
||||
const std::string type = spec.substr(pos + 1);
|
||||
auto it = buft_list.find(type);
|
||||
if (it == buft_list.end()) return; // unknown buffer type: ignore
|
||||
override_names.push_back(name);
|
||||
params.tensor_buft_overrides.push_back(
|
||||
{override_names.back().c_str(), it->second});
|
||||
};
|
||||
for (char c : optval_str) {
|
||||
if (c == ',') { if (!cur.empty()) { flush(cur); cur.clear(); } }
|
||||
else { cur.push_back(c); }
|
||||
}
|
||||
if (!cur.empty()) flush(cur);
|
||||
|
||||
// Speculative decoding options
|
||||
} else if (!strcmp(optname, "spec_type") || !strcmp(optname, "speculative_type")) {
|
||||
auto type = common_speculative_type_from_name(optval_str);
|
||||
#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
|
||||
// Fork only knows a single scalar `type`. Take the first comma-
|
||||
// separated value and assign it via the singular helper.
|
||||
std::string first = optval_str;
|
||||
const auto comma = first.find(',');
|
||||
if (comma != std::string::npos) first = first.substr(0, comma);
|
||||
auto type = common_speculative_type_from_name(first);
|
||||
if (type != COMMON_SPECULATIVE_TYPE_COUNT) {
|
||||
params.speculative.type = type;
|
||||
}
|
||||
#else
|
||||
// Upstream switched to a vector of types (comma-separated for multi-type
|
||||
// chaining via common_speculative_types_from_names). We keep accepting a
|
||||
// single value here, but also tolerate comma-separated lists.
|
||||
//
|
||||
// ggml-org/llama.cpp#22964 also renamed the registered names from
|
||||
// underscore- to dash-separated form, and replaced the bare
|
||||
// `draft`/`eagle3` aliases with `draft-simple`/`draft-eagle3`. We
|
||||
// normalize each token here so existing model configs keep working.
|
||||
auto normalize_spec_name = [](std::string s) -> std::string {
|
||||
std::replace(s.begin(), s.end(), '_', '-');
|
||||
if (s == "draft") return "draft-simple";
|
||||
if (s == "eagle3") return "draft-eagle3";
|
||||
return s;
|
||||
};
|
||||
std::vector<std::string> names;
|
||||
std::string item;
|
||||
for (char c : optval_str) {
|
||||
if (c == ',') {
|
||||
if (!item.empty()) { names.push_back(normalize_spec_name(item)); item.clear(); }
|
||||
} else {
|
||||
item.push_back(c);
|
||||
}
|
||||
}
|
||||
if (!item.empty()) names.push_back(normalize_spec_name(item));
|
||||
auto parsed = common_speculative_types_from_names(names);
|
||||
if (!parsed.empty()) {
|
||||
params.speculative.types = parsed;
|
||||
}
|
||||
#endif
|
||||
} else if (!strcmp(optname, "spec_n_max") || !strcmp(optname, "draft_max")) {
|
||||
if (optval != NULL) {
|
||||
try { params.speculative.draft.n_max = std::stoi(optval_str); } catch (...) {}
|
||||
@@ -710,10 +893,155 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
|
||||
try { params.speculative.draft.n_gpu_layers = std::stoi(optval_str); } catch (...) {}
|
||||
}
|
||||
} else if (!strcmp(optname, "draft_ctx_size")) {
|
||||
if (optval != NULL) {
|
||||
try { params.speculative.draft.n_ctx = std::stoi(optval_str); } catch (...) {}
|
||||
}
|
||||
// The draft context size is no longer a separate field upstream: the draft
|
||||
// shares the target context size. Accept the option for backward
|
||||
// compatibility but silently ignore it.
|
||||
|
||||
// Everything below relies on struct shape introduced in ggml-org/llama.cpp#22838
|
||||
// (parallel drafting): `ngram_mod`, `ngram_map_k`, `ngram_map_k4v`,
|
||||
// `ngram_cache`, and the `draft.{cache_type_*, cpuparams*, tensor_buft_overrides}`
|
||||
// fields. The turboquant fork branched before that, so its build defines
|
||||
// LOCALAI_LEGACY_LLAMA_CPP_SPEC via patch-grpc-server.sh and these option
|
||||
// keys become unrecognized (silently dropped, like any unknown opt) for it.
|
||||
//
|
||||
// The `#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC` / `#else` split below sits at the
|
||||
// closing-brace position of the `draft_ctx_size` branch on purpose: in the
|
||||
// legacy build the chain ends here (the brace closes draft_ctx_size), and in
|
||||
// the modern build the chain continues with `} else if (...)` instead, so the
|
||||
// brace count stays balanced under both branches of the preprocessor.
|
||||
#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
|
||||
}
|
||||
#else
|
||||
// --- ngram_mod family (upstream --spec-ngram-mod-*) ---
|
||||
} else if (!strcmp(optname, "spec_ngram_mod_n_min")) {
|
||||
if (optval != NULL) {
|
||||
try { params.speculative.ngram_mod.n_min = std::stoi(optval_str); } catch (...) {}
|
||||
}
|
||||
} else if (!strcmp(optname, "spec_ngram_mod_n_max")) {
|
||||
if (optval != NULL) {
|
||||
try { params.speculative.ngram_mod.n_max = std::stoi(optval_str); } catch (...) {}
|
||||
}
|
||||
} else if (!strcmp(optname, "spec_ngram_mod_n_match")) {
|
||||
if (optval != NULL) {
|
||||
try { params.speculative.ngram_mod.n_match = std::stoi(optval_str); } catch (...) {}
|
||||
}
|
||||
|
||||
// --- ngram_map_k family (upstream --spec-ngram-map-k-*) ---
|
||||
} else if (!strcmp(optname, "spec_ngram_map_k_size_n")) {
|
||||
if (optval != NULL) {
|
||||
try { params.speculative.ngram_map_k.size_n = (uint16_t)std::stoi(optval_str); } catch (...) {}
|
||||
}
|
||||
} else if (!strcmp(optname, "spec_ngram_map_k_size_m")) {
|
||||
if (optval != NULL) {
|
||||
try { params.speculative.ngram_map_k.size_m = (uint16_t)std::stoi(optval_str); } catch (...) {}
|
||||
}
|
||||
} else if (!strcmp(optname, "spec_ngram_map_k_min_hits")) {
|
||||
if (optval != NULL) {
|
||||
try { params.speculative.ngram_map_k.min_hits = (uint16_t)std::stoi(optval_str); } catch (...) {}
|
||||
}
|
||||
|
||||
// --- ngram_map_k4v family (upstream --spec-ngram-map-k4v-*) ---
|
||||
} else if (!strcmp(optname, "spec_ngram_map_k4v_size_n")) {
|
||||
if (optval != NULL) {
|
||||
try { params.speculative.ngram_map_k4v.size_n = (uint16_t)std::stoi(optval_str); } catch (...) {}
|
||||
}
|
||||
} else if (!strcmp(optname, "spec_ngram_map_k4v_size_m")) {
|
||||
if (optval != NULL) {
|
||||
try { params.speculative.ngram_map_k4v.size_m = (uint16_t)std::stoi(optval_str); } catch (...) {}
|
||||
}
|
||||
} else if (!strcmp(optname, "spec_ngram_map_k4v_min_hits")) {
|
||||
if (optval != NULL) {
|
||||
try { params.speculative.ngram_map_k4v.min_hits = (uint16_t)std::stoi(optval_str); } catch (...) {}
|
||||
}
|
||||
|
||||
// --- ngram lookup caches (upstream --lookup-cache-static / -dynamic) ---
|
||||
} else if (!strcmp(optname, "spec_lookup_cache_static") || !strcmp(optname, "lookup_cache_static")) {
|
||||
params.speculative.ngram_cache.lookup_cache_static = optval_str;
|
||||
} else if (!strcmp(optname, "spec_lookup_cache_dynamic") || !strcmp(optname, "lookup_cache_dynamic")) {
|
||||
params.speculative.ngram_cache.lookup_cache_dynamic = optval_str;
|
||||
|
||||
// --- draft model KV cache types (upstream --spec-draft-type-k / -v) ---
|
||||
} else if (!strcmp(optname, "draft_cache_type_k") || !strcmp(optname, "spec_draft_cache_type_k")) {
|
||||
params.speculative.draft.cache_type_k = kv_cache_type_from_str(optval_str);
|
||||
} else if (!strcmp(optname, "draft_cache_type_v") || !strcmp(optname, "spec_draft_cache_type_v")) {
|
||||
params.speculative.draft.cache_type_v = kv_cache_type_from_str(optval_str);
|
||||
|
||||
// --- draft model thread counts (upstream --spec-draft-threads / -batch) ---
|
||||
} else if (!strcmp(optname, "draft_threads") || !strcmp(optname, "spec_draft_threads")) {
|
||||
if (optval != NULL) {
|
||||
try {
|
||||
int n = std::stoi(optval_str);
|
||||
if (n <= 0) n = (int)std::thread::hardware_concurrency();
|
||||
params.speculative.draft.cpuparams.n_threads = n;
|
||||
} catch (...) {}
|
||||
}
|
||||
} else if (!strcmp(optname, "draft_threads_batch") || !strcmp(optname, "spec_draft_threads_batch")) {
|
||||
if (optval != NULL) {
|
||||
try {
|
||||
int n = std::stoi(optval_str);
|
||||
if (n <= 0) n = (int)std::thread::hardware_concurrency();
|
||||
params.speculative.draft.cpuparams_batch.n_threads = n;
|
||||
} catch (...) {}
|
||||
}
|
||||
|
||||
// --- draft model MoE on CPU (upstream --spec-draft-cpu-moe / --spec-draft-n-cpu-moe) ---
|
||||
} else if (!strcmp(optname, "draft_cpu_moe") || !strcmp(optname, "spec_draft_cpu_moe")) {
|
||||
// Bool-style flag: optval may be missing, "true"/"1"/"yes" enables.
|
||||
const bool enable = (optval == NULL) ||
|
||||
optval_str == "true" || optval_str == "1" || optval_str == "yes" ||
|
||||
optval_str == "on" || optval_str == "enabled";
|
||||
if (enable) {
|
||||
params.speculative.draft.tensor_buft_overrides.push_back(llm_ffn_exps_cpu_override());
|
||||
}
|
||||
} else if (!strcmp(optname, "draft_n_cpu_moe") || !strcmp(optname, "spec_draft_n_cpu_moe")) {
|
||||
if (optval != NULL) {
|
||||
try {
|
||||
int n = std::stoi(optval_str);
|
||||
if (n < 0) n = 0;
|
||||
// Keep override-name storage alive for the lifetime of the params struct
|
||||
// (mirrors upstream arg.cpp behavior with a function-local static).
|
||||
static std::list<std::string> buft_overrides_draft;
|
||||
for (int i = 0; i < n; ++i) {
|
||||
buft_overrides_draft.push_back(llm_ffn_exps_block_regex(i));
|
||||
params.speculative.draft.tensor_buft_overrides.push_back(
|
||||
{buft_overrides_draft.back().c_str(), ggml_backend_cpu_buffer_type()});
|
||||
}
|
||||
} catch (...) {}
|
||||
}
|
||||
|
||||
// --- draft model tensor buffer overrides (upstream --spec-draft-override-tensor) ---
|
||||
} else if (!strcmp(optname, "draft_override_tensor") || !strcmp(optname, "spec_draft_override_tensor")) {
|
||||
// Format: <tensor regex>=<buffer type>,<tensor regex>=<buffer type>,...
|
||||
// We replicate upstream's parse_tensor_buffer_overrides (static in arg.cpp).
|
||||
ggml_backend_load_all();
|
||||
std::map<std::string, ggml_backend_buffer_type_t> buft_list;
|
||||
for (size_t i = 0; i < ggml_backend_dev_count(); ++i) {
|
||||
auto * dev = ggml_backend_dev_get(i);
|
||||
auto * buft = ggml_backend_dev_buffer_type(dev);
|
||||
if (buft) {
|
||||
buft_list[ggml_backend_buft_name(buft)] = buft;
|
||||
}
|
||||
}
|
||||
static std::list<std::string> draft_override_names;
|
||||
std::string cur;
|
||||
auto flush = [&](const std::string & spec) {
|
||||
auto pos = spec.find('=');
|
||||
if (pos == std::string::npos) return;
|
||||
const std::string name = spec.substr(0, pos);
|
||||
const std::string type = spec.substr(pos + 1);
|
||||
auto it = buft_list.find(type);
|
||||
if (it == buft_list.end()) return; // unknown buffer type: ignore
|
||||
draft_override_names.push_back(name);
|
||||
params.speculative.draft.tensor_buft_overrides.push_back(
|
||||
{draft_override_names.back().c_str(), it->second});
|
||||
};
|
||||
for (char c : optval_str) {
|
||||
if (c == ',') { if (!cur.empty()) { flush(cur); cur.clear(); } }
|
||||
else { cur.push_back(c); }
|
||||
}
|
||||
if (!cur.empty()) flush(cur);
|
||||
}
|
||||
#endif // LOCALAI_LEGACY_LLAMA_CPP_SPEC — closes the `else`/`#ifdef` opened at draft_ctx_size
|
||||
}
|
||||
|
||||
// Set params.n_parallel from environment variable if not set via options (fallback)
|
||||
@@ -2610,7 +2938,9 @@ public:
|
||||
}
|
||||
}
|
||||
|
||||
int embd_normalize = 2; // default to Euclidean/L2 norm
|
||||
// Honor the load-time embd_normalize set via options:embd_normalize.
|
||||
// -1 none, 0 max-abs, 1 taxicab, 2 L2 (default), >2 p-norm.
|
||||
int embd_normalize = params_base.embd_normalize;
|
||||
// create and queue the task
|
||||
auto rd = ctx_server.get_response_reader();
|
||||
{
|
||||
@@ -2704,7 +3034,7 @@ public:
|
||||
|
||||
tasks.reserve(documents.size());
|
||||
for (size_t i = 0; i < documents.size(); i++) {
|
||||
auto tmp = format_prompt_rerank(ctx_server.impl->model, ctx_server.impl->vocab, ctx_server.impl->mctx, request->query(), documents[i]);
|
||||
auto tmp = format_prompt_rerank(ctx_server.impl->model_tgt, ctx_server.impl->vocab, ctx_server.impl->mctx, request->query(), documents[i]);
|
||||
server_task task = server_task(SERVER_TASK_TYPE_RERANK);
|
||||
task.id = rd.queue_tasks.get_new_id();
|
||||
task.index = i;
|
||||
@@ -2882,7 +3212,7 @@ public:
|
||||
// Get template source and reconstruct a common_chat_template for analysis
|
||||
std::string tmpl_src = common_chat_templates_source(ctx_server.impl->chat_params.tmpls.get());
|
||||
if (!tmpl_src.empty()) {
|
||||
const auto * vocab = llama_model_get_vocab(ctx_server.impl->model);
|
||||
const auto * vocab = llama_model_get_vocab(ctx_server.impl->model_tgt);
|
||||
std::string token_bos, token_eos;
|
||||
if (vocab) {
|
||||
auto bos_id = llama_vocab_bos(vocab);
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
|
||||
# Pinned to the HEAD of feature/turboquant-kv-cache on https://github.com/TheTom/llama-cpp-turboquant.
|
||||
# Auto-bumped nightly by .github/workflows/bump_deps.yaml.
|
||||
TURBOQUANT_VERSION?=69d8e4be47243e83b3d0d71e932bc7aa61c644dc
|
||||
TURBOQUANT_VERSION?=5aeb2fdbe26cd4c534c6fa15de73cb5749bd0403
|
||||
LLAMA_REPO?=https://github.com/TheTom/llama-cpp-turboquant
|
||||
|
||||
CMAKE_ARGS?=
|
||||
|
||||
@@ -108,4 +108,47 @@ else
|
||||
echo "==> $SRC has no post-#22397 speculative field refs, skipping spec rename patch"
|
||||
fi
|
||||
|
||||
# 4. Revert the `ctx_server.impl->model_tgt` rename introduced by upstream
|
||||
# ggml-org/llama.cpp#22838 (parallel drafting). The turboquant fork still
|
||||
# exposes the field as `model` on `server_context_impl`. The two call sites
|
||||
# are in the Rerank and ModelMetadata RPC handlers.
|
||||
if grep -q 'ctx_server\.impl->model_tgt' "$SRC"; then
|
||||
echo "==> patching $SRC to revert ctx_server.impl->model_tgt -> ctx_server.impl->model"
|
||||
sed -E 's/ctx_server\.impl->model_tgt/ctx_server.impl->model/g' "$SRC" > "$SRC.tmp"
|
||||
mv "$SRC.tmp" "$SRC"
|
||||
echo "==> model_tgt rename OK"
|
||||
else
|
||||
echo "==> $SRC has no ctx_server.impl->model_tgt refs, skipping model_tgt rename patch"
|
||||
fi
|
||||
|
||||
# 5. Define LOCALAI_LEGACY_LLAMA_CPP_SPEC at the top of the file so the
|
||||
# grpc-server option parser skips the new option-handler blocks (ngram_mod,
|
||||
# ngram_map_k, ngram_map_k4v, ngram_cache, draft.cache_type_*, draft.cpuparams*,
|
||||
# draft.tensor_buft_overrides) introduced for the post-#22838 layout. Those
|
||||
# blocks reference struct fields that simply do not exist in the fork.
|
||||
if grep -q '^#define LOCALAI_LEGACY_LLAMA_CPP_SPEC' "$SRC"; then
|
||||
echo "==> $SRC already defines LOCALAI_LEGACY_LLAMA_CPP_SPEC, skipping"
|
||||
else
|
||||
echo "==> patching $SRC to define LOCALAI_LEGACY_LLAMA_CPP_SPEC at the top"
|
||||
# Insert the define before the very first `#include` so it precedes all the
|
||||
# speculative-decoding code paths.
|
||||
awk '
|
||||
!done && /^#include/ {
|
||||
print "#define LOCALAI_LEGACY_LLAMA_CPP_SPEC 1"
|
||||
print "// ^ injected by backend/cpp/turboquant/patch-grpc-server.sh"
|
||||
print ""
|
||||
done = 1
|
||||
}
|
||||
{ print }
|
||||
END {
|
||||
if (!done) {
|
||||
print "patch-grpc-server.sh: no #include anchor found to insert LOCALAI_LEGACY_LLAMA_CPP_SPEC" > "/dev/stderr"
|
||||
exit 1
|
||||
}
|
||||
}
|
||||
' "$SRC" > "$SRC.tmp"
|
||||
mv "$SRC.tmp" "$SRC"
|
||||
echo "==> LOCALAI_LEGACY_LLAMA_CPP_SPEC define OK"
|
||||
fi
|
||||
|
||||
echo "==> all patches applied"
|
||||
|
||||
@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
|
||||
|
||||
# stablediffusion.cpp (ggml)
|
||||
STABLEDIFFUSION_GGML_REPO?=https://github.com/leejet/stable-diffusion.cpp
|
||||
STABLEDIFFUSION_GGML_VERSION?=90e87bc846f17059771efb8aaa31e9ef0cab6f78
|
||||
STABLEDIFFUSION_GGML_VERSION?=bd17f53b7386fb5f60e8587b75e73c4b2fed3426
|
||||
|
||||
CMAKE_ARGS+=-DGGML_MAX_NAME=128
|
||||
|
||||
|
||||
@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
|
||||
|
||||
# whisper.cpp version
|
||||
WHISPER_REPO?=https://github.com/ggml-org/whisper.cpp
|
||||
WHISPER_CPP_VERSION?=c33c5618b72bb345df029b730b36bc0e369845a3
|
||||
WHISPER_CPP_VERSION?=968eebe77225d25e57a3f981da7c696310f0e881
|
||||
SO_TARGET?=libgowhisper.so
|
||||
|
||||
CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF
|
||||
|
||||
@@ -72,6 +72,29 @@
|
||||
nvidia-cuda-12: "cuda12-turboquant"
|
||||
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-turboquant"
|
||||
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-turboquant"
|
||||
- &ds4
|
||||
name: "ds4"
|
||||
alias: "ds4"
|
||||
license: mit
|
||||
description: |
|
||||
antirez/ds4 - DeepSeek V4 Flash inference engine. Single-model,
|
||||
optimized for Metal (Darwin) and CUDA (Linux). Requires the GGUFs
|
||||
published at huggingface.co/antirez/deepseek-v4-gguf.
|
||||
urls:
|
||||
- https://github.com/antirez/ds4
|
||||
tags:
|
||||
- text-to-text
|
||||
- LLM
|
||||
- CPU
|
||||
- CUDA
|
||||
- Metal
|
||||
capabilities:
|
||||
default: "cpu-ds4"
|
||||
nvidia: "cuda13-ds4"
|
||||
nvidia-cuda-13: "cuda13-ds4"
|
||||
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-ds4"
|
||||
metal: "metal-ds4"
|
||||
metal-darwin-arm64: "metal-ds4"
|
||||
- &whispercpp
|
||||
name: "whisper"
|
||||
alias: "whisper"
|
||||
@@ -824,6 +847,35 @@
|
||||
nvidia-l4t-cuda-12: "nvidia-l4t-vibevoice"
|
||||
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-vibevoice"
|
||||
icon: https://avatars.githubusercontent.com/u/6154722?s=200&v=4
|
||||
- &liquid-audio
|
||||
urls:
|
||||
- https://github.com/Liquid4All/liquid-audio
|
||||
- https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B
|
||||
description: |
|
||||
LiquidAI LFM2 / LFM2.5 Audio Python backend. End-to-end speech-to-speech, ASR,
|
||||
TTS (4 baked voices), and text chat from a single 1.5B model. Wraps the
|
||||
upstream `liquid-audio` package; supports fine-tuning via LocalAI's
|
||||
/v1/fine-tuning/jobs endpoint.
|
||||
tags:
|
||||
- speech-to-speech
|
||||
- any-to-any
|
||||
- text-to-speech
|
||||
- speech-to-text
|
||||
- TTS
|
||||
- ASR
|
||||
- realtime
|
||||
license: LFM-Open-License-v1.0
|
||||
name: "liquid-audio"
|
||||
alias: "liquid-audio"
|
||||
capabilities:
|
||||
nvidia: "cuda12-liquid-audio"
|
||||
intel: "intel-liquid-audio"
|
||||
amd: "rocm-liquid-audio"
|
||||
default: "cpu-liquid-audio"
|
||||
nvidia-cuda-13: "cuda13-liquid-audio"
|
||||
nvidia-cuda-12: "cuda12-liquid-audio"
|
||||
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-liquid-audio"
|
||||
icon: https://cdn-avatars.huggingface.co/v1/production/uploads/61b8e2ba285851687028d395/7_6D7rWrLxp2hb6OHSV1p.png
|
||||
- &qwen-tts
|
||||
urls:
|
||||
- https://github.com/QwenLM/Qwen3-TTS
|
||||
@@ -1127,6 +1179,15 @@
|
||||
nvidia-cuda-12: "cuda12-turboquant-development"
|
||||
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-turboquant-development"
|
||||
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-turboquant-development"
|
||||
- !!merge <<: *ds4
|
||||
name: "ds4-development"
|
||||
capabilities:
|
||||
default: "cpu-ds4-development"
|
||||
nvidia: "cuda13-ds4-development"
|
||||
nvidia-cuda-13: "cuda13-ds4-development"
|
||||
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-ds4-development"
|
||||
metal: "metal-ds4-development"
|
||||
metal-darwin-arm64: "metal-ds4-development"
|
||||
- !!merge <<: *stablediffusionggml
|
||||
name: "stablediffusion-ggml-development"
|
||||
capabilities:
|
||||
@@ -1673,6 +1734,47 @@
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-turboquant"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-turboquant
|
||||
## ds4
|
||||
- !!merge <<: *ds4
|
||||
name: "cpu-ds4"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-ds4"
|
||||
mirrors:
|
||||
- localai/localai-backends:latest-cpu-ds4
|
||||
- !!merge <<: *ds4
|
||||
name: "cpu-ds4-development"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-ds4"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-cpu-ds4
|
||||
- !!merge <<: *ds4
|
||||
name: "cuda13-ds4"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-ds4"
|
||||
mirrors:
|
||||
- localai/localai-backends:latest-gpu-nvidia-cuda-13-ds4
|
||||
- !!merge <<: *ds4
|
||||
name: "cuda13-ds4-development"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-ds4"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-gpu-nvidia-cuda-13-ds4
|
||||
- !!merge <<: *ds4
|
||||
name: "cuda13-nvidia-l4t-arm64-ds4"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-ds4"
|
||||
mirrors:
|
||||
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-ds4
|
||||
- !!merge <<: *ds4
|
||||
name: "cuda13-nvidia-l4t-arm64-ds4-development"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-ds4"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-ds4
|
||||
- !!merge <<: *ds4
|
||||
name: "metal-ds4"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:latest-metal-darwin-arm64-ds4"
|
||||
mirrors:
|
||||
- localai/localai-backends:latest-metal-darwin-arm64-ds4
|
||||
- !!merge <<: *ds4
|
||||
name: "metal-ds4-development"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-ds4"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-metal-darwin-arm64-ds4
|
||||
## whisper
|
||||
- !!merge <<: *whispercpp
|
||||
name: "whisper-development"
|
||||
@@ -3364,6 +3466,77 @@
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-vibevoice"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-metal-darwin-arm64-vibevoice
|
||||
## liquid-audio
|
||||
- !!merge <<: *liquid-audio
|
||||
name: "liquid-audio-development"
|
||||
capabilities:
|
||||
nvidia: "cuda12-liquid-audio-development"
|
||||
intel: "intel-liquid-audio-development"
|
||||
amd: "rocm-liquid-audio-development"
|
||||
default: "cpu-liquid-audio-development"
|
||||
nvidia-cuda-13: "cuda13-liquid-audio-development"
|
||||
nvidia-cuda-12: "cuda12-liquid-audio-development"
|
||||
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-liquid-audio-development"
|
||||
- !!merge <<: *liquid-audio
|
||||
name: "cpu-liquid-audio"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-liquid-audio"
|
||||
mirrors:
|
||||
- localai/localai-backends:latest-cpu-liquid-audio
|
||||
- !!merge <<: *liquid-audio
|
||||
name: "cpu-liquid-audio-development"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-liquid-audio"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-cpu-liquid-audio
|
||||
- !!merge <<: *liquid-audio
|
||||
name: "cuda12-liquid-audio"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-liquid-audio"
|
||||
mirrors:
|
||||
- localai/localai-backends:latest-gpu-nvidia-cuda-12-liquid-audio
|
||||
- !!merge <<: *liquid-audio
|
||||
name: "cuda12-liquid-audio-development"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-liquid-audio"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-gpu-nvidia-cuda-12-liquid-audio
|
||||
- !!merge <<: *liquid-audio
|
||||
name: "cuda13-liquid-audio"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-liquid-audio"
|
||||
mirrors:
|
||||
- localai/localai-backends:latest-gpu-nvidia-cuda-13-liquid-audio
|
||||
- !!merge <<: *liquid-audio
|
||||
name: "cuda13-liquid-audio-development"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-liquid-audio"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-gpu-nvidia-cuda-13-liquid-audio
|
||||
- !!merge <<: *liquid-audio
|
||||
name: "intel-liquid-audio"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-liquid-audio"
|
||||
mirrors:
|
||||
- localai/localai-backends:latest-gpu-intel-liquid-audio
|
||||
- !!merge <<: *liquid-audio
|
||||
name: "intel-liquid-audio-development"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-liquid-audio"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-gpu-intel-liquid-audio
|
||||
- !!merge <<: *liquid-audio
|
||||
name: "rocm-liquid-audio"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-liquid-audio"
|
||||
mirrors:
|
||||
- localai/localai-backends:latest-gpu-rocm-hipblas-liquid-audio
|
||||
- !!merge <<: *liquid-audio
|
||||
name: "rocm-liquid-audio-development"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-liquid-audio"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-gpu-rocm-hipblas-liquid-audio
|
||||
- !!merge <<: *liquid-audio
|
||||
name: "cuda13-nvidia-l4t-arm64-liquid-audio"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-liquid-audio"
|
||||
mirrors:
|
||||
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-liquid-audio
|
||||
- !!merge <<: *liquid-audio
|
||||
name: "cuda13-nvidia-l4t-arm64-liquid-audio-development"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-liquid-audio"
|
||||
mirrors:
|
||||
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-liquid-audio
|
||||
## qwen-tts
|
||||
- !!merge <<: *qwen-tts
|
||||
name: "qwen-tts-development"
|
||||
|
||||
23
backend/python/liquid-audio/Makefile
Normal file
23
backend/python/liquid-audio/Makefile
Normal file
@@ -0,0 +1,23 @@
|
||||
.PHONY: liquid-audio
|
||||
liquid-audio:
|
||||
bash install.sh
|
||||
|
||||
.PHONY: run
|
||||
run: liquid-audio
|
||||
@echo "Running liquid-audio..."
|
||||
bash run.sh
|
||||
@echo "liquid-audio run."
|
||||
|
||||
.PHONY: test
|
||||
test: liquid-audio
|
||||
@echo "Testing liquid-audio..."
|
||||
bash test.sh
|
||||
@echo "liquid-audio tested."
|
||||
|
||||
.PHONY: protogen-clean
|
||||
protogen-clean:
|
||||
$(RM) backend_pb2_grpc.py backend_pb2.py
|
||||
|
||||
.PHONY: clean
|
||||
clean: protogen-clean
|
||||
rm -rf venv __pycache__
|
||||
871
backend/python/liquid-audio/backend.py
Normal file
871
backend/python/liquid-audio/backend.py
Normal file
@@ -0,0 +1,871 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Liquid Audio backend for LocalAI.
|
||||
|
||||
Wraps LiquidAI's `liquid-audio` Python package (https://github.com/Liquid4All/liquid-audio).
|
||||
The same model serves four roles, selected by the `mode` option at load time:
|
||||
chat, asr, tts, s2s. Fine-tuning is exposed via StartFineTune.
|
||||
"""
|
||||
from concurrent import futures
|
||||
import argparse
|
||||
import json
|
||||
import os
|
||||
import queue
|
||||
import signal
|
||||
import sys
|
||||
import threading
|
||||
import time
|
||||
import traceback
|
||||
import uuid
|
||||
|
||||
import grpc
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'common'))
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'common'))
|
||||
from grpc_auth import get_auth_interceptors # noqa: E402
|
||||
from python_utils import parse_options # noqa: E402
|
||||
|
||||
import backend_pb2 # noqa: E402
|
||||
import backend_pb2_grpc # noqa: E402
|
||||
|
||||
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
|
||||
MAX_WORKERS = int(os.environ.get('PYTHON_GRPC_MAX_WORKERS', '1'))
|
||||
|
||||
# Voice id → system-prompt suffix. The model only ships these four voices.
|
||||
VOICE_PROMPTS = {
|
||||
"us_male": "Perform TTS. Use the US male voice.",
|
||||
"us_female": "Perform TTS. Use the US female voice.",
|
||||
"uk_male": "Perform TTS. Use the UK male voice.",
|
||||
"uk_female": "Perform TTS. Use the UK female voice.",
|
||||
}
|
||||
DEFAULT_VOICE = "us_female"
|
||||
|
||||
# Special-token IDs that LFM2-Audio emits to delimit modality boundaries.
|
||||
# Sourced from liquid_audio/model/lfm2_audio.py (see generate_sequential/_sample_*).
|
||||
TEXT_END_TOKEN = 130 # <|text_end|>
|
||||
AUDIO_START_TOKEN = 128 # <|audio_start|>
|
||||
IM_END_TOKEN = 7 # <|im_end|>
|
||||
AUDIO_EOS_CODE = 2048 # signals end-of-audio in any codebook position
|
||||
|
||||
_PATCHED_LOCAL_PATHS = False
|
||||
|
||||
|
||||
def _patch_liquid_audio_local_paths():
|
||||
"""Make liquid_audio.utils.get_model_dir() tolerate local directories.
|
||||
|
||||
Upstream always passes its argument to huggingface_hub.snapshot_download,
|
||||
which only accepts `owner/repo` ids. LocalAI's gallery hands us absolute
|
||||
paths under <ModelPath>/<owner>/<repo>, so we intercept snapshot_download
|
||||
in the liquid_audio.utils namespace and return the directory as-is when
|
||||
it already exists on disk. Idempotent.
|
||||
"""
|
||||
global _PATCHED_LOCAL_PATHS
|
||||
if _PATCHED_LOCAL_PATHS:
|
||||
return
|
||||
import liquid_audio.utils as _la_utils
|
||||
_orig_snapshot_download = _la_utils.snapshot_download
|
||||
|
||||
def _local_first_snapshot_download(repo_id, revision=None, **kwargs):
|
||||
if isinstance(repo_id, (str, os.PathLike)) and os.path.isdir(str(repo_id)):
|
||||
return str(repo_id)
|
||||
return _orig_snapshot_download(repo_id, revision=revision, **kwargs)
|
||||
|
||||
_la_utils.snapshot_download = _local_first_snapshot_download
|
||||
_PATCHED_LOCAL_PATHS = True
|
||||
|
||||
|
||||
def _select_device():
|
||||
import torch
|
||||
if torch.cuda.is_available():
|
||||
return "cuda"
|
||||
if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
|
||||
return "mps"
|
||||
return "cpu"
|
||||
|
||||
|
||||
class ActiveJob:
|
||||
"""Tracks an in-flight fine-tune so FineTuneProgress can stream from its queue."""
|
||||
|
||||
def __init__(self, job_id):
|
||||
self.job_id = job_id
|
||||
self.progress_queue = queue.Queue()
|
||||
self.thread = None
|
||||
self.stopped = False
|
||||
self.completed = False
|
||||
self.error = None
|
||||
|
||||
|
||||
class BackendServicer(backend_pb2_grpc.BackendServicer):
|
||||
def __init__(self):
|
||||
self.processor = None
|
||||
self.model = None
|
||||
self.device = "cpu"
|
||||
self.dtype = None
|
||||
self.options = {}
|
||||
self.model_id = None
|
||||
self.active_job = None
|
||||
|
||||
@property
|
||||
def mode(self):
|
||||
return str(self.options.get("mode", "chat")).lower()
|
||||
|
||||
@property
|
||||
def voice(self):
|
||||
v = str(self.options.get("voice", DEFAULT_VOICE)).lower()
|
||||
return v if v in VOICE_PROMPTS else DEFAULT_VOICE
|
||||
|
||||
|
||||
def Free(self, request, context):
|
||||
# Called by LocalAI when unloading the model. Drop GPU tensors so the
|
||||
# next load starts from a clean state instead of bumping into OOM.
|
||||
try:
|
||||
for attr in ("model", "processor", "tokenizer"):
|
||||
if hasattr(self, attr):
|
||||
try:
|
||||
delattr(self, attr)
|
||||
except Exception:
|
||||
pass
|
||||
import gc
|
||||
gc.collect()
|
||||
try:
|
||||
import torch
|
||||
if torch.cuda.is_available():
|
||||
torch.cuda.empty_cache()
|
||||
except Exception:
|
||||
pass
|
||||
return backend_pb2.Result(success=True, message="OK")
|
||||
except Exception as exc:
|
||||
print(f"Free failed: {exc}", file=sys.stderr)
|
||||
return backend_pb2.Result(success=False, message=str(exc))
|
||||
|
||||
|
||||
def Health(self, request, context):
|
||||
return backend_pb2.Reply(message=bytes("OK", 'utf-8'))
|
||||
|
||||
|
||||
def LoadModel(self, request, context):
|
||||
try:
|
||||
import torch
|
||||
|
||||
self.options = parse_options(request.Options)
|
||||
if self.options.get("voice") and self.options["voice"] not in VOICE_PROMPTS:
|
||||
print(f"Warning: unknown voice '{self.options['voice']}'; defaulting to '{DEFAULT_VOICE}'",
|
||||
file=sys.stderr)
|
||||
|
||||
requested_device = self.options.get("device")
|
||||
self.device = requested_device or _select_device()
|
||||
if self.device == "cuda" and not torch.cuda.is_available():
|
||||
return backend_pb2.Result(success=False, message="CUDA requested but not available")
|
||||
if self.device == "mps" and not (hasattr(torch.backends, "mps") and
|
||||
torch.backends.mps.is_available()):
|
||||
print("MPS not available; falling back to CPU", file=sys.stderr)
|
||||
self.device = "cpu"
|
||||
|
||||
dtype_name = str(self.options.get("dtype", "bfloat16")).lower()
|
||||
self.dtype = {
|
||||
"bfloat16": torch.bfloat16,
|
||||
"bf16": torch.bfloat16,
|
||||
"float16": torch.float16,
|
||||
"fp16": torch.float16,
|
||||
"half": torch.float16,
|
||||
"float32": torch.float32,
|
||||
"fp32": torch.float32,
|
||||
}.get(dtype_name, torch.bfloat16)
|
||||
|
||||
# request.Model holds the raw `parameters.model` value (an HF
|
||||
# repo id like "LiquidAI/LFM2.5-Audio-1.5B"); request.ModelFile
|
||||
# is LocalAI's ModelPath-prefixed local copy that exists only
|
||||
# when the gallery supplied a `files:` list. Mirror the
|
||||
# transformers/vibevoice convention: prefer the repo id and
|
||||
# only switch to the local path if it's been staged on disk.
|
||||
model_id = request.Model
|
||||
if not model_id:
|
||||
model_id = request.ModelFile
|
||||
if not model_id:
|
||||
return backend_pb2.Result(success=False, message="No model identifier provided")
|
||||
if request.ModelFile and os.path.isdir(request.ModelFile):
|
||||
model_id = request.ModelFile
|
||||
self.model_id = model_id
|
||||
|
||||
# Pure fine-tune jobs don't need an in-memory inference model — the
|
||||
# Trainer instantiates its own copy at StartFineTune time.
|
||||
if self.mode == "finetune":
|
||||
print(f"Loaded liquid-audio backend in fine-tune mode (model id: {model_id})",
|
||||
file=sys.stderr)
|
||||
return backend_pb2.Result(success=True, message="OK")
|
||||
|
||||
from liquid_audio import LFM2AudioModel, LFM2AudioProcessor
|
||||
|
||||
# liquid_audio's from_pretrained unconditionally routes through
|
||||
# huggingface_hub.snapshot_download, which rejects local paths
|
||||
# (HFValidationError on `/models/LiquidAI/LFM2.5-Audio-1.5B`).
|
||||
# When LocalAI's gallery has already staged the weights on disk,
|
||||
# short-circuit the download to return the local directory.
|
||||
_patch_liquid_audio_local_paths()
|
||||
|
||||
print(f"Loading liquid-audio model '{model_id}' on {self.device} ({self.dtype})",
|
||||
file=sys.stderr)
|
||||
self.processor = LFM2AudioProcessor.from_pretrained(model_id, device=self.device).eval()
|
||||
self.model = LFM2AudioModel.from_pretrained(
|
||||
model_id, device=self.device, dtype=self.dtype
|
||||
).eval()
|
||||
|
||||
print(f"Liquid-audio mode={self.mode}, voice={self.voice}", file=sys.stderr)
|
||||
return backend_pb2.Result(success=True, message="OK")
|
||||
|
||||
except Exception as exc:
|
||||
print(f"LoadModel failed: {exc}", file=sys.stderr)
|
||||
print(traceback.format_exc(), file=sys.stderr)
|
||||
return backend_pb2.Result(success=False, message=str(exc))
|
||||
|
||||
|
||||
def Predict(self, request, context):
|
||||
try:
|
||||
text = "".join(self._generate_text_stream(request))
|
||||
return backend_pb2.Reply(message=text.encode("utf-8"))
|
||||
except Exception as exc:
|
||||
print(f"Predict failed: {exc}", file=sys.stderr)
|
||||
print(traceback.format_exc(), file=sys.stderr)
|
||||
context.set_code(grpc.StatusCode.INTERNAL)
|
||||
context.set_details(str(exc))
|
||||
return backend_pb2.Reply()
|
||||
|
||||
def PredictStream(self, request, context):
|
||||
try:
|
||||
for delta in self._generate_text_stream(request):
|
||||
yield backend_pb2.Reply(message=delta.encode("utf-8"))
|
||||
except Exception as exc:
|
||||
print(f"PredictStream failed: {exc}", file=sys.stderr)
|
||||
print(traceback.format_exc(), file=sys.stderr)
|
||||
context.set_code(grpc.StatusCode.INTERNAL)
|
||||
context.set_details(str(exc))
|
||||
|
||||
|
||||
def VAD(self, request, context):
|
||||
# Stub voice-activity detector: RMS-energy threshold over 30ms frames at
|
||||
# 16 kHz. Good enough for the realtime endpoint's handleVAD loop, which
|
||||
# only inspects segment presence + last segment end. The proper signal
|
||||
# would come from the model's audio encoder, but that ride-along is a
|
||||
# PR-D scope item — until then this keeps the legacy pipeline path
|
||||
# working without forcing the operator to install a separate VAD model.
|
||||
import numpy as np
|
||||
try:
|
||||
audio = np.asarray(request.audio, dtype=np.float32)
|
||||
if audio.size == 0:
|
||||
return backend_pb2.VADResponse(segments=[])
|
||||
|
||||
sample_rate = 16000
|
||||
frame_size = sample_rate * 30 // 1000 # 30ms → 480 samples
|
||||
threshold = float(self.options.get("vad_rms_threshold", 0.01))
|
||||
min_speech_frames = int(self.options.get("vad_min_speech_frames", 2)) # ≥60ms
|
||||
# handleVAD ticks every 300 ms and only inspects segment presence
|
||||
# + last segment end relative to silence_threshold (~500 ms). Cap
|
||||
# the analysed window to the tail of the buffer so we don't redo
|
||||
# the entire growing utterance every tick.
|
||||
window_s = float(self.options.get("vad_window_s", 5.0))
|
||||
window_samples = int(window_s * sample_rate)
|
||||
time_offset_s = 0.0
|
||||
if audio.size > window_samples:
|
||||
time_offset_s = (audio.size - window_samples) / sample_rate
|
||||
audio = audio[-window_samples:]
|
||||
|
||||
n_frames = audio.size // frame_size
|
||||
if n_frames == 0:
|
||||
return backend_pb2.VADResponse(segments=[])
|
||||
frames = audio[: n_frames * frame_size].reshape(n_frames, frame_size)
|
||||
rms = np.sqrt(np.mean(frames ** 2, axis=1))
|
||||
speech = rms > threshold
|
||||
|
||||
def _emit(start_idx, end_idx, out):
|
||||
if end_idx - start_idx >= min_speech_frames:
|
||||
out.append(backend_pb2.VADSegment(
|
||||
start=time_offset_s + start_idx * frame_size / sample_rate,
|
||||
end=time_offset_s + end_idx * frame_size / sample_rate,
|
||||
))
|
||||
|
||||
segments = []
|
||||
start_idx = None
|
||||
for i, is_speech in enumerate(speech):
|
||||
if is_speech and start_idx is None:
|
||||
start_idx = i
|
||||
elif not is_speech and start_idx is not None:
|
||||
_emit(start_idx, i, segments)
|
||||
start_idx = None
|
||||
if start_idx is not None:
|
||||
_emit(start_idx, n_frames, segments)
|
||||
return backend_pb2.VADResponse(segments=segments)
|
||||
except Exception as exc:
|
||||
print(f"VAD failed: {exc}", file=sys.stderr)
|
||||
print(traceback.format_exc(), file=sys.stderr)
|
||||
context.set_code(grpc.StatusCode.INTERNAL)
|
||||
context.set_details(str(exc))
|
||||
return backend_pb2.VADResponse(segments=[])
|
||||
|
||||
|
||||
def TTS(self, request, context):
|
||||
try:
|
||||
if self.model is None or self.processor is None:
|
||||
return backend_pb2.Result(success=False, message="Model not loaded")
|
||||
|
||||
import torch
|
||||
import torchaudio
|
||||
from liquid_audio import ChatState
|
||||
|
||||
voice = request.voice.lower() if request.voice else self.voice
|
||||
voice = voice.removeprefix("lfm2:").removeprefix("lfm:")
|
||||
if voice not in VOICE_PROMPTS:
|
||||
voice = self.voice
|
||||
system_prompt = VOICE_PROMPTS[voice]
|
||||
|
||||
chat = ChatState(self.processor)
|
||||
chat.new_turn("system")
|
||||
chat.add_text(system_prompt)
|
||||
chat.end_turn()
|
||||
chat.new_turn("user")
|
||||
chat.add_text(request.text or "")
|
||||
chat.end_turn()
|
||||
chat.new_turn("assistant")
|
||||
|
||||
audio_top_k = int(self.options.get("audio_top_k", 64))
|
||||
audio_temp = float(self.options.get("audio_temperature", 0.8))
|
||||
max_new = int(self.options.get("max_new_tokens", 2048))
|
||||
|
||||
audio_out = []
|
||||
for tok in self.model.generate_sequential(
|
||||
**chat,
|
||||
max_new_tokens=max_new,
|
||||
audio_temperature=audio_temp,
|
||||
audio_top_k=audio_top_k,
|
||||
):
|
||||
if tok.numel() > 1:
|
||||
audio_out.append(tok)
|
||||
|
||||
if len(audio_out) <= 1:
|
||||
return backend_pb2.Result(success=False, message="No audio frames generated")
|
||||
|
||||
# Drop the trailing end-of-audio frame, matching the package's examples.
|
||||
audio_codes = torch.stack(audio_out[:-1], 1).unsqueeze(0)
|
||||
waveform = self.processor.decode(audio_codes)
|
||||
|
||||
out_path = request.dst
|
||||
if not out_path:
|
||||
return backend_pb2.Result(success=False, message="dst path is required")
|
||||
os.makedirs(os.path.dirname(out_path) or ".", exist_ok=True)
|
||||
# soundfile in preference to torchaudio.save — the latter routes
|
||||
# through torchcodec, whose native libs need NVIDIA NPP that we
|
||||
# don't bundle in the cuda13 image.
|
||||
import soundfile as _sf
|
||||
_sf.write(out_path, waveform.cpu().numpy().squeeze(0).T, 24_000)
|
||||
|
||||
return backend_pb2.Result(success=True)
|
||||
except Exception as exc:
|
||||
print(f"TTS failed: {exc}", file=sys.stderr)
|
||||
print(traceback.format_exc(), file=sys.stderr)
|
||||
return backend_pb2.Result(success=False, message=str(exc))
|
||||
|
||||
|
||||
def AudioToAudioStream(self, request_iterator, context):
|
||||
"""Bidirectional any-to-any speech-to-speech stream.
|
||||
|
||||
See `backend.proto` AudioToAudioStream for the wire protocol. Audio
|
||||
is decoded once per turn here; chunked detokenization for sub-second
|
||||
TTFB is left to a future iteration once the LFM2AudioDetokenizer
|
||||
gains a streaming entry point.
|
||||
"""
|
||||
try:
|
||||
yield from self._audio_to_audio_stream(request_iterator, context)
|
||||
except Exception as exc:
|
||||
print(f"AudioToAudioStream failed: {exc}", file=sys.stderr)
|
||||
print(traceback.format_exc(), file=sys.stderr)
|
||||
yield backend_pb2.AudioToAudioResponse(
|
||||
event="error",
|
||||
meta=json.dumps({"message": str(exc)}).encode("utf-8"),
|
||||
)
|
||||
|
||||
def _audio_to_audio_stream(self, request_iterator, context):
|
||||
if self.model is None or self.processor is None:
|
||||
raise RuntimeError("Model not loaded")
|
||||
|
||||
import torch
|
||||
import torchaudio
|
||||
from liquid_audio import ChatState
|
||||
|
||||
cfg = None
|
||||
chat = None
|
||||
input_sample_rate = 16000
|
||||
output_sample_rate = 24000
|
||||
sequence = 0
|
||||
|
||||
def _new_event(event, **kwargs):
|
||||
nonlocal sequence
|
||||
sequence += 1
|
||||
kwargs.setdefault("sequence", sequence)
|
||||
return backend_pb2.AudioToAudioResponse(event=event, **kwargs)
|
||||
|
||||
def _ensure_chat():
|
||||
"""Build a fresh ChatState seeded with the system prompt."""
|
||||
nonlocal chat
|
||||
chat = ChatState(self.processor)
|
||||
system_prompt = (cfg.system_prompt if cfg and cfg.system_prompt
|
||||
else "Respond with interleaved text and audio.")
|
||||
chat.new_turn("system")
|
||||
chat.add_text(system_prompt)
|
||||
chat.end_turn()
|
||||
|
||||
# Buffers for the in-flight user turn
|
||||
pcm_buffer = bytearray()
|
||||
|
||||
def _consume_user_turn():
|
||||
nonlocal pcm_buffer
|
||||
if not pcm_buffer:
|
||||
return
|
||||
# Avoid the bytes(pcm_buffer) copy and let the float widen happen
|
||||
# in-place: numpy view → torch view → in-place divide.
|
||||
import numpy as np
|
||||
arr = np.frombuffer(memoryview(pcm_buffer), dtype=np.int16)
|
||||
wav = torch.from_numpy(arr).to(torch.float32).div_(32768.0).unsqueeze(0)
|
||||
chat.new_turn("user")
|
||||
chat.add_audio(wav, input_sample_rate)
|
||||
chat.end_turn()
|
||||
pcm_buffer = bytearray()
|
||||
|
||||
def _run_generation():
|
||||
"""Run generate_interleaved; yield response events as we go."""
|
||||
chat.new_turn("assistant")
|
||||
audio_top_k = int(self.options.get("audio_top_k", 4))
|
||||
audio_temp = float(self.options.get("audio_temperature", 1.0))
|
||||
text_top_k = int(self.options.get("text_top_k", 0)) or None
|
||||
text_temp = float(self.options.get("text_temperature", 0)) or None
|
||||
max_new = int(self.options.get("max_new_tokens", 512))
|
||||
|
||||
audio_tokens = []
|
||||
for tok in self.model.generate_interleaved(
|
||||
**chat,
|
||||
max_new_tokens=max_new,
|
||||
text_temperature=text_temp,
|
||||
text_top_k=text_top_k,
|
||||
audio_temperature=audio_temp,
|
||||
audio_top_k=audio_top_k,
|
||||
):
|
||||
if tok.numel() == 1:
|
||||
if tok.item() == IM_END_TOKEN:
|
||||
break
|
||||
text = self.processor.text.decode(tok)
|
||||
if not text:
|
||||
continue
|
||||
yield _new_event(
|
||||
"response.audio_transcript.delta",
|
||||
meta=json.dumps({"delta": text}).encode("utf-8"),
|
||||
)
|
||||
else:
|
||||
audio_tokens.append(tok)
|
||||
|
||||
# Detokenize the accumulated audio at end-of-turn — the
|
||||
# LFM2AudioDetokenizer is non-streaming today.
|
||||
if len(audio_tokens) > 1:
|
||||
audio_codes = torch.stack(audio_tokens[:-1], 1).unsqueeze(0)
|
||||
waveform = self.processor.decode(audio_codes)
|
||||
# Convert to s16le PCM bytes at output_sample_rate
|
||||
if output_sample_rate != 24000:
|
||||
waveform = torchaudio.functional.resample(
|
||||
waveform.cpu(), 24000, output_sample_rate
|
||||
)
|
||||
pcm = (waveform.cpu().squeeze(0).clamp(-1, 1) * 32767.0).to(
|
||||
torch.int16
|
||||
).numpy().tobytes()
|
||||
yield _new_event(
|
||||
"response.audio.delta",
|
||||
pcm=pcm,
|
||||
sample_rate=output_sample_rate,
|
||||
)
|
||||
|
||||
yield _new_event("response.done", meta=b"{}")
|
||||
|
||||
for req in request_iterator:
|
||||
if not context.is_active():
|
||||
return
|
||||
payload = req.WhichOneof("payload")
|
||||
if payload == "config":
|
||||
cfg = req.config
|
||||
if cfg.input_sample_rate > 0:
|
||||
input_sample_rate = cfg.input_sample_rate
|
||||
if cfg.output_sample_rate > 0:
|
||||
output_sample_rate = cfg.output_sample_rate
|
||||
# The first config implicitly resets state.
|
||||
_ensure_chat()
|
||||
pcm_buffer = bytearray()
|
||||
elif payload == "frame":
|
||||
if chat is None:
|
||||
_ensure_chat()
|
||||
if req.frame.pcm:
|
||||
pcm_buffer.extend(req.frame.pcm)
|
||||
if req.frame.end_of_input:
|
||||
_consume_user_turn()
|
||||
yield from _run_generation()
|
||||
elif payload == "control":
|
||||
event = req.control.event
|
||||
if event == "input_audio_buffer.commit":
|
||||
_consume_user_turn()
|
||||
yield from _run_generation()
|
||||
elif event == "response.cancel":
|
||||
# Synchronous generation here means cancel can only
|
||||
# take effect between turns; we ack so the client unblocks.
|
||||
yield _new_event("response.done", meta=b'{"cancelled":true}')
|
||||
elif event == "session.update":
|
||||
# Free-form session re-config; treat as a soft reset.
|
||||
_ensure_chat()
|
||||
pcm_buffer = bytearray()
|
||||
# Unknown events are ignored — forward-compatible.
|
||||
|
||||
|
||||
def AudioTranscription(self, request, context):
|
||||
try:
|
||||
if self.model is None or self.processor is None:
|
||||
return backend_pb2.TranscriptResult(segments=[], text="")
|
||||
|
||||
import torchaudio
|
||||
from liquid_audio import ChatState
|
||||
|
||||
audio_path = request.dst
|
||||
if not audio_path:
|
||||
return backend_pb2.TranscriptResult(segments=[], text="")
|
||||
|
||||
chat = ChatState(self.processor)
|
||||
chat.new_turn("system")
|
||||
chat.add_text("Perform ASR.")
|
||||
chat.end_turn()
|
||||
chat.new_turn("user")
|
||||
# soundfile in preference to torchaudio.load — the latter routes
|
||||
# through torchcodec which needs NVIDIA NPP libs we don't bundle.
|
||||
import soundfile as _sf
|
||||
import torch
|
||||
audio_np, sr = _sf.read(audio_path, dtype="float32", always_2d=True)
|
||||
wav = torch.from_numpy(audio_np.T) # (channels, samples)
|
||||
if wav.shape[0] > 1:
|
||||
# Down-mix to mono — the processor expects a single channel
|
||||
wav = wav.mean(dim=0, keepdim=True)
|
||||
chat.add_audio(wav, sr)
|
||||
chat.end_turn()
|
||||
chat.new_turn("assistant")
|
||||
|
||||
max_new = int(self.options.get("max_new_tokens", 1024))
|
||||
|
||||
pieces = []
|
||||
for tok in self.model.generate_sequential(**chat, max_new_tokens=max_new):
|
||||
if tok.numel() == 1:
|
||||
if tok.item() == IM_END_TOKEN:
|
||||
break
|
||||
pieces.append(self.processor.text.decode(tok))
|
||||
|
||||
text = "".join(pieces).strip()
|
||||
duration_ms = int((wav.shape[1] / sr) * 1000)
|
||||
segment = backend_pb2.TranscriptSegment(
|
||||
id=0, start=0, end=duration_ms, text=text, tokens=[],
|
||||
)
|
||||
return backend_pb2.TranscriptResult(segments=[segment], text=text)
|
||||
except Exception as exc:
|
||||
print(f"AudioTranscription failed: {exc}", file=sys.stderr)
|
||||
print(traceback.format_exc(), file=sys.stderr)
|
||||
return backend_pb2.TranscriptResult(segments=[], text="")
|
||||
|
||||
|
||||
def StartFineTune(self, request, context):
|
||||
if self.active_job is not None and not self.active_job.completed:
|
||||
return backend_pb2.FineTuneJobResult(
|
||||
job_id="", success=False,
|
||||
message="A fine-tuning job is already running",
|
||||
)
|
||||
|
||||
job_id = request.job_id or str(uuid.uuid4())
|
||||
job = ActiveJob(job_id)
|
||||
self.active_job = job
|
||||
|
||||
thread = threading.Thread(target=self._run_training, args=(request, job), daemon=True)
|
||||
job.thread = thread
|
||||
thread.start()
|
||||
|
||||
return backend_pb2.FineTuneJobResult(
|
||||
job_id=job_id, success=True, message="Training started",
|
||||
)
|
||||
|
||||
def FineTuneProgress(self, request, context):
|
||||
if self.active_job is None or self.active_job.job_id != request.job_id:
|
||||
context.set_code(grpc.StatusCode.NOT_FOUND)
|
||||
context.set_details(f"Job {request.job_id} not found")
|
||||
return
|
||||
|
||||
job = self.active_job
|
||||
while True:
|
||||
try:
|
||||
update = job.progress_queue.get(timeout=1.0)
|
||||
except queue.Empty:
|
||||
if job.completed or job.stopped:
|
||||
break
|
||||
if not context.is_active():
|
||||
break
|
||||
continue
|
||||
if update is None:
|
||||
break
|
||||
yield update
|
||||
if update.status in ("completed", "failed", "stopped"):
|
||||
break
|
||||
|
||||
def StopFineTune(self, request, context):
|
||||
# We can't kill the Accelerate training loop mid-step cleanly from here;
|
||||
# LocalAI's job manager kills the backend process on stop. The flag below
|
||||
# at least lets the progress stream terminate quickly.
|
||||
if self.active_job is not None and self.active_job.job_id == request.job_id:
|
||||
self.active_job.stopped = True
|
||||
self.active_job.progress_queue.put(None)
|
||||
return backend_pb2.Result(success=True, message="OK")
|
||||
|
||||
def _run_training(self, request, job):
|
||||
try:
|
||||
self._do_train(request, job)
|
||||
job.completed = True
|
||||
job.progress_queue.put(backend_pb2.FineTuneProgressUpdate(
|
||||
job_id=job.job_id, status="completed", message="Training completed",
|
||||
progress_percent=100.0,
|
||||
))
|
||||
except Exception as exc:
|
||||
job.error = str(exc)
|
||||
job.completed = True
|
||||
print(f"Training failed: {exc}", file=sys.stderr)
|
||||
print(traceback.format_exc(), file=sys.stderr)
|
||||
job.progress_queue.put(backend_pb2.FineTuneProgressUpdate(
|
||||
job_id=job.job_id, status="failed", message=str(exc),
|
||||
))
|
||||
finally:
|
||||
job.progress_queue.put(None)
|
||||
|
||||
def _do_train(self, request, job):
|
||||
from liquid_audio import LFM2AudioModel # noqa: F401 (sanity import)
|
||||
from liquid_audio.data.dataloader import LFM2DataLoader
|
||||
from liquid_audio.trainer import Trainer
|
||||
|
||||
model_id = request.model or self.model_id or "LiquidAI/LFM2.5-Audio-1.5B"
|
||||
|
||||
dataset_path = request.dataset_source
|
||||
if not dataset_path:
|
||||
raise ValueError("dataset_source is required (path to a preprocessed dataset)")
|
||||
|
||||
extras = dict(request.extra_options) if request.extra_options else {}
|
||||
val_path = extras.get("val_dataset")
|
||||
|
||||
# Map FineTuneRequest hyperparameters to liquid_audio.Trainer constructor args
|
||||
lr = request.learning_rate or 3e-5
|
||||
max_steps = request.max_steps or 1000
|
||||
warmup_steps = request.warmup_steps or min(100, max_steps // 10)
|
||||
batch_size = request.batch_size or 16
|
||||
save_interval = request.save_steps or max(1, max_steps // 4)
|
||||
|
||||
output_dir = request.output_dir or os.path.join(
|
||||
os.environ.get("LIQUID_AUDIO_OUTPUT_DIR", "/tmp"),
|
||||
f"liquid-audio-{job.job_id}",
|
||||
)
|
||||
os.makedirs(output_dir, exist_ok=True)
|
||||
|
||||
job.progress_queue.put(backend_pb2.FineTuneProgressUpdate(
|
||||
job_id=job.job_id, status="loading_dataset",
|
||||
message=f"Loading preprocessed dataset from {dataset_path}",
|
||||
))
|
||||
train_data = LFM2DataLoader(dataset_path)
|
||||
val_data = LFM2DataLoader(val_path) if val_path else None
|
||||
|
||||
job.progress_queue.put(backend_pb2.FineTuneProgressUpdate(
|
||||
job_id=job.job_id, status="loading_model",
|
||||
message=f"Loading base model {model_id}",
|
||||
))
|
||||
|
||||
# The Liquid Trainer logs via self.accelerator.print; we subclass it to
|
||||
# also push progress events onto the queue every logging_interval steps.
|
||||
progress_q = job.progress_queue
|
||||
|
||||
class QueuedTrainer(Trainer):
|
||||
def log(self_, model_output):
|
||||
if self_.step > 0 and self_.step % self_.logging_interval == 0:
|
||||
try:
|
||||
loss = self_.accelerator.reduce(
|
||||
model_output.loss.detach(), reduction="mean"
|
||||
).item()
|
||||
except Exception:
|
||||
loss = float("nan")
|
||||
lr_now = self_.optimizer.param_groups[0]["lr"]
|
||||
pct = (self_.step / self_.max_steps * 100.0) if self_.max_steps else 0.0
|
||||
progress_q.put(backend_pb2.FineTuneProgressUpdate(
|
||||
job_id=job.job_id,
|
||||
current_step=int(self_.step),
|
||||
total_steps=int(self_.max_steps),
|
||||
current_epoch=float(self_.epoch),
|
||||
loss=float(loss),
|
||||
learning_rate=float(lr_now),
|
||||
progress_percent=float(pct),
|
||||
status="training",
|
||||
))
|
||||
# Honour stop requests: raising here terminates the loop cleanly
|
||||
if job.stopped:
|
||||
raise KeyboardInterrupt("stop requested")
|
||||
return super().log(model_output)
|
||||
|
||||
def validate(self_):
|
||||
progress_q.put(backend_pb2.FineTuneProgressUpdate(
|
||||
job_id=job.job_id, current_step=int(self_.step),
|
||||
total_steps=int(self_.max_steps), status="training",
|
||||
message=f"Running validation at step {self_.step}",
|
||||
))
|
||||
return super().validate()
|
||||
|
||||
trainer = QueuedTrainer(
|
||||
model_id=model_id,
|
||||
train_data=train_data,
|
||||
val_data=val_data,
|
||||
lr=lr,
|
||||
max_steps=max_steps,
|
||||
warmup_steps=warmup_steps,
|
||||
batch_size=batch_size,
|
||||
save_interval=save_interval,
|
||||
output_dir=output_dir,
|
||||
weight_decay=request.weight_decay or 0.1,
|
||||
)
|
||||
|
||||
job.progress_queue.put(backend_pb2.FineTuneProgressUpdate(
|
||||
job_id=job.job_id, status="training", message="Training started",
|
||||
total_steps=int(max_steps),
|
||||
))
|
||||
trainer.train()
|
||||
|
||||
job.progress_queue.put(backend_pb2.FineTuneProgressUpdate(
|
||||
job_id=job.job_id, status="saving",
|
||||
message=f"Saved final model to {output_dir}",
|
||||
checkpoint_path=os.path.join(output_dir, "final"),
|
||||
))
|
||||
|
||||
|
||||
def _build_chat_state(self, messages, user_prompt, tools_prelude=None):
|
||||
"""Build a ChatState from a list of (role, content) tuples plus an optional final user turn.
|
||||
|
||||
tools_prelude, when non-empty, is prepended as an extra system turn carrying
|
||||
the LFM2 tool-list block — mirrors gallery/lfm.yaml's `function:` template
|
||||
so the model sees the same prompt shape whether served via llama-cpp or here.
|
||||
"""
|
||||
from liquid_audio import ChatState
|
||||
chat = ChatState(self.processor)
|
||||
if tools_prelude:
|
||||
chat.new_turn("system")
|
||||
chat.add_text(tools_prelude)
|
||||
chat.end_turn()
|
||||
for role, content in messages:
|
||||
chat.new_turn(role)
|
||||
chat.add_text(content)
|
||||
chat.end_turn()
|
||||
if user_prompt:
|
||||
chat.new_turn("user")
|
||||
chat.add_text(user_prompt)
|
||||
chat.end_turn()
|
||||
chat.new_turn("assistant")
|
||||
return chat
|
||||
|
||||
def _collect_messages(self, request):
|
||||
"""Translate PredictOptions.Messages into (role, content) tuples."""
|
||||
out = []
|
||||
for m in request.Messages:
|
||||
role = (m.role or "user").lower()
|
||||
if role not in ("system", "user", "assistant"):
|
||||
role = "user"
|
||||
out.append((role, m.content or ""))
|
||||
return out
|
||||
|
||||
def _render_tools_prelude(self, request):
|
||||
"""Build the LFM2 `<|tool_list_start|>…<|tool_list_end|>` system prelude
|
||||
from request.Tools (OpenAI Chat-Completions tool JSON). Returns "" when
|
||||
no tools are attached. Output mirrors gallery/lfm.yaml's `function:`
|
||||
template so the model sees the same prompt whether routed via llama-cpp
|
||||
or this backend."""
|
||||
tools_raw = getattr(request, "Tools", "") or ""
|
||||
if not tools_raw:
|
||||
return ""
|
||||
try:
|
||||
tools = json.loads(tools_raw)
|
||||
except json.JSONDecodeError:
|
||||
print(f"liquid-audio: ignoring malformed Tools JSON: {tools_raw[:200]!r}",
|
||||
file=sys.stderr)
|
||||
return ""
|
||||
if not isinstance(tools, list) or not tools:
|
||||
return ""
|
||||
# The LFM2 chat template uses single-quoted Python-dict-ish syntax in
|
||||
# examples, but the tokenizer treats this whole block as opaque text;
|
||||
# JSON works fine and is what other backends emit.
|
||||
return (
|
||||
"You are a function calling AI model. You are provided with functions to "
|
||||
"execute. You may call one or more functions to assist with the user query. "
|
||||
"Don't make assumptions about what values to plug into functions.\n"
|
||||
"List of tools: <|tool_list_start|>"
|
||||
+ json.dumps(tools, separators=(",", ":"))
|
||||
+ "<|tool_list_end|>"
|
||||
)
|
||||
|
||||
def _generate_text_stream(self, request):
|
||||
"""Yield text-only deltas from generate_sequential. Caller joins for unary Predict."""
|
||||
if self.model is None or self.processor is None:
|
||||
raise RuntimeError("Model not loaded")
|
||||
messages = self._collect_messages(request)
|
||||
user_prompt = request.Prompt or None
|
||||
tools_prelude = self._render_tools_prelude(request)
|
||||
# If the request already carries Messages, Prompt is the templated form
|
||||
# of the same content — don't append a duplicate user turn.
|
||||
chat = self._build_chat_state(
|
||||
messages,
|
||||
user_prompt if not messages else None,
|
||||
tools_prelude=tools_prelude,
|
||||
)
|
||||
|
||||
max_new = request.Tokens if request.Tokens > 0 else int(self.options.get("max_new_tokens", 512))
|
||||
temperature = request.Temperature if request.Temperature > 0 else None
|
||||
top_k = request.TopK if request.TopK > 0 else None
|
||||
|
||||
for tok in self.model.generate_sequential(
|
||||
**chat,
|
||||
max_new_tokens=max_new,
|
||||
text_temperature=temperature,
|
||||
text_top_k=top_k,
|
||||
):
|
||||
if tok.numel() == 1:
|
||||
if tok.item() == IM_END_TOKEN:
|
||||
break
|
||||
yield self.processor.text.decode(tok)
|
||||
|
||||
|
||||
def serve(address):
|
||||
server = grpc.server(
|
||||
futures.ThreadPoolExecutor(max_workers=MAX_WORKERS),
|
||||
options=[
|
||||
('grpc.max_message_length', 50 * 1024 * 1024),
|
||||
('grpc.max_send_message_length', 50 * 1024 * 1024),
|
||||
('grpc.max_receive_message_length', 50 * 1024 * 1024),
|
||||
],
|
||||
interceptors=get_auth_interceptors(),
|
||||
)
|
||||
backend_pb2_grpc.add_BackendServicer_to_server(BackendServicer(), server)
|
||||
server.add_insecure_port(address)
|
||||
server.start()
|
||||
print(f"Liquid-audio backend listening on {address}", file=sys.stderr, flush=True)
|
||||
|
||||
def stop(_signum, _frame):
|
||||
server.stop(0)
|
||||
sys.exit(0)
|
||||
|
||||
signal.signal(signal.SIGTERM, stop)
|
||||
signal.signal(signal.SIGINT, stop)
|
||||
|
||||
try:
|
||||
while True:
|
||||
time.sleep(_ONE_DAY_IN_SECONDS)
|
||||
except KeyboardInterrupt:
|
||||
server.stop(0)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(description="Liquid Audio gRPC backend")
|
||||
parser.add_argument("--addr", default="localhost:50051", help="gRPC server address")
|
||||
args = parser.parse_args()
|
||||
serve(args.addr)
|
||||
18
backend/python/liquid-audio/install.sh
Executable file
18
backend/python/liquid-audio/install.sh
Executable file
@@ -0,0 +1,18 @@
|
||||
#!/bin/bash
|
||||
set -e
|
||||
|
||||
# liquid-audio requires Python ≥ 3.12 (per its pyproject.toml); the default
|
||||
# portable Python in libbackend.sh is 3.10. Override before sourcing.
|
||||
export PYTHON_VERSION="${PYTHON_VERSION:-3.12}"
|
||||
export PYTHON_PATCH="${PYTHON_PATCH:-11}"
|
||||
|
||||
backend_dir=$(dirname $0)
|
||||
if [ -d $backend_dir/common ]; then
|
||||
source $backend_dir/common/libbackend.sh
|
||||
else
|
||||
source $backend_dir/../common/libbackend.sh
|
||||
fi
|
||||
|
||||
# liquid-audio's torch wheels are large; allow upgrades to satisfy transitive pins
|
||||
EXTRA_PIP_INSTALL_FLAGS+=" --upgrade --index-strategy=unsafe-first-match"
|
||||
installRequirements
|
||||
11
backend/python/liquid-audio/protogen.sh
Executable file
11
backend/python/liquid-audio/protogen.sh
Executable file
@@ -0,0 +1,11 @@
|
||||
#!/bin/bash
|
||||
set -e
|
||||
|
||||
backend_dir=$(dirname $0)
|
||||
if [ -d $backend_dir/common ]; then
|
||||
source $backend_dir/common/libbackend.sh
|
||||
else
|
||||
source $backend_dir/../common/libbackend.sh
|
||||
fi
|
||||
|
||||
runProtogen
|
||||
13
backend/python/liquid-audio/requirements-cpu.txt
Normal file
13
backend/python/liquid-audio/requirements-cpu.txt
Normal file
@@ -0,0 +1,13 @@
|
||||
--extra-index-url https://download.pytorch.org/whl/cpu
|
||||
torch>=2.8.0
|
||||
torchaudio>=2.8.0
|
||||
torchcodec>=0.9.1
|
||||
transformers>=4.55.4
|
||||
accelerate>=1.10.1
|
||||
datasets>=4.8.4
|
||||
einops>=0.8.1
|
||||
librosa>=0.11.0
|
||||
soundfile>=0.12.1
|
||||
sentencepiece>=0.2.1
|
||||
huggingface-hub>=1.3.0
|
||||
liquid-audio>=1.2.0
|
||||
13
backend/python/liquid-audio/requirements-cublas12.txt
Normal file
13
backend/python/liquid-audio/requirements-cublas12.txt
Normal file
@@ -0,0 +1,13 @@
|
||||
--extra-index-url https://download.pytorch.org/whl/cu121
|
||||
torch>=2.8.0
|
||||
torchaudio>=2.8.0
|
||||
torchcodec>=0.9.1
|
||||
transformers>=4.55.4
|
||||
accelerate>=1.10.1
|
||||
datasets>=4.8.4
|
||||
einops>=0.8.1
|
||||
librosa>=0.11.0
|
||||
soundfile>=0.12.1
|
||||
sentencepiece>=0.2.1
|
||||
huggingface-hub>=1.3.0
|
||||
liquid-audio>=1.2.0
|
||||
13
backend/python/liquid-audio/requirements-cublas13.txt
Normal file
13
backend/python/liquid-audio/requirements-cublas13.txt
Normal file
@@ -0,0 +1,13 @@
|
||||
--extra-index-url https://download.pytorch.org/whl/cu130
|
||||
torch>=2.8.0
|
||||
torchaudio>=2.8.0
|
||||
torchcodec>=0.9.1
|
||||
transformers>=4.55.4
|
||||
accelerate>=1.10.1
|
||||
datasets>=4.8.4
|
||||
einops>=0.8.1
|
||||
librosa>=0.11.0
|
||||
soundfile>=0.12.1
|
||||
sentencepiece>=0.2.1
|
||||
huggingface-hub>=1.3.0
|
||||
liquid-audio>=1.2.0
|
||||
13
backend/python/liquid-audio/requirements-hipblas.txt
Normal file
13
backend/python/liquid-audio/requirements-hipblas.txt
Normal file
@@ -0,0 +1,13 @@
|
||||
--extra-index-url https://download.pytorch.org/whl/rocm7.0
|
||||
torch>=2.8.0
|
||||
torchaudio>=2.8.0
|
||||
torchcodec>=0.9.1
|
||||
transformers>=4.55.4
|
||||
accelerate>=1.10.1
|
||||
datasets>=4.8.4
|
||||
einops>=0.8.1
|
||||
librosa>=0.11.0
|
||||
soundfile>=0.12.1
|
||||
sentencepiece>=0.2.1
|
||||
huggingface-hub>=1.3.0
|
||||
liquid-audio>=1.2.0
|
||||
13
backend/python/liquid-audio/requirements-l4t13.txt
Normal file
13
backend/python/liquid-audio/requirements-l4t13.txt
Normal file
@@ -0,0 +1,13 @@
|
||||
--extra-index-url https://pypi.jetson-ai-lab.io/jp7/cu130
|
||||
torch>=2.8.0
|
||||
torchaudio>=2.8.0
|
||||
torchcodec>=0.9.1
|
||||
transformers>=4.55.4
|
||||
accelerate>=1.10.1
|
||||
datasets>=4.8.4
|
||||
einops>=0.8.1
|
||||
librosa>=0.11.0
|
||||
soundfile>=0.12.1
|
||||
sentencepiece>=0.2.1
|
||||
huggingface-hub>=1.3.0
|
||||
liquid-audio>=1.2.0
|
||||
12
backend/python/liquid-audio/requirements-mps.txt
Normal file
12
backend/python/liquid-audio/requirements-mps.txt
Normal file
@@ -0,0 +1,12 @@
|
||||
torch>=2.8.0
|
||||
torchaudio>=2.8.0
|
||||
torchcodec>=0.9.1
|
||||
transformers>=4.55.4
|
||||
accelerate>=1.10.1
|
||||
datasets>=4.8.4
|
||||
einops>=0.8.1
|
||||
librosa>=0.11.0
|
||||
soundfile>=0.12.1
|
||||
sentencepiece>=0.2.1
|
||||
huggingface-hub>=1.3.0
|
||||
liquid-audio>=1.2.0
|
||||
3
backend/python/liquid-audio/requirements.txt
Normal file
3
backend/python/liquid-audio/requirements.txt
Normal file
@@ -0,0 +1,3 @@
|
||||
grpcio==1.78.1
|
||||
protobuf
|
||||
certifi
|
||||
10
backend/python/liquid-audio/run.sh
Executable file
10
backend/python/liquid-audio/run.sh
Executable file
@@ -0,0 +1,10 @@
|
||||
#!/bin/bash
|
||||
|
||||
backend_dir=$(dirname $0)
|
||||
if [ -d $backend_dir/common ]; then
|
||||
source $backend_dir/common/libbackend.sh
|
||||
else
|
||||
source $backend_dir/../common/libbackend.sh
|
||||
fi
|
||||
|
||||
startBackend $@
|
||||
89
backend/python/liquid-audio/test.py
Normal file
89
backend/python/liquid-audio/test.py
Normal file
@@ -0,0 +1,89 @@
|
||||
"""Smoke tests for the liquid-audio backend.
|
||||
|
||||
These run without contacting HuggingFace or loading model weights:
|
||||
they only verify that the gRPC service starts and Health() responds.
|
||||
|
||||
To run an end-to-end inference test, set LIQUID_AUDIO_MODEL_ID
|
||||
(e.g. "LiquidAI/LFM2.5-Audio-1.5B") in the environment — see test_inference().
|
||||
"""
|
||||
import os
|
||||
import subprocess
|
||||
import sys
|
||||
import time
|
||||
import unittest
|
||||
|
||||
import grpc
|
||||
|
||||
# Ensure generated protobuf stubs are importable
|
||||
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
|
||||
|
||||
import backend_pb2
|
||||
import backend_pb2_grpc
|
||||
|
||||
|
||||
class TestBackend(unittest.TestCase):
|
||||
@classmethod
|
||||
def setUpClass(cls):
|
||||
addr = os.environ.get("LIQUID_AUDIO_TEST_ADDR", "localhost:50053")
|
||||
cls.addr = addr
|
||||
cls.server = subprocess.Popen(
|
||||
[sys.executable, os.path.join(os.path.dirname(__file__), "backend.py"), "--addr", addr],
|
||||
)
|
||||
time.sleep(2) # Give the server a moment to bind
|
||||
|
||||
@classmethod
|
||||
def tearDownClass(cls):
|
||||
cls.server.terminate()
|
||||
try:
|
||||
cls.server.wait(timeout=5)
|
||||
except subprocess.TimeoutExpired:
|
||||
cls.server.kill()
|
||||
|
||||
def _stub(self):
|
||||
channel = grpc.insecure_channel(self.addr)
|
||||
return backend_pb2_grpc.BackendStub(channel)
|
||||
|
||||
def test_health(self):
|
||||
stub = self._stub()
|
||||
reply = stub.Health(backend_pb2.HealthMessage(), timeout=5)
|
||||
self.assertEqual(reply.message, b"OK")
|
||||
|
||||
def test_load_finetune_mode_without_weights(self):
|
||||
"""Loading in fine-tune mode should succeed without pulling model weights."""
|
||||
stub = self._stub()
|
||||
result = stub.LoadModel(
|
||||
backend_pb2.ModelOptions(
|
||||
Model="LiquidAI/LFM2.5-Audio-1.5B",
|
||||
Options=["mode:finetune"],
|
||||
),
|
||||
timeout=10,
|
||||
)
|
||||
self.assertTrue(result.success, msg=result.message)
|
||||
|
||||
@unittest.skipUnless(os.environ.get("LIQUID_AUDIO_MODEL_ID"),
|
||||
"Set LIQUID_AUDIO_MODEL_ID to run an end-to-end inference smoke test")
|
||||
def test_inference(self):
|
||||
"""End-to-end: load a real LFM2-Audio model and run one short prediction."""
|
||||
stub = self._stub()
|
||||
model_id = os.environ["LIQUID_AUDIO_MODEL_ID"]
|
||||
result = stub.LoadModel(
|
||||
backend_pb2.ModelOptions(
|
||||
Model=model_id,
|
||||
Options=["mode:chat"],
|
||||
),
|
||||
timeout=600,
|
||||
)
|
||||
self.assertTrue(result.success, msg=result.message)
|
||||
reply = stub.Predict(
|
||||
backend_pb2.PredictOptions(
|
||||
Prompt="Hello!",
|
||||
Tokens=8,
|
||||
Temperature=0.0,
|
||||
),
|
||||
timeout=120,
|
||||
)
|
||||
self.assertGreater(len(reply.message), 0)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main()
|
||||
11
backend/python/liquid-audio/test.sh
Executable file
11
backend/python/liquid-audio/test.sh
Executable file
@@ -0,0 +1,11 @@
|
||||
#!/bin/bash
|
||||
set -e
|
||||
|
||||
backend_dir=$(dirname $0)
|
||||
if [ -d $backend_dir/common ]; then
|
||||
source $backend_dir/common/libbackend.sh
|
||||
else
|
||||
source $backend_dir/../common/libbackend.sh
|
||||
fi
|
||||
|
||||
runUnittests
|
||||
@@ -2,9 +2,9 @@ torch==2.7.1
|
||||
llvmlite==0.43.0
|
||||
numba==0.60.0
|
||||
accelerate
|
||||
transformers>=5.0.0
|
||||
transformers>=5.8.0
|
||||
bitsandbytes
|
||||
sentence-transformers==5.4.0
|
||||
sentence-transformers==5.5.0
|
||||
diffusers
|
||||
soundfile
|
||||
protobuf==6.33.5
|
||||
@@ -2,9 +2,9 @@ torch==2.7.1
|
||||
accelerate
|
||||
llvmlite==0.43.0
|
||||
numba==0.60.0
|
||||
transformers>=5.0.0
|
||||
transformers>=5.8.0
|
||||
bitsandbytes
|
||||
sentence-transformers==5.4.0
|
||||
sentence-transformers==5.5.0
|
||||
diffusers
|
||||
soundfile
|
||||
protobuf==6.33.5
|
||||
@@ -2,9 +2,9 @@
|
||||
torch==2.9.0
|
||||
llvmlite==0.43.0
|
||||
numba==0.60.0
|
||||
transformers>=5.0.0
|
||||
transformers>=5.8.0
|
||||
bitsandbytes
|
||||
sentence-transformers==5.4.0
|
||||
sentence-transformers==5.5.0
|
||||
diffusers
|
||||
soundfile
|
||||
protobuf==6.33.5
|
||||
@@ -1,11 +1,11 @@
|
||||
--extra-index-url https://download.pytorch.org/whl/rocm7.0
|
||||
torch==2.10.0+rocm7.0
|
||||
accelerate
|
||||
transformers>=5.0.0
|
||||
transformers>=5.8.0
|
||||
llvmlite==0.43.0
|
||||
numba==0.60.0
|
||||
bitsandbytes
|
||||
sentence-transformers==5.4.0
|
||||
sentence-transformers==5.5.0
|
||||
diffusers
|
||||
soundfile
|
||||
protobuf==6.33.5
|
||||
@@ -3,9 +3,9 @@ torch
|
||||
optimum[openvino]
|
||||
llvmlite==0.43.0
|
||||
numba==0.60.0
|
||||
transformers>=5.0.0
|
||||
transformers>=5.8.0
|
||||
bitsandbytes
|
||||
sentence-transformers==5.4.0
|
||||
sentence-transformers==5.5.0
|
||||
diffusers
|
||||
soundfile
|
||||
protobuf==6.33.5
|
||||
@@ -2,9 +2,9 @@ torch==2.7.1
|
||||
llvmlite==0.43.0
|
||||
numba==0.60.0
|
||||
accelerate
|
||||
transformers>=5.0.0
|
||||
transformers>=5.8.0
|
||||
bitsandbytes
|
||||
sentence-transformers==5.4.0
|
||||
sentence-transformers==5.5.0
|
||||
diffusers
|
||||
soundfile
|
||||
protobuf==6.33.5
|
||||
|
||||
@@ -33,7 +33,7 @@ dependencies = [
|
||||
"certifi",
|
||||
"setuptools",
|
||||
"pillow",
|
||||
"charset-normalizer>=3.4.0",
|
||||
"charset-normalizer>=3.4.7",
|
||||
"chardet",
|
||||
# L4T-specific accelerator stack (sourced from jetson-ai-lab below).
|
||||
"torch",
|
||||
|
||||
@@ -3,5 +3,5 @@
|
||||
# on a cu130 host. Pull the cu130-flavoured wheel from vLLM's per-tag index
|
||||
# instead — the cublas13 case in install.sh adds --index-strategy=unsafe-best-match
|
||||
# so uv consults this index alongside PyPI.
|
||||
--extra-index-url https://wheels.vllm.ai/0.20.2/cu130
|
||||
vllm==0.20.2
|
||||
--extra-index-url https://wheels.vllm.ai/0.21.0/cu130
|
||||
vllm==0.21.0
|
||||
|
||||
@@ -3,5 +3,5 @@ protobuf
|
||||
certifi
|
||||
setuptools
|
||||
pillow
|
||||
charset-normalizer>=3.4.0
|
||||
charset-normalizer>=3.4.7
|
||||
chardet
|
||||
@@ -169,7 +169,7 @@ func initDistributed(cfg *config.ApplicationConfig, authDB *gorm.DB, configLoade
|
||||
cfg.Distributed.HealthCheckIntervalOrDefault(),
|
||||
cfg.Distributed.StaleNodeThresholdOrDefault(),
|
||||
routerAuthToken,
|
||||
cfg.Distributed.PerModelHealthCheck,
|
||||
!cfg.Distributed.DisablePerModelHealthCheck,
|
||||
)
|
||||
|
||||
// Initialize job store
|
||||
|
||||
@@ -212,12 +212,12 @@ func New(opts ...config.AppOption) (*Application, error) {
|
||||
}
|
||||
}
|
||||
|
||||
if err := coreStartup.InstallModels(options.Context, application.GalleryService(), options.Galleries, options.BackendGalleries, options.SystemState, application.ModelLoader(), options.EnforcePredownloadScans, options.AutoloadBackendGalleries, nil, options.ModelsURL...); err != nil {
|
||||
if err := coreStartup.InstallModels(options.Context, application.GalleryService(), options.Galleries, options.BackendGalleries, options.SystemState, application.ModelLoader(), options.EnforcePredownloadScans, options.AutoloadBackendGalleries, options.RequireBackendIntegrity, nil, options.ModelsURL...); err != nil {
|
||||
xlog.Error("error installing models", "error", err)
|
||||
}
|
||||
|
||||
for _, backend := range options.ExternalBackends {
|
||||
if err := galleryop.InstallExternalBackend(options.Context, options.BackendGalleries, options.SystemState, application.ModelLoader(), nil, backend, "", ""); err != nil {
|
||||
if err := galleryop.InstallExternalBackend(options.Context, options.BackendGalleries, options.SystemState, application.ModelLoader(), nil, backend, "", "", options.RequireBackendIntegrity); err != nil {
|
||||
xlog.Error("error installing external backend", "error", err)
|
||||
}
|
||||
}
|
||||
@@ -267,13 +267,13 @@ func New(opts ...config.AppOption) (*Application, error) {
|
||||
}
|
||||
|
||||
if options.PreloadJSONModels != "" {
|
||||
if err := galleryop.ApplyGalleryFromString(options.SystemState, application.ModelLoader(), options.EnforcePredownloadScans, options.AutoloadBackendGalleries, options.Galleries, options.BackendGalleries, options.PreloadJSONModels); err != nil {
|
||||
if err := galleryop.ApplyGalleryFromString(options.SystemState, application.ModelLoader(), options.EnforcePredownloadScans, options.AutoloadBackendGalleries, options.Galleries, options.BackendGalleries, options.PreloadJSONModels, options.RequireBackendIntegrity); err != nil {
|
||||
return nil, err
|
||||
}
|
||||
}
|
||||
|
||||
if options.PreloadModelsFromPath != "" {
|
||||
if err := galleryop.ApplyGalleryFromFile(options.SystemState, application.ModelLoader(), options.EnforcePredownloadScans, options.AutoloadBackendGalleries, options.Galleries, options.BackendGalleries, options.PreloadModelsFromPath); err != nil {
|
||||
if err := galleryop.ApplyGalleryFromFile(options.SystemState, application.ModelLoader(), options.EnforcePredownloadScans, options.AutoloadBackendGalleries, options.Galleries, options.BackendGalleries, options.PreloadModelsFromPath, options.RequireBackendIntegrity); err != nil {
|
||||
return nil, err
|
||||
}
|
||||
}
|
||||
|
||||
@@ -217,7 +217,7 @@ func (uc *UpgradeChecker) runCheck(ctx context.Context) {
|
||||
err = bm.UpgradeBackend(ctx, name, nil)
|
||||
} else {
|
||||
err = gallery.UpgradeBackend(ctx, uc.systemState, uc.modelLoader,
|
||||
uc.galleries, name, nil)
|
||||
uc.galleries, name, nil, uc.appConfig.RequireBackendIntegrity)
|
||||
}
|
||||
if err != nil {
|
||||
xlog.Error("Failed to auto-upgrade backend",
|
||||
|
||||
@@ -86,7 +86,7 @@ func ModelInference(ctx context.Context, s string, messages schema.Messages, ima
|
||||
if !slices.Contains(modelNames, modelName) {
|
||||
utils.ResetDownloadTimers()
|
||||
// if we failed to load the model, we try to download it
|
||||
err := gallery.InstallModelFromGallery(ctx, o.Galleries, o.BackendGalleries, o.SystemState, loader, modelName, gallery.GalleryModel{}, utils.DisplayDownloadFunction, o.EnforcePredownloadScans, o.AutoloadBackendGalleries)
|
||||
err := gallery.InstallModelFromGallery(ctx, o.Galleries, o.BackendGalleries, o.SystemState, loader, modelName, gallery.GalleryModel{}, utils.DisplayDownloadFunction, o.EnforcePredownloadScans, o.AutoloadBackendGalleries, o.RequireBackendIntegrity)
|
||||
if err != nil {
|
||||
xlog.Error("failed to install model from gallery", "error", err, "model", modelFile)
|
||||
//return nil, err
|
||||
|
||||
@@ -17,9 +17,10 @@ import (
|
||||
)
|
||||
|
||||
type BackendsCMDFlags struct {
|
||||
BackendGalleries string `env:"LOCALAI_BACKEND_GALLERIES,BACKEND_GALLERIES" help:"JSON list of backend galleries" group:"backends" default:"${backends}"`
|
||||
BackendsPath string `env:"LOCALAI_BACKENDS_PATH,BACKENDS_PATH" type:"path" default:"${basepath}/backends" help:"Path containing backends used for inferencing" group:"storage"`
|
||||
BackendsSystemPath string `env:"LOCALAI_BACKENDS_SYSTEM_PATH,BACKEND_SYSTEM_PATH" type:"path" default:"/var/lib/local-ai/backends" help:"Path containing system backends used for inferencing" group:"backends"`
|
||||
BackendGalleries string `env:"LOCALAI_BACKEND_GALLERIES,BACKEND_GALLERIES" help:"JSON list of backend galleries" group:"backends" default:"${backends}"`
|
||||
BackendsPath string `env:"LOCALAI_BACKENDS_PATH,BACKENDS_PATH" type:"path" default:"${basepath}/backends" help:"Path containing backends used for inferencing" group:"storage"`
|
||||
BackendsSystemPath string `env:"LOCALAI_BACKENDS_SYSTEM_PATH,BACKEND_SYSTEM_PATH" type:"path" default:"/var/lib/local-ai/backends" help:"Path containing system backends used for inferencing" group:"backends"`
|
||||
RequireBackendIntegrity bool `env:"LOCALAI_REQUIRE_BACKEND_INTEGRITY,REQUIRE_BACKEND_INTEGRITY" help:"If true, reject backend installs without a configured signature verification policy (OCI URIs) or SHA256 (tarball/HTTP URIs)." group:"hardening" default:"false"`
|
||||
}
|
||||
|
||||
type BackendsList struct {
|
||||
@@ -126,7 +127,7 @@ func (bi *BackendsInstall) Run(ctx *cliContext.Context) error {
|
||||
}
|
||||
|
||||
modelLoader := model.NewModelLoader(systemState)
|
||||
err = galleryop.InstallExternalBackend(context.Background(), galleries, systemState, modelLoader, progressCallback, bi.BackendArgs, bi.Name, bi.Alias)
|
||||
err = galleryop.InstallExternalBackend(context.Background(), galleries, systemState, modelLoader, progressCallback, bi.BackendArgs, bi.Name, bi.Alias, bi.RequireBackendIntegrity)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
@@ -197,7 +198,7 @@ func (bu *BackendsUpgrade) Run(ctx *cliContext.Context) error {
|
||||
}
|
||||
}
|
||||
|
||||
if err := gallery.UpgradeBackend(context.Background(), systemState, modelLoader, galleries, name, progressCallback); err != nil {
|
||||
if err := gallery.UpgradeBackend(context.Background(), systemState, modelLoader, galleries, name, progressCallback, bu.RequireBackendIntegrity); err != nil {
|
||||
fmt.Printf("Failed to upgrade %s: %v\n", name, err)
|
||||
} else {
|
||||
fmt.Printf("Backend %s upgraded successfully\n", name)
|
||||
|
||||
@@ -32,6 +32,7 @@ type ModelsList struct {
|
||||
|
||||
type ModelsInstall struct {
|
||||
DisablePredownloadScan bool `env:"LOCALAI_DISABLE_PREDOWNLOAD_SCAN" help:"If true, disables the best-effort security scanner before downloading any files." group:"hardening" default:"false"`
|
||||
RequireBackendIntegrity bool `env:"LOCALAI_REQUIRE_BACKEND_INTEGRITY,REQUIRE_BACKEND_INTEGRITY" help:"If true, reject backend installs without a configured signature verification policy (OCI URIs) or SHA256 (tarball/HTTP URIs)." group:"hardening" default:"false"`
|
||||
AutoloadBackendGalleries bool `env:"LOCALAI_AUTOLOAD_BACKEND_GALLERIES" help:"If true, automatically loads backend galleries" group:"backends" default:"true"`
|
||||
ModelArgs []string `arg:"" optional:"" name:"models" help:"Model configuration URLs to load"`
|
||||
|
||||
@@ -71,7 +72,6 @@ func (ml *ModelsList) Run(ctx *cliContext.Context) error {
|
||||
}
|
||||
|
||||
func (mi *ModelsInstall) Run(ctx *cliContext.Context) error {
|
||||
|
||||
systemState, err := system.GetSystemState(
|
||||
system.WithModelPath(mi.ModelsPath),
|
||||
system.WithBackendPath(mi.BackendsPath),
|
||||
@@ -135,7 +135,7 @@ func (mi *ModelsInstall) Run(ctx *cliContext.Context) error {
|
||||
}
|
||||
|
||||
modelLoader := model.NewModelLoader(systemState)
|
||||
err = startup.InstallModels(context.Background(), galleryService, galleries, backendGalleries, systemState, modelLoader, !mi.DisablePredownloadScan, mi.AutoloadBackendGalleries, progressCallback, modelName)
|
||||
err = startup.InstallModels(context.Background(), galleryService, galleries, backendGalleries, systemState, modelLoader, !mi.DisablePredownloadScan, mi.AutoloadBackendGalleries, mi.RequireBackendIntegrity, progressCallback, modelName)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
@@ -67,6 +67,7 @@ type RunCMD struct {
|
||||
OllamaAPIRootEndpoint bool `env:"LOCALAI_OLLAMA_API_ROOT_ENDPOINT" default:"false" help:"Register Ollama-compatible health check on / (replaces web UI on root path). The /api/* Ollama endpoints are always available regardless of this flag" group:"api"`
|
||||
DisableRuntimeSettings bool `env:"LOCALAI_DISABLE_RUNTIME_SETTINGS,DISABLE_RUNTIME_SETTINGS" default:"false" help:"Disables the runtime settings. When set to true, the server will not load the runtime settings from the runtime_settings.json file" group:"api"`
|
||||
DisablePredownloadScan bool `env:"LOCALAI_DISABLE_PREDOWNLOAD_SCAN" help:"If true, disables the best-effort security scanner before downloading any files." group:"hardening" default:"false"`
|
||||
RequireBackendIntegrity bool `env:"LOCALAI_REQUIRE_BACKEND_INTEGRITY,REQUIRE_BACKEND_INTEGRITY" help:"If true, backend installs without a configured signature verification policy (for OCI URIs) or SHA256 (for tarball/HTTP URIs) are rejected. Default is to warn and install. Set this in production once your gallery's verification: block is populated." group:"hardening" default:"false"`
|
||||
OpaqueErrors bool `env:"LOCALAI_OPAQUE_ERRORS" default:"false" help:"If true, all error responses are replaced with blank 500 errors. This is intended only for hardening against information leaks and is normally not recommended." group:"hardening"`
|
||||
UseSubtleKeyComparison bool `env:"LOCALAI_SUBTLE_KEY_COMPARISON" default:"false" help:"If true, API Key validation comparisons will be performed using constant-time comparisons rather than simple equality. This trades off performance on each request for resiliancy against timing attacks." group:"hardening"`
|
||||
DisableApiKeyRequirementForHttpGet bool `env:"LOCALAI_DISABLE_API_KEY_REQUIREMENT_FOR_HTTP_GET" default:"false" help:"If true, a valid API key is not required to issue GET requests to portions of the web ui. This should only be enabled in secure testing environments" group:"hardening"`
|
||||
@@ -503,6 +504,10 @@ func (r *RunCMD) Run(ctx *cliContext.Context) error {
|
||||
opts = append(opts, config.WithAutoUpgradeBackends(r.AutoUpgradeBackends))
|
||||
}
|
||||
|
||||
if r.RequireBackendIntegrity {
|
||||
opts = append(opts, config.WithRequireBackendIntegrity(r.RequireBackendIntegrity))
|
||||
}
|
||||
|
||||
if r.PreferDevelopmentBackends {
|
||||
opts = append(opts, config.WithPreferDevelopmentBackends(r.PreferDevelopmentBackends))
|
||||
}
|
||||
|
||||
@@ -1,10 +1,11 @@
|
||||
package worker
|
||||
|
||||
type WorkerFlags struct {
|
||||
BackendsPath string `env:"LOCALAI_BACKENDS_PATH,BACKENDS_PATH" type:"path" default:"${basepath}/backends" help:"Path containing backends used for inferencing" group:"backends"`
|
||||
BackendGalleries string `env:"LOCALAI_BACKEND_GALLERIES,BACKEND_GALLERIES" help:"JSON list of backend galleries" group:"backends" default:"${backends}"`
|
||||
BackendsSystemPath string `env:"LOCALAI_BACKENDS_SYSTEM_PATH,BACKEND_SYSTEM_PATH" type:"path" default:"/var/lib/local-ai/backends" help:"Path containing system backends used for inferencing" group:"backends"`
|
||||
ExtraLLamaCPPArgs string `name:"llama-cpp-args" env:"LOCALAI_EXTRA_LLAMA_CPP_ARGS,EXTRA_LLAMA_CPP_ARGS" help:"Extra arguments to pass to llama-cpp-rpc-server"`
|
||||
BackendsPath string `env:"LOCALAI_BACKENDS_PATH,BACKENDS_PATH" type:"path" default:"${basepath}/backends" help:"Path containing backends used for inferencing" group:"backends"`
|
||||
BackendGalleries string `env:"LOCALAI_BACKEND_GALLERIES,BACKEND_GALLERIES" help:"JSON list of backend galleries" group:"backends" default:"${backends}"`
|
||||
BackendsSystemPath string `env:"LOCALAI_BACKENDS_SYSTEM_PATH,BACKEND_SYSTEM_PATH" type:"path" default:"/var/lib/local-ai/backends" help:"Path containing system backends used for inferencing" group:"backends"`
|
||||
RequireBackendIntegrity bool `env:"LOCALAI_REQUIRE_BACKEND_INTEGRITY,REQUIRE_BACKEND_INTEGRITY" help:"If true, reject backend installs without a configured signature verification policy (OCI URIs) or SHA256 (tarball/HTTP URIs)." group:"hardening" default:"false"`
|
||||
ExtraLLamaCPPArgs string `name:"llama-cpp-args" env:"LOCALAI_EXTRA_LLAMA_CPP_ARGS,EXTRA_LLAMA_CPP_ARGS" help:"Extra arguments to pass to llama-cpp-rpc-server"`
|
||||
}
|
||||
|
||||
type Worker struct {
|
||||
|
||||
@@ -18,7 +18,7 @@ import (
|
||||
// installing the backend from the gallery if it isn't present.
|
||||
// `name` is the gallery entry name (for vLLM the meta entry "vllm"
|
||||
// resolves to a platform-specific package via capability lookup).
|
||||
func findBackendPath(name, galleries string, systemState *system.SystemState) (string, error) {
|
||||
func findBackendPath(name, galleries string, systemState *system.SystemState, requireIntegrity bool) (string, error) {
|
||||
backends, err := gallery.ListSystemBackends(systemState)
|
||||
if err != nil {
|
||||
return "", err
|
||||
@@ -33,7 +33,7 @@ func findBackendPath(name, galleries string, systemState *system.SystemState) (s
|
||||
xlog.Error("failed loading galleries", "error", err)
|
||||
return "", err
|
||||
}
|
||||
if err := gallery.InstallBackendFromGallery(context.Background(), gals, systemState, ml, name, nil, true); err != nil {
|
||||
if err := gallery.InstallBackendFromGallery(context.Background(), gals, systemState, ml, name, nil, true, requireIntegrity); err != nil {
|
||||
xlog.Error("backend not found, failed to install it", "name", name, "error", err)
|
||||
return "", err
|
||||
}
|
||||
|
||||
@@ -27,7 +27,7 @@ const (
|
||||
llamaCPPGalleryName = "llama-cpp"
|
||||
)
|
||||
|
||||
func findLLamaCPPBackend(galleries string, systemState *system.SystemState) (string, error) {
|
||||
func findLLamaCPPBackend(galleries string, systemState *system.SystemState, requireIntegrity bool) (string, error) {
|
||||
backends, err := gallery.ListSystemBackends(systemState)
|
||||
if err != nil {
|
||||
xlog.Warn("Failed listing system backends", "error", err)
|
||||
@@ -43,7 +43,7 @@ func findLLamaCPPBackend(galleries string, systemState *system.SystemState) (str
|
||||
xlog.Error("failed loading galleries", "error", err)
|
||||
return "", err
|
||||
}
|
||||
err := gallery.InstallBackendFromGallery(context.Background(), gals, systemState, ml, llamaCPPGalleryName, nil, true)
|
||||
err := gallery.InstallBackendFromGallery(context.Background(), gals, systemState, ml, llamaCPPGalleryName, nil, true, requireIntegrity)
|
||||
if err != nil {
|
||||
xlog.Error("llama-cpp backend not found, failed to install it", "error", err)
|
||||
return "", err
|
||||
@@ -76,7 +76,7 @@ func (r *LLamaCPP) Run(ctx *cliContext.Context) error {
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
grpcProcess, err := findLLamaCPPBackend(r.BackendGalleries, systemState)
|
||||
grpcProcess, err := findLLamaCPPBackend(r.BackendGalleries, systemState, r.RequireBackendIntegrity)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
@@ -9,8 +9,8 @@ import (
|
||||
|
||||
const mlxDistributedGalleryName = "mlx-distributed"
|
||||
|
||||
func findMLXDistributedBackendPath(galleries string, systemState *system.SystemState) (string, error) {
|
||||
return findBackendPath(mlxDistributedGalleryName, galleries, systemState)
|
||||
func findMLXDistributedBackendPath(galleries string, systemState *system.SystemState, requireIntegrity bool) (string, error) {
|
||||
return findBackendPath(mlxDistributedGalleryName, galleries, systemState, requireIntegrity)
|
||||
}
|
||||
|
||||
// buildMLXCommand builds the exec.Cmd to launch the mlx-distributed backend.
|
||||
|
||||
@@ -28,7 +28,7 @@ func (r *MLXDistributed) Run(ctx *cliContext.Context) error {
|
||||
return err
|
||||
}
|
||||
|
||||
backendPath, err := findMLXDistributedBackendPath(r.BackendGalleries, systemState)
|
||||
backendPath, err := findMLXDistributedBackendPath(r.BackendGalleries, systemState, r.RequireBackendIntegrity)
|
||||
if err != nil {
|
||||
return fmt.Errorf("cannot find mlx-distributed backend: %w", err)
|
||||
}
|
||||
|
||||
@@ -73,7 +73,7 @@ func (r *P2P) Run(ctx *cliContext.Context) error {
|
||||
for {
|
||||
xlog.Info("Starting llama-cpp-rpc-server", "address", address, "port", port)
|
||||
|
||||
grpcProcess, err := findLLamaCPPBackend(r.BackendGalleries, systemState)
|
||||
grpcProcess, err := findLLamaCPPBackend(r.BackendGalleries, systemState, r.RequireBackendIntegrity)
|
||||
if err != nil {
|
||||
xlog.Error("Failed to find llama-cpp-rpc-server", "error", err)
|
||||
return
|
||||
|
||||
@@ -48,7 +48,7 @@ func (r *P2PMLX) Run(ctx *cliContext.Context) error {
|
||||
c, cancel := context.WithCancel(context.Background())
|
||||
defer cancel()
|
||||
|
||||
backendPath, err := findMLXDistributedBackendPath(r.BackendGalleries, systemState)
|
||||
backendPath, err := findMLXDistributedBackendPath(r.BackendGalleries, systemState, r.RequireBackendIntegrity)
|
||||
if err != nil {
|
||||
xlog.Warn("Could not find mlx-distributed backend from gallery, will try backend.py directly", "error", err)
|
||||
}
|
||||
|
||||
@@ -77,7 +77,7 @@ func (r *VLLMDistributed) Run(ctx *cliContext.Context) error {
|
||||
return fmt.Errorf("getting system state: %w", err)
|
||||
}
|
||||
|
||||
backendPath, err := findBackendPath("vllm", r.BackendGalleries, systemState)
|
||||
backendPath, err := findBackendPath("vllm", r.BackendGalleries, systemState, r.RequireBackendIntegrity)
|
||||
if err != nil {
|
||||
return fmt.Errorf("cannot find vllm backend: %w", err)
|
||||
}
|
||||
|
||||
@@ -60,6 +60,13 @@ type ApplicationConfig struct {
|
||||
AutoUpgradeBackends bool
|
||||
PreferDevelopmentBackends bool
|
||||
|
||||
// RequireBackendIntegrity promotes a missing SHA256 (tarball/HTTP URIs)
|
||||
// or missing verification policy (OCI URIs) from a warning to a hard
|
||||
// failure during backend install/upgrade. Off by default to keep
|
||||
// upgrades non-breaking; operators opt in explicitly via
|
||||
// --require-backend-integrity / LOCALAI_REQUIRE_BACKEND_INTEGRITY.
|
||||
RequireBackendIntegrity bool
|
||||
|
||||
SingleBackend bool // Deprecated: use MaxActiveBackends = 1 instead
|
||||
MaxActiveBackends int // Maximum number of active backends (0 = unlimited, 1 = single backend mode)
|
||||
WatchDogIdle bool
|
||||
@@ -436,6 +443,10 @@ func WithAutoUpgradeBackends(v bool) AppOption {
|
||||
return func(o *ApplicationConfig) { o.AutoUpgradeBackends = v }
|
||||
}
|
||||
|
||||
func WithRequireBackendIntegrity(v bool) AppOption {
|
||||
return func(o *ApplicationConfig) { o.RequireBackendIntegrity = v }
|
||||
}
|
||||
|
||||
func WithPreferDevelopmentBackends(v bool) AppOption {
|
||||
return func(o *ApplicationConfig) { o.PreferDevelopmentBackends = v }
|
||||
}
|
||||
|
||||
@@ -24,6 +24,7 @@ const (
|
||||
UsecaseVAD = "vad"
|
||||
UsecaseAudioTransform = "audio_transform"
|
||||
UsecaseDiarization = "diarization"
|
||||
UsecaseRealtimeAudio = "realtime_audio"
|
||||
)
|
||||
|
||||
// GRPCMethod identifies a Backend service RPC from backend.proto.
|
||||
@@ -45,6 +46,7 @@ const (
|
||||
MethodVAD GRPCMethod = "VAD"
|
||||
MethodAudioTransform GRPCMethod = "AudioTransform"
|
||||
MethodDiarize GRPCMethod = "Diarize"
|
||||
MethodAudioToAudioStream GRPCMethod = "AudioToAudioStream"
|
||||
)
|
||||
|
||||
// UsecaseInfo describes a single known_usecase value and how it maps
|
||||
@@ -147,6 +149,11 @@ var UsecaseInfoMap = map[string]UsecaseInfo{
|
||||
GRPCMethod: MethodDiarize,
|
||||
Description: "Speaker diarization (who-spoke-when, per-speaker segments) via the Diarize RPC.",
|
||||
},
|
||||
UsecaseRealtimeAudio: {
|
||||
Flag: FLAG_REALTIME_AUDIO,
|
||||
GRPCMethod: MethodAudioToAudioStream,
|
||||
Description: "Self-contained any-to-any audio model for the Realtime API — accepts microphone audio and emits speech + transcript (+ optional function calls) from a single backend via the AudioToAudioStream RPC.",
|
||||
},
|
||||
}
|
||||
|
||||
// BackendCapability describes which gRPC methods and usecases a backend supports.
|
||||
@@ -397,6 +404,15 @@ var BackendCapabilities = map[string]BackendCapability{
|
||||
Description: "Meta MusicGen via transformers — music generation from text",
|
||||
},
|
||||
|
||||
// --- Any-to-any audio backends ---
|
||||
"liquid-audio": {
|
||||
GRPCMethods: []GRPCMethod{MethodPredict, MethodPredictStream, MethodAudioTranscription, MethodTTS, MethodAudioToAudioStream, MethodVAD},
|
||||
PossibleUsecases: []string{UsecaseChat, UsecaseCompletion, UsecaseTranscript, UsecaseTTS, UsecaseRealtimeAudio, UsecaseVAD},
|
||||
DefaultUsecases: []string{UsecaseRealtimeAudio, UsecaseChat, UsecaseTranscript, UsecaseTTS, UsecaseVAD},
|
||||
AcceptsAudios: true,
|
||||
Description: "LFM2 / LFM2.5-Audio — self-contained any-to-any audio model for the Realtime API; also exposes chat, transcription, TTS and a stub energy-based VAD endpoint",
|
||||
},
|
||||
|
||||
// --- Audio transform backends ---
|
||||
"localvqe": {
|
||||
GRPCMethods: []GRPCMethod{MethodAudioTransform},
|
||||
|
||||
@@ -31,7 +31,15 @@ type DistributedConfig struct {
|
||||
DrainTimeout time.Duration // Time to wait for in-flight requests during drain (default 30s)
|
||||
HealthCheckInterval time.Duration // Health monitor check interval (default 15s)
|
||||
StaleNodeThreshold time.Duration // Time before a node is considered stale (default 60s)
|
||||
PerModelHealthCheck bool // Enable per-model backend health checking (default false)
|
||||
// DisablePerModelHealthCheck turns off the health monitor's per-model
|
||||
// gRPC probe. When enabled (the default), the monitor pings each model's
|
||||
// gRPC address and removes stale node_models rows whose backend has
|
||||
// crashed even though the worker's node-level heartbeat is still arriving.
|
||||
// Without per-model probing, /embeddings and /completions can be dispatched
|
||||
// to a backend that silently returns garbage (see also the cascading
|
||||
// model-row cleanup on MarkUnhealthy / MarkDraining).
|
||||
DisablePerModelHealthCheck bool
|
||||
|
||||
MCPCIJobTimeout time.Duration // MCP CI job execution timeout (default 10m)
|
||||
|
||||
MaxUploadSize int64 // Maximum upload body size in bytes (default 50 GB)
|
||||
|
||||
@@ -1,6 +1,37 @@
|
||||
package config
|
||||
|
||||
type Gallery struct {
|
||||
URL string `json:"url" yaml:"url"`
|
||||
Name string `json:"name" yaml:"name"`
|
||||
// GalleryVerification declares the keyless-cosign signature policy that
|
||||
// every OCI backend image fetched from this gallery must satisfy.
|
||||
//
|
||||
// Verification is opt-in: galleries without a Verification block install
|
||||
// backends with no signature check (the downloader logs a warning when
|
||||
// LOCALAI_REQUIRE_BACKEND_INTEGRITY is unset; that flag turns the warning
|
||||
// into a hard error).
|
||||
//
|
||||
// Identity matching: set Issuer (exact) or IssuerRegex, AND Identity
|
||||
// (exact) or IdentityRegex. For GitHub Actions keyless signing the
|
||||
// typical shape is:
|
||||
//
|
||||
// verification:
|
||||
// issuer: "https://token.actions.githubusercontent.com"
|
||||
// identity_regex: "^https://github\\.com/mudler/local-ai-backends/\\.github/workflows/build\\.yaml@refs/heads/master$"
|
||||
// not_before: "2026-05-01T00:00:00Z"
|
||||
//
|
||||
// NotBefore is the revocation lever: advance it to invalidate every
|
||||
// signature produced before a known compromise window. Keyless cosign
|
||||
// certs are ephemeral so there is no CA-side revocation.
|
||||
type GalleryVerification struct {
|
||||
Issuer string `json:"issuer,omitempty" yaml:"issuer,omitempty"`
|
||||
IssuerRegex string `json:"issuer_regex,omitempty" yaml:"issuer_regex,omitempty"`
|
||||
Identity string `json:"identity,omitempty" yaml:"identity,omitempty"`
|
||||
IdentityRegex string `json:"identity_regex,omitempty" yaml:"identity_regex,omitempty"`
|
||||
|
||||
// NotBefore is an RFC3339 timestamp. Empty disables the time check.
|
||||
NotBefore string `json:"not_before,omitempty" yaml:"not_before,omitempty"`
|
||||
}
|
||||
|
||||
type Gallery struct {
|
||||
URL string `json:"url" yaml:"url"`
|
||||
Name string `json:"name" yaml:"name"`
|
||||
Verification *GalleryVerification `json:"verification,omitempty" yaml:"verification,omitempty"`
|
||||
}
|
||||
|
||||
@@ -54,6 +54,13 @@ func guessGGUFFromFile(cfg *ModelConfig, f *gguf.GGUFFile, defaultCtx int) {
|
||||
cfg.modelTemplate = chatTemplate.ValueString()
|
||||
}
|
||||
|
||||
// Auto-enable Multi-Token Prediction (ggml-org/llama.cpp#22673) when the
|
||||
// GGUF carries an embedded MTP head. Skipped silently for non-MTP models
|
||||
// and when the user already configured a spec_type.
|
||||
if n, ok := HasEmbeddedMTPHead(f); ok {
|
||||
ApplyMTPDefaults(cfg, n)
|
||||
}
|
||||
|
||||
// Thinking support detection is done after model load via DetectThinkingSupportFromBackend
|
||||
|
||||
// template estimations
|
||||
|
||||
@@ -636,6 +636,7 @@ const (
|
||||
FLAG_SPEAKER_RECOGNITION ModelConfigUsecase = 0b1000000000000000
|
||||
FLAG_AUDIO_TRANSFORM ModelConfigUsecase = 0b10000000000000000
|
||||
FLAG_DIARIZATION ModelConfigUsecase = 0b100000000000000000
|
||||
FLAG_REALTIME_AUDIO ModelConfigUsecase = 0b1000000000000000000
|
||||
|
||||
// Common Subsets
|
||||
FLAG_LLM ModelConfigUsecase = FLAG_CHAT | FLAG_COMPLETION | FLAG_EDIT
|
||||
@@ -645,12 +646,12 @@ const (
|
||||
// Flags within the same group are NOT orthogonal (e.g., chat and completion are
|
||||
// both text/language). A model is multimodal when its usecases span 2+ groups.
|
||||
var ModalityGroups = []ModelConfigUsecase{
|
||||
FLAG_CHAT | FLAG_COMPLETION | FLAG_EDIT, // text/language
|
||||
FLAG_VISION | FLAG_DETECTION, // visual understanding
|
||||
FLAG_TRANSCRIPT, // speech input
|
||||
FLAG_TTS | FLAG_SOUND_GENERATION, // audio output
|
||||
FLAG_AUDIO_TRANSFORM, // audio in/out transforms
|
||||
FLAG_IMAGE | FLAG_VIDEO, // visual generation
|
||||
FLAG_CHAT | FLAG_COMPLETION | FLAG_EDIT, // text/language
|
||||
FLAG_VISION | FLAG_DETECTION, // visual understanding
|
||||
FLAG_TRANSCRIPT | FLAG_REALTIME_AUDIO, // speech input — realtime_audio is any-to-any, so it counts here too
|
||||
FLAG_TTS | FLAG_SOUND_GENERATION | FLAG_REALTIME_AUDIO, // audio output — and here, so a lone realtime_audio flag still reads as multimodal
|
||||
FLAG_AUDIO_TRANSFORM, // audio in/out transforms
|
||||
FLAG_IMAGE | FLAG_VIDEO, // visual generation
|
||||
}
|
||||
|
||||
// IsMultimodal returns true if the given usecases span two or more orthogonal
|
||||
@@ -692,6 +693,7 @@ func GetAllModelConfigUsecases() map[string]ModelConfigUsecase {
|
||||
"FLAG_SPEAKER_RECOGNITION": FLAG_SPEAKER_RECOGNITION,
|
||||
"FLAG_AUDIO_TRANSFORM": FLAG_AUDIO_TRANSFORM,
|
||||
"FLAG_DIARIZATION": FLAG_DIARIZATION,
|
||||
"FLAG_REALTIME_AUDIO": FLAG_REALTIME_AUDIO,
|
||||
}
|
||||
}
|
||||
|
||||
@@ -866,6 +868,16 @@ func (c *ModelConfig) GuessUsecases(u ModelConfigUsecase) bool {
|
||||
}
|
||||
}
|
||||
|
||||
if (u & FLAG_REALTIME_AUDIO) == FLAG_REALTIME_AUDIO {
|
||||
// Backends that own a single any-to-any loop and implement
|
||||
// AudioToAudioStream — listed here so models without an explicit
|
||||
// known_usecases still surface on the Talk page.
|
||||
realtimeAudioBackends := []string{"liquid-audio"}
|
||||
if !slices.Contains(realtimeAudioBackends, c.Backend) {
|
||||
return false
|
||||
}
|
||||
}
|
||||
|
||||
return true
|
||||
}
|
||||
|
||||
|
||||
84
core/config/mtp.go
Normal file
84
core/config/mtp.go
Normal file
@@ -0,0 +1,84 @@
|
||||
package config
|
||||
|
||||
import (
|
||||
"strings"
|
||||
|
||||
gguf "github.com/gpustack/gguf-parser-go"
|
||||
"github.com/mudler/xlog"
|
||||
)
|
||||
|
||||
// mtpSpecOptions lists the speculative-decoding option keys auto-applied when
|
||||
// an MTP head is detected on a llama-cpp GGUF. Defaults track the upstream
|
||||
// MTP PR (ggml-org/llama.cpp#22673):
|
||||
//
|
||||
// - spec_type:draft-mtp activates Multi-Token Prediction
|
||||
// - spec_n_max:6 draft window
|
||||
// - spec_p_min:0.75 pinned because upstream marked the 0.75 default
|
||||
// with a "change to 0.0f" TODO; locking it here keeps acceptance
|
||||
// thresholds stable across future bumps
|
||||
var mtpSpecOptions = []string{
|
||||
"spec_type:draft-mtp",
|
||||
"spec_n_max:6",
|
||||
"spec_p_min:0.75",
|
||||
}
|
||||
|
||||
// MTPSpecOptions returns a copy of the option keys auto-applied when an MTP
|
||||
// head is detected. Exported for testing and for the importer.
|
||||
func MTPSpecOptions() []string {
|
||||
out := make([]string, len(mtpSpecOptions))
|
||||
copy(out, mtpSpecOptions)
|
||||
return out
|
||||
}
|
||||
|
||||
// HasEmbeddedMTPHead reports whether the parsed GGUF declares a Multi-Token
|
||||
// Prediction head. Detection reads `<arch>.nextn_predict_layers`, which is
|
||||
// what `gguf_writer.add_nextn_predict_layers(n)` emits in upstream's
|
||||
// `conversion/qwen.py` MTP mixin. A positive layer count means the head is
|
||||
// present in the same GGUF as the trunk.
|
||||
func HasEmbeddedMTPHead(f *gguf.GGUFFile) (uint32, bool) {
|
||||
if f == nil {
|
||||
return 0, false
|
||||
}
|
||||
arch := f.Architecture().Architecture
|
||||
if arch == "" {
|
||||
return 0, false
|
||||
}
|
||||
v, ok := f.Header.MetadataKV.Get(arch + ".nextn_predict_layers")
|
||||
if !ok {
|
||||
return 0, false
|
||||
}
|
||||
n := gguf.ValueNumeric[uint32](v)
|
||||
return n, n > 0
|
||||
}
|
||||
|
||||
// hasSpecTypeOption returns true when the slice already contains a
|
||||
// user-configured `spec_type:` / `speculative_type:` entry. Used to avoid
|
||||
// clobbering an explicit choice with the MTP auto-defaults.
|
||||
func hasSpecTypeOption(opts []string) bool {
|
||||
for _, o := range opts {
|
||||
if strings.HasPrefix(o, "spec_type:") || strings.HasPrefix(o, "speculative_type:") {
|
||||
return true
|
||||
}
|
||||
}
|
||||
return false
|
||||
}
|
||||
|
||||
// ApplyMTPDefaults appends the auto-MTP option keys to cfg.Options when none
|
||||
// is already configured. It is a no-op when the user already picked a
|
||||
// `spec_type` (either via YAML or via the importer's preferences flow).
|
||||
//
|
||||
// `layers` is the value read from `<arch>.nextn_predict_layers` and is only
|
||||
// used for the diagnostic log line.
|
||||
func ApplyMTPDefaults(cfg *ModelConfig, layers uint32) {
|
||||
if cfg == nil {
|
||||
return
|
||||
}
|
||||
if hasSpecTypeOption(cfg.Options) {
|
||||
xlog.Debug("[mtp] embedded MTP head detected but spec_type already configured; leaving user choice intact",
|
||||
"name", cfg.Name, "nextn_layers", layers)
|
||||
return
|
||||
}
|
||||
cfg.Options = append(cfg.Options, mtpSpecOptions...)
|
||||
xlog.Info("[mtp] embedded MTP head detected; enabling draft-mtp speculative decoding",
|
||||
"name", cfg.Name, "nextn_layers", layers, "spec_n_max", 6, "spec_p_min", 0.75)
|
||||
}
|
||||
86
core/config/mtp_test.go
Normal file
86
core/config/mtp_test.go
Normal file
@@ -0,0 +1,86 @@
|
||||
package config_test
|
||||
|
||||
import (
|
||||
. "github.com/mudler/LocalAI/core/config"
|
||||
|
||||
. "github.com/onsi/ginkgo/v2"
|
||||
. "github.com/onsi/gomega"
|
||||
)
|
||||
|
||||
var _ = Describe("MTP auto-defaults", func() {
|
||||
Context("MTPSpecOptions", func() {
|
||||
It("returns the upstream-recommended speculative tuple", func() {
|
||||
Expect(MTPSpecOptions()).To(Equal([]string{
|
||||
"spec_type:draft-mtp",
|
||||
"spec_n_max:6",
|
||||
"spec_p_min:0.75",
|
||||
}))
|
||||
})
|
||||
|
||||
It("returns a defensive copy so callers cannot mutate the package default", func() {
|
||||
opts := MTPSpecOptions()
|
||||
opts[0] = "spec_type:none"
|
||||
Expect(MTPSpecOptions()[0]).To(Equal("spec_type:draft-mtp"))
|
||||
})
|
||||
})
|
||||
|
||||
Context("ApplyMTPDefaults", func() {
|
||||
It("appends MTP options when nothing is configured", func() {
|
||||
cfg := &ModelConfig{Name: "qwen-mtp"}
|
||||
ApplyMTPDefaults(cfg, 1)
|
||||
Expect(cfg.Options).To(Equal([]string{
|
||||
"spec_type:draft-mtp",
|
||||
"spec_n_max:6",
|
||||
"spec_p_min:0.75",
|
||||
}))
|
||||
})
|
||||
|
||||
It("preserves unrelated options already on the config", func() {
|
||||
cfg := &ModelConfig{
|
||||
Name: "qwen-mtp",
|
||||
Options: []string{"use_jinja:true", "cache_reuse:256"},
|
||||
}
|
||||
ApplyMTPDefaults(cfg, 1)
|
||||
Expect(cfg.Options).To(Equal([]string{
|
||||
"use_jinja:true",
|
||||
"cache_reuse:256",
|
||||
"spec_type:draft-mtp",
|
||||
"spec_n_max:6",
|
||||
"spec_p_min:0.75",
|
||||
}))
|
||||
})
|
||||
|
||||
It("is a no-op when the user already configured spec_type", func() {
|
||||
cfg := &ModelConfig{
|
||||
Name: "qwen-mtp",
|
||||
Options: []string{"spec_type:ngram-simple", "use_jinja:true"},
|
||||
}
|
||||
ApplyMTPDefaults(cfg, 1)
|
||||
Expect(cfg.Options).To(Equal([]string{
|
||||
"spec_type:ngram-simple",
|
||||
"use_jinja:true",
|
||||
}))
|
||||
})
|
||||
|
||||
It("also respects the legacy speculative_type alias", func() {
|
||||
cfg := &ModelConfig{
|
||||
Name: "qwen-mtp",
|
||||
Options: []string{"speculative_type:ngram-mod"},
|
||||
}
|
||||
ApplyMTPDefaults(cfg, 1)
|
||||
Expect(cfg.Options).To(Equal([]string{"speculative_type:ngram-mod"}))
|
||||
})
|
||||
|
||||
It("tolerates a nil config", func() {
|
||||
Expect(func() { ApplyMTPDefaults(nil, 1) }).ToNot(Panic())
|
||||
})
|
||||
})
|
||||
|
||||
Context("HasEmbeddedMTPHead", func() {
|
||||
It("returns false on a nil GGUF file", func() {
|
||||
n, ok := HasEmbeddedMTPHead(nil)
|
||||
Expect(ok).To(BeFalse())
|
||||
Expect(n).To(BeZero())
|
||||
})
|
||||
})
|
||||
})
|
||||
@@ -16,6 +16,7 @@ import (
|
||||
"github.com/mudler/LocalAI/pkg/downloader"
|
||||
"github.com/mudler/LocalAI/pkg/model"
|
||||
"github.com/mudler/LocalAI/pkg/oci"
|
||||
"github.com/mudler/LocalAI/pkg/oci/cosignverify"
|
||||
"github.com/mudler/LocalAI/pkg/system"
|
||||
"github.com/mudler/xlog"
|
||||
cp "github.com/otiai10/copy"
|
||||
@@ -102,8 +103,81 @@ func writeBackendMetadata(backendPath string, metadata *BackendMetadata) error {
|
||||
return nil
|
||||
}
|
||||
|
||||
// backendDownloadOptions translates the gallery's verification policy into
|
||||
// downloader options, and gates the call on strict-integrity mode. Both
|
||||
// InstallBackend and UpgradeBackend MUST route their download through these
|
||||
// options — without them, the corresponding code path silently downloads
|
||||
// and activates unverified backend bytes even when the gallery has a
|
||||
// verification: policy configured.
|
||||
//
|
||||
// For OCI URIs with a verification policy, returns a slice containing
|
||||
// downloader.WithImageVerifier(v) — the downloader will then run cosign
|
||||
// signature verification between fetching the manifest and extracting
|
||||
// layers (see pkg/downloader/uri.go OCI branch).
|
||||
//
|
||||
// For OCI URIs without a verification policy, or non-OCI URIs without a
|
||||
// SHA256, the function either returns a non-fatal warning (requireIntegrity
|
||||
// false) or fails the install (requireIntegrity true).
|
||||
func backendDownloadOptions(config *GalleryBackend, requireIntegrity bool) ([]downloader.DownloadOption, error) {
|
||||
uri := downloader.URI(config.URI)
|
||||
hasVerification := config.Gallery.Verification != nil
|
||||
hasSHA := config.SHA256 != ""
|
||||
|
||||
switch {
|
||||
case uri.LooksLikeOCI():
|
||||
if !hasVerification {
|
||||
if requireIntegrity {
|
||||
return nil, fmt.Errorf("strict integrity: gallery %q has no verification policy for OCI backend %q (set verification: in the gallery YAML or disable --require-backend-integrity)",
|
||||
config.Gallery.Name, config.Name)
|
||||
}
|
||||
xlog.Warn("installing OCI backend without signature verification",
|
||||
"backend", config.Name, "gallery", config.Gallery.Name, "uri", config.URI)
|
||||
return nil, nil
|
||||
}
|
||||
v, err := newGalleryVerifier(config.Gallery.Verification)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("gallery %q verification policy: %w", config.Gallery.Name, err)
|
||||
}
|
||||
return []downloader.DownloadOption{downloader.WithImageVerifier(v)}, nil
|
||||
|
||||
case uri.LooksLikeDir():
|
||||
// Local directory — out of scope for integrity checks.
|
||||
return nil, nil
|
||||
|
||||
default:
|
||||
if !hasSHA && requireIntegrity {
|
||||
return nil, fmt.Errorf("strict integrity: backend %q has no SHA256 (gallery %q)",
|
||||
config.Name, config.Gallery.Name)
|
||||
}
|
||||
// Non-strict: pkg/downloader already emits a warning when sha is empty.
|
||||
return nil, nil
|
||||
}
|
||||
}
|
||||
|
||||
// newGalleryVerifier constructs a cosignverify.Verifier from the gallery
|
||||
// policy. Parses NotBefore (RFC3339) here so YAML errors surface at install
|
||||
// time rather than during signature verification.
|
||||
func newGalleryVerifier(p *config.GalleryVerification) (*cosignverify.Verifier, error) {
|
||||
pol := cosignverify.Policy{
|
||||
Issuer: p.Issuer,
|
||||
IssuerRegex: p.IssuerRegex,
|
||||
Identity: p.Identity,
|
||||
IdentityRegex: p.IdentityRegex,
|
||||
}
|
||||
if p.NotBefore != "" {
|
||||
t, err := time.Parse(time.RFC3339, p.NotBefore)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("not_before %q: %w", p.NotBefore, err)
|
||||
}
|
||||
pol.NotBefore = t
|
||||
}
|
||||
return cosignverify.NewVerifier(pol, nil, nil)
|
||||
}
|
||||
|
||||
// InstallBackendFromGallery installs a backend from the gallery.
|
||||
func InstallBackendFromGallery(ctx context.Context, galleries []config.Gallery, systemState *system.SystemState, modelLoader *model.ModelLoader, name string, downloadStatus func(string, string, string, float64), force bool) error {
|
||||
// requireIntegrity escalates a missing SHA256 / verification policy from a
|
||||
// warning to a hard failure (see backendDownloadOptions).
|
||||
func InstallBackendFromGallery(ctx context.Context, galleries []config.Gallery, systemState *system.SystemState, modelLoader *model.ModelLoader, name string, downloadStatus func(string, string, string, float64), force, requireIntegrity bool) error {
|
||||
if !force {
|
||||
// check if we already have the backend installed
|
||||
backends, err := ListSystemBackends(systemState)
|
||||
@@ -149,7 +223,7 @@ func InstallBackendFromGallery(ctx context.Context, galleries []config.Gallery,
|
||||
xlog.Debug("Installing backend from meta backend", "name", name, "bestBackend", bestBackend.Name)
|
||||
|
||||
// Then, let's install the best backend
|
||||
if err := InstallBackend(ctx, systemState, modelLoader, bestBackend, downloadStatus); err != nil {
|
||||
if err := InstallBackend(ctx, systemState, modelLoader, bestBackend, downloadStatus, requireIntegrity); err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
@@ -175,10 +249,10 @@ func InstallBackendFromGallery(ctx context.Context, galleries []config.Gallery,
|
||||
return nil
|
||||
}
|
||||
|
||||
return InstallBackend(ctx, systemState, modelLoader, backend, downloadStatus)
|
||||
return InstallBackend(ctx, systemState, modelLoader, backend, downloadStatus, requireIntegrity)
|
||||
}
|
||||
|
||||
func InstallBackend(ctx context.Context, systemState *system.SystemState, modelLoader *model.ModelLoader, config *GalleryBackend, downloadStatus func(string, string, string, float64)) error {
|
||||
func InstallBackend(ctx context.Context, systemState *system.SystemState, modelLoader *model.ModelLoader, config *GalleryBackend, downloadStatus func(string, string, string, float64), requireIntegrity bool) error {
|
||||
// Get configurable fallback tag values from SystemState
|
||||
latestTag, masterTag, devSuffix := getFallbackTagValues(systemState)
|
||||
|
||||
@@ -213,6 +287,14 @@ func InstallBackend(ctx context.Context, systemState *system.SystemState, modelL
|
||||
return fmt.Errorf("failed to create base path: %v", err)
|
||||
}
|
||||
|
||||
// Build the download options once and reuse for every retry path —
|
||||
// mirrors and tag fallbacks must verify against the same gallery
|
||||
// policy or we open a hole where a non-default URI bypasses the check.
|
||||
downloadOpts, optsErr := backendDownloadOptions(config, requireIntegrity)
|
||||
if optsErr != nil {
|
||||
return fmt.Errorf("backend %q: %w", config.Name, optsErr)
|
||||
}
|
||||
|
||||
uri := downloader.URI(config.URI)
|
||||
// Check if it is a directory
|
||||
if uri.LooksLikeDir() {
|
||||
@@ -222,7 +304,7 @@ func InstallBackend(ctx context.Context, systemState *system.SystemState, modelL
|
||||
}
|
||||
} else {
|
||||
xlog.Debug("Downloading backend", "uri", config.URI, "backendPath", backendPath)
|
||||
if err := uri.DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus); err != nil {
|
||||
if err := uri.DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus, downloadOpts...); err != nil {
|
||||
xlog.Debug("Backend download failed, trying fallback", "backendPath", backendPath, "error", err)
|
||||
|
||||
// resetBackendPath cleans up partial state from a failed OCI extraction
|
||||
@@ -243,7 +325,7 @@ func InstallBackend(ctx context.Context, systemState *system.SystemState, modelL
|
||||
default:
|
||||
}
|
||||
resetBackendPath()
|
||||
if err := downloader.URI(mirror).DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus); err == nil {
|
||||
if err := downloader.URI(mirror).DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus, downloadOpts...); err == nil {
|
||||
success = true
|
||||
xlog.Debug("Downloaded backend from mirror", "uri", config.URI, "backendPath", backendPath)
|
||||
break
|
||||
@@ -256,7 +338,7 @@ func InstallBackend(ctx context.Context, systemState *system.SystemState, modelL
|
||||
if fallbackURI != string(config.URI) {
|
||||
resetBackendPath()
|
||||
xlog.Info("Trying fallback URI", "original", config.URI, "fallback", fallbackURI)
|
||||
if err := downloader.URI(fallbackURI).DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus); err == nil {
|
||||
if err := downloader.URI(fallbackURI).DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus, downloadOpts...); err == nil {
|
||||
xlog.Info("Downloaded backend using fallback URI", "uri", fallbackURI, "backendPath", backendPath)
|
||||
success = true
|
||||
} else {
|
||||
@@ -265,7 +347,7 @@ func InstallBackend(ctx context.Context, systemState *system.SystemState, modelL
|
||||
resetBackendPath()
|
||||
devFallbackURI := fallbackURI + "-" + devSuffix
|
||||
xlog.Info("Trying development fallback URI", "fallback", devFallbackURI)
|
||||
if err := downloader.URI(devFallbackURI).DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus); err == nil {
|
||||
if err := downloader.URI(devFallbackURI).DownloadFileWithContext(ctx, backendPath, config.SHA256, 1, 1, downloadStatus, downloadOpts...); err == nil {
|
||||
xlog.Info("Downloaded backend using development fallback URI", "uri", devFallbackURI, "backendPath", backendPath)
|
||||
success = true
|
||||
} else {
|
||||
|
||||
@@ -117,13 +117,13 @@ var _ = Describe("Gallery Backends", func() {
|
||||
|
||||
Describe("InstallBackendFromGallery", func() {
|
||||
It("should return error when backend is not found", func() {
|
||||
err := InstallBackendFromGallery(context.TODO(), galleries, systemState, ml, "non-existent", nil, true)
|
||||
err := InstallBackendFromGallery(context.TODO(), galleries, systemState, ml, "non-existent", nil, true, false)
|
||||
Expect(err).To(HaveOccurred())
|
||||
Expect(err.Error()).To(ContainSubstring("no backend found with name \"non-existent\""))
|
||||
})
|
||||
|
||||
It("should install backend from gallery", func() {
|
||||
err := InstallBackendFromGallery(context.TODO(), galleries, systemState, ml, "test-backend", nil, true)
|
||||
err := InstallBackendFromGallery(context.TODO(), galleries, systemState, ml, "test-backend", nil, true, false)
|
||||
Expect(err).ToNot(HaveOccurred())
|
||||
Expect(filepath.Join(tempDir, "test-backend", "run.sh")).To(BeARegularFile())
|
||||
})
|
||||
@@ -545,7 +545,7 @@ var _ = Describe("Gallery Backends", func() {
|
||||
VRAM: 1000000000000,
|
||||
Backend: system.Backend{BackendsPath: tempDir},
|
||||
}
|
||||
err = InstallBackendFromGallery(context.TODO(), []config.Gallery{gallery}, nvidiaSystemState, ml, "meta-backend", nil, true)
|
||||
err = InstallBackendFromGallery(context.TODO(), []config.Gallery{gallery}, nvidiaSystemState, ml, "meta-backend", nil, true, false)
|
||||
Expect(err).NotTo(HaveOccurred())
|
||||
|
||||
metaBackendPath := filepath.Join(tempDir, "meta-backend")
|
||||
@@ -625,7 +625,7 @@ var _ = Describe("Gallery Backends", func() {
|
||||
VRAM: 1000000000000,
|
||||
Backend: system.Backend{BackendsPath: tempDir},
|
||||
}
|
||||
err = InstallBackendFromGallery(context.TODO(), []config.Gallery{gallery}, nvidiaSystemState, ml, "meta-backend", nil, true)
|
||||
err = InstallBackendFromGallery(context.TODO(), []config.Gallery{gallery}, nvidiaSystemState, ml, "meta-backend", nil, true, false)
|
||||
Expect(err).NotTo(HaveOccurred())
|
||||
|
||||
metaBackendPath := filepath.Join(tempDir, "meta-backend")
|
||||
@@ -709,7 +709,7 @@ var _ = Describe("Gallery Backends", func() {
|
||||
VRAM: 1000000000000,
|
||||
Backend: system.Backend{BackendsPath: tempDir},
|
||||
}
|
||||
err = InstallBackendFromGallery(context.TODO(), []config.Gallery{gallery}, nvidiaSystemState, ml, "meta-backend", nil, true)
|
||||
err = InstallBackendFromGallery(context.TODO(), []config.Gallery{gallery}, nvidiaSystemState, ml, "meta-backend", nil, true, false)
|
||||
Expect(err).NotTo(HaveOccurred())
|
||||
|
||||
metaBackendPath := filepath.Join(tempDir, "meta-backend")
|
||||
@@ -808,7 +808,7 @@ var _ = Describe("Gallery Backends", func() {
|
||||
system.WithBackendPath(newPath),
|
||||
)
|
||||
Expect(err).NotTo(HaveOccurred())
|
||||
err = InstallBackend(context.TODO(), systemState, ml, &backend, nil)
|
||||
err = InstallBackend(context.TODO(), systemState, ml, &backend, nil, false)
|
||||
Expect(newPath).To(BeADirectory())
|
||||
Expect(err).To(HaveOccurred()) // Will fail due to invalid URI, but path should be created
|
||||
})
|
||||
@@ -840,7 +840,7 @@ var _ = Describe("Gallery Backends", func() {
|
||||
system.WithBackendPath(tempDir),
|
||||
)
|
||||
Expect(err).NotTo(HaveOccurred())
|
||||
err = InstallBackend(context.TODO(), systemState, ml, &backend, nil)
|
||||
err = InstallBackend(context.TODO(), systemState, ml, &backend, nil, false)
|
||||
Expect(err).ToNot(HaveOccurred())
|
||||
Expect(filepath.Join(tempDir, "test-backend", "metadata.json")).To(BeARegularFile())
|
||||
dat, err := os.ReadFile(filepath.Join(tempDir, "test-backend", "metadata.json"))
|
||||
@@ -873,7 +873,7 @@ var _ = Describe("Gallery Backends", func() {
|
||||
|
||||
Expect(filepath.Join(tempDir, "test-backend", "metadata.json")).ToNot(BeARegularFile())
|
||||
|
||||
err = InstallBackend(context.TODO(), systemState, ml, &backend, nil)
|
||||
err = InstallBackend(context.TODO(), systemState, ml, &backend, nil, false)
|
||||
Expect(err).ToNot(HaveOccurred())
|
||||
Expect(filepath.Join(tempDir, "test-backend", "metadata.json")).To(BeARegularFile())
|
||||
})
|
||||
@@ -894,7 +894,7 @@ var _ = Describe("Gallery Backends", func() {
|
||||
system.WithBackendPath(tempDir),
|
||||
)
|
||||
Expect(err).NotTo(HaveOccurred())
|
||||
err = InstallBackend(context.TODO(), systemState, ml, &backend, nil)
|
||||
err = InstallBackend(context.TODO(), systemState, ml, &backend, nil, false)
|
||||
Expect(err).ToNot(HaveOccurred())
|
||||
Expect(filepath.Join(tempDir, "test-backend", "metadata.json")).To(BeARegularFile())
|
||||
|
||||
|
||||
@@ -47,7 +47,7 @@ var _ = Describe("Backend versioning", func() {
|
||||
backend.URI = srcDir
|
||||
backend.Version = "1.2.3"
|
||||
|
||||
err = gallery.InstallBackend(context.Background(), systemState, modelLoader, backend, nil)
|
||||
err = gallery.InstallBackend(context.Background(), systemState, modelLoader, backend, nil, false)
|
||||
Expect(err).NotTo(HaveOccurred())
|
||||
|
||||
// Read the metadata file and check version
|
||||
@@ -74,7 +74,7 @@ var _ = Describe("Backend versioning", func() {
|
||||
backend.URI = srcDir
|
||||
backend.Version = "2.0.0"
|
||||
|
||||
err = gallery.InstallBackend(context.Background(), systemState, modelLoader, backend, nil)
|
||||
err = gallery.InstallBackend(context.Background(), systemState, modelLoader, backend, nil, false)
|
||||
Expect(err).NotTo(HaveOccurred())
|
||||
|
||||
metadataPath := filepath.Join(tempDir, "test-backend-uri", "metadata.json")
|
||||
@@ -100,7 +100,7 @@ var _ = Describe("Backend versioning", func() {
|
||||
backend.URI = srcDir
|
||||
// Version intentionally left empty
|
||||
|
||||
err = gallery.InstallBackend(context.Background(), systemState, modelLoader, backend, nil)
|
||||
err = gallery.InstallBackend(context.Background(), systemState, modelLoader, backend, nil, false)
|
||||
Expect(err).NotTo(HaveOccurred())
|
||||
|
||||
metadataPath := filepath.Join(tempDir, "test-backend-noversion", "metadata.json")
|
||||
|
||||
130
core/gallery/importers/ds4.go
Normal file
130
core/gallery/importers/ds4.go
Normal file
@@ -0,0 +1,130 @@
|
||||
package importers
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"path/filepath"
|
||||
"strings"
|
||||
|
||||
"github.com/mudler/LocalAI/core/config"
|
||||
"github.com/mudler/LocalAI/core/gallery"
|
||||
"github.com/mudler/LocalAI/core/schema"
|
||||
"github.com/mudler/LocalAI/pkg/downloader"
|
||||
"github.com/mudler/LocalAI/pkg/functions"
|
||||
"go.yaml.in/yaml/v2"
|
||||
)
|
||||
|
||||
var _ Importer = &DS4Importer{}
|
||||
|
||||
// DS4Importer detects antirez/ds4 weights - single-model DeepSeek V4 Flash
|
||||
// inference engine. ds4 only loads the GGUFs published at
|
||||
// huggingface.co/antirez/deepseek-v4-gguf; auto-detect keys on:
|
||||
//
|
||||
// - the repo name itself ("antirez/deepseek-v4-gguf" anywhere in URI)
|
||||
// - the canonical filename pattern "DeepSeek-V4-Flash-*.gguf"
|
||||
//
|
||||
// Must register BEFORE LlamaCPPImporter - both match .gguf, but ds4 is
|
||||
// more specific and first-match-wins.
|
||||
type DS4Importer struct{}
|
||||
|
||||
func (i *DS4Importer) Name() string { return "ds4" }
|
||||
func (i *DS4Importer) Modality() string { return "text" }
|
||||
func (i *DS4Importer) AutoDetects() bool { return true }
|
||||
|
||||
func (i *DS4Importer) Match(details Details) bool {
|
||||
preferences, err := details.Preferences.MarshalJSON()
|
||||
if err != nil {
|
||||
return false
|
||||
}
|
||||
preferencesMap := make(map[string]any)
|
||||
if len(preferences) > 0 {
|
||||
_ = json.Unmarshal(preferences, &preferencesMap)
|
||||
}
|
||||
|
||||
if b, ok := preferencesMap["backend"].(string); ok && b == "ds4" {
|
||||
return true
|
||||
}
|
||||
|
||||
if strings.Contains(details.URI, "antirez/deepseek-v4-gguf") {
|
||||
return true
|
||||
}
|
||||
|
||||
base := filepath.Base(details.URI)
|
||||
if strings.HasPrefix(base, "DeepSeek-V4-Flash-") && strings.HasSuffix(base, ".gguf") {
|
||||
return true
|
||||
}
|
||||
|
||||
if details.HuggingFace != nil {
|
||||
for _, file := range details.HuggingFace.Files {
|
||||
fb := filepath.Base(file.Path)
|
||||
if strings.HasPrefix(fb, "DeepSeek-V4-Flash-") && strings.HasSuffix(fb, ".gguf") {
|
||||
return true
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return false
|
||||
}
|
||||
|
||||
func (i *DS4Importer) Import(details Details) (gallery.ModelConfig, error) {
|
||||
preferences, err := details.Preferences.MarshalJSON()
|
||||
if err != nil {
|
||||
return gallery.ModelConfig{}, err
|
||||
}
|
||||
preferencesMap := make(map[string]any)
|
||||
if len(preferences) > 0 {
|
||||
_ = json.Unmarshal(preferences, &preferencesMap)
|
||||
}
|
||||
|
||||
name, ok := preferencesMap["name"].(string)
|
||||
if !ok {
|
||||
name = filepath.Base(details.URI)
|
||||
name = strings.TrimSuffix(name, ".gguf")
|
||||
}
|
||||
description, ok := preferencesMap["description"].(string)
|
||||
if !ok {
|
||||
description = "DeepSeek V4 Flash - antirez/ds4 backend"
|
||||
}
|
||||
|
||||
modelConfig := config.ModelConfig{
|
||||
Name: name,
|
||||
Description: description,
|
||||
KnownUsecaseStrings: []string{config.UsecaseChat},
|
||||
Backend: "ds4",
|
||||
PredictionOptions: schema.PredictionOptions{
|
||||
BasicModelRequest: schema.BasicModelRequest{
|
||||
Model: "ds4flash.gguf",
|
||||
},
|
||||
},
|
||||
TemplateConfig: config.TemplateConfig{
|
||||
UseTokenizerTemplate: true,
|
||||
},
|
||||
FunctionsConfig: functions.FunctionsConfig{
|
||||
GrammarConfig: functions.GrammarConfig{NoGrammar: true},
|
||||
// ds4 emits OpenAI-shape tool_calls in ChatDelta natively via
|
||||
// our DSML parser; the Go-side regex fallback should NOT fire.
|
||||
AutomaticToolParsingFallback: false,
|
||||
},
|
||||
}
|
||||
|
||||
cfg := gallery.ModelConfig{
|
||||
Name: name,
|
||||
Description: description,
|
||||
}
|
||||
|
||||
// The file to fetch: derive from the URI. We standardize the local
|
||||
// filename to "ds4flash.gguf" to match ds4's own convention (its CLI
|
||||
// defaults to that path), so users can run the model without extra
|
||||
// config.
|
||||
uri := downloader.URI(details.URI)
|
||||
cfg.Files = append(cfg.Files, gallery.File{
|
||||
Filename: "ds4flash.gguf",
|
||||
URI: string(uri),
|
||||
})
|
||||
|
||||
out, err := yaml.Marshal(modelConfig)
|
||||
if err != nil {
|
||||
return gallery.ModelConfig{}, err
|
||||
}
|
||||
cfg.ConfigFile = string(out)
|
||||
return cfg, nil
|
||||
}
|
||||
69
core/gallery/importers/ds4_test.go
Normal file
69
core/gallery/importers/ds4_test.go
Normal file
@@ -0,0 +1,69 @@
|
||||
package importers_test
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"strings"
|
||||
|
||||
. "github.com/mudler/LocalAI/core/gallery/importers"
|
||||
. "github.com/onsi/ginkgo/v2"
|
||||
. "github.com/onsi/gomega"
|
||||
)
|
||||
|
||||
var _ = Describe("DS4Importer", func() {
|
||||
var importer *DS4Importer
|
||||
|
||||
BeforeEach(func() {
|
||||
importer = &DS4Importer{}
|
||||
})
|
||||
|
||||
Context("Match", func() {
|
||||
It("matches the canonical HuggingFace repo URI", func() {
|
||||
details := Details{
|
||||
URI: "huggingface://antirez/deepseek-v4-gguf/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf",
|
||||
}
|
||||
Expect(importer.Match(details)).To(BeTrue())
|
||||
})
|
||||
|
||||
It("matches when filename has the DeepSeek-V4-Flash prefix", func() {
|
||||
details := Details{
|
||||
URI: "https://example.com/mirror/DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2.gguf",
|
||||
}
|
||||
Expect(importer.Match(details)).To(BeTrue())
|
||||
})
|
||||
|
||||
It("matches when backend preference is ds4", func() {
|
||||
prefs := json.RawMessage(`{"backend": "ds4"}`)
|
||||
details := Details{
|
||||
URI: "https://example.com/some-other.gguf",
|
||||
Preferences: prefs,
|
||||
}
|
||||
Expect(importer.Match(details)).To(BeTrue())
|
||||
})
|
||||
|
||||
It("does not match arbitrary GGUFs (must fall through to llama-cpp)", func() {
|
||||
details := Details{URI: "huggingface://TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_K_M.gguf"}
|
||||
Expect(importer.Match(details)).To(BeFalse())
|
||||
})
|
||||
|
||||
It("does not match non-GGUF assets", func() {
|
||||
details := Details{URI: "https://example.com/model.bin"}
|
||||
Expect(importer.Match(details)).To(BeFalse())
|
||||
})
|
||||
})
|
||||
|
||||
Context("Import", func() {
|
||||
It("emits backend: ds4 and the standard ds4flash.gguf filename", func() {
|
||||
details := Details{
|
||||
URI: "huggingface://antirez/deepseek-v4-gguf/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2.gguf",
|
||||
}
|
||||
cfg, err := importer.Import(details)
|
||||
Expect(err).NotTo(HaveOccurred())
|
||||
Expect(cfg.Files).To(HaveLen(1))
|
||||
Expect(cfg.Files[0].Filename).To(Equal("ds4flash.gguf"))
|
||||
Expect(cfg.Files[0].URI).To(Equal(details.URI))
|
||||
Expect(strings.Contains(cfg.ConfigFile, "backend: ds4")).To(BeTrue(),
|
||||
"ConfigFile must specify backend: ds4, got: %s", cfg.ConfigFile)
|
||||
Expect(strings.Contains(cfg.ConfigFile, "use_tokenizer_template: true")).To(BeTrue())
|
||||
})
|
||||
})
|
||||
})
|
||||
@@ -130,6 +130,8 @@ var defaultImporters = []Importer{
|
||||
// and would otherwise swallow the C++ port's GGUF bundles.
|
||||
&VibeVoiceCppImporter{},
|
||||
&VibeVoiceImporter{},
|
||||
// LiquidAudio (Python) — keep before LlamaCPP so non-GGUF LFM2-Audio repos route here.
|
||||
&LiquidAudioImporter{},
|
||||
&CoquiImporter{},
|
||||
// Image/Video (Batch 3)
|
||||
&StableDiffusionGGMLImporter{},
|
||||
@@ -153,6 +155,11 @@ var defaultImporters = []Importer{
|
||||
// checkpoints may carry tokenizer-adjacent artefacts.
|
||||
&RFDetrImporter{},
|
||||
// Existing
|
||||
// DS4Importer must precede LlamaCPPImporter - ds4 weights are GGUFs and
|
||||
// would otherwise be claimed by the generic .gguf-handling llama-cpp
|
||||
// importer. Matches only the antirez/deepseek-v4-gguf repo + filename
|
||||
// pattern, so false-positives against arbitrary GGUFs are impossible.
|
||||
&DS4Importer{},
|
||||
&LlamaCPPImporter{},
|
||||
&MLXImporter{},
|
||||
&VLLMImporter{},
|
||||
|
||||
145
core/gallery/importers/liquid-audio.go
Normal file
145
core/gallery/importers/liquid-audio.go
Normal file
@@ -0,0 +1,145 @@
|
||||
package importers
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"path/filepath"
|
||||
"strings"
|
||||
|
||||
"github.com/mudler/LocalAI/core/config"
|
||||
"github.com/mudler/LocalAI/core/gallery"
|
||||
"github.com/mudler/LocalAI/core/schema"
|
||||
"go.yaml.in/yaml/v2"
|
||||
)
|
||||
|
||||
var _ Importer = &LiquidAudioImporter{}
|
||||
|
||||
// LiquidAudioImporter recognises LiquidAI's LFM2-Audio family (LFM2-Audio-1.5B,
|
||||
// LFM2.5-Audio-1.5B, community finetunes) and routes them to the Python
|
||||
// `liquid-audio` backend. Detection is by repo-name substring so third-party
|
||||
// mirrors still match. preferences.backend="liquid-audio" overrides detection.
|
||||
//
|
||||
// Once upstream llama.cpp PR #18641 lands and the GGUF gallery entries are
|
||||
// added, GGUF mirrors of these models should route to llama-cpp; that's
|
||||
// handled by ordering LlamaCPPImporter after this one and by the explicit
|
||||
// "-gguf" exclusion below.
|
||||
type LiquidAudioImporter struct{}
|
||||
|
||||
func (i *LiquidAudioImporter) Name() string { return "liquid-audio" }
|
||||
func (i *LiquidAudioImporter) Modality() string { return "tts" }
|
||||
func (i *LiquidAudioImporter) AutoDetects() bool { return true }
|
||||
|
||||
func (i *LiquidAudioImporter) Match(details Details) bool {
|
||||
preferences, err := details.Preferences.MarshalJSON()
|
||||
if err != nil {
|
||||
return false
|
||||
}
|
||||
preferencesMap := make(map[string]any)
|
||||
if len(preferences) > 0 {
|
||||
if err := json.Unmarshal(preferences, &preferencesMap); err != nil {
|
||||
return false
|
||||
}
|
||||
}
|
||||
|
||||
if b, ok := preferencesMap["backend"].(string); ok && b == "liquid-audio" {
|
||||
return true
|
||||
}
|
||||
|
||||
matchRepo := func(repo string) bool {
|
||||
r := strings.ToLower(repo)
|
||||
// Cede GGUF mirrors to the (later-ordered) llama-cpp importer.
|
||||
if strings.HasSuffix(r, "-gguf") {
|
||||
return false
|
||||
}
|
||||
return strings.Contains(r, "lfm2-audio") || strings.Contains(r, "lfm2.5-audio")
|
||||
}
|
||||
|
||||
if details.HuggingFace != nil {
|
||||
repoName := details.HuggingFace.ModelID
|
||||
if idx := strings.Index(repoName, "/"); idx >= 0 {
|
||||
repoName = repoName[idx+1:]
|
||||
}
|
||||
if matchRepo(repoName) {
|
||||
return true
|
||||
}
|
||||
}
|
||||
|
||||
if _, repo, ok := HFOwnerRepoFromURI(details.URI); ok {
|
||||
return matchRepo(repo)
|
||||
}
|
||||
return false
|
||||
}
|
||||
|
||||
func (i *LiquidAudioImporter) Import(details Details) (gallery.ModelConfig, error) {
|
||||
preferences, err := details.Preferences.MarshalJSON()
|
||||
if err != nil {
|
||||
return gallery.ModelConfig{}, err
|
||||
}
|
||||
preferencesMap := make(map[string]any)
|
||||
if len(preferences) > 0 {
|
||||
if err := json.Unmarshal(preferences, &preferencesMap); err != nil {
|
||||
return gallery.ModelConfig{}, err
|
||||
}
|
||||
}
|
||||
|
||||
name, ok := preferencesMap["name"].(string)
|
||||
if !ok {
|
||||
name = filepath.Base(details.URI)
|
||||
}
|
||||
|
||||
description, ok := preferencesMap["description"].(string)
|
||||
if !ok {
|
||||
description = "Imported from " + details.URI
|
||||
}
|
||||
|
||||
model := details.URI
|
||||
if details.HuggingFace != nil && details.HuggingFace.ModelID != "" {
|
||||
model = details.HuggingFace.ModelID
|
||||
}
|
||||
|
||||
// Preferences may pin the mode (chat / asr / tts / s2s / finetune).
|
||||
// Default to s2s — the headline any-to-any use case.
|
||||
mode, _ := preferencesMap["mode"].(string)
|
||||
if mode == "" {
|
||||
mode = "s2s"
|
||||
}
|
||||
|
||||
options := []string{"mode:" + mode}
|
||||
if voice, ok := preferencesMap["voice"].(string); ok && voice != "" {
|
||||
options = append(options, "voice:"+voice)
|
||||
}
|
||||
|
||||
usecases := []string{"chat"}
|
||||
switch mode {
|
||||
case "asr":
|
||||
usecases = []string{"transcript"}
|
||||
case "tts":
|
||||
usecases = []string{"tts"}
|
||||
case "s2s":
|
||||
// realtime_audio surfaces the model on the Talk page; chat/tts/
|
||||
// transcript/vad keep the standalone OpenAI-compatible endpoints
|
||||
// working since liquid-audio implements all of them.
|
||||
usecases = []string{"realtime_audio", "chat", "tts", "transcript", "vad"}
|
||||
}
|
||||
|
||||
modelConfig := config.ModelConfig{
|
||||
Name: name,
|
||||
Description: description,
|
||||
Backend: "liquid-audio",
|
||||
KnownUsecaseStrings: usecases,
|
||||
Options: options,
|
||||
PredictionOptions: schema.PredictionOptions{
|
||||
BasicModelRequest: schema.BasicModelRequest{Model: model},
|
||||
},
|
||||
}
|
||||
|
||||
data, err := yaml.Marshal(modelConfig)
|
||||
if err != nil {
|
||||
return gallery.ModelConfig{}, err
|
||||
}
|
||||
|
||||
return gallery.ModelConfig{
|
||||
Name: name,
|
||||
Description: description,
|
||||
ConfigFile: string(data),
|
||||
}, nil
|
||||
}
|
||||
91
core/gallery/importers/liquid-audio_test.go
Normal file
91
core/gallery/importers/liquid-audio_test.go
Normal file
@@ -0,0 +1,91 @@
|
||||
package importers_test
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
|
||||
"github.com/mudler/LocalAI/core/gallery/importers"
|
||||
. "github.com/onsi/ginkgo/v2"
|
||||
. "github.com/onsi/gomega"
|
||||
)
|
||||
|
||||
var _ = Describe("LiquidAudioImporter", func() {
|
||||
Context("detection from HuggingFace", func() {
|
||||
It("matches LiquidAI/LFM2.5-Audio-1.5B", func() {
|
||||
uri := "https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B"
|
||||
preferences := json.RawMessage(`{}`)
|
||||
|
||||
modelConfig, err := importers.DiscoverModelConfig(uri, preferences)
|
||||
|
||||
Expect(err).ToNot(HaveOccurred(), fmt.Sprintf("Error: %v", err))
|
||||
Expect(modelConfig.ConfigFile).To(ContainSubstring("backend: liquid-audio"))
|
||||
Expect(modelConfig.ConfigFile).To(ContainSubstring("LiquidAI/LFM2.5-Audio-1.5B"))
|
||||
})
|
||||
|
||||
It("matches LiquidAI/LFM2-Audio-1.5B (older variant)", func() {
|
||||
uri := "https://huggingface.co/LiquidAI/LFM2-Audio-1.5B"
|
||||
preferences := json.RawMessage(`{}`)
|
||||
|
||||
modelConfig, err := importers.DiscoverModelConfig(uri, preferences)
|
||||
|
||||
Expect(err).ToNot(HaveOccurred(), fmt.Sprintf("Error: %v", err))
|
||||
Expect(modelConfig.ConfigFile).To(ContainSubstring("backend: liquid-audio"))
|
||||
})
|
||||
|
||||
It("cedes -GGUF mirrors to the llama-cpp importer", func() {
|
||||
// LiquidAI/LFM2.5-Audio-1.5B-GGUF should NOT route to liquid-audio.
|
||||
// Once upstream PR #18641 lands and the GGUF gallery entry exists,
|
||||
// this is the path that lets users opt into the C++ runtime.
|
||||
uri := "https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B-GGUF"
|
||||
preferences := json.RawMessage(`{}`)
|
||||
|
||||
modelConfig, err := importers.DiscoverModelConfig(uri, preferences)
|
||||
|
||||
Expect(err).ToNot(HaveOccurred(), fmt.Sprintf("Error: %v", err))
|
||||
Expect(modelConfig.ConfigFile).ToNot(ContainSubstring("backend: liquid-audio"),
|
||||
fmt.Sprintf("GGUF repo should not match Python importer; got: %s", modelConfig.ConfigFile))
|
||||
})
|
||||
})
|
||||
|
||||
Context("preference override", func() {
|
||||
It("honours preferences.backend=liquid-audio for arbitrary URIs", func() {
|
||||
uri := "https://example.com/some-unrelated-model"
|
||||
preferences := json.RawMessage(`{"backend": "liquid-audio"}`)
|
||||
|
||||
modelConfig, err := importers.DiscoverModelConfig(uri, preferences)
|
||||
|
||||
Expect(err).ToNot(HaveOccurred(), fmt.Sprintf("Error: %v", err))
|
||||
Expect(modelConfig.ConfigFile).To(ContainSubstring("backend: liquid-audio"))
|
||||
})
|
||||
|
||||
It("picks up the mode preference", func() {
|
||||
uri := "https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B"
|
||||
preferences := json.RawMessage(`{"mode": "asr"}`)
|
||||
|
||||
modelConfig, err := importers.DiscoverModelConfig(uri, preferences)
|
||||
|
||||
Expect(err).ToNot(HaveOccurred(), fmt.Sprintf("Error: %v", err))
|
||||
Expect(modelConfig.ConfigFile).To(ContainSubstring("mode:asr"))
|
||||
Expect(modelConfig.ConfigFile).To(ContainSubstring("transcript"))
|
||||
})
|
||||
|
||||
It("picks up the voice preference", func() {
|
||||
uri := "https://huggingface.co/LiquidAI/LFM2.5-Audio-1.5B"
|
||||
preferences := json.RawMessage(`{"mode": "tts", "voice": "uk_male"}`)
|
||||
|
||||
modelConfig, err := importers.DiscoverModelConfig(uri, preferences)
|
||||
|
||||
Expect(err).ToNot(HaveOccurred(), fmt.Sprintf("Error: %v", err))
|
||||
Expect(modelConfig.ConfigFile).To(ContainSubstring("voice:uk_male"))
|
||||
})
|
||||
})
|
||||
|
||||
Context("Importer interface metadata", func() {
|
||||
It("exposes name/modality/autodetect", func() {
|
||||
imp := &importers.LiquidAudioImporter{}
|
||||
Expect(imp.Name()).To(Equal("liquid-audio"))
|
||||
Expect(imp.Modality()).To(Equal("tts"))
|
||||
Expect(imp.AutoDetects()).To(BeTrue())
|
||||
})
|
||||
})
|
||||
})
|
||||
@@ -1,10 +1,13 @@
|
||||
package importers
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"path/filepath"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
gguf "github.com/gpustack/gguf-parser-go"
|
||||
"github.com/mudler/LocalAI/core/config"
|
||||
"github.com/mudler/LocalAI/core/gallery"
|
||||
"github.com/mudler/LocalAI/core/schema"
|
||||
@@ -261,6 +264,13 @@ func (i *LlamaCPPImporter) Import(details Details) (gallery.ModelConfig, error)
|
||||
// Apply per-model-family inference parameter defaults
|
||||
config.ApplyInferenceDefaults(&modelConfig, details.URI)
|
||||
|
||||
// Auto-detect Multi-Token Prediction heads (ggml-org/llama.cpp#22673) and
|
||||
// enable speculative decoding. Mirrors the load-time hook so freshly
|
||||
// imported configs already carry spec_type:draft-mtp before the model is
|
||||
// ever loaded - users see it in the YAML preview rather than discovering
|
||||
// it after the first start.
|
||||
maybeApplyMTPDefaults(&modelConfig, details, &cfg)
|
||||
|
||||
data, err := yaml.Marshal(modelConfig)
|
||||
if err != nil {
|
||||
return gallery.ModelConfig{}, err
|
||||
@@ -291,6 +301,85 @@ func pickPreferredGroup(groups []hfapi.ShardGroup, prefs []string) *hfapi.ShardG
|
||||
return &groups[len(groups)-1]
|
||||
}
|
||||
|
||||
// maybeApplyMTPDefaults parses the picked GGUF header (range-fetched over
|
||||
// HTTP for HF/URL imports) and, if the file declares a Multi-Token Prediction
|
||||
// head, appends the auto-MTP option keys to modelConfig.Options. Failures
|
||||
// during the probe are non-fatal: the importer keeps the config without MTP
|
||||
// so an unrelated network blip or weird header doesn't break the import.
|
||||
//
|
||||
// OCI/Ollama URIs are skipped because the artifact isn't directly fetchable
|
||||
// as a GGUF byte stream - the load-time hook (core/config/gguf.go) covers
|
||||
// those once the model is materialised on disk.
|
||||
func maybeApplyMTPDefaults(modelConfig *config.ModelConfig, details Details, cfg *gallery.ModelConfig) {
|
||||
probeURL := pickMTPProbeURL(details, cfg)
|
||||
if probeURL == "" {
|
||||
return
|
||||
}
|
||||
|
||||
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
|
||||
defer cancel()
|
||||
|
||||
defer func() {
|
||||
if r := recover(); r != nil {
|
||||
xlog.Debug("[mtp-importer] panic while probing GGUF header", "uri", probeURL, "recover", r)
|
||||
}
|
||||
}()
|
||||
|
||||
f, err := gguf.ParseGGUFFileRemote(ctx, probeURL)
|
||||
if err != nil {
|
||||
xlog.Debug("[mtp-importer] failed to read remote GGUF header for MTP detection", "uri", probeURL, "error", err)
|
||||
return
|
||||
}
|
||||
|
||||
n, ok := config.HasEmbeddedMTPHead(f)
|
||||
if !ok {
|
||||
return
|
||||
}
|
||||
config.ApplyMTPDefaults(modelConfig, n)
|
||||
}
|
||||
|
||||
// pickMTPProbeURL returns an HTTP(S) URL pointing at the main (non-mmproj)
|
||||
// GGUF shard that should be inspected for an MTP head, or "" when no
|
||||
// suitable URL is available. Custom URI schemes (`huggingface://`,
|
||||
// `ollama://`, etc.) are run through `downloader.URI.ResolveURL` so the
|
||||
// resulting URL is something `gguf.ParseGGUFFileRemote` can actually open.
|
||||
// OCI/Ollama URIs are skipped because the artifact is not directly
|
||||
// streamable as a GGUF byte range.
|
||||
func pickMTPProbeURL(details Details, cfg *gallery.ModelConfig) string {
|
||||
uri := downloader.URI(details.URI)
|
||||
|
||||
if uri.LooksLikeOCI() {
|
||||
return ""
|
||||
}
|
||||
|
||||
if strings.HasSuffix(strings.ToLower(details.URI), ".gguf") {
|
||||
return resolveHTTPProbe(details.URI)
|
||||
}
|
||||
|
||||
for _, f := range cfg.Files {
|
||||
lower := strings.ToLower(f.Filename)
|
||||
if strings.Contains(lower, "mmproj") {
|
||||
continue
|
||||
}
|
||||
if !strings.HasSuffix(lower, ".gguf") {
|
||||
continue
|
||||
}
|
||||
return resolveHTTPProbe(f.URI)
|
||||
}
|
||||
return ""
|
||||
}
|
||||
|
||||
// resolveHTTPProbe resolves an importer-side URI to the HTTP(S) URL that
|
||||
// `gguf.ParseGGUFFileRemote` can range-fetch. Returns "" if the URI can't
|
||||
// be reduced to an HTTP(S) endpoint (e.g. local path, unsupported scheme).
|
||||
func resolveHTTPProbe(uri string) string {
|
||||
resolved := downloader.URI(uri).ResolveURL()
|
||||
if downloader.URI(resolved).LooksLikeHTTPURL() {
|
||||
return resolved
|
||||
}
|
||||
return ""
|
||||
}
|
||||
|
||||
// appendShardGroup copies every shard of group into cfg.Files under dest,
|
||||
// skipping any entry whose target filename is already present so repeated
|
||||
// calls (e.g. the rare case of mmproj + model picking the same group)
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user