Compare commits

..

16 Commits

Author SHA1 Message Date
Ettore Di Giacinto
cd56a05c3e ci(vllm): disable tests-vllm-grpc job (heterogeneous runners)
Both ubuntu-latest and bigger-runner have inconsistent CPU baselines:
some instances support the AVX-512 VNNI/BF16 instructions the prebuilt
vllm 0.14.1+cpu wheel was compiled with, others SIGILL on import of
vllm.model_executor.models.registry. The libnuma packaging fix doesn't
help when the wheel itself can't be loaded.

FROM_SOURCE=true compiles vllm against the actual host CPU and works
everywhere, but takes 30-50 minutes per run — too slow for a smoke
test on every PR.

Comment out the job for now. The test itself is intact and passes
locally; run it via 'make test-extra-backend-vllm' on a host with the
required SIMD baseline. Re-enable when:
  - we have a self-hosted runner label with guaranteed AVX-512 VNNI/BF16, or
  - vllm publishes a CPU wheel with a wider baseline, or
  - we set up a docker layer cache that makes FROM_SOURCE acceptable

The detect-changes vllm output, the test harness changes (tests/
e2e-backends + tools cap), the make target (test-extra-backend-vllm),
the package.sh and the Dockerfile/install.sh plumbing all stay in
place.
2026-04-13 07:46:57 +00:00
Ettore Di Giacinto
d74cd56b14 feat(vllm): bundle libnuma/libgomp via package.sh
The vllm CPU wheel ships a _C extension that dlopens libnuma.so.1 at
import time; torch's CPU kernels in turn use libgomp.so.1 (OpenMP).
Without these on the host, vllm._C silently fails to register its
torch ops and EngineCore crashes with:

  AttributeError: '_OpNamespace' '_C_utils' object has no attribute
    'init_cpu_threads_env'

Rather than asking every user to install libnuma1/libgomp1 on their
host (or every LocalAI base image to ship them), bundle them into
the backend image itself — same pattern fish-speech and the GPU libs
already use. libbackend.sh adds ${EDIR}/lib to LD_LIBRARY_PATH at
run time so the bundled copies are picked up automatically.

- backend/python/vllm/package.sh (new): copies libnuma.so.1 and
  libgomp.so.1 from the builder's multilib paths into ${BACKEND}/lib,
  preserving soname symlinks. Runs during Dockerfile.python's
  'Run backend-specific packaging' step (which already invokes
  package.sh if present).
- backend/Dockerfile.python: install libnuma1 + libgomp1 in the
  builder stage so package.sh has something to copy (the Ubuntu
  base image otherwise only has libgomp in the gcc dep chain).
- test-extra.yml: drop the workaround that installed these libs on
  the runner host — with the backend image self-contained, the
  runner no longer needs them, and the test now exercises the
  packaging path end-to-end the way a production host would.
2026-04-12 20:20:21 +00:00
Ettore Di Giacinto
017bdee4e4 ci(vllm): install libnuma1 + libgomp1 on bigger-runner
The vllm 0.14.1+cpu wheel ships a _C C++ extension that dlopens
libnuma.so.1 at import time. When the runner host doesn't have it,
the extension silently fails to register its torch ops, so
EngineCore crashes on init_device with:

  AttributeError: '_OpNamespace' '_C_utils' object has no attribute
    'init_cpu_threads_env'

Also add libgomp1 (OpenMP runtime, used by torch CPU kernels) to be
safe on stripped-down runners.
2026-04-12 20:18:13 +00:00
Ettore Di Giacinto
c4dc495ea1 ci(vllm): install make + build deps on bigger-runner
bigger-runner is a bare self-hosted runner used by backend.yml for
docker image builds — it has docker but not the usual ubuntu-latest
toolchain. The make-based test target needs make, build-essential
(cgo in 'go test'), and curl/unzip (the Makefile protoc target
downloads protoc from github releases).

protoc-gen-go and protoc-gen-go-grpc come via 'go install' in the
install-go-tools target, which setup-go makes possible.
2026-04-12 20:08:09 +00:00
Ettore Di Giacinto
ea2bbabffd ci(vllm): use bigger-runner instead of source build
The prebuilt vllm 0.14.1+cpu wheel requires SIMD instructions (AVX-512
VNNI/BF16) that stock ubuntu-latest GitHub runners don't support —
vllm.model_executor.models.registry SIGILLs on import during LoadModel.

Source compilation works but takes 30-40 minutes per CI run, which is
too slow for an e2e smoke test. Instead, switch tests-vllm-grpc to the
bigger-runner self-hosted label (already used by backend.yml for the
llama-cpp CUDA build) — that hardware has the required SIMD baseline
and the prebuilt wheel runs cleanly.

FROM_SOURCE=true is kept as an opt-in escape hatch:
- install.sh still has the CPU source-build path for hosts that need it
- backend/Dockerfile.python still declares the ARG + ENV
- Makefile docker-build-backend still forwards the build-arg when set
Default CI path uses the fast prebuilt wheel; source build can be
re-enabled by exporting FROM_SOURCE=true in the environment.
2026-04-12 16:02:49 +00:00
Ettore Di Giacinto
329df11989 fix(vllm): build from source on CI to avoid SIGILL on prebuilt wheel
The prebuilt vllm 0.14.1+cpu wheel from GitHub releases is compiled with
SIMD instructions (AVX-512 VNNI/BF16 or AMX-BF16) that not every CPU
supports. GitHub Actions ubuntu-latest runners SIGILL when vllm spawns
the model_executor.models.registry subprocess for introspection, so
LoadModel never reaches the actual inference path.

- install.sh: when FROM_SOURCE=true on a CPU build, temporarily hide
  requirements-cpu-after.txt so installRequirements installs the base
  deps + torch CPU without pulling the prebuilt wheel, then clone vllm
  and compile it with VLLM_TARGET_DEVICE=cpu. The resulting binaries
  target the host's actual CPU.
- backend/Dockerfile.python: accept a FROM_SOURCE build-arg and expose
  it as an ENV so install.sh sees it during `make`.
- Makefile docker-build-backend: forward FROM_SOURCE as --build-arg
  when set, so backends that need source builds can opt in.
- Makefile test-extra-backend-vllm: call docker-build-vllm via a
  recursive $(MAKE) invocation so FROM_SOURCE flows through.
- .github/workflows/test-extra.yml: set FROM_SOURCE=true on the
  tests-vllm-grpc job. Slower but reliable — the prebuilt wheel only
  works on hosts that share the build-time SIMD baseline.

Answers 'did you test locally?': yes, end-to-end on my local machine
with the prebuilt wheel (CPU supports AVX-512 VNNI). The CI runner CPU
gap was not covered locally — this commit plugs that gap.
2026-04-12 15:14:42 +00:00
Ettore Di Giacinto
c7f444d18b ci(test-extra): run vllm e2e tests on CPU
Adds tests-vllm-grpc to the test-extra workflow, mirroring the
llama-cpp and ik-llama-cpp gRPC jobs. Triggers when files under
backend/python/vllm/ change (or on run-all), builds the local-ai
vllm container image, and runs the tests/e2e-backends harness with
BACKEND_TEST_MODEL_NAME=Qwen/Qwen2.5-0.5B-Instruct, tool_parser:hermes,
and the tools capability enabled.

Uses ubuntu-latest (no GPU) — vllm runs on CPU via the cpu-vllm
wheel we pinned in requirements-cpu-after.txt. Frees disk space
before the build since the docker image + torch + vllm wheel is
sizeable.
2026-04-12 14:53:44 +00:00
Ettore Di Giacinto
e7f406169a test(e2e-backends): add tools capability + HF model name support
Extends tests/e2e-backends to cover backends that:
- Resolve HuggingFace model ids natively (vllm, vllm-omni) instead of
  loading a local file: BACKEND_TEST_MODEL_NAME is passed verbatim as
  ModelOptions.Model with no download/ModelFile.
- Parse tool calls into ChatDelta.tool_calls: new "tools" capability
  sends a Predict with a get_weather function definition and asserts
  the Reply contains a matching ToolCallDelta. Uses UseTokenizerTemplate
  with OpenAI-style Messages so the backend can wire tools into the
  model's chat template.
- Need backend-specific Options[]: BACKEND_TEST_OPTIONS lets a test set
  e.g. "tool_parser:hermes,reasoning_parser:qwen3" at LoadModel time.

Adds make target test-extra-backend-vllm that:
- docker-build-vllm
- loads Qwen/Qwen2.5-0.5B-Instruct
- runs health,load,predict,stream,tools with tool_parser:hermes

Drops backend/python/vllm/test_{cpu_inference,tool_calls}.py — those
standalone scripts were scaffolding used while bringing up the Python
backend; the e2e-backends harness now covers the same ground uniformly
alongside llama-cpp and ik-llama-cpp.
2026-04-12 14:51:58 +00:00
Ettore Di Giacinto
034a60bf76 ci(backend): build cpu-vllm container image
Add the cpu-vllm variant to the backend container build matrix so the
image registered in backend/index.yaml (cpu-vllm / cpu-vllm-development)
is actually produced by CI.

Follows the same pattern as the other CPU python backends
(cpu-diffusers, cpu-chatterbox, etc.) with build-type='' and no CUDA.
backend_pr.yml auto-picks this up via its matrix filter from backend.yml.
2026-04-12 14:48:28 +00:00
Ettore Di Giacinto
c99188f106 fix(vllm): tool parser constructor compat + e2e tool calling test
Concrete vLLM tool parsers override the abstract base's __init__ and
drop the tools kwarg (e.g. Hermes2ProToolParser only takes tokenizer).
Instantiating with tools= raised TypeError which was silently caught,
leaving chat_deltas.tool_calls empty.

Retry the constructor without the tools kwarg on TypeError — tools
aren't required by these parsers since extract_tool_calls finds tool
syntax in the raw model output directly.

Validated with Qwen/Qwen2.5-0.5B-Instruct + hermes parser on CPU:
the backend correctly returns ToolCallDelta{name='get_weather',
arguments='{"location": "Paris, France"}'} in ChatDelta.

test_tool_calls.py is a standalone smoke test that spawns the gRPC
backend, sends a chat completion with tools, and asserts the response
contains a structured tool call.
2026-04-12 14:48:28 +00:00
Ettore Di Giacinto
c2f73a987e fix(vllm): CPU build compatibility with vllm 0.14.1
Validated end-to-end on CPU with Qwen2.5-0.5B-Instruct (LoadModel, Predict,
TokenizeString, Free all working).

- requirements-cpu-after.txt: pin vllm to 0.14.1+cpu (pre-built wheel from
  GitHub releases) for x86_64 and aarch64. vllm 0.14.1 is the newest CPU
  wheel whose torch dependency resolves against published PyTorch builds
  (torch==2.9.1+cpu). Later vllm CPU wheels currently require
  torch==2.10.0+cpu which is only available on the PyTorch test channel
  with incompatible torchvision.
- requirements-cpu.txt: bump torch to 2.9.1+cpu, add torchvision/torchaudio
  so uv resolves them consistently from the PyTorch CPU index.
- install.sh: add --index-strategy=unsafe-best-match for CPU builds so uv
  can mix the PyTorch index and PyPI for transitive deps (matches the
  existing intel profile behaviour).
- backend.py LoadModel: vllm >= 0.14 removed AsyncLLMEngine.get_model_config
  so the old code path errored out with AttributeError on model load.
  Switch to the new get_tokenizer()/tokenizer accessor with a fallback
  to building the tokenizer directly from request.Model.
2026-04-12 14:48:28 +00:00
Ettore Di Giacinto
b215843807 feat(vllm): CPU support + shared utils + vllm-omni feature parity
- Split vllm install per acceleration: move generic `vllm` out of
  requirements-after.txt into per-profile after files (cublas12, hipblas,
  intel) and add CPU wheel URL for cpu-after.txt
- requirements-cpu.txt now pulls torch==2.7.0+cpu from PyTorch CPU index
- backend/index.yaml: register cpu-vllm / cpu-vllm-development variants
- New backend/python/common/vllm_utils.py: shared parse_options,
  messages_to_dicts, setup_parsers helpers (used by both vllm backends)
- vllm-omni: replace hardcoded chat template with tokenizer.apply_chat_template,
  wire native parsers via shared utils, emit ChatDelta with token counts,
  add TokenizeString and Free RPCs, detect CPU and set VLLM_TARGET_DEVICE
- Add test_cpu_inference.py: standalone script to validate CPU build with
  a small model (Qwen2.5-0.5B-Instruct)
2026-04-12 14:48:28 +00:00
Ettore Di Giacinto
6786f05c64 feat(vllm): wire native tool/reasoning parsers + chat deltas + logprobs
- Use vLLM's ToolParserManager/ReasoningParserManager to extract structured
  output (tool calls, reasoning content) instead of reimplementing parsing
- Convert proto Messages to dicts and pass tools to apply_chat_template
- Emit ChatDelta with content/reasoning_content/tool_calls in Reply
- Extract prompt_tokens, completion_tokens, and logprobs from output
- Replace boolean GuidedDecoding with proper GuidedDecodingParams from Grammar
- Add TokenizeString and Free RPC methods
- Fix missing `time` import used by load_video()
2026-04-12 14:48:28 +00:00
Ettore Di Giacinto
6cf8263c30 feat(config): add vLLM parser defaults hook and importer auto-detection
Introduces parser_defaults.json mapping model families to vLLM
tool_parser/reasoning_parser names, with longest-pattern-first matching.

The vllmDefaults hook auto-fills tool_parser and reasoning_parser
options at load time for known families, while the VLLMImporter writes
the same values into generated YAML so users can review and edit them.

Adds tests covering MatchParserDefaults, hook registration via
SetDefaults, and the user-override behavior.
2026-04-12 14:48:28 +00:00
Ettore Di Giacinto
a30719f04a refactor(config): introduce backend hook system and migrate llama-cpp defaults
Adds RegisterBackendHook/runBackendHooks so each backend can register
default-filling functions that run during ModelConfig.SetDefaults().

Migrates the existing GGUF guessing logic into hooks_llamacpp.go,
registered for both 'llama-cpp' and the empty backend (auto-detect).
Removes the old guesser.go shim.
2026-04-12 14:48:28 +00:00
Ettore Di Giacinto
40b1c6f943 fix(schema): serialize ToolCallID and Reasoning in Messages.ToProto
The ToProto conversion was dropping tool_call_id and reasoning_content
even though both proto and Go fields existed, breaking multi-turn tool
calling and reasoning passthrough to backends.
2026-04-12 14:48:28 +00:00
183 changed files with 2375 additions and 13120 deletions

View File

@@ -129,30 +129,6 @@ After adding a new backend, verify:
- [ ] No Makefile syntax errors (check with linter)
- [ ] Follows the same pattern as similar backends (e.g., if it's a transcription backend, follow `faster-whisper` pattern)
## Bundling runtime shared libraries (`package.sh`)
The final `Dockerfile.python` stage is `FROM scratch` — there is no system `libc`, no `apt`, no fallback library path. Only files explicitly copied from the builder stage end up in the backend image. That means any runtime `dlopen` your backend (or its Python deps) needs **must** be packaged into `${BACKEND}/lib/`.
Pattern:
1. Make sure the library is installed in the builder stage of `backend/Dockerfile.python` (add it to the top-level `apt-get install`).
2. Drop a `package.sh` in your backend directory that copies the library — and its soname symlinks — into `$(dirname $0)/lib`. See `backend/python/vllm/package.sh` for a reference implementation that walks `/usr/lib/x86_64-linux-gnu`, `/usr/lib/aarch64-linux-gnu`, etc.
3. `Dockerfile.python` already runs `package.sh` automatically if it exists, after `package-gpu-libs.sh`.
4. `libbackend.sh` automatically prepends `${EDIR}/lib` to `LD_LIBRARY_PATH` at run time, so anything packaged this way is found by `dlopen`.
How to find missing libs: when a Python module silently fails to register torch ops or you see `AttributeError: '_OpNamespace' '...' object has no attribute '...'`, run the backend image's Python with `LD_DEBUG=libs` to see which `dlopen` failed. The filename in the error message (e.g. `libnuma.so.1`) is what you need to package.
To verify packaging works without trusting the host:
```bash
make docker-build-<backend>
CID=$(docker create --entrypoint=/run.sh local-ai-backend:<backend>)
docker cp $CID:/lib /tmp/check && docker rm $CID
ls /tmp/check # expect the bundled .so files + symlinks
```
Then boot it inside a fresh `ubuntu:24.04` (which intentionally does *not* have the lib installed) to confirm it actually loads from the backend dir.
## 6. Example: Adding a Python Backend
For reference, when `moonshine` was added:

View File

@@ -1,115 +0,0 @@
# Working on the vLLM Backend
The vLLM backend lives at `backend/python/vllm/backend.py` (async gRPC) and the multimodal variant at `backend/python/vllm-omni/backend.py` (sync gRPC). Both wrap vLLM's `AsyncLLMEngine` / `Omni` and translate the LocalAI gRPC `PredictOptions` into vLLM `SamplingParams` + outputs into `Reply.chat_deltas`.
This file captures the non-obvious bits — most of the bring-up was a single PR (`feat/vllm-parity`) and the things below are easy to get wrong.
## Tool calling and reasoning use vLLM's *native* parsers
Do not write regex-based tool-call extractors for vLLM. vLLM ships:
- `vllm.tool_parsers.ToolParserManager` — 50+ registered parsers (`hermes`, `llama3_json`, `llama4_pythonic`, `mistral`, `qwen3_xml`, `deepseek_v3`, `granite4`, `openai`, `kimi_k2`, `glm45`, …)
- `vllm.reasoning.ReasoningParserManager` — 25+ registered parsers (`deepseek_r1`, `qwen3`, `mistral`, `gemma4`, …)
Both can be used standalone: instantiate with a tokenizer, call `extract_tool_calls(text, request=None)` / `extract_reasoning(text, request=None)`. The backend stores the parser *classes* on `self.tool_parser_cls` / `self.reasoning_parser_cls` at LoadModel time and instantiates them per request.
**Selection:** vLLM does *not* auto-detect parsers from model name — neither does the LocalAI backend. The user (or `core/config/hooks_vllm.go`) must pick one and pass it via `Options[]`:
```yaml
options:
- tool_parser:hermes
- reasoning_parser:qwen3
```
Auto-defaults for known model families live in `core/config/parser_defaults.json` and are applied:
- at gallery import time by `core/gallery/importers/vllm.go`
- at model load time by the `vllm` / `vllm-omni` backend hook in `core/config/hooks_vllm.go`
User-supplied `tool_parser:`/`reasoning_parser:` in the config wins over defaults — the hook checks for existing entries before appending.
**When to update `parser_defaults.json`:** any time vLLM ships a new tool or reasoning parser, or you onboard a new model family that LocalAI users will pull from HuggingFace. The file is keyed by *family pattern* matched against `normalizeModelID(cfg.Model)` (lowercase, org-prefix stripped, `_``-`). Patterns are checked **longest-first** — keep `qwen3.5` before `qwen3`, `llama-3.3` before `llama-3`, etc., or the wrong family wins. Add a covering test in `core/config/hooks_test.go`.
**Sister file — `core/config/inference_defaults.json`:** same pattern but for sampling parameters (temperature, top_p, top_k, min_p, repeat_penalty, presence_penalty). Loaded by `core/config/inference_defaults.go` and applied by `ApplyInferenceDefaults()`. The schema is `map[string]float64` only — *strings don't fit*, which is why parser defaults needed their own JSON file. The inference file is **auto-generated from unsloth** via `go generate ./core/config/` (see `core/config/gen_inference_defaults/`) — don't hand-edit it; instead update the upstream source or regenerate. Both files share `normalizeModelID()` and the longest-first pattern ordering.
**Constructor compatibility gotcha:** the abstract `ToolParser.__init__` accepts `tools=`, but several concrete parsers (Hermes2ProToolParser, etc.) override `__init__` and *only* accept `tokenizer`. Always:
```python
try:
tp = self.tool_parser_cls(self.tokenizer, tools=tools)
except TypeError:
tp = self.tool_parser_cls(self.tokenizer)
```
## ChatDelta is the streaming contract
The Go side (`core/backend/llm.go`, `pkg/functions/chat_deltas.go`) consumes `Reply.chat_deltas` to assemble the OpenAI response. For tool calls to surface in `chat/completions`, the Python backend **must** populate `Reply.chat_deltas[].tool_calls` with `ToolCallDelta{index, id, name, arguments}`. Returning the raw `<tool_call>...</tool_call>` text in `Reply.message` is *not* enough — the Go regex fallback exists for llama.cpp, not for vllm.
Same story for `reasoning_content` — emit it on `ChatDelta.reasoning_content`, not as part of `content`.
## Message conversion to chat templates
`tokenizer.apply_chat_template()` expects a list of dicts, not proto Messages. The shared helper in `backend/python/common/vllm_utils.py` (`messages_to_dicts`) handles the mapping including:
- `tool_call_id` and `name` for `role="tool"` messages
- `tool_calls` JSON-string field → parsed Python list for `role="assistant"`
- `reasoning_content` for thinking models
Pass `tools=json.loads(request.Tools)` and (when `request.Metadata.get("enable_thinking") == "true"`) `enable_thinking=True` to `apply_chat_template`. Wrap in `try/except TypeError` because not every tokenizer template accepts those kwargs.
## CPU support and the SIMD/library minefield
vLLM publishes prebuilt CPU wheels at `https://github.com/vllm-project/vllm/releases/...`. The pin lives in `backend/python/vllm/requirements-cpu-after.txt`.
**Version compatibility — important:** newer vllm CPU wheels (≥ 0.15) declare `torch==2.10.0+cpu` as a hard dep, but `torch==2.10.0` only exists on the PyTorch test channel and pulls in an incompatible `torchvision`. Stay on **`vllm 0.14.1+cpu` + `torch 2.9.1+cpu`** until both upstream catch up. Bumping requires verifying torchvision/torchaudio match.
`requirements-cpu.txt` uses `--extra-index-url https://download.pytorch.org/whl/cpu`. `install.sh` adds `--index-strategy=unsafe-best-match` for the `cpu` profile so uv resolves transformers/vllm from PyPI while pulling torch from the PyTorch index.
**SIMD baseline:** the prebuilt CPU wheel is compiled with AVX-512 VNNI/BF16. On a CPU without those instructions, importing `vllm.model_executor.models.registry` SIGILLs at `_run_in_subprocess` time during model inspection. There is no runtime flag to disable it. Workarounds:
1. **Run on a host with the right SIMD baseline** (default — fast)
2. **Build from source** with `FROM_SOURCE=true` env var. Plumbing exists end-to-end:
- `install.sh` hides `requirements-cpu-after.txt`, runs `installRequirements` for the base deps, then clones vllm and `VLLM_TARGET_DEVICE=cpu uv pip install --no-deps .`
- `backend/Dockerfile.python` declares `ARG FROM_SOURCE` + `ENV FROM_SOURCE`
- `Makefile` `docker-build-backend` macro forwards `--build-arg FROM_SOURCE=$(FROM_SOURCE)` when set
- Source build takes 3050 minutes — too slow for per-PR CI but fine for local.
**Runtime shared libraries:** vLLM's `vllm._C` extension `dlopen`s `libnuma.so.1` at import time. If missing, the C extension silently fails and `torch.ops._C_utils.init_cpu_threads_env` is never registered → `EngineCore` crashes on `init_device` with:
```
AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env'
```
`backend/python/vllm/package.sh` bundles `libnuma.so.1` and `libgomp.so.1` into `${BACKEND}/lib/`, which `libbackend.sh` adds to `LD_LIBRARY_PATH` at run time. The builder stage in `backend/Dockerfile.python` installs `libnuma1`/`libgomp1` so package.sh has something to copy. Do *not* assume the production host has these — backend images are `FROM scratch`.
## Backend hook system (`core/config/backend_hooks.go`)
Per-backend defaults that used to be hardcoded in `ModelConfig.Prepare()` now live in `core/config/hooks_*.go` files and self-register via `init()`:
- `hooks_llamacpp.go` → GGUF metadata parsing, context size, GPU layers, jinja template
- `hooks_vllm.go` → tool/reasoning parser auto-selection from `parser_defaults.json`
Hook keys:
- `"llama-cpp"`, `"vllm"`, `"vllm-omni"`, … — backend-specific
- `""` — runs only when `cfg.Backend` is empty (auto-detect case)
- `"*"` — global catch-all, runs for every backend before specific hooks
Multiple hooks per key are supported and run in registration order. Adding a new backend default:
```go
// core/config/hooks_<backend>.go
func init() {
RegisterBackendHook("<backend>", myDefaults)
}
func myDefaults(cfg *ModelConfig, modelPath string) {
// only fill in fields the user didn't set
}
```
## The `Messages.ToProto()` fields you need to set
`core/schema/message.go:ToProto()` must serialize:
- `ToolCallID``proto.Message.ToolCallId` (for `role="tool"` messages — links result back to the call)
- `Reasoning``proto.Message.ReasoningContent`
- `ToolCalls``proto.Message.ToolCalls` (JSON-encoded string)
These were originally not serialized and tool-calling conversations broke silently — the C++ llama.cpp backend reads them but always got empty strings. Any new field added to `schema.Message` *and* `proto.Message` needs a matching line in `ToProto()`.

446
.github/gallery-agent/agent.go vendored Normal file
View File

@@ -0,0 +1,446 @@
package main
import (
"context"
"encoding/json"
"fmt"
"io"
"net/http"
"os"
"regexp"
"slices"
"strings"
"github.com/ghodss/yaml"
hfapi "github.com/mudler/LocalAI/pkg/huggingface-api"
"github.com/mudler/cogito"
"github.com/mudler/cogito/clients"
"github.com/mudler/cogito/structures"
"github.com/sashabaranov/go-openai/jsonschema"
)
var (
openAIModel = os.Getenv("OPENAI_MODEL")
openAIKey = os.Getenv("OPENAI_KEY")
openAIBaseURL = os.Getenv("OPENAI_BASE_URL")
galleryIndexPath = os.Getenv("GALLERY_INDEX_PATH")
//defaultclient
llm = clients.NewOpenAILLM(openAIModel, openAIKey, openAIBaseURL)
)
// cleanTextContent removes trailing spaces, tabs, and normalizes line endings
// to prevent YAML linting issues like trailing spaces and multiple empty lines
func cleanTextContent(text string) string {
lines := strings.Split(text, "\n")
var cleanedLines []string
var prevEmpty bool
for _, line := range lines {
// Remove all trailing whitespace (spaces, tabs, etc.)
trimmed := strings.TrimRight(line, " \t\r")
// Avoid multiple consecutive empty lines
if trimmed == "" {
if !prevEmpty {
cleanedLines = append(cleanedLines, "")
}
prevEmpty = true
} else {
cleanedLines = append(cleanedLines, trimmed)
prevEmpty = false
}
}
// Remove trailing empty lines from the result
result := strings.Join(cleanedLines, "\n")
return stripThinkingTags(strings.TrimRight(result, "\n"))
}
type galleryModel struct {
Name string `yaml:"name"`
Urls []string `yaml:"urls"`
}
// isModelExisting checks if a specific model ID exists in the gallery using text search
func isModelExisting(modelID string) (bool, error) {
indexPath := getGalleryIndexPath()
content, err := os.ReadFile(indexPath)
if err != nil {
return false, fmt.Errorf("failed to read %s: %w", indexPath, err)
}
var galleryModels []galleryModel
err = yaml.Unmarshal(content, &galleryModels)
if err != nil {
return false, fmt.Errorf("failed to unmarshal %s: %w", indexPath, err)
}
for _, galleryModel := range galleryModels {
if slices.Contains(galleryModel.Urls, modelID) {
return true, nil
}
}
return false, nil
}
// filterExistingModels removes models that already exist in the gallery
func filterExistingModels(models []ProcessedModel) ([]ProcessedModel, error) {
var filteredModels []ProcessedModel
for _, model := range models {
exists, err := isModelExisting(model.ModelID)
if err != nil {
fmt.Printf("Error checking if model %s exists: %v, skipping\n", model.ModelID, err)
continue
}
if !exists {
filteredModels = append(filteredModels, model)
} else {
fmt.Printf("Skipping existing model: %s\n", model.ModelID)
}
}
fmt.Printf("Filtered out %d existing models, %d new models remaining\n",
len(models)-len(filteredModels), len(filteredModels))
return filteredModels, nil
}
// getGalleryIndexPath returns the gallery index file path, with a default fallback
func getGalleryIndexPath() string {
if galleryIndexPath != "" {
return galleryIndexPath
}
return "gallery/index.yaml"
}
func stripThinkingTags(content string) string {
// Remove content between <thinking> and </thinking> (including multi-line)
content = regexp.MustCompile(`(?s)<thinking>.*?</thinking>`).ReplaceAllString(content, "")
// Remove content between <think> and </think> (including multi-line)
content = regexp.MustCompile(`(?s)<think>.*?</think>`).ReplaceAllString(content, "")
// Clean up any extra whitespace
content = strings.TrimSpace(content)
return content
}
func getRealReadme(ctx context.Context, repository string) (string, error) {
// Create a conversation fragment
fragment := cogito.NewEmptyFragment().
AddMessage("user",
`Your task is to get a clear description of a large language model from huggingface by using the provided tool. I will share with you a repository that might be quantized, and as such probably not by the original model author. We need to get the real description of the model, and not the one that might be quantized. You will have to call the tool to get the readme more than once by figuring out from the quantized readme which is the base model readme. This is the repository: `+repository)
// Execute with tools
result, err := cogito.ExecuteTools(llm, fragment,
cogito.WithIterations(3),
cogito.WithMaxAttempts(3),
cogito.DisableSinkState,
cogito.WithTools(&HFReadmeTool{client: hfapi.NewClient()}))
if err != nil {
return "", err
}
result = result.AddMessage("user", "Describe the model in a clear and concise way that can be shared in a model gallery.")
// Get a response
_, err = llm.Ask(ctx, result)
if err != nil {
return "", err
}
content := result.LastMessage().Content
return cleanTextContent(content), nil
}
func selectMostInterestingModels(ctx context.Context, searchResult *SearchResult) ([]ProcessedModel, error) {
if len(searchResult.Models) == 1 {
return searchResult.Models, nil
}
// Create a conversation fragment
fragment := cogito.NewEmptyFragment().
AddMessage("user",
`Your task is to analyze a list of AI models and select the most interesting ones for a model gallery. You will be given detailed information about multiple models including their metadata, file information, and README content.
Consider the following criteria when selecting models:
1. Model popularity (download count)
2. Model recency (last modified date)
3. Model completeness (has preferred model file, README, etc.)
4. Model uniqueness (not duplicates or very similar models)
5. Model quality (based on README content and description)
6. Model utility (practical applications)
You should select models that would be most valuable for users browsing a model gallery. Prioritize models that are:
- Well-documented with clear READMEs
- Recently updated
- Popular (high download count)
- Have the preferred quantization format available
- Offer unique capabilities or are from reputable authors
Return your analysis and selection reasoning.`)
// Add the search results as context
modelsInfo := fmt.Sprintf("Found %d models matching '%s' with quantization preference '%s':\n\n",
searchResult.TotalModelsFound, searchResult.SearchTerm, searchResult.Quantization)
for i, model := range searchResult.Models {
modelsInfo += fmt.Sprintf("Model %d:\n", i+1)
modelsInfo += fmt.Sprintf(" ID: %s\n", model.ModelID)
modelsInfo += fmt.Sprintf(" Author: %s\n", model.Author)
modelsInfo += fmt.Sprintf(" Downloads: %d\n", model.Downloads)
modelsInfo += fmt.Sprintf(" Last Modified: %s\n", model.LastModified)
modelsInfo += fmt.Sprintf(" Files: %d files\n", len(model.Files))
if model.PreferredModelFile != nil {
modelsInfo += fmt.Sprintf(" Preferred Model File: %s (%d bytes)\n",
model.PreferredModelFile.Path, model.PreferredModelFile.Size)
} else {
modelsInfo += " No preferred model file found\n"
}
if model.ReadmeContent != "" {
modelsInfo += fmt.Sprintf(" README: %s\n", model.ReadmeContent)
}
if model.ProcessingError != "" {
modelsInfo += fmt.Sprintf(" Processing Error: %s\n", model.ProcessingError)
}
modelsInfo += "\n"
}
fragment = fragment.AddMessage("user", modelsInfo)
fragment = fragment.AddMessage("user", "Based on your analysis, select the top 5 most interesting models and provide a brief explanation for each selection. Also, create a filtered SearchResult with only the selected models. Return just a list of repositories IDs, you will later be asked to output it as a JSON array with the json tool.")
// Get a response
newFragment, err := llm.Ask(ctx, fragment)
if err != nil {
return nil, err
}
fmt.Println(newFragment.LastMessage().Content)
repositories := struct {
Repositories []string `json:"repositories"`
}{}
s := structures.Structure{
Schema: jsonschema.Definition{
Type: jsonschema.Object,
AdditionalProperties: false,
Properties: map[string]jsonschema.Definition{
"repositories": {
Type: jsonschema.Array,
Items: &jsonschema.Definition{Type: jsonschema.String},
Description: "The trending repositories IDs",
},
},
Required: []string{"repositories"},
},
Object: &repositories,
}
err = newFragment.ExtractStructure(ctx, llm, s)
if err != nil {
return nil, err
}
filteredModels := []ProcessedModel{}
for _, m := range searchResult.Models {
if slices.Contains(repositories.Repositories, m.ModelID) {
filteredModels = append(filteredModels, m)
}
}
return filteredModels, nil
}
// ModelMetadata represents extracted metadata from a model
type ModelMetadata struct {
Tags []string `json:"tags"`
License string `json:"license"`
}
// extractModelMetadata extracts tags and license from model README and documentation
func extractModelMetadata(ctx context.Context, model ProcessedModel) ([]string, string, error) {
// Create a conversation fragment
fragment := cogito.NewEmptyFragment().
AddMessage("user",
`Your task is to extract metadata from an AI model's README and documentation. You will be provided with:
1. Model information (ID, author, description)
2. README content
You need to extract:
1. **Tags**: An array of relevant tags that describe the model. Use common tags from the gallery such as:
- llm, gguf, gpu, cpu, multimodal, image-to-text, text-to-text, text-to-speech, tts
- thinking, reasoning, chat, instruction-tuned, code, vision
- Model family names (e.g., llama, qwen, mistral, gemma) if applicable
- Any other relevant descriptive tags
Select 3-8 most relevant tags.
2. **License**: The license identifier (e.g., "apache-2.0", "mit", "llama2", "gpl-3.0", "bsd", "cc-by-4.0").
If no license is found, return an empty string.
Return the extracted metadata in a structured format.`)
// Add model information
modelInfo := "Model Information:\n"
modelInfo += fmt.Sprintf(" ID: %s\n", model.ModelID)
modelInfo += fmt.Sprintf(" Author: %s\n", model.Author)
modelInfo += fmt.Sprintf(" Downloads: %d\n", model.Downloads)
if model.ReadmeContent != "" {
modelInfo += fmt.Sprintf(" README Content:\n%s\n", model.ReadmeContent)
} else if model.ReadmeContentPreview != "" {
modelInfo += fmt.Sprintf(" README Preview: %s\n", model.ReadmeContentPreview)
}
fragment = fragment.AddMessage("user", modelInfo)
fragment = fragment.AddMessage("user", "Extract the tags and license from the model information. Return the metadata as a JSON object with 'tags' (array of strings) and 'license' (string).")
// Get a response
newFragment, err := llm.Ask(ctx, fragment)
if err != nil {
return nil, "", err
}
// Extract structured metadata
metadata := ModelMetadata{}
s := structures.Structure{
Schema: jsonschema.Definition{
Type: jsonschema.Object,
AdditionalProperties: false,
Properties: map[string]jsonschema.Definition{
"tags": {
Type: jsonschema.Array,
Items: &jsonschema.Definition{Type: jsonschema.String},
Description: "Array of relevant tags describing the model",
},
"license": {
Type: jsonschema.String,
Description: "License identifier (e.g., apache-2.0, mit, llama2). Empty string if not found.",
},
},
Required: []string{"tags", "license"},
},
Object: &metadata,
}
err = newFragment.ExtractStructure(ctx, llm, s)
if err != nil {
return nil, "", err
}
return metadata.Tags, metadata.License, nil
}
// extractIconFromReadme scans the README content for image URLs and returns the first suitable icon URL found
func extractIconFromReadme(readmeContent string) string {
if readmeContent == "" {
return ""
}
// Regular expressions to match image URLs in various formats (case-insensitive)
// Match markdown image syntax: ![alt](url) - case insensitive extensions
markdownImageRegex := regexp.MustCompile(`(?i)!\[[^\]]*\]\(([^)]+\.(png|jpg|jpeg|svg|webp|gif))\)`)
// Match HTML img tags: <img src="url">
htmlImageRegex := regexp.MustCompile(`(?i)<img[^>]+src=["']([^"']+\.(png|jpg|jpeg|svg|webp|gif))["']`)
// Match plain URLs ending with image extensions
plainImageRegex := regexp.MustCompile(`(?i)https?://[^\s<>"']+\.(png|jpg|jpeg|svg|webp|gif)`)
// Try markdown format first
matches := markdownImageRegex.FindStringSubmatch(readmeContent)
if len(matches) > 1 && matches[1] != "" {
url := strings.TrimSpace(matches[1])
// Prefer HuggingFace CDN URLs or absolute URLs
if strings.HasPrefix(strings.ToLower(url), "http") {
return url
}
}
// Try HTML img tags
matches = htmlImageRegex.FindStringSubmatch(readmeContent)
if len(matches) > 1 && matches[1] != "" {
url := strings.TrimSpace(matches[1])
if strings.HasPrefix(strings.ToLower(url), "http") {
return url
}
}
// Try plain URLs
matches = plainImageRegex.FindStringSubmatch(readmeContent)
if len(matches) > 0 {
url := strings.TrimSpace(matches[0])
if strings.HasPrefix(strings.ToLower(url), "http") {
return url
}
}
return ""
}
// getHuggingFaceAvatarURL attempts to get the HuggingFace avatar URL for a user
func getHuggingFaceAvatarURL(author string) string {
if author == "" {
return ""
}
// Try to fetch user info from HuggingFace API
// HuggingFace API endpoint: https://huggingface.co/api/users/{username}
baseURL := "https://huggingface.co"
userURL := fmt.Sprintf("%s/api/users/%s", baseURL, author)
req, err := http.NewRequest("GET", userURL, nil)
if err != nil {
return ""
}
client := &http.Client{}
resp, err := client.Do(req)
if err != nil {
return ""
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
return ""
}
// Parse the response to get avatar URL
var userInfo map[string]any
body, err := io.ReadAll(resp.Body)
if err != nil {
return ""
}
if err := json.Unmarshal(body, &userInfo); err != nil {
return ""
}
// Try to extract avatar URL from response
if avatar, ok := userInfo["avatarUrl"].(string); ok && avatar != "" {
return avatar
}
if avatar, ok := userInfo["avatar"].(string); ok && avatar != "" {
return avatar
}
return ""
}
// extractModelIcon extracts icon URL from README or falls back to HuggingFace avatar
func extractModelIcon(model ProcessedModel) string {
// First, try to extract icon from README
if icon := extractIconFromReadme(model.ReadmeContent); icon != "" {
return icon
}
// Fallback: Try to get HuggingFace user avatar
if model.Author != "" {
if avatar := getHuggingFaceAvatarURL(model.Author); avatar != "" {
return avatar
}
}
return ""
}

View File

@@ -7,8 +7,8 @@ import (
"os"
"strings"
"github.com/ghodss/yaml"
"github.com/mudler/LocalAI/core/gallery/importers"
"sigs.k8s.io/yaml"
)
func formatTextContent(text string) string {

View File

@@ -1,301 +0,0 @@
package main
import (
"encoding/json"
"fmt"
"io"
"net/http"
"os"
"regexp"
"strings"
hfapi "github.com/mudler/LocalAI/pkg/huggingface-api"
"sigs.k8s.io/yaml"
)
var galleryIndexPath = os.Getenv("GALLERY_INDEX_PATH")
// getGalleryIndexPath returns the gallery index file path, with a default fallback
func getGalleryIndexPath() string {
if galleryIndexPath != "" {
return galleryIndexPath
}
return "gallery/index.yaml"
}
type galleryModel struct {
Name string `yaml:"name"`
Urls []string `yaml:"urls"`
}
// loadGalleryURLSet parses gallery/index.yaml once and returns the set of
// HuggingFace model URLs already present in the gallery.
func loadGalleryURLSet() (map[string]struct{}, error) {
indexPath := getGalleryIndexPath()
content, err := os.ReadFile(indexPath)
if err != nil {
return nil, fmt.Errorf("failed to read %s: %w", indexPath, err)
}
var galleryModels []galleryModel
if err := yaml.Unmarshal(content, &galleryModels); err != nil {
return nil, fmt.Errorf("failed to unmarshal %s: %w", indexPath, err)
}
set := make(map[string]struct{}, len(galleryModels))
for _, gm := range galleryModels {
for _, u := range gm.Urls {
set[u] = struct{}{}
}
}
// Also skip URLs already proposed in open (unmerged) gallery-agent PRs.
// The workflow injects these via EXTRA_SKIP_URLS so we don't keep
// re-proposing the same model every run while a PR is waiting to merge.
for _, line := range strings.FieldsFunc(os.Getenv("EXTRA_SKIP_URLS"), func(r rune) bool {
return r == '\n' || r == ',' || r == ' '
}) {
u := strings.TrimSpace(line)
if u != "" {
set[u] = struct{}{}
}
}
return set, nil
}
// modelAlreadyInGallery checks whether a HuggingFace model repo is already
// referenced in the gallery URL set.
func modelAlreadyInGallery(set map[string]struct{}, modelID string) bool {
_, ok := set["https://huggingface.co/"+modelID]
return ok
}
// baseModelFromTags returns the first `base_model:<repo>` value found in the
// tag list, or "" if none is present. HuggingFace surfaces the base model
// declared in the model card's YAML frontmatter as such a tag.
func baseModelFromTags(tags []string) string {
for _, t := range tags {
if strings.HasPrefix(t, "base_model:") {
return strings.TrimPrefix(t, "base_model:")
}
}
return ""
}
// licenseFromTags returns the `license:<id>` value from the tag list, or "".
func licenseFromTags(tags []string) string {
for _, t := range tags {
if strings.HasPrefix(t, "license:") {
return strings.TrimPrefix(t, "license:")
}
}
return ""
}
// curatedTags produces the gallery tag list from HuggingFace's raw tag set.
// Always includes llm + gguf, then adds whitelisted family / capability
// markers when they appear in the HF tag list.
func curatedTags(hfTags []string) []string {
whitelist := []string{
"gpu", "cpu",
"llama", "mistral", "mixtral", "qwen", "qwen2", "qwen3",
"gemma", "gemma2", "gemma3", "phi", "phi3", "phi4",
"deepseek", "yi", "falcon", "command-r",
"vision", "multimodal", "code", "chat",
"instruction-tuned", "reasoning", "thinking",
}
seen := map[string]struct{}{}
out := []string{"llm", "gguf"}
seen["llm"] = struct{}{}
seen["gguf"] = struct{}{}
hfSet := map[string]struct{}{}
for _, t := range hfTags {
hfSet[strings.ToLower(t)] = struct{}{}
}
for _, w := range whitelist {
if _, ok := hfSet[w]; ok {
if _, dup := seen[w]; !dup {
out = append(out, w)
seen[w] = struct{}{}
}
}
}
return out
}
// resolveReadme fetches a description-quality README for a (possibly
// quantized) repo: if a `base_model:` tag is present, fetch the base repo's
// README; otherwise fall back to the repo's own README.
func resolveReadme(client *hfapi.Client, modelID string, hfTags []string) (string, error) {
if base := baseModelFromTags(hfTags); base != "" && base != modelID {
if content, err := client.GetReadmeContent(base, "README.md"); err == nil && strings.TrimSpace(content) != "" {
return cleanTextContent(content), nil
}
}
content, err := client.GetReadmeContent(modelID, "README.md")
if err != nil {
return "", err
}
return cleanTextContent(content), nil
}
// extractDescription turns a raw HuggingFace README into a concise plain-text
// description suitable for embedding in gallery/index.yaml: strips YAML
// frontmatter, HTML tags/comments, markdown images, link URLs (keeping the
// link text), markdown tables, and then truncates at a paragraph boundary
// around ~1200 characters. Raw README should still be used for icon
// extraction — call this only for the `description:` field.
func extractDescription(readme string) string {
s := readme
// Strip leading YAML frontmatter: `---\n...\n---\n` at start of file.
if strings.HasPrefix(strings.TrimLeft(s, " \t\n"), "---") {
trimmed := strings.TrimLeft(s, " \t\n")
rest := strings.TrimPrefix(trimmed, "---")
if idx := strings.Index(rest, "\n---"); idx >= 0 {
after := rest[idx+len("\n---"):]
after = strings.TrimPrefix(after, "\n")
s = after
}
}
// Strip HTML comments and tags.
s = regexp.MustCompile(`(?s)<!--.*?-->`).ReplaceAllString(s, "")
s = regexp.MustCompile(`(?is)<[^>]+>`).ReplaceAllString(s, "")
// Strip markdown images entirely.
s = regexp.MustCompile(`!\[[^\]]*\]\([^)]*\)`).ReplaceAllString(s, "")
// Replace markdown links `[text](url)` with just `text`.
s = regexp.MustCompile(`\[([^\]]+)\]\([^)]+\)`).ReplaceAllString(s, "$1")
// Drop table lines and horizontal rules, and flatten all leading
// whitespace: generateYAMLEntry embeds this under a `description: |`
// literal block whose indentation is set by the first non-empty line.
// If any line has extra leading whitespace (e.g. from an indented
// `<p align="center">` block in the original README), YAML will pick
// that up as the block's indent and every later line at a smaller
// indent blows the block scalar. Stripping leading whitespace here
// guarantees uniform 4-space indentation after formatTextContent runs.
var kept []string
for _, line := range strings.Split(s, "\n") {
t := strings.TrimLeft(line, " \t")
ts := strings.TrimSpace(t)
if strings.HasPrefix(ts, "|") {
continue
}
if strings.HasPrefix(ts, ":--") || strings.HasPrefix(ts, "---") || strings.HasPrefix(ts, "===") {
continue
}
kept = append(kept, t)
}
s = strings.Join(kept, "\n")
// Normalise whitespace and drop any leading blank lines so the literal
// block in YAML doesn't start with a blank first line (which would
// break the indentation detector the same way).
s = cleanTextContent(s)
s = strings.TrimLeft(s, " \t\n")
// Truncate at a paragraph boundary around maxLen chars.
const maxLen = 1200
if len(s) > maxLen {
cut := strings.LastIndex(s[:maxLen], "\n\n")
if cut < maxLen/3 {
cut = maxLen
}
s = strings.TrimRight(s[:cut], " \t\n") + "\n\n..."
}
return s
}
// cleanTextContent removes trailing spaces/tabs and collapses multiple empty
// lines so README content embeds cleanly into YAML without lint noise.
func cleanTextContent(text string) string {
lines := strings.Split(text, "\n")
var cleaned []string
var prevEmpty bool
for _, line := range lines {
trimmed := strings.TrimRight(line, " \t\r")
if trimmed == "" {
if !prevEmpty {
cleaned = append(cleaned, "")
}
prevEmpty = true
} else {
cleaned = append(cleaned, trimmed)
prevEmpty = false
}
}
return strings.TrimRight(strings.Join(cleaned, "\n"), "\n")
}
// extractIconFromReadme scans README content for an image URL usable as a
// gallery entry icon.
func extractIconFromReadme(readmeContent string) string {
if readmeContent == "" {
return ""
}
markdownImageRegex := regexp.MustCompile(`(?i)!\[[^\]]*\]\(([^)]+\.(png|jpg|jpeg|svg|webp|gif))\)`)
htmlImageRegex := regexp.MustCompile(`(?i)<img[^>]+src=["']([^"']+\.(png|jpg|jpeg|svg|webp|gif))["']`)
plainImageRegex := regexp.MustCompile(`(?i)https?://[^\s<>"']+\.(png|jpg|jpeg|svg|webp|gif)`)
if m := markdownImageRegex.FindStringSubmatch(readmeContent); len(m) > 1 && strings.HasPrefix(strings.ToLower(m[1]), "http") {
return strings.TrimSpace(m[1])
}
if m := htmlImageRegex.FindStringSubmatch(readmeContent); len(m) > 1 && strings.HasPrefix(strings.ToLower(m[1]), "http") {
return strings.TrimSpace(m[1])
}
if m := plainImageRegex.FindStringSubmatch(readmeContent); len(m) > 0 && strings.HasPrefix(strings.ToLower(m[0]), "http") {
return strings.TrimSpace(m[0])
}
return ""
}
// getHuggingFaceAvatarURL returns the HF avatar URL for a user, or "".
func getHuggingFaceAvatarURL(author string) string {
if author == "" {
return ""
}
userURL := fmt.Sprintf("https://huggingface.co/api/users/%s/overview", author)
resp, err := http.Get(userURL)
if err != nil {
return ""
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
return ""
}
body, err := io.ReadAll(resp.Body)
if err != nil {
return ""
}
var info map[string]any
if err := json.Unmarshal(body, &info); err != nil {
return ""
}
if v, ok := info["avatarUrl"].(string); ok && v != "" {
return v
}
if v, ok := info["avatar"].(string); ok && v != "" {
return v
}
return ""
}
// extractModelIcon extracts an icon URL from the README, falling back to the
// HuggingFace user avatar.
func extractModelIcon(model ProcessedModel) string {
if icon := extractIconFromReadme(model.ReadmeContent); icon != "" {
return icon
}
if model.Author != "" {
if avatar := getHuggingFaceAvatarURL(model.Author); avatar != "" {
return avatar
}
}
return ""
}

View File

@@ -6,6 +6,7 @@ import (
"fmt"
"os"
"strconv"
"strings"
"time"
hfapi "github.com/mudler/LocalAI/pkg/huggingface-api"
@@ -38,6 +39,16 @@ type ProcessedModel struct {
Icon string `json:"icon,omitempty"`
}
// SearchResult represents the complete result of searching and processing models
type SearchResult struct {
SearchTerm string `json:"search_term"`
Limit int `json:"limit"`
Quantization string `json:"quantization"`
TotalModelsFound int `json:"total_models_found"`
Models []ProcessedModel `json:"models"`
FormattedOutput string `json:"formatted_output"`
}
// AddedModelSummary represents a summary of models added to the gallery
type AddedModelSummary struct {
SearchTerm string `json:"search_term"`
@@ -52,16 +63,19 @@ type AddedModelSummary struct {
func main() {
startTime := time.Now()
// Synthetic mode for local testing
if sm := os.Getenv("SYNTHETIC_MODE"); sm == "true" || sm == "1" {
// Check for synthetic mode
syntheticMode := os.Getenv("SYNTHETIC_MODE")
if syntheticMode == "true" || syntheticMode == "1" {
fmt.Println("Running in SYNTHETIC MODE - generating random test data")
if err := runSyntheticMode(); err != nil {
err := runSyntheticMode()
if err != nil {
fmt.Fprintf(os.Stderr, "Error in synthetic mode: %v\n", err)
os.Exit(1)
}
return
}
// Get configuration from environment variables
searchTerm := os.Getenv("SEARCH_TERM")
if searchTerm == "" {
searchTerm = "GGUF"
@@ -69,7 +83,7 @@ func main() {
limitStr := os.Getenv("LIMIT")
if limitStr == "" {
limitStr = "15"
limitStr = "5"
}
limit, err := strconv.Atoi(limitStr)
if err != nil {
@@ -78,197 +92,287 @@ func main() {
}
quantization := os.Getenv("QUANTIZATION")
if quantization == "" {
quantization = "Q4_K_M"
}
maxModelsStr := os.Getenv("MAX_MODELS")
if maxModelsStr == "" {
maxModelsStr = "1"
maxModels := os.Getenv("MAX_MODELS")
if maxModels == "" {
maxModels = "1"
}
maxModels, err := strconv.Atoi(maxModelsStr)
maxModelsInt, err := strconv.Atoi(maxModels)
if err != nil {
fmt.Fprintf(os.Stderr, "Error parsing MAX_MODELS: %v\n", err)
os.Exit(1)
}
// Print configuration
fmt.Printf("Gallery Agent Configuration:\n")
fmt.Printf(" Search Term: %s\n", searchTerm)
fmt.Printf(" Limit: %d\n", limit)
fmt.Printf(" Quantization: %s\n", quantization)
fmt.Printf(" Max Models to Add: %d\n", maxModels)
fmt.Printf(" Gallery Index Path: %s\n", getGalleryIndexPath())
fmt.Printf(" Max Models to Add: %d\n", maxModelsInt)
fmt.Printf(" Gallery Index Path: %s\n", os.Getenv("GALLERY_INDEX_PATH"))
fmt.Println()
// Phase 1: load current gallery and query HuggingFace.
gallerySet, err := loadGalleryURLSet()
result, err := searchAndProcessModels(searchTerm, limit, quantization)
if err != nil {
fmt.Fprintf(os.Stderr, "Error loading gallery index: %v\n", err)
fmt.Fprintf(os.Stderr, "Error: %v\n", err)
os.Exit(1)
}
fmt.Printf("Loaded %d existing gallery entries\n", len(gallerySet))
client := hfapi.NewClient()
fmt.Println(result.FormattedOutput)
var models []ProcessedModel
fmt.Println("Searching for trending models on HuggingFace...")
rawModels, err := client.GetTrending(searchTerm, limit)
if err != nil {
fmt.Fprintf(os.Stderr, "Error fetching models: %v\n", err)
os.Exit(1)
}
fmt.Printf("Found %d trending models matching %q\n", len(rawModels), searchTerm)
totalFound := len(rawModels)
// Phase 2: drop anything already in the gallery *before* any expensive
// per-model work (GetModelDetails, README fetches, icon lookups).
fresh := rawModels[:0]
for _, m := range rawModels {
if modelAlreadyInGallery(gallerySet, m.ModelID) {
fmt.Printf("Skipping existing model: %s\n", m.ModelID)
continue
if len(result.Models) > 1 {
fmt.Println("More than one model found (", len(result.Models), "), using AI agent to select the most interesting models")
for _, model := range result.Models {
fmt.Println("Model: ", model.ModelID)
}
fresh = append(fresh, m)
// Use AI agent to select the most interesting models
fmt.Println("Using AI agent to select the most interesting models...")
models, err = selectMostInterestingModels(context.Background(), result)
if err != nil {
fmt.Fprintf(os.Stderr, "Error in model selection: %v\n", err)
// Continue with original result if selection fails
models = result.Models
}
} else if len(result.Models) == 1 {
models = result.Models
fmt.Println("Only one model found, using it directly")
}
fmt.Printf("%d candidates after gallery dedup\n", len(fresh))
// Phase 3: HuggingFace already returned these in trendingScore order —
// just cap to MAX_MODELS.
if len(fresh) > maxModels {
fresh = fresh[:maxModels]
fmt.Print(models)
// Filter out models that already exist in the gallery
fmt.Println("Filtering out existing models...")
models, err = filterExistingModels(models)
if err != nil {
fmt.Fprintf(os.Stderr, "Error filtering existing models: %v\n", err)
os.Exit(1)
}
if len(fresh) == 0 {
// Limit to maxModelsInt after filtering
if len(models) > maxModelsInt {
models = models[:maxModelsInt]
}
// Track added models for summary
var addedModelIDs []string
var addedModelURLs []string
// Generate YAML entries and append to gallery/index.yaml
if len(models) > 0 {
for _, model := range models {
addedModelIDs = append(addedModelIDs, model.ModelID)
// Generate Hugging Face URL for the model
modelURL := fmt.Sprintf("https://huggingface.co/%s", model.ModelID)
addedModelURLs = append(addedModelURLs, modelURL)
}
fmt.Println("Generating YAML entries for selected models...")
err = generateYAMLForModels(context.Background(), models, quantization)
if err != nil {
fmt.Fprintf(os.Stderr, "Error generating YAML entries: %v\n", err)
os.Exit(1)
}
} else {
fmt.Println("No new models to add to the gallery.")
writeSummary(AddedModelSummary{
SearchTerm: searchTerm,
TotalFound: totalFound,
ModelsAdded: 0,
Quantization: quantization,
ProcessingTime: time.Since(startTime).String(),
})
return
}
// Phase 4: fetch details and build ProcessedModel entries for survivors.
var processed []ProcessedModel
quantPrefs := []string{quantization, "Q4_K_M", "Q4_K_S", "Q3_K_M", "Q2_K", "Q8_0"}
for _, m := range fresh {
fmt.Printf("Processing model: %s (downloads=%d)\n", m.ModelID, m.Downloads)
pm := ProcessedModel{
ModelID: m.ModelID,
Author: m.Author,
Downloads: m.Downloads,
LastModified: m.LastModified,
QuantizationPreferences: quantPrefs,
}
details, err := client.GetModelDetails(m.ModelID)
if err != nil {
fmt.Printf(" Error getting model details: %v (skipping)\n", err)
continue
}
preferred := hfapi.FindPreferredModelFile(details.Files, quantPrefs)
if preferred == nil {
fmt.Printf(" No GGUF file matching %v — skipping\n", quantPrefs)
continue
}
pm.Files = make([]ProcessedModelFile, len(details.Files))
for j, f := range details.Files {
fileType := "other"
if f.IsReadme {
fileType = "readme"
} else if f.Path == preferred.Path {
fileType = "model"
}
pm.Files[j] = ProcessedModelFile{
Path: f.Path,
Size: f.Size,
SHA256: f.SHA256,
IsReadme: f.IsReadme,
FileType: fileType,
}
if f.Path == preferred.Path {
copyFile := pm.Files[j]
pm.PreferredModelFile = &copyFile
}
if f.IsReadme {
copyFile := pm.Files[j]
pm.ReadmeFile = &copyFile
}
}
// Deterministic README resolution: follow base_model tag if set.
// Keep the raw (HTML-bearing) README around while we extract the
// icon, then strip it down to a plain-text description for the
// `description:` YAML field.
readme, err := resolveReadme(client, m.ModelID, m.Tags)
if err != nil {
fmt.Printf(" Warning: failed to fetch README: %v\n", err)
}
pm.ReadmeContent = readme
pm.License = licenseFromTags(m.Tags)
pm.Tags = curatedTags(m.Tags)
pm.Icon = extractModelIcon(pm)
if pm.ReadmeContent != "" {
pm.ReadmeContent = extractDescription(pm.ReadmeContent)
pm.ReadmeContentPreview = truncateString(pm.ReadmeContent, 200)
}
fmt.Printf(" License: %s, Tags: %v, Icon: %s\n", pm.License, pm.Tags, pm.Icon)
processed = append(processed, pm)
}
if len(processed) == 0 {
fmt.Println("No processable models after detail fetch.")
writeSummary(AddedModelSummary{
SearchTerm: searchTerm,
TotalFound: totalFound,
ModelsAdded: 0,
Quantization: quantization,
ProcessingTime: time.Since(startTime).String(),
})
return
}
// Phase 5: write YAML entries.
var addedIDs, addedURLs []string
for _, pm := range processed {
addedIDs = append(addedIDs, pm.ModelID)
addedURLs = append(addedURLs, "https://huggingface.co/"+pm.ModelID)
}
fmt.Println("Generating YAML entries for selected models...")
if err := generateYAMLForModels(context.Background(), processed, quantization); err != nil {
fmt.Fprintf(os.Stderr, "Error generating YAML entries: %v\n", err)
os.Exit(1)
}
writeSummary(AddedModelSummary{
// Create and write summary
processingTime := time.Since(startTime).String()
summary := AddedModelSummary{
SearchTerm: searchTerm,
TotalFound: totalFound,
ModelsAdded: len(addedIDs),
AddedModelIDs: addedIDs,
AddedModelURLs: addedURLs,
TotalFound: result.TotalModelsFound,
ModelsAdded: len(addedModelIDs),
AddedModelIDs: addedModelIDs,
AddedModelURLs: addedModelURLs,
Quantization: quantization,
ProcessingTime: time.Since(startTime).String(),
})
}
ProcessingTime: processingTime,
}
func writeSummary(summary AddedModelSummary) {
data, err := json.MarshalIndent(summary, "", " ")
// Write summary to file
summaryData, err := json.MarshalIndent(summary, "", " ")
if err != nil {
fmt.Fprintf(os.Stderr, "Error marshaling summary: %v\n", err)
return
} else {
err = os.WriteFile("gallery-agent-summary.json", summaryData, 0644)
if err != nil {
fmt.Fprintf(os.Stderr, "Error writing summary file: %v\n", err)
} else {
fmt.Printf("Summary written to gallery-agent-summary.json\n")
}
}
if err := os.WriteFile("gallery-agent-summary.json", data, 0644); err != nil {
fmt.Fprintf(os.Stderr, "Error writing summary file: %v\n", err)
return
}
func searchAndProcessModels(searchTerm string, limit int, quantization string) (*SearchResult, error) {
client := hfapi.NewClient()
var outputBuilder strings.Builder
fmt.Println("Searching for models...")
// Initialize the result struct
result := &SearchResult{
SearchTerm: searchTerm,
Limit: limit,
Quantization: quantization,
Models: []ProcessedModel{},
}
fmt.Println("Summary written to gallery-agent-summary.json")
models, err := client.GetLatest(searchTerm, limit)
if err != nil {
return nil, fmt.Errorf("failed to fetch models: %w", err)
}
fmt.Println("Models found:", len(models))
result.TotalModelsFound = len(models)
if len(models) == 0 {
outputBuilder.WriteString("No models found.\n")
result.FormattedOutput = outputBuilder.String()
return result, nil
}
outputBuilder.WriteString(fmt.Sprintf("Found %d models matching '%s':\n\n", len(models), searchTerm))
// Process each model
for i, model := range models {
outputBuilder.WriteString(fmt.Sprintf("%d. Processing Model: %s\n", i+1, model.ModelID))
outputBuilder.WriteString(fmt.Sprintf(" Author: %s\n", model.Author))
outputBuilder.WriteString(fmt.Sprintf(" Downloads: %d\n", model.Downloads))
outputBuilder.WriteString(fmt.Sprintf(" Last Modified: %s\n", model.LastModified))
// Initialize processed model struct
processedModel := ProcessedModel{
ModelID: model.ModelID,
Author: model.Author,
Downloads: model.Downloads,
LastModified: model.LastModified,
QuantizationPreferences: []string{quantization, "Q4_K_M", "Q4_K_S", "Q3_K_M", "Q2_K"},
}
// Get detailed model information
details, err := client.GetModelDetails(model.ModelID)
if err != nil {
errorMsg := fmt.Sprintf(" Error getting model details: %v\n", err)
outputBuilder.WriteString(errorMsg)
processedModel.ProcessingError = err.Error()
result.Models = append(result.Models, processedModel)
continue
}
// Define quantization preferences (in order of preference)
quantizationPreferences := []string{quantization, "Q4_K_M", "Q4_K_S", "Q3_K_M", "Q2_K"}
// Find preferred model file
preferredModelFile := hfapi.FindPreferredModelFile(details.Files, quantizationPreferences)
// Process files
processedFiles := make([]ProcessedModelFile, len(details.Files))
for j, file := range details.Files {
fileType := "other"
if file.IsReadme {
fileType = "readme"
} else if preferredModelFile != nil && file.Path == preferredModelFile.Path {
fileType = "model"
}
processedFiles[j] = ProcessedModelFile{
Path: file.Path,
Size: file.Size,
SHA256: file.SHA256,
IsReadme: file.IsReadme,
FileType: fileType,
}
}
processedModel.Files = processedFiles
// Set preferred model file
if preferredModelFile != nil {
for _, file := range processedFiles {
if file.Path == preferredModelFile.Path {
processedModel.PreferredModelFile = &file
break
}
}
}
// Print file information
outputBuilder.WriteString(fmt.Sprintf(" Files found: %d\n", len(details.Files)))
if preferredModelFile != nil {
outputBuilder.WriteString(fmt.Sprintf(" Preferred Model File: %s (SHA256: %s)\n",
preferredModelFile.Path,
preferredModelFile.SHA256))
} else {
outputBuilder.WriteString(fmt.Sprintf(" No model file found with quantization preferences: %v\n", quantizationPreferences))
}
if details.ReadmeFile != nil {
outputBuilder.WriteString(fmt.Sprintf(" README File: %s\n", details.ReadmeFile.Path))
// Find and set readme file
for _, file := range processedFiles {
if file.IsReadme {
processedModel.ReadmeFile = &file
break
}
}
fmt.Println("Getting real readme for", model.ModelID, "waiting...")
// Use agent to get the real readme and prepare the model description
readmeContent, err := getRealReadme(context.Background(), model.ModelID)
if err == nil {
processedModel.ReadmeContent = readmeContent
processedModel.ReadmeContentPreview = truncateString(readmeContent, 200)
outputBuilder.WriteString(fmt.Sprintf(" README Content Preview: %s\n",
processedModel.ReadmeContentPreview))
} else {
fmt.Printf(" Warning: Failed to get real readme: %v\n", err)
}
fmt.Println("Real readme got", readmeContent)
// Extract metadata (tags, license) from README using LLM
fmt.Println("Extracting metadata for", model.ModelID, "waiting...")
tags, license, err := extractModelMetadata(context.Background(), processedModel)
if err == nil {
processedModel.Tags = tags
processedModel.License = license
outputBuilder.WriteString(fmt.Sprintf(" Tags: %v\n", tags))
outputBuilder.WriteString(fmt.Sprintf(" License: %s\n", license))
} else {
fmt.Printf(" Warning: Failed to extract metadata: %v\n", err)
}
// Extract icon from README or use HuggingFace avatar
icon := extractModelIcon(processedModel)
if icon != "" {
processedModel.Icon = icon
outputBuilder.WriteString(fmt.Sprintf(" Icon: %s\n", icon))
}
// Get README content
// readmeContent, err := client.GetReadmeContent(model.ModelID, details.ReadmeFile.Path)
// if err == nil {
// processedModel.ReadmeContent = readmeContent
// processedModel.ReadmeContentPreview = truncateString(readmeContent, 200)
// outputBuilder.WriteString(fmt.Sprintf(" README Content Preview: %s\n",
// processedModel.ReadmeContentPreview))
// }
}
// Print all files with their checksums
outputBuilder.WriteString(" All Files:\n")
for _, file := range processedFiles {
outputBuilder.WriteString(fmt.Sprintf(" - %s (%s, %d bytes", file.Path, file.FileType, file.Size))
if file.SHA256 != "" {
outputBuilder.WriteString(fmt.Sprintf(", SHA256: %s", file.SHA256))
}
outputBuilder.WriteString(")\n")
}
outputBuilder.WriteString("\n")
result.Models = append(result.Models, processedModel)
}
result.FormattedOutput = outputBuilder.String()
return result, nil
}
func truncateString(s string, maxLen int) string {
@@ -277,4 +381,3 @@ func truncateString(s string, maxLen int) string {
}
return s[:maxLen] + "..."
}

46
.github/gallery-agent/tools.go vendored Normal file
View File

@@ -0,0 +1,46 @@
package main
import (
"fmt"
hfapi "github.com/mudler/LocalAI/pkg/huggingface-api"
openai "github.com/sashabaranov/go-openai"
jsonschema "github.com/sashabaranov/go-openai/jsonschema"
)
// Get repository README from HF
type HFReadmeTool struct {
client *hfapi.Client
}
func (s *HFReadmeTool) Execute(args map[string]any) (string, any, error) {
q, ok := args["repository"].(string)
if !ok {
return "", nil, fmt.Errorf("no query")
}
readme, err := s.client.GetReadmeContent(q, "README.md")
if err != nil {
return "", nil, err
}
return readme, nil, nil
}
func (s *HFReadmeTool) Tool() openai.Tool {
return openai.Tool{
Type: openai.ToolTypeFunction,
Function: &openai.FunctionDefinition{
Name: "hf_readme",
Description: "A tool to get the README content of a huggingface repository",
Parameters: jsonschema.Definition{
Type: jsonschema.Object,
Properties: map[string]jsonschema.Definition{
"repository": {
Type: jsonschema.String,
Description: "The huggingface repository to get the README content of",
},
},
Required: []string{"repository"},
},
},
}
}

View File

@@ -66,19 +66,6 @@ jobs:
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-cpu-sglang'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'true'
backend: "sglang"
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
@@ -118,25 +105,6 @@ jobs:
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
# tinygrad ships a single image — its CPU device uses bundled
# libLLVM, and its CUDA / HIP / Metal devices dlopen the host
# driver libraries at runtime via tinygrad's ctypes autogen
# wrappers. There is no toolkit-version split because tinygrad
# generates kernels itself (PTX renderer for CUDA) and never
# links against cuDNN/cuBLAS/torch.
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-tinygrad'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'true'
backend: "tinygrad"
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
@@ -385,19 +353,6 @@ jobs:
dockerfile: "./backend/Dockerfile.llama-cpp"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-12-turboquant'
runs-on: 'bigger-runner'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "turboquant"
dockerfile: "./backend/Dockerfile.turboquant"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
@@ -424,19 +379,6 @@ jobs:
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-12-sglang'
runs-on: 'arc-runner-set'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "sglang"
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
@@ -854,19 +796,6 @@ jobs:
dockerfile: "./backend/Dockerfile.llama-cpp"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-13-turboquant'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "turboquant"
dockerfile: "./backend/Dockerfile.turboquant"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
@@ -880,19 +809,6 @@ jobs:
backend: "llama-cpp"
dockerfile: "./backend/Dockerfile.llama-cpp"
context: "./"
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'false'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-cuda-13-arm64-turboquant'
base-image: "ubuntu:24.04"
runs-on: 'ubuntu-24.04-arm'
ubuntu-version: '2404'
backend: "turboquant"
dockerfile: "./backend/Dockerfile.turboquant"
context: "./"
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
@@ -1414,19 +1330,6 @@ jobs:
dockerfile: "./backend/Dockerfile.llama-cpp"
context: "./"
ubuntu-version: '2404'
- build-type: 'hipblas'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-turboquant'
runs-on: 'ubuntu-latest'
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
skip-drivers: 'false'
backend: "turboquant"
dockerfile: "./backend/Dockerfile.turboquant"
context: "./"
ubuntu-version: '2404'
- build-type: 'hipblas'
cuda-major-version: ""
cuda-minor-version: ""
@@ -1453,19 +1356,6 @@ jobs:
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: 'hipblas'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-sglang'
runs-on: 'arc-runner-set'
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
skip-drivers: 'false'
backend: "sglang"
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: 'hipblas'
cuda-major-version: ""
cuda-minor-version: ""
@@ -1676,19 +1566,6 @@ jobs:
dockerfile: "./backend/Dockerfile.llama-cpp"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f32'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-intel-sycl-f32-turboquant'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "turboquant"
dockerfile: "./backend/Dockerfile.turboquant"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f16'
cuda-major-version: ""
cuda-minor-version: ""
@@ -1702,19 +1579,6 @@ jobs:
dockerfile: "./backend/Dockerfile.llama-cpp"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f16'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-intel-sycl-f16-turboquant'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "turboquant"
dockerfile: "./backend/Dockerfile.turboquant"
context: "./"
ubuntu-version: '2404'
- build-type: 'intel'
cuda-major-version: ""
cuda-minor-version: ""
@@ -1728,19 +1592,6 @@ jobs:
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: 'intel'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-intel-sglang'
runs-on: 'arc-runner-set'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "sglang"
dockerfile: "./backend/Dockerfile.python"
context: "./"
ubuntu-version: '2404'
- build-type: 'intel'
cuda-major-version: ""
cuda-minor-version: ""
@@ -2107,19 +1958,6 @@ jobs:
dockerfile: "./backend/Dockerfile.llama-cpp"
context: "./"
ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64,linux/arm64'
tag-latest: 'auto'
tag-suffix: '-cpu-turboquant'
runs-on: 'bigger-runner'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "turboquant"
dockerfile: "./backend/Dockerfile.turboquant"
context: "./"
ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
@@ -2146,19 +1984,6 @@ jobs:
dockerfile: "./backend/Dockerfile.llama-cpp"
context: "./"
ubuntu-version: '2204'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'false'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-arm64-turboquant'
base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0"
runs-on: 'ubuntu-24.04-arm'
backend: "turboquant"
dockerfile: "./backend/Dockerfile.turboquant"
context: "./"
ubuntu-version: '2204'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
@@ -2172,19 +1997,6 @@ jobs:
dockerfile: "./backend/Dockerfile.llama-cpp"
context: "./"
ubuntu-version: '2404'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64,linux/arm64'
tag-latest: 'auto'
tag-suffix: '-gpu-vulkan-turboquant'
runs-on: 'bigger-runner'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "turboquant"
dockerfile: "./backend/Dockerfile.turboquant"
context: "./"
ubuntu-version: '2404'
# Stablediffusion-ggml
- build-type: ''
cuda-major-version: ""

View File

@@ -18,10 +18,6 @@ jobs:
variable: "IK_LLAMA_VERSION"
branch: "main"
file: "backend/cpp/ik-llama-cpp/Makefile"
- repository: "TheTom/llama-cpp-turboquant"
variable: "TURBOQUANT_VERSION"
branch: "feature/turboquant-kv-cache"
file: "backend/cpp/turboquant/Makefile"
- repository: "ggml-org/whisper.cpp"
variable: "WHISPER_CPP_VERSION"
branch: "master"

View File

@@ -48,71 +48,21 @@ jobs:
go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
PATH="$PATH:$HOME/go/bin" make protogen-go
- name: Process gallery-agent PR commands
env:
GH_TOKEN: ${{ secrets.UPDATE_BOT_TOKEN }}
REPO: ${{ github.repository }}
SEARCH: 'gallery agent in:title'
run: |
# Walk open gallery-agent PRs and act on maintainer comments:
# /gallery-agent blacklist → label `gallery-agent/blacklisted` + close (never repropose)
# /gallery-agent recreate → close without label (next run may repropose)
# Only comments from OWNER / MEMBER / COLLABORATOR are honored so
# random users can't drive the bot.
gh label create gallery-agent/blacklisted \
--repo "$REPO" --color ededed \
--description "gallery-agent must not repropose this model" 2>/dev/null || true
prs=$(gh pr list --repo "$REPO" --state open --search "$SEARCH" --json number --jq '.[].number')
for pr in $prs; do
cmds=$(gh pr view "$pr" --repo "$REPO" --json comments \
--jq '.comments[] | select(.authorAssociation=="OWNER" or .authorAssociation=="MEMBER" or .authorAssociation=="COLLABORATOR") | .body')
if echo "$cmds" | grep -qE '(^|[[:space:]])/gallery-agent[[:space:]]+blacklist([[:space:]]|$)'; then
echo "PR #$pr: blacklist command found"
gh pr edit "$pr" --repo "$REPO" --add-label gallery-agent/blacklisted || true
gh pr close "$pr" --repo "$REPO" --comment "Blacklisted via \`/gallery-agent blacklist\`. This model will not be reproposed." || true
elif echo "$cmds" | grep -qE '(^|[[:space:]])/gallery-agent[[:space:]]+recreate([[:space:]]|$)'; then
echo "PR #$pr: recreate command found"
gh pr close "$pr" --repo "$REPO" --comment "Closed via \`/gallery-agent recreate\`. The next scheduled run will propose this model again." || true
fi
done
- name: Collect skip URLs for the gallery agent
id: open_prs
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
REPO: ${{ github.repository }}
SEARCH: 'gallery agent in:title'
run: |
# Skip set =
# URLs from any open gallery-agent PR (avoid duplicate PRs for the same model while one is pending)
# + URLs from closed PRs carrying the `gallery-agent/blacklisted` label (hard blacklist)
# Plain-closed PRs without the label are ignored — closing a PR is
# not by itself a "never propose again" signal; maintainers must
# opt in via the /gallery-agent blacklist comment command.
urls_open=$(gh pr list --repo "$REPO" --state open --search "$SEARCH" \
--json body --jq '[.[].body] | join("\n")' \
| grep -oE 'https://huggingface\.co/[^ )]+' || true)
urls_blacklist=$(gh pr list --repo "$REPO" --state closed --search "$SEARCH" \
--label gallery-agent/blacklisted \
--json body --jq '[.[].body] | join("\n")' \
| grep -oE 'https://huggingface\.co/[^ )]+' || true)
urls=$(printf '%s\n%s\n' "$urls_open" "$urls_blacklist" | sort -u | sed '/^$/d')
echo "Skip URLs:"
echo "$urls"
{
echo "urls<<EOF"
echo "$urls"
echo "EOF"
} >> "$GITHUB_OUTPUT"
- uses: mudler/localai-github-action@v1.1
with:
model: 'https://huggingface.co/unsloth/Qwen3.5-2B-GGUF'
- name: Run gallery agent
env:
#OPENAI_MODEL: ${{ secrets.OPENAI_MODEL }}
OPENAI_MODEL: Qwen3.5-2B-GGUF
OPENAI_BASE_URL: "http://localhost:8080"
OPENAI_KEY: ${{ secrets.OPENAI_KEY }}
#OPENAI_BASE_URL: ${{ secrets.OPENAI_BASE_URL }}
SEARCH_TERM: ${{ github.event.inputs.search_term || 'GGUF' }}
LIMIT: ${{ github.event.inputs.limit || '15' }}
QUANTIZATION: ${{ github.event.inputs.quantization || 'Q4_K_M' }}
MAX_MODELS: ${{ github.event.inputs.max_models || '1' }}
EXTRA_SKIP_URLS: ${{ steps.open_prs.outputs.urls }}
run: |
export GALLERY_INDEX_PATH=$PWD/gallery/index.yaml
go run ./.github/gallery-agent
@@ -174,21 +124,7 @@ jobs:
**Added Models:**
${{ steps.read_summary.outputs.added_models || '- No models added' }}
### Bot commands
Maintainers (owner / member / collaborator) can control this PR
by leaving a comment with one of:
- `/gallery-agent recreate` — close this PR; the next scheduled
run will propose this model again (useful if the entry needs
to be regenerated with fresh metadata).
- `/gallery-agent blacklist` — close this PR and permanently
prevent the gallery agent from ever reproposing this model.
Plain "Close" (without a command) is treated as a no-op: the
model may be reproposed by a future run.
**Workflow Details:**
- Triggered by: `${{ github.event_name }}`
- Run ID: `${{ github.run_id }}`

View File

@@ -59,7 +59,7 @@ jobs:
hugo --minify --baseURL "${{ steps.pages.outputs.base_url }}/"
- name: Upload artifact
uses: actions/upload-pages-artifact@v5
uses: actions/upload-pages-artifact@v4
with:
path: docs/public

View File

@@ -39,7 +39,7 @@ jobs:
run: |
make build-launcher-darwin
- name: Upload DMG to Release
uses: softprops/action-gh-release@v3
uses: softprops/action-gh-release@v2
with:
files: ./dist/LocalAI.dmg
launcher-build-linux:
@@ -59,6 +59,6 @@ jobs:
sudo apt-get install golang gcc libgl1-mesa-dev xorg-dev libxkbcommon-dev
make build-launcher-linux
- name: Upload Linux launcher artifacts
uses: softprops/action-gh-release@v3
uses: softprops/action-gh-release@v2
with:
files: ./local-ai-launcher-linux.tar.xz

View File

@@ -31,9 +31,7 @@ jobs:
llama-cpp-quantization: ${{ steps.detect.outputs.llama-cpp-quantization }}
llama-cpp: ${{ steps.detect.outputs.llama-cpp }}
ik-llama-cpp: ${{ steps.detect.outputs.ik-llama-cpp }}
turboquant: ${{ steps.detect.outputs.turboquant }}
vllm: ${{ steps.detect.outputs.vllm }}
sglang: ${{ steps.detect.outputs.sglang }}
acestep-cpp: ${{ steps.detect.outputs.acestep-cpp }}
qwen3-tts-cpp: ${{ steps.detect.outputs.qwen3-tts-cpp }}
voxtral: ${{ steps.detect.outputs.voxtral }}
@@ -487,23 +485,6 @@ jobs:
- name: Build llama-cpp backend image and run gRPC e2e tests
run: |
make test-extra-backend-llama-cpp
tests-llama-cpp-grpc-transcription:
needs: detect-changes
if: needs.detect-changes.outputs.llama-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
timeout-minutes: 90
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.25.4'
- name: Build llama-cpp backend image and run audio transcription gRPC e2e tests
run: |
make test-extra-backend-llama-cpp-transcription
tests-ik-llama-cpp-grpc:
needs: detect-changes
if: needs.detect-changes.outputs.ik-llama-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
@@ -521,29 +502,6 @@ jobs:
- name: Build ik-llama-cpp backend image and run gRPC e2e tests
run: |
make test-extra-backend-ik-llama-cpp
tests-turboquant-grpc:
needs: detect-changes
if: needs.detect-changes.outputs.turboquant == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
timeout-minutes: 90
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.25.4'
# Exercises the turboquant (llama.cpp fork) backend with KV-cache
# quantization enabled. The convenience target sets
# BACKEND_TEST_CACHE_TYPE_K / _V=q8_0, which are plumbed into the
# ModelOptions.CacheTypeKey/Value gRPC fields. LoadModel-success +
# backend stdout/stderr (captured by the Ginkgo suite) prove the
# cache-type config path reaches the fork's KV-cache init.
- name: Build turboquant backend image and run gRPC e2e tests
run: |
make test-extra-backend-turboquant
# tests-vllm-grpc is currently disabled in CI.
#
# The prebuilt vllm CPU wheel is compiled with AVX-512 VNNI/BF16
@@ -590,48 +548,6 @@ jobs:
# - name: Build vllm (cpu) backend image and run gRPC e2e tests
# run: |
# make test-extra-backend-vllm
# tests-sglang-grpc is currently disabled in CI for the same reason as
# tests-vllm-grpc: sglang's CPU kernel (sgl-kernel) uses __m512 AVX-512
# intrinsics unconditionally in shm.cpp, so the from-source build
# requires `-march=sapphirerapids` (already set in install.sh) and the
# resulting binary SIGILLs at import on CPUs without AVX-512 VNNI/BF16.
# The ubuntu-latest runner pool does not guarantee that ISA baseline.
#
# The test itself (tests/e2e-backends + make test-extra-backend-sglang)
# is fully working and validated locally on a host with the right
# SIMD baseline. Run it manually with:
#
# make test-extra-backend-sglang
#
# Re-enable this job once we have a self-hosted runner label with
# guaranteed AVX-512 VNNI/BF16 support.
#
# tests-sglang-grpc:
# needs: detect-changes
# if: needs.detect-changes.outputs.sglang == 'true' || needs.detect-changes.outputs.run-all == 'true'
# runs-on: bigger-runner
# timeout-minutes: 90
# steps:
# - name: Clone
# uses: actions/checkout@v6
# with:
# submodules: true
# - name: Dependencies
# run: |
# sudo apt-get update
# sudo apt-get install -y --no-install-recommends \
# make build-essential curl unzip ca-certificates git tar
# - name: Setup Go
# uses: actions/setup-go@v5
# with:
# go-version: '1.25.4'
# - name: Free disk space
# run: |
# sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /opt/hostedtoolcache/CodeQL || true
# df -h
# - name: Build sglang (cpu) backend image and run gRPC e2e tests
# run: |
# make test-extra-backend-sglang
tests-acestep-cpp:
needs: detect-changes
if: needs.detect-changes.outputs.acestep-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'

View File

@@ -10,7 +10,6 @@ This file is an index to detailed topic guides in the `.agents/` directory. Read
| [.agents/adding-backends.md](.agents/adding-backends.md) | Adding a new backend (Python, Go, or C++) — full step-by-step checklist |
| [.agents/coding-style.md](.agents/coding-style.md) | Code style, editorconfig, logging, documentation conventions |
| [.agents/llama-cpp-backend.md](.agents/llama-cpp-backend.md) | Working on the llama.cpp backend — architecture, updating, tool call parsing |
| [.agents/vllm-backend.md](.agents/vllm-backend.md) | Working on the vLLM / vLLM-omni backends — native parsers, ChatDelta, CPU build, libnuma packaging, backend hooks |
| [.agents/testing-mcp-apps.md](.agents/testing-mcp-apps.md) | Testing MCP Apps (interactive tool UIs) in the React UI |
| [.agents/api-endpoints-and-auth.md](.agents/api-endpoints-and-auth.md) | Adding API endpoints, auth middleware, feature permissions, user access control |
| [.agents/debugging-backends.md](.agents/debugging-backends.md) | Debugging runtime backend failures, dependency conflicts, rebuilding backends |

127
Makefile
View File

@@ -1,5 +1,5 @@
# Disable parallel execution for backend builds
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/tinygrad
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp
GOCMD=go
GOTEST=$(GOCMD) test
@@ -419,7 +419,6 @@ prepare-test-extra: protogen-python
$(MAKE) -C backend/python/chatterbox
$(MAKE) -C backend/python/vllm
$(MAKE) -C backend/python/vllm-omni
$(MAKE) -C backend/python/sglang
$(MAKE) -C backend/python/vibevoice
$(MAKE) -C backend/python/moonshine
$(MAKE) -C backend/python/pocket-tts
@@ -433,7 +432,6 @@ prepare-test-extra: protogen-python
$(MAKE) -C backend/python/whisperx
$(MAKE) -C backend/python/ace-step
$(MAKE) -C backend/python/trl
$(MAKE) -C backend/python/tinygrad
$(MAKE) -C backend/rust/kokoros kokoros-grpc
test-extra: prepare-test-extra
@@ -456,7 +454,6 @@ test-extra: prepare-test-extra
$(MAKE) -C backend/python/whisperx test
$(MAKE) -C backend/python/ace-step test
$(MAKE) -C backend/python/trl test
$(MAKE) -C backend/python/tinygrad test
$(MAKE) -C backend/rust/kokoros test
##
@@ -496,17 +493,11 @@ test-extra-backend: protogen-go
BACKEND_TEST_MODEL_URL="$${BACKEND_TEST_MODEL_URL:-$(BACKEND_TEST_MODEL_URL)}" \
BACKEND_TEST_MODEL_FILE="$$BACKEND_TEST_MODEL_FILE" \
BACKEND_TEST_MODEL_NAME="$$BACKEND_TEST_MODEL_NAME" \
BACKEND_TEST_MMPROJ_URL="$$BACKEND_TEST_MMPROJ_URL" \
BACKEND_TEST_MMPROJ_FILE="$$BACKEND_TEST_MMPROJ_FILE" \
BACKEND_TEST_AUDIO_URL="$$BACKEND_TEST_AUDIO_URL" \
BACKEND_TEST_AUDIO_FILE="$$BACKEND_TEST_AUDIO_FILE" \
BACKEND_TEST_CAPS="$$BACKEND_TEST_CAPS" \
BACKEND_TEST_PROMPT="$$BACKEND_TEST_PROMPT" \
BACKEND_TEST_OPTIONS="$$BACKEND_TEST_OPTIONS" \
BACKEND_TEST_TOOL_PROMPT="$$BACKEND_TEST_TOOL_PROMPT" \
BACKEND_TEST_TOOL_NAME="$$BACKEND_TEST_TOOL_NAME" \
BACKEND_TEST_CACHE_TYPE_K="$$BACKEND_TEST_CACHE_TYPE_K" \
BACKEND_TEST_CACHE_TYPE_V="$$BACKEND_TEST_CACHE_TYPE_V" \
go test -v -timeout 30m ./tests/e2e-backends/...
## Convenience wrappers: build the image, then exercise it.
@@ -516,31 +507,6 @@ test-extra-backend-llama-cpp: docker-build-llama-cpp
test-extra-backend-ik-llama-cpp: docker-build-ik-llama-cpp
BACKEND_IMAGE=local-ai-backend:ik-llama-cpp $(MAKE) test-extra-backend
## turboquant: exercises the llama.cpp-fork backend with the fork's
## *TurboQuant-specific* KV-cache types (turbo3 for both K and V). turbo3
## is what makes this backend distinct from stock llama-cpp — picking q8_0
## here would only test the standard llama.cpp code path that the upstream
## llama-cpp backend already covers. The fork auto-enables flash_attention
## when turbo3/turbo4 are active, so we don't need to set it explicitly.
test-extra-backend-turboquant: docker-build-turboquant
BACKEND_IMAGE=local-ai-backend:turboquant \
BACKEND_TEST_CACHE_TYPE_K=q8_0 \
BACKEND_TEST_CACHE_TYPE_V=turbo3 \
$(MAKE) test-extra-backend
## Audio transcription wrapper for the llama-cpp backend.
## Drives the new AudioTranscription / AudioTranscriptionStream RPCs against
## ggml-org/Qwen3-ASR-0.6B-GGUF (a small ASR model that requires its mmproj
## audio encoder companion). The audio fixture is a short public-domain
## "jfk.wav" clip ggml-org bundles with whisper.cpp's CI assets.
test-extra-backend-llama-cpp-transcription: docker-build-llama-cpp
BACKEND_IMAGE=local-ai-backend:llama-cpp \
BACKEND_TEST_MODEL_URL=https://huggingface.co/ggml-org/Qwen3-ASR-0.6B-GGUF/resolve/main/Qwen3-ASR-0.6B-Q8_0.gguf \
BACKEND_TEST_MMPROJ_URL=https://huggingface.co/ggml-org/Qwen3-ASR-0.6B-GGUF/resolve/main/mmproj-Qwen3-ASR-0.6B-Q8_0.gguf \
BACKEND_TEST_AUDIO_URL=https://github.com/ggml-org/whisper.cpp/raw/master/samples/jfk.wav \
BACKEND_TEST_CAPS=health,load,transcription \
$(MAKE) test-extra-backend
## vllm is resolved from a HuggingFace model id (no file download) and
## exercises Predict + streaming + tool-call extraction via the hermes parser.
## Requires a host CPU with the SIMD instructions the prebuilt vllm CPU
@@ -553,83 +519,6 @@ test-extra-backend-vllm: docker-build-vllm
BACKEND_TEST_OPTIONS=tool_parser:hermes \
$(MAKE) test-extra-backend
## tinygrad mirrors the vllm target (same model, same caps, same parser) so
## the two backends are directly comparable. The LLM path covers Predict,
## streaming and native tool-call extraction. Companion targets below cover
## embeddings, Stable Diffusion and Whisper — run them individually or via
## the `test-extra-backend-tinygrad-all` aggregate.
test-extra-backend-tinygrad: docker-build-tinygrad
BACKEND_IMAGE=local-ai-backend:tinygrad \
BACKEND_TEST_MODEL_NAME=Qwen/Qwen3-0.6B \
BACKEND_TEST_CAPS=health,load,predict,stream,tools \
BACKEND_TEST_OPTIONS=tool_parser:hermes \
$(MAKE) test-extra-backend
## tinygrad — embeddings via LLM last-hidden-state pooling. Reuses the same
## Qwen3-0.6B as the chat target so we don't need a separate BERT vendor;
## the Embedding RPC mean-pools and L2-normalizes the last-layer hidden
## state.
test-extra-backend-tinygrad-embeddings: docker-build-tinygrad
BACKEND_IMAGE=local-ai-backend:tinygrad \
BACKEND_TEST_MODEL_NAME=Qwen/Qwen3-0.6B \
BACKEND_TEST_CAPS=health,load,embeddings \
$(MAKE) test-extra-backend
## tinygrad — Stable Diffusion 1.5. The original CompVis/runwayml repos have
## been gated, so we use the community-maintained mirror at
## stable-diffusion-v1-5/stable-diffusion-v1-5 with the EMA-only pruned
## checkpoint (~4.3GB). Step count is kept low (4) so a CPU-only run finishes
## in a few minutes; bump BACKEND_TEST_IMAGE_STEPS for higher quality.
test-extra-backend-tinygrad-sd: docker-build-tinygrad
BACKEND_IMAGE=local-ai-backend:tinygrad \
BACKEND_TEST_MODEL_NAME=stable-diffusion-v1-5/stable-diffusion-v1-5 \
BACKEND_TEST_CAPS=health,load,image \
$(MAKE) test-extra-backend
## tinygrad — Whisper. Loads OpenAI's tiny.en checkpoint (smallest at ~75MB)
## from the original azure CDN through tinygrad's `fetch` helper, and
## transcribes the canonical jfk.wav fixture from whisper.cpp's CI samples.
## Exercises both AudioTranscription and AudioTranscriptionStream.
test-extra-backend-tinygrad-whisper: docker-build-tinygrad
BACKEND_IMAGE=local-ai-backend:tinygrad \
BACKEND_TEST_MODEL_NAME=openai/whisper-tiny.en \
BACKEND_TEST_AUDIO_URL=https://github.com/ggml-org/whisper.cpp/raw/master/samples/jfk.wav \
BACKEND_TEST_CAPS=health,load,transcription \
$(MAKE) test-extra-backend
test-extra-backend-tinygrad-all: \
test-extra-backend-tinygrad \
test-extra-backend-tinygrad-embeddings \
test-extra-backend-tinygrad-sd \
test-extra-backend-tinygrad-whisper
## sglang mirrors the vllm setup: HuggingFace model id, same tiny Qwen,
## tool-call extraction via sglang's native qwen parser. CPU builds use
## sglang's upstream pyproject_cpu.toml recipe (see backend/python/sglang/install.sh).
test-extra-backend-sglang: docker-build-sglang
BACKEND_IMAGE=local-ai-backend:sglang \
BACKEND_TEST_MODEL_NAME=Qwen/Qwen2.5-0.5B-Instruct \
BACKEND_TEST_CAPS=health,load,predict,stream,tools \
BACKEND_TEST_OPTIONS=tool_parser:qwen \
$(MAKE) test-extra-backend
## mlx is Apple-Silicon-first — the MLX backend auto-detects the right tool
## parser from the chat template, so no tool_parser: option is needed (it
## would be ignored at runtime). Run this on macOS / arm64 with Metal; the
## Linux/CPU mlx variant is untested in CI.
test-extra-backend-mlx: docker-build-mlx
BACKEND_IMAGE=local-ai-backend:mlx \
BACKEND_TEST_MODEL_NAME=mlx-community/Qwen2.5-0.5B-Instruct-4bit \
BACKEND_TEST_CAPS=health,load,predict,stream,tools \
$(MAKE) test-extra-backend
test-extra-backend-mlx-vlm: docker-build-mlx-vlm
BACKEND_IMAGE=local-ai-backend:mlx-vlm \
BACKEND_TEST_MODEL_NAME=mlx-community/Qwen2.5-0.5B-Instruct-4bit \
BACKEND_TEST_CAPS=health,load,predict,stream,tools \
$(MAKE) test-extra-backend
DOCKER_IMAGE?=local-ai
IMAGE_TYPE?=core
BASE_IMAGE?=ubuntu:24.04
@@ -725,9 +614,6 @@ backend-images:
BACKEND_LLAMA_CPP = llama-cpp|llama-cpp|.|false|false
# ik-llama-cpp is a fork of llama.cpp with superior CPU performance
BACKEND_IK_LLAMA_CPP = ik-llama-cpp|ik-llama-cpp|.|false|false
# turboquant is a llama.cpp fork with TurboQuant KV-cache quantization.
# Reuses backend/cpp/llama-cpp grpc-server sources via a thin wrapper Makefile.
BACKEND_TURBOQUANT = turboquant|turboquant|.|false|false
# Golang backends
BACKEND_PIPER = piper|golang|.|false|true
@@ -753,7 +639,6 @@ BACKEND_NEUTTS = neutts|python|.|false|true
BACKEND_KOKORO = kokoro|python|.|false|true
BACKEND_VLLM = vllm|python|.|false|true
BACKEND_VLLM_OMNI = vllm-omni|python|.|false|true
BACKEND_SGLANG = sglang|python|.|false|true
BACKEND_DIFFUSERS = diffusers|python|.|--progress=plain|true
BACKEND_CHATTERBOX = chatterbox|python|.|false|true
BACKEND_VIBEVOICE = vibevoice|python|.|--progress=plain|true
@@ -767,12 +652,9 @@ BACKEND_NEMO = nemo|python|.|false|true
BACKEND_VOXCPM = voxcpm|python|.|false|true
BACKEND_WHISPERX = whisperx|python|.|false|true
BACKEND_ACE_STEP = ace-step|python|.|false|true
BACKEND_MLX = mlx|python|.|false|true
BACKEND_MLX_VLM = mlx-vlm|python|.|false|true
BACKEND_MLX_DISTRIBUTED = mlx-distributed|python|./|false|true
BACKEND_TRL = trl|python|.|false|true
BACKEND_LLAMA_CPP_QUANTIZATION = llama-cpp-quantization|python|.|false|true
BACKEND_TINYGRAD = tinygrad|python|.|false|true
# Rust backends
BACKEND_KOKOROS = kokoros|rust|.|false|true
@@ -804,7 +686,6 @@ endef
# Generate all docker-build targets
$(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_IK_LLAMA_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_TURBOQUANT)))
$(eval $(call generate-docker-build-target,$(BACKEND_PIPER)))
$(eval $(call generate-docker-build-target,$(BACKEND_LOCAL_STORE)))
$(eval $(call generate-docker-build-target,$(BACKEND_HUGGINGFACE)))
@@ -824,7 +705,6 @@ $(eval $(call generate-docker-build-target,$(BACKEND_NEUTTS)))
$(eval $(call generate-docker-build-target,$(BACKEND_KOKORO)))
$(eval $(call generate-docker-build-target,$(BACKEND_VLLM)))
$(eval $(call generate-docker-build-target,$(BACKEND_VLLM_OMNI)))
$(eval $(call generate-docker-build-target,$(BACKEND_SGLANG)))
$(eval $(call generate-docker-build-target,$(BACKEND_DIFFUSERS)))
$(eval $(call generate-docker-build-target,$(BACKEND_CHATTERBOX)))
$(eval $(call generate-docker-build-target,$(BACKEND_VIBEVOICE)))
@@ -840,12 +720,9 @@ $(eval $(call generate-docker-build-target,$(BACKEND_WHISPERX)))
$(eval $(call generate-docker-build-target,$(BACKEND_ACE_STEP)))
$(eval $(call generate-docker-build-target,$(BACKEND_ACESTEP_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_QWEN3_TTS_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_MLX)))
$(eval $(call generate-docker-build-target,$(BACKEND_MLX_VLM)))
$(eval $(call generate-docker-build-target,$(BACKEND_MLX_DISTRIBUTED)))
$(eval $(call generate-docker-build-target,$(BACKEND_TRL)))
$(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP_QUANTIZATION)))
$(eval $(call generate-docker-build-target,$(BACKEND_TINYGRAD)))
$(eval $(call generate-docker-build-target,$(BACKEND_KOKOROS)))
$(eval $(call generate-docker-build-target,$(BACKEND_SAM3_CPP)))
@@ -853,7 +730,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_SAM3_CPP)))
docker-save-%: backend-images
docker save local-ai-backend:$* -o backend-images/$*.tar
docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp
docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp
########################################################
### Mock Backend for E2E Tests

View File

@@ -58,8 +58,6 @@ ARG CUDA_DOCKER_ARCH
ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
ARG CMAKE_ARGS
ENV CMAKE_ARGS=${CMAKE_ARGS}
ARG AMDGPU_TARGETS
ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}
ARG BACKEND=rerankers
ARG BUILD_TYPE
ENV BUILD_TYPE=${BUILD_TYPE}

View File

@@ -1,290 +0,0 @@
ARG BASE_IMAGE=ubuntu:24.04
ARG GRPC_BASE_IMAGE=${BASE_IMAGE}
# The grpc target does one thing, it builds and installs GRPC. This is in it's own layer so that it can be effectively cached by CI.
# You probably don't need to change anything here, and if you do, make sure that CI is adjusted so that the cache continues to work.
FROM ${GRPC_BASE_IMAGE} AS grpc
# This is a bit of a hack, but it's required in order to be able to effectively cache this layer in CI
ARG GRPC_MAKEFLAGS="-j4 -Otarget"
ARG GRPC_VERSION=v1.65.0
ARG CMAKE_FROM_SOURCE=false
# CUDA Toolkit 13.x compatibility: CMake 3.31.9+ fixes toolchain detection/arch table issues
ARG CMAKE_VERSION=3.31.10
ENV MAKEFLAGS=${GRPC_MAKEFLAGS}
WORKDIR /build
RUN apt-get update && \
apt-get install -y --no-install-recommends \
ca-certificates \
build-essential curl libssl-dev \
git wget && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
# Install CMake (the version in 22.04 is too old)
RUN <<EOT bash
if [ "${CMAKE_FROM_SOURCE}" = "true" ]; then
curl -L -s https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}.tar.gz -o cmake.tar.gz && tar xvf cmake.tar.gz && cd cmake-${CMAKE_VERSION} && ./configure && make && make install
else
apt-get update && \
apt-get install -y \
cmake && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
fi
EOT
# We install GRPC to a different prefix here so that we can copy in only the build artifacts later
# saves several hundred MB on the final docker image size vs copying in the entire GRPC source tree
# and running make install in the target container
RUN git clone --recurse-submodules --jobs 4 -b ${GRPC_VERSION} --depth 1 --shallow-submodules https://github.com/grpc/grpc && \
mkdir -p /build/grpc/cmake/build && \
cd /build/grpc/cmake/build && \
sed -i "216i\ TESTONLY" "../../third_party/abseil-cpp/absl/container/CMakeLists.txt" && \
cmake -DgRPC_INSTALL=ON -DgRPC_BUILD_TESTS=OFF -DCMAKE_INSTALL_PREFIX:PATH=/opt/grpc ../.. && \
make && \
make install && \
rm -rf /build
FROM ${BASE_IMAGE} AS builder
ARG CMAKE_FROM_SOURCE=false
ARG CMAKE_VERSION=3.31.10
# We can target specific CUDA ARCHITECTURES like --build-arg CUDA_DOCKER_ARCH='75;86;89;120'
ARG CUDA_DOCKER_ARCH
ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
ARG CMAKE_ARGS
ENV CMAKE_ARGS=${CMAKE_ARGS}
ARG BACKEND=rerankers
ARG BUILD_TYPE
ENV BUILD_TYPE=${BUILD_TYPE}
ARG CUDA_MAJOR_VERSION
ARG CUDA_MINOR_VERSION
ARG SKIP_DRIVERS=false
ENV CUDA_MAJOR_VERSION=${CUDA_MAJOR_VERSION}
ENV CUDA_MINOR_VERSION=${CUDA_MINOR_VERSION}
ENV DEBIAN_FRONTEND=noninteractive
ARG TARGETARCH
ARG TARGETVARIANT
ARG GO_VERSION=1.25.4
ARG UBUNTU_VERSION=2404
RUN apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
ccache git \
ca-certificates \
make \
pkg-config libcurl4-openssl-dev \
curl unzip \
libssl-dev wget && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
# Cuda
ENV PATH=/usr/local/cuda/bin:${PATH}
# HipBLAS requirements
ENV PATH=/opt/rocm/bin:${PATH}
# Vulkan requirements
RUN <<EOT bash
if [ "${BUILD_TYPE}" = "vulkan" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
apt-get update && \
apt-get install -y --no-install-recommends \
software-properties-common pciutils wget gpg-agent && \
apt-get install -y libglm-dev cmake libxcb-dri3-0 libxcb-present0 libpciaccess0 \
libpng-dev libxcb-keysyms1-dev libxcb-dri3-dev libx11-dev g++ gcc \
libwayland-dev libxrandr-dev libxcb-randr0-dev libxcb-ewmh-dev \
git python-is-python3 bison libx11-xcb-dev liblz4-dev libzstd-dev \
ocaml-core ninja-build pkg-config libxml2-dev wayland-protocols python3-jsonschema \
clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils
if [ "amd64" = "$TARGETARCH" ]; then
wget "https://sdk.lunarg.com/sdk/download/1.4.335.0/linux/vulkansdk-linux-x86_64-1.4.335.0.tar.xz" && \
tar -xf vulkansdk-linux-x86_64-1.4.335.0.tar.xz && \
rm vulkansdk-linux-x86_64-1.4.335.0.tar.xz && \
mkdir -p /opt/vulkan-sdk && \
mv 1.4.335.0 /opt/vulkan-sdk/ && \
cd /opt/vulkan-sdk/1.4.335.0 && \
./vulkansdk --no-deps --maxjobs \
vulkan-loader \
vulkan-validationlayers \
vulkan-extensionlayer \
vulkan-tools \
shaderc && \
cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/bin/* /usr/bin/ && \
cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/lib/* /usr/lib/x86_64-linux-gnu/ && \
cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/include/* /usr/include/ && \
cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/share/* /usr/share/ && \
rm -rf /opt/vulkan-sdk
fi
if [ "arm64" = "$TARGETARCH" ]; then
mkdir vulkan && cd vulkan && \
curl -L -o vulkan-sdk.tar.xz https://github.com/mudler/vulkan-sdk-arm/releases/download/1.4.335.0/vulkansdk-ubuntu-24.04-arm-1.4.335.0.tar.xz && \
tar -xvf vulkan-sdk.tar.xz && \
rm vulkan-sdk.tar.xz && \
cd 1.4.335.0 && \
cp -rfv aarch64/bin/* /usr/bin/ && \
cp -rfv aarch64/lib/* /usr/lib/aarch64-linux-gnu/ && \
cp -rfv aarch64/include/* /usr/include/ && \
cp -rfv aarch64/share/* /usr/share/ && \
cd ../.. && \
rm -rf vulkan
fi
ldconfig && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
fi
EOT
# CuBLAS requirements
RUN <<EOT bash
if ( [ "${BUILD_TYPE}" = "cublas" ] || [ "${BUILD_TYPE}" = "l4t" ] ) && [ "${SKIP_DRIVERS}" = "false" ]; then
apt-get update && \
apt-get install -y --no-install-recommends \
software-properties-common pciutils
if [ "amd64" = "$TARGETARCH" ]; then
curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/x86_64/cuda-keyring_1.1-1_all.deb
fi
if [ "arm64" = "$TARGETARCH" ]; then
if [ "${CUDA_MAJOR_VERSION}" = "13" ]; then
curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/sbsa/cuda-keyring_1.1-1_all.deb
else
curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/arm64/cuda-keyring_1.1-1_all.deb
fi
fi
dpkg -i cuda-keyring_1.1-1_all.deb && \
rm -f cuda-keyring_1.1-1_all.deb && \
apt-get update && \
apt-get install -y --no-install-recommends \
cuda-nvcc-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcufft-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcurand-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcublas-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcusparse-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcusolver-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}
if [ "${CUDA_MAJOR_VERSION}" = "13" ] && [ "arm64" = "$TARGETARCH" ]; then
apt-get install -y --no-install-recommends \
libcufile-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} libcudnn9-cuda-${CUDA_MAJOR_VERSION} cuda-cupti-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} libnvjitlink-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}
fi
apt-get clean && \
rm -rf /var/lib/apt/lists/*
fi
EOT
# https://github.com/NVIDIA/Isaac-GR00T/issues/343
RUN <<EOT bash
if [ "${BUILD_TYPE}" = "cublas" ] && [ "${TARGETARCH}" = "arm64" ]; then
wget https://developer.download.nvidia.com/compute/cudss/0.6.0/local_installers/cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0_0.6.0-1_arm64.deb && \
dpkg -i cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0_0.6.0-1_arm64.deb && \
cp /var/cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0/cudss-*-keyring.gpg /usr/share/keyrings/ && \
apt-get update && apt-get -y install cudss cudss-cuda-${CUDA_MAJOR_VERSION} && \
wget https://developer.download.nvidia.com/compute/nvpl/25.5/local_installers/nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5_1.0-1_arm64.deb && \
dpkg -i nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5_1.0-1_arm64.deb && \
cp /var/nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5/nvpl-*-keyring.gpg /usr/share/keyrings/ && \
apt-get update && apt-get install -y nvpl
fi
EOT
# If we are building with clblas support, we need the libraries for the builds
RUN if [ "${BUILD_TYPE}" = "clblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then \
apt-get update && \
apt-get install -y --no-install-recommends \
libclblast-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists/* \
; fi
RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then \
apt-get update && \
apt-get install -y --no-install-recommends \
hipblas-dev \
rocblas-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists/* && \
# I have no idea why, but the ROCM lib packages don't trigger ldconfig after they install, which results in local-ai and others not being able
# to locate the libraries. We run ldconfig ourselves to work around this packaging deficiency
ldconfig && \
# Log which GPU architectures have rocBLAS kernel support
echo "rocBLAS library data architectures:" && \
(ls /opt/rocm*/lib/rocblas/library/Kernels* 2>/dev/null || ls /opt/rocm*/lib64/rocblas/library/Kernels* 2>/dev/null) | grep -oP 'gfx[0-9a-z+-]+' | sort -u || \
echo "WARNING: No rocBLAS kernel data found" \
; fi
RUN echo "TARGETARCH: $TARGETARCH"
# We need protoc installed, and the version in 22.04 is too old. We will create one as part installing the GRPC build below
# but that will also being in a newer version of absl which stablediffusion cannot compile with. This version of protoc is only
# here so that we can generate the grpc code for the stablediffusion build
RUN <<EOT bash
if [ "amd64" = "$TARGETARCH" ]; then
curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v27.1/protoc-27.1-linux-x86_64.zip -o protoc.zip && \
unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
rm protoc.zip
fi
if [ "arm64" = "$TARGETARCH" ]; then
curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v27.1/protoc-27.1-linux-aarch_64.zip -o protoc.zip && \
unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
rm protoc.zip
fi
EOT
# Install CMake (the version in 22.04 is too old)
RUN <<EOT bash
if [ "${CMAKE_FROM_SOURCE}" = "true" ]; then
curl -L -s https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}.tar.gz -o cmake.tar.gz && tar xvf cmake.tar.gz && cd cmake-${CMAKE_VERSION} && ./configure && make && make install
else
apt-get update && \
apt-get install -y \
cmake && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
fi
EOT
COPY --from=grpc /opt/grpc /usr/local
COPY . /LocalAI
RUN <<'EOT' bash
set -euxo pipefail
if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then
CUDA_ARCH_ESC="${CUDA_DOCKER_ARCH//;/\\;}"
export CMAKE_ARGS="${CMAKE_ARGS:-} -DCMAKE_CUDA_ARCHITECTURES=${CUDA_ARCH_ESC}"
echo "CMAKE_ARGS(env) = ${CMAKE_ARGS}"
rm -rf /LocalAI/backend/cpp/turboquant-*-build
fi
cd /LocalAI/backend/cpp/turboquant
if [ "${TARGETARCH}" = "arm64" ] || [ "${BUILD_TYPE}" = "hipblas" ]; then
make turboquant-fallback
make turboquant-grpc
make turboquant-rpc-server
else
make turboquant-avx
make turboquant-avx2
make turboquant-avx512
make turboquant-fallback
make turboquant-grpc
make turboquant-rpc-server
fi
EOT
# Copy libraries using a script to handle architecture differences
RUN make -BC /LocalAI/backend/cpp/turboquant package
FROM scratch
# Copy all available binaries (the build process only creates the appropriate ones for the target architecture)
COPY --from=builder /LocalAI/backend/cpp/turboquant/package/. ./

View File

@@ -17,7 +17,6 @@ service Backend {
rpc GenerateImage(GenerateImageRequest) returns (Result) {}
rpc GenerateVideo(GenerateVideoRequest) returns (Result) {}
rpc AudioTranscription(TranscriptRequest) returns (TranscriptResult) {}
rpc AudioTranscriptionStream(TranscriptRequest) returns (stream TranscriptStreamResponse) {}
rpc TTS(TTSRequest) returns (Result) {}
rpc TTSStream(TTSRequest) returns (stream Reply) {}
rpc SoundGeneration(SoundGenerationRequest) returns (Result) {}
@@ -323,21 +322,11 @@ message TranscriptRequest {
bool translate = 5;
bool diarize = 6;
string prompt = 7;
float temperature = 8;
repeated string timestamp_granularities = 9;
bool stream = 10;
}
message TranscriptResult {
repeated TranscriptSegment segments = 1;
string text = 2;
string language = 3;
float duration = 4;
}
message TranscriptStreamResponse {
string delta = 1;
TranscriptResult final_result = 2;
}
message TranscriptSegment {
@@ -557,7 +546,6 @@ message ModelMetadataResponse {
bool supports_thinking = 1;
string rendered_template = 2; // The rendered chat template with enable_thinking=true (empty if not applicable)
ToolFormatMarkers tool_format = 3; // Auto-detected tool format markers from differential template analysis
string media_marker = 4; // Marker the backend expects in the prompt for each multimodal input (images/audio/video). Empty when the backend does not use a marker.
}
// Fine-tuning messages

View File

@@ -1,5 +1,5 @@
IK_LLAMA_VERSION?=8befd92ea5f702494ea9813fe42a52fb015db5fe
IK_LLAMA_VERSION?=08ae48c667e3dcd3025821a8585190b4a46c2f7c
LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp
CMAKE_ARGS?=

View File

@@ -62,18 +62,7 @@ add_executable(${TARGET} grpc-server.cpp json.hpp httplib.h)
target_include_directories(${TARGET} PRIVATE ../llava)
target_include_directories(${TARGET} PRIVATE ${CMAKE_SOURCE_DIR})
# Upstream llama.cpp renamed the `common` helpers library to `llama-common`.
# Forks that branched before the rename (e.g. llama-cpp-turboquant) still
# expose it as `common`. Detect which one is present so the same CMakeLists
# drives both builds — otherwise an unresolved name silently degrades to a
# plain `-l` flag and the PUBLIC include dir (where common.h lives) is lost.
if (TARGET llama-common)
set(_LLAMA_COMMON_TARGET llama-common)
else()
set(_LLAMA_COMMON_TARGET common)
endif()
target_link_libraries(${TARGET} PRIVATE ${_LLAMA_COMMON_TARGET} llama mtmd ${CMAKE_THREAD_LIBS_INIT} absl::flags hw_grpc_proto
target_link_libraries(${TARGET} PRIVATE common llama mtmd ${CMAKE_THREAD_LIBS_INIT} absl::flags hw_grpc_proto
absl::flags_parse
gRPC::${_REFLECTION}
gRPC::${_GRPC_GRPCPP}

View File

@@ -1,5 +1,5 @@
LLAMA_VERSION?=4f02d4733934179386cbc15b3454be26237940bb
LLAMA_VERSION?=ff5ef8278615a2462b79b50abdf3cc95cfb31c6f
LLAMA_REPO?=https://github.com/ggerganov/llama.cpp
CMAKE_ARGS?=
@@ -33,7 +33,7 @@ else ifeq ($(BUILD_TYPE),hipblas)
ROCM_PATH ?= /opt/rocm
export CXX=$(ROCM_HOME)/llvm/bin/clang++
export CC=$(ROCM_HOME)/llvm/bin/clang
AMDGPU_TARGETS?=gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1151,gfx1200,gfx1201
AMDGPU_TARGETS?=gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1200,gfx1201
CMAKE_ARGS+=-DGGML_HIP=ON -DAMDGPU_TARGETS=$(AMDGPU_TARGETS)
else ifeq ($(BUILD_TYPE),vulkan)
CMAKE_ARGS+=-DGGML_VULKAN=1
@@ -132,7 +132,7 @@ llama.cpp:
cd llama.cpp && \
git init && \
git remote add origin $(LLAMA_REPO) && \
git fetch --all --tags && \
git fetch origin && \
git checkout -b build $(LLAMA_VERSION) && \
git submodule update --init --recursive --depth 1 --single-branch

View File

@@ -26,8 +26,6 @@
#include <regex>
#include <atomic>
#include <cstdlib>
#include <fstream>
#include <iterator>
#include <mutex>
#include <signal.h>
#include <thread>
@@ -78,27 +76,6 @@ static grpc::Status checkAuth(grpc::ServerContext* context) {
return grpc::Status(grpc::StatusCode::UNAUTHENTICATED, "invalid token");
}
// Minimal base64 encoder. The C++ backend already pulls in base64_decode from
// llama.cpp's server-common.cpp, but no encoder is exposed — and we need one to
// hand audio bytes to the existing PredictOptions.audios path (which expects
// base64-encoded strings, just like images).
static std::string base64_encode_bytes(const unsigned char* data, size_t len) {
static const char tbl[] =
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
std::string out;
out.reserve(((len + 2) / 3) * 4);
for (size_t i = 0; i < len; i += 3) {
uint32_t triple = (uint32_t(data[i]) << 16);
if (i + 1 < len) triple |= (uint32_t(data[i + 1]) << 8);
if (i + 2 < len) triple |= uint32_t(data[i + 2]);
out.push_back(tbl[(triple >> 18) & 0x3F]);
out.push_back(tbl[(triple >> 12) & 0x3F]);
out.push_back(i + 1 < len ? tbl[(triple >> 6) & 0x3F] : '=');
out.push_back(i + 2 < len ? tbl[triple & 0x3F] : '=');
}
return out;
}
// END LocalAI
@@ -2814,13 +2791,6 @@ public:
return grpc::Status(grpc::StatusCode::FAILED_PRECONDITION, "Model not loaded");
}
// Report the active multimodal media marker so the Go layer can emit the
// same string when rendering prompts outside the tokenizer-template path.
// Only meaningful when an mtmd context was initialized (vision/audio models).
if (ctx_server.impl->mctx != nullptr) {
response->set_media_marker(get_media_marker());
}
// Check if chat templates are initialized
if (ctx_server.impl->chat_params.tmpls == nullptr) {
// If templates are not initialized, we can't detect thinking support
@@ -2961,119 +2931,6 @@ public:
return grpc::Status::OK;
}
// runTranscriptionAsCompletion implements OAI /v1/audio/transcriptions on
// top of the existing chat-completion + multimodal-audio pipeline, exactly
// the way upstream llama.cpp's server does it (see
// tools/server/server-context.cpp post_transcriptions_oai → forwards into
// handle_completions_impl with a single user message attaching the audio
// file via the mtmd marker).
//
// We synthesize a backend::PredictOptions with one user message
// ("Transcribe audio to text" + optional language hint) and the audio
// bytes attached via the existing PredictOptions.audios field, then
// delegate to our own Predict() handler. This keeps every multimodal
// codepath identical to the chat path and avoids duplicating ~700 lines
// of task-construction logic.
grpc::Status runTranscriptionAsCompletion(grpc::ServerContext* context,
const backend::TranscriptRequest* request,
backend::Reply* out_reply) {
if (params_base.model.path.empty()) {
return grpc::Status(grpc::StatusCode::FAILED_PRECONDITION, "Model not loaded");
}
if (request->dst().empty()) {
return grpc::Status(grpc::StatusCode::INVALID_ARGUMENT, "dst (audio file path) is required");
}
// Read audio bytes from the path LocalAI's HTTP layer wrote.
std::ifstream f(request->dst(), std::ios::binary);
if (!f.is_open()) {
return grpc::Status(grpc::StatusCode::INVALID_ARGUMENT, "failed to open audio file: " + request->dst());
}
std::vector<unsigned char> bytes((std::istreambuf_iterator<char>(f)),
std::istreambuf_iterator<char>());
f.close();
if (bytes.empty()) {
return grpc::Status(grpc::StatusCode::INVALID_ARGUMENT, "audio file is empty: " + request->dst());
}
std::string b64 = base64_encode_bytes(bytes.data(), bytes.size());
// Build the same prompt upstream uses in convert_transcriptions_to_chatcmpl.
std::string user_prompt = "Transcribe audio to text";
if (!request->language().empty()) {
user_prompt += " (language: " + request->language() + ")";
}
if (!request->prompt().empty()) {
// Optional context hint from the caller.
user_prompt += "\n" + request->prompt();
}
backend::PredictOptions synthetic;
synthetic.set_usetokenizertemplate(true);
synthetic.set_temperature(request->temperature());
// Generation length: leave at 0 so parse_options uses -1 (model default).
// The model's stop tokens / EOS handle termination naturally for ASR.
backend::Message* msg = synthetic.add_messages();
msg->set_role("user");
msg->set_content(user_prompt);
synthetic.add_audios(b64);
return Predict(context, &synthetic, out_reply);
}
grpc::Status AudioTranscription(ServerContext* context,
const backend::TranscriptRequest* request,
backend::TranscriptResult* response) override {
auto auth = checkAuth(context);
if (!auth.ok()) return auth;
backend::Reply reply;
grpc::Status st = runTranscriptionAsCompletion(context, request, &reply);
if (!st.ok()) {
return st;
}
response->set_text(reply.message());
if (!request->language().empty()) {
response->set_language(request->language());
}
return grpc::Status::OK;
}
grpc::Status AudioTranscriptionStream(ServerContext* context,
const backend::TranscriptRequest* request,
grpc::ServerWriter<backend::TranscriptStreamResponse>* writer) override {
auto auth = checkAuth(context);
if (!auth.ok()) return auth;
// Buffered streaming: run the transcription as a normal chat
// completion, then emit one delta + one final event. Real
// token-by-token streaming would require refactoring PredictStream's
// 700-line writer-coupled body; the HTTP/SSE contract is identical
// either way, and clients that only consume the assembled text don't
// notice the difference.
backend::Reply reply;
grpc::Status st = runTranscriptionAsCompletion(context, request, &reply);
if (!st.ok()) {
return st;
}
const std::string& text = reply.message();
if (!text.empty()) {
backend::TranscriptStreamResponse delta_chunk;
delta_chunk.set_delta(text);
writer->Write(delta_chunk);
}
backend::TranscriptStreamResponse final_chunk;
backend::TranscriptResult* final_result = final_chunk.mutable_final_result();
final_result->set_text(text);
if (!request->language().empty()) {
final_result->set_language(request->language());
}
writer->Write(final_chunk);
return grpc::Status::OK;
}
};

View File

@@ -1,81 +0,0 @@
# Pinned to the HEAD of feature/turboquant-kv-cache on https://github.com/TheTom/llama-cpp-turboquant.
# Auto-bumped nightly by .github/workflows/bump_deps.yaml.
TURBOQUANT_VERSION?=627ebbc6e27727bd4f65422d8aa60b13404993c8
LLAMA_REPO?=https://github.com/TheTom/llama-cpp-turboquant
CMAKE_ARGS?=
BUILD_TYPE?=
NATIVE?=false
ONEAPI_VARS?=/opt/intel/oneapi/setvars.sh
TARGET?=--target grpc-server
JOBS?=$(shell nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 1)
ARCH?=$(shell uname -m)
CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
LLAMA_CPP_DIR := $(CURRENT_MAKEFILE_DIR)/../llama-cpp
GREEN := \033[0;32m
RESET := \033[0m
# turboquant is a llama.cpp fork. Rather than duplicating grpc-server.cpp / CMakeLists.txt /
# prepare.sh we reuse the ones in backend/cpp/llama-cpp, and only swap which repo+sha the
# fetch step pulls. Each flavor target copies ../llama-cpp into a sibling ../turboquant-<flavor>-build
# directory, then invokes llama-cpp's own build-llama-cpp-grpc-server with LLAMA_REPO/LLAMA_VERSION
# overridden to point at the fork.
PATCHES_DIR := $(CURRENT_MAKEFILE_DIR)/patches
# Each flavor target:
# 1. copies backend/cpp/llama-cpp/ (grpc-server.cpp + prepare.sh + CMakeLists.txt + Makefile)
# into a sibling turboquant-<flavor>-build directory;
# 2. clones the turboquant fork into turboquant-<flavor>-build/llama.cpp via the copy's
# own `llama.cpp` target, overriding LLAMA_REPO/LLAMA_VERSION;
# 3. applies patches from backend/cpp/turboquant/patches/ to the cloned fork sources
# (needed until the fork catches up with upstream server-context.cpp changes);
# 4. runs the copy's `grpc-server` target, which produces the binary we copy up as
# turboquant-<flavor>.
define turboquant-build
rm -rf $(CURRENT_MAKEFILE_DIR)/../turboquant-$(1)-build
cp -rf $(LLAMA_CPP_DIR) $(CURRENT_MAKEFILE_DIR)/../turboquant-$(1)-build
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../turboquant-$(1)-build purge
# Augment the copied grpc-server.cpp's KV-cache allow-list with the
# fork's turbo2/turbo3/turbo4 types. We patch the *copy*, never the
# original under backend/cpp/llama-cpp/, so the stock llama-cpp build
# stays compiling against vanilla upstream.
bash $(CURRENT_MAKEFILE_DIR)/patch-grpc-server.sh $(CURRENT_MAKEFILE_DIR)/../turboquant-$(1)-build/grpc-server.cpp
$(info $(GREEN)I turboquant build info:$(1)$(RESET))
LLAMA_REPO=$(LLAMA_REPO) LLAMA_VERSION=$(TURBOQUANT_VERSION) \
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../turboquant-$(1)-build llama.cpp
bash $(CURRENT_MAKEFILE_DIR)/apply-patches.sh $(CURRENT_MAKEFILE_DIR)/../turboquant-$(1)-build/llama.cpp $(PATCHES_DIR)
CMAKE_ARGS="$(CMAKE_ARGS) $(2)" TARGET="$(3)" \
LLAMA_REPO=$(LLAMA_REPO) LLAMA_VERSION=$(TURBOQUANT_VERSION) \
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../turboquant-$(1)-build grpc-server
cp -rfv $(CURRENT_MAKEFILE_DIR)/../turboquant-$(1)-build/grpc-server turboquant-$(1)
endef
turboquant-avx2:
$(call turboquant-build,avx2,-DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=off -DGGML_FMA=on -DGGML_F16C=on,--target grpc-server)
turboquant-avx512:
$(call turboquant-build,avx512,-DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=on -DGGML_FMA=on -DGGML_F16C=on,--target grpc-server)
turboquant-avx:
$(call turboquant-build,avx,-DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server)
turboquant-fallback:
$(call turboquant-build,fallback,-DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server)
turboquant-grpc:
$(call turboquant-build,grpc,-DGGML_RPC=ON -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server --target rpc-server)
turboquant-rpc-server: turboquant-grpc
cp -rf $(CURRENT_MAKEFILE_DIR)/../turboquant-grpc-build/llama.cpp/build/bin/rpc-server turboquant-rpc-server
package:
bash package.sh
purge:
rm -rf $(CURRENT_MAKEFILE_DIR)/../turboquant-*-build
rm -rf turboquant-* package
clean: purge

View File

@@ -1,50 +0,0 @@
#!/bin/bash
# Apply the turboquant patch series to a cloned llama-cpp-turboquant checkout.
#
# The turboquant fork branched from upstream llama.cpp before a few API changes
# that the shared backend/cpp/llama-cpp/grpc-server.cpp depends on. We carry
# those upstream commits as patch files under backend/cpp/turboquant/patches/
# and apply them here so the reused grpc-server source compiles against the
# fork unmodified.
#
# Drop the corresponding patch from patches/ whenever the fork catches up with
# upstream — the build will fail fast if a patch stops applying, which is the
# signal to retire it.
set -euo pipefail
if [[ $# -ne 2 ]]; then
echo "usage: $0 <llama.cpp-src-dir> <patches-dir>" >&2
exit 2
fi
SRC_DIR=$1
PATCHES_DIR=$2
if [[ ! -d "$SRC_DIR" ]]; then
echo "source dir does not exist: $SRC_DIR" >&2
exit 2
fi
if [[ ! -d "$PATCHES_DIR" ]]; then
echo "no patches dir at $PATCHES_DIR, nothing to apply"
exit 0
fi
shopt -s nullglob
patches=("$PATCHES_DIR"/*.patch)
shopt -u nullglob
if [[ ${#patches[@]} -eq 0 ]]; then
echo "no .patch files in $PATCHES_DIR, nothing to apply"
exit 0
fi
cd "$SRC_DIR"
for patch in "${patches[@]}"; do
echo "==> applying $patch"
git apply --verbose "$patch"
done
echo "all turboquant patches applied successfully"

View File

@@ -1,57 +0,0 @@
#!/bin/bash
# Script to copy the appropriate libraries based on architecture
# This script is used in the final stage of the Dockerfile
set -e
CURDIR=$(dirname "$(realpath $0)")
REPO_ROOT="${CURDIR}/../../.."
# Create lib directory
mkdir -p $CURDIR/package/lib
cp -avrf $CURDIR/turboquant-* $CURDIR/package/
cp -rfv $CURDIR/run.sh $CURDIR/package/
# Detect architecture and copy appropriate libraries
if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
# x86_64 architecture
echo "Detected x86_64 architecture, copying x86_64 libraries..."
cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so
cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
# ARM64 architecture
echo "Detected ARM64 architecture, copying ARM64 libraries..."
cp -arfLv /lib/ld-linux-aarch64.so.1 $CURDIR/package/lib/ld.so
cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
else
echo "Error: Could not detect architecture"
exit 1
fi
# Package GPU libraries based on BUILD_TYPE
GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
if [ -f "$GPU_LIB_SCRIPT" ]; then
echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
package_gpu_libs
fi
echo "Packaging completed successfully"
ls -liah $CURDIR/package/
ls -liah $CURDIR/package/lib/

View File

@@ -1,80 +0,0 @@
#!/bin/bash
# Patch the shared backend/cpp/llama-cpp/grpc-server.cpp *copy* used by the
# turboquant build to account for two gaps between upstream and the fork:
#
# 1. Augment the kv_cache_types[] allow-list so `LoadModel` accepts the
# fork-specific `turbo2` / `turbo3` / `turbo4` cache types.
# 2. Replace `get_media_marker()` (added upstream in ggml-org/llama.cpp#21962,
# server-side random per-instance marker) with the legacy "<__media__>"
# literal. The fork branched before that PR, so server-common.cpp has no
# get_media_marker symbol. The fork's mtmd_default_marker() still returns
# "<__media__>", and Go-side tooling falls back to that sentinel when the
# backend does not expose media_marker, so substituting the literal keeps
# behavior identical on the turboquant path.
#
# We patch the *copy* sitting in turboquant-<flavor>-build/, never the original
# under backend/cpp/llama-cpp/, so the stock llama-cpp build keeps compiling
# against vanilla upstream.
#
# Idempotent: skips each insertion if its marker is already present (so re-runs
# of the same build dir don't double-insert).
set -euo pipefail
if [[ $# -ne 1 ]]; then
echo "usage: $0 <grpc-server.cpp>" >&2
exit 2
fi
SRC=$1
if [[ ! -f "$SRC" ]]; then
echo "grpc-server.cpp not found at $SRC" >&2
exit 2
fi
if grep -q 'GGML_TYPE_TURBO2_0' "$SRC"; then
echo "==> $SRC already has TurboQuant cache types, skipping KV allow-list patch"
else
echo "==> patching $SRC to allow turbo2/turbo3/turbo4 KV-cache types"
# Insert the three TURBO entries right after the first ` GGML_TYPE_Q5_1,`
# line (the kv_cache_types[] allow-list). Using awk because the builder image
# does not ship python3, and GNU sed's multi-line `a\` quoting is awkward.
awk '
/^ GGML_TYPE_Q5_1,$/ && !done {
print
print " // turboquant fork extras — added by patch-grpc-server.sh"
print " GGML_TYPE_TURBO2_0,"
print " GGML_TYPE_TURBO3_0,"
print " GGML_TYPE_TURBO4_0,"
done = 1
next
}
{ print }
END {
if (!done) {
print "patch-grpc-server.sh: anchor ` GGML_TYPE_Q5_1,` not found" > "/dev/stderr"
exit 1
}
}
' "$SRC" > "$SRC.tmp"
mv "$SRC.tmp" "$SRC"
echo "==> KV allow-list patch OK"
fi
if grep -q 'get_media_marker()' "$SRC"; then
echo "==> patching $SRC to replace get_media_marker() with legacy \"<__media__>\" literal"
# Only one call site today (ModelMetadata), but replace all occurrences to
# stay robust if upstream adds more. Use a temp file to avoid relying on
# sed -i portability (the builder image uses GNU sed, but keeping this
# consistent with the awk block above).
sed 's/get_media_marker()/"<__media__>"/g' "$SRC" > "$SRC.tmp"
mv "$SRC.tmp" "$SRC"
echo "==> get_media_marker() substitution OK"
else
echo "==> $SRC has no get_media_marker() call, skipping media-marker patch"
fi
echo "==> all patches applied"

View File

@@ -1,65 +0,0 @@
#!/bin/bash
set -ex
# Get the absolute current dir where the script is located
CURDIR=$(dirname "$(realpath $0)")
cd /
echo "CPU info:"
grep -e "model\sname" /proc/cpuinfo | head -1
grep -e "flags" /proc/cpuinfo | head -1
BINARY=turboquant-fallback
if grep -q -e "\savx\s" /proc/cpuinfo ; then
echo "CPU: AVX found OK"
if [ -e $CURDIR/turboquant-avx ]; then
BINARY=turboquant-avx
fi
fi
if grep -q -e "\savx2\s" /proc/cpuinfo ; then
echo "CPU: AVX2 found OK"
if [ -e $CURDIR/turboquant-avx2 ]; then
BINARY=turboquant-avx2
fi
fi
# Check avx 512
if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
echo "CPU: AVX512F found OK"
if [ -e $CURDIR/turboquant-avx512 ]; then
BINARY=turboquant-avx512
fi
fi
if [ -n "$LLAMACPP_GRPC_SERVERS" ]; then
if [ -e $CURDIR/turboquant-grpc ]; then
BINARY=turboquant-grpc
fi
fi
# Extend ld library path with the dir where this script is located/lib
if [ "$(uname)" == "Darwin" ]; then
export DYLD_LIBRARY_PATH=$CURDIR/lib:$DYLD_LIBRARY_PATH
else
export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
# Tell rocBLAS where to find TensileLibrary data (GPU kernel tuning files)
if [ -d "$CURDIR/lib/rocblas/library" ]; then
export ROCBLAS_TENSILE_LIBPATH=$CURDIR/lib/rocblas/library
fi
fi
# If there is a lib/ld.so, use it
if [ -f $CURDIR/lib/ld.so ]; then
echo "Using lib/ld.so"
echo "Using binary: $BINARY"
exec $CURDIR/lib/ld.so $CURDIR/$BINARY "$@"
fi
echo "Using binary: $BINARY"
exec $CURDIR/$BINARY "$@"
# We should never reach this point, however just in case we do, run fallback
exec $CURDIR/turboquant-fallback "$@"

View File

@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
# stablediffusion.cpp (ggml)
STABLEDIFFUSION_GGML_REPO?=https://github.com/leejet/stable-diffusion.cpp
STABLEDIFFUSION_GGML_VERSION?=7d33d4b2ddeafa672761a5880ec33bdff452504d
STABLEDIFFUSION_GGML_VERSION?=6b675a5ede9b0edf0a0f44191e8b79d7ef27615a
CMAKE_ARGS+=-DGGML_MAX_NAME=128

View File

@@ -26,10 +26,6 @@
#include "stb_image_resize.h"
#include <stdlib.h>
#include <regex>
#include <errno.h>
#include <signal.h>
#include <unistd.h>
#include <sys/wait.h>
@@ -984,251 +980,6 @@ int gen_image(sd_img_gen_params_t *p, int steps, char *dst, float cfg_scale, cha
return !ret;
}
// ---------------- Video generation ----------------
sd_vid_gen_params_t* sd_vid_gen_params_new(void) {
sd_vid_gen_params_t *params = (sd_vid_gen_params_t *)std::malloc(sizeof(sd_vid_gen_params_t));
sd_vid_gen_params_init(params);
sd_sample_params_init(&params->sample_params);
sd_sample_params_init(&params->high_noise_sample_params);
sd_cache_params_init(&params->cache);
return params;
}
// Persistent storage for cleaned video prompts (kept alive for the duration of generation)
static std::string cleaned_vid_prompt_storage;
static std::string cleaned_vid_negative_prompt_storage;
void sd_vid_gen_params_set_prompts(sd_vid_gen_params_t *params, const char *prompt, const char *negative_prompt) {
lora_vec.clear();
lora_strings.clear();
std::string prompt_str = prompt ? prompt : "";
std::string negative_prompt_str = negative_prompt ? negative_prompt : "";
const char* lora_dir_to_use = lora_dir_path.empty() ? nullptr : lora_dir_path.c_str();
auto [loras, cleaned_prompt] = parse_loras_from_prompt(prompt_str, lora_dir_to_use);
lora_vec = loras;
cleaned_vid_prompt_storage = cleaned_prompt;
auto [neg_loras, cleaned_negative] = parse_loras_from_prompt(negative_prompt_str, lora_dir_to_use);
cleaned_vid_negative_prompt_storage = cleaned_negative;
params->prompt = cleaned_vid_prompt_storage.c_str();
params->negative_prompt = cleaned_vid_negative_prompt_storage.c_str();
params->loras = lora_vec.empty() ? nullptr : lora_vec.data();
params->lora_count = static_cast<uint32_t>(lora_vec.size());
}
void sd_vid_gen_params_set_dimensions(sd_vid_gen_params_t *params, int width, int height) {
params->width = width;
params->height = height;
}
void sd_vid_gen_params_set_seed(sd_vid_gen_params_t *params, int64_t seed) {
params->seed = seed;
}
void sd_vid_gen_params_set_video_frames(sd_vid_gen_params_t *params, int n) {
params->video_frames = n;
}
// Load an image file into an sd_image_t, resizing to target dims if needed.
// Returns a heap-allocated buffer the caller must free (or nullptr on failure).
static uint8_t* load_and_resize_image(const char* path, int target_width, int target_height, sd_image_t* out) {
if (!path || strlen(path) == 0) {
*out = {0, 0, 0, nullptr};
return nullptr;
}
int c = 0, img_w = 0, img_h = 0;
uint8_t* buf = stbi_load(path, &img_w, &img_h, &c, 3);
if (!buf) {
fprintf(stderr, "Failed to load image from '%s'\n", path);
*out = {0, 0, 0, nullptr};
return nullptr;
}
if (img_w != target_width || img_h != target_height) {
fprintf(stderr, "Resizing image from %dx%d to %dx%d\n", img_w, img_h, target_width, target_height);
uint8_t* resized = (uint8_t*)malloc((size_t)target_width * target_height * 3);
if (!resized) { free(buf); *out = {0, 0, 0, nullptr}; return nullptr; }
stbir_resize(buf, img_w, img_h, 0,
resized, target_width, target_height, 0, STBIR_TYPE_UINT8,
3, STBIR_ALPHA_CHANNEL_NONE, 0,
STBIR_EDGE_CLAMP, STBIR_EDGE_CLAMP,
STBIR_FILTER_BOX, STBIR_FILTER_BOX,
STBIR_COLORSPACE_SRGB, nullptr);
free(buf);
buf = resized;
}
*out = {(uint32_t)target_width, (uint32_t)target_height, 3, buf};
return buf;
}
// Pipe raw RGB/RGBA frames to ffmpeg stdin and let it produce an MP4 at dst.
// Uses fork+execvp to avoid shell interpretation of dst.
static int ffmpeg_mux_raw_to_mp4(sd_image_t* frames, int num_frames, int fps, const char* dst) {
if (num_frames <= 0 || !frames || !frames[0].data) {
fprintf(stderr, "ffmpeg_mux: empty frames\n");
return 1;
}
int width = (int)frames[0].width;
int height = (int)frames[0].height;
int channels = (int)frames[0].channel;
const char* pix_fmt_in = (channels == 4) ? "rgba" : "rgb24";
char size_str[32];
char fps_str[32];
snprintf(size_str, sizeof(size_str), "%dx%d", width, height);
snprintf(fps_str, sizeof(fps_str), "%d", fps);
int pipefd[2];
if (pipe(pipefd) != 0) { perror("pipe"); return 1; }
pid_t pid = fork();
if (pid < 0) { perror("fork"); close(pipefd[0]); close(pipefd[1]); return 1; }
if (pid == 0) {
// child
close(pipefd[1]);
if (dup2(pipefd[0], STDIN_FILENO) < 0) { perror("dup2"); _exit(127); }
close(pipefd[0]);
std::vector<char*> argv = {
const_cast<char*>("ffmpeg"),
const_cast<char*>("-y"),
const_cast<char*>("-hide_banner"),
const_cast<char*>("-loglevel"), const_cast<char*>("warning"),
const_cast<char*>("-f"), const_cast<char*>("rawvideo"),
const_cast<char*>("-pix_fmt"), const_cast<char*>(pix_fmt_in),
const_cast<char*>("-s"), size_str,
const_cast<char*>("-framerate"), fps_str,
const_cast<char*>("-i"), const_cast<char*>("-"),
const_cast<char*>("-c:v"), const_cast<char*>("libx264"),
const_cast<char*>("-pix_fmt"), const_cast<char*>("yuv420p"),
const_cast<char*>("-movflags"), const_cast<char*>("+faststart"),
const_cast<char*>(dst),
nullptr
};
execvp(argv[0], argv.data());
perror("execvp ffmpeg");
_exit(127);
}
// parent
close(pipefd[0]);
// Ignore SIGPIPE so a dying ffmpeg surfaces via write() errno instead of killing us.
signal(SIGPIPE, SIG_IGN);
for (int i = 0; i < num_frames; i++) {
if (!frames[i].data) continue;
size_t frame_bytes = (size_t)frames[i].width * frames[i].height * frames[i].channel;
const uint8_t* p = frames[i].data;
size_t remaining = frame_bytes;
while (remaining > 0) {
ssize_t n = write(pipefd[1], p, remaining);
if (n < 0) {
if (errno == EINTR) continue;
perror("write frame to ffmpeg");
close(pipefd[1]);
int status;
waitpid(pid, &status, 0);
return 1;
}
p += n;
remaining -= (size_t)n;
}
}
close(pipefd[1]);
int status = 0;
while (waitpid(pid, &status, 0) < 0) {
if (errno != EINTR) { perror("waitpid"); return 1; }
}
if (!WIFEXITED(status) || WEXITSTATUS(status) != 0) {
fprintf(stderr, "ffmpeg exited with status %d\n", status);
return 1;
}
return 0;
}
int gen_video(sd_vid_gen_params_t *p, int steps, char *dst, float cfg_scale, int fps, char *init_image, char *end_image) {
if (!p) return 1;
if (!dst || strlen(dst) == 0) {
fprintf(stderr, "gen_video: dst is empty\n");
std::free(p);
return 1;
}
std::vector<int> skip_layers = {7, 8, 9};
fprintf(stderr, "Generating video: %dx%d, frames=%d, fps=%d, steps=%d, cfg=%.2f\n",
p->width, p->height, p->video_frames, fps, steps, cfg_scale);
// Sample params (shared by both low and high-noise passes — MoE models use the high-noise
// set during the first phase; single-model Wan2.1 ignores it. Same defaults for both is fine.)
p->sample_params.guidance.txt_cfg = cfg_scale;
p->sample_params.guidance.slg.layers = skip_layers.data();
p->sample_params.guidance.slg.layer_count = skip_layers.size();
p->sample_params.sample_method = sample_method;
p->sample_params.sample_steps = steps;
p->sample_params.scheduler = scheduler;
p->sample_params.flow_shift = flow_shift;
p->high_noise_sample_params.guidance.txt_cfg = cfg_scale;
p->high_noise_sample_params.guidance.slg.layers = skip_layers.data();
p->high_noise_sample_params.guidance.slg.layer_count = skip_layers.size();
p->high_noise_sample_params.sample_method = sample_method;
p->high_noise_sample_params.sample_steps = steps;
p->high_noise_sample_params.scheduler = scheduler;
p->high_noise_sample_params.flow_shift = flow_shift;
// Load init/end reference images if provided (resized to output dims).
uint8_t* init_buf = nullptr;
uint8_t* end_buf = nullptr;
sd_image_t init_img = {0, 0, 0, nullptr};
sd_image_t end_img = {0, 0, 0, nullptr};
if (init_image && strlen(init_image) > 0) {
init_buf = load_and_resize_image(init_image, p->width, p->height, &init_img);
if (!init_buf) { std::free(p); return 1; }
}
if (end_image && strlen(end_image) > 0) {
end_buf = load_and_resize_image(end_image, p->width, p->height, &end_img);
if (!end_buf) { if (init_buf) free(init_buf); std::free(p); return 1; }
}
p->init_image = init_img;
p->end_image = end_img;
// Generate
int num_frames_out = 0;
sd_image_t* frames = generate_video(sd_c, p, &num_frames_out);
std::free(p);
if (!frames || num_frames_out == 0) {
fprintf(stderr, "generate_video produced no frames\n");
if (init_buf) free(init_buf);
if (end_buf) free(end_buf);
return 1;
}
fprintf(stderr, "Generated %d frames, muxing to %s via ffmpeg\n", num_frames_out, dst);
int rc = ffmpeg_mux_raw_to_mp4(frames, num_frames_out, fps, dst);
for (int i = 0; i < num_frames_out; i++) {
if (frames[i].data) free(frames[i].data);
}
free(frames);
if (init_buf) free(init_buf);
if (end_buf) free(end_buf);
if (rc == 0) {
fprintf(stderr, "gen_video done: %s\n", dst);
}
fflush(stderr);
return rc;
}
int unload() {
free_sd_ctx(sd_c);
return 0;

View File

@@ -23,7 +23,6 @@ type SDGGML struct {
var (
LoadModel func(model, model_apth string, options []uintptr, threads int32, diff int) int
GenImage func(params uintptr, steps int, dst string, cfgScale float32, srcImage string, strength float32, maskImage string, refImages []uintptr, refImagesCount int) int
GenVideo func(params uintptr, steps int, dst string, cfgScale float32, fps int, initImage string, endImage string) int
TilingParamsSetEnabled func(params uintptr, enabled bool)
TilingParamsSetTileSizes func(params uintptr, tileSizeX int, tileSizeY int)
@@ -35,12 +34,6 @@ var (
ImgGenParamsSetDimensions func(params uintptr, width int, height int)
ImgGenParamsSetSeed func(params uintptr, seed int64)
ImgGenParamsGetVaeTilingParams func(params uintptr) uintptr
VidGenParamsNew func() uintptr
VidGenParamsSetPrompts func(params uintptr, prompt string, negativePrompt string)
VidGenParamsSetDimensions func(params uintptr, width int, height int)
VidGenParamsSetSeed func(params uintptr, seed int64)
VidGenParamsSetVideoFrames func(params uintptr, n int)
)
// Copied from Purego internal/strings
@@ -160,58 +153,3 @@ func (sd *SDGGML) GenerateImage(opts *pb.GenerateImageRequest) error {
return nil
}
func (sd *SDGGML) GenerateVideo(opts *pb.GenerateVideoRequest) error {
dst := opts.Dst
if dst == "" {
return fmt.Errorf("dst is empty")
}
width := int(opts.Width)
height := int(opts.Height)
if width == 0 {
width = 512
}
if height == 0 {
height = 512
}
numFrames := int(opts.NumFrames)
if numFrames <= 0 {
numFrames = 16
}
fps := int(opts.Fps)
if fps <= 0 {
fps = 16
}
steps := int(opts.Step)
if steps <= 0 {
steps = 20
}
cfg := opts.CfgScale
if cfg == 0 {
cfg = sd.cfgScale
}
if cfg == 0 {
cfg = 5.0
}
// sd_vid_gen_params_new allocates; gen_video frees it after the generation call.
p := VidGenParamsNew()
VidGenParamsSetPrompts(p, opts.Prompt, opts.NegativePrompt)
VidGenParamsSetDimensions(p, width, height)
VidGenParamsSetSeed(p, int64(opts.Seed))
VidGenParamsSetVideoFrames(p, numFrames)
fmt.Fprintf(os.Stderr, "GenerateVideo: dst=%s size=%dx%d frames=%d fps=%d steps=%d cfg=%.2f\n",
dst, width, height, numFrames, fps, steps, cfg)
ret := GenVideo(p, steps, dst, cfg, fps, opts.StartImage, opts.EndImage)
if ret != 0 {
return fmt.Errorf("video inference failed (code %d)", ret)
}
return nil
}

View File

@@ -18,13 +18,6 @@ void sd_img_gen_params_set_seed(sd_img_gen_params_t *params, int64_t seed);
int load_model(const char *model, char *model_path, char* options[], int threads, int diffusionModel);
int gen_image(sd_img_gen_params_t *p, int steps, char *dst, float cfg_scale, char *src_image, float strength, char *mask_image, char* ref_images[], int ref_images_count);
sd_vid_gen_params_t* sd_vid_gen_params_new(void);
void sd_vid_gen_params_set_prompts(sd_vid_gen_params_t *params, const char *prompt, const char *negative_prompt);
void sd_vid_gen_params_set_dimensions(sd_vid_gen_params_t *params, int width, int height);
void sd_vid_gen_params_set_seed(sd_vid_gen_params_t *params, int64_t seed);
void sd_vid_gen_params_set_video_frames(sd_vid_gen_params_t *params, int n);
int gen_video(sd_vid_gen_params_t *p, int steps, char *dst, float cfg_scale, int fps, char *init_image, char *end_image);
#ifdef __cplusplus
}
#endif

View File

@@ -32,7 +32,6 @@ func main() {
libFuncs := []LibFuncs{
{&LoadModel, "load_model"},
{&GenImage, "gen_image"},
{&GenVideo, "gen_video"},
{&TilingParamsSetEnabled, "sd_tiling_params_set_enabled"},
{&TilingParamsSetTileSizes, "sd_tiling_params_set_tile_sizes"},
{&TilingParamsSetRelSizes, "sd_tiling_params_set_rel_sizes"},
@@ -43,12 +42,6 @@ func main() {
{&ImgGenParamsSetDimensions, "sd_img_gen_params_set_dimensions"},
{&ImgGenParamsSetSeed, "sd_img_gen_params_set_seed"},
{&ImgGenParamsGetVaeTilingParams, "sd_img_gen_params_get_vae_tiling_params"},
{&VidGenParamsNew, "sd_vid_gen_params_new"},
{&VidGenParamsSetPrompts, "sd_vid_gen_params_set_prompts"},
{&VidGenParamsSetDimensions, "sd_vid_gen_params_set_dimensions"},
{&VidGenParamsSetSeed, "sd_vid_gen_params_set_seed"},
{&VidGenParamsSetVideoFrames, "sd_vid_gen_params_set_video_frames"},
}
for _, lf := range libFuncs {

View File

@@ -56,6 +56,5 @@ func (v *Voxtral) AudioTranscription(opts *pb.TranscriptRequest) (pb.TranscriptR
return pb.TranscriptResult{
Segments: segments,
Text: text,
Language: opts.Language,
}, nil
}

View File

@@ -8,7 +8,7 @@ JOBS?=$(shell nproc --ignore=1)
# whisper.cpp version
WHISPER_REPO?=https://github.com/ggml-org/whisper.cpp
WHISPER_CPP_VERSION?=166c20b473d5f4d04052e699f992f625ea2a2fdd
WHISPER_CPP_VERSION?=95ea8f9bfb03a15db08a8989966fd1ae3361e20d
SO_TARGET?=libgowhisper.so
CMAKE_ARGS+=-DBUILD_SHARED_LIBS=OFF

View File

@@ -120,12 +120,6 @@ func (w *Whisper) AudioTranscription(opts *pb.TranscriptRequest) (pb.TranscriptR
}
data := buf.AsFloat32Buffer().Data
// whisper.cpp resamples to 16 kHz internally; this matches buf.Format.SampleRate
// for the converted file produced by AudioToWav above.
var duration float32
if buf.Format != nil && buf.Format.SampleRate > 0 {
duration = float32(len(data)) / float32(buf.Format.SampleRate)
}
segsLen := uintptr(0xdeadbeef)
segsLenPtr := unsafe.Pointer(&segsLen)
@@ -164,7 +158,5 @@ func (w *Whisper) AudioTranscription(opts *pb.TranscriptRequest) (pb.TranscriptR
return pb.TranscriptResult{
Segments: segments,
Text: strings.TrimSpace(text),
Language: opts.Language,
Duration: duration,
}, nil
}

View File

@@ -43,35 +43,6 @@
- CPU
capabilities:
default: "cpu-ik-llama-cpp"
- &turboquant
name: "turboquant"
alias: "turboquant"
license: mit
description: |
Fork of llama.cpp adding the TurboQuant KV-cache quantization scheme.
Reuses the LocalAI llama.cpp gRPC server sources against the fork's libllama.
urls:
- https://github.com/TheTom/llama-cpp-turboquant
tags:
- text-to-text
- LLM
- CPU
- GPU
- CUDA
- HIP
- turboquant
- kv-cache
capabilities:
default: "cpu-turboquant"
nvidia: "cuda12-turboquant"
intel: "intel-sycl-f16-turboquant"
amd: "rocm-turboquant"
vulkan: "vulkan-turboquant"
nvidia-l4t: "nvidia-l4t-arm64-turboquant"
nvidia-cuda-13: "cuda13-turboquant"
nvidia-cuda-12: "cuda12-turboquant"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-turboquant"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-turboquant"
- &whispercpp
name: "whisper"
alias: "whisper"
@@ -227,28 +198,6 @@
intel: "intel-vllm"
nvidia-cuda-12: "cuda12-vllm"
cpu: "cpu-vllm"
- &sglang
name: "sglang"
license: apache-2.0
urls:
- https://github.com/sgl-project/sglang
tags:
- text-to-text
- multimodal
icon: https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png
description: |
SGLang is a fast serving framework for large language models and vision language models.
It co-designs the backend runtime (RadixAttention, continuous batching, structured
decoding) and the frontend language to make interaction with models faster and more
controllable. Features include fast backend runtime, flexible frontend language,
extensive model support, and an active community.
alias: "sglang"
capabilities:
nvidia: "cuda12-sglang"
amd: "rocm-sglang"
intel: "intel-sglang"
nvidia-cuda-12: "cuda12-sglang"
cpu: "cpu-sglang"
- &vllm-omni
name: "vllm-omni"
license: apache-2.0
@@ -383,34 +332,6 @@
intel: "intel-rerankers"
amd: "rocm-rerankers"
metal: "metal-rerankers"
- &tinygrad
name: "tinygrad"
alias: "tinygrad"
license: MIT
description: |
tinygrad is a minimalist deep-learning framework with zero runtime
dependencies that targets CUDA, ROCm, Metal, WebGPU and CPU (CLANG).
The LocalAI tinygrad backend exposes a single multimodal runtime that
covers LLM text generation (Llama / Qwen / Mistral via safetensors or
GGUF) with native tool-call extraction, BERT-family embeddings,
Stable Diffusion 1.x / 2 / XL image generation, and Whisper speech-to-text.
Single image: tinygrad generates its own GPU kernels and dlopens the
host driver libraries at runtime, so there is no per-toolkit build
split. The same image runs CPU-only or accelerates against
CUDA / ROCm / Metal when the host driver is visible.
urls:
- https://github.com/tinygrad/tinygrad
uri: "quay.io/go-skynet/local-ai-backends:latest-tinygrad"
mirrors:
- localai/localai-backends:latest-tinygrad
tags:
- text-to-text
- LLM
- embeddings
- image-generation
- transcription
- multimodal
- &transformers
name: "transformers"
icon: https://avatars.githubusercontent.com/u/25720743?s=200&v=4
@@ -995,33 +916,6 @@
name: "ik-llama-cpp-development"
capabilities:
default: "cpu-ik-llama-cpp-development"
- !!merge <<: *turboquant
name: "turboquant-development"
capabilities:
default: "cpu-turboquant-development"
nvidia: "cuda12-turboquant-development"
intel: "intel-sycl-f16-turboquant-development"
amd: "rocm-turboquant-development"
vulkan: "vulkan-turboquant-development"
nvidia-l4t: "nvidia-l4t-arm64-turboquant-development"
nvidia-cuda-13: "cuda13-turboquant-development"
nvidia-cuda-12: "cuda12-turboquant-development"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-turboquant-development"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-turboquant-development"
- !!merge <<: *stablediffusionggml
name: "stablediffusion-ggml-development"
capabilities:
default: "cpu-stablediffusion-ggml-development"
nvidia: "cuda12-stablediffusion-ggml-development"
intel: "intel-sycl-f16-stablediffusion-ggml-development"
# amd: "rocm-stablediffusion-ggml-development"
vulkan: "vulkan-stablediffusion-ggml-development"
nvidia-l4t: "nvidia-l4t-arm64-stablediffusion-ggml-development"
metal: "metal-stablediffusion-ggml-development"
nvidia-cuda-13: "cuda13-stablediffusion-ggml-development"
nvidia-cuda-12: "cuda12-stablediffusion-ggml-development"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-stablediffusion-ggml-development"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-stablediffusion-ggml-development"
- !!merge <<: *neutts
name: "cpu-neutts"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-neutts"
@@ -1463,97 +1357,6 @@
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-ik-llama-cpp"
mirrors:
- localai/localai-backends:master-cpu-ik-llama-cpp
## turboquant
- !!merge <<: *turboquant
name: "cpu-turboquant"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-turboquant"
mirrors:
- localai/localai-backends:latest-cpu-turboquant
- !!merge <<: *turboquant
name: "cpu-turboquant-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-turboquant"
mirrors:
- localai/localai-backends:master-cpu-turboquant
- !!merge <<: *turboquant
name: "cuda12-turboquant"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-turboquant"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-turboquant
- !!merge <<: *turboquant
name: "cuda12-turboquant-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-turboquant"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-turboquant
- !!merge <<: *turboquant
name: "cuda13-turboquant"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-turboquant"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-13-turboquant
- !!merge <<: *turboquant
name: "cuda13-turboquant-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-turboquant"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-13-turboquant
- !!merge <<: *turboquant
name: "rocm-turboquant"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-turboquant"
mirrors:
- localai/localai-backends:latest-gpu-rocm-hipblas-turboquant
- !!merge <<: *turboquant
name: "rocm-turboquant-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-turboquant"
mirrors:
- localai/localai-backends:master-gpu-rocm-hipblas-turboquant
- !!merge <<: *turboquant
name: "intel-sycl-f32-turboquant"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f32-turboquant"
mirrors:
- localai/localai-backends:latest-gpu-intel-sycl-f32-turboquant
- !!merge <<: *turboquant
name: "intel-sycl-f32-turboquant-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f32-turboquant"
mirrors:
- localai/localai-backends:master-gpu-intel-sycl-f32-turboquant
- !!merge <<: *turboquant
name: "intel-sycl-f16-turboquant"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sycl-f16-turboquant"
mirrors:
- localai/localai-backends:latest-gpu-intel-sycl-f16-turboquant
- !!merge <<: *turboquant
name: "intel-sycl-f16-turboquant-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sycl-f16-turboquant"
mirrors:
- localai/localai-backends:master-gpu-intel-sycl-f16-turboquant
- !!merge <<: *turboquant
name: "vulkan-turboquant"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-vulkan-turboquant"
mirrors:
- localai/localai-backends:latest-gpu-vulkan-turboquant
- !!merge <<: *turboquant
name: "vulkan-turboquant-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-vulkan-turboquant"
mirrors:
- localai/localai-backends:master-gpu-vulkan-turboquant
- !!merge <<: *turboquant
name: "nvidia-l4t-arm64-turboquant"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-arm64-turboquant"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-arm64-turboquant
- !!merge <<: *turboquant
name: "nvidia-l4t-arm64-turboquant-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-arm64-turboquant"
mirrors:
- localai/localai-backends:master-nvidia-l4t-arm64-turboquant
- !!merge <<: *turboquant
name: "cuda13-nvidia-l4t-arm64-turboquant"
uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-turboquant"
mirrors:
- localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-turboquant
- !!merge <<: *turboquant
name: "cuda13-nvidia-l4t-arm64-turboquant-development"
uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-turboquant"
mirrors:
- localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-turboquant
## whisper
- !!merge <<: *whispercpp
name: "nvidia-l4t-arm64-whisper"
@@ -1802,54 +1605,6 @@
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-vllm"
mirrors:
- localai/localai-backends:master-cpu-vllm
# sglang
- !!merge <<: *sglang
name: "sglang-development"
capabilities:
nvidia: "cuda12-sglang-development"
amd: "rocm-sglang-development"
intel: "intel-sglang-development"
cpu: "cpu-sglang-development"
- !!merge <<: *sglang
name: "cuda12-sglang"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-sglang"
mirrors:
- localai/localai-backends:latest-gpu-nvidia-cuda-12-sglang
- !!merge <<: *sglang
name: "rocm-sglang"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-rocm-hipblas-sglang"
mirrors:
- localai/localai-backends:latest-gpu-rocm-hipblas-sglang
- !!merge <<: *sglang
name: "intel-sglang"
uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-intel-sglang"
mirrors:
- localai/localai-backends:latest-gpu-intel-sglang
- !!merge <<: *sglang
name: "cpu-sglang"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-sglang"
mirrors:
- localai/localai-backends:latest-cpu-sglang
- !!merge <<: *sglang
name: "cuda12-sglang-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-sglang"
mirrors:
- localai/localai-backends:master-gpu-nvidia-cuda-12-sglang
- !!merge <<: *sglang
name: "rocm-sglang-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-rocm-hipblas-sglang"
mirrors:
- localai/localai-backends:master-gpu-rocm-hipblas-sglang
- !!merge <<: *sglang
name: "intel-sglang-development"
uri: "quay.io/go-skynet/local-ai-backends:master-gpu-intel-sglang"
mirrors:
- localai/localai-backends:master-gpu-intel-sglang
- !!merge <<: *sglang
name: "cpu-sglang-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-sglang"
mirrors:
- localai/localai-backends:master-cpu-sglang
# vllm-omni
- !!merge <<: *vllm-omni
name: "vllm-omni-development"
@@ -2105,15 +1860,6 @@
uri: "quay.io/go-skynet/local-ai-backends:master-metal-darwin-arm64-rerankers"
mirrors:
- localai/localai-backends:master-metal-darwin-arm64-rerankers
## tinygrad
## Single image — the meta anchor above carries the latest uri directly
## since there is only one variant. The development entry below points at
## the master tag.
- !!merge <<: *tinygrad
name: "tinygrad-development"
uri: "quay.io/go-skynet/local-ai-backends:master-tinygrad"
mirrors:
- localai/localai-backends:master-tinygrad
## Transformers
- !!merge <<: *transformers
name: "transformers-development"

View File

@@ -344,16 +344,7 @@ function ensureVenv() {
if [ ! -d "${EDIR}/venv" ]; then
if [ "x${USE_PIP}" == "xtrue" ]; then
# --copies is only needed when we will later relocate the venv via
# _makeVenvPortable (PORTABLE_PYTHON=true). Some Python builds —
# notably macOS system Python — refuse to create a venv with
# --copies because the build doesn't support it. Fall back to
# symlinks in that case.
local venv_args=""
if [ "x${PORTABLE_PYTHON}" == "xtrue" ]; then
venv_args="--copies"
fi
"${interpreter}" -m venv ${venv_args} "${EDIR}/venv"
"${interpreter}" -m venv --copies "${EDIR}/venv"
source "${EDIR}/venv/bin/activate"
"${interpreter}" -m pip install --upgrade pip
else

View File

@@ -1,100 +0,0 @@
"""Shared utilities for the mlx and mlx-vlm gRPC backends.
These helpers wrap mlx-lm's and mlx-vlm's native tool-parser modules, which
auto-detect the right parser from the model's chat template. Each tool
module exposes ``tool_call_start``, ``tool_call_end`` and
``parse_tool_call(text, tools) -> dict | list[dict]``.
The split-reasoning helper is generic enough to work with any think-start /
think-end delimiter pair.
"""
import json
import re
import sys
import uuid
def split_reasoning(text, think_start, think_end):
"""Split ``<think>...</think>`` blocks out of ``text``.
Returns ``(reasoning_content, remaining_text)``. When ``think_start`` is
empty or not found, returns ``("", text)`` unchanged.
"""
if not think_start or not text or think_start not in text:
return "", text
pattern = re.compile(
re.escape(think_start) + r"(.*?)" + re.escape(think_end or ""),
re.DOTALL,
)
reasoning_parts = pattern.findall(text)
if not reasoning_parts:
return "", text
remaining = pattern.sub("", text).strip()
return "\n".join(p.strip() for p in reasoning_parts), remaining
def parse_tool_calls(text, tool_module, tools):
"""Extract tool calls from ``text`` using a mlx-lm tool module.
Ports the ``process_tool_calls`` logic from
``mlx_vlm/server.py`` (v0.10 onwards). ``tool_module`` must expose
``tool_call_start``, ``tool_call_end`` and ``parse_tool_call``.
Returns ``(calls, remaining_text)`` where ``calls`` is a list of dicts:
[{"index": int, "id": str, "name": str, "arguments": str (JSON)}]
and ``remaining_text`` is the free-form text with the tool call blocks
removed. ``(calls, text)`` is returned unchanged if ``tool_module`` is
``None`` or the start delimiter isn't present.
"""
if tool_module is None or not text:
return [], text
start = getattr(tool_module, "tool_call_start", None)
end = getattr(tool_module, "tool_call_end", None)
parse_fn = getattr(tool_module, "parse_tool_call", None)
if not start or parse_fn is None or start not in text:
return [], text
if end == "" or end is None:
pattern = re.compile(
re.escape(start) + r".*?(?:\n|$)",
re.DOTALL,
)
else:
pattern = re.compile(
re.escape(start) + r".*?" + re.escape(end),
re.DOTALL,
)
matches = pattern.findall(text)
if not matches:
return [], text
remaining = pattern.sub(" ", text).strip()
calls = []
for match in matches:
call_body = match.strip().removeprefix(start)
if end:
call_body = call_body.removesuffix(end)
call_body = call_body.strip()
try:
parsed = parse_fn(call_body, tools)
except Exception as e:
print(
f"[mlx_utils] Invalid tool call: {call_body!r} ({e})",
file=sys.stderr,
)
continue
if not isinstance(parsed, list):
parsed = [parsed]
for tc in parsed:
calls.append(
{
"index": len(calls),
"id": str(uuid.uuid4()),
"name": (tc.get("name") or "").strip(),
"arguments": json.dumps(tc.get("arguments", {}), ensure_ascii=False),
}
)
return calls, remaining

View File

@@ -1,65 +0,0 @@
"""Generic utilities shared across Python gRPC backends.
These helpers don't depend on any specific inference framework and can be
imported by any backend that needs to parse LocalAI gRPC options or build a
chat-template-compatible message list from proto Message objects.
"""
import json
def parse_options(options_list):
"""Parse Options[] list of ``key:value`` strings into a dict.
Supports type inference for common cases (bool, int, float). Unknown or
mixed-case values are returned as strings.
Used by LoadModel to extract backend-specific options passed via
``ModelOptions.Options`` in ``backend.proto``.
"""
opts = {}
for opt in options_list:
if ":" not in opt:
continue
key, value = opt.split(":", 1)
key = key.strip()
value = value.strip()
# Try type conversion
if value.lower() in ("true", "false"):
opts[key] = value.lower() == "true"
else:
try:
opts[key] = int(value)
except ValueError:
try:
opts[key] = float(value)
except ValueError:
opts[key] = value
return opts
def messages_to_dicts(proto_messages):
"""Convert proto ``Message`` objects to dicts suitable for ``apply_chat_template``.
Handles: ``role``, ``content``, ``name``, ``tool_call_id``,
``reasoning_content``, ``tool_calls`` (JSON string → Python list).
HuggingFace chat templates (and their MLX/vLLM wrappers) expect a list of
plain dicts — proto Message objects don't work directly with Jinja, so
this conversion is needed before every ``apply_chat_template`` call.
"""
result = []
for msg in proto_messages:
d = {"role": msg.role, "content": msg.content or ""}
if msg.name:
d["name"] = msg.name
if msg.tool_call_id:
d["tool_call_id"] = msg.tool_call_id
if msg.reasoning_content:
d["reasoning_content"] = msg.reasoning_content
if msg.tool_calls:
try:
d["tool_calls"] = json.loads(msg.tool_calls)
except json.JSONDecodeError:
pass
result.append(d)
return result

View File

@@ -1,22 +1,63 @@
"""vLLM-specific helpers for the vllm and vllm-omni gRPC backends.
Generic helpers (``parse_options``, ``messages_to_dicts``) live in
``python_utils`` and are re-exported here for backwards compatibility with
existing imports in both backends.
"""
"""Shared utilities for vLLM-based backends."""
import json
import sys
from python_utils import messages_to_dicts, parse_options
__all__ = ["parse_options", "messages_to_dicts", "setup_parsers"]
def parse_options(options_list):
"""Parse Options[] list of 'key:value' strings into a dict.
Supports type inference for common cases (bool, int, float).
Used by LoadModel to extract backend-specific options.
"""
opts = {}
for opt in options_list:
if ":" not in opt:
continue
key, value = opt.split(":", 1)
key = key.strip()
value = value.strip()
# Try type conversion
if value.lower() in ("true", "false"):
opts[key] = value.lower() == "true"
else:
try:
opts[key] = int(value)
except ValueError:
try:
opts[key] = float(value)
except ValueError:
opts[key] = value
return opts
def messages_to_dicts(proto_messages):
"""Convert proto Message objects to list of dicts for apply_chat_template().
Handles: role, content, name, tool_call_id, reasoning_content, tool_calls (JSON string -> list).
"""
result = []
for msg in proto_messages:
d = {"role": msg.role, "content": msg.content or ""}
if msg.name:
d["name"] = msg.name
if msg.tool_call_id:
d["tool_call_id"] = msg.tool_call_id
if msg.reasoning_content:
d["reasoning_content"] = msg.reasoning_content
if msg.tool_calls:
try:
d["tool_calls"] = json.loads(msg.tool_calls)
except json.JSONDecodeError:
pass
result.append(d)
return result
def setup_parsers(opts):
"""Return ``(tool_parser_cls, reasoning_parser_cls)`` from an opts dict.
"""Return (tool_parser_cls, reasoning_parser_cls) tuple from opts dict.
Uses vLLM's native ``ToolParserManager`` / ``ReasoningParserManager``.
Returns ``(None, None)`` if vLLM isn't installed or the requested
parser name can't be resolved.
Uses vLLM's native ToolParserManager and ReasoningParserManager.
Returns (None, None) if vLLM is not installed or parsers not available.
"""
tool_parser_cls = None
reasoning_parser_cls = None

View File

@@ -15,21 +15,17 @@ Two startup modes:
import asyncio
from concurrent import futures
import argparse
import gc
import json
import os
import signal
import sys
import tempfile
import types
from typing import List
import grpc
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'common'))
sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'common'))
from grpc_auth import get_auth_interceptors
from python_utils import messages_to_dicts, parse_options as _shared_parse_options
from mlx_utils import parse_tool_calls, split_reasoning
import backend_pb2
@@ -66,10 +62,37 @@ def mlx_distributed_init(rank, hostfile, backend="ring", coordinator=None):
raise ValueError(f"Unknown backend: {backend}")
# Re-export the shared helper under the local name for back-compat with
# any callers (and the existing distributed worker tests) that imported
# parse_options directly from this module.
parse_options = _shared_parse_options
def is_float(s):
try:
float(s)
return True
except ValueError:
return False
def is_int(s):
try:
int(s)
return True
except ValueError:
return False
def parse_options(options):
"""Parse key:value option strings into a dict."""
result = {}
for opt in options:
if ":" not in opt:
continue
key, value = opt.split(":", 1)
if is_float(value):
value = float(value)
elif is_int(value):
value = int(value)
elif value.lower() in ["true", "false"]:
value = value.lower() == "true"
result[key] = value
return result
class BackendServicer(backend_pb2_grpc.BackendServicer):
@@ -165,20 +188,6 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
)
print("[Rank 0] Model loaded (single-node with prompt cache)", file=sys.stderr)
# Log auto-detected TokenizerWrapper capabilities. Same shape
# as the mlx backend: has_tool_calling / has_thinking from
# mlx_lm.tokenizer_utils + the start/end markers it sniffed
# from the chat template / vocab.
has_tools = bool(getattr(self.tokenizer, "has_tool_calling", False))
has_thinking = bool(getattr(self.tokenizer, "has_thinking", False))
tcs = getattr(self.tokenizer, "tool_call_start", None)
tce = getattr(self.tokenizer, "tool_call_end", None)
print(
f"[Rank 0] Tokenizer capabilities: has_tool_calling={has_tools} "
f"has_thinking={has_thinking} tool_call_start={tcs!r} tool_call_end={tce!r}",
file=sys.stderr,
)
except Exception as err:
print(f"[Rank 0] Error loading model: {err}", file=sys.stderr)
return backend_pb2.Result(success=False, message=f"Error loading model: {err}")
@@ -192,7 +201,7 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
try:
import mlx.core as mx
from mlx_lm import stream_generate
from mlx_lm.sample_utils import make_logits_processors, make_sampler
from mlx_lm.sample_utils import make_sampler
prompt_text = self._prepare_prompt(request)
tokens = self._get_tokens_from_prompt(prompt_text)
@@ -202,7 +211,7 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
self.coordinator.broadcast_command(CMD_GENERATE, len(tokens))
self.coordinator.broadcast_tokens(tokens)
max_tokens, sampler_params, logits_params, stop_words = self._build_generation_params(request)
max_tokens, sampler_params = self._build_generation_params(request)
if self.coordinator:
gen_params = self.coordinator.broadcast_generation_params(
@@ -213,7 +222,6 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
max_tokens = gen_params["max_tokens"]
sampler = make_sampler(**sampler_params)
logits_processors = make_logits_processors(**logits_params) if logits_params else None
# Use prompt cache in single-node mode
gen_kwargs = {}
@@ -230,44 +238,22 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
tokens = remaining_tokens if remaining_tokens else cache_key
generated = []
last_response = None
for response in stream_generate(
self.model,
self.tokenizer,
prompt=tokens,
max_tokens=max_tokens,
sampler=sampler,
logits_processors=logits_processors,
**gen_kwargs,
):
generated.append(response.text)
last_response = response
if cache_key is not None:
cache_key.append(response.token)
if stop_words and any(s in "".join(generated) for s in stop_words):
break
if self.lru_cache is not None and cache_key is not None:
self.lru_cache.insert_cache(self.model_key, cache_key, prompt_cache)
full_text = self._truncate_at_stop("".join(generated), stop_words)
content, reasoning_content, tool_calls_proto, prompt_tokens, completion_tokens, logprobs_bytes = (
self._finalize_output(request, full_text, last_response)
)
return backend_pb2.Reply(
message=bytes(content, encoding='utf-8'),
prompt_tokens=prompt_tokens,
tokens=completion_tokens,
logprobs=logprobs_bytes,
chat_deltas=[
backend_pb2.ChatDelta(
content=content,
reasoning_content=reasoning_content,
tool_calls=tool_calls_proto,
)
],
)
return backend_pb2.Reply(message=bytes(''.join(generated), encoding='utf-8'))
except Exception as e:
print(f"[Rank 0] Error in Predict: {e}", file=sys.stderr)
@@ -282,7 +268,7 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
try:
import mlx.core as mx
from mlx_lm import stream_generate
from mlx_lm.sample_utils import make_logits_processors, make_sampler
from mlx_lm.sample_utils import make_sampler
prompt_text = self._prepare_prompt(request)
tokens = self._get_tokens_from_prompt(prompt_text)
@@ -292,9 +278,7 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
self.coordinator.broadcast_command(CMD_GENERATE, len(tokens))
self.coordinator.broadcast_tokens(tokens)
max_tokens, sampler_params, logits_params, stop_words = self._build_generation_params(
request, default_max_tokens=512
)
max_tokens, sampler_params = self._build_generation_params(request, default_max_tokens=512)
if self.coordinator:
gen_params = self.coordinator.broadcast_generation_params(
@@ -305,7 +289,6 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
max_tokens = gen_params["max_tokens"]
sampler = make_sampler(**sampler_params)
logits_processors = make_logits_processors(**logits_params) if logits_params else None
# Use prompt cache in single-node mode
gen_kwargs = {}
@@ -321,45 +304,17 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
gen_kwargs['prompt_cache'] = prompt_cache
tokens = remaining_tokens if remaining_tokens else cache_key
accumulated = []
last_response = None
for response in stream_generate(
self.model,
self.tokenizer,
prompt=tokens,
max_tokens=max_tokens,
sampler=sampler,
logits_processors=logits_processors,
**gen_kwargs,
):
if cache_key is not None:
cache_key.append(response.token)
accumulated.append(response.text)
last_response = response
yield backend_pb2.Reply(
message=bytes(response.text, encoding='utf-8'),
chat_deltas=[backend_pb2.ChatDelta(content=response.text)],
)
if stop_words and any(s in "".join(accumulated) for s in stop_words):
break
full_text = self._truncate_at_stop("".join(accumulated), stop_words)
content, reasoning_content, tool_calls_proto, prompt_tokens, completion_tokens, logprobs_bytes = (
self._finalize_output(request, full_text, last_response)
)
yield backend_pb2.Reply(
message=b"",
prompt_tokens=prompt_tokens,
tokens=completion_tokens,
logprobs=logprobs_bytes,
chat_deltas=[
backend_pb2.ChatDelta(
content="",
reasoning_content=reasoning_content,
tool_calls=tool_calls_proto,
)
],
)
yield backend_pb2.Reply(message=bytes(response.text, encoding='utf-8'))
except Exception as e:
print(f"[Rank 0] Error in PredictStream: {e}", file=sys.stderr)
@@ -380,74 +335,12 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
context.set_details("Embeddings are not supported in the MLX distributed backend.")
return backend_pb2.EmbeddingResult()
async def TokenizeString(self, request, context):
if not hasattr(self, "tokenizer") or self.tokenizer is None:
context.set_code(grpc.StatusCode.FAILED_PRECONDITION)
context.set_details("tokenizer not loaded")
return backend_pb2.TokenizationResponse()
try:
tokens = self.tokenizer.encode(request.Prompt)
if hasattr(tokens, "tolist"):
tokens = tokens.tolist()
tokens = list(tokens)
return backend_pb2.TokenizationResponse(length=len(tokens), tokens=tokens)
except Exception as e:
context.set_code(grpc.StatusCode.INTERNAL)
context.set_details(str(e))
return backend_pb2.TokenizationResponse()
async def Free(self, request, context):
try:
# If we're rank 0 of a distributed run, tell workers to shut
# down their per-request loops first so they release the model.
if self.coordinator is not None:
try:
from coordinator import CMD_SHUTDOWN
self.coordinator.broadcast_command(CMD_SHUTDOWN)
except Exception as e:
print(f"[Rank 0] failed to broadcast shutdown: {e}", file=sys.stderr)
if hasattr(self, "model"):
del self.model
if hasattr(self, "tokenizer"):
del self.tokenizer
if self.lru_cache is not None:
try:
self.lru_cache.clear()
except Exception:
pass
self.lru_cache = None
self.coordinator = None
self.group = None
gc.collect()
try:
import mlx.core as mx # type: ignore
if hasattr(mx, "clear_cache"):
mx.clear_cache()
elif hasattr(mx, "metal") and hasattr(mx.metal, "clear_cache"):
mx.metal.clear_cache()
except Exception:
pass
return backend_pb2.Result(success=True, message="MLX distributed model freed")
except Exception as e:
return backend_pb2.Result(success=False, message=str(e))
def _prepare_prompt(self, request):
if not request.Prompt and request.UseTokenizerTemplate and request.Messages:
messages = messages_to_dicts(request.Messages)
kwargs = {"tokenize": False, "add_generation_prompt": True}
if request.Tools:
try:
kwargs["tools"] = json.loads(request.Tools)
except json.JSONDecodeError:
pass
if request.Metadata.get("enable_thinking", "").lower() == "true":
kwargs["enable_thinking"] = True
try:
return self.tokenizer.apply_chat_template(messages, **kwargs)
except TypeError:
return self.tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
messages = [{"role": msg.role, "content": msg.content} for msg in request.Messages]
return self.tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
return request.Prompt
def _get_tokens_from_prompt(self, prompt_text: str) -> List[int]:
@@ -456,82 +349,6 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
return tokens.tolist()
return list(tokens)
def _tool_module_from_tokenizer(self):
"""Same shim as the mlx backend: fall back to json.loads when the
installed mlx-lm doesn't expose a tool_parser callable on the
wrapper (true on 0.29.x — only HEAD ships parsers)."""
start = getattr(self.tokenizer, "tool_call_start", None)
end = getattr(self.tokenizer, "tool_call_end", None)
if not start:
return None
parse_fn = getattr(self.tokenizer, "tool_parser", None)
if parse_fn is None:
def parse_fn(body, tools): # noqa: E306
return json.loads(body.strip())
return types.SimpleNamespace(
tool_call_start=start,
tool_call_end=end or "",
parse_tool_call=parse_fn,
)
def _truncate_at_stop(self, text, stop_words):
if not stop_words:
return text
earliest = len(text)
for stop in stop_words:
if not stop:
continue
idx = text.find(stop)
if idx >= 0 and idx < earliest:
earliest = idx
return text[:earliest] if earliest < len(text) else text
def _finalize_output(self, request, generated_text, last_response):
content = generated_text
reasoning_content = ""
if getattr(self.tokenizer, "has_thinking", False):
think_start = getattr(self.tokenizer, "think_start", "") or ""
think_end = getattr(self.tokenizer, "think_end", "") or ""
reasoning_content, content = split_reasoning(content, think_start, think_end)
tool_calls_proto: List[backend_pb2.ToolCallDelta] = []
tool_module = None
if getattr(self.tokenizer, "has_tool_calling", False):
tool_module = self._tool_module_from_tokenizer()
if tool_module is not None:
parsed_tools = None
if request.Tools:
try:
parsed_tools = json.loads(request.Tools)
except json.JSONDecodeError:
parsed_tools = None
calls, content = parse_tool_calls(content, tool_module, parsed_tools)
for c in calls:
tool_calls_proto.append(
backend_pb2.ToolCallDelta(
index=c["index"], id=c["id"], name=c["name"], arguments=c["arguments"],
)
)
prompt_token_count = int(getattr(last_response, "prompt_tokens", 0) or 0) if last_response else 0
completion_token_count = int(getattr(last_response, "generation_tokens", 0) or 0) if last_response else 0
logprobs_bytes = b""
if last_response is not None and int(getattr(request, "Logprobs", 0) or 0) > 0:
try:
lp = getattr(last_response, "logprobs", None)
if lp is not None:
token_id = int(getattr(last_response, "token", 0) or 0)
token_text = self.tokenizer.decode([token_id]) if token_id else ""
top_logprob = float(lp[token_id]) if hasattr(lp, "__getitem__") else 0.0
logprobs_bytes = json.dumps(
{"content": [{"token": token_text, "logprob": top_logprob}]}
).encode("utf-8")
except Exception as e:
print(f"[Rank 0] Logprobs extraction failed: {e}", file=sys.stderr)
return content, reasoning_content, tool_calls_proto, prompt_token_count, completion_token_count, logprobs_bytes
def _build_generation_params(self, request, default_max_tokens=200):
import mlx.core as mx
@@ -556,22 +373,6 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
'xtc_probability': 0.0,
}
# Logits processor parameters — pulled from the request and
# forwarded to make_logits_processors. Rank 0 is the only rank
# running the sampler so we don't need to broadcast these to
# workers (workers participate in the pipeline-parallel forward
# pass only).
logits_params = {}
repetition_penalty = getattr(request, 'RepetitionPenalty', 0.0) or 0.0
if repetition_penalty and repetition_penalty != 1.0:
logits_params['repetition_penalty'] = repetition_penalty
presence_penalty = getattr(request, 'PresencePenalty', 0.0) or 0.0
if presence_penalty:
logits_params['presence_penalty'] = presence_penalty
frequency_penalty = getattr(request, 'FrequencyPenalty', 0.0) or 0.0
if frequency_penalty:
logits_params['frequency_penalty'] = frequency_penalty
seed = getattr(request, 'Seed', 0)
if seed != 0:
mx.random.seed(seed)
@@ -591,15 +392,9 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
for opt_key, param_key in option_mapping.items():
if opt_key in self.options:
sampler_params[param_key] = self.options[opt_key]
for opt_key in ('repetition_penalty', 'presence_penalty', 'frequency_penalty'):
if opt_key in self.options:
logits_params[opt_key] = self.options[opt_key]
if 'seed' in self.options:
mx.random.seed(self.options['seed'])
stop_words = list(getattr(request, 'StopPrompts', []) or [])
return max_tokens, sampler_params, logits_params, stop_words
# XTC special tokens
xtc_special_tokens = []
if hasattr(self.tokenizer, 'eos_token_ids') and self.tokenizer.eos_token_ids:

View File

@@ -1,6 +1,3 @@
import os
import sys
import types
import unittest
import subprocess
import time
@@ -9,12 +6,6 @@ import grpc
import backend_pb2
import backend_pb2_grpc
# Make the shared helpers importable so we can unit-test them without a
# running gRPC server.
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'common'))
from python_utils import messages_to_dicts, parse_options
from mlx_utils import parse_tool_calls, split_reasoning
class TestBackendServicer(unittest.TestCase):
def setUp(self):
@@ -94,44 +85,3 @@ class TestBackendServicer(unittest.TestCase):
self.fail("sampling params service failed")
finally:
self.tearDown()
class TestSharedHelpers(unittest.TestCase):
"""Server-less unit tests for the helpers the mlx-distributed backend depends on."""
def test_parse_options_typed(self):
opts = parse_options(["temperature:0.7", "max_tokens:128", "trust:true"])
self.assertEqual(opts["temperature"], 0.7)
self.assertEqual(opts["max_tokens"], 128)
self.assertIs(opts["trust"], True)
def test_messages_to_dicts_roundtrip(self):
msgs = [
backend_pb2.Message(role="user", content="hi"),
backend_pb2.Message(
role="assistant",
content="",
tool_calls='[{"id":"call_1","type":"function","function":{"name":"f","arguments":"{}"}}]',
),
backend_pb2.Message(role="tool", content="42", tool_call_id="call_1", name="f"),
]
out = messages_to_dicts(msgs)
self.assertEqual(out[0], {"role": "user", "content": "hi"})
self.assertEqual(out[1]["tool_calls"][0]["function"]["name"], "f")
self.assertEqual(out[2]["tool_call_id"], "call_1")
def test_split_reasoning(self):
r, c = split_reasoning("<think>plan</think>final", "<think>", "</think>")
self.assertEqual(r, "plan")
self.assertEqual(c, "final")
def test_parse_tool_calls_with_shim(self):
tm = types.SimpleNamespace(
tool_call_start="<tool_call>",
tool_call_end="</tool_call>",
parse_tool_call=lambda body, tools: {"name": "get_weather", "arguments": {"location": body.strip()}},
)
calls, remaining = parse_tool_calls("<tool_call>Paris</tool_call>", tm, tools=None)
self.assertEqual(len(calls), 1)
self.assertEqual(calls[0]["name"], "get_weather")
self.assertEqual(calls[0]["arguments"], '{"location": "Paris"}')

View File

@@ -2,14 +2,11 @@
import asyncio
from concurrent import futures
import argparse
import gc
import json
import signal
import sys
import os
import tempfile
import types
from typing import List
import time
import backend_pb2
import backend_pb2_grpc
@@ -18,18 +15,30 @@ import grpc
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'common'))
sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'common'))
from grpc_auth import get_auth_interceptors
from python_utils import messages_to_dicts, parse_options
from mlx_utils import parse_tool_calls, split_reasoning
from mlx_vlm import load, stream_generate
from mlx_vlm import load, generate, stream_generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.tool_parsers import _infer_tool_parser, load_tool_module
from mlx_vlm.utils import load_config
from mlx_lm.sample_utils import make_logits_processors, make_sampler
from mlx_vlm.utils import load_config, load_image
import mlx.core as mx
import base64
import io
from PIL import Image
import tempfile
def is_float(s):
"""Check if a string can be converted to float."""
try:
float(s)
return True
except ValueError:
return False
def is_int(s):
"""Check if a string can be converted to int."""
try:
int(s)
return True
except ValueError:
return False
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
@@ -69,52 +78,36 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
try:
print(f"Loading MLX-VLM model: {request.Model}", file=sys.stderr)
print(f"Request: {request}", file=sys.stderr)
# Parse Options[] key:value strings into a typed dict
self.options = parse_options(request.Options)
# Parse options like in the diffusers backend
options = request.Options
self.options = {}
# The options are a list of strings in this form optname:optvalue
# We store all the options in a dict for later use
for opt in options:
if ":" not in opt:
continue
key, value = opt.split(":", 1) # Split only on first colon to handle values with colons
if is_float(value):
value = float(value)
elif is_int(value):
value = int(value)
elif value.lower() in ["true", "false"]:
value = value.lower() == "true"
self.options[key] = value
print(f"Options: {self.options}", file=sys.stderr)
# Load model and processor using MLX-VLM
# mlx-vlm load function returns (model, processor) instead of (model, tokenizer)
self.model, self.processor = load(request.Model)
# Load model config for chat template support
self.config = load_config(request.Model)
# Auto-infer the tool parser from the chat template. mlx-vlm has
# its own _infer_tool_parser that falls back to mlx-lm parsers.
tokenizer = (
self.processor.tokenizer if hasattr(self.processor, "tokenizer") else self.processor
)
self.tool_module = None
if hasattr(tokenizer, "chat_template"):
try:
parser_type = _infer_tool_parser(tokenizer.chat_template)
if parser_type is not None:
self.tool_module = load_tool_module(parser_type)
print(
f"[mlx-vlm] auto-detected tool parser: {parser_type}",
file=sys.stderr,
)
else:
print(
"[mlx-vlm] no tool parser matched the chat template",
file=sys.stderr,
)
except Exception as e:
print(
f"[mlx-vlm] failed to load tool parser: {e}",
file=sys.stderr,
)
# Reasoning tokens — check if the tokenizer advertises thinking
# markers. Fall back to empty strings (split_reasoning no-ops).
self.think_start = getattr(tokenizer, "think_start", "") or ""
self.think_end = getattr(tokenizer, "think_end", "") or ""
self.has_thinking = bool(
getattr(tokenizer, "has_thinking", False) or self.think_start
)
except Exception as err:
print(f"Error loading MLX-VLM model {err=}, {type(err)=}", file=sys.stderr)
return backend_pb2.Result(success=False, message=f"Error loading MLX-VLM model: {err}")
@@ -135,72 +128,63 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
"""
temp_files = []
try:
image_paths, audio_paths = self._collect_media(request, temp_files)
prompt = self._prepare_prompt(
request,
num_images=len(image_paths),
num_audios=len(audio_paths),
)
max_tokens, sampler_params, logits_params, stop_words = self._build_generation_params(request)
sampler = make_sampler(**sampler_params)
logits_processors = make_logits_processors(**logits_params) if logits_params else None
print(
f"Generating text with MLX-VLM - max_tokens: {max_tokens}, "
f"images: {len(image_paths)}, audios: {len(audio_paths)}",
file=sys.stderr,
)
accumulated = []
last_response = None
for response in stream_generate(
# Process images and audios from request
image_paths = []
audio_paths = []
# Process images
if request.Images:
for img_data in request.Images:
img_path = self.load_image_from_base64(img_data)
if img_path:
image_paths.append(img_path)
temp_files.append(img_path)
# Process audios
if request.Audios:
for audio_data in request.Audios:
audio_path = self.load_audio_from_base64(audio_data)
if audio_path:
audio_paths.append(audio_path)
temp_files.append(audio_path)
# Prepare the prompt with multimodal information
prompt = self._prepare_prompt(request, num_images=len(image_paths), num_audios=len(audio_paths))
# Build generation parameters using request attributes and options
max_tokens, generation_params = self._build_generation_params(request)
print(f"Generating text with MLX-VLM - max_tokens: {max_tokens}, params: {generation_params}", file=sys.stderr)
print(f"Images: {len(image_paths)}, Audios: {len(audio_paths)}", file=sys.stderr)
# Generate text using MLX-VLM with multimodal inputs
response = generate(
model=self.model,
processor=self.processor,
prompt=prompt,
image=image_paths if image_paths else None,
audio=audio_paths if audio_paths else None,
max_tokens=max_tokens,
sampler=sampler,
logits_processors=logits_processors,
):
accumulated.append(response.text)
last_response = response
if stop_words and any(s in "".join(accumulated) for s in stop_words):
break
full_text = self._truncate_at_stop("".join(accumulated), stop_words)
content, reasoning_content, tool_calls_proto, prompt_tokens, completion_tokens, logprobs_bytes = (
self._finalize_output(request, full_text, last_response)
temperature=generation_params.get('temp', 0.6),
top_p=generation_params.get('top_p', 1.0),
verbose=False
)
return backend_pb2.Reply(
message=bytes(content, encoding='utf-8'),
prompt_tokens=prompt_tokens,
tokens=completion_tokens,
logprobs=logprobs_bytes,
chat_deltas=[
backend_pb2.ChatDelta(
content=content,
reasoning_content=reasoning_content,
tool_calls=tool_calls_proto,
)
],
)
return backend_pb2.Reply(message=bytes(response, encoding='utf-8'))
except Exception as e:
print(f"Error in MLX-VLM Predict: {e}", file=sys.stderr)
context.set_code(grpc.StatusCode.INTERNAL)
context.set_details(f"Generation failed: {str(e)}")
return backend_pb2.Reply(message=bytes("", encoding='utf-8'))
finally:
# Clean up temporary files
self.cleanup_temp_files(temp_files)
def Embedding(self, request, context):
"""
A gRPC method that calculates embeddings for a given sentence.
Note: MLX-VLM doesn't support embeddings directly. This method returns an error.
Args:
@@ -215,79 +199,6 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
context.set_details("Embeddings are not supported in the MLX-VLM backend.")
return backend_pb2.EmbeddingResult()
def _collect_media(self, request, temp_files):
"""Decode base64 Images and Audios into temp file paths.
Appends every temp file to ``temp_files`` so the finally block can
clean up even on mid-generation errors.
"""
image_paths = []
audio_paths = []
if request.Images:
for img_data in request.Images:
img_path = self.load_image_from_base64(img_data)
if img_path:
image_paths.append(img_path)
temp_files.append(img_path)
if request.Audios:
for audio_data in request.Audios:
audio_path = self.load_audio_from_base64(audio_data)
if audio_path:
audio_paths.append(audio_path)
temp_files.append(audio_path)
return image_paths, audio_paths
async def TokenizeString(self, request, context):
"""Tokenize ``request.Prompt`` via the processor's tokenizer."""
if not hasattr(self, "processor") or self.processor is None:
context.set_code(grpc.StatusCode.FAILED_PRECONDITION)
context.set_details("processor not loaded")
return backend_pb2.TokenizationResponse()
try:
tokenizer = (
self.processor.tokenizer
if hasattr(self.processor, "tokenizer")
else self.processor
)
tokens = tokenizer.encode(request.Prompt)
if hasattr(tokens, "tolist"):
tokens = tokens.tolist()
tokens = list(tokens)
return backend_pb2.TokenizationResponse(length=len(tokens), tokens=tokens)
except Exception as e:
context.set_code(grpc.StatusCode.INTERNAL)
context.set_details(str(e))
return backend_pb2.TokenizationResponse()
async def Free(self, request, context):
"""Drop the loaded model, processor and tool module."""
try:
if hasattr(self, "model"):
del self.model
if hasattr(self, "processor"):
del self.processor
if hasattr(self, "config"):
del self.config
self.tool_module = None
gc.collect()
# mlx.clear_cache (mlx >= 0.30) supersedes mlx.metal.clear_cache.
try:
if hasattr(mx, "clear_cache"):
mx.clear_cache()
elif hasattr(mx, "metal") and hasattr(mx.metal, "clear_cache"):
mx.metal.clear_cache()
except Exception:
pass
try:
import torch # type: ignore
if torch.cuda.is_available():
torch.cuda.empty_cache()
except Exception:
pass
return backend_pb2.Result(success=True, message="MLX-VLM model freed")
except Exception as e:
return backend_pb2.Result(success=False, message=str(e))
async def PredictStream(self, request, context):
"""
Generates text based on the given prompt and sampling parameters, and streams the results using MLX-VLM with multimodal support.
@@ -301,28 +212,36 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
"""
temp_files = []
try:
image_paths, audio_paths = self._collect_media(request, temp_files)
prompt = self._prepare_prompt(
request,
num_images=len(image_paths),
num_audios=len(audio_paths),
)
max_tokens, sampler_params, logits_params, stop_words = self._build_generation_params(
request, default_max_tokens=512
)
sampler = make_sampler(**sampler_params)
logits_processors = make_logits_processors(**logits_params) if logits_params else None
print(
f"Streaming text with MLX-VLM - max_tokens: {max_tokens}, "
f"images: {len(image_paths)}, audios: {len(audio_paths)}",
file=sys.stderr,
)
accumulated = []
last_response = None
# Process images and audios from request
image_paths = []
audio_paths = []
# Process images
if request.Images:
for img_data in request.Images:
img_path = self.load_image_from_base64(img_data)
if img_path:
image_paths.append(img_path)
temp_files.append(img_path)
# Process audios
if request.Audios:
for audio_data in request.Audios:
audio_path = self.load_audio_from_base64(audio_data)
if audio_path:
audio_paths.append(audio_path)
temp_files.append(audio_path)
# Prepare the prompt with multimodal information
prompt = self._prepare_prompt(request, num_images=len(image_paths), num_audios=len(audio_paths))
# Build generation parameters using request attributes and options
max_tokens, generation_params = self._build_generation_params(request, default_max_tokens=512)
print(f"Streaming text with MLX-VLM - max_tokens: {max_tokens}, params: {generation_params}", file=sys.stderr)
print(f"Images: {len(image_paths)}, Audios: {len(audio_paths)}", file=sys.stderr)
# Stream text generation using MLX-VLM with multimodal inputs
for response in stream_generate(
model=self.model,
processor=self.processor,
@@ -330,91 +249,77 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
image=image_paths if image_paths else None,
audio=audio_paths if audio_paths else None,
max_tokens=max_tokens,
sampler=sampler,
logits_processors=logits_processors,
temperature=generation_params.get('temp', 0.6),
top_p=generation_params.get('top_p', 1.0),
):
accumulated.append(response.text)
last_response = response
yield backend_pb2.Reply(
message=bytes(response.text, encoding='utf-8'),
chat_deltas=[backend_pb2.ChatDelta(content=response.text)],
)
if stop_words and any(s in "".join(accumulated) for s in stop_words):
break
full_text = self._truncate_at_stop("".join(accumulated), stop_words)
content, reasoning_content, tool_calls_proto, prompt_tokens, completion_tokens, logprobs_bytes = (
self._finalize_output(request, full_text, last_response)
)
yield backend_pb2.Reply(
message=b"",
prompt_tokens=prompt_tokens,
tokens=completion_tokens,
logprobs=logprobs_bytes,
chat_deltas=[
backend_pb2.ChatDelta(
content="",
reasoning_content=reasoning_content,
tool_calls=tool_calls_proto,
)
],
)
yield backend_pb2.Reply(message=bytes(response.text, encoding='utf-8'))
except Exception as e:
print(f"Error in MLX-VLM PredictStream: {e}", file=sys.stderr)
context.set_code(grpc.StatusCode.INTERNAL)
context.set_details(f"Streaming generation failed: {str(e)}")
yield backend_pb2.Reply(message=bytes("", encoding='utf-8'))
finally:
# Clean up temporary files
self.cleanup_temp_files(temp_files)
def _build_template_kwargs(self, request, num_images, num_audios):
"""Collect kwargs for ``apply_chat_template`` that survive model variants."""
kwargs = {"num_images": num_images, "num_audios": num_audios}
if request.Tools:
try:
kwargs["tools"] = json.loads(request.Tools)
except json.JSONDecodeError:
pass
if request.Metadata.get("enable_thinking", "").lower() == "true":
kwargs["enable_thinking"] = True
return kwargs
def _apply_template(self, request, messages, num_images, num_audios):
kwargs = self._build_template_kwargs(request, num_images, num_audios)
try:
return apply_chat_template(self.processor, self.config, messages, **kwargs)
except TypeError:
# Fallback for older mlx-vlm versions that reject tools=/enable_thinking=
return apply_chat_template(
self.processor,
self.config,
messages,
num_images=num_images,
num_audios=num_audios,
)
def _prepare_prompt(self, request, num_images=0, num_audios=0):
"""
Prepare the prompt for MLX-VLM generation, handling chat templates and
multimodal inputs. Forwards tool definitions and enable_thinking when
present on the request.
Prepare the prompt for MLX-VLM generation, handling chat templates and multimodal inputs.
Args:
request: The gRPC request containing prompt and message information.
num_images: Number of images in the request.
num_audios: Number of audio files in the request.
Returns:
str: The prepared prompt.
"""
# If tokenizer template is enabled and messages are provided instead of prompt, apply the tokenizer template
if not request.Prompt and request.UseTokenizerTemplate and request.Messages:
messages = messages_to_dicts(request.Messages)
return self._apply_template(request, messages, num_images, num_audios)
if request.Prompt:
# Convert gRPC messages to the format expected by apply_chat_template
messages = []
for msg in request.Messages:
messages.append({"role": msg.role, "content": msg.content})
# Use mlx-vlm's apply_chat_template which handles multimodal inputs
prompt = apply_chat_template(
self.processor,
self.config,
messages,
num_images=num_images,
num_audios=num_audios
)
return prompt
elif request.Prompt:
# If we have a direct prompt but also have images/audio, we need to format it properly
if num_images > 0 or num_audios > 0:
# Create a simple message structure for multimodal prompt
messages = [{"role": "user", "content": request.Prompt}]
return self._apply_template(request, messages, num_images, num_audios)
return request.Prompt
# Fallback to empty prompt with multimodal template if we have media
if num_images > 0 or num_audios > 0:
messages = [{"role": "user", "content": ""}]
return self._apply_template(request, messages, num_images, num_audios)
return ""
prompt = apply_chat_template(
self.processor,
self.config,
messages,
num_images=num_images,
num_audios=num_audios
)
return prompt
else:
return request.Prompt
else:
# Fallback to empty prompt with multimodal template if we have media
if num_images > 0 or num_audios > 0:
messages = [{"role": "user", "content": ""}]
prompt = apply_chat_template(
self.processor,
self.config,
messages,
num_images=num_images,
num_audios=num_audios
)
return prompt
else:
return ""
@@ -422,122 +327,62 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
def _build_generation_params(self, request, default_max_tokens=200):
"""
Build generation parameters from request attributes and options.
Build generation parameters from request attributes and options for MLX-VLM.
Args:
request: The gRPC request.
default_max_tokens: Default max_tokens if not specified.
Returns:
tuple: (max_tokens, sampler_params, logits_params, stop_words)
tuple: (max_tokens, generation_params dict)
"""
max_tokens = getattr(request, 'Tokens', default_max_tokens) or default_max_tokens
temp = getattr(request, 'Temperature', 0.0) or 0.6
top_p = getattr(request, 'TopP', 0.0) or 1.0
min_p = getattr(request, 'MinP', 0.0) or 0.0
top_k = getattr(request, 'TopK', 0) or 0
sampler_params = {
# Extract max_tokens
max_tokens = getattr(request, 'Tokens', default_max_tokens)
if max_tokens == 0:
max_tokens = default_max_tokens
# Extract generation parameters from request attributes
temp = getattr(request, 'Temperature', 0.0)
if temp == 0.0:
temp = 0.6 # Default temperature
top_p = getattr(request, 'TopP', 0.0)
if top_p == 0.0:
top_p = 1.0 # Default top_p
# Initialize generation parameters for MLX-VLM
generation_params = {
'temp': temp,
'top_p': top_p,
'min_p': min_p,
'top_k': top_k,
}
logits_params = {}
repetition_penalty = getattr(request, 'RepetitionPenalty', 0.0) or 0.0
if repetition_penalty and repetition_penalty != 1.0:
logits_params['repetition_penalty'] = repetition_penalty
presence_penalty = getattr(request, 'PresencePenalty', 0.0) or 0.0
if presence_penalty:
logits_params['presence_penalty'] = presence_penalty
frequency_penalty = getattr(request, 'FrequencyPenalty', 0.0) or 0.0
if frequency_penalty:
logits_params['frequency_penalty'] = frequency_penalty
# Add seed if specified
seed = getattr(request, 'Seed', 0)
if seed != 0:
mx.random.seed(seed)
# Override with options if available
if hasattr(self, 'options'):
# Max tokens from options
if 'max_tokens' in self.options:
max_tokens = self.options['max_tokens']
option_mapping = {
'temp': 'temp', 'temperature': 'temp',
'top_p': 'top_p', 'min_p': 'min_p', 'top_k': 'top_k',
# Generation parameters from options
param_option_mapping = {
'temp': 'temp',
'temperature': 'temp', # alias
'top_p': 'top_p',
}
for option_key, param_key in option_mapping.items():
for option_key, param_key in param_option_mapping.items():
if option_key in self.options:
sampler_params[param_key] = self.options[option_key]
for option_key in ('repetition_penalty', 'presence_penalty', 'frequency_penalty'):
if option_key in self.options:
logits_params[option_key] = self.options[option_key]
generation_params[param_key] = self.options[option_key]
# Handle seed from options
if 'seed' in self.options:
mx.random.seed(self.options['seed'])
stop_words = list(getattr(request, 'StopPrompts', []) or [])
return max_tokens, sampler_params, logits_params, stop_words
def _finalize_output(self, request, generated_text, last_response):
"""Split reasoning + tool calls out of generated_text and return the
tuple consumed by Reply-builders."""
content = generated_text
reasoning_content = ""
if getattr(self, "has_thinking", False):
reasoning_content, content = split_reasoning(content, self.think_start, self.think_end)
tool_calls_proto: List[backend_pb2.ToolCallDelta] = []
if self.tool_module is not None:
parsed_tools = None
if request.Tools:
try:
parsed_tools = json.loads(request.Tools)
except json.JSONDecodeError:
parsed_tools = None
calls, content = parse_tool_calls(content, self.tool_module, parsed_tools)
for c in calls:
tool_calls_proto.append(
backend_pb2.ToolCallDelta(
index=c["index"],
id=c["id"],
name=c["name"],
arguments=c["arguments"],
)
)
prompt_tokens = int(getattr(last_response, "prompt_tokens", 0) or 0) if last_response else 0
completion_tokens = int(getattr(last_response, "generation_tokens", 0) or 0) if last_response else 0
logprobs_bytes = b""
if last_response is not None and int(getattr(request, "Logprobs", 0) or 0) > 0:
try:
lp = getattr(last_response, "logprobs", None)
if lp is not None:
token_id = int(getattr(last_response, "token", 0) or 0)
tokenizer = (
self.processor.tokenizer
if hasattr(self.processor, "tokenizer")
else self.processor
)
token_text = tokenizer.decode([token_id]) if token_id else ""
top_logprob = float(lp[token_id]) if hasattr(lp, "__getitem__") else 0.0
logprobs_bytes = json.dumps(
{"content": [{"token": token_text, "logprob": top_logprob}]}
).encode("utf-8")
except Exception as e:
print(f"[mlx-vlm] Logprobs extraction failed: {e}", file=sys.stderr)
return content, reasoning_content, tool_calls_proto, prompt_tokens, completion_tokens, logprobs_bytes
def _truncate_at_stop(self, text, stop_words):
if not stop_words:
return text
earliest = len(text)
for stop in stop_words:
if not stop:
continue
idx = text.find(stop)
if idx >= 0 and idx < earliest:
earliest = idx
return text[:earliest] if earliest < len(text) else text
return max_tokens, generation_params
def load_image_from_base64(self, image_data: str):
"""

View File

@@ -1,19 +1,17 @@
import os
import sys
import types
import unittest
import subprocess
import time
import grpc
import backend_pb2
import backend_pb2_grpc
# Make the shared helpers importable so we can unit-test them without a
# running gRPC server.
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'common'))
from python_utils import messages_to_dicts, parse_options
from mlx_utils import parse_tool_calls, split_reasoning
import grpc
import unittest
import subprocess
import time
import grpc
import backend_pb2_grpc
import backend_pb2
class TestBackendServicer(unittest.TestCase):
"""
@@ -145,55 +143,4 @@ class TestBackendServicer(unittest.TestCase):
print(err)
self.fail("Embedding service failed")
finally:
self.tearDown()
class TestSharedHelpers(unittest.TestCase):
"""Server-less unit tests for the helpers the mlx-vlm backend depends on."""
def test_parse_options_typed(self):
opts = parse_options(["temperature:0.7", "max_tokens:128", "trust:true", "name:hello"])
self.assertEqual(opts["temperature"], 0.7)
self.assertEqual(opts["max_tokens"], 128)
self.assertIs(opts["trust"], True)
self.assertEqual(opts["name"], "hello")
def test_messages_to_dicts_roundtrip(self):
msgs = [
backend_pb2.Message(role="user", content="hi"),
backend_pb2.Message(
role="assistant",
content="",
tool_calls='[{"id":"call_1","type":"function","function":{"name":"f","arguments":"{}"}}]',
),
backend_pb2.Message(
role="tool",
content="42",
tool_call_id="call_1",
name="f",
),
]
out = messages_to_dicts(msgs)
self.assertEqual(out[0], {"role": "user", "content": "hi"})
self.assertEqual(out[1]["tool_calls"][0]["function"]["name"], "f")
self.assertEqual(out[2]["tool_call_id"], "call_1")
def test_split_reasoning(self):
r, c = split_reasoning("<think>plan</think>final", "<think>", "</think>")
self.assertEqual(r, "plan")
self.assertEqual(c, "final")
def test_parse_tool_calls_with_shim(self):
tm = types.SimpleNamespace(
tool_call_start="<tool_call>",
tool_call_end="</tool_call>",
parse_tool_call=lambda body, tools: {"name": "get_weather", "arguments": {"location": body.strip()}},
)
calls, remaining = parse_tool_calls(
"<tool_call>Paris</tool_call>",
tm,
tools=None,
)
self.assertEqual(len(calls), 1)
self.assertEqual(calls[0]["name"], "get_weather")
self.assertEqual(calls[0]["arguments"], '{"location": "Paris"}')
self.tearDown()

View File

@@ -2,13 +2,11 @@
import asyncio
from concurrent import futures
import argparse
import gc
import json
import signal
import sys
import os
import types
from typing import List
import time
import backend_pb2
import backend_pb2_grpc
@@ -17,13 +15,13 @@ import grpc
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'common'))
sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'common'))
from grpc_auth import get_auth_interceptors
from python_utils import messages_to_dicts, parse_options
from mlx_utils import parse_tool_calls, split_reasoning
from mlx_lm import load, stream_generate
from mlx_lm.sample_utils import make_logits_processors, make_sampler
from mlx_lm import load, generate, stream_generate
from mlx_lm.sample_utils import make_sampler
from mlx_lm.models.cache import make_prompt_cache, can_trim_prompt_cache, trim_prompt_cache
import mlx.core as mx
import base64
import io
from mlx_cache import ThreadSafeLRUPromptCache
@@ -32,6 +30,21 @@ _ONE_DAY_IN_SECONDS = 60 * 60 * 24
# If MAX_WORKERS are specified in the environment use it, otherwise default to 1
MAX_WORKERS = int(os.environ.get('PYTHON_GRPC_MAX_WORKERS', '1'))
def is_float(s):
"""Check if a string can be converted to float."""
try:
float(s)
return True
except ValueError:
return False
def is_int(s):
"""Check if a string can be converted to int."""
try:
int(s)
return True
except ValueError:
return False
# Implement the BackendServicer class with the service methods
class BackendServicer(backend_pb2_grpc.BackendServicer):
"""
@@ -65,27 +78,46 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
try:
print(f"Loading MLX model: {request.Model}", file=sys.stderr)
print(f"Request: {request}", file=sys.stderr)
# Parse Options[] key:value strings into a typed dict (shared helper)
self.options = parse_options(request.Options)
# Parse options like in the diffusers backend
options = request.Options
self.options = {}
# The options are a list of strings in this form optname:optvalue
# We store all the options in a dict for later use
for opt in options:
if ":" not in opt:
continue
key, value = opt.split(":", 1) # Split only on first colon to handle values with colons
# Convert numeric values to appropriate types
if is_float(value):
value = float(value)
elif is_int(value):
value = int(value)
elif value.lower() in ["true", "false"]:
value = value.lower() == "true"
self.options[key] = value
print(f"Options: {self.options}", file=sys.stderr)
# Build tokenizer config for MLX using options
tokenizer_config = {}
# Handle trust_remote_code from request or options
if request.TrustRemoteCode or self.options.get("trust_remote_code", False):
tokenizer_config["trust_remote_code"] = True
# Handle EOS token from options
if "eos_token" in self.options:
tokenizer_config["eos_token"] = self.options["eos_token"]
# Handle other tokenizer config options
for key in ["pad_token", "bos_token", "unk_token", "sep_token", "cls_token", "mask_token"]:
if key in self.options:
tokenizer_config[key] = self.options[key]
# Load model and tokenizer using MLX
if tokenizer_config:
print(f"Loading with tokenizer_config: {tokenizer_config}", file=sys.stderr)
@@ -93,21 +125,6 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
else:
self.model, self.tokenizer = load(request.Model)
# mlx_lm.load() returns a TokenizerWrapper that detects tool
# calling and thinking markers from the chat template / vocab.
# mlx-lm >= 0.30 also exposes a parser callable on the wrapper;
# earlier versions don't (we fall back to json.loads inside
# _tool_module_from_tokenizer below).
has_tools = bool(getattr(self.tokenizer, "has_tool_calling", False))
has_thinking = bool(getattr(self.tokenizer, "has_thinking", False))
tcs = getattr(self.tokenizer, "tool_call_start", None)
tce = getattr(self.tokenizer, "tool_call_end", None)
print(
f"MLX tokenizer capabilities: has_tool_calling={has_tools} "
f"has_thinking={has_thinking} tool_call_start={tcs!r} tool_call_end={tce!r}",
file=sys.stderr,
)
# Initialize thread-safe LRU prompt cache for efficient generation
max_cache_entries = self.options.get("max_cache_entries", 10)
self.max_kv_size = self.options.get("max_kv_size", None)
@@ -117,7 +134,7 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
can_trim_fn=can_trim_prompt_cache,
trim_fn=trim_prompt_cache,
)
except Exception as err:
print(f"Error loading MLX model {err=}, {type(err)=}", file=sys.stderr)
return backend_pb2.Result(success=False, message=f"Error loading MLX model: {err}")
@@ -155,58 +172,30 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
remaining_tokens = cache_key
# Build generation parameters using request attributes and options
max_tokens, sampler_params, logits_params, stop_words = self._build_generation_params(request)
max_tokens, sampler_params = self._build_generation_params(request)
print(
f"Generating text with MLX - max_tokens: {max_tokens}, "
f"cache_hit: {len(remaining_tokens) < len(cache_key)}",
file=sys.stderr,
)
print(f"Generating text with MLX - max_tokens: {max_tokens}, cache_hit: {len(remaining_tokens) < len(cache_key)}", file=sys.stderr)
# Create sampler and optional logits processors (penalties)
# Create sampler with parameters
sampler = make_sampler(**sampler_params)
logits_processors = make_logits_processors(**logits_params) if logits_params else None
# Use stream_generate to collect text + track tokens for cache key
# Use stream_generate to track generated tokens for cache key
generated_text = []
last_response = None
for response in stream_generate(
self.model,
self.tokenizer,
prompt=remaining_tokens if remaining_tokens else cache_key,
max_tokens=max_tokens,
sampler=sampler,
logits_processors=logits_processors,
prompt_cache=prompt_cache,
):
generated_text.append(response.text)
cache_key.append(response.token)
last_response = response
# Early stop on user-provided stop sequences
if stop_words and any(s in "".join(generated_text) for s in stop_words):
break
# Insert completed cache
self.lru_cache.insert_cache(self.model_key, cache_key, prompt_cache)
full_text = self._truncate_at_stop("".join(generated_text), stop_words)
content, reasoning_content, tool_calls_proto, prompt_tokens, completion_tokens, logprobs_bytes = (
self._finalize_output(request, full_text, last_response)
)
return backend_pb2.Reply(
message=bytes(content, encoding='utf-8'),
prompt_tokens=prompt_tokens,
tokens=completion_tokens,
logprobs=logprobs_bytes,
chat_deltas=[
backend_pb2.ChatDelta(
content=content,
reasoning_content=reasoning_content,
tool_calls=tool_calls_proto,
)
],
)
return backend_pb2.Reply(message=bytes(''.join(generated_text), encoding='utf-8'))
except Exception as e:
print(f"Error in MLX Predict: {e}", file=sys.stderr)
@@ -217,7 +206,7 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
def Embedding(self, request, context):
"""
A gRPC method that calculates embeddings for a given sentence.
Note: MLX-LM doesn't support embeddings directly. This method returns an error.
Args:
@@ -232,62 +221,6 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
context.set_details("Embeddings are not supported in the MLX backend.")
return backend_pb2.EmbeddingResult()
async def TokenizeString(self, request, context):
"""Tokenize ``request.Prompt`` using the loaded model's tokenizer."""
if not hasattr(self, "tokenizer") or self.tokenizer is None:
context.set_code(grpc.StatusCode.FAILED_PRECONDITION)
context.set_details("tokenizer not loaded")
return backend_pb2.TokenizationResponse()
try:
tokens = self.tokenizer.encode(request.Prompt)
if hasattr(tokens, "tolist"):
tokens = tokens.tolist()
tokens = list(tokens)
return backend_pb2.TokenizationResponse(length=len(tokens), tokens=tokens)
except Exception as e:
context.set_code(grpc.StatusCode.INTERNAL)
context.set_details(str(e))
return backend_pb2.TokenizationResponse()
async def Free(self, request, context):
"""Drop the loaded model, tokenizer and prompt cache.
Metal / CUDA memory is released via ``gc.collect()`` + the
platform-specific cache clear hooks when available.
"""
try:
if hasattr(self, "model"):
del self.model
if hasattr(self, "tokenizer"):
del self.tokenizer
if hasattr(self, "lru_cache") and self.lru_cache is not None:
try:
self.lru_cache.clear()
except Exception:
pass
self.lru_cache = None
gc.collect()
# Metal: drop the cached allocator. mlx.clear_cache (mlx >= 0.30)
# supersedes the now-deprecated mlx.metal.clear_cache.
try:
if hasattr(mx, "clear_cache"):
mx.clear_cache()
elif hasattr(mx, "metal") and hasattr(mx.metal, "clear_cache"):
mx.metal.clear_cache()
except Exception:
pass
# CUDA: release the torch cache if a CUDA-backed mlx variant
# happens to be installed alongside torch (best-effort).
try:
import torch # type: ignore
if torch.cuda.is_available():
torch.cuda.empty_cache()
except Exception:
pass
return backend_pb2.Result(success=True, message="MLX model freed")
except Exception as e:
return backend_pb2.Result(success=False, message=str(e))
async def PredictStream(self, request, context):
"""
Generates text based on the given prompt and sampling parameters, and streams the results using MLX.
@@ -318,64 +251,24 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
remaining_tokens = cache_key
# Build generation parameters using request attributes and options
max_tokens, sampler_params, logits_params, stop_words = self._build_generation_params(
request, default_max_tokens=512
)
max_tokens, sampler_params = self._build_generation_params(request, default_max_tokens=512)
print(
f"Streaming text with MLX - max_tokens: {max_tokens}, "
f"cache_hit: {len(remaining_tokens) < len(cache_key)}",
file=sys.stderr,
)
print(f"Streaming text with MLX - max_tokens: {max_tokens}, cache_hit: {len(remaining_tokens) < len(cache_key)}", file=sys.stderr)
# Create sampler and optional logits processors (penalties)
# Create sampler with parameters
sampler = make_sampler(**sampler_params)
logits_processors = make_logits_processors(**logits_params) if logits_params else None
accumulated = []
last_response = None
# Stream text generation using MLX with proper parameters
for response in stream_generate(
self.model,
self.tokenizer,
prompt=remaining_tokens if remaining_tokens else cache_key,
max_tokens=max_tokens,
sampler=sampler,
logits_processors=logits_processors,
prompt_cache=prompt_cache,
):
cache_key.append(response.token)
accumulated.append(response.text)
last_response = response
# Emit a content delta. Structured reasoning / tool parsing
# happens on the final chunk so we don't fragment the state
# machine in v1.
yield backend_pb2.Reply(
message=bytes(response.text, encoding='utf-8'),
chat_deltas=[backend_pb2.ChatDelta(content=response.text)],
)
# Early stop on user-provided stop sequences
if stop_words and any(s in "".join(accumulated) for s in stop_words):
break
# Final chunk: run reasoning + tool parsing on accumulated text
# and emit the structured ChatDelta with token counts + logprobs.
full_text = self._truncate_at_stop("".join(accumulated), stop_words)
content, reasoning_content, tool_calls_proto, prompt_tokens, completion_tokens, logprobs_bytes = (
self._finalize_output(request, full_text, last_response)
)
yield backend_pb2.Reply(
message=b"",
prompt_tokens=prompt_tokens,
tokens=completion_tokens,
logprobs=logprobs_bytes,
chat_deltas=[
backend_pb2.ChatDelta(
content="",
reasoning_content=reasoning_content,
tool_calls=tool_calls_proto,
)
],
)
yield backend_pb2.Reply(message=bytes(response.text, encoding='utf-8'))
except Exception as e:
print(f"Error in MLX PredictStream: {e}", file=sys.stderr)
@@ -401,33 +294,21 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
Returns:
str: The prepared prompt.
"""
# If tokenizer template is enabled and messages are provided instead
# of prompt, apply the tokenizer template (forwards tool definitions
# and enable_thinking when the model supports them).
# If tokenizer template is enabled and messages are provided instead of prompt, apply the tokenizer template
if not request.Prompt and request.UseTokenizerTemplate and request.Messages:
messages = messages_to_dicts(request.Messages)
# Convert gRPC messages to the format expected by apply_chat_template
messages = []
for msg in request.Messages:
messages.append({"role": msg.role, "content": msg.content})
kwargs = {"tokenize": False, "add_generation_prompt": True}
if request.Tools:
try:
kwargs["tools"] = json.loads(request.Tools)
except json.JSONDecodeError:
pass
enable_thinking = request.Metadata.get("enable_thinking", "").lower()
if enable_thinking == "true":
kwargs["enable_thinking"] = True
try:
return self.tokenizer.apply_chat_template(messages, **kwargs)
except TypeError:
# Fallback for tokenizers whose template doesn't accept
# tools= or enable_thinking=.
return self.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
return request.Prompt
prompt = self.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
return prompt
else:
return request.Prompt
def _get_tokens_from_prompt(self, prompt_text: str) -> List[int]:
"""
@@ -457,19 +338,18 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
default_max_tokens: Default max_tokens if not specified.
Returns:
tuple: (max_tokens, sampler_params dict, logits_processor_params dict,
stop_words list)
tuple: (max_tokens, sampler_params dict)
"""
# Extract max_tokens
max_tokens = getattr(request, 'Tokens', default_max_tokens)
if max_tokens == 0:
max_tokens = default_max_tokens
# Extract sampler parameters from request attributes
temp = getattr(request, 'Temperature', 0.0)
if temp == 0.0:
temp = 0.6 # Default temperature
top_p = getattr(request, 'TopP', 0.0)
if top_p == 0.0:
top_p = 1.0 # Default top_p
@@ -489,31 +369,18 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
'xtc_threshold': 0.0,
'xtc_probability': 0.0,
}
# Logits processor parameters — only set fields the request actually
# provides so we can feed them unconditionally to make_logits_processors.
logits_params = {}
repetition_penalty = getattr(request, 'RepetitionPenalty', 0.0) or 0.0
if repetition_penalty and repetition_penalty != 1.0:
logits_params['repetition_penalty'] = repetition_penalty
presence_penalty = getattr(request, 'PresencePenalty', 0.0) or 0.0
if presence_penalty:
logits_params['presence_penalty'] = presence_penalty
frequency_penalty = getattr(request, 'FrequencyPenalty', 0.0) or 0.0
if frequency_penalty:
logits_params['frequency_penalty'] = frequency_penalty
# Add seed if specified
seed = getattr(request, 'Seed', 0)
if seed != 0:
mx.random.seed(seed)
# Override with options if available
if hasattr(self, 'options'):
# Max tokens from options
if 'max_tokens' in self.options:
max_tokens = self.options['max_tokens']
# Sampler parameters from options
sampler_option_mapping = {
'temp': 'temp',
@@ -524,142 +391,32 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
'xtc_threshold': 'xtc_threshold',
'xtc_probability': 'xtc_probability',
}
for option_key, param_key in sampler_option_mapping.items():
if option_key in self.options:
sampler_params[param_key] = self.options[option_key]
# Logits processor overrides
for option_key in ('repetition_penalty', 'presence_penalty', 'frequency_penalty'):
if option_key in self.options:
logits_params[option_key] = self.options[option_key]
# Handle seed from options
if 'seed' in self.options:
mx.random.seed(self.options['seed'])
# Special tokens for XTC sampling (if tokenizer has eos_token_ids)
xtc_special_tokens = []
if hasattr(self.tokenizer, 'eos_token_ids') and self.tokenizer.eos_token_ids:
xtc_special_tokens = list(self.tokenizer.eos_token_ids)
elif hasattr(self.tokenizer, 'eos_token_id') and self.tokenizer.eos_token_id is not None:
xtc_special_tokens = [self.tokenizer.eos_token_id]
# Add newline token if available
try:
newline_tokens = self.tokenizer.encode("\n")
xtc_special_tokens.extend(newline_tokens)
except Exception:
except:
pass # Skip if encoding fails
sampler_params['xtc_special_tokens'] = xtc_special_tokens
# Stop sequences are applied post-decode (mlx-lm doesn't have a
# built-in stop-sequence sampler param). Preserve the list here.
stop_words = list(getattr(request, 'StopPrompts', []) or [])
return max_tokens, sampler_params, logits_params, stop_words
def _tool_module_from_tokenizer(self):
"""Build a duck-typed tool module from the TokenizerWrapper.
On mlx-lm >= 0.30 the wrapper exposes a ``tool_parser`` callable
that's been resolved from the model's chat template. On older
releases (e.g. 0.29.x) the wrapper only carries the start/end
markers — fall back to ``json.loads`` of the body, which matches
what ``mlx_lm.tool_parsers.json_tools.parse_tool_call`` does on
HEAD and covers the only format 0.29 detects (``<tool_call>``).
"""
start = getattr(self.tokenizer, "tool_call_start", None)
end = getattr(self.tokenizer, "tool_call_end", None)
if not start:
return None
parse_fn = getattr(self.tokenizer, "tool_parser", None)
if parse_fn is None:
def parse_fn(body, tools): # noqa: E306 — local fallback
return json.loads(body.strip())
return types.SimpleNamespace(
tool_call_start=start,
tool_call_end=end or "",
parse_tool_call=parse_fn,
)
def _finalize_output(self, request, generated_text, last_response):
"""Build a ChatDelta + token counts + logprobs from accumulated output.
Returns ``(content, reasoning_content, tool_calls_proto,
prompt_token_count, completion_token_count, logprobs_bytes)``.
"""
content = generated_text
reasoning_content = ""
if getattr(self.tokenizer, "has_thinking", False):
think_start = getattr(self.tokenizer, "think_start", "") or ""
think_end = getattr(self.tokenizer, "think_end", "") or ""
reasoning_content, content = split_reasoning(content, think_start, think_end)
tool_calls_proto: List[backend_pb2.ToolCallDelta] = []
tool_module = None
if getattr(self.tokenizer, "has_tool_calling", False):
tool_module = self._tool_module_from_tokenizer()
if tool_module is not None:
parsed_tools = None
if request.Tools:
try:
parsed_tools = json.loads(request.Tools)
except json.JSONDecodeError:
parsed_tools = None
calls, content = parse_tool_calls(content, tool_module, parsed_tools)
for c in calls:
tool_calls_proto.append(
backend_pb2.ToolCallDelta(
index=c["index"],
id=c["id"],
name=c["name"],
arguments=c["arguments"],
)
)
prompt_token_count = int(getattr(last_response, "prompt_tokens", 0) or 0) if last_response else 0
completion_token_count = int(getattr(last_response, "generation_tokens", 0) or 0) if last_response else 0
logprobs_bytes = b""
# Logprobs extraction — only when the request asked for them.
if last_response is not None and int(getattr(request, "Logprobs", 0) or 0) > 0:
try:
lp = getattr(last_response, "logprobs", None)
if lp is not None:
# GenerationResponse.logprobs on the last chunk is the
# logprob distribution of the final token. Without a
# per-token history we at minimum surface the last token's
# top-1 logprob so clients get a non-empty field.
token_id = int(getattr(last_response, "token", 0) or 0)
token_text = self.tokenizer.decode([token_id]) if token_id else ""
top_logprob = float(lp[token_id]) if hasattr(lp, "__getitem__") else 0.0
logprobs_bytes = json.dumps(
{
"content": [
{"token": token_text, "logprob": top_logprob}
]
}
).encode("utf-8")
except Exception as e:
print(f"[mlx] Logprobs extraction failed: {e}", file=sys.stderr)
return content, reasoning_content, tool_calls_proto, prompt_token_count, completion_token_count, logprobs_bytes
def _truncate_at_stop(self, text, stop_words):
"""Truncate ``text`` at the first occurrence of any stop sequence."""
if not stop_words:
return text
earliest = len(text)
for stop in stop_words:
if not stop:
continue
idx = text.find(stop)
if idx >= 0 and idx < earliest:
earliest = idx
return text[:earliest] if earliest < len(text) else text
return max_tokens, sampler_params
async def serve(address):
# Start asyncio gRPC server

View File

@@ -1,20 +1,11 @@
import os
import sys
import unittest
import subprocess
import time
import types
import grpc
import backend_pb2
import backend_pb2_grpc
# Make the shared helpers importable so we can unit-test them without a
# running gRPC server.
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'common'))
from python_utils import messages_to_dicts, parse_options
from mlx_utils import parse_tool_calls, split_reasoning
class TestBackendServicer(unittest.TestCase):
"""
TestBackendServicer is the class that tests the gRPC service.
@@ -240,104 +231,4 @@ class TestBackendServicer(unittest.TestCase):
self.tearDown()
def test_tokenize_string(self):
"""TokenizeString should return a non-empty token list for a known prompt."""
try:
self.setUp()
with grpc.insecure_channel("localhost:50051") as channel:
stub = backend_pb2_grpc.BackendStub(channel)
response = stub.LoadModel(
backend_pb2.ModelOptions(Model="mlx-community/Llama-3.2-1B-Instruct-4bit")
)
self.assertTrue(response.success)
resp = stub.TokenizeString(backend_pb2.PredictOptions(Prompt="Hello, world"))
self.assertGreater(resp.length, 0)
self.assertEqual(len(list(resp.tokens)), resp.length)
except Exception as err:
print(err)
self.fail("TokenizeString service failed")
finally:
self.tearDown()
def test_free(self):
"""Free should release the model and not crash on subsequent calls."""
try:
self.setUp()
with grpc.insecure_channel("localhost:50051") as channel:
stub = backend_pb2_grpc.BackendStub(channel)
response = stub.LoadModel(
backend_pb2.ModelOptions(Model="mlx-community/Llama-3.2-1B-Instruct-4bit")
)
self.assertTrue(response.success)
free_resp = stub.Free(backend_pb2.HealthMessage())
self.assertTrue(free_resp.success)
except Exception as err:
print(err)
self.fail("Free service failed")
finally:
self.tearDown()
class TestSharedHelpers(unittest.TestCase):
"""Server-less unit tests for the helpers the mlx backend depends on."""
def test_parse_options_typed(self):
opts = parse_options(["temperature:0.7", "max_tokens:128", "trust:true", "name:hello", "no_colon_skipped"])
self.assertEqual(opts["temperature"], 0.7)
self.assertEqual(opts["max_tokens"], 128)
self.assertIs(opts["trust"], True)
self.assertEqual(opts["name"], "hello")
self.assertNotIn("no_colon_skipped", opts)
def test_messages_to_dicts_roundtrip(self):
# Build proto Message objects (via backend_pb2 to match real gRPC)
msgs = [
backend_pb2.Message(role="user", content="hi"),
backend_pb2.Message(
role="assistant",
content="",
tool_calls='[{"id":"call_1","type":"function","function":{"name":"f","arguments":"{}"}}]',
),
backend_pb2.Message(
role="tool",
content="42",
tool_call_id="call_1",
name="f",
),
]
out = messages_to_dicts(msgs)
self.assertEqual(out[0], {"role": "user", "content": "hi"})
self.assertEqual(out[1]["role"], "assistant")
self.assertEqual(out[1]["tool_calls"][0]["function"]["name"], "f")
self.assertEqual(out[2]["tool_call_id"], "call_1")
self.assertEqual(out[2]["name"], "f")
def test_split_reasoning(self):
r, c = split_reasoning("<think>step 1\nstep 2</think>The answer is 42.", "<think>", "</think>")
self.assertEqual(r, "step 1\nstep 2")
self.assertEqual(c, "The answer is 42.")
def test_split_reasoning_no_marker(self):
r, c = split_reasoning("just text", "<think>", "</think>")
self.assertEqual(r, "")
self.assertEqual(c, "just text")
def test_parse_tool_calls_with_shim(self):
tm = types.SimpleNamespace(
tool_call_start="<tool_call>",
tool_call_end="</tool_call>",
parse_tool_call=lambda body, tools: {"name": "get_weather", "arguments": {"location": body.strip()}},
)
calls, remaining = parse_tool_calls(
"Sure: <tool_call>Paris</tool_call>",
tm,
tools=None,
)
self.assertEqual(len(calls), 1)
self.assertEqual(calls[0]["name"], "get_weather")
self.assertEqual(calls[0]["arguments"], '{"location": "Paris"}')
self.assertEqual(calls[0]["index"], 0)
self.assertNotIn("<tool_call>", remaining)
# Unit tests for ThreadSafeLRUPromptCache are in test_mlx_cache.py

View File

@@ -1,17 +0,0 @@
.PHONY: sglang
sglang:
bash install.sh
.PHONY: run
run: sglang
@echo "Running sglang..."
bash run.sh
@echo "sglang run."
.PHONY: protogen-clean
protogen-clean:
$(RM) backend_pb2_grpc.py backend_pb2.py
.PHONY: clean
clean: protogen-clean
rm -rf venv __pycache__

View File

@@ -1,502 +0,0 @@
#!/usr/bin/env python3
"""LocalAI gRPC backend for sglang.
Wraps sglang's async Engine API behind the Backend gRPC contract defined
in backend.proto. Mirrors the structure of backend/python/vllm/backend.py
so that the two backends stay behavior-equivalent at the protocol level.
The streaming path applies sglang's per-request FunctionCallParser and
ReasoningParser so tool_calls and reasoning_content are emitted
incrementally inside ChatDelta, which is a capability sglang exposes
natively and vLLM does not.
"""
import asyncio
from concurrent import futures
import argparse
import signal
import sys
import os
import json
import gc
import uuid
import base64
import io
from typing import Dict, List, Optional, Tuple
from PIL import Image
import backend_pb2
import backend_pb2_grpc
import grpc
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'common'))
sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'common'))
from grpc_auth import get_auth_interceptors
# sglang imports. Engine is the stable public entry point; parser modules
# are wrapped in try/except so older / leaner installs that omit them
# still load the backend for plain text generation.
from sglang.srt.entrypoints.engine import Engine
try:
from sglang.srt.function_call.function_call_parser import FunctionCallParser
# sglang's FunctionCallParser expects a list of pydantic Tool objects
# (protocol.Tool with .function.name), not plain dicts. Wrap at the
# request boundary to match.
from sglang.srt.entrypoints.openai.protocol import Tool as SglTool
HAS_TOOL_PARSERS = True
except Exception:
FunctionCallParser = None # type: ignore
SglTool = None # type: ignore
HAS_TOOL_PARSERS = False
try:
from sglang.srt.parser.reasoning_parser import ReasoningParser
HAS_REASONING_PARSERS = True
except Exception:
ReasoningParser = None # type: ignore
HAS_REASONING_PARSERS = False
try:
from transformers import AutoTokenizer
HAS_TRANSFORMERS = True
except Exception:
AutoTokenizer = None # type: ignore
HAS_TRANSFORMERS = False
_ONE_DAY_IN_SECONDS = 60 * 60 * 24
MAX_WORKERS = int(os.environ.get('PYTHON_GRPC_MAX_WORKERS', '1'))
class BackendServicer(backend_pb2_grpc.BackendServicer):
"""gRPC servicer implementing the Backend service for sglang."""
def _parse_options(self, options_list) -> Dict[str, str]:
opts: Dict[str, str] = {}
for opt in options_list:
if ":" not in opt:
continue
key, value = opt.split(":", 1)
opts[key.strip()] = value.strip()
return opts
def _messages_to_dicts(self, messages) -> List[dict]:
result: List[dict] = []
for msg in messages:
d = {"role": msg.role, "content": msg.content or ""}
if msg.name:
d["name"] = msg.name
if msg.tool_call_id:
d["tool_call_id"] = msg.tool_call_id
if msg.reasoning_content:
d["reasoning_content"] = msg.reasoning_content
if msg.tool_calls:
try:
d["tool_calls"] = json.loads(msg.tool_calls)
except json.JSONDecodeError:
pass
result.append(d)
return result
def Health(self, request, context):
return backend_pb2.Reply(message=bytes("OK", 'utf-8'))
async def LoadModel(self, request, context):
engine_kwargs = {"model_path": request.Model}
if request.Quantization:
engine_kwargs["quantization"] = request.Quantization
if request.LoadFormat:
engine_kwargs["load_format"] = request.LoadFormat
if request.GPUMemoryUtilization:
engine_kwargs["mem_fraction_static"] = float(request.GPUMemoryUtilization)
if request.TrustRemoteCode:
engine_kwargs["trust_remote_code"] = True
if request.EnforceEager:
engine_kwargs["disable_cuda_graph"] = True
if request.TensorParallelSize:
engine_kwargs["tp_size"] = int(request.TensorParallelSize)
if request.MaxModelLen:
engine_kwargs["context_length"] = int(request.MaxModelLen)
if request.DType:
engine_kwargs["dtype"] = request.DType
opts = self._parse_options(request.Options)
# Cache parser names — actual parser instances are created per
# request because sglang's parsers are stateful.
self.tool_parser_name: Optional[str] = opts.get("tool_parser") or None
self.reasoning_parser_name: Optional[str] = opts.get("reasoning_parser") or None
# Also hand the parser names to sglang's engine so its HTTP/OAI
# paths work identically if someone hits the engine directly.
if self.tool_parser_name:
engine_kwargs["tool_call_parser"] = self.tool_parser_name
if self.reasoning_parser_name:
engine_kwargs["reasoning_parser"] = self.reasoning_parser_name
try:
self.llm = Engine(**engine_kwargs)
except Exception as err:
print(f"sglang Engine init failed: {err!r}", file=sys.stderr)
return backend_pb2.Result(success=False, message=f"{err!r}")
# sglang does not expose a uniform get_tokenizer() off Engine.
# Use transformers directly — same path sglang uses internally.
self.tokenizer = None
if HAS_TRANSFORMERS:
try:
self.tokenizer = AutoTokenizer.from_pretrained(
request.Model,
trust_remote_code=bool(request.TrustRemoteCode),
)
except Exception as err:
print(f"AutoTokenizer load failed (non-fatal): {err!r}", file=sys.stderr)
print("Model loaded successfully", file=sys.stderr)
return backend_pb2.Result(message="Model loaded successfully", success=True)
async def Predict(self, request, context):
gen = self._predict(request, context, streaming=False)
res = await gen.__anext__()
return res
async def PredictStream(self, request, context):
iterations = self._predict(request, context, streaming=True)
try:
async for iteration in iterations:
yield iteration
finally:
try:
await iterations.aclose()
except Exception:
pass
async def TokenizeString(self, request, context):
if not getattr(self, "tokenizer", None):
context.set_code(grpc.StatusCode.FAILED_PRECONDITION)
context.set_details("tokenizer not loaded")
return backend_pb2.TokenizationResponse()
try:
tokens = self.tokenizer.encode(request.Prompt)
return backend_pb2.TokenizationResponse(length=len(tokens), tokens=tokens)
except Exception as e:
context.set_code(grpc.StatusCode.INTERNAL)
context.set_details(str(e))
return backend_pb2.TokenizationResponse()
async def Free(self, request, context):
try:
if hasattr(self, "llm"):
try:
self.llm.shutdown()
except Exception:
pass
del self.llm
if hasattr(self, "tokenizer"):
del self.tokenizer
self.tool_parser_name = None
self.reasoning_parser_name = None
gc.collect()
try:
import torch
if torch.cuda.is_available():
torch.cuda.empty_cache()
except ImportError:
pass
return backend_pb2.Result(success=True, message="Model freed")
except Exception as e:
return backend_pb2.Result(success=False, message=str(e))
def _build_sampling_params(self, request) -> dict:
sampling_params: dict = {"temperature": 0.7, "max_new_tokens": 200}
mapping = {
"N": "n",
"PresencePenalty": "presence_penalty",
"FrequencyPenalty": "frequency_penalty",
"RepetitionPenalty": "repetition_penalty",
"Temperature": "temperature",
"TopP": "top_p",
"TopK": "top_k",
"MinP": "min_p",
"Seed": "seed",
"StopPrompts": "stop",
"StopTokenIds": "stop_token_ids",
"IgnoreEOS": "ignore_eos",
"Tokens": "max_new_tokens",
"MinTokens": "min_new_tokens",
"SkipSpecialTokens": "skip_special_tokens",
}
for proto_field, sgl_key in mapping.items():
if not hasattr(request, proto_field):
continue
value = getattr(request, proto_field)
if value in (None, 0, 0.0, [], False, ""):
continue
# repeated fields come back as RepeatedScalarContainer — convert
if hasattr(value, "__iter__") and not isinstance(value, (str, bytes)):
value = list(value)
if not value:
continue
sampling_params[sgl_key] = value
# Grammar → JSON schema or EBNF structured decoding.
if getattr(request, "Grammar", ""):
grammar = request.Grammar
try:
json.loads(grammar)
sampling_params["json_schema"] = grammar
except json.JSONDecodeError:
sampling_params["ebnf"] = grammar
return sampling_params
def _build_prompt(self, request) -> str:
prompt = request.Prompt
if prompt or not request.UseTokenizerTemplate or not request.Messages:
return prompt
if self.tokenizer is None:
print(
"UseTokenizerTemplate requested but tokenizer not loaded; "
"falling back to naive concatenation",
file=sys.stderr,
)
return "\n".join(m.content or "" for m in request.Messages)
messages_dicts = self._messages_to_dicts(request.Messages)
template_kwargs: dict = {"tokenize": False, "add_generation_prompt": True}
if request.Tools:
try:
template_kwargs["tools"] = json.loads(request.Tools)
except json.JSONDecodeError:
pass
if request.Metadata.get("enable_thinking", "").lower() == "true":
template_kwargs["enable_thinking"] = True
try:
return self.tokenizer.apply_chat_template(messages_dicts, **template_kwargs)
except TypeError:
return self.tokenizer.apply_chat_template(
messages_dicts, tokenize=False, add_generation_prompt=True,
)
def _make_parsers(self, request):
"""Construct fresh per-request parser instances (stateful)."""
tool_parser = None
reasoning_parser = None
if HAS_TOOL_PARSERS and self.tool_parser_name and request.Tools:
try:
tools_raw = json.loads(request.Tools)
tools = [SglTool.model_validate(t) for t in tools_raw] if SglTool else tools_raw
tool_parser = FunctionCallParser(
tools=tools, tool_call_parser=self.tool_parser_name,
)
except Exception as e:
print(f"FunctionCallParser init failed: {e!r}", file=sys.stderr)
if HAS_REASONING_PARSERS and self.reasoning_parser_name:
try:
reasoning_parser = ReasoningParser(
model_type=self.reasoning_parser_name,
stream_reasoning=True,
)
except Exception as e:
print(f"ReasoningParser init failed: {e!r}", file=sys.stderr)
return tool_parser, reasoning_parser
async def _predict(self, request, context, streaming: bool = False):
sampling_params = self._build_sampling_params(request)
prompt = self._build_prompt(request)
tool_parser, reasoning_parser = self._make_parsers(request)
image_data = list(request.Images) if request.Images else None
video_data = list(request.Videos) if request.Videos else None
# Kick off streaming generation. We always use stream=True so the
# non-stream path still gets parser coverage on the final text.
try:
iterator = await self.llm.async_generate(
prompt=prompt,
sampling_params=sampling_params,
image_data=image_data,
video_data=video_data,
stream=True,
)
except Exception as e:
print(f"sglang async_generate failed: {e!r}", file=sys.stderr)
yield backend_pb2.Reply(message=bytes(f"error: {e!r}", "utf-8"))
return
generated_text = ""
last_chunk: Optional[dict] = None
# Track tool call ids once per (request, tool_index) to match the
# OpenAI streaming contract (id sent on first chunk for that tool).
tool_ids_seen: Dict[int, str] = {}
try:
async for chunk in iterator:
last_chunk = chunk
cumulative = chunk.get("text", "") if isinstance(chunk, dict) else ""
delta_text = cumulative[len(generated_text):] if cumulative.startswith(generated_text) else cumulative
generated_text = cumulative
if not delta_text:
continue
reasoning_delta = ""
content_delta = delta_text
if reasoning_parser is not None:
try:
r, n = reasoning_parser.parse_stream_chunk(delta_text)
reasoning_delta = r or ""
content_delta = n or ""
except Exception as e:
print(f"reasoning_parser.parse_stream_chunk: {e!r}", file=sys.stderr)
tool_call_deltas: List[backend_pb2.ToolCallDelta] = []
if tool_parser is not None and content_delta:
try:
normal_text, calls = tool_parser.parse_stream_chunk(content_delta)
content_delta = normal_text or ""
for tc in calls:
idx = int(getattr(tc, "tool_index", 0) or 0)
tc_id = tool_ids_seen.get(idx)
if tc_id is None:
tc_id = f"call_{uuid.uuid4().hex[:24]}"
tool_ids_seen[idx] = tc_id
tool_call_deltas.append(backend_pb2.ToolCallDelta(
index=idx,
id=tc_id,
name=getattr(tc, "name", "") or "",
arguments=getattr(tc, "parameters", "") or "",
))
except Exception as e:
print(f"tool_parser.parse_stream_chunk: {e!r}", file=sys.stderr)
if streaming and (content_delta or reasoning_delta or tool_call_deltas):
yield backend_pb2.Reply(
message=bytes(content_delta, "utf-8"),
chat_deltas=[backend_pb2.ChatDelta(
content=content_delta,
reasoning_content=reasoning_delta,
tool_calls=tool_call_deltas,
)],
)
finally:
try:
await iterator.aclose()
except Exception:
pass
# Extract token counts from the final chunk's meta_info.
meta = {}
if isinstance(last_chunk, dict):
meta = last_chunk.get("meta_info") or {}
prompt_tokens = int(meta.get("prompt_tokens", 0) or 0)
completion_tokens = int(meta.get("completion_tokens", 0) or 0)
# Non-streaming path: re-parse the full text with fresh parsers
# so we return a clean, complete ChatDelta. Streaming parsers
# used above have accumulated state we don't want to reuse.
final_content = generated_text
final_reasoning = ""
final_tool_calls: List[backend_pb2.ToolCallDelta] = []
if not streaming:
final_reasoning_parser = None
if HAS_REASONING_PARSERS and self.reasoning_parser_name:
try:
final_reasoning_parser = ReasoningParser(
model_type=self.reasoning_parser_name,
stream_reasoning=False,
)
except Exception:
final_reasoning_parser = None
if final_reasoning_parser is not None:
try:
r, n = final_reasoning_parser.parse_non_stream(generated_text)
final_reasoning = r or ""
final_content = n if n is not None else generated_text
except Exception as e:
print(f"reasoning_parser.parse_non_stream: {e!r}", file=sys.stderr)
if HAS_TOOL_PARSERS and self.tool_parser_name and request.Tools:
try:
tools_raw = json.loads(request.Tools)
tools = [SglTool.model_validate(t) for t in tools_raw] if SglTool else tools_raw
fresh_tool_parser = FunctionCallParser(
tools=tools, tool_call_parser=self.tool_parser_name,
)
normal, calls = fresh_tool_parser.parse_non_stream(final_content)
if calls:
final_content = normal
for tc in calls:
idx = int(getattr(tc, "tool_index", 0) or 0)
final_tool_calls.append(backend_pb2.ToolCallDelta(
index=idx,
id=f"call_{uuid.uuid4().hex[:24]}",
name=getattr(tc, "name", "") or "",
arguments=getattr(tc, "parameters", "") or "",
))
except Exception as e:
print(f"tool_parser.parse_non_stream: {e!r}", file=sys.stderr)
chat_delta = backend_pb2.ChatDelta(
content=final_content if not streaming else "",
reasoning_content=final_reasoning,
tool_calls=final_tool_calls,
)
if streaming:
yield backend_pb2.Reply(
message=b"",
prompt_tokens=prompt_tokens,
tokens=completion_tokens,
chat_deltas=[chat_delta],
)
return
yield backend_pb2.Reply(
message=bytes(final_content or "", "utf-8"),
prompt_tokens=prompt_tokens,
tokens=completion_tokens,
chat_deltas=[chat_delta],
)
async def serve(address):
server = grpc.aio.server(
migration_thread_pool=futures.ThreadPoolExecutor(max_workers=MAX_WORKERS),
options=[
('grpc.max_message_length', 50 * 1024 * 1024),
('grpc.max_send_message_length', 50 * 1024 * 1024),
('grpc.max_receive_message_length', 50 * 1024 * 1024),
],
interceptors=get_auth_interceptors(aio=True),
)
backend_pb2_grpc.add_BackendServicer_to_server(BackendServicer(), server)
server.add_insecure_port(address)
loop = asyncio.get_event_loop()
for sig in (signal.SIGINT, signal.SIGTERM):
loop.add_signal_handler(sig, lambda: asyncio.ensure_future(server.stop(5)))
await server.start()
print("Server started. Listening on: " + address, file=sys.stderr)
await server.wait_for_termination()
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Run the sglang gRPC server.")
parser.add_argument(
"--addr", default="localhost:50051", help="The address to bind the server to.",
)
args = parser.parse_args()
asyncio.run(serve(args.addr))

View File

@@ -1,87 +0,0 @@
#!/bin/bash
set -e
EXTRA_PIP_INSTALL_FLAGS="--no-build-isolation"
# Avoid overcommitting the CPU during builds that compile native code.
export NVCC_THREADS=2
export MAX_JOBS=1
backend_dir=$(dirname $0)
if [ -d $backend_dir/common ]; then
source $backend_dir/common/libbackend.sh
else
source $backend_dir/../common/libbackend.sh
fi
if [ "x${BUILD_PROFILE}" == "xintel" ]; then
EXTRA_PIP_INSTALL_FLAGS+=" --upgrade --index-strategy=unsafe-first-match"
fi
if [ "x${BUILD_PROFILE}" == "xcpu" ]; then
EXTRA_PIP_INSTALL_FLAGS+=" --index-strategy=unsafe-best-match"
fi
# sglang's CPU path has no prebuilt wheel on PyPI — upstream publishes
# a separate pyproject_cpu.toml that must be swapped in before `pip install`.
# Reference: docker/xeon.Dockerfile in the sglang upstream repo.
#
# When BUILD_TYPE is empty (CPU profile) or FROM_SOURCE=true is forced,
# install torch/transformers/etc from requirements-cpu.txt, then clone
# sglang and install its python/ and sgl-kernel/ packages from source
# using the CPU pyproject.
if [ "x${BUILD_TYPE}" == "x" ] || [ "x${FROM_SOURCE:-}" == "xtrue" ]; then
# sgl-kernel's CPU build links against libnuma and libtbb. Install
# them here (Docker builder stage) before running the source build.
# Harmless no-op on runs outside the docker build since installRequirements
# below still needs them only if we reach the source build branch.
if command -v apt-get >/dev/null 2>&1 && [ "$(id -u)" = "0" ]; then
apt-get update
DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
libnuma-dev numactl libtbb-dev libgomp1 libomp-dev google-perftools \
build-essential cmake ninja-build
fi
installRequirements
# sgl-kernel's pyproject_cpu.toml uses scikit-build-core as its build
# backend. With --no-build-isolation, that (and ninja/cmake) must be
# present in the venv before we build from source.
uv pip install --no-build-isolation "scikit-build-core>=0.10" ninja cmake
# sgl-kernel's CPU shm.cpp uses __m512 AVX-512 intrinsics unconditionally.
# csrc/cpu/CMakeLists.txt hard-codes add_compile_options(-march=native),
# which on runners without AVX-512 in /proc/cpuinfo fails with
# "__m512 return without 'avx512f' enabled changes the ABI".
# CXXFLAGS alone is insufficient because CMake's add_compile_options()
# appends -march=native *after* CXXFLAGS, overriding it.
# We therefore patch the CMakeLists.txt to replace -march=native with
# -march=sapphirerapids so the flag is consistent throughout the build.
# The resulting binary still requires an AVX-512 capable CPU at runtime,
# same constraint sglang upstream documents in docker/xeon.Dockerfile.
_sgl_src=$(mktemp -d)
trap 'rm -rf "${_sgl_src}"' EXIT
git clone --depth 1 https://github.com/sgl-project/sglang "${_sgl_src}/sglang"
# Patch -march=native → -march=sapphirerapids in the CPU kernel CMakeLists
sed -i 's/-march=native/-march=sapphirerapids/g' \
"${_sgl_src}/sglang/sgl-kernel/csrc/cpu/CMakeLists.txt"
pushd "${_sgl_src}/sglang/sgl-kernel"
if [ -f pyproject_cpu.toml ]; then
cp pyproject_cpu.toml pyproject.toml
fi
uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} .
popd
pushd "${_sgl_src}/sglang/python"
if [ -f pyproject_cpu.toml ]; then
cp pyproject_cpu.toml pyproject.toml
fi
uv pip install ${EXTRA_PIP_INSTALL_FLAGS:-} .
popd
else
installRequirements
fi

View File

@@ -1,63 +0,0 @@
#!/bin/bash
# Package runtime shared libraries for the sglang backend.
#
# Dockerfile.python's final stage is FROM scratch — every system library
# the backend dlopens at runtime must be explicitly copied into
# ${BACKEND}/lib, which libbackend.sh adds to LD_LIBRARY_PATH.
#
# sglang's CPU kernel links against libnuma and libtbb; torch's CPU
# kernels use libgomp; tcmalloc + iomp5 are preloaded per sglang's
# docker/xeon.Dockerfile recipe for best CPU throughput. Missing any of
# these makes the engine crash on import.
set -e
CURDIR=$(dirname "$(realpath "$0")")
LIB_DIR="${CURDIR}/lib"
mkdir -p "${LIB_DIR}"
copy_with_symlinks() {
local soname="$1"
local hit=""
for dir in \
/usr/lib/x86_64-linux-gnu \
/usr/lib/aarch64-linux-gnu \
/lib/x86_64-linux-gnu \
/lib/aarch64-linux-gnu \
/usr/lib \
/lib; do
if [ -e "${dir}/${soname}" ]; then
hit="${dir}/${soname}"
break
fi
done
if [ -z "${hit}" ]; then
echo "warning: ${soname} not found in standard lib paths" >&2
return 0
fi
local real
real=$(readlink -f "${hit}")
cp -v "${real}" "${LIB_DIR}/"
local real_base
real_base=$(basename "${real}")
if [ "${real_base}" != "${soname}" ]; then
ln -sf "${real_base}" "${LIB_DIR}/${soname}"
fi
}
copy_with_symlinks libnuma.so.1
copy_with_symlinks libgomp.so.1
copy_with_symlinks libtbb.so.12
copy_with_symlinks libtbbmalloc.so.2
copy_with_symlinks libtcmalloc.so.4
# intel-openmp ships libiomp5.so inside the venv under venv/lib/ — sglang's
# CPU kernel was compiled against its __kmpc_* symbols, so it must be on
# LD_LIBRARY_PATH at runtime. Copy it into the backend lib dir where
# libbackend.sh will pick it up.
if [ -f "${CURDIR}/venv/lib/libiomp5.so" ]; then
cp -v "${CURDIR}/venv/lib/libiomp5.so" "${LIB_DIR}/"
fi
echo "sglang packaging completed successfully"
ls -liah "${LIB_DIR}/"

View File

@@ -1,2 +0,0 @@
# sglang is installed per-acceleration in requirements-{profile}-after.txt
# (cublas12, hipblas, intel, cpu)

View File

@@ -1,3 +0,0 @@
# sglang has no prebuilt CPU wheel on PyPI. install.sh performs a
# from-source build using the upstream pyproject_cpu.toml recipe from
# docker/xeon.Dockerfile when BUILD_TYPE is empty (CPU profile).

View File

@@ -1,7 +0,0 @@
--extra-index-url https://download.pytorch.org/whl/cpu
accelerate
torch==2.9.0
torchvision
torchaudio
transformers
intel-openmp; platform_machine == 'x86_64'

View File

@@ -1,3 +0,0 @@
# Bump this pin deliberately — sglang releases weekly and API surfaces
# (FunctionCallParser, ReasoningParser) move between releases.
sglang[all]>=0.4.0

View File

@@ -1,5 +0,0 @@
accelerate
torch==2.7.1
torchvision
torchaudio==2.7.1
transformers

View File

@@ -1,2 +0,0 @@
# sglang's ROCm build is installed from source per docker/rocm.Dockerfile
# upstream; install.sh handles the source build when BUILD_TYPE=hipblas.

View File

@@ -1,5 +0,0 @@
--extra-index-url https://download.pytorch.org/whl/nightly/rocm7.0
accelerate
torch
torchvision
transformers

View File

@@ -1,6 +0,0 @@
# sglang and sgl-kernel do not declare full PEP517 build deps; install the
# basic build tooling into the venv before pulling the rest of the stack.
packaging
setuptools
wheel
setuptools-scm

View File

@@ -1,2 +0,0 @@
# sglang's Intel XPU build is installed from source per docker/xpu.Dockerfile
# upstream; install.sh handles the source build when BUILD_PROFILE=intel.

View File

@@ -1,7 +0,0 @@
--extra-index-url https://download.pytorch.org/whl/xpu
accelerate
torch
torchvision
transformers
optimum[openvino]
setuptools

View File

@@ -1,4 +0,0 @@
grpcio==1.80.0
protobuf
certifi
setuptools

View File

@@ -1,29 +0,0 @@
#!/bin/bash
backend_dir=$(dirname $(realpath $0))
if [ -d $backend_dir/common ]; then
source $backend_dir/common/libbackend.sh
else
source $backend_dir/../common/libbackend.sh
fi
# sglang's CPU kernel references LLVM OpenMP (__kmpc_*) symbols that are
# not declared in its NEEDED list — they get resolved through LD_PRELOAD
# of libiomp5.so in sglang's own docker/xeon.Dockerfile. Do the same here.
# Harmless on GPU builds where libiomp5.so is absent.
if [ -f "${backend_dir}/lib/libiomp5.so" ]; then
if [ -n "${LD_PRELOAD:-}" ]; then
export LD_PRELOAD="${backend_dir}/lib/libiomp5.so:${LD_PRELOAD}"
else
export LD_PRELOAD="${backend_dir}/lib/libiomp5.so"
fi
fi
# sglang CPU engine requires this env var to switch to the CPU backend.
# No-op on GPU builds. See docker/xeon.Dockerfile in sglang upstream.
if [ -f "${backend_dir}/lib/libiomp5.so" ]; then
export SGLANG_USE_CPU_ENGINE=1
fi
startBackend $@

View File

@@ -1,25 +0,0 @@
.DEFAULT_GOAL := install
.PHONY: install
install:
bash install.sh
.PHONY: run
run: install
@echo "Running tinygrad..."
bash run.sh
@echo "tinygrad run."
.PHONY: test
test: install
@echo "Testing tinygrad..."
bash test.sh
@echo "tinygrad tested."
.PHONY: protogen-clean
protogen-clean:
$(RM) backend_pb2_grpc.py backend_pb2.py
.PHONY: clean
clean: protogen-clean
rm -rf venv __pycache__

View File

@@ -1,785 +0,0 @@
#!/usr/bin/env python3
"""
LocalAI gRPC backend for tinygrad.
LLM execution is delegated to `tinygrad.apps.llm.Transformer` — we keep
only a thin HF → GGUF-name adapter (vendor/appsllm_adapter.py) for the
safetensors path; GGUF models load through `Transformer.from_gguf()`
with native Q4/Q6/Q8 support.
Scope:
- LLM text generation via apps.llm (Qwen3 / Qwen3.5 / Llama 3.x /
GLM-4 / OLMoE / Kimi-K2 / Moonlight — anything apps.llm supports).
- Native tool-call extraction via pluggable parsers (hermes,
llama3_json, qwen3_xml, mistral).
- Embeddings — mean-pooled last-hidden-state over the block stack.
- Stable Diffusion 1.x, Whisper — handled by the vendored paths.
Sampling is greedy-only because `apps.llm.Transformer.generate` (in the
tinygrad 0.12.0 PyPI release) ends with `.argmax(-1)` and takes no
temperature / top-k / top-p / repetition-penalty arguments. These
request fields are accepted and ignored.
The heavy imports (tinygrad, tokenizers, tinygrad.apps.llm) are deferred
until `LoadModel`, because tinygrad binds its compute device at import
time from env vars. `_select_tinygrad_device()` maps LocalAI's BUILD_TYPE
onto the corresponding tinygrad env flag before any import happens.
"""
from __future__ import annotations
import argparse
import asyncio
import json
import os
import signal
import sys
import time
from concurrent import futures
from pathlib import Path
from typing import Any, Optional
import grpc
import backend_pb2
import backend_pb2_grpc
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'common'))
sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'common'))
from grpc_auth import get_auth_interceptors # noqa: E402
from tool_parsers import resolve_parser # noqa: E402
from tool_parsers.base import ToolCall # noqa: E402
MAX_WORKERS = int(os.environ.get('PYTHON_GRPC_MAX_WORKERS', '1'))
# ---------------------------------------------------------------------------
# Device selection — must run BEFORE `import tinygrad` anywhere.
#
# In production this is set by run.sh based on which driver libraries the
# host has injected into the container (libcuda.so.1 → CUDA, libamdhip64
# → HIP, otherwise CLANG). This helper is only a fallback for direct
# invocations like the unit tests.
# ---------------------------------------------------------------------------
def _select_tinygrad_device() -> None:
if any(os.environ.get(k) == "1" for k in ("CUDA", "HIP", "METAL", "CLANG", "AMD", "NV")):
return
os.environ["CLANG"] = "1"
# ---------------------------------------------------------------------------
# Model asset discovery
# ---------------------------------------------------------------------------
def _resolve_model_assets(model_ref: str) -> Path:
"""
Accept either a local path or a HuggingFace repo id (e.g.
"unsloth/Qwen3.5-0.8B-GGUF") and return the local directory / file.
HF ids are materialized via `huggingface_hub.snapshot_download` — we
pull both safetensors (for fp16 HF repos) and GGUF (for quantized
repos) so the same code path handles either.
"""
p = Path(model_ref)
if p.exists():
return p
if "/" in model_ref and not model_ref.startswith(("/", ".")):
from huggingface_hub import snapshot_download
local = snapshot_download(
repo_id=model_ref,
allow_patterns=[
"config.json",
"tokenizer.json",
"tokenizer_config.json",
"special_tokens_map.json",
"generation_config.json",
"*.safetensors",
"*.safetensors.index.json",
"*.gguf",
],
)
return Path(local)
raise FileNotFoundError(f"Model not found: {model_ref}")
def _gguf_path(model_ref: Path) -> Optional[Path]:
"""Return the GGUF file to load from a path that may be a file or dir."""
if model_ref.is_file() and str(model_ref).endswith(".gguf"):
return model_ref
if model_ref.is_dir():
ggufs = sorted(model_ref.glob("*.gguf"))
if ggufs:
return ggufs[0]
return None
def _load_hf_safetensors(model_dir: Path) -> dict[str, Any]:
"""Load sharded or single-file HF safetensors from a directory."""
from tinygrad.nn.state import safe_load
index = model_dir / "model.safetensors.index.json"
if index.exists():
with open(index) as fp:
weight_map = json.load(fp)["weight_map"]
shards: dict[str, Any] = {}
for shard_name in set(weight_map.values()):
shards[shard_name] = safe_load(str(model_dir / shard_name))
return {k: shards[n][k] for k, n in weight_map.items()}
single = model_dir / "model.safetensors"
if single.exists():
return safe_load(str(single))
raise FileNotFoundError(f"No safetensors weights found under {model_dir}")
def _auto_tool_parser(model_ref: Optional[str], config: dict) -> Optional[str]:
"""Pick a tool parser automatically from model family heuristics.
Order of precedence: architecture name from config.json, then model ref
string. Returns None to fall through to the passthrough parser.
"""
arches = " ".join(a.lower() for a in config.get("architectures", []))
ref = (model_ref or "").lower()
blob = f"{arches} {ref}"
if "qwen3" in blob:
return "qwen3_xml"
if "hermes" in blob or "qwen2" in blob or "qwen" in blob:
return "hermes"
if "llama-3" in blob or "llama_3" in blob or "llama3" in blob:
return "llama3_json"
if "mistral" in blob or "mixtral" in blob:
return "mistral"
return None
# ---------------------------------------------------------------------------
# Servicer
# ---------------------------------------------------------------------------
class BackendServicer(backend_pb2_grpc.BackendServicer):
"""gRPC servicer for the tinygrad backend."""
def __init__(self) -> None:
self._reset_state()
def _reset_state(self) -> None:
self.model_ref: Optional[str] = None
self.model_type: str = "llm"
self.options: dict[str, str] = {}
# LLM state
self.llm_model = None
self.llm_config: dict = {}
self.llm_tokenizer = None
self.llm_eos_ids: list[int] = []
self.chat_template: Optional[str] = None
self.tool_parser = resolve_parser(None)
self.max_context = 4096
# Stable Diffusion state
self.sd_model = None
# Whisper state
self.whisper_model = None
self.whisper_tokenizer = None
# --------------------- helpers --------------------------------------
@staticmethod
def _parse_options(options_list) -> dict[str, str]:
opts: dict[str, str] = {}
for opt in options_list:
if ":" not in opt:
continue
key, value = opt.split(":", 1)
opts[key.strip()] = value.strip()
return opts
@staticmethod
def _detect_model_type(model_ref: str, explicit: Optional[str]) -> str:
if explicit:
return explicit
name = (model_ref or "").lower()
if "whisper" in name:
return "whisper"
if "sdxl" in name:
return "sdxl"
if "sd-v1" in name or "v1-5" in name or "stable-diffusion" in name:
return "sd15"
if any(tag in name for tag in ("bge", "e5", "minilm", "bert")):
return "bert"
return "llm"
def _messages_to_dicts(self, messages) -> list[dict]:
result = []
for msg in messages:
d: dict = {"role": msg.role, "content": msg.content or ""}
if msg.name:
d["name"] = msg.name
if msg.tool_call_id:
d["tool_call_id"] = msg.tool_call_id
if msg.reasoning_content:
d["reasoning_content"] = msg.reasoning_content
if msg.tool_calls:
try:
d["tool_calls"] = json.loads(msg.tool_calls)
except json.JSONDecodeError:
pass
result.append(d)
return result
def _render_prompt(self, request) -> str:
"""Render messages + tools into the model's chat template, or fall
back to the raw Prompt field for models without a template."""
if not request.Messages and request.Prompt:
return request.Prompt
if not self.chat_template:
# No template known — concatenate role/content lines.
lines = []
for msg in request.Messages:
lines.append(f"{msg.role}: {msg.content or ''}")
return "\n".join(lines) + "\nassistant:"
from jinja2 import Environment
env = Environment(trim_blocks=True, lstrip_blocks=True)
template = env.from_string(self.chat_template)
tools = None
if request.Tools:
try:
tools = json.loads(request.Tools)
except json.JSONDecodeError:
tools = None
return template.render(
messages=self._messages_to_dicts(request.Messages),
tools=tools,
add_generation_prompt=True,
# Qwen3's chat template enables <think>...</think> reasoning
# by default. On small models (0.6B) that reasoning preamble
# eats the whole token budget before a tool call emerges, so
# we disable it. Templates that don't know this var ignore it.
enable_thinking=False,
)
# --------------------- LLM path -------------------------------------
def _load_llm(self, model_path: Path) -> None:
"""Load an LLM through `tinygrad.apps.llm.Transformer`.
Two paths:
- GGUF file (anywhere in the tree) → `Transformer.from_gguf()`
handles config, weight conversion (incl. Q4/Q6/Q8 quantization)
and RoPE permute natively.
- HF safetensors directory → build `TransformerConfig` from
config.json and load weights via a small HF→GGUF-name adapter.
"""
from tinygrad import Device, Tensor, dtypes
from tinygrad.apps.llm import Transformer
from tinygrad.nn.state import load_state_dict
from vendor.appsllm_adapter import (
_hf_to_appsllm_state_dict,
_hf_to_transformer_kwargs,
)
max_context_cap = 8192
gguf_file = _gguf_path(model_path)
if gguf_file is not None:
# GGUF path: apps.llm handles everything — config, quant, RoPE.
gguf_tensor = Tensor.empty(
os.stat(gguf_file).st_size, dtype=dtypes.uint8,
device=f"disk:{gguf_file}",
).to(Device.DEFAULT)
model, kv = Transformer.from_gguf(gguf_tensor, max_context=max_context_cap)
self.llm_model = model
self.max_context = model.max_context
# Preserve a config-shaped dict for tool-parser heuristics and
# the "loaded" message.
arch = kv.get("general.architecture", "")
self.llm_config = {
"architectures": [kv.get("general.name", arch) or arch],
"gguf_kv": kv,
}
# Tokenizer: prefer sidecar tokenizer.json (richer HF Jinja2
# templates), fall back to apps.llm's SimpleTokenizer built
# from GGUF metadata.
self._load_tokenizer_for_dir(model_path if model_path.is_dir() else gguf_file.parent, gguf_kv=kv)
else:
# HF safetensors path.
if not model_path.is_dir():
raise FileNotFoundError(f"Expected HF model directory, got file: {model_path}")
config_path = model_path / "config.json"
if not config_path.exists():
raise FileNotFoundError(f"config.json not found under {model_path}")
with open(config_path) as fp:
hf_config = json.load(fp)
self.llm_config = hf_config
raw_weights = _load_hf_safetensors(model_path)
n_layers = hf_config["num_hidden_layers"]
state_dict = _hf_to_appsllm_state_dict(raw_weights, n_layers)
kwargs = _hf_to_transformer_kwargs(hf_config, state_dict, max_context_cap)
self.max_context = kwargs["max_context"]
model = Transformer(**kwargs)
load_state_dict(model, state_dict, strict=False, consume=True)
self.llm_model = model
self._load_tokenizer_for_dir(model_path, gguf_kv=None)
# Auto-pick tool parser from options or model family.
parser_name = self.options.get("tool_parser") or _auto_tool_parser(self.model_ref, self.llm_config)
self.tool_parser = resolve_parser(parser_name)
def _load_tokenizer_for_dir(self, model_dir: Path, gguf_kv: Optional[dict]) -> None:
"""Load HF tokenizer + chat template + EOS ids from a model directory.
Falls back to apps.llm's `SimpleTokenizer.from_gguf_kv` when there
is no `tokenizer.json` sidecar (single-file GGUF, no HF repo).
"""
tokenizer_json = model_dir / "tokenizer.json"
if tokenizer_json.exists():
from tokenizers import Tokenizer as HFTokenizer
self.llm_tokenizer = HFTokenizer.from_file(str(tokenizer_json))
elif gguf_kv is not None:
from tinygrad.apps.llm import SimpleTokenizer
self.llm_tokenizer = SimpleTokenizer.from_gguf_kv(gguf_kv)
else:
raise FileNotFoundError(f"tokenizer.json not found under {model_dir}")
tok_cfg_path = model_dir / "tokenizer_config.json"
if tok_cfg_path.exists():
with open(tok_cfg_path) as fp:
tok_cfg = json.load(fp)
self.chat_template = tok_cfg.get("chat_template")
self.llm_eos_ids = []
for cfg_name in ("generation_config.json", "config.json"):
cfg_path = model_dir / cfg_name
if not cfg_path.exists():
continue
with open(cfg_path) as fp:
cfg = json.load(fp)
eos = cfg.get("eos_token_id")
if isinstance(eos, list):
self.llm_eos_ids.extend(int(x) for x in eos)
elif isinstance(eos, int):
self.llm_eos_ids.append(eos)
if self.llm_eos_ids:
break
if not self.llm_eos_ids and gguf_kv is not None:
eos = gguf_kv.get("tokenizer.ggml.eos_token_id")
if isinstance(eos, int):
self.llm_eos_ids.append(eos)
# --------------------- Stable Diffusion path ------------------------
def _load_sd(self, model_ref: str) -> None:
"""Load a Stable Diffusion 1.x checkpoint (CompVis `.ckpt` format)."""
from huggingface_hub import hf_hub_download
from tinygrad.nn.state import load_state_dict, torch_load
from vendor.stable_diffusion import StableDiffusion
ckpt_path = Path(model_ref)
if not ckpt_path.exists():
# Accept an HF repo id — fetch the canonical v1-5-pruned-emaonly.ckpt
# from the requested repo. Common case is runwayml/stable-diffusion-v1-5.
repo_id = model_ref if "/" in model_ref else "runwayml/stable-diffusion-v1-5"
ckpt_file = self.options.get("sd_ckpt_filename", "v1-5-pruned-emaonly.ckpt")
ckpt_path = Path(hf_hub_download(repo_id=repo_id, filename=ckpt_file))
model = StableDiffusion()
state_dict = torch_load(str(ckpt_path))
if isinstance(state_dict, dict) and "state_dict" in state_dict:
state_dict = state_dict["state_dict"]
load_state_dict(model, state_dict, strict=False, verbose=False, realize=False)
self.sd_model = model
# --------------------- Whisper path ---------------------------------
def _load_whisper(self, model_ref: str) -> None:
"""Load a Whisper checkpoint (OpenAI `.pt` format).
Accepts a model-size alias (tiny / tiny.en / base / base.en / small /
small.en) OR an explicit `.pt` file path OR the HF repo id naming
convention `openai/whisper-*` (mapped to the matching OpenAI alias).
"""
from vendor.whisper import init_whisper, MODEL_URLS
alias = model_ref
if "/" in alias and alias.startswith("openai/whisper-"):
alias = alias.removeprefix("openai/whisper-")
if alias not in MODEL_URLS:
# Explicit path to a .pt checkpoint — fall back to size heuristic
# via filename.
basename = Path(alias).name.lower()
for name in MODEL_URLS:
if name in basename:
alias = name
break
else:
raise ValueError(
f"Unknown Whisper model_ref={model_ref!r}; expected one of {list(MODEL_URLS)} "
f"or an openai/whisper-* HF id"
)
model, enc = init_whisper(alias, batch_size=1)
self.whisper_model = model
self.whisper_tokenizer = enc
# --------------------- LLM generation -------------------------------
def _encode_prompt(self, prompt: str) -> list[int]:
"""Normalize tokenizer output: HF `tokenizers.Tokenizer.encode()`
returns an `Encoding` with `.ids`; apps.llm's `SimpleTokenizer.encode()`
returns `list[int]` directly."""
encoded = self.llm_tokenizer.encode(prompt)
return list(getattr(encoded, "ids", encoded))
def _decode_tokens(self, ids: list[int]) -> str:
return self.llm_tokenizer.decode(ids)
def _generate_tokens(self, prompt: str, max_new_tokens: int, temperature: float):
"""Yield (token_id, token_text) pairs using `apps.llm.Transformer.generate()`.
tinygrad 0.12.0's `generate()` is greedy-only (its `forward` ends
with `.argmax(-1)` and it takes no temperature / top-k / top-p
knobs). We accept `temperature` in the signature for API
compatibility but it is ignored.
"""
del temperature # tinygrad.apps.llm.Transformer.generate is greedy-only
ids = self._encode_prompt(prompt)
if not ids:
return
count = 0
for next_tok in self.llm_model.generate(list(ids)):
if next_tok in self.llm_eos_ids:
break
yield next_tok, self._decode_tokens([next_tok])
count += 1
if count >= max_new_tokens:
break
# --------------------- gRPC methods ---------------------------------
def Health(self, request, context):
return backend_pb2.Reply(message=bytes("OK", 'utf-8'))
async def LoadModel(self, request, context):
try:
_select_tinygrad_device()
self._reset_state()
self.options = self._parse_options(list(request.Options))
self.model_ref = request.ModelFile or request.Model
self.model_type = self._detect_model_type(self.model_ref, self.options.get("model_type"))
if self.model_type in ("sd15", "sd", "stable-diffusion"):
self._load_sd(self.model_ref)
return backend_pb2.Result(
success=True, message="tinygrad Stable Diffusion 1.x loaded",
)
if self.model_type == "whisper":
self._load_whisper(self.model_ref)
return backend_pb2.Result(
success=True, message="tinygrad Whisper loaded",
)
if self.model_type != "llm":
return backend_pb2.Result(
success=False,
message=f"tinygrad: model_type={self.model_type} not yet implemented",
)
model_path = _resolve_model_assets(self.model_ref)
self._load_llm(model_path)
return backend_pb2.Result(
success=True,
message=f"tinygrad LLM loaded (arch={self.llm_config.get('architectures', ['?'])[0]}, "
f"parser={self.tool_parser.name})",
)
except Exception as exc:
import traceback
traceback.print_exc()
return backend_pb2.Result(success=False, message=f"LoadModel failed: {exc}")
async def Predict(self, request, context):
if self.llm_model is None:
context.set_code(grpc.StatusCode.FAILED_PRECONDITION)
context.set_details("LLM not loaded")
return backend_pb2.Reply()
try:
prompt = self._render_prompt(request)
max_new = request.Tokens if request.Tokens > 0 else 256
temperature = request.Temperature if request.Temperature > 0 else 0.7
t0 = time.monotonic()
pieces: list[str] = []
ntok = 0
for _, text in self._generate_tokens(prompt, max_new, temperature):
pieces.append(text)
ntok += 1
elapsed = time.monotonic() - t0
full = "".join(pieces)
from tool_parsers.hermes import HermesToolParser
if isinstance(self.tool_parser, HermesToolParser):
result = self.tool_parser.parse_full(full)
content, calls, reasoning = result.content, result.tool_calls, result.reasoning
else:
content, calls = self.tool_parser.parse(full)
reasoning = ""
delta = backend_pb2.ChatDelta(
content=content,
reasoning_content=reasoning,
tool_calls=[
backend_pb2.ToolCallDelta(index=c.index, id=c.id, name=c.name, arguments=c.arguments)
for c in calls
],
)
return backend_pb2.Reply(
message=content.encode("utf-8"),
tokens=ntok,
timing_token_generation=elapsed,
chat_deltas=[delta],
)
except Exception as exc:
import traceback
traceback.print_exc()
context.set_code(grpc.StatusCode.INTERNAL)
context.set_details(f"Predict failed: {exc}")
return backend_pb2.Reply()
async def PredictStream(self, request, context):
if self.llm_model is None:
context.set_code(grpc.StatusCode.FAILED_PRECONDITION)
context.set_details("LLM not loaded")
return
try:
prompt = self._render_prompt(request)
max_new = request.Tokens if request.Tokens > 0 else 256
temperature = request.Temperature if request.Temperature > 0 else 0.7
buffer = ""
for _, text in self._generate_tokens(prompt, max_new, temperature):
buffer += text
yield backend_pb2.Reply(
message=text.encode("utf-8"),
chat_deltas=[backend_pb2.ChatDelta(content=text)],
)
# Final emission carries the extracted tool calls (vLLM semantics).
from tool_parsers.hermes import HermesToolParser
if isinstance(self.tool_parser, HermesToolParser):
result = self.tool_parser.parse_full(buffer)
calls = result.tool_calls
reasoning = result.reasoning
else:
_, calls = self.tool_parser.parse(buffer)
reasoning = ""
if calls or reasoning:
yield backend_pb2.Reply(
chat_deltas=[backend_pb2.ChatDelta(
reasoning_content=reasoning,
tool_calls=[
backend_pb2.ToolCallDelta(index=c.index, id=c.id, name=c.name, arguments=c.arguments)
for c in calls
],
)],
)
except Exception as exc:
import traceback
traceback.print_exc()
context.set_code(grpc.StatusCode.INTERNAL)
context.set_details(f"PredictStream failed: {exc}")
async def Embedding(self, request, context):
if self.llm_model is None or self.llm_tokenizer is None:
context.set_code(grpc.StatusCode.FAILED_PRECONDITION)
context.set_details("No model loaded")
return backend_pb2.EmbeddingResult()
try:
text = request.Embeddings
if not text:
context.set_code(grpc.StatusCode.INVALID_ARGUMENT)
context.set_details("Embeddings field is empty")
return backend_pb2.EmbeddingResult()
from tinygrad import Tensor, dtypes
from vendor.appsllm_adapter import _embed_hidden
ids = self._encode_prompt(text)
if not ids:
return backend_pb2.EmbeddingResult(embeddings=[])
# Clamp to context window — truncate long inputs rather than blow up.
ids = ids[: self.max_context]
tokens = Tensor([ids])
hidden = _embed_hidden(self.llm_model, tokens) # (1, seqlen, dim)
# Mean pool over sequence dim
pooled = hidden.mean(axis=1).squeeze(0) # (dim,)
# L2 normalize
norm = pooled.square().sum().sqrt()
normalized = (pooled / (norm + 1e-12))
vec = normalized.cast(dtypes.float32).tolist()
return backend_pb2.EmbeddingResult(embeddings=[float(x) for x in vec])
except Exception as exc:
import traceback
traceback.print_exc()
context.set_code(grpc.StatusCode.INTERNAL)
context.set_details(f"Embedding failed: {exc}")
return backend_pb2.EmbeddingResult()
async def GenerateImage(self, request, context):
if self.sd_model is None:
context.set_code(grpc.StatusCode.FAILED_PRECONDITION)
context.set_details("No Stable Diffusion model loaded")
return backend_pb2.Result(success=False, message="not loaded")
try:
from PIL import Image
from vendor.stable_diffusion import run_sd15
steps = request.step if request.step > 0 else 20
guidance = 7.5
seed = request.seed if request.seed != 0 else None
img_tensor = run_sd15(
model=self.sd_model,
prompt=request.positive_prompt or "",
negative_prompt=request.negative_prompt or "",
steps=steps,
guidance=guidance,
seed=seed,
)
arr = img_tensor.numpy()
image = Image.fromarray(arr)
dst = request.dst or "/tmp/tinygrad_image.png"
image.save(dst)
return backend_pb2.Result(success=True, message=dst)
except Exception as exc:
import traceback
traceback.print_exc()
return backend_pb2.Result(success=False, message=f"GenerateImage failed: {exc}")
def _transcribe(self, audio_path: str, language: Optional[str]) -> tuple[str, float]:
from vendor.whisper import load_file_waveform, transcribe_waveform
waveform = load_file_waveform(audio_path)
text = transcribe_waveform(
self.whisper_model,
self.whisper_tokenizer,
[waveform],
language=language or None,
)
duration = float(len(waveform)) / 16000.0
return text, duration
async def AudioTranscription(self, request, context):
if self.whisper_model is None or self.whisper_tokenizer is None:
context.set_code(grpc.StatusCode.FAILED_PRECONDITION)
context.set_details("No Whisper model loaded")
return backend_pb2.TranscriptResult()
try:
if not request.dst:
context.set_code(grpc.StatusCode.INVALID_ARGUMENT)
context.set_details("TranscriptRequest.dst (audio file path) is required")
return backend_pb2.TranscriptResult()
text, duration = self._transcribe(request.dst, request.language)
segments = [backend_pb2.TranscriptSegment(id=0, start=0, end=0, text=text)]
return backend_pb2.TranscriptResult(
text=text,
language=request.language or "en",
duration=duration,
segments=segments,
)
except Exception as exc:
import traceback
traceback.print_exc()
context.set_code(grpc.StatusCode.INTERNAL)
context.set_details(f"AudioTranscription failed: {exc}")
return backend_pb2.TranscriptResult()
async def AudioTranscriptionStream(self, request, context):
if self.whisper_model is None or self.whisper_tokenizer is None:
context.set_code(grpc.StatusCode.FAILED_PRECONDITION)
context.set_details("No Whisper model loaded")
return
try:
if not request.dst:
context.set_code(grpc.StatusCode.INVALID_ARGUMENT)
context.set_details("TranscriptRequest.dst (audio file path) is required")
return
# The vendored tinygrad whisper loop is chunked at the file level
# (one inference pass per 30s segment), not token-level. To still
# produce a streaming response we run the full transcription and
# emit it as a single delta + a final-result envelope so the client
# gets both code paths exercised.
text, duration = self._transcribe(request.dst, request.language)
yield backend_pb2.TranscriptStreamResponse(delta=text)
final = backend_pb2.TranscriptResult(
text=text,
language=request.language or "en",
duration=duration,
segments=[backend_pb2.TranscriptSegment(id=0, start=0, end=0, text=text)],
)
yield backend_pb2.TranscriptStreamResponse(final_result=final)
except Exception as exc:
import traceback
traceback.print_exc()
context.set_code(grpc.StatusCode.INTERNAL)
context.set_details(f"AudioTranscriptionStream failed: {exc}")
async def Status(self, request, context):
return backend_pb2.StatusResponse(state=backend_pb2.StatusResponse.READY)
async def Free(self, request, context):
self._reset_state()
return backend_pb2.Result(success=True, message="freed")
async def serve(address):
server = grpc.aio.server(
migration_thread_pool=futures.ThreadPoolExecutor(max_workers=MAX_WORKERS),
options=[
('grpc.max_message_length', 50 * 1024 * 1024),
('grpc.max_send_message_length', 50 * 1024 * 1024),
('grpc.max_receive_message_length', 50 * 1024 * 1024),
],
interceptors=get_auth_interceptors(aio=True),
)
backend_pb2_grpc.add_BackendServicer_to_server(BackendServicer(), server)
server.add_insecure_port(address)
loop = asyncio.get_event_loop()
for sig in (signal.SIGINT, signal.SIGTERM):
loop.add_signal_handler(sig, lambda: asyncio.ensure_future(server.stop(5)))
await server.start()
print("Server started. Listening on: " + address, file=sys.stderr)
await server.wait_for_termination()
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Run the tinygrad gRPC backend.")
parser.add_argument("--addr", default="localhost:50051", help="Bind address")
args = parser.parse_args()
asyncio.run(serve(args.addr))

View File

@@ -1,17 +0,0 @@
#!/bin/bash
set -e
backend_dir=$(dirname $0)
if [ -d $backend_dir/common ]; then
source $backend_dir/common/libbackend.sh
else
source $backend_dir/../common/libbackend.sh
fi
# tinygrad >= 0.12 requires Python >= 3.11 (pyproject: `requires-python = ">=3.11"`).
# LocalAI's default portable python is 3.10, so we pin to 3.11.x here.
PYTHON_VERSION="3.11"
PYTHON_PATCH="14"
PY_STANDALONE_TAG="20260203"
installRequirements

View File

@@ -1,103 +0,0 @@
#!/bin/bash
# Script to package runtime shared libraries for the tinygrad backend.
#
# The final Dockerfile.python stage is FROM scratch, so system libraries
# must be explicitly copied into ${BACKEND}/lib so the backend can run on
# any host without installing them. libbackend.sh automatically prepends
# that directory to LD_LIBRARY_PATH at run time.
#
# tinygrad's CPU device (CLANG / LLVM renderer) JIT-compiles kernels at
# runtime. The default `CLANG` path invokes the external `clang` binary via
# subprocess, which does not exist in the scratch image. We force the
# in-process LLVM path (`CPU_LLVM=1` in run.sh) which loads libLLVM.so.*
# through ctypes and bundle the library + its runtime dependencies here.
#
# Also bundle libgomp (pulled by librosa / numpy via numba) and libsndfile
# (required by soundfile -> librosa audio I/O for Whisper).
set -e
CURDIR=$(dirname "$(realpath "$0")")
LIB_DIR="${CURDIR}/lib"
mkdir -p "${LIB_DIR}"
SEARCH_DIRS=(
/usr/lib/x86_64-linux-gnu
/usr/lib/aarch64-linux-gnu
/lib/x86_64-linux-gnu
/lib/aarch64-linux-gnu
/usr/lib
/lib
)
copy_with_symlinks() {
local soname="$1"
local hit=""
for dir in "${SEARCH_DIRS[@]}"; do
if [ -e "${dir}/${soname}" ]; then
hit="${dir}/${soname}"
break
fi
done
if [ -z "${hit}" ]; then
echo "warning: ${soname} not found in standard lib paths" >&2
return 0
fi
local real
real=$(readlink -f "${hit}")
cp -v "${real}" "${LIB_DIR}/"
local real_base
real_base=$(basename "${real}")
if [ "${real_base}" != "${soname}" ]; then
ln -sf "${real_base}" "${LIB_DIR}/${soname}"
fi
}
# tinygrad searches for libLLVM under these sonames (see
# tinygrad/runtime/autogen/llvm.py). Ubuntu 24.04's `llvm` metapackage
# installs `libLLVM-18.so.1` into `/usr/lib/llvm-18/lib/`. Also scan the
# standard lib directories in case a different distro layout puts it in
# /usr/lib/x86_64-linux-gnu.
llvm_so=""
shopt -s nullglob
LLVM_EXTRA_DIRS=(/usr/lib/llvm-*/lib /usr/lib/llvm-*)
# First try the versioned symlink (libLLVM-18.so) since that's what
# tinygrad's DLL loader matches against (see llvm.py DLL name list).
for dir in "${SEARCH_DIRS[@]}" "${LLVM_EXTRA_DIRS[@]}"; do
for candidate in "${dir}"/libLLVM-[0-9]*.so "${dir}"/libLLVM-[0-9]*.so.[0-9]*; do
if [ -e "${candidate}" ]; then
llvm_so="${candidate}"
break 2
fi
done
done
# Fallback: any libLLVM.so file under /usr.
if [ -z "${llvm_so}" ]; then
llvm_so=$(find /usr -maxdepth 5 -name 'libLLVM*.so*' 2>/dev/null | head -1)
fi
shopt -u nullglob
if [ -z "${llvm_so}" ]; then
echo "ERROR: libLLVM not found — tinygrad CPU device needs it." >&2
echo "Install the Ubuntu \`llvm\` package in the builder stage." >&2
exit 1
fi
echo "Found libLLVM at: ${llvm_so}"
llvm_base=$(basename "${llvm_so}")
real_llvm=$(readlink -f "${llvm_so}")
cp -v "${real_llvm}" "${LIB_DIR}/"
real_base=$(basename "${real_llvm}")
if [ "${real_base}" != "${llvm_base}" ]; then
ln -sf "${real_base}" "${LIB_DIR}/${llvm_base}"
fi
# libLLVM has soft runtime deps on libedit / libtinfo; pick them up if
# present. They're optional but loading without them can fail.
copy_with_symlinks libedit.so.2
copy_with_symlinks libtinfo.so.6
# Audio I/O for the Whisper path.
copy_with_symlinks libsndfile.so.1
copy_with_symlinks libgomp.so.1
echo "tinygrad packaging completed successfully"
ls -liah "${LIB_DIR}/"

View File

@@ -1,11 +0,0 @@
#!/bin/bash
set -e
backend_dir=$(dirname $0)
if [ -d $backend_dir/common ]; then
source $backend_dir/common/libbackend.sh
else
source $backend_dir/../common/libbackend.sh
fi
runProtogen

View File

@@ -1 +0,0 @@
# tinygrad CPU backend uses CLANG device (no extra deps required).

View File

@@ -1,2 +0,0 @@
# tinygrad drives CUDA through its own JIT (CUDA=1 env var).
# Requires the CUDA 12 runtime from the base image; no extra Python deps.

View File

@@ -1,2 +0,0 @@
# tinygrad drives CUDA through its own JIT (CUDA=1 env var).
# Requires the CUDA 13 runtime from the base image; no extra Python deps.

View File

@@ -1,15 +0,0 @@
grpcio==1.80.0
protobuf==6.33.5
certifi
setuptools
numpy>=2.0.0
tinygrad>=0.12.0
tokenizers>=0.21.0
huggingface_hub
jinja2>=3.1.0
tiktoken
sentencepiece
safetensors
Pillow
librosa
soundfile

View File

@@ -1,55 +0,0 @@
#!/bin/bash
backend_dir=$(dirname $0)
if [ -d $backend_dir/common ]; then
source $backend_dir/common/libbackend.sh
else
source $backend_dir/../common/libbackend.sh
fi
# tinygrad binds its compute device at import time from a single env var
# (CUDA / HIP / METAL / CLANG). We pick one here based on what driver
# libraries the host has injected into the container — when a user runs
# the image with `--gpus all` (or the equivalent rocm runtime), the
# nvidia-container-toolkit / rocm runtime mounts the right libraries
# under /usr/lib so we can detect them.
#
# tinygrad's CUDA path uses two compiler pairs: an NVRTC-backed one and
# an in-process PTX renderer. We force the PTX renderer here
# (`CUDA_PTX=1`) so the image is independent of the host CUDA toolkit
# version — only libcuda.so.1 (the driver) is required.
find_lib() {
local soname="$1"
for dir in /usr/lib/x86_64-linux-gnu /usr/lib64 /usr/lib /lib/x86_64-linux-gnu /lib64 /lib; do
if [ -e "${dir}/${soname}" ]; then
echo "${dir}/${soname}"
return 0
fi
done
return 1
}
if [ -z "${CUDA:-}${HIP:-}${METAL:-}${CLANG:-}" ]; then
if find_lib libcuda.so.1 >/dev/null; then
export CUDA=1
export CUDA_PTX=1
elif find_lib libamdhip64.so >/dev/null || find_lib libamdhip64.so.6 >/dev/null; then
export HIP=1
else
export CLANG=1
fi
fi
# The CPU path (CLANG=1) JIT-compiles via libLLVM. Force tinygrad's
# in-process LLVM compiler so we don't need an external `clang` binary
# (which is not present in the scratch image).
export CPU_LLVM=1
if [ -z "${LLVM_PATH:-}" ]; then
for candidate in "${EDIR}"/lib/libLLVM-*.so "${EDIR}"/lib/libLLVM-*.so.* "${EDIR}"/lib/libLLVM.so.*; do
if [ -e "${candidate}" ]; then
export LLVM_PATH="${candidate}"
break
fi
done
fi
startBackend $@

View File

@@ -1,153 +0,0 @@
"""
Unit tests for the tinygrad gRPC backend.
These tests cover the cheap paths that don't need a real model checkpoint:
- Health responds OK
- Tool-call parsers emit expected ToolCall structures
The full LLM / embeddings / Stable Diffusion / Whisper paths are exercised by
the root-level `make test-extra-backend-tinygrad-all` e2e targets, which boot
the containerized backend against real HF checkpoints.
"""
import os
import subprocess
import sys
import time
import unittest
import grpc
import backend_pb2
import backend_pb2_grpc
sys.path.insert(0, os.path.dirname(__file__))
from tool_parsers.hermes import HermesToolParser # noqa: E402
from vendor.appsllm_adapter import _hf_to_appsllm_state_dict # noqa: E402
class TestHealth(unittest.TestCase):
def setUp(self):
self.service = subprocess.Popen(
["python3", "backend.py", "--addr", "localhost:50051"]
)
time.sleep(5)
def tearDown(self):
self.service.kill()
self.service.wait()
def test_health(self):
with grpc.insecure_channel("localhost:50051") as channel:
stub = backend_pb2_grpc.BackendStub(channel)
response = stub.Health(backend_pb2.HealthMessage())
self.assertEqual(response.message, b"OK")
class TestHermesParser(unittest.TestCase):
def test_single_tool_call(self):
parser = HermesToolParser()
text = (
"Sure, let me check.\n"
"<tool_call>\n"
'{"name": "get_weather", "arguments": {"city": "Paris"}}\n'
"</tool_call>\n"
"Done."
)
content, calls = parser.parse(text)
self.assertIn("Sure", content)
self.assertIn("Done", content)
self.assertEqual(len(calls), 1)
self.assertEqual(calls[0].name, "get_weather")
self.assertIn("Paris", calls[0].arguments)
def test_multi_call_and_thinking(self):
parser = HermesToolParser()
text = (
"<think>I need both.</think>"
'<tool_call>{"name":"a","arguments":{"x":1}}</tool_call>'
'<tool_call>{"name":"b","arguments":{}}</tool_call>'
)
result = parser.parse_full(text)
self.assertEqual(result.reasoning, "I need both.")
self.assertEqual([c.name for c in result.tool_calls], ["a", "b"])
self.assertEqual(result.tool_calls[0].index, 0)
self.assertEqual(result.tool_calls[1].index, 1)
def test_no_tool_call_is_passthrough(self):
parser = HermesToolParser()
text = "plain assistant answer with no tool call"
content, calls = parser.parse(text)
self.assertEqual(content, text)
self.assertEqual(calls, [])
class TestAppsLLMAdapter(unittest.TestCase):
"""Smoke tests for the HF → tinygrad.apps.llm state-dict keymap."""
def _fake_hf_weights(self, n_layers: int = 2, include_lm_head: bool = True):
keys = [
"model.embed_tokens.weight",
"model.norm.weight",
]
if include_lm_head:
keys.append("lm_head.weight")
for l in range(n_layers):
keys += [
f"model.layers.{l}.input_layernorm.weight",
f"model.layers.{l}.post_attention_layernorm.weight",
f"model.layers.{l}.self_attn.q_proj.weight",
f"model.layers.{l}.self_attn.k_proj.weight",
f"model.layers.{l}.self_attn.v_proj.weight",
f"model.layers.{l}.self_attn.o_proj.weight",
f"model.layers.{l}.self_attn.q_norm.weight",
f"model.layers.{l}.self_attn.k_norm.weight",
f"model.layers.{l}.mlp.gate_proj.weight",
f"model.layers.{l}.mlp.up_proj.weight",
f"model.layers.{l}.mlp.down_proj.weight",
]
# sentinel objects so we can verify identity-based aliasing
return {k: object() for k in keys}
def test_keymap_renames_every_hf_key(self):
hf = self._fake_hf_weights(n_layers=2)
sd = _hf_to_appsllm_state_dict(hf, 2)
expected = {
"token_embd.weight", "output_norm.weight", "output.weight",
"blk.0.attn_norm.weight", "blk.0.ffn_norm.weight",
"blk.0.attn_q.weight", "blk.0.attn_k.weight", "blk.0.attn_v.weight",
"blk.0.attn_output.weight",
"blk.0.attn_q_norm.weight", "blk.0.attn_k_norm.weight",
"blk.0.ffn_gate.weight", "blk.0.ffn_up.weight", "blk.0.ffn_down.weight",
"blk.1.attn_norm.weight", "blk.1.ffn_norm.weight",
"blk.1.attn_q.weight", "blk.1.attn_k.weight", "blk.1.attn_v.weight",
"blk.1.attn_output.weight",
"blk.1.attn_q_norm.weight", "blk.1.attn_k_norm.weight",
"blk.1.ffn_gate.weight", "blk.1.ffn_up.weight", "blk.1.ffn_down.weight",
}
self.assertEqual(set(sd.keys()), expected)
def test_tied_embedding_fallback_when_lm_head_missing(self):
hf = self._fake_hf_weights(n_layers=1, include_lm_head=False)
sd = _hf_to_appsllm_state_dict(hf, 1)
self.assertIn("output.weight", sd)
self.assertIs(sd["output.weight"], sd["token_embd.weight"])
def test_unknown_keys_are_skipped(self):
hf = self._fake_hf_weights(n_layers=1)
hf["model.layers.0.self_attn.rotary_emb.inv_freq"] = object()
hf["model.some_unknown.weight"] = object()
sd = _hf_to_appsllm_state_dict(hf, 1)
self.assertNotIn("model.some_unknown.weight", sd)
# Renamed keys still present
self.assertIn("blk.0.attn_q.weight", sd)
def test_qkv_bias_models_rejected(self):
hf = self._fake_hf_weights(n_layers=1)
hf["model.layers.0.self_attn.q_proj.bias"] = object()
with self.assertRaises(ValueError) as ctx:
_hf_to_appsllm_state_dict(hf, 1)
self.assertIn("Qwen3", str(ctx.exception))
if __name__ == "__main__":
unittest.main()

View File

@@ -1,11 +0,0 @@
#!/bin/bash
set -e
backend_dir=$(dirname $0)
if [ -d $backend_dir/common ]; then
source $backend_dir/common/libbackend.sh
else
source $backend_dir/../common/libbackend.sh
fi
runUnittests

View File

@@ -1,11 +0,0 @@
"""Tool-call parsers for the tinygrad backend.
Each parser takes raw model output and extracts OpenAI-style tool calls so
the backend can populate `ChatDelta.tool_calls[]` natively (matching vLLM's
behavior, which the Go core prefers over regex fallback parsing).
"""
from __future__ import annotations
from .base import ToolCall, ToolParser, resolve_parser
__all__ = ["ToolCall", "ToolParser", "resolve_parser"]

View File

@@ -1,85 +0,0 @@
"""Common types + parser registry for tool-call extraction."""
from __future__ import annotations
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class ToolCall:
"""One extracted tool call — maps 1:1 to backend_pb2.ToolCallDelta."""
index: int
name: str
arguments: str # JSON string
id: str = ""
class ToolParser:
"""Parser interface.
Subclasses implement `parse` (full non-streaming pass) and optionally
`parse_stream` (incremental). The default `parse_stream` buffers until a
full response is available and then delegates to `parse`.
"""
name: str = "base"
def __init__(self) -> None:
self._stream_buffer = ""
self._stream_index = 0
def parse(self, text: str) -> tuple[str, list[ToolCall]]:
"""Return (content_for_user, tool_calls)."""
raise NotImplementedError
def parse_stream(self, delta: str, finished: bool = False) -> tuple[str, list[ToolCall]]:
"""Accumulate a streaming delta. Emits any tool calls that have closed.
Default behavior: buffer until `finished=True`, then parse once.
Subclasses can override to emit mid-stream.
"""
self._stream_buffer += delta
if not finished:
return "", []
content, calls = self.parse(self._stream_buffer)
# Re-index starting from whatever we've already emitted in this stream.
reindexed: list[ToolCall] = []
for i, c in enumerate(calls):
reindexed.append(ToolCall(
index=self._stream_index + i,
name=c.name,
arguments=c.arguments,
id=c.id,
))
self._stream_index += len(reindexed)
return content, reindexed
def reset(self) -> None:
self._stream_buffer = ""
self._stream_index = 0
_REGISTRY: dict[str, type[ToolParser]] = {}
def register(cls: type[ToolParser]) -> type[ToolParser]:
_REGISTRY[cls.name] = cls
return cls
def resolve_parser(name: Optional[str]) -> ToolParser:
"""Return a parser instance by name, falling back to a no-op passthrough."""
# Import for side effects — each module registers itself.
from . import hermes, llama3_json, mistral, qwen3_xml # noqa: F401
if name and name in _REGISTRY:
return _REGISTRY[name]()
return PassthroughToolParser()
class PassthroughToolParser(ToolParser):
"""No-op parser — used when no tool_parser is configured."""
name = "passthrough"
def parse(self, text: str) -> tuple[str, list[ToolCall]]:
return text, []

View File

@@ -1,74 +0,0 @@
"""Hermes-format tool-call parser.
Hermes 2 / 2.5 / 3 (and Qwen 2.5 Instruct, which adopted the same convention)
emit tool calls wrapped in `<tool_call>...</tool_call>` tags, where the inner
content is a JSON object with `name` and `arguments` keys:
<tool_call>
{"name": "get_weather", "arguments": {"city": "Paris"}}
</tool_call>
Multiple tool calls may appear back-to-back. Text outside the tags is plain
assistant content that should surface to the user.
This parser also strips `<think>...</think>` reasoning blocks and returns them
via the reasoning_content channel (Qwen 3, DeepSeek-R1 distills).
"""
from __future__ import annotations
import json
import re
from dataclasses import dataclass
from .base import ToolCall, ToolParser, register
_TOOL_CALL_RE = re.compile(r"<tool_call>\s*(\{.*?\})\s*</tool_call>", re.DOTALL)
_THINK_RE = re.compile(r"<think>(.*?)</think>", re.DOTALL)
@dataclass
class HermesParseResult:
content: str
reasoning: str
tool_calls: list[ToolCall]
@register
class HermesToolParser(ToolParser):
name = "hermes"
def _parse_full(self, text: str) -> HermesParseResult:
reasoning_parts: list[str] = []
def _capture_reasoning(match: re.Match[str]) -> str:
reasoning_parts.append(match.group(1).strip())
return ""
text_wo_think = _THINK_RE.sub(_capture_reasoning, text)
calls: list[ToolCall] = []
for idx, match in enumerate(_TOOL_CALL_RE.finditer(text_wo_think)):
raw = match.group(1)
try:
obj = json.loads(raw)
except json.JSONDecodeError:
continue
if not isinstance(obj, dict):
continue
name = obj.get("name")
if not isinstance(name, str):
continue
args = obj.get("arguments", {})
args_str = args if isinstance(args, str) else json.dumps(args, ensure_ascii=False)
calls.append(ToolCall(index=idx, name=name, arguments=args_str))
content = _TOOL_CALL_RE.sub("", text_wo_think).strip()
reasoning = "\n\n".join(reasoning_parts).strip()
return HermesParseResult(content=content, reasoning=reasoning, tool_calls=calls)
def parse(self, text: str) -> tuple[str, list[ToolCall]]:
result = self._parse_full(text)
return result.content, result.tool_calls
def parse_full(self, text: str) -> HermesParseResult:
return self._parse_full(text)

View File

@@ -1,86 +0,0 @@
"""Llama 3.1 / 3.2 / 3.3 JSON tool-call parser.
Meta's Llama 3.1+ instruct chat templates emit tool calls in two broadly
compatible shapes:
1. With the `<|python_tag|>` lead-in:
<|python_tag|>{"name": "get_weather", "parameters": {"city": "Paris"}}
2. As a bare JSON object (or list of objects) at the end of the turn.
We also handle multi-call shapes where the model emits several JSON objects
separated by `;` or newlines, and JSON arrays `[{...}, {...}]`. The key field
for Llama 3 is historically `parameters` (older docs) but recent checkpoints
also emit `arguments` — accept either.
"""
from __future__ import annotations
import json
import re
from dataclasses import dataclass
from .base import ToolCall, ToolParser, register
_PYTHON_TAG = "<|python_tag|>"
_JSON_OBJECT_RE = re.compile(r"\{[^{}]*(?:\{[^{}]*\}[^{}]*)*\}", re.DOTALL)
def _coerce_call(obj: object, index: int) -> ToolCall | None:
if not isinstance(obj, dict):
return None
name = obj.get("name")
if not isinstance(name, str):
return None
args = obj.get("arguments", obj.get("parameters", {}))
args_str = args if isinstance(args, str) else json.dumps(args, ensure_ascii=False)
return ToolCall(index=index, name=name, arguments=args_str)
@register
class Llama3JsonToolParser(ToolParser):
name = "llama3_json"
def parse(self, text: str) -> tuple[str, list[ToolCall]]:
calls: list[ToolCall] = []
# Strip <|python_tag|> segments first — each segment is one tool call
# body. The content after the final python_tag (if any) is the call.
remaining = text
if _PYTHON_TAG in text:
head, *tails = text.split(_PYTHON_TAG)
remaining = head
for tail in tails:
parsed = _try_parse(tail.strip(), len(calls))
calls.extend(parsed)
# Any JSON objects / arrays left in `remaining` count as tool calls too
# if they parse to a {"name": ..., "arguments": ...} shape.
for match in _JSON_OBJECT_RE.finditer(remaining):
parsed = _try_parse(match.group(0), len(calls))
if parsed:
calls.extend(parsed)
remaining = remaining.replace(match.group(0), "", 1)
content = remaining.strip()
return content, calls
def _try_parse(blob: str, start_index: int) -> list[ToolCall]:
"""Parse a fragment that may be a JSON object or a JSON array of objects."""
blob = blob.strip().rstrip(";")
if not blob:
return []
try:
obj = json.loads(blob)
except json.JSONDecodeError:
return []
if isinstance(obj, dict):
call = _coerce_call(obj, start_index)
return [call] if call else []
if isinstance(obj, list):
calls: list[ToolCall] = []
for i, item in enumerate(obj):
c = _coerce_call(item, start_index + i)
if c:
calls.append(c)
return calls
return []

View File

@@ -1,56 +0,0 @@
"""Mistral / Mixtral tool-call parser.
Mistral Nemo / Small / Large Instruct emit tool calls prefixed with the
`[TOOL_CALLS]` control token, followed by a JSON array:
[TOOL_CALLS][{"name": "get_weather", "arguments": {"city": "Paris"}}]
Multiple calls live inside the same array. Any text before `[TOOL_CALLS]` is
normal assistant content and should surface to the user.
"""
from __future__ import annotations
import json
import re
from .base import ToolCall, ToolParser, register
_MARKER = "[TOOL_CALLS]"
_JSON_ARRAY_RE = re.compile(r"\[\s*(?:\{.*?\}\s*,?\s*)+\]", re.DOTALL)
@register
class MistralToolParser(ToolParser):
name = "mistral"
def parse(self, text: str) -> tuple[str, list[ToolCall]]:
if _MARKER not in text:
return text.strip(), []
head, tail = text.split(_MARKER, 1)
content = head.strip()
match = _JSON_ARRAY_RE.search(tail)
if not match:
return content, []
try:
arr = json.loads(match.group(0))
except json.JSONDecodeError:
return content, []
if not isinstance(arr, list):
return content, []
calls: list[ToolCall] = []
for i, obj in enumerate(arr):
if not isinstance(obj, dict):
continue
name = obj.get("name")
if not isinstance(name, str):
continue
args = obj.get("arguments", {})
args_str = args if isinstance(args, str) else json.dumps(args, ensure_ascii=False)
calls.append(ToolCall(index=i, name=name, arguments=args_str))
return content, calls

View File

@@ -1,74 +0,0 @@
"""Qwen 3 XML tool-call parser.
Qwen 3 Instruct emits tool calls wrapped in a two-level tag structure:
<tool_call>
<function=get_weather>
<parameter=city>
Paris
</parameter>
<parameter=unit>
celsius
</parameter>
</function>
</tool_call>
Parameter values are raw text — we treat them as strings unless they look
like JSON (in which case we try to parse so numbers / booleans round-trip
cleanly). Qwen 3 also supports `<think>...</think>` reasoning blocks before
the tool call — these are captured via the shared Hermes convention.
"""
from __future__ import annotations
import json
import re
from .base import ToolCall, ToolParser, register
_TOOL_CALL_RE = re.compile(r"<tool_call>(.*?)</tool_call>", re.DOTALL)
_FUNCTION_RE = re.compile(r"<function=([^>]+)>(.*?)</function>", re.DOTALL)
_PARAMETER_RE = re.compile(r"<parameter=([^>]+)>(.*?)</parameter>", re.DOTALL)
_THINK_RE = re.compile(r"<think>(.*?)</think>", re.DOTALL)
def _maybe_json(value: str):
value = value.strip()
if not value:
return value
if value[0] in "{[\"" or value in ("true", "false", "null") or value.lstrip("-").replace(".", "", 1).isdigit():
try:
return json.loads(value)
except json.JSONDecodeError:
return value
return value
@register
class Qwen3XmlToolParser(ToolParser):
name = "qwen3_xml"
def parse(self, text: str) -> tuple[str, list[ToolCall]]:
# Strip reasoning blocks from the user-visible content.
stripped = _THINK_RE.sub("", text)
calls: list[ToolCall] = []
for match in _TOOL_CALL_RE.finditer(stripped):
body = match.group(1)
fn_match = _FUNCTION_RE.search(body)
if not fn_match:
continue
name = fn_match.group(1).strip()
params_body = fn_match.group(2)
params: dict[str, object] = {}
for pm in _PARAMETER_RE.finditer(params_body):
params[pm.group(1).strip()] = _maybe_json(pm.group(2))
calls.append(ToolCall(
index=len(calls),
name=name,
arguments=json.dumps(params, ensure_ascii=False),
))
content = _TOOL_CALL_RE.sub("", stripped).strip()
return content, calls

View File

@@ -1,6 +0,0 @@
"""Vendored upstream tinygrad reference code (MIT-licensed).
Source: https://github.com/tinygrad/tinygrad
These files are not part of the `tinygrad` pip package (the `extra/` tree is
excluded from `pyproject.toml` `packages`), so we carry a pinned copy here.
"""

View File

@@ -1,102 +0,0 @@
"""Glue code between LocalAI's HF-shaped model assets and tinygrad.apps.llm.
apps.llm's `Transformer` uses GGUF-native weight names and consumes a
`TransformerConfig` dataclass. LocalAI resolves models from HuggingFace
snapshots (HF safetensors + config.json) so we translate both sides here.
This module does NOT subclass anything from apps.llm. With the Qwen3+
scope the backend targets, we can use `apps.llm.Transformer` unchanged
(no qkv_bias, no RoPE permute). Everything below is a thin adapter.
"""
from __future__ import annotations
from typing import Any
def _hf_to_appsllm_state_dict(hf_weights: dict[str, Any], n_layers: int) -> dict[str, Any]:
"""Rename a HuggingFace-style state dict to the GGUF-native keys that
`tinygrad.apps.llm.Transformer` expects.
HF and apps.llm both store RoPE weights in half-split layout, so no
permute is required — only a direct key rename and a tied-embedding
fallback for models like Llama 3.2 that drop `lm_head.weight`.
"""
keymap: dict[str, str] = {
"model.embed_tokens.weight": "token_embd.weight",
"model.norm.weight": "output_norm.weight",
"lm_head.weight": "output.weight",
}
for layer in range(n_layers):
keymap[f"model.layers.{layer}.input_layernorm.weight"] = f"blk.{layer}.attn_norm.weight"
keymap[f"model.layers.{layer}.post_attention_layernorm.weight"] = f"blk.{layer}.ffn_norm.weight"
for hf_proj, gguf_proj in (("q", "q"), ("k", "k"), ("v", "v"), ("o", "output")):
keymap[f"model.layers.{layer}.self_attn.{hf_proj}_proj.weight"] = f"blk.{layer}.attn_{gguf_proj}.weight"
keymap[f"model.layers.{layer}.self_attn.q_norm.weight"] = f"blk.{layer}.attn_q_norm.weight"
keymap[f"model.layers.{layer}.self_attn.k_norm.weight"] = f"blk.{layer}.attn_k_norm.weight"
for hf_name, gguf_name in (("gate", "gate"), ("up", "up"), ("down", "down")):
keymap[f"model.layers.{layer}.mlp.{hf_name}_proj.weight"] = f"blk.{layer}.ffn_{gguf_name}.weight"
# Fail loudly if the model carries Q/K/V projection bias (Qwen2 / 2.5).
# apps.llm's `TransformerBlock` hardcodes `bias=False`, so these weights
# would be silently dropped by `load_state_dict(strict=False)` and the
# model would produce garbage. Supported families (Qwen3, Qwen3.5,
# Llama 3.x, GLM-4, Mistral) have no qkv bias.
bias_keys = [k for k in hf_weights
if k.startswith("model.layers.") and
any(k.endswith(f".self_attn.{p}_proj.bias") for p in ("q", "k", "v"))]
if bias_keys:
raise ValueError(
"tinygrad backend: model has Q/K/V projection bias ("
f"{bias_keys[0]} etc). Supported families are Qwen3, Qwen3.5, "
"Llama 3.x, GLM-4, Mistral. For Qwen2 / 2.5 please use a "
"newer model or the vLLM / llama.cpp backends."
)
sd = {dst: hf_weights[src] for src, dst in keymap.items() if src in hf_weights}
if "output.weight" not in sd and "token_embd.weight" in sd:
sd["output.weight"] = sd["token_embd.weight"]
return sd
def _hf_to_transformer_kwargs(hf_config: dict, state_dict: dict[str, Any], max_context: int) -> dict:
"""Build the kwargs dict for `tinygrad.apps.llm.Transformer(**kwargs)`.
Supports dense Qwen3 / Qwen3.5 / Llama 3.x / GLM-4 / Mistral-shaped
models. The tinygrad 0.12.0 `Transformer` takes keyword-only args (no
`TransformerConfig` dataclass) — so we return a plain dict.
"""
n_heads = hf_config["num_attention_heads"]
head_dim = hf_config.get("head_dim") or (hf_config["hidden_size"] // n_heads)
# Detect qk_norm presence from the GGUF-shaped state dict (matches
# apps.llm's own heuristic in `from_gguf`).
qk_norm = 0
qn = state_dict.get("blk.0.attn_q_norm.weight")
if qn is not None:
qk_norm = int(qn.shape[0])
max_pos = hf_config.get("max_position_embeddings", 4096)
return dict(
num_blocks=hf_config["num_hidden_layers"],
dim=hf_config["hidden_size"],
hidden_dim=hf_config["intermediate_size"],
n_heads=n_heads,
n_kv_heads=hf_config.get("num_key_value_heads", n_heads),
norm_eps=hf_config.get("rms_norm_eps", 1e-5),
vocab_size=hf_config["vocab_size"],
head_dim=head_dim,
rope_theta=float(hf_config.get("rope_theta", 10000.0)),
max_context=min(max_pos, max_context),
qk_norm=qk_norm,
)
def _embed_hidden(model, tokens):
"""Return mean-poolable hidden states by running the block stack
without going through the LM head + Gumbel-max sampler baked into
`Transformer.forward`."""
x = model.token_embd(tokens).float()
for blk in model.blk:
x = blk(x, 0)
return model.output_norm(x)

View File

@@ -1,83 +0,0 @@
# Vendored verbatim from tinygrad examples/audio_helpers.py (MIT license).
# Upstream: https://github.com/tinygrad/tinygrad/blob/master/examples/audio_helpers.py
# Copyright (c) 2023- the tinygrad authors
# SPDX-License-Identifier: MIT
from typing import Optional
from tinygrad import Tensor
from tinygrad.dtype import DTypeLike, dtypes
import math
# rewritten from numpy
def rfftfreq(n: int, d: float = 1.0, device=None) -> Tensor:
val = 1.0 / (n * d)
N = n // 2 + 1
results = Tensor.arange(N, device=device)
return results * val
# just like in librosa
def fft_frequencies(sr: float, n_fft: int) -> Tensor:
return rfftfreq(n=n_fft, d=1.0 / sr)
def hz_to_mel(freq: Tensor) -> Tensor:
# linear part
f_min = 0.0
f_sp = 200.0 / 3
mels = (freq - f_min) / f_sp
# log-scale part
min_log_hz = 1000.0 # beginning of log region (Hz)
mask = freq >= min_log_hz
return mask.where(((min_log_hz - f_min) / f_sp) + (freq / min_log_hz).log() / (math.log(6.4) / 27.0), mels)
def mel_to_hz(mels: Tensor) -> Tensor:
# linear scale
f_min = 0.0
f_sp = 200.0 / 3
freqs = f_min + f_sp * mels
# nonlinear scale
min_log_hz = 1000.0 # beginning of log region (Hz)
min_log_mel = (min_log_hz - f_min) / f_sp # same (Mels)
logstep = math.log(6.4) / 27.0 # step size for log region
log_t = mels >= min_log_mel
freqs = log_t.where(min_log_hz * ((logstep * (mels - min_log_mel)).exp()), freqs)
return freqs
def mel_frequencies(n_mels: int = 128, *, fmin: float = 0.0, fmax: float = 11025.0) -> Tensor:
# center freqs of mel bands - uniformly spaced between limits
min_max_mel = hz_to_mel(Tensor([fmin, fmax]))
mels = Tensor.linspace(min_max_mel[0], min_max_mel[1], n_mels)
hz = mel_to_hz(mels)
return hz
def mel(
*,
sr: float,
n_fft: int,
n_mels: int = 128,
fmin: float = 0.0,
fmax: Optional[float] = None,
dtype: DTypeLike = dtypes.default_float,
) -> Tensor:
if fmax is None:
fmax = float(sr) / 2
n_mels = int(n_mels)
fftfreqs = fft_frequencies(sr=sr, n_fft=n_fft) # center freqs of each FFT bin
mel_f = mel_frequencies(n_mels + 2, fmin=fmin, fmax=fmax) # center freqs of mel bands
fdiff = mel_f[1:] - mel_f[:-1]
ramps = mel_f[None].T.expand(-1, fftfreqs.shape[-1]) - fftfreqs
lower = -ramps[:n_mels] / fdiff[:n_mels][None].T
upper = ramps[2 : n_mels + 2] / fdiff[1 : n_mels + 1][None].T
weights = lower.minimum(upper).maximum(0)
# Slaney-style mel is scaled to be approx constant energy per channel
enorm = 2.0 / (mel_f[2 : n_mels + 2] - mel_f[:n_mels])
weights *= enorm[:, None]
return weights

View File

@@ -1,484 +0,0 @@
# Vendored verbatim from tinygrad extra/models/clip.py (MIT license).
# Upstream: https://github.com/tinygrad/tinygrad/blob/master/extra/models/clip.py
# Copyright (c) 2023- the tinygrad authors
# SPDX-License-Identifier: MIT
from tinygrad import Tensor, dtypes
from tinygrad.helpers import fetch
from tinygrad.nn import Linear, LayerNorm, Embedding, Conv2d
from typing import List, Optional, Union, Tuple, Dict
from abc import ABC, abstractmethod
from functools import lru_cache
import numpy as np
import re, gzip
# Allow for monkeypatching for mlperf.
gelu = Tensor.gelu
@lru_cache()
def default_bpe():
# Clip tokenizer, taken from https://github.com/openai/CLIP/blob/main/clip/simple_tokenizer.py (MIT license)
return fetch("https://github.com/openai/CLIP/raw/main/clip/bpe_simple_vocab_16e6.txt.gz", "bpe_simple_vocab_16e6.txt.gz")
class Tokenizer:
"""
Namespace for CLIP Text Tokenizer components.
"""
@staticmethod
def get_pairs(word):
"""
Return set of symbol pairs in a word.
Word is represented as tuple of symbols (symbols being variable-length strings).
"""
return set(zip(word, word[1:]))
@staticmethod
def whitespace_clean(text):
text = re.sub(r'\s+', ' ', text)
text = text.strip()
return text
@staticmethod
def bytes_to_unicode():
"""
Returns list of utf-8 byte and a corresponding list of unicode strings.
The reversible bpe codes work on unicode strings.
This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
This is a significant percentage of your normal, say, 32K bpe vocab.
To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
And avoids mapping to whitespace/control characters the bpe code barfs on.
"""
bs = list(range(ord("!"), ord("~")+1))+list(range(ord("¡"), ord("¬")+1))+list(range(ord("®"), ord("ÿ")+1))
cs = bs[:]
n = 0
for b in range(2**8):
if b not in bs:
bs.append(b)
cs.append(2**8+n)
n += 1
cs = [chr(n) for n in cs]
return dict(zip(bs, cs))
class ClipTokenizer:
def __init__(self, version=None):
self.byte_encoder, self.version = Tokenizer.bytes_to_unicode(), version
merges = gzip.open(default_bpe()).read().decode("utf-8").split('\n')
merges = merges[1:49152-256-2+1]
merges = [tuple(merge.split()) for merge in merges]
vocab = list(Tokenizer.bytes_to_unicode().values())
vocab = vocab + [v+'</w>' for v in vocab]
for merge in merges:
vocab.append(''.join(merge))
if self.version == "sd_mlperf_v5_0":
import regex
vocab.extend(['<start_of_text>', '<end_of_text>'])
self.cache = {'<start_of_text>': '<start_of_text>', '<end_of_text>': '<end_of_text>'}
self.pat = regex.compile(r"""<start_of_text>|<end_of_text>|'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+""", regex.IGNORECASE)
else:
vocab.extend(['<|startoftext|>', '<|endoftext|>'])
self.cache = {'<|startoftext|>': '<|startoftext|>', '<|endoftext|>': '<|endoftext|>'}
self.pat = re.compile(r"""<\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[^\s]+""", re.IGNORECASE)
self.encoder = dict(zip(vocab, range(len(vocab))))
self.bpe_ranks = dict(zip(merges, range(len(merges))))
def bpe(self, token):
if token in self.cache:
return self.cache[token]
word = tuple(token[:-1]) + ( token[-1] + '</w>',)
pairs = Tokenizer.get_pairs(word)
if not pairs:
return token+'</w>'
while True:
bigram = min(pairs, key = lambda pair: self.bpe_ranks.get(pair, float('inf')))
if bigram not in self.bpe_ranks:
break
first, second = bigram
new_word = []
i = 0
while i < len(word):
try:
j = word.index(first, i)
new_word.extend(word[i:j])
i = j
except Exception:
new_word.extend(word[i:])
break
if word[i] == first and i < len(word)-1 and word[i+1] == second:
new_word.append(first+second)
i += 2
else:
new_word.append(word[i])
i += 1
new_word = tuple(new_word)
word = new_word
if len(word) == 1:
break
pairs = Tokenizer.get_pairs(word)
word = ' '.join(word)
self.cache[token] = word
return word
def encode(self, text:str, pad_with_zeros:bool=False) -> List[int]:
bpe_tokens: List[int] = []
if self.version == "sd_mlperf_v5_0":
import regex, ftfy, html
text = ftfy.fix_text(text)
text = html.unescape(html.unescape(text)).strip()
text = Tokenizer.whitespace_clean(text).lower()
re_module = regex
else:
text = Tokenizer.whitespace_clean(text.strip()).lower()
re_module = re
for token in re_module.findall(self.pat, text):
token = ''.join(self.byte_encoder[b] for b in token.encode('utf-8'))
bpe_tokens.extend(self.encoder[bpe_token] for bpe_token in self.bpe(token).split(' '))
# Truncation, keeping two slots for start and end tokens.
if len(bpe_tokens) > 75:
bpe_tokens = bpe_tokens[:75]
return [49406] + bpe_tokens + [49407] + ([0] if pad_with_zeros else [49407]) * (77 - len(bpe_tokens) - 2)
class Embedder(ABC):
input_key: str
@abstractmethod
def __call__(self, x:Union[str,List[str],Tensor]) -> Union[Tensor,Tuple[Tensor,...]]:
pass
class Closed:
"""
Namespace for OpenAI CLIP model components.
"""
class ClipMlp:
def __init__(self):
self.fc1 = Linear(768, 3072)
self.fc2 = Linear(3072, 768)
def __call__(self, h:Tensor) -> Tensor:
h = self.fc1(h)
h = h.quick_gelu()
h = self.fc2(h)
return h
class ClipAttention:
def __init__(self):
self.embed_dim = 768
self.num_heads = 12
self.head_dim = self.embed_dim // self.num_heads
self.k_proj = Linear(self.embed_dim, self.embed_dim)
self.v_proj = Linear(self.embed_dim, self.embed_dim)
self.q_proj = Linear(self.embed_dim, self.embed_dim)
self.out_proj = Linear(self.embed_dim, self.embed_dim)
def __call__(self, hidden_states:Tensor, causal_attention_mask:Tensor) -> Tensor:
bsz, tgt_len, embed_dim = hidden_states.shape
q,k,v = self.q_proj(hidden_states), self.k_proj(hidden_states), self.v_proj(hidden_states)
q,k,v = [x.reshape(bsz, tgt_len, self.num_heads, self.head_dim).transpose(1, 2) for x in (q,k,v)]
attn_output = Tensor.scaled_dot_product_attention(q, k, v, attn_mask=causal_attention_mask)
return self.out_proj(attn_output.transpose(1, 2).reshape(bsz, tgt_len, embed_dim))
class ClipEncoderLayer:
def __init__(self):
self.self_attn = Closed.ClipAttention()
self.layer_norm1 = LayerNorm(768)
self.mlp = Closed.ClipMlp()
self.layer_norm2 = LayerNorm(768)
def __call__(self, hidden_states:Tensor, causal_attention_mask:Tensor) -> Tensor:
residual = hidden_states
hidden_states = self.layer_norm1(hidden_states)
hidden_states = self.self_attn(hidden_states, causal_attention_mask)
hidden_states = residual + hidden_states
residual = hidden_states
hidden_states = self.layer_norm2(hidden_states)
hidden_states = self.mlp(hidden_states)
hidden_states = residual + hidden_states
return hidden_states
class ClipTextEmbeddings:
def __init__(self):
self.token_embedding = Embedding(49408, 768)
self.position_embedding = Embedding(77, 768)
def __call__(self, input_ids:Tensor, position_ids:Tensor) -> Tensor:
return self.token_embedding(input_ids) + self.position_embedding(position_ids)
class ClipEncoder:
def __init__(self, layer_count:int=12):
self.layers = [Closed.ClipEncoderLayer() for _ in range(layer_count)]
def __call__(self, x:Tensor, causal_attention_mask:Tensor, ret_layer_idx:Optional[int]=None) -> Tensor:
# the indexing of layers is NOT off by 1, the original code considers the "input" as the first hidden state
layers = self.layers if ret_layer_idx is None else self.layers[:ret_layer_idx]
for l in layers:
x = l(x, causal_attention_mask)
return x
class ClipTextTransformer:
def __init__(self, ret_layer_idx:Optional[int]=None):
self.embeddings = Closed.ClipTextEmbeddings()
self.encoder = Closed.ClipEncoder()
self.final_layer_norm = LayerNorm(768)
self.ret_layer_idx = ret_layer_idx
def __call__(self, input_ids:Tensor) -> Tensor:
x = self.embeddings(input_ids, Tensor.arange(input_ids.shape[1]).reshape(1, -1))
x = self.encoder(x, Tensor.full((1, 1, 77, 77), float("-inf")).triu(1), self.ret_layer_idx)
return self.final_layer_norm(x) if (self.ret_layer_idx is None) else x
class ClipTextModel:
def __init__(self, ret_layer_idx:Optional[int]):
self.text_model = Closed.ClipTextTransformer(ret_layer_idx=ret_layer_idx)
# https://github.com/Stability-AI/generative-models/blob/fbdc58cab9f4ee2be7a5e1f2e2787ecd9311942f/sgm/modules/encoders/modules.py#L331
class FrozenClosedClipEmbedder(Embedder):
def __init__(self, ret_layer_idx:Optional[int]=None):
self.tokenizer = Tokenizer.ClipTokenizer()
self.transformer = Closed.ClipTextModel(ret_layer_idx)
self.input_key = "txt"
def __call__(self, texts:Union[str,List[str],Tensor]) -> Union[Tensor,Tuple[Tensor,...]]:
if isinstance(texts, str): texts = [texts]
assert isinstance(texts, (list,tuple)), f"expected list of strings, got {type(texts).__name__}"
tokens = Tensor.cat(*[Tensor(self.tokenizer.encode(text)) for text in texts], dim=0)
return self.transformer.text_model(tokens.reshape(len(texts),-1))
class Open:
"""
Namespace for OpenCLIP model components.
"""
class MultiheadAttention:
def __init__(self, dims:int, n_heads:int):
self.dims = dims
self.n_heads = n_heads
self.d_head = self.dims // self.n_heads
self.in_proj_bias = Tensor.empty(3*dims)
self.in_proj_weight = Tensor.empty(3*dims, dims)
self.out_proj = Linear(dims, dims)
def __call__(self, x:Tensor, attn_mask:Optional[Tensor]=None) -> Tensor:
T,B,C = x.shape
proj = x.linear(self.in_proj_weight.T, self.in_proj_bias)
proj = proj.unflatten(-1, (3,C)).unsqueeze(0).transpose(0, -2)
q,k,v = [y.reshape(T, B*self.n_heads, self.d_head).transpose(0, 1).reshape(B, self.n_heads, T, self.d_head) for y in proj.chunk(3)]
attn_output = Tensor.scaled_dot_product_attention(q, k, v, attn_mask=attn_mask)
attn_output = attn_output.permute(2, 0, 1, 3).reshape(T, B, C)
attn_output = self.out_proj(attn_output)
return attn_output
class Mlp:
def __init__(self, dims, hidden_dims):
self.c_fc = Linear(dims, hidden_dims)
self.c_proj = Linear(hidden_dims, dims)
self.gelu = gelu
def __call__(self, x:Tensor) -> Tensor:
return x.sequential([self.c_fc, self.gelu, self.c_proj])
# https://github.com/mlfoundations/open_clip/blob/58e4e39aaabc6040839b0d2a7e8bf20979e4558a/src/open_clip/transformer.py#L210
class ResidualAttentionBlock:
def __init__(self, dims:int, n_heads:int, mlp_ratio:float):
self.ln_1 = LayerNorm(dims)
self.attn = Open.MultiheadAttention(dims, n_heads)
self.ln_2 = LayerNorm(dims)
self.mlp = Open.Mlp(dims, int(dims * mlp_ratio))
def __call__(self, x:Tensor, attn_mask:Optional[Tensor]=None, transpose:bool=False) -> Tensor:
q_x = self.ln_1(x)
attn_out = self.attn(q_x.transpose(0, 1) if transpose else q_x, attn_mask=attn_mask)
attn_out = attn_out.transpose(0, 1) if transpose else attn_out
x = x + attn_out
x = x + self.mlp(self.ln_2(x))
return x
# https://github.com/mlfoundations/open_clip/blob/58e4e39aaabc6040839b0d2a7e8bf20979e4558a/src/open_clip/transformer.py#L317
class ClipTransformer:
def __init__(self, dims:int, layers:int, n_heads:int, mlp_ratio:float=4.0):
self.resblocks = [
Open.ResidualAttentionBlock(dims, n_heads, mlp_ratio) for _ in range(layers)
]
def __call__(self, x:Tensor, attn_mask:Optional[Tensor]=None) -> Tensor:
for r in self.resblocks:
x = r(x, attn_mask=attn_mask, transpose=True)
return x
# https://github.com/mlfoundations/open_clip/blob/58e4e39aaabc6040839b0d2a7e8bf20979e4558a/src/open_clip/model.py#L220
# https://github.com/mlfoundations/open_clip/blob/58e4e39aaabc6040839b0d2a7e8bf20979e4558a/src/open_clip/transformer.py#L661
class ClipTextTransformer:
def __init__(self, width:int, n_heads:int, layers:int, vocab_size:int=49408, ctx_length:int=77):
self.token_embedding = Embedding(vocab_size, width)
self.positional_embedding = Tensor.empty(ctx_length, width)
self.transformer = Open.ClipTransformer(width, layers, n_heads)
self.ln_final = LayerNorm(width)
self.text_projection = Tensor.empty(width, width)
self.attn_mask = Tensor.full((77, 77), float("-inf")).triu(1).realize()
def __call__(self, text:Tensor) -> Tensor:
seq_len = text.shape[1]
x = self.token_embedding(text)
x = x + self.positional_embedding[:seq_len]
x = self.transformer(x, attn_mask=self.attn_mask)
x = self.ln_final(x)
pooled = x[:, text.argmax(dim=-1)] @ self.text_projection
return pooled
class ClipVisionTransformer:
def __init__(self, width:int, layers:int, d_head:int, image_size:int, patch_size:int):
grid_size = image_size // patch_size
n_heads = width // d_head
assert n_heads * d_head == width
self.conv1 = Conv2d(3, width, kernel_size=patch_size, stride=patch_size, bias=False)
self.class_embedding = Tensor.empty(width)
self.positional_embedding = Tensor.empty(grid_size * grid_size + 1, width)
self.transformer = Open.ClipTransformer(width, layers, n_heads)
self.ln_pre = LayerNorm(width)
self.ln_post = LayerNorm(width)
self.proj = Tensor.empty(width, 1024)
def __call__(self, x:Tensor) -> Tensor:
x = self.conv1(x)
x = x.reshape(x.shape[0], x.shape[1], -1).permute(0, 2, 1)
x = self.class_embedding.reshape(1, 1, -1).expand(x.shape[0], 1, -1).cat(x, dim=1)
x = x + self.positional_embedding
x = self.ln_pre(x)
x = self.transformer(x)
x = self.ln_post(x)
pooled = x[:, 0] @ self.proj
return pooled
# https://github.com/Stability-AI/generative-models/blob/fbdc58cab9f4ee2be7a5e1f2e2787ecd9311942f/sgm/modules/encoders/modules.py#L396
# https://github.com/Stability-AI/generative-models/blob/fbdc58cab9f4ee2be7a5e1f2e2787ecd9311942f/sgm/modules/encoders/modules.py#L498
class FrozenOpenClipEmbedder(Embedder):
def __init__(self, dims:int, n_heads:int, layers:int, return_pooled:bool, ln_penultimate:bool=False, clip_tokenizer_version=None):
self.tokenizer = Tokenizer.ClipTokenizer(version=clip_tokenizer_version)
self.model = Open.ClipTextTransformer(dims, n_heads, layers)
self.return_pooled = return_pooled
self.input_key = "txt"
self.ln_penultimate = ln_penultimate
def tokenize(self, text:str, device:Optional[str]=None) -> Tensor:
return Tensor(self.tokenizer.encode(text, pad_with_zeros=True), dtype=dtypes.int32, device=device).reshape(1,-1)
def text_transformer_forward(self, x:Tensor, attn_mask:Optional[Tensor]=None):
for r in self.model.transformer.resblocks:
x, penultimate = r(x, attn_mask=attn_mask), x
return x.permute(1, 0, 2), penultimate.permute(1, 0, 2)
def embed_tokens(self, tokens:Tensor) -> Union[Tensor,Tuple[Tensor,...]]:
x = self.model.token_embedding(tokens).add(self.model.positional_embedding).permute(1,0,2)
x, penultimate = self.text_transformer_forward(x, attn_mask=self.model.attn_mask)
if self.ln_penultimate:
penultimate = self.model.ln_final(penultimate)
if self.return_pooled:
x = self.model.ln_final(x)
index = tokens.argmax(axis=-1).reshape(-1,1,1).expand(x.shape[0],1,x.shape[-1])
pooled = x.gather(1, index).squeeze(1) @ self.model.text_projection
return penultimate, pooled
else:
return penultimate
def __call__(self, texts:Union[str,List[str],Tensor]) -> Union[Tensor,Tuple[Tensor,...]]:
if isinstance(texts, str): texts = [texts]
assert isinstance(texts, (list,tuple)), f"expected list of strings, got {type(texts).__name__}"
tokens = Tensor.cat(*[self.tokenize(text) for text in texts], dim=0)
return self.embed_tokens(tokens)
clip_configs: Dict = {
"ViT-H-14": {
"dims": 1024,
"vision_cfg": {
"width": 1280,
"layers": 32,
"d_head": 80,
"image_size": 224,
"patch_size": 14,
},
"text_cfg": {
"width": 1024,
"n_heads": 16,
"layers": 24,
"ctx_length": 77,
"vocab_size": 49408,
},
"return_pooled": False,
"ln_penultimate": True,
}
}
class OpenClipEncoder:
def __init__(self, dims:int, text_cfg:Dict, vision_cfg:Dict, **_):
self.visual = Open.ClipVisionTransformer(**vision_cfg)
text = Open.ClipTextTransformer(**text_cfg)
self.transformer = text.transformer
self.token_embedding = text.token_embedding
self.positional_embedding = text.positional_embedding
self.ln_final = text.ln_final
self.text_projection = text.text_projection
self.attn_mask = Tensor.full((77, 77), float("-inf")).triu(1).realize()
self.mean = Tensor([0.48145466, 0.45782750, 0.40821073]).reshape(-1, 1, 1)
self.std = Tensor([0.26862954, 0.26130258, 0.27577711]).reshape(-1, 1, 1)
# TODO:
# Should be doable in pure tinygrad, would just require some work and verification.
# This is very desirable since it would allow for full generation->evaluation in a single JIT call.
def prepare_image(self, image) -> Tensor:
from PIL import Image
SIZE = 224
w, h = image.size
scale = min(SIZE / h, SIZE / w)
image = image.resize((max(int(w*scale),SIZE),max(int(h*scale),SIZE)), Image.Resampling.BICUBIC)
w, h = image.size
if w > SIZE:
left = (w - SIZE) // 2
image = image.crop((left, left+SIZE, 0, SIZE))
elif h > SIZE:
top = (h - SIZE) // 2
image = image.crop((0, SIZE, top, top+SIZE))
x = Tensor(np.array(image.convert('RGB')), device=self.std.device)
x = x.permute(2, 0, 1).cast(dtypes.float32) / 255.0
return (x - self.mean) / self.std
def encode_tokens(self, tokens:Tensor) -> Tensor:
x = self.token_embedding(tokens)
x = x + self.positional_embedding
x = self.transformer(x, attn_mask=self.attn_mask)
x = self.ln_final(x)
x = x[Tensor.arange(x.shape[0], device=x.device), tokens.argmax(axis=-1)]
x = x @ self.text_projection
return x
def get_clip_score(self, tokens:Tensor, image:Tensor) -> Tensor:
image_features: Tensor = self.visual(image)
image_features /= image_features.square().sum(-1, keepdim=True).sqrt() # Frobenius Norm
text_features = self.encode_tokens(tokens)
text_features /= text_features.square().sum(-1, keepdim=True).sqrt() # Frobenius Norm
return (image_features * text_features).sum(axis=-1)

View File

@@ -1,232 +0,0 @@
# Adapted from tinygrad examples/stable_diffusion.py (MIT license).
# Upstream: https://github.com/tinygrad/tinygrad/blob/master/examples/stable_diffusion.py
# Copyright (c) 2023- the tinygrad authors
# SPDX-License-Identifier: MIT
#
# Local modifications: removed the MLPerf training branch (pulls
# examples/mlperf/initializers which we don't vendor) and the __main__
# argparse / fetch / profile blocks. Kept the core classes so the LocalAI
# tinygrad backend can instantiate and drive Stable Diffusion v1.x from a
# single checkpoint path.
from collections import namedtuple
from typing import Any, Dict
import numpy as np
from tinygrad import Tensor, dtypes
from tinygrad.nn import Conv2d, GroupNorm
from . import clip as clip_mod
from . import unet as unet_mod
from .clip import Closed, Tokenizer
from .unet import UNetModel
class AttnBlock:
def __init__(self, in_channels):
self.norm = GroupNorm(32, in_channels)
self.q = Conv2d(in_channels, in_channels, 1)
self.k = Conv2d(in_channels, in_channels, 1)
self.v = Conv2d(in_channels, in_channels, 1)
self.proj_out = Conv2d(in_channels, in_channels, 1)
def __call__(self, x):
h_ = self.norm(x)
q, k, v = self.q(h_), self.k(h_), self.v(h_)
b, c, h, w = q.shape
q, k, v = [t.reshape(b, c, h * w).transpose(1, 2) for t in (q, k, v)]
h_ = Tensor.scaled_dot_product_attention(q, k, v).transpose(1, 2).reshape(b, c, h, w)
return x + self.proj_out(h_)
class ResnetBlock:
def __init__(self, in_channels, out_channels=None):
self.norm1 = GroupNorm(32, in_channels)
self.conv1 = Conv2d(in_channels, out_channels, 3, padding=1)
self.norm2 = GroupNorm(32, out_channels)
self.conv2 = Conv2d(out_channels, out_channels, 3, padding=1)
self.nin_shortcut = Conv2d(in_channels, out_channels, 1) if in_channels != out_channels else (lambda x: x)
def __call__(self, x):
h = self.conv1(self.norm1(x).swish())
h = self.conv2(self.norm2(h).swish())
return self.nin_shortcut(x) + h
class Mid:
def __init__(self, block_in):
self.block_1 = ResnetBlock(block_in, block_in)
self.attn_1 = AttnBlock(block_in)
self.block_2 = ResnetBlock(block_in, block_in)
def __call__(self, x):
return x.sequential([self.block_1, self.attn_1, self.block_2])
class Decoder:
def __init__(self):
sz = [(128, 256), (256, 512), (512, 512), (512, 512)]
self.conv_in = Conv2d(4, 512, 3, padding=1)
self.mid = Mid(512)
arr = []
for i, s in enumerate(sz):
arr.append({"block": [ResnetBlock(s[1], s[0]), ResnetBlock(s[0], s[0]), ResnetBlock(s[0], s[0])]})
if i != 0:
arr[-1]['upsample'] = {"conv": Conv2d(s[0], s[0], 3, padding=1)}
self.up = arr
self.norm_out = GroupNorm(32, 128)
self.conv_out = Conv2d(128, 3, 3, padding=1)
def __call__(self, x):
x = self.conv_in(x)
x = self.mid(x)
for l in self.up[::-1]:
for b in l['block']:
x = b(x)
if 'upsample' in l:
bs, c, py, px = x.shape
x = x.reshape(bs, c, py, 1, px, 1).expand(bs, c, py, 2, px, 2).reshape(bs, c, py * 2, px * 2)
x = l['upsample']['conv'](x)
x.realize()
return self.conv_out(self.norm_out(x).swish())
class Encoder:
def __init__(self):
sz = [(128, 128), (128, 256), (256, 512), (512, 512)]
self.conv_in = Conv2d(3, 128, 3, padding=1)
arr = []
for i, s in enumerate(sz):
arr.append({"block": [ResnetBlock(s[0], s[1]), ResnetBlock(s[1], s[1])]})
if i != 3:
arr[-1]['downsample'] = {"conv": Conv2d(s[1], s[1], 3, stride=2, padding=(0, 1, 0, 1))}
self.down = arr
self.mid = Mid(512)
self.norm_out = GroupNorm(32, 512)
self.conv_out = Conv2d(512, 8, 3, padding=1)
def __call__(self, x):
x = self.conv_in(x)
for l in self.down:
for b in l['block']:
x = b(x)
if 'downsample' in l:
x = l['downsample']['conv'](x)
x = self.mid(x)
return self.conv_out(self.norm_out(x).swish())
class AutoencoderKL:
def __init__(self):
self.encoder = Encoder()
self.decoder = Decoder()
self.quant_conv = Conv2d(8, 8, 1)
self.post_quant_conv = Conv2d(4, 4, 1)
def __call__(self, x):
latent = self.encoder(x)
latent = self.quant_conv(latent)
latent = latent[:, 0:4]
latent = self.post_quant_conv(latent)
return self.decoder(latent)
def get_alphas_cumprod(beta_start=0.00085, beta_end=0.0120, n_training_steps=1000):
betas = np.linspace(beta_start ** 0.5, beta_end ** 0.5, n_training_steps, dtype=np.float32) ** 2
alphas = 1.0 - betas
alphas_cumprod = np.cumprod(alphas, axis=0)
return Tensor(alphas_cumprod)
# SD1.x UNet hyperparameters (same as upstream `unet_params`).
UNET_PARAMS_SD1: Dict[str, Any] = {
"adm_in_ch": None,
"in_ch": 4,
"out_ch": 4,
"model_ch": 320,
"attention_resolutions": [4, 2, 1],
"num_res_blocks": 2,
"channel_mult": [1, 2, 4, 4],
"n_heads": 8,
"transformer_depth": [1, 1, 1, 1],
"ctx_dim": 768,
"use_linear": False,
}
class StableDiffusion:
"""Stable Diffusion 1.x pipeline, adapted from tinygrad's reference example.
Drives the native CompVis `sd-v1-*.ckpt` checkpoint format (the only one
the vendored weight layout handles). For HuggingFace safetensors pipelines
the caller is expected to download / merge the `.ckpt` equivalent before
calling LoadModel.
"""
def __init__(self):
self.alphas_cumprod = get_alphas_cumprod()
self.first_stage_model = AutoencoderKL()
self.cond_stage_model = namedtuple("CondStageModel", ["transformer"])(
transformer=namedtuple("Transformer", ["text_model"])(text_model=Closed.ClipTextTransformer())
)
self.model = namedtuple("DiffusionModel", ["diffusion_model"])(
diffusion_model=UNetModel(**UNET_PARAMS_SD1)
)
# DDIM update step.
def _update(self, x, e_t, a_t, a_prev):
sqrt_one_minus_at = (1 - a_t).sqrt()
pred_x0 = (x - sqrt_one_minus_at * e_t) / a_t.sqrt()
dir_xt = (1.0 - a_prev).sqrt() * e_t
return a_prev.sqrt() * pred_x0 + dir_xt
def _model_output(self, uncond, cond, latent, timestep, guidance):
latents = self.model.diffusion_model(latent.expand(2, *latent.shape[1:]), timestep, uncond.cat(cond, dim=0))
uncond_latent, cond_latent = latents[0:1], latents[1:2]
return uncond_latent + guidance * (cond_latent - uncond_latent)
def step(self, uncond, cond, latent, timestep, a_t, a_prev, guidance):
e_t = self._model_output(uncond, cond, latent, timestep, guidance)
return self._update(latent, e_t, a_t, a_prev).realize()
def decode(self, x):
x = self.first_stage_model.post_quant_conv(1 / 0.18215 * x)
x = self.first_stage_model.decoder(x)
x = (x + 1.0) / 2.0
x = x.reshape(3, 512, 512).permute(1, 2, 0).clip(0, 1) * 255
return x.cast(dtypes.uint8)
def encode_prompt(self, tokenizer, prompt: str):
ids = Tensor([tokenizer.encode(prompt)])
return self.cond_stage_model.transformer.text_model(ids).realize()
def run_sd15(model: StableDiffusion, prompt: str, negative_prompt: str, steps: int, guidance: float, seed: int):
"""Generate a single 512x512 image. Returns a (512,512,3) uint8 tensor."""
tokenizer = Tokenizer.ClipTokenizer()
context = model.encode_prompt(tokenizer, prompt)
uncond = model.encode_prompt(tokenizer, negative_prompt)
timesteps = list(range(1, 1000, 1000 // steps))
alphas = model.alphas_cumprod[Tensor(timesteps)]
alphas_prev = Tensor([1.0]).cat(alphas[:-1])
if seed is not None:
Tensor.manual_seed(seed)
latent = Tensor.randn(1, 4, 64, 64)
for index in range(len(timesteps) - 1, -1, -1):
timestep = timesteps[index]
tid = Tensor([index])
latent = model.step(
uncond, context, latent,
Tensor([timestep]),
alphas[tid], alphas_prev[tid],
Tensor([guidance]),
)
return model.decode(latent).realize()

View File

@@ -1,267 +0,0 @@
# Vendored verbatim from tinygrad extra/models/unet.py (MIT license).
# Upstream: https://github.com/tinygrad/tinygrad/blob/master/extra/models/unet.py
# Copyright (c) 2023- the tinygrad authors
# SPDX-License-Identifier: MIT
from tinygrad import Tensor, dtypes, nn
from tinygrad.device import is_dtype_supported
from typing import Optional, Union, List, Any, Tuple, Callable
import math
# allow for monkeypatching
Linear, Conv2d, GroupNorm, LayerNorm = nn.Linear, nn.Conv2d, nn.GroupNorm, nn.LayerNorm
attention, gelu, mixed_precision_dtype = Tensor.scaled_dot_product_attention, Tensor.gelu, dtypes.float16
# https://github.com/Stability-AI/generative-models/blob/fbdc58cab9f4ee2be7a5e1f2e2787ecd9311942f/sgm/modules/diffusionmodules/util.py#L207
def timestep_embedding(timesteps:Tensor, dim:int, max_period=10000):
half = dim // 2
freqs = (-math.log(max_period) * Tensor.arange(half, device=timesteps.device) / half).exp()
args = timesteps.unsqueeze(1) * freqs.unsqueeze(0)
out = Tensor.cat(args.cos(), args.sin(), dim=-1)
return out.cast(mixed_precision_dtype) if is_dtype_supported(mixed_precision_dtype) else out
class ResBlock:
def __init__(self, channels:int, emb_channels:int, out_channels:int, num_groups:int=32):
self.in_layers = [
GroupNorm(num_groups, channels),
Tensor.silu,
Conv2d(channels, out_channels, 3, padding=1),
]
self.emb_layers = [
Tensor.silu,
Linear(emb_channels, out_channels),
]
self.out_layers = [
GroupNorm(num_groups, out_channels),
Tensor.silu,
lambda x: x, # needed for weights loading code to work
Conv2d(out_channels, out_channels, 3, padding=1),
]
self.skip_connection = Conv2d(channels, out_channels, 1) if channels != out_channels else (lambda x: x)
def __call__(self, x:Tensor, emb:Tensor) -> Tensor:
h = x.sequential(self.in_layers)
emb_out = emb.sequential(self.emb_layers)
h = h + emb_out.reshape(*emb_out.shape, 1, 1)
h = h.sequential(self.out_layers)
return self.skip_connection(x) + h
class CrossAttention:
def __init__(self, query_dim:int, ctx_dim:int, n_heads:int, d_head:int):
self.to_q = Linear(query_dim, n_heads*d_head, bias=False)
self.to_k = Linear(ctx_dim, n_heads*d_head, bias=False)
self.to_v = Linear(ctx_dim, n_heads*d_head, bias=False)
self.num_heads = n_heads
self.head_size = d_head
self.attn = attention
self.to_out = [Linear(n_heads*d_head, query_dim)]
def __call__(self, x:Tensor, ctx:Optional[Tensor]=None) -> Tensor:
ctx = x if ctx is None else ctx
q,k,v = self.to_q(x), self.to_k(ctx), self.to_v(ctx)
q,k,v = [y.reshape(x.shape[0], -1, self.num_heads, self.head_size).transpose(1,2) for y in (q,k,v)]
attention = self.attn(q, k, v).transpose(1,2)
h_ = attention.reshape(x.shape[0], -1, self.num_heads * self.head_size)
return h_.sequential(self.to_out)
class GEGLU:
def __init__(self, dim_in:int, dim_out:int):
self.proj = Linear(dim_in, dim_out * 2)
self.gelu = gelu
self.dim_out = dim_out
def __call__(self, x:Tensor) -> Tensor:
x, gate = self.proj(x).chunk(2, dim=-1)
return x * self.gelu(gate)
class FeedForward:
def __init__(self, dim:int, mult:int=4):
self.net: tuple[GEGLU, Callable, nn.Linear] = (
GEGLU(dim, dim*mult),
lambda x: x, # needed for weights loading code to work
Linear(dim*mult, dim)
)
def __call__(self, x:Tensor) -> Tensor:
return x.sequential(list(self.net))
class BasicTransformerBlock:
def __init__(self, dim:int, ctx_dim:int, n_heads:int, d_head:int):
self.attn1 = CrossAttention(dim, dim, n_heads, d_head)
self.ff = FeedForward(dim)
self.attn2 = CrossAttention(dim, ctx_dim, n_heads, d_head)
self.norm1 = LayerNorm(dim)
self.norm2 = LayerNorm(dim)
self.norm3 = LayerNorm(dim)
def __call__(self, x:Tensor, ctx:Optional[Tensor]=None) -> Tensor:
x = x + self.attn1(self.norm1(x))
x = x + self.attn2(self.norm2(x), ctx=ctx)
x = x + self.ff(self.norm3(x))
return x
# https://github.com/Stability-AI/generative-models/blob/fbdc58cab9f4ee2be7a5e1f2e2787ecd9311942f/sgm/modules/attention.py#L619
class SpatialTransformer:
def __init__(self, channels:int, n_heads:int, d_head:int, ctx_dim:Union[int,List[int]], use_linear:bool, depth:int=1,
norm_eps:float=1e-5):
if isinstance(ctx_dim, int):
ctx_dim = [ctx_dim]*depth
else:
assert isinstance(ctx_dim, list) and depth == len(ctx_dim)
self.norm = GroupNorm(32, channels, eps=norm_eps)
assert channels == n_heads * d_head
self.proj_in = Linear(channels, channels) if use_linear else Conv2d(channels, channels, 1)
self.transformer_blocks = [BasicTransformerBlock(channels, ctx_dim[d], n_heads, d_head) for d in range(depth)]
self.proj_out = Linear(channels, channels) if use_linear else Conv2d(channels, channels, 1)
self.use_linear = use_linear
def __call__(self, x:Tensor, ctx:Optional[Tensor]=None) -> Tensor:
b, c, h, w = x.shape
x_in = x
x = self.norm(x)
ops = [ (lambda z: z.reshape(b, c, h*w).permute(0,2,1)), (lambda z: self.proj_in(z)) ]
x = x.sequential(ops if self.use_linear else ops[::-1])
for block in self.transformer_blocks:
x = block(x, ctx=ctx)
ops = [ (lambda z: self.proj_out(z)), (lambda z: z.permute(0,2,1).reshape(b, c, h, w)) ]
x = x.sequential(ops if self.use_linear else ops[::-1])
return x + x_in
class Downsample:
def __init__(self, channels:int):
self.op = Conv2d(channels, channels, 3, stride=2, padding=1)
def __call__(self, x:Tensor) -> Tensor:
return self.op(x)
class Upsample:
def __init__(self, channels:int):
self.conv = Conv2d(channels, channels, 3, padding=1)
def __call__(self, x:Tensor) -> Tensor:
bs,c,py,px = x.shape
z = x.reshape(bs, c, py, 1, px, 1).expand(bs, c, py, 2, px, 2).reshape(bs, c, py*2, px*2)
return self.conv(z)
# https://github.com/Stability-AI/generative-models/blob/fbdc58cab9f4ee2be7a5e1f2e2787ecd9311942f/sgm/modules/diffusionmodules/openaimodel.py#L472
class UNetModel:
def __init__(self, adm_in_ch:Optional[int], in_ch:int, out_ch:int, model_ch:int, attention_resolutions:List[int], num_res_blocks:int,
channel_mult:List[int], transformer_depth:List[int], ctx_dim:Union[int,List[int]], use_linear:bool=False, d_head:Optional[int]=None,
n_heads:Optional[int]=None, num_groups:int=32, st_norm_eps:float=1e-5):
self.model_ch = model_ch
self.num_res_blocks = [num_res_blocks] * len(channel_mult)
self.attention_resolutions = attention_resolutions
self.d_head = d_head
self.n_heads = n_heads
def get_d_and_n_heads(dims:int) -> Tuple[int,int]:
if self.d_head is None:
assert self.n_heads is not None, f"d_head and n_heads cannot both be None"
return dims // self.n_heads, self.n_heads
else:
assert self.n_heads is None, f"d_head and n_heads cannot both be non-None"
return self.d_head, dims // self.d_head
time_embed_dim = model_ch * 4
self.time_embed = [
Linear(model_ch, time_embed_dim),
Tensor.silu,
Linear(time_embed_dim, time_embed_dim),
]
if adm_in_ch is not None:
self.label_emb = [
[
Linear(adm_in_ch, time_embed_dim),
Tensor.silu,
Linear(time_embed_dim, time_embed_dim),
]
]
self.input_blocks: List[Any] = [
[Conv2d(in_ch, model_ch, 3, padding=1)]
]
input_block_channels = [model_ch]
ch = model_ch
ds = 1
for idx, mult in enumerate(channel_mult):
for _ in range(self.num_res_blocks[idx]):
layers: List[Any] = [
ResBlock(ch, time_embed_dim, model_ch*mult, num_groups),
]
ch = mult * model_ch
if ds in attention_resolutions:
d_head, n_heads = get_d_and_n_heads(ch)
layers.append(SpatialTransformer(ch, n_heads, d_head, ctx_dim, use_linear, depth=transformer_depth[idx], norm_eps=st_norm_eps))
self.input_blocks.append(layers)
input_block_channels.append(ch)
if idx != len(channel_mult) - 1:
self.input_blocks.append([
Downsample(ch),
])
input_block_channels.append(ch)
ds *= 2
d_head, n_heads = get_d_and_n_heads(ch)
self.middle_block: List = [
ResBlock(ch, time_embed_dim, ch, num_groups),
SpatialTransformer(ch, n_heads, d_head, ctx_dim, use_linear, depth=transformer_depth[-1], norm_eps=st_norm_eps),
ResBlock(ch, time_embed_dim, ch, num_groups),
]
self.output_blocks = []
for idx, mult in list(enumerate(channel_mult))[::-1]:
for i in range(self.num_res_blocks[idx] + 1):
ich = input_block_channels.pop()
layers = [
ResBlock(ch + ich, time_embed_dim, model_ch*mult, num_groups),
]
ch = model_ch * mult
if ds in attention_resolutions:
d_head, n_heads = get_d_and_n_heads(ch)
layers.append(SpatialTransformer(ch, n_heads, d_head, ctx_dim, use_linear, depth=transformer_depth[idx], norm_eps=st_norm_eps))
if idx > 0 and i == self.num_res_blocks[idx]:
layers.append(Upsample(ch))
ds //= 2
self.output_blocks.append(layers)
self.out = [
GroupNorm(num_groups, ch),
Tensor.silu,
Conv2d(model_ch, out_ch, 3, padding=1),
]
def __call__(self, x:Tensor, tms:Tensor, ctx:Tensor, y:Optional[Tensor]=None) -> Tensor:
t_emb = timestep_embedding(tms, self.model_ch)
emb = t_emb.sequential(self.time_embed)
if y is not None:
assert y.shape[0] == x.shape[0]
emb = emb + y.sequential(self.label_emb[0])
if is_dtype_supported(mixed_precision_dtype):
emb = emb.cast(mixed_precision_dtype)
ctx = ctx.cast(mixed_precision_dtype)
x = x .cast(mixed_precision_dtype)
def run(x:Tensor, bb) -> Tensor:
if isinstance(bb, ResBlock): x = bb(x, emb)
elif isinstance(bb, SpatialTransformer): x = bb(x, ctx)
else: x = bb(x)
return x
saved_inputs = []
for b in self.input_blocks:
for bb in b:
x = run(x, bb)
saved_inputs.append(x)
for bb in self.middle_block:
x = run(x, bb)
for b in self.output_blocks:
x = x.cat(saved_inputs.pop(), dim=1)
for bb in b:
x = run(x, bb)
return x.sequential(self.out)

View File

@@ -1,274 +0,0 @@
# Adapted from tinygrad examples/whisper.py (MIT license).
# Upstream: https://github.com/tinygrad/tinygrad/blob/master/examples/whisper.py
# Copyright (c) 2023- the tinygrad authors
# SPDX-License-Identifier: MIT
#
# Local modifications: removed the pyaudio listener / __main__ block; the rest
# is the core Whisper model + preprocessing + single-file transcription path.
from __future__ import annotations
import base64
import collections
import itertools
from typing import List, Literal, Optional, Union
import numpy as np
from tinygrad import Tensor, TinyJit, Variable, dtypes, nn
from tinygrad.helpers import fetch
from tinygrad.nn.state import load_state_dict, torch_load
from .audio_helpers import mel
class MultiHeadAttention:
def __init__(self, n_state, n_head, kv_caching: Literal['cross', 'self', None] = None, max_self_attn_cache_len=None):
self.n_head = n_head
self.query = nn.Linear(n_state, n_state)
self.key = nn.Linear(n_state, n_state, bias=False)
self.value = nn.Linear(n_state, n_state)
self.out = nn.Linear(n_state, n_state)
self.kv_caching = kv_caching
self.max_self_attn_cache_len = max_self_attn_cache_len
def __call__(self, x, xa=None, mask=None, len=None):
if self.kv_caching == 'cross':
if xa is not None:
k, v = self.key(xa), self.value(xa)
if not hasattr(self, 'cache_k'):
self.cache_k, self.cache_v = k, v
else:
self.cache_k.assign(k).realize()
self.cache_v.assign(v).realize()
else:
k, v = self.cache_k, self.cache_v
else:
k, v = self.key(x), self.value(x)
if self.kv_caching == 'self':
if not hasattr(self, 'cache_k'):
self.cache_k = Tensor.zeros(x.shape[0], self.max_self_attn_cache_len, x.shape[2])
self.cache_v = Tensor.zeros(x.shape[0], self.max_self_attn_cache_len, x.shape[2])
k = self.cache_k.shrink((None, (0, len), None)).cat(k, dim=1)
v = self.cache_v.shrink((None, (0, len), None)).cat(v, dim=1)
padding = self.max_self_attn_cache_len - len - x.shape[1]
self.cache_k.assign(k.pad((None, (0, padding), None)).contiguous()).realize()
self.cache_v.assign(v.pad((None, (0, padding), None)).contiguous()).realize()
q = self.query(x)
n_ctx = q.shape[1]
head_dim = q.shape[-1] // self.n_head
q = q.reshape(*q.shape[:2], self.n_head, head_dim).permute(0, 2, 1, 3)
k = k.reshape(*k.shape[:2], self.n_head, head_dim).permute(0, 2, 1, 3)
v = v.reshape(*v.shape[:2], self.n_head, head_dim).permute(0, 2, 1, 3)
attn = Tensor.scaled_dot_product_attention(q, k, v, mask[:n_ctx, :n_ctx] if mask is not None else None)
wv = attn.permute(0, 2, 1, 3).flatten(start_dim=2)
return self.out(wv)
class ResidualAttentionBlock:
def __init__(self, n_state, n_head, is_decoder_block=False, max_self_attn_cache_len=None):
self.attn = MultiHeadAttention(n_state, n_head, kv_caching='self' if is_decoder_block else None, max_self_attn_cache_len=max_self_attn_cache_len)
self.attn_ln = nn.LayerNorm(n_state)
self.cross_attn = MultiHeadAttention(n_state, n_head, kv_caching='cross') if is_decoder_block else None
self.cross_attn_ln = nn.LayerNorm(n_state) if is_decoder_block else None
self.mlp = [nn.Linear(n_state, n_state * 4), Tensor.gelu, nn.Linear(n_state * 4, n_state)]
self.mlp_ln = nn.LayerNorm(n_state)
def __call__(self, x, xa=None, mask=None, len=None):
x = x + self.attn(self.attn_ln(x), mask=mask, len=len)
if self.cross_attn:
x = x + self.cross_attn(self.cross_attn_ln(x), xa)
x = x + self.mlp_ln(x).sequential(self.mlp)
return x.realize()
class AudioEncoder:
def __init__(self, n_mels, n_audio_ctx, n_audio_state, n_audio_head, n_audio_layer, **_):
self.conv1 = nn.Conv1d(n_mels, n_audio_state, kernel_size=3, padding=1)
self.conv2 = nn.Conv1d(n_audio_state, n_audio_state, kernel_size=3, stride=2, padding=1)
self.blocks = [ResidualAttentionBlock(n_audio_state, n_audio_head) for _ in range(n_audio_layer)]
self.ln_post = nn.LayerNorm(n_audio_state)
self.positional_embedding = Tensor.empty(n_audio_ctx, n_audio_state)
self.encode = TinyJit(self.__call__)
def __call__(self, x):
x = self.conv1(x).gelu()
x = self.conv2(x).gelu()
x = x.permute(0, 2, 1)
x = x + self.positional_embedding[:x.shape[1]]
x = x.sequential(self.blocks)
x = self.ln_post(x)
return x.realize()
class TextDecoder:
def __init__(self, n_vocab, n_text_ctx, n_text_state, n_text_head, n_text_layer, **_):
self.max_tokens_to_sample = n_text_ctx // 2
self.max_self_attn_cache_len = n_text_ctx
self.token_embedding = nn.Embedding(n_vocab, n_text_state)
self.positional_embedding = Tensor.empty(n_text_ctx, n_text_state)
self.blocks = [ResidualAttentionBlock(n_text_state, n_text_head, is_decoder_block=True, max_self_attn_cache_len=self.max_self_attn_cache_len) for _ in range(n_text_layer)]
self.ln = nn.LayerNorm(n_text_state)
self.mask = Tensor.full((n_text_ctx, n_text_ctx), -np.inf).triu(1).realize()
self.getjitted = collections.defaultdict(lambda: TinyJit(self.forward))
def __call__(self, x, pos, encoded_audio):
pos = Variable("self_attn_cache_len", 1, self.max_self_attn_cache_len - 1).bind(pos) if pos else 0
return self.getjitted[x.shape](x, pos, encoded_audio)
def forward(self, x, pos, encoded_audio):
seqlen = x.shape[-1]
x = self.token_embedding(x) + self.positional_embedding.shrink(((pos, pos + seqlen), None))
for block in self.blocks:
x = block(x, xa=encoded_audio, mask=self.mask, len=pos)
return self.output_tok(x)
def output_tok(self, x):
return (self.ln(x) @ self.token_embedding.weight.T).realize()
class Whisper:
def __init__(self, dims, batch_size=1):
self.encoder = AudioEncoder(**dims)
self.decoder = TextDecoder(**dims)
self.is_multilingual = dims["n_vocab"] == 51865
self.batch_size = batch_size
RATE = 16000
SEGMENT_SECONDS = 30
SAMPLES_PER_SEGMENT = RATE * SEGMENT_SECONDS
N_FFT = 400
HOP_LENGTH = 160
N_MELS = 80
FRAMES_PER_SEGMENT = SAMPLES_PER_SEGMENT // HOP_LENGTH
def prep_audio(waveforms: List[np.ndarray], batch_size: int, truncate: bool = False) -> np.ndarray:
import librosa
def pad_or_trim(arr, target_len):
if len(arr) == target_len:
return arr
if len(arr) < target_len:
return np.pad(arr, (0, target_len - len(arr)), 'constant')
return arr[:target_len]
max_len = SAMPLES_PER_SEGMENT if truncate else max(len(w) for w in waveforms)
if (r := max_len % SAMPLES_PER_SEGMENT) > 0:
max_len += SAMPLES_PER_SEGMENT - r
waveforms = np.array(list(map(lambda w: pad_or_trim(w, max_len), waveforms)))
if waveforms.shape[0] < batch_size:
waveforms = np.pad(waveforms, pad_width=((0, batch_size - waveforms.shape[0]), (0, 0)))
stft = librosa.stft(waveforms, n_fft=N_FFT, hop_length=HOP_LENGTH, window='hann', dtype=np.csingle)
magnitudes = np.absolute(stft[..., :-1]) ** 2
mel_spec = mel(sr=RATE, n_fft=N_FFT, n_mels=N_MELS).numpy() @ magnitudes
log_spec = np.log10(np.clip(mel_spec, 1e-10, None))
log_spec = np.maximum(log_spec, log_spec.max((1, 2), keepdims=True) - 8.0)
log_spec = (log_spec + 4.0) / 4.0
return log_spec
LANGUAGES = {
"en": "english", "zh": "chinese", "de": "german", "es": "spanish", "ru": "russian", "ko": "korean",
"fr": "french", "ja": "japanese", "pt": "portuguese", "tr": "turkish", "pl": "polish", "it": "italian",
}
def get_encoding(encoding_name: str):
import tiktoken
with fetch(f"https://raw.githubusercontent.com/openai/whisper/main/whisper/assets/{encoding_name}.tiktoken").open() as f:
ranks = {base64.b64decode(token): int(rank) for token, rank in (line.split() for line in f if line)}
n_vocab = len(ranks)
specials = [
"<|endoftext|>",
"<|startoftranscript|>",
*[f"<|{lang}|>" for lang in LANGUAGES.keys()],
"<|translate|>",
"<|transcribe|>",
"<|startoflm|>",
"<|startofprev|>",
"<|nospeech|>",
"<|notimestamps|>",
*[f"<|{i * 0.02:.2f}|>" for i in range(1501)],
]
special_tokens = dict(zip(specials, itertools.count(n_vocab)))
return tiktoken.Encoding(
name=encoding_name,
explicit_n_vocab=n_vocab + len(specials),
pat_str=r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""",
mergeable_ranks=ranks,
special_tokens=special_tokens,
)
MODEL_URLS = {
"tiny.en": "https://openaipublic.azureedge.net/main/whisper/models/d3dd57d32accea0b295c96e26691aa14d8822fac7d9d27d5dc00b4ca2826dd03/tiny.en.pt",
"tiny": "https://openaipublic.azureedge.net/main/whisper/models/65147644a518d12f04e32d6f3b26facc3f8dd46e5390956a9424a650c0ce22b9/tiny.pt",
"base.en": "https://openaipublic.azureedge.net/main/whisper/models/25a8566e1d0c1e2231d1c762132cd20e0f96a85d16145c3a00adf5d1ac670ead/base.en.pt",
"base": "https://openaipublic.azureedge.net/main/whisper/models/ed3a0b6b1c0edf879ad9b11b1af5a0e6ab5db9205f891f668f8b0e6c6326e34e/base.pt",
"small.en": "https://openaipublic.azureedge.net/main/whisper/models/f953ad0fd29cacd07d5a9eda5624af0f6bcf2258be67c92b79389873d91e0872/small.en.pt",
"small": "https://openaipublic.azureedge.net/main/whisper/models/9ecf779972d90ba49c06d968637d720dd632c55bbf19d441fb42bf17a411e794/small.pt",
}
def init_whisper(model_name: str = "base", batch_size: int = 1):
filename = fetch(MODEL_URLS[model_name])
state = torch_load(filename)
model = Whisper(state['dims'], batch_size)
load_state_dict(model, state['model_state_dict'], strict=False)
enc = get_encoding("multilingual" if model.is_multilingual else "gpt2")
return model, enc
def load_file_waveform(filename: str):
import librosa
waveform, _ = librosa.load(filename, sr=RATE)
return waveform
def transcribe_waveform(model: Whisper, enc, waveforms, language: Optional[str] = None, truncate: bool = False) -> str:
log_spec = prep_audio(waveforms, model.batch_size, truncate)
nsample = model.decoder.max_tokens_to_sample
nctx = model.decoder.max_self_attn_cache_len
start_tokens = [enc._special_tokens["<|startoftranscript|>"]]
if model.is_multilingual:
lang = language if (language and language in LANGUAGES) else "en"
language_token = enc._special_tokens["<|startoftranscript|>"] + 1 + tuple(LANGUAGES.keys()).index(lang)
start_tokens.append(language_token)
start_tokens.append(enc._special_tokens["<|transcribe|>"])
start_tokens.append(enc._special_tokens["<|notimestamps|>"])
eot = enc._special_tokens["<|endoftext|>"]
def inferloop(ctx, encoded_audio):
pos, next_tokens = 0, ctx
for _ in range(nsample):
next_tokens = model.decoder(Tensor(next_tokens, dtype=dtypes.int32), pos, encoded_audio)[:, -1].argmax(axis=-1).numpy().astype(np.int32).reshape(-1, 1)
next_tokens[ctx[:, -1] == eot] = eot
ctx = np.concatenate((ctx, next_tokens), axis=1)
pos = ctx.shape[-1] - 1
if (next_tokens == eot).all() or pos == nctx:
break
return ctx
ctx = np.tile(start_tokens, (model.batch_size, 1))
transcriptions: list[list[int]] = [[] for _ in waveforms]
for curr_frame in range(0, log_spec.shape[-1], FRAMES_PER_SEGMENT):
encoded_audio = model.encoder.encode(Tensor(log_spec[:, :, curr_frame:curr_frame + FRAMES_PER_SEGMENT]))
ctx_arr = inferloop(np.array(ctx), encoded_audio)
for i, arr in enumerate(ctx_arr):
if i >= len(waveforms):
break
end_idxs = np.where(arr == eot)[0]
start_idx = np.where(arr == start_tokens[-1])[0][0] + 1
end_idx = end_idxs[0] if len(end_idxs) else None
transcriptions[i].extend(arr[start_idx:end_idx])
ctx = ctx_arr
texts = [enc.decode([int(t) for t in toks]).strip() for toks in transcriptions]
return texts[0] if len(texts) == 1 else "\n".join(texts)

View File

@@ -4,7 +4,7 @@ numba==0.60.0
accelerate
transformers>=5.0.0
bitsandbytes
sentence-transformers==5.4.0
sentence-transformers==5.2.3
diffusers
soundfile
protobuf==6.33.5

View File

@@ -4,7 +4,7 @@ llvmlite==0.43.0
numba==0.60.0
transformers>=5.0.0
bitsandbytes
sentence-transformers==5.4.0
sentence-transformers==5.2.3
diffusers
soundfile
protobuf==6.33.5

View File

@@ -4,7 +4,7 @@ llvmlite==0.43.0
numba==0.60.0
transformers>=5.0.0
bitsandbytes
sentence-transformers==5.4.0
sentence-transformers==5.2.3
diffusers
soundfile
protobuf==6.33.5

View File

@@ -5,7 +5,7 @@ transformers>=5.0.0
llvmlite==0.43.0
numba==0.60.0
bitsandbytes
sentence-transformers==5.4.0
sentence-transformers==5.2.3
diffusers
soundfile
protobuf==6.33.5

View File

@@ -5,7 +5,7 @@ llvmlite==0.43.0
numba==0.60.0
transformers>=5.0.0
bitsandbytes
sentence-transformers==5.4.0
sentence-transformers==5.2.3
diffusers
soundfile
protobuf==6.33.5

View File

@@ -4,7 +4,7 @@ numba==0.60.0
accelerate
transformers>=5.0.0
bitsandbytes
sentence-transformers==5.4.0
sentence-transformers==5.2.3
diffusers
soundfile
protobuf==6.33.5

View File

@@ -1,6 +1,3 @@
# whisperx hard-pins torch~=2.8.0, which is not available in the rocm7.x indexes
# (they start at torch 2.10). Keep rocm6.4 wheels here — they still load against
# the rocm7.2.1 runtime via AMD's forward-compatibility window.
--extra-index-url https://download.pytorch.org/whl/rocm6.4
torch==2.8.0+rocm6.4
--extra-index-url https://download.pytorch.org/whl/rocm7.0
torch==2.10.0+rocm7.0
whisperx @ git+https://github.com/m-bain/whisperX.git

View File

@@ -341,16 +341,6 @@ impl Backend for KokorosService {
Err(Status::unimplemented("Not supported"))
}
type AudioTranscriptionStreamStream =
ReceiverStream<Result<backend::TranscriptStreamResponse, Status>>;
async fn audio_transcription_stream(
&self,
_: Request<backend::TranscriptRequest>,
) -> Result<Response<Self::AudioTranscriptionStreamStream>, Status> {
Err(Status::unimplemented("Not supported"))
}
async fn sound_generation(
&self,
_: Request<backend::SoundGenerationRequest>,

View File

@@ -242,20 +242,14 @@ func initDistributed(cfg *config.ApplicationConfig, authDB *gorm.DB) (*Distribut
DB: authDB,
})
// Create ReplicaReconciler for auto-scaling model replicas. Adapter +
// RegistrationToken feed the state-reconciliation passes: pending op
// drain uses the adapter, and model health probes use the token to auth
// against workers' gRPC HealthCheck.
// Create ReplicaReconciler for auto-scaling model replicas
reconciler := nodes.NewReplicaReconciler(nodes.ReplicaReconcilerOptions{
Registry: registry,
Scheduler: router,
Unloader: remoteUnloader,
Adapter: remoteUnloader,
RegistrationToken: cfg.Distributed.RegistrationToken,
DB: authDB,
Interval: 30 * time.Second,
ScaleDownDelay: 5 * time.Minute,
ProbeStaleAfter: 2 * time.Minute,
Registry: registry,
Scheduler: router,
Unloader: remoteUnloader,
DB: authDB,
Interval: 30 * time.Second,
ScaleDownDelay: 5 * time.Minute,
})
// Create ModelRouterAdapter to wire into ModelLoader

View File

@@ -235,12 +235,7 @@ func New(opts ...config.AppOption) (*Application, error) {
// In distributed mode, uses PostgreSQL advisory lock so only one frontend
// instance runs periodic checks (avoids duplicate upgrades across replicas).
if len(options.BackendGalleries) > 0 {
// Pass a lazy getter for the backend manager so the checker always
// uses the active one — DistributedBackendManager is swapped in above
// and asks workers for their installed backends, which is what
// upgrade detection needs in distributed mode.
bmFn := func() galleryop.BackendManager { return application.GalleryService().BackendManager() }
uc := NewUpgradeChecker(options, application.ModelLoader(), application.distributedDB(), bmFn)
uc := NewUpgradeChecker(options, application.ModelLoader(), application.distributedDB())
application.upgradeChecker = uc
go uc.Run(options.Context)
}

View File

@@ -8,7 +8,6 @@ import (
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/gallery"
"github.com/mudler/LocalAI/core/services/advisorylock"
"github.com/mudler/LocalAI/core/services/galleryop"
"github.com/mudler/LocalAI/pkg/model"
"github.com/mudler/LocalAI/pkg/system"
"github.com/mudler/xlog"
@@ -27,12 +26,6 @@ type UpgradeChecker struct {
galleries []config.Gallery
systemState *system.SystemState
db *gorm.DB // non-nil in distributed mode
// backendManagerFn lazily returns the current backend manager (may be
// swapped from Local to Distributed after startup). Pulled through each
// check so the UpgradeChecker uses whichever is active. In distributed
// mode this ensures CheckUpgrades asks workers instead of the (empty)
// frontend filesystem — fixing the bug where upgrades never surfaced.
backendManagerFn func() galleryop.BackendManager
checkInterval time.Duration
stop chan struct{}
@@ -47,22 +40,18 @@ type UpgradeChecker struct {
// NewUpgradeChecker creates a new UpgradeChecker service.
// Pass db=nil for standalone mode, or a *gorm.DB for distributed mode
// (uses advisory locks so only one instance runs periodic checks).
// backendManagerFn is optional; when set, CheckUpgrades is routed through
// the active backend manager — required in distributed mode so the check
// aggregates from workers rather than the empty frontend filesystem.
func NewUpgradeChecker(appConfig *config.ApplicationConfig, ml *model.ModelLoader, db *gorm.DB, backendManagerFn func() galleryop.BackendManager) *UpgradeChecker {
func NewUpgradeChecker(appConfig *config.ApplicationConfig, ml *model.ModelLoader, db *gorm.DB) *UpgradeChecker {
return &UpgradeChecker{
appConfig: appConfig,
modelLoader: ml,
galleries: appConfig.BackendGalleries,
systemState: appConfig.SystemState,
db: db,
backendManagerFn: backendManagerFn,
checkInterval: 6 * time.Hour,
stop: make(chan struct{}),
done: make(chan struct{}),
triggerCh: make(chan struct{}, 1),
lastUpgrades: make(map[string]gallery.UpgradeInfo),
appConfig: appConfig,
modelLoader: ml,
galleries: appConfig.BackendGalleries,
systemState: appConfig.SystemState,
db: db,
checkInterval: 6 * time.Hour,
stop: make(chan struct{}),
done: make(chan struct{}),
triggerCh: make(chan struct{}, 1),
lastUpgrades: make(map[string]gallery.UpgradeInfo),
}
}
@@ -75,16 +64,13 @@ func NewUpgradeChecker(appConfig *config.ApplicationConfig, ml *model.ModelLoade
func (uc *UpgradeChecker) Run(ctx context.Context) {
defer close(uc.done)
// Initial delay: don't slow down startup. Short enough that operators
// don't stare at an empty upgrade banner for long; long enough that
// workers have registered and reported their installed backends.
initialDelay := 10 * time.Second
// Initial delay: don't slow down startup
select {
case <-ctx.Done():
return
case <-uc.stop:
return
case <-time.After(initialDelay):
case <-time.After(30 * time.Second):
}
// First check always runs locally (to warm the cache on this instance)
@@ -158,18 +144,7 @@ func (uc *UpgradeChecker) GetAvailableUpgrades() map[string]gallery.UpgradeInfo
}
func (uc *UpgradeChecker) runCheck(ctx context.Context) {
var (
upgrades map[string]gallery.UpgradeInfo
err error
)
if uc.backendManagerFn != nil {
if bm := uc.backendManagerFn(); bm != nil {
upgrades, err = bm.CheckUpgrades(ctx)
}
}
if upgrades == nil && err == nil {
upgrades, err = gallery.CheckBackendUpgrades(ctx, uc.galleries, uc.systemState)
}
upgrades, err := gallery.CheckBackendUpgrades(ctx, uc.galleries, uc.systemState)
uc.mu.Lock()
uc.lastCheckTime = time.Now()

View File

@@ -15,7 +15,6 @@ import (
"github.com/mudler/LocalAI/core/config"
"github.com/mudler/LocalAI/core/schema"
"github.com/mudler/LocalAI/core/services/galleryop"
"github.com/mudler/LocalAI/core/templates"
"github.com/mudler/LocalAI/core/trace"
"github.com/mudler/LocalAI/core/gallery"
@@ -95,25 +94,15 @@ func ModelInference(ctx context.Context, s string, messages schema.Messages, ima
return nil, err
}
// Probe the backend for model-scoped metadata after LoadModel succeeds.
// Two signals are captured: thinking-mode detection (only meaningful when the
// tokenizer template path is active) and the multimodal media marker (needed
// by custom chat templates so markers line up with what mtmd expects).
// We probe whenever any of those slots is still empty.
needsThinkingProbe := c.TemplateConfig.UseTokenizerTemplate &&
c.ReasoningConfig.DisableReasoning == nil &&
c.ReasoningConfig.DisableReasoningTagPrefill == nil
needsMarkerProbe := c.MediaMarker == ""
if needsThinkingProbe || needsMarkerProbe {
// Detect thinking support after model load (only if not already detected)
// This needs to happen after LoadModel succeeds so the backend can render templates
if (c.ReasoningConfig.DisableReasoning == nil && c.ReasoningConfig.DisableReasoningTagPrefill == nil) && c.TemplateConfig.UseTokenizerTemplate {
modelOpts := grpcModelOpts(*c, o.SystemState.Model.ModelsPath)
config.DetectThinkingSupportFromBackend(ctx, c, inferenceModel, modelOpts)
// Update the config in the loader so it persists for future requests
cl.UpdateModelConfig(c.Name, func(cfg *config.ModelConfig) {
cfg.ReasoningConfig.DisableReasoning = c.ReasoningConfig.DisableReasoning
cfg.ReasoningConfig.DisableReasoningTagPrefill = c.ReasoningConfig.DisableReasoningTagPrefill
if c.MediaMarker != "" {
cfg.MediaMarker = c.MediaMarker
}
})
}
@@ -132,17 +121,7 @@ func ModelInference(ctx context.Context, s string, messages schema.Messages, ima
for k, v := range metadata {
opts.Metadata[k] = v
}
// The prompt was rendered with the sentinel "<__media__>" marker because
// middleware templating runs before the backend is loaded and probed.
// Once we know the backend's actual media marker, substitute so marker
// count matches the bitmap count passed through opts.Images/Videos/Audios.
// No-op when MediaMarker is unset, matches the sentinel, or the prompt has
// no media placeholders.
prompt := s
if c.MediaMarker != "" && c.MediaMarker != templates.DefaultMultiMediaMarker {
prompt = strings.ReplaceAll(prompt, templates.DefaultMultiMediaMarker, c.MediaMarker)
}
opts.Prompt = prompt
opts.Prompt = s
opts.Messages = protoMessages
opts.UseTokenizerTemplate = c.TemplateConfig.UseTokenizerTemplate
opts.Images = images

View File

@@ -10,68 +10,26 @@ import (
"github.com/mudler/LocalAI/core/schema"
"github.com/mudler/LocalAI/core/trace"
grpcPkg "github.com/mudler/LocalAI/pkg/grpc"
"github.com/mudler/LocalAI/pkg/grpc/proto"
"github.com/mudler/LocalAI/pkg/model"
)
// TranscriptionRequest groups the parameters accepted by ModelTranscription.
// Use this so callers don't have to pass long positional arg lists when they
// only care about a subset of fields.
type TranscriptionRequest struct {
Audio string
Language string
Translate bool
Diarize bool
Prompt string
Temperature float32
TimestampGranularities []string
}
func (r *TranscriptionRequest) toProto(threads uint32) *proto.TranscriptRequest {
return &proto.TranscriptRequest{
Dst: r.Audio,
Language: r.Language,
Translate: r.Translate,
Diarize: r.Diarize,
Threads: threads,
Prompt: r.Prompt,
Temperature: r.Temperature,
TimestampGranularities: r.TimestampGranularities,
}
}
func loadTranscriptionModel(ml *model.ModelLoader, modelConfig config.ModelConfig, appConfig *config.ApplicationConfig) (grpcPkg.Backend, error) {
func ModelTranscription(audio, language string, translate, diarize bool, prompt string, ml *model.ModelLoader, modelConfig config.ModelConfig, appConfig *config.ApplicationConfig) (*schema.TranscriptionResult, error) {
if modelConfig.Backend == "" {
modelConfig.Backend = model.WhisperBackend
}
opts := ModelOptions(modelConfig, appConfig)
transcriptionModel, err := ml.Load(opts...)
if err != nil {
recordModelLoadFailure(appConfig, modelConfig.Name, modelConfig.Backend, err, nil)
return nil, err
}
if transcriptionModel == nil {
return nil, fmt.Errorf("could not load transcription model")
}
return transcriptionModel, nil
}
func ModelTranscription(audio, language string, translate, diarize bool, prompt string, ml *model.ModelLoader, modelConfig config.ModelConfig, appConfig *config.ApplicationConfig) (*schema.TranscriptionResult, error) {
return ModelTranscriptionWithOptions(TranscriptionRequest{
Audio: audio,
Language: language,
Translate: translate,
Diarize: diarize,
Prompt: prompt,
}, ml, modelConfig, appConfig)
}
func ModelTranscriptionWithOptions(req TranscriptionRequest, ml *model.ModelLoader, modelConfig config.ModelConfig, appConfig *config.ApplicationConfig) (*schema.TranscriptionResult, error) {
transcriptionModel, err := loadTranscriptionModel(ml, modelConfig, appConfig)
if err != nil {
return nil, err
}
var startTime time.Time
var audioSnippet map[string]any
@@ -79,18 +37,25 @@ func ModelTranscriptionWithOptions(req TranscriptionRequest, ml *model.ModelLoad
trace.InitBackendTracingIfEnabled(appConfig.TracingMaxItems)
startTime = time.Now()
// Capture audio before the backend call — the backend may delete the file.
audioSnippet = trace.AudioSnippet(req.Audio)
audioSnippet = trace.AudioSnippet(audio)
}
r, err := transcriptionModel.AudioTranscription(context.Background(), req.toProto(uint32(*modelConfig.Threads)))
r, err := transcriptionModel.AudioTranscription(context.Background(), &proto.TranscriptRequest{
Dst: audio,
Language: language,
Translate: translate,
Diarize: diarize,
Threads: uint32(*modelConfig.Threads),
Prompt: prompt,
})
if err != nil {
if appConfig.EnableTracing {
errData := map[string]any{
"audio_file": req.Audio,
"language": req.Language,
"translate": req.Translate,
"diarize": req.Diarize,
"prompt": req.Prompt,
"audio_file": audio,
"language": language,
"translate": translate,
"diarize": diarize,
"prompt": prompt,
}
if audioSnippet != nil {
maps.Copy(errData, audioSnippet)
@@ -101,83 +66,15 @@ func ModelTranscriptionWithOptions(req TranscriptionRequest, ml *model.ModelLoad
Type: trace.BackendTraceTranscription,
ModelName: modelConfig.Name,
Backend: modelConfig.Backend,
Summary: trace.TruncateString(req.Audio, 200),
Summary: trace.TruncateString(audio, 200),
Error: err.Error(),
Data: errData,
})
}
return nil, err
}
tr := transcriptResultFromProto(r)
if appConfig.EnableTracing {
data := map[string]any{
"audio_file": req.Audio,
"language": req.Language,
"translate": req.Translate,
"diarize": req.Diarize,
"prompt": req.Prompt,
"result_text": tr.Text,
"segments_count": len(tr.Segments),
}
if audioSnippet != nil {
maps.Copy(data, audioSnippet)
}
trace.RecordBackendTrace(trace.BackendTrace{
Timestamp: startTime,
Duration: time.Since(startTime),
Type: trace.BackendTraceTranscription,
ModelName: modelConfig.Name,
Backend: modelConfig.Backend,
Summary: trace.TruncateString(req.Audio+" -> "+tr.Text, 200),
Data: data,
})
}
return tr, err
}
// TranscriptionStreamChunk is a streaming event emitted by
// ModelTranscriptionStream. Either Delta carries an incremental text fragment,
// or Final carries the completed transcription as the very last event.
type TranscriptionStreamChunk struct {
Delta string
Final *schema.TranscriptionResult
}
// ModelTranscriptionStream runs the gRPC streaming transcription RPC and
// invokes onChunk for each event the backend produces. Backends that don't
// support real streaming should still emit one terminal event with Final set,
// which the HTTP layer turns into a single delta + done SSE pair.
func ModelTranscriptionStream(req TranscriptionRequest, ml *model.ModelLoader, modelConfig config.ModelConfig, appConfig *config.ApplicationConfig, onChunk func(TranscriptionStreamChunk)) error {
transcriptionModel, err := loadTranscriptionModel(ml, modelConfig, appConfig)
if err != nil {
return err
}
pbReq := req.toProto(uint32(*modelConfig.Threads))
pbReq.Stream = true
return transcriptionModel.AudioTranscriptionStream(context.Background(), pbReq, func(chunk *proto.TranscriptStreamResponse) {
if chunk == nil {
return
}
out := TranscriptionStreamChunk{Delta: chunk.Delta}
if chunk.FinalResult != nil {
out.Final = transcriptResultFromProto(chunk.FinalResult)
}
onChunk(out)
})
}
func transcriptResultFromProto(r *proto.TranscriptResult) *schema.TranscriptionResult {
if r == nil {
return &schema.TranscriptionResult{}
}
tr := &schema.TranscriptionResult{
Text: r.Text,
Language: r.Language,
Duration: float64(r.Duration),
Text: r.Text,
}
for _, s := range r.Segments {
var tks []int
@@ -194,5 +91,30 @@ func transcriptResultFromProto(r *proto.TranscriptResult) *schema.TranscriptionR
Speaker: s.Speaker,
})
}
return tr
if appConfig.EnableTracing {
data := map[string]any{
"audio_file": audio,
"language": language,
"translate": translate,
"diarize": diarize,
"prompt": prompt,
"result_text": tr.Text,
"segments_count": len(tr.Segments),
}
if audioSnippet != nil {
maps.Copy(data, audioSnippet)
}
trace.RecordBackendTrace(trace.BackendTrace{
Timestamp: startTime,
Duration: time.Since(startTime),
Type: trace.BackendTraceTranscription,
ModelName: modelConfig.Name,
Backend: modelConfig.Backend,
Summary: trace.TruncateString(audio+" -> "+tr.Text, 200),
Data: data,
})
}
return tr, err
}

Some files were not shown because too many files have changed in this diff Show More