diff --git a/.agents/adding-backends.md b/.agents/adding-backends.md index e775a4492..31eff8474 100644 --- a/.agents/adding-backends.md +++ b/.agents/adding-backends.md @@ -129,6 +129,30 @@ After adding a new backend, verify: - [ ] No Makefile syntax errors (check with linter) - [ ] Follows the same pattern as similar backends (e.g., if it's a transcription backend, follow `faster-whisper` pattern) +## Bundling runtime shared libraries (`package.sh`) + +The final `Dockerfile.python` stage is `FROM scratch` — there is no system `libc`, no `apt`, no fallback library path. Only files explicitly copied from the builder stage end up in the backend image. That means any runtime `dlopen` your backend (or its Python deps) needs **must** be packaged into `${BACKEND}/lib/`. + +Pattern: + +1. Make sure the library is installed in the builder stage of `backend/Dockerfile.python` (add it to the top-level `apt-get install`). +2. Drop a `package.sh` in your backend directory that copies the library — and its soname symlinks — into `$(dirname $0)/lib`. See `backend/python/vllm/package.sh` for a reference implementation that walks `/usr/lib/x86_64-linux-gnu`, `/usr/lib/aarch64-linux-gnu`, etc. +3. `Dockerfile.python` already runs `package.sh` automatically if it exists, after `package-gpu-libs.sh`. +4. `libbackend.sh` automatically prepends `${EDIR}/lib` to `LD_LIBRARY_PATH` at run time, so anything packaged this way is found by `dlopen`. + +How to find missing libs: when a Python module silently fails to register torch ops or you see `AttributeError: '_OpNamespace' '...' object has no attribute '...'`, run the backend image's Python with `LD_DEBUG=libs` to see which `dlopen` failed. The filename in the error message (e.g. `libnuma.so.1`) is what you need to package. + +To verify packaging works without trusting the host: + +```bash +make docker-build- +CID=$(docker create --entrypoint=/run.sh local-ai-backend:) +docker cp $CID:/lib /tmp/check && docker rm $CID +ls /tmp/check # expect the bundled .so files + symlinks +``` + +Then boot it inside a fresh `ubuntu:24.04` (which intentionally does *not* have the lib installed) to confirm it actually loads from the backend dir. + ## 6. Example: Adding a Python Backend For reference, when `moonshine` was added: diff --git a/.agents/vllm-backend.md b/.agents/vllm-backend.md new file mode 100644 index 000000000..a2b9e614e --- /dev/null +++ b/.agents/vllm-backend.md @@ -0,0 +1,115 @@ +# Working on the vLLM Backend + +The vLLM backend lives at `backend/python/vllm/backend.py` (async gRPC) and the multimodal variant at `backend/python/vllm-omni/backend.py` (sync gRPC). Both wrap vLLM's `AsyncLLMEngine` / `Omni` and translate the LocalAI gRPC `PredictOptions` into vLLM `SamplingParams` + outputs into `Reply.chat_deltas`. + +This file captures the non-obvious bits — most of the bring-up was a single PR (`feat/vllm-parity`) and the things below are easy to get wrong. + +## Tool calling and reasoning use vLLM's *native* parsers + +Do not write regex-based tool-call extractors for vLLM. vLLM ships: + +- `vllm.tool_parsers.ToolParserManager` — 50+ registered parsers (`hermes`, `llama3_json`, `llama4_pythonic`, `mistral`, `qwen3_xml`, `deepseek_v3`, `granite4`, `openai`, `kimi_k2`, `glm45`, …) +- `vllm.reasoning.ReasoningParserManager` — 25+ registered parsers (`deepseek_r1`, `qwen3`, `mistral`, `gemma4`, …) + +Both can be used standalone: instantiate with a tokenizer, call `extract_tool_calls(text, request=None)` / `extract_reasoning(text, request=None)`. The backend stores the parser *classes* on `self.tool_parser_cls` / `self.reasoning_parser_cls` at LoadModel time and instantiates them per request. + +**Selection:** vLLM does *not* auto-detect parsers from model name — neither does the LocalAI backend. The user (or `core/config/hooks_vllm.go`) must pick one and pass it via `Options[]`: + +```yaml +options: + - tool_parser:hermes + - reasoning_parser:qwen3 +``` + +Auto-defaults for known model families live in `core/config/parser_defaults.json` and are applied: +- at gallery import time by `core/gallery/importers/vllm.go` +- at model load time by the `vllm` / `vllm-omni` backend hook in `core/config/hooks_vllm.go` + +User-supplied `tool_parser:`/`reasoning_parser:` in the config wins over defaults — the hook checks for existing entries before appending. + +**When to update `parser_defaults.json`:** any time vLLM ships a new tool or reasoning parser, or you onboard a new model family that LocalAI users will pull from HuggingFace. The file is keyed by *family pattern* matched against `normalizeModelID(cfg.Model)` (lowercase, org-prefix stripped, `_`→`-`). Patterns are checked **longest-first** — keep `qwen3.5` before `qwen3`, `llama-3.3` before `llama-3`, etc., or the wrong family wins. Add a covering test in `core/config/hooks_test.go`. + +**Sister file — `core/config/inference_defaults.json`:** same pattern but for sampling parameters (temperature, top_p, top_k, min_p, repeat_penalty, presence_penalty). Loaded by `core/config/inference_defaults.go` and applied by `ApplyInferenceDefaults()`. The schema is `map[string]float64` only — *strings don't fit*, which is why parser defaults needed their own JSON file. The inference file is **auto-generated from unsloth** via `go generate ./core/config/` (see `core/config/gen_inference_defaults/`) — don't hand-edit it; instead update the upstream source or regenerate. Both files share `normalizeModelID()` and the longest-first pattern ordering. + +**Constructor compatibility gotcha:** the abstract `ToolParser.__init__` accepts `tools=`, but several concrete parsers (Hermes2ProToolParser, etc.) override `__init__` and *only* accept `tokenizer`. Always: + +```python +try: + tp = self.tool_parser_cls(self.tokenizer, tools=tools) +except TypeError: + tp = self.tool_parser_cls(self.tokenizer) +``` + +## ChatDelta is the streaming contract + +The Go side (`core/backend/llm.go`, `pkg/functions/chat_deltas.go`) consumes `Reply.chat_deltas` to assemble the OpenAI response. For tool calls to surface in `chat/completions`, the Python backend **must** populate `Reply.chat_deltas[].tool_calls` with `ToolCallDelta{index, id, name, arguments}`. Returning the raw `...` text in `Reply.message` is *not* enough — the Go regex fallback exists for llama.cpp, not for vllm. + +Same story for `reasoning_content` — emit it on `ChatDelta.reasoning_content`, not as part of `content`. + +## Message conversion to chat templates + +`tokenizer.apply_chat_template()` expects a list of dicts, not proto Messages. The shared helper in `backend/python/common/vllm_utils.py` (`messages_to_dicts`) handles the mapping including: + +- `tool_call_id` and `name` for `role="tool"` messages +- `tool_calls` JSON-string field → parsed Python list for `role="assistant"` +- `reasoning_content` for thinking models + +Pass `tools=json.loads(request.Tools)` and (when `request.Metadata.get("enable_thinking") == "true"`) `enable_thinking=True` to `apply_chat_template`. Wrap in `try/except TypeError` because not every tokenizer template accepts those kwargs. + +## CPU support and the SIMD/library minefield + +vLLM publishes prebuilt CPU wheels at `https://github.com/vllm-project/vllm/releases/...`. The pin lives in `backend/python/vllm/requirements-cpu-after.txt`. + +**Version compatibility — important:** newer vllm CPU wheels (≥ 0.15) declare `torch==2.10.0+cpu` as a hard dep, but `torch==2.10.0` only exists on the PyTorch test channel and pulls in an incompatible `torchvision`. Stay on **`vllm 0.14.1+cpu` + `torch 2.9.1+cpu`** until both upstream catch up. Bumping requires verifying torchvision/torchaudio match. + +`requirements-cpu.txt` uses `--extra-index-url https://download.pytorch.org/whl/cpu`. `install.sh` adds `--index-strategy=unsafe-best-match` for the `cpu` profile so uv resolves transformers/vllm from PyPI while pulling torch from the PyTorch index. + +**SIMD baseline:** the prebuilt CPU wheel is compiled with AVX-512 VNNI/BF16. On a CPU without those instructions, importing `vllm.model_executor.models.registry` SIGILLs at `_run_in_subprocess` time during model inspection. There is no runtime flag to disable it. Workarounds: + +1. **Run on a host with the right SIMD baseline** (default — fast) +2. **Build from source** with `FROM_SOURCE=true` env var. Plumbing exists end-to-end: + - `install.sh` hides `requirements-cpu-after.txt`, runs `installRequirements` for the base deps, then clones vllm and `VLLM_TARGET_DEVICE=cpu uv pip install --no-deps .` + - `backend/Dockerfile.python` declares `ARG FROM_SOURCE` + `ENV FROM_SOURCE` + - `Makefile` `docker-build-backend` macro forwards `--build-arg FROM_SOURCE=$(FROM_SOURCE)` when set + - Source build takes 30–50 minutes — too slow for per-PR CI but fine for local. + +**Runtime shared libraries:** vLLM's `vllm._C` extension `dlopen`s `libnuma.so.1` at import time. If missing, the C extension silently fails and `torch.ops._C_utils.init_cpu_threads_env` is never registered → `EngineCore` crashes on `init_device` with: + +``` +AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' +``` + +`backend/python/vllm/package.sh` bundles `libnuma.so.1` and `libgomp.so.1` into `${BACKEND}/lib/`, which `libbackend.sh` adds to `LD_LIBRARY_PATH` at run time. The builder stage in `backend/Dockerfile.python` installs `libnuma1`/`libgomp1` so package.sh has something to copy. Do *not* assume the production host has these — backend images are `FROM scratch`. + +## Backend hook system (`core/config/backend_hooks.go`) + +Per-backend defaults that used to be hardcoded in `ModelConfig.Prepare()` now live in `core/config/hooks_*.go` files and self-register via `init()`: + +- `hooks_llamacpp.go` → GGUF metadata parsing, context size, GPU layers, jinja template +- `hooks_vllm.go` → tool/reasoning parser auto-selection from `parser_defaults.json` + +Hook keys: +- `"llama-cpp"`, `"vllm"`, `"vllm-omni"`, … — backend-specific +- `""` — runs only when `cfg.Backend` is empty (auto-detect case) +- `"*"` — global catch-all, runs for every backend before specific hooks + +Multiple hooks per key are supported and run in registration order. Adding a new backend default: + +```go +// core/config/hooks_.go +func init() { + RegisterBackendHook("", myDefaults) +} +func myDefaults(cfg *ModelConfig, modelPath string) { + // only fill in fields the user didn't set +} +``` + +## The `Messages.ToProto()` fields you need to set + +`core/schema/message.go:ToProto()` must serialize: +- `ToolCallID` → `proto.Message.ToolCallId` (for `role="tool"` messages — links result back to the call) +- `Reasoning` → `proto.Message.ReasoningContent` +- `ToolCalls` → `proto.Message.ToolCalls` (JSON-encoded string) + +These were originally not serialized and tool-calling conversations broke silently — the C++ llama.cpp backend reads them but always got empty strings. Any new field added to `schema.Message` *and* `proto.Message` needs a matching line in `ToProto()`. diff --git a/AGENTS.md b/AGENTS.md index d3d8037e0..207f5e83e 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -10,6 +10,7 @@ This file is an index to detailed topic guides in the `.agents/` directory. Read | [.agents/adding-backends.md](.agents/adding-backends.md) | Adding a new backend (Python, Go, or C++) — full step-by-step checklist | | [.agents/coding-style.md](.agents/coding-style.md) | Code style, editorconfig, logging, documentation conventions | | [.agents/llama-cpp-backend.md](.agents/llama-cpp-backend.md) | Working on the llama.cpp backend — architecture, updating, tool call parsing | +| [.agents/vllm-backend.md](.agents/vllm-backend.md) | Working on the vLLM / vLLM-omni backends — native parsers, ChatDelta, CPU build, libnuma packaging, backend hooks | | [.agents/testing-mcp-apps.md](.agents/testing-mcp-apps.md) | Testing MCP Apps (interactive tool UIs) in the React UI | | [.agents/api-endpoints-and-auth.md](.agents/api-endpoints-and-auth.md) | Adding API endpoints, auth middleware, feature permissions, user access control | | [.agents/debugging-backends.md](.agents/debugging-backends.md) | Debugging runtime backend failures, dependency conflicts, rebuilding backends |