mirror of
https://github.com/mudler/LocalAI.git
synced 2026-04-16 12:59:33 -04:00
docs(agents): capture vllm backend lessons + runtime lib packaging (#9333)
New .agents/vllm-backend.md with everything that's easy to get wrong
on the vllm/vllm-omni backends:
- Use vLLM's native ToolParserManager / ReasoningParserManager — do
not write regex-based parsers. Selection is explicit via Options[],
defaults live in core/config/parser_defaults.json.
- Concrete parsers don't always accept the tools= kwarg the abstract
base declares; try/except TypeError is mandatory.
- ChatDelta.tool_calls is the contract — Reply.message text alone
won't surface tool calls in /v1/chat/completions.
- vllm version pin trap: 0.14.1+cpu pairs with torch 2.9.1+cpu.
Newer wheels declare torch==2.10.0+cpu which only exists on the
PyTorch test channel and pulls an incompatible torchvision.
- SIMD baseline: prebuilt wheel needs AVX-512 VNNI/BF16. SIGILL
symptom + FROM_SOURCE=true escape hatch are documented.
- libnuma.so.1 + libgomp.so.1 must be bundled because vllm._C
silently fails to register torch ops if they're missing.
- backend_hooks system: hooks_llamacpp / hooks_vllm split + the
'*' / '' / named-backend keys.
- ToProto() must serialize ToolCallID and Reasoning — easy to miss
when adding fields to schema.Message.
Also extended .agents/adding-backends.md with a generic 'Bundling
runtime shared libraries' section: Dockerfile.python is FROM scratch,
package.sh is the mechanism, libbackend.sh adds ${EDIR}/lib to
LD_LIBRARY_PATH, and how to verify packaging without trusting the
host (extract image, boot in fresh ubuntu container).
Index in AGENTS.md updated.
This commit is contained in:
committed by
GitHub
parent
d67623230f
commit
daa0272f2e
@@ -129,6 +129,30 @@ After adding a new backend, verify:
|
||||
- [ ] No Makefile syntax errors (check with linter)
|
||||
- [ ] Follows the same pattern as similar backends (e.g., if it's a transcription backend, follow `faster-whisper` pattern)
|
||||
|
||||
## Bundling runtime shared libraries (`package.sh`)
|
||||
|
||||
The final `Dockerfile.python` stage is `FROM scratch` — there is no system `libc`, no `apt`, no fallback library path. Only files explicitly copied from the builder stage end up in the backend image. That means any runtime `dlopen` your backend (or its Python deps) needs **must** be packaged into `${BACKEND}/lib/`.
|
||||
|
||||
Pattern:
|
||||
|
||||
1. Make sure the library is installed in the builder stage of `backend/Dockerfile.python` (add it to the top-level `apt-get install`).
|
||||
2. Drop a `package.sh` in your backend directory that copies the library — and its soname symlinks — into `$(dirname $0)/lib`. See `backend/python/vllm/package.sh` for a reference implementation that walks `/usr/lib/x86_64-linux-gnu`, `/usr/lib/aarch64-linux-gnu`, etc.
|
||||
3. `Dockerfile.python` already runs `package.sh` automatically if it exists, after `package-gpu-libs.sh`.
|
||||
4. `libbackend.sh` automatically prepends `${EDIR}/lib` to `LD_LIBRARY_PATH` at run time, so anything packaged this way is found by `dlopen`.
|
||||
|
||||
How to find missing libs: when a Python module silently fails to register torch ops or you see `AttributeError: '_OpNamespace' '...' object has no attribute '...'`, run the backend image's Python with `LD_DEBUG=libs` to see which `dlopen` failed. The filename in the error message (e.g. `libnuma.so.1`) is what you need to package.
|
||||
|
||||
To verify packaging works without trusting the host:
|
||||
|
||||
```bash
|
||||
make docker-build-<backend>
|
||||
CID=$(docker create --entrypoint=/run.sh local-ai-backend:<backend>)
|
||||
docker cp $CID:/lib /tmp/check && docker rm $CID
|
||||
ls /tmp/check # expect the bundled .so files + symlinks
|
||||
```
|
||||
|
||||
Then boot it inside a fresh `ubuntu:24.04` (which intentionally does *not* have the lib installed) to confirm it actually loads from the backend dir.
|
||||
|
||||
## 6. Example: Adding a Python Backend
|
||||
|
||||
For reference, when `moonshine` was added:
|
||||
|
||||
115
.agents/vllm-backend.md
Normal file
115
.agents/vllm-backend.md
Normal file
@@ -0,0 +1,115 @@
|
||||
# Working on the vLLM Backend
|
||||
|
||||
The vLLM backend lives at `backend/python/vllm/backend.py` (async gRPC) and the multimodal variant at `backend/python/vllm-omni/backend.py` (sync gRPC). Both wrap vLLM's `AsyncLLMEngine` / `Omni` and translate the LocalAI gRPC `PredictOptions` into vLLM `SamplingParams` + outputs into `Reply.chat_deltas`.
|
||||
|
||||
This file captures the non-obvious bits — most of the bring-up was a single PR (`feat/vllm-parity`) and the things below are easy to get wrong.
|
||||
|
||||
## Tool calling and reasoning use vLLM's *native* parsers
|
||||
|
||||
Do not write regex-based tool-call extractors for vLLM. vLLM ships:
|
||||
|
||||
- `vllm.tool_parsers.ToolParserManager` — 50+ registered parsers (`hermes`, `llama3_json`, `llama4_pythonic`, `mistral`, `qwen3_xml`, `deepseek_v3`, `granite4`, `openai`, `kimi_k2`, `glm45`, …)
|
||||
- `vllm.reasoning.ReasoningParserManager` — 25+ registered parsers (`deepseek_r1`, `qwen3`, `mistral`, `gemma4`, …)
|
||||
|
||||
Both can be used standalone: instantiate with a tokenizer, call `extract_tool_calls(text, request=None)` / `extract_reasoning(text, request=None)`. The backend stores the parser *classes* on `self.tool_parser_cls` / `self.reasoning_parser_cls` at LoadModel time and instantiates them per request.
|
||||
|
||||
**Selection:** vLLM does *not* auto-detect parsers from model name — neither does the LocalAI backend. The user (or `core/config/hooks_vllm.go`) must pick one and pass it via `Options[]`:
|
||||
|
||||
```yaml
|
||||
options:
|
||||
- tool_parser:hermes
|
||||
- reasoning_parser:qwen3
|
||||
```
|
||||
|
||||
Auto-defaults for known model families live in `core/config/parser_defaults.json` and are applied:
|
||||
- at gallery import time by `core/gallery/importers/vllm.go`
|
||||
- at model load time by the `vllm` / `vllm-omni` backend hook in `core/config/hooks_vllm.go`
|
||||
|
||||
User-supplied `tool_parser:`/`reasoning_parser:` in the config wins over defaults — the hook checks for existing entries before appending.
|
||||
|
||||
**When to update `parser_defaults.json`:** any time vLLM ships a new tool or reasoning parser, or you onboard a new model family that LocalAI users will pull from HuggingFace. The file is keyed by *family pattern* matched against `normalizeModelID(cfg.Model)` (lowercase, org-prefix stripped, `_`→`-`). Patterns are checked **longest-first** — keep `qwen3.5` before `qwen3`, `llama-3.3` before `llama-3`, etc., or the wrong family wins. Add a covering test in `core/config/hooks_test.go`.
|
||||
|
||||
**Sister file — `core/config/inference_defaults.json`:** same pattern but for sampling parameters (temperature, top_p, top_k, min_p, repeat_penalty, presence_penalty). Loaded by `core/config/inference_defaults.go` and applied by `ApplyInferenceDefaults()`. The schema is `map[string]float64` only — *strings don't fit*, which is why parser defaults needed their own JSON file. The inference file is **auto-generated from unsloth** via `go generate ./core/config/` (see `core/config/gen_inference_defaults/`) — don't hand-edit it; instead update the upstream source or regenerate. Both files share `normalizeModelID()` and the longest-first pattern ordering.
|
||||
|
||||
**Constructor compatibility gotcha:** the abstract `ToolParser.__init__` accepts `tools=`, but several concrete parsers (Hermes2ProToolParser, etc.) override `__init__` and *only* accept `tokenizer`. Always:
|
||||
|
||||
```python
|
||||
try:
|
||||
tp = self.tool_parser_cls(self.tokenizer, tools=tools)
|
||||
except TypeError:
|
||||
tp = self.tool_parser_cls(self.tokenizer)
|
||||
```
|
||||
|
||||
## ChatDelta is the streaming contract
|
||||
|
||||
The Go side (`core/backend/llm.go`, `pkg/functions/chat_deltas.go`) consumes `Reply.chat_deltas` to assemble the OpenAI response. For tool calls to surface in `chat/completions`, the Python backend **must** populate `Reply.chat_deltas[].tool_calls` with `ToolCallDelta{index, id, name, arguments}`. Returning the raw `<tool_call>...</tool_call>` text in `Reply.message` is *not* enough — the Go regex fallback exists for llama.cpp, not for vllm.
|
||||
|
||||
Same story for `reasoning_content` — emit it on `ChatDelta.reasoning_content`, not as part of `content`.
|
||||
|
||||
## Message conversion to chat templates
|
||||
|
||||
`tokenizer.apply_chat_template()` expects a list of dicts, not proto Messages. The shared helper in `backend/python/common/vllm_utils.py` (`messages_to_dicts`) handles the mapping including:
|
||||
|
||||
- `tool_call_id` and `name` for `role="tool"` messages
|
||||
- `tool_calls` JSON-string field → parsed Python list for `role="assistant"`
|
||||
- `reasoning_content` for thinking models
|
||||
|
||||
Pass `tools=json.loads(request.Tools)` and (when `request.Metadata.get("enable_thinking") == "true"`) `enable_thinking=True` to `apply_chat_template`. Wrap in `try/except TypeError` because not every tokenizer template accepts those kwargs.
|
||||
|
||||
## CPU support and the SIMD/library minefield
|
||||
|
||||
vLLM publishes prebuilt CPU wheels at `https://github.com/vllm-project/vllm/releases/...`. The pin lives in `backend/python/vllm/requirements-cpu-after.txt`.
|
||||
|
||||
**Version compatibility — important:** newer vllm CPU wheels (≥ 0.15) declare `torch==2.10.0+cpu` as a hard dep, but `torch==2.10.0` only exists on the PyTorch test channel and pulls in an incompatible `torchvision`. Stay on **`vllm 0.14.1+cpu` + `torch 2.9.1+cpu`** until both upstream catch up. Bumping requires verifying torchvision/torchaudio match.
|
||||
|
||||
`requirements-cpu.txt` uses `--extra-index-url https://download.pytorch.org/whl/cpu`. `install.sh` adds `--index-strategy=unsafe-best-match` for the `cpu` profile so uv resolves transformers/vllm from PyPI while pulling torch from the PyTorch index.
|
||||
|
||||
**SIMD baseline:** the prebuilt CPU wheel is compiled with AVX-512 VNNI/BF16. On a CPU without those instructions, importing `vllm.model_executor.models.registry` SIGILLs at `_run_in_subprocess` time during model inspection. There is no runtime flag to disable it. Workarounds:
|
||||
|
||||
1. **Run on a host with the right SIMD baseline** (default — fast)
|
||||
2. **Build from source** with `FROM_SOURCE=true` env var. Plumbing exists end-to-end:
|
||||
- `install.sh` hides `requirements-cpu-after.txt`, runs `installRequirements` for the base deps, then clones vllm and `VLLM_TARGET_DEVICE=cpu uv pip install --no-deps .`
|
||||
- `backend/Dockerfile.python` declares `ARG FROM_SOURCE` + `ENV FROM_SOURCE`
|
||||
- `Makefile` `docker-build-backend` macro forwards `--build-arg FROM_SOURCE=$(FROM_SOURCE)` when set
|
||||
- Source build takes 30–50 minutes — too slow for per-PR CI but fine for local.
|
||||
|
||||
**Runtime shared libraries:** vLLM's `vllm._C` extension `dlopen`s `libnuma.so.1` at import time. If missing, the C extension silently fails and `torch.ops._C_utils.init_cpu_threads_env` is never registered → `EngineCore` crashes on `init_device` with:
|
||||
|
||||
```
|
||||
AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env'
|
||||
```
|
||||
|
||||
`backend/python/vllm/package.sh` bundles `libnuma.so.1` and `libgomp.so.1` into `${BACKEND}/lib/`, which `libbackend.sh` adds to `LD_LIBRARY_PATH` at run time. The builder stage in `backend/Dockerfile.python` installs `libnuma1`/`libgomp1` so package.sh has something to copy. Do *not* assume the production host has these — backend images are `FROM scratch`.
|
||||
|
||||
## Backend hook system (`core/config/backend_hooks.go`)
|
||||
|
||||
Per-backend defaults that used to be hardcoded in `ModelConfig.Prepare()` now live in `core/config/hooks_*.go` files and self-register via `init()`:
|
||||
|
||||
- `hooks_llamacpp.go` → GGUF metadata parsing, context size, GPU layers, jinja template
|
||||
- `hooks_vllm.go` → tool/reasoning parser auto-selection from `parser_defaults.json`
|
||||
|
||||
Hook keys:
|
||||
- `"llama-cpp"`, `"vllm"`, `"vllm-omni"`, … — backend-specific
|
||||
- `""` — runs only when `cfg.Backend` is empty (auto-detect case)
|
||||
- `"*"` — global catch-all, runs for every backend before specific hooks
|
||||
|
||||
Multiple hooks per key are supported and run in registration order. Adding a new backend default:
|
||||
|
||||
```go
|
||||
// core/config/hooks_<backend>.go
|
||||
func init() {
|
||||
RegisterBackendHook("<backend>", myDefaults)
|
||||
}
|
||||
func myDefaults(cfg *ModelConfig, modelPath string) {
|
||||
// only fill in fields the user didn't set
|
||||
}
|
||||
```
|
||||
|
||||
## The `Messages.ToProto()` fields you need to set
|
||||
|
||||
`core/schema/message.go:ToProto()` must serialize:
|
||||
- `ToolCallID` → `proto.Message.ToolCallId` (for `role="tool"` messages — links result back to the call)
|
||||
- `Reasoning` → `proto.Message.ReasoningContent`
|
||||
- `ToolCalls` → `proto.Message.ToolCalls` (JSON-encoded string)
|
||||
|
||||
These were originally not serialized and tool-calling conversations broke silently — the C++ llama.cpp backend reads them but always got empty strings. Any new field added to `schema.Message` *and* `proto.Message` needs a matching line in `ToProto()`.
|
||||
@@ -10,6 +10,7 @@ This file is an index to detailed topic guides in the `.agents/` directory. Read
|
||||
| [.agents/adding-backends.md](.agents/adding-backends.md) | Adding a new backend (Python, Go, or C++) — full step-by-step checklist |
|
||||
| [.agents/coding-style.md](.agents/coding-style.md) | Code style, editorconfig, logging, documentation conventions |
|
||||
| [.agents/llama-cpp-backend.md](.agents/llama-cpp-backend.md) | Working on the llama.cpp backend — architecture, updating, tool call parsing |
|
||||
| [.agents/vllm-backend.md](.agents/vllm-backend.md) | Working on the vLLM / vLLM-omni backends — native parsers, ChatDelta, CPU build, libnuma packaging, backend hooks |
|
||||
| [.agents/testing-mcp-apps.md](.agents/testing-mcp-apps.md) | Testing MCP Apps (interactive tool UIs) in the React UI |
|
||||
| [.agents/api-endpoints-and-auth.md](.agents/api-endpoints-and-auth.md) | Adding API endpoints, auth middleware, feature permissions, user access control |
|
||||
| [.agents/debugging-backends.md](.agents/debugging-backends.md) | Debugging runtime backend failures, dependency conflicts, rebuilding backends |
|
||||
|
||||
Reference in New Issue
Block a user