mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-11 18:27:32 -04:00
docs(dllm): backend documentation and agents topic guide
User docs: dllm section in text-generation (setup, eb_* options table, n_predict canvas rounding, enable_thinking metadata, honest GB10 throughput numbers). Agents guide: .agents/dllm-backend.md covering the purego C-ABI contract, serialization rules, template provenance, test layers, and known limitations. Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
134
.agents/dllm-backend.md
Normal file
134
.agents/dllm-backend.md
Normal file
@@ -0,0 +1,134 @@
|
||||
# Working on the dllm Backend
|
||||
|
||||
`mudler/dllm.cpp` is a standalone C++/ggml engine for DiffusionGemma
|
||||
block-diffusion models. LocalAI wraps it with a **pure-Go** backend at
|
||||
`backend/go/dllm/` that dlopens `libdllm.so` via purego (ebitengine/purego) -
|
||||
NOT cgo, and NOT a C++ grpc-server fork. The Go side owns chat templating
|
||||
(gemma4 renderer) and output parsing (gemma4 streaming parser) and implements
|
||||
the rich gRPC interface (`PredictRich`/`PredictStreamRich`, ChatDelta replies).
|
||||
|
||||
> NOTE: github.com/mudler/dllm.cpp is still **private** (publishing is
|
||||
> planned). Until then the Makefile's anonymous clone fails; use the local-dev
|
||||
> symlink shortcut documented at the top of `backend/go/dllm/Makefile`
|
||||
> (symlink an out-of-tree `build/libdllm.so` into the backend dir and skip the
|
||||
> clone), or a git credential helper with repo access.
|
||||
|
||||
## Pin
|
||||
|
||||
`backend/go/dllm/Makefile` pins `DLLM_VERSION?=<sha>` at the top
|
||||
(whisper / parakeet-cpp / ds4 convention). The bump-deps bot
|
||||
(`.github/workflows/bump_deps.yaml`) tracks `mudler/dllm.cpp` `main` and
|
||||
rewrites that variable. After a manual bump: `make -C backend/go/dllm purge &&
|
||||
make -C backend/go/dllm` (the clone is keyed on the directory existing, not
|
||||
the sha).
|
||||
|
||||
## C-ABI and the serialization contract
|
||||
|
||||
The binding covers the 9-symbol flat C-ABI from dllm.cpp's
|
||||
`include/dllm_capi.h` (ABI v1; `main.go` hard-fails on a version mismatch):
|
||||
`abi_version, load, free, last_error, free_string, tokenize_json, generate,
|
||||
generate_stream, cancel`. Contract points the Go wiring encodes (`capi.go`
|
||||
header comment has the full list):
|
||||
|
||||
- **One ctx = one concurrent generate/tokenize.** A per-model worker
|
||||
goroutine (`Dllm.jobs` in `dllm.go`) owns ALL C calls, making the
|
||||
serialization structural instead of lock discipline.
|
||||
- **`dllm_capi_cancel` is the ONE exception**: it only flips an atomic and may
|
||||
be called from any goroutine mid-generate, so `Dllm.Cancel` bypasses the
|
||||
worker queue. The flag resets at the start of each generate, so a watchdog
|
||||
racing a new generate must re-issue cancel.
|
||||
- **`last_error` is a borrowed pointer** and must only be read AFTER the
|
||||
failing call returned (never while a generate is in flight on the same ctx).
|
||||
- **Free vs in-flight requests**: requests hold `genMu.RLock` for their full
|
||||
duration; `Free` takes the write lock, so it only runs when nothing is in
|
||||
flight, then drains and closes the worker. Post-Free requests get a clean
|
||||
"model not loaded" error.
|
||||
- `tokenize_json`/`generate` return malloc'd `char*` (bound as `uintptr`,
|
||||
copied, then `dllm_capi_free_string`d); opts/params JSON must be a FLAT
|
||||
object of scalars (`buildOptsJSON` rejects anything else).
|
||||
|
||||
## Wire shape
|
||||
|
||||
| RPC | Implementation |
|
||||
|---|---|
|
||||
| LoadModel | `dllm_capi_load` (params: `n_gpu_layers`, `n_threads`, `ctx_len`); `Options[]` parsed into per-request gen opts (`eb_*`, `blocks`, `kv_cache`) by `parseModelGenOpts` |
|
||||
| PredictRich | render (if templated) → `dllm_capi_generate` → parse → ONE Reply with aggregated ChatDeltas + legacy `Message` bytes |
|
||||
| PredictStreamRich | `dllm_capi_generate_stream`; per committed diffusion block → UTF-8 holdback → parser.Feed → one Reply per non-empty delta batch (channel closed by the CALLER, per `pkg/grpc/interface.go`) |
|
||||
| Predict / PredictStream | Legacy paths, delegate to the rich pair (legacy stream INVERTS channel ownership: the impl closes) |
|
||||
| TokenizeString | `dllm_capi_tokenize_json` (C side prepends BOS per `vocab.add_bos`) |
|
||||
| Cancel | `dllm_capi_cancel`; currently INERT in practice - the gRPC server does not hand the request/stream context to backends, so client disconnects never reach it (plumbing is future work) |
|
||||
|
||||
`n_threads` and `ctx_len` are accepted-but-ignored by the engine at the
|
||||
current pin (the context bound comes from GGUF `n_ctx_train`); they are sent
|
||||
for forward compatibility.
|
||||
|
||||
## Renderer / parser (the templated chat path)
|
||||
|
||||
With `use_tokenizer_template` + raw Messages, the backend owns templating and
|
||||
parsing (the ds4 precedent, but in Go):
|
||||
|
||||
- `gemma4_renderer.go` - `RenderGemma4(msgs, toolsJSON, enableThinking,
|
||||
addGenerationPrompt)`. The file embeds the FULL `tokenizer.chat_template`
|
||||
jinja (17466 bytes, md5 `8c34cf93c7a7815b3fdb300a009c4c17`) extracted
|
||||
verbatim from `diffusiongemma-26B-A4B-it-BF16.gguf` via gguf-py - e.g.
|
||||
`python scripts/dump_gguf.py model.gguf | grep -A400 chat_template` in the
|
||||
dllm.cpp checkout - as a numbered comment block; every Go rule cites its
|
||||
"tpl L<n>" line. Re-verify the md5 before blaming the renderer for a
|
||||
mismatch with a new GGUF. **BOS exception**: the template emits
|
||||
`{{- bos_token -}}` but the renderer deliberately does NOT - dllm.cpp's
|
||||
`run_generate` tokenizes with `prepend_bos = vocab.add_bos` (true for
|
||||
gemma4), so a literal `<bos>` would double it.
|
||||
- `gemma4_parser.go` - streaming state machine turning raw model text
|
||||
(fragments can split anywhere, including mid-marker) into ChatDeltas:
|
||||
thought channels → `reasoning_content`, `<|tool_call>call:name{...}` →
|
||||
ToolCallDelta, `<turn|>` → done. Marker grammar cross-checked against vLLM
|
||||
PR #45163's gemma4 tool/reasoning parsers. Malformed payloads are re-emitted
|
||||
raw as content, never dropped.
|
||||
- Thinking is **opt-in** for this family (`Metadata["enable_thinking"]`,
|
||||
default OFF - the inverse of ds4): the template gates every thinking branch
|
||||
on `enable_thinking`, and the no-thinking render pre-closes an empty thought
|
||||
channel, so the parser always starts in content state.
|
||||
- **UTF-8 boundary holdback** (`splitValidUTF8` in `dllm.go`): per-block
|
||||
detokenization can split a multi-byte character across block boundaries, and
|
||||
grpc-go refuses to marshal invalid UTF-8 in proto3 strings. An incomplete
|
||||
trailing sequence (at most 3 bytes) is carried into the next block; genuinely
|
||||
undecodable bytes become U+FFFD.
|
||||
|
||||
Without `use_tokenizer_template`, the prompt passes through verbatim and the
|
||||
output is NOT gemma4-parsed (plain content, like any non-autoparsing backend).
|
||||
|
||||
## Tests
|
||||
|
||||
| Layer | Gate | What |
|
||||
|---|---|---|
|
||||
| `backend/go/dllm/*_test.go` (renderer/parser/wiring) | none - run in plain `go test ./backend/go/dllm/...` | Ginkgo specs over a fake `generator` seam; canonical renderer fixtures from transformers' `test_modeling_diffusion_gemma.py`, parser tables from the vLLM gemma4 parsers |
|
||||
| `backend/go/dllm/dllm_test.go` C-ABI smoke | `DLLM_TEST_LIBRARY` + `DLLM_TEST_TINY_MODEL` (dllm.cpp's `tests/fixtures/tiny_with_vocab.gguf`); Skips when unset | Drives the real `libdllm.so`: ABI check, load, tokenize `[2,18]`, deterministic generate, cancel |
|
||||
| `tests/e2e-backends/dllm_test.go` | `BACKEND_TEST_DLLM=1` + `BACKEND_BINARY` (packaged run.sh) + `BACKEND_TEST_MODEL_FILE` (tiny fixture) | Templated chat round trip (Messages + UseTokenizerTemplate) over the real gRPC binary, non-streaming + streaming |
|
||||
| Real-model e2e | `BACKEND_TEST_DLLM_REAL_MODEL_FILE` (26B BF16, ~50 GB) + `BACKEND_TEST_DLLM_REAL_GPU_LAYERS` | CUDA-13-class hardware only |
|
||||
|
||||
Tool-call e2e is deliberately absent from the tiny-model spec: the fixture has
|
||||
random weights and cannot be coaxed into emitting tool markup; the unit tables
|
||||
carry that coverage.
|
||||
|
||||
## Build matrix
|
||||
|
||||
`cpu-dllm` (amd64 + arm64), `cuda13-dllm` (amd64 + arm64), and
|
||||
`cuda13-nvidia-l4t-arm64-dllm` (Jetson / DGX Spark GB10), via
|
||||
`.github/backend-matrix.yml`. No darwin/Metal. CUDA builds forward
|
||||
`-DDLLM_CUDA=ON` (dllm.cpp gates ggml's CUDA behind its own flag - a bare
|
||||
`-DGGML_CUDA=ON` is overridden by the cache FORCE). `libdllm.so` is
|
||||
self-contained (ggml statically absorbed, PIC), so packaging only ships the
|
||||
one .so plus the usual ldd walk.
|
||||
|
||||
## Known limitations
|
||||
|
||||
- **Cancel is unwired**: nothing calls `Dllm.Cancel` on client disconnect
|
||||
until the gRPC server plumbs the request context through to backends.
|
||||
- **Throughput**: ~0.15 tok/s on the 26B at default settings (GB10) - every
|
||||
denoise step recomputes the full prompt+canvas. The upstream prefix-KV
|
||||
cache (dllm.cpp P3) is the fix; `kv_cache:on` errors until it lands
|
||||
(`auto`/`off` are accepted no-ops).
|
||||
- **Repo privacy**: see the note at the top - CI clone of dllm.cpp needs the
|
||||
repo published (or credentials) before the backend images can build.
|
||||
- Engine spec/validation references: dllm.cpp `docs/validation.md` and
|
||||
LocalAI `docs/superpowers/specs/2026-06-10-dllm-cpp-design.md`.
|
||||
@@ -26,6 +26,7 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants]
|
||||
| [.agents/vllm-backend.md](.agents/vllm-backend.md) | Working on the vLLM / vLLM-omni backends — native parsers, ChatDelta, CPU build, libnuma packaging, backend hooks |
|
||||
| [.agents/sglang-backend.md](.agents/sglang-backend.md) | Working on the SGLang backend — `engine_args` validation against ServerArgs, speculative-decoding (EAGLE/EAGLE3/DFLASH/MTP) recipes, parser handling |
|
||||
| [.agents/ds4-backend.md](.agents/ds4-backend.md) | Working on the ds4 backend - DSML state machine, thinking modes, KV cache, Metal+CUDA matrix |
|
||||
| [.agents/dllm-backend.md](.agents/dllm-backend.md) | Working on the dllm backend (DiffusionGemma block-diffusion) - purego C-ABI binding, per-ctx serialization contract, gemma4 renderer/parser, gated test layers |
|
||||
| [.agents/testing-mcp-apps.md](.agents/testing-mcp-apps.md) | Testing MCP Apps (interactive tool UIs) in the React UI |
|
||||
| [.agents/api-endpoints-and-auth.md](.agents/api-endpoints-and-auth.md) | Adding API endpoints, auth middleware, feature permissions, user access control |
|
||||
| [.agents/debugging-backends.md](.agents/debugging-backends.md) | Debugging runtime backend failures, dependency conflicts, rebuilding backends |
|
||||
|
||||
@@ -655,6 +655,123 @@ The `cache_type_k` / `cache_type_v` fields map to llama.cpp's `-ctk` / `-ctv` fl
|
||||
- [Tracked branch: `feature/turboquant-kv-cache`](https://github.com/TheTom/llama-cpp-turboquant/tree/feature/turboquant-kv-cache)
|
||||
|
||||
|
||||
### dllm (DiffusionGemma block-diffusion)
|
||||
|
||||
[dllm.cpp](https://github.com/mudler/dllm.cpp) is a standalone C++/ggml engine for **DiffusionGemma** block-diffusion language models (GGUF weights). Instead of sampling one token at a time, generation works on fixed-size token **canvases** (256 tokens for the published model): each canvas is iteratively denoised with the Entropy-Bound (EB) sampler, committed as a whole block, and committed blocks feed back as prompt for the next canvas. LocalAI wraps the engine with a native Go backend (`dllm`) that also owns chat templating and output parsing: the model's thought channels and tool calls stream natively as `reasoning_content` and `tool_calls` deltas, with no jinja template involved.
|
||||
|
||||
{{% notice note %}}
|
||||
|
||||
This backend is **experimental**, and the engine does not yet have a prompt-KV prefix cache: every denoise step recomputes the full prompt+canvas forward pass, so throughput is low (~0.15 tok/s at default settings on a single GB10 GPU) and drops further as the context fills up. The prefix cache is the planned fix in upstream dllm.cpp.
|
||||
|
||||
{{% /notice %}}
|
||||
|
||||
#### Features
|
||||
|
||||
- [📖 Text generation (GPT)]({{%relref "features/text-generation" %}})
|
||||
- [🔥 OpenAI functions]({{%relref "features/openai-functions" %}}) - tool calls are parsed natively by the backend (gemma4 `<|tool_call>` markers), not by LocalAI's grammar/regex fallback
|
||||
- Reasoning - opt-in thinking streams as `reasoning_content` (see below)
|
||||
|
||||
#### Supported platforms
|
||||
|
||||
| Flavor | Hardware |
|
||||
|---|---|
|
||||
| `cpu-dllm` | CPU (amd64 + arm64) - functional but very slow on the 26B model; mainly useful for wiring tests |
|
||||
| `cuda13-dllm` | NVIDIA CUDA 13 (amd64 + arm64) |
|
||||
| `cuda13-nvidia-l4t-arm64-dllm` | NVIDIA L4T (Jetson / DGX Spark GB10) |
|
||||
|
||||
macOS/Metal is not available yet.
|
||||
|
||||
#### Setup
|
||||
|
||||
The easiest path is the model gallery; the entry installs the backend and the model together:
|
||||
|
||||
```bash
|
||||
local-ai models install diffusiongemma-26b-a4b-it
|
||||
```
|
||||
|
||||
Or configure it manually with a YAML file pointing at the GGUF (BF16 is the only published file the engine's validation is calibrated for; the model card flags quantized MoE exports as problematic):
|
||||
|
||||
```yaml
|
||||
name: diffusiongemma
|
||||
backend: dllm
|
||||
parameters:
|
||||
model: diffusiongemma-26B-A4B-it-BF16.gguf
|
||||
context_size: 4096
|
||||
stopwords:
|
||||
- <turn|>
|
||||
# The backend parses tool calls natively; keep LocalAI's generated tool
|
||||
# grammar from overriding that pipeline.
|
||||
function:
|
||||
grammar:
|
||||
disable: true
|
||||
template:
|
||||
use_tokenizer_template: true
|
||||
```
|
||||
|
||||
`use_tokenizer_template: true` is what routes chat requests through the backend's native gemma4 renderer/parser (messages and tools in, `content`/`reasoning_content`/`tool_calls` out). Without it, your own prompt template output is passed to the engine verbatim and the raw model text comes back as plain content.
|
||||
|
||||
#### Backend options
|
||||
|
||||
Model-level generation options go in the `options:` array (format: `key:value`), like other backends:
|
||||
|
||||
```yaml
|
||||
options:
|
||||
- eb_max_steps:24
|
||||
- kv_cache:auto
|
||||
```
|
||||
|
||||
| Option | Type | Default | Description |
|
||||
|--------|------|---------|-------------|
|
||||
| `blocks` | integer | unset | Generation budget in whole diffusion canvases (`blocks * canvas_length` tokens, 256 per canvas for the published model). Must be >= 1. When both `blocks` and a token budget are present, `blocks` wins. |
|
||||
| `kv_cache` | string | `auto` | One of `auto`, `off`, `on`. The engine has no KV cache yet, so `auto` and `off` are accepted no-ops; `kv_cache:on` fails the request until the prefix-KV cache lands upstream. |
|
||||
| `eb_max_steps` | integer | 48 | Maximum denoise steps per canvas. Blocks exit early once stable **and** confident, so this is a ceiling, not a fixed cost. Lower values are faster but can degrade quality. |
|
||||
| `eb_t_min` | float | 0.4 | Lower bound of the linear temperature schedule. |
|
||||
| `eb_t_max` | float | 0.8 | Upper bound of the linear temperature schedule: `t = t_min + (t_max - t_min) * cur_step/max_steps`, with `cur_step` counting down, so denoising anneals from `t_max` toward `t_min`. |
|
||||
| `eb_entropy_bound` | float | 0.1 | Per-step acceptance budget: canvas positions are sorted by entropy (ascending) and accepted while the cumulative entropy, minus the position's own, stays at or below the bound. Higher accepts more tokens per step (faster, riskier). |
|
||||
| `eb_stability_threshold` | integer | 1 | Consecutive identical argmax canvases required before a block counts as stable (`0` = always stable; at `1` the earliest exit is the 2nd identical step). |
|
||||
| `eb_confidence_threshold` | float | 0.005 | Mean-entropy ceiling for the "confident" half of the early-exit test; a block stops denoising only when it is both stable and below this. |
|
||||
|
||||
Defaults for the `eb_*` knobs come from the GGUF's `diffusion.*` metadata when present, falling back to the engine defaults shown (DiffusionGemma's canonical values). The published `diffusiongemma-26B-A4B-it` GGUF carries only `diffusion.canvas_length`, so the fallbacks above are what you actually get.
|
||||
|
||||
Per-request parameters: `max_tokens` maps to the engine's `n_predict` (omitted: engine default of 256), and a **positive** `seed` gives deterministic output (absent, zero or negative = a fresh random seed per call). Autoregressive sampling fields (`temperature`, `top_p`, `top_k`, ...) are **not used**: the EB sampler's own temperature schedule (`eb_t_min`/`eb_t_max`) replaces them.
|
||||
|
||||
{{% notice note %}}
|
||||
|
||||
**`max_tokens` rounds up to whole canvases.** The scheduler always commits whole canvases, so the token budget rounds **up** to `ceil(n_predict / canvas_length)` blocks and the completion may run slightly past the requested `max_tokens` (canonical DiffusionGemma behavior). Generation can still end earlier when the model emits an end-of-turn token, which finalizes the canvas.
|
||||
|
||||
{{% /notice %}}
|
||||
|
||||
#### Thinking
|
||||
|
||||
DiffusionGemma's chat template makes thinking **opt-in** (the default render pre-closes an empty thought channel), so the backend defaults to thinking OFF - the opposite of most reasoning models. Enable it per request via the `metadata` field ([per-request override]({{%relref "advanced/model-configuration#per-request-override-via-metadata" %}})):
|
||||
|
||||
```bash
|
||||
curl http://localhost:8080/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "diffusiongemma",
|
||||
"messages": [{"role": "user", "content": "Explain quantum computing"}],
|
||||
"metadata": {"enable_thinking": "true"}
|
||||
}'
|
||||
```
|
||||
|
||||
The model's thought channel then streams as `reasoning_content`, separate from the final `content`.
|
||||
|
||||
#### Performance expectations
|
||||
|
||||
Honest numbers from validation on a DGX Spark (GB10, CUDA 13, BF16 26B model, full GPU offload):
|
||||
|
||||
- Engine load: ~33 s (50 GB of weights to GPU)
|
||||
- Forward pass: ~5.6 s per denoise step (256-token canvas); a block takes up to `eb_max_steps` steps but typically exits early (24/48 observed on a normal prompt, 4 steps on a trivial one)
|
||||
- End-to-end: ~0.15 tok/s at default settings, dominated by the per-step full recompute - this is the cost the upstream prefix-KV cache work targets
|
||||
|
||||
On CPU the same forward step takes ~139 s (20 Grace cores): treat the CPU flavor as functional, not practical, for the 26B model.
|
||||
|
||||
#### Reference
|
||||
|
||||
- [dllm.cpp](https://github.com/mudler/dllm.cpp)
|
||||
- [unsloth/diffusiongemma-26B-A4B-it-GGUF](https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF)
|
||||
|
||||
### vLLM
|
||||
|
||||
[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference.
|
||||
|
||||
Reference in New Issue
Block a user