feat(grpc): request cancellation for Go backends via the Cancellable capability

The llama.cpp C++ backend aborts generation when its gRPC context is
cancelled (grpc-server.cpp polls context->IsCancelled() in the result
loops), but Go backends served by pkg/grpc never observed context
cancellation: a disconnected client left the generation running to
completion. Add an optional Cancellable capability; the server registers
context.AfterFunc on the request/stream context (after the Locking block
so queued requests cannot abort the current owner) covering both rich
and legacy paths. dllm implements it: measured cancel latency ~10ms vs
~10s of orphaned generation, and follow-up requests no longer queue
behind cancelled ones (~220ms vs ~9s in the e2e proof).

Assisted-by: Claude Code (Fable 5)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-11 17:50:04 +00:00
parent eb61e1d770
commit ad6d1dbc8b
8 changed files with 349 additions and 17 deletions

View File

@@ -670,6 +670,7 @@ This backend is **experimental**, and the engine does not yet have a prompt-KV p
- [📖 Text generation (GPT)]({{%relref "features/text-generation" %}})
- [🔥 OpenAI functions]({{%relref "features/openai-functions" %}}) - tool calls are parsed natively by the backend (gemma4 `<|tool_call>` markers), not by LocalAI's grammar/regex fallback
- Reasoning - opt-in thinking streams as `reasoning_content` (see below)
- Request cancellation - disconnecting the client (or a request timeout) aborts the in-flight generation server-side, so an abandoned slow run does not keep the GPU busy
#### Supported platforms