feat(grpc): request cancellation for Go backends via the Cancellable capability

The llama.cpp C++ backend aborts generation when its gRPC context is cancelled (grpc-server.cpp polls context->IsCancelled() in the result loops), but Go backends served by pkg/grpc never observed context cancellation: a disconnected client left the generation running to completion. Add an optional Cancellable capability; the server registers context.AfterFunc on the request/stream context (after the Locking block so queued requests cannot abort the current owner) covering both rich and legacy paths. dllm implements it: measured cancel latency ~10ms vs ~10s of orphaned generation, and follow-up requests no longer queue behind cancelled ones (~220ms vs ~9s in the e2e proof). Assisted-by: Claude Code (Fable 5) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-12 10:47:23 -04:00 · 2026-06-11 17:50:04 +00:00
parent eb61e1d770
commit ad6d1dbc8b
8 changed files with 349 additions and 17 deletions
--- a/docs/content/features/text-generation.md
+++ b/docs/content/features/text-generation.md
@@ -670,6 +670,7 @@ This backend is **experimental**, and the engine does not yet have a prompt-KV p
 - [📖 Text generation (GPT)]({{%relref "features/text-generation" %}})
 - [🔥 OpenAI functions]({{%relref "features/openai-functions" %}}) - tool calls are parsed natively by the backend (gemma4 `<|tool_call>` markers), not by LocalAI's grammar/regex fallback
 - Reasoning - opt-in thinking streams as `reasoning_content` (see below)
+- Request cancellation - disconnecting the client (or a request timeout) aborts the in-flight generation server-side, so an abandoned slow run does not keep the GPU busy

 #### Supported platforms