mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-13 03:09:03 -04:00
In the /v1/responses streaming handler a reasoning model's thinking monologue was streamed to the client as normal message text (a msg_ output item with output_text.delta) and only reclassified into a reasoning item after the stream completed. Subsequent output_text.delta events also kept referencing the old msg_ item id instead of the reasoning_ id. Root causes: 1. The live reasoning item was gated on extractor.Reasoning(), which is only updated by the Go-side raw-tag parser (ProcessToken). When the C++ autoparser drives reasoning through reasoning_content ChatDeltas, the reasoning delta is computed via ProcessChatDeltaReasoning into a separate accumulator, so extractor.Reasoning() stays empty and the gate never fired. The reasoning item was thus only reconstructed at end-of-stream. 2. The non-tool-call path created the message/msg_ output item eagerly before any token, forcing reasoning to a higher output index and making mis-split <think> text land on the pre-existing message item. 3. Neither path carried the sticky preferAutoparser flag, so a content-only autoparser (the non-jinja pure-content fallback, #9985) could leak <think>...</think> tokens into content. Extract the per-token reasoning-vs-message classification into a pure, unit-tested streamReasoningRouter (mirroring chooseDeferredReasoning and processStream in the chat streaming worker): it gates the reasoning item on the reasoning delta, opens the message item lazily on the first content delta, and keeps a sticky preferAutoparser fallback. Both streaming paths now route reasoning deltas to the reasoning_ id and order the reasoning item ahead of the message at completion. Assisted-by: claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>