fix(reasoning): suppress partial tag tokens during autoparser warm-up

The C++ PEG parser needs a few tokens to identify the reasoning format (e.g. "<|channel>thought\n" for Gemma 4). During this warm-up, the gRPC layer was sending raw partial tag tokens to Go, which leaked into the reasoning field. - Clear reply.message in gRPC when autoparser is active but has no diffs yet, matching llama.cpp server behavior of only emitting classified output - Prefer C++ autoparser chat deltas for reasoning/content in all streaming paths, falling back to Go-side extraction for backends without autoparser (e.g. vLLM) - Override non-streaming no-tools result with chat delta content when available - Guard PrependThinkingTokenIfNeeded against partial tag prefixes during streaming accumulation - Reorder default thinking tokens so <|channel>thought is checked before <|think|> (Gemma 4 templates contain both)
2026-06-01 04:28:59 -04:00 · 2026-04-04 20:45:50 +00:00
parent c5a840f6af
commit 53deeb1107
4 changed files with 34 additions and 28 deletions
--- a/backend/cpp/llama-cpp/grpc-server.cpp
+++ b/backend/cpp/llama-cpp/grpc-server.cpp
@@ -1608,8 +1608,18 @@ public:
        auto attach_chat_deltas = [](backend::Reply & reply, server_task_result * raw_result) {
            // Try streaming partial result first
            auto* partial = dynamic_cast<server_task_result_cmpl_partial*>(raw_result);
-            if (partial && !partial->oaicompat_msg_diffs.empty()) {
-                populate_chat_deltas_from_diffs(reply, partial->oaicompat_msg_diffs);
+            if (partial) {
+                if (!partial->oaicompat_msg_diffs.empty()) {
+                    populate_chat_deltas_from_diffs(reply, partial->oaicompat_msg_diffs);
+                } else if (partial->is_updated) {
+                    // Autoparser is active but hasn't classified this chunk yet
+                    // (PEG parser warming up). Clear the raw message so the Go
+                    // side doesn't try to parse partial tag tokens (e.g. "<|channel>"
+                    // before the full "<|channel>thought\n" is received).
+                    // This matches llama.cpp server behavior which only emits SSE
+                    // chunks when the parser produces diffs.
+                    reply.set_message("");
+                }
                return;
            }
            // Try final result