* fix(vllm): don't stream raw tool-call markup as content when a tool parser is active When a tool_parser is configured and the request carries tools, the streaming loop emitted every text delta as delta.content — including the model's raw tool-call markup (e.g. <tool_call>...) — because extract_tool_calls only runs on the full output after the stream. Clients streaming a tool call therefore saw the unparsed tool-call syntax as assistant content. Buffer the text while a tool parser is active for the request; the existing end-of-stream chat_delta already carries the parsed tool_calls (or the cleaned content), which the Go side converts to SSE deltas. Non-tool-parser streaming is unchanged. Add a server-less regression test covering both the tool-call case (no raw markup leaked as content) and the plain-text case (content delivered exactly once — guards against double-emitting the buffered content). Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com> * test(vllm): add expectedFailure test for progressive streaming with tool parser (Case 3, #582) Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com> * test(vllm): add Cases 4+5 — marker split across chunks + false-positive prefix (TDD, Option B state machine, #582) Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com> * feat(vllm): progressive streaming via parser.extract_tool_calls_streaming When a tool parser is active for a tool-enabled streaming request, #10346 buffers the entire generation and surfaces it on the final chunk to prevent raw tool-call markup from leaking as delta.content. This is correct but turns the request into effectively non-streaming for plain-text responses — the client sees nothing until the model stops. Every concrete tool parser shipped with vLLM 0.23+ already implements extract_tool_calls_streaming (Granite4, Qwen3Coder, DeepSeekV31, Jamba, Ernie45, Hermes2Pro, llama3_json, mistral, …). Use it: instantiate the parser before the streaming loop and call its streaming method per delta, emitting DeltaMessage(content=…) or DeltaMessage(tool_calls=[…]) when the parser is ready. Falls back to the existing #10346 buffer path when: - the parser does not have extract_tool_calls_streaming, OR - extract_tool_calls_streaming raises mid-stream (logged, the rest of the request finishes via post-loop extract_tool_calls). Tests (TestStreamingToolParser): 1. Buffer path: no markup leaked, no content duplication 2. Native streaming: plain-text response streams progressively 3. Native streaming: tool_call structured, no markup leaked 4. Native streaming exception → graceful fallback, no markup, no crash 5. No tool parser → unchanged per-delta content stream E2E verified against qwen3_coder on vLLM 0.23.0 (NVIDIA GB10 / arm64 / CUDA 13). Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com> * docs(vllm): add server-side TTFT benchmark for the streaming tool-parser path Self-contained stdlib-only script that measures time-to-first-token (TTFT) for the vLLM backend's two streaming scenarios: - tool_call: request mentions a tool; model is expected to call it - plain_text: request offers a tool but explicitly asks for prose Use this to compare: - the buffer-all path (#10346) → plain_text TTFT ≈ total response time - the native-streaming path (this PR) → plain_text TTFT ≈ true first-token time python examples/vllm-bench/ttft_streaming_tool_parser.py \\ --url http://localhost:8080 --model my-coder --runs 3 Lives under examples/ so it does not interfere with the test suite. Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com> * examples/vllm-bench: add long-text scenario (8 paragraphs, 1500 tokens) The long-text scenario shows the buffering vs streaming difference most dramatically: with the buffer-all path, the client receives nothing for 20+ seconds and then the entire 1500-token response at once. With native streaming, the first token arrives in tens of milliseconds and the response flows progressively. Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com> --------- Signed-off-by: pos-ei-don <1822533+pos-ei-don@users.noreply.github.com> Co-authored-by: Philipp Wacker <philipp.wacker@ibf-solutions.com>
2.4 KiB
vLLM streaming + tool-parser benchmark
A small, self-contained Python script (stdlib only) that measures time-to-first-token (TTFT) for the vLLM backend's streaming path with a tool parser configured.
Why this exists
When a vLLM tool parser is active and a streaming chat completion is requested,
LocalAI used to buffer the full generation to prevent raw tool-call markup
(e.g. <tool_call>...) from leaking as delta.content. That was correct
for tool-call responses, but it turned plain-text responses into effectively
non-streaming — the client received nothing until the model finished.
With native parser-side streaming (parser.extract_tool_calls_streaming,
implemented by every concrete vLLM 0.23+ tool parser), each delta can be
classified per-token: emit as content, emit as a structured tool_call, or
suppress.
Three scenarios
| Scenario | Request | Expected outcome |
|---|---|---|
tool_call |
"What is the weather in Paris? Please use the tool." | Model calls get_weather. delta.tool_calls chunks; no content leak. |
plain_text_short |
"Explain in 3 short sentences what a hash table is. Do NOT call any tool." | Model writes ~3 sentences. |
plain_text_long |
"Write a thorough 8-paragraph explanation of how Python's GIL works…" | Model writes ~1500 tokens of prose. |
The long scenario is where the streaming/buffering difference is most dramatic: with the buffer-all path, the client sees nothing for 20+ seconds and then everything at once; with native streaming, the first token arrives in <100ms and the response flows progressively.
What the script reports
For each scenario, across N runs:
ttf_content_s— time until the firstdelta.contentchunkttf_tool_s— time until the firstdelta.tool_callschunkn_content_chunks— total content deltas (1 = bundled, >>1 = streamed)n_tool_chunks— total tool_call deltastotal_s— total wall-clock until[DONE]finish_reason—tool_calls/stop/length
The big tell is n_content_chunks vs total_s ratio:
- Buffer-all:
n_content_chunks≈ 1,ttf_content_s≈total_s(one chunk at end) - Streaming:
n_content_chunks≈ token count,ttf_content_s≈ first-token latency
Usage
python ttft_streaming_tool_parser.py --url http://localhost:8080 --model my-coder --runs 3
JSON results are written to ttft_bench_<label>.json (default label: run).