LocalAI/examples/vllm-bench/README.md

# vLLM streaming + tool-parser benchmark

A small, self-contained Python script (stdlib only) that measures
time-to-first-token (TTFT) for the vLLM backend's streaming path with
a tool parser configured.

## Why this exists

When a vLLM tool parser is active and a streaming chat completion is requested,
LocalAI used to buffer the full generation to prevent raw tool-call markup
(e.g. `<tool_call>...`) from leaking as `delta.content`. That was correct
for tool-call responses, but it turned plain-text responses into effectively
non-streaming — the client received nothing until the model finished.

With native parser-side streaming (`parser.extract_tool_calls_streaming`,
implemented by every concrete vLLM 0.23+ tool parser), each delta can be
classified per-token: emit as content, emit as a structured tool_call, or
suppress.

## Three scenarios

| Scenario | Request | Expected outcome |
|---|---|---|
| `tool_call`         | "What is the weather in Paris? Please use the tool." | Model calls `get_weather`. `delta.tool_calls` chunks; no content leak. |
| `plain_text_short`  | "Explain in 3 short sentences what a hash table is. Do NOT call any tool." | Model writes ~3 sentences. |
| `plain_text_long`   | "Write a thorough 8-paragraph explanation of how Python's GIL works…" | Model writes ~1500 tokens of prose. |

The **long scenario** is where the streaming/buffering difference is most
dramatic: with the buffer-all path, the client sees nothing for 20+ seconds
and then everything at once; with native streaming, the first token arrives
in <100ms and the response flows progressively.

## What the script reports

For each scenario, across N runs:

- `ttf_content_s` — time until the first `delta.content` chunk
- `ttf_tool_s` — time until the first `delta.tool_calls` chunk
- `n_content_chunks` — total content deltas (1 = bundled, >>1 = streamed)
- `n_tool_chunks` — total tool_call deltas
- `total_s` — total wall-clock until `[DONE]`
- `finish_reason` — `tool_calls` / `stop` / `length`

The big tell is **`n_content_chunks` vs `total_s` ratio**:
- Buffer-all: `n_content_chunks` ≈ 1, `ttf_content_s` ≈ `total_s` (one chunk at end)
- Streaming: `n_content_chunks` ≈ token count, `ttf_content_s` ≈ first-token latency

## Usage

```bash
python ttft_streaming_tool_parser.py --url http://localhost:8080 --model my-coder --runs 3
```

JSON results are written to `ttft_bench_<label>.json` (default label: `run`).