mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-23 16:19:07 -04:00
The 0003 gather-read was single-stream only (GGML_ASSERT k->ne[3]==1). Lift it to N streams: one index column per stream over the unified batch, gathered with a single ggml_get_rows along the stream axis. Each column is position-sorted (preserving the flash-attn online-softmax reduction order that makes the read byte-identical) and padded to the max non-empty count across streams with a masked (empty) cell, which contributes exp(-inf)=0. Core touch stays additive: the one-line build_attn hook is unchanged; only the two kv-cache gather helpers (now per-stream) and src/paged-attn.cpp grow. Gate 0 (CPU, Qwen3-0.6B-Q8_0): a multi-sequence greedy driver (non-unified KV, k->ne[3]>1) is token-identical between stock (env unset) and LLAMA_KV_PAGED=1: 3 seqs x 40 tok, 2 seqs x 32 tok, 5 seqs x 32 tok all identical; single-stream llama-simple unchanged. Debug log confirms n_stream=3 engaged the multi path. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>