LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-26 01:16:58 -04:00

Files

Ettore Di Giacinto a72385257a docs(paged): decisive node-level decode timeline gap - bubbles refuted

Fresh nsys --cuda-graph-trace=node capture of one steady decode step on
q36-27b-nvfp4 dense at npl128 (clean Lever-1 build-cuda-base). The decode step
is a single CUDA graph; node-level expansion shows it is 99.94% GPU-busy on a
single stream with 0.225 ms/step inter-kernel idle (0.06%, zero gaps >5us).

This refutes the "~60% idle bubbles / 57 ms = 100% bubble" hypothesis and
confirms the cudagraph-coverage source verdict. Real decode mix: gated_delta_net
196 ms = 51.6% of the step (4.08 ms/call x48; the prior 1.47 ms/call "near-vLLM"
was a prefill-contaminated eager average), FP4 GEMM+quantize 29%, gating glue
(Lever 3 target) only 3.35%, gdn_gather 0.06 ms. By roofline-decode's own sizing
test (idle < 57 ms => gap is elsewhere) the 14% gap to vLLM lives in kernel
GPU-time, dominated by the bandwidth-bound GDN recurrence, not in bubbles; Lever
3 fusion is resized to ~3% and reframed as byte-reduction, not bubble removal.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-25 14:57:37 +00:00

paged

feat(paged): target-readiness for 2xH200 - correctness PASS, load-gen harness, projection

2026-06-21 23:16:28 +00:00

patches

docs(paged): decisive node-level decode timeline gap - bubbles refuted

2026-06-25 14:57:37 +00:00

CMakeLists.txt

fix(turboquant): resolve common.h by detecting llama-common vs common target (#9413 )

2026-04-18 20:30:28 +02:00

grpc-server.cpp

feat(llama-cpp/paged): dynamic decode-first prefill budget (patch 0016, continuous-batch P1)