mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-25 09:09:07 -04:00
Append the four-point synthesis to A2_CUDAGRAPH_DECODE.md: measured CUDA-graph lever size (<1%, not the guessed 10-20%), the corrected 'eager' premise (default paged decode already captures), the unchanged 37-38% of vLLM at npl128, and the honest verdict that A.2 closes none of the 2.6x gap because paged attention touches ~0.4% of decode on this hybrid-SSM model. Residual lever is the qwen35 gated-DeltaNet SSM path (state D2D copy + get_rows gather), orthogonal to paged attention. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>