mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-25 09:09:07 -04:00
Mirror of the llama-paged-dev patch 0018 engine change plus the measured
results. Per SSM layer per step decode no longer D2D-copies the full ~225 MB
recurrent state into the cache: the fused gated_delta_net op writes the final
state in place at the active sequences cache slot (new
ggml_gated_delta_net_inplace, src[6] = state_dst), mirroring vLLM
fused_recurrent_gated_delta_rule. SSM math unchanged (bit-identical greedy).
Measured (decode_agg S_TG, npp128 ntg128, -fa on, paged on):
q36-27b-nvfp4 dense: npl32 113.74 -> 136.39 (+19.9 percent),
npl128 146.23 -> 180.53 (+23.5 percent, = predicted copy-removal ceiling).
q36-35b-a3b-nvfp4 MoE: npl128 313.36 -> 372.62 (+18.9 percent).
nsys D2D memcpy bucket 18.9 -> 0.23 percent (356 -> 2.93 GB). vLLM share
(391 @128) 37.4 -> 46.2 percent. See SSM_DECODE_FIX_RESULTS.md.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>