LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-26 09:26:55 -04:00

Files

Ettore Di Giacinto 8f8777e0f4 feat(paged): qwen35 decode conv-state in-place fusion (patch 0021)

The no-regret bit-exact conv-state cleanup from the GDN recurrence byte-gate
design (point 3). After the recurrence verdict (NO-BUILD: the gated-DeltaNet
recurrence is already single-pass at the f32 byte floor), the decode conv path
was the only remaining bit-exact lever.

New fused op ggml_ssm_conv_update_inplace (reuses GGML_OP_SSM_CONV, discriminated
by a non-null src[3]). On the single-token decode path it replaces the four-op
conv chain - qkv transpose + ggml_concat (concat_cont) + ggml_ssm_conv + ggml_silu
+ ggml_cpy of the shifted ring state (cpy_scalar) - with one kernel that, per
(channel, sequence), assembles the width-K window in registers from the K-1 cached
taps plus the current qkv_mixed token, computes the depthwise conv with the SAME
ascending-tap FMA order as ssm_conv_f32 at i==0, folds silu, writes the conv
output, and writes the 1-token-shifted ring state back IN PLACE into the conv
cache slot at kv_head. This is vLLM causal_conv1d_update; it mirrors the 0018
in-place write-back and 0019 patterns. Read source (the build_rs tap gather) and
write target (the cache view) are disjoint buffers, so it is race-free by
construction with no ids/identity logic.

- ggml.h/ggml.c: builder (src0=conv_states [K-1,ch,n_seqs], src1=conv_kernel,
  src2=x_cur [ch,1,n_seqs], src3=conv_state_dst [(K-1)*ch,n_seqs] in-place ring;
  op_params[0]=fuse_silu)
- ggml-cuda/ssm-conv.cu: ssm_conv_update_f32<apply_silu,d_conv> kernel +
  ggml_cuda_op_ssm_conv_update + src[3]-discriminated branch in ggml_cuda_op_ssm_conv
- ggml-cpu/ops.cpp: ggml_compute_forward_ssm_conv_update_f32 (threads over channels)
  + branch in ggml_compute_forward_ssm_conv
- delta-net-base.cpp/models.h: build_conv_state_fused (keeps the cheap build_rs
  conv-tap gather; fuses conv+silu+shifted write-back)
- qwen35.cpp, qwen35moe.cpp, qwen3next.cpp: route the single-token decode path
  (n_seq_tokens==1 && n_rs_seq==0 && fused_gdn_ar); prefill/chunked/rollback keep
  the original chain
- tests/test-backend-ops.cpp: test_ssm_conv_update (16 cases) vs the CPU reference

test-backend-ops: SSM_CONV 45/45, SSM_CONV_UPDATE 16/16, SSM_CONV_BIAS_SILU 90/90.

Greedy (--temp 0 --seed 1 --ignore-eos -n 256) byte-identical to the Lever-1
(0019/0020) baseline: q36-27b-nvfp4 md5 675cd522..., q36-35b-a3b-nvfp4 md5
ac163882... both BYTE-IDENTICAL.

decode_agg S_TG (npp128 ntg128, -fa on, CUDA-graph), same session:
  dense q36-27b-nvfp4 : npl 32  199.76 -> 202.99 (+1.6%)
                        npl 128 336.35 -> 347.14 (+3.2%, 86.0 -> 88.8 percent of vLLM 391)
  MoE   q36-35b-a3b   : npl 32  421.72 -> 432.39 (+2.5%)
                        npl 128 689.74 -> 713.54 (+3.5%)
Lift holds in eager too (dense npl128 333.62 -> 342.97). Step -11.9 ms/step
(dense npl128: 380.6 -> 368.7). nsys eager decode: concat_cont (1152 calls) and the
decode cpy_scalar GONE; ssm_conv_f32 at decode replaced by ssm_conv_update (1152);
conv-path ~20.9 -> ~7.6 ms/step. Bit-exact, no regression, de-risks the bf16-state
conv-cache plumbing.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-25 16:56:35 +00:00

0001-vendor-paged-kv-manager.patch

build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on)

2026-06-22 09:22:36 +00:00

0002-paged-kv-block-placement-env-LLAMA_KV_PAGED.patch

build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on)

2026-06-22 09:22:36 +00:00

0003-gather-read-plan.md

build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on)

2026-06-22 09:22:36 +00:00

0003-paged-gather-read-env-LLAMA_KV_PAGED.patch

build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on)