LocalAI/backend/cpp/llama-cpp-localai-paged/patches/paged at 6edbb56b069da495677604dbee7bf4c4ff9e695b - LocalAI - Gitea: Git with a cup of tea

mirror/LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-30 19:37:00 -04:00

Files

History

Ettore Di Giacinto bd100dd20a fix(paged): repair the patch series, sync to the fork branch (drop dev-tree 0044/0045, add f32-only M5 as 0047)

The 0044/0045 patches were exported from the old bf16/hybrid dev tree and no
longer apply on the f32-only series (0026 ssm_bf16_tau is dropped), so the
build broke at `git apply`. Re-sync the vendored series to the now
feature-complete fork branch mudler/llama.cpp:localai-paged, which is the
canonical source (pin 0ed235ea + the paged patch commits in order).

- git rm the dev-tree-based 0044 (GDN M5, bf16-machinery base) and 0045
  (Marlin W4A16 offline-repack, not part of the fork branch).
- Add the fork branch's newest commit (2c32ab8b7, "GDN M5 tensor-core
  chunked-scan prefill, f32-only re-port") as 0047, generated with a single
  git format-patch off that branch. It sequences after 0046 (its parent on
  the branch) and recovers the prefill win 0044 encoded (+3.5% S_PP @npp512,
  +17.7% @npp2048), bit-exact per-path (test-backend-ops GATED_DELTA_NET
  46/46 default and force-M5; greedy md5 default-on == M5-forced == canonical).
- Track patch 0046 (dense-prefill geometry gate), which was on disk but never
  committed, so the series is complete in git.
- README: patch-table header 0001-0046 -> 0001-0047, replace the 0044 row with
  the f32-only 0047 row, fix the dangling 0044 prose references, note the
  bf16 M6/M7/M8 variants are not part of this f32-only series, and add a
  maintenance bullet that the series is now generated from the fork branch so
  there is no more patch-export drift.

Verified: on a pristine llama.cpp at pin 0ed235ea the full series 0001-0043,
0046, 0047 applies clean in sorted order with the Makefile's exact
`git apply --verbose` method (37/37 OK), and the resulting tree is
byte-identical to the fork branch tip 2c32ab8b7.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-30 07:54:46 +00:00

..

0001-vendor-paged-kv-manager.patch

paged-kv-manager.h: add missing <cstddef> for size_t

2026-06-28 04:09:16 +00:00

0002-paged-kv-block-placement-env-LLAMA_KV_PAGED.patch

refactor(paged): stock llama-cpp is patch-free; paged backend owns its patch series

2026-06-27 11:01:22 +00:00

0003-paged-gather-read-env-LLAMA_KV_PAGED.patch

paged headers: self-include <cstddef>/<cstdint> for size_t/uintN_t (fix amd64/non-arm64 build; compile-only)

2026-06-28 06:18:56 +00:00

0004-paged-on-demand-block-allocation-env-LLAMA_KV_PAGED.patch

refactor(paged): stock llama-cpp is patch-free; paged backend owns its patch series

2026-06-27 11:01:22 +00:00

0006-paged-cross-request-prefix-caching-env-LLAMA_KV_PAGED.patch

refactor(paged): stock llama-cpp is patch-free; paged backend owns its patch series

2026-06-27 11:01:22 +00:00

0007-paged-engine-prefix-recompute-skip-env-LLAMA_KV_PAGED.patch

paged headers: self-include <cstddef>/<cstdint> for size_t/uintN_t (fix amd64/non-arm64 build; compile-only)

2026-06-28 06:18:56 +00:00

0008-paged-server-cross-request-prefix-share-env-LLAMA_KV_PAGED.patch

refactor(paged): stock llama-cpp is patch-free; paged backend owns its patch series

2026-06-27 11:01:22 +00:00

0009-paged-in-kernel-decode-read-env-LLAMA_KV_PAGED-patch.patch

refactor(paged): stock llama-cpp is patch-free; paged backend owns its patch series

2026-06-27 11:01:22 +00:00

0010-paged-tile-in-kernel-read-and-dispatch-guard-env-LLAMA_KV_PAGED.patch

refactor(paged): stock llama-cpp is patch-free; paged backend owns its patch series

2026-06-27 11:01:22 +00:00

0011-paged-decode-route-GQA-grouped-tile-kernel-by-defaul.patch

refactor(paged): stock llama-cpp is patch-free; paged backend owns its patch series

2026-06-27 11:01:22 +00:00

0012-paged-mask-pad-invariant-assert.patch

refactor(paged): stock llama-cpp is patch-free; paged backend owns its patch series

2026-06-27 11:01:22 +00:00

0013-paged-decoupled-prefill-token-budget.patch

refactor(paged): stock llama-cpp is patch-free; paged backend owns its patch series

2026-06-27 11:01:22 +00:00

0014-paged-expert-aware-moe-token-tile-cap.patch

refactor(paged): stock llama-cpp is patch-free; paged backend owns its patch series

2026-06-27 11:01:22 +00:00

0015-paged-expert-density-aware-moe-token-tile-auto-select.patch

refactor(paged): stock llama-cpp is patch-free; paged backend owns its patch series

2026-06-27 11:01:22 +00:00

0016-paged-dynamic-prefill-budget-continuous-batch.patch

refactor(paged): stock llama-cpp is patch-free; paged backend owns its patch series

2026-06-27 11:01:22 +00:00

0017-fp4-gemm-decode-tile-tune.patch

refactor(paged): stock llama-cpp is patch-free; paged backend owns its patch series

2026-06-27 11:01:22 +00:00

0018-qwen35-ssm-decode-inplace-state.patch

refactor(paged): stock llama-cpp is patch-free; paged backend owns its patch series

2026-06-27 11:01:22 +00:00

0019-qwen35-ssm-decode-fused-gather.patch

refactor(paged): stock llama-cpp is patch-free; paged backend owns its patch series

2026-06-27 11:01:22 +00:00

0020-qwen35-gdn-oproj-mmq-reshape.patch

refactor(paged): stock llama-cpp is patch-free; paged backend owns its patch series

2026-06-27 11:01:22 +00:00

0021-qwen35-conv-state-inplace-fusion.patch

refactor(paged): stock llama-cpp is patch-free; paged backend owns its patch series

2026-06-27 11:01:22 +00:00

0022-qwen35-gdn-recurrence-occupancy-retune.patch

refactor(paged): stock llama-cpp is patch-free; paged backend owns its patch series

2026-06-27 11:01:22 +00:00

0023-qwen35moe-nvfp4-quant-dedup.patch

refactor(paged): stock llama-cpp is patch-free; paged backend owns its patch series

2026-06-27 11:01:22 +00:00

0024-paged-pool-burst-reclaim.patch

refactor(paged): stock llama-cpp is patch-free; paged backend owns its patch series

2026-06-27 11:01:22 +00:00

0025-qwen35moe-nvfp4-moe-decode-regraph.patch

refactor(paged): stock llama-cpp is patch-free; paged backend owns its patch series

2026-06-27 11:01:22 +00:00

0028-qwen35-recurrent-state-gather-fusion.patch

refactor(paged): stock llama-cpp is patch-free; paged backend owns its patch series

2026-06-27 11:01:22 +00:00

0029-qwen35-blocktable-within-step-cache.patch

refactor(paged): stock llama-cpp is patch-free; paged backend owns its patch series

2026-06-27 11:01:22 +00:00

0030-fused-op-backend-gate.patch

refactor(paged): stock llama-cpp is patch-free; paged backend owns its patch series

2026-06-27 11:01:22 +00:00

0031-paged-chunked-gdn-prefill-scan-kernel.patch

fix(paged): make patch 0031 apply on the 0001-0030 base; default S3 on under paged KV

2026-06-28 19:37:05 +00:00

0033-fp4-prefill-large-m-bf16-cublas-scaffold.patch

feat(paged): FP4 prefill large-M dequant->bf16 cuBLAS scaffold (patch 0033, default-off)

2026-06-28 17:42:15 +00:00

0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch

feat(paged): tail-fusion (0042) + full-step decode CUDA graph default-on (0043); FP4-MMA W4A4 (0034) + Marlin W4A16 (0035) MoE-GEMM scaffolds default-off

2026-06-29 06:15:10 +00:00

0035-feat-paged-marlin-w4a16-grouped-moe-prefill-gemm.patch

feat(paged): tail-fusion (0042) + full-step decode CUDA graph default-on (0043); FP4-MMA W4A4 (0034) + Marlin W4A16 (0035) MoE-GEMM scaffolds default-off

2026-06-29 06:15:10 +00:00

0040-feat-paged-S1-paged-decode-graph-reuse-across-servin.patch

feat(paged): close the continuous-serving decode gap (S1+S3, patches 0040/0041)

2026-06-28 18:04:28 +00:00

0041-feat-paged-S3-decode-shape-stable-scheduling-patch-0.patch

fix(paged): revert S3 decode-stable scheduler to default-OFF (A/B regression)

2026-06-29 05:00:11 +00:00

0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch

feat(paged): tail-fusion (0042) + full-step decode CUDA graph default-on (0043); FP4-MMA W4A4 (0034) + Marlin W4A16 (0035) MoE-GEMM scaffolds default-off

2026-06-29 06:15:10 +00:00

0043-feat-paged-default-on-full-step-moe-decode-cuda-graph.patch

feat(paged): tail-fusion (0042) + full-step decode CUDA graph default-on (0043); FP4-MMA W4A4 (0034) + Marlin W4A16 (0035) MoE-GEMM scaffolds default-off

2026-06-29 06:15:10 +00:00

0046-paged-gate-GDN-prefill-geometry-by-scan-length.patch

fix(paged): repair the patch series, sync to the fork branch (drop dev-tree 0044/0045, add f32-only M5 as 0047)

2026-06-30 07:54:46 +00:00

0047-paged-GDN-M5-tensor-core-chunked-scan-f32.patch

fix(paged): repair the patch series, sync to the fork branch (drop dev-tree 0044/0045, add f32-only M5 as 0047)

2026-06-30 07:54:46 +00:00