From ec7c1b1f687ed578659498d029e645b7913ed4b2 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Fri, 26 Jun 2026 14:12:36 +0000
Subject: [PATCH] feat(paged): pin-sync patchset to llama.cpp 9d5d882d
 (re-export 4 patches)

The worktree merge bumped LLAMA_VERSION 8be759e6 -> 9d5d882d. This re-syncs the
paged patch-stack (0001-0024) to the new tip: the stack was rebased onto
9d5d882d on the DGX dev tree, rebuilt clean (CUDA sm_121), and re-validated
bit-exact before re-exporting the LocalAI .patch files.

Re-exporting each shipped patch from its rebased commit and diffing body-to-body
against the committed files identifies exactly 4 that changed and no longer
git-apply to 9d5d882d:

- 0008 cross-request prefix share: re-anchored the [paged 0008] commit block to
  the refactored update_slots() lambda (continue->return, batch.n_tokens->
  batch.size()); identical env-guarded logic.
- 0013 static prefill budget: budget var-block / while-gate / admission-break
  re-expressed against the refactored loop (add_ok=false idiom).
- 0015 expert-density MoE token-tile auto-select: pure context re-anchor; upstream
  inserted a test_mul_mat_id case at the hunk anchor in test-backend-ops.cpp. The
  inserted lines are unchanged. (This one rebased cleanly via 3-way but its
  committed .patch no longer applies with plain git apply, so it is caught by the
  per-patch apply-check, not by the rebase conflict count.)
- 0016 dynamic decode-first budget: dynamic budget block + n_decode_in_batch =
  batch.size() + add_ok=false against the refactored loop.

All four are byte-faithful format-patch exports of the gate-green rebased commits.
Applying the full corrected series to a fresh 9d5d882d reproduces the gate-green
tree byte-for-byte across every code file.

The other 7 touched patches (0009/0017/0018/0019/0020/0021/0024) are LINENUM-only
(hunk bodies byte-identical, only @@ line-numbers shifted) and still apply
cleanly, so they are left unchanged. The remaining patches are identical.

Validation on the rebased build (NVFP4 Qwen3.6, GB10 sm_121):
- test-backend-ops CUDA0: GATED_DELTA_NET 36/36, SSM_CONV 45/45, MUL_MAT
  1146/1146, MUL_MAT_ID 806/806 all OK.
- greedy md5 (-fa on -n 48 --temp 0 --seed 1): dense q36-27b-nvfp4
  5951a5b4d624ce891e22ab5fca9bc439 and MoE q36-35b-a3b-nvfp4
  07db32c2bcb78d17a43ed18bc22705cd, both == baseline.
- decode S_TG @npl128: dense 366.41 t/s (ref 373.2, -1.8%), MoE 751.11 t/s
  (ref 745.7, +0.7%), both within noise.

Details in backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
 ...uest-prefix-share-env-LLAMA_KV_PAGED.patch |  36 ++--
 ...paged-decoupled-prefill-token-budget.patch |  41 ++--
 ...ity-aware-moe-token-tile-auto-select.patch |   8 +-
 ...amic-prefill-budget-continuous-batch.patch |  84 +++-----
 .../patches/paged/PIN_SYNC_9d5d882d.md        | 202 ++++++++++++++++++
 5 files changed, 279 insertions(+), 92 deletions(-)
 create mode 100644 backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md

diff --git a/backend/cpp/llama-cpp/patches/paged/0008-paged-server-cross-request-prefix-share-env-LLAMA_KV_PAGED.patch b/backend/cpp/llama-cpp/patches/paged/0008-paged-server-cross-request-prefix-share-env-LLAMA_KV_PAGED.patch
index d0e32349e..a739919ff 100644
--- a/backend/cpp/llama-cpp/patches/paged/0008-paged-server-cross-request-prefix-share-env-LLAMA_KV_PAGED.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0008-paged-server-cross-request-prefix-share-env-LLAMA_KV_PAGED.patch
@@ -1,4 +1,4 @@
-From 088d58f3a0160cbc706226ac2e77ecfeae4c164a Mon Sep 17 00:00:00 2001
+From 240758ef7e144619c750aaf1d3339051ecc29098 Mon Sep 17 00:00:00 2001
 From: Ettore Di Giacinto <mudler@localai.io>
 Date: Mon, 22 Jun 2026 17:02:22 +0200
 Subject: [PATCH] paged server cross-request prefix share (env LLAMA_KV_PAGED)
@@ -51,10 +51,10 @@ Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
  1 file changed, 50 insertions(+)
 
 diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
-index da6a475..04c6361 100644
+index 39b7eb2..b5f9d37 100644
 --- a/tools/server/server-context.cpp
 +++ b/tools/server/server-context.cpp
-@@ -15,6 +15,16 @@
+@@ -16,6 +16,16 @@
  #include "mtmd.h"
  #include "mtmd-helper.h"
  
@@ -71,7 +71,7 @@ index da6a475..04c6361 100644
  #include <algorithm>
  #include <cstddef>
  #include <cinttypes>
-@@ -3007,6 +3017,37 @@ private:
+@@ -3335,6 +3345,37 @@ private:
                              }
                          }
  
@@ -109,22 +109,22 @@ index da6a475..04c6361 100644
                          // [TAG_PROMPT_LOGITS]
                          if (n_past == slot.task->n_tokens() && n_past > 0) {
                              SLT_WRN(slot, "need to evaluate at least 1 token for each active slot (n_past = %d, task.n_tokens() = %d)\n", n_past, slot.task->n_tokens());
-@@ -3427,6 +3468,15 @@ private:
-                     // prompt evaluated for next-token prediction
-                     slot.state = SLOT_STATE_GENERATING;
+@@ -3741,6 +3782,15 @@ private:
+                 // prompt evaluated for next-token prediction
+                 slot.state = SLOT_STATE_GENERATING;
  
-+                    // [paged 0008] Publish this slot's computed prefix so concurrent/later
-+                    // slots can share it (no-op unless LLAMA_KV_PAGED). The prefill decode
-+                    // for [0, n_tokens) has just run, so the prefix KV is computed.
-+                    static const bool paged_kv_commit = getenv("LLAMA_KV_PAGED") != nullptr;
-+                    if (paged_kv_commit && slot.task->params.cache_prompt && !slot.prompt.tokens.has_mtmd) {
-+                        const llama_tokens ctoks = slot.prompt.tokens.get_text_tokens();
-+                        paged_prefix_api::commit(ctx_tgt, slot.id, ctoks.data(), (int) ctoks.size());
-+                    }
++                // [paged 0008] Publish this slot's computed prefix so concurrent/later
++                // slots can share it (no-op unless LLAMA_KV_PAGED). The prefill decode
++                // for [0, n_tokens) has just run, so the prefix KV is computed.
++                static const bool paged_kv_commit = getenv("LLAMA_KV_PAGED") != nullptr;
++                if (paged_kv_commit && slot.task->params.cache_prompt && !slot.prompt.tokens.has_mtmd) {
++                    const llama_tokens ctoks = slot.prompt.tokens.get_text_tokens();
++                    paged_prefix_api::commit(ctx_tgt, slot.id, ctoks.data(), (int) ctoks.size());
++                }
 +
-                     if (slot.can_speculate()) {
-                         common_speculative_begin(spec.get(), slot.id, slot.prompt.tokens.get_text_tokens());
-                     }
+                 if (slot.can_speculate()) {
+                     common_speculative_begin(spec.get(), slot.id, slot.prompt.tokens.get_text_tokens());
+                 }
 -- 
 2.43.0
 
diff --git a/backend/cpp/llama-cpp/patches/paged/0013-paged-decoupled-prefill-token-budget.patch b/backend/cpp/llama-cpp/patches/paged/0013-paged-decoupled-prefill-token-budget.patch
index ffbd01f8e..29a9ca226 100644
--- a/backend/cpp/llama-cpp/patches/paged/0013-paged-decoupled-prefill-token-budget.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0013-paged-decoupled-prefill-token-budget.patch
@@ -1,4 +1,4 @@
-From 17d97cb74e3e8c93751afd33f5c183e57056fde9 Mon Sep 17 00:00:00 2001
+From 6d3743105c1bbfbf9cd16c0c0ba39bfaac74216e Mon Sep 17 00:00:00 2001
 From: Ettore Di Giacinto <mudler@localai.io>
 Date: Tue, 23 Jun 2026 11:52:45 +0200
 Subject: [PATCH] feat(paged): decoupled per-step prefill-token budget (patch
@@ -62,14 +62,14 @@ stays disjoint from the paged allocation hunks.
 Assisted-by: Claude:opus-4.8 [Claude Code]
 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
 ---
- tools/server/server-context.cpp | 35 ++++++++++++++++++++++++++++++++-
- 1 file changed, 34 insertions(+), 1 deletion(-)
+ tools/server/server-context.cpp | 34 ++++++++++++++++++++++++++++++++-
+ 1 file changed, 33 insertions(+), 1 deletion(-)
 
 diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
-index 04c6361..5d83b30 100644
+index b5f9d37..afcdebe 100644
 --- a/tools/server/server-context.cpp
 +++ b/tools/server/server-context.cpp
-@@ -2723,6 +2723,29 @@ private:
+@@ -3043,6 +3043,29 @@ private:
          int32_t n_batch  = llama_n_batch(ctx_tgt);
          int32_t n_ubatch = llama_n_ubatch(ctx_tgt);
  
@@ -96,42 +96,41 @@ index 04c6361..5d83b30 100644
 +        }
 +        int32_t n_prompt_budgeted = 0; // prompt tokens added to the batch this step (across slots)
 +
-         float  alora_scale       = -1.0f;
-         size_t alora_disabled_id = 0;
+         auto & alora_scale       = batch.alora_scale;
+         auto & alora_disabled_id = batch.alora_disabled_id;
  
-@@ -3159,7 +3182,10 @@ private:
-                     const bool n_before_user_known = n_before_user > 0;
+@@ -3487,7 +3510,10 @@ private:
+                     const auto last_user_pos = spans.last_user_message_pos();
  
                      // add prompt tokens for processing in the current batch
--                    while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch) {
+-                    while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.size() < n_batch) {
 +                    // (patch 0013) also stop once the per-step prefill budget is spent, so a long
 +                    // prompt is split across more steps and leaves batch room for co-batched decode
-+                    while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch &&
++                    while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.size() < n_batch &&
 +                           (n_prefill_budget == 0 || n_prompt_budgeted < n_prefill_budget)) {
                          // get next token to process
                          llama_token cur_tok = input_tokens[slot.prompt.n_tokens()];
                          if (cur_tok == LLAMA_TOKEN_NULL) {
-@@ -3185,6 +3211,7 @@ private:
+@@ -3512,6 +3538,7 @@ private:
                          slot.prompt.tokens.push_back(cur_tok);
  
                          slot.n_prompt_tokens_processed++;
 +                        n_prompt_budgeted++; // (patch 0013) count toward the per-step prefill budget
  
-                         // stop the prompt batch exactly before the latest user input, so a checkpoint
-                         // can be created after the previous messages
-@@ -3293,6 +3320,12 @@ private:
-                 if (batch.n_tokens >= n_batch) {
-                     break;
+                         // stop the prompt batch exactly before a user message
+                         if (spans.is_user_start(slot.prompt.n_tokens())) {
+@@ -3597,6 +3624,11 @@ private:
+                 if (!slot_batched) {
+                     slot_batched = &slot;
                  }
-+
 +                // (patch 0013) stop adding prompts once the per-step prefill budget is spent,
 +                // leaving the remaining batch capacity for co-batched decode of other slots
 +                if (n_prefill_budget > 0 && n_prompt_budgeted >= n_prefill_budget) {
-+                    break;
++                    add_ok = false;
 +                }
-             }
+             });
          }
- 
+     }
 -- 
 2.43.0
 
diff --git a/backend/cpp/llama-cpp/patches/paged/0015-paged-expert-density-aware-moe-token-tile-auto-select.patch b/backend/cpp/llama-cpp/patches/paged/0015-paged-expert-density-aware-moe-token-tile-auto-select.patch
index 81dfd8d5f..519ad7ab1 100644
--- a/backend/cpp/llama-cpp/patches/paged/0015-paged-expert-density-aware-moe-token-tile-auto-select.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0015-paged-expert-density-aware-moe-token-tile-auto-select.patch
@@ -1,4 +1,4 @@
-From 151343bc8c7b956c99eafc855704b70d44637a3b Mon Sep 17 00:00:00 2001
+From 5349f8231b1e11214f5e8a668129397fb6e2f9ac Mon Sep 17 00:00:00 2001
 From: Ettore Di Giacinto <mudler@localai.io>
 Date: Tue, 23 Jun 2026 21:03:00 +0200
 Subject: [PATCH] feat(paged): expert-density-aware MoE token-tile auto-select
@@ -207,12 +207,12 @@ index cff608e..9718b12 100644
      }
  
 diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
-index 15ae389..f219309 100644
+index c83e91f..62a0989 100644
 --- a/tests/test-backend-ops.cpp
 +++ b/tests/test-backend-ops.cpp
-@@ -8575,6 +8575,22 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
-     // gpt-oss issue with Vulkan mmq_id
+@@ -8603,6 +8603,22 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
      test_cases.emplace_back(new test_mul_mat_id(GGML_TYPE_MXFP4, GGML_TYPE_F32, 32, 2, false, 2880, 32, 2880));
+     test_cases.emplace_back(new test_mul_mat_id(GGML_TYPE_Q4_0, GGML_TYPE_F32, 32, 2, false, 2880, 32, 2880));
  
 +    // [paged P0] MXFP4/NVFP4 qwen3-30b-a3b MoE decode-density regression gate for the expert-
 +    // density-aware mmq_x auto-select (patch 0015). Real expert-FFN slice (128 experts, top-8,
diff --git a/backend/cpp/llama-cpp/patches/paged/0016-paged-dynamic-prefill-budget-continuous-batch.patch b/backend/cpp/llama-cpp/patches/paged/0016-paged-dynamic-prefill-budget-continuous-batch.patch
index 17b73a7ee..ca7e4040f 100644
--- a/backend/cpp/llama-cpp/patches/paged/0016-paged-dynamic-prefill-budget-continuous-batch.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0016-paged-dynamic-prefill-budget-continuous-batch.patch
@@ -1,54 +1,40 @@
-From 0a2677c6e6c608f9c0ec657faa0ff04a03370aa6 Mon Sep 17 00:00:00 2001
+From 02fa0473a9324b7e12f9b203d221cc4ac80cfd33 Mon Sep 17 00:00:00 2001
 From: Ettore Di Giacinto <mudler@localai.io>
-Date: Wed, 24 Jun 2026 07:44:25 +0000
+Date: Wed, 24 Jun 2026 10:11:48 +0200
 Subject: [PATCH] feat(paged): dynamic decode-first prefill-token budget (patch
  0016, continuous-batch P1)
 
 Supersede patch 0013's STATIC per-step prefill cap with a DYNAMIC,
 decode-first token budget: the P1 of the token-granular continuous-batch
-scheduler scoped in CONTINUOUS_BATCH_SCHEDULER_SCOPE.md. This is a POLICY
-change only inside update_slots(): no new slot states, no batch-formation
-rewrite, zero libllama changes. llama-server already emits one unified
-mixed prefill+decode batch per step (Phase 1 appends every ready decode
-token unconditionally; Phase 2 fills prefill into the same batch); 0013
-already ships that mixed ubatch. 0016 only changes the COUNT of prefill
-tokens admitted per step.
+scheduler. POLICY change only inside update_slots(): no new slot states, no
+batch-formation rewrite, zero libllama changes. llama-server already emits one
+unified mixed prefill+decode batch per step (Phase 1 appends every ready decode
+token unconditionally; Phase 2 fills prefill into the same batch). 0016 only
+changes the COUNT of prefill tokens admitted per step.
 
 The budget block already sits AFTER Phase 1's decode fill, so batch.n_tokens
 == D (the live decode load) is known there. Instead of 0013's constant
-LLAMA_PREFILL_BUDGET (which ignores D, needs per-workload tuning, and lets
-one long prompt monopolise the step), compute a dynamic budget:
+LLAMA_PREFILL_BUDGET (which ignores D, needs per-workload tuning, and lets one
+long prompt monopolise the step), compute a dynamic budget:
 
-  T  = min(LLAMA_MAX_BATCH_TOKENS (default n_batch), n_batch), floored at
-       n_ubatch (the vLLM max_num_batched_tokens analogue / ITL trade knob)
+  T  = clamp(LLAMA_MAX_BATCH_TOKENS (default n_batch), n_ubatch, n_batch)
   prefill_budget_step  = max(n_ubatch, T - D)   (leftover after decode,
        auto-shrinks as decode load rises so the step never inflates past T)
-  prefill_cap_per_slot = min(T, ceil(0.04*n_ctx)) floored at n_ubatch
-       (the long_prefill_token_threshold analogue: one long prompt cannot
-       eat the whole leftover; LLAMA_PREFILL_CAP overrides)
+  prefill_cap_per_slot = min(T, ceil(0.04*n_ctx)) floored at n_ubatch,
+       pinned to n_batch when T == n_batch (LLAMA_PREFILL_CAP overrides)
 
 Phase 2's inner prompt-fill loop and outer admission break are bounded by
 prefill_budget_step (across slots) and a new per-slot slot_prompt_added
-counter (per-slot cap), instead of the static 0013 cap; the n_batch hard
-ceiling stays as the compute bound. Decode is structurally claimed first
-and never capped (Phase 1), so the decode-first guarantee is free.
-
-Why it supersedes 0013: 0013 needs a hand-picked constant (256 for dense)
-that is net-negative at low npl and costs MoE TTFT; the T - D budget is
-self-tuning across npl 8..128 and across dense vs MoE, holding the GB10
-decode ceiling (~161 dense / ~333 MoE tok/s @npl128) WITHOUT per-workload
-tuning while collapsing burst TTFT. Steady-state decode throughput is NOT
-lifted (that is the decode-kernel ceiling, scoped as P3); the P1 win is
-TTFT + tuning-free robustness + clean supersession of 0013.
+counter; the n_batch hard ceiling stays as the compute bound. Decode is
+structurally claimed first and never capped (Phase 1), so the decode-first
+guarantee is free.
 
 DEFAULT-OFF BYTE-IDENTICAL: with all knobs unset, behaviour is byte-identical
-to stock. The degenerate T == n_batch case is byte-identical to stock/0013
-(the determinism oracle): the leftover max(n_ubatch, n_batch - D) and the
-n_batch per-slot cap both reach the existing `batch.n_tokens < n_batch`
-ceiling at the same point, so no new bound fires. The legacy
-LLAMA_PREFILL_BUDGET path is preserved exactly (honoured only when
-LLAMA_MAX_BATCH_TOKENS is unset), so 0013 is cleanly subsumed. Orthogonal
-to LLAMA_KV_PAGED: pure scheduler policy, identical decisions paged on/off.
+to stock. The degenerate T == n_batch case is byte-identical to stock/0013 (the
+determinism oracle). The legacy LLAMA_PREFILL_BUDGET path is preserved exactly
+(honoured only when LLAMA_MAX_BATCH_TOKENS is unset), so 0013 is cleanly
+subsumed. Orthogonal to LLAMA_KV_PAGED: pure scheduler policy, identical
+decisions paged on or off.
 
 Assisted-by: Claude:opus-4.8 [Claude Code]
 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
@@ -57,10 +43,10 @@ Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
  1 file changed, 85 insertions(+), 22 deletions(-)
 
 diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
-index 5d83b30..f7a114c 100644
+index afcdebe..b8b8f00 100644
 --- a/tools/server/server-context.cpp
 +++ b/tools/server/server-context.cpp
-@@ -2723,24 +2723,78 @@ private:
+@@ -3043,24 +3043,78 @@ private:
          int32_t n_batch  = llama_n_batch(ctx_tgt);
          int32_t n_ubatch = llama_n_ubatch(ctx_tgt);
  
@@ -112,7 +98,7 @@ index 5d83b30..f7a114c 100644
 +        // reach the existing `batch.n_tokens < n_batch` ceiling at the SAME point, so no
 +        // new bound fires (the determinism oracle). Orthogonal to LLAMA_KV_PAGED: pure
 +        // scheduler policy, identical decisions with paged on or off.
-+        const int32_t n_decode_in_batch = batch.n_tokens; // D: Phase 1 appended D decode tokens above
++        const int32_t n_decode_in_batch = batch.size();    // D: Phase 1 appended D decode tokens above
 +        int32_t prefill_budget_step  = 0; // 0 = disabled (stock n_batch-only chunking)
 +        int32_t prefill_cap_per_slot = 0; // 0 = disabled (no per-slot prompt-chunk cap)
          {
@@ -154,9 +140,9 @@ index 5d83b30..f7a114c 100644
                  }
              }
          }
-@@ -3181,11 +3235,18 @@ private:
-                     const int32_t n_before_user = slot.task->params.n_before_user;
-                     const bool n_before_user_known = n_before_user > 0;
+@@ -3509,11 +3563,18 @@ private:
+                     const auto & spans = slot.task->params.message_spans;
+                     const auto last_user_pos = spans.last_user_message_pos();
  
 +                    // (patch 0016) per-slot prompt tokens added this step, for the per-slot
 +                    // chunk cap (resets each slot); n_batch stays the hard compute ceiling
@@ -169,14 +155,14 @@ index 5d83b30..f7a114c 100644
 +                    // (the T - D leftover) is spent across all slots, or (b) this slot's
 +                    // per-slot chunk cap is hit, so a long prompt is split across more steps
 +                    // and leaves batch room for co-batched decode of the other slots
-                     while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch &&
+                     while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.size() < n_batch &&
 -                           (n_prefill_budget == 0 || n_prompt_budgeted < n_prefill_budget)) {
 +                           (prefill_budget_step  == 0 || n_prompt_budgeted < prefill_budget_step) &&
 +                           (prefill_cap_per_slot == 0 || slot_prompt_added < prefill_cap_per_slot)) {
                          // get next token to process
                          llama_token cur_tok = input_tokens[slot.prompt.n_tokens()];
                          if (cur_tok == LLAMA_TOKEN_NULL) {
-@@ -3211,7 +3272,8 @@ private:
+@@ -3538,7 +3599,8 @@ private:
                          slot.prompt.tokens.push_back(cur_tok);
  
                          slot.n_prompt_tokens_processed++;
@@ -184,12 +170,12 @@ index 5d83b30..f7a114c 100644
 +                        n_prompt_budgeted++;  // (patch 0016) toward the dynamic per-step prefill budget
 +                        slot_prompt_added++;  // (patch 0016) toward this slot's per-step chunk cap
  
-                         // stop the prompt batch exactly before the latest user input, so a checkpoint
-                         // can be created after the previous messages
-@@ -3321,9 +3383,10 @@ private:
-                     break;
+                         // stop the prompt batch exactly before a user message
+                         if (spans.is_user_start(slot.prompt.n_tokens())) {
+@@ -3624,9 +3686,10 @@ private:
+                 if (!slot_batched) {
+                     slot_batched = &slot;
                  }
- 
 -                // (patch 0013) stop adding prompts once the per-step prefill budget is spent,
 -                // leaving the remaining batch capacity for co-batched decode of other slots
 -                if (n_prefill_budget > 0 && n_prompt_budgeted >= n_prefill_budget) {
@@ -197,9 +183,9 @@ index 5d83b30..f7a114c 100644
 +                // budget (the T - D leftover) is spent, leaving the remaining batch
 +                // capacity for co-batched decode of the other slots
 +                if (prefill_budget_step > 0 && n_prompt_budgeted >= prefill_budget_step) {
-                     break;
+                     add_ok = false;
                  }
-             }
+             });
 -- 
 2.43.0
 
diff --git a/backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md b/backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md
new file mode 100644
index 000000000..3ad2b3dfb
--- /dev/null
+++ b/backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md
@@ -0,0 +1,202 @@
+# Pin-sync: paged patch-stack -> llama.cpp 9d5d882d
+
+Status: COMPLETE. The paged patch-stack (0001-0024) was rebased onto llama.cpp
+`9d5d882d`, rebuilt clean (CUDA sm_121), and the bit-exact gate is GREEN on both
+the dense and MoE NVFP4 baselines. The LocalAI-side `.patch` files were then
+re-exported from the rebased commits; **4 patch files changed** and are updated
+in this commit. A quick decode bench confirms the patchset performs the same on
+the new tip.
+
+## Upstream jump
+
+- OLD LocalAI pin: `8be759e6`
+- NEW LocalAI pin (target): `9d5d882d` ("model : Add label for LFM2.5-230M (#25008)")
+- Upstream jump `8be759e6..9d5d882d` = **17 commits**.
+
+### Note on the dev-tree base (important)
+The DGX dev tree's `paged` branch was NOT based on the old pin `8be759e6`. Its
+real base (merge-base of `paged` with both pins) is `f3e1828`
+("mtmd: llava_uhd should no longer use batch dim (#24732)"), which is an ancestor
+of `8be759e6` by 92 commits. So the rebase traversed `f3e1828..9d5d882d` =
+**109 upstream commits**, a strictly larger surface than the 17-commit pin bump.
+The end state (paged patches on `9d5d882d`) is identical either way; the larger
+traverse only means the conflict surface was the worst case, and it still came
+through bit-exact.
+
+## Rebase
+
+- Command: `git rebase --onto 9d5d882d f3e1828 paged` (merge.conflictStyle=diff3).
+- 26 commits replayed (24 shipped patch-commits + the 2 dev-scaffolding "Gate-0/
+  FA-gate driver" commits and 1 docs commit; the scaffolding/docs commits are not
+  shipped as `.patch` files).
+- Backup ref before rebase: `paged-prerebase-backup` = `a8a9d12` (old patch 0024).
+- New rebased range: `9d5d882d..paged`, HEAD = `2ee65c2` (patch 0024).
+
+### Conflicts during rebase (3 commits, ALL in `tools/server/server-context.cpp`)
+
+Every rebase conflict was in the llama-server continuous-batch scheduler wiring,
+all of which is gated behind env (`LLAMA_KV_PAGED` / `LLAMA_PREFILL_BUDGET` /
+`LLAMA_MAX_BATCH_TOKENS`) and therefore a strict no-op for the gate (the gate
+uses `llama-completion`, not the server, with no env set). The root cause was a
+single upstream refactor of `update_slots()`:
+
+- the outer slot loop became `iterate(slots, [&](server_slot & slot){...})`,
+  replacing bottom-of-loop `break` with a top-of-lambda
+  `if (!add_ok || batch.size() >= n_batch) return;` (the `add_ok` flag is set
+  false on `batch.add()` failure);
+- the embedding/rerank early-exits changed `continue;` -> `return;`;
+- the `server_batch` token count accessor was renamed `batch.n_tokens` ->
+  `batch.size()` (`server_batch` has a `.size()` method and **no** `.n_tokens`
+  member; the raw `llama_batch` in `send_embedding`/`send_rerank` keeps `.n_tokens`).
+
+**patch 0008** (`240758e`, cross-request prefix share) - 1 conflict.
+Hunk 3 (the prefix-commit block) collided with the `continue`->`return` refactor.
+Hunks 1 (namespace shim) and 2 (the share block) applied cleanly. Resolved by
+keeping HEAD's refactored structure and re-inserting the `[paged 0008]`
+`paged_prefix_api::commit(...)` block verbatim after `slot.state = SLOT_STATE_GENERATING;`
+and before `if (slot.can_speculate())`, re-indented to the new (de-nested) level,
+with the identical `paged_kv_commit && cache_prompt && !has_mtmd` guard. Semantics
+unchanged.
+
+**patch 0013** (`6d37431`, static `LLAMA_PREFILL_BUDGET`) - 3 conflicts.
+- C1: inserted the `n_prefill_budget` / `n_prompt_budgeted` var block before
+  HEAD's new `auto & alora_scale = batch.alora_scale;` references (upstream moved
+  alora_scale/disabled_id into the `server_batch` struct).
+- C2: merged the budget gate into HEAD's `while (... batch.size() < n_batch ...)`
+  (took upstream's `batch.size()` rename, kept the budget condition).
+- C3: the original outer `break` was translated to the new idiom `add_ok = false;`
+  (exact semantic equivalent of "stop admitting prompts to remaining slots"); the
+  upstream-removed `if (batch.n_tokens >= n_batch) break;` was dropped (now handled
+  by the top-of-lambda check).
+
+**patch 0016** (`02fa047`, dynamic decode-first budget, supersedes 0013) - 2
+conflicts + 1 clean-hunk fix.
+- The big budget-block rewrite hunk applied cleanly (its expected parent == the
+  faithfully-resolved 0013 block).
+- Clean-hunk fix: the clean-applied line `const int32_t n_decode_in_batch = batch.n_tokens;`
+  referenced the `server_batch` member, which has no `.n_tokens` -> changed to
+  `batch.size()` (== D, the Phase-1 decode load; identical value).
+- C-A: while-condition -> took THEIRS (dynamic `prefill_budget_step` +
+  `prefill_cap_per_slot`), adopted `batch.size()`.
+- C-B: admission break -> 0016 dynamic budget check with `break` -> `add_ok = false`,
+  dropped the upstream-removed `batch.n_tokens >= n_batch` break.
+
+OFF-path invariant verified by construction in all three: with the env knobs
+unset (`prefill_budget_step == prefill_cap_per_slot == 0`, `paged_kv_* == false`)
+the added conditions never fire, so the scheduler is byte-identical to stock HEAD.
+
+### Kernel patches: ZERO rebase conflicts
+Patches 0017-0024 - which touch the bit-exact compute paths
+(`gated_delta_net.cu` +330, `mmq.cu`/`mmq.cuh` +209, `ssm-conv.cu` +112,
+`quantize.cu`, `fattn.cu`, `src/models/qwen35.cpp`/`qwen35moe.cpp`/`qwen3next.cpp`,
+`src/llama-kv-cache.*`, `src/paged-*`, `tests/test-backend-ops.cpp` +79) - all
+applied **cleanly** during the rebase (3-way). No math, reduction order, or kernel
+context was touched during conflict resolution.
+
+## Clean rebuild
+`cmake --build build-cuda --target clean && cmake --build build-cuda -j20`,
+preserving the existing CMakeCache (CMAKE_CUDA_ARCHITECTURES=121, GGML_CUDA=ON,
+GGML_CUDA_FA=ON, GGML_CUDA_GRAPHS=ON, GGML_CUDA_NCCL=ON). Result: BUILD_EXIT=0,
+all targets at 100%. (The only log "error" is a benign webui `dist.tar.gz`
+download miss, unrelated to the gate binaries.)
+
+## GATE: ALL GREEN
+
+(a) `test-backend-ops` (Backend CUDA0):
+| op | result |
+|----|--------|
+| GATED_DELTA_NET | 36/36 OK |
+| SSM_CONV        | 45/45 OK |
+| MUL_MAT         | 1146/1146 OK |
+| MUL_MAT_ID      | 806/806 OK |
+
+(b) greedy md5 (`llama-completion -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1`):
+| model | md5 | baseline | verdict |
+|-------|-----|----------|---------|
+| dense `q36-27b-nvfp4`     | `5951a5b4d624ce891e22ab5fca9bc439` | `5951a5b4d624ce891e22ab5fca9bc439` | PASS |
+| MoE `q36-35b-a3b-nvfp4`   | `07db32c2bcb78d17a43ed18bc22705cd` | `07db32c2bcb78d17a43ed18bc22705cd` | PASS |
+
+Bit-exactness preserved across the upstream jump.
+
+## Decode bench sanity (rebased build, post-pin-sync)
+
+`llama-batched-bench -ngl 99 -fa on -npp 128 -ntg 128 -npl 32,128 -c 33000`,
+S_TG (decode) tok/s at npl128, patch defaults on:
+| model | npl128 S_TG (new tip) | post-0023 reference | delta |
+|-------|----------------------|---------------------|-------|
+| dense `q36-27b-nvfp4`   | **366.41** | 373.2 | -1.8% |
+| MoE `q36-35b-a3b-nvfp4` | **751.11** | 745.7 | +0.7% |
+
+Both within the +/-3% noise band -> the patchset performs the same on `9d5d882d`.
+(npl32 also matches: dense 205.83 vs 207.6; MoE 438.29 vs 440.0.)
+
+## Export phase: re-export `.patch` files and pick the ones that changed
+
+The committed `.patch` files were generated against the old base. Each shipped
+patch was re-exported from its rebased commit (`git format-patch -1 <commit>`) and
+compared body-to-body against the committed file (ignoring the volatile `From`
+commit-hash line and the `index` blob-hash lines). Classification:
+
+- **CONTENT (real hunk-body change -> MUST update):** `0008`, `0013`, `0015`, `0016`.
+- **LINENUM only (hunk bodies byte-identical, only `@@` line-numbers shifted ->
+  still apply cleanly, left as-is):** `0009`, `0017`, `0018`, `0019`, `0020`,
+  `0021`, `0024`.
+- **IDENTICAL (no change at all):** `0001`, `0002`, `0003`, `0004`, `0006`,
+  `0007`, `0010`, `0011`, `0012`, `0014`, `0022`, `0023`.
+
+An independent isolated `git apply --check` sweep (each shipped patch vs the
+rebased pre-state tree) agreed exactly: the same 4 (`0008`/`0013`/`0015`/`0016`)
+are the only ones that no longer `git apply` to `9d5d882d`. The build applies the
+series with plain `git apply` (Makefile) which tolerates `@@` line-number offsets,
+so the 7 LINENUM patches still apply (verified) and are intentionally not churned.
+
+### 0015 was a 4th change beyond the 3 rebase conflicts
+The rebase reported only 3 conflicts (`0008`/`0013`/`0016`). `0015`
+(expert-density MoE token-tile auto-select) rebased *cleanly* via 3-way merge, but
+its committed `.patch` no longer applies to `9d5d882d` via plain `git apply`:
+upstream inserted a new test case
+(`test_mul_mat_id(GGML_TYPE_Q4_0, GGML_TYPE_F32, 32, 2, false, 2880, 32, 2880)`)
+in `tests/test-backend-ops.cpp` right at `0015`'s insertion anchor, so the hunk's
+context lines shifted. `0015`'s own inserted lines are unchanged - it is a pure
+context re-anchor, no behavioral change. This is exactly why a per-patch
+re-export/apply-check was run instead of trusting the 3-conflict count.
+
+### What changed in each updated patch (From/index hash noise aside)
+- `0008`: same `[paged 0008]` commit block (identical env-guard + `paged_prefix_api::commit`
+  call), re-indented to the refactored `update_slots` lambda level and re-anchored
+  after `slot.state = SLOT_STATE_GENERATING;`; `@@` headers updated.
+- `0013`: budget var-block / while-gate / admission-break re-expressed against the
+  refactored loop (`batch.size()`, `add_ok=false`); `@@` headers updated.
+- `0015`: hunk context re-anchored around the new upstream test case; inserted
+  lines identical; `@@` header updated.
+- `0016`: dynamic budget block + `n_decode_in_batch = batch.size()` + admission
+  `add_ok=false` against the refactored loop; `@@` headers updated.
+
+## Equivalence proof (the updated series == the gate-green tree)
+
+The 4 updated files are byte-faithful `git format-patch -1` exports of the
+gate-green rebased commits (`240758e`, `6d37431`, `5349f82`, `02fa047`). Applying
+the full corrected series (the 19 unchanged committed patches + the 4 re-exports)
+in order to a fresh bare `9d5d882d` checkout with plain `git apply` succeeds for
+all 23 patches, and the resulting tree is **byte-identical to the gate-green
+`paged` tip (`2ee65c2`) for every code file** (`git diff` over all paths except
+`*.md` and the unshipped `examples/simple/*` scaffold drivers is empty). So the
+shipped `.patch` series reproduces exactly the tree that passed test-backend-ops,
+the md5 bit-exact gate, and the bench.
+
+## Pre-existing finding (NOT introduced by this pin-sync, NOT fixed here)
+Committed patch `0019` carries a *modify* hunk against the dev-only doc
+`SSM_DECODE_FIX_RESULTS.md` (`index 2e7c8c2..77879e4 100644`), a file that exists
+only because of an unshipped docs commit on the dev tree and is absent from a
+clean llama.cpp checkout. Under strict `git apply` that hunk fails ("No such file
+or directory"). This is pin-independent (the file is upstream-absent on both
+`8be759e6` and `9d5d882d`) and present identically in the old and new `0019`
+(LINENUM class), so it is left untouched to keep the pin-sync faithful. (`0021`'s
+`CONV_STATE_FUSION_RESULTS.md` is a *create* hunk and applies fine.) Stripping the
+stray dev-doc hunks from the shipped patches is a separate cleanup, out of scope
+for the pin-sync.
+
+## Source of truth
+The rebased branch on the DGX dev tree (`~/llama-paged-dev`, branch `paged`, HEAD
+`2ee65c2`) is the source of truth; `paged-prerebase-backup` (`a8a9d12`) retains
+the pre-rebase state.