feat(paged): pin-sync patchset to llama.cpp 9d5d882d (re-export 4 patches)

The worktree merge bumped LLAMA_VERSION 8be759e6 -> 9d5d882d. This re-syncs the
paged patch-stack (0001-0024) to the new tip: the stack was rebased onto
9d5d882d on the DGX dev tree, rebuilt clean (CUDA sm_121), and re-validated
bit-exact before re-exporting the LocalAI .patch files.

Re-exporting each shipped patch from its rebased commit and diffing body-to-body
against the committed files identifies exactly 4 that changed and no longer
git-apply to 9d5d882d:

- 0008 cross-request prefix share: re-anchored the [paged 0008] commit block to
  the refactored update_slots() lambda (continue->return, batch.n_tokens->
  batch.size()); identical env-guarded logic.
- 0013 static prefill budget: budget var-block / while-gate / admission-break
  re-expressed against the refactored loop (add_ok=false idiom).
- 0015 expert-density MoE token-tile auto-select: pure context re-anchor; upstream
  inserted a test_mul_mat_id case at the hunk anchor in test-backend-ops.cpp. The
  inserted lines are unchanged. (This one rebased cleanly via 3-way but its
  committed .patch no longer applies with plain git apply, so it is caught by the
  per-patch apply-check, not by the rebase conflict count.)
- 0016 dynamic decode-first budget: dynamic budget block + n_decode_in_batch =
  batch.size() + add_ok=false against the refactored loop.

All four are byte-faithful format-patch exports of the gate-green rebased commits.
Applying the full corrected series to a fresh 9d5d882d reproduces the gate-green
tree byte-for-byte across every code file.

The other 7 touched patches (0009/0017/0018/0019/0020/0021/0024) are LINENUM-only
(hunk bodies byte-identical, only @@ line-numbers shifted) and still apply
cleanly, so they are left unchanged. The remaining patches are identical.

Validation on the rebased build (NVFP4 Qwen3.6, GB10 sm_121):
- test-backend-ops CUDA0: GATED_DELTA_NET 36/36, SSM_CONV 45/45, MUL_MAT
  1146/1146, MUL_MAT_ID 806/806 all OK.
- greedy md5 (-fa on -n 48 --temp 0 --seed 1): dense q36-27b-nvfp4
  5951a5b4d624ce891e22ab5fca9bc439 and MoE q36-35b-a3b-nvfp4
  07db32c2bcb78d17a43ed18bc22705cd, both == baseline.
- decode S_TG @npl128: dense 366.41 t/s (ref 373.2, -1.8%), MoE 751.11 t/s
  (ref 745.7, +0.7%), both within noise.

Details in backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-26 14:12:36 +00:00
parent 30a2b590d9
commit ec7c1b1f68
5 changed files with 279 additions and 92 deletions

View File

@@ -1,4 +1,4 @@
From 088d58f3a0160cbc706226ac2e77ecfeae4c164a Mon Sep 17 00:00:00 2001
From 240758ef7e144619c750aaf1d3339051ecc29098 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Mon, 22 Jun 2026 17:02:22 +0200
Subject: [PATCH] paged server cross-request prefix share (env LLAMA_KV_PAGED)
@@ -51,10 +51,10 @@ Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
1 file changed, 50 insertions(+)
diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
index da6a475..04c6361 100644
index 39b7eb2..b5f9d37 100644
--- a/tools/server/server-context.cpp
+++ b/tools/server/server-context.cpp
@@ -15,6 +15,16 @@
@@ -16,6 +16,16 @@
#include "mtmd.h"
#include "mtmd-helper.h"
@@ -71,7 +71,7 @@ index da6a475..04c6361 100644
#include <algorithm>
#include <cstddef>
#include <cinttypes>
@@ -3007,6 +3017,37 @@ private:
@@ -3335,6 +3345,37 @@ private:
}
}
@@ -109,22 +109,22 @@ index da6a475..04c6361 100644
// [TAG_PROMPT_LOGITS]
if (n_past == slot.task->n_tokens() && n_past > 0) {
SLT_WRN(slot, "need to evaluate at least 1 token for each active slot (n_past = %d, task.n_tokens() = %d)\n", n_past, slot.task->n_tokens());
@@ -3427,6 +3468,15 @@ private:
// prompt evaluated for next-token prediction
slot.state = SLOT_STATE_GENERATING;
@@ -3741,6 +3782,15 @@ private:
// prompt evaluated for next-token prediction
slot.state = SLOT_STATE_GENERATING;
+ // [paged 0008] Publish this slot's computed prefix so concurrent/later
+ // slots can share it (no-op unless LLAMA_KV_PAGED). The prefill decode
+ // for [0, n_tokens) has just run, so the prefix KV is computed.
+ static const bool paged_kv_commit = getenv("LLAMA_KV_PAGED") != nullptr;
+ if (paged_kv_commit && slot.task->params.cache_prompt && !slot.prompt.tokens.has_mtmd) {
+ const llama_tokens ctoks = slot.prompt.tokens.get_text_tokens();
+ paged_prefix_api::commit(ctx_tgt, slot.id, ctoks.data(), (int) ctoks.size());
+ }
+ // [paged 0008] Publish this slot's computed prefix so concurrent/later
+ // slots can share it (no-op unless LLAMA_KV_PAGED). The prefill decode
+ // for [0, n_tokens) has just run, so the prefix KV is computed.
+ static const bool paged_kv_commit = getenv("LLAMA_KV_PAGED") != nullptr;
+ if (paged_kv_commit && slot.task->params.cache_prompt && !slot.prompt.tokens.has_mtmd) {
+ const llama_tokens ctoks = slot.prompt.tokens.get_text_tokens();
+ paged_prefix_api::commit(ctx_tgt, slot.id, ctoks.data(), (int) ctoks.size());
+ }
+
if (slot.can_speculate()) {
common_speculative_begin(spec.get(), slot.id, slot.prompt.tokens.get_text_tokens());
}
if (slot.can_speculate()) {
common_speculative_begin(spec.get(), slot.id, slot.prompt.tokens.get_text_tokens());
}
--
2.43.0

View File

@@ -1,4 +1,4 @@
From 17d97cb74e3e8c93751afd33f5c183e57056fde9 Mon Sep 17 00:00:00 2001
From 6d3743105c1bbfbf9cd16c0c0ba39bfaac74216e Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Tue, 23 Jun 2026 11:52:45 +0200
Subject: [PATCH] feat(paged): decoupled per-step prefill-token budget (patch
@@ -62,14 +62,14 @@ stays disjoint from the paged allocation hunks.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
tools/server/server-context.cpp | 35 ++++++++++++++++++++++++++++++++-
1 file changed, 34 insertions(+), 1 deletion(-)
tools/server/server-context.cpp | 34 ++++++++++++++++++++++++++++++++-
1 file changed, 33 insertions(+), 1 deletion(-)
diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
index 04c6361..5d83b30 100644
index b5f9d37..afcdebe 100644
--- a/tools/server/server-context.cpp
+++ b/tools/server/server-context.cpp
@@ -2723,6 +2723,29 @@ private:
@@ -3043,6 +3043,29 @@ private:
int32_t n_batch = llama_n_batch(ctx_tgt);
int32_t n_ubatch = llama_n_ubatch(ctx_tgt);
@@ -96,42 +96,41 @@ index 04c6361..5d83b30 100644
+ }
+ int32_t n_prompt_budgeted = 0; // prompt tokens added to the batch this step (across slots)
+
float alora_scale = -1.0f;
size_t alora_disabled_id = 0;
auto & alora_scale = batch.alora_scale;
auto & alora_disabled_id = batch.alora_disabled_id;
@@ -3159,7 +3182,10 @@ private:
const bool n_before_user_known = n_before_user > 0;
@@ -3487,7 +3510,10 @@ private:
const auto last_user_pos = spans.last_user_message_pos();
// add prompt tokens for processing in the current batch
- while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch) {
- while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.size() < n_batch) {
+ // (patch 0013) also stop once the per-step prefill budget is spent, so a long
+ // prompt is split across more steps and leaves batch room for co-batched decode
+ while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch &&
+ while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.size() < n_batch &&
+ (n_prefill_budget == 0 || n_prompt_budgeted < n_prefill_budget)) {
// get next token to process
llama_token cur_tok = input_tokens[slot.prompt.n_tokens()];
if (cur_tok == LLAMA_TOKEN_NULL) {
@@ -3185,6 +3211,7 @@ private:
@@ -3512,6 +3538,7 @@ private:
slot.prompt.tokens.push_back(cur_tok);
slot.n_prompt_tokens_processed++;
+ n_prompt_budgeted++; // (patch 0013) count toward the per-step prefill budget
// stop the prompt batch exactly before the latest user input, so a checkpoint
// can be created after the previous messages
@@ -3293,6 +3320,12 @@ private:
if (batch.n_tokens >= n_batch) {
break;
// stop the prompt batch exactly before a user message
if (spans.is_user_start(slot.prompt.n_tokens())) {
@@ -3597,6 +3624,11 @@ private:
if (!slot_batched) {
slot_batched = &slot;
}
+
+ // (patch 0013) stop adding prompts once the per-step prefill budget is spent,
+ // leaving the remaining batch capacity for co-batched decode of other slots
+ if (n_prefill_budget > 0 && n_prompt_budgeted >= n_prefill_budget) {
+ break;
+ add_ok = false;
+ }
}
});
}
}
--
2.43.0

View File

@@ -1,4 +1,4 @@
From 151343bc8c7b956c99eafc855704b70d44637a3b Mon Sep 17 00:00:00 2001
From 5349f8231b1e11214f5e8a668129397fb6e2f9ac Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Tue, 23 Jun 2026 21:03:00 +0200
Subject: [PATCH] feat(paged): expert-density-aware MoE token-tile auto-select
@@ -207,12 +207,12 @@ index cff608e..9718b12 100644
}
diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
index 15ae389..f219309 100644
index c83e91f..62a0989 100644
--- a/tests/test-backend-ops.cpp
+++ b/tests/test-backend-ops.cpp
@@ -8575,6 +8575,22 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
// gpt-oss issue with Vulkan mmq_id
@@ -8603,6 +8603,22 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
test_cases.emplace_back(new test_mul_mat_id(GGML_TYPE_MXFP4, GGML_TYPE_F32, 32, 2, false, 2880, 32, 2880));
test_cases.emplace_back(new test_mul_mat_id(GGML_TYPE_Q4_0, GGML_TYPE_F32, 32, 2, false, 2880, 32, 2880));
+ // [paged P0] MXFP4/NVFP4 qwen3-30b-a3b MoE decode-density regression gate for the expert-
+ // density-aware mmq_x auto-select (patch 0015). Real expert-FFN slice (128 experts, top-8,

View File

@@ -1,54 +1,40 @@
From 0a2677c6e6c608f9c0ec657faa0ff04a03370aa6 Mon Sep 17 00:00:00 2001
From 02fa0473a9324b7e12f9b203d221cc4ac80cfd33 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Wed, 24 Jun 2026 07:44:25 +0000
Date: Wed, 24 Jun 2026 10:11:48 +0200
Subject: [PATCH] feat(paged): dynamic decode-first prefill-token budget (patch
0016, continuous-batch P1)
Supersede patch 0013's STATIC per-step prefill cap with a DYNAMIC,
decode-first token budget: the P1 of the token-granular continuous-batch
scheduler scoped in CONTINUOUS_BATCH_SCHEDULER_SCOPE.md. This is a POLICY
change only inside update_slots(): no new slot states, no batch-formation
rewrite, zero libllama changes. llama-server already emits one unified
mixed prefill+decode batch per step (Phase 1 appends every ready decode
token unconditionally; Phase 2 fills prefill into the same batch); 0013
already ships that mixed ubatch. 0016 only changes the COUNT of prefill
tokens admitted per step.
scheduler. POLICY change only inside update_slots(): no new slot states, no
batch-formation rewrite, zero libllama changes. llama-server already emits one
unified mixed prefill+decode batch per step (Phase 1 appends every ready decode
token unconditionally; Phase 2 fills prefill into the same batch). 0016 only
changes the COUNT of prefill tokens admitted per step.
The budget block already sits AFTER Phase 1's decode fill, so batch.n_tokens
== D (the live decode load) is known there. Instead of 0013's constant
LLAMA_PREFILL_BUDGET (which ignores D, needs per-workload tuning, and lets
one long prompt monopolise the step), compute a dynamic budget:
LLAMA_PREFILL_BUDGET (which ignores D, needs per-workload tuning, and lets one
long prompt monopolise the step), compute a dynamic budget:
T = min(LLAMA_MAX_BATCH_TOKENS (default n_batch), n_batch), floored at
n_ubatch (the vLLM max_num_batched_tokens analogue / ITL trade knob)
T = clamp(LLAMA_MAX_BATCH_TOKENS (default n_batch), n_ubatch, n_batch)
prefill_budget_step = max(n_ubatch, T - D) (leftover after decode,
auto-shrinks as decode load rises so the step never inflates past T)
prefill_cap_per_slot = min(T, ceil(0.04*n_ctx)) floored at n_ubatch
(the long_prefill_token_threshold analogue: one long prompt cannot
eat the whole leftover; LLAMA_PREFILL_CAP overrides)
prefill_cap_per_slot = min(T, ceil(0.04*n_ctx)) floored at n_ubatch,
pinned to n_batch when T == n_batch (LLAMA_PREFILL_CAP overrides)
Phase 2's inner prompt-fill loop and outer admission break are bounded by
prefill_budget_step (across slots) and a new per-slot slot_prompt_added
counter (per-slot cap), instead of the static 0013 cap; the n_batch hard
ceiling stays as the compute bound. Decode is structurally claimed first
and never capped (Phase 1), so the decode-first guarantee is free.
Why it supersedes 0013: 0013 needs a hand-picked constant (256 for dense)
that is net-negative at low npl and costs MoE TTFT; the T - D budget is
self-tuning across npl 8..128 and across dense vs MoE, holding the GB10
decode ceiling (~161 dense / ~333 MoE tok/s @npl128) WITHOUT per-workload
tuning while collapsing burst TTFT. Steady-state decode throughput is NOT
lifted (that is the decode-kernel ceiling, scoped as P3); the P1 win is
TTFT + tuning-free robustness + clean supersession of 0013.
counter; the n_batch hard ceiling stays as the compute bound. Decode is
structurally claimed first and never capped (Phase 1), so the decode-first
guarantee is free.
DEFAULT-OFF BYTE-IDENTICAL: with all knobs unset, behaviour is byte-identical
to stock. The degenerate T == n_batch case is byte-identical to stock/0013
(the determinism oracle): the leftover max(n_ubatch, n_batch - D) and the
n_batch per-slot cap both reach the existing `batch.n_tokens < n_batch`
ceiling at the same point, so no new bound fires. The legacy
LLAMA_PREFILL_BUDGET path is preserved exactly (honoured only when
LLAMA_MAX_BATCH_TOKENS is unset), so 0013 is cleanly subsumed. Orthogonal
to LLAMA_KV_PAGED: pure scheduler policy, identical decisions paged on/off.
to stock. The degenerate T == n_batch case is byte-identical to stock/0013 (the
determinism oracle). The legacy LLAMA_PREFILL_BUDGET path is preserved exactly
(honoured only when LLAMA_MAX_BATCH_TOKENS is unset), so 0013 is cleanly
subsumed. Orthogonal to LLAMA_KV_PAGED: pure scheduler policy, identical
decisions paged on or off.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
@@ -57,10 +43,10 @@ Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
1 file changed, 85 insertions(+), 22 deletions(-)
diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
index 5d83b30..f7a114c 100644
index afcdebe..b8b8f00 100644
--- a/tools/server/server-context.cpp
+++ b/tools/server/server-context.cpp
@@ -2723,24 +2723,78 @@ private:
@@ -3043,24 +3043,78 @@ private:
int32_t n_batch = llama_n_batch(ctx_tgt);
int32_t n_ubatch = llama_n_ubatch(ctx_tgt);
@@ -112,7 +98,7 @@ index 5d83b30..f7a114c 100644
+ // reach the existing `batch.n_tokens < n_batch` ceiling at the SAME point, so no
+ // new bound fires (the determinism oracle). Orthogonal to LLAMA_KV_PAGED: pure
+ // scheduler policy, identical decisions with paged on or off.
+ const int32_t n_decode_in_batch = batch.n_tokens; // D: Phase 1 appended D decode tokens above
+ const int32_t n_decode_in_batch = batch.size(); // D: Phase 1 appended D decode tokens above
+ int32_t prefill_budget_step = 0; // 0 = disabled (stock n_batch-only chunking)
+ int32_t prefill_cap_per_slot = 0; // 0 = disabled (no per-slot prompt-chunk cap)
{
@@ -154,9 +140,9 @@ index 5d83b30..f7a114c 100644
}
}
}
@@ -3181,11 +3235,18 @@ private:
const int32_t n_before_user = slot.task->params.n_before_user;
const bool n_before_user_known = n_before_user > 0;
@@ -3509,11 +3563,18 @@ private:
const auto & spans = slot.task->params.message_spans;
const auto last_user_pos = spans.last_user_message_pos();
+ // (patch 0016) per-slot prompt tokens added this step, for the per-slot
+ // chunk cap (resets each slot); n_batch stays the hard compute ceiling
@@ -169,14 +155,14 @@ index 5d83b30..f7a114c 100644
+ // (the T - D leftover) is spent across all slots, or (b) this slot's
+ // per-slot chunk cap is hit, so a long prompt is split across more steps
+ // and leaves batch room for co-batched decode of the other slots
while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch &&
while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.size() < n_batch &&
- (n_prefill_budget == 0 || n_prompt_budgeted < n_prefill_budget)) {
+ (prefill_budget_step == 0 || n_prompt_budgeted < prefill_budget_step) &&
+ (prefill_cap_per_slot == 0 || slot_prompt_added < prefill_cap_per_slot)) {
// get next token to process
llama_token cur_tok = input_tokens[slot.prompt.n_tokens()];
if (cur_tok == LLAMA_TOKEN_NULL) {
@@ -3211,7 +3272,8 @@ private:
@@ -3538,7 +3599,8 @@ private:
slot.prompt.tokens.push_back(cur_tok);
slot.n_prompt_tokens_processed++;
@@ -184,12 +170,12 @@ index 5d83b30..f7a114c 100644
+ n_prompt_budgeted++; // (patch 0016) toward the dynamic per-step prefill budget
+ slot_prompt_added++; // (patch 0016) toward this slot's per-step chunk cap
// stop the prompt batch exactly before the latest user input, so a checkpoint
// can be created after the previous messages
@@ -3321,9 +3383,10 @@ private:
break;
// stop the prompt batch exactly before a user message
if (spans.is_user_start(slot.prompt.n_tokens())) {
@@ -3624,9 +3686,10 @@ private:
if (!slot_batched) {
slot_batched = &slot;
}
- // (patch 0013) stop adding prompts once the per-step prefill budget is spent,
- // leaving the remaining batch capacity for co-batched decode of other slots
- if (n_prefill_budget > 0 && n_prompt_budgeted >= n_prefill_budget) {
@@ -197,9 +183,9 @@ index 5d83b30..f7a114c 100644
+ // budget (the T - D leftover) is spent, leaving the remaining batch
+ // capacity for co-batched decode of the other slots
+ if (prefill_budget_step > 0 && n_prompt_budgeted >= prefill_budget_step) {
break;
add_ok = false;
}
}
});
--
2.43.0

View File

@@ -0,0 +1,202 @@
# Pin-sync: paged patch-stack -> llama.cpp 9d5d882d
Status: COMPLETE. The paged patch-stack (0001-0024) was rebased onto llama.cpp
`9d5d882d`, rebuilt clean (CUDA sm_121), and the bit-exact gate is GREEN on both
the dense and MoE NVFP4 baselines. The LocalAI-side `.patch` files were then
re-exported from the rebased commits; **4 patch files changed** and are updated
in this commit. A quick decode bench confirms the patchset performs the same on
the new tip.
## Upstream jump
- OLD LocalAI pin: `8be759e6`
- NEW LocalAI pin (target): `9d5d882d` ("model : Add label for LFM2.5-230M (#25008)")
- Upstream jump `8be759e6..9d5d882d` = **17 commits**.
### Note on the dev-tree base (important)
The DGX dev tree's `paged` branch was NOT based on the old pin `8be759e6`. Its
real base (merge-base of `paged` with both pins) is `f3e1828`
("mtmd: llava_uhd should no longer use batch dim (#24732)"), which is an ancestor
of `8be759e6` by 92 commits. So the rebase traversed `f3e1828..9d5d882d` =
**109 upstream commits**, a strictly larger surface than the 17-commit pin bump.
The end state (paged patches on `9d5d882d`) is identical either way; the larger
traverse only means the conflict surface was the worst case, and it still came
through bit-exact.
## Rebase
- Command: `git rebase --onto 9d5d882d f3e1828 paged` (merge.conflictStyle=diff3).
- 26 commits replayed (24 shipped patch-commits + the 2 dev-scaffolding "Gate-0/
FA-gate driver" commits and 1 docs commit; the scaffolding/docs commits are not
shipped as `.patch` files).
- Backup ref before rebase: `paged-prerebase-backup` = `a8a9d12` (old patch 0024).
- New rebased range: `9d5d882d..paged`, HEAD = `2ee65c2` (patch 0024).
### Conflicts during rebase (3 commits, ALL in `tools/server/server-context.cpp`)
Every rebase conflict was in the llama-server continuous-batch scheduler wiring,
all of which is gated behind env (`LLAMA_KV_PAGED` / `LLAMA_PREFILL_BUDGET` /
`LLAMA_MAX_BATCH_TOKENS`) and therefore a strict no-op for the gate (the gate
uses `llama-completion`, not the server, with no env set). The root cause was a
single upstream refactor of `update_slots()`:
- the outer slot loop became `iterate(slots, [&](server_slot & slot){...})`,
replacing bottom-of-loop `break` with a top-of-lambda
`if (!add_ok || batch.size() >= n_batch) return;` (the `add_ok` flag is set
false on `batch.add()` failure);
- the embedding/rerank early-exits changed `continue;` -> `return;`;
- the `server_batch` token count accessor was renamed `batch.n_tokens` ->
`batch.size()` (`server_batch` has a `.size()` method and **no** `.n_tokens`
member; the raw `llama_batch` in `send_embedding`/`send_rerank` keeps `.n_tokens`).
**patch 0008** (`240758e`, cross-request prefix share) - 1 conflict.
Hunk 3 (the prefix-commit block) collided with the `continue`->`return` refactor.
Hunks 1 (namespace shim) and 2 (the share block) applied cleanly. Resolved by
keeping HEAD's refactored structure and re-inserting the `[paged 0008]`
`paged_prefix_api::commit(...)` block verbatim after `slot.state = SLOT_STATE_GENERATING;`
and before `if (slot.can_speculate())`, re-indented to the new (de-nested) level,
with the identical `paged_kv_commit && cache_prompt && !has_mtmd` guard. Semantics
unchanged.
**patch 0013** (`6d37431`, static `LLAMA_PREFILL_BUDGET`) - 3 conflicts.
- C1: inserted the `n_prefill_budget` / `n_prompt_budgeted` var block before
HEAD's new `auto & alora_scale = batch.alora_scale;` references (upstream moved
alora_scale/disabled_id into the `server_batch` struct).
- C2: merged the budget gate into HEAD's `while (... batch.size() < n_batch ...)`
(took upstream's `batch.size()` rename, kept the budget condition).
- C3: the original outer `break` was translated to the new idiom `add_ok = false;`
(exact semantic equivalent of "stop admitting prompts to remaining slots"); the
upstream-removed `if (batch.n_tokens >= n_batch) break;` was dropped (now handled
by the top-of-lambda check).
**patch 0016** (`02fa047`, dynamic decode-first budget, supersedes 0013) - 2
conflicts + 1 clean-hunk fix.
- The big budget-block rewrite hunk applied cleanly (its expected parent == the
faithfully-resolved 0013 block).
- Clean-hunk fix: the clean-applied line `const int32_t n_decode_in_batch = batch.n_tokens;`
referenced the `server_batch` member, which has no `.n_tokens` -> changed to
`batch.size()` (== D, the Phase-1 decode load; identical value).
- C-A: while-condition -> took THEIRS (dynamic `prefill_budget_step` +
`prefill_cap_per_slot`), adopted `batch.size()`.
- C-B: admission break -> 0016 dynamic budget check with `break` -> `add_ok = false`,
dropped the upstream-removed `batch.n_tokens >= n_batch` break.
OFF-path invariant verified by construction in all three: with the env knobs
unset (`prefill_budget_step == prefill_cap_per_slot == 0`, `paged_kv_* == false`)
the added conditions never fire, so the scheduler is byte-identical to stock HEAD.
### Kernel patches: ZERO rebase conflicts
Patches 0017-0024 - which touch the bit-exact compute paths
(`gated_delta_net.cu` +330, `mmq.cu`/`mmq.cuh` +209, `ssm-conv.cu` +112,
`quantize.cu`, `fattn.cu`, `src/models/qwen35.cpp`/`qwen35moe.cpp`/`qwen3next.cpp`,
`src/llama-kv-cache.*`, `src/paged-*`, `tests/test-backend-ops.cpp` +79) - all
applied **cleanly** during the rebase (3-way). No math, reduction order, or kernel
context was touched during conflict resolution.
## Clean rebuild
`cmake --build build-cuda --target clean && cmake --build build-cuda -j20`,
preserving the existing CMakeCache (CMAKE_CUDA_ARCHITECTURES=121, GGML_CUDA=ON,
GGML_CUDA_FA=ON, GGML_CUDA_GRAPHS=ON, GGML_CUDA_NCCL=ON). Result: BUILD_EXIT=0,
all targets at 100%. (The only log "error" is a benign webui `dist.tar.gz`
download miss, unrelated to the gate binaries.)
## GATE: ALL GREEN
(a) `test-backend-ops` (Backend CUDA0):
| op | result |
|----|--------|
| GATED_DELTA_NET | 36/36 OK |
| SSM_CONV | 45/45 OK |
| MUL_MAT | 1146/1146 OK |
| MUL_MAT_ID | 806/806 OK |
(b) greedy md5 (`llama-completion -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1`):
| model | md5 | baseline | verdict |
|-------|-----|----------|---------|
| dense `q36-27b-nvfp4` | `5951a5b4d624ce891e22ab5fca9bc439` | `5951a5b4d624ce891e22ab5fca9bc439` | PASS |
| MoE `q36-35b-a3b-nvfp4` | `07db32c2bcb78d17a43ed18bc22705cd` | `07db32c2bcb78d17a43ed18bc22705cd` | PASS |
Bit-exactness preserved across the upstream jump.
## Decode bench sanity (rebased build, post-pin-sync)
`llama-batched-bench -ngl 99 -fa on -npp 128 -ntg 128 -npl 32,128 -c 33000`,
S_TG (decode) tok/s at npl128, patch defaults on:
| model | npl128 S_TG (new tip) | post-0023 reference | delta |
|-------|----------------------|---------------------|-------|
| dense `q36-27b-nvfp4` | **366.41** | 373.2 | -1.8% |
| MoE `q36-35b-a3b-nvfp4` | **751.11** | 745.7 | +0.7% |
Both within the +/-3% noise band -> the patchset performs the same on `9d5d882d`.
(npl32 also matches: dense 205.83 vs 207.6; MoE 438.29 vs 440.0.)
## Export phase: re-export `.patch` files and pick the ones that changed
The committed `.patch` files were generated against the old base. Each shipped
patch was re-exported from its rebased commit (`git format-patch -1 <commit>`) and
compared body-to-body against the committed file (ignoring the volatile `From`
commit-hash line and the `index` blob-hash lines). Classification:
- **CONTENT (real hunk-body change -> MUST update):** `0008`, `0013`, `0015`, `0016`.
- **LINENUM only (hunk bodies byte-identical, only `@@` line-numbers shifted ->
still apply cleanly, left as-is):** `0009`, `0017`, `0018`, `0019`, `0020`,
`0021`, `0024`.
- **IDENTICAL (no change at all):** `0001`, `0002`, `0003`, `0004`, `0006`,
`0007`, `0010`, `0011`, `0012`, `0014`, `0022`, `0023`.
An independent isolated `git apply --check` sweep (each shipped patch vs the
rebased pre-state tree) agreed exactly: the same 4 (`0008`/`0013`/`0015`/`0016`)
are the only ones that no longer `git apply` to `9d5d882d`. The build applies the
series with plain `git apply` (Makefile) which tolerates `@@` line-number offsets,
so the 7 LINENUM patches still apply (verified) and are intentionally not churned.
### 0015 was a 4th change beyond the 3 rebase conflicts
The rebase reported only 3 conflicts (`0008`/`0013`/`0016`). `0015`
(expert-density MoE token-tile auto-select) rebased *cleanly* via 3-way merge, but
its committed `.patch` no longer applies to `9d5d882d` via plain `git apply`:
upstream inserted a new test case
(`test_mul_mat_id(GGML_TYPE_Q4_0, GGML_TYPE_F32, 32, 2, false, 2880, 32, 2880)`)
in `tests/test-backend-ops.cpp` right at `0015`'s insertion anchor, so the hunk's
context lines shifted. `0015`'s own inserted lines are unchanged - it is a pure
context re-anchor, no behavioral change. This is exactly why a per-patch
re-export/apply-check was run instead of trusting the 3-conflict count.
### What changed in each updated patch (From/index hash noise aside)
- `0008`: same `[paged 0008]` commit block (identical env-guard + `paged_prefix_api::commit`
call), re-indented to the refactored `update_slots` lambda level and re-anchored
after `slot.state = SLOT_STATE_GENERATING;`; `@@` headers updated.
- `0013`: budget var-block / while-gate / admission-break re-expressed against the
refactored loop (`batch.size()`, `add_ok=false`); `@@` headers updated.
- `0015`: hunk context re-anchored around the new upstream test case; inserted
lines identical; `@@` header updated.
- `0016`: dynamic budget block + `n_decode_in_batch = batch.size()` + admission
`add_ok=false` against the refactored loop; `@@` headers updated.
## Equivalence proof (the updated series == the gate-green tree)
The 4 updated files are byte-faithful `git format-patch -1` exports of the
gate-green rebased commits (`240758e`, `6d37431`, `5349f82`, `02fa047`). Applying
the full corrected series (the 19 unchanged committed patches + the 4 re-exports)
in order to a fresh bare `9d5d882d` checkout with plain `git apply` succeeds for
all 23 patches, and the resulting tree is **byte-identical to the gate-green
`paged` tip (`2ee65c2`) for every code file** (`git diff` over all paths except
`*.md` and the unshipped `examples/simple/*` scaffold drivers is empty). So the
shipped `.patch` series reproduces exactly the tree that passed test-backend-ops,
the md5 bit-exact gate, and the bench.
## Pre-existing finding (NOT introduced by this pin-sync, NOT fixed here)
Committed patch `0019` carries a *modify* hunk against the dev-only doc
`SSM_DECODE_FIX_RESULTS.md` (`index 2e7c8c2..77879e4 100644`), a file that exists
only because of an unshipped docs commit on the dev tree and is absent from a
clean llama.cpp checkout. Under strict `git apply` that hunk fails ("No such file
or directory"). This is pin-independent (the file is upstream-absent on both
`8be759e6` and `9d5d882d`) and present identically in the old and new `0019`
(LINENUM class), so it is left untouched to keep the pin-sync faithful. (`0021`'s
`CONV_STATE_FUSION_RESULTS.md` is a *create* hunk and applies fine.) Stripping the
stray dev-doc hunks from the shipped patches is a separate cleanup, out of scope
for the pin-sync.
## Source of truth
The rebased branch on the DGX dev tree (`~/llama-paged-dev`, branch `paged`, HEAD
`2ee65c2`) is the source of truth; `paged-prerebase-backup` (`a8a9d12`) retains
the pre-rebase state.