From ec7c1b1f687ed578659498d029e645b7913ed4b2 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Fri, 26 Jun 2026 14:12:36 +0000 Subject: [PATCH] feat(paged): pin-sync patchset to llama.cpp 9d5d882d (re-export 4 patches) The worktree merge bumped LLAMA_VERSION 8be759e6 -> 9d5d882d. This re-syncs the paged patch-stack (0001-0024) to the new tip: the stack was rebased onto 9d5d882d on the DGX dev tree, rebuilt clean (CUDA sm_121), and re-validated bit-exact before re-exporting the LocalAI .patch files. Re-exporting each shipped patch from its rebased commit and diffing body-to-body against the committed files identifies exactly 4 that changed and no longer git-apply to 9d5d882d: - 0008 cross-request prefix share: re-anchored the [paged 0008] commit block to the refactored update_slots() lambda (continue->return, batch.n_tokens-> batch.size()); identical env-guarded logic. - 0013 static prefill budget: budget var-block / while-gate / admission-break re-expressed against the refactored loop (add_ok=false idiom). - 0015 expert-density MoE token-tile auto-select: pure context re-anchor; upstream inserted a test_mul_mat_id case at the hunk anchor in test-backend-ops.cpp. The inserted lines are unchanged. (This one rebased cleanly via 3-way but its committed .patch no longer applies with plain git apply, so it is caught by the per-patch apply-check, not by the rebase conflict count.) - 0016 dynamic decode-first budget: dynamic budget block + n_decode_in_batch = batch.size() + add_ok=false against the refactored loop. All four are byte-faithful format-patch exports of the gate-green rebased commits. Applying the full corrected series to a fresh 9d5d882d reproduces the gate-green tree byte-for-byte across every code file. The other 7 touched patches (0009/0017/0018/0019/0020/0021/0024) are LINENUM-only (hunk bodies byte-identical, only @@ line-numbers shifted) and still apply cleanly, so they are left unchanged. The remaining patches are identical. Validation on the rebased build (NVFP4 Qwen3.6, GB10 sm_121): - test-backend-ops CUDA0: GATED_DELTA_NET 36/36, SSM_CONV 45/45, MUL_MAT 1146/1146, MUL_MAT_ID 806/806 all OK. - greedy md5 (-fa on -n 48 --temp 0 --seed 1): dense q36-27b-nvfp4 5951a5b4d624ce891e22ab5fca9bc439 and MoE q36-35b-a3b-nvfp4 07db32c2bcb78d17a43ed18bc22705cd, both == baseline. - decode S_TG @npl128: dense 366.41 t/s (ref 373.2, -1.8%), MoE 751.11 t/s (ref 745.7, +0.7%), both within noise. Details in backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- ...uest-prefix-share-env-LLAMA_KV_PAGED.patch | 36 ++-- ...paged-decoupled-prefill-token-budget.patch | 41 ++-- ...ity-aware-moe-token-tile-auto-select.patch | 8 +- ...amic-prefill-budget-continuous-batch.patch | 84 +++----- .../patches/paged/PIN_SYNC_9d5d882d.md | 202 ++++++++++++++++++ 5 files changed, 279 insertions(+), 92 deletions(-) create mode 100644 backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md diff --git a/backend/cpp/llama-cpp/patches/paged/0008-paged-server-cross-request-prefix-share-env-LLAMA_KV_PAGED.patch b/backend/cpp/llama-cpp/patches/paged/0008-paged-server-cross-request-prefix-share-env-LLAMA_KV_PAGED.patch index d0e32349e..a739919ff 100644 --- a/backend/cpp/llama-cpp/patches/paged/0008-paged-server-cross-request-prefix-share-env-LLAMA_KV_PAGED.patch +++ b/backend/cpp/llama-cpp/patches/paged/0008-paged-server-cross-request-prefix-share-env-LLAMA_KV_PAGED.patch @@ -1,4 +1,4 @@ -From 088d58f3a0160cbc706226ac2e77ecfeae4c164a Mon Sep 17 00:00:00 2001 +From 240758ef7e144619c750aaf1d3339051ecc29098 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Mon, 22 Jun 2026 17:02:22 +0200 Subject: [PATCH] paged server cross-request prefix share (env LLAMA_KV_PAGED) @@ -51,10 +51,10 @@ Signed-off-by: Ettore Di Giacinto 1 file changed, 50 insertions(+) diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp -index da6a475..04c6361 100644 +index 39b7eb2..b5f9d37 100644 --- a/tools/server/server-context.cpp +++ b/tools/server/server-context.cpp -@@ -15,6 +15,16 @@ +@@ -16,6 +16,16 @@ #include "mtmd.h" #include "mtmd-helper.h" @@ -71,7 +71,7 @@ index da6a475..04c6361 100644 #include #include #include -@@ -3007,6 +3017,37 @@ private: +@@ -3335,6 +3345,37 @@ private: } } @@ -109,22 +109,22 @@ index da6a475..04c6361 100644 // [TAG_PROMPT_LOGITS] if (n_past == slot.task->n_tokens() && n_past > 0) { SLT_WRN(slot, "need to evaluate at least 1 token for each active slot (n_past = %d, task.n_tokens() = %d)\n", n_past, slot.task->n_tokens()); -@@ -3427,6 +3468,15 @@ private: - // prompt evaluated for next-token prediction - slot.state = SLOT_STATE_GENERATING; +@@ -3741,6 +3782,15 @@ private: + // prompt evaluated for next-token prediction + slot.state = SLOT_STATE_GENERATING; -+ // [paged 0008] Publish this slot's computed prefix so concurrent/later -+ // slots can share it (no-op unless LLAMA_KV_PAGED). The prefill decode -+ // for [0, n_tokens) has just run, so the prefix KV is computed. -+ static const bool paged_kv_commit = getenv("LLAMA_KV_PAGED") != nullptr; -+ if (paged_kv_commit && slot.task->params.cache_prompt && !slot.prompt.tokens.has_mtmd) { -+ const llama_tokens ctoks = slot.prompt.tokens.get_text_tokens(); -+ paged_prefix_api::commit(ctx_tgt, slot.id, ctoks.data(), (int) ctoks.size()); -+ } ++ // [paged 0008] Publish this slot's computed prefix so concurrent/later ++ // slots can share it (no-op unless LLAMA_KV_PAGED). The prefill decode ++ // for [0, n_tokens) has just run, so the prefix KV is computed. ++ static const bool paged_kv_commit = getenv("LLAMA_KV_PAGED") != nullptr; ++ if (paged_kv_commit && slot.task->params.cache_prompt && !slot.prompt.tokens.has_mtmd) { ++ const llama_tokens ctoks = slot.prompt.tokens.get_text_tokens(); ++ paged_prefix_api::commit(ctx_tgt, slot.id, ctoks.data(), (int) ctoks.size()); ++ } + - if (slot.can_speculate()) { - common_speculative_begin(spec.get(), slot.id, slot.prompt.tokens.get_text_tokens()); - } + if (slot.can_speculate()) { + common_speculative_begin(spec.get(), slot.id, slot.prompt.tokens.get_text_tokens()); + } -- 2.43.0 diff --git a/backend/cpp/llama-cpp/patches/paged/0013-paged-decoupled-prefill-token-budget.patch b/backend/cpp/llama-cpp/patches/paged/0013-paged-decoupled-prefill-token-budget.patch index ffbd01f8e..29a9ca226 100644 --- a/backend/cpp/llama-cpp/patches/paged/0013-paged-decoupled-prefill-token-budget.patch +++ b/backend/cpp/llama-cpp/patches/paged/0013-paged-decoupled-prefill-token-budget.patch @@ -1,4 +1,4 @@ -From 17d97cb74e3e8c93751afd33f5c183e57056fde9 Mon Sep 17 00:00:00 2001 +From 6d3743105c1bbfbf9cd16c0c0ba39bfaac74216e Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Tue, 23 Jun 2026 11:52:45 +0200 Subject: [PATCH] feat(paged): decoupled per-step prefill-token budget (patch @@ -62,14 +62,14 @@ stays disjoint from the paged allocation hunks. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- - tools/server/server-context.cpp | 35 ++++++++++++++++++++++++++++++++- - 1 file changed, 34 insertions(+), 1 deletion(-) + tools/server/server-context.cpp | 34 ++++++++++++++++++++++++++++++++- + 1 file changed, 33 insertions(+), 1 deletion(-) diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp -index 04c6361..5d83b30 100644 +index b5f9d37..afcdebe 100644 --- a/tools/server/server-context.cpp +++ b/tools/server/server-context.cpp -@@ -2723,6 +2723,29 @@ private: +@@ -3043,6 +3043,29 @@ private: int32_t n_batch = llama_n_batch(ctx_tgt); int32_t n_ubatch = llama_n_ubatch(ctx_tgt); @@ -96,42 +96,41 @@ index 04c6361..5d83b30 100644 + } + int32_t n_prompt_budgeted = 0; // prompt tokens added to the batch this step (across slots) + - float alora_scale = -1.0f; - size_t alora_disabled_id = 0; + auto & alora_scale = batch.alora_scale; + auto & alora_disabled_id = batch.alora_disabled_id; -@@ -3159,7 +3182,10 @@ private: - const bool n_before_user_known = n_before_user > 0; +@@ -3487,7 +3510,10 @@ private: + const auto last_user_pos = spans.last_user_message_pos(); // add prompt tokens for processing in the current batch -- while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch) { +- while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.size() < n_batch) { + // (patch 0013) also stop once the per-step prefill budget is spent, so a long + // prompt is split across more steps and leaves batch room for co-batched decode -+ while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch && ++ while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.size() < n_batch && + (n_prefill_budget == 0 || n_prompt_budgeted < n_prefill_budget)) { // get next token to process llama_token cur_tok = input_tokens[slot.prompt.n_tokens()]; if (cur_tok == LLAMA_TOKEN_NULL) { -@@ -3185,6 +3211,7 @@ private: +@@ -3512,6 +3538,7 @@ private: slot.prompt.tokens.push_back(cur_tok); slot.n_prompt_tokens_processed++; + n_prompt_budgeted++; // (patch 0013) count toward the per-step prefill budget - // stop the prompt batch exactly before the latest user input, so a checkpoint - // can be created after the previous messages -@@ -3293,6 +3320,12 @@ private: - if (batch.n_tokens >= n_batch) { - break; + // stop the prompt batch exactly before a user message + if (spans.is_user_start(slot.prompt.n_tokens())) { +@@ -3597,6 +3624,11 @@ private: + if (!slot_batched) { + slot_batched = &slot; } -+ + // (patch 0013) stop adding prompts once the per-step prefill budget is spent, + // leaving the remaining batch capacity for co-batched decode of other slots + if (n_prefill_budget > 0 && n_prompt_budgeted >= n_prefill_budget) { -+ break; ++ add_ok = false; + } - } + }); } - + } -- 2.43.0 diff --git a/backend/cpp/llama-cpp/patches/paged/0015-paged-expert-density-aware-moe-token-tile-auto-select.patch b/backend/cpp/llama-cpp/patches/paged/0015-paged-expert-density-aware-moe-token-tile-auto-select.patch index 81dfd8d5f..519ad7ab1 100644 --- a/backend/cpp/llama-cpp/patches/paged/0015-paged-expert-density-aware-moe-token-tile-auto-select.patch +++ b/backend/cpp/llama-cpp/patches/paged/0015-paged-expert-density-aware-moe-token-tile-auto-select.patch @@ -1,4 +1,4 @@ -From 151343bc8c7b956c99eafc855704b70d44637a3b Mon Sep 17 00:00:00 2001 +From 5349f8231b1e11214f5e8a668129397fb6e2f9ac Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Tue, 23 Jun 2026 21:03:00 +0200 Subject: [PATCH] feat(paged): expert-density-aware MoE token-tile auto-select @@ -207,12 +207,12 @@ index cff608e..9718b12 100644 } diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp -index 15ae389..f219309 100644 +index c83e91f..62a0989 100644 --- a/tests/test-backend-ops.cpp +++ b/tests/test-backend-ops.cpp -@@ -8575,6 +8575,22 @@ static std::vector> make_test_cases_eval() { - // gpt-oss issue with Vulkan mmq_id +@@ -8603,6 +8603,22 @@ static std::vector> make_test_cases_eval() { test_cases.emplace_back(new test_mul_mat_id(GGML_TYPE_MXFP4, GGML_TYPE_F32, 32, 2, false, 2880, 32, 2880)); + test_cases.emplace_back(new test_mul_mat_id(GGML_TYPE_Q4_0, GGML_TYPE_F32, 32, 2, false, 2880, 32, 2880)); + // [paged P0] MXFP4/NVFP4 qwen3-30b-a3b MoE decode-density regression gate for the expert- + // density-aware mmq_x auto-select (patch 0015). Real expert-FFN slice (128 experts, top-8, diff --git a/backend/cpp/llama-cpp/patches/paged/0016-paged-dynamic-prefill-budget-continuous-batch.patch b/backend/cpp/llama-cpp/patches/paged/0016-paged-dynamic-prefill-budget-continuous-batch.patch index 17b73a7ee..ca7e4040f 100644 --- a/backend/cpp/llama-cpp/patches/paged/0016-paged-dynamic-prefill-budget-continuous-batch.patch +++ b/backend/cpp/llama-cpp/patches/paged/0016-paged-dynamic-prefill-budget-continuous-batch.patch @@ -1,54 +1,40 @@ -From 0a2677c6e6c608f9c0ec657faa0ff04a03370aa6 Mon Sep 17 00:00:00 2001 +From 02fa0473a9324b7e12f9b203d221cc4ac80cfd33 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto -Date: Wed, 24 Jun 2026 07:44:25 +0000 +Date: Wed, 24 Jun 2026 10:11:48 +0200 Subject: [PATCH] feat(paged): dynamic decode-first prefill-token budget (patch 0016, continuous-batch P1) Supersede patch 0013's STATIC per-step prefill cap with a DYNAMIC, decode-first token budget: the P1 of the token-granular continuous-batch -scheduler scoped in CONTINUOUS_BATCH_SCHEDULER_SCOPE.md. This is a POLICY -change only inside update_slots(): no new slot states, no batch-formation -rewrite, zero libllama changes. llama-server already emits one unified -mixed prefill+decode batch per step (Phase 1 appends every ready decode -token unconditionally; Phase 2 fills prefill into the same batch); 0013 -already ships that mixed ubatch. 0016 only changes the COUNT of prefill -tokens admitted per step. +scheduler. POLICY change only inside update_slots(): no new slot states, no +batch-formation rewrite, zero libllama changes. llama-server already emits one +unified mixed prefill+decode batch per step (Phase 1 appends every ready decode +token unconditionally; Phase 2 fills prefill into the same batch). 0016 only +changes the COUNT of prefill tokens admitted per step. The budget block already sits AFTER Phase 1's decode fill, so batch.n_tokens == D (the live decode load) is known there. Instead of 0013's constant -LLAMA_PREFILL_BUDGET (which ignores D, needs per-workload tuning, and lets -one long prompt monopolise the step), compute a dynamic budget: +LLAMA_PREFILL_BUDGET (which ignores D, needs per-workload tuning, and lets one +long prompt monopolise the step), compute a dynamic budget: - T = min(LLAMA_MAX_BATCH_TOKENS (default n_batch), n_batch), floored at - n_ubatch (the vLLM max_num_batched_tokens analogue / ITL trade knob) + T = clamp(LLAMA_MAX_BATCH_TOKENS (default n_batch), n_ubatch, n_batch) prefill_budget_step = max(n_ubatch, T - D) (leftover after decode, auto-shrinks as decode load rises so the step never inflates past T) - prefill_cap_per_slot = min(T, ceil(0.04*n_ctx)) floored at n_ubatch - (the long_prefill_token_threshold analogue: one long prompt cannot - eat the whole leftover; LLAMA_PREFILL_CAP overrides) + prefill_cap_per_slot = min(T, ceil(0.04*n_ctx)) floored at n_ubatch, + pinned to n_batch when T == n_batch (LLAMA_PREFILL_CAP overrides) Phase 2's inner prompt-fill loop and outer admission break are bounded by prefill_budget_step (across slots) and a new per-slot slot_prompt_added -counter (per-slot cap), instead of the static 0013 cap; the n_batch hard -ceiling stays as the compute bound. Decode is structurally claimed first -and never capped (Phase 1), so the decode-first guarantee is free. - -Why it supersedes 0013: 0013 needs a hand-picked constant (256 for dense) -that is net-negative at low npl and costs MoE TTFT; the T - D budget is -self-tuning across npl 8..128 and across dense vs MoE, holding the GB10 -decode ceiling (~161 dense / ~333 MoE tok/s @npl128) WITHOUT per-workload -tuning while collapsing burst TTFT. Steady-state decode throughput is NOT -lifted (that is the decode-kernel ceiling, scoped as P3); the P1 win is -TTFT + tuning-free robustness + clean supersession of 0013. +counter; the n_batch hard ceiling stays as the compute bound. Decode is +structurally claimed first and never capped (Phase 1), so the decode-first +guarantee is free. DEFAULT-OFF BYTE-IDENTICAL: with all knobs unset, behaviour is byte-identical -to stock. The degenerate T == n_batch case is byte-identical to stock/0013 -(the determinism oracle): the leftover max(n_ubatch, n_batch - D) and the -n_batch per-slot cap both reach the existing `batch.n_tokens < n_batch` -ceiling at the same point, so no new bound fires. The legacy -LLAMA_PREFILL_BUDGET path is preserved exactly (honoured only when -LLAMA_MAX_BATCH_TOKENS is unset), so 0013 is cleanly subsumed. Orthogonal -to LLAMA_KV_PAGED: pure scheduler policy, identical decisions paged on/off. +to stock. The degenerate T == n_batch case is byte-identical to stock/0013 (the +determinism oracle). The legacy LLAMA_PREFILL_BUDGET path is preserved exactly +(honoured only when LLAMA_MAX_BATCH_TOKENS is unset), so 0013 is cleanly +subsumed. Orthogonal to LLAMA_KV_PAGED: pure scheduler policy, identical +decisions paged on or off. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto @@ -57,10 +43,10 @@ Signed-off-by: Ettore Di Giacinto 1 file changed, 85 insertions(+), 22 deletions(-) diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp -index 5d83b30..f7a114c 100644 +index afcdebe..b8b8f00 100644 --- a/tools/server/server-context.cpp +++ b/tools/server/server-context.cpp -@@ -2723,24 +2723,78 @@ private: +@@ -3043,24 +3043,78 @@ private: int32_t n_batch = llama_n_batch(ctx_tgt); int32_t n_ubatch = llama_n_ubatch(ctx_tgt); @@ -112,7 +98,7 @@ index 5d83b30..f7a114c 100644 + // reach the existing `batch.n_tokens < n_batch` ceiling at the SAME point, so no + // new bound fires (the determinism oracle). Orthogonal to LLAMA_KV_PAGED: pure + // scheduler policy, identical decisions with paged on or off. -+ const int32_t n_decode_in_batch = batch.n_tokens; // D: Phase 1 appended D decode tokens above ++ const int32_t n_decode_in_batch = batch.size(); // D: Phase 1 appended D decode tokens above + int32_t prefill_budget_step = 0; // 0 = disabled (stock n_batch-only chunking) + int32_t prefill_cap_per_slot = 0; // 0 = disabled (no per-slot prompt-chunk cap) { @@ -154,9 +140,9 @@ index 5d83b30..f7a114c 100644 } } } -@@ -3181,11 +3235,18 @@ private: - const int32_t n_before_user = slot.task->params.n_before_user; - const bool n_before_user_known = n_before_user > 0; +@@ -3509,11 +3563,18 @@ private: + const auto & spans = slot.task->params.message_spans; + const auto last_user_pos = spans.last_user_message_pos(); + // (patch 0016) per-slot prompt tokens added this step, for the per-slot + // chunk cap (resets each slot); n_batch stays the hard compute ceiling @@ -169,14 +155,14 @@ index 5d83b30..f7a114c 100644 + // (the T - D leftover) is spent across all slots, or (b) this slot's + // per-slot chunk cap is hit, so a long prompt is split across more steps + // and leaves batch room for co-batched decode of the other slots - while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch && + while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.size() < n_batch && - (n_prefill_budget == 0 || n_prompt_budgeted < n_prefill_budget)) { + (prefill_budget_step == 0 || n_prompt_budgeted < prefill_budget_step) && + (prefill_cap_per_slot == 0 || slot_prompt_added < prefill_cap_per_slot)) { // get next token to process llama_token cur_tok = input_tokens[slot.prompt.n_tokens()]; if (cur_tok == LLAMA_TOKEN_NULL) { -@@ -3211,7 +3272,8 @@ private: +@@ -3538,7 +3599,8 @@ private: slot.prompt.tokens.push_back(cur_tok); slot.n_prompt_tokens_processed++; @@ -184,12 +170,12 @@ index 5d83b30..f7a114c 100644 + n_prompt_budgeted++; // (patch 0016) toward the dynamic per-step prefill budget + slot_prompt_added++; // (patch 0016) toward this slot's per-step chunk cap - // stop the prompt batch exactly before the latest user input, so a checkpoint - // can be created after the previous messages -@@ -3321,9 +3383,10 @@ private: - break; + // stop the prompt batch exactly before a user message + if (spans.is_user_start(slot.prompt.n_tokens())) { +@@ -3624,9 +3686,10 @@ private: + if (!slot_batched) { + slot_batched = &slot; } - - // (patch 0013) stop adding prompts once the per-step prefill budget is spent, - // leaving the remaining batch capacity for co-batched decode of other slots - if (n_prefill_budget > 0 && n_prompt_budgeted >= n_prefill_budget) { @@ -197,9 +183,9 @@ index 5d83b30..f7a114c 100644 + // budget (the T - D leftover) is spent, leaving the remaining batch + // capacity for co-batched decode of the other slots + if (prefill_budget_step > 0 && n_prompt_budgeted >= prefill_budget_step) { - break; + add_ok = false; } - } + }); -- 2.43.0 diff --git a/backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md b/backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md new file mode 100644 index 000000000..3ad2b3dfb --- /dev/null +++ b/backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md @@ -0,0 +1,202 @@ +# Pin-sync: paged patch-stack -> llama.cpp 9d5d882d + +Status: COMPLETE. The paged patch-stack (0001-0024) was rebased onto llama.cpp +`9d5d882d`, rebuilt clean (CUDA sm_121), and the bit-exact gate is GREEN on both +the dense and MoE NVFP4 baselines. The LocalAI-side `.patch` files were then +re-exported from the rebased commits; **4 patch files changed** and are updated +in this commit. A quick decode bench confirms the patchset performs the same on +the new tip. + +## Upstream jump + +- OLD LocalAI pin: `8be759e6` +- NEW LocalAI pin (target): `9d5d882d` ("model : Add label for LFM2.5-230M (#25008)") +- Upstream jump `8be759e6..9d5d882d` = **17 commits**. + +### Note on the dev-tree base (important) +The DGX dev tree's `paged` branch was NOT based on the old pin `8be759e6`. Its +real base (merge-base of `paged` with both pins) is `f3e1828` +("mtmd: llava_uhd should no longer use batch dim (#24732)"), which is an ancestor +of `8be759e6` by 92 commits. So the rebase traversed `f3e1828..9d5d882d` = +**109 upstream commits**, a strictly larger surface than the 17-commit pin bump. +The end state (paged patches on `9d5d882d`) is identical either way; the larger +traverse only means the conflict surface was the worst case, and it still came +through bit-exact. + +## Rebase + +- Command: `git rebase --onto 9d5d882d f3e1828 paged` (merge.conflictStyle=diff3). +- 26 commits replayed (24 shipped patch-commits + the 2 dev-scaffolding "Gate-0/ + FA-gate driver" commits and 1 docs commit; the scaffolding/docs commits are not + shipped as `.patch` files). +- Backup ref before rebase: `paged-prerebase-backup` = `a8a9d12` (old patch 0024). +- New rebased range: `9d5d882d..paged`, HEAD = `2ee65c2` (patch 0024). + +### Conflicts during rebase (3 commits, ALL in `tools/server/server-context.cpp`) + +Every rebase conflict was in the llama-server continuous-batch scheduler wiring, +all of which is gated behind env (`LLAMA_KV_PAGED` / `LLAMA_PREFILL_BUDGET` / +`LLAMA_MAX_BATCH_TOKENS`) and therefore a strict no-op for the gate (the gate +uses `llama-completion`, not the server, with no env set). The root cause was a +single upstream refactor of `update_slots()`: + +- the outer slot loop became `iterate(slots, [&](server_slot & slot){...})`, + replacing bottom-of-loop `break` with a top-of-lambda + `if (!add_ok || batch.size() >= n_batch) return;` (the `add_ok` flag is set + false on `batch.add()` failure); +- the embedding/rerank early-exits changed `continue;` -> `return;`; +- the `server_batch` token count accessor was renamed `batch.n_tokens` -> + `batch.size()` (`server_batch` has a `.size()` method and **no** `.n_tokens` + member; the raw `llama_batch` in `send_embedding`/`send_rerank` keeps `.n_tokens`). + +**patch 0008** (`240758e`, cross-request prefix share) - 1 conflict. +Hunk 3 (the prefix-commit block) collided with the `continue`->`return` refactor. +Hunks 1 (namespace shim) and 2 (the share block) applied cleanly. Resolved by +keeping HEAD's refactored structure and re-inserting the `[paged 0008]` +`paged_prefix_api::commit(...)` block verbatim after `slot.state = SLOT_STATE_GENERATING;` +and before `if (slot.can_speculate())`, re-indented to the new (de-nested) level, +with the identical `paged_kv_commit && cache_prompt && !has_mtmd` guard. Semantics +unchanged. + +**patch 0013** (`6d37431`, static `LLAMA_PREFILL_BUDGET`) - 3 conflicts. +- C1: inserted the `n_prefill_budget` / `n_prompt_budgeted` var block before + HEAD's new `auto & alora_scale = batch.alora_scale;` references (upstream moved + alora_scale/disabled_id into the `server_batch` struct). +- C2: merged the budget gate into HEAD's `while (... batch.size() < n_batch ...)` + (took upstream's `batch.size()` rename, kept the budget condition). +- C3: the original outer `break` was translated to the new idiom `add_ok = false;` + (exact semantic equivalent of "stop admitting prompts to remaining slots"); the + upstream-removed `if (batch.n_tokens >= n_batch) break;` was dropped (now handled + by the top-of-lambda check). + +**patch 0016** (`02fa047`, dynamic decode-first budget, supersedes 0013) - 2 +conflicts + 1 clean-hunk fix. +- The big budget-block rewrite hunk applied cleanly (its expected parent == the + faithfully-resolved 0013 block). +- Clean-hunk fix: the clean-applied line `const int32_t n_decode_in_batch = batch.n_tokens;` + referenced the `server_batch` member, which has no `.n_tokens` -> changed to + `batch.size()` (== D, the Phase-1 decode load; identical value). +- C-A: while-condition -> took THEIRS (dynamic `prefill_budget_step` + + `prefill_cap_per_slot`), adopted `batch.size()`. +- C-B: admission break -> 0016 dynamic budget check with `break` -> `add_ok = false`, + dropped the upstream-removed `batch.n_tokens >= n_batch` break. + +OFF-path invariant verified by construction in all three: with the env knobs +unset (`prefill_budget_step == prefill_cap_per_slot == 0`, `paged_kv_* == false`) +the added conditions never fire, so the scheduler is byte-identical to stock HEAD. + +### Kernel patches: ZERO rebase conflicts +Patches 0017-0024 - which touch the bit-exact compute paths +(`gated_delta_net.cu` +330, `mmq.cu`/`mmq.cuh` +209, `ssm-conv.cu` +112, +`quantize.cu`, `fattn.cu`, `src/models/qwen35.cpp`/`qwen35moe.cpp`/`qwen3next.cpp`, +`src/llama-kv-cache.*`, `src/paged-*`, `tests/test-backend-ops.cpp` +79) - all +applied **cleanly** during the rebase (3-way). No math, reduction order, or kernel +context was touched during conflict resolution. + +## Clean rebuild +`cmake --build build-cuda --target clean && cmake --build build-cuda -j20`, +preserving the existing CMakeCache (CMAKE_CUDA_ARCHITECTURES=121, GGML_CUDA=ON, +GGML_CUDA_FA=ON, GGML_CUDA_GRAPHS=ON, GGML_CUDA_NCCL=ON). Result: BUILD_EXIT=0, +all targets at 100%. (The only log "error" is a benign webui `dist.tar.gz` +download miss, unrelated to the gate binaries.) + +## GATE: ALL GREEN + +(a) `test-backend-ops` (Backend CUDA0): +| op | result | +|----|--------| +| GATED_DELTA_NET | 36/36 OK | +| SSM_CONV | 45/45 OK | +| MUL_MAT | 1146/1146 OK | +| MUL_MAT_ID | 806/806 OK | + +(b) greedy md5 (`llama-completion -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1`): +| model | md5 | baseline | verdict | +|-------|-----|----------|---------| +| dense `q36-27b-nvfp4` | `5951a5b4d624ce891e22ab5fca9bc439` | `5951a5b4d624ce891e22ab5fca9bc439` | PASS | +| MoE `q36-35b-a3b-nvfp4` | `07db32c2bcb78d17a43ed18bc22705cd` | `07db32c2bcb78d17a43ed18bc22705cd` | PASS | + +Bit-exactness preserved across the upstream jump. + +## Decode bench sanity (rebased build, post-pin-sync) + +`llama-batched-bench -ngl 99 -fa on -npp 128 -ntg 128 -npl 32,128 -c 33000`, +S_TG (decode) tok/s at npl128, patch defaults on: +| model | npl128 S_TG (new tip) | post-0023 reference | delta | +|-------|----------------------|---------------------|-------| +| dense `q36-27b-nvfp4` | **366.41** | 373.2 | -1.8% | +| MoE `q36-35b-a3b-nvfp4` | **751.11** | 745.7 | +0.7% | + +Both within the +/-3% noise band -> the patchset performs the same on `9d5d882d`. +(npl32 also matches: dense 205.83 vs 207.6; MoE 438.29 vs 440.0.) + +## Export phase: re-export `.patch` files and pick the ones that changed + +The committed `.patch` files were generated against the old base. Each shipped +patch was re-exported from its rebased commit (`git format-patch -1 `) and +compared body-to-body against the committed file (ignoring the volatile `From` +commit-hash line and the `index` blob-hash lines). Classification: + +- **CONTENT (real hunk-body change -> MUST update):** `0008`, `0013`, `0015`, `0016`. +- **LINENUM only (hunk bodies byte-identical, only `@@` line-numbers shifted -> + still apply cleanly, left as-is):** `0009`, `0017`, `0018`, `0019`, `0020`, + `0021`, `0024`. +- **IDENTICAL (no change at all):** `0001`, `0002`, `0003`, `0004`, `0006`, + `0007`, `0010`, `0011`, `0012`, `0014`, `0022`, `0023`. + +An independent isolated `git apply --check` sweep (each shipped patch vs the +rebased pre-state tree) agreed exactly: the same 4 (`0008`/`0013`/`0015`/`0016`) +are the only ones that no longer `git apply` to `9d5d882d`. The build applies the +series with plain `git apply` (Makefile) which tolerates `@@` line-number offsets, +so the 7 LINENUM patches still apply (verified) and are intentionally not churned. + +### 0015 was a 4th change beyond the 3 rebase conflicts +The rebase reported only 3 conflicts (`0008`/`0013`/`0016`). `0015` +(expert-density MoE token-tile auto-select) rebased *cleanly* via 3-way merge, but +its committed `.patch` no longer applies to `9d5d882d` via plain `git apply`: +upstream inserted a new test case +(`test_mul_mat_id(GGML_TYPE_Q4_0, GGML_TYPE_F32, 32, 2, false, 2880, 32, 2880)`) +in `tests/test-backend-ops.cpp` right at `0015`'s insertion anchor, so the hunk's +context lines shifted. `0015`'s own inserted lines are unchanged - it is a pure +context re-anchor, no behavioral change. This is exactly why a per-patch +re-export/apply-check was run instead of trusting the 3-conflict count. + +### What changed in each updated patch (From/index hash noise aside) +- `0008`: same `[paged 0008]` commit block (identical env-guard + `paged_prefix_api::commit` + call), re-indented to the refactored `update_slots` lambda level and re-anchored + after `slot.state = SLOT_STATE_GENERATING;`; `@@` headers updated. +- `0013`: budget var-block / while-gate / admission-break re-expressed against the + refactored loop (`batch.size()`, `add_ok=false`); `@@` headers updated. +- `0015`: hunk context re-anchored around the new upstream test case; inserted + lines identical; `@@` header updated. +- `0016`: dynamic budget block + `n_decode_in_batch = batch.size()` + admission + `add_ok=false` against the refactored loop; `@@` headers updated. + +## Equivalence proof (the updated series == the gate-green tree) + +The 4 updated files are byte-faithful `git format-patch -1` exports of the +gate-green rebased commits (`240758e`, `6d37431`, `5349f82`, `02fa047`). Applying +the full corrected series (the 19 unchanged committed patches + the 4 re-exports) +in order to a fresh bare `9d5d882d` checkout with plain `git apply` succeeds for +all 23 patches, and the resulting tree is **byte-identical to the gate-green +`paged` tip (`2ee65c2`) for every code file** (`git diff` over all paths except +`*.md` and the unshipped `examples/simple/*` scaffold drivers is empty). So the +shipped `.patch` series reproduces exactly the tree that passed test-backend-ops, +the md5 bit-exact gate, and the bench. + +## Pre-existing finding (NOT introduced by this pin-sync, NOT fixed here) +Committed patch `0019` carries a *modify* hunk against the dev-only doc +`SSM_DECODE_FIX_RESULTS.md` (`index 2e7c8c2..77879e4 100644`), a file that exists +only because of an unshipped docs commit on the dev tree and is absent from a +clean llama.cpp checkout. Under strict `git apply` that hunk fails ("No such file +or directory"). This is pin-independent (the file is upstream-absent on both +`8be759e6` and `9d5d882d`) and present identically in the old and new `0019` +(LINENUM class), so it is left untouched to keep the pin-sync faithful. (`0021`'s +`CONV_STATE_FUSION_RESULTS.md` is a *create* hunk and applies fine.) Stripping the +stray dev-doc hunks from the shipped patches is a separate cleanup, out of scope +for the pin-sync. + +## Source of truth +The rebased branch on the DGX dev tree (`~/llama-paged-dev`, branch `paged`, HEAD +`2ee65c2`) is the source of truth; `paged-prerebase-backup` (`a8a9d12`) retains +the pre-rebase state.