mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-27 09:57:14 -04:00
feat(paged): pin-sync patchset to llama.cpp 9d5d882d (re-export 4 patches)
The worktree merge bumped LLAMA_VERSION 8be759e6 -> 9d5d882d. This re-syncs the paged patch-stack (0001-0024) to the new tip: the stack was rebased onto 9d5d882d on the DGX dev tree, rebuilt clean (CUDA sm_121), and re-validated bit-exact before re-exporting the LocalAI .patch files. Re-exporting each shipped patch from its rebased commit and diffing body-to-body against the committed files identifies exactly 4 that changed and no longer git-apply to 9d5d882d: - 0008 cross-request prefix share: re-anchored the [paged 0008] commit block to the refactored update_slots() lambda (continue->return, batch.n_tokens-> batch.size()); identical env-guarded logic. - 0013 static prefill budget: budget var-block / while-gate / admission-break re-expressed against the refactored loop (add_ok=false idiom). - 0015 expert-density MoE token-tile auto-select: pure context re-anchor; upstream inserted a test_mul_mat_id case at the hunk anchor in test-backend-ops.cpp. The inserted lines are unchanged. (This one rebased cleanly via 3-way but its committed .patch no longer applies with plain git apply, so it is caught by the per-patch apply-check, not by the rebase conflict count.) - 0016 dynamic decode-first budget: dynamic budget block + n_decode_in_batch = batch.size() + add_ok=false against the refactored loop. All four are byte-faithful format-patch exports of the gate-green rebased commits. Applying the full corrected series to a fresh 9d5d882d reproduces the gate-green tree byte-for-byte across every code file. The other 7 touched patches (0009/0017/0018/0019/0020/0021/0024) are LINENUM-only (hunk bodies byte-identical, only @@ line-numbers shifted) and still apply cleanly, so they are left unchanged. The remaining patches are identical. Validation on the rebased build (NVFP4 Qwen3.6, GB10 sm_121): - test-backend-ops CUDA0: GATED_DELTA_NET 36/36, SSM_CONV 45/45, MUL_MAT 1146/1146, MUL_MAT_ID 806/806 all OK. - greedy md5 (-fa on -n 48 --temp 0 --seed 1): dense q36-27b-nvfp4 5951a5b4d624ce891e22ab5fca9bc439 and MoE q36-35b-a3b-nvfp4 07db32c2bcb78d17a43ed18bc22705cd, both == baseline. - decode S_TG @npl128: dense 366.41 t/s (ref 373.2, -1.8%), MoE 751.11 t/s (ref 745.7, +0.7%), both within noise. Details in backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -1,4 +1,4 @@
|
||||
From 088d58f3a0160cbc706226ac2e77ecfeae4c164a Mon Sep 17 00:00:00 2001
|
||||
From 240758ef7e144619c750aaf1d3339051ecc29098 Mon Sep 17 00:00:00 2001
|
||||
From: Ettore Di Giacinto <mudler@localai.io>
|
||||
Date: Mon, 22 Jun 2026 17:02:22 +0200
|
||||
Subject: [PATCH] paged server cross-request prefix share (env LLAMA_KV_PAGED)
|
||||
@@ -51,10 +51,10 @@ Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
|
||||
1 file changed, 50 insertions(+)
|
||||
|
||||
diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
|
||||
index da6a475..04c6361 100644
|
||||
index 39b7eb2..b5f9d37 100644
|
||||
--- a/tools/server/server-context.cpp
|
||||
+++ b/tools/server/server-context.cpp
|
||||
@@ -15,6 +15,16 @@
|
||||
@@ -16,6 +16,16 @@
|
||||
#include "mtmd.h"
|
||||
#include "mtmd-helper.h"
|
||||
|
||||
@@ -71,7 +71,7 @@ index da6a475..04c6361 100644
|
||||
#include <algorithm>
|
||||
#include <cstddef>
|
||||
#include <cinttypes>
|
||||
@@ -3007,6 +3017,37 @@ private:
|
||||
@@ -3335,6 +3345,37 @@ private:
|
||||
}
|
||||
}
|
||||
|
||||
@@ -109,22 +109,22 @@ index da6a475..04c6361 100644
|
||||
// [TAG_PROMPT_LOGITS]
|
||||
if (n_past == slot.task->n_tokens() && n_past > 0) {
|
||||
SLT_WRN(slot, "need to evaluate at least 1 token for each active slot (n_past = %d, task.n_tokens() = %d)\n", n_past, slot.task->n_tokens());
|
||||
@@ -3427,6 +3468,15 @@ private:
|
||||
// prompt evaluated for next-token prediction
|
||||
slot.state = SLOT_STATE_GENERATING;
|
||||
@@ -3741,6 +3782,15 @@ private:
|
||||
// prompt evaluated for next-token prediction
|
||||
slot.state = SLOT_STATE_GENERATING;
|
||||
|
||||
+ // [paged 0008] Publish this slot's computed prefix so concurrent/later
|
||||
+ // slots can share it (no-op unless LLAMA_KV_PAGED). The prefill decode
|
||||
+ // for [0, n_tokens) has just run, so the prefix KV is computed.
|
||||
+ static const bool paged_kv_commit = getenv("LLAMA_KV_PAGED") != nullptr;
|
||||
+ if (paged_kv_commit && slot.task->params.cache_prompt && !slot.prompt.tokens.has_mtmd) {
|
||||
+ const llama_tokens ctoks = slot.prompt.tokens.get_text_tokens();
|
||||
+ paged_prefix_api::commit(ctx_tgt, slot.id, ctoks.data(), (int) ctoks.size());
|
||||
+ }
|
||||
+ // [paged 0008] Publish this slot's computed prefix so concurrent/later
|
||||
+ // slots can share it (no-op unless LLAMA_KV_PAGED). The prefill decode
|
||||
+ // for [0, n_tokens) has just run, so the prefix KV is computed.
|
||||
+ static const bool paged_kv_commit = getenv("LLAMA_KV_PAGED") != nullptr;
|
||||
+ if (paged_kv_commit && slot.task->params.cache_prompt && !slot.prompt.tokens.has_mtmd) {
|
||||
+ const llama_tokens ctoks = slot.prompt.tokens.get_text_tokens();
|
||||
+ paged_prefix_api::commit(ctx_tgt, slot.id, ctoks.data(), (int) ctoks.size());
|
||||
+ }
|
||||
+
|
||||
if (slot.can_speculate()) {
|
||||
common_speculative_begin(spec.get(), slot.id, slot.prompt.tokens.get_text_tokens());
|
||||
}
|
||||
if (slot.can_speculate()) {
|
||||
common_speculative_begin(spec.get(), slot.id, slot.prompt.tokens.get_text_tokens());
|
||||
}
|
||||
--
|
||||
2.43.0
|
||||
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
From 17d97cb74e3e8c93751afd33f5c183e57056fde9 Mon Sep 17 00:00:00 2001
|
||||
From 6d3743105c1bbfbf9cd16c0c0ba39bfaac74216e Mon Sep 17 00:00:00 2001
|
||||
From: Ettore Di Giacinto <mudler@localai.io>
|
||||
Date: Tue, 23 Jun 2026 11:52:45 +0200
|
||||
Subject: [PATCH] feat(paged): decoupled per-step prefill-token budget (patch
|
||||
@@ -62,14 +62,14 @@ stays disjoint from the paged allocation hunks.
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
|
||||
---
|
||||
tools/server/server-context.cpp | 35 ++++++++++++++++++++++++++++++++-
|
||||
1 file changed, 34 insertions(+), 1 deletion(-)
|
||||
tools/server/server-context.cpp | 34 ++++++++++++++++++++++++++++++++-
|
||||
1 file changed, 33 insertions(+), 1 deletion(-)
|
||||
|
||||
diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
|
||||
index 04c6361..5d83b30 100644
|
||||
index b5f9d37..afcdebe 100644
|
||||
--- a/tools/server/server-context.cpp
|
||||
+++ b/tools/server/server-context.cpp
|
||||
@@ -2723,6 +2723,29 @@ private:
|
||||
@@ -3043,6 +3043,29 @@ private:
|
||||
int32_t n_batch = llama_n_batch(ctx_tgt);
|
||||
int32_t n_ubatch = llama_n_ubatch(ctx_tgt);
|
||||
|
||||
@@ -96,42 +96,41 @@ index 04c6361..5d83b30 100644
|
||||
+ }
|
||||
+ int32_t n_prompt_budgeted = 0; // prompt tokens added to the batch this step (across slots)
|
||||
+
|
||||
float alora_scale = -1.0f;
|
||||
size_t alora_disabled_id = 0;
|
||||
auto & alora_scale = batch.alora_scale;
|
||||
auto & alora_disabled_id = batch.alora_disabled_id;
|
||||
|
||||
@@ -3159,7 +3182,10 @@ private:
|
||||
const bool n_before_user_known = n_before_user > 0;
|
||||
@@ -3487,7 +3510,10 @@ private:
|
||||
const auto last_user_pos = spans.last_user_message_pos();
|
||||
|
||||
// add prompt tokens for processing in the current batch
|
||||
- while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch) {
|
||||
- while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.size() < n_batch) {
|
||||
+ // (patch 0013) also stop once the per-step prefill budget is spent, so a long
|
||||
+ // prompt is split across more steps and leaves batch room for co-batched decode
|
||||
+ while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch &&
|
||||
+ while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.size() < n_batch &&
|
||||
+ (n_prefill_budget == 0 || n_prompt_budgeted < n_prefill_budget)) {
|
||||
// get next token to process
|
||||
llama_token cur_tok = input_tokens[slot.prompt.n_tokens()];
|
||||
if (cur_tok == LLAMA_TOKEN_NULL) {
|
||||
@@ -3185,6 +3211,7 @@ private:
|
||||
@@ -3512,6 +3538,7 @@ private:
|
||||
slot.prompt.tokens.push_back(cur_tok);
|
||||
|
||||
slot.n_prompt_tokens_processed++;
|
||||
+ n_prompt_budgeted++; // (patch 0013) count toward the per-step prefill budget
|
||||
|
||||
// stop the prompt batch exactly before the latest user input, so a checkpoint
|
||||
// can be created after the previous messages
|
||||
@@ -3293,6 +3320,12 @@ private:
|
||||
if (batch.n_tokens >= n_batch) {
|
||||
break;
|
||||
// stop the prompt batch exactly before a user message
|
||||
if (spans.is_user_start(slot.prompt.n_tokens())) {
|
||||
@@ -3597,6 +3624,11 @@ private:
|
||||
if (!slot_batched) {
|
||||
slot_batched = &slot;
|
||||
}
|
||||
+
|
||||
+ // (patch 0013) stop adding prompts once the per-step prefill budget is spent,
|
||||
+ // leaving the remaining batch capacity for co-batched decode of other slots
|
||||
+ if (n_prefill_budget > 0 && n_prompt_budgeted >= n_prefill_budget) {
|
||||
+ break;
|
||||
+ add_ok = false;
|
||||
+ }
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
}
|
||||
--
|
||||
2.43.0
|
||||
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
From 151343bc8c7b956c99eafc855704b70d44637a3b Mon Sep 17 00:00:00 2001
|
||||
From 5349f8231b1e11214f5e8a668129397fb6e2f9ac Mon Sep 17 00:00:00 2001
|
||||
From: Ettore Di Giacinto <mudler@localai.io>
|
||||
Date: Tue, 23 Jun 2026 21:03:00 +0200
|
||||
Subject: [PATCH] feat(paged): expert-density-aware MoE token-tile auto-select
|
||||
@@ -207,12 +207,12 @@ index cff608e..9718b12 100644
|
||||
}
|
||||
|
||||
diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
|
||||
index 15ae389..f219309 100644
|
||||
index c83e91f..62a0989 100644
|
||||
--- a/tests/test-backend-ops.cpp
|
||||
+++ b/tests/test-backend-ops.cpp
|
||||
@@ -8575,6 +8575,22 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
|
||||
// gpt-oss issue with Vulkan mmq_id
|
||||
@@ -8603,6 +8603,22 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
|
||||
test_cases.emplace_back(new test_mul_mat_id(GGML_TYPE_MXFP4, GGML_TYPE_F32, 32, 2, false, 2880, 32, 2880));
|
||||
test_cases.emplace_back(new test_mul_mat_id(GGML_TYPE_Q4_0, GGML_TYPE_F32, 32, 2, false, 2880, 32, 2880));
|
||||
|
||||
+ // [paged P0] MXFP4/NVFP4 qwen3-30b-a3b MoE decode-density regression gate for the expert-
|
||||
+ // density-aware mmq_x auto-select (patch 0015). Real expert-FFN slice (128 experts, top-8,
|
||||
|
||||
@@ -1,54 +1,40 @@
|
||||
From 0a2677c6e6c608f9c0ec657faa0ff04a03370aa6 Mon Sep 17 00:00:00 2001
|
||||
From 02fa0473a9324b7e12f9b203d221cc4ac80cfd33 Mon Sep 17 00:00:00 2001
|
||||
From: Ettore Di Giacinto <mudler@localai.io>
|
||||
Date: Wed, 24 Jun 2026 07:44:25 +0000
|
||||
Date: Wed, 24 Jun 2026 10:11:48 +0200
|
||||
Subject: [PATCH] feat(paged): dynamic decode-first prefill-token budget (patch
|
||||
0016, continuous-batch P1)
|
||||
|
||||
Supersede patch 0013's STATIC per-step prefill cap with a DYNAMIC,
|
||||
decode-first token budget: the P1 of the token-granular continuous-batch
|
||||
scheduler scoped in CONTINUOUS_BATCH_SCHEDULER_SCOPE.md. This is a POLICY
|
||||
change only inside update_slots(): no new slot states, no batch-formation
|
||||
rewrite, zero libllama changes. llama-server already emits one unified
|
||||
mixed prefill+decode batch per step (Phase 1 appends every ready decode
|
||||
token unconditionally; Phase 2 fills prefill into the same batch); 0013
|
||||
already ships that mixed ubatch. 0016 only changes the COUNT of prefill
|
||||
tokens admitted per step.
|
||||
scheduler. POLICY change only inside update_slots(): no new slot states, no
|
||||
batch-formation rewrite, zero libllama changes. llama-server already emits one
|
||||
unified mixed prefill+decode batch per step (Phase 1 appends every ready decode
|
||||
token unconditionally; Phase 2 fills prefill into the same batch). 0016 only
|
||||
changes the COUNT of prefill tokens admitted per step.
|
||||
|
||||
The budget block already sits AFTER Phase 1's decode fill, so batch.n_tokens
|
||||
== D (the live decode load) is known there. Instead of 0013's constant
|
||||
LLAMA_PREFILL_BUDGET (which ignores D, needs per-workload tuning, and lets
|
||||
one long prompt monopolise the step), compute a dynamic budget:
|
||||
LLAMA_PREFILL_BUDGET (which ignores D, needs per-workload tuning, and lets one
|
||||
long prompt monopolise the step), compute a dynamic budget:
|
||||
|
||||
T = min(LLAMA_MAX_BATCH_TOKENS (default n_batch), n_batch), floored at
|
||||
n_ubatch (the vLLM max_num_batched_tokens analogue / ITL trade knob)
|
||||
T = clamp(LLAMA_MAX_BATCH_TOKENS (default n_batch), n_ubatch, n_batch)
|
||||
prefill_budget_step = max(n_ubatch, T - D) (leftover after decode,
|
||||
auto-shrinks as decode load rises so the step never inflates past T)
|
||||
prefill_cap_per_slot = min(T, ceil(0.04*n_ctx)) floored at n_ubatch
|
||||
(the long_prefill_token_threshold analogue: one long prompt cannot
|
||||
eat the whole leftover; LLAMA_PREFILL_CAP overrides)
|
||||
prefill_cap_per_slot = min(T, ceil(0.04*n_ctx)) floored at n_ubatch,
|
||||
pinned to n_batch when T == n_batch (LLAMA_PREFILL_CAP overrides)
|
||||
|
||||
Phase 2's inner prompt-fill loop and outer admission break are bounded by
|
||||
prefill_budget_step (across slots) and a new per-slot slot_prompt_added
|
||||
counter (per-slot cap), instead of the static 0013 cap; the n_batch hard
|
||||
ceiling stays as the compute bound. Decode is structurally claimed first
|
||||
and never capped (Phase 1), so the decode-first guarantee is free.
|
||||
|
||||
Why it supersedes 0013: 0013 needs a hand-picked constant (256 for dense)
|
||||
that is net-negative at low npl and costs MoE TTFT; the T - D budget is
|
||||
self-tuning across npl 8..128 and across dense vs MoE, holding the GB10
|
||||
decode ceiling (~161 dense / ~333 MoE tok/s @npl128) WITHOUT per-workload
|
||||
tuning while collapsing burst TTFT. Steady-state decode throughput is NOT
|
||||
lifted (that is the decode-kernel ceiling, scoped as P3); the P1 win is
|
||||
TTFT + tuning-free robustness + clean supersession of 0013.
|
||||
counter; the n_batch hard ceiling stays as the compute bound. Decode is
|
||||
structurally claimed first and never capped (Phase 1), so the decode-first
|
||||
guarantee is free.
|
||||
|
||||
DEFAULT-OFF BYTE-IDENTICAL: with all knobs unset, behaviour is byte-identical
|
||||
to stock. The degenerate T == n_batch case is byte-identical to stock/0013
|
||||
(the determinism oracle): the leftover max(n_ubatch, n_batch - D) and the
|
||||
n_batch per-slot cap both reach the existing `batch.n_tokens < n_batch`
|
||||
ceiling at the same point, so no new bound fires. The legacy
|
||||
LLAMA_PREFILL_BUDGET path is preserved exactly (honoured only when
|
||||
LLAMA_MAX_BATCH_TOKENS is unset), so 0013 is cleanly subsumed. Orthogonal
|
||||
to LLAMA_KV_PAGED: pure scheduler policy, identical decisions paged on/off.
|
||||
to stock. The degenerate T == n_batch case is byte-identical to stock/0013 (the
|
||||
determinism oracle). The legacy LLAMA_PREFILL_BUDGET path is preserved exactly
|
||||
(honoured only when LLAMA_MAX_BATCH_TOKENS is unset), so 0013 is cleanly
|
||||
subsumed. Orthogonal to LLAMA_KV_PAGED: pure scheduler policy, identical
|
||||
decisions paged on or off.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
|
||||
@@ -57,10 +43,10 @@ Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
|
||||
1 file changed, 85 insertions(+), 22 deletions(-)
|
||||
|
||||
diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
|
||||
index 5d83b30..f7a114c 100644
|
||||
index afcdebe..b8b8f00 100644
|
||||
--- a/tools/server/server-context.cpp
|
||||
+++ b/tools/server/server-context.cpp
|
||||
@@ -2723,24 +2723,78 @@ private:
|
||||
@@ -3043,24 +3043,78 @@ private:
|
||||
int32_t n_batch = llama_n_batch(ctx_tgt);
|
||||
int32_t n_ubatch = llama_n_ubatch(ctx_tgt);
|
||||
|
||||
@@ -112,7 +98,7 @@ index 5d83b30..f7a114c 100644
|
||||
+ // reach the existing `batch.n_tokens < n_batch` ceiling at the SAME point, so no
|
||||
+ // new bound fires (the determinism oracle). Orthogonal to LLAMA_KV_PAGED: pure
|
||||
+ // scheduler policy, identical decisions with paged on or off.
|
||||
+ const int32_t n_decode_in_batch = batch.n_tokens; // D: Phase 1 appended D decode tokens above
|
||||
+ const int32_t n_decode_in_batch = batch.size(); // D: Phase 1 appended D decode tokens above
|
||||
+ int32_t prefill_budget_step = 0; // 0 = disabled (stock n_batch-only chunking)
|
||||
+ int32_t prefill_cap_per_slot = 0; // 0 = disabled (no per-slot prompt-chunk cap)
|
||||
{
|
||||
@@ -154,9 +140,9 @@ index 5d83b30..f7a114c 100644
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -3181,11 +3235,18 @@ private:
|
||||
const int32_t n_before_user = slot.task->params.n_before_user;
|
||||
const bool n_before_user_known = n_before_user > 0;
|
||||
@@ -3509,11 +3563,18 @@ private:
|
||||
const auto & spans = slot.task->params.message_spans;
|
||||
const auto last_user_pos = spans.last_user_message_pos();
|
||||
|
||||
+ // (patch 0016) per-slot prompt tokens added this step, for the per-slot
|
||||
+ // chunk cap (resets each slot); n_batch stays the hard compute ceiling
|
||||
@@ -169,14 +155,14 @@ index 5d83b30..f7a114c 100644
|
||||
+ // (the T - D leftover) is spent across all slots, or (b) this slot's
|
||||
+ // per-slot chunk cap is hit, so a long prompt is split across more steps
|
||||
+ // and leaves batch room for co-batched decode of the other slots
|
||||
while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch &&
|
||||
while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.size() < n_batch &&
|
||||
- (n_prefill_budget == 0 || n_prompt_budgeted < n_prefill_budget)) {
|
||||
+ (prefill_budget_step == 0 || n_prompt_budgeted < prefill_budget_step) &&
|
||||
+ (prefill_cap_per_slot == 0 || slot_prompt_added < prefill_cap_per_slot)) {
|
||||
// get next token to process
|
||||
llama_token cur_tok = input_tokens[slot.prompt.n_tokens()];
|
||||
if (cur_tok == LLAMA_TOKEN_NULL) {
|
||||
@@ -3211,7 +3272,8 @@ private:
|
||||
@@ -3538,7 +3599,8 @@ private:
|
||||
slot.prompt.tokens.push_back(cur_tok);
|
||||
|
||||
slot.n_prompt_tokens_processed++;
|
||||
@@ -184,12 +170,12 @@ index 5d83b30..f7a114c 100644
|
||||
+ n_prompt_budgeted++; // (patch 0016) toward the dynamic per-step prefill budget
|
||||
+ slot_prompt_added++; // (patch 0016) toward this slot's per-step chunk cap
|
||||
|
||||
// stop the prompt batch exactly before the latest user input, so a checkpoint
|
||||
// can be created after the previous messages
|
||||
@@ -3321,9 +3383,10 @@ private:
|
||||
break;
|
||||
// stop the prompt batch exactly before a user message
|
||||
if (spans.is_user_start(slot.prompt.n_tokens())) {
|
||||
@@ -3624,9 +3686,10 @@ private:
|
||||
if (!slot_batched) {
|
||||
slot_batched = &slot;
|
||||
}
|
||||
|
||||
- // (patch 0013) stop adding prompts once the per-step prefill budget is spent,
|
||||
- // leaving the remaining batch capacity for co-batched decode of other slots
|
||||
- if (n_prefill_budget > 0 && n_prompt_budgeted >= n_prefill_budget) {
|
||||
@@ -197,9 +183,9 @@ index 5d83b30..f7a114c 100644
|
||||
+ // budget (the T - D leftover) is spent, leaving the remaining batch
|
||||
+ // capacity for co-batched decode of the other slots
|
||||
+ if (prefill_budget_step > 0 && n_prompt_budgeted >= prefill_budget_step) {
|
||||
break;
|
||||
add_ok = false;
|
||||
}
|
||||
}
|
||||
});
|
||||
--
|
||||
2.43.0
|
||||
|
||||
|
||||
202
backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md
Normal file
202
backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md
Normal file
@@ -0,0 +1,202 @@
|
||||
# Pin-sync: paged patch-stack -> llama.cpp 9d5d882d
|
||||
|
||||
Status: COMPLETE. The paged patch-stack (0001-0024) was rebased onto llama.cpp
|
||||
`9d5d882d`, rebuilt clean (CUDA sm_121), and the bit-exact gate is GREEN on both
|
||||
the dense and MoE NVFP4 baselines. The LocalAI-side `.patch` files were then
|
||||
re-exported from the rebased commits; **4 patch files changed** and are updated
|
||||
in this commit. A quick decode bench confirms the patchset performs the same on
|
||||
the new tip.
|
||||
|
||||
## Upstream jump
|
||||
|
||||
- OLD LocalAI pin: `8be759e6`
|
||||
- NEW LocalAI pin (target): `9d5d882d` ("model : Add label for LFM2.5-230M (#25008)")
|
||||
- Upstream jump `8be759e6..9d5d882d` = **17 commits**.
|
||||
|
||||
### Note on the dev-tree base (important)
|
||||
The DGX dev tree's `paged` branch was NOT based on the old pin `8be759e6`. Its
|
||||
real base (merge-base of `paged` with both pins) is `f3e1828`
|
||||
("mtmd: llava_uhd should no longer use batch dim (#24732)"), which is an ancestor
|
||||
of `8be759e6` by 92 commits. So the rebase traversed `f3e1828..9d5d882d` =
|
||||
**109 upstream commits**, a strictly larger surface than the 17-commit pin bump.
|
||||
The end state (paged patches on `9d5d882d`) is identical either way; the larger
|
||||
traverse only means the conflict surface was the worst case, and it still came
|
||||
through bit-exact.
|
||||
|
||||
## Rebase
|
||||
|
||||
- Command: `git rebase --onto 9d5d882d f3e1828 paged` (merge.conflictStyle=diff3).
|
||||
- 26 commits replayed (24 shipped patch-commits + the 2 dev-scaffolding "Gate-0/
|
||||
FA-gate driver" commits and 1 docs commit; the scaffolding/docs commits are not
|
||||
shipped as `.patch` files).
|
||||
- Backup ref before rebase: `paged-prerebase-backup` = `a8a9d12` (old patch 0024).
|
||||
- New rebased range: `9d5d882d..paged`, HEAD = `2ee65c2` (patch 0024).
|
||||
|
||||
### Conflicts during rebase (3 commits, ALL in `tools/server/server-context.cpp`)
|
||||
|
||||
Every rebase conflict was in the llama-server continuous-batch scheduler wiring,
|
||||
all of which is gated behind env (`LLAMA_KV_PAGED` / `LLAMA_PREFILL_BUDGET` /
|
||||
`LLAMA_MAX_BATCH_TOKENS`) and therefore a strict no-op for the gate (the gate
|
||||
uses `llama-completion`, not the server, with no env set). The root cause was a
|
||||
single upstream refactor of `update_slots()`:
|
||||
|
||||
- the outer slot loop became `iterate(slots, [&](server_slot & slot){...})`,
|
||||
replacing bottom-of-loop `break` with a top-of-lambda
|
||||
`if (!add_ok || batch.size() >= n_batch) return;` (the `add_ok` flag is set
|
||||
false on `batch.add()` failure);
|
||||
- the embedding/rerank early-exits changed `continue;` -> `return;`;
|
||||
- the `server_batch` token count accessor was renamed `batch.n_tokens` ->
|
||||
`batch.size()` (`server_batch` has a `.size()` method and **no** `.n_tokens`
|
||||
member; the raw `llama_batch` in `send_embedding`/`send_rerank` keeps `.n_tokens`).
|
||||
|
||||
**patch 0008** (`240758e`, cross-request prefix share) - 1 conflict.
|
||||
Hunk 3 (the prefix-commit block) collided with the `continue`->`return` refactor.
|
||||
Hunks 1 (namespace shim) and 2 (the share block) applied cleanly. Resolved by
|
||||
keeping HEAD's refactored structure and re-inserting the `[paged 0008]`
|
||||
`paged_prefix_api::commit(...)` block verbatim after `slot.state = SLOT_STATE_GENERATING;`
|
||||
and before `if (slot.can_speculate())`, re-indented to the new (de-nested) level,
|
||||
with the identical `paged_kv_commit && cache_prompt && !has_mtmd` guard. Semantics
|
||||
unchanged.
|
||||
|
||||
**patch 0013** (`6d37431`, static `LLAMA_PREFILL_BUDGET`) - 3 conflicts.
|
||||
- C1: inserted the `n_prefill_budget` / `n_prompt_budgeted` var block before
|
||||
HEAD's new `auto & alora_scale = batch.alora_scale;` references (upstream moved
|
||||
alora_scale/disabled_id into the `server_batch` struct).
|
||||
- C2: merged the budget gate into HEAD's `while (... batch.size() < n_batch ...)`
|
||||
(took upstream's `batch.size()` rename, kept the budget condition).
|
||||
- C3: the original outer `break` was translated to the new idiom `add_ok = false;`
|
||||
(exact semantic equivalent of "stop admitting prompts to remaining slots"); the
|
||||
upstream-removed `if (batch.n_tokens >= n_batch) break;` was dropped (now handled
|
||||
by the top-of-lambda check).
|
||||
|
||||
**patch 0016** (`02fa047`, dynamic decode-first budget, supersedes 0013) - 2
|
||||
conflicts + 1 clean-hunk fix.
|
||||
- The big budget-block rewrite hunk applied cleanly (its expected parent == the
|
||||
faithfully-resolved 0013 block).
|
||||
- Clean-hunk fix: the clean-applied line `const int32_t n_decode_in_batch = batch.n_tokens;`
|
||||
referenced the `server_batch` member, which has no `.n_tokens` -> changed to
|
||||
`batch.size()` (== D, the Phase-1 decode load; identical value).
|
||||
- C-A: while-condition -> took THEIRS (dynamic `prefill_budget_step` +
|
||||
`prefill_cap_per_slot`), adopted `batch.size()`.
|
||||
- C-B: admission break -> 0016 dynamic budget check with `break` -> `add_ok = false`,
|
||||
dropped the upstream-removed `batch.n_tokens >= n_batch` break.
|
||||
|
||||
OFF-path invariant verified by construction in all three: with the env knobs
|
||||
unset (`prefill_budget_step == prefill_cap_per_slot == 0`, `paged_kv_* == false`)
|
||||
the added conditions never fire, so the scheduler is byte-identical to stock HEAD.
|
||||
|
||||
### Kernel patches: ZERO rebase conflicts
|
||||
Patches 0017-0024 - which touch the bit-exact compute paths
|
||||
(`gated_delta_net.cu` +330, `mmq.cu`/`mmq.cuh` +209, `ssm-conv.cu` +112,
|
||||
`quantize.cu`, `fattn.cu`, `src/models/qwen35.cpp`/`qwen35moe.cpp`/`qwen3next.cpp`,
|
||||
`src/llama-kv-cache.*`, `src/paged-*`, `tests/test-backend-ops.cpp` +79) - all
|
||||
applied **cleanly** during the rebase (3-way). No math, reduction order, or kernel
|
||||
context was touched during conflict resolution.
|
||||
|
||||
## Clean rebuild
|
||||
`cmake --build build-cuda --target clean && cmake --build build-cuda -j20`,
|
||||
preserving the existing CMakeCache (CMAKE_CUDA_ARCHITECTURES=121, GGML_CUDA=ON,
|
||||
GGML_CUDA_FA=ON, GGML_CUDA_GRAPHS=ON, GGML_CUDA_NCCL=ON). Result: BUILD_EXIT=0,
|
||||
all targets at 100%. (The only log "error" is a benign webui `dist.tar.gz`
|
||||
download miss, unrelated to the gate binaries.)
|
||||
|
||||
## GATE: ALL GREEN
|
||||
|
||||
(a) `test-backend-ops` (Backend CUDA0):
|
||||
| op | result |
|
||||
|----|--------|
|
||||
| GATED_DELTA_NET | 36/36 OK |
|
||||
| SSM_CONV | 45/45 OK |
|
||||
| MUL_MAT | 1146/1146 OK |
|
||||
| MUL_MAT_ID | 806/806 OK |
|
||||
|
||||
(b) greedy md5 (`llama-completion -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1`):
|
||||
| model | md5 | baseline | verdict |
|
||||
|-------|-----|----------|---------|
|
||||
| dense `q36-27b-nvfp4` | `5951a5b4d624ce891e22ab5fca9bc439` | `5951a5b4d624ce891e22ab5fca9bc439` | PASS |
|
||||
| MoE `q36-35b-a3b-nvfp4` | `07db32c2bcb78d17a43ed18bc22705cd` | `07db32c2bcb78d17a43ed18bc22705cd` | PASS |
|
||||
|
||||
Bit-exactness preserved across the upstream jump.
|
||||
|
||||
## Decode bench sanity (rebased build, post-pin-sync)
|
||||
|
||||
`llama-batched-bench -ngl 99 -fa on -npp 128 -ntg 128 -npl 32,128 -c 33000`,
|
||||
S_TG (decode) tok/s at npl128, patch defaults on:
|
||||
| model | npl128 S_TG (new tip) | post-0023 reference | delta |
|
||||
|-------|----------------------|---------------------|-------|
|
||||
| dense `q36-27b-nvfp4` | **366.41** | 373.2 | -1.8% |
|
||||
| MoE `q36-35b-a3b-nvfp4` | **751.11** | 745.7 | +0.7% |
|
||||
|
||||
Both within the +/-3% noise band -> the patchset performs the same on `9d5d882d`.
|
||||
(npl32 also matches: dense 205.83 vs 207.6; MoE 438.29 vs 440.0.)
|
||||
|
||||
## Export phase: re-export `.patch` files and pick the ones that changed
|
||||
|
||||
The committed `.patch` files were generated against the old base. Each shipped
|
||||
patch was re-exported from its rebased commit (`git format-patch -1 <commit>`) and
|
||||
compared body-to-body against the committed file (ignoring the volatile `From`
|
||||
commit-hash line and the `index` blob-hash lines). Classification:
|
||||
|
||||
- **CONTENT (real hunk-body change -> MUST update):** `0008`, `0013`, `0015`, `0016`.
|
||||
- **LINENUM only (hunk bodies byte-identical, only `@@` line-numbers shifted ->
|
||||
still apply cleanly, left as-is):** `0009`, `0017`, `0018`, `0019`, `0020`,
|
||||
`0021`, `0024`.
|
||||
- **IDENTICAL (no change at all):** `0001`, `0002`, `0003`, `0004`, `0006`,
|
||||
`0007`, `0010`, `0011`, `0012`, `0014`, `0022`, `0023`.
|
||||
|
||||
An independent isolated `git apply --check` sweep (each shipped patch vs the
|
||||
rebased pre-state tree) agreed exactly: the same 4 (`0008`/`0013`/`0015`/`0016`)
|
||||
are the only ones that no longer `git apply` to `9d5d882d`. The build applies the
|
||||
series with plain `git apply` (Makefile) which tolerates `@@` line-number offsets,
|
||||
so the 7 LINENUM patches still apply (verified) and are intentionally not churned.
|
||||
|
||||
### 0015 was a 4th change beyond the 3 rebase conflicts
|
||||
The rebase reported only 3 conflicts (`0008`/`0013`/`0016`). `0015`
|
||||
(expert-density MoE token-tile auto-select) rebased *cleanly* via 3-way merge, but
|
||||
its committed `.patch` no longer applies to `9d5d882d` via plain `git apply`:
|
||||
upstream inserted a new test case
|
||||
(`test_mul_mat_id(GGML_TYPE_Q4_0, GGML_TYPE_F32, 32, 2, false, 2880, 32, 2880)`)
|
||||
in `tests/test-backend-ops.cpp` right at `0015`'s insertion anchor, so the hunk's
|
||||
context lines shifted. `0015`'s own inserted lines are unchanged - it is a pure
|
||||
context re-anchor, no behavioral change. This is exactly why a per-patch
|
||||
re-export/apply-check was run instead of trusting the 3-conflict count.
|
||||
|
||||
### What changed in each updated patch (From/index hash noise aside)
|
||||
- `0008`: same `[paged 0008]` commit block (identical env-guard + `paged_prefix_api::commit`
|
||||
call), re-indented to the refactored `update_slots` lambda level and re-anchored
|
||||
after `slot.state = SLOT_STATE_GENERATING;`; `@@` headers updated.
|
||||
- `0013`: budget var-block / while-gate / admission-break re-expressed against the
|
||||
refactored loop (`batch.size()`, `add_ok=false`); `@@` headers updated.
|
||||
- `0015`: hunk context re-anchored around the new upstream test case; inserted
|
||||
lines identical; `@@` header updated.
|
||||
- `0016`: dynamic budget block + `n_decode_in_batch = batch.size()` + admission
|
||||
`add_ok=false` against the refactored loop; `@@` headers updated.
|
||||
|
||||
## Equivalence proof (the updated series == the gate-green tree)
|
||||
|
||||
The 4 updated files are byte-faithful `git format-patch -1` exports of the
|
||||
gate-green rebased commits (`240758e`, `6d37431`, `5349f82`, `02fa047`). Applying
|
||||
the full corrected series (the 19 unchanged committed patches + the 4 re-exports)
|
||||
in order to a fresh bare `9d5d882d` checkout with plain `git apply` succeeds for
|
||||
all 23 patches, and the resulting tree is **byte-identical to the gate-green
|
||||
`paged` tip (`2ee65c2`) for every code file** (`git diff` over all paths except
|
||||
`*.md` and the unshipped `examples/simple/*` scaffold drivers is empty). So the
|
||||
shipped `.patch` series reproduces exactly the tree that passed test-backend-ops,
|
||||
the md5 bit-exact gate, and the bench.
|
||||
|
||||
## Pre-existing finding (NOT introduced by this pin-sync, NOT fixed here)
|
||||
Committed patch `0019` carries a *modify* hunk against the dev-only doc
|
||||
`SSM_DECODE_FIX_RESULTS.md` (`index 2e7c8c2..77879e4 100644`), a file that exists
|
||||
only because of an unshipped docs commit on the dev tree and is absent from a
|
||||
clean llama.cpp checkout. Under strict `git apply` that hunk fails ("No such file
|
||||
or directory"). This is pin-independent (the file is upstream-absent on both
|
||||
`8be759e6` and `9d5d882d`) and present identically in the old and new `0019`
|
||||
(LINENUM class), so it is left untouched to keep the pin-sync faithful. (`0021`'s
|
||||
`CONV_STATE_FUSION_RESULTS.md` is a *create* hunk and applies fine.) Stripping the
|
||||
stray dev-doc hunks from the shipped patches is a separate cleanup, out of scope
|
||||
for the pin-sync.
|
||||
|
||||
## Source of truth
|
||||
The rebased branch on the DGX dev tree (`~/llama-paged-dev`, branch `paged`, HEAD
|
||||
`2ee65c2`) is the source of truth; `paged-prerebase-backup` (`a8a9d12`) retains
|
||||
the pre-rebase state.
|
||||
Reference in New Issue
Block a user