LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-07-03 12:57:02 -04:00

Author	SHA1	Message	Date
Ettore Di Giacinto	ace1ffab28	docs(paged): record audited current snapshot Record the Phase 26 current-stack paged-vs-vLLM serving snapshot with hardware classification and compact pre/post inference gates. Assisted-by: Codex:gpt-5	2026-07-01 03:48:27 +00:00
Ettore Di Giacinto	a0194125f5	chore(paged): summarize snapshot inference gates Emit a compact gate_summary.tsv from current serving snapshots so each artifact records the pre/post MoE md5, dense md5, and backend op checks. Add a summary-only mode for auditing existing artifacts and document the Phase 25 backfill on the Phase 20 snapshot. Assisted-by: Codex:gpt-5	2026-07-01 03:35:54 +00:00
Ettore Di Giacinto	7108b68a70	chore(paged): record snapshot hardware class Add a hardware report to the current serving snapshot harness so every paged-vs-vLLM artifact records the GPU identity and conservative hardware class before any server starts. Document the Phase 24 dry run and the GB10 classification for future parity comparisons. Assisted-by: Codex:gpt-5	2026-07-01 03:31:11 +00:00
Ettore Di Giacinto	ff3f0620de	chore(paged): add current serving snapshot harness Add a reusable current-stack paged-vs-vLLM serving snapshot harness that targets the clean DGX mirror, enforces idle/lock preflight, runs pre/post inference gates, and records ratio summaries. Assisted-by: Codex:gpt-5	2026-07-01 03:19:36 +00:00
Ettore Di Giacinto	c99678da42	docs(paged): refresh current serving snapshot Record the Phase 20 same-session MoE paged-vs-vLLM serving snapshot on the current clean DGX mirror, including pre/post inference gates and the resulting GB10 parity decision. Assisted-by: Codex:gpt-5	2026-07-01 03:15:30 +00:00
Ettore Di Giacinto	cced07c7fe	docs(paged): add MTP shape trace patch Add the next incremental llama.cpp patch for default-off speculative batch-shape tracing and record the Phase 18 red/green and inference-gate results. Assisted-by: Codex:gpt-5	2026-07-01 02:54:29 +00:00
Ettore Di Giacinto	4d171e62bb	docs(paged): reject MTP serving lever Add the repeatable MTP serving A/B runner and record Phase 15 results showing current llama-server MTP regresses GB10 serving throughput despite passing inference gates. Assisted-by: Codex:gpt-5	2026-07-01 02:29:28 +00:00
Ettore Di Giacinto	e169058e73	chore(paged): add DGX inference gate runner Add a reusable paged llama.cpp gate script for DGX work. It checks docker/local-ai-worker/GPU lock state, runs the canonical MoE and dense transcript md5 gates, and runs selected test-backend-ops filters. Verified on dgx.casa: MoE 8cb0ce23777bf55f92f63d0292c756b0, dense 5951a5b4d624ce891e22ab5fca9bc439, MUL_MAT_ID 806/806. Artifact: /home/mudler/bench/paged_inference_gates/20260701_040048. Assisted-by: Codex:gpt-5	2026-07-01 02:01:55 +00:00
Ettore Di Giacinto	de34cd5954	docs(paged): refresh parity handoff state Reconcile the paged backend pin prose with the current Makefile pin, mark the 0044 patch tracking note as resolved, and add DGX Docker worker idleness to the benchmark preflight. Assisted-by: Codex:gpt-5	2026-06-30 15:27:44 +00:00
Ettore Di Giacinto	6edbb56b06	docs(paged): definitive vLLM-parity final-state record (GB10, CLOSED) Add docs/VLLM_PARITY_FINAL.md: the standing, never-re-litigate record of the exhaustive GB10 (sm_121) vLLM-parity investigation for the Qwen3.6 NVFP4 hybrid models. Captures the definitive same-session both-engine benchmark (prefill S_PP, decode/serving per-seq + aggregate, TTFT, PEAK_GB, paged-as-%-of-vLLM for both the MoE 35B-A3B and dense 27B models), the complete lever map (every prefill-GEMM, prefill-GDN, decode and serving/engine attempt with its verdict and key number), the structural floors (LPDDR5x bandwidth, FP4-MMQ optimality, GDN O(C^2) intra-chunk + serial recurrence, vLLM's HBM-tuned FLA/Marlin), the shipped bit-exact wins, and the parity verdict: parity is a hardware ceiling on GB10, not missing optimizations; the path to parity is datacenter Blackwell. Every number cites its artifact (dgx:~/bench/COMBINED_DEFINITIVE.txt, the marlin_gate / gdn_p1_ab A/Bs, PREFILL_GEMM_RESULTS, VLLM_PARITY_LEVER_MAP, DECODE_SERVING_SCOPE, the patch headers); figures not pinned to an artifact are marked estimated. Add a section-9 summary + link in the backend README. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-30 11:57:36 +00:00
Ettore Di Giacinto	bd100dd20a	fix(paged): repair the patch series, sync to the fork branch (drop dev-tree 0044/0045, add f32-only M5 as 0047) The 0044/0045 patches were exported from the old bf16/hybrid dev tree and no longer apply on the f32-only series (0026 ssm_bf16_tau is dropped), so the build broke at `git apply`. Re-sync the vendored series to the now feature-complete fork branch mudler/llama.cpp:localai-paged, which is the canonical source (pin 0ed235ea + the paged patch commits in order). - git rm the dev-tree-based 0044 (GDN M5, bf16-machinery base) and 0045 (Marlin W4A16 offline-repack, not part of the fork branch). - Add the fork branch's newest commit (2c32ab8b7, "GDN M5 tensor-core chunked-scan prefill, f32-only re-port") as 0047, generated with a single git format-patch off that branch. It sequences after 0046 (its parent on the branch) and recovers the prefill win 0044 encoded (+3.5% S_PP @npp512, +17.7% @npp2048), bit-exact per-path (test-backend-ops GATED_DELTA_NET 46/46 default and force-M5; greedy md5 default-on == M5-forced == canonical). - Track patch 0046 (dense-prefill geometry gate), which was on disk but never committed, so the series is complete in git. - README: patch-table header 0001-0046 -> 0001-0047, replace the 0044 row with the f32-only 0047 row, fix the dangling 0044 prose references, note the bf16 M6/M7/M8 variants are not part of this f32-only series, and add a maintenance bullet that the series is now generated from the fork branch so there is no more patch-export drift. Verified: on a pristine llama.cpp at pin 0ed235ea the full series 0001-0043, 0046, 0047 applies clean in sorted order with the Makefile's exact `git apply --verbose` method (37/37 OK), and the resulting tree is byte-identical to the fork branch tip 2c32ab8b7. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-30 07:54:46 +00:00
Ettore Di Giacinto	7b38c6b2a3	feat(paged): GDN M5 tensor-core chunked-scan prefill, default-on under paged KV (patch 0044) Land the tensor-core forms of the chunked gated-DeltaNet prefill scan (0031) as a single GDN_TC-selected build and ship the M5 variant (full TC form-T solve + state-update mma) default-ON when LLAMA_KV_PAGED is set. The dispatch defaults GDN_TC=5 and GDN_CHUNK_MIN=64 under paged KV (both env-overridable; OFF/INT_MAX when not paged, so stock/non-paged stays regression-free). GDN_CHUNK_MIN is the per-call engage threshold and stays > 1 so decode (1 tok/call) keeps the sequential recurrence; 64 was tuned from a {1,32,64,128,256} sweep (32/64/128 all win on prefill, 256 barely fires because the MoE-prefill per-call count is < 256, 1 collapses decode S_TG ~25%). Measured GB10, q36-35b-a3b-nvfp4, LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1, llama-batched-bench -ngl 99 -fa on -ntg 4 -npl 32: -npp 512 S_PP 2208.96 -> 2286.5 t/s (+3.5%, mean of 3 interleaved A/B) -npp 2048 S_PP 2021.5 -> 2379.8 t/s (+17.7%) Decode S_TG unchanged (~399 vs ~397 t/s, within noise). Bit-exactness (per-path greedy md5, n=48 --temp 0 --seed 1, paged): default-on == M5-forced == canonical on the gate prompt - MoE 8cb0ce23, dense 5951a5b4. test-backend-ops GATED_DELTA_NET 94/94 vs CPU with M5 forced (incl. multi-chunk up to n_tokens=256). On a long MoE prompt the default (M5 fires at >=64 tokens) and the sequential path agree word-for-word until one benign greedy token-flip; dense is byte-identical. The chunked scan is a NEW per-path result (different FP reduction order), NMSE-validated benign. CUDA-only, gencode arch=compute_121a,code=sm_121a (GB10 / sm_121a). README sections 3 (0044 row, 0031 superseded note) and 5 (dev-notes verdict) updated. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-29 06:42:11 +00:00
Ettore Di Giacinto	f1c98ff0b9	fix(paged): revert S3 decode-stable scheduler to default-OFF (A/B regression) Patch 0041 (LLAMA_PAGED_DECODE_STABLE) was made default-on-when-paged, but a measured end-to-end A/B proved that is a serving mistake. S3 defers prefill admission on the period-8 cadence, which delays prompt admission: 2.5x worse TTFT (60s vs 24s at N=256) and 20-29% lower end-to-end throughput, with no end-to-end win at any concurrency. Its apparent decode_agg gain was a metric artifact (faster per-step decode bought by starving prefill). Flip the s3_enabled default so an unset LLAMA_PAGED_DECODE_STABLE means OFF; the mechanism stays available as an explicit opt-in (LLAMA_PAGED_DECODE_STABLE=1) for decode-dominated, low-arrival traffic where TTFT is not a concern. The default now prefers prompt prefill admission for good TTFT. S1 (patch 0040) keeps shipping default-on; only S3's default changes. Re-exports patch 0041 (change folded into its source commit) and updates the README 0041 row plus the decode-serving narrative to record the A/B finding. Greedy md5 gate unchanged (single-sequence llama-completion path, not update_slots): paged MoE 8cb0ce23, dense 5951a5b4. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-29 05:00:11 +00:00
Ettore Di Giacinto	b028c81eda	docs(paged): record padded/fixed-slot decode shape as tested-and-rejected The S1 section-(a) padded/fixed-slot decode shape (the scoped follow-up to push serving graph reuse from ~72% toward ~100%) was implemented in an isolated worktree off the committed S1/S3/tail base, built CUDA-only, and benched on GB10. Verdict: REJECTED. It is bit-exact and provably inert, but it regresses serving throughput at every concurrency and does not close the vLLM gap. Implementation (default-off, LLAMA_PAGED_PAD_DECODE): on a pure-decode step (n_prompt_budgeted == 0) emit a masked-inert dummy decode for every idle slot so n_tokens / n_seqs / n_seqs_unq / n_outputs and the seq-id set stay constant; a release()-side guard keeps a finished slot warm under padding. Each dummy is its own sequence (private recurrent state, per-stream paged attention, logits discarded), so it cannot perturb a real stream. Gates: single-seq greedy md5 bit-exact (dense 5951a5b4, paged-MoE 8cb0ce23). The literal per-stream ON-vs-OFF identity gate is unachievable - concurrent cuBLAS/FA decode is not bit-reproducible run-to-run even with padding off (OFF-vs-OFF diverging streams: dense 3/16, MoE 8/16). The achievable inertness gate passed: ON-vs-OFF per-stream prefix-agreement equals the OFF-vs-OFF noise floor exactly (MoE 0.940/0.940, dense 0.812/0.812), so the dummy slots leak nothing. Bench (MoE Qwen3.6-35B-A3B-NVFP4, GB10), burst decode tok/s/seq: n=8 S1+S3 28.16 / PAD 6.05 / vLLM 44.8; n=128 S1+S3 4.53 / PAD 4.32 / vLLM 6.87. Staggered aggregate tok/s: baseline (reuse 0%) 757.6, S1+S3 (reuse 72%) 763.3, PAD (reuse 38%) 558.0. Why it fails: (1) serving decode here is GPU-compute-bound, not host-rebuild-bound - baseline reuse 0% ~= S1+S3 reuse 72% on aggregate tok/s, so closing reuse buys ~nothing (the earlier 542->762 host-bound delta did not reproduce); (2) padding adds dummy-row compute proportional to pad_width - real_load, catastrophic at low load; (3) in continuous serving padding cannot hold a constant width (perpetual prefill churn) so reuse drops 72% -> 38%; (4) the completion-driven batch shrink padding prevents is itself a throughput win in a compute-bound regime. The residual burst gap is GPU-compute, which a host-side reuse lever cannot close. Patch series unchanged: this rejected lever is NOT added to patches/paged/. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 20:47:43 +00:00
Ettore Di Giacinto	2fa8ef8fc5	fix(paged): make patch 0031 apply on the 0001-0030 base; default S3 on under paged KV FIX A (patch 0031 compose break): the chunked GDN prefill patch carried '#include <cuda_bf16.h>' and '#include <type_traits>' as CONTEXT lines, but those were introduced by the dropped bf16-tau patch 0026, so on the bf16-tau-free 0001-0030 base only '#include <cstdlib>' is present and 'git apply' failed. The same 0026 drop also shifted 0031's later hunks off their context (the ', hyb' kernel-launch arg, the 'STATE_BF16, HYBRID' template params, and the GDN_LAUNCH_ARGS list). Regenerated 0031 against a fresh pin(0ed235ea) + 0001-0030 tree: the chunked kernel now SELF-PROVIDES the cuda_bf16.h / type_traits includes (adds them, plus the climits it needs for INT_MAX) and the dispatch guard is the 2-param 'if constexpr (!KDA && !keep_rs_t)' form. Behaviour is unchanged: 0031 stays opt-in, default OFF (GDN_CHUNK_MIN), a recorded negative. The full 0001-0042 series now applies clean on 0ed235ea ('git apply --check' green for every patch). FIX B (patch 0041 S3 default): the decode-shape-stable scheduler defaulted OFF. Make it default ON whenever paged KV is active (LLAMA_KV_PAGED set), still overridable to off via LLAMA_PAGED_DECODE_STABLE=0. Minimal host-side change in update_slots(); re-exported from the dev tree, README 0041 row updated to match. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 19:37:05 +00:00
Ettore Di Giacinto	d706980c2b	feat(paged): close the continuous-serving decode gap (S1+S3, patches 0040/0041) Add the two decode-serving graph-reuse levers (validated on GB10) that close the host-bound serving gap (paged dropped to ~3.7 vs vLLM ~5.9 tok/s/seq in real continuous serving while tying it in static batched-bench). - 0040 S1 paged decode-graph reuse: the paged decode inputs never overrode llm_graph_input_i::can_reuse (defaults false), so the host rebuilt the ggml graph on EVERY decode step (layer-A reuse 0%). Add a 256-bucketed-shape can_reuse + a live-mctx refresh from the owning attn input. Bit-exact (md5 byte-identical reuse on/off). Static batched-bench: paged reuse 0% -> 95.5%. - 0041 S3 decode-shape-stable scheduling: keep co-batched prefill out of decode steps so the scheduler emits the reuse-stable pure-decode shape S1 can reuse. Default-off policy on top of 0016; bit-exact (per-stream independent). S1+S3 together (128-client staggered serving, MoE Qwen3.6-35B-A3B-NVFP4): graph reuse 0% -> 72.2%, hostproc 15.98 -> 6.31 ms/step, decode 4.05 -> 5.52 tok/s/seq median (4.24 -> 5.96 mean, at vLLM's ~5.9). S1 alone is insufficient (13.8%); S3 is the multiplier. S2 (double-buffer set_inputs) dropped: Phase-0 put set_inputs at ~0.05 ms/step, so it has nothing to recover. README patch table + DECODE_SERVING_SCOPE.md updated with results and the padded/fixed-slot follow-up. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 18:04:28 +00:00
Ettore Di Giacinto	e610347367	feat(paged): chunked parallel-scan GDN prefill kernel (patch 0031) Adds patch 0031 to the paged llama.cpp series: an FLA-style chunked parallel-scan prefill kernel for gated DeltaNet (the upstream gated_delta_net.cu "Add chunked kernel for even faster pre-fill" TODO). Scope: non-KDA scalar gate, f32 state, final-state-only, homogeneous. Bit-exact-benign (NEW per-path): test-backend-ops GATED_DELTA_NET 91/91 within the 1e-7 NMSE gate vs the CPU reference (patch adds 8 S_v=128 prefill cases: exact-multiple / tail / multi-seq / GQA / permuted); numpy prototype confirms f32 chunked-vs-sequential NMSE ~1e-13. OPT-IN, default OFF: GB10's 99KB dynamic-smem opt-in forces C=16 (the 128x128 f32 state is 64KB of the all-shared layout), pinning the kernel to 1 block/SM with serial dk-reductions. Measured ~761 t/s chunked vs ~971 t/s sequential (~22%% slower) on q36-27b-nvfp4 prefill, so it defaults OFF (enable with GDN_CHUNK_MIN=<n>); the backend default is regression-free. Beating the 84.7%-of-peak sequential scan needs tensor-core matmuls / register-resident state with larger chunks (recorded in README section 5). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 17:09:38 +00:00
Ettore Di Giacinto	4cd90bfae9	paged: drop bf16-tau (patch 0026), subsumed by decode fusions (tau=100000 flat, zero speed benefit) The opt-in hybrid per-head bf16 SSM-state lever (ssm_bf16_tau, patch 0026) is removed from the llama-cpp-localai-paged patch series. Clean re-measurement after the decode fusions (0028 recurrent-state gather-fusion + 0029 block-table cache) landed shows it buys nothing: forcing ALL gated-DeltaNet heads to bf16 (tau=100000, the most aggressive setting) gives flat decode throughput, 780.6 vs 780.0 t/s. The mode engages but adds zero speed because it is subsumed by the fusions. The earlier "+12%" was measured before the fusions completed. bf16-tau was a precision trade (not bit-exact, ~91% same-top-p) plus extra bug surface and extra CUDA template-instantiation compile cost with no offsetting benefit. Dependency check: no later patch (0028/0029/0030) depends on 0026. 0030's only mention is a description comment; its code keys off fused_gdn_ar/ch/auto_fgdn, which originate in 0018/0019/0021 (before 0026). The remaining series (0001-0025, 0028-0030) applies clean with git apply --check against the pin 0ed235ea2c17a19fc8238668653946721ed136fd. The Makefile applies the series by glob (patches/paged/0*.patch); the resulting gap at 0026 is tolerated (0005/0027 are already absent). Removed: - patches/paged/0026-qwen35-hybrid-perhead-ssm-state.patch - the dead ssm_bf16_tau / ssm_hybrid_tau option handler in the shared grpc-server.cpp (it only set LLAMA_SSM_BF16_TAU, now a no-op the library no longer reads) - the patched+bf16-tau benchmark columns and llama-patched-bf16tau rows (README + final_benchmark.csv), the ssm_bf16_tau option text in backend index.yaml, the gallery NOTE block, and the docs/features/backends.md mention. The rejected-lever lesson is kept (why it was dropped: subsumed, tau=100000 flat) in the backend README section 5, the paged-backend agent guide, and the vLLM-parity methodology, so it is not re-tried. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 16:06:06 +00:00
Ettore Di Giacinto	0b84fda496	docs(paged): add the bf16-tau opt-in line to the decode plots Per request, the plots now show all four series: llama.cpp (standard), vLLM, LocalAI's llama.cpp patches (bit-exact hero), and LocalAI's patches + bf16-tau (opt-in ceiling, +3% to +17% over the patches, ahead of vLLM at every dense width and MoE npl>=32). Subtitle flags bf16-tau as opt-in / not bit-exact. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 22:25:02 +00:00
Ettore Di Giacinto	1431f72b92	docs(paged): regenerate decode plots (3-way) from re-measured data + overview Rebuild the two committed decode plots from the re-measured CSV and add a combined overview. Three series per the comparison that matters: llama.cpp (standard) vs vLLM vs LocalAI's llama.cpp patches; x-over-standard called out at npl128. bf16-tau stays out of the plot (it remains in the CSV + the README table as the opt-in row). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 22:20:12 +00:00
Ettore Di Giacinto	3466094c68	docs(paged): re-measure DGX benchmarks on one harness (stock/patched/bf16-tau) Re-run the GB10/DGX-Spark llama-batched-bench matrix (dense q36-27b + MoE q36-35b-a3b, npl 8/32/64/128, -fa on -ngl 99 -npp 128 -ntg 128) so the CSV and README section 4 carry a single consistent set of llama numbers with all three configs: - stock: separately-built unpatched llama.cpp at this backend's exact pin 9d5d882d (toggling LLAMA_KV_PAGED on the patched binary does NOT reproduce stock - the SSM decode fusions are compiled in, not env-gated). - patched: paged binary, LLAMA_KV_PAGED=1 (+LLAMA_MOE_FORCE_GRAPHS=1 for MoE). - patched+bf16-tau: patched plus --ssm-bf16-tau 64 (opt-in, NOT bit-exact, ~91% same-top-p). final_benchmark.csv now has stock + patched + bf16-tau + vllm rows for both models at all four widths (the prior CSV had no stock and no bf16-tau rows). peak_gb is dropped: the GB10's unified LPDDR5x reports [N/A] to nvidia-smi and the bench does not print it, so per-run peak could not be captured this session. Patch series gives up to 2.46x (dense) / 2.26x (MoE) over true-stock; opt-in bf16-tau adds a further +3% to +17% on top of patched (growing with width). vLLM column is kept from the prior session (not re-run) and labeled as such. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 22:05:59 +00:00
Ettore Di Giacinto	ed5eb705c7	docs(paged): drop moot PIN_SYNC_c299a92c record, repoint to README sec 7 The paged backend's llama.cpp pin was reverted from c299a92c back to 9d5d882d (== stock), so docs/PIN_SYNC_c299a92c.md (a blow-by-blow of the reverted sync) is dead weight. The pin-sync PROCESS stays documented in the three live places: the Makefile comment, README section 7 (Pin + maintenance policy), and .agents/llama-cpp-localai-paged-backend.md. Delete the doc and repoint every reference to it (Makefile, README, .agents, canary script + workflow) at README section 7. No functional paths change: the canary's patches-dir glob (patches/paged/0*.patch) is untouched. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 21:34:10 +00:00
Ettore Di Giacinto	53f66a6f03	fix(paged): revert pin to 9d5d882d (== stock); c299a92c broke grpc-server link The c299a92c bump diverged 23 commits ahead of the stock llama-cpp pin. grpc-server.cpp is SHARED with the stock backend and tracks the stock pin; c299a92c's upstream server-API refactor pulled stream_* helpers into the headers grpc-server.cpp includes, whose definitions the stock-aligned build does not compile -> every paged variant failed to LINK (undefined reference to stream_aware_should_stop / stream_pipe_producer::cleanup / stream_session_attach_pipe). The bump was greedy-md5 bit-exact, but the bit-exact gate never exercises the full grpc-server build, so it slipped through. Revert LLAMA_VERSION to 9d5d882d (== stock pin, where the patches are bit-exact AND grpc-server links - the original DGX-proven baseline). Document the hard constraint in the Makefile, README, PIN_SYNC record, and the .agents guide: the paged pin must track the stock pin, and a pin-sync must pass the full CI grpc-server build, not only the bit-exact gate. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 20:28:28 +00:00
Ettore Di Giacinto	08b754f910	chore(paged): keep patches/ patch-only; README to backend root, docs to docs/ The llama-cpp-localai-paged patches/ dir had accumulated docs, plots, a csv, dev .cpp harnesses, and a dead FP4-MoE kernel scaffold after an earlier git-mv. Restore the invariant that patches/ holds only the .patch series. Moves: - patches/paged/README.md -> README.md (canonical doc at the backend root) - patches/paged/{PIN_SYNC_c299a92c,PAGED_BITEXACT_NOTE,LOCALAI_LLAMACPP_BACKEND_PLAN,UPSTREAM_LAYER2_SCOPE}.md, final_benchmark.csv, qwen36_.png, paged-burst-bench.cpp, paged-reclaim-unit.cpp -> docs/ - patches/README.md -> docs/PATCH_MAINTENANCE.md (unique patch-regen recipe not in the canonical README) Deletes: - patches/BENCHMARKS.md (superseded by README section 4 + the dev-notes section) - patches/kernel/ (dead FP4-MoE scaffold, never in the 0001-0030 apply glob, zero refs repo-wide) Repoint every reference to the moved files: README internal links (docs/ + the .github links drop from 5x ../ to 3x ../), .agents/llama-cpp-localai-paged-backend.md, .github/scripts/paged-canary-apply.sh, .github/workflows/llama-cpp-paged-canary.yml, the wrapper Makefile, backend/cpp/llama-cpp/grpc-server.cpp, backend/index.yaml, docs/content/features/backends.md, gallery/index.yaml. The build apply glob PAGED_PATCHES_DIR/0.patch (PAGED_PATCHES_DIR := .../patches/paged) is unchanged and still resolves to the 28 patches. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-27 13:20:05 +00:00

24 Commits