LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-07-03 04:46:54 -04:00

Author	SHA1	Message	Date
Ettore Di Giacinto	ff3f0620de	chore(paged): add current serving snapshot harness Add a reusable current-stack paged-vs-vLLM serving snapshot harness that targets the clean DGX mirror, enforces idle/lock preflight, runs pre/post inference gates, and records ratio summaries. Assisted-by: Codex:gpt-5	2026-07-01 03:19:36 +00:00
Ettore Di Giacinto	c99678da42	docs(paged): refresh current serving snapshot Record the Phase 20 same-session MoE paged-vs-vLLM serving snapshot on the current clean DGX mirror, including pre/post inference gates and the resulting GB10 parity decision. Assisted-by: Codex:gpt-5	2026-07-01 03:15:30 +00:00
Ettore Di Giacinto	310eb3c866	docs(paged): reject MTP draft-shape scheduler Record the Phase 19 trace-only serving entropy run and close the group/defer-by-draft follow-up based on measured shape distribution, throughput regression, and green inference gates. Assisted-by: Codex:gpt-5	2026-07-01 03:03:49 +00:00
Ettore Di Giacinto	cced07c7fe	docs(paged): add MTP shape trace patch Add the next incremental llama.cpp patch for default-off speculative batch-shape tracing and record the Phase 18 red/green and inference-gate results. Assisted-by: Codex:gpt-5	2026-07-01 02:54:29 +00:00
Ettore Di Giacinto	6e35476340	docs(paged): scope MTP graph-shape follow-up Record Phase 17 source inspection: MTP verification rows change hard graph dimensions, padding is not a safe shortcut, and any future work should start with shape instrumentation before an opt-in scheduler experiment. Assisted-by: Codex:gpt-5	2026-07-01 02:37:21 +00:00
Ettore Di Giacinto	ae76d42a96	docs(paged): profile MTP graph reuse loss Record Phase 16 nsys evidence that current MTP serving loses paged decode graph reuse and increases GPU work, explaining the Phase 15 serving regression. Assisted-by: Codex:gpt-5	2026-07-01 02:32:49 +00:00
Ettore Di Giacinto	4d171e62bb	docs(paged): reject MTP serving lever Add the repeatable MTP serving A/B runner and record Phase 15 results showing current llama-server MTP regresses GB10 serving throughput despite passing inference gates. Assisted-by: Codex:gpt-5	2026-07-01 02:29:28 +00:00
Ettore Di Giacinto	70394364a3	docs(paged): gate MTP rollback safety Record Phase 14 MTP rollback evidence, normalized greedy-prefix checks, and canonical inference gates. Assisted-by: Codex:gpt-5	2026-07-01 02:15:11 +00:00
Ettore Di Giacinto	2074b4fb5b	docs(paged): reject GDN global Ai32 prototype Record the default-off Global-Ai32 implementation, exact md5 gates, GB10 A/B regression, rejected diff artifact, and the resulting stop decision for GDN kernel work on GB10. Assisted-by: Codex:gpt-5	2026-07-01 01:51:53 +00:00
Ettore Di Giacinto	adabd11919	docs(paged): scope GDN global Ai32 prototype Record the shared-A/Ai GB10 cost model, the GO decision for one default-off f32 Ai prototype, and the Phase 13 implementation plan. Assisted-by: Codex:gpt-5	2026-07-01 01:38:51 +00:00
Ettore Di Giacinto	1b5ae227eb	docs(paged): reject GDN M5 QS-early phase Record the Phase 11 default-off QS-early GDN experiment, its canonical md5 gates, the same-session GB10 A/B regression, and the rejected diff artifact. Assisted-by: Codex:gpt-5	2026-07-01 01:29:44 +00:00
Ettore Di Giacinto	3da3b169fb	docs(paged): reject GDN C32 slab phase Record the default-off C32 slab experiment, its md5 gates, the dense tail-row fix, and the performance regression that rejects the source patch. Assisted-by: Codex:gpt-5	2026-07-01 01:15:00 +00:00
Ettore Di Giacinto	34c4b5ce8d	docs(paged): scope phase7 serving candidates Mark the Phase 6 serving classifier complete, preserve the old parity final as historical, and scope Phase 7 source candidates with explicit md5 and op gates. Assisted-by: Codex:gpt-5	2026-06-30 23:12:09 +00:00
Ettore Di Giacinto	85c88320ef	patches(paged): pad W4A16 A shared tile stride Mirror fork commit d9b9be0be as patch 0050 and record the Phase 4 W4A16 shared-memory padding gates, benchmarks, and mirror verification. Assisted-by: Codex:gpt-5	2026-06-30 22:15:21 +00:00
Ettore Di Giacinto	c5f2545cdd	patches(paged): tune W4A16 grouped tile shape Mirror fork commit 7dfa0e175 as patch 0049 and record the Phase 2 GB10 W4A16 shape sweep, md5 gates, MUL_MAT_ID checks, and mirror verification. Assisted-by: Codex:gpt-5	2026-06-30 21:57:42 +00:00
Ettore Di Giacinto	d8edc615e7	patches(paged): mirror W4A16 packed metadata Mirror the fork-first W4A16 packed tile metadata commit into the LocalAI paged patch series, record the Phase 1 benchmark result, and keep the implementation plan checked off. Assisted-by: Codex:gpt-5	2026-06-30 21:21:53 +00:00
Ettore Di Giacinto	de34cd5954	docs(paged): refresh parity handoff state Reconcile the paged backend pin prose with the current Makefile pin, mark the 0044 patch tracking note as resolved, and add DGX Docker worker idleness to the benchmark preflight. Assisted-by: Codex:gpt-5	2026-06-30 15:27:44 +00:00
Ettore Di Giacinto	1b9176c2c8	docs(paged): codify fork-first patch workflow as mandatory policy The fork mudler/llama.cpp branch localai-paged is the canonical source of truth for all paged-backend kernel/patch work. Always update it FIRST: commit the change on the fork branch and push it, then regenerate the LocalAI patch series (backend/cpp/llama-cpp-localai-paged/patches/paged/) from the fork via git format-patch so the series is a 1:1 drift-free mirror of the branch. Never edit the LocalAI patch files directly, and never add a patch with no corresponding fork-branch commit. The series is a derivative; the fork is the source. The fork branch is also where the build and the per-path bit-exact md5 gate actually run, so it is the only place a change is truly validated. Codified in two places: - .agents/llama-cpp-localai-paged-backend.md: new "Fork-first workflow (MANDATORY)" section at the top of the patch/pin-sync material, plus the "Encapsulating your work" bullet now points at it. - backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md: strengthened the hard-gate (section 2.5) into "Fork-first is MANDATORY", and corrected a stale numbering example (fork 51168c5ee "patch 0044" maps to worktree 0044, not the f32-only M5 which is worktree 0047). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-30 15:12:36 +00:00
Ettore Di Giacinto	8bb47e5a8a	docs(paged): correct PARITY_HANDOFF ahead/behind + note dense CDEF gate md5 Ground-check follow-up to `2431090ff`. Two factual corrections: - Section 7 worktree line had the ahead/behind counts swapped ("25 ahead, 197 behind"); the branch is actually ~199 ahead / 25 behind origin/master. - Discrepancy item 5 flagged only the MoE CDEF PAGED_GATE_MD5 (0921716...); the dense run is symmetric (COMBINED_DEFINITIVE.txt records ecfe924d... for dense, which likewise differs from the canonical dense gate 5951a5b4). Both CDEF values come from combined_definitive.sh's own gate command, not the canonical bit-exact gate in section 3.3, so neither is sanctioned and both must be KL-validated. Everything else in the handoff verified accurate: fork branch localai-paged HEAD 51168c5ee (patch 0044) on dgx:~/llama-paged-fork, dev-tree HEAD a7d439e, all md5/KL numbers, the 86%/1078/924 decode record, bench env, and all referenced file/artifact paths. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-30 14:49:06 +00:00
Ettore Di Giacinto	2431090ff3	docs(paged): future-agent vLLM-parity HANDOFF guide (GB10, how-to companion to FINAL) Adds docs/PARITY_HANDOFF.md: the operational how-to for an agent with zero context picking up the GB10 vLLM-parity work. Complements VLLM_PARITY_FINAL.md (the why/record) with TL;DR state, the hard gates (per-path bit-exact md5, KL-gate, no LLAMA_MAX_BATCH_TOKENS, fork-is-canonical), a copy-pasteable operational quickstart (ssh/lock/build/bench + the --cuda-graph-trace=node decode-profiling rule that caused 4 wrong analyses), the complete tested-and- rejected lever map, methodology lessons, the three forward directions, and a key file/artifact index with the open discrepancies to reconcile. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-30 14:42:44 +00:00

20 Commits