LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-07-03 04:46:54 -04:00

Author	SHA1	Message	Date
Ettore Di Giacinto	e189e5a4ca	feat(paged): add moe mmq launch trace patch Assisted-by: Codex:gpt-5	2026-07-01 04:54:33 +00:00
Ettore Di Giacinto	b28b448c68	docs(paged): record mmq shape serving profile Assisted-by: Codex:gpt-5	2026-07-01 04:36:04 +00:00
Ettore Di Giacinto	2148fa466b	feat(paged): add moe mmq shape trace patch Assisted-by: Codex:gpt-5	2026-07-01 04:32:12 +00:00
Ettore Di Giacinto	3b9ec3e1f1	docs(paged): record mmq occupancy rejection Assisted-by: Codex:gpt-5	2026-07-01 04:18:12 +00:00
Ettore Di Giacinto	3c2cb9f4ab	docs(paged): record graph-node serving profile Record the Phase 27 current-stack llama.cpp n128 serving profile captured with CUDA graph node tracing and gated before and after the run. Assisted-by: Codex:gpt-5	2026-07-01 04:00:14 +00:00
Ettore Di Giacinto	ace1ffab28	docs(paged): record audited current snapshot Record the Phase 26 current-stack paged-vs-vLLM serving snapshot with hardware classification and compact pre/post inference gates. Assisted-by: Codex:gpt-5	2026-07-01 03:48:27 +00:00
Ettore Di Giacinto	a0194125f5	chore(paged): summarize snapshot inference gates Emit a compact gate_summary.tsv from current serving snapshots so each artifact records the pre/post MoE md5, dense md5, and backend op checks. Add a summary-only mode for auditing existing artifacts and document the Phase 25 backfill on the Phase 20 snapshot. Assisted-by: Codex:gpt-5	2026-07-01 03:35:54 +00:00
Ettore Di Giacinto	7108b68a70	chore(paged): record snapshot hardware class Add a hardware report to the current serving snapshot harness so every paged-vs-vLLM artifact records the GPU identity and conservative hardware class before any server starts. Document the Phase 24 dry run and the GB10 classification for future parity comparisons. Assisted-by: Codex:gpt-5	2026-07-01 03:31:11 +00:00
Ettore Di Giacinto	7aa15ce539	docs(paged): refresh parity handoff coordinates Update the paged parity handoff to the current fork head, patch count, mirror invariant, current serving harness, and LocalAI AI-attribution policy after Phases 20-22. Assisted-by: Codex:gpt-5	2026-07-01 03:25:14 +00:00
Ettore Di Giacinto	6c165747a9	docs(paged): verify patch-series mirror invariant Record the Phase 22 strict git-apply mirror check proving the LocalAI paged patch series reconstructs the canonical llama.cpp fork tree after patch 0055. Assisted-by: Codex:gpt-5	2026-07-01 03:22:43 +00:00
Ettore Di Giacinto	ff3f0620de	chore(paged): add current serving snapshot harness Add a reusable current-stack paged-vs-vLLM serving snapshot harness that targets the clean DGX mirror, enforces idle/lock preflight, runs pre/post inference gates, and records ratio summaries. Assisted-by: Codex:gpt-5	2026-07-01 03:19:36 +00:00
Ettore Di Giacinto	c99678da42	docs(paged): refresh current serving snapshot Record the Phase 20 same-session MoE paged-vs-vLLM serving snapshot on the current clean DGX mirror, including pre/post inference gates and the resulting GB10 parity decision. Assisted-by: Codex:gpt-5	2026-07-01 03:15:30 +00:00
Ettore Di Giacinto	310eb3c866	docs(paged): reject MTP draft-shape scheduler Record the Phase 19 trace-only serving entropy run and close the group/defer-by-draft follow-up based on measured shape distribution, throughput regression, and green inference gates. Assisted-by: Codex:gpt-5	2026-07-01 03:03:49 +00:00
Ettore Di Giacinto	cced07c7fe	docs(paged): add MTP shape trace patch Add the next incremental llama.cpp patch for default-off speculative batch-shape tracing and record the Phase 18 red/green and inference-gate results. Assisted-by: Codex:gpt-5	2026-07-01 02:54:29 +00:00
Ettore Di Giacinto	6e35476340	docs(paged): scope MTP graph-shape follow-up Record Phase 17 source inspection: MTP verification rows change hard graph dimensions, padding is not a safe shortcut, and any future work should start with shape instrumentation before an opt-in scheduler experiment. Assisted-by: Codex:gpt-5	2026-07-01 02:37:21 +00:00
Ettore Di Giacinto	ae76d42a96	docs(paged): profile MTP graph reuse loss Record Phase 16 nsys evidence that current MTP serving loses paged decode graph reuse and increases GPU work, explaining the Phase 15 serving regression. Assisted-by: Codex:gpt-5	2026-07-01 02:32:49 +00:00
Ettore Di Giacinto	4d171e62bb	docs(paged): reject MTP serving lever Add the repeatable MTP serving A/B runner and record Phase 15 results showing current llama-server MTP regresses GB10 serving throughput despite passing inference gates. Assisted-by: Codex:gpt-5	2026-07-01 02:29:28 +00:00
Ettore Di Giacinto	70394364a3	docs(paged): gate MTP rollback safety Record Phase 14 MTP rollback evidence, normalized greedy-prefix checks, and canonical inference gates. Assisted-by: Codex:gpt-5	2026-07-01 02:15:11 +00:00
Ettore Di Giacinto	ede23df333	docs(paged): close W4A16 pad checklist Mark the rejected-branch disposition as not taken because Phase 4 was kept as patch 0050 with recorded md5, op, perf, and mirror gates. Assisted-by: Codex:gpt-5	2026-07-01 01:58:22 +00:00
Ettore Di Giacinto	abc70c209e	docs(paged): close ragged MoE dispatch shortcut Record the Phase 8 safety rerun, canonical transcript md5 gates, full and ragged MUL_MAT_ID op gates, and the no-production-patch decision for metadata-only fused dispatch work. Assisted-by: Codex:gpt-5	2026-07-01 01:57:45 +00:00
Ettore Di Giacinto	2074b4fb5b	docs(paged): reject GDN global Ai32 prototype Record the default-off Global-Ai32 implementation, exact md5 gates, GB10 A/B regression, rejected diff artifact, and the resulting stop decision for GDN kernel work on GB10. Assisted-by: Codex:gpt-5	2026-07-01 01:51:53 +00:00
Ettore Di Giacinto	adabd11919	docs(paged): scope GDN global Ai32 prototype Record the shared-A/Ai GB10 cost model, the GO decision for one default-off f32 Ai prototype, and the Phase 13 implementation plan. Assisted-by: Codex:gpt-5	2026-07-01 01:38:51 +00:00
Ettore Di Giacinto	1b5ae227eb	docs(paged): reject GDN M5 QS-early phase Record the Phase 11 default-off QS-early GDN experiment, its canonical md5 gates, the same-session GB10 A/B regression, and the rejected diff artifact. Assisted-by: Codex:gpt-5	2026-07-01 01:29:44 +00:00
Ettore Di Giacinto	24e778de47	docs(paged): scope GDN M5 state-boundary phase Add the Phase 11 design and implementation plan for a default-off C16 M5 QS-early GDN experiment after rejecting C32 slabs. Assisted-by: Codex:gpt-5	2026-07-01 01:21:05 +00:00
Ettore Di Giacinto	3da3b169fb	docs(paged): reject GDN C32 slab phase Record the default-off C32 slab experiment, its md5 gates, the dense tail-row fix, and the performance regression that rejects the source patch. Assisted-by: Codex:gpt-5	2026-07-01 01:15:00 +00:00
Ettore Di Giacinto	ff3ad84191	docs(paged): record GDN C32 slab baseline Record the Phase 10 current-M5 prefill baseline and the source inspection finding that C32 M5 needs a real U-staging implementation rather than a launcher-only shortcut. Assisted-by: Codex:gpt-5	2026-07-01 00:58:54 +00:00
Ettore Di Giacinto	9bbe02c161	fix(paged): gate MTP backend sampling Record the Phase 9 MTP smoke gate, mirror the fork patch that disables backend sampling for MTP drafts, and scope the follow-up C32 slab GDN prefill phase. Assisted-by: Codex:gpt-5	2026-07-01 00:54:25 +00:00
Ettore Di Giacinto	b862e2c568	docs(paged): stop ragged dispatch source shortcut Assisted-by: Codex:gpt-5	2026-07-01 00:42:36 +00:00
Ettore Di Giacinto	b009de0ee0	test(paged): mirror ragged MoE dispatch gate Assisted-by: Codex:gpt-5	2026-07-01 00:41:21 +00:00
Ettore Di Giacinto	89ef3a4020	docs(paged): record ragged MoE profile gate Assisted-by: Codex:gpt-5	2026-07-01 00:35:21 +00:00
Ettore Di Giacinto	ef14748f06	docs(paged): scope ragged MoE dispatch phase Assisted-by: Codex:gpt-5	2026-07-01 00:26:01 +00:00
Ettore Di Giacinto	b6885aa446	docs(paged): reject weighted combine fusion candidate Assisted-by: Codex:gpt-5	2026-07-01 00:20:53 +00:00
Ettore Di Giacinto	4b6fc0fa1c	test(paged): mirror MoE weighted combine gate Assisted-by: Codex:gpt-5	2026-06-30 23:51:52 +00:00
Ettore Di Giacinto	22a93ce1a3	docs(paged): select weighted combine candidate Assisted-by: Codex:gpt-5	2026-06-30 23:47:34 +00:00
Ettore Di Giacinto	3cf7fa1715	docs(paged): reject swiglu down fusion candidate Assisted-by: Codex:gpt-5	2026-06-30 23:41:38 +00:00
Ettore Di Giacinto	d0fa463eac	test(paged): mirror MoE swiglu down gate Mirror the llama.cpp Phase 7 test gate for the merged MoE gate_up/SWIGLU/down chain and record the DGX md5/op gate evidence. Assisted-by: Codex:gpt-5	2026-06-30 23:20:52 +00:00
Ettore Di Giacinto	34c4b5ce8d	docs(paged): scope phase7 serving candidates Mark the Phase 6 serving classifier complete, preserve the old parity final as historical, and scope Phase 7 source candidates with explicit md5 and op gates. Assisted-by: Codex:gpt-5	2026-06-30 23:12:09 +00:00
Ettore Di Giacinto	b647460dee	docs(paged): record phase6 serving classifier Record both-engine serving nsys buckets, rejected sampler short-circuit, and rejected GDN/MMQ env grids for the GB10 parity work. Assisted-by: Codex:gpt-5	2026-06-30 23:04:15 +00:00
Ettore Di Giacinto	f9e015d8e2	docs(paged): record W4A16 Wq padding rejection Record the Phase 5 Wq shared-memory padding experiment, its gates, sub-threshold benchmark gain, and the decision to ship no 0051 patch. Assisted-by: Codex:gpt-5	2026-06-30 22:23:14 +00:00
Ettore Di Giacinto	85c88320ef	patches(paged): pad W4A16 A shared tile stride Mirror fork commit d9b9be0be as patch 0050 and record the Phase 4 W4A16 shared-memory padding gates, benchmarks, and mirror verification. Assisted-by: Codex:gpt-5	2026-06-30 22:15:21 +00:00
Ettore Di Giacinto	8b413d1cbd	docs(paged): record W4A16 scale broadcast rejection Record the Phase 3 scale-broadcast experiment, its md5 and MUL_MAT_ID gates, the prefill regression, and the decision to ship no 0050 patch. Assisted-by: Codex:gpt-5	2026-06-30 22:06:17 +00:00
Ettore Di Giacinto	c5f2545cdd	patches(paged): tune W4A16 grouped tile shape Mirror fork commit 7dfa0e175 as patch 0049 and record the Phase 2 GB10 W4A16 shape sweep, md5 gates, MUL_MAT_ID checks, and mirror verification. Assisted-by: Codex:gpt-5	2026-06-30 21:57:42 +00:00
Ettore Di Giacinto	d8edc615e7	patches(paged): mirror W4A16 packed metadata Mirror the fork-first W4A16 packed tile metadata commit into the LocalAI paged patch series, record the Phase 1 benchmark result, and keep the implementation plan checked off. Assisted-by: Codex:gpt-5	2026-06-30 21:21:53 +00:00
Ettore Di Giacinto	1c0709b700	docs(paged): record W4A16 phase1 kill gate Record the clean forced W4A16 baseline, default comparison, selected metadata target, and completed plan checkpoint for the GB10 parity reopen. Assisted-by: Codex:gpt-5	2026-06-30 20:40:40 +00:00
Ettore Di Giacinto	337ebb8a37	docs(paged): record phase0 decode repro Record comparable graph-node-traced paged and vLLM decode difference-method artifacts for the GB10 parity reopen. Assisted-by: Codex:gpt-5	2026-06-30 20:35:43 +00:00
Ettore Di Giacinto	ef5d4af203	docs(paged): record phase0 prefill baseline Record clean-source MoE and dense prefill baselines for the GB10 parity reopen and mark the plan checkpoint complete. Assisted-by: Codex:gpt-5	2026-06-30 20:22:18 +00:00
Ettore Di Giacinto	a9a2efb296	docs(paged): record phase0 clean build gates Record the clean DGX build retry, binary provenance, canonical greedy md5 gates, and completed plan steps for the GB10 parity reopen. Assisted-by: Codex:gpt-5	2026-06-30 20:19:14 +00:00
Ettore Di Giacinto	d288a0300f	docs(paged): add GB10 parity implementation plan Add the Superpowers implementation plan for the GB10 parity reopen, including Phase 0 provenance, decode repro, W4A16 kill gates, and later kernel workstream entry criteria. Assisted-by: Codex:gpt-5	2026-06-30 15:50:01 +00:00
Ettore Di Giacinto	4cd90bfae9	paged: drop bf16-tau (patch 0026), subsumed by decode fusions (tau=100000 flat, zero speed benefit) The opt-in hybrid per-head bf16 SSM-state lever (ssm_bf16_tau, patch 0026) is removed from the llama-cpp-localai-paged patch series. Clean re-measurement after the decode fusions (0028 recurrent-state gather-fusion + 0029 block-table cache) landed shows it buys nothing: forcing ALL gated-DeltaNet heads to bf16 (tau=100000, the most aggressive setting) gives flat decode throughput, 780.6 vs 780.0 t/s. The mode engages but adds zero speed because it is subsumed by the fusions. The earlier "+12%" was measured before the fusions completed. bf16-tau was a precision trade (not bit-exact, ~91% same-top-p) plus extra bug surface and extra CUDA template-instantiation compile cost with no offsetting benefit. Dependency check: no later patch (0028/0029/0030) depends on 0026. 0030's only mention is a description comment; its code keys off fused_gdn_ar/ch/auto_fgdn, which originate in 0018/0019/0021 (before 0026). The remaining series (0001-0025, 0028-0030) applies clean with git apply --check against the pin 0ed235ea2c17a19fc8238668653946721ed136fd. The Makefile applies the series by glob (patches/paged/0*.patch); the resulting gap at 0026 is tolerated (0005/0027 are already absent). Removed: - patches/paged/0026-qwen35-hybrid-perhead-ssm-state.patch - the dead ssm_bf16_tau / ssm_hybrid_tau option handler in the shared grpc-server.cpp (it only set LLAMA_SSM_BF16_TAU, now a no-op the library no longer reads) - the patched+bf16-tau benchmark columns and llama-patched-bf16tau rows (README + final_benchmark.csv), the ssm_bf16_tau option text in backend index.yaml, the gallery NOTE block, and the docs/features/backends.md mention. The rejected-lever lesson is kept (why it was dropped: subsumed, tau=100000 flat) in the backend README section 5, the paged-backend agent guide, and the vLLM-parity methodology, so it is not re-tried. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 16:06:06 +00:00
Ettore Di Giacinto	ea72a56e2c	Merge origin/master + pin-sync paged backend to 0ed235ea master auto-bumped the stock llama-cpp pin 9d5d882d -> 0ed235ea and updated the shared grpc-server.cpp. The paged backend's pin must track the stock pin (the grpc-server.cpp is shared), so bump its LLAMA_VERSION to match. All 28 paged patches apply clean on 0ed235ea (verified against a fresh upstream clone). The bf16-tau state-serialization fix (patch 0026) is included. Bit-exact gate + full grpc-server build verify on GPU/CI to follow. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-28 07:56:47 +00:00

1 2 3 4 5 ...

628 Commits