LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-07-03 04:46:54 -04:00

Author	SHA1	Message	Date
Ettore Di Giacinto	bf61db6214	docs(paged): scope vLLM-class execution re-architecture (additive program) Reframe the GB10 vLLM-parity gap from a per-lever "hardware floor" verdict to a ggml-execution-architecture-conditional one: same-silicon 2-3x is software architecture, not silicon. Add EXECUTION_REARCH_SCOPE.md, a phased additive program (P1 bf16-native stream, P2 expert-major fused MoE region, P3 Marlin large-M retry on P1+P2, P4 token-budget scheduler, P5 blocked-solve GDN, P6 fp8 KV), each with the ggml/fork seam, default-off env gate, per-path md5/KL correctness gate, a falsifiable P0 kill-gate, expected-recovery arithmetic grounded in the both-engine nsys buckets, and upstream-clash analysis. Point the README docs list and PARITY_HANDOFF forward-direction at it. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-07-02 10:50:00 +00:00
Ettore Di Giacinto	b529cc5420	patches(paged): trim series to Phase135 routed-FFN line, sync to fork 1edddc8fe The campaign patches 0048-0063 were added without matching fork commits. After a keep/drop review, the series is trimmed and re-mirrored 1:1 onto the fork branch mudler/llama.cpp:localai-paged (HEAD 1edddc8fe, on 51168c5ee). Kept, renumbered from the fork (now carry Assisted-by + Signed-off-by): - 0048 test(paged): cover MoE swiglu down chain (was 0051, fd920cf8a) - 0049 test(paged): cover MoE weighted combine chain (was 0052, a85c1e098) - 0050 test(paged): cover ragged MoE dispatch (was 0053, 2fed6aacf) - 0051 fix(speculative): disable backend sampling for MTP drafts (was 0054, f1d976f06) - 0052 feat(paged): whole-pattern MoE matcher + routed-FFN fused NVFP4-quant down MMQ (new, 1edddc8fe) Dropped (no fork commits, removed from the series): - 0048-0050 W4A16 grouped-tile pack/tune/pad: dead line, W4A16 ~1.5x slower than grouped-MMQ. - 0055-0063 speculative/moe/mul-mat/cublas route traces + the rejected small-M tile-policy knob (0059). - All other 110-140 campaign markers not needed by Phase135 (GPU-sort, W4A16-direct-A, boundary trace/timing, Phase133 sorted-F32, Phase134 fused-SWIGLU, Phase138 finalize) carry no code in this tree. Tree-hash proof (the mirror invariant): a fresh detached worktree at LLAMA_VERSION 0ed235ea2c17a19fc8238668653946721ed136fd with every on-disk patches/paged/0*.patch applied in numeric order (git apply) stages to tree 097c862c6834b7d8b90419b305b8402155ef8373, byte-identical to fork HEAD 1edddc8fe's tree. Series is 43 patches (0001-0047 unchanged + 0048-0052). Gated on GB10 sm_121a: default md5 MoE 8cb0ce23 / dense 5951a5b4 unchanged; opt-in md5-clean; MUL_MAT 1146/1146, MUL_MAT_ID 806/806, GATED_DELTA_NET 46/46, MOE_SWIGLU_DOWN 7/7, MUL_MAT_ID_RAGGED_MOE 6/6; six mmq_moe_quantized_raw markers with zero sorted launches on the opt-in sentinel. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-07-02 10:19:10 +00:00
Ettore Di Giacinto	1aba41082b	docs(paged): record phases 112-140 + series trim decision Record the phase 110-140 GDN/MoE campaign benchmark log and append the series-trim decision to the parity handoff: keep the Phase135 routed-FFN fused-quant line plus the MoE test sentinels and the MTP-draft correctness fix; drop the W4A16 structural line, the trace/tile-policy patches, GPU-sort, W4A16-direct-A, and the finalize fusion. Rejected/neutral levers are recorded in the handoff and the per-phase bench artifacts. Fork re-mirrored on 51168c5ee: fd920cf8a a85c1e098 2fed6aacf f1d976f06 1edddc8fe (HEAD tree 097c862c). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-07-02 10:16:53 +00:00
Ettore Di Giacinto	67d2c4c9d4	docs(paged): record BF16 GDN state cache phase Record Phase81 default-off BF16 persistent S-cache results, including md5 drift, op gates, decode profile, and KL smoke. Scope Phase82 as full f16-reference KL plus serving A/B before patch-series promotion. Assisted-by: Codex:gpt-5	2026-07-01 16:26:09 +00:00
Ettore Di Giacinto	d091eb30f2	docs(paged): record GDN identity shortcut phase Assisted-by: Codex:gpt-5	2026-07-01 15:46:41 +00:00
Ettore Di Giacinto	bbfaa66f02	docs(paged): record GDN BV32 decode A/B phase Assisted-by: Codex:gpt-5	2026-07-01 15:35:06 +00:00
Ettore Di Giacinto	04ed7fe52f	docs(paged): record GDN launch sweep phase Assisted-by: Codex:gpt-5	2026-07-01 15:13:45 +00:00
Ettore Di Giacinto	a9454b45c8	docs(paged): record MoE decode-only profile phase Assisted-by: Codex:gpt-5	2026-07-01 15:05:45 +00:00
Ettore Di Giacinto	f21b393746	docs(paged): record current MoE graph profile phase Assisted-by: Codex:gpt-5	2026-07-01 14:56:39 +00:00
Ettore Di Giacinto	26a41fad1a	docs(paged): record post-PoC GDN audit phase Assisted-by: Codex:gpt-5	2026-07-01 14:44:17 +00:00
Ettore Di Giacinto	5369219729	docs(paged): record GDN blocked-solve PoC phase Assisted-by: Codex:gpt-5	2026-07-01 14:39:09 +00:00
Ettore Di Giacinto	eb82ff138f	docs(paged): record datacenter Blackwell readiness phase Assisted-by: Codex:gpt-5	2026-07-01 14:28:41 +00:00
Ettore Di Giacinto	2efb0ec362	docs(paged): record TTFT min32 serving phase Assisted-by: Codex:gpt-5	2026-07-01 14:18:54 +00:00
Ettore Di Giacinto	e5c5746c0a	docs(paged): record GDN tensor-core revalidation phase Assisted-by: Codex:gpt-5	2026-07-01 14:05:20 +00:00
Ettore Di Giacinto	6cf8b782d1	docs(paged): record BF16 F32 output broader serving phase Assisted-by: Codex:gpt-5	2026-07-01 13:26:50 +00:00
Ettore Di Giacinto	e573194799	docs(paged): record patch mirror readiness phase Assisted-by: Codex:gpt-5	2026-07-01 13:11:57 +00:00
Ettore Di Giacinto	2b2b1f0b25	docs(paged): record BF16 F32 output dense serving phase Assisted-by: Codex:gpt-5	2026-07-01 13:06:49 +00:00
Ettore Di Giacinto	e67b329eb1	docs(paged): record BF16 cuBLAS F32 output phase Assisted-by: Codex:gpt-5	2026-07-01 12:54:24 +00:00
Ettore Di Giacinto	60954d484a	docs(paged): record quant kernel timing phase Assisted-by: Codex:gpt-5	2026-07-01 12:45:19 +00:00
Ettore Di Giacinto	3fbdfc21c9	docs(paged): record quant trace phase Assisted-by: Codex:gpt-5	2026-07-01 12:42:13 +00:00
Ettore Di Giacinto	55df9100dc	docs(paged): record layout trace phase Assisted-by: Codex:gpt-5	2026-07-01 12:32:05 +00:00
Ettore Di Giacinto	2e19e5c90f	docs(paged): record prefill bucket attribution phase Assisted-by: Codex:gpt-5	2026-07-01 12:20:42 +00:00
Ettore Di Giacinto	6a2618b6dc	docs(paged): record MTP verify-cost rejection Assisted-by: Codex:gpt-5	2026-07-01 11:51:29 +00:00
Ettore Di Giacinto	f7d76389b0	docs(paged): record W4A16 direct activation rejection Assisted-by: Codex:gpt-5	2026-07-01 11:28:11 +00:00
Ettore Di Giacinto	ef578866c8	docs(paged): scope W4A16 direct activation experiment Assisted-by: Codex:gpt-5	2026-07-01 10:59:56 +00:00
Ettore Di Giacinto	fc5d5e4ff3	docs(paged): profile current W4A16 prefill Assisted-by: Codex:gpt-5	2026-07-01 10:56:48 +00:00
Ettore Di Giacinto	ef7dbfa5f7	docs(paged): compare MoE min32 against vLLM Assisted-by: Codex:gpt-5	2026-07-01 10:46:32 +00:00
Ettore Di Giacinto	c41d1a5b4f	docs(paged): record waiting-threshold TTFT defer Record Phase58 prompt-backlog threshold A/B, DGX gates, MoE and dense serving results, and the repeat-before-default decision. Assisted-by: Codex:gpt-5	2026-07-01 10:31:09 +00:00
Ettore Di Giacinto	9be291e6b0	docs(paged): reject capped TTFT defer sweep Record Phase57 capped TTFT prefill-first sweep, DGX gates, and the decision to keep the cap as an A/B knob rather than a parity path. Assisted-by: Codex:gpt-5	2026-07-01 10:18:41 +00:00
Ettore Di Giacinto	902bcc7717	docs(paged): validate TTFT prefill-first A/B Record Phase56 MoE and lower-concurrency validation for the TTFT prefill-first policy, including DGX gates and the opt-in-only decision. Assisted-by: Codex:gpt-5	2026-07-01 10:05:23 +00:00
Ettore Di Giacinto	999cf09532	docs(paged): record TTFT prefill-first A/B Record Phase55 default-off scheduler A/B, DGX md5/op gates, dense serving results, and the pending fork push/mirror status. Assisted-by: Codex:gpt-5	2026-07-01 09:57:55 +00:00
Ettore Di Giacinto	3dbf34e739	docs(paged): record admission histogram trace Record Phase54 trace-only histogram work, DGX md5/op gates, dense serving histogram evidence, and the next scheduler decision. Assisted-by: Codex:gpt-5	2026-07-01 09:40:50 +00:00
Ettore Di Giacinto	347a5c05bd	docs(paged): reject admission budget sweep Assisted-by: Codex:gpt-5	2026-07-01 09:27:20 +00:00
Ettore Di Giacinto	2aa76702df	docs(paged): record dense admission trace Assisted-by: Codex:gpt-5	2026-07-01 09:18:43 +00:00
Ettore Di Giacinto	b5f65152e2	docs(paged): record serving admission trace Assisted-by: Codex:gpt-5	2026-07-01 09:08:42 +00:00
Ettore Di Giacinto	c299dcd231	docs(paged): record dense true decode profile Assisted-by: Codex:gpt-5	2026-07-01 08:55:23 +00:00
Ettore Di Giacinto	cd59e5d61f	fix(paged): scrub harness vars for vllm serve Assisted-by: Codex:gpt-5	2026-07-01 08:23:05 +00:00
Ettore Di Giacinto	96825a224e	docs(paged): record dense serving snapshot Assisted-by: Codex:gpt-5	2026-07-01 08:20:26 +00:00
Ettore Di Giacinto	440129c98e	fix(paged): harden serving snapshot readiness Assisted-by: Codex:gpt-5	2026-07-01 08:07:48 +00:00
Ettore Di Giacinto	e69ee0e867	feat(paged): parameterize served model name Assisted-by: Codex:gpt-5	2026-07-01 07:50:19 +00:00
Ettore Di Giacinto	2a0fc0f4b9	docs(paged): record inference gate guard Assisted-by: Codex:gpt-5	2026-07-01 07:45:52 +00:00
Ettore Di Giacinto	ae8284f5fb	feat(paged): parameterize vllm serving snapshot Assisted-by: Codex:gpt-5	2026-07-01 07:41:55 +00:00
Ettore Di Giacinto	ecaf406c0b	docs(paged): reject persistent gate fusion shortcut Assisted-by: Codex:gpt-5	2026-07-01 07:34:27 +00:00
Ettore Di Giacinto	b9eff5bca3	docs(paged): reconcile next parity target Assisted-by: Codex:gpt-5	2026-07-01 07:31:26 +00:00
Ettore Di Giacinto	aa848d5afb	docs(paged): record low-concurrency serving check Assisted-by: Codex:gpt-5	2026-07-01 07:24:28 +00:00
Ettore Di Giacinto	d44e164c96	docs(paged): record max-concurrency parity check Assisted-by: Codex:gpt-5	2026-07-01 07:13:48 +00:00
Ettore Di Giacinto	52c11b1ce5	docs(paged): reject graph-time gate fusion shortcut Assisted-by: Codex:gpt-5	2026-07-01 06:56:01 +00:00
Ettore Di Giacinto	5354adcffb	docs(paged): scope gate projection policy Assisted-by: Codex:gpt-5	2026-07-01 06:50:19 +00:00
Ettore Di Giacinto	9f75da01f9	feat(paged): add cublas tensor-name trace patch Add patch 0063 extending LLAMA_CUBLAS_ROUTE_TRACE with src0/src1/dst tensor names. Record Phase 37 gates and the conclusion that SGEMM traces to MoE gate tensors. Assisted-by: Codex:gpt-5	2026-07-01 06:41:00 +00:00
Ettore Di Giacinto	fbdc200886	feat(paged): add cublas route trace patch Add patch 0062 with default-off LLAMA_CUBLAS_ROUTE_TRACE instrumentation for generic cuBLAS MUL_MAT subroutes. Record Phase 36 DGX gates, serving trace results, and the next projection follow-up scope. Assisted-by: Codex:gpt-5	2026-07-01 06:24:46 +00:00

1 2 3 4 5 ...

876 Commits