Commit Graph

7182 Commits

Author SHA1 Message Date
Ettore Di Giacinto
b2784ccbca docs(paged): fix EXECUTION_REARCH_SCOPE seam citations to fork 1edddc8fe
Adversarial verification against the canonical fork mudler/llama.cpp:localai-paged
HEAD 1edddc8fe found the scope doc's section-3 seam references were anchored to
the abandoned pre-trim tree 237ad9b96, which the immediately-preceding commit
b529cc5420 reset away. Two classes of defect, both corrected:

- Phantom scaffolding (honesty): the doc claimed "the team has already started
  scaffolding P1 and P3" citing four commits (237ad9b96 bf16 GDN state cache,
  afc2c7030 act-quant trace, ea0875d14 LLAMA_BF16_CUBLAS_F32_OUT, 7967ad47f
  W4A16 direct-A stub) that b529cc5420 TRIMMED - none exist at 1edddc8fe (git
  cat-file: not a valid object). w4a16-policy.h, test-cuda-w4a16-policy.cpp and
  ggml_cuda_mul_mat_id_w4a16_grouped_direct_a are absent from the tree. Reworded
  P1 plank-1 and the P3 mechanism/files/effort to say these must be re-introduced
  on top of the surviving grouped W4A16 path (patch 0035), not "finished".

- Stale line numbers (additivity): every file:line was off (computed against the
  larger 237ad9b96 tree). Re-anchored to 1edddc8fe: ggml_cuda_try_fuse 4232 (was
  4661), capture loop 4908 (was 5444), moe whole-pattern matcher 4157 (was 4678),
  routed_ffn_poc moe-ffn.cu:275 (was 456), grouped W4A16 hook ggml-cuda.cu:2797
  (was 3093/3188; the direct-A hooks 3085/3171 never existed), concurrent_event
  machinery 4769 (was 5305-5318), continuous-batch budget server-context.cpp
  3083-3135 with LLAMA_MAX_BATCH_TOKENS at 3105 / prefill_budget_step at 3113
  (was 3122-3200).

Numbers (attribution table, recovery arithmetic), the six P0 kill-gates, and the
unreachable-floor honesty were verified sound and left unchanged.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-07-02 11:03:07 +00:00
Ettore Di Giacinto
bf61db6214 docs(paged): scope vLLM-class execution re-architecture (additive program)
Reframe the GB10 vLLM-parity gap from a per-lever "hardware floor" verdict
to a ggml-execution-architecture-conditional one: same-silicon 2-3x is
software architecture, not silicon. Add EXECUTION_REARCH_SCOPE.md, a phased
additive program (P1 bf16-native stream, P2 expert-major fused MoE region,
P3 Marlin large-M retry on P1+P2, P4 token-budget scheduler, P5 blocked-solve
GDN, P6 fp8 KV), each with the ggml/fork seam, default-off env gate, per-path
md5/KL correctness gate, a falsifiable P0 kill-gate, expected-recovery
arithmetic grounded in the both-engine nsys buckets, and upstream-clash
analysis. Point the README docs list and PARITY_HANDOFF forward-direction at
it.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-07-02 10:50:00 +00:00
Ettore Di Giacinto
b529cc5420 patches(paged): trim series to Phase135 routed-FFN line, sync to fork 1edddc8fe
The campaign patches 0048-0063 were added without matching fork commits.
After a keep/drop review, the series is trimmed and re-mirrored 1:1 onto the
fork branch mudler/llama.cpp:localai-paged (HEAD 1edddc8fe, on 51168c5ee).

Kept, renumbered from the fork (now carry Assisted-by + Signed-off-by):
- 0048 test(paged): cover MoE swiglu down chain      (was 0051, fd920cf8a)
- 0049 test(paged): cover MoE weighted combine chain (was 0052, a85c1e098)
- 0050 test(paged): cover ragged MoE dispatch        (was 0053, 2fed6aacf)
- 0051 fix(speculative): disable backend sampling for MTP drafts (was 0054, f1d976f06)
- 0052 feat(paged): whole-pattern MoE matcher + routed-FFN fused NVFP4-quant
       down MMQ (new, 1edddc8fe)

Dropped (no fork commits, removed from the series):
- 0048-0050 W4A16 grouped-tile pack/tune/pad: dead line, W4A16 ~1.5x slower
  than grouped-MMQ.
- 0055-0063 speculative/moe/mul-mat/cublas route traces + the rejected small-M
  tile-policy knob (0059).
- All other 110-140 campaign markers not needed by Phase135 (GPU-sort,
  W4A16-direct-A, boundary trace/timing, Phase133 sorted-F32, Phase134
  fused-SWIGLU, Phase138 finalize) carry no code in this tree.

Tree-hash proof (the mirror invariant): a fresh detached worktree at
LLAMA_VERSION 0ed235ea2c17a19fc8238668653946721ed136fd with every on-disk
patches/paged/0*.patch applied in numeric order (git apply) stages to tree
097c862c6834b7d8b90419b305b8402155ef8373, byte-identical to fork HEAD
1edddc8fe's tree. Series is 43 patches (0001-0047 unchanged + 0048-0052).

Gated on GB10 sm_121a: default md5 MoE 8cb0ce23 / dense 5951a5b4 unchanged;
opt-in md5-clean; MUL_MAT 1146/1146, MUL_MAT_ID 806/806, GATED_DELTA_NET 46/46,
MOE_SWIGLU_DOWN 7/7, MUL_MAT_ID_RAGGED_MOE 6/6; six mmq_moe_quantized_raw
markers with zero sorted launches on the opt-in sentinel.

Assisted-by: Claude:opus-4.8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-07-02 10:19:10 +00:00
Ettore Di Giacinto
1aba41082b docs(paged): record phases 112-140 + series trim decision
Record the phase 110-140 GDN/MoE campaign benchmark log and append the
series-trim decision to the parity handoff: keep the Phase135 routed-FFN
fused-quant line plus the MoE test sentinels and the MTP-draft correctness
fix; drop the W4A16 structural line, the trace/tile-policy patches, GPU-sort,
W4A16-direct-A, and the finalize fusion. Rejected/neutral levers are recorded
in the handoff and the per-phase bench artifacts. Fork re-mirrored on
51168c5ee: fd920cf8a a85c1e098 2fed6aacf f1d976f06 1edddc8fe (HEAD tree
097c862c).

Assisted-by: Claude:opus-4.8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-07-02 10:16:53 +00:00
Ettore Di Giacinto
67d2c4c9d4 docs(paged): record BF16 GDN state cache phase
Record Phase81 default-off BF16 persistent S-cache results, including md5 drift, op gates, decode profile, and KL smoke. Scope Phase82 as full f16-reference KL plus serving A/B before patch-series promotion.

Assisted-by: Codex:gpt-5
2026-07-01 16:26:09 +00:00
Ettore Di Giacinto
d091eb30f2 docs(paged): record GDN identity shortcut phase
Assisted-by: Codex:gpt-5
2026-07-01 15:46:41 +00:00
Ettore Di Giacinto
bbfaa66f02 docs(paged): record GDN BV32 decode A/B phase
Assisted-by: Codex:gpt-5
2026-07-01 15:35:06 +00:00
Ettore Di Giacinto
04ed7fe52f docs(paged): record GDN launch sweep phase
Assisted-by: Codex:gpt-5
2026-07-01 15:13:45 +00:00
Ettore Di Giacinto
a9454b45c8 docs(paged): record MoE decode-only profile phase
Assisted-by: Codex:gpt-5
2026-07-01 15:05:45 +00:00
Ettore Di Giacinto
f21b393746 docs(paged): record current MoE graph profile phase
Assisted-by: Codex:gpt-5
2026-07-01 14:56:39 +00:00
Ettore Di Giacinto
26a41fad1a docs(paged): record post-PoC GDN audit phase
Assisted-by: Codex:gpt-5
2026-07-01 14:44:17 +00:00
Ettore Di Giacinto
5369219729 docs(paged): record GDN blocked-solve PoC phase
Assisted-by: Codex:gpt-5
2026-07-01 14:39:09 +00:00
Ettore Di Giacinto
eb82ff138f docs(paged): record datacenter Blackwell readiness phase
Assisted-by: Codex:gpt-5
2026-07-01 14:28:41 +00:00
Ettore Di Giacinto
2efb0ec362 docs(paged): record TTFT min32 serving phase
Assisted-by: Codex:gpt-5
2026-07-01 14:18:54 +00:00
Ettore Di Giacinto
e5c5746c0a docs(paged): record GDN tensor-core revalidation phase
Assisted-by: Codex:gpt-5
2026-07-01 14:05:20 +00:00
Ettore Di Giacinto
6cf8b782d1 docs(paged): record BF16 F32 output broader serving phase
Assisted-by: Codex:gpt-5
2026-07-01 13:26:50 +00:00
Ettore Di Giacinto
e573194799 docs(paged): record patch mirror readiness phase
Assisted-by: Codex:gpt-5
2026-07-01 13:11:57 +00:00
Ettore Di Giacinto
2b2b1f0b25 docs(paged): record BF16 F32 output dense serving phase
Assisted-by: Codex:gpt-5
2026-07-01 13:06:49 +00:00
Ettore Di Giacinto
e67b329eb1 docs(paged): record BF16 cuBLAS F32 output phase
Assisted-by: Codex:gpt-5
2026-07-01 12:54:24 +00:00
Ettore Di Giacinto
60954d484a docs(paged): record quant kernel timing phase
Assisted-by: Codex:gpt-5
2026-07-01 12:45:19 +00:00
Ettore Di Giacinto
3fbdfc21c9 docs(paged): record quant trace phase
Assisted-by: Codex:gpt-5
2026-07-01 12:42:13 +00:00
Ettore Di Giacinto
55df9100dc docs(paged): record layout trace phase
Assisted-by: Codex:gpt-5
2026-07-01 12:32:05 +00:00
Ettore Di Giacinto
2e19e5c90f docs(paged): record prefill bucket attribution phase
Assisted-by: Codex:gpt-5
2026-07-01 12:20:42 +00:00
Ettore Di Giacinto
6a2618b6dc docs(paged): record MTP verify-cost rejection
Assisted-by: Codex:gpt-5
2026-07-01 11:51:29 +00:00
Ettore Di Giacinto
f7d76389b0 docs(paged): record W4A16 direct activation rejection
Assisted-by: Codex:gpt-5
2026-07-01 11:28:11 +00:00
Ettore Di Giacinto
4645935fa5 docs(paged): mark W4A16 direct routing stub done
Assisted-by: Codex:gpt-5
2026-07-01 11:14:55 +00:00
Ettore Di Giacinto
b425d8ce03 docs(paged): mark W4A16 direct policy tests done
Assisted-by: Codex:gpt-5
2026-07-01 11:06:10 +00:00
Ettore Di Giacinto
ef578866c8 docs(paged): scope W4A16 direct activation experiment
Assisted-by: Codex:gpt-5
2026-07-01 10:59:56 +00:00
Ettore Di Giacinto
fc5d5e4ff3 docs(paged): profile current W4A16 prefill
Assisted-by: Codex:gpt-5
2026-07-01 10:56:48 +00:00
Ettore Di Giacinto
ef7dbfa5f7 docs(paged): compare MoE min32 against vLLM
Assisted-by: Codex:gpt-5
2026-07-01 10:46:32 +00:00
Ettore Di Giacinto
c41d1a5b4f docs(paged): record waiting-threshold TTFT defer
Record Phase58 prompt-backlog threshold A/B, DGX gates, MoE and dense serving results, and the repeat-before-default decision.

Assisted-by: Codex:gpt-5
2026-07-01 10:31:09 +00:00
Ettore Di Giacinto
9be291e6b0 docs(paged): reject capped TTFT defer sweep
Record Phase57 capped TTFT prefill-first sweep, DGX gates, and the decision to keep the cap as an A/B knob rather than a parity path.

Assisted-by: Codex:gpt-5
2026-07-01 10:18:41 +00:00
Ettore Di Giacinto
902bcc7717 docs(paged): validate TTFT prefill-first A/B
Record Phase56 MoE and lower-concurrency validation for the TTFT prefill-first policy, including DGX gates and the opt-in-only decision.

Assisted-by: Codex:gpt-5
2026-07-01 10:05:23 +00:00
Ettore Di Giacinto
999cf09532 docs(paged): record TTFT prefill-first A/B
Record Phase55 default-off scheduler A/B, DGX md5/op gates, dense serving results, and the pending fork push/mirror status.

Assisted-by: Codex:gpt-5
2026-07-01 09:57:55 +00:00
Ettore Di Giacinto
3dbf34e739 docs(paged): record admission histogram trace
Record Phase54 trace-only histogram work, DGX md5/op gates, dense serving histogram evidence, and the next scheduler decision.

Assisted-by: Codex:gpt-5
2026-07-01 09:40:50 +00:00
Ettore Di Giacinto
347a5c05bd docs(paged): reject admission budget sweep
Assisted-by: Codex:gpt-5
2026-07-01 09:27:20 +00:00
Ettore Di Giacinto
2aa76702df docs(paged): record dense admission trace
Assisted-by: Codex:gpt-5
2026-07-01 09:18:43 +00:00
Ettore Di Giacinto
b5f65152e2 docs(paged): record serving admission trace
Assisted-by: Codex:gpt-5
2026-07-01 09:08:42 +00:00
Ettore Di Giacinto
c299dcd231 docs(paged): record dense true decode profile
Assisted-by: Codex:gpt-5
2026-07-01 08:55:23 +00:00
Ettore Di Giacinto
cd59e5d61f fix(paged): scrub harness vars for vllm serve
Assisted-by: Codex:gpt-5
2026-07-01 08:23:05 +00:00
Ettore Di Giacinto
96825a224e docs(paged): record dense serving snapshot
Assisted-by: Codex:gpt-5
2026-07-01 08:20:26 +00:00
Ettore Di Giacinto
440129c98e fix(paged): harden serving snapshot readiness
Assisted-by: Codex:gpt-5
2026-07-01 08:07:48 +00:00
Ettore Di Giacinto
e69ee0e867 feat(paged): parameterize served model name
Assisted-by: Codex:gpt-5
2026-07-01 07:50:19 +00:00
Ettore Di Giacinto
2a0fc0f4b9 docs(paged): record inference gate guard
Assisted-by: Codex:gpt-5
2026-07-01 07:45:52 +00:00
Ettore Di Giacinto
ae8284f5fb feat(paged): parameterize vllm serving snapshot
Assisted-by: Codex:gpt-5
2026-07-01 07:41:55 +00:00
Ettore Di Giacinto
ecaf406c0b docs(paged): reject persistent gate fusion shortcut
Assisted-by: Codex:gpt-5
2026-07-01 07:34:27 +00:00
Ettore Di Giacinto
b9eff5bca3 docs(paged): reconcile next parity target
Assisted-by: Codex:gpt-5
2026-07-01 07:31:26 +00:00
Ettore Di Giacinto
aa848d5afb docs(paged): record low-concurrency serving check
Assisted-by: Codex:gpt-5
2026-07-01 07:24:28 +00:00
Ettore Di Giacinto
d44e164c96 docs(paged): record max-concurrency parity check
Assisted-by: Codex:gpt-5
2026-07-01 07:13:48 +00:00
Ettore Di Giacinto
52c11b1ce5 docs(paged): reject graph-time gate fusion shortcut
Assisted-by: Codex:gpt-5
2026-07-01 06:56:01 +00:00