Commit Graph

89 Commits

Author SHA1 Message Date
Ettore Di Giacinto
ac2b0211ff docs(paged): record P6 fp8-KV BLOCKED-ON-INFRA + the analytical decode ceiling
P6 (final program phase) could not run its kill-gate: the DGX/GB10 was
unreachable for the entire window (cloudflared access via prem-vm returned
HTTP 530 / websocket bad-handshake on every probe; re-confirmed with 5 fresh
probes). Stage 0a (measured nsys graph-node decode ceiling) and Stage 0b
(fp8-e4m3 kernel + kill-gate A/B) were physically impossible with no GPU.

Records the honest infra-block (NOT a measured NO-GO, NOT a NO-GO-by-ceiling)
plus the load-bearing artifact: the analytical fp8-KV decode ceiling table.
fp8 halves KV bytes -> theoretical-max decode saving = 0.5 x flash-attn share:
ctx256 0.65% (standard shape hard NO-GO), ctx1024 2.55%, ctx2048 4.98% (first
crosses +3%), ctx4096 9.49%, ctx8192 17.34%. The win, if realizable, lives
only at ctx>=2048; the hybrid-GDN structure (10/40 layers carry KV, 30 GDN
layers hold fixed-size recurrent state with no KV) caps what any KV-dtype
lever can save. The dominant null stands unrefuted: Q8_0 KV was a measured
+7.8% decode regression on GB10. Notes the capacity-play framing (fp8-KV as a
memory feature remains open even if throughput-flat).

Fork localai-paged untouched at 653bb2f3d; series stays at 46 patches
(0001-0055); P3's p3-w4a16-direct work undisturbed. Docs-only; no code, no
topic branch, no patches. Not pushed.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-07-02 21:43:36 +00:00
Ettore Di Giacinto
5b8b33a302 docs(paged): record P5 FLA GDN NO-GO - GDN prefill bucket is a confirmed shared-hardware floor
P5 ported the full six-kernel vLLM-FLA chunk_gated_delta_rule_fwd pipeline
(cumsum, chunk_scaled_dot_kkt, blocked merge_16x16_to_64x64 solve_tril,
recompute_w_u, register/smem-resident chunk_gated_delta_rule_fwd_h with the
chunk loop in-kernel, chunk_fwd_o) to CUDA tf32 mma, per-kernel validated vs
host fp64 (o NMSE 2.2e-7, final-state 1.2e-7), integrated behind
LLAMA_GDN_FLA_CHUNK=1 (default-off), and A/B'd in-backend vs the shipped M5
chunked scan.

P0 perf kill-gate FAILED decisively: nsys --cuda-graph-trace=node, MoE
q36-35b-a3b, npp2048 M5 56.31 vs FLA 119.46 us/tok (FLA 2.12x SLOWER,
gdn_delta_pct -112.1); npp512 2.29x slower; end-to-end S_PP -13.33%/-13.12%.
GO required FLA >10% faster at npp2048.

Novel decomposition: the blocked solve_tril is only ~2.8% of the FLA bucket
(55.6 ms); fwd_h 46.2% (903 ms) + fwd_o 31.5% (617 ms) dominate. The cost is
the state-recurrence GEMMs plus per-chunk h-state materialization to global
LPDDR5x that FLA's split-kernel structure forces; the fused M5 single kernel
keeps the 128x128 state resident in smem and never materializes per-chunk h,
so it is 2.1x faster on GB10's low-bandwidth memory. So the GDN prefill bucket
(+59.2, the single largest prefill lever) is a confirmed shared-hardware /
memory-bandwidth floor, NOT recoverable by the blocked-solve algorithm.
Extends Phase74 (standalone blocked-inverse 0.59x) and bf16-C64 (-18.75%).

Gates: SMEM PASS (max 96KB < 99KB cap). KL band GREEN (FLA KLD 0.137028 vs
control 0.136563, delta +0.000465 < 0.01; same-top-p 84.61%). DEFAULT path
untouched (canonical md5 GREEN both models default-off AND FLA-on; MoE
8cb0ce23, dense 5951a5b4; GATED_DELTA_NET DEFAULT 46/46). Nothing landed;
series stays at 46 patches. WIP on DGX fork branch p5-fla-gdn (2d64c37f0),
NOT pushed.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-07-02 21:04:37 +00:00
Ettore Di Giacinto
7b129a51f1 docs(paged): finalize P4 CBv2 record with the measured A/B verdict
The forced-report placeholders are replaced with the completed 60/60-raw A/B
from dgx:~/bench/p4_cbv2/perf_20260702_194359/RESULTS.md: NO-GO confirmed by
measurement, and stronger than flat. CBv2 fair-share chunked prefill regresses
TTFT under staggered load (N=32 p50 +33.6%, N=128 p50 +15.5%) and regresses
aggregate/decode -6.9% beyond noise at staggered N=128. Analysis recorded:
processor-sharing delays near-uniform prompt completion by construction; the
scheduler-shaped-TTFT premise is partially refuted for GB10 (patch 0016 already
captures the schedulable win); TTFT parity routes through P3/P5 prefill compute.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-07-02 18:09:55 +00:00
Ettore Di Giacinto
865e77c4ec docs(paged): record P4 CBv2 NO-GO at the perf kill-gate
P4 (token-granular continuous-batching scheduler, LLAMA_CONTINUOUS_BATCH_V2,
default-off) stopped honestly at the P0 perf kill-gate. The kill-gate subset
(per-seq chunked-prefill cursors + adaptive decode bucketing, server-side only,
zero ggml/ files, ~68 LOC + a new unit-tested server-admission-policy.h) was
implemented and correctness-proven green (canonical md5 both models default-off
AND cbv2-on: MoE 8cb0ce23, dense 5951a5b4; test-backend-ops MUL_MAT 1146/1146,
MUL_MAT_ID 806/806, GATED_DELTA_NET 46/46; cursor-interleave PROVEN via
LLAMA_CBV2_TRACE with decode+prefill co-batched and per-seq cursors advancing
across steps, dbucket==n_decode no-pad; determinism-NEUTRAL: CBv2 diverges from
control no more than control diverges from itself, the paged concurrent-greedy
path being inherently non-deterministic run-to-run in the baseline too).

The kill-gate GO criterion - a >20% TTFT-under-load drop with md5 green and
serving-aggregate not regressed - was NOT demonstrated: the staggered/burst TTFT
A/B was force-terminated by the harness mid-run (CONTROL-only, 30/60 raws), so
the TTFT deltas are not-yet-measured placeholders, not measured neutrality. Per
the phased contract go=false was the kill-gate default: nothing built beyond P0
(no SLOT_STATE_PREEMPTED, no aging/starvation-freedom), nothing landed. This is
the scope-anticipated outcome - P4 is a GB10 TTFT/fairness/enabler lever, not a
throughput lever (decode is GPU-compute-bound), so a NO-GO on the TTFT gate is
expected and any throughput payoff is non-GB10.

Records the honest rejection in EXECUTION_REARCH_SCOPE.md (P4 RESULT subsection)
and PARITY_HANDOFF.md chronology, including the re-score path: read the finalized
DGX ~/bench/p4_cbv2/perf_20260702_194359/RESULTS.md once the CANDIDATE arm
completes; a genuine >20% staggered-TTFT drop clearing max(2%, 3*stdev) re-scores
go=true and triggers the full P4 build-out. Fork localai-paged untouched at
653bb2f3d; LocalAI series stays at 46 patches; topic branch p4-cbv2 retained on
the DGX fork at ebb649335 (base 653bb2f3d, not pushed).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-07-02 18:03:34 +00:00
Ettore Di Giacinto
586639d016 docs(paged): record P2 MoE-region NO-GO (kill-gate flat + seam-shape gap)
P2 (expert-major fused routed-FFN region executor, LLAMA_MOE_REGION_EXECUTOR,
default-off) is recorded as NO-GO on two independent signals; nothing built
beyond the P0 kill-gate, nothing landed, fork localai-paged HEAD untouched at
653bb2f3d (LocalAI series stays at 46 patches, 0001-0055).

(1) Primary GO metric flat: n=257 MOE_SWIGLU_DOWN region 1022.15 us vs
grouped-MMQ control 1021.61 us = -0.05% (needed >5% faster); n=128 -0.34%;
MUL_MAT_ID_RAGGED_MOE +0.48%/+0.28% (region never engages). All inside the
5-sample spread - reproduces the six prior one-boundary transplants
(phases 113/114/122/123/125/127). A compact expert-major layout + single sort,
both GEMMs still ragged grouped-MMQ, does not move the ragged-tile tax; that
needs P3 Marlin persistent-CTA, not a P2 layout swap.

(2) Decisive structural blocker: q36-35b-a3b-nvfp4 ships separate
ffn_gate_exps/ffn_up_exps (+ per-tensor .scale) with ggml_swiglu_split, not the
merged gate_up->VIEW->VIEW->SWIGLU->down shape the whole-pattern matcher
requires; the matcher, region executor, and pre-existing POC/fused-quant all
engage 0x on q36 in prefill and decode. KL delta 0.000000 is vacuous (0
engagement). Default md5 canonical both models (MoE 8cb0ce23, dense 5951a5b4);
test-backend-ops all green both arms.

Prerequisite handoff (gates P2 and P3): rebuild the seam for q36's
separate/scaled/swiglu-split FFN shape before any MoE-region lever can engage,
then re-evaluate a fused two-GEMM region (not a layout swap). Topic branch
p2-moe-region retained on the fork for forensics at 2d87564dd (base 653bb2f3d),
not pushed.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-07-02 15:43:57 +00:00
Ettore Di Giacinto
ccf75d1dcd docs(paged): record P1 bf16-stream landing (GO)
P1 of the EXECUTION_REARCH_SCOPE additive program landed: LLAMA_BF16_STREAM
(default-off) bf16-resident residual-segment executor for the q36 MoE model's
projection boundaries.

- EXECUTION_REARCH_SCOPE.md: dated "P1 RESULT" subsection (P0 kill-gate GO,
  full build-out deltas, KL, correctness gates, honest magnitude, provenance).
- PARITY_HANDOFF.md: chronology note (verdict, engagement, prefill/KL numbers,
  fork commits, deferred-not-failed measurements).

Key reframe recorded: q36 GDN/attention projections are BF16 weights (not
NVFP4), so bf16-stream is a MoE-model prefill lever; the dense model quantizes
those projections to NVFP4 and engages nothing (stays bit-identical). Prefill
MoE @512 +1.99% (reproducible, at noise floor), KL delta -0.00052 (KL-improving),
all md5 + test-backend-ops gates green. Fork HEAD 653bb2f3d, tree 6cf1523047.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-07-02 14:34:26 +00:00
Ettore Di Giacinto
bf61db6214 docs(paged): scope vLLM-class execution re-architecture (additive program)
Reframe the GB10 vLLM-parity gap from a per-lever "hardware floor" verdict
to a ggml-execution-architecture-conditional one: same-silicon 2-3x is
software architecture, not silicon. Add EXECUTION_REARCH_SCOPE.md, a phased
additive program (P1 bf16-native stream, P2 expert-major fused MoE region,
P3 Marlin large-M retry on P1+P2, P4 token-budget scheduler, P5 blocked-solve
GDN, P6 fp8 KV), each with the ggml/fork seam, default-off env gate, per-path
md5/KL correctness gate, a falsifiable P0 kill-gate, expected-recovery
arithmetic grounded in the both-engine nsys buckets, and upstream-clash
analysis. Point the README docs list and PARITY_HANDOFF forward-direction at
it.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-07-02 10:50:00 +00:00
Ettore Di Giacinto
1aba41082b docs(paged): record phases 112-140 + series trim decision
Record the phase 110-140 GDN/MoE campaign benchmark log and append the
series-trim decision to the parity handoff: keep the Phase135 routed-FFN
fused-quant line plus the MoE test sentinels and the MTP-draft correctness
fix; drop the W4A16 structural line, the trace/tile-policy patches, GPU-sort,
W4A16-direct-A, and the finalize fusion. Rejected/neutral levers are recorded
in the handoff and the per-phase bench artifacts. Fork re-mirrored on
51168c5ee: fd920cf8a a85c1e098 2fed6aacf f1d976f06 1edddc8fe (HEAD tree
097c862c).

Assisted-by: Claude:opus-4.8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-07-02 10:16:53 +00:00
Ettore Di Giacinto
67d2c4c9d4 docs(paged): record BF16 GDN state cache phase
Record Phase81 default-off BF16 persistent S-cache results, including md5 drift, op gates, decode profile, and KL smoke. Scope Phase82 as full f16-reference KL plus serving A/B before patch-series promotion.

Assisted-by: Codex:gpt-5
2026-07-01 16:26:09 +00:00
Ettore Di Giacinto
d091eb30f2 docs(paged): record GDN identity shortcut phase
Assisted-by: Codex:gpt-5
2026-07-01 15:46:41 +00:00
Ettore Di Giacinto
bbfaa66f02 docs(paged): record GDN BV32 decode A/B phase
Assisted-by: Codex:gpt-5
2026-07-01 15:35:06 +00:00
Ettore Di Giacinto
04ed7fe52f docs(paged): record GDN launch sweep phase
Assisted-by: Codex:gpt-5
2026-07-01 15:13:45 +00:00
Ettore Di Giacinto
a9454b45c8 docs(paged): record MoE decode-only profile phase
Assisted-by: Codex:gpt-5
2026-07-01 15:05:45 +00:00
Ettore Di Giacinto
f21b393746 docs(paged): record current MoE graph profile phase
Assisted-by: Codex:gpt-5
2026-07-01 14:56:39 +00:00
Ettore Di Giacinto
26a41fad1a docs(paged): record post-PoC GDN audit phase
Assisted-by: Codex:gpt-5
2026-07-01 14:44:17 +00:00
Ettore Di Giacinto
5369219729 docs(paged): record GDN blocked-solve PoC phase
Assisted-by: Codex:gpt-5
2026-07-01 14:39:09 +00:00
Ettore Di Giacinto
eb82ff138f docs(paged): record datacenter Blackwell readiness phase
Assisted-by: Codex:gpt-5
2026-07-01 14:28:41 +00:00
Ettore Di Giacinto
2efb0ec362 docs(paged): record TTFT min32 serving phase
Assisted-by: Codex:gpt-5
2026-07-01 14:18:54 +00:00
Ettore Di Giacinto
e5c5746c0a docs(paged): record GDN tensor-core revalidation phase
Assisted-by: Codex:gpt-5
2026-07-01 14:05:20 +00:00
Ettore Di Giacinto
6cf8b782d1 docs(paged): record BF16 F32 output broader serving phase
Assisted-by: Codex:gpt-5
2026-07-01 13:26:50 +00:00
Ettore Di Giacinto
e573194799 docs(paged): record patch mirror readiness phase
Assisted-by: Codex:gpt-5
2026-07-01 13:11:57 +00:00
Ettore Di Giacinto
2b2b1f0b25 docs(paged): record BF16 F32 output dense serving phase
Assisted-by: Codex:gpt-5
2026-07-01 13:06:49 +00:00
Ettore Di Giacinto
e67b329eb1 docs(paged): record BF16 cuBLAS F32 output phase
Assisted-by: Codex:gpt-5
2026-07-01 12:54:24 +00:00
Ettore Di Giacinto
60954d484a docs(paged): record quant kernel timing phase
Assisted-by: Codex:gpt-5
2026-07-01 12:45:19 +00:00
Ettore Di Giacinto
3fbdfc21c9 docs(paged): record quant trace phase
Assisted-by: Codex:gpt-5
2026-07-01 12:42:13 +00:00
Ettore Di Giacinto
55df9100dc docs(paged): record layout trace phase
Assisted-by: Codex:gpt-5
2026-07-01 12:32:05 +00:00
Ettore Di Giacinto
2e19e5c90f docs(paged): record prefill bucket attribution phase
Assisted-by: Codex:gpt-5
2026-07-01 12:20:42 +00:00
Ettore Di Giacinto
6a2618b6dc docs(paged): record MTP verify-cost rejection
Assisted-by: Codex:gpt-5
2026-07-01 11:51:29 +00:00
Ettore Di Giacinto
f7d76389b0 docs(paged): record W4A16 direct activation rejection
Assisted-by: Codex:gpt-5
2026-07-01 11:28:11 +00:00
Ettore Di Giacinto
ef578866c8 docs(paged): scope W4A16 direct activation experiment
Assisted-by: Codex:gpt-5
2026-07-01 10:59:56 +00:00
Ettore Di Giacinto
fc5d5e4ff3 docs(paged): profile current W4A16 prefill
Assisted-by: Codex:gpt-5
2026-07-01 10:56:48 +00:00
Ettore Di Giacinto
ef7dbfa5f7 docs(paged): compare MoE min32 against vLLM
Assisted-by: Codex:gpt-5
2026-07-01 10:46:32 +00:00
Ettore Di Giacinto
c41d1a5b4f docs(paged): record waiting-threshold TTFT defer
Record Phase58 prompt-backlog threshold A/B, DGX gates, MoE and dense serving results, and the repeat-before-default decision.

Assisted-by: Codex:gpt-5
2026-07-01 10:31:09 +00:00
Ettore Di Giacinto
9be291e6b0 docs(paged): reject capped TTFT defer sweep
Record Phase57 capped TTFT prefill-first sweep, DGX gates, and the decision to keep the cap as an A/B knob rather than a parity path.

Assisted-by: Codex:gpt-5
2026-07-01 10:18:41 +00:00
Ettore Di Giacinto
902bcc7717 docs(paged): validate TTFT prefill-first A/B
Record Phase56 MoE and lower-concurrency validation for the TTFT prefill-first policy, including DGX gates and the opt-in-only decision.

Assisted-by: Codex:gpt-5
2026-07-01 10:05:23 +00:00
Ettore Di Giacinto
999cf09532 docs(paged): record TTFT prefill-first A/B
Record Phase55 default-off scheduler A/B, DGX md5/op gates, dense serving results, and the pending fork push/mirror status.

Assisted-by: Codex:gpt-5
2026-07-01 09:57:55 +00:00
Ettore Di Giacinto
3dbf34e739 docs(paged): record admission histogram trace
Record Phase54 trace-only histogram work, DGX md5/op gates, dense serving histogram evidence, and the next scheduler decision.

Assisted-by: Codex:gpt-5
2026-07-01 09:40:50 +00:00
Ettore Di Giacinto
347a5c05bd docs(paged): reject admission budget sweep
Assisted-by: Codex:gpt-5
2026-07-01 09:27:20 +00:00
Ettore Di Giacinto
2aa76702df docs(paged): record dense admission trace
Assisted-by: Codex:gpt-5
2026-07-01 09:18:43 +00:00
Ettore Di Giacinto
b5f65152e2 docs(paged): record serving admission trace
Assisted-by: Codex:gpt-5
2026-07-01 09:08:42 +00:00
Ettore Di Giacinto
c299dcd231 docs(paged): record dense true decode profile
Assisted-by: Codex:gpt-5
2026-07-01 08:55:23 +00:00
Ettore Di Giacinto
cd59e5d61f fix(paged): scrub harness vars for vllm serve
Assisted-by: Codex:gpt-5
2026-07-01 08:23:05 +00:00
Ettore Di Giacinto
96825a224e docs(paged): record dense serving snapshot
Assisted-by: Codex:gpt-5
2026-07-01 08:20:26 +00:00
Ettore Di Giacinto
440129c98e fix(paged): harden serving snapshot readiness
Assisted-by: Codex:gpt-5
2026-07-01 08:07:48 +00:00
Ettore Di Giacinto
e69ee0e867 feat(paged): parameterize served model name
Assisted-by: Codex:gpt-5
2026-07-01 07:50:19 +00:00
Ettore Di Giacinto
2a0fc0f4b9 docs(paged): record inference gate guard
Assisted-by: Codex:gpt-5
2026-07-01 07:45:52 +00:00
Ettore Di Giacinto
ae8284f5fb feat(paged): parameterize vllm serving snapshot
Assisted-by: Codex:gpt-5
2026-07-01 07:41:55 +00:00
Ettore Di Giacinto
ecaf406c0b docs(paged): reject persistent gate fusion shortcut
Assisted-by: Codex:gpt-5
2026-07-01 07:34:27 +00:00
Ettore Di Giacinto
b9eff5bca3 docs(paged): reconcile next parity target
Assisted-by: Codex:gpt-5
2026-07-01 07:31:26 +00:00
Ettore Di Giacinto
aa848d5afb docs(paged): record low-concurrency serving check
Assisted-by: Codex:gpt-5
2026-07-01 07:24:28 +00:00