Commit Graph

109 Commits

Author SHA1 Message Date
Ettore Di Giacinto
6cf8b782d1 docs(paged): record BF16 F32 output broader serving phase
Assisted-by: Codex:gpt-5
2026-07-01 13:26:50 +00:00
Ettore Di Giacinto
e573194799 docs(paged): record patch mirror readiness phase
Assisted-by: Codex:gpt-5
2026-07-01 13:11:57 +00:00
Ettore Di Giacinto
2b2b1f0b25 docs(paged): record BF16 F32 output dense serving phase
Assisted-by: Codex:gpt-5
2026-07-01 13:06:49 +00:00
Ettore Di Giacinto
e67b329eb1 docs(paged): record BF16 cuBLAS F32 output phase
Assisted-by: Codex:gpt-5
2026-07-01 12:54:24 +00:00
Ettore Di Giacinto
60954d484a docs(paged): record quant kernel timing phase
Assisted-by: Codex:gpt-5
2026-07-01 12:45:19 +00:00
Ettore Di Giacinto
3fbdfc21c9 docs(paged): record quant trace phase
Assisted-by: Codex:gpt-5
2026-07-01 12:42:13 +00:00
Ettore Di Giacinto
55df9100dc docs(paged): record layout trace phase
Assisted-by: Codex:gpt-5
2026-07-01 12:32:05 +00:00
Ettore Di Giacinto
2e19e5c90f docs(paged): record prefill bucket attribution phase
Assisted-by: Codex:gpt-5
2026-07-01 12:20:42 +00:00
Ettore Di Giacinto
6a2618b6dc docs(paged): record MTP verify-cost rejection
Assisted-by: Codex:gpt-5
2026-07-01 11:51:29 +00:00
Ettore Di Giacinto
f7d76389b0 docs(paged): record W4A16 direct activation rejection
Assisted-by: Codex:gpt-5
2026-07-01 11:28:11 +00:00
Ettore Di Giacinto
ef578866c8 docs(paged): scope W4A16 direct activation experiment
Assisted-by: Codex:gpt-5
2026-07-01 10:59:56 +00:00
Ettore Di Giacinto
fc5d5e4ff3 docs(paged): profile current W4A16 prefill
Assisted-by: Codex:gpt-5
2026-07-01 10:56:48 +00:00
Ettore Di Giacinto
ef7dbfa5f7 docs(paged): compare MoE min32 against vLLM
Assisted-by: Codex:gpt-5
2026-07-01 10:46:32 +00:00
Ettore Di Giacinto
c41d1a5b4f docs(paged): record waiting-threshold TTFT defer
Record Phase58 prompt-backlog threshold A/B, DGX gates, MoE and dense serving results, and the repeat-before-default decision.

Assisted-by: Codex:gpt-5
2026-07-01 10:31:09 +00:00
Ettore Di Giacinto
9be291e6b0 docs(paged): reject capped TTFT defer sweep
Record Phase57 capped TTFT prefill-first sweep, DGX gates, and the decision to keep the cap as an A/B knob rather than a parity path.

Assisted-by: Codex:gpt-5
2026-07-01 10:18:41 +00:00
Ettore Di Giacinto
902bcc7717 docs(paged): validate TTFT prefill-first A/B
Record Phase56 MoE and lower-concurrency validation for the TTFT prefill-first policy, including DGX gates and the opt-in-only decision.

Assisted-by: Codex:gpt-5
2026-07-01 10:05:23 +00:00
Ettore Di Giacinto
999cf09532 docs(paged): record TTFT prefill-first A/B
Record Phase55 default-off scheduler A/B, DGX md5/op gates, dense serving results, and the pending fork push/mirror status.

Assisted-by: Codex:gpt-5
2026-07-01 09:57:55 +00:00
Ettore Di Giacinto
3dbf34e739 docs(paged): record admission histogram trace
Record Phase54 trace-only histogram work, DGX md5/op gates, dense serving histogram evidence, and the next scheduler decision.

Assisted-by: Codex:gpt-5
2026-07-01 09:40:50 +00:00
Ettore Di Giacinto
347a5c05bd docs(paged): reject admission budget sweep
Assisted-by: Codex:gpt-5
2026-07-01 09:27:20 +00:00
Ettore Di Giacinto
2aa76702df docs(paged): record dense admission trace
Assisted-by: Codex:gpt-5
2026-07-01 09:18:43 +00:00
Ettore Di Giacinto
b5f65152e2 docs(paged): record serving admission trace
Assisted-by: Codex:gpt-5
2026-07-01 09:08:42 +00:00
Ettore Di Giacinto
c299dcd231 docs(paged): record dense true decode profile
Assisted-by: Codex:gpt-5
2026-07-01 08:55:23 +00:00
Ettore Di Giacinto
cd59e5d61f fix(paged): scrub harness vars for vllm serve
Assisted-by: Codex:gpt-5
2026-07-01 08:23:05 +00:00
Ettore Di Giacinto
96825a224e docs(paged): record dense serving snapshot
Assisted-by: Codex:gpt-5
2026-07-01 08:20:26 +00:00
Ettore Di Giacinto
440129c98e fix(paged): harden serving snapshot readiness
Assisted-by: Codex:gpt-5
2026-07-01 08:07:48 +00:00
Ettore Di Giacinto
e69ee0e867 feat(paged): parameterize served model name
Assisted-by: Codex:gpt-5
2026-07-01 07:50:19 +00:00
Ettore Di Giacinto
2a0fc0f4b9 docs(paged): record inference gate guard
Assisted-by: Codex:gpt-5
2026-07-01 07:45:52 +00:00
Ettore Di Giacinto
ae8284f5fb feat(paged): parameterize vllm serving snapshot
Assisted-by: Codex:gpt-5
2026-07-01 07:41:55 +00:00
Ettore Di Giacinto
ecaf406c0b docs(paged): reject persistent gate fusion shortcut
Assisted-by: Codex:gpt-5
2026-07-01 07:34:27 +00:00
Ettore Di Giacinto
b9eff5bca3 docs(paged): reconcile next parity target
Assisted-by: Codex:gpt-5
2026-07-01 07:31:26 +00:00
Ettore Di Giacinto
aa848d5afb docs(paged): record low-concurrency serving check
Assisted-by: Codex:gpt-5
2026-07-01 07:24:28 +00:00
Ettore Di Giacinto
d44e164c96 docs(paged): record max-concurrency parity check
Assisted-by: Codex:gpt-5
2026-07-01 07:13:48 +00:00
Ettore Di Giacinto
52c11b1ce5 docs(paged): reject graph-time gate fusion shortcut
Assisted-by: Codex:gpt-5
2026-07-01 06:56:01 +00:00
Ettore Di Giacinto
5354adcffb docs(paged): scope gate projection policy
Assisted-by: Codex:gpt-5
2026-07-01 06:50:19 +00:00
Ettore Di Giacinto
9f75da01f9 feat(paged): add cublas tensor-name trace patch
Add patch 0063 extending LLAMA_CUBLAS_ROUTE_TRACE with src0/src1/dst tensor names.

Record Phase 37 gates and the conclusion that SGEMM traces to MoE gate tensors.

Assisted-by: Codex:gpt-5
2026-07-01 06:41:00 +00:00
Ettore Di Giacinto
fbdc200886 feat(paged): add cublas route trace patch
Add patch 0062 with default-off LLAMA_CUBLAS_ROUTE_TRACE instrumentation for generic cuBLAS MUL_MAT subroutes.

Record Phase 36 DGX gates, serving trace results, and the next projection follow-up scope.

Assisted-by: Codex:gpt-5
2026-07-01 06:24:46 +00:00
Ettore Di Giacinto
49cce0b5a2 feat(paged): add mul mat route trace patch
Add LocalAI patch 0061 from the llama.cpp fork and record Phase 35 gates, serving route counts, and the updated patch mirror invariant.

Assisted-by: Codex:gpt-5
2026-07-01 05:52:09 +00:00
Ettore Di Giacinto
ba1979a689 feat(paged): add moe mmid route trace patch
Add LocalAI patch 0060 from the llama.cpp fork and record Phase 34 gates, serving route counts, and the updated patch mirror invariant.

Assisted-by: Codex:gpt-5
2026-07-01 05:37:53 +00:00
Ettore Di Giacinto
7665422bfa feat(paged): add moe small-m mmq tile policy gate
Assisted-by: Codex:gpt-5
2026-07-01 05:20:18 +00:00
Ettore Di Giacinto
70a4c31f36 feat(paged): add moe small-m mmq candidate trace
Assisted-by: Codex:gpt-5
2026-07-01 05:08:31 +00:00
Ettore Di Giacinto
e189e5a4ca feat(paged): add moe mmq launch trace patch
Assisted-by: Codex:gpt-5
2026-07-01 04:54:33 +00:00
Ettore Di Giacinto
b28b448c68 docs(paged): record mmq shape serving profile
Assisted-by: Codex:gpt-5
2026-07-01 04:36:04 +00:00
Ettore Di Giacinto
2148fa466b feat(paged): add moe mmq shape trace patch
Assisted-by: Codex:gpt-5
2026-07-01 04:32:12 +00:00
Ettore Di Giacinto
3b9ec3e1f1 docs(paged): record mmq occupancy rejection
Assisted-by: Codex:gpt-5
2026-07-01 04:18:12 +00:00
Ettore Di Giacinto
3c2cb9f4ab docs(paged): record graph-node serving profile
Record the Phase 27 current-stack llama.cpp n128 serving profile captured with CUDA graph node tracing and gated before and after the run.

Assisted-by: Codex:gpt-5
2026-07-01 04:00:14 +00:00
Ettore Di Giacinto
ace1ffab28 docs(paged): record audited current snapshot
Record the Phase 26 current-stack paged-vs-vLLM serving snapshot with hardware classification and compact pre/post inference gates.

Assisted-by: Codex:gpt-5
2026-07-01 03:48:27 +00:00
Ettore Di Giacinto
a0194125f5 chore(paged): summarize snapshot inference gates
Emit a compact gate_summary.tsv from current serving snapshots so each artifact records the pre/post MoE md5, dense md5, and backend op checks. Add a summary-only mode for auditing existing artifacts and document the Phase 25 backfill on the Phase 20 snapshot.

Assisted-by: Codex:gpt-5
2026-07-01 03:35:54 +00:00
Ettore Di Giacinto
7108b68a70 chore(paged): record snapshot hardware class
Add a hardware report to the current serving snapshot harness so every paged-vs-vLLM artifact records the GPU identity and conservative hardware class before any server starts. Document the Phase 24 dry run and the GB10 classification for future parity comparisons.

Assisted-by: Codex:gpt-5
2026-07-01 03:31:11 +00:00
Ettore Di Giacinto
7aa15ce539 docs(paged): refresh parity handoff coordinates
Update the paged parity handoff to the current fork head, patch count, mirror invariant, current serving harness, and LocalAI AI-attribution policy after Phases 20-22.

Assisted-by: Codex:gpt-5
2026-07-01 03:25:14 +00:00
Ettore Di Giacinto
6c165747a9 docs(paged): verify patch-series mirror invariant
Record the Phase 22 strict git-apply mirror check proving the LocalAI paged patch series reconstructs the canonical llama.cpp fork tree after patch 0055.

Assisted-by: Codex:gpt-5
2026-07-01 03:22:43 +00:00