Commit Graph

7146 Commits

Author SHA1 Message Date
Ettore Di Giacinto
2aa76702df docs(paged): record dense admission trace
Assisted-by: Codex:gpt-5
2026-07-01 09:18:43 +00:00
Ettore Di Giacinto
b5f65152e2 docs(paged): record serving admission trace
Assisted-by: Codex:gpt-5
2026-07-01 09:08:42 +00:00
Ettore Di Giacinto
c299dcd231 docs(paged): record dense true decode profile
Assisted-by: Codex:gpt-5
2026-07-01 08:55:23 +00:00
Ettore Di Giacinto
cd59e5d61f fix(paged): scrub harness vars for vllm serve
Assisted-by: Codex:gpt-5
2026-07-01 08:23:05 +00:00
Ettore Di Giacinto
96825a224e docs(paged): record dense serving snapshot
Assisted-by: Codex:gpt-5
2026-07-01 08:20:26 +00:00
Ettore Di Giacinto
440129c98e fix(paged): harden serving snapshot readiness
Assisted-by: Codex:gpt-5
2026-07-01 08:07:48 +00:00
Ettore Di Giacinto
e69ee0e867 feat(paged): parameterize served model name
Assisted-by: Codex:gpt-5
2026-07-01 07:50:19 +00:00
Ettore Di Giacinto
2a0fc0f4b9 docs(paged): record inference gate guard
Assisted-by: Codex:gpt-5
2026-07-01 07:45:52 +00:00
Ettore Di Giacinto
ae8284f5fb feat(paged): parameterize vllm serving snapshot
Assisted-by: Codex:gpt-5
2026-07-01 07:41:55 +00:00
Ettore Di Giacinto
ecaf406c0b docs(paged): reject persistent gate fusion shortcut
Assisted-by: Codex:gpt-5
2026-07-01 07:34:27 +00:00
Ettore Di Giacinto
b9eff5bca3 docs(paged): reconcile next parity target
Assisted-by: Codex:gpt-5
2026-07-01 07:31:26 +00:00
Ettore Di Giacinto
aa848d5afb docs(paged): record low-concurrency serving check
Assisted-by: Codex:gpt-5
2026-07-01 07:24:28 +00:00
Ettore Di Giacinto
d44e164c96 docs(paged): record max-concurrency parity check
Assisted-by: Codex:gpt-5
2026-07-01 07:13:48 +00:00
Ettore Di Giacinto
52c11b1ce5 docs(paged): reject graph-time gate fusion shortcut
Assisted-by: Codex:gpt-5
2026-07-01 06:56:01 +00:00
Ettore Di Giacinto
5354adcffb docs(paged): scope gate projection policy
Assisted-by: Codex:gpt-5
2026-07-01 06:50:19 +00:00
Ettore Di Giacinto
9f75da01f9 feat(paged): add cublas tensor-name trace patch
Add patch 0063 extending LLAMA_CUBLAS_ROUTE_TRACE with src0/src1/dst tensor names.

Record Phase 37 gates and the conclusion that SGEMM traces to MoE gate tensors.

Assisted-by: Codex:gpt-5
2026-07-01 06:41:00 +00:00
Ettore Di Giacinto
fbdc200886 feat(paged): add cublas route trace patch
Add patch 0062 with default-off LLAMA_CUBLAS_ROUTE_TRACE instrumentation for generic cuBLAS MUL_MAT subroutes.

Record Phase 36 DGX gates, serving trace results, and the next projection follow-up scope.

Assisted-by: Codex:gpt-5
2026-07-01 06:24:46 +00:00
Ettore Di Giacinto
49cce0b5a2 feat(paged): add mul mat route trace patch
Add LocalAI patch 0061 from the llama.cpp fork and record Phase 35 gates, serving route counts, and the updated patch mirror invariant.

Assisted-by: Codex:gpt-5
2026-07-01 05:52:09 +00:00
Ettore Di Giacinto
ba1979a689 feat(paged): add moe mmid route trace patch
Add LocalAI patch 0060 from the llama.cpp fork and record Phase 34 gates, serving route counts, and the updated patch mirror invariant.

Assisted-by: Codex:gpt-5
2026-07-01 05:37:53 +00:00
Ettore Di Giacinto
7665422bfa feat(paged): add moe small-m mmq tile policy gate
Assisted-by: Codex:gpt-5
2026-07-01 05:20:18 +00:00
Ettore Di Giacinto
70a4c31f36 feat(paged): add moe small-m mmq candidate trace
Assisted-by: Codex:gpt-5
2026-07-01 05:08:31 +00:00
Ettore Di Giacinto
e189e5a4ca feat(paged): add moe mmq launch trace patch
Assisted-by: Codex:gpt-5
2026-07-01 04:54:33 +00:00
Ettore Di Giacinto
b28b448c68 docs(paged): record mmq shape serving profile
Assisted-by: Codex:gpt-5
2026-07-01 04:36:04 +00:00
Ettore Di Giacinto
2148fa466b feat(paged): add moe mmq shape trace patch
Assisted-by: Codex:gpt-5
2026-07-01 04:32:12 +00:00
Ettore Di Giacinto
3b9ec3e1f1 docs(paged): record mmq occupancy rejection
Assisted-by: Codex:gpt-5
2026-07-01 04:18:12 +00:00
Ettore Di Giacinto
3c2cb9f4ab docs(paged): record graph-node serving profile
Record the Phase 27 current-stack llama.cpp n128 serving profile captured with CUDA graph node tracing and gated before and after the run.

Assisted-by: Codex:gpt-5
2026-07-01 04:00:14 +00:00
Ettore Di Giacinto
ace1ffab28 docs(paged): record audited current snapshot
Record the Phase 26 current-stack paged-vs-vLLM serving snapshot with hardware classification and compact pre/post inference gates.

Assisted-by: Codex:gpt-5
2026-07-01 03:48:27 +00:00
Ettore Di Giacinto
a0194125f5 chore(paged): summarize snapshot inference gates
Emit a compact gate_summary.tsv from current serving snapshots so each artifact records the pre/post MoE md5, dense md5, and backend op checks. Add a summary-only mode for auditing existing artifacts and document the Phase 25 backfill on the Phase 20 snapshot.

Assisted-by: Codex:gpt-5
2026-07-01 03:35:54 +00:00
Ettore Di Giacinto
7108b68a70 chore(paged): record snapshot hardware class
Add a hardware report to the current serving snapshot harness so every paged-vs-vLLM artifact records the GPU identity and conservative hardware class before any server starts. Document the Phase 24 dry run and the GB10 classification for future parity comparisons.

Assisted-by: Codex:gpt-5
2026-07-01 03:31:11 +00:00
Ettore Di Giacinto
7aa15ce539 docs(paged): refresh parity handoff coordinates
Update the paged parity handoff to the current fork head, patch count, mirror invariant, current serving harness, and LocalAI AI-attribution policy after Phases 20-22.

Assisted-by: Codex:gpt-5
2026-07-01 03:25:14 +00:00
Ettore Di Giacinto
6c165747a9 docs(paged): verify patch-series mirror invariant
Record the Phase 22 strict git-apply mirror check proving the LocalAI paged patch series reconstructs the canonical llama.cpp fork tree after patch 0055.

Assisted-by: Codex:gpt-5
2026-07-01 03:22:43 +00:00
Ettore Di Giacinto
ff3f0620de chore(paged): add current serving snapshot harness
Add a reusable current-stack paged-vs-vLLM serving snapshot harness that targets the clean DGX mirror, enforces idle/lock preflight, runs pre/post inference gates, and records ratio summaries.

Assisted-by: Codex:gpt-5
2026-07-01 03:19:36 +00:00
Ettore Di Giacinto
c99678da42 docs(paged): refresh current serving snapshot
Record the Phase 20 same-session MoE paged-vs-vLLM serving snapshot on the current clean DGX mirror, including pre/post inference gates and the resulting GB10 parity decision.

Assisted-by: Codex:gpt-5
2026-07-01 03:15:30 +00:00
Ettore Di Giacinto
310eb3c866 docs(paged): reject MTP draft-shape scheduler
Record the Phase 19 trace-only serving entropy run and close the group/defer-by-draft follow-up based on measured shape distribution, throughput regression, and green inference gates.

Assisted-by: Codex:gpt-5
2026-07-01 03:03:49 +00:00
Ettore Di Giacinto
cced07c7fe docs(paged): add MTP shape trace patch
Add the next incremental llama.cpp patch for default-off speculative batch-shape tracing and record the Phase 18 red/green and inference-gate results.

Assisted-by: Codex:gpt-5
2026-07-01 02:54:29 +00:00
Ettore Di Giacinto
6e35476340 docs(paged): scope MTP graph-shape follow-up
Record Phase 17 source inspection: MTP verification rows change hard graph dimensions, padding is not a safe shortcut, and any future work should start with shape instrumentation before an opt-in scheduler experiment.

Assisted-by: Codex:gpt-5
2026-07-01 02:37:21 +00:00
Ettore Di Giacinto
ae76d42a96 docs(paged): profile MTP graph reuse loss
Record Phase 16 nsys evidence that current MTP serving loses paged decode graph reuse and increases GPU work, explaining the Phase 15 serving regression.

Assisted-by: Codex:gpt-5
2026-07-01 02:32:49 +00:00
Ettore Di Giacinto
4d171e62bb docs(paged): reject MTP serving lever
Add the repeatable MTP serving A/B runner and record Phase 15 results showing current llama-server MTP regresses GB10 serving throughput despite passing inference gates.

Assisted-by: Codex:gpt-5
2026-07-01 02:29:28 +00:00
Ettore Di Giacinto
70394364a3 docs(paged): gate MTP rollback safety
Record Phase 14 MTP rollback evidence, normalized greedy-prefix checks, and canonical inference gates.

Assisted-by: Codex:gpt-5
2026-07-01 02:15:11 +00:00
Ettore Di Giacinto
e169058e73 chore(paged): add DGX inference gate runner
Add a reusable paged llama.cpp gate script for DGX work. It checks docker/local-ai-worker/GPU lock state, runs the canonical MoE and dense transcript md5 gates, and runs selected test-backend-ops filters.

Verified on dgx.casa: MoE 8cb0ce23777bf55f92f63d0292c756b0, dense 5951a5b4d624ce891e22ab5fca9bc439, MUL_MAT_ID 806/806. Artifact: /home/mudler/bench/paged_inference_gates/20260701_040048.

Assisted-by: Codex:gpt-5
2026-07-01 02:01:55 +00:00
Ettore Di Giacinto
ede23df333 docs(paged): close W4A16 pad checklist
Mark the rejected-branch disposition as not taken because Phase 4 was kept as patch 0050 with recorded md5, op, perf, and mirror gates.

Assisted-by: Codex:gpt-5
2026-07-01 01:58:22 +00:00
Ettore Di Giacinto
abc70c209e docs(paged): close ragged MoE dispatch shortcut
Record the Phase 8 safety rerun, canonical transcript md5 gates, full and ragged MUL_MAT_ID op gates, and the no-production-patch decision for metadata-only fused dispatch work.

Assisted-by: Codex:gpt-5
2026-07-01 01:57:45 +00:00
Ettore Di Giacinto
2074b4fb5b docs(paged): reject GDN global Ai32 prototype
Record the default-off Global-Ai32 implementation, exact md5 gates, GB10 A/B regression, rejected diff artifact, and the resulting stop decision for GDN kernel work on GB10.

Assisted-by: Codex:gpt-5
2026-07-01 01:51:53 +00:00
Ettore Di Giacinto
adabd11919 docs(paged): scope GDN global Ai32 prototype
Record the shared-A/Ai GB10 cost model, the GO decision for one default-off f32 Ai prototype, and the Phase 13 implementation plan.

Assisted-by: Codex:gpt-5
2026-07-01 01:38:51 +00:00
Ettore Di Giacinto
1b5ae227eb docs(paged): reject GDN M5 QS-early phase
Record the Phase 11 default-off QS-early GDN experiment, its canonical md5 gates, the same-session GB10 A/B regression, and the rejected diff artifact.

Assisted-by: Codex:gpt-5
2026-07-01 01:29:44 +00:00
Ettore Di Giacinto
24e778de47 docs(paged): scope GDN M5 state-boundary phase
Add the Phase 11 design and implementation plan for a default-off C16 M5 QS-early GDN experiment after rejecting C32 slabs.

Assisted-by: Codex:gpt-5
2026-07-01 01:21:05 +00:00
Ettore Di Giacinto
3da3b169fb docs(paged): reject GDN C32 slab phase
Record the default-off C32 slab experiment, its md5 gates, the dense tail-row fix, and the performance regression that rejects the source patch.

Assisted-by: Codex:gpt-5
2026-07-01 01:15:00 +00:00
Ettore Di Giacinto
ff3ad84191 docs(paged): record GDN C32 slab baseline
Record the Phase 10 current-M5 prefill baseline and the source inspection finding that C32 M5 needs a real U-staging implementation rather than a launcher-only shortcut.

Assisted-by: Codex:gpt-5
2026-07-01 00:58:54 +00:00
Ettore Di Giacinto
9bbe02c161 fix(paged): gate MTP backend sampling
Record the Phase 9 MTP smoke gate, mirror the fork patch that disables backend sampling for MTP drafts, and scope the follow-up C32 slab GDN prefill phase.

Assisted-by: Codex:gpt-5
2026-07-01 00:54:25 +00:00
Ettore Di Giacinto
b862e2c568 docs(paged): stop ragged dispatch source shortcut
Assisted-by: Codex:gpt-5
2026-07-01 00:42:36 +00:00