LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-07-03 12:57:02 -04:00

Author	SHA1	Message	Date
Ettore Di Giacinto	a1a3b99960	docs(paged): record P3 W4A16 direct-A NO-GO + write program-level prefill conclusion P3 (the last big prefill lever) is a decisive NO-GO. The direct-A W4A16 Marlin path was re-created per the section-3 contract, engaged behind LLAMA_W4A16_DIRECT_A, and A/B'd against the FP4-MMQ default: -46.9/-48.0/-49.1% at M=512/1024/2048 (MoE q36-35b-a3b, 3-iter medians). The forensics retry is REFUTED - the integration tax it blamed was genuinely removed (act-quant 18.92 -> ~0 us/tok; host expert-sort + src1-gather + separate cast eliminated) and direct-A still lost. nsys graph-node decomposition: the mature bf16 grouped-W4A16 GEMM = 323.90 us/tok = 1.97x the FP4-MMQ int8 GEMM (164.6) = exactly bf16 = half int8/FP4 tensor-core peak on sm_121. Bucket 2 (GEMM tiling, +56.5) is now a CONFIRMED FP4-MMQ-optimal floor on GB10, joining bucket 1 (GDN scan, P5-confirmed). Novel sub-finding: fusing the A-gather in-kernel is a NET pessimization vs a separate bf16 pre-cast (+128 > ~63 tax removed), a GB10-specific inversion of the no-round-trips heuristic. KL in-band and better than control (KLD 0.130260 / same-top-p 85.172%); default md5s green both models; engagement proven (7680 env-on, 0 default). Nothing built beyond P0, nothing landed; fork localai-paged HEAD untouched at 653bb2f3d, series stays 46 patches; topic branch p3-w4a16-direct retained on the DGX fork at 8eef7ba43 (NOT pushed). Because P3 is the last major lever, this also writes the program-level conclusion into EXECUTION_REARCH_SCOPE.md section 4a (dated) and corrects the pre-execution projection to measured reality: six phases gated, exactly one landed (P1 +2% MoE prefill, bucket-3 projection boundary); P2/P3/P4/P5 rejected, P6 blocked-on-infra. Prefill closes to ~50-51% of vLLM (not ~55-65%), serving-agg stays ~60.7% (not ~80%), decode-GPU-steady stays ~86% (not ~95%), TTFT stays ~3.4x - because the two largest prefill buckets (1+2 = +115.7 of the 198.9 gap) are confirmed silicon/bandwidth floors that lift only on datacenter Blackwell. This confirms and strengthens the standing conclusion that GB10 throughput-parity is unreachable by exhaustion; the paged fork's precision parity + memory advantage stand. Default path untouched; canonical md5s green. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-07-02 22:31:14 +00:00
Ettore Di Giacinto	ac2b0211ff	docs(paged): record P6 fp8-KV BLOCKED-ON-INFRA + the analytical decode ceiling P6 (final program phase) could not run its kill-gate: the DGX/GB10 was unreachable for the entire window (cloudflared access via prem-vm returned HTTP 530 / websocket bad-handshake on every probe; re-confirmed with 5 fresh probes). Stage 0a (measured nsys graph-node decode ceiling) and Stage 0b (fp8-e4m3 kernel + kill-gate A/B) were physically impossible with no GPU. Records the honest infra-block (NOT a measured NO-GO, NOT a NO-GO-by-ceiling) plus the load-bearing artifact: the analytical fp8-KV decode ceiling table. fp8 halves KV bytes -> theoretical-max decode saving = 0.5 x flash-attn share: ctx256 0.65% (standard shape hard NO-GO), ctx1024 2.55%, ctx2048 4.98% (first crosses +3%), ctx4096 9.49%, ctx8192 17.34%. The win, if realizable, lives only at ctx>=2048; the hybrid-GDN structure (10/40 layers carry KV, 30 GDN layers hold fixed-size recurrent state with no KV) caps what any KV-dtype lever can save. The dominant null stands unrefuted: Q8_0 KV was a measured +7.8% decode regression on GB10. Notes the capacity-play framing (fp8-KV as a memory feature remains open even if throughput-flat). Fork localai-paged untouched at 653bb2f3d; series stays at 46 patches (0001-0055); P3's p3-w4a16-direct work undisturbed. Docs-only; no code, no topic branch, no patches. Not pushed. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-07-02 21:43:36 +00:00
Ettore Di Giacinto	5b8b33a302	docs(paged): record P5 FLA GDN NO-GO - GDN prefill bucket is a confirmed shared-hardware floor P5 ported the full six-kernel vLLM-FLA chunk_gated_delta_rule_fwd pipeline (cumsum, chunk_scaled_dot_kkt, blocked merge_16x16_to_64x64 solve_tril, recompute_w_u, register/smem-resident chunk_gated_delta_rule_fwd_h with the chunk loop in-kernel, chunk_fwd_o) to CUDA tf32 mma, per-kernel validated vs host fp64 (o NMSE 2.2e-7, final-state 1.2e-7), integrated behind LLAMA_GDN_FLA_CHUNK=1 (default-off), and A/B'd in-backend vs the shipped M5 chunked scan. P0 perf kill-gate FAILED decisively: nsys --cuda-graph-trace=node, MoE q36-35b-a3b, npp2048 M5 56.31 vs FLA 119.46 us/tok (FLA 2.12x SLOWER, gdn_delta_pct -112.1); npp512 2.29x slower; end-to-end S_PP -13.33%/-13.12%. GO required FLA >10% faster at npp2048. Novel decomposition: the blocked solve_tril is only ~2.8% of the FLA bucket (55.6 ms); fwd_h 46.2% (903 ms) + fwd_o 31.5% (617 ms) dominate. The cost is the state-recurrence GEMMs plus per-chunk h-state materialization to global LPDDR5x that FLA's split-kernel structure forces; the fused M5 single kernel keeps the 128x128 state resident in smem and never materializes per-chunk h, so it is 2.1x faster on GB10's low-bandwidth memory. So the GDN prefill bucket (+59.2, the single largest prefill lever) is a confirmed shared-hardware / memory-bandwidth floor, NOT recoverable by the blocked-solve algorithm. Extends Phase74 (standalone blocked-inverse 0.59x) and bf16-C64 (-18.75%). Gates: SMEM PASS (max 96KB < 99KB cap). KL band GREEN (FLA KLD 0.137028 vs control 0.136563, delta +0.000465 < 0.01; same-top-p 84.61%). DEFAULT path untouched (canonical md5 GREEN both models default-off AND FLA-on; MoE 8cb0ce23, dense 5951a5b4; GATED_DELTA_NET DEFAULT 46/46). Nothing landed; series stays at 46 patches. WIP on DGX fork branch p5-fla-gdn (2d64c37f0), NOT pushed. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-07-02 21:04:37 +00:00
Ettore Di Giacinto	7b129a51f1	docs(paged): finalize P4 CBv2 record with the measured A/B verdict The forced-report placeholders are replaced with the completed 60/60-raw A/B from dgx:~/bench/p4_cbv2/perf_20260702_194359/RESULTS.md: NO-GO confirmed by measurement, and stronger than flat. CBv2 fair-share chunked prefill regresses TTFT under staggered load (N=32 p50 +33.6%, N=128 p50 +15.5%) and regresses aggregate/decode -6.9% beyond noise at staggered N=128. Analysis recorded: processor-sharing delays near-uniform prompt completion by construction; the scheduler-shaped-TTFT premise is partially refuted for GB10 (patch 0016 already captures the schedulable win); TTFT parity routes through P3/P5 prefill compute. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-07-02 18:09:55 +00:00
Ettore Di Giacinto	865e77c4ec	docs(paged): record P4 CBv2 NO-GO at the perf kill-gate P4 (token-granular continuous-batching scheduler, LLAMA_CONTINUOUS_BATCH_V2, default-off) stopped honestly at the P0 perf kill-gate. The kill-gate subset (per-seq chunked-prefill cursors + adaptive decode bucketing, server-side only, zero ggml/ files, ~68 LOC + a new unit-tested server-admission-policy.h) was implemented and correctness-proven green (canonical md5 both models default-off AND cbv2-on: MoE 8cb0ce23, dense 5951a5b4; test-backend-ops MUL_MAT 1146/1146, MUL_MAT_ID 806/806, GATED_DELTA_NET 46/46; cursor-interleave PROVEN via LLAMA_CBV2_TRACE with decode+prefill co-batched and per-seq cursors advancing across steps, dbucket==n_decode no-pad; determinism-NEUTRAL: CBv2 diverges from control no more than control diverges from itself, the paged concurrent-greedy path being inherently non-deterministic run-to-run in the baseline too). The kill-gate GO criterion - a >20% TTFT-under-load drop with md5 green and serving-aggregate not regressed - was NOT demonstrated: the staggered/burst TTFT A/B was force-terminated by the harness mid-run (CONTROL-only, 30/60 raws), so the TTFT deltas are not-yet-measured placeholders, not measured neutrality. Per the phased contract go=false was the kill-gate default: nothing built beyond P0 (no SLOT_STATE_PREEMPTED, no aging/starvation-freedom), nothing landed. This is the scope-anticipated outcome - P4 is a GB10 TTFT/fairness/enabler lever, not a throughput lever (decode is GPU-compute-bound), so a NO-GO on the TTFT gate is expected and any throughput payoff is non-GB10. Records the honest rejection in EXECUTION_REARCH_SCOPE.md (P4 RESULT subsection) and PARITY_HANDOFF.md chronology, including the re-score path: read the finalized DGX ~/bench/p4_cbv2/perf_20260702_194359/RESULTS.md once the CANDIDATE arm completes; a genuine >20% staggered-TTFT drop clearing max(2%, 3*stdev) re-scores go=true and triggers the full P4 build-out. Fork localai-paged untouched at 653bb2f3d; LocalAI series stays at 46 patches; topic branch p4-cbv2 retained on the DGX fork at ebb649335 (base 653bb2f3d, not pushed). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-07-02 18:03:34 +00:00
Ettore Di Giacinto	586639d016	docs(paged): record P2 MoE-region NO-GO (kill-gate flat + seam-shape gap) P2 (expert-major fused routed-FFN region executor, LLAMA_MOE_REGION_EXECUTOR, default-off) is recorded as NO-GO on two independent signals; nothing built beyond the P0 kill-gate, nothing landed, fork localai-paged HEAD untouched at 653bb2f3d (LocalAI series stays at 46 patches, 0001-0055). (1) Primary GO metric flat: n=257 MOE_SWIGLU_DOWN region 1022.15 us vs grouped-MMQ control 1021.61 us = -0.05% (needed >5% faster); n=128 -0.34%; MUL_MAT_ID_RAGGED_MOE +0.48%/+0.28% (region never engages). All inside the 5-sample spread - reproduces the six prior one-boundary transplants (phases 113/114/122/123/125/127). A compact expert-major layout + single sort, both GEMMs still ragged grouped-MMQ, does not move the ragged-tile tax; that needs P3 Marlin persistent-CTA, not a P2 layout swap. (2) Decisive structural blocker: q36-35b-a3b-nvfp4 ships separate ffn_gate_exps/ffn_up_exps (+ per-tensor .scale) with ggml_swiglu_split, not the merged gate_up->VIEW->VIEW->SWIGLU->down shape the whole-pattern matcher requires; the matcher, region executor, and pre-existing POC/fused-quant all engage 0x on q36 in prefill and decode. KL delta 0.000000 is vacuous (0 engagement). Default md5 canonical both models (MoE 8cb0ce23, dense 5951a5b4); test-backend-ops all green both arms. Prerequisite handoff (gates P2 and P3): rebuild the seam for q36's separate/scaled/swiglu-split FFN shape before any MoE-region lever can engage, then re-evaluate a fused two-GEMM region (not a layout swap). Topic branch p2-moe-region retained on the fork for forensics at 2d87564dd (base 653bb2f3d), not pushed. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-07-02 15:43:57 +00:00
Ettore Di Giacinto	ccf75d1dcd	docs(paged): record P1 bf16-stream landing (GO) P1 of the EXECUTION_REARCH_SCOPE additive program landed: LLAMA_BF16_STREAM (default-off) bf16-resident residual-segment executor for the q36 MoE model's projection boundaries. - EXECUTION_REARCH_SCOPE.md: dated "P1 RESULT" subsection (P0 kill-gate GO, full build-out deltas, KL, correctness gates, honest magnitude, provenance). - PARITY_HANDOFF.md: chronology note (verdict, engagement, prefill/KL numbers, fork commits, deferred-not-failed measurements). Key reframe recorded: q36 GDN/attention projections are BF16 weights (not NVFP4), so bf16-stream is a MoE-model prefill lever; the dense model quantizes those projections to NVFP4 and engages nothing (stays bit-identical). Prefill MoE @512 +1.99% (reproducible, at noise floor), KL delta -0.00052 (KL-improving), all md5 + test-backend-ops gates green. Fork HEAD 653bb2f3d, tree 6cf1523047. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-07-02 14:34:26 +00:00
Ettore Di Giacinto	500d653bfa	feat(paged): regenerate patch series 0053-0055 (P1 bf16-stream) Additive regen mirroring fork mudler/llama.cpp:localai-paged HEAD 653bb2f3d (base 1edddc8fe + 3 P1 commits). Patches 0001-0052 are untouched. - 0053 residual-segment executor + norm-bf16.{cu,cuh} + LLAMA_BF16_CUBLAS_F32_OUT - 0054 bf16 residual-add + rope op-variants - 0055 BF16_STREAM_SEGMENT test-backend-ops sentinel Kill-gate: a fresh detached worktree at pin 0ed235ea2c17a19fc8238668653946721ed136fd applied all 46 on-disk patches in numeric order (strict git apply) and staged tree 6cf1523047e0e38679baff20844bdc9e6829eb22, byte-for-byte == fork HEAD tree. All default-off (LLAMA_BF16_STREAM); default md5 canonical both models (MoE 8cb0ce23777bf55f92f63d0292c756b0, dense 5951a5b4d624ce891e22ab5fca9bc439). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-07-02 14:34:14 +00:00
Ettore Di Giacinto	b2784ccbca	docs(paged): fix EXECUTION_REARCH_SCOPE seam citations to fork 1edddc8fe Adversarial verification against the canonical fork mudler/llama.cpp:localai-paged HEAD 1edddc8fe found the scope doc's section-3 seam references were anchored to the abandoned pre-trim tree 237ad9b96, which the immediately-preceding commit `b529cc5420` reset away. Two classes of defect, both corrected: - Phantom scaffolding (honesty): the doc claimed "the team has already started scaffolding P1 and P3" citing four commits (237ad9b96 bf16 GDN state cache, afc2c7030 act-quant trace, ea0875d14 LLAMA_BF16_CUBLAS_F32_OUT, 7967ad47f W4A16 direct-A stub) that `b529cc5420` TRIMMED - none exist at 1edddc8fe (git cat-file: not a valid object). w4a16-policy.h, test-cuda-w4a16-policy.cpp and ggml_cuda_mul_mat_id_w4a16_grouped_direct_a are absent from the tree. Reworded P1 plank-1 and the P3 mechanism/files/effort to say these must be re-introduced on top of the surviving grouped W4A16 path (patch 0035), not "finished". - Stale line numbers (additivity): every file:line was off (computed against the larger 237ad9b96 tree). Re-anchored to 1edddc8fe: ggml_cuda_try_fuse 4232 (was 4661), capture loop 4908 (was 5444), moe whole-pattern matcher 4157 (was 4678), routed_ffn_poc moe-ffn.cu:275 (was 456), grouped W4A16 hook ggml-cuda.cu:2797 (was 3093/3188; the direct-A hooks 3085/3171 never existed), concurrent_event machinery 4769 (was 5305-5318), continuous-batch budget server-context.cpp 3083-3135 with LLAMA_MAX_BATCH_TOKENS at 3105 / prefill_budget_step at 3113 (was 3122-3200). Numbers (attribution table, recovery arithmetic), the six P0 kill-gates, and the unreachable-floor honesty were verified sound and left unchanged. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-07-02 11:03:07 +00:00
Ettore Di Giacinto	bf61db6214	docs(paged): scope vLLM-class execution re-architecture (additive program) Reframe the GB10 vLLM-parity gap from a per-lever "hardware floor" verdict to a ggml-execution-architecture-conditional one: same-silicon 2-3x is software architecture, not silicon. Add EXECUTION_REARCH_SCOPE.md, a phased additive program (P1 bf16-native stream, P2 expert-major fused MoE region, P3 Marlin large-M retry on P1+P2, P4 token-budget scheduler, P5 blocked-solve GDN, P6 fp8 KV), each with the ggml/fork seam, default-off env gate, per-path md5/KL correctness gate, a falsifiable P0 kill-gate, expected-recovery arithmetic grounded in the both-engine nsys buckets, and upstream-clash analysis. Point the README docs list and PARITY_HANDOFF forward-direction at it. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-07-02 10:50:00 +00:00
Ettore Di Giacinto	b529cc5420	patches(paged): trim series to Phase135 routed-FFN line, sync to fork 1edddc8fe The campaign patches 0048-0063 were added without matching fork commits. After a keep/drop review, the series is trimmed and re-mirrored 1:1 onto the fork branch mudler/llama.cpp:localai-paged (HEAD 1edddc8fe, on 51168c5ee). Kept, renumbered from the fork (now carry Assisted-by + Signed-off-by): - 0048 test(paged): cover MoE swiglu down chain (was 0051, fd920cf8a) - 0049 test(paged): cover MoE weighted combine chain (was 0052, a85c1e098) - 0050 test(paged): cover ragged MoE dispatch (was 0053, 2fed6aacf) - 0051 fix(speculative): disable backend sampling for MTP drafts (was 0054, f1d976f06) - 0052 feat(paged): whole-pattern MoE matcher + routed-FFN fused NVFP4-quant down MMQ (new, 1edddc8fe) Dropped (no fork commits, removed from the series): - 0048-0050 W4A16 grouped-tile pack/tune/pad: dead line, W4A16 ~1.5x slower than grouped-MMQ. - 0055-0063 speculative/moe/mul-mat/cublas route traces + the rejected small-M tile-policy knob (0059). - All other 110-140 campaign markers not needed by Phase135 (GPU-sort, W4A16-direct-A, boundary trace/timing, Phase133 sorted-F32, Phase134 fused-SWIGLU, Phase138 finalize) carry no code in this tree. Tree-hash proof (the mirror invariant): a fresh detached worktree at LLAMA_VERSION 0ed235ea2c17a19fc8238668653946721ed136fd with every on-disk patches/paged/0*.patch applied in numeric order (git apply) stages to tree 097c862c6834b7d8b90419b305b8402155ef8373, byte-identical to fork HEAD 1edddc8fe's tree. Series is 43 patches (0001-0047 unchanged + 0048-0052). Gated on GB10 sm_121a: default md5 MoE 8cb0ce23 / dense 5951a5b4 unchanged; opt-in md5-clean; MUL_MAT 1146/1146, MUL_MAT_ID 806/806, GATED_DELTA_NET 46/46, MOE_SWIGLU_DOWN 7/7, MUL_MAT_ID_RAGGED_MOE 6/6; six mmq_moe_quantized_raw markers with zero sorted launches on the opt-in sentinel. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-07-02 10:19:10 +00:00
Ettore Di Giacinto	1aba41082b	docs(paged): record phases 112-140 + series trim decision Record the phase 110-140 GDN/MoE campaign benchmark log and append the series-trim decision to the parity handoff: keep the Phase135 routed-FFN fused-quant line plus the MoE test sentinels and the MTP-draft correctness fix; drop the W4A16 structural line, the trace/tile-policy patches, GPU-sort, W4A16-direct-A, and the finalize fusion. Rejected/neutral levers are recorded in the handoff and the per-phase bench artifacts. Fork re-mirrored on 51168c5ee: fd920cf8a a85c1e098 2fed6aacf f1d976f06 1edddc8fe (HEAD tree 097c862c). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-07-02 10:16:53 +00:00
Ettore Di Giacinto	67d2c4c9d4	docs(paged): record BF16 GDN state cache phase Record Phase81 default-off BF16 persistent S-cache results, including md5 drift, op gates, decode profile, and KL smoke. Scope Phase82 as full f16-reference KL plus serving A/B before patch-series promotion. Assisted-by: Codex:gpt-5	2026-07-01 16:26:09 +00:00
Ettore Di Giacinto	d091eb30f2	docs(paged): record GDN identity shortcut phase Assisted-by: Codex:gpt-5	2026-07-01 15:46:41 +00:00
Ettore Di Giacinto	bbfaa66f02	docs(paged): record GDN BV32 decode A/B phase Assisted-by: Codex:gpt-5	2026-07-01 15:35:06 +00:00
Ettore Di Giacinto	04ed7fe52f	docs(paged): record GDN launch sweep phase Assisted-by: Codex:gpt-5	2026-07-01 15:13:45 +00:00
Ettore Di Giacinto	a9454b45c8	docs(paged): record MoE decode-only profile phase Assisted-by: Codex:gpt-5	2026-07-01 15:05:45 +00:00
Ettore Di Giacinto	f21b393746	docs(paged): record current MoE graph profile phase Assisted-by: Codex:gpt-5	2026-07-01 14:56:39 +00:00
Ettore Di Giacinto	26a41fad1a	docs(paged): record post-PoC GDN audit phase Assisted-by: Codex:gpt-5	2026-07-01 14:44:17 +00:00
Ettore Di Giacinto	5369219729	docs(paged): record GDN blocked-solve PoC phase Assisted-by: Codex:gpt-5	2026-07-01 14:39:09 +00:00
Ettore Di Giacinto	eb82ff138f	docs(paged): record datacenter Blackwell readiness phase Assisted-by: Codex:gpt-5	2026-07-01 14:28:41 +00:00
Ettore Di Giacinto	2efb0ec362	docs(paged): record TTFT min32 serving phase Assisted-by: Codex:gpt-5	2026-07-01 14:18:54 +00:00
Ettore Di Giacinto	e5c5746c0a	docs(paged): record GDN tensor-core revalidation phase Assisted-by: Codex:gpt-5	2026-07-01 14:05:20 +00:00
Ettore Di Giacinto	6cf8b782d1	docs(paged): record BF16 F32 output broader serving phase Assisted-by: Codex:gpt-5	2026-07-01 13:26:50 +00:00
Ettore Di Giacinto	e573194799	docs(paged): record patch mirror readiness phase Assisted-by: Codex:gpt-5	2026-07-01 13:11:57 +00:00
Ettore Di Giacinto	2b2b1f0b25	docs(paged): record BF16 F32 output dense serving phase Assisted-by: Codex:gpt-5	2026-07-01 13:06:49 +00:00
Ettore Di Giacinto	e67b329eb1	docs(paged): record BF16 cuBLAS F32 output phase Assisted-by: Codex:gpt-5	2026-07-01 12:54:24 +00:00
Ettore Di Giacinto	60954d484a	docs(paged): record quant kernel timing phase Assisted-by: Codex:gpt-5	2026-07-01 12:45:19 +00:00
Ettore Di Giacinto	3fbdfc21c9	docs(paged): record quant trace phase Assisted-by: Codex:gpt-5	2026-07-01 12:42:13 +00:00
Ettore Di Giacinto	55df9100dc	docs(paged): record layout trace phase Assisted-by: Codex:gpt-5	2026-07-01 12:32:05 +00:00
Ettore Di Giacinto	2e19e5c90f	docs(paged): record prefill bucket attribution phase Assisted-by: Codex:gpt-5	2026-07-01 12:20:42 +00:00
Ettore Di Giacinto	6a2618b6dc	docs(paged): record MTP verify-cost rejection Assisted-by: Codex:gpt-5	2026-07-01 11:51:29 +00:00
Ettore Di Giacinto	f7d76389b0	docs(paged): record W4A16 direct activation rejection Assisted-by: Codex:gpt-5	2026-07-01 11:28:11 +00:00
Ettore Di Giacinto	ef578866c8	docs(paged): scope W4A16 direct activation experiment Assisted-by: Codex:gpt-5	2026-07-01 10:59:56 +00:00
Ettore Di Giacinto	fc5d5e4ff3	docs(paged): profile current W4A16 prefill Assisted-by: Codex:gpt-5	2026-07-01 10:56:48 +00:00
Ettore Di Giacinto	ef7dbfa5f7	docs(paged): compare MoE min32 against vLLM Assisted-by: Codex:gpt-5	2026-07-01 10:46:32 +00:00
Ettore Di Giacinto	c41d1a5b4f	docs(paged): record waiting-threshold TTFT defer Record Phase58 prompt-backlog threshold A/B, DGX gates, MoE and dense serving results, and the repeat-before-default decision. Assisted-by: Codex:gpt-5	2026-07-01 10:31:09 +00:00
Ettore Di Giacinto	9be291e6b0	docs(paged): reject capped TTFT defer sweep Record Phase57 capped TTFT prefill-first sweep, DGX gates, and the decision to keep the cap as an A/B knob rather than a parity path. Assisted-by: Codex:gpt-5	2026-07-01 10:18:41 +00:00
Ettore Di Giacinto	902bcc7717	docs(paged): validate TTFT prefill-first A/B Record Phase56 MoE and lower-concurrency validation for the TTFT prefill-first policy, including DGX gates and the opt-in-only decision. Assisted-by: Codex:gpt-5	2026-07-01 10:05:23 +00:00
Ettore Di Giacinto	999cf09532	docs(paged): record TTFT prefill-first A/B Record Phase55 default-off scheduler A/B, DGX md5/op gates, dense serving results, and the pending fork push/mirror status. Assisted-by: Codex:gpt-5	2026-07-01 09:57:55 +00:00
Ettore Di Giacinto	3dbf34e739	docs(paged): record admission histogram trace Record Phase54 trace-only histogram work, DGX md5/op gates, dense serving histogram evidence, and the next scheduler decision. Assisted-by: Codex:gpt-5	2026-07-01 09:40:50 +00:00
Ettore Di Giacinto	347a5c05bd	docs(paged): reject admission budget sweep Assisted-by: Codex:gpt-5	2026-07-01 09:27:20 +00:00
Ettore Di Giacinto	2aa76702df	docs(paged): record dense admission trace Assisted-by: Codex:gpt-5	2026-07-01 09:18:43 +00:00
Ettore Di Giacinto	b5f65152e2	docs(paged): record serving admission trace Assisted-by: Codex:gpt-5	2026-07-01 09:08:42 +00:00
Ettore Di Giacinto	c299dcd231	docs(paged): record dense true decode profile Assisted-by: Codex:gpt-5	2026-07-01 08:55:23 +00:00
Ettore Di Giacinto	cd59e5d61f	fix(paged): scrub harness vars for vllm serve Assisted-by: Codex:gpt-5	2026-07-01 08:23:05 +00:00
Ettore Di Giacinto	96825a224e	docs(paged): record dense serving snapshot Assisted-by: Codex:gpt-5	2026-07-01 08:20:26 +00:00
Ettore Di Giacinto	440129c98e	fix(paged): harden serving snapshot readiness Assisted-by: Codex:gpt-5	2026-07-01 08:07:48 +00:00
Ettore Di Giacinto	e69ee0e867	feat(paged): parameterize served model name Assisted-by: Codex:gpt-5	2026-07-01 07:50:19 +00:00
Ettore Di Giacinto	2a0fc0f4b9	docs(paged): record inference gate guard Assisted-by: Codex:gpt-5	2026-07-01 07:45:52 +00:00

1 2 3 4 5 ...

885 Commits