mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-02 20:37:03 -04:00
docs(paged): reconcile next parity target
Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -2371,11 +2371,11 @@ Decision:
|
||||
Any future C1 rerun must push beyond this tested point and keep the same
|
||||
md5 plus `MUL_MAT`/`MUL_MAT_ID` gates.
|
||||
|
||||
## Phase 41 Low-Concurrency D1 Check
|
||||
## Phase 41 Low-Concurrency Serving Check
|
||||
|
||||
Phase 41 measured the opposite serving regime after Phase40 rejected the tested
|
||||
max-concurrency shortcut: low concurrency and latency-sensitive decode. This is
|
||||
the regime where the D1/full-step graph-capture direction should matter most.
|
||||
the regime where any remaining host/scheduler gap should be most visible.
|
||||
|
||||
Artifacts:
|
||||
|
||||
@@ -2433,11 +2433,39 @@ Ratios:
|
||||
|
||||
Decision:
|
||||
|
||||
- D1/full-step graph capture remains relevant for low-concurrency and latency
|
||||
work, but this current-stack snapshot does not show an easy parity bridge:
|
||||
paged is about `0.75x` vLLM decode at `n=1/8` and `0.665x` at `n=32`.
|
||||
- The low-concurrency gap is real, but Phase41 does not reopen D1/full-step graph
|
||||
capture. Patch `0043` already ships that behavior default-on, and Phase34
|
||||
route tracing found `host_sync=0/4096` for the current n128 serving path.
|
||||
Paged is about `0.75x` vLLM decode at `n=1/8` and `0.665x` at `n=32`.
|
||||
- TTFT is the bigger user-visible low-concurrency gap, especially by `n=8/32`;
|
||||
prefill GDN and MoE GEMM work therefore still matters even in a decode-focused
|
||||
serving discussion.
|
||||
- The next implementation phase should require a separately built A/B and the
|
||||
same md5 plus `MUL_MAT`/`MUL_MAT_ID` gates before claiming any D1 improvement.
|
||||
- Do not fund another D1 graph-capture patch on GB10 unless a fresh route trace
|
||||
first proves a host-sync fallback or graph-disable condition has returned. The
|
||||
next implementation target should be a measured non-D1 bucket, gated by the
|
||||
same md5 plus `MUL_MAT`/`MUL_MAT_ID` checks.
|
||||
|
||||
## Phase 42 D1/GDN/GEMM Target Reconciliation
|
||||
|
||||
Phase 42 challenged the Phase41 wording against the patch stack and read-only
|
||||
subagent analysis. It resolves the next-target decision before any source work.
|
||||
|
||||
Evidence:
|
||||
|
||||
| track | evidence | decision |
|
||||
|-------|----------|----------|
|
||||
| D1/full-step graph capture | Patch `0043` is default-on for grouped MMQ decode and opt-out via `LLAMA_MOE_NO_FORCE_GRAPHS=1`; Phase34 route trace found `host_sync=0/4096`; `VLLM_PARITY_FINAL.md` marks D1 shipped and the host-sync premise refuted | closed on current GB10 path |
|
||||
| S3 decode-shape-stable scheduling | Patch `0041` is shipped default-off after end-to-end A/B showed worse TTFT and lower throughput despite better per-step decode metrics | keep opt-in only |
|
||||
| GDN prefill | Patches `0046`/`0047` are the shipped GB10 GDN wins; C32 slab, QS-early, and Global-Ai32 were md5-clean but slower | do not add another low-conflict GB10 GDN reorder |
|
||||
| W4A16 / prefill GEMM | Patches `0033`/`0034`/`0035` are default-off; `0048`-`0050` improved forced W4A16 only marginally and did not beat default MMQ | do not add another small W4A16 body/metadata tweak |
|
||||
|
||||
Next target:
|
||||
|
||||
- The only small incremental candidate left from the current evidence is the
|
||||
persistent/load-time F32 combined gate projection scoped in Phase38/39:
|
||||
combine `ffn_gate_inp.weight` and `ffn_gate_inp_shexp.weight` once, run one
|
||||
F32 gate matmul, and split/view the output. Do not use graph-time
|
||||
`ggml_concat()`.
|
||||
- It must be default-off, fork-first, and validated with MoE/dense md5,
|
||||
`MUL_MAT`, `MUL_MAT_ID`, and KL if either md5 changes before any serving
|
||||
benchmark.
|
||||
|
||||
@@ -525,16 +525,28 @@ and `OPS=MUL_MAT,MUL_MAT_ID`. Pre/post gates stayed green: MoE
|
||||
concurrency at this prompt/gen length and `n<=256`; a future C1 retry must push
|
||||
beyond this tested point and keep the same md5/op gates.
|
||||
|
||||
Phase 41 records the low-concurrency counterpart for D1. Artifact:
|
||||
Phase 41 records the low-concurrency counterpart to the Phase40 high-concurrency
|
||||
check. Artifact:
|
||||
`/home/mudler/bench/phase41_low_concurrency/20260701_091437`. The snapshot ran
|
||||
with `PARALLEL=32`, `CTX=32768`, `PTOK=128`, `GEN=64`, `NPL="1 8 32"`, and
|
||||
`OPS=MUL_MAT,MUL_MAT_ID`. Pre/post gates stayed green: MoE
|
||||
`8cb0ce23777bf55f92f63d0292c756b0`, dense
|
||||
`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID`
|
||||
`806/806`. Paged is about `0.75x` vLLM decode at `n=1/8` and `0.665x` at
|
||||
`n=32`; TTFT is `1.38x`, `3.14x`, and `3.40x` vLLM respectively. Keep D1 in
|
||||
scope for low-concurrency/latency, but require a separately built A/B and the
|
||||
same md5/op gates before claiming improvement.
|
||||
`n=32`; TTFT is `1.38x`, `3.14x`, and `3.40x` vLLM respectively. Do not reopen
|
||||
D1 from this result: `0043` already ships grouped-MMQ full-step graph capture
|
||||
default-on, Phase34 found `host_sync=0/4096`, and S3 is default-off because it
|
||||
regressed TTFT/end-to-end throughput.
|
||||
|
||||
Phase 42 reconciles the target list after parallel read-only review. D1 is
|
||||
closed on the current GB10 path; GDN low-conflict work is exhausted after
|
||||
`0046`/`0047` plus the rejected C32/QS-early/Global-Ai32 follow-ups; W4A16/GEMM
|
||||
micro-tweaks are exhausted after `0033`-`0035` and `0048`-`0050`. The next small
|
||||
GB10 source candidate is the Phase38/39 persistent/load-time F32 combined gate
|
||||
projection: combine `ffn_gate_inp.weight` and `ffn_gate_inp_shexp.weight` once,
|
||||
run one F32 gate matmul, split/view outputs, default-off, no graph-time
|
||||
`ggml_concat()`, and gate with MoE/dense md5 plus `MUL_MAT`/`MUL_MAT_ID` before
|
||||
benchmarking. If md5 changes, run KL first.
|
||||
|
||||
---
|
||||
|
||||
@@ -619,7 +631,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
|
||||
- `~/bench/phase39_gate_sgemm_profile/20260701_085211` - short completion profile, diagnostic only because `-n 32` is not a canonical md5 gate; useful for confirming graph-time concat is a real kernel path.
|
||||
- `~/bench/phase39_gate_sgemm_profile/phase27_reanalysis` - Phase27 serving profile re-analysis used to reject graph-time fused gate weight concat; `concat_layout=459.84ms` (`2.29%`) in the serving kernel window.
|
||||
- `~/bench/phase40_max_concurrency/20260701_090012` - max-concurrency C1 check at `NPL=128/192/256`, `PTOK=128`, `GEN=64`, `PARALLEL=256`, `CTX=262144`; pre/post MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` gates green, but vLLM also fit at `n=256` and stayed ahead (`paged_decode_over_vllm=0.6354`, `paged_agg_over_vllm=0.4721`).
|
||||
- `~/bench/phase41_low_concurrency/20260701_091437` - low-concurrency D1 check at `NPL=1/8/32`, `PTOK=128`, `GEN=64`, `PARALLEL=32`, `CTX=32768`; pre/post MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` gates green; paged is `0.7493`, `0.7518`, and `0.6649` of vLLM decode at `n=1/8/32`, with TTFT still much worse by `n=8/32`.
|
||||
- `~/bench/phase41_low_concurrency/20260701_091437` - low-concurrency serving check at `NPL=1/8/32`, `PTOK=128`, `GEN=64`, `PARALLEL=32`, `CTX=32768`; pre/post MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` gates green; paged is `0.7493`, `0.7518`, and `0.6649` of vLLM decode at `n=1/8/32`, with TTFT still much worse by `n=8/32`; does not reopen D1.
|
||||
- Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
|
||||
- Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
|
||||
- A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.
|
||||
|
||||
@@ -1038,10 +1038,10 @@ the memory-footprint advantage as a parity claim at this tested point; any
|
||||
future C1 retry must push beyond it and keep md5 plus `MUL_MAT`/`MUL_MAT_ID`
|
||||
gates.
|
||||
|
||||
### Phase 41 low-concurrency D1 check
|
||||
### Phase 41 low-concurrency serving check
|
||||
|
||||
Phase 41 measured the low-concurrency serving regime where D1/full-step graph
|
||||
capture should be most useful. Artifact:
|
||||
Phase 41 measured the low-concurrency serving regime where any remaining
|
||||
host/scheduler gap should be most visible. Artifact:
|
||||
`/home/mudler/bench/phase41_low_concurrency/20260701_091437`. The run used
|
||||
`PARALLEL=32`, `CTX=32768`, `PTOK=128`, `GEN=64`, `NPL="1 8 32"`, and
|
||||
`OPS=MUL_MAT,MUL_MAT_ID`.
|
||||
@@ -1058,10 +1058,35 @@ Result:
|
||||
| 8 | `0.7518` | `0.7398` | `0.6334` | `3.1425` |
|
||||
| 32 | `0.6649` | `0.6397` | `0.5282` | `3.4014` |
|
||||
|
||||
Decision: D1 remains a real low-concurrency/latency lever, but Phase41 does not
|
||||
make it a shortcut to parity. The implementation gate remains a separately built
|
||||
A/B with md5 plus `MUL_MAT`/`MUL_MAT_ID` checks, and TTFT evidence keeps prefill
|
||||
GDN/MoE work in scope for serving quality.
|
||||
Decision: low-concurrency remains a gap, but Phase41 does not reopen
|
||||
D1/full-step graph capture. Patch `0043` already ships grouped-MMQ full-step
|
||||
decode graph capture default-on, Phase34 found `host_sync=0/4096`, and S3 is
|
||||
intentionally default-off because it hurts TTFT/end-to-end throughput. Treat
|
||||
D1 as closed on the current GB10 path unless a fresh route trace proves a
|
||||
host-sync fallback or graph-disable condition has returned. TTFT evidence keeps
|
||||
prefill GDN/MoE work in scope for serving quality.
|
||||
|
||||
### Phase 42 target reconciliation
|
||||
|
||||
Phase 42 challenged the current target list with three read-only subagent
|
||||
reviews:
|
||||
|
||||
- D1/full-step graph capture: closed on current GB10 path. `0040` S1 is
|
||||
default-on graph reuse, `0041` S3 is opt-in only, and `0043` D1 is default-on
|
||||
grouped-MMQ full-step CUDA graph capture.
|
||||
- GDN prefill: the shipped GB10 wins are `0046`/`0047`; later C32 slab,
|
||||
QS-early, and Global-Ai32 variants were correctness-clean but slower. Do not
|
||||
add another low-conflict GDN reorder on GB10.
|
||||
- W4A16 / prefill GEMM: `0033`/`0034`/`0035` remain default-off; `0048`-`0050`
|
||||
improved forced W4A16 only marginally and still did not beat default MMQ. Do
|
||||
not add another small W4A16 body/metadata tweak.
|
||||
|
||||
The next small source candidate, if we stay on GB10, is the persistent/load-time
|
||||
F32 combined gate projection from Phase38/39: combine `ffn_gate_inp.weight` and
|
||||
`ffn_gate_inp_shexp.weight` once, issue one F32 gate matmul, and split/view the
|
||||
outputs. It must be default-off, avoid graph-time `ggml_concat()`, and pass
|
||||
MoE/dense md5 plus `MUL_MAT`/`MUL_MAT_ID`; if md5 changes, run KL before any
|
||||
serving benchmark.
|
||||
|
||||
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
|
||||
|
||||
|
||||
@@ -93,7 +93,7 @@ vllm 32 454.6 592.4 17.43 5376.5 818.6
|
||||
|
||||
- [x] **Step 4: Record decision**
|
||||
|
||||
Decision: Phase41 confirms D1 remains relevant for low-concurrency/latency work, but the measured current-stack gap is around `0.75x` vLLM at `n=1/8` and `0.665x` at `n=32`, not an immediate parity bridge. TTFT remains the larger user-visible gap.
|
||||
Decision: Phase41 confirms a low-concurrency/latency gap, but it does not reopen D1/full-step graph capture. Patch `0043` already ships grouped-MMQ full-step graph capture default-on, and Phase34 found `host_sync=0/4096`. The measured current-stack gap is around `0.75x` vLLM at `n=1/8` and `0.665x` at `n=32`; TTFT remains the larger user-visible gap.
|
||||
|
||||
### Task 3: Update Handoff Docs
|
||||
|
||||
|
||||
@@ -0,0 +1,108 @@
|
||||
# Target Reconciliation Phase42 Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Reconcile the post-Phase41 target list so the next parity phase does not chase a closed D1/GDN/W4A16 premise.
|
||||
|
||||
**Architecture:** Use read-only parallel subagent analysis over D1 graph capture, GDN prefill, and W4A16/MoE prefill GEMM. Record the resulting target decision in the parity docs.
|
||||
|
||||
**Tech Stack:** LocalAI docs, llama.cpp patch mirrors, `/home/mudler/_git/llama.cpp` fork, Git.
|
||||
|
||||
---
|
||||
|
||||
### Task 1: Run Parallel Target Reviews
|
||||
|
||||
**Files:**
|
||||
- Read: `backend/cpp/llama-cpp-localai-paged/patches/paged/0040-*.patch`
|
||||
- Read: `backend/cpp/llama-cpp-localai-paged/patches/paged/0041-*.patch`
|
||||
- Read: `backend/cpp/llama-cpp-localai-paged/patches/paged/0043-*.patch`
|
||||
- Read: `backend/cpp/llama-cpp-localai-paged/patches/paged/0031-*.patch`
|
||||
- Read: `backend/cpp/llama-cpp-localai-paged/patches/paged/0046-*.patch`
|
||||
- Read: `backend/cpp/llama-cpp-localai-paged/patches/paged/0047-*.patch`
|
||||
- Read: `backend/cpp/llama-cpp-localai-paged/patches/paged/0033-*.patch`
|
||||
- Read: `backend/cpp/llama-cpp-localai-paged/patches/paged/0034-*.patch`
|
||||
- Read: `backend/cpp/llama-cpp-localai-paged/patches/paged/0035-*.patch`
|
||||
- Read: `backend/cpp/llama-cpp-localai-paged/patches/paged/0048-*.patch`
|
||||
- Read: `backend/cpp/llama-cpp-localai-paged/patches/paged/0049-*.patch`
|
||||
- Read: `backend/cpp/llama-cpp-localai-paged/patches/paged/0050-*.patch`
|
||||
|
||||
- [x] **Step 1: Review D1**
|
||||
|
||||
Ask a read-only explorer to reconcile whether D1/full-step graph capture is shipped or still open.
|
||||
|
||||
Observed:
|
||||
|
||||
```text
|
||||
D1/full-step MoE decode CUDA graph capture is shipped and default-on.
|
||||
The host-sync premise is closed/refuted for current GB10 NVFP4 grouped-MMQ decode.
|
||||
```
|
||||
|
||||
- [x] **Step 2: Review GDN**
|
||||
|
||||
Ask a read-only explorer to inspect GDN tensor-core/chunking state.
|
||||
|
||||
Observed:
|
||||
|
||||
```text
|
||||
0046/0047 are shipped GB10 wins.
|
||||
0031 scalar chunking stayed opt-in/slower.
|
||||
C32 slab, QS-early, and Global-Ai32 were correctness-clean but slower.
|
||||
Do not add another GDN GB10 patch.
|
||||
```
|
||||
|
||||
- [x] **Step 3: Review W4A16/GEMM**
|
||||
|
||||
Ask a read-only explorer to inspect the prefill GEMM / W4A16 state.
|
||||
|
||||
Observed:
|
||||
|
||||
```text
|
||||
0033/0034/0035 are default-off.
|
||||
0048/0049/0050 improve forced W4A16 only marginally.
|
||||
Production defaults still use FP4-MMQ.
|
||||
Do not add another small W4A16 body/metadata patch.
|
||||
```
|
||||
|
||||
### Task 2: Record Phase42 Decision
|
||||
|
||||
**Files:**
|
||||
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
|
||||
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md`
|
||||
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md`
|
||||
- Modify: `docs/superpowers/plans/2026-07-01-low-concurrency-phase41.md`
|
||||
- Create: `docs/superpowers/plans/2026-07-01-target-reconciliation-phase42.md`
|
||||
|
||||
- [x] **Step 1: Correct Phase41 D1 wording**
|
||||
|
||||
Change Phase41 from "D1 remains relevant" to "low-concurrency remains a gap, but D1 graph capture is already shipped/default-on and not reopened."
|
||||
|
||||
- [x] **Step 2: Add Phase42 decision**
|
||||
|
||||
Record:
|
||||
|
||||
```text
|
||||
D1: closed on current GB10 path.
|
||||
GDN: low-conflict GB10 work exhausted.
|
||||
W4A16/GEMM: micro-patch track exhausted.
|
||||
Next small GB10 source candidate: persistent/load-time F32 combined gate projection.
|
||||
```
|
||||
|
||||
- [x] **Step 3: Verify and commit**
|
||||
|
||||
Run:
|
||||
|
||||
```bash
|
||||
git diff --check
|
||||
git status --short
|
||||
```
|
||||
|
||||
Commit with:
|
||||
|
||||
```bash
|
||||
git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \
|
||||
backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \
|
||||
backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md \
|
||||
docs/superpowers/plans/2026-07-01-low-concurrency-phase41.md
|
||||
git add -f docs/superpowers/plans/2026-07-01-target-reconciliation-phase42.md
|
||||
git commit -m "docs(paged): reconcile next parity target" -m "Assisted-by: Codex:gpt-5"
|
||||
```
|
||||
Reference in New Issue
Block a user