mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-27 09:57:14 -04:00
docs(paged): FUTURE_LEVERS - parked decode-parity exploration trail
Ranked pick-up points after the 95%-bit-exact plateau: hybrid-precision SSM state (per-head f32/bf16 split - the bf16 error is concentrated in long-memory heads, so a split could capture most of the +25-31% while passing the f32 KL gate), dense CUDA-graph instability, the rms_norm->fp4 fold (flat-risk), datacenter Blackwell sm_100 (no LPDDR5x floor), adaptive prefill budget, MoE-specific recurrence tuning. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
77
backend/cpp/llama-cpp/patches/paged/FUTURE_LEVERS.md
Normal file
77
backend/cpp/llama-cpp/patches/paged/FUTURE_LEVERS.md
Normal file
@@ -0,0 +1,77 @@
|
||||
# Decode-Parity: Parked Levers (future exploration)
|
||||
|
||||
**Context.** The bit-exact decode-parity effort shipped patches **0018-0023**: dense decode
|
||||
38% -> **95% of vLLM** @npl128 on GB10 / DGX Spark (LPDDR5x ~273 GB/s), every patch
|
||||
**byte-identical to llama's own f32 output** (md5-gated). The gated-DeltaNet recurrence (the
|
||||
dominant ~50% kernel) now runs at **84.6% of peak BW = past vLLM's 82.4%**, at the DRAM floor.
|
||||
bf16 SSM state was fully built and **shelved** (real +25-31% lever but fails the f32 KL gate).
|
||||
|
||||
The remaining non-recurrence kernels (FP4 GEMM, attention, lm_head) are at their bit-exact
|
||||
floor: any knob changes a reduction order vs the f32 reference. So further *bit-exact* decode
|
||||
gains are marginal; the levers below are the honest pick-up points, ranked by promise.
|
||||
|
||||
---
|
||||
|
||||
## 1. Hybrid-precision SSM state (the most promising)
|
||||
|
||||
The bf16 build (`BF16_SSM_STATE_RESULTS.md`) proved the throughput lever is large -
|
||||
recurrence **-49%/call** (dense 3.38 -> 1.73 ms), dense decode ~**490 t/s = 125% of vLLM** (clean
|
||||
runs), MoE @128 **+24.9%** - but bf16 fails the f32 KL gate (KLD 0.06-0.17 at >=1024 ctx,
|
||||
~10% argmax flips). The discrimination showed the error is **intrinsic to bf16 over the
|
||||
long-memory heads** (exp(g) ~ 1, where the per-step decay does not contract the rounding);
|
||||
short/fast-decaying heads are fine.
|
||||
|
||||
**Lever:** a per-head (or per-channel) precision split - keep the long-memory heads (g near 1)
|
||||
in f32, store the fast-decaying heads (g well below 1, where rounding contracts) in bf16. Could
|
||||
capture most of the speedup while passing the KL gate. Needs a g-magnitude classifier at graph
|
||||
build + a mixed-dtype recurrent-state cache. **HIGH promise, moderate effort.** The bf16 kernel
|
||||
plumbing already exists (DGX `~/llama-paged-dev/BF16_SSM_STATE.diff`); this adds the per-head
|
||||
dtype selection on top.
|
||||
|
||||
*Note:* plain bf16 (no split) is also a legitimate **opt-in for precision-tolerant deployments** -
|
||||
it is exactly vLLM's own GDN precision (vLLM's recurrent cache is bf16), so "match vLLM speed at
|
||||
vLLM precision" is a one-flag away if a user wants it. We declined it as the *default* because our
|
||||
f32 is a strictly higher bar.
|
||||
|
||||
## 2. Dense CUDA-graph instability
|
||||
|
||||
The bf16 dense decode was **bimodal** across runs (287 / 336 / 487 / 498 t/s) - a dense-path
|
||||
CUDA-graph capture/replay instability (good runs hit ~490). The f32 dense path measured stable
|
||||
(371-376) but the bimodality is a latent fragility worth root-causing; a robust graph capture on
|
||||
the dense path could stabilize and possibly lift dense decode. **Moderate promise**, diagnostic.
|
||||
|
||||
## 3. Dense rms_norm -> fp4 producer-fold (~1.5-2.5%, parked as flat-risk)
|
||||
|
||||
The last bit-exact bucket (`RMSNORM_FP4_FOLD.md`). Folding the standalone `quantize_mmq_nvfp4`
|
||||
into the rms_norm+mul producer at the FFN boundary (f32 output dead -> droppable) could recover
|
||||
~1.5-2.5% dense. Parked because: the Lever-2 precedent measured **flat**, it has the worst
|
||||
gain/plumbing ratio (3-op `{RMS_NORM,MUL,MUL_MAT(NVFP4)}` graph fusion + a pre-quantized-src1
|
||||
GEMM path + scratch-pool / CUDA-graph-lifetime plumbing), and the gain risks being swallowed by
|
||||
the ~0.3-0.5% bench noise floor. Revisit only with the inter-node graph-CSE plumbing built and
|
||||
proven on a same-build flag toggle (decode_agg lift above noise AND md5 == 0023). **LOW promise.**
|
||||
|
||||
## 4. Datacenter Blackwell (sm_100)
|
||||
|
||||
This effort targeted **consumer** Blackwell sm_12x (sm_120 RTX 50-series, sm_121 GB10). Datacenter
|
||||
Blackwell (B100/B200/GB200, sm_100 / cc 10.0) has HBM3e (much higher BW) and different MMA
|
||||
characteristics - the LPDDR5x bandwidth floor that dominates GB10 decode does **not** apply, so the
|
||||
whole calculus changes (likely compute-bound, not BW-bound; the recurrence would not be the binding
|
||||
kernel). A separate investigation if datacenter Blackwell becomes a target.
|
||||
|
||||
## 5. Prefill / TTFT scheduler
|
||||
|
||||
The chunked-prefill QoS budget (patches 0013/0016, `LLAMA_MAX_BATCH_TOKENS`) bounds TTFT but uses a
|
||||
single static default. A **dynamic/adaptive** budget (by concurrency + queue depth) could improve the
|
||||
TTFT-vs-decode tradeoff at high concurrency. **Moderate promise** for the serving experience (not raw
|
||||
decode tok/s).
|
||||
|
||||
## 6. MoE-specific recurrence tuning
|
||||
|
||||
The occupancy retune (0022) was tuned on the dense path; it lifted MoE +8.3% as a side effect. The
|
||||
MoE path (`MUL_MAT_ID` grouped GEMM + the shared GDN recurrence, expert routing changes the GEMM
|
||||
shapes) may have MoE-specific occupancy headroom. Worth a MoE-targeted reprofile.
|
||||
|
||||
---
|
||||
|
||||
*All shelved per the host handover - experiments parked. Pick up from the linked result docs in this
|
||||
directory.*
|
||||
Reference in New Issue
Block a user