mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-27 09:57:14 -04:00
docs(paged): arch-generality audit - optimization classification (0017-0029)
Classify the paged-attention optimizations as arch-GENERAL (ship everywhere), GB10-TUNED (per-arch retune), or Blackwell-precision-specific; add the per-arch expected story (sm_100/Hopper/Ada/Metal/CPU) and the SAFETY gap (fused GDN/conv ops are CUDA+CPU-only with backend-ungated emission). Extends the prior build/gallery-targeting audit in the same file. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -218,4 +218,187 @@ description + tags. Recommend a one-line Blackwell-recommended hardware note +
|
||||
consistent `nvfp4`/`blackwell` tags on all six, and tempering the GB10 bench
|
||||
claims with the "runs slower off-Blackwell" caveat.
|
||||
|
||||
## Section: optimization-generality (patches 0013/0016 + 0017-0029)
|
||||
|
||||
Classifies each optimization as arch-GENERAL (ship everywhere, helps any arch),
|
||||
GB10-TUNED (needs per-arch retuning of the magnitude/constants), or
|
||||
Blackwell-PRECISION-specific (only meaningful where FP4-MMA exists). Read from the
|
||||
patch commit bodies + the diffs they touch; bit-exactness verdicts are the
|
||||
patches' own md5/test-backend-ops gates.
|
||||
|
||||
Arch axis used: NVFP4 FP4-MMA needs `BLACKWELL_MMA_AVAILABLE` (sm_120/121 consumer
|
||||
+ sm_100 datacenter); Hopper sm_90 / Ada sm_89 / Ampere sm_80-86 have none;
|
||||
Metal/CPU/AMD/Intel have no NVFP4-MMA. Datacenter Blackwell sm_100 has FP4-MMA but
|
||||
HBM3e (~8 TB/s) so it is COMPUTE-bound, not bandwidth-bound: every GB10
|
||||
"bandwidth-bound" verdict inverts there. The FP4-MMA kernel itself is UPSTREAM
|
||||
ggml-cuda gated by `BLACKWELL_MMA_AVAILABLE`; none of these patches add it - they
|
||||
reshape/route/dedup around it (0017/0020/0023/0025) or are precision-agnostic.
|
||||
|
||||
### A. ARCH-GENERAL (ship everywhere; pure win or provably neutral)
|
||||
|
||||
Graph-shape, host-side, or gather/copy-elimination changes. No FP4, no
|
||||
bandwidth-floor assumption. Bit-exact. Help or are neutral on any arch that runs
|
||||
the code path.
|
||||
|
||||
- 0013 decoupled prefill-token budget - pure `update_slots()` scheduler policy,
|
||||
zero libllama/ggml change, orthogonal to LLAMA_KV_PAGED, default-off
|
||||
byte-identical. Latency/fairness lever (flattens decode-ITL spike from a
|
||||
co-batched long prefill). No arch assumption.
|
||||
- 0016 dynamic decode-first prefill budget - supersedes 0013; still pure
|
||||
`update_slots()` policy, default-off byte-identical, T==n_batch degenerate case
|
||||
== stock. Arch-neutral, identical paged on/off.
|
||||
- 0024 paged-pool burst-reclaim - host-side block accounting + defrag + slot
|
||||
release; never touches KV values or compute. Gated behind LLAMA_KV_PAGED. Fixes
|
||||
a real fragmentation/throughput-collapse bug on long-lived servers.
|
||||
Arch-independent host bookkeeping.
|
||||
- 0029 block-table within-step host cache - memcpy-reuse of the host block table
|
||||
across full-attention layers in one step; bit-exact, LLAMA_PAGED_NO_BT_CACHE=1
|
||||
off. Helps host-bound decode (dense +2.7% on GB10), neutral when compute-bound
|
||||
(MoE flat). The faster the GPU (e.g. sm_100), the MORE host-bound decode is, so
|
||||
the BIGGER this win elsewhere.
|
||||
- 0028 recurrent-state (conv-tap) gather fusion - eliminates a k_get_rows by
|
||||
reading cache[ids[s]] in-kernel; bit-identical. Deleting a gather vLLM has no
|
||||
equivalent of is a win on ANY arch running the GDN path; not FP4, not
|
||||
bandwidth-floor specific.
|
||||
- 0018 in-place SSM-state write-back + 0019 fused SSM-state gather + 0021
|
||||
conv-state in-place fusion - remove a D2D state copy-back (0018), a state
|
||||
get_rows (0019), and the 4-op conv chain + ring-state copy (0021), mirroring
|
||||
vLLM's in-place recurrent update. Arithmetic byte-identical; what is removed is
|
||||
plumbing, so wins on ANY arch running the gated-DeltaNet recurrence.
|
||||
- Paged KV core (0001-0012) - paged KV manager, on-demand alloc, prefix caching,
|
||||
in-kernel paged read. No precision or bandwidth-floor assumption; the most
|
||||
portable part of the work, helps capacity/serving anywhere it compiles.
|
||||
|
||||
NOTE: 0018/0019/0021/0028 + the base GDN op have CUDA + CPU kernels ONLY (every
|
||||
gate is "CUDA0 vs CPU"). General within {CUDA, HIP/ROCm (hipified ggml-cuda), CPU};
|
||||
NOT covered on Metal/SYCL/Vulkan - see SAFETY #1.
|
||||
|
||||
### B. GENERAL-IN-DIRECTION but the MAGNITUDE was measured on the GB10 floor
|
||||
|
||||
Correct + beneficial everywhere, but the specific %/constants are GB10-bound.
|
||||
|
||||
- 0020 o_proj GDN MMVQ->MMQ reshape - collapses the GDN output to 2D so the
|
||||
ssm_out matmul sees src1->ne[1]=128 and routes to MMQ (M=128 GEMM that amortizes
|
||||
the weight read across 128 tokens) instead of MMVQ (built for batch<=8 with the
|
||||
128 sequences stuck in ne[2]). Zero-cost view change, bit-identical, gated to the
|
||||
gated-DeltaNet path. UNIVERSAL: MMVQ (mul_mat_vec_q) is structurally a batch<=8
|
||||
GEMV and cannot amortize the weight read at a real M=128; MMQ does, on ALL CUDA
|
||||
archs (dp4a pre-tensor-core still amortizes) and on HIP. So MMQ > MMVQ at M=128
|
||||
is NOT GB10-specific - pure win wherever MMQ exists. RE-TUNE: the +31.7%
|
||||
magnitude is on the GB10 BW floor; smaller % on sm_100 but still correct.
|
||||
REGRESSION RISK: only at a genuinely tiny real M (single-stream decode n_seqs<=8)
|
||||
could forcing MMQ be slower than MMVQ - see SAFETY #2. On Metal/Vulkan/SYCL the
|
||||
MMVQ/MMQ split differs; the reshape is harmless (a view) but yields no benefit.
|
||||
- 0023 MoE NVFP4 activation-quantize de-dup - for broadcast up/gate proj (ne11==1)
|
||||
quantize the unique token activations once and gather the identical FP4 blocks
|
||||
instead of re-quantizing per expert; bit-exact, ..._DEDUP=0 off.
|
||||
DIRECTION-GENERAL (de-duplicating identical work is always good) but
|
||||
NVFP4-block-layout specific (uint4 copy of block_fp4_mmq) and only matters where
|
||||
activation-quant is a measurable decode bucket - on a compute-bound arch the
|
||||
saved quant time may be off the critical path (even on GB10 the MoE TG win is
|
||||
only +1.7%).
|
||||
|
||||
### C. GB10-TUNED (constants are GB10 winners; re-sweep per arch)
|
||||
|
||||
- 0022 GDN recurrence occupancy/coalescing retune - column-folding template params
|
||||
NUM_WARPS/COLS_PER_WARP, default (16,8), env-selectable GDN_NW/GDN_CPW. The
|
||||
reduction/FMA order is byte-identical (md5-gateable); only the warp/block->column
|
||||
assignment changes to raise memory-level parallelism on a BANDWIDTH-BOUND kernel.
|
||||
(16,8) is explicitly "the measured GB10 winner". Textbook per-arch tune: optimal
|
||||
values depend on DRAM latency / L2 / SM count / occupancy. SAFE everywhere
|
||||
(bit-exact, env-overridable, no forbidden float4 load) but unlikely optimal off
|
||||
GB10; on a compute-bound arch (sm_100) the kernel may not even be the
|
||||
bottleneck. Needs a per-arch GDN_NW/CPW sweep.
|
||||
- 0017 FP4 dense-GEMM decode tile tune - shipped as a P0 bit-exact gate + DEFAULT-
|
||||
OFF occupancy levers (GGML_CUDA_FP4_MMQ_Y / ..._MINBLOCKS / ..._DENSE_MMQ_X).
|
||||
Honest GB10 verdict was a KILL-GATE: every cheap occupancy probe REGRESSED on
|
||||
sm_121 (the M=128 tile is already weight-read optimal). Nothing on by default =>
|
||||
byte-identical to stock everywhere. On a DIFFERENT FP4 arch (sm_100) the
|
||||
kill-gate could flip; the levers are in place and inert, ready to re-sweep.
|
||||
|
||||
### D. Blackwell-PRECISION-specific (only meaningful where FP4-MMA exists)
|
||||
|
||||
- 0025 MoE NVFP4 MoE-decode re-graph - keeps CUDA graphs on for the grouped
|
||||
stream-k mul_mat_q id-path; env-gated LLAMA_MOE_FORCE_GRAPHS, default-off
|
||||
byte-identical. The CUDA-graph mechanism is general, but the specific guard
|
||||
condition (mmvq_mmid_max==8 for NVFP4 on sm_121) and the "graphs are safe here"
|
||||
reasoning are tied to the NVFP4 grouped path on Blackwell. On a non-FP4 arch the
|
||||
node would not take that branch -> inert.
|
||||
- 0026 hybrid per-head SSM-state precision (bf16 SSM/conv cache) - adds
|
||||
--cache-type-ssm/-conv + --ssm-bf16-tau (per-head f32-vs-bf16 by memory length).
|
||||
Default f32 = bit-exact. PRECISION/bandwidth lever: bf16 halves the dominant GDN
|
||||
decode byte stream, which only pays off on a BANDWIDTH-bound arch (GB10). On
|
||||
sm_100 HBM3e it buys little. Value is bandwidth-floor specific; correctness is
|
||||
precision-specific (opt-in, default-safe).
|
||||
- NVFP4 GGUFs + the 6 gallery -paged rows - inherently Blackwell-precision-specific
|
||||
for the FAST path: NVFP4 weights only get FP4-MMA on sm_120/121/100. Elsewhere
|
||||
they run-via-dequant (correct, slow) per the gallery-targeting section above.
|
||||
|
||||
### Per-arch expected story
|
||||
|
||||
- Consumer Blackwell sm_120/121 (GB10 / dGPU): the validated target. dGPU sm_120
|
||||
(GDDR7 ~1 TB/s) is less BW-starved than GB10 LPDDR5x 273 GB/s, so the
|
||||
bandwidth-floor wins (0018/0019/0022/0026) shrink in % while the host-pipeline +
|
||||
graph wins (0029/0025) and the MMQ reshape (0020) hold.
|
||||
- Datacenter Blackwell sm_100 (HBM3e ~8 TB/s): FP4-MMA WORKS so NVFP4 stays fast
|
||||
(precision bucket + 0025 carry over), BUT the BW floor is GONE -> compute-bound.
|
||||
Re-tune: 0022 GDN_NW/CPW sweep; 0017 kill-gate may flip (levers ready). The
|
||||
bandwidth-motivated wins (0018/0019, 0026 bf16-state) shrink toward neutral; the
|
||||
host-pipeline/graph/MMQ-reshape general wins (0029/0025/0020) still help. Net:
|
||||
works, faster GPU, needs a re-tune pass, do NOT assume the GB10 constants.
|
||||
- Hopper sm_90 / Ada sm_89 / Ampere sm_80-86: NO FP4-MMA. NVFP4 GGUFs + the FP4
|
||||
levers (0017/0023/0025) are out of scope -> use a DIFFERENT quant (Q4_K/AWQ/GPTQ
|
||||
etc). BUT the precision-agnostic infra still helps: paged KV core, scheduler
|
||||
(0013/0016), burst-reclaim (0024), block-table cache (0029), SSM/conv
|
||||
plumbing-removal (0018/0019/0021/0028); 0020 still routes the o_proj
|
||||
MMVQ->MMQ in whatever quant it uses. Story: "no FP4 -> another quant, but paged +
|
||||
SSM + scheduler infra is a pure win".
|
||||
- Metal / CPU / AMD-ROCm / Intel-SYCL / Vulkan (all built by the matrix): no
|
||||
NVFP4-MMA; paged KV + scheduler infra is the portable value. CPU has reference
|
||||
kernels for every fused op (the bit-exact gate is CUDA0-vs-CPU). ROCm/HIP reuses
|
||||
ggml-cuda (hipify) so it inherits the fused-op kernels. Metal/SYCL/Vulkan do NOT
|
||||
get the new fused-op kernels (SAFETY #1).
|
||||
|
||||
### SAFETY / regression risks
|
||||
|
||||
1. Fused GDN/conv ops are CUDA + CPU only; emission is NOT backend-gated.
|
||||
0018/0019/0021/0028 add new ops (ggml_gated_delta_net_inplace[_ids],
|
||||
ggml_ssm_conv_update_inplace[_ids]) implemented for CUDA + CPU only. They are
|
||||
emitted by the graph builder whenever cparams.fused_gdn_ar/fused_gdn_ch is set
|
||||
(constructor sets fused_gdn_ch=true, auto_fgdn=true), with NO check that the
|
||||
active compute backend is CUDA/HIP. Fine on CUDA/HIP/CPU. On Metal/SYCL/Vulkan
|
||||
two failure modes: (a) the new GATED_DELTA_NET op has no kernel -> loud
|
||||
supports_op/assert (but the base GDN op may already be CUDA/CPU-only upstream,
|
||||
so a qwen35 model likely cannot run there regardless); (b) the fused conv
|
||||
variant REUSES GGML_OP_SSM_CONV discriminated by a non-null src[3]/src[4] - a
|
||||
backend that supports plain SSM_CONV but ignores the discriminator would compute
|
||||
the WRONG plain conv -> SILENT corruption. That is the one genuine
|
||||
silent-correctness risk. RECOMMENDATION: gate fused-op emission on the compute
|
||||
backend being CUDA/HIP (or add a supports_op guard rejecting the discriminated
|
||||
SSM_CONV where the fused handling is absent).
|
||||
2. 0020 MMVQ->MMQ at tiny real M. MMQ is right at decode M=128 (the gallery
|
||||
batched-serving regime); it would be wrong only at a genuine M<=8 (single-stream
|
||||
decode, n_seqs=1). Bit-identical either way - only a potential perf regression
|
||||
at tiny batch on non-GB10 archs (never the measured GB10 case). Worth confirming
|
||||
the reshape still picks the better kernel at n_seqs=1 elsewhere.
|
||||
3. 0022 default (16,8) off GB10: safe (bit-exact, env-overridable) but non-optimal;
|
||||
do not ship it as the default for sm_100/Hopper/Ada without a GDN_NW/CPW sweep.
|
||||
No correctness risk.
|
||||
4. Gallery rows do not state a GPU-arch requirement (covered in the
|
||||
gallery-targeting section): add a Blackwell (sm_120/121/100) recommended note.
|
||||
|
||||
### One-line verdict
|
||||
|
||||
The PORTABLE core (paged KV 0001-0012, scheduler 0013/0016, burst-reclaim 0024,
|
||||
block-table cache 0029, the SSM/conv plumbing-removal 0018/0019/0021/0028, the
|
||||
o_proj MMQ reshape 0020) is arch-general and ships everywhere it compiles -
|
||||
bit-exact, mostly default-safe, pure win or neutral. The FP4/NVFP4 levers
|
||||
(0017/0023/0025, the GGUFs, the gallery) are Blackwell-precision-specific. The
|
||||
occupancy/precision tunes (0017 levers, 0022, 0026) are GB10-bandwidth-floor-tuned
|
||||
and need a per-arch re-sweep (especially on sm_100 where the BW floor is gone and
|
||||
the regime flips to compute-bound). The single real SAFETY gap: the new fused
|
||||
GDN/conv ops are CUDA+CPU-only with backend-ungated emission, so a Vulkan/SYCL/Metal
|
||||
paged build of a gated-DeltaNet model could assert (GDN op) or silently miscompute
|
||||
(discriminated SSM_CONV) - it should be compute-backend-gated.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
|
||||
Reference in New Issue
Block a user