mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-25 17:12:10 -04:00
docs(paged): bitexact-vs-vLLM verdict + verified f32 GDN-state correction
Synthesize the cross-engine bit-exactness and f32-preserving-parity study. Resolve the contradiction between sub-agents (one f32, two bf16) by reading every link of vLLM's state-dtype chain on live source: - config.json text_config.mamba_ssm_dtype = "float32" (both served models) - cache.py default mamba_ssm_cache_dtype = "auto"; bench passes no override - vllm.py __post_init__ -> try_verify_and_update_config (config finalize) - Qwen3_5ForConditionalGenerationConfig override copies "float32" into mamba_ssm_cache_dtype before state-dtype resolution - mamba_utils._mamba_state_dtype -> temporal = torch.float32 (conv = bf16) - qwen_gdn_linear_attn allocates the temporal cache at f32 Verdicts: B1 TRUE (sub-claim 'more efficient than vLLM' refuted); B2 REFUTED (equal f32 bytes both sides, ~10pct efficiency gap not 2x width); B3 REFUTED (vLLM hits throughput with f32 state; a bit-exact occupancy/coalescing retune of gated_delta_net_cuda 74->81pct peak is the f32-preserving parity lever); B4 CONFIRMED (bit-exact-vs-vLLM impossible: A1 FP4 GEMM 8/4/16-bit operand gap + A2 recurrence g.Sigma vs Sigma.g reassociation on different reduction trees, plus general FP non-associativity). bf16 temporal state degrades BELOW vLLM's f32 recurrent precision -> an over-clock, not a parity requirement. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
339
backend/cpp/llama-cpp/patches/paged/BITEXACT_VS_VLLM.md
Normal file
339
backend/cpp/llama-cpp/patches/paged/BITEXACT_VS_VLLM.md
Normal file
@@ -0,0 +1,339 @@
|
||||
# Bit-exact vs vLLM, and the f32-preserving-parity hunt (Qwen3.5 gated-DeltaNet)
|
||||
|
||||
Label: crossengine-bitexact (READ-ONLY, no GPU). Adversarial source+numerics study.
|
||||
Model: q36-27b-nvfp4 (dense, `Qwen3_5ForConditionalGeneration`) / q36-35b-a3b-nvfp4
|
||||
(MoE, `Qwen3_5MoeForConditionalGeneration`). Engines: llama dev `~/llama-paged-dev`,
|
||||
vLLM 0.23.0 `~/vllm-bench`. Decode B=128, enforce-eager / graphs-off, GB10 (~273 GB/s).
|
||||
|
||||
> **CORRECTION NOTICE (supersedes the earlier draft of this file).** A prior pass concluded
|
||||
> "vLLM's GDN state cache is bf16, so the 2x recurrence-DRAM gap is f32(llama)-vs-bf16(vLLM)
|
||||
> width" (old B2/B3). **That is wrong.** It read `gated_delta_net_state_dtype(..., mamba_ssm_cache_dtype="auto")`
|
||||
> as auto->model-dtype=bf16, but it did **not** trace the Qwen3.5-specific config override that
|
||||
> reassigns `mamba_ssm_cache_dtype` from `"auto"` to `"float32"` *before* the state dtype is
|
||||
> resolved. **vLLM stores this model's gated-DeltaNet temporal state in float32**, the same width
|
||||
> as llama. Proof chain in Part B. Everything in Part C is re-derived from the corrected dtype.
|
||||
>
|
||||
> **INDEPENDENT RE-VERIFICATION (this pass, live DGX source).** Two separate sub-agents reached
|
||||
> *opposite* dtype readings (one f32, two bf16). The contradiction was resolved by reading every
|
||||
> link of the chain directly, not by majority vote. All eight links confirm **float32 temporal
|
||||
> state**: `config.json text_config.mamba_ssm_dtype = "float32"` (both served models);
|
||||
> `config/cache.py:129` default `mamba_ssm_cache_dtype = "auto"`; the bench scripts
|
||||
> (`h2h_dense_vllm.sh`, `h2h_moe_serve_vllm.sh`, `serve_nvfp4.sh`) pass **only**
|
||||
> `--enforce-eager --gpu-memory-utilization 0.85 --max-model-len 4096` (no `--mamba-ssm-cache-dtype`,
|
||||
> no `--dtype`); `config/vllm.py:847 __post_init__` -> `:856 try_verify_and_update_config()` (runs at
|
||||
> finalize, before any state-dtype resolution); `MODELS_CONFIG_MAP` (`models/config.py:622-623`) maps
|
||||
> both `Qwen3_5ForConditionalGeneration` and `Qwen3_5MoeForConditionalGeneration` ->
|
||||
> `Qwen3_5ForConditionalGenerationConfig`; its override body (`config.py:546-549`)
|
||||
> `if mamba_ssm_cache_dtype=="auto": cache_config.mamba_ssm_cache_dtype = mamba_ssm_dtype` **fires**
|
||||
> (value "float32"); `mamba_utils.py:91-94` then takes the `!= "auto"` branch ->
|
||||
> `temporal = STR_DTYPE_TO_TORCH_DTYPE["float32"] = torch.float32` (conv stays bf16);
|
||||
> `qwen_gdn_linear_attn.py:1101` `_, state_dtype = self.get_state_dtype()` takes the **temporal** (2nd)
|
||||
> tuple element and allocates the cache (`:1136`) at f32; `ssm_state = self_kv_cache[1]` (`:1316/1596/1664`).
|
||||
> The two bf16 sub-agent readings are **refuted** - they stopped at the `cache.py` default "auto" and
|
||||
> never traced the `__post_init__` override. **Numeric corroboration:** at the measured vLLM duration
|
||||
> 3.62 ms/call, bf16 (402 MB) would imply 111 GB/s = 41% peak (implausibly low for a tuned BW-bound
|
||||
> Triton kernel); f32 (805 MB) implies 222 GB/s = 81% peak (the expected regime). f32 is the only
|
||||
> reading consistent with both source *and* the measured time.
|
||||
|
||||
## Headline (two answers)
|
||||
|
||||
1. **Bit-exact-vs-vLLM (identical logits / probabilities) is IMPOSSIBLE - for this model and for any
|
||||
two distinct engines.** B4 = CONFIRMED. The sharpest proof is the GDN recurrence itself: the two
|
||||
kernels evaluate an *algebraically reassociated* expression (`g.Sigma` vs `Sigma.g`) on *different
|
||||
reduction trees*, so they diverge **even if both ran pure f32 with identical inputs**. On top of
|
||||
that the FP4 GEMM uses different operand precision (8-bit vs 4/16-bit activations) and different
|
||||
accumulation - a >>ULP divergence in every projection and the LM head.
|
||||
|
||||
2. **bf16 SSM state is NOT the only way to reach vLLM decode throughput, and an f32-preserving lever
|
||||
was missed.** vLLM reaches its throughput **with an f32 GDN state** (proven). Both engines move the
|
||||
same ~805 MB f32/recurrence-call; the ~10% per-call gap is a bandwidth-**efficiency** gap on equal
|
||||
bytes (llama ~74% of peak, vLLM ~81%), i.e. an occupancy/grid/coalescing lever that is **bit-exact
|
||||
vs llama's own f32**. bf16 state is an *optional over-clock* (goes AHEAD of vLLM on the recurrence),
|
||||
not a parity requirement. B2/B3 (as "bf16 width is the lever") = REFUTED.
|
||||
|
||||
---
|
||||
|
||||
# The five questions, answered (synthesis)
|
||||
|
||||
**Q1. Can llama be BIT-EXACT with vLLM? NO.** Two *binding* (>>ULP) divergence sources make
|
||||
bit-identical logits impossible on their own: **(A1)** the FP4 GEMM - llama MMQ quantizes the
|
||||
activation to **q8_1 (8-bit)** while vLLM runs cutlass **w4a4 (4-bit acts)** or marlin **w4a16
|
||||
(16-bit acts)**; different operand precision + accumulation order -> ~1e-2 relative error in *every*
|
||||
projection and the LM head; **(A2)** the GDN recurrence - llama computes `g*(Sigma round(S*k))`
|
||||
(scalar decay *after* the reduction) while vLLM computes `Sigma round(round(g*h)*k)` (decay rounded
|
||||
into each element *before* the reduction): an IEEE-754 reassociation on *different reduction trees*
|
||||
(warp butterfly vs Triton `tl.sum`) that diverges **even with identical pure-f32 state and inputs**.
|
||||
A dozen further ops (L2/RMSNorm, MRoPE, gate `exp`, flash-attn softmax) add close-but-not-equal
|
||||
rounding. Cross-engine bit-exactness is impossible *in general* (FP non-associativity across distinct
|
||||
GEMM/recurrence/norm kernel stacks); the determinism literature only buys run-to-run determinism
|
||||
*within* one engine. **Weaker form reachable:** greedy **top-1 token agreement** is the right gate
|
||||
(top-1 / KL / PPL-delta, never md5). It is probabilistic (flips at low-margin steps), **compounds**
|
||||
with length (once one token differs the SSM/KV states fork), and is *weaker here* than a
|
||||
same-precision run because of the A8-vs-A4 GEMM gap.
|
||||
|
||||
**Q2. Is bf16 SSM state the only path to vLLM decode throughput? NO - an f32-preserving lever exists
|
||||
and bf16 is not even required for parity.** vLLM carries the **same f32 temporal state** (proven +
|
||||
re-verified), so the recurrence gap is **bandwidth EFFICIENCY on equal f32 bytes** (llama 74% vs vLLM
|
||||
81% of GB10 peak), ~10% per call, *not* a 2x width gap. The lever: **retune `gated_delta_net_cuda`
|
||||
74% -> ~81%** - it launches 196608 tiny one-column blocks (butterfly-reduce per token); fold toward
|
||||
fewer/larger `BV x BK` tiles + vectorized `f32x4` loads + better row coalescing, **keeping the
|
||||
per-column reduction order -> BIT-EXACT vs llama's own f32** (md5-gateable). **Cost vs bf16:** zero
|
||||
precision risk and bit-exact, but it can only **match** vLLM's recurrence BW (81%), never beat it;
|
||||
worth ~+5% (~335 -> ~351 tok/s, ~90% of vLLM), and it caps below 100% unless stacked with the other
|
||||
bit-exact levers (conv fusion 0021, activation fold, oproj MMQ 0020). The adversarial sweep of every
|
||||
other f32 avenue (lossless sub-f32, delta/low-rank/sparse, recompute+checkpoint, 2nd-stream/overlap,
|
||||
chunked recurrence) **FAILS** to beat it; recompute is bit-exact but only **ties** the irreducible
|
||||
one-full-state-READ floor and is now moot (vLLM also writes f32, so you match its achieved BW, you
|
||||
don't need to eliminate the write). bf16 remains the **only** lever that goes *ahead* of vLLM on the
|
||||
recurrence (~440 tok/s) - an **over-clock**, not a requirement.
|
||||
|
||||
**Q3. Does bf16 state MATCH vLLM's precision or DEGRADE below it? It DEGRADES below vLLM.** (This
|
||||
corrects the `precision-ground-truth` sub-agent's "matching, not degrading" claim, which rested on
|
||||
the refuted bf16 reading.) vLLM keeps the **temporal/recurrent** state in **f32**; only its small
|
||||
**conv** state is bf16 (llama keeps conv f32, so llama is *more* precise there). So bf16 **temporal**
|
||||
state in llama (~8 mantissa bits) sits **below vLLM's f32 temporal** (~24 bits) - it is a deliberate
|
||||
precision-for-speed trade, KL/PPL-gated vs llama's own f32 *and* a step under vLLM's recurrent-state
|
||||
precision. A genuine "match vLLM's envelope" change would be f32 temporal (as today) + bf16 conv -
|
||||
which costs llama precision only on a tiny stream and buys almost no BW.
|
||||
|
||||
**Q4. What can "parity" mean here? Throughput at equal precision + a distributional quality bar -
|
||||
never identical bits.** Bit-identical logits are impossible cross-engine, so "parity" = **(a)**
|
||||
throughput (tok/s in the harness) at **(b)** a quality bar measured by **top-1 greedy agreement,
|
||||
KL(llama||vLLM)/step, and PPL-delta**, never md5. Both engines already run the recurrence math in f32
|
||||
registers; at **equal** precision (llama f32 temporal == vLLM f32 temporal) the *only* open variable
|
||||
is throughput, and that gap is closable **bit-exactly** (Q2). If llama adopts bf16 temporal, "parity"
|
||||
must be restated as "throughput >= vLLM at KL/PPL within gate vs llama's own f32" and reported as the
|
||||
precision-for-speed trade it is.
|
||||
|
||||
**Q5. Did the prior analysis get B1-B4 right? B1 mostly; B2/B3 REFUTED; B4 CONFIRMED. Overturn the
|
||||
"bf16 is required" framing - keep the bit-exact levers.**
|
||||
- **B1 TRUE** (single-pass f32, load-once/store-once, 74% peak) - but its sub-claim "more efficient
|
||||
than vLLM (41%)" is **REFUTED** (41% was the bf16 artifact; vLLM is ~81%, *more* efficient).
|
||||
- **B2 REFUTED** - not a f32-vs-bf16 width gap; equal f32 bytes both sides, ~10% efficiency gap.
|
||||
- **B3 REFUTED** as written - vLLM reaches its throughput **with f32 state**; a bit-exact f32
|
||||
occupancy retune reaches vLLM's recurrence BW. bf16 is optional.
|
||||
- **B4 CONFIRMED** - impossible, on two independent grounds (structural A1+A2; general FP
|
||||
non-associativity across distinct kernel stacks).
|
||||
- **Plan disposition:** do **not** overturn the conv-fusion (0021) bit-exact lever - keep it.
|
||||
**Re-prioritize the bit-exact f32 occupancy/coalescing retune of `gated_delta_net_cuda` as the
|
||||
parity path.** Treat bf16 temporal state as an explicitly-gated **over-clock for going beyond
|
||||
vLLM**, reported as a precision-for-speed trade (below vLLM's f32 recurrent precision), NOT as a
|
||||
parity-matching change.
|
||||
|
||||
---
|
||||
|
||||
# PART A - Divergence inventory (per source: bit-identical vs close)
|
||||
|
||||
Per decode layer the two engines run *different kernels* for: FP4 GEMMs (proj + LM head), depthwise
|
||||
conv+SiLU, q/k L2-norm, the GDN recurrence, gated RMSNorm; and on the hybrid's full-attention layers:
|
||||
RMSNorm q/k-norm, MRoPE, flash attention, a sigmoid gate.
|
||||
|
||||
## A1. NVFP4 dequant + FP4 GEMM -- NOT bit-identical (diverges >> ULP)
|
||||
|
||||
- **llama**: MMQ (`mmq.cuh` `block_fp4_mmq`, nvfp4 block=16, 4x ue4m3 sub-scales). Host path
|
||||
(`ggml-cuda.cu` ~1955-2014) **quantizes the activation (src1) to q8_1** (`block_q8_1_mmq`, **8-bit**,
|
||||
block 32) and accumulates over K in the MMQ tile (DP4A / Blackwell FP4-MMA); tile order set by
|
||||
`mmq_y`/`mmq_x` + the warp-MMA fragment layout.
|
||||
- **vLLM**: `compressed_tensors_w4a4_nvfp4` -> cutlass FP4 GEMM on Blackwell (**4-bit** activations,
|
||||
w4a4, per-group act-quant, e4m3 block scale x global FP8 tensor scale) or marlin fp4 fallback
|
||||
(**16-bit** activations, w4a16, dequant->bf16 then bf16 GEMM). `apply_weights` -> `self.kernel`.
|
||||
- **Verdict: not close.** (a) *Operand precision differs*: llama 8-bit acts vs vLLM 4-bit (cutlass) or
|
||||
16-bit (marlin) - per-GEMM outputs differ at ~1e-2 relative, not ULP. (b) Scale-application order
|
||||
differs. (c) Accumulation tiling/order differs (MMQ fragment vs cutlass/marlin). This is the largest
|
||||
divergence and is present in every projection + the LM head, so logits differ materially on its own.
|
||||
|
||||
## A2. gated-DeltaNet recurrence -- NOT bit-identical, AND provably so even in pure f32
|
||||
|
||||
Both single-pass over an **f32** state (Part B). llama: `gated_delta_net.cu`
|
||||
`gated_delta_net_cuda<128,KDA=false>`; vLLM: `fused_recurrent.py`
|
||||
`fused_recurrent_gated_delta_rule_packed_decode_kernel`. Scalar-gate (GDA) path, `g.ne0==1`.
|
||||
With S[k][v] (llama, transposed) == h[v][k] (vLLM):
|
||||
|
||||
```
|
||||
llama: kv[v] = Sigma_k S_old[k][v]*k[k] # OLD state; g applied AFTER the sum
|
||||
delta = (v[v] - g*kv[v])*beta; S_new = g*S_old + k(x)delta; o[v]=Sigma_k S_new[k][v]*q[k]
|
||||
vLLM: h' = g*h_old # decay rounded into EVERY element first
|
||||
kv[v]=Sigma_k h'[v][k]*k[k]=Sigma_k round(g*h_old)*k; b_v=(v[v]-kv[v])*beta
|
||||
h_new = h' + b_v(x)k; o[v]=Sigma_k h_new[v][k]*q[k]
|
||||
```
|
||||
|
||||
Algebraically identical (g scalar). **Numerically not**, for two structural reasons that survive even
|
||||
with identical f32 state, identical inputs, and identical reduction tree:
|
||||
- **Reassociation:** llama forms `g*(Sigma round(S*k))` (scalar multiply *after* the reduction);
|
||||
vLLM forms `Sigma round(round(g*h)*k)` (decay rounded into each element *before* the reduction).
|
||||
Distributing a multiply across a sum is exact in R, not in IEEE-754. This is not a precision knob.
|
||||
- **Different reduction trees:** llama `warp_reduce_sum<32>` (4 sequential per-lane FMAs + 5-step
|
||||
butterfly) vs vLLM `tl.sum(...,1)` (Triton tree over the 128-wide BK axis).
|
||||
**Verdict: not bit-identical; cannot be made so without rewriting one kernel to the other's op order.**
|
||||
|
||||
## A3. Depthwise conv1d (width 4) + SiLU -- NOT bit-identical
|
||||
llama `ggml_ssm_conv` (ascending-j f32 FMA) + `ggml_silu`, conv state cached **f32**. vLLM
|
||||
`causal_conv1d_update` (Triton) + SiLU, conv state cached **bf16** (`conv_state_dtype = bf16`; only the
|
||||
*temporal* SSM state is forced f32 - Part B). Different kernel + different conv-state width + FMA order.
|
||||
(Patch 0021 fuses llama's chain bit-exactly vs *llama's own* f32 path, not vs vLLM.)
|
||||
|
||||
## A4. q/k L2-norm + RMSNorm/RMSNormGated -- NOT bit-identical (close, ~1e-6)
|
||||
L2-norm: llama standalone `ggml_l2_norm` (f32 tree) vs vLLM `l2norm_fwd`/in-kernel fold
|
||||
(`USE_QK_L2NORM_IN_KERNEL`). RMSNorm: llama `ggml_rms_norm` vs vLLM `vllm_c` fused kernel (run log:
|
||||
`rms_norm=['vllm_c','native']`); gated out-norm `build_norm_gated`=RMS*SiLU(z) vs `RMSNormGated`.
|
||||
Different variance reduction tree / eps placement / fusion boundary.
|
||||
|
||||
## A5. MRoPE + gate scalar pipeline -- NOT bit-identical (close)
|
||||
MRoPE: `ggml_rope_multi` (ggml sin/cos) vs vLLM rotary cos/sin cache (different theta eval + apply
|
||||
order). Gate: vLLM computes `-exp(A_log)*softplus(a+dt)` then `exp` **in-kernel**; llama computes
|
||||
`softplus(alpha+ssm_dt)*ssm_a` as split graph ops with `ssm_a` baking `-exp(A_log)` at GGUF-convert
|
||||
time (rounded once), writes/reloads the intermediate, `expf` in-kernel. Same algebra, different
|
||||
rounding points + convert-time vs runtime `exp(A_log)`.
|
||||
|
||||
## A6. Flash attention (full-attn layers) -- NOT bit-identical (close)
|
||||
llama `ggml_flash_attn_ext` -> `fattn-mma-f16`/`fattn-vec` (online softmax, F16/F32 PV accum per
|
||||
`GGML_PREC`) vs vLLM FlashInfer/FA2. Different tiling => different running max/sum order => different
|
||||
rounding.
|
||||
|
||||
## A7. SiLU/sigmoid primitives + fusion -- equivalent IF inputs matched (they never do)
|
||||
Both ultimately use the same hardware `expf`/`__nv_expf`; the primitives could match given identical
|
||||
inputs, but every upstream value has diverged, and vLLM fuses act+quant / norm+quant differently than
|
||||
llama's separate ops (run log `fuse_act_quant=True`), moving the rounding points.
|
||||
|
||||
### Inventory summary
|
||||
|
||||
| Source | bit-identical? | divergence size |
|
||||
|---|---|---|
|
||||
| FP4 GEMM (proj/LM head): MMQ q8_1(A8) vs cutlass w4a4(A4)/marlin w4a16 | **NO** | **>>ULP (~1e-2)** |
|
||||
| GDN recurrence: hand-CUDA warp-reduce vs Triton tl.sum | **NO (provable even in f32)** | reassoc + tree |
|
||||
| conv1d+SiLU: f32 conv-state vs bf16 conv-state | NO | dtype + order |
|
||||
| L2-norm / RMSNorm | NO | ~1e-6 (tree) |
|
||||
| MRoPE | NO | ~ULP-1e-6 |
|
||||
| gate softplus/exp | NO | rounding points |
|
||||
| flash attention | NO | softmax tiling |
|
||||
| silu/sigmoid primitive | identical IFF inputs equal | inputs never equal |
|
||||
|
||||
Any single NO makes the logits differ. A1 and A2 differ by far more than ULP -> the logit vectors are
|
||||
not close-to-equal at the bit level; they agree only to a few significant digits.
|
||||
|
||||
---
|
||||
|
||||
# PART B - The decisive f32-state correction (proof from source)
|
||||
|
||||
The byte-gate inferred vLLM's GDN temporal state is **bf16** (402 MB/call, 41% peak) and built the
|
||||
"bf16-width is the lever" case on it. The byte count was *inferred from the dtype*; ncu byte counters
|
||||
were blocked, so only the **duration** (3.62 ms/call) was measured. The dtype inference is falsified:
|
||||
|
||||
1. `config.json`: `architectures=["Qwen3_5ForConditionalGeneration"]`, `text_config.dtype=bfloat16`,
|
||||
and **`text_config.mamba_ssm_dtype = "float32"`**.
|
||||
2. `models/config.py:590 MODELS_CONFIG_MAP` maps `"Qwen3_5ForConditionalGeneration"` (line 622) and
|
||||
`"Qwen3_5MoeForConditionalGeneration"` (623) to `Qwen3_5ForConditionalGenerationConfig`.
|
||||
3. `Qwen3_5ForConditionalGenerationConfig.verify_and_update_config` (config.py:536-562):
|
||||
`mamba_ssm_dtype = getattr(hf_text_config,"mamba_ssm_dtype")` (="float32"); if
|
||||
`cache_config.mamba_ssm_cache_dtype == "auto"` (the default) it executes
|
||||
**`cache_config.mamba_ssm_cache_dtype = mamba_ssm_dtype`** -> sets it to **"float32"**.
|
||||
4. This override runs at config finalization: `config/vllm.py:856` -> `try_verify_and_update_config()`
|
||||
(vllm.py:1880-1900) looks up the arch in `MODELS_CONFIG_MAP` and calls `verify_and_update_config`.
|
||||
It runs **before** any layer/model state-dtype resolution.
|
||||
5. The bench left it default: `h2h_dense_vllm.sh` = `vllm serve .../q36-27b-nvfp4-vllm --enforce-eager
|
||||
--gpu-memory-utilization 0.85 --max-model-len 4096` (no `--mamba-ssm-cache-dtype`; `dl-logs/vllm_dense.log`
|
||||
non-default args confirm none). So the override fires and the value is "float32".
|
||||
6. State dtype resolution reads the **already-overridden** value:
|
||||
- `gdn/base.py:53-57` `get_state_dtype()` -> `gated_delta_net_state_dtype(model_dtype=bf16,
|
||||
cache_config.mamba_cache_dtype="auto", cache_config.mamba_ssm_cache_dtype="float32")`.
|
||||
- `qwen3_5.py:678 get_mamba_state_dtype_from_config` likewise passes
|
||||
`vllm_config.cache_config.mamba_ssm_cache_dtype` (= "float32", post-override) - **not** a raw "auto".
|
||||
- `mamba_utils.py _mamba_state_dtype`: conv_state = `get_kv_cache_torch_dtype("auto", bf16)` = **bf16**;
|
||||
temporal_state, since `mamba_ssm_cache_dtype != "auto"`, = `STR_DTYPE_TO_TORCH_DTYPE["float32"]`
|
||||
= **torch.float32** (key verified: `torch_utils.py:33 "float32": torch.float32`).
|
||||
7. `qwen_gdn_linear_attn.py:1101` `_, state_dtype = self.get_state_dtype()` takes the **second** tuple
|
||||
element (temporal) = **float32**, allocates the cache `dtype=state_dtype`. The packed_decode kernel
|
||||
round-trips f32: `b_h = tl.load(p_h0).to(f32)` ... `tl.store(p_ht, b_h.to(p_ht.dtype.element_ty))`
|
||||
with `p_ht.dtype == initial_state.dtype == float32`.
|
||||
|
||||
**=> vLLM's gated-DeltaNet temporal (recurrent) state cache for this model is float32, identical width
|
||||
to llama's f32 state.** The earlier "bf16" reading hardcoded the third arg as `"auto"` and missed the
|
||||
override at step 3-4. Only the small *conv* state is bf16 in vLLM (f32 in llama: divergence A3, tiny
|
||||
byte stream).
|
||||
|
||||
## Re-derived efficiency table (measured duration + PROVEN f32 byte volume)
|
||||
|
||||
| kernel | state dtype (PROVEN) | bytes R+W/call | duration/call | eff. BW | % of 273 peak |
|
||||
|---|---|---|---|---|---|
|
||||
| llama `gated_delta_net_cuda` | f32 | 805 MB | 3.98 ms | 202 GB/s | **74%** |
|
||||
| vLLM `..._packed_decode` | **f32 (not bf16)** | **805 MB (not 402)** | 3.62 ms | **222 GB/s** | **~81%** |
|
||||
|
||||
- **B1 (single-pass f32 byte floor): TRUE** (load-once/store-once `s_shard`, coalesced). *Sub-claim
|
||||
"more BW-efficient than vLLM (41%)" REFUTED* - 41% was the bf16 artifact; at the correct f32 byte
|
||||
count vLLM is at ~81%, i.e. **more** efficient than llama.
|
||||
- **B2 ("the gap is f32-vs-bf16 width"): REFUTED.** Equal f32 bytes both sides; the ~10% per-call gap
|
||||
is bandwidth **efficiency** on equal bytes, not width.
|
||||
- **B3 ("vLLM throughput REQUIRES bf16 state"): REFUTED.** vLLM reaches it *with f32 state*.
|
||||
|
||||
---
|
||||
|
||||
# PART C - The f32-preserving lever, and where recompute/bf16 land
|
||||
|
||||
Since vLLM hits ~81% on the **same f32 byte volume** llama runs at ~74%, the missed lever is **raising
|
||||
llama's `gated_delta_net_cuda` achieved BW 74% -> ~81%**, bit-exact, NOT dtype width:
|
||||
- llama grid `(H=48, n_seqs=128, ceil(S_v/4)=32) = 196608` blocks/128 thr, each warp owns ONE state
|
||||
column + warp-reduces over 128 rows. vLLM grid `(NV=4, B*HV=6144) = 24576` programs (num_warps=1),
|
||||
each owns a BV=32 x BK=128 tile. llama's far-finer blocking (8x more blocks, one column of work each,
|
||||
a butterfly reduce/token) is the likely ~7-point deficit. Retune toward fewer/larger blocks (more
|
||||
columns/block, vectorized f32x4 loads, better row coalescing) - changes thread/tile mapping + load
|
||||
width only, **keeps the per-column reduction order -> bit-exact vs llama's own f32**.
|
||||
- Upper bound: 74%->81% on ~50% of the step ~= +17 ms/step (384 -> ~367), ~+5% -> ~351 tok/s (~90% of
|
||||
vLLM 391), stacking with the landed bit-exact levers (oproj MMQ 0020 @86%, conv fusion 0021).
|
||||
|
||||
**Other f32-preserving avenues (adversarial sweep) - none beats the simple bf16 over-clock, but the
|
||||
occupancy tune above is the real bit-exact win:**
|
||||
- *Lossless sub-f32 state:* generic float compression is data-dependent (1.1-1.5x, never a guaranteed
|
||||
2x) and breaks the 128-consecutive-f32 coalescing a BW-bound kernel depends on. The state is dense,
|
||||
full-rank, non-symmetric (sum of `k(x)delta`, k!=delta) -> no low-rank/half-storage. FAILS.
|
||||
- *Recompute (checkpoint every N + rank-1 replay):* eliminates the per-step WRITE; the per-step full
|
||||
dense f32 READ (the `S^T k` / `S^T q` matvecs need every element; the checkpoint is itself a full
|
||||
read) is irreducible. Optimal N~=11 -> ~473 MB/step (0.587x), realistically ~0.65-0.75x after
|
||||
replay/latency overhead. A genuine bit-exact path but it only reaches - never beats - the read floor,
|
||||
at large kernel/graph complexity. **Note: this was over-weighted before because vLLM was assumed
|
||||
bf16; now that vLLM is f32 too and runs at 81%, you do NOT need to cut the write to match vLLM - you
|
||||
need to match vLLM's achieved BW on the same f32 bytes.** Recompute is dominated.
|
||||
- *2nd stream / overlap / pipelining:* DRAM BW (273) is one shared resource; the whole decode step is
|
||||
uniformly BW-bound (state traffic + ~13.5 GB/step dense NVFP4 weight traffic both hit 273), so
|
||||
overlapping two BW-bound phases sums to ~0. FAILS.
|
||||
- *Equivalent recurrence with less decode traffic:* chunked gated-delta-rule is a prefill lever (C=1 at
|
||||
decode); attention/materialization-free form is O(t) over the prefix. FAILS.
|
||||
|
||||
**bf16 SSM state is therefore an OPTIONAL over-clock**, the only lever that goes *ahead* of vLLM on the
|
||||
recurrence (halve 805 -> ~440 tok/s) - but it takes llama below both its own f32 and vLLM's f32
|
||||
precision, so it must be **KL/PPL-gated vs llama's own f32**, never md5. f32-only parity-class
|
||||
throughput is plausible from the SUM of bit-exact levers (recurrence occupancy + conv fusion + oproj
|
||||
MMQ + activation fold); none require bf16.
|
||||
|
||||
---
|
||||
|
||||
# PART D - Verdict on B4 + the meaningful weaker form
|
||||
|
||||
## Bit-exact-vs-vLLM: IMPOSSIBLE (B4 CONFIRMED) - two independent grounds
|
||||
|
||||
1. **Structural (this model):** A1 (FP4 GEMM operand precision + accumulation) and A2 (recurrence
|
||||
`g.Sigma` vs `Sigma.g` + different reduction trees) make per-layer outputs differ by >>ULP, so logits
|
||||
cannot be bit-identical. A2 shows it is not a precision knob: the kernels evaluate a *reassociated
|
||||
expression*, differing **even given identical f32 state and inputs**.
|
||||
2. **General (any two engines):** IEEE-754 add/mul are non-associative; two engines that tile, reduce,
|
||||
fuse, and quantize differently cannot produce bit-identical results for a non-trivial transformer.
|
||||
Field determinism work (batch-invariant / fixed-reduction kernels, "defeating nondeterminism in LLM
|
||||
inference") delivers **run-to-run determinism WITHIN one engine**; it does **not** and cannot deliver
|
||||
**cross-engine** bit-exactness (that needs identical kernel+tiling+reduction-order+dtype for *every*
|
||||
op). Cross-engine bit-exactness is essentially never achieved in practice. Bit-exactness is only a
|
||||
meaningful gate **within** an engine (how llama patches 0018-0021 are validated by md5).
|
||||
|
||||
## Greedy-token match (argmax robustness) - the right weaker form, but probabilistic
|
||||
Because logits differ mostly in low-order bits (A4-A7) plus a few-significant-digit GEMM/recurrence gap
|
||||
(A1-A2), the **argmax** frequently coincides whenever the top-1/top-2 logit margin exceeds the
|
||||
cross-engine noise. This is the only meaningful cross-engine "equivalence"; gate on **top-1 agreement /
|
||||
KL / PPL-delta**, never md5. Caveats: not guaranteed per-token (low-margin steps can flip); it
|
||||
**compounds** - once one greedy token differs the sequences fork and the KV/SSM states diverge, so
|
||||
agreement degrades with length (high on short continuations, drift on long ones); and the FP4 A4-vs-A8
|
||||
gap (A1) makes the per-step divergence *larger* here than a same-precision bf16-vs-bf16 comparison,
|
||||
weakening greedy agreement for this model specifically.
|
||||
|
||||
**Bottom line:** target near-vLLM via KL/PPL/top-1-agreement, not bit-exactness. Reserve bit-exact
|
||||
gating for intra-llama validation (the f32 recurrence-occupancy lever and the conv fusion qualify;
|
||||
bf16 state does not and must be KL/PPL-gated vs llama's own f32).
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
Reference in New Issue
Block a user