docs(paged): bitexact-vs-vLLM verdict + verified f32 GDN-state correction

Synthesize the cross-engine bit-exactness and f32-preserving-parity study.
Resolve the contradiction between sub-agents (one f32, two bf16) by reading
every link of vLLM's state-dtype chain on live source:

- config.json text_config.mamba_ssm_dtype = "float32" (both served models)
- cache.py default mamba_ssm_cache_dtype = "auto"; bench passes no override
- vllm.py __post_init__ -> try_verify_and_update_config (config finalize)
- Qwen3_5ForConditionalGenerationConfig override copies "float32" into
  mamba_ssm_cache_dtype before state-dtype resolution
- mamba_utils._mamba_state_dtype -> temporal = torch.float32 (conv = bf16)
- qwen_gdn_linear_attn allocates the temporal cache at f32

Verdicts: B1 TRUE (sub-claim 'more efficient than vLLM' refuted); B2 REFUTED
(equal f32 bytes both sides, ~10pct efficiency gap not 2x width); B3 REFUTED
(vLLM hits throughput with f32 state; a bit-exact occupancy/coalescing retune
of gated_delta_net_cuda 74->81pct peak is the f32-preserving parity lever);
B4 CONFIRMED (bit-exact-vs-vLLM impossible: A1 FP4 GEMM 8/4/16-bit operand
gap + A2 recurrence g.Sigma vs Sigma.g reassociation on different reduction
trees, plus general FP non-associativity). bf16 temporal state degrades BELOW
vLLM's f32 recurrent precision -> an over-clock, not a parity requirement.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-25 16:55:25 +00:00
parent 17855735c7
commit 5cec1a6a21

View File

@@ -0,0 +1,339 @@
# Bit-exact vs vLLM, and the f32-preserving-parity hunt (Qwen3.5 gated-DeltaNet)
Label: crossengine-bitexact (READ-ONLY, no GPU). Adversarial source+numerics study.
Model: q36-27b-nvfp4 (dense, `Qwen3_5ForConditionalGeneration`) / q36-35b-a3b-nvfp4
(MoE, `Qwen3_5MoeForConditionalGeneration`). Engines: llama dev `~/llama-paged-dev`,
vLLM 0.23.0 `~/vllm-bench`. Decode B=128, enforce-eager / graphs-off, GB10 (~273 GB/s).
> **CORRECTION NOTICE (supersedes the earlier draft of this file).** A prior pass concluded
> "vLLM's GDN state cache is bf16, so the 2x recurrence-DRAM gap is f32(llama)-vs-bf16(vLLM)
> width" (old B2/B3). **That is wrong.** It read `gated_delta_net_state_dtype(..., mamba_ssm_cache_dtype="auto")`
> as auto->model-dtype=bf16, but it did **not** trace the Qwen3.5-specific config override that
> reassigns `mamba_ssm_cache_dtype` from `"auto"` to `"float32"` *before* the state dtype is
> resolved. **vLLM stores this model's gated-DeltaNet temporal state in float32**, the same width
> as llama. Proof chain in Part B. Everything in Part C is re-derived from the corrected dtype.
>
> **INDEPENDENT RE-VERIFICATION (this pass, live DGX source).** Two separate sub-agents reached
> *opposite* dtype readings (one f32, two bf16). The contradiction was resolved by reading every
> link of the chain directly, not by majority vote. All eight links confirm **float32 temporal
> state**: `config.json text_config.mamba_ssm_dtype = "float32"` (both served models);
> `config/cache.py:129` default `mamba_ssm_cache_dtype = "auto"`; the bench scripts
> (`h2h_dense_vllm.sh`, `h2h_moe_serve_vllm.sh`, `serve_nvfp4.sh`) pass **only**
> `--enforce-eager --gpu-memory-utilization 0.85 --max-model-len 4096` (no `--mamba-ssm-cache-dtype`,
> no `--dtype`); `config/vllm.py:847 __post_init__` -> `:856 try_verify_and_update_config()` (runs at
> finalize, before any state-dtype resolution); `MODELS_CONFIG_MAP` (`models/config.py:622-623`) maps
> both `Qwen3_5ForConditionalGeneration` and `Qwen3_5MoeForConditionalGeneration` ->
> `Qwen3_5ForConditionalGenerationConfig`; its override body (`config.py:546-549`)
> `if mamba_ssm_cache_dtype=="auto": cache_config.mamba_ssm_cache_dtype = mamba_ssm_dtype` **fires**
> (value "float32"); `mamba_utils.py:91-94` then takes the `!= "auto"` branch ->
> `temporal = STR_DTYPE_TO_TORCH_DTYPE["float32"] = torch.float32` (conv stays bf16);
> `qwen_gdn_linear_attn.py:1101` `_, state_dtype = self.get_state_dtype()` takes the **temporal** (2nd)
> tuple element and allocates the cache (`:1136`) at f32; `ssm_state = self_kv_cache[1]` (`:1316/1596/1664`).
> The two bf16 sub-agent readings are **refuted** - they stopped at the `cache.py` default "auto" and
> never traced the `__post_init__` override. **Numeric corroboration:** at the measured vLLM duration
> 3.62 ms/call, bf16 (402 MB) would imply 111 GB/s = 41% peak (implausibly low for a tuned BW-bound
> Triton kernel); f32 (805 MB) implies 222 GB/s = 81% peak (the expected regime). f32 is the only
> reading consistent with both source *and* the measured time.
## Headline (two answers)
1. **Bit-exact-vs-vLLM (identical logits / probabilities) is IMPOSSIBLE - for this model and for any
two distinct engines.** B4 = CONFIRMED. The sharpest proof is the GDN recurrence itself: the two
kernels evaluate an *algebraically reassociated* expression (`g.Sigma` vs `Sigma.g`) on *different
reduction trees*, so they diverge **even if both ran pure f32 with identical inputs**. On top of
that the FP4 GEMM uses different operand precision (8-bit vs 4/16-bit activations) and different
accumulation - a >>ULP divergence in every projection and the LM head.
2. **bf16 SSM state is NOT the only way to reach vLLM decode throughput, and an f32-preserving lever
was missed.** vLLM reaches its throughput **with an f32 GDN state** (proven). Both engines move the
same ~805 MB f32/recurrence-call; the ~10% per-call gap is a bandwidth-**efficiency** gap on equal
bytes (llama ~74% of peak, vLLM ~81%), i.e. an occupancy/grid/coalescing lever that is **bit-exact
vs llama's own f32**. bf16 state is an *optional over-clock* (goes AHEAD of vLLM on the recurrence),
not a parity requirement. B2/B3 (as "bf16 width is the lever") = REFUTED.
---
# The five questions, answered (synthesis)
**Q1. Can llama be BIT-EXACT with vLLM? NO.** Two *binding* (>>ULP) divergence sources make
bit-identical logits impossible on their own: **(A1)** the FP4 GEMM - llama MMQ quantizes the
activation to **q8_1 (8-bit)** while vLLM runs cutlass **w4a4 (4-bit acts)** or marlin **w4a16
(16-bit acts)**; different operand precision + accumulation order -> ~1e-2 relative error in *every*
projection and the LM head; **(A2)** the GDN recurrence - llama computes `g*(Sigma round(S*k))`
(scalar decay *after* the reduction) while vLLM computes `Sigma round(round(g*h)*k)` (decay rounded
into each element *before* the reduction): an IEEE-754 reassociation on *different reduction trees*
(warp butterfly vs Triton `tl.sum`) that diverges **even with identical pure-f32 state and inputs**.
A dozen further ops (L2/RMSNorm, MRoPE, gate `exp`, flash-attn softmax) add close-but-not-equal
rounding. Cross-engine bit-exactness is impossible *in general* (FP non-associativity across distinct
GEMM/recurrence/norm kernel stacks); the determinism literature only buys run-to-run determinism
*within* one engine. **Weaker form reachable:** greedy **top-1 token agreement** is the right gate
(top-1 / KL / PPL-delta, never md5). It is probabilistic (flips at low-margin steps), **compounds**
with length (once one token differs the SSM/KV states fork), and is *weaker here* than a
same-precision run because of the A8-vs-A4 GEMM gap.
**Q2. Is bf16 SSM state the only path to vLLM decode throughput? NO - an f32-preserving lever exists
and bf16 is not even required for parity.** vLLM carries the **same f32 temporal state** (proven +
re-verified), so the recurrence gap is **bandwidth EFFICIENCY on equal f32 bytes** (llama 74% vs vLLM
81% of GB10 peak), ~10% per call, *not* a 2x width gap. The lever: **retune `gated_delta_net_cuda`
74% -> ~81%** - it launches 196608 tiny one-column blocks (butterfly-reduce per token); fold toward
fewer/larger `BV x BK` tiles + vectorized `f32x4` loads + better row coalescing, **keeping the
per-column reduction order -> BIT-EXACT vs llama's own f32** (md5-gateable). **Cost vs bf16:** zero
precision risk and bit-exact, but it can only **match** vLLM's recurrence BW (81%), never beat it;
worth ~+5% (~335 -> ~351 tok/s, ~90% of vLLM), and it caps below 100% unless stacked with the other
bit-exact levers (conv fusion 0021, activation fold, oproj MMQ 0020). The adversarial sweep of every
other f32 avenue (lossless sub-f32, delta/low-rank/sparse, recompute+checkpoint, 2nd-stream/overlap,
chunked recurrence) **FAILS** to beat it; recompute is bit-exact but only **ties** the irreducible
one-full-state-READ floor and is now moot (vLLM also writes f32, so you match its achieved BW, you
don't need to eliminate the write). bf16 remains the **only** lever that goes *ahead* of vLLM on the
recurrence (~440 tok/s) - an **over-clock**, not a requirement.
**Q3. Does bf16 state MATCH vLLM's precision or DEGRADE below it? It DEGRADES below vLLM.** (This
corrects the `precision-ground-truth` sub-agent's "matching, not degrading" claim, which rested on
the refuted bf16 reading.) vLLM keeps the **temporal/recurrent** state in **f32**; only its small
**conv** state is bf16 (llama keeps conv f32, so llama is *more* precise there). So bf16 **temporal**
state in llama (~8 mantissa bits) sits **below vLLM's f32 temporal** (~24 bits) - it is a deliberate
precision-for-speed trade, KL/PPL-gated vs llama's own f32 *and* a step under vLLM's recurrent-state
precision. A genuine "match vLLM's envelope" change would be f32 temporal (as today) + bf16 conv -
which costs llama precision only on a tiny stream and buys almost no BW.
**Q4. What can "parity" mean here? Throughput at equal precision + a distributional quality bar -
never identical bits.** Bit-identical logits are impossible cross-engine, so "parity" = **(a)**
throughput (tok/s in the harness) at **(b)** a quality bar measured by **top-1 greedy agreement,
KL(llama||vLLM)/step, and PPL-delta**, never md5. Both engines already run the recurrence math in f32
registers; at **equal** precision (llama f32 temporal == vLLM f32 temporal) the *only* open variable
is throughput, and that gap is closable **bit-exactly** (Q2). If llama adopts bf16 temporal, "parity"
must be restated as "throughput >= vLLM at KL/PPL within gate vs llama's own f32" and reported as the
precision-for-speed trade it is.
**Q5. Did the prior analysis get B1-B4 right? B1 mostly; B2/B3 REFUTED; B4 CONFIRMED. Overturn the
"bf16 is required" framing - keep the bit-exact levers.**
- **B1 TRUE** (single-pass f32, load-once/store-once, 74% peak) - but its sub-claim "more efficient
than vLLM (41%)" is **REFUTED** (41% was the bf16 artifact; vLLM is ~81%, *more* efficient).
- **B2 REFUTED** - not a f32-vs-bf16 width gap; equal f32 bytes both sides, ~10% efficiency gap.
- **B3 REFUTED** as written - vLLM reaches its throughput **with f32 state**; a bit-exact f32
occupancy retune reaches vLLM's recurrence BW. bf16 is optional.
- **B4 CONFIRMED** - impossible, on two independent grounds (structural A1+A2; general FP
non-associativity across distinct kernel stacks).
- **Plan disposition:** do **not** overturn the conv-fusion (0021) bit-exact lever - keep it.
**Re-prioritize the bit-exact f32 occupancy/coalescing retune of `gated_delta_net_cuda` as the
parity path.** Treat bf16 temporal state as an explicitly-gated **over-clock for going beyond
vLLM**, reported as a precision-for-speed trade (below vLLM's f32 recurrent precision), NOT as a
parity-matching change.
---
# PART A - Divergence inventory (per source: bit-identical vs close)
Per decode layer the two engines run *different kernels* for: FP4 GEMMs (proj + LM head), depthwise
conv+SiLU, q/k L2-norm, the GDN recurrence, gated RMSNorm; and on the hybrid's full-attention layers:
RMSNorm q/k-norm, MRoPE, flash attention, a sigmoid gate.
## A1. NVFP4 dequant + FP4 GEMM -- NOT bit-identical (diverges >> ULP)
- **llama**: MMQ (`mmq.cuh` `block_fp4_mmq`, nvfp4 block=16, 4x ue4m3 sub-scales). Host path
(`ggml-cuda.cu` ~1955-2014) **quantizes the activation (src1) to q8_1** (`block_q8_1_mmq`, **8-bit**,
block 32) and accumulates over K in the MMQ tile (DP4A / Blackwell FP4-MMA); tile order set by
`mmq_y`/`mmq_x` + the warp-MMA fragment layout.
- **vLLM**: `compressed_tensors_w4a4_nvfp4` -> cutlass FP4 GEMM on Blackwell (**4-bit** activations,
w4a4, per-group act-quant, e4m3 block scale x global FP8 tensor scale) or marlin fp4 fallback
(**16-bit** activations, w4a16, dequant->bf16 then bf16 GEMM). `apply_weights` -> `self.kernel`.
- **Verdict: not close.** (a) *Operand precision differs*: llama 8-bit acts vs vLLM 4-bit (cutlass) or
16-bit (marlin) - per-GEMM outputs differ at ~1e-2 relative, not ULP. (b) Scale-application order
differs. (c) Accumulation tiling/order differs (MMQ fragment vs cutlass/marlin). This is the largest
divergence and is present in every projection + the LM head, so logits differ materially on its own.
## A2. gated-DeltaNet recurrence -- NOT bit-identical, AND provably so even in pure f32
Both single-pass over an **f32** state (Part B). llama: `gated_delta_net.cu`
`gated_delta_net_cuda<128,KDA=false>`; vLLM: `fused_recurrent.py`
`fused_recurrent_gated_delta_rule_packed_decode_kernel`. Scalar-gate (GDA) path, `g.ne0==1`.
With S[k][v] (llama, transposed) == h[v][k] (vLLM):
```
llama: kv[v] = Sigma_k S_old[k][v]*k[k] # OLD state; g applied AFTER the sum
delta = (v[v] - g*kv[v])*beta; S_new = g*S_old + k(x)delta; o[v]=Sigma_k S_new[k][v]*q[k]
vLLM: h' = g*h_old # decay rounded into EVERY element first
kv[v]=Sigma_k h'[v][k]*k[k]=Sigma_k round(g*h_old)*k; b_v=(v[v]-kv[v])*beta
h_new = h' + b_v(x)k; o[v]=Sigma_k h_new[v][k]*q[k]
```
Algebraically identical (g scalar). **Numerically not**, for two structural reasons that survive even
with identical f32 state, identical inputs, and identical reduction tree:
- **Reassociation:** llama forms `g*(Sigma round(S*k))` (scalar multiply *after* the reduction);
vLLM forms `Sigma round(round(g*h)*k)` (decay rounded into each element *before* the reduction).
Distributing a multiply across a sum is exact in R, not in IEEE-754. This is not a precision knob.
- **Different reduction trees:** llama `warp_reduce_sum<32>` (4 sequential per-lane FMAs + 5-step
butterfly) vs vLLM `tl.sum(...,1)` (Triton tree over the 128-wide BK axis).
**Verdict: not bit-identical; cannot be made so without rewriting one kernel to the other's op order.**
## A3. Depthwise conv1d (width 4) + SiLU -- NOT bit-identical
llama `ggml_ssm_conv` (ascending-j f32 FMA) + `ggml_silu`, conv state cached **f32**. vLLM
`causal_conv1d_update` (Triton) + SiLU, conv state cached **bf16** (`conv_state_dtype = bf16`; only the
*temporal* SSM state is forced f32 - Part B). Different kernel + different conv-state width + FMA order.
(Patch 0021 fuses llama's chain bit-exactly vs *llama's own* f32 path, not vs vLLM.)
## A4. q/k L2-norm + RMSNorm/RMSNormGated -- NOT bit-identical (close, ~1e-6)
L2-norm: llama standalone `ggml_l2_norm` (f32 tree) vs vLLM `l2norm_fwd`/in-kernel fold
(`USE_QK_L2NORM_IN_KERNEL`). RMSNorm: llama `ggml_rms_norm` vs vLLM `vllm_c` fused kernel (run log:
`rms_norm=['vllm_c','native']`); gated out-norm `build_norm_gated`=RMS*SiLU(z) vs `RMSNormGated`.
Different variance reduction tree / eps placement / fusion boundary.
## A5. MRoPE + gate scalar pipeline -- NOT bit-identical (close)
MRoPE: `ggml_rope_multi` (ggml sin/cos) vs vLLM rotary cos/sin cache (different theta eval + apply
order). Gate: vLLM computes `-exp(A_log)*softplus(a+dt)` then `exp` **in-kernel**; llama computes
`softplus(alpha+ssm_dt)*ssm_a` as split graph ops with `ssm_a` baking `-exp(A_log)` at GGUF-convert
time (rounded once), writes/reloads the intermediate, `expf` in-kernel. Same algebra, different
rounding points + convert-time vs runtime `exp(A_log)`.
## A6. Flash attention (full-attn layers) -- NOT bit-identical (close)
llama `ggml_flash_attn_ext` -> `fattn-mma-f16`/`fattn-vec` (online softmax, F16/F32 PV accum per
`GGML_PREC`) vs vLLM FlashInfer/FA2. Different tiling => different running max/sum order => different
rounding.
## A7. SiLU/sigmoid primitives + fusion -- equivalent IF inputs matched (they never do)
Both ultimately use the same hardware `expf`/`__nv_expf`; the primitives could match given identical
inputs, but every upstream value has diverged, and vLLM fuses act+quant / norm+quant differently than
llama's separate ops (run log `fuse_act_quant=True`), moving the rounding points.
### Inventory summary
| Source | bit-identical? | divergence size |
|---|---|---|
| FP4 GEMM (proj/LM head): MMQ q8_1(A8) vs cutlass w4a4(A4)/marlin w4a16 | **NO** | **>>ULP (~1e-2)** |
| GDN recurrence: hand-CUDA warp-reduce vs Triton tl.sum | **NO (provable even in f32)** | reassoc + tree |
| conv1d+SiLU: f32 conv-state vs bf16 conv-state | NO | dtype + order |
| L2-norm / RMSNorm | NO | ~1e-6 (tree) |
| MRoPE | NO | ~ULP-1e-6 |
| gate softplus/exp | NO | rounding points |
| flash attention | NO | softmax tiling |
| silu/sigmoid primitive | identical IFF inputs equal | inputs never equal |
Any single NO makes the logits differ. A1 and A2 differ by far more than ULP -> the logit vectors are
not close-to-equal at the bit level; they agree only to a few significant digits.
---
# PART B - The decisive f32-state correction (proof from source)
The byte-gate inferred vLLM's GDN temporal state is **bf16** (402 MB/call, 41% peak) and built the
"bf16-width is the lever" case on it. The byte count was *inferred from the dtype*; ncu byte counters
were blocked, so only the **duration** (3.62 ms/call) was measured. The dtype inference is falsified:
1. `config.json`: `architectures=["Qwen3_5ForConditionalGeneration"]`, `text_config.dtype=bfloat16`,
and **`text_config.mamba_ssm_dtype = "float32"`**.
2. `models/config.py:590 MODELS_CONFIG_MAP` maps `"Qwen3_5ForConditionalGeneration"` (line 622) and
`"Qwen3_5MoeForConditionalGeneration"` (623) to `Qwen3_5ForConditionalGenerationConfig`.
3. `Qwen3_5ForConditionalGenerationConfig.verify_and_update_config` (config.py:536-562):
`mamba_ssm_dtype = getattr(hf_text_config,"mamba_ssm_dtype")` (="float32"); if
`cache_config.mamba_ssm_cache_dtype == "auto"` (the default) it executes
**`cache_config.mamba_ssm_cache_dtype = mamba_ssm_dtype`** -> sets it to **"float32"**.
4. This override runs at config finalization: `config/vllm.py:856` -> `try_verify_and_update_config()`
(vllm.py:1880-1900) looks up the arch in `MODELS_CONFIG_MAP` and calls `verify_and_update_config`.
It runs **before** any layer/model state-dtype resolution.
5. The bench left it default: `h2h_dense_vllm.sh` = `vllm serve .../q36-27b-nvfp4-vllm --enforce-eager
--gpu-memory-utilization 0.85 --max-model-len 4096` (no `--mamba-ssm-cache-dtype`; `dl-logs/vllm_dense.log`
non-default args confirm none). So the override fires and the value is "float32".
6. State dtype resolution reads the **already-overridden** value:
- `gdn/base.py:53-57` `get_state_dtype()` -> `gated_delta_net_state_dtype(model_dtype=bf16,
cache_config.mamba_cache_dtype="auto", cache_config.mamba_ssm_cache_dtype="float32")`.
- `qwen3_5.py:678 get_mamba_state_dtype_from_config` likewise passes
`vllm_config.cache_config.mamba_ssm_cache_dtype` (= "float32", post-override) - **not** a raw "auto".
- `mamba_utils.py _mamba_state_dtype`: conv_state = `get_kv_cache_torch_dtype("auto", bf16)` = **bf16**;
temporal_state, since `mamba_ssm_cache_dtype != "auto"`, = `STR_DTYPE_TO_TORCH_DTYPE["float32"]`
= **torch.float32** (key verified: `torch_utils.py:33 "float32": torch.float32`).
7. `qwen_gdn_linear_attn.py:1101` `_, state_dtype = self.get_state_dtype()` takes the **second** tuple
element (temporal) = **float32**, allocates the cache `dtype=state_dtype`. The packed_decode kernel
round-trips f32: `b_h = tl.load(p_h0).to(f32)` ... `tl.store(p_ht, b_h.to(p_ht.dtype.element_ty))`
with `p_ht.dtype == initial_state.dtype == float32`.
**=> vLLM's gated-DeltaNet temporal (recurrent) state cache for this model is float32, identical width
to llama's f32 state.** The earlier "bf16" reading hardcoded the third arg as `"auto"` and missed the
override at step 3-4. Only the small *conv* state is bf16 in vLLM (f32 in llama: divergence A3, tiny
byte stream).
## Re-derived efficiency table (measured duration + PROVEN f32 byte volume)
| kernel | state dtype (PROVEN) | bytes R+W/call | duration/call | eff. BW | % of 273 peak |
|---|---|---|---|---|---|
| llama `gated_delta_net_cuda` | f32 | 805 MB | 3.98 ms | 202 GB/s | **74%** |
| vLLM `..._packed_decode` | **f32 (not bf16)** | **805 MB (not 402)** | 3.62 ms | **222 GB/s** | **~81%** |
- **B1 (single-pass f32 byte floor): TRUE** (load-once/store-once `s_shard`, coalesced). *Sub-claim
"more BW-efficient than vLLM (41%)" REFUTED* - 41% was the bf16 artifact; at the correct f32 byte
count vLLM is at ~81%, i.e. **more** efficient than llama.
- **B2 ("the gap is f32-vs-bf16 width"): REFUTED.** Equal f32 bytes both sides; the ~10% per-call gap
is bandwidth **efficiency** on equal bytes, not width.
- **B3 ("vLLM throughput REQUIRES bf16 state"): REFUTED.** vLLM reaches it *with f32 state*.
---
# PART C - The f32-preserving lever, and where recompute/bf16 land
Since vLLM hits ~81% on the **same f32 byte volume** llama runs at ~74%, the missed lever is **raising
llama's `gated_delta_net_cuda` achieved BW 74% -> ~81%**, bit-exact, NOT dtype width:
- llama grid `(H=48, n_seqs=128, ceil(S_v/4)=32) = 196608` blocks/128 thr, each warp owns ONE state
column + warp-reduces over 128 rows. vLLM grid `(NV=4, B*HV=6144) = 24576` programs (num_warps=1),
each owns a BV=32 x BK=128 tile. llama's far-finer blocking (8x more blocks, one column of work each,
a butterfly reduce/token) is the likely ~7-point deficit. Retune toward fewer/larger blocks (more
columns/block, vectorized f32x4 loads, better row coalescing) - changes thread/tile mapping + load
width only, **keeps the per-column reduction order -> bit-exact vs llama's own f32**.
- Upper bound: 74%->81% on ~50% of the step ~= +17 ms/step (384 -> ~367), ~+5% -> ~351 tok/s (~90% of
vLLM 391), stacking with the landed bit-exact levers (oproj MMQ 0020 @86%, conv fusion 0021).
**Other f32-preserving avenues (adversarial sweep) - none beats the simple bf16 over-clock, but the
occupancy tune above is the real bit-exact win:**
- *Lossless sub-f32 state:* generic float compression is data-dependent (1.1-1.5x, never a guaranteed
2x) and breaks the 128-consecutive-f32 coalescing a BW-bound kernel depends on. The state is dense,
full-rank, non-symmetric (sum of `k(x)delta`, k!=delta) -> no low-rank/half-storage. FAILS.
- *Recompute (checkpoint every N + rank-1 replay):* eliminates the per-step WRITE; the per-step full
dense f32 READ (the `S^T k` / `S^T q` matvecs need every element; the checkpoint is itself a full
read) is irreducible. Optimal N~=11 -> ~473 MB/step (0.587x), realistically ~0.65-0.75x after
replay/latency overhead. A genuine bit-exact path but it only reaches - never beats - the read floor,
at large kernel/graph complexity. **Note: this was over-weighted before because vLLM was assumed
bf16; now that vLLM is f32 too and runs at 81%, you do NOT need to cut the write to match vLLM - you
need to match vLLM's achieved BW on the same f32 bytes.** Recompute is dominated.
- *2nd stream / overlap / pipelining:* DRAM BW (273) is one shared resource; the whole decode step is
uniformly BW-bound (state traffic + ~13.5 GB/step dense NVFP4 weight traffic both hit 273), so
overlapping two BW-bound phases sums to ~0. FAILS.
- *Equivalent recurrence with less decode traffic:* chunked gated-delta-rule is a prefill lever (C=1 at
decode); attention/materialization-free form is O(t) over the prefix. FAILS.
**bf16 SSM state is therefore an OPTIONAL over-clock**, the only lever that goes *ahead* of vLLM on the
recurrence (halve 805 -> ~440 tok/s) - but it takes llama below both its own f32 and vLLM's f32
precision, so it must be **KL/PPL-gated vs llama's own f32**, never md5. f32-only parity-class
throughput is plausible from the SUM of bit-exact levers (recurrence occupancy + conv fusion + oproj
MMQ + activation fold); none require bf16.
---
# PART D - Verdict on B4 + the meaningful weaker form
## Bit-exact-vs-vLLM: IMPOSSIBLE (B4 CONFIRMED) - two independent grounds
1. **Structural (this model):** A1 (FP4 GEMM operand precision + accumulation) and A2 (recurrence
`g.Sigma` vs `Sigma.g` + different reduction trees) make per-layer outputs differ by >>ULP, so logits
cannot be bit-identical. A2 shows it is not a precision knob: the kernels evaluate a *reassociated
expression*, differing **even given identical f32 state and inputs**.
2. **General (any two engines):** IEEE-754 add/mul are non-associative; two engines that tile, reduce,
fuse, and quantize differently cannot produce bit-identical results for a non-trivial transformer.
Field determinism work (batch-invariant / fixed-reduction kernels, "defeating nondeterminism in LLM
inference") delivers **run-to-run determinism WITHIN one engine**; it does **not** and cannot deliver
**cross-engine** bit-exactness (that needs identical kernel+tiling+reduction-order+dtype for *every*
op). Cross-engine bit-exactness is essentially never achieved in practice. Bit-exactness is only a
meaningful gate **within** an engine (how llama patches 0018-0021 are validated by md5).
## Greedy-token match (argmax robustness) - the right weaker form, but probabilistic
Because logits differ mostly in low-order bits (A4-A7) plus a few-significant-digit GEMM/recurrence gap
(A1-A2), the **argmax** frequently coincides whenever the top-1/top-2 logit margin exceeds the
cross-engine noise. This is the only meaningful cross-engine "equivalence"; gate on **top-1 agreement /
KL / PPL-delta**, never md5. Caveats: not guaranteed per-token (low-margin steps can flip); it
**compounds** - once one greedy token differs the sequences fork and the KV/SSM states diverge, so
agreement degrades with length (high on short continuations, drift on long ones); and the FP4 A4-vs-A8
gap (A1) makes the per-step divergence *larger* here than a same-precision bf16-vs-bf16 comparison,
weakening greedy agreement for this model specifically.
**Bottom line:** target near-vLLM via KL/PPL/top-1-agreement, not bit-exactness. Reserve bit-exact
gating for intra-llama validation (the f32 recurrence-occupancy lever and the conv fusion qualify;
bf16 state does not and must be KL/PPL-gated vs llama's own f32).
Assisted-by: Claude:opus-4.8 [Claude Code]