Synthesize the cross-engine bit-exactness and f32-preserving-parity study. Resolve the contradiction between sub-agents (one f32, two bf16) by reading every link of vLLM's state-dtype chain on live source: - config.json text_config.mamba_ssm_dtype = "float32" (both served models) - cache.py default mamba_ssm_cache_dtype = "auto"; bench passes no override - vllm.py __post_init__ -> try_verify_and_update_config (config finalize) - Qwen3_5ForConditionalGenerationConfig override copies "float32" into mamba_ssm_cache_dtype before state-dtype resolution - mamba_utils._mamba_state_dtype -> temporal = torch.float32 (conv = bf16) - qwen_gdn_linear_attn allocates the temporal cache at f32 Verdicts: B1 TRUE (sub-claim 'more efficient than vLLM' refuted); B2 REFUTED (equal f32 bytes both sides, ~10pct efficiency gap not 2x width); B3 REFUTED (vLLM hits throughput with f32 state; a bit-exact occupancy/coalescing retune of gated_delta_net_cuda 74->81pct peak is the f32-preserving parity lever); B4 CONFIRMED (bit-exact-vs-vLLM impossible: A1 FP4 GEMM 8/4/16-bit operand gap + A2 recurrence g.Sigma vs Sigma.g reassociation on different reduction trees, plus general FP non-associativity). bf16 temporal state degrades BELOW vLLM's f32 recurrent precision -> an over-clock, not a parity requirement. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
24 KiB
Bit-exact vs vLLM, and the f32-preserving-parity hunt (Qwen3.5 gated-DeltaNet)
Label: crossengine-bitexact (READ-ONLY, no GPU). Adversarial source+numerics study.
Model: q36-27b-nvfp4 (dense, Qwen3_5ForConditionalGeneration) / q36-35b-a3b-nvfp4
(MoE, Qwen3_5MoeForConditionalGeneration). Engines: llama dev ~/llama-paged-dev,
vLLM 0.23.0 ~/vllm-bench. Decode B=128, enforce-eager / graphs-off, GB10 (~273 GB/s).
CORRECTION NOTICE (supersedes the earlier draft of this file). A prior pass concluded "vLLM's GDN state cache is bf16, so the 2x recurrence-DRAM gap is f32(llama)-vs-bf16(vLLM) width" (old B2/B3). That is wrong. It read
gated_delta_net_state_dtype(..., mamba_ssm_cache_dtype="auto")as auto->model-dtype=bf16, but it did not trace the Qwen3.5-specific config override that reassignsmamba_ssm_cache_dtypefrom"auto"to"float32"before the state dtype is resolved. vLLM stores this model's gated-DeltaNet temporal state in float32, the same width as llama. Proof chain in Part B. Everything in Part C is re-derived from the corrected dtype.INDEPENDENT RE-VERIFICATION (this pass, live DGX source). Two separate sub-agents reached opposite dtype readings (one f32, two bf16). The contradiction was resolved by reading every link of the chain directly, not by majority vote. All eight links confirm float32 temporal state:
config.json text_config.mamba_ssm_dtype = "float32"(both served models);config/cache.py:129defaultmamba_ssm_cache_dtype = "auto"; the bench scripts (h2h_dense_vllm.sh,h2h_moe_serve_vllm.sh,serve_nvfp4.sh) pass only--enforce-eager --gpu-memory-utilization 0.85 --max-model-len 4096(no--mamba-ssm-cache-dtype, no--dtype);config/vllm.py:847 __post_init__->:856 try_verify_and_update_config()(runs at finalize, before any state-dtype resolution);MODELS_CONFIG_MAP(models/config.py:622-623) maps bothQwen3_5ForConditionalGenerationandQwen3_5MoeForConditionalGeneration->Qwen3_5ForConditionalGenerationConfig; its override body (config.py:546-549)if mamba_ssm_cache_dtype=="auto": cache_config.mamba_ssm_cache_dtype = mamba_ssm_dtypefires (value "float32");mamba_utils.py:91-94then takes the!= "auto"branch ->temporal = STR_DTYPE_TO_TORCH_DTYPE["float32"] = torch.float32(conv stays bf16);qwen_gdn_linear_attn.py:1101_, state_dtype = self.get_state_dtype()takes the temporal (2nd) tuple element and allocates the cache (:1136) at f32;ssm_state = self_kv_cache[1](:1316/1596/1664). The two bf16 sub-agent readings are refuted - they stopped at thecache.pydefault "auto" and never traced the__post_init__override. Numeric corroboration: at the measured vLLM duration 3.62 ms/call, bf16 (402 MB) would imply 111 GB/s = 41% peak (implausibly low for a tuned BW-bound Triton kernel); f32 (805 MB) implies 222 GB/s = 81% peak (the expected regime). f32 is the only reading consistent with both source and the measured time.
Headline (two answers)
-
Bit-exact-vs-vLLM (identical logits / probabilities) is IMPOSSIBLE - for this model and for any two distinct engines. B4 = CONFIRMED. The sharpest proof is the GDN recurrence itself: the two kernels evaluate an algebraically reassociated expression (
g.SigmavsSigma.g) on different reduction trees, so they diverge even if both ran pure f32 with identical inputs. On top of that the FP4 GEMM uses different operand precision (8-bit vs 4/16-bit activations) and different accumulation - a >>ULP divergence in every projection and the LM head. -
bf16 SSM state is NOT the only way to reach vLLM decode throughput, and an f32-preserving lever was missed. vLLM reaches its throughput with an f32 GDN state (proven). Both engines move the same ~805 MB f32/recurrence-call; the ~10% per-call gap is a bandwidth-efficiency gap on equal bytes (llama ~74% of peak, vLLM ~81%), i.e. an occupancy/grid/coalescing lever that is bit-exact vs llama's own f32. bf16 state is an optional over-clock (goes AHEAD of vLLM on the recurrence), not a parity requirement. B2/B3 (as "bf16 width is the lever") = REFUTED.
The five questions, answered (synthesis)
Q1. Can llama be BIT-EXACT with vLLM? NO. Two binding (>>ULP) divergence sources make
bit-identical logits impossible on their own: (A1) the FP4 GEMM - llama MMQ quantizes the
activation to q8_1 (8-bit) while vLLM runs cutlass w4a4 (4-bit acts) or marlin w4a16
(16-bit acts); different operand precision + accumulation order -> ~1e-2 relative error in every
projection and the LM head; (A2) the GDN recurrence - llama computes g*(Sigma round(S*k))
(scalar decay after the reduction) while vLLM computes Sigma round(round(g*h)*k) (decay rounded
into each element before the reduction): an IEEE-754 reassociation on different reduction trees
(warp butterfly vs Triton tl.sum) that diverges even with identical pure-f32 state and inputs.
A dozen further ops (L2/RMSNorm, MRoPE, gate exp, flash-attn softmax) add close-but-not-equal
rounding. Cross-engine bit-exactness is impossible in general (FP non-associativity across distinct
GEMM/recurrence/norm kernel stacks); the determinism literature only buys run-to-run determinism
within one engine. Weaker form reachable: greedy top-1 token agreement is the right gate
(top-1 / KL / PPL-delta, never md5). It is probabilistic (flips at low-margin steps), compounds
with length (once one token differs the SSM/KV states fork), and is weaker here than a
same-precision run because of the A8-vs-A4 GEMM gap.
Q2. Is bf16 SSM state the only path to vLLM decode throughput? NO - an f32-preserving lever exists
and bf16 is not even required for parity. vLLM carries the same f32 temporal state (proven +
re-verified), so the recurrence gap is bandwidth EFFICIENCY on equal f32 bytes (llama 74% vs vLLM
81% of GB10 peak), ~10% per call, not a 2x width gap. The lever: retune gated_delta_net_cuda
74% -> ~81% - it launches 196608 tiny one-column blocks (butterfly-reduce per token); fold toward
fewer/larger BV x BK tiles + vectorized f32x4 loads + better row coalescing, keeping the
per-column reduction order -> BIT-EXACT vs llama's own f32 (md5-gateable). Cost vs bf16: zero
precision risk and bit-exact, but it can only match vLLM's recurrence BW (81%), never beat it;
worth ~+5% (~335 -> ~351 tok/s, ~90% of vLLM), and it caps below 100% unless stacked with the other
bit-exact levers (conv fusion 0021, activation fold, oproj MMQ 0020). The adversarial sweep of every
other f32 avenue (lossless sub-f32, delta/low-rank/sparse, recompute+checkpoint, 2nd-stream/overlap,
chunked recurrence) FAILS to beat it; recompute is bit-exact but only ties the irreducible
one-full-state-READ floor and is now moot (vLLM also writes f32, so you match its achieved BW, you
don't need to eliminate the write). bf16 remains the only lever that goes ahead of vLLM on the
recurrence (~440 tok/s) - an over-clock, not a requirement.
Q3. Does bf16 state MATCH vLLM's precision or DEGRADE below it? It DEGRADES below vLLM. (This
corrects the precision-ground-truth sub-agent's "matching, not degrading" claim, which rested on
the refuted bf16 reading.) vLLM keeps the temporal/recurrent state in f32; only its small
conv state is bf16 (llama keeps conv f32, so llama is more precise there). So bf16 temporal
state in llama (~8 mantissa bits) sits below vLLM's f32 temporal (~24 bits) - it is a deliberate
precision-for-speed trade, KL/PPL-gated vs llama's own f32 and a step under vLLM's recurrent-state
precision. A genuine "match vLLM's envelope" change would be f32 temporal (as today) + bf16 conv -
which costs llama precision only on a tiny stream and buys almost no BW.
Q4. What can "parity" mean here? Throughput at equal precision + a distributional quality bar - never identical bits. Bit-identical logits are impossible cross-engine, so "parity" = (a) throughput (tok/s in the harness) at (b) a quality bar measured by top-1 greedy agreement, KL(llama||vLLM)/step, and PPL-delta, never md5. Both engines already run the recurrence math in f32 registers; at equal precision (llama f32 temporal == vLLM f32 temporal) the only open variable is throughput, and that gap is closable bit-exactly (Q2). If llama adopts bf16 temporal, "parity" must be restated as "throughput >= vLLM at KL/PPL within gate vs llama's own f32" and reported as the precision-for-speed trade it is.
Q5. Did the prior analysis get B1-B4 right? B1 mostly; B2/B3 REFUTED; B4 CONFIRMED. Overturn the "bf16 is required" framing - keep the bit-exact levers.
- B1 TRUE (single-pass f32, load-once/store-once, 74% peak) - but its sub-claim "more efficient than vLLM (41%)" is REFUTED (41% was the bf16 artifact; vLLM is ~81%, more efficient).
- B2 REFUTED - not a f32-vs-bf16 width gap; equal f32 bytes both sides, ~10% efficiency gap.
- B3 REFUTED as written - vLLM reaches its throughput with f32 state; a bit-exact f32 occupancy retune reaches vLLM's recurrence BW. bf16 is optional.
- B4 CONFIRMED - impossible, on two independent grounds (structural A1+A2; general FP non-associativity across distinct kernel stacks).
- Plan disposition: do not overturn the conv-fusion (0021) bit-exact lever - keep it.
Re-prioritize the bit-exact f32 occupancy/coalescing retune of
gated_delta_net_cudaas the parity path. Treat bf16 temporal state as an explicitly-gated over-clock for going beyond vLLM, reported as a precision-for-speed trade (below vLLM's f32 recurrent precision), NOT as a parity-matching change.
PART A - Divergence inventory (per source: bit-identical vs close)
Per decode layer the two engines run different kernels for: FP4 GEMMs (proj + LM head), depthwise conv+SiLU, q/k L2-norm, the GDN recurrence, gated RMSNorm; and on the hybrid's full-attention layers: RMSNorm q/k-norm, MRoPE, flash attention, a sigmoid gate.
A1. NVFP4 dequant + FP4 GEMM -- NOT bit-identical (diverges >> ULP)
- llama: MMQ (
mmq.cuhblock_fp4_mmq, nvfp4 block=16, 4x ue4m3 sub-scales). Host path (ggml-cuda.cu~1955-2014) quantizes the activation (src1) to q8_1 (block_q8_1_mmq, 8-bit, block 32) and accumulates over K in the MMQ tile (DP4A / Blackwell FP4-MMA); tile order set bymmq_y/mmq_x+ the warp-MMA fragment layout. - vLLM:
compressed_tensors_w4a4_nvfp4-> cutlass FP4 GEMM on Blackwell (4-bit activations, w4a4, per-group act-quant, e4m3 block scale x global FP8 tensor scale) or marlin fp4 fallback (16-bit activations, w4a16, dequant->bf16 then bf16 GEMM).apply_weights->self.kernel. - Verdict: not close. (a) Operand precision differs: llama 8-bit acts vs vLLM 4-bit (cutlass) or 16-bit (marlin) - per-GEMM outputs differ at ~1e-2 relative, not ULP. (b) Scale-application order differs. (c) Accumulation tiling/order differs (MMQ fragment vs cutlass/marlin). This is the largest divergence and is present in every projection + the LM head, so logits differ materially on its own.
A2. gated-DeltaNet recurrence -- NOT bit-identical, AND provably so even in pure f32
Both single-pass over an f32 state (Part B). llama: gated_delta_net.cu
gated_delta_net_cuda<128,KDA=false>; vLLM: fused_recurrent.py
fused_recurrent_gated_delta_rule_packed_decode_kernel. Scalar-gate (GDA) path, g.ne0==1.
With S[k][v] (llama, transposed) == h[v][k] (vLLM):
llama: kv[v] = Sigma_k S_old[k][v]*k[k] # OLD state; g applied AFTER the sum
delta = (v[v] - g*kv[v])*beta; S_new = g*S_old + k(x)delta; o[v]=Sigma_k S_new[k][v]*q[k]
vLLM: h' = g*h_old # decay rounded into EVERY element first
kv[v]=Sigma_k h'[v][k]*k[k]=Sigma_k round(g*h_old)*k; b_v=(v[v]-kv[v])*beta
h_new = h' + b_v(x)k; o[v]=Sigma_k h_new[v][k]*q[k]
Algebraically identical (g scalar). Numerically not, for two structural reasons that survive even with identical f32 state, identical inputs, and identical reduction tree:
- Reassociation: llama forms
g*(Sigma round(S*k))(scalar multiply after the reduction); vLLM formsSigma round(round(g*h)*k)(decay rounded into each element before the reduction). Distributing a multiply across a sum is exact in R, not in IEEE-754. This is not a precision knob. - Different reduction trees: llama
warp_reduce_sum<32>(4 sequential per-lane FMAs + 5-step butterfly) vs vLLMtl.sum(...,1)(Triton tree over the 128-wide BK axis). Verdict: not bit-identical; cannot be made so without rewriting one kernel to the other's op order.
A3. Depthwise conv1d (width 4) + SiLU -- NOT bit-identical
llama ggml_ssm_conv (ascending-j f32 FMA) + ggml_silu, conv state cached f32. vLLM
causal_conv1d_update (Triton) + SiLU, conv state cached bf16 (conv_state_dtype = bf16; only the
temporal SSM state is forced f32 - Part B). Different kernel + different conv-state width + FMA order.
(Patch 0021 fuses llama's chain bit-exactly vs llama's own f32 path, not vs vLLM.)
A4. q/k L2-norm + RMSNorm/RMSNormGated -- NOT bit-identical (close, ~1e-6)
L2-norm: llama standalone ggml_l2_norm (f32 tree) vs vLLM l2norm_fwd/in-kernel fold
(USE_QK_L2NORM_IN_KERNEL). RMSNorm: llama ggml_rms_norm vs vLLM vllm_c fused kernel (run log:
rms_norm=['vllm_c','native']); gated out-norm build_norm_gated=RMS*SiLU(z) vs RMSNormGated.
Different variance reduction tree / eps placement / fusion boundary.
A5. MRoPE + gate scalar pipeline -- NOT bit-identical (close)
MRoPE: ggml_rope_multi (ggml sin/cos) vs vLLM rotary cos/sin cache (different theta eval + apply
order). Gate: vLLM computes -exp(A_log)*softplus(a+dt) then exp in-kernel; llama computes
softplus(alpha+ssm_dt)*ssm_a as split graph ops with ssm_a baking -exp(A_log) at GGUF-convert
time (rounded once), writes/reloads the intermediate, expf in-kernel. Same algebra, different
rounding points + convert-time vs runtime exp(A_log).
A6. Flash attention (full-attn layers) -- NOT bit-identical (close)
llama ggml_flash_attn_ext -> fattn-mma-f16/fattn-vec (online softmax, F16/F32 PV accum per
GGML_PREC) vs vLLM FlashInfer/FA2. Different tiling => different running max/sum order => different
rounding.
A7. SiLU/sigmoid primitives + fusion -- equivalent IF inputs matched (they never do)
Both ultimately use the same hardware expf/__nv_expf; the primitives could match given identical
inputs, but every upstream value has diverged, and vLLM fuses act+quant / norm+quant differently than
llama's separate ops (run log fuse_act_quant=True), moving the rounding points.
Inventory summary
| Source | bit-identical? | divergence size |
|---|---|---|
| FP4 GEMM (proj/LM head): MMQ q8_1(A8) vs cutlass w4a4(A4)/marlin w4a16 | NO | >>ULP (~1e-2) |
| GDN recurrence: hand-CUDA warp-reduce vs Triton tl.sum | NO (provable even in f32) | reassoc + tree |
| conv1d+SiLU: f32 conv-state vs bf16 conv-state | NO | dtype + order |
| L2-norm / RMSNorm | NO | ~1e-6 (tree) |
| MRoPE | NO | ~ULP-1e-6 |
| gate softplus/exp | NO | rounding points |
| flash attention | NO | softmax tiling |
| silu/sigmoid primitive | identical IFF inputs equal | inputs never equal |
Any single NO makes the logits differ. A1 and A2 differ by far more than ULP -> the logit vectors are not close-to-equal at the bit level; they agree only to a few significant digits.
PART B - The decisive f32-state correction (proof from source)
The byte-gate inferred vLLM's GDN temporal state is bf16 (402 MB/call, 41% peak) and built the "bf16-width is the lever" case on it. The byte count was inferred from the dtype; ncu byte counters were blocked, so only the duration (3.62 ms/call) was measured. The dtype inference is falsified:
config.json:architectures=["Qwen3_5ForConditionalGeneration"],text_config.dtype=bfloat16, andtext_config.mamba_ssm_dtype = "float32".models/config.py:590 MODELS_CONFIG_MAPmaps"Qwen3_5ForConditionalGeneration"(line 622) and"Qwen3_5MoeForConditionalGeneration"(623) toQwen3_5ForConditionalGenerationConfig.Qwen3_5ForConditionalGenerationConfig.verify_and_update_config(config.py:536-562):mamba_ssm_dtype = getattr(hf_text_config,"mamba_ssm_dtype")(="float32"); ifcache_config.mamba_ssm_cache_dtype == "auto"(the default) it executescache_config.mamba_ssm_cache_dtype = mamba_ssm_dtype-> sets it to "float32".- This override runs at config finalization:
config/vllm.py:856->try_verify_and_update_config()(vllm.py:1880-1900) looks up the arch inMODELS_CONFIG_MAPand callsverify_and_update_config. It runs before any layer/model state-dtype resolution. - The bench left it default:
h2h_dense_vllm.sh=vllm serve .../q36-27b-nvfp4-vllm --enforce-eager --gpu-memory-utilization 0.85 --max-model-len 4096(no--mamba-ssm-cache-dtype;dl-logs/vllm_dense.lognon-default args confirm none). So the override fires and the value is "float32". - State dtype resolution reads the already-overridden value:
gdn/base.py:53-57get_state_dtype()->gated_delta_net_state_dtype(model_dtype=bf16, cache_config.mamba_cache_dtype="auto", cache_config.mamba_ssm_cache_dtype="float32").qwen3_5.py:678 get_mamba_state_dtype_from_configlikewise passesvllm_config.cache_config.mamba_ssm_cache_dtype(= "float32", post-override) - not a raw "auto".mamba_utils.py _mamba_state_dtype: conv_state =get_kv_cache_torch_dtype("auto", bf16)= bf16; temporal_state, sincemamba_ssm_cache_dtype != "auto", =STR_DTYPE_TO_TORCH_DTYPE["float32"]= torch.float32 (key verified:torch_utils.py:33 "float32": torch.float32).
qwen_gdn_linear_attn.py:1101_, state_dtype = self.get_state_dtype()takes the second tuple element (temporal) = float32, allocates the cachedtype=state_dtype. The packed_decode kernel round-trips f32:b_h = tl.load(p_h0).to(f32)...tl.store(p_ht, b_h.to(p_ht.dtype.element_ty))withp_ht.dtype == initial_state.dtype == float32.
=> vLLM's gated-DeltaNet temporal (recurrent) state cache for this model is float32, identical width
to llama's f32 state. The earlier "bf16" reading hardcoded the third arg as "auto" and missed the
override at step 3-4. Only the small conv state is bf16 in vLLM (f32 in llama: divergence A3, tiny
byte stream).
Re-derived efficiency table (measured duration + PROVEN f32 byte volume)
| kernel | state dtype (PROVEN) | bytes R+W/call | duration/call | eff. BW | % of 273 peak |
|---|---|---|---|---|---|
llama gated_delta_net_cuda |
f32 | 805 MB | 3.98 ms | 202 GB/s | 74% |
vLLM ..._packed_decode |
f32 (not bf16) | 805 MB (not 402) | 3.62 ms | 222 GB/s | ~81% |
- B1 (single-pass f32 byte floor): TRUE (load-once/store-once
s_shard, coalesced). Sub-claim "more BW-efficient than vLLM (41%)" REFUTED - 41% was the bf16 artifact; at the correct f32 byte count vLLM is at ~81%, i.e. more efficient than llama. - B2 ("the gap is f32-vs-bf16 width"): REFUTED. Equal f32 bytes both sides; the ~10% per-call gap is bandwidth efficiency on equal bytes, not width.
- B3 ("vLLM throughput REQUIRES bf16 state"): REFUTED. vLLM reaches it with f32 state.
PART C - The f32-preserving lever, and where recompute/bf16 land
Since vLLM hits ~81% on the same f32 byte volume llama runs at ~74%, the missed lever is raising
llama's gated_delta_net_cuda achieved BW 74% -> ~81%, bit-exact, NOT dtype width:
- llama grid
(H=48, n_seqs=128, ceil(S_v/4)=32) = 196608blocks/128 thr, each warp owns ONE state column + warp-reduces over 128 rows. vLLM grid(NV=4, B*HV=6144) = 24576programs (num_warps=1), each owns a BV=32 x BK=128 tile. llama's far-finer blocking (8x more blocks, one column of work each, a butterfly reduce/token) is the likely ~7-point deficit. Retune toward fewer/larger blocks (more columns/block, vectorized f32x4 loads, better row coalescing) - changes thread/tile mapping + load width only, keeps the per-column reduction order -> bit-exact vs llama's own f32. - Upper bound: 74%->81% on ~50% of the step ~= +17 ms/step (384 -> ~367), ~+5% -> ~351 tok/s (~90% of vLLM 391), stacking with the landed bit-exact levers (oproj MMQ 0020 @86%, conv fusion 0021).
Other f32-preserving avenues (adversarial sweep) - none beats the simple bf16 over-clock, but the occupancy tune above is the real bit-exact win:
- Lossless sub-f32 state: generic float compression is data-dependent (1.1-1.5x, never a guaranteed
2x) and breaks the 128-consecutive-f32 coalescing a BW-bound kernel depends on. The state is dense,
full-rank, non-symmetric (sum of
k(x)delta, k!=delta) -> no low-rank/half-storage. FAILS. - Recompute (checkpoint every N + rank-1 replay): eliminates the per-step WRITE; the per-step full
dense f32 READ (the
S^T k/S^T qmatvecs need every element; the checkpoint is itself a full read) is irreducible. Optimal N~=11 -> ~473 MB/step (0.587x), realistically ~0.65-0.75x after replay/latency overhead. A genuine bit-exact path but it only reaches - never beats - the read floor, at large kernel/graph complexity. Note: this was over-weighted before because vLLM was assumed bf16; now that vLLM is f32 too and runs at 81%, you do NOT need to cut the write to match vLLM - you need to match vLLM's achieved BW on the same f32 bytes. Recompute is dominated. - 2nd stream / overlap / pipelining: DRAM BW (273) is one shared resource; the whole decode step is uniformly BW-bound (state traffic + ~13.5 GB/step dense NVFP4 weight traffic both hit 273), so overlapping two BW-bound phases sums to ~0. FAILS.
- Equivalent recurrence with less decode traffic: chunked gated-delta-rule is a prefill lever (C=1 at decode); attention/materialization-free form is O(t) over the prefix. FAILS.
bf16 SSM state is therefore an OPTIONAL over-clock, the only lever that goes ahead of vLLM on the recurrence (halve 805 -> ~440 tok/s) - but it takes llama below both its own f32 and vLLM's f32 precision, so it must be KL/PPL-gated vs llama's own f32, never md5. f32-only parity-class throughput is plausible from the SUM of bit-exact levers (recurrence occupancy + conv fusion + oproj MMQ + activation fold); none require bf16.
PART D - Verdict on B4 + the meaningful weaker form
Bit-exact-vs-vLLM: IMPOSSIBLE (B4 CONFIRMED) - two independent grounds
- Structural (this model): A1 (FP4 GEMM operand precision + accumulation) and A2 (recurrence
g.SigmavsSigma.g+ different reduction trees) make per-layer outputs differ by >>ULP, so logits cannot be bit-identical. A2 shows it is not a precision knob: the kernels evaluate a reassociated expression, differing even given identical f32 state and inputs. - General (any two engines): IEEE-754 add/mul are non-associative; two engines that tile, reduce, fuse, and quantize differently cannot produce bit-identical results for a non-trivial transformer. Field determinism work (batch-invariant / fixed-reduction kernels, "defeating nondeterminism in LLM inference") delivers run-to-run determinism WITHIN one engine; it does not and cannot deliver cross-engine bit-exactness (that needs identical kernel+tiling+reduction-order+dtype for every op). Cross-engine bit-exactness is essentially never achieved in practice. Bit-exactness is only a meaningful gate within an engine (how llama patches 0018-0021 are validated by md5).
Greedy-token match (argmax robustness) - the right weaker form, but probabilistic
Because logits differ mostly in low-order bits (A4-A7) plus a few-significant-digit GEMM/recurrence gap (A1-A2), the argmax frequently coincides whenever the top-1/top-2 logit margin exceeds the cross-engine noise. This is the only meaningful cross-engine "equivalence"; gate on top-1 agreement / KL / PPL-delta, never md5. Caveats: not guaranteed per-token (low-margin steps can flip); it compounds - once one greedy token differs the sequences fork and the KV/SSM states diverge, so agreement degrades with length (high on short continuations, drift on long ones); and the FP4 A4-vs-A8 gap (A1) makes the per-step divergence larger here than a same-precision bf16-vs-bf16 comparison, weakening greedy agreement for this model specifically.
Bottom line: target near-vLLM via KL/PPL/top-1-agreement, not bit-exactness. Reserve bit-exact gating for intra-llama validation (the f32 recurrence-occupancy lever and the conv fusion qualify; bf16 state does not and must be KL/PPL-gated vs llama's own f32).
Assisted-by: Claude:opus-4.8 [Claude Code]