Split the gated-delta Metal/CUDA kernels' dtype template into separate
input (InT) and state (StT) types so activations can stay in bf16/fp16
while the accumulated delta state stays in float32. Allocate the delta
state and qwen3_5's no-cache zero state in float32 to match.