Synthesis of the four read-only/GPU investigations (A MoE grouped-GEMM,
B cublas lm_head, C TTFT/paged-pool burst, D dense CUDA-graph):
- A: llama already has the sorted-grouped-FP4-MMA GEMM (higher tier than
vLLM's GB10 W4A16 Marlin fallback); standalone bit-exact kernel win is
bounded on this bandwidth-bound a3b model. Keep down_proj quantize
retune (M1) as a cheap bank-shot; fold the decode-graph (M2) into a
later shared GDN+MoE decode-graph project.
- B: lm_head is BF16 (not FP4), nvjet already ~72% of peak HBM; bit-exact
ceiling <1%, the only big win (NVFP4 head) is non-bit-exact and unfair
vs vLLM. Dead end. Rank last.
- C: paged-pool burst-degradation BUG (Part 2) is a true correctness
defect (prefill collapses 507->65 t/s after a burst, restart cures it):
reclamation gap on partial seq_rm + free-queue fragmentation. Plus the
static decode-first budget (Part 1) explains 903s/213s burst TTFT and
the chunked-interleave fix.
- D: f32 dense CUDA-graph is STABLE (<1%, no bimodality); the brief's
bimodality was the shelved BF16 SSM path. Closed.
First build target: the paged-pool burst-degradation bug fix (Fix-1
truncate-on-partial-seq_rm + Fix-2 defrag-on-empty + Fix-3 release-on-slot-
completion). Small, localized, default-off byte-identical, crisp repro
(npl64 burst then npl8: prefill within 10% of fresh + num_free restored).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>