Files
LocalAI/backend/cpp/llama-cpp
Ettore Di Giacinto b061e4aef0 docs(paged): OTHER_PATHS investigation - rank 4 post-0023 paths, pick paged-pool burst bug as first build target
Synthesis of the four read-only/GPU investigations (A MoE grouped-GEMM,
B cublas lm_head, C TTFT/paged-pool burst, D dense CUDA-graph):

- A: llama already has the sorted-grouped-FP4-MMA GEMM (higher tier than
  vLLM's GB10 W4A16 Marlin fallback); standalone bit-exact kernel win is
  bounded on this bandwidth-bound a3b model. Keep down_proj quantize
  retune (M1) as a cheap bank-shot; fold the decode-graph (M2) into a
  later shared GDN+MoE decode-graph project.
- B: lm_head is BF16 (not FP4), nvjet already ~72% of peak HBM; bit-exact
  ceiling <1%, the only big win (NVFP4 head) is non-bit-exact and unfair
  vs vLLM. Dead end. Rank last.
- C: paged-pool burst-degradation BUG (Part 2) is a true correctness
  defect (prefill collapses 507->65 t/s after a burst, restart cures it):
  reclamation gap on partial seq_rm + free-queue fragmentation. Plus the
  static decode-first budget (Part 1) explains 903s/213s burst TTFT and
  the chunked-interleave fix.
- D: f32 dense CUDA-graph is STABLE (<1%, no bimodality); the brief's
  bimodality was the shelved BF16 SSM path. Closed.

First build target: the paged-pool burst-degradation bug fix (Fix-1
truncate-on-partial-seq_rm + Fix-2 defrag-on-empty + Fix-3 release-on-slot-
completion). Small, localized, default-off byte-identical, crisp repro
(npl64 burst then npl8: prefill within 10% of fresh + num_free restored).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-26 09:42:55 +00:00
..