LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-26 17:37:07 -04:00

Files

Ettore Di Giacinto b061e4aef0 docs(paged): OTHER_PATHS investigation - rank 4 post-0023 paths, pick paged-pool burst bug as first build target

Synthesis of the four read-only/GPU investigations (A MoE grouped-GEMM,
B cublas lm_head, C TTFT/paged-pool burst, D dense CUDA-graph):

- A: llama already has the sorted-grouped-FP4-MMA GEMM (higher tier than
  vLLM's GB10 W4A16 Marlin fallback); standalone bit-exact kernel win is
  bounded on this bandwidth-bound a3b model. Keep down_proj quantize
  retune (M1) as a cheap bank-shot; fold the decode-graph (M2) into a
  later shared GDN+MoE decode-graph project.
- B: lm_head is BF16 (not FP4), nvjet already ~72% of peak HBM; bit-exact
  ceiling <1%, the only big win (NVFP4 head) is non-bit-exact and unfair
  vs vLLM. Dead end. Rank last.
- C: paged-pool burst-degradation BUG (Part 2) is a true correctness
  defect (prefill collapses 507->65 t/s after a burst, restart cures it):
  reclamation gap on partial seq_rm + free-queue fragmentation. Plus the
  static decode-first budget (Part 1) explains 903s/213s burst TTFT and
  the chunked-interleave fix.
- D: f32 dense CUDA-graph is STABLE (<1%, no bimodality); the brief's
  bimodality was the shelved BF16 SSM path. Closed.

First build target: the paged-pool burst-degradation bug fix (Fix-1
truncate-on-partial-seq_rm + Fix-2 defrag-on-empty + Fix-3 release-on-slot-
completion). Small, localized, default-off byte-identical, crisp repro
(npl64 burst then npl8: prefill within 10% of fresh + num_free restored).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-26 09:42:55 +00:00

paged

feat(paged): target-readiness for 2xH200 - correctness PASS, load-gen harness, projection

2026-06-21 23:16:28 +00:00

patches

docs(paged): OTHER_PATHS investigation - rank 4 post-0023 paths, pick paged-pool burst bug as first build target

2026-06-26 09:42:55 +00:00

CMakeLists.txt

feat: single-build ggml CPU_ALL_VARIANTS for llama-cpp + turboquant (x86/arm64/apple) (#10497 )

2026-06-25 15:47:03 +02:00

grpc-server.cpp

Merge branch 'master' into worktree-feat+paged-attention