LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-23 16:19:07 -04:00

Files

Ettore Di Giacinto 7aa61d4c32 docs(paged): DGX Blackwell gap analysis + lever plan (living doc)

Captures the full dgx.casa investigation: Q8/F16/vLLM baselines, concurrency
sweeps, paged-patch (no concurrency effect), nsys+code root-cause (MoE int8
MMQ on Ampere-class tensor cores = 74.5% compute, no FP8 path), and the
lever plan.

Measured wins:
- Lever 1 (MXFP4 / Blackwell FP4 path): decode +50-66% over Q8, prefill
  plateau +66% (2200->3650). MXFP4 decode beats vLLM FP8 at B=1 (83 vs 48),
  near-parity B=8. Prefill still plateaus (fused-MoE-GEMM gap).
- Lever 2 (ubatch): saturates at 2048; ceiling is the kernel, not batch.

Designed (not built): Lever 3 fused FP4/FP8 MoE grouped GEMM, Lever 4 FP8
GEMM (needs ggml_mul_mat_ext scale plumbing), Lever 5 tcgen05 kernels, and
the complete paged attention (on-demand alloc + gather-read + continuous
batching + prefix sharing). Honest scope: each is multi-week kernel/systems
work.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-19 20:15:14 +00:00

paged

docs(paged): DGX Blackwell gap analysis + lever plan (living doc)

2026-06-19 20:15:14 +00:00

CMakeLists.txt

fix(turboquant): resolve common.h by detecting llama-common vs common target (#9413 )

2026-04-18 20:30:28 +02:00

grpc-server.cpp

feat: generic chat_template_kwargs (model config + per-request metadata) (#10359 )

2026-06-16 12:16:34 +02:00

Makefile

chore: ⬆️ Update ggml-org/llama.cpp to f3e182816421c648188b5eab269853bf1531d950 (#10379 )