LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-27 01:47:18 -04:00

Files

Ettore Di Giacinto 4d3fecd524 docs(paged): MoE decode re-graph lever (patch 0025) + speedup-hunt B findings

Mirror of llama.cpp dev-tree patch 0025 (qwen35moe NVFP4 MoE-decode re-graph) and the GPU-agent B
findings in SPEEDUP_HUNT.md: re-confirmed MoE decode decomposition @npl128, the measured re-graph
lever (+4.4%/+2.9%/+1.9% decode_agg at npl 32/64/128; bit-exact: test-backend-ops MUL_MAT_ID 806/806
+ parallel-greedy np16 byte-identical ON==OFF), grouped-GEMM occupancy headroom (exhausted on this
bandwidth-bound model), and the W4A16 assessment (rejected: non-bit-exact, slower BF16 MMA).

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-26 14:53:14 +00:00

paged

feat(paged): target-readiness for 2xH200 - correctness PASS, load-gen harness, projection

2026-06-21 23:16:28 +00:00

patches

docs(paged): MoE decode re-graph lever (patch 0025) + speedup-hunt B findings

2026-06-26 14:53:14 +00:00

CMakeLists.txt

feat: single-build ggml CPU_ALL_VARIANTS for llama-cpp + turboquant (x86/arm64/apple) (#10497 )

2026-06-25 15:47:03 +02:00

grpc-server.cpp

Merge branch 'master' into worktree-feat+paged-attention