LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-26 09:26:55 -04:00

Files

Ettore Di Giacinto 02cbae5ea9 feat(paged): qwen35moe NVFP4 activation-quantize de-dup (patch 0023)

Mirror patch 0023 + results into the paged series. Bit-exact MoE decode/prefill
lever: ggml mul_mat_id re-quantizes each token's activation once per expert for
the broadcast up/gate proj (ne11==1); quantize_mmq_nvfp4 has no cross-thread
reduction, so the gathered blocks are byte-identical across experts. The lever
quantizes the ne12 unique tokens once and gathers the block_fp4_mmq rows into the
expert-gathered layout with a coalesced uint4 copy (144 B = 9 uint4); the GEMM is
untouched and down_proj keeps the stock path.

Measured (DGX GB10, on top of patch 0022, q36-35b-a3b-nvfp4): decode S_TG npl128
745.2 -> 758.1 t/s (+1.73%), npl32 +0.6%, prefill T_PP -4%; dense q36-27b-nvfp4
byte-flat. nsys: quantize_mmq_nvfp4 868 -> 457 ms, gather +32 ms (net -379 ms).
Bit-exact: q36-27b 5951a5b4..., q36-35b-a3b 07db32c2... (on == off == 0022);
test-backend-ops MUL_MAT 1115/1115, MUL_MAT_ID 805/805. On by default;
GGML_CUDA_MOE_QUANT_DEDUP=0 restores stock.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-25 21:49:15 +00:00

ds4

chore: ⬆️ Update antirez/ds4 to 80ebbc396aee40eedc1d829222f3362d10fa4c6c (#10378 )

2026-06-18 00:32:13 +02:00

grpc

fix: speedup git submodule update with --single-branch (#2847 )

2024-07-13 22:32:25 +02:00

ik-llama-cpp

chore: ⬆️ Update ikawrakow/ik_llama.cpp to b3dfb7858cfcb9166e92f366e5af87f19ebc94be (#10395 )

2026-06-19 00:03:37 +02:00

llama-cpp

feat(paged): qwen35moe NVFP4 activation-quantize de-dup (patch 0023)