LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-27 09:57:14 -04:00

Files

Ettore Di Giacinto 6c6a925213 docs(paged): MoE-vs-vLLM DECIDE synthesis - reject W4A16 Marlin, the GEMM is a llama win

Cross-agent synthesis on top of the both-engine nsys decomposition (3b5957157):
settle the user's "can we do what vLLM does on MoE?" question with the three
converging investigations (groundtruth measurement + vllm-marlin source-read +
marlin-port feasibility).

Verdict: vLLM's ~15% MoE-decode lead is NOT the Marlin GEMM (that bucket is a
-1.7 ms llama WIN: native FP4-MMA W4A4 47.3 vs Marlin W4A16 50.0 at the ragged
tiny-M decode shape, both at the LPDDR5x BW floor). The gap is bf16
dense-projection bandwidth (+6.5), recurrence state-gather plumbing (+6.6, led
by k_get_rows 5.2), graph/stream-overlap overhead (~+7), W4A4 act-quant tax
(+3.3), and router/glue (+5.4).

A W4A16/Marlin grouped MoE GEMM is REJECTED (default and opt-in): it would
regress the 27% GEMM bucket to half-rate bf16 MMA, re-enter the GB10 occupancy
wall the dense scaffold already STOPPED at, and its entire intrinsic upside is
the ~2% act-quant tax - smaller than the bit-exact +1.9% the 0025 re-graph
already banked, and closeable bit-exactly by fusing the act-quant.

Recommended build (none a new MoE GEMM): (1) fuse the k_get_rows SSM-state
gather (bit-exact, ~+5, biggest single-kernel win); (2) extend CUDA-graph
coverage + stream overlap (bit-exact, ~+7); (3) fuse the W4A4 act-quant into
RMSNorm/SiLU (bit-exact, +3.3); (4) NVFP4-quantize the still-bf16 GDN/attn
projections + lm_head (bit-changing, +6.5, the same NVFP4-dense-quant move vLLM
makes). Bit-exact levers alone reach ~94% of vLLM; with the projection quant
~96-97%, parity-or-better physically in reach since both heaviest kernels
(SSM core, MoE GEMM) are already llama wins.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-26 20:14:30 +00:00

0001-vendor-paged-kv-manager.patch

build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on)

2026-06-22 09:22:36 +00:00

0002-paged-kv-block-placement-env-LLAMA_KV_PAGED.patch

build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on)

2026-06-22 09:22:36 +00:00

0003-gather-read-plan.md

build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on)

2026-06-22 09:22:36 +00:00

0003-paged-gather-read-env-LLAMA_KV_PAGED.patch

build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on)

2026-06-22 09:22:36 +00:00

0004-paged-on-demand-block-allocation-env-LLAMA_KV_PAGED.patch

build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on)

2026-06-22 09:22:36 +00:00

0006-paged-cross-request-prefix-caching-env-LLAMA_KV_PAGED.patch

feat(llama-cpp/paged): cross-request prefix caching patch 0006

2026-06-22 10:14:27 +00:00

0007-paged-engine-prefix-recompute-skip-env-LLAMA_KV_PAGED.patch

feat(llama-cpp/paged): engine-level prefix recompute-skip (patch 0007)

2026-06-22 10:47:10 +00:00

0008-paged-server-cross-request-prefix-share-env-LLAMA_KV_PAGED.patch

feat(paged): pin-sync patchset to llama.cpp 9d5d882d (re-export 4 patches)

2026-06-26 14:12:36 +00:00

0009-paged-in-kernel-decode-read-env-LLAMA_KV_PAGED-patch.patch

paged: in-kernel decode read patch 0009 (kill the gather regression)

2026-06-22 18:04:09 +00:00

0010-paged-tile-in-kernel-read-and-dispatch-guard-env-LLAMA_KV_PAGED.patch

feat(paged): tile in-kernel decode read + dispatch guard (patch 0010)

2026-06-22 20:37:12 +00:00

0011-paged-decode-route-GQA-grouped-tile-kernel-by-defaul.patch

feat(paged): route GQA-grouped tile kernel by default for paged decode (patch 0011)

2026-06-22 22:38:28 +00:00

0012-paged-mask-pad-invariant-assert.patch

feat(paged): assert mask-pad invariant for the paged tile route (patch 0012)

2026-06-23 09:13:08 +00:00

0013-paged-decoupled-prefill-token-budget.patch

feat(paged): pin-sync patchset to llama.cpp 9d5d882d (re-export 4 patches)

2026-06-26 14:12:36 +00:00

0014-paged-expert-aware-moe-token-tile-cap.patch

feat(paged): mirror patch 0014 - expert-aware MoE token-tile cap

2026-06-23 13:49:15 +00:00

0015-paged-expert-density-aware-moe-token-tile-auto-select.patch

feat(paged): pin-sync patchset to llama.cpp 9d5d882d (re-export 4 patches)

2026-06-26 14:12:36 +00:00

0016-paged-dynamic-prefill-budget-continuous-batch.patch

feat(paged): pin-sync patchset to llama.cpp 9d5d882d (re-export 4 patches)

2026-06-26 14:12:36 +00:00

0017-fp4-gemm-decode-tile-tune.patch

docs(paged): mirror FP4 decode-GEMM track-B P0 gate + P1 kill-gate results (patch 0017)

2026-06-24 17:58:00 +00:00

0018-qwen35-ssm-decode-inplace-state.patch

feat(paged): qwen35 gated-DeltaNet in-place SSM state write-back (patch 0018)

2026-06-24 22:45:49 +00:00

0019-qwen35-ssm-decode-fused-gather.patch

feat(paged): qwen35 SSM decode fused recurrent-state gather (patch 0019)

2026-06-24 23:47:51 +00:00

0020-qwen35-gdn-oproj-mmq-reshape.patch

feat(paged): qwen35 gated-DeltaNet o_proj MMVQ->MMQ reshape (patch 0020)

2026-06-25 10:41:38 +00:00

0021-qwen35-conv-state-inplace-fusion.patch

feat(paged): qwen35 decode conv-state in-place fusion (patch 0021)

2026-06-25 16:56:35 +00:00

0022-qwen35-gdn-recurrence-occupancy-retune.patch

feat(paged): qwen35 gated-DeltaNet decode occupancy/coalescing retune (patch 0022)

2026-06-25 18:34:17 +00:00

0023-qwen35moe-nvfp4-quant-dedup.patch

feat(paged): qwen35moe NVFP4 activation-quantize de-dup (patch 0023)

2026-06-25 21:49:15 +00:00

0024-paged-pool-burst-reclaim.patch

feat(paged): paged-pool burst-reclaim (truncate + defrag + slot release) (patch 0024)

2026-06-26 10:44:33 +00:00

0025-qwen35moe-nvfp4-moe-decode-regraph.patch

docs(paged): MoE decode re-graph lever (patch 0025) + speedup-hunt B findings

2026-06-26 14:53:14 +00:00

0026-qwen35-hybrid-perhead-ssm-state.patch

feat(paged): wire ssm_bf16_tau model option for hybrid SSM-state fast mode

2026-06-26 19:51:00 +00:00

A2_CUDAGRAPH_DECODE.md

docs(paged): A.2 final synthesis - CUDA-graph decode verdict

2026-06-24 21:45:42 +00:00

A_HYBRID_PROGRESS.md

feat(paged): qwen35 hybrid per-head f32/bf16 SSM state - carry fix + gate sweep (patch 0026)

2026-06-26 17:44:05 +00:00

A_HYBRID_SSM_RESULTS.md

feat(paged): qwen35 hybrid per-head f32/bf16 SSM state - carry fix + gate sweep (patch 0026)

2026-06-26 17:44:05 +00:00

ADDITIVE_DESIGN.md

build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on)

2026-06-22 09:22:36 +00:00

B_MOE_PROGRESS.md

docs(paged): B-3 mmq_y-down warp-remap NEGATIVE - bit-exact MoE ceiling ~85% of vLLM

2026-06-26 19:10:24 +00:00

B_MOE_RESULTS.md

docs(paged): B-3 mmq_y-down warp-remap NEGATIVE - bit-exact MoE ceiling ~85% of vLLM

2026-06-26 19:10:24 +00:00

BENCHMARK_PROGRESS.md

bench(paged): final apples-to-apples NVFP4 decode benchmark (0023 vs vLLM 0.23.0, GB10)

2026-06-26 03:47:24 +00:00

BF16_SSM_STATE_PLAN.md

docs(paged): bf16 SSM-state build plan (PART C synthesis: edits, KL gate, bench, risks)

2026-06-25 16:46:59 +00:00

BF16_SSM_STATE_PROGRESS.md

docs(paged): bf16 SSM-state NO-SHIP - fails f32 KL gate (= vLLM's own precision)

2026-06-26 00:49:49 +00:00

BF16_SSM_STATE_RESULTS.md

docs(paged): correct vLLM recurrent-state precision (f32, not bf16)

2026-06-26 06:22:08 +00:00

BITEXACT_VS_VLLM.md

docs(paged): bitexact-vs-vLLM verdict + verified f32 GDN-state correction

2026-06-25 16:55:25 +00:00

BYTEGATE_PROGRESS.md

docs(paged): GDN recurrence byte-gate SETTLED - re-stream ~1.0x, build bf16 state not fused kernel

2026-06-25 15:24:49 +00:00

CONTINUOUS_BATCH_SCHEDULER_SCOPE.md

docs(paged): adversarial review of the continuous-batch scheduler scope

2026-06-23 22:48:31 +00:00

CONV_STATE_FUSION_RESULTS.md

feat(paged): qwen35 decode conv-state in-place fusion (patch 0021)

2026-06-25 16:56:35 +00:00

CRITICALPATH_GAP_ANALYSIS.md

docs(paged): SYNTHESIS - validated decode-parity picture, ranked plan, verdict

2026-06-25 15:03:18 +00:00

DECODE_GAP_STUDY.md

docs(paged): decode-step gap study vs vLLM on GB10

2026-06-22 15:44:24 +00:00

DECODE_PARITY_EXPLORE.md

docs(paged): synthesize decode-parity exploration - the o_proj MMVQ lever

2026-06-25 09:06:50 +00:00

F16_DENSE_RESIDUAL_PROBE.md

docs(paged): finalize f16 glue probe - cost analysis + build verdict

2026-06-26 09:12:55 +00:00

final_benchmark.csv

docs(paged): publish NVFP4 decode benchmark - plot-ready CSV + decode-vs-npl plots

2026-06-26 03:51:35 +00:00

FP4_GEMM_SCOPE_B.md

docs(paged): adversarial review of track-B FP4-GEMM parity go/no-go

2026-06-24 14:31:35 +00:00

FUTURE_LEVERS.md

docs(paged): correct vLLM recurrent-state precision (f32, not bf16)

2026-06-26 06:22:08 +00:00

GDN_DECODE_VERIFY.md

docs(paged): verify llama.cpp GDN decode is O(1)-in-context, not a 2.4x lever

2026-06-24 11:21:44 +00:00

GDN_RECURRENCE_BYTE_GATE.md

docs(paged): FINAL DECISION - NO-BUILD fused recurrence, BUILD conv fusion + bf16 state

2026-06-25 15:27:04 +00:00

LEVER1_OPROJ_MMQ_RESULTS.md

feat(paged): qwen35 gated-DeltaNet o_proj MMVQ->MMQ reshape (patch 0020)

2026-06-25 10:41:38 +00:00

LOCALAI_LLAMACPP_BACKEND_PLAN.md

feat(backend): llama-cpp-localai-paged variant + NVFP4 Qwen3.6 gallery

2026-06-26 12:58:56 +00:00

MOE_DENSITY_AUTO_TILE.md

feat(paged): mirror MoE token-tile density-aware auto-select (patch 0015)

2026-06-23 19:04:55 +00:00

MOE_GAP_PROGRESS.md

docs(paged): both-engine MoE decode decomposition - the 15% is NOT the Marlin GEMM

2026-06-26 20:11:40 +00:00

MOE_GAP_VS_VLLM.md

docs(paged): MoE-vs-vLLM DECIDE synthesis - reject W4A16 Marlin, the GEMM is a llama win

2026-06-26 20:14:30 +00:00

MOE_GROUPED_GEMM_SCOPE.md

docs(paged): scope durable grouped FP4-MMA MoE GEMM port for GB10

2026-06-23 13:17:03 +00:00

MOE_QUANT_DEDUP_RESULTS.md

feat(paged): qwen35moe NVFP4 activation-quantize de-dup (patch 0023)

2026-06-25 21:49:15 +00:00

MOE_TOKEN_TILE_CAP.md

feat(paged): mirror patch 0014 - expert-aware MoE token-tile cap

2026-06-23 13:49:15 +00:00

NONRECURRENCE_BITEXACT.md

feat(paged): qwen35moe NVFP4 activation-quantize de-dup (patch 0023)

2026-06-25 21:49:15 +00:00

OCCUPANCY_RETUNE_RESULTS.md

feat(paged): qwen35 gated-DeltaNet decode occupancy/coalescing retune (patch 0022)

2026-06-25 18:34:17 +00:00

OTHER_PATHS_INVESTIGATION.md

docs(paged): OTHER_PATHS investigation - rank 4 post-0023 paths, pick paged-pool burst bug as first build target

2026-06-26 09:42:55 +00:00

P1_DYNAMIC_BUDGET_RESULTS.md

docs(paged): staggered-arrival evaluation of patch 0016 dynamic budget

2026-06-24 10:56:13 +00:00

PAGED_BENCH.md

docs(llama-cpp/paged): GPU 0007 re-run + shared-prefix benchmark results

2026-06-22 12:59:09 +00:00

PAGED_GPU_VERIFY.md

docs(paged): record GPU correctness + CUDA backend-build verification

2026-06-22 11:50:01 +00:00

PAGED_POOL_BURST_FIX.md

feat(paged): paged-pool burst-reclaim (truncate + defrag + slot release) (patch 0024)

2026-06-26 10:44:33 +00:00

PAGED_VLLM_APPLES.md

docs(paged): apples-to-apples paged llama.cpp vs vLLM (batched+NVFP4+prefix cache)

2026-06-22 14:16:52 +00:00

PAGED_VLLM_COMPARE.md

docs(paged): stock GPU batch-shape determinism + vLLM shared-prefix comparison

2026-06-22 13:48:01 +00:00

paged-burst-bench.cpp

feat(paged): paged-pool burst-reclaim (truncate + defrag + slot release) (patch 0024)

2026-06-26 10:44:33 +00:00

paged-reclaim-unit.cpp

feat(paged): paged-pool burst-reclaim (truncate + defrag + slot release) (patch 0024)

2026-06-26 10:44:33 +00:00

PIN_SYNC_9d5d882d.md

feat(paged): pin-sync patchset to llama.cpp 9d5d882d (re-export 4 patches)

2026-06-26 14:12:36 +00:00

qwen36_dense_decode_vs_npl.png

docs(paged): publish NVFP4 decode benchmark - plot-ready CSV + decode-vs-npl plots

2026-06-26 03:51:35 +00:00

qwen36_moe_decode_vs_npl.png

docs(paged): publish NVFP4 decode benchmark - plot-ready CSV + decode-vs-npl plots

2026-06-26 03:51:35 +00:00

QWEN36_NVFP4_BENCH.md

docs(paged): publish NVFP4 decode benchmark - plot-ready CSV + decode-vs-npl plots

2026-06-26 03:51:35 +00:00

RMSNORM_FP4_FOLD.md

docs(paged): rms_norm->fp4 fold analysis - bit-exact decode ceiling at 95% of vLLM

2026-06-25 22:42:08 +00:00

SERVER_SWEEP.md

docs(paged): GB10 head-to-head server sweep (llama-server vs vLLM)

2026-06-23 12:22:15 +00:00

SPEEDUP_HUNT.md

docs(paged): speedup-hunt C section + final RANK + PLAN synthesis

2026-06-26 14:56:53 +00:00

SSM_DECODE_FIX_RESULTS.md

feat(paged): qwen35 SSM decode fused recurrent-state gather (patch 0019)

2026-06-24 23:47:51 +00:00

THROUGHPUT_B_P1_RESULTS.md

docs(paged): mirror FP4 decode-GEMM track-B P0 gate + P1 kill-gate results (patch 0017)

2026-06-24 17:58:00 +00:00

VLLM_DECODE_GROUNDING.md

docs(paged): ground vLLM 0.23.0 eager-decode architecture vs llama.cpp

2026-06-24 07:44:07 +00:00