mirror of https://github.com/mudler/LocalAI.git synced 2026-07-04 05:16:42 -04:00

Files

Ettore Di Giacinto f8d7b026cf docs(paged): scope GB10 parity reopen plan

Add a phased follow-up spec for challenging the GB10 vLLM-parity closure, including provenance gates, W4A16/GDN/MoE workstreams, and subagent ownership boundaries.

Assisted-by: Codex:gpt-5

2026-06-30 15:44:11 +00:00

11 KiB

Raw Blame History

GB10 vLLM Parity Reopen Spec

Status: scoped follow-up. This document intentionally challenges the current VLLM_PARITY_FINAL.md conclusion that GB10 parity is closed. The final record is still useful as a baseline, but the follow-up work must treat it as a hypothesis to test, not as a proof of impossibility.

Goal

Determine whether llama.cpp / ggml can close the remaining GB10 parity gap for Qwen3.6 NVFP4 hybrid gated-DeltaNet models by porting or adapting concrete vLLM implementation ideas, while preserving LocalAI's hard correctness gates.

Success means one of two outcomes:

A measured, source-backed path improves paged llama.cpp materially toward vLLM parity on GB10.
The remaining gap is rejected with clean provenance: clean source, clean DGX host state, artifact-pinned A/B results, and explicit correctness gates.

Non-goals

Do not accept a "closed" conclusion based only on existing docs.
Do not run long builds or benchmarks without a recorded DGX preflight.
Do not edit patches/paged/*.patch directly. Kernel changes land fork-first in mudler/llama.cpp:localai-paged, then the LocalAI patch series is regenerated.
Do not treat a standalone PoC as a result. Every performance claim requires an in-backend A/B.
Do not ship lossy paths default-on. Non-byte-identical paths require KL gates.

Required Preflight

Before any DGX build, benchmark, or profile:

docker ps must show no running containers, especially no local-ai-worker.
nvidia-smi --query-compute-apps=pid must show zero compute apps.
~/gpu_bench_lock/owner must be absent or FREE*.
Record hostname, git SHA, dirty status, build arch, binary mtimes, model paths, benchmark command, and environment variables.

Use ~/_git/llama.cpp as the local source of truth. DGX source trees are allowed for builds and artifact inspection, but dirty DGX checkouts must not be treated as canonical source.

Evidence From Subagent Audit

Four read-only subagents audited the current state:

llama.cpp / ggml source audit.
vLLM source and installed package audit.
LocalAI patch and docs audit.
DGX artifact and profile audit.

Their shared conclusion: the final docs are a useful snapshot, but several claims are broader than the available evidence.

Key findings:

The strongest unresolved implementation target is W4A16 grouped MoE prefill. vLLM uses Marlin W4A16 on GB10, and llama.cpp already has a correct but untuned scaffold in ggml/src/ggml-cuda/w4a16-gemm.cu.
The existing W4A16 rejection is a first-implementation failure, not a proof of impossibility. The patch header names fixable costs: f32 to bf16 cast pre-pass, host tile-map setup, small copies, scalar dequant, and ragged tile waste.
The 924 t/s paged GPU-steady decode figure is artifact-backed, but the vLLM 1078 t/s true GPU-steady figure was not found as a self-contained ntg16/ntg64 difference-method artifact. Reproduce before relying on the 86% claim.
GDN M5 is real, but M5/M8 provenance is muddy because CDEF records a dirty dev-tree M8 commit while docs describe production M5 defaults.
S3 fixed-period scheduling and fixed-slot padding were rejected, but adaptive scheduling remains unproven.

Candidate Workstreams

A. Provenance And Baseline Reproduction

Purpose: make later claims defensible.

Tasks:

Build from clean ~/_git/llama.cpp localai-paged source, or a clean DGX clone generated from that source.
Re-run canonical md5 gates for paged MoE and dense:
- paged MoE: 8cb0ce23777bf55f92f63d0292c756b0
- dense: 5951a5b4d624ce891e22ab5fca9bc439
Re-run a short prefill baseline for MoE and dense at npp=512,2048.
Re-run graph-node-traced decode for paged and vLLM using the same difference-method shape: ntg=16 and ntg=64 at N=128 or N=256.

Gate:

No implementation work starts until the baseline artifact names, source SHAs, and commands are recorded.

B. W4A16 Grouped MoE Prefill Attack

Purpose: port the vLLM Marlin W4A16 advantage into ggml's in-backend MoE prefill path.

Current hooks:

ggml/src/ggml-cuda/w4a16-gemm.cu
ggml/src/ggml-cuda/w4a16-gemm.cuh
ggml/src/ggml-cuda/ggml-cuda.cu around ggml_cuda_mul_mat_id
ggml/src/ggml-cuda/mmq.cu around LLAMA_W4A16_PREFILL_M

Known current costs:

Separate f32 to bf16 activation cast pass.
Host-built tile metadata and H2D copies.
Scalar in-register FP4 to bf16 dequant.
4-byte weight staging.
Ragged expert tile waste.
Interaction with the generic token-sorting fallback.

Phased experiments:

Reconfirm current 0035 W4A16 performance with clean provenance.
Remove or fuse the f32 to bf16 activation cast pre-pass.
Move tile metadata generation device-side or cache it across repeated shapes.
Improve weight staging width and shared-memory layout.
Tune tile shapes for ragged per-expert M distribution.
Compare against FP4-MMQ and vLLM Marlin buckets with nsys.

Correctness gate:

test-backend-ops MUL_MAT_ID forced W4A16.
Greedy md5 for unaffected default-off path.
KL gate for engaged W4A16 path: KLD(W4A16||f16) <= KLD(FP4-MMQ||f16).
Decode path unchanged when LLAMA_W4A16_PREFILL_M=0.

Benchmark gate:

Beat default FP4-MMQ on MoE S_PP at npp=512 and npp=2048.
No material peak-memory increase.
No decode regression in the default path.

C. Native Ragged Grouped FP4-MMA Prefill

Purpose: test whether patch 0034's native FP4-MMA PoC failed due to integration, not due to the core kernel idea.

Current hooks:

ggml/src/ggml-cuda/fp4-gemm.cu
ggml/src/ggml-cuda/fp4-gemm.cuh
LLAMA_FP4_PREFILL_M

Experiment:

Build a graph-safe ragged grouped FP4-MMA MoE prefill kernel that avoids the per-expert host-sync loop.

Correctness gate:

Same KL and op gates as W4A16.
Explicit proof that the per-expert host fallback is not on the hot path.

Benchmark gate:

Beat current FP4-MMQ or lose decisively enough to close this branch.

D. GDN Chunked Scan Follow-up

Purpose: compare vLLM's in-tree FLA-derived GDN path against the current M5 implementation without relying on muddy dev-tree artifacts.

Current hooks:

ggml/src/ggml-cuda/gated_delta_net.cu
GDN_TC
GDN_CHUNK_MIN
existing M5 tensor-core ladder

Phased experiments:

Clean A/B: current production M5 against sequential and against recorded M8 dev-tree behavior.
C=32 and C=64 variants.
dv slab variants.
cp.async staging variants.
Register-state variant only if the lower-risk variants show headroom.

Correctness gate:

test-backend-ops GATED_DELTA_NET, including multi-chunk, tail-chunk, multi-seq, and adversarial decay cases.
Greedy md5 per path.
KL gate for any non-byte-identical path.

Benchmark gate:

Beat current M5, not just old sequential.
Preserve decode behavior by keeping GDN_CHUNK_MIN > 1.

E. MoE Weighted Fan-in Fusion

Purpose: remove generic graph-level MoE reduction overhead that vLLM avoids or amortizes through fused MoE handling.

Current source:

src/llama-graph.cpp, MoE down projection and expert reduction.
ggml/src/ggml-cuda/ggml-cuda.cu, CUDA fusion and MoE support.

Experiment:

Add a CUDA-specific fused path for down_experts * weights -> sum expert_used while preserving the current reduction order where required.

Correctness gate:

Bit-exact for supported shapes, or KL-benign if reduction order changes.
Handles all n_expert_used used by Qwen3.6 MoE.

Benchmark gate:

Move MoE prefill or decode wall time by more than noise. If it is only a 2-3% dispatch bucket, record and deprioritize.

F. Adaptive Serving Scheduler

Purpose: keep S3's decode-window benefit without reproducing its TTFT collapse.

Current hooks:

tools/server/server-context.cpp
LLAMA_PAGED_DECODE_STABLE
LLAMA_PAGED_PREFILL_PERIOD
existing dynamic prefill budget patches.

Experiment:

Replace fixed-period prefill deferral with adaptive admission based on live decode width, waiting prefill backlog, and TTFT budget.

Correctness gate:

Serving output correctness unchanged.
No starvation of prefill requests.

Benchmark gate:

Improve aggregate throughput or decode throughput at N=128 or N=256 without the 2.5x TTFT regression from fixed S3.

G. Projection And GDN Glue Fusion

Purpose: steal vLLM's prepare_gdn_attention_core_inputs idea where ggml still pays small copy, cat, slice, or unpack kernels.

Current source:

src/models/qwen35.cpp
src/models/qwen35moe.cpp
ggml/src/ggml-cuda/ggml-cuda.cu

Experiment:

Fuse q/k/v/z unpacking, BA projection preparation, RMSNorm-gated output prep, and FP4/FP8 quant prep where the graph pattern is stable.

Correctness gate:

Per-op tests for new fusion.
Greedy md5 for model paths.

Benchmark gate:

Only continue if nsys shows this bucket is material after MoE and GDN work.

Subagent Plan

Use subagents for independent read, implementation, and review slices. Do not use subagents to edit the same files in parallel.

Recommended roles by phase:

Phase 0 source/provenance agent: owns command capture and source SHA checks.
Phase 0 artifact agent: owns parsing existing and new benchmark artifacts.
W4A16 kernel agent: owns w4a16-gemm.*.
W4A16 integration agent: owns ggml-cuda.cu and mmq.cu dispatch plumbing.
GDN kernel agent: owns gated_delta_net.cu.
Scheduler agent: owns server scheduling files only.
Reviewer agent: reviews gates, provenance, and whether measured claims match artifacts.

Subagent output requirements:

File paths and functions inspected or changed.
Exact commands run.
Exact artifacts produced.
Pass/fail result against the phase gate.
Any uncertainty labeled explicitly.

Phase Order

Phase 0 - Reproduce And Correct The Record

Do first.

Deliverables:

Clean source/build provenance.
Short prefill baseline.
Graph-node-traced decode difference-method for paged and vLLM.
Updated docs if the 86% decode claim or CDEF provenance changes.

Exit criteria:

Baseline is trustworthy enough to judge optimization deltas.

Phase 1 - W4A16 MoE Prefill

Do second.

Deliverables:

Reconfirmed current W4A16 baseline.
At least one targeted W4A16 overhead removal.
A/B against default FP4-MMQ.

Exit criteria:

Either W4A16 beats FP4-MMQ and continues, or it is rejected with direct artifact-backed evidence.

Phase 2 - GDN Follow-up

Do after Phase 1 unless Phase 0 proves decode/GDN is the larger immediate gap.

Deliverables:

Clean M5 vs candidate geometry A/B.
Correctness gates for all candidate variants.

Exit criteria:

Keep the best variant or close the branch with measured evidence.

Phase 3 - MoE Fan-in And Glue Fusions

Do after kernel work identifies remaining non-kernel buckets.

Deliverables:

nsys-backed bucket selection.
Fusion implementation only for material buckets.

Exit criteria:

Keep only fusions that move end-to-end numbers beyond noise.

Phase 4 - Adaptive Serving

Do after compute kernels are stable.

Deliverables:

Adaptive scheduling policy.
Serving A/B at N=8,32,128,256.

Exit criteria:

Improve serving without TTFT collapse.

Decision Rules

Prefer measured in-backend results over source plausibility.
Prefer small kill-gate experiments over multi-week rewrites.
Continue a branch only if it beats the current shipped path, not an obsolete baseline.
Document rejected branches with artifact paths so they are not rerun.
Keep the fork branch canonical and regenerate LocalAI patches from it.

11 KiB Raw Blame History

GB10 vLLM Parity Reopen Spec

Goal

Non-goals

Required Preflight

Evidence From Subagent Audit

Candidate Workstreams

A. Provenance And Baseline Reproduction

B. W4A16 Grouped MoE Prefill Attack

C. Native Ragged Grouped FP4-MMA Prefill

D. GDN Chunked Scan Follow-up

E. MoE Weighted Fan-in Fusion

F. Adaptive Serving Scheduler

G. Projection And GDN Glue Fusion

Subagent Plan

Phase Order

Phase 0 - Reproduce And Correct The Record

Phase 1 - W4A16 MoE Prefill

Phase 2 - GDN Follow-up

Phase 3 - MoE Fan-in And Glue Fusions

Phase 4 - Adaptive Serving

Decision Rules

11 KiB

Raw Blame History