LocalAI/backend/cpp/llama-cpp-localai-paged/docs at bf61db62143ba8c9a53eae9173f8f9b30cc930f0 - LocalAI - Gitea: Git with a cup of tea

mirror/LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-07-03 04:46:54 -04:00

Files

History

Ettore Di Giacinto bf61db6214 docs(paged): scope vLLM-class execution re-architecture (additive program)

Reframe the GB10 vLLM-parity gap from a per-lever "hardware floor" verdict
to a ggml-execution-architecture-conditional one: same-silicon 2-3x is
software architecture, not silicon. Add EXECUTION_REARCH_SCOPE.md, a phased
additive program (P1 bf16-native stream, P2 expert-major fused MoE region,
P3 Marlin large-M retry on P1+P2, P4 token-budget scheduler, P5 blocked-solve
GDN, P6 fp8 KV), each with the ggml/fork seam, default-off env gate, per-path
md5/KL correctness gate, a falsifiable P0 kill-gate, expected-recovery
arithmetic grounded in the both-engine nsys buckets, and upstream-clash
analysis. Point the README docs list and PARITY_HANDOFF forward-direction at
it.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-07-02 10:50:00 +00:00

..

ACCELERATOR_PORTING_SCOPE.md

docs(paged): scope porting the portable benefits to Metal/SYCL/Vulkan (+ROCm)

2026-06-28 08:34:32 +00:00

BENCHMARK.md

docs(paged): record phases 112-140 + series trim decision

2026-07-02 10:16:53 +00:00

DECODE_SERVING_SCOPE.md

docs(paged): record padded/fixed-slot decode shape as tested-and-rejected

2026-06-28 20:47:43 +00:00

EXECUTION_REARCH_SCOPE.md

docs(paged): scope vLLM-class execution re-architecture (additive program)

2026-07-02 10:50:00 +00:00

final_benchmark.csv

paged: drop bf16-tau (patch 0026), subsumed by decode fusions (tau=100000 flat, zero speed benefit)

2026-06-28 16:06:06 +00:00

GB10_PARITY_PHASE0_RESULTS.md

docs(paged): record BF16 F32 output broader serving phase

2026-07-01 13:26:50 +00:00

GB10_PARITY_REOPEN_SPEC.md

docs(paged): scope GB10 parity reopen plan

2026-06-30 15:44:11 +00:00

GDN_SHARED_AI_COST_MODEL.md

docs(paged): reject GDN global Ai32 prototype

2026-07-01 01:51:53 +00:00

LOCALAI_LLAMACPP_BACKEND_PLAN.md

chore(paged): keep patches/ patch-only; README to backend root, docs to docs/

2026-06-27 13:20:05 +00:00

PAGED_BITEXACT_NOTE.md

chore(paged): keep patches/ patch-only; README to backend root, docs to docs/

2026-06-27 13:20:05 +00:00

paged-burst-bench.cpp

chore(paged): keep patches/ patch-only; README to backend root, docs to docs/

2026-06-27 13:20:05 +00:00

paged-reclaim-unit.cpp

chore(paged): keep patches/ patch-only; README to backend root, docs to docs/

2026-06-27 13:20:05 +00:00

PARITY_HANDOFF.md

docs(paged): scope vLLM-class execution re-architecture (additive program)

2026-07-02 10:50:00 +00:00

PATCH_MAINTENANCE.md

docs(paged): record patch mirror readiness phase

2026-07-01 13:11:57 +00:00

PREFILL_GEMM_RESULTS.md

feat(paged): FP4 prefill large-M dequant->bf16 cuBLAS scaffold (patch 0033, default-off)

2026-06-28 17:42:15 +00:00

PREFILL_GEMM_SCOPE.md

docs(paged): scope the large-M NVFP4 prefill GEMM lever (design only)

2026-06-28 16:42:23 +00:00

qwen36_decode_overview.png

docs(paged): add the bf16-tau opt-in line to the decode plots

2026-06-27 22:25:02 +00:00

qwen36_dense_decode_vs_npl.png

docs(paged): add the bf16-tau opt-in line to the decode plots

2026-06-27 22:25:02 +00:00

qwen36_moe_decode_vs_npl.png

docs(paged): add the bf16-tau opt-in line to the decode plots

2026-06-27 22:25:02 +00:00

TENSORCORE_GDN_BUILD_PLAN.md

docs(paged): record GDN tensor-core revalidation phase

2026-07-01 14:05:20 +00:00

TENSORCORE_GDN_SCOPE.md

docs(paged): scope tensor-core (mma) chunked GDN prefill kernel

2026-06-28 17:23:51 +00:00

UPSTREAM_LAYER2_SCOPE.md

docs(paged): scope porting the portable benefits to Metal/SYCL/Vulkan (+ROCm)

2026-06-28 08:34:32 +00:00

VLLM_PARITY_FINAL.md

docs(paged): profile MTP graph reuse loss

2026-07-01 02:32:49 +00:00

VLLM_PARITY_LEVER_MAP.md

docs(paged): record datacenter Blackwell readiness phase

2026-07-01 14:28:41 +00:00