LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-23 08:08:52 -04:00

Files

Ettore Di Giacinto 0dd45f0da5 docs(llama-cpp/paged): GPU 0007 re-run + shared-prefix benchmark results

Record the belt-and-suspenders GPU run of the 0007 prefix-engine driver and a
shared-prefix throughput benchmark. The committed CPU driver passes ALL PASS;
the CUDA build fails only the strict greedy-token-equality assertions (the same
binary fails them at ngl=0 too), which is CUDA float-kernel non-determinism, not
a paged-logic defect - every structural KV-reuse invariant passes on GPU.

The shared-prefix benchmark shows a real, K-scaling win: prefill wall time drops
7.2x (32B K=16) to 10.3x (32B K=32) when the shared prefix is computed once and
reused via the paged cross-request prefix cache.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-22 12:59:09 +00:00

ds4

chore: ⬆️ Update antirez/ds4 to 80ebbc396aee40eedc1d829222f3362d10fa4c6c (#10378 )

2026-06-18 00:32:13 +02:00

grpc

fix: speedup git submodule update with --single-branch (#2847 )

2024-07-13 22:32:25 +02:00

ik-llama-cpp

chore: ⬆️ Update ikawrakow/ik_llama.cpp to b3dfb7858cfcb9166e92f366e5af87f19ebc94be (#10395 )

2026-06-19 00:03:37 +02:00

llama-cpp

docs(llama-cpp/paged): GPU 0007 re-run + shared-prefix benchmark results