LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-23 16:19:07 -04:00

Files

Ettore Di Giacinto d1ba327843 docs(paged): record GPU correctness + CUDA backend-build verification

GPU (DGX Spark, GB10/sm_121, CUDA 13.0) verification of the paged-KV series:
core token-identical gate and 4-stream multiseq are byte-identical stock-vs-paged
at -ngl 99, the device gather is confirmed firing, and a 32B paged run is coherent.
Full backend: patches/paged apply clean to the pin and grpc-server compiles+links
under CUDA sm_121. Notes also flag a double patch-application in the LLAMA_PAGED=on
make flow (git apply + prepare.sh) and a token divergence in the unshipped
prefix-recompute-skip dev driver (same on CPU and GPU).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-22 11:50:01 +00:00

0001-vendor-paged-kv-manager.patch

build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on)

2026-06-22 09:22:36 +00:00

0002-paged-kv-block-placement-env-LLAMA_KV_PAGED.patch

build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on)

2026-06-22 09:22:36 +00:00

0003-gather-read-plan.md

build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on)

2026-06-22 09:22:36 +00:00

0003-paged-gather-read-env-LLAMA_KV_PAGED.patch

build(llama-cpp): isolate paged patches in patches/paged/ behind LLAMA_PAGED flag (default on)