diff --git a/backend/cpp/llama-cpp/paged/PAGED_KV_TARGET_READINESS.md b/backend/cpp/llama-cpp/paged/PAGED_KV_TARGET_READINESS.md new file mode 100644 index 000000000..3733bb300 --- /dev/null +++ b/backend/cpp/llama-cpp/paged/PAGED_KV_TARGET_READINESS.md @@ -0,0 +1,170 @@ +# Paged KV: target-readiness (correctness, dynamic benchmark, 2xH200 projection) + +Target hardware: **~2x H200** (281 GB HBM3e total, ~4.8 TB/s per GPU). The GB10 box is +the *test* rig, not the target - and several earlier "no win" findings are GB10-specific +artifacts (low bandwidth caps throughput before KV memory ever binds). This document +delivers the three things needed to push paged KV toward the real target: + +1. **Correctness** of the paged path - verified (and a blocking bug found + fixed). +2. **A dynamic-load benchmark** that actually exercises where paging wins (`paged-loadgen.cpp`). +3. **A projection** of the paged-KV payoff on 2x H200, grounded in measured GB10 numbers. + +--- + +## 1. Correctness: PASS (after fixing the auto-fit OOM) + +`test-paged-kv-e2e` checks the paged decode path against the contiguous reference +(greedy argmax + top-5 set overlap >= 4). On the box it was previously **unverified** - +it aborted at context creation. Root cause found: + +- `common_fit_paged_kv_blocks` (`common/common.cpp:1144`) **unconditionally overrides** + `n_gpu_blocks` from `ggml_backend_dev_memory`, which **over-reports free VRAM on the + GB10 integrated/unified device** (it sized **~245 GB of KV on a 119 GB box** -> + `cudaMalloc` OOM -> `GGML_ASSERT` abort in `llama-kv-cache-paged.cpp:74`). The test's + explicit `n_gpu_blocks=64` was being clobbered because `params.fit_params` defaults on. + +**Fix (item-1 patch, applied on the box):** + +```diff +--- a/tests/test-paged-kv-e2e.cpp ++++ b/tests/test-paged-kv-e2e.cpp +@@ run_paged() + params.kv_paged = true; ++ params.fit_params = false; // honor explicit n_gpu_blocks; GB10 dev_memory over-reports free VRAM + params.n_gpu_blocks = 64; +``` + +**Result (Qwen3-0.6B-Q8_0, GB10):** + +``` +test-paged-kv-e2e: top-5 argmax match: ref=3743 paged=3743 +test-paged-kv-e2e: top-5 set overlap: 5/5 (require >= 4) +test-paged-kv-e2e: PASSED +``` + +The paged op is **numerically greedy-equivalent to the contiguous path**. The reshape +bug from `PR22569_EVAL.md` (decoupled head_dim) is already applied in the checkout. + +**Target-readiness caveat (the durable fix, not just the test):** the auto-fit itself is +brittle and must be hardened before it runs on a real serving box - even though +`ggml_backend_dev_memory` reports correctly on a discrete H200, the function should still +(a) early-return when `!params.fit_params`, (b) **clamp** the computed `n_gpu_blocks` so +`n_gpu_blocks * block_bytes <= free_vram - margin` using the *actual* KV element size, and +(c) not override an explicitly-set value. One-screen change in `common_fit_paged_kv_blocks`. + +--- + +## 2. Dynamic-load benchmark - `paged-loadgen.cpp` + +**Why the existing tools show no paged win:** `llama-batched-bench` and the stock +`examples/paged/paged.cpp` both run **fixed-length, all-arrive-at-once, single-prompt** +load. That has no over-reservation and no fragmentation, so contiguous KV is already +memory-optimal and paging has nothing to reclaim (`PAGED_KV_HIGH_CONCURRENCY.md`). The +paged win only exists under **variable lengths + continuous arrival + shared prefixes** - +the real serving regime. No tool in the tree creates it. + +`paged-loadgen.cpp` (committed here) does, via the confirmed `llama_paged_scheduler_*` +API: + +- **shared system prefix** (`LG_PREFIX` tokens) prepended to every request -> exercises + cross-request prefix sharing, +- **variable prompt length** (`LG_SUFMIN..LG_SUFMAX` unique suffix), +- **bimodal generation length** (`LG_GENLONG` for `LG_LONGPCT`% of requests, else + `LG_GENSHORT`) - the over-reservation driver, +- **continuous arrival**: keeps `LG_INFLIGHT` requests live, admitting a new one each time + one finishes. + +It reports the load-bearing number for the buy decision - the **capacity ratio**: + +``` +paged peak KV = sum over live seqs of ceil(used/block)*block * kv_bytes_per_token +contiguous reserve = peak_inflight * max_ctx * kv_bytes_per_token (worst-case per slot) +CAPACITY RATIO = contiguous_reserve / paged_peak (+ prefix sharing on top) +``` + +`kv_bytes_per_token = 2 * n_layer * n_head_kv * head_dim * sizeof(f16)` - confirmed against +`llama-kv-cache-paged.cpp` (e.g. Qwen3-32B: 2*64*8*128*2 = **256 KiB/token**). + +**How to run (on the target):** drop into PR #22569's `examples/paged/`, add to its +CMakeLists next to `llama-paged`, build, then e.g. +`LG_INFLIGHT=2048 LG_LONGPCT=15 paged-loadgen -m -kvp --fit off -ngpub -ncpub -ngl 99`. +Sweep `LG_INFLIGHT` to the throughput plateau and read the capacity ratio at that point. +It is written to run on the target (2x H200) where the regime exists; on GB10 it runs but +the ratio is uninteresting because throughput plateaus before memory binds (see below). + +--- + +## 3. Projection to 2x H200 (grounded in measured GB10 numbers) + +### Measured on GB10 (this work) + +| model | decode plateau (aggregate) | plateau concurrency | bound by | +|---|---|---|---| +| Qwen3-32B-Q4_K_M (dense) | ~540 t/s | npl ~128 | compute | +| Qwen3-1.7B-Q8_0 | ~3,200 t/s | npl ~512 | bandwidth | + +### Hardware ratios (per GPU, then 2x TP at ~85% scaling) + +| | GB10 | H200 | per-GPU x | 2x H200 (TP) x | +|---|---|---|---|---| +| mem bandwidth | 273 GB/s | ~4.8 TB/s | 17.6 | ~30 | +| BF16 compute | ~213 TFLOP | ~989 TFLOP | 4.6 | ~8 | +| HBM | 119 GB | 141 GB | 1.18 | 2.4 (281 GB) | + +Decode is bandwidth-bound, so **both the aggregate ceiling and the concurrency at which it +is reached scale with bandwidth (~30x on 2x H200)**: + +- **32B-dense aggregate decode ceiling:** 540 x 30 ~= **16,000 t/s**, reached at + ~128 x 30 ~= **3,800 concurrent sequences**. + +### Why paged KV becomes the binding lever on 2x H200 (and didn't on GB10) + +To reach that ~16k t/s ceiling you must hold **~3,800 sequences** of KV. The memory math: + +- 32B weights (FP8) ~= 32 GB, sharded over 2 GPUs -> ~250 GB HBM free for KV. +- 32B KV = 256 KiB/token. At an avg held context of 2,000 tokens, **per seq = 512 MiB**. +- Contiguous unified KV (reserve for the live set) fits ~250 GB / 512 MiB ~= **~490 + sequences** - **8x short of the 3,800 needed to reach the throughput ceiling.** + +So on 2x H200 **KV memory is the binding constraint at the throughput-optimal concurrency**, +and contiguous KV strands most of the bandwidth (you'd run at a fraction of 16k t/s). This +is the gap paged KV closes. On GB10 it never appeared because GB10's 30x-lower bandwidth +caps decode at npl ~128, whose KV fits in memory trivially - the constraint order is +inverted on the real target. + +### Magnitude of the paged win + +Paging recovers concurrency two ways, both multiplicative on achievable throughput: + +1. **No over-reservation.** Contiguous must back `max_ctx` per slot; paging uses + `ceil(actual/block)`. For a realistic bimodal workload (most generations short, ~15% + long, prompts ~512) the average held context is several-fold below `max_ctx` -> + `paged-loadgen` capacity ratio typically **~4-10x** (it measures the exact number for + your workload's length distribution). +2. **Cross-request prefix sharing** of shared system prompts / RAG preambles - additional, + workload-dependent (chained-hash block cache; vLLM's `block_pool.py`). + +Net: on 2x H200, paged KV is plausibly the difference between serving **~500 and ~3,800** +concurrent 32B sequences in HBM, i.e. between a fraction of and ~all of the **~16k t/s** +decode ceiling. **That is the datacenter payoff, and it is real on the target even though +GB10 cannot exhibit it.** + +### Honest caveats for the buy case + +- These are **projections** from GB10 + spec ratios; the capacity multiplier depends on the + workload's context-length distribution (more variable -> bigger paged win) and TP + efficiency. `paged-loadgen` measures it directly once you have target-GPU time. +- The **paged op itself still needs work**: PR #22569's `ggml_paged_attn` was 12-13% + *slower* than the mature contiguous flash-attention path at equal concurrency + (`PR22569_EVAL.md`), lacks prefix sharing (deferred to a non-existent Phase 2), and has + the fit-robustness bug above. Adopting paged KV for the target means either hardening + #22569 or finishing the from-scratch P4 - the capacity win above assumes a *correct, + competitive* op, which is the remaining engineering. +- Prefill on either KV layout is compute-capped, not a paged concern. + +**Bottom line for the decision:** paged KV **is** the right lever for the 2x H200 target - +the GB10 "no win" result is a bandwidth artifact, not a verdict. The paged path is now +**correctness-verified**, the **benchmark to size the win exists**, and the projection +says the payoff is **~5-10x concurrent-tenant capacity -> several-fold higher aggregate +decode** on the target. The remaining work is hardening/finishing the paged op, not +proving the thesis. diff --git a/backend/cpp/llama-cpp/paged/paged-loadgen.cpp b/backend/cpp/llama-cpp/paged/paged-loadgen.cpp new file mode 100644 index 000000000..1491bcd7c --- /dev/null +++ b/backend/cpp/llama-cpp/paged/paged-loadgen.cpp @@ -0,0 +1,169 @@ +// paged-loadgen: a dynamic-load benchmark for paged KV that actually exercises the +// regime where paging wins - variable prompt lengths, variable generation lengths, +// staggered (continuous) arrival, and a shared system prefix. The stock +// examples/paged/paged.cpp adds all requests up front with a fixed n_predict from a +// 20-prompt pool, so it never creates KV-memory pressure or fragmentation and +// therefore never shows a paged advantage (see PAGED_KV_HIGH_CONCURRENCY.md). +// +// Build: drop into PR #22569's examples/paged/ and add to its CMakeLists.txt next to +// llama-paged (it uses the same llama_paged_scheduler_* API). Run on the TARGET GPU +// (e.g. 2xH200) where bandwidth lets decode scale to thousands of sequences and KV +// memory becomes the binding constraint - that is where paged KV pays off and where +// this harness produces a meaningful number. On a low-bandwidth box (GB10) throughput +// plateaus long before memory binds, so the win is not observable there regardless. +// +// Metrics reported: +// - goodput (decode tokens/s aggregate) under the dynamic load +// - peak concurrent in-flight sequences actually sustained +// - paged peak KV bytes used vs the contiguous reservation a unified cache needs +// (n_seq_peak * max_ctx), i.e. the capacity ratio = the headroom paging unlocks +// +// The capacity ratio is the load-bearing number for the buy decision: it is how many +// more concurrent tenants a fixed HBM budget serves with paging than without. + +#include "common.h" +#include "llama.h" + +#include +#include +#include +#include +#include +#include + +// ---- workload knobs (env-overridable so the harness is sweepable without rebuilds) ---- +static int env_int(const char * k, int dflt) { const char * v = getenv(k); return v ? atoi(v) : dflt; } + +struct workload_cfg { + int total_requests = env_int("LG_TOTAL", 2000); // total requests to serve + int target_inflight = env_int("LG_INFLIGHT", 256); // continuous-batching concurrency target + int prefix_tokens = env_int("LG_PREFIX", 512); // shared system-prompt prefix (prefix-cache target) + int suffix_min = env_int("LG_SUFMIN", 16); // per-request unique prompt suffix range + int suffix_max = env_int("LG_SUFMAX", 768); + int gen_short = env_int("LG_GENSHORT", 32); // bimodal generation: most short... + int gen_long = env_int("LG_GENLONG", 1024); // ...some long (the over-reservation driver) + int gen_long_pct = env_int("LG_LONGPCT", 15); // % of requests that are long + int block_size = env_int("LG_BLOCK", 16); // must match -kvbls + unsigned seed = (unsigned) env_int("LG_SEED", 1234); +}; + +// Per-request plan drawn from the workload distribution. +struct req_plan { int prompt_len; int gen_len; }; + +int main(int argc, char ** argv) { + common_params params; + params.n_predict = -1; // per-request, controlled by the plan below + if (!common_params_parse(argc, argv, params, LLAMA_EXAMPLE_PAGED)) { + fprintf(stderr, "usage: %s -m -kvp --fit off -ngpub N -ncpub M -ngl 99\n", argv[0]); + return 1; + } + params.kv_paged = true; + + common_init_result init = common_init_from_params(params); + llama_model * model = init.model.get(); + llama_context * ctx = init.context.get(); + if (!model || !ctx) { fprintf(stderr, "load failed\n"); return 1; } + const llama_vocab * vocab = llama_model_get_vocab(model); + + workload_cfg cfg; + std::mt19937 rng(cfg.seed); + std::uniform_int_distribution suf(cfg.suffix_min, cfg.suffix_max); + std::uniform_int_distribution pct(1, 100); + + // KV bytes/token = 2(K,V) * n_layers * n_head_kv * head_dim * sizeof(f16). Confirmed + // against llama-kv-cache-paged.cpp (block_bytes formula). Used for the capacity ratio. + const int n_layers = llama_model_n_layer(model); + const int n_head_kv = llama_model_n_head_kv(model); + const int head_dim = llama_model_n_embd(model) / llama_model_n_head(model); + const size_t kv_bytes_per_token = (size_t)2 * n_layers * n_head_kv * head_dim * sizeof(uint16_t); + + // A long shared system prefix that every request reuses (the prefix-cache target). + std::vector prefix = common_tokenize(ctx, std::string(cfg.prefix_tokens, 'x'), true); + + // Pre-draw all request plans so paged peak usage and the contiguous reservation are + // computed from the SAME workload. + std::vector plans(cfg.total_requests); + int max_ctx = 0; + for (auto & p : plans) { + p.prompt_len = cfg.prefix_tokens + suf(rng); + p.gen_len = (pct(rng) <= cfg.gen_long_pct) ? cfg.gen_long : cfg.gen_short; + max_ctx = std::max(max_ctx, p.prompt_len + p.gen_len); + } + + llama_paged_scheduler * sched = llama_paged_scheduler_init(ctx); + if (!sched) { fprintf(stderr, "scheduler init failed\n"); return 1; } + + // ---- continuous-arrival loop: keep ~target_inflight requests live at all times ---- + int next_req = 0, done = 0, inflight = 0, peak_inflight = 0; + long total_decoded = 0; + size_t peak_kv_bytes_paged = 0; // sum over live seqs of ceil(used/block)*block*kv_bytes + size_t live_used_tokens = 0; // running sum of actual KV tokens held by live seqs + + auto admit = [&](int rid) { + const req_plan & p = plans[rid]; + std::vector toks = prefix; // shared prefix... + std::vector suff = common_tokenize(ctx, std::string(p.prompt_len - cfg.prefix_tokens, 'y'), false); + toks.insert(toks.end(), suff.begin(), suff.end()); // ...+ unique suffix + if (llama_paged_scheduler_add_request(sched, toks.data(), toks.size(), rid)) { + inflight++; peak_inflight = std::max(peak_inflight, inflight); + live_used_tokens += p.prompt_len; + } + }; + + const int64_t t0 = ggml_time_us(); + for (int i = 0; i < cfg.target_inflight && next_req < cfg.total_requests; ++i) admit(next_req++); + + llama_batch batch = {}; + std::vector sampled; std::vector stop_flags; + + while (done < cfg.total_requests) { + if (!llama_paged_scheduler_prepare_batch(sched, &batch)) break; + const llama_paged_batch_info * info = llama_paged_scheduler_get_batch_info(sched); + sampled.assign(info->n_seq, 0); stop_flags.assign(info->n_seq, 0); + + // (decode is done inside the scheduler/update path in PR #22569; greedy here) + for (int i = 0; i < info->n_seq; ++i) { + const int rid = info->seq_ids[i]; + llama_paged_seq_state st{}; + llama_paged_scheduler_get_seq_state(sched, rid, &st); + // greedy argmax from the i-th row of logits + const float * lg = llama_get_logits_ith(ctx, i); + int best = 0; float bv = lg[0]; + for (int t = 1; t < llama_vocab_n_tokens(vocab); ++t) if (lg[t] > bv) { bv = lg[t]; best = t; } + sampled[i] = best; + const bool stop = llama_vocab_is_eog(vocab, best) || st.n_decoded + 1 >= plans[rid].gen_len; + stop_flags[i] = stop ? 1 : 0; + if (!stop) { total_decoded++; live_used_tokens++; } + if (stop) { + done++; inflight--; + live_used_tokens -= (plans[rid].prompt_len + st.n_decoded); + if (next_req < cfg.total_requests) admit(next_req++); // continuous arrival + } + } + // paged peak KV: blocks are allocated per live seq = ceil(used/block); approximate + // current paged footprint from live_used_tokens rounded up per the block size. + const size_t paged_now = (size_t)std::ceil((double)live_used_tokens / cfg.block_size) + * cfg.block_size * kv_bytes_per_token; + peak_kv_bytes_paged = std::max(peak_kv_bytes_paged, paged_now); + + llama_paged_scheduler_update(sched, &batch, sampled.data(), stop_flags.data()); + } + const double secs = (ggml_time_us() - t0) / 1e6; + + // Contiguous unified-KV reservation needed to serve the SAME peak concurrency without + // mid-generation eviction: every live slot must be backed for the worst-case context. + const size_t contig_reserve = (size_t)peak_inflight * max_ctx * kv_bytes_per_token; + + printf("\n==== paged-loadgen ====\n"); + printf("requests served : %d (target inflight %d, peak inflight %d)\n", done, cfg.target_inflight, peak_inflight); + printf("goodput (decode) : %.1f tok/s (%ld tokens / %.2f s)\n", total_decoded / secs, total_decoded, secs); + printf("kv bytes / token : %zu (n_layer=%d n_head_kv=%d head_dim=%d f16)\n", kv_bytes_per_token, n_layers, n_head_kv, head_dim); + printf("paged peak KV : %.2f GiB (allocated on demand)\n", peak_kv_bytes_paged / 1073741824.0); + printf("contiguous reserve : %.2f GiB (peak_inflight * max_ctx %d)\n", contig_reserve / 1073741824.0, max_ctx); + printf("CAPACITY RATIO : %.2fx <- tenants-per-HBM paging unlocks\n", + peak_kv_bytes_paged ? (double)contig_reserve / peak_kv_bytes_paged : 0.0); + printf(" (plus cross-request prefix sharing of the %d-token shared prefix, not counted above)\n", cfg.prefix_tokens); + + llama_paged_scheduler_free(sched); + return 0; +} diff --git a/backend/cpp/llama-cpp/paged/patches/0002-paged-e2e-disable-broken-autofit.patch b/backend/cpp/llama-cpp/paged/patches/0002-paged-e2e-disable-broken-autofit.patch new file mode 100644 index 000000000..5de1bb641 --- /dev/null +++ b/backend/cpp/llama-cpp/paged/patches/0002-paged-e2e-disable-broken-autofit.patch @@ -0,0 +1,12 @@ +diff --git a/tests/test-paged-kv-e2e.cpp b/tests/test-paged-kv-e2e.cpp +index 5a352e3..06ead50 100644 +--- a/tests/test-paged-kv-e2e.cpp ++++ b/tests/test-paged-kv-e2e.cpp +@@ -115,6 +115,7 @@ static path_result run_paged(const std::string & model_path) { + params.sampling.temp = 0.0f; // greedy + params.warmup = false; + params.kv_paged = true; ++ params.fit_params = false; // honor explicit n_gpu_blocks; GB10 dev_memory over-reports free VRAM + params.n_gpu_blocks = 64; + params.n_cpu_blocks = 16; + params.n_sequences = 1;