mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
docs(paged): record TTFT min32 serving phase
Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -12,49 +12,77 @@ with artifact path, gates, benchmark rows, and decision.
|
||||
- Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
|
||||
- Current tested source: DGX mirror
|
||||
`14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
|
||||
- Latest attempt: Phase71.
|
||||
- Latest decision: keep shipped GDN M5 default as-is. It still beats
|
||||
sequential-disabled and serial-chunked GDN, and forced `GDN_TC=4` is within
|
||||
noise of the current default. Do not reopen smaller GDN kernel reorders on
|
||||
GB10.
|
||||
- Latest attempt: Phase72.
|
||||
- Latest decision: keep `LLAMA_TTFT_PREFILL_FIRST=1`
|
||||
`LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` opt-in only. It regressed broad
|
||||
serving aggregate, decode, TTFT, and wall time at `n=8`, `n=32`, and `n=128`.
|
||||
|
||||
## Current Serving Record
|
||||
|
||||
Phase70 broader serving snapshot, MoE `PTOK=128`, `GEN=64`, `PARALLEL=128`.
|
||||
Phase72 broader serving snapshot, MoE `PTOK=128`, `GEN=64`, `PARALLEL=128`.
|
||||
|
||||
Artifact:
|
||||
|
||||
- `/home/mudler/bench/phase70_bf16_broader_serving/20260701_151500`
|
||||
- `/home/mudler/bench/phase72_ttft_min32_serving/20260701_160730`
|
||||
|
||||
| arm | n | agg_tps | decode_agg_tps | decode_perseq_tps | prefill_tps | ttft_mean_ms | wall_s |
|
||||
|-----|--:|--------:|---------------:|------------------:|------------:|-------------:|-------:|
|
||||
| llama default | `8` | `178.5` | `242.6` | `29.82` | `1767.2` | `754.8` | `2.868` |
|
||||
| llama opt-in | `8` | `158.8` | `218.3` | `26.60` | `1541.1` | `848.9` | `3.225` |
|
||||
| vLLM | `8` | `260.9` | `299.5` | `36.67` | `5415.6` | `239.0` | `1.917` |
|
||||
| llama default | `32` | `250.1` | `418.7` | `11.75` | `1661.2` | `2717.0` | `8.187` |
|
||||
| llama opt-in | `32` | `247.9` | `417.6` | `11.79` | `1650.3` | `2803.9` | `8.261` |
|
||||
| vLLM | `32` | `465.3` | `608.4` | `17.74` | `5394.4` | `782.7` | `4.314` |
|
||||
| llama default | `128` | `322.5` | `706.2` | `3.87` | `1613.9` | `7836.5` | `25.401` |
|
||||
| llama opt-in | `128` | `324.8` | `697.9` | `3.88` | `1671.1` | `7720.9` | `25.220` |
|
||||
| vLLM | `128` | `659.9` | `1020.4` | `6.75` | `5228.0` | `2543.1` | `12.060` |
|
||||
| llama default | `8` | `170.4` | `231.3` | `28.42` | `1693.4` | `786.4` | `3.004` |
|
||||
| llama min32 | `8` | `158.5` | `218.4` | `26.27` | `1547.8` | `816.2` | `3.230` |
|
||||
| vLLM | `8` | `260.0` | `305.9` | `37.32` | `4659.7` | `266.4` | `1.915` |
|
||||
| llama default | `32` | `257.8` | `430.2` | `12.09` | `1720.4` | `2625.2` | `7.943` |
|
||||
| llama min32 | `32` | `242.7` | `411.7` | `11.58` | `1617.4` | `2881.6` | `8.439` |
|
||||
| vLLM | `32` | `463.6` | `601.0` | `17.60` | `5496.2` | `773.7` | `4.357` |
|
||||
| llama default | `128` | `325.8` | `714.0` | `3.92` | `1628.8` | `7822.5` | `25.148` |
|
||||
| llama min32 | `128` | `316.0` | `697.9` | `3.81` | `1606.0` | `8056.9` | `25.926` |
|
||||
| vLLM | `128` | `666.4` | `1029.5` | `6.81` | `5292.5` | `2511.7` | `11.933` |
|
||||
|
||||
Ratios:
|
||||
|
||||
| n | opt/default agg | opt/default decode | opt/default TTFT | default decode/vLLM | opt decode/vLLM | default agg/vLLM | opt agg/vLLM |
|
||||
|--:|----------------:|-------------------:|-----------------:|--------------------:|----------------:|-----------------:|-------------:|
|
||||
| `8` | `0.8896` | `0.8998` | `1.1247` | `0.8100` | `0.7289` | `0.6842` | `0.6087` |
|
||||
| `32` | `0.9912` | `0.9974` | `1.0320` | `0.6882` | `0.6864` | `0.5375` | `0.5328` |
|
||||
| `128` | `1.0071` | `0.9882` | `0.9852` | `0.6921` | `0.6839` | `0.4887` | `0.4922` |
|
||||
| n | min32/default agg | min32/default decode | min32/default TTFT | default decode/vLLM | min32 decode/vLLM |
|
||||
|--:|------------------:|---------------------:|-------------------:|--------------------:|----------------:|
|
||||
| `8` | `0.9302` | `0.9442` | `1.0379` | `0.7561` | `0.7140` |
|
||||
| `32` | `0.9414` | `0.9570` | `1.0977` | `0.7158` | `0.6850` |
|
||||
| `128` | `0.9699` | `0.9775` | `1.0300` | `0.6935` | `0.6779` |
|
||||
|
||||
Decision:
|
||||
|
||||
- Reject default-on for `LLAMA_BF16_CUBLAS_F32_OUT=1`.
|
||||
- Keep as default-off opt-in only.
|
||||
- The opt-in regressed `n=8` throughput and TTFT materially, and slightly
|
||||
widened the vLLM decode gap at `n=32` and `n=128`.
|
||||
- Reject default-on for `LLAMA_TTFT_PREFILL_FIRST=1`
|
||||
`LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32`.
|
||||
- Keep min32 as opt-in only.
|
||||
- The opt-in regressed aggregate, decode, TTFT, and wall time at every tested
|
||||
concurrency and widened the vLLM decode gap.
|
||||
|
||||
## Attempt Log
|
||||
|
||||
### Phase72: TTFT Min32 Broader Serving
|
||||
|
||||
- Date: 2026-07-01.
|
||||
- Plan: `docs/superpowers/plans/2026-07-01-ttft-min32-serving-phase72.md`.
|
||||
- Artifact:
|
||||
`/home/mudler/bench/phase72_ttft_min32_serving/20260701_160730`.
|
||||
- Source: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
|
||||
- Shape: MoE serving, `NPL=8 32 128`, prompt `128`, generation `64`,
|
||||
`PARALLEL=128`, `CTX=131072`.
|
||||
- Env gate: `LLAMA_TTFT_PREFILL_FIRST=1`
|
||||
`LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32`.
|
||||
|
||||
Gates:
|
||||
|
||||
| gate | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
|
||||
|------|---------|-----------|-----------|--------------|
|
||||
| pre default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
|
||||
| pre min32 | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | not run | not run |
|
||||
| post default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | not run | not run |
|
||||
| post min32 | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | not run | not run |
|
||||
|
||||
Result:
|
||||
|
||||
- Reject default-on for min32 in the broader serving shape.
|
||||
- Keep the scheduler knob opt-in only.
|
||||
- min32 regressed aggregate, decode, TTFT, and wall time for every tested
|
||||
concurrency.
|
||||
|
||||
### Phase71: GDN Tensor-Core Revalidation
|
||||
|
||||
- Date: 2026-07-01.
|
||||
|
||||
@@ -1150,3 +1150,35 @@ Post-Phase71 do-not-reopen list for GB10:
|
||||
The only GDN work that should be reconsidered is a larger FLA/CuteDSL-class
|
||||
blocked-solve implementation or a hardware pivot where the GB10 constraints no
|
||||
longer apply.
|
||||
|
||||
## 17. PHASE72 RESULT: TTFT MIN32 BROADER SERVING
|
||||
|
||||
Phase72 broadened the Phase59 min32 scheduler result to the same serving shape
|
||||
used by Phase70. Plan:
|
||||
`docs/superpowers/plans/2026-07-01-ttft-min32-serving-phase72.md`.
|
||||
Benchmark ledger:
|
||||
`backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md`.
|
||||
DGX artifact:
|
||||
`/home/mudler/bench/phase72_ttft_min32_serving/20260701_160730`.
|
||||
|
||||
Source under test stayed at DGX mirror commit
|
||||
`14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`. No llama.cpp source was
|
||||
changed.
|
||||
|
||||
Gates stayed green. Pre default matched MoE md5 `8cb0ce23`, dense md5
|
||||
`5951a5b4`, `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Pre/post min32 and
|
||||
post default md5 gates also matched MoE `8cb0ce23` and dense `5951a5b4`.
|
||||
|
||||
Serving shape: MoE `NPL=8 32 128`, prompt `128`, generation `64`,
|
||||
`PARALLEL=128`.
|
||||
|
||||
| n | min32/default agg | min32/default decode | min32/default TTFT | default decode/vLLM | min32 decode/vLLM |
|
||||
|---:|------------------:|---------------------:|-------------------:|--------------------:|------------------:|
|
||||
| `8` | `0.9302` | `0.9442` | `1.0379` | `0.7561` | `0.7140` |
|
||||
| `32` | `0.9414` | `0.9570` | `1.0977` | `0.7158` | `0.6850` |
|
||||
| `128` | `0.9699` | `0.9775` | `1.0300` | `0.6935` | `0.6779` |
|
||||
|
||||
Decision: keep `LLAMA_TTFT_PREFILL_FIRST=1` plus
|
||||
`LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` opt-in only. It regressed aggregate,
|
||||
decode, TTFT, and wall time at every tested concurrency in the broader shape,
|
||||
and widened the vLLM decode gap. Do not default this scheduler policy on GB10.
|
||||
|
||||
@@ -1579,6 +1579,13 @@ lever. Llama min32 is still `0.560x` vLLM aggregate, `0.430x` vLLM prefill,
|
||||
`0.673x` vLLM decode aggregate, and `2.415x` slower on mean TTFT. Keep the
|
||||
scheduler knob opt-in and return parity work to the prefill / MoE compute gap.
|
||||
|
||||
Phase72 broadened that min32 result to the Phase70 serving shape. Artifact:
|
||||
`/home/mudler/bench/phase72_ttft_min32_serving/20260701_160730`. Gates stayed
|
||||
green, but min32 regressed every tested concurrency: aggregate ratios
|
||||
`0.9302`/`0.9414`/`0.9699`, decode ratios `0.9442`/`0.9570`/`0.9775`, and TTFT
|
||||
ratios `1.0379`/`1.0977`/`1.0300` at `n=8/32/128`. Keep min32 opt-in only and
|
||||
do not default it on GB10.
|
||||
|
||||
### Phase 60 current W4A16 prefill profile
|
||||
|
||||
Phase60 re-profiled the current W4A16 grouped MoE prefill path after the
|
||||
|
||||
Reference in New Issue
Block a user