mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
docs(paged): record BF16 GDN state cache phase
Record Phase81 default-off BF16 persistent S-cache results, including md5 drift, op gates, decode profile, and KL smoke. Scope Phase82 as full f16-reference KL plus serving A/B before patch-series promotion. Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -12,13 +12,15 @@ with artifact path, gates, benchmark rows, and decision.
|
||||
- Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`.
|
||||
- Current tested source: DGX mirror
|
||||
`14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`.
|
||||
- Latest attempt: Phase80.
|
||||
- Latest decision: default-off source A/B `GDN_ASSUME_IDENTITY_IDS=1` is
|
||||
rejected for carry-forward/default. It removed the tiny `gdn_gather` bucket
|
||||
(`0.79 ms`) and kept gates green, but left `gdn_core` effectively unchanged
|
||||
(`1411.65 ms` baseline vs `1409.28 ms` candidate). The assumption is too
|
||||
narrow for the measured gain. The next real GDN lever remains recurrent-state
|
||||
precision/traffic, with a separate invasive scope.
|
||||
- Latest attempt: Phase81.
|
||||
- Latest decision: default-off `LLAMA_QWEN35_GDN_S_CACHE_TYPE=bf16` is a
|
||||
promising carry-forward candidate, not default-on. Same-source decode-only
|
||||
profiling cut normalized `gdn_core` from about `2.34 ms/launch` to
|
||||
`1.20 ms/launch` and total kernel time from `3.6157 s` to `3.5244 s`, but MoE
|
||||
greedy md5 changed from the paged canonical `8cb0...` to `07db...`. Dense md5
|
||||
stayed canonical and `MUL_MAT`/`MUL_MAT_ID` gates stayed green. Phase82 must
|
||||
run the full f16-reference KL gate and serving A/B before this can be promoted
|
||||
beyond an opt-in experiment.
|
||||
|
||||
## Current Serving Record
|
||||
|
||||
@@ -58,6 +60,76 @@ Decision:
|
||||
|
||||
## Attempt Log
|
||||
|
||||
### Phase81: Qwen35 BF16 Persistent GDN S-Cache
|
||||
|
||||
- Date: 2026-07-01.
|
||||
- Source: `/home/mudler/llama-phase81-bf16-state-source`, local fork patch in
|
||||
`/home/mudler/_git/llama.cpp` branch `localai-paged`.
|
||||
- Build artifact: `/home/mudler/llama-phase81-bf16-state-source/build-cuda`.
|
||||
- Gate artifact:
|
||||
`/home/mudler/bench/phase81_bf16_s_cache_gates/20260701_161350`.
|
||||
- Profile artifacts:
|
||||
- default F32:
|
||||
`/home/mudler/bench/phase81_bf16_s_cache_profile/default_20260701_162117`
|
||||
- BF16 S-cache:
|
||||
`/home/mudler/bench/phase81_bf16_s_cache_profile/bf16_20260701_162028`
|
||||
- KL smoke artifact:
|
||||
`/home/mudler/bench/phase81_bf16_s_cache_kl/20260701_162322`.
|
||||
- Result type: source experiment. `LLAMA_QWEN35_GDN_S_CACHE_TYPE=bf16`
|
||||
stores Qwen35/Qwen35MoE persistent recurrent S cache in BF16 while keeping GDN
|
||||
recurrence math, q/k/v/g/beta, and output in F32. Default remains F32.
|
||||
|
||||
Implementation scope:
|
||||
|
||||
- Added BF16 state support for `ggml_gated_delta_net_inplace_ids` only.
|
||||
- Added CPU/CUDA BF16 state load/store conversion at the persistent cache
|
||||
boundary.
|
||||
- Added BF16 CPU/CUDA `SCALE` support because recurrent cache zeroing uses
|
||||
`ggml_scale_inplace(..., 0)` on the S cache.
|
||||
- Added tests for BF16 `GATED_DELTA_NET_INPLACE_IDS` and BF16 in-place `SCALE`.
|
||||
|
||||
Local verification:
|
||||
|
||||
| check | result |
|
||||
|-------|--------|
|
||||
| RED test before implementation | `ggml_gated_delta_net_inplace_ids` rejected BF16 state at `state->type == GGML_TYPE_F32` |
|
||||
| CPU `SCALE -p bf16` | `1/1` passed |
|
||||
| CPU `GATED_DELTA_NET_INPLACE_IDS` | `2/2` passed |
|
||||
| DGX CUDA build | completed for `llama-completion`, `llama-batched-bench`, `test-backend-ops`, `llama-server`, later `llama-perplexity` |
|
||||
|
||||
Gates:
|
||||
|
||||
| mode | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` |
|
||||
|------|---------|-----------|-----------|--------------|
|
||||
| default F32 | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
|
||||
| BF16 S-cache | `07db32c2bcb78d17a43ed18bc22705cd` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` |
|
||||
|
||||
Profile:
|
||||
|
||||
| arm | env | active slots | depth start | depth mid | total kernel s | GDN ms | GDN share | `gdn_core` ms | `gdn_core` launches | `gdn_core`/launch | `mmq_nvfp4` ms |
|
||||
|-----|-----|-------------:|------------:|----------:|---------------:|-------:|----------:|--------------:|--------------------:|------------------:|---------------:|
|
||||
| default F32 | none | `128` | `65` | `87` | `3.6157` | `1480.44` | `40.94%` | `1399.30` | `599` | `2.336 ms` | `1394.28` |
|
||||
| BF16 S-cache | `LLAMA_QWEN35_GDN_S_CACHE_TYPE=bf16` | `128` | `65` | `91` | `3.5244` | `961.61` | `27.28%` | `863.57` | `720` | `1.199 ms` | `1665.38` |
|
||||
|
||||
KL smoke against same-source F32 base:
|
||||
|
||||
| check | result |
|
||||
|-------|--------|
|
||||
| shape | MoE, `-c 256 -b 256 --chunks 32`, Wikitext-2 raw |
|
||||
| F32 floor KLD vs F32 base | `0.000000 +/- 0.000000`, same-top-p `99.975%` |
|
||||
| BF16 S-cache KLD vs F32 base | `0.055499 +/- 0.001705`, same-top-p `88.361%` |
|
||||
| BF16 PPL ratio vs F32 base | `1.010356 +/- 0.005817` |
|
||||
|
||||
Decision:
|
||||
|
||||
- Carry forward as a default-off candidate and run Phase82 full gates.
|
||||
- Do not make it default-on: MoE greedy md5 is not canonical, and the KL smoke is
|
||||
not the full f16-reference acceptance gate.
|
||||
- Required Phase82 before patch-series promotion:
|
||||
full f16-reference KL gate for MoE and dense, same-source serving A/B against
|
||||
F32 default and vLLM, then regenerate LocalAI patches from the fork only if
|
||||
serving and KL both hold.
|
||||
|
||||
### Phase80: GDN Identity-Ids Shortcut Source A/B
|
||||
|
||||
- Date: 2026-07-01.
|
||||
|
||||
@@ -45,6 +45,18 @@ Read order for a cold start:
|
||||
> llama.cpp QoS knob, not a parity-closing lever. The trace and scheduler commits
|
||||
> are local and DGX-gated but not pushed, so the LocalAI patch series has not
|
||||
> been regenerated.
|
||||
>
|
||||
> 2026-07-01 Phase81 update: the next viable GDN lever is no longer launch
|
||||
> shape or gather removal. A default-off Qwen35/Qwen35MoE BF16 persistent
|
||||
> recurrent S-cache experiment (`LLAMA_QWEN35_GDN_S_CACHE_TYPE=bf16`) cut
|
||||
> same-source decode-only `gdn_core` from `1399.30 ms / 599 launches`
|
||||
> (`2.34 ms/launch`) to `863.57 ms / 720 launches` (`1.20 ms/launch`). Default
|
||||
> F32 md5 gates and op gates stayed green, and BF16 dense md5 stayed canonical,
|
||||
> but BF16 MoE md5 changed to `07db32c2bcb78d17a43ed18bc22705cd`. A quick
|
||||
> MoE KL smoke vs the same-source F32 base showed KLD `0.055499 +/- 0.001705`,
|
||||
> same-top-p `88.361%`, and PPL ratio `1.010356`. Treat this as a promising
|
||||
> default-off candidate only. Phase82 must run the full f16-reference KL gate
|
||||
> plus serving A/B before regenerating LocalAI patches or considering promotion.
|
||||
|
||||
- Historical verdict: the older investigation marked GB10 parity **CLOSED** and
|
||||
unreachable. Treat that as superseded where Phase50-54 provide newer dense
|
||||
|
||||
Reference in New Issue
Block a user