mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-02 20:37:03 -04:00
feat(paged): parameterize served model name
Assisted-by: Codex:gpt-5
This commit is contained in:
@@ -2581,3 +2581,37 @@ Decision:
|
||||
- Current DGX phase36 build still passes the canonical inference md5/op gates.
|
||||
- Phase44 did not touch inference code; Phase45 provides the post-change guard
|
||||
artifact for future handoff and comparison.
|
||||
|
||||
## Phase 46 Served-Model-Name Harness Readiness
|
||||
|
||||
Phase 46 removes the remaining hardcoded `q36` model name from the audited
|
||||
serving snapshot harness. This is a harness-only hardware-pivot readiness
|
||||
change: it does not change llama.cpp inference code, patch-series source, md5
|
||||
gates, op gates, or any throughput result.
|
||||
|
||||
New override:
|
||||
|
||||
| variable | default | used for |
|
||||
|----------|---------|----------|
|
||||
| `SERVED_MODEL_NAME` | `q36` | vLLM `--served-model-name`, vLLM readiness check, and h2h `--model` requests for both paged and vLLM arms |
|
||||
|
||||
Verification:
|
||||
|
||||
- Red help-text check first proved `SERVED_MODEL_NAME` was absent from
|
||||
`paged-current-serving-snapshot.sh --help`.
|
||||
- Red DGX dry-run check first proved the harness did not print
|
||||
`SERVED_MODEL_NAME=dense-q36` when supplied.
|
||||
- Green checks after the patch included `bash -n`, help-text grep, a source grep
|
||||
proving no hardcoded `q36` serve/request names remain in the harness, and DGX
|
||||
`DRY_RUN=1` preflight with the override value printed before any server
|
||||
starts. Artifact:
|
||||
`/home/mudler/bench/phase46_served_model_name_dryrun/20260701_094849`.
|
||||
|
||||
Decision:
|
||||
|
||||
- Future dense, MoE, or hardware-pivot snapshots can keep the same audited
|
||||
harness while setting model paths and the served OpenAI model name from the
|
||||
environment.
|
||||
- This does not claim a new parity result. Full runs still require the normal
|
||||
preflight, `hardware.txt`, pre/post md5 gates, `MUL_MAT`/`MUL_MAT_ID`, and
|
||||
KL-if-md5-changes gates before interpreting throughput.
|
||||
|
||||
@@ -574,6 +574,14 @@ phase36 build passed MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5
|
||||
`806/806`. Docker, `local-ai-worker`, and GPU compute preflight were all zero
|
||||
before and after the run.
|
||||
|
||||
Phase 46 removes the last hardcoded `q36` served-model name from the audited
|
||||
serving snapshot harness. Set `SERVED_MODEL_NAME` to drive vLLM
|
||||
`--served-model-name`, the vLLM readiness check, and h2h `--model` on both
|
||||
engines. DGX dry run:
|
||||
`/home/mudler/bench/phase46_served_model_name_dryrun/20260701_094849`, with
|
||||
`SERVED_MODEL_NAME=dense-q36` printed during `DRY_RUN=1`. This is harness-only
|
||||
hardware-pivot readiness, not a throughput result.
|
||||
|
||||
---
|
||||
|
||||
## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes)
|
||||
@@ -660,6 +668,7 @@ Only pursue if (a)+(b) are not options and someone explicitly wants the residual
|
||||
- `~/bench/phase41_low_concurrency/20260701_091437` - low-concurrency serving check at `NPL=1/8/32`, `PTOK=128`, `GEN=64`, `PARALLEL=32`, `CTX=32768`; pre/post MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` gates green; paged is `0.7493`, `0.7518`, and `0.6649` of vLLM decode at `n=1/8/32`, with TTFT still much worse by `n=8/32`; does not reopen D1.
|
||||
- `~/bench/phase44_hardware_pivot_harness_dryrun/20260701_094038` - harness-only dry-run artifact proving the vLLM serving config overrides are printed and preflighted before any server starts.
|
||||
- `~/bench/phase45_inference_gate_guard/20260701_094320` - post-Phase44 inference guard; MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` backend-op gates green.
|
||||
- `~/bench/phase46_served_model_name_dryrun/20260701_094849` - harness-only dry-run artifact proving `SERVED_MODEL_NAME` is printed and preflighted before any server starts.
|
||||
- Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`.
|
||||
- Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30.
|
||||
- A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`.
|
||||
|
||||
@@ -1123,6 +1123,21 @@ Results stayed green on the DGX phase36 build: MoE md5
|
||||
`MUL_MAT_ID` `806/806`. This confirms the current build still satisfies the
|
||||
inference-safety gates before any later hardware-pivot or larger kernel work.
|
||||
|
||||
### Phase 46 served-model-name harness readiness
|
||||
|
||||
Phase 46 removes the hardcoded `q36` served model name from
|
||||
`paged-current-serving-snapshot.sh`. The new `SERVED_MODEL_NAME` environment
|
||||
variable defaults to `q36` and is used consistently for vLLM
|
||||
`--served-model-name`, the vLLM `/v1/models` readiness check, and h2h `--model`
|
||||
requests on both arms.
|
||||
|
||||
DGX dry-run artifact:
|
||||
`/home/mudler/bench/phase46_served_model_name_dryrun/20260701_094849`.
|
||||
Preflight was clean and the dry run printed
|
||||
`SERVED_MODEL_NAME=dense-q36` before any server launch. This is another
|
||||
harness-only portability step for dense or hardware-pivot snapshots; it does not
|
||||
change inference code or produce a new throughput result.
|
||||
|
||||
Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`.
|
||||
|
||||
### Phase 10 GDN C32 slab update
|
||||
|
||||
@@ -17,6 +17,7 @@ Environment overrides:
|
||||
BIN llama.cpp build bin dir (default: $SRC/build-cuda/bin)
|
||||
MODEL paged GGUF path (default: ~/bench/q36-35b-a3b-nvfp4.gguf)
|
||||
VLLM_MODEL vLLM model dir (default: ~/bench/q36-35b-a3b-nvfp4-vllm)
|
||||
SERVED_MODEL_NAME OpenAI model name used by llama-server, vLLM, and h2h (default: q36)
|
||||
H2H h2h client (default: ~/bench/h2h_cli3.py)
|
||||
ART artifact dir (default: ~/bench/phase_current_serving_snapshot/<timestamp>)
|
||||
NPL concurrency list (default: "8 32 128")
|
||||
@@ -68,6 +69,7 @@ BUILD_DIR=${BUILD_DIR:-"$SRC/build-cuda"}
|
||||
BIN=${BIN:-"$BUILD_DIR/bin"}
|
||||
MODEL=${MODEL:-"$HOME/bench/q36-35b-a3b-nvfp4.gguf"}
|
||||
VLLM_MODEL=${VLLM_MODEL:-"$HOME/bench/q36-35b-a3b-nvfp4-vllm"}
|
||||
SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-q36}
|
||||
H2H=${H2H:-"$HOME/bench/h2h_cli3.py"}
|
||||
ART=${ART:-"$HOME/bench/phase_current_serving_snapshot/$(date +%Y%m%d_%H%M%S)"}
|
||||
NPL=${NPL:-"8 32 128"}
|
||||
@@ -231,11 +233,11 @@ run_paged() {
|
||||
SERVER_PID=$!
|
||||
wait_http "http://127.0.0.1:$LLAMA_PORT/health" "ok" "$arm_dir/server.log" "$arm_dir/health.json"
|
||||
python3 "$H2H" --url "http://127.0.0.1:$LLAMA_PORT/v1/completions" \
|
||||
--model q36 -n 8 --ptok "$PTOK" --gen 16 --nonce "warm_paged_$(date +%s)" --no-cache >/dev/null
|
||||
--model "$SERVED_MODEL_NAME" -n 8 --ptok "$PTOK" --gen 16 --nonce "warm_paged_$(date +%s)" --no-cache >/dev/null
|
||||
for n in $NPL; do
|
||||
log "paged n=$n"
|
||||
python3 "$H2H" --url "http://127.0.0.1:$LLAMA_PORT/v1/completions" \
|
||||
--model q36 -n "$n" --ptok "$PTOK" --gen "$GEN" \
|
||||
--model "$SERVED_MODEL_NAME" -n "$n" --ptok "$PTOK" --gen "$GEN" \
|
||||
--nonce "paged_${n}_$(date +%s)" --no-cache > "$arm_dir/n${n}.json"
|
||||
cat "$arm_dir/n${n}.json" | tee -a "$ART/run.log"
|
||||
done
|
||||
@@ -257,18 +259,18 @@ run_vllm() {
|
||||
fi
|
||||
log "starting vLLM server"
|
||||
nohup "$VLLM_BIN" serve "$VLLM_MODEL" \
|
||||
--served-model-name q36 --gpu-memory-utilization "$VLLM_GPU_MEMORY_UTILIZATION" --max-model-len "$VLLM_MAX_MODEL_LEN" \
|
||||
--served-model-name "$SERVED_MODEL_NAME" --gpu-memory-utilization "$VLLM_GPU_MEMORY_UTILIZATION" --max-model-len "$VLLM_MAX_MODEL_LEN" \
|
||||
--max-num-seqs "$VLLM_MAX_NUM_SEQS" --host 127.0.0.1 --port "$VLLM_PORT" --tensor-parallel-size "$VLLM_TENSOR_PARALLEL_SIZE" \
|
||||
"${extra_args[@]}" \
|
||||
> "$arm_dir/server.log" 2>&1 &
|
||||
SERVER_PID=$!
|
||||
wait_http "http://127.0.0.1:$VLLM_PORT/v1/models" "q36" "$arm_dir/server.log" "$arm_dir/models.json"
|
||||
wait_http "http://127.0.0.1:$VLLM_PORT/v1/models" "$SERVED_MODEL_NAME" "$arm_dir/server.log" "$arm_dir/models.json"
|
||||
python3 "$H2H" --url "http://127.0.0.1:$VLLM_PORT/v1/completions" \
|
||||
--model q36 -n 8 --ptok "$PTOK" --gen 16 --nonce "warm_vllm_$(date +%s)" --no-cache >/dev/null
|
||||
--model "$SERVED_MODEL_NAME" -n 8 --ptok "$PTOK" --gen 16 --nonce "warm_vllm_$(date +%s)" --no-cache >/dev/null
|
||||
for n in $NPL; do
|
||||
log "vllm n=$n"
|
||||
python3 "$H2H" --url "http://127.0.0.1:$VLLM_PORT/v1/completions" \
|
||||
--model q36 -n "$n" --ptok "$PTOK" --gen "$GEN" \
|
||||
--model "$SERVED_MODEL_NAME" -n "$n" --ptok "$PTOK" --gen "$GEN" \
|
||||
--nonce "vllm_${n}_$(date +%s)" --no-cache > "$arm_dir/n${n}.json"
|
||||
cat "$arm_dir/n${n}.json" | tee -a "$ART/run.log"
|
||||
done
|
||||
@@ -394,6 +396,7 @@ log "source=$(git -C "$SRC" log --oneline -1)"
|
||||
if [[ "$DRY_RUN" == "1" ]]; then
|
||||
log "dry run only; commands validated"
|
||||
log "would build: cmake --build $BUILD_DIR --target llama-server llama-completion test-backend-ops -j8"
|
||||
log "served model: SERVED_MODEL_NAME=$SERVED_MODEL_NAME"
|
||||
log "would run paged NPL=[$NPL] PTOK=$PTOK GEN=$GEN"
|
||||
log "would run vLLM NPL=[$NPL] PTOK=$PTOK GEN=$GEN"
|
||||
log "vLLM config: VLLM_GPU_MEMORY_UTILIZATION=$VLLM_GPU_MEMORY_UTILIZATION VLLM_MAX_MODEL_LEN=$VLLM_MAX_MODEL_LEN VLLM_MAX_NUM_SEQS=$VLLM_MAX_NUM_SEQS VLLM_TENSOR_PARALLEL_SIZE=$VLLM_TENSOR_PARALLEL_SIZE VLLM_EXTRA_ARGS=[$VLLM_EXTRA_ARGS]"
|
||||
|
||||
155
docs/superpowers/plans/2026-07-01-served-model-name-phase46.md
Normal file
155
docs/superpowers/plans/2026-07-01-served-model-name-phase46.md
Normal file
@@ -0,0 +1,155 @@
|
||||
# Phase46 Served Model Name Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Let the audited serving snapshot harness run MoE, dense, or hardware-pivot model variants without hardcoded `q36` model names.
|
||||
|
||||
**Architecture:** Add a single `SERVED_MODEL_NAME` environment variable to `paged-current-serving-snapshot.sh`, defaulting to `q36`. Use it consistently for vLLM `--served-model-name`, vLLM model readiness checks, and h2h `--model` requests on both engines. Print it in `DRY_RUN=1` output so hardware-pivot runs can be audited before launching servers.
|
||||
|
||||
**Tech Stack:** Bash serving harness, DGX dry-run preflight, LocalAI parity docs.
|
||||
|
||||
---
|
||||
|
||||
### Task 1: Prove the override is missing
|
||||
|
||||
**Files:**
|
||||
- Test: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh`
|
||||
|
||||
- [x] **Step 1: Run help-text red check**
|
||||
|
||||
```bash
|
||||
backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh --help | grep -F 'SERVED_MODEL_NAME'
|
||||
```
|
||||
|
||||
Expected: exit `1`, because the harness does not document the model-name override yet.
|
||||
|
||||
- [x] **Step 2: Run DGX dry-run red check**
|
||||
|
||||
```bash
|
||||
ssh dgx.casa 'set -euo pipefail; ART=$HOME/bench/phase46_served_model_name_dryrun_red/$(date +%Y%m%d_%H%M%S); SRC=$HOME/llama-phase6-source BUILD_DIR=$HOME/llama-phase6-source/build-phase36 BIN=$HOME/llama-phase6-source/build-phase36/bin ART=$ART NPL="1" PARALLEL=1 CTX=4096 PTOK=16 GEN=4 DRY_RUN=1 SERVED_MODEL_NAME=dense-q36 OPS=MUL_MAT,MUL_MAT_ID bash -s' < backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh | grep -F 'SERVED_MODEL_NAME=dense-q36'
|
||||
```
|
||||
|
||||
Expected: exit `1`, because `DRY_RUN=1` does not print the served model name yet.
|
||||
|
||||
### Task 2: Add `SERVED_MODEL_NAME`
|
||||
|
||||
**Files:**
|
||||
- Modify: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh`
|
||||
|
||||
- [x] **Step 1: Document the variable**
|
||||
|
||||
Add this line after `VLLM_MODEL`:
|
||||
|
||||
```bash
|
||||
SERVED_MODEL_NAME OpenAI model name used by llama-server, vLLM, and h2h (default: q36)
|
||||
```
|
||||
|
||||
- [x] **Step 2: Add the default**
|
||||
|
||||
Add this assignment after `VLLM_MODEL`:
|
||||
|
||||
```bash
|
||||
SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-q36}
|
||||
```
|
||||
|
||||
- [x] **Step 3: Replace hardcoded h2h model names**
|
||||
|
||||
Replace every h2h `--model q36` with:
|
||||
|
||||
```bash
|
||||
--model "$SERVED_MODEL_NAME"
|
||||
```
|
||||
|
||||
- [x] **Step 4: Replace hardcoded vLLM model name and readiness check**
|
||||
|
||||
Replace:
|
||||
|
||||
```bash
|
||||
--served-model-name q36
|
||||
wait_http "http://127.0.0.1:$VLLM_PORT/v1/models" "q36" ...
|
||||
```
|
||||
|
||||
with:
|
||||
|
||||
```bash
|
||||
--served-model-name "$SERVED_MODEL_NAME"
|
||||
wait_http "http://127.0.0.1:$VLLM_PORT/v1/models" "$SERVED_MODEL_NAME" ...
|
||||
```
|
||||
|
||||
- [x] **Step 5: Print it in dry-run output**
|
||||
|
||||
Add:
|
||||
|
||||
```bash
|
||||
log "served model: SERVED_MODEL_NAME=$SERVED_MODEL_NAME"
|
||||
```
|
||||
|
||||
### Task 3: Verify the harness
|
||||
|
||||
**Files:**
|
||||
- Test: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh`
|
||||
|
||||
- [x] **Step 1: Shell syntax check**
|
||||
|
||||
```bash
|
||||
bash -n backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh
|
||||
```
|
||||
|
||||
Expected: exit `0`.
|
||||
|
||||
- [x] **Step 2: Help-text green check**
|
||||
|
||||
```bash
|
||||
backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh --help | grep -F 'SERVED_MODEL_NAME'
|
||||
```
|
||||
|
||||
Expected: exit `0`.
|
||||
|
||||
- [x] **Step 3: DGX dry-run green check**
|
||||
|
||||
```bash
|
||||
ssh dgx.casa 'set -euo pipefail; ART=$HOME/bench/phase46_served_model_name_dryrun/$(date +%Y%m%d_%H%M%S); SRC=$HOME/llama-phase6-source BUILD_DIR=$HOME/llama-phase6-source/build-phase36 BIN=$HOME/llama-phase6-source/build-phase36/bin ART=$ART NPL="1" PARALLEL=1 CTX=4096 PTOK=16 GEN=4 DRY_RUN=1 SERVED_MODEL_NAME=dense-q36 OPS=MUL_MAT,MUL_MAT_ID bash -s' < backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh
|
||||
```
|
||||
|
||||
Expected: exit `0`, clean preflight, and output includes `SERVED_MODEL_NAME=dense-q36`.
|
||||
|
||||
### Task 4: Record Phase46
|
||||
|
||||
**Files:**
|
||||
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`
|
||||
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md`
|
||||
- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md`
|
||||
- Modify: `docs/superpowers/plans/2026-07-01-served-model-name-phase46.md`
|
||||
|
||||
- [x] **Step 1: Append the Phase46 result**
|
||||
|
||||
Record that this is harness-only hardware-pivot readiness and cite the DGX dry-run artifact.
|
||||
|
||||
- [x] **Step 2: Mark all completed plan items**
|
||||
|
||||
Mark this file's remaining task checkboxes complete only after the corresponding command or docs update has happened.
|
||||
|
||||
### Task 5: Commit
|
||||
|
||||
**Files:**
|
||||
- Commit Phase46 script, docs, and plan changes.
|
||||
|
||||
- [x] **Step 1: Run final checks**
|
||||
|
||||
```bash
|
||||
git diff --check
|
||||
git status --short
|
||||
```
|
||||
|
||||
Expected: no whitespace errors; only intended files changed plus the pre-existing untracked `.claude/`.
|
||||
|
||||
- [x] **Step 2: Commit**
|
||||
|
||||
```bash
|
||||
git add backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh \
|
||||
backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \
|
||||
backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \
|
||||
backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
|
||||
git add -f docs/superpowers/plans/2026-07-01-served-model-name-phase46.md
|
||||
git commit -m "feat(paged): parameterize served model name" -m "Assisted-by: Codex:gpt-5"
|
||||
```
|
||||
Reference in New Issue
Block a user