Commit Graph

6 Commits

Author SHA1 Message Date
Ettore Di Giacinto
9787bee48b fix(buun-llama-cpp): shim cudaMemcpy{To,From}Symbol + WARP_SIZE on fwht128 shuffles
Two more hipblas-only build failures in buun's fattn.cu, fixed under the
same patches/ infrastructure:

1. cudaMemcpyToSymbol / cudaMemcpyFromSymbol — buun's Q² calibration +
   TCQ codebook upload paths call the symbol variants of cudaMemcpy.
   ggml/src/ggml-cuda/vendors/hip.h aliases every other cudaMemcpy*
   name (cudaMemcpy, cudaMemcpyAsync, cudaMemcpy2DAsync, …) but the
   symbol pair was never added. 15+ "use of undeclared identifier"
   errors across fattn.cu lines 40, 54, 74-76, 94, 100-101, 371, 883,
   905, 954, 976, 1449, 1463. Add the two missing aliases alongside
   the existing memcpy block.

2. __shfl_xor_sync fwht128 calls — same 3-arg omission pattern as the
   earlier argmax top-K fix. Lines 512 (ggml_cuda_fwht128 intra-warp
   butterfly) and 536 (fwht128_store_half neighbor fetch) drop the
   width argument that hip.h:33 requires. Add WARP_SIZE.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-24 20:09:36 +00:00
Ettore Di Giacinto
42754d33b9 fix(buun-llama-cpp): pass WARP_SIZE to argmax __shfl_xor_sync calls
Two call sites in ggml/src/ggml-cuda/argmax.cu (the top-K intra-warp
merge added by buun) use the 3-arg CUDA form __shfl_xor_sync(mask, var,
laneMask), omitting the optional width parameter. The hipification shim
at ggml/src/ggml-cuda/vendors/hip.h:33 is a function-like macro that
requires all four arguments, so hipcc fails with:

    argmax.cu:265: too few arguments provided to function-like macro
      invocation
    note: macro '__shfl_xor_sync' defined here:
      #define __shfl_xor_sync(mask, var, laneMask, width) \
              __shfl_xor(var, laneMask, width)

Every other call in the same file already passes WARP_SIZE explicitly;
aligning these two with that convention fixes the hipblas build without
changing CUDA codegen (warpSize is the CUDA default).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-24 16:29:29 +00:00
Ettore Di Giacinto
7f2b7e4ace fix(buun-llama-cpp): shim atomicAdd(double*,double) for pre-sm_60 CUDA
Buun's Q² calibration path in ggml/src/ggml-cuda/fattn.cu calls
atomicAdd with a double* destination. Native double atomicAdd is only
available on CUDA compute capability 6.0 and later — LocalAI's CUDA 12
Docker image builds for the full published arch range (which includes
sm_50/sm_52), so nvcc fails with:

    fattn.cu:812: error: no instance of overloaded function "atomicAdd"
    matches the argument list, argument types are: (double *, double)

Add the canonical CAS-loop shim from the CUDA C Programming Guide
(B.15 Atomic Functions) guarded on __CUDA_ARCH__ < 600. On sm_60+ the
guard is false and nvcc picks up the native intrinsic as before.

Patch file lives under backend/cpp/buun-llama-cpp/patches/ and is
applied to the cloned fork tree by apply-patches.sh (the infrastructure
already put in place for exactly this class of backport).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-24 13:57:30 +00:00
Ettore Di Giacinto
d6bf3a4969 fix(buun-llama-cpp): drop logit_bias_eog arg from params_from_json_cmpl
Previous substitution kept the call as 5 args, but buun predates the
upstream refactor that also *added* the logit_bias_eog parameter to
params_from_json_cmpl — buun's signature is still the 4-arg form
  (const llama_vocab*, const common_params&, int, const json&)
and it still derives logit_bias_eog internally from the common_params.

Replace the substitution with a line-delete. Guard matches both the
original call (ctx_server.get_meta().logit_bias_eog) and the previously
substituted form (params_base.sampling.logit_bias_eog) so the script
stays safe across re-runs and whatever state the tree was left in.

Assisted-by: Claude:Opus-4.7 [Read] [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-24 12:52:53 +00:00
Ettore Di Giacinto
b27d38a53d fix(buun-llama-cpp): backport logit_bias_eog field to grpc-server copy
LocalAI's shared grpc-server.cpp reaches
ctx_server.get_meta().logit_bias_eog twice (the twin params_from_json_cmpl
callsites). That accessor was added to server_context_meta upstream after
buun's 2026-04-05 fork-point, so compiling against buun errors with
  'struct server_context_meta' has no member named 'logit_bias_eog'.

Rewrite the call sites — only in the buun grpc-server.cpp copy — to source
the vector from params_base.sampling.logit_bias_eog instead. That vector is
the underlying data the upstream meta accessor eventually returns (buun
still carries common_params_sampling::logit_bias_eog at common.h:280), so
the substitution yields identical behavior on both trees.

The sed is guarded by a grep for the call site, so this patch is
self-disabling once buun rebases past the upstream refactor.

Assisted-by: Claude:Opus-4.7 [Read] [Edit] [Bash] [WebFetch]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-24 12:52:53 +00:00
Ettore Di Giacinto
cd6079b2f3 feat(backend): add buun-llama-cpp fork (DFlash + TCQ KV-cache)
spiritbuun/buun-llama-cpp is a fork of TheTom/llama-cpp-turboquant that adds
two independent features on top: DFlash block-diffusion speculative decoding
(via a dedicated DFlashDraftModel GGUF arch) and two extra TCQ KV-cache
variants (turbo2_tcq, turbo3_tcq) on top of TurboQuant's turbo2/turbo3/turbo4.

Follows the turboquant thin-wrapper pattern — reuses backend/cpp/llama-cpp
grpc-server sources verbatim, patches only the build copy to extend the KV
allow-list and wire up buun-exclusive tree_budget / draft_topk options.
DraftModel is already wired end-to-end (proto field 39 → params.speculative),
so DFlash activation only needs the existing options passthrough
(spec_type:dflash) plus the drafter path in draft_model.

CacheTypeOptions now surfaces the five turbo* values so the React UI dropdown
shows them — benefits turboquant too (previously users had to type them in
YAML manually).

Assisted-by: Claude:Opus-4.7 [Read] [Edit] [Bash] [WebFetch]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-24 12:52:53 +00:00