mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-16 12:49:08 -04:00
* chore(turboquant): bump TheTom/llama-cpp-turboquant to 7d9715f1 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(turboquant): drop obsolete legacy-spec shim after fork rebased The TheTom/llama-cpp-turboquant fork (pin c9aa86a) rebased past the upstream common_params_speculative refactor (ggml-org/llama.cpp #22397/#22838/#22964), the model_tgt rename (#22838) and get_media_marker (#21962). The old fork-compat shim forced now-wrong legacy code paths, breaking the build with errors like 'struct common_params_speculative has no member named mparams_dft / type' and 'server_context_impl has no member named model'. Remove the obsolete LOCALAI_LEGACY_LLAMA_CPP_SPEC branches from the shared grpc-server.cpp (stock llama-cpp and the modern fork both take the modern path now), and narrow the one remaining gap (the fork still lacks common_params::checkpoint_min_step) to a dedicated LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP guard injected by patch-grpc-server.sh. The patch script now only adds the turbo2/3/4 KV-cache types and injects that one macro. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * fix(turboquant): HIP-port the fork's CUDA additions (copy2d 3D-peer + cudaEventCreate) The turboquant fork adds/modifies a few ggml-cuda.cu spots with CUDA APIs that ggml's HIP/MUSA shim does not provide, breaking the -gpu-rocm-hipblas-turboquant build. patches/0001-hip-guard-copy2d-peer-fastpath.patch (applied by apply-patches.sh) ports them: - Guard ggml_cuda_copy2d_across_devices's 3D-peer copy fast path with #if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA) so HIP/MUSA fall through to the existing cudaMemcpyAsync staging fallback (HIP genuinely lacks cudaMemcpy3DPeerAsync, per the fork's own comment). - Create the device event in ggml_backend_cuda_device_event_new with the HIP-aliased cudaEventCreateWithFlags(.., cudaEventDisableTiming) instead of the un-aliased plain cudaEventCreate, matching this file's own usage elsewhere. CUDA builds are unaffected. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * ci(turboquant): drop the ROCm/hipblas build flavor The TheTom/llama-cpp-turboquant fork is not ROCm-clean at the current pin: beyond the CUDA-API gaps already patched (3D-peer copy, cudaEventCreate), its llama.cpp base fails to compile the flash-attention MMA f16 kernels for head-dim 640 under HIP (cols_per_warp evaluates to 0 -> division-by-zero / non-constant static asserts in fattn-mma-f16.cuh). That is a deep ggml-on-ROCm kernel issue, not something a small fork patch can paper over. Drop -gpu-rocm-hipblas-turboquant from the build matrix so turboquant still ships for cpu / cublas / vulkan / sycl. Re-add it once the fork's HIP path compiles (or upstream ggml fixes the large-head-dim MMA kernels for ROCm). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
56 lines
2.7 KiB
Diff
56 lines
2.7 KiB
Diff
hip: port the turboquant CUDA additions that ggml's HIP shim doesn't cover
|
|
|
|
The turboquant fork adds/modifies a few ggml-cuda.cu spots with CUDA APIs
|
|
that ggml's HIP (and MUSA) compatibility layer does not provide, breaking
|
|
the -gpu-rocm-hipblas-turboquant build:
|
|
|
|
1. ggml_cuda_copy2d_across_devices() (host-staged cross-device copy for
|
|
split mul_mat output) uses the CUDA 3D-peer copy APIs
|
|
cudaMemcpy3DPeerParms / make_cudaPitchedPtr / make_cudaExtent /
|
|
cudaMemcpy3DPeerAsync. HIP genuinely does not support these (see the
|
|
fork's own comment "HIP does not support cudaMemcpy3DPeerAsync"), so
|
|
guard the peer fast path with #if !defined(GGML_USE_HIP) &&
|
|
!defined(GGML_USE_MUSA) -- matching how the fork already guards the
|
|
same API for the sibling 2D copy -- and fall through to the existing
|
|
cudaMemcpyAsync staging fallback below (functionally identical,
|
|
slightly slower on multi-GPU ROCm).
|
|
|
|
2. ggml_backend_cuda_device_event_new() creates its event with plain
|
|
cudaEventCreate, which ggml's HIP shim does not alias (it only aliases
|
|
cudaEventCreateWithFlags). Use cudaEventCreateWithFlags(...,
|
|
cudaEventDisableTiming) -- exactly what the rest of this file already
|
|
does (cf. lines ~1034, ~3461) and HIP-safe.
|
|
|
|
CUDA builds are unaffected. Drop the relevant hunk once the fork HIP-ports
|
|
these; apply-patches.sh fails fast if an anchor goes stale.
|
|
|
|
diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
|
|
index 0427e6b..6352e6a 100644
|
|
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
|
|
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
|
|
@@ -1933,6 +1933,7 @@ static cudaError_t ggml_cuda_copy2d_across_devices(
|
|
size_t width, size_t height, cudaStream_t dst_stream, cudaStream_t src_stream) {
|
|
|
|
const auto & info = ggml_cuda_info();
|
|
+#if !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA) // 3D-peer copy types unmapped by ggml's HIP/MUSA shim; use staging fallback below
|
|
if (info.peer_access[src_device][dst_device]) {
|
|
cudaMemcpy3DPeerParms p = {};
|
|
p.dstDevice = dst_device;
|
|
@@ -1942,6 +1943,7 @@ static cudaError_t ggml_cuda_copy2d_across_devices(
|
|
p.extent = make_cudaExtent(width, height, 1);
|
|
return cudaMemcpy3DPeerAsync(&p, dst_stream);
|
|
}
|
|
+#endif // !defined(GGML_USE_HIP) && !defined(GGML_USE_MUSA)
|
|
|
|
// Fallback: stage all rows through a single contiguous pinned buffer
|
|
int prev_device = ggml_cuda_get_device();
|
|
@@ -5714,7 +5716,7 @@ static ggml_backend_event_t ggml_backend_cuda_device_event_new(ggml_backend_dev_
|
|
ggml_cuda_set_device(dev_ctx->device);
|
|
|
|
cudaEvent_t event;
|
|
- CUDA_CHECK(cudaEventCreate(&event));
|
|
+ CUDA_CHECK(cudaEventCreateWithFlags(&event, cudaEventDisableTiming));
|
|
|
|
return new ggml_backend_event {
|
|
/* .device = */ dev,
|