* chore(llama.cpp): bump to 1ec7ba0c14f33f17e980daeeda5f35b225d41994
Picks up the upstream `spec : parallel drafting support` change
(ggml-org/llama.cpp#22838) which reshapes the speculative-decoding API
and `server_context_impl`.
Adapt the grpc-server wrapper accordingly:
* `common_params_speculative::type` (single enum) became `types`
(`std::vector<common_speculative_type>`). Update both the
"default to draft when a draft model is set" branch and the
`spec_type`/`speculative_type` option parser. The parser now also
tolerates comma-separated lists, mirroring the upstream
`common_speculative_types_from_names` semantics.
* `common_params_speculative_draft::n_ctx` is gone (draft now shares
the target context size). Keep the `draft_ctx_size` option name for
backward compatibility and ignore the value rather than failing.
* `server_context_impl::model` was renamed to `model_tgt`; update the
two reranker / model-metadata call sites.
Replaces #9763. Builds cleanly under the linux/amd64 cpu-llama-cpp
target locally.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(llama-cpp): expose new speculative-decoding option keys
Upstream `spec : parallel drafting support` (ggml-org/llama.cpp#22838)
adds the `ngram_mod`, `ngram_map_k`, and `ngram_map_k4v` speculative
families and beefs up the draft-model knobs. The previous bump only
adapted the API; this exposes the new fields through the grpc-server
options dictionary so model configs can drive them.
New `options:` keys (all under `backend: llama-cpp`):
ngram_mod (`ngram_mod` type):
spec_ngram_mod_n_min / spec_ngram_mod_n_max / spec_ngram_mod_n_match
ngram_map_k (`ngram_map_k` type):
spec_ngram_map_k_size_n / spec_ngram_map_k_size_m / spec_ngram_map_k_min_hits
ngram_map_k4v (`ngram_map_k4v` type):
spec_ngram_map_k4v_size_n / spec_ngram_map_k4v_size_m /
spec_ngram_map_k4v_min_hits
ngram lookup caches (`ngram_cache` type):
spec_lookup_cache_static / lookup_cache_static
spec_lookup_cache_dynamic / lookup_cache_dynamic
Draft-model tuning (active when `spec_type` is `draft`):
draft_cache_type_k / spec_draft_cache_type_k
draft_cache_type_v / spec_draft_cache_type_v
draft_threads / spec_draft_threads
draft_threads_batch / spec_draft_threads_batch
draft_cpu_moe / spec_draft_cpu_moe (bool flag)
draft_n_cpu_moe / spec_draft_n_cpu_moe (first N MoE layers on CPU)
draft_override_tensor / spec_draft_override_tensor
(comma-separated <tensor regex>=<buffer type>; re-implements upstream's
static parse_tensor_buffer_overrides since it isn't exported)
`spec_type` already accepted comma-separated lists after the previous
commit, matching upstream's `common_speculative_types_from_names`.
Docs: refresh `docs/content/advanced/model-configuration.md` with
per-family tables and a note about multi-type chaining.
Builds locally with `make docker-build-llama-cpp` (linux/amd64
cpu-llama-cpp AVX variant).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(turboquant): bridge new llama.cpp spec API to the legacy fork layout
The previous commits in this series adapted backend/cpp/llama-cpp/grpc-server.cpp
to the post-#22838 (parallel drafting) llama.cpp API. The turboquant build
reuses the same grpc-server.cpp through backend/cpp/turboquant/Makefile,
which copies it into turboquant-<flavor>-build/ and runs patch-grpc-server.sh
on the copy. The fork branched before the API refactor, so it errors out on:
* `ctx_server.impl->model_tgt` (fork still has `model`)
* `params.speculative.{ngram_mod,ngram_map_k,ngram_map_k4v,ngram_cache}.*`
(none of these sub-structs exist in the fork)
* `params.speculative.draft.{cache_type_k/v, cpuparams[, _batch].n_threads,
tensor_buft_overrides}` (fork uses the pre-#22397 flat layout)
* `params.speculative.types` vector / `common_speculative_types_from_names`
(fork has a scalar `type` and only the singular helper)
Approach:
1. backend/cpp/llama-cpp/grpc-server.cpp: introduce a single feature switch
`LOCALAI_LEGACY_LLAMA_CPP_SPEC`. When defined, the two `speculative.type[s]`
discriminations (the "default to draft when a draft model is set" branch
and the `spec_type` / `speculative_type` option parser) fall back to the
singular scalar form, and the entire new-option block (ngram_mod / map_k
/ map_k4v / ngram_cache / draft.{cache_type_*, cpuparams*,
tensor_buft_overrides}) is preprocessed out. The macro is *not* defined
in the source tree — stock llama-cpp builds get the full new API.
2. backend/cpp/turboquant/patch-grpc-server.sh: two new patch steps applied
to the per-flavor build copy at turboquant-<flavor>-build/grpc-server.cpp:
- substitute `ctx_server.impl->model_tgt` -> `ctx_server.impl->model`
- inject `#define LOCALAI_LEGACY_LLAMA_CPP_SPEC 1` before the first
`#include`, so the guarded blocks above drop out for the fork build.
Both patches are idempotent and follow the existing sed/awk pattern in
this script (KV cache types, `get_media_marker`, flat speculative
renames). Stock llama-cpp's `grpc-server.cpp` is never touched.
Drop both legacy patches once the turboquant fork rebases past
ggml-org/llama.cpp#22397 / #22838.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(turboquant): close draft_ctx_size brace inside legacy guard
The previous turboquant fix wrapped the new option-handler blocks in
`#ifndef LOCALAI_LEGACY_LLAMA_CPP_SPEC ... #endif` but placed the guard
in the middle of an `else if` chain — the `} else if` openings of the
new blocks were responsible for closing the previous block's brace.
With the macro defined the new blocks vanish, draft_ctx_size's `{`
loses its closer, the for-loop's `}` is consumed instead, and the
file ends with a stray opening brace — clang reports it as
`function-definition is not allowed here before '{'` on the next
top-level `int main(...)` and `expected '}' at end of input`.
Move the chain split inside the draft_ctx_size branch:
} else if (... "draft_ctx_size") {
// ...
#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
} // legacy: chain ends here
#else
} else if (... "spec_ngram_mod_n_min") { // modern: chain continues
...
} else if (... "draft_override_tensor") {
...
} // closes last branch
#endif
} // closes for-loop
Brace count is now balanced under both preprocessor branches (verified
with `tr -cd '{' | wc -c` against the patched and unpatched outputs).
Local `make docker-build-turboquant` builds the linux/amd64 cpu-llama-cpp
`turboquant-avx` variant cleanly.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(ci): forward AMDGPU_TARGETS into Dockerfile.turboquant builder-prebuilt
Dockerfile.turboquant's `builder-prebuilt` stage was missing the
`ARG AMDGPU_TARGETS` / `ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}` pair that
`builder-fromsource` already has (and that `Dockerfile.llama-cpp`
mirrors across both stages). When CI uses the prebuilt base image
(quay.io/go-skynet/ci-cache:base-grpc-*, the common path) the build-arg
passed by the workflow never reaches the env inside the compile stage.
backend/cpp/llama-cpp/Makefile:38 (introduced by #9626) errors out on
hipblas builds when AMDGPU_TARGETS is empty, and the turboquant
Makefile reuses backend/cpp/llama-cpp via a sibling build dir, so the
same check fires from turboquant-fallback under BUILD_TYPE=hipblas:
Makefile:38: *** AMDGPU_TARGETS is empty — set it to a comma-separated
list of gfx targets e.g. gfx1100,gfx1101. Stop.
make: *** [Makefile:66: turboquant-fallback] Error 2
The bug is latent on master because the docker layer cache stays warm
across builds — the compile step rarely re-runs from scratch. The
llama.cpp bump in this PR invalidates the cache, so the missing env var
becomes load-bearing and the hipblas turboquant CI job fails.
Mirror the existing pattern from Dockerfile.llama-cpp.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
LocalAI website
LocalAI documentation website
Requirement
In this project, the Docsy theme component is pulled in as a Hugo module, together with other module dependencies:
$ hugo mod graph
hugo: collected modules in 566 ms
hugo: collected modules in 578 ms
github.com/google/docsy-example github.com/google/docsy@v0.5.1-0.20221017155306-99eacb09ffb0
github.com/google/docsy-example github.com/google/docsy/dependencies@v0.5.1-0.20221014161617-be5da07ecff1
github.com/google/docsy/dependencies@v0.5.1-0.20221014161617-be5da07ecff1 github.com/twbs/bootstrap@v4.6.2+incompatible
github.com/google/docsy/dependencies@v0.5.1-0.20221014161617-be5da07ecff1 github.com/FortAwesome/Font-Awesome@v0.0.0-20220831210243-d3a7818c253f
If you want to do SCSS edits and want to publish these, you need to install PostCSS
npm install
Running the website locally
Building and running the site locally requires a recent extended version of Hugo.
You can find out more about how to install Hugo for your environment in our
Getting started guide.
Once you've made your working copy of the site repo, from the repo root folder, run:
hugo server
Running a container locally
You can run docsy-example inside a Docker
container, the container runs with a volume bound to the docsy-example
folder. This approach doesn't require you to install any dependencies other
than Docker Desktop on
Windows and Mac, and Docker Compose
on Linux.
-
Build the docker image
docker-compose build -
Run the built image
docker-compose upNOTE: You can run both commands at once with
docker-compose up --build. -
Verify that the service is working.
Open your web browser and type
http://localhost:1313in your navigation bar, This opens a local instance of the docsy-example homepage. You can now make changes to the docsy example and those changes will immediately show up in your browser after you save.
Cleanup
To stop Docker Compose, on your terminal window, press Ctrl + C.
To remove the produced images run:
docker-compose rm
For more information see the Docker Compose documentation.
Troubleshooting
As you run the website locally, you may run into the following error:
➜ hugo server
INFO 2021/01/21 21:07:55 Using config file:
Building sites … INFO 2021/01/21 21:07:55 syncing static files to /
Built in 288 ms
Error: Error building site: TOCSS: failed to transform "scss/main.scss" (text/x-scss): resource "scss/scss/main.scss_9fadf33d895a46083cdd64396b57ef68" not found in file cache
This error occurs if you have not installed the extended version of Hugo. See this section of the user guide for instructions on how to install Hugo.
Or you may encounter the following error:
➜ hugo server
Error: failed to download modules: binary with name "go" not found
This error occurs if you have not installed the go programming language on your system.
See this section of the user guide for instructions on how to install go.