⬆️ Update ggerganov/llama.cpp (#1655 )

Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>
feat(startup): fetch model definition remotely (#1654 )
2026-02-03 11:13:31 -05:00 · 2024-01-28 09:24:44 +01:00 · 2024-01-28 00:14:16 +01:00 · 2024-01-27 00:13:38 +01:00 · 2024-01-27 00:13:19 +01:00 · 2024-01-26 18:35:33 +01:00
193 changed files with 5402 additions and 5343 deletions
--- a/.github/ISSUE_TEMPLATE/bug_report.md
+++ b/.github/ISSUE_TEMPLATE/bug_report.md
@@ -2,9 +2,7 @@
 name: Bug report
 about: Create a report to help us improve
 title: ''
-labels: bug
-assignees: mudler
-
+labels: bug, unconfirmed, up-for-grabs
 ---

 <!-- Thanks for helping us to improve LocalAI! We welcome all bug reports. Please fill out each area of the template so we can better help you. Comments like this will be hidden when you post but you can delete them if you wish. -->
--- a/.github/ISSUE_TEMPLATE/feature_request.md
+++ b/.github/ISSUE_TEMPLATE/feature_request.md
@@ -2,9 +2,7 @@
 name: Feature request
 about: Suggest an idea for this project
 title: ''
-labels: enhancement
-assignees: mudler
-
+labels: enhancement, up-for-grabs
 ---

 <!-- Thanks for helping us to improve LocalAI! We welcome all feature requests. Please fill out each area of the template so we can better help you. Comments like this will be hidden when you post but you can delete them if you wish. -->
--- a/.github/workflows/image-pr.yml
+++ b/.github/workflows/image-pr.yml
@@ -0,0 +1,86 @@
+---
+name: 'build container images tests'
+
+on:
+  pull_request:
+
+concurrency:
+  group: ci-${{ github.head_ref || github.ref }}-${{ github.repository }}
+  cancel-in-progress: true
+
+jobs:
+  extras-image-build:
+    uses: ./.github/workflows/image_build.yml
+    with:
+      tag-latest: ${{ matrix.tag-latest }}
+      tag-suffix: ${{ matrix.tag-suffix }}
+      ffmpeg: ${{ matrix.ffmpeg }}
+      image-type: ${{ matrix.image-type }}
+      build-type: ${{ matrix.build-type }}
+      cuda-major-version: ${{ matrix.cuda-major-version }}
+      cuda-minor-version: ${{ matrix.cuda-minor-version }}
+      platforms: ${{ matrix.platforms }}
+      runs-on: ${{ matrix.runs-on }}
+    secrets:
+      dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
+      dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
+      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
+      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
+    strategy:
+      # Pushing with all jobs in parallel
+      # eats the bandwidth of all the nodes
+      max-parallel: ${{ github.event_name != 'pull_request' && 2 || 4 }}
+      matrix:
+        include:
+          - build-type: ''
+            platforms: 'linux/amd64'
+            tag-latest: 'false'
+            tag-suffix: '-ffmpeg'
+            ffmpeg: 'true'
+            image-type: 'extras'
+            runs-on: 'arc-runner-set'
+          - build-type: 'cublas'
+            cuda-major-version: "12"
+            cuda-minor-version: "1"
+            platforms: 'linux/amd64'
+            tag-latest: 'false'
+            tag-suffix: '-cublas-cuda12-ffmpeg'
+            ffmpeg: 'true'
+            image-type: 'extras'
+            runs-on: 'arc-runner-set'
+  core-image-build:
+    uses: ./.github/workflows/image_build.yml
+    with:
+      tag-latest: ${{ matrix.tag-latest }}
+      tag-suffix: ${{ matrix.tag-suffix }}
+      ffmpeg: ${{ matrix.ffmpeg }}
+      image-type: ${{ matrix.image-type }}
+      build-type: ${{ matrix.build-type }}
+      cuda-major-version: ${{ matrix.cuda-major-version }}
+      cuda-minor-version: ${{ matrix.cuda-minor-version }}
+      platforms: ${{ matrix.platforms }}
+      runs-on: ${{ matrix.runs-on }}
+    secrets:
+      dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
+      dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
+      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
+      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
+    strategy:
+      matrix:
+        include:
+          - build-type: ''
+            platforms: 'linux/amd64'
+            tag-latest: 'false'
+            tag-suffix: '-ffmpeg-core'
+            ffmpeg: 'true'
+            image-type: 'core'
+            runs-on: 'ubuntu-latest'
+          - build-type: 'cublas'
+            cuda-major-version: "12"
+            cuda-minor-version: "1"
+            platforms: 'linux/amd64'
+            tag-latest: 'false'
+            tag-suffix: '-cublas-cuda12-ffmpeg-core'
+            ffmpeg: 'true'
+            image-type: 'core'
+            runs-on: 'ubuntu-latest'
--- a/.github/workflows/image.yml
+++ b/.github/workflows/image.yml
@@ -2,7 +2,6 @@
 name: 'build container images'

 on:
-  pull_request:
  push:
    branches:
      - master
@@ -27,8 +26,10 @@ jobs:
      platforms: ${{ matrix.platforms }}
      runs-on: ${{ matrix.runs-on }}
    secrets:
-      dockerUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-      dockerPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
+      dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
+      dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
+      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
+      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
    strategy:
      # Pushing with all jobs in parallel
      # eats the bandwidth of all the nodes
@@ -107,8 +108,10 @@ jobs:
      platforms: ${{ matrix.platforms }}
      runs-on: ${{ matrix.runs-on }}
    secrets:
-      dockerUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
-      dockerPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
+      dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
+      dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
+      quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
+      quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
    strategy:
      matrix:
        include:
--- a/.github/workflows/image_build.yml
+++ b/.github/workflows/image_build.yml
@@ -46,6 +46,10 @@ on:
        required: true
      dockerPassword:
        required: true
+      quayUsername:
+        required: true
+      quayPassword:
+        required: true
 jobs:
  reusable_image-build:
    runs-on: ${{ inputs.runs-on }}
@@ -100,7 +104,9 @@ jobs:
        id: meta
        uses: docker/metadata-action@v5
        with:
-          images: quay.io/go-skynet/local-ai
+          images: |
+            quay.io/go-skynet/local-ai
+            localai/localai
          tags: |
            type=ref,event=branch
            type=semver,pattern={{raw}}
@@ -122,10 +128,17 @@ jobs:
        if: github.event_name != 'pull_request'
        uses: docker/login-action@v3
        with:
-          registry: quay.io
          username: ${{ secrets.dockerUsername }}
          password: ${{ secrets.dockerPassword }}

+      - name: Login to DockerHub
+        if: github.event_name != 'pull_request'
+        uses: docker/login-action@v3
+        with:
+          registry: quay.io
+          username: ${{ secrets.quayUsername }}
+          password: ${{ secrets.quayPassword }}
+
      - name: Build and push
        uses: docker/build-push-action@v5
        with:
--- a/.github/workflows/release.yaml
+++ b/.github/workflows/release.yaml
@@ -34,10 +34,22 @@ jobs:
          sudo apt-get update
          sudo apt-get install build-essential ffmpeg

+      - name: Cache grpc
+        id: cache-grpc
+        uses: actions/cache@v3
+        with:
+          path: grpc
+          key: ${{ runner.os }}-grpc
+      - name: Build grpc
+        if: steps.cache-grpc.outputs.cache-hit != 'true'
+        run: |
          git clone --recurse-submodules -b v1.58.0 --depth 1 --shallow-submodules https://github.com/grpc/grpc && \
-              cd grpc && mkdir -p cmake/build && cd cmake/build && cmake -DgRPC_INSTALL=ON \
-                -DgRPC_BUILD_TESTS=OFF \
-                ../.. && sudo make -j12 install
+          cd grpc && mkdir -p cmake/build && cd cmake/build && cmake -DgRPC_INSTALL=ON \
+            -DgRPC_BUILD_TESTS=OFF \
+            ../.. && sudo make -j12
+      - name: Install gRPC
+        run: |
+          cd grpc && cd cmake/build && sudo make -j12 install

      - name: Build
        id: build
--- a/.github/workflows/test.yml
+++ b/.github/workflows/test.yml
@@ -86,11 +86,22 @@ jobs:
          sudo cp -rfv sources/go-piper/piper-phonemize/pi/lib/. /usr/lib/ && \
          # Pre-build stable diffusion before we install a newer version of abseil (not compatible with stablediffusion-ncn)
          GO_TAGS="stablediffusion tts" GRPC_BACKENDS=backend-assets/grpc/stablediffusion make build
-
+      - name: Cache grpc
+        id: cache-grpc
+        uses: actions/cache@v3
+        with:
+          path: grpc
+          key: ${{ runner.os }}-grpc
+      - name: Build grpc
+        if: steps.cache-grpc.outputs.cache-hit != 'true'
+        run: |
          git clone --recurse-submodules -b v1.58.0 --depth 1 --shallow-submodules https://github.com/grpc/grpc && \
-              cd grpc && mkdir -p cmake/build && cd cmake/build && cmake -DgRPC_INSTALL=ON \
-                -DgRPC_BUILD_TESTS=OFF \
-                ../.. && sudo make -j12 install
+          cd grpc && mkdir -p cmake/build && cd cmake/build && cmake -DgRPC_INSTALL=ON \
+            -DgRPC_BUILD_TESTS=OFF \
+            ../.. && sudo make -j12
+      - name: Install gRPC
+        run: |
+          cd grpc && cd cmake/build && sudo make -j12 install
      - name: Test
        run: |
          GO_TAGS="stablediffusion tts" make test
--- a/.gitmodules
+++ b/.gitmodules
@@ -1,3 +1,6 @@
 [submodule "docs/themes/hugo-theme-relearn"]
 	path = docs/themes/hugo-theme-relearn
 	url = https://github.com/McShelby/hugo-theme-relearn.git
+[submodule "docs/themes/lotusdocs"]
+	path = docs/themes/lotusdocs
+	url = https://github.com/colinwilson/lotusdocs
--- a/16
+++ b/16
@@ -13,9 +13,8 @@ ARG TARGETVARIANT

 ENV BUILD_TYPE=${BUILD_TYPE}

-ENV EXTERNAL_GRPC_BACKENDS="coqui:/build/backend/python/coqui/run.sh,huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh,petals:/build/backend/python/petals/run.sh,transformers:/build/backend/python/transformers/run.sh,sentencetransformers:/build/backend/python/sentencetransformers/run.sh,autogptq:/build/backend/python/autogptq/run.sh,bark:/build/backend/python/bark/run.sh,diffusers:/build/backend/python/diffusers/run.sh,exllama:/build/backend/python/exllama/run.sh,vall-e-x:/build/backend/python/vall-e-x/run.sh,vllm:/build/backend/python/vllm/run.sh,exllama2:/build/backend/python/exllama2/run.sh,transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh"
+ENV EXTERNAL_GRPC_BACKENDS="coqui:/build/backend/python/coqui/run.sh,huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh,petals:/build/backend/python/petals/run.sh,transformers:/build/backend/python/transformers/run.sh,sentencetransformers:/build/backend/python/sentencetransformers/run.sh,autogptq:/build/backend/python/autogptq/run.sh,bark:/build/backend/python/bark/run.sh,diffusers:/build/backend/python/diffusers/run.sh,exllama:/build/backend/python/exllama/run.sh,vall-e-x:/build/backend/python/vall-e-x/run.sh,vllm:/build/backend/python/vllm/run.sh,mamba:/build/backend/python/mamba/run.sh,exllama2:/build/backend/python/exllama2/run.sh,transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh"

-ENV GALLERIES='[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.yaml"}, {"url": "github:go-skynet/model-gallery/huggingface.yaml","name":"huggingface"}]'
 ARG GO_TAGS="stablediffusion tinydream tts"

 RUN apt-get update && \
@@ -64,12 +63,12 @@ RUN curl https://repo.anaconda.com/pkgs/misc/gpgkeys/anaconda.asc | gpg --dearmo
    echo "deb [arch=amd64 signed-by=/usr/share/keyrings/conda-archive-keyring.gpg] https://repo.anaconda.com/pkgs/misc/debrepo/conda stable main" > /etc/apt/sources.list.d/conda.list && \
    echo "deb [arch=amd64 signed-by=/usr/share/keyrings/conda-archive-keyring.gpg] https://repo.anaconda.com/pkgs/misc/debrepo/conda stable main" | tee -a /etc/apt/sources.list.d/conda.list && \
    apt-get update && \
-    apt-get install -y conda
+    apt-get install -y conda && apt-get clean

 ENV PATH="/root/.cargo/bin:${PATH}"
 RUN pip install --upgrade pip
 RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
-RUN apt-get install -y espeak-ng espeak
+RUN apt-get install -y espeak-ng espeak && apt-get clean

 ###################################
 ###################################
@@ -127,10 +126,11 @@ ARG CUDA_MAJOR_VERSION=11
 ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility
 ENV NVIDIA_REQUIRE_CUDA="cuda>=${CUDA_MAJOR_VERSION}.0"
 ENV NVIDIA_VISIBLE_DEVICES=all
+ENV PIP_CACHE_PURGE=true

 # Add FFmpeg
 RUN if [ "${FFMPEG}" = "true" ]; then \
-    apt-get install -y ffmpeg \
+    apt-get install -y ffmpeg && apt-get clean \
    ; fi

 WORKDIR /build
@@ -168,6 +168,9 @@ RUN if [ "${IMAGE_TYPE}" = "extras" ]; then \
 RUN if [ "${IMAGE_TYPE}" = "extras" ]; then \
 	PATH=$PATH:/opt/conda/bin make -C backend/python/vllm \
    ; fi
+RUN if [ "${IMAGE_TYPE}" = "extras" ]; then \
+	PATH=$PATH:/opt/conda/bin make -C backend/python/mamba \
+    ; fi
 RUN if [ "${IMAGE_TYPE}" = "extras" ]; then \
 	PATH=$PATH:/opt/conda/bin make -C backend/python/sentencetransformers \
    ; fi
@@ -193,6 +196,9 @@ RUN if [ "${IMAGE_TYPE}" = "extras" ]; then \
 	PATH=$PATH:/opt/conda/bin make -C backend/python/coqui \
    ; fi

+# Make sure the models directory exists
+RUN mkdir -p /build/models
+
 # Define the health check command
 HEALTHCHECK --interval=1m --timeout=10m --retries=10 \
  CMD curl -f $HEALTHCHECK_ENDPOINT || exit 1
--- a/2
+++ b/2
@@ -1,6 +1,6 @@
 MIT License

-Copyright (c) 2023 Ettore Di Giacinto
+Copyright (c) 2023-2024 Ettore Di Giacinto (mudler@localai.io)

 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
--- a/88
+++ b/88
@@ -8,7 +8,7 @@ GOLLAMA_VERSION?=aeba71ee842819da681ea537e78846dc75949ac0

 GOLLAMA_STABLE_VERSION?=50cee7712066d9e38306eccadcfbb44ea87df4b7

-CPPLLAMA_VERSION?=65e5f6dadbba4b496bba27f573e473c66b446496
+CPPLLAMA_VERSION?=6db2b41a76ee78d5efdd5c3cddd5d7ad3f646855

 # gpt4all version
 GPT4ALL_REPO?=https://github.com/nomic-ai/gpt4all
@@ -140,8 +140,8 @@ endif
 ifeq ($(findstring tts,$(GO_TAGS)),tts)
 #	OPTIONAL_TARGETS+=go-piper/libpiper_binding.a
 #	OPTIONAL_TARGETS+=backend-assets/espeak-ng-data
-	PIPER_CGO_CXXFLAGS+=-I$(shell pwd)/sources/go-piper/piper/src/cpp -I$(shell pwd)/sources/go-piper/piper/build/fi/include -I$(shell pwd)/sources/go-piper/piper/build/pi/include -I$(shell pwd)/sources/go-piper/piper/build/si/include
- 	PIPER_CGO_LDFLAGS+=-L$(shell pwd)/sources/go-piper/piper/build/fi/lib -L$(shell pwd)/sources/go-piper/piper/build/pi/lib -L$(shell pwd)/sources/go-piper/piper/build/si/lib -lfmt -lspdlog -lucd
+	PIPER_CGO_CXXFLAGS+=-I$(CURDIR)/sources/go-piper/piper/src/cpp -I$(CURDIR)/sources/go-piper/piper/build/fi/include -I$(CURDIR)/sources/go-piper/piper/build/pi/include -I$(CURDIR)/sources/go-piper/piper/build/si/include
+	PIPER_CGO_LDFLAGS+=-L$(CURDIR)/sources/go-piper/piper/build/fi/lib -L$(CURDIR)/sources/go-piper/piper/build/pi/lib -L$(CURDIR)/sources/go-piper/piper/build/si/lib -lfmt -lspdlog -lucd
 	OPTIONAL_GRPC+=backend-assets/grpc/piper
 endif

@@ -153,6 +153,10 @@ ifeq ($(GRPC_BACKENDS),)
 	GRPC_BACKENDS=$(ALL_GRPC_BACKENDS)
 endif

+ifeq ($(BUILD_API_ONLY),true)
+	GRPC_BACKENDS=
+endif
+
 .PHONY: all test build vendor

 all: help
@@ -252,15 +256,15 @@ get-sources: backend/cpp/llama/llama.cpp sources/go-llama sources/go-llama-ggml
 	touch $@

 replace:
-	$(GOCMD) mod edit -replace github.com/nomic-ai/gpt4all/gpt4all-bindings/golang=$(shell pwd)/sources/gpt4all/gpt4all-bindings/golang
-	$(GOCMD) mod edit -replace github.com/go-skynet/go-ggml-transformers.cpp=$(shell pwd)/sources/go-ggml-transformers
-	$(GOCMD) mod edit -replace github.com/donomii/go-rwkv.cpp=$(shell pwd)/sources/go-rwkv
-	$(GOCMD) mod edit -replace github.com/ggerganov/whisper.cpp=$(shell pwd)/sources/whisper.cpp
-	$(GOCMD) mod edit -replace github.com/ggerganov/whisper.cpp/bindings/go=$(shell pwd)/sources/whisper.cpp/bindings/go
-	$(GOCMD) mod edit -replace github.com/go-skynet/go-bert.cpp=$(shell pwd)/sources/go-bert
-	$(GOCMD) mod edit -replace github.com/mudler/go-stable-diffusion=$(shell pwd)/sources/go-stable-diffusion
-	$(GOCMD) mod edit -replace github.com/M0Rf30/go-tiny-dream=$(shell pwd)/sources/go-tiny-dream
-	$(GOCMD) mod edit -replace github.com/mudler/go-piper=$(shell pwd)/sources/go-piper
+	$(GOCMD) mod edit -replace github.com/nomic-ai/gpt4all/gpt4all-bindings/golang=$(CURDIR)/sources/gpt4all/gpt4all-bindings/golang
+	$(GOCMD) mod edit -replace github.com/go-skynet/go-ggml-transformers.cpp=$(CURDIR)/sources/go-ggml-transformers
+	$(GOCMD) mod edit -replace github.com/donomii/go-rwkv.cpp=$(CURDIR)/sources/go-rwkv
+	$(GOCMD) mod edit -replace github.com/ggerganov/whisper.cpp=$(CURDIR)/sources/whisper.cpp
+	$(GOCMD) mod edit -replace github.com/ggerganov/whisper.cpp/bindings/go=$(CURDIR)/sources/whisper.cpp/bindings/go
+	$(GOCMD) mod edit -replace github.com/go-skynet/go-bert.cpp=$(CURDIR)/sources/go-bert
+	$(GOCMD) mod edit -replace github.com/mudler/go-stable-diffusion=$(CURDIR)/sources/go-stable-diffusion
+	$(GOCMD) mod edit -replace github.com/M0Rf30/go-tiny-dream=$(CURDIR)/sources/go-tiny-dream
+	$(GOCMD) mod edit -replace github.com/mudler/go-piper=$(CURDIR)/sources/go-piper

 prepare-sources: get-sources replace
 	$(GOCMD) mod download
@@ -290,19 +294,17 @@ clean: ## Remove build related file
 	rm -rf ./sources
 	rm -rf $(BINARY_NAME)
 	rm -rf release/
-	rm -rf ./backend/cpp/grpc/grpc_repo
-	rm -rf ./backend/cpp/grpc/build
-	rm -rf ./backend/cpp/grpc/installed_packages
+	rm -rf backend-assets
+	$(MAKE) -C backend/cpp/grpc clean
 	$(MAKE) -C backend/cpp/llama clean

 ## Build:

-build: grpcs prepare ## Build the project
+build: backend-assets grpcs prepare ## Build the project
 	$(info ${GREEN}I local-ai build info:${RESET})
 	$(info ${GREEN}I BUILD_TYPE: ${YELLOW}$(BUILD_TYPE)${RESET})
 	$(info ${GREEN}I GO_TAGS: ${YELLOW}$(GO_TAGS)${RESET})
 	$(info ${GREEN}I LD_FLAGS: ${YELLOW}$(LD_FLAGS)${RESET})
-
 	CGO_LDFLAGS="$(CGO_LDFLAGS)" $(GOCMD) build -ldflags "$(LD_FLAGS)" -tags "$(GO_TAGS)" -o $(BINARY_NAME) ./

 dist: build
@@ -417,6 +419,7 @@ protogen-python:
 	python3 -m grpc_tools.protoc -Ibackend/ --python_out=backend/python/vall-e-x/ --grpc_python_out=backend/python/vall-e-x/ backend/backend.proto
 	python3 -m grpc_tools.protoc -Ibackend/ --python_out=backend/python/vllm/ --grpc_python_out=backend/python/vllm/ backend/backend.proto
 	python3 -m grpc_tools.protoc -Ibackend/ --python_out=backend/python/petals/ --grpc_python_out=backend/python/petals/ backend/backend.proto
+	python3 -m grpc_tools.protoc -Ibackend/ --python_out=backend/python/mamba/ --grpc_python_out=backend/python/mamba/ backend/backend.proto
 	python3 -m grpc_tools.protoc -Ibackend/ --python_out=backend/python/exllama2/ --grpc_python_out=backend/python/exllama2/ backend/backend.proto

 ## GRPC
@@ -427,6 +430,7 @@ prepare-extra-conda-environments:
 	$(MAKE) -C backend/python/coqui
 	$(MAKE) -C backend/python/diffusers
 	$(MAKE) -C backend/python/vllm
+	$(MAKE) -C backend/python/mamba
 	$(MAKE) -C backend/python/sentencetransformers
 	$(MAKE) -C backend/python/transformers
 	$(MAKE) -C backend/python/transformers-musicgen
@@ -443,12 +447,18 @@ test-extra: prepare-test-extra
 	$(MAKE) -C backend/python/transformers test
 	$(MAKE) -C backend/python/diffusers test

+backend-assets:
+	mkdir -p backend-assets
+ifeq ($(BUILD_API_ONLY),true)
+	touch backend-assets/keep
+endif
+
 backend-assets/grpc:
 	mkdir -p backend-assets/grpc

 backend-assets/grpc/llama: backend-assets/grpc sources/go-llama/libbinding.a
-	$(GOCMD) mod edit -replace github.com/go-skynet/go-llama.cpp=$(shell pwd)/sources/go-llama
-	CGO_LDFLAGS="$(CGO_LDFLAGS)" C_INCLUDE_PATH=$(shell pwd)/sources/go-llama LIBRARY_PATH=$(shell pwd)/sources/go-llama \
+	$(GOCMD) mod edit -replace github.com/go-skynet/go-llama.cpp=$(CURDIR)/sources/go-llama
+	CGO_LDFLAGS="$(CGO_LDFLAGS)" C_INCLUDE_PATH=$(CURDIR)/sources/go-llama LIBRARY_PATH=$(CURDIR)/sources/go-llama \
 	$(GOCMD) build -ldflags "$(LD_FLAGS)" -tags "$(GO_TAGS)" -o backend-assets/grpc/llama ./backend/go/llm/llama/
 # TODO: every binary should have its own folder instead, so can have different  implementations
 ifeq ($(BUILD_TYPE),metal)
@@ -467,17 +477,17 @@ ADDED_CMAKE_ARGS=-Dabsl_DIR=${INSTALLED_LIB_CMAKE}/absl \

 backend/cpp/llama/grpc-server:
 ifdef BUILD_GRPC_FOR_BACKEND_LLAMA
-	backend/cpp/grpc/script/build_grpc.sh ${INSTALLED_PACKAGES}
+	$(MAKE) -C backend/cpp/grpc build
 	export _PROTOBUF_PROTOC=${INSTALLED_PACKAGES}/bin/proto && \
 	export _GRPC_CPP_PLUGIN_EXECUTABLE=${INSTALLED_PACKAGES}/bin/grpc_cpp_plugin && \
-	export PATH=${PATH}:${INSTALLED_PACKAGES}/bin && \
+	export PATH="${INSTALLED_PACKAGES}/bin:${PATH}" && \
 	CMAKE_ARGS="${CMAKE_ARGS} ${ADDED_CMAKE_ARGS}" LLAMA_VERSION=$(CPPLLAMA_VERSION) $(MAKE) -C backend/cpp/llama grpc-server
 else
 	echo "BUILD_GRPC_FOR_BACKEND_LLAMA is not defined."
 	LLAMA_VERSION=$(CPPLLAMA_VERSION) $(MAKE) -C backend/cpp/llama grpc-server			
 endif
 ## BACKEND CPP LLAMA END
-		
+
 ##
 backend-assets/grpc/llama-cpp: backend-assets/grpc backend/cpp/llama/grpc-server
 	cp -rfv backend/cpp/llama/grpc-server backend-assets/grpc/llama-cpp
@@ -487,52 +497,52 @@ ifeq ($(BUILD_TYPE),metal)
 endif

 backend-assets/grpc/llama-ggml: backend-assets/grpc sources/go-llama-ggml/libbinding.a
-	$(GOCMD) mod edit -replace github.com/go-skynet/go-llama.cpp=$(shell pwd)/sources/go-llama-ggml
-	CGO_LDFLAGS="$(CGO_LDFLAGS)" C_INCLUDE_PATH=$(shell pwd)/sources/go-llama-ggml LIBRARY_PATH=$(shell pwd)/sources/go-llama-ggml \
+	$(GOCMD) mod edit -replace github.com/go-skynet/go-llama.cpp=$(CURDIR)/sources/go-llama-ggml
+	CGO_LDFLAGS="$(CGO_LDFLAGS)" C_INCLUDE_PATH=$(CURDIR)/sources/go-llama-ggml LIBRARY_PATH=$(CURDIR)/sources/go-llama-ggml \
 	$(GOCMD) build -ldflags "$(LD_FLAGS)" -tags "$(GO_TAGS)" -o backend-assets/grpc/llama-ggml ./backend/go/llm/llama-ggml/

 backend-assets/grpc/gpt4all: backend-assets/grpc backend-assets/gpt4all sources/gpt4all/gpt4all-bindings/golang/libgpt4all.a
-	CGO_LDFLAGS="$(CGO_LDFLAGS)" C_INCLUDE_PATH=$(shell pwd)/sources/gpt4all/gpt4all-bindings/golang/ LIBRARY_PATH=$(shell pwd)/sources/gpt4all/gpt4all-bindings/golang/ \
+	CGO_LDFLAGS="$(CGO_LDFLAGS)" C_INCLUDE_PATH=$(CURDIR)/sources/gpt4all/gpt4all-bindings/golang/ LIBRARY_PATH=$(CURDIR)/sources/gpt4all/gpt4all-bindings/golang/ \
 	$(GOCMD) build -ldflags "$(LD_FLAGS)" -tags "$(GO_TAGS)" -o backend-assets/grpc/gpt4all ./backend/go/llm/gpt4all/

 backend-assets/grpc/dolly: backend-assets/grpc sources/go-ggml-transformers/libtransformers.a
-	CGO_LDFLAGS="$(CGO_LDFLAGS)" C_INCLUDE_PATH=$(shell pwd)/sources/go-ggml-transformers LIBRARY_PATH=$(shell pwd)/sources/go-ggml-transformers \
+	CGO_LDFLAGS="$(CGO_LDFLAGS)" C_INCLUDE_PATH=$(CURDIR)/sources/go-ggml-transformers LIBRARY_PATH=$(CURDIR)/sources/go-ggml-transformers \
 	$(GOCMD) build -ldflags "$(LD_FLAGS)" -tags "$(GO_TAGS)" -o backend-assets/grpc/dolly ./backend/go/llm/dolly/

 backend-assets/grpc/gpt2: backend-assets/grpc sources/go-ggml-transformers/libtransformers.a
-	CGO_LDFLAGS="$(CGO_LDFLAGS)" C_INCLUDE_PATH=$(shell pwd)/sources/go-ggml-transformers LIBRARY_PATH=$(shell pwd)/sources/go-ggml-transformers \
+	CGO_LDFLAGS="$(CGO_LDFLAGS)" C_INCLUDE_PATH=$(CURDIR)/sources/go-ggml-transformers LIBRARY_PATH=$(CURDIR)/sources/go-ggml-transformers \
 	$(GOCMD) build -ldflags "$(LD_FLAGS)" -tags "$(GO_TAGS)" -o backend-assets/grpc/gpt2 ./backend/go/llm/gpt2/

 backend-assets/grpc/gptj: backend-assets/grpc sources/go-ggml-transformers/libtransformers.a
-	CGO_LDFLAGS="$(CGO_LDFLAGS)" C_INCLUDE_PATH=$(shell pwd)/sources/go-ggml-transformers LIBRARY_PATH=$(shell pwd)/sources/go-ggml-transformers \
+	CGO_LDFLAGS="$(CGO_LDFLAGS)" C_INCLUDE_PATH=$(CURDIR)/sources/go-ggml-transformers LIBRARY_PATH=$(CURDIR)/sources/go-ggml-transformers \
 	$(GOCMD) build -ldflags "$(LD_FLAGS)" -tags "$(GO_TAGS)" -o backend-assets/grpc/gptj ./backend/go/llm/gptj/

 backend-assets/grpc/gptneox: backend-assets/grpc sources/go-ggml-transformers/libtransformers.a
-	CGO_LDFLAGS="$(CGO_LDFLAGS)" C_INCLUDE_PATH=$(shell pwd)/sources/go-ggml-transformers LIBRARY_PATH=$(shell pwd)/sources/go-ggml-transformers \
+	CGO_LDFLAGS="$(CGO_LDFLAGS)" C_INCLUDE_PATH=$(CURDIR)/sources/go-ggml-transformers LIBRARY_PATH=$(CURDIR)/sources/go-ggml-transformers \
 	$(GOCMD) build -ldflags "$(LD_FLAGS)" -tags "$(GO_TAGS)" -o backend-assets/grpc/gptneox ./backend/go/llm/gptneox/

 backend-assets/grpc/mpt: backend-assets/grpc sources/go-ggml-transformers/libtransformers.a
-	CGO_LDFLAGS="$(CGO_LDFLAGS)" C_INCLUDE_PATH=$(shell pwd)/sources/go-ggml-transformers LIBRARY_PATH=$(shell pwd)/sources/go-ggml-transformers \
+	CGO_LDFLAGS="$(CGO_LDFLAGS)" C_INCLUDE_PATH=$(CURDIR)/sources/go-ggml-transformers LIBRARY_PATH=$(CURDIR)/sources/go-ggml-transformers \
 	$(GOCMD) build -ldflags "$(LD_FLAGS)" -tags "$(GO_TAGS)" -o backend-assets/grpc/mpt ./backend/go/llm/mpt/

 backend-assets/grpc/replit: backend-assets/grpc sources/go-ggml-transformers/libtransformers.a
-	CGO_LDFLAGS="$(CGO_LDFLAGS)" C_INCLUDE_PATH=$(shell pwd)/sources/go-ggml-transformers LIBRARY_PATH=$(shell pwd)/sources/go-ggml-transformers \
+	CGO_LDFLAGS="$(CGO_LDFLAGS)" C_INCLUDE_PATH=$(CURDIR)/sources/go-ggml-transformers LIBRARY_PATH=$(CURDIR)/sources/go-ggml-transformers \
 	$(GOCMD) build -ldflags "$(LD_FLAGS)" -tags "$(GO_TAGS)" -o backend-assets/grpc/replit ./backend/go/llm/replit/

 backend-assets/grpc/falcon-ggml: backend-assets/grpc sources/go-ggml-transformers/libtransformers.a
-	CGO_LDFLAGS="$(CGO_LDFLAGS)" C_INCLUDE_PATH=$(shell pwd)/sources/go-ggml-transformers LIBRARY_PATH=$(shell pwd)/sources/go-ggml-transformers \
+	CGO_LDFLAGS="$(CGO_LDFLAGS)" C_INCLUDE_PATH=$(CURDIR)/sources/go-ggml-transformers LIBRARY_PATH=$(CURDIR)/sources/go-ggml-transformers \
 	$(GOCMD) build -ldflags "$(LD_FLAGS)" -tags "$(GO_TAGS)" -o backend-assets/grpc/falcon-ggml ./backend/go/llm/falcon-ggml/

 backend-assets/grpc/starcoder: backend-assets/grpc sources/go-ggml-transformers/libtransformers.a
-	CGO_LDFLAGS="$(CGO_LDFLAGS)" C_INCLUDE_PATH=$(shell pwd)/sources/go-ggml-transformers LIBRARY_PATH=$(shell pwd)/sources/go-ggml-transformers \
+	CGO_LDFLAGS="$(CGO_LDFLAGS)" C_INCLUDE_PATH=$(CURDIR)/sources/go-ggml-transformers LIBRARY_PATH=$(CURDIR)/sources/go-ggml-transformers \
 	$(GOCMD) build -ldflags "$(LD_FLAGS)" -tags "$(GO_TAGS)" -o backend-assets/grpc/starcoder ./backend/go/llm/starcoder/

 backend-assets/grpc/rwkv: backend-assets/grpc sources/go-rwkv/librwkv.a
-	CGO_LDFLAGS="$(CGO_LDFLAGS)" C_INCLUDE_PATH=$(shell pwd)/sources/go-rwkv LIBRARY_PATH=$(shell pwd)/sources/go-rwkv \
+	CGO_LDFLAGS="$(CGO_LDFLAGS)" C_INCLUDE_PATH=$(CURDIR)/sources/go-rwkv LIBRARY_PATH=$(CURDIR)/sources/go-rwkv \
 	$(GOCMD) build -ldflags "$(LD_FLAGS)" -tags "$(GO_TAGS)" -o backend-assets/grpc/rwkv ./backend/go/llm/rwkv

 backend-assets/grpc/bert-embeddings: backend-assets/grpc sources/go-bert/libgobert.a
-	CGO_LDFLAGS="$(CGO_LDFLAGS)" C_INCLUDE_PATH=$(shell pwd)/sources/go-bert LIBRARY_PATH=$(shell pwd)/sources/go-bert \
+	CGO_LDFLAGS="$(CGO_LDFLAGS)" C_INCLUDE_PATH=$(CURDIR)/sources/go-bert LIBRARY_PATH=$(CURDIR)/sources/go-bert \
 	$(GOCMD) build -ldflags "$(LD_FLAGS)" -tags "$(GO_TAGS)" -o backend-assets/grpc/bert-embeddings ./backend/go/llm/bert/

 backend-assets/grpc/langchain-huggingface: backend-assets/grpc
@@ -541,20 +551,20 @@ backend-assets/grpc/langchain-huggingface: backend-assets/grpc
 backend-assets/grpc/stablediffusion: backend-assets/grpc
 	if [ ! -f backend-assets/grpc/stablediffusion ]; then \
 		$(MAKE) sources/go-stable-diffusion/libstablediffusion.a; \
-		CGO_LDFLAGS="$(CGO_LDFLAGS)" C_INCLUDE_PATH=$(shell pwd)/sources/go-stable-diffusion/ LIBRARY_PATH=$(shell pwd)/sources/go-stable-diffusion/ \
+		CGO_LDFLAGS="$(CGO_LDFLAGS)" C_INCLUDE_PATH=$(CURDIR)/sources/go-stable-diffusion/ LIBRARY_PATH=$(CURDIR)/sources/go-stable-diffusion/ \
 		$(GOCMD) build -ldflags "$(LD_FLAGS)" -tags "$(GO_TAGS)" -o backend-assets/grpc/stablediffusion ./backend/go/image/stablediffusion; \
 	fi

 backend-assets/grpc/tinydream: backend-assets/grpc sources/go-tiny-dream/libtinydream.a
-	CGO_LDFLAGS="$(CGO_LDFLAGS)" LIBRARY_PATH=$(shell pwd)/go-tiny-dream \
+	CGO_LDFLAGS="$(CGO_LDFLAGS)" LIBRARY_PATH=$(CURDIR)/go-tiny-dream \
 	$(GOCMD) build -ldflags "$(LD_FLAGS)" -tags "$(GO_TAGS)" -o backend-assets/grpc/tinydream ./backend/go/image/tinydream

 backend-assets/grpc/piper: backend-assets/grpc backend-assets/espeak-ng-data sources/go-piper/libpiper_binding.a
-	CGO_CXXFLAGS="$(PIPER_CGO_CXXFLAGS)" CGO_LDFLAGS="$(PIPER_CGO_LDFLAGS)" LIBRARY_PATH=$(shell pwd)/sources/go-piper \
+	CGO_CXXFLAGS="$(PIPER_CGO_CXXFLAGS)" CGO_LDFLAGS="$(PIPER_CGO_LDFLAGS)" LIBRARY_PATH=$(CURDIR)/sources/go-piper \
 	$(GOCMD) build -ldflags "$(LD_FLAGS)" -tags "$(GO_TAGS)" -o backend-assets/grpc/piper ./backend/go/tts/

 backend-assets/grpc/whisper: backend-assets/grpc sources/whisper.cpp/libwhisper.a
-	CGO_LDFLAGS="$(CGO_LDFLAGS)" C_INCLUDE_PATH=$(shell pwd)/sources/whisper.cpp LIBRARY_PATH=$(shell pwd)/sources/whisper.cpp \
+	CGO_LDFLAGS="$(CGO_LDFLAGS)" C_INCLUDE_PATH=$(CURDIR)/sources/whisper.cpp LIBRARY_PATH=$(CURDIR)/sources/whisper.cpp \
 	$(GOCMD) build -ldflags "$(LD_FLAGS)" -tags "$(GO_TAGS)" -o backend-assets/grpc/whisper ./backend/go/transcribe/

 grpcs: prepare $(GRPC_BACKENDS)
--- a/README.md
+++ b/README.md
@@ -20,6 +20,9 @@
 </a>
 </p>

+[<img src="https://img.shields.io/badge/dockerhub-images-important.svg?logo=Docker">](https://hub.docker.com/r/localai/localai)
+[<img src="https://img.shields.io/badge/quay.io-images-important.svg?">](https://quay.io/repository/go-skynet/local-ai?tab=tags&tag=latest)
+
 > :bulb: Get help - [❓FAQ](https://localai.io/faq/) [💭Discussions](https://github.com/go-skynet/LocalAI/discussions) [:speech_balloon: Discord](https://discord.gg/uJAeKSAGDy) [:book: Documentation website](https://localai.io/)
 >
 > [💻 Quickstart](https://localai.io/basics/getting_started/) [📣 News](https://localai.io/basics/news/) [ 🛫 Examples ](https://github.com/go-skynet/LocalAI/tree/master/examples/) [ 🖼️ Models ](https://localai.io/models/) [ 🚀 Roadmap ](https://github.com/mudler/LocalAI/issues?q=is%3Aissue+is%3Aopen+label%3Aroadmap)
@@ -40,6 +43,8 @@

 [Roadmap](https://github.com/mudler/LocalAI/issues?q=is%3Aissue+is%3Aopen+label%3Aroadmap)

+- Mamba support: https://github.com/mudler/LocalAI/pull/1589
+- Start and share models with config file: https://github.com/mudler/LocalAI/pull/1522
 - 🐸 Coqui: https://github.com/mudler/LocalAI/pull/1489
 - Inline templates: https://github.com/mudler/LocalAI/pull/1452
 - Mixtral: https://github.com/mudler/LocalAI/pull/1449
@@ -54,6 +59,12 @@ If you want to help and contribute, issues up for grabs: https://github.com/mudl

 ## 💻 [Getting started](https://localai.io/basics/getting_started/index.html)

+For a detailed step-by-step introduction, refer to the [Getting Started](https://localai.io/basics/getting_started/index.html) guide. For those in a hurry, here's a straightforward one-liner to launch a LocalAI instance with [phi-2](https://huggingface.co/microsoft/phi-2) using `docker`:
+
+```
+docker run -ti -p 8080:8080 localai/localai:v2.5.1-ffmpeg-core phi-2
+```
+
 ## 🚀 [Features](https://localai.io/features/)

 - 📖 [Text generation with GPTs](https://localai.io/features/text-generation/) (`llama.cpp`, `gpt4all.cpp`, ... [:book: and more](https://localai.io/model-compatibility/index.html#model-compatibility-table))
@@ -81,6 +92,10 @@ WebUIs:

 Model galleries
 - https://github.com/go-skynet/model-gallery
+  
+Auto Docker / Model setup
+- https://io.midori-ai.xyz/howtos/easy-localai-installer/
+- https://io.midori-ai.xyz/howtos/easy-model-installer/

 Other:
 - Helm chart https://github.com/go-skynet/helm-charts
@@ -98,7 +113,7 @@ Other:
 - [How to build locally](https://localai.io/basics/build/index.html)
 - [How to install in Kubernetes](https://localai.io/basics/getting_started/index.html#run-localai-in-kubernetes)
 - [Projects integrating LocalAI](https://localai.io/integrations/)
- [How tos section](https://localai.io/howtos/) (curated by our community)
+- [How tos section](https://io.midori-ai.xyz/howtos/) (curated by our community)

 ## :book: 🎥 [Media, Blogs, Social](https://localai.io/basics/news/#media-blogs-social)

--- a/api/api.go
+++ b/api/api.go
@@ -16,6 +16,7 @@ import (
 	"github.com/go-skynet/LocalAI/metrics"
 	"github.com/go-skynet/LocalAI/pkg/assets"
 	"github.com/go-skynet/LocalAI/pkg/model"
+	"github.com/go-skynet/LocalAI/pkg/startup"

 	"github.com/gofiber/fiber/v2"
 	"github.com/gofiber/fiber/v2/middleware/cors"
@@ -36,6 +37,8 @@ func Startup(opts ...options.AppOption) (*options.Option, *config.ConfigLoader,
 	log.Info().Msgf("Starting LocalAI using %d threads, with models path: %s", options.Threads, options.Loader.ModelPath)
 	log.Info().Msgf("LocalAI version: %s", internal.PrintableVersion())

+	startup.PreloadModelsConfigurations(options.ModelLibraryURL, options.Loader.ModelPath, options.ModelsURL...)
+
 	cl := config.NewConfigLoader()
 	if err := cl.LoadConfigs(options.Loader.ModelPath); err != nil {
 		log.Error().Msgf("error loading config files: %s", err.Error())
@@ -51,6 +54,18 @@ func Startup(opts ...options.AppOption) (*options.Option, *config.ConfigLoader,
 		log.Error().Msgf("error downloading models: %s", err.Error())
 	}

+	if options.PreloadJSONModels != "" {
+		if err := localai.ApplyGalleryFromString(options.Loader.ModelPath, options.PreloadJSONModels, cl, options.Galleries); err != nil {
+			return nil, nil, err
+		}
+	}
+
+	if options.PreloadModelsFromPath != "" {
+		if err := localai.ApplyGalleryFromFile(options.Loader.ModelPath, options.PreloadModelsFromPath, cl, options.Galleries); err != nil {
+			return nil, nil, err
+		}
+	}
+
 	if options.Debug {
 		for _, v := range cl.ListConfigs() {
 			cfg, _ := cl.GetConfig(v)
@@ -67,18 +82,6 @@ func Startup(opts ...options.AppOption) (*options.Option, *config.ConfigLoader,
 		}
 	}

-	if options.PreloadJSONModels != "" {
-		if err := localai.ApplyGalleryFromString(options.Loader.ModelPath, options.PreloadJSONModels, cl, options.Galleries); err != nil {
-			return nil, nil, err
-		}
-	}
-
-	if options.PreloadModelsFromPath != "" {
-		if err := localai.ApplyGalleryFromFile(options.Loader.ModelPath, options.PreloadModelsFromPath, cl, options.Galleries); err != nil {
-			return nil, nil, err
-		}
-	}
-
 	// turn off any process that was started by GRPC if the context is canceled
 	go func() {
 		<-options.Context.Done()
@@ -213,6 +216,11 @@ func App(opts ...options.AppOption) (*fiber.App, error) {
 		}{Version: internal.PrintableVersion()})
 	})

+	// Make sure directories exists
+	os.MkdirAll(options.ImageDir, 0755)
+	os.MkdirAll(options.AudioDir, 0755)
+	os.MkdirAll(options.Loader.ModelPath, 0755)
+
 	modelGalleryService := localai.CreateModelGalleryService(options.Galleries, options.Loader.ModelPath, galleryService)
 	app.Post("/models/apply", auth, modelGalleryService.ApplyModelGalleryEndpoint())
 	app.Get("/models/available", auth, modelGalleryService.ListModelFromGalleryEndpoint())
--- a/api/api_test.go
+++ b/api/api_test.go
@@ -16,9 +16,9 @@ import (
 	. "github.com/go-skynet/LocalAI/api"
 	"github.com/go-skynet/LocalAI/api/options"
 	"github.com/go-skynet/LocalAI/metrics"
+	"github.com/go-skynet/LocalAI/pkg/downloader"
 	"github.com/go-skynet/LocalAI/pkg/gallery"
 	"github.com/go-skynet/LocalAI/pkg/model"
-	"github.com/go-skynet/LocalAI/pkg/utils"
 	"github.com/gofiber/fiber/v2"
 	. "github.com/onsi/ginkgo/v2"
 	. "github.com/onsi/gomega"
@@ -61,7 +61,7 @@ func getModelStatus(url string) (response map[string]interface{}) {
 }

 func getModels(url string) (response []gallery.GalleryModel) {
-	utils.GetURI(url, func(url string, i []byte) error {
+	downloader.GetURI(url, func(url string, i []byte) error {
 		// Unmarshal YAML data into a struct
 		return json.Unmarshal(i, &response)
 	})
--- a/api/backend/embeddings.go
+++ b/api/backend/embeddings.go
@@ -41,7 +41,7 @@ func ModelEmbedding(s string, tokens []int, loader *model.ModelLoader, c config.

 	var fn func() ([]float32, error)
 	switch model := inferenceModel.(type) {
-	case *grpc.Client:
+	case grpc.Backend:
 		fn = func() ([]float32, error) {
 			predictOptions := gRPCPredictOpts(c, loader.ModelPath)
 			if len(tokens) > 0 {
--- a/api/backend/llm.go
+++ b/api/backend/llm.go
@@ -31,7 +31,7 @@ func ModelInference(ctx context.Context, s string, images []string, loader *mode

 	grpcOpts := gRPCModelOpts(c)

-	var inferenceModel *grpc.Client
+	var inferenceModel grpc.Backend
 	var err error

 	opts := modelOpts(c, o, []model.Option{
@@ -159,6 +159,9 @@ func Finetune(config config.Config, input, prediction string) string {
 	for _, c := range config.TrimSpace {
 		prediction = strings.TrimSpace(strings.TrimPrefix(prediction, c))
 	}
-	return prediction

+	for _, c := range config.TrimSuffix {
+		prediction = strings.TrimSpace(strings.TrimSuffix(prediction, c))
+	}
+	return prediction
 }
--- a/api/backend/options.go
+++ b/api/backend/options.go
@@ -63,6 +63,8 @@ func gRPCModelOpts(c config.Config) *pb.ModelOptions {
 		F16Memory:      c.F16,
 		MLock:          c.MMlock,
 		RopeFreqBase:   c.RopeFreqBase,
+		RopeScaling:    c.RopeScaling,
+		Type:           c.ModelType,
 		RopeFreqScale:  c.RopeFreqScale,
 		NUMA:           c.NUMA,
 		Embeddings:     c.Embeddings,
--- a/api/config/config.go
+++ b/api/config/config.go
@@ -1,6 +1,7 @@
 package api_config

 import (
+	"errors"
 	"fmt"
 	"io/fs"
 	"os"
@@ -8,6 +9,7 @@ import (
 	"strings"
 	"sync"

+	"github.com/go-skynet/LocalAI/pkg/downloader"
 	"github.com/go-skynet/LocalAI/pkg/utils"
 	"github.com/rs/zerolog/log"
 	"gopkg.in/yaml.v3"
@@ -51,6 +53,17 @@ type Config struct {
 	// CUDA
 	// Explicitly enable CUDA or not (some backends might need it)
 	CUDA bool `yaml:"cuda"`
+
+	DownloadFiles []File `yaml:"download_files"`
+
+	Description string `yaml:"description"`
+	Usage       string `yaml:"usage"`
+}
+
+type File struct {
+	Filename string `yaml:"filename" json:"filename"`
+	SHA256   string `yaml:"sha256" json:"sha256"`
+	URI      string `yaml:"uri" json:"uri"`
 }

 type VallE struct {
@@ -102,18 +115,22 @@ type LLMConfig struct {
 	StopWords       []string `yaml:"stopwords"`
 	Cutstrings      []string `yaml:"cutstrings"`
 	TrimSpace       []string `yaml:"trimspace"`
-	ContextSize     int      `yaml:"context_size"`
-	NUMA            bool     `yaml:"numa"`
-	LoraAdapter     string   `yaml:"lora_adapter"`
-	LoraBase        string   `yaml:"lora_base"`
-	LoraScale       float32  `yaml:"lora_scale"`
-	NoMulMatQ       bool     `yaml:"no_mulmatq"`
-	DraftModel      string   `yaml:"draft_model"`
-	NDraft          int32    `yaml:"n_draft"`
-	Quantization    string   `yaml:"quantization"`
-	MMProj          string   `yaml:"mmproj"`
+	TrimSuffix      []string `yaml:"trimsuffix"`
+
+	ContextSize  int     `yaml:"context_size"`
+	NUMA         bool    `yaml:"numa"`
+	LoraAdapter  string  `yaml:"lora_adapter"`
+	LoraBase     string  `yaml:"lora_base"`
+	LoraScale    float32 `yaml:"lora_scale"`
+	NoMulMatQ    bool    `yaml:"no_mulmatq"`
+	DraftModel   string  `yaml:"draft_model"`
+	NDraft       int32   `yaml:"n_draft"`
+	Quantization string  `yaml:"quantization"`
+	MMProj       string  `yaml:"mmproj"`
+
+	RopeScaling string `yaml:"rope_scaling"`
+	ModelType   string `yaml:"type"`

-	RopeScaling    string  `yaml:"rope_scaling"`
 	YarnExtFactor  float32 `yaml:"yarn_ext_factor"`
 	YarnAttnFactor float32 `yaml:"yarn_attn_factor"`
 	YarnBetaFast   float32 `yaml:"yarn_beta_fast"`
@@ -266,22 +283,44 @@ func (cm *ConfigLoader) ListConfigs() []string {
 	return res
 }

+// Preload prepare models if they are not local but url or huggingface repositories
 func (cm *ConfigLoader) Preload(modelPath string) error {
 	cm.Lock()
 	defer cm.Unlock()

+	status := func(fileName, current, total string, percent float64) {
+		utils.DisplayDownloadFunction(fileName, current, total, percent)
+	}
+
+	log.Info().Msgf("Preloading models from %s", modelPath)
+
 	for i, config := range cm.configs {
+
+		// Download files and verify their SHA
+		for _, file := range config.DownloadFiles {
+			log.Debug().Msgf("Checking %q exists and matches SHA", file.Filename)
+
+			if err := utils.VerifyPath(file.Filename, modelPath); err != nil {
+				return err
+			}
+			// Create file path
+			filePath := filepath.Join(modelPath, file.Filename)
+
+			if err := downloader.DownloadFile(file.URI, filePath, file.SHA256, status); err != nil {
+				return err
+			}
+		}
+
 		modelURL := config.PredictionOptions.Model
-		modelURL = utils.ConvertURL(modelURL)
-		if strings.HasPrefix(modelURL, "http://") || strings.HasPrefix(modelURL, "https://") {
+		modelURL = downloader.ConvertURL(modelURL)
+
+		if downloader.LooksLikeURL(modelURL) {
 			// md5 of model name
 			md5Name := utils.MD5(modelURL)

 			// check if file exists
-			if _, err := os.Stat(filepath.Join(modelPath, md5Name)); err == os.ErrNotExist {
-				err := utils.DownloadFile(modelURL, filepath.Join(modelPath, md5Name), "", func(fileName, current, total string, percent float64) {
-					log.Info().Msgf("Downloading %s: %s/%s (%.2f%%)", fileName, current, total, percent)
-				})
+			if _, err := os.Stat(filepath.Join(modelPath, md5Name)); errors.Is(err, os.ErrNotExist) {
+				err := downloader.DownloadFile(modelURL, filepath.Join(modelPath, md5Name), "", status)
 				if err != nil {
 					return err
 				}
@@ -292,6 +331,15 @@ func (cm *ConfigLoader) Preload(modelPath string) error {
 			c.PredictionOptions.Model = md5Name
 			cm.configs[i] = *c
 		}
+		if cm.configs[i].Name != "" {
+			log.Info().Msgf("Model name: %s", cm.configs[i].Name)
+		}
+		if cm.configs[i].Description != "" {
+			log.Info().Msgf("Model description: %s", cm.configs[i].Description)
+		}
+		if cm.configs[i].Usage != "" {
+			log.Info().Msgf("Model usage: \n%s", cm.configs[i].Usage)
+		}
 	}
 	return nil
 }
--- a/api/localai/gallery.go
+++ b/api/localai/gallery.go
@@ -130,6 +130,12 @@ func (g *galleryApplier) Start(c context.Context, cm *config.ConfigLoader) {
 					continue
 				}

+				err = cm.Preload(g.modelPath)
+				if err != nil {
+					updateError(err)
+					continue
+				}
+
 				g.updateStatus(op.id, &galleryOpStatus{Processed: true, Message: "completed", Progress: 100})
 			}
 		}
--- a/api/options/options.go
+++ b/api/options/options.go
@@ -28,6 +28,8 @@ type Option struct {
 	ApiKeys                             []string
 	Metrics                             *metrics.Metrics

+	ModelLibraryURL string
+
 	Galleries []gallery.Gallery

 	BackendAssets     embed.FS
@@ -40,9 +42,12 @@ type Option struct {
 	SingleBackend           bool
 	ParallelBackendRequests bool

-	WatchDogIdle                             bool
-	WatchDogBusy                             bool
-	WatchDog                                 bool
+	WatchDogIdle bool
+	WatchDogBusy bool
+	WatchDog     bool
+
+	ModelsURL []string
+
 	WatchDogBusyTimeout, WatchDogIdleTimeout time.Duration
 }

@@ -63,12 +68,24 @@ func NewOptions(o ...AppOption) *Option {
 	return opt
 }

+func WithModelsURL(urls ...string) AppOption {
+	return func(o *Option) {
+		o.ModelsURL = urls
+	}
+}
+
 func WithCors(b bool) AppOption {
 	return func(o *Option) {
 		o.CORS = b
 	}
 }

+func WithModelLibraryURL(url string) AppOption {
+	return func(o *Option) {
+		o.ModelLibraryURL = url
+	}
+}
+
 var EnableWatchDog = func(o *Option) {
 	o.WatchDog = true
 }
--- a/backend/backend.proto
+++ b/backend/backend.proto
@@ -112,7 +112,6 @@ message ModelOptions {
  int32 CLIPSkip = 33;
  string ControlNet = 48;

-  // RWKV
  string Tokenizer = 34;

  // LLM (llama.cpp)
@@ -135,6 +134,8 @@ message ModelOptions {
  float YarnAttnFactor = 45;
  float YarnBetaFast = 46;
  float YarnBetaSlow = 47;
+
+  string Type = 49;
 }

 message Result {
--- a/backend/backend_grpc.pb.go
+++ b/backend/backend_grpc.pb.go
@@ -0,0 +1,457 @@
+// Code generated by protoc-gen-go-grpc. DO NOT EDIT.
+// versions:
+// - protoc-gen-go-grpc v1.2.0
+// - protoc             v4.23.4
+// source: backend/backend.proto
+
+package proto
+
+import (
+	context "context"
+	grpc "google.golang.org/grpc"
+	codes "google.golang.org/grpc/codes"
+	status "google.golang.org/grpc/status"
+)
+
+// This is a compile-time assertion to ensure that this generated file
+// is compatible with the grpc package it is being compiled against.
+// Requires gRPC-Go v1.32.0 or later.
+const _ = grpc.SupportPackageIsVersion7
+
+// BackendClient is the client API for Backend service.
+//
+// For semantics around ctx use and closing/ending streaming RPCs, please refer to https://pkg.go.dev/google.golang.org/grpc/?tab=doc#ClientConn.NewStream.
+type BackendClient interface {
+	Health(ctx context.Context, in *HealthMessage, opts ...grpc.CallOption) (*Reply, error)
+	Predict(ctx context.Context, in *PredictOptions, opts ...grpc.CallOption) (*Reply, error)
+	LoadModel(ctx context.Context, in *ModelOptions, opts ...grpc.CallOption) (*Result, error)
+	PredictStream(ctx context.Context, in *PredictOptions, opts ...grpc.CallOption) (Backend_PredictStreamClient, error)
+	Embedding(ctx context.Context, in *PredictOptions, opts ...grpc.CallOption) (*EmbeddingResult, error)
+	GenerateImage(ctx context.Context, in *GenerateImageRequest, opts ...grpc.CallOption) (*Result, error)
+	AudioTranscription(ctx context.Context, in *TranscriptRequest, opts ...grpc.CallOption) (*TranscriptResult, error)
+	TTS(ctx context.Context, in *TTSRequest, opts ...grpc.CallOption) (*Result, error)
+	TokenizeString(ctx context.Context, in *PredictOptions, opts ...grpc.CallOption) (*TokenizationResponse, error)
+	Status(ctx context.Context, in *HealthMessage, opts ...grpc.CallOption) (*StatusResponse, error)
+}
+
+type backendClient struct {
+	cc grpc.ClientConnInterface
+}
+
+func NewBackendClient(cc grpc.ClientConnInterface) BackendClient {
+	return &backendClient{cc}
+}
+
+func (c *backendClient) Health(ctx context.Context, in *HealthMessage, opts ...grpc.CallOption) (*Reply, error) {
+	out := new(Reply)
+	err := c.cc.Invoke(ctx, "/backend.Backend/Health", in, out, opts...)
+	if err != nil {
+		return nil, err
+	}
+	return out, nil
+}
+
+func (c *backendClient) Predict(ctx context.Context, in *PredictOptions, opts ...grpc.CallOption) (*Reply, error) {
+	out := new(Reply)
+	err := c.cc.Invoke(ctx, "/backend.Backend/Predict", in, out, opts...)
+	if err != nil {
+		return nil, err
+	}
+	return out, nil
+}
+
+func (c *backendClient) LoadModel(ctx context.Context, in *ModelOptions, opts ...grpc.CallOption) (*Result, error) {
+	out := new(Result)
+	err := c.cc.Invoke(ctx, "/backend.Backend/LoadModel", in, out, opts...)
+	if err != nil {
+		return nil, err
+	}
+	return out, nil
+}
+
+func (c *backendClient) PredictStream(ctx context.Context, in *PredictOptions, opts ...grpc.CallOption) (Backend_PredictStreamClient, error) {
+	stream, err := c.cc.NewStream(ctx, &Backend_ServiceDesc.Streams[0], "/backend.Backend/PredictStream", opts...)
+	if err != nil {
+		return nil, err
+	}
+	x := &backendPredictStreamClient{stream}
+	if err := x.ClientStream.SendMsg(in); err != nil {
+		return nil, err
+	}
+	if err := x.ClientStream.CloseSend(); err != nil {
+		return nil, err
+	}
+	return x, nil
+}
+
+type Backend_PredictStreamClient interface {
+	Recv() (*Reply, error)
+	grpc.ClientStream
+}
+
+type backendPredictStreamClient struct {
+	grpc.ClientStream
+}
+
+func (x *backendPredictStreamClient) Recv() (*Reply, error) {
+	m := new(Reply)
+	if err := x.ClientStream.RecvMsg(m); err != nil {
+		return nil, err
+	}
+	return m, nil
+}
+
+func (c *backendClient) Embedding(ctx context.Context, in *PredictOptions, opts ...grpc.CallOption) (*EmbeddingResult, error) {
+	out := new(EmbeddingResult)
+	err := c.cc.Invoke(ctx, "/backend.Backend/Embedding", in, out, opts...)
+	if err != nil {
+		return nil, err
+	}
+	return out, nil
+}
+
+func (c *backendClient) GenerateImage(ctx context.Context, in *GenerateImageRequest, opts ...grpc.CallOption) (*Result, error) {
+	out := new(Result)
+	err := c.cc.Invoke(ctx, "/backend.Backend/GenerateImage", in, out, opts...)
+	if err != nil {
+		return nil, err
+	}
+	return out, nil
+}
+
+func (c *backendClient) AudioTranscription(ctx context.Context, in *TranscriptRequest, opts ...grpc.CallOption) (*TranscriptResult, error) {
+	out := new(TranscriptResult)
+	err := c.cc.Invoke(ctx, "/backend.Backend/AudioTranscription", in, out, opts...)
+	if err != nil {
+		return nil, err
+	}
+	return out, nil
+}
+
+func (c *backendClient) TTS(ctx context.Context, in *TTSRequest, opts ...grpc.CallOption) (*Result, error) {
+	out := new(Result)
+	err := c.cc.Invoke(ctx, "/backend.Backend/TTS", in, out, opts...)
+	if err != nil {
+		return nil, err
+	}
+	return out, nil
+}
+
+func (c *backendClient) TokenizeString(ctx context.Context, in *PredictOptions, opts ...grpc.CallOption) (*TokenizationResponse, error) {
+	out := new(TokenizationResponse)
+	err := c.cc.Invoke(ctx, "/backend.Backend/TokenizeString", in, out, opts...)
+	if err != nil {
+		return nil, err
+	}
+	return out, nil
+}
+
+func (c *backendClient) Status(ctx context.Context, in *HealthMessage, opts ...grpc.CallOption) (*StatusResponse, error) {
+	out := new(StatusResponse)
+	err := c.cc.Invoke(ctx, "/backend.Backend/Status", in, out, opts...)
+	if err != nil {
+		return nil, err
+	}
+	return out, nil
+}
+
+// BackendServer is the server API for Backend service.
+// All implementations must embed UnimplementedBackendServer
+// for forward compatibility
+type BackendServer interface {
+	Health(context.Context, *HealthMessage) (*Reply, error)
+	Predict(context.Context, *PredictOptions) (*Reply, error)
+	LoadModel(context.Context, *ModelOptions) (*Result, error)
+	PredictStream(*PredictOptions, Backend_PredictStreamServer) error
+	Embedding(context.Context, *PredictOptions) (*EmbeddingResult, error)
+	GenerateImage(context.Context, *GenerateImageRequest) (*Result, error)
+	AudioTranscription(context.Context, *TranscriptRequest) (*TranscriptResult, error)
+	TTS(context.Context, *TTSRequest) (*Result, error)
+	TokenizeString(context.Context, *PredictOptions) (*TokenizationResponse, error)
+	Status(context.Context, *HealthMessage) (*StatusResponse, error)
+	mustEmbedUnimplementedBackendServer()
+}
+
+// UnimplementedBackendServer must be embedded to have forward compatible implementations.
+type UnimplementedBackendServer struct {
+}
+
+func (UnimplementedBackendServer) Health(context.Context, *HealthMessage) (*Reply, error) {
+	return nil, status.Errorf(codes.Unimplemented, "method Health not implemented")
+}
+func (UnimplementedBackendServer) Predict(context.Context, *PredictOptions) (*Reply, error) {
+	return nil, status.Errorf(codes.Unimplemented, "method Predict not implemented")
+}
+func (UnimplementedBackendServer) LoadModel(context.Context, *ModelOptions) (*Result, error) {
+	return nil, status.Errorf(codes.Unimplemented, "method LoadModel not implemented")
+}
+func (UnimplementedBackendServer) PredictStream(*PredictOptions, Backend_PredictStreamServer) error {
+	return status.Errorf(codes.Unimplemented, "method PredictStream not implemented")
+}
+func (UnimplementedBackendServer) Embedding(context.Context, *PredictOptions) (*EmbeddingResult, error) {
+	return nil, status.Errorf(codes.Unimplemented, "method Embedding not implemented")
+}
+func (UnimplementedBackendServer) GenerateImage(context.Context, *GenerateImageRequest) (*Result, error) {
+	return nil, status.Errorf(codes.Unimplemented, "method GenerateImage not implemented")
+}
+func (UnimplementedBackendServer) AudioTranscription(context.Context, *TranscriptRequest) (*TranscriptResult, error) {
+	return nil, status.Errorf(codes.Unimplemented, "method AudioTranscription not implemented")
+}
+func (UnimplementedBackendServer) TTS(context.Context, *TTSRequest) (*Result, error) {
+	return nil, status.Errorf(codes.Unimplemented, "method TTS not implemented")
+}
+func (UnimplementedBackendServer) TokenizeString(context.Context, *PredictOptions) (*TokenizationResponse, error) {
+	return nil, status.Errorf(codes.Unimplemented, "method TokenizeString not implemented")
+}
+func (UnimplementedBackendServer) Status(context.Context, *HealthMessage) (*StatusResponse, error) {
+	return nil, status.Errorf(codes.Unimplemented, "method Status not implemented")
+}
+func (UnimplementedBackendServer) mustEmbedUnimplementedBackendServer() {}
+
+// UnsafeBackendServer may be embedded to opt out of forward compatibility for this service.
+// Use of this interface is not recommended, as added methods to BackendServer will
+// result in compilation errors.
+type UnsafeBackendServer interface {
+	mustEmbedUnimplementedBackendServer()
+}
+
+func RegisterBackendServer(s grpc.ServiceRegistrar, srv BackendServer) {
+	s.RegisterService(&Backend_ServiceDesc, srv)
+}
+
+func _Backend_Health_Handler(srv interface{}, ctx context.Context, dec func(interface{}) error, interceptor grpc.UnaryServerInterceptor) (interface{}, error) {
+	in := new(HealthMessage)
+	if err := dec(in); err != nil {
+		return nil, err
+	}
+	if interceptor == nil {
+		return srv.(BackendServer).Health(ctx, in)
+	}
+	info := &grpc.UnaryServerInfo{
+		Server:     srv,
+		FullMethod: "/backend.Backend/Health",
+	}
+	handler := func(ctx context.Context, req interface{}) (interface{}, error) {
+		return srv.(BackendServer).Health(ctx, req.(*HealthMessage))
+	}
+	return interceptor(ctx, in, info, handler)
+}
+
+func _Backend_Predict_Handler(srv interface{}, ctx context.Context, dec func(interface{}) error, interceptor grpc.UnaryServerInterceptor) (interface{}, error) {
+	in := new(PredictOptions)
+	if err := dec(in); err != nil {
+		return nil, err
+	}
+	if interceptor == nil {
+		return srv.(BackendServer).Predict(ctx, in)
+	}
+	info := &grpc.UnaryServerInfo{
+		Server:     srv,
+		FullMethod: "/backend.Backend/Predict",
+	}
+	handler := func(ctx context.Context, req interface{}) (interface{}, error) {
+		return srv.(BackendServer).Predict(ctx, req.(*PredictOptions))
+	}
+	return interceptor(ctx, in, info, handler)
+}
+
+func _Backend_LoadModel_Handler(srv interface{}, ctx context.Context, dec func(interface{}) error, interceptor grpc.UnaryServerInterceptor) (interface{}, error) {
+	in := new(ModelOptions)
+	if err := dec(in); err != nil {
+		return nil, err
+	}
+	if interceptor == nil {
+		return srv.(BackendServer).LoadModel(ctx, in)
+	}
+	info := &grpc.UnaryServerInfo{
+		Server:     srv,
+		FullMethod: "/backend.Backend/LoadModel",
+	}
+	handler := func(ctx context.Context, req interface{}) (interface{}, error) {
+		return srv.(BackendServer).LoadModel(ctx, req.(*ModelOptions))
+	}
+	return interceptor(ctx, in, info, handler)
+}
+
+func _Backend_PredictStream_Handler(srv interface{}, stream grpc.ServerStream) error {
+	m := new(PredictOptions)
+	if err := stream.RecvMsg(m); err != nil {
+		return err
+	}
+	return srv.(BackendServer).PredictStream(m, &backendPredictStreamServer{stream})
+}
+
+type Backend_PredictStreamServer interface {
+	Send(*Reply) error
+	grpc.ServerStream
+}
+
+type backendPredictStreamServer struct {
+	grpc.ServerStream
+}
+
+func (x *backendPredictStreamServer) Send(m *Reply) error {
+	return x.ServerStream.SendMsg(m)
+}
+
+func _Backend_Embedding_Handler(srv interface{}, ctx context.Context, dec func(interface{}) error, interceptor grpc.UnaryServerInterceptor) (interface{}, error) {
+	in := new(PredictOptions)
+	if err := dec(in); err != nil {
+		return nil, err
+	}
+	if interceptor == nil {
+		return srv.(BackendServer).Embedding(ctx, in)
+	}
+	info := &grpc.UnaryServerInfo{
+		Server:     srv,
+		FullMethod: "/backend.Backend/Embedding",
+	}
+	handler := func(ctx context.Context, req interface{}) (interface{}, error) {
+		return srv.(BackendServer).Embedding(ctx, req.(*PredictOptions))
+	}
+	return interceptor(ctx, in, info, handler)
+}
+
+func _Backend_GenerateImage_Handler(srv interface{}, ctx context.Context, dec func(interface{}) error, interceptor grpc.UnaryServerInterceptor) (interface{}, error) {
+	in := new(GenerateImageRequest)
+	if err := dec(in); err != nil {
+		return nil, err
+	}
+	if interceptor == nil {
+		return srv.(BackendServer).GenerateImage(ctx, in)
+	}
+	info := &grpc.UnaryServerInfo{
+		Server:     srv,
+		FullMethod: "/backend.Backend/GenerateImage",
+	}
+	handler := func(ctx context.Context, req interface{}) (interface{}, error) {
+		return srv.(BackendServer).GenerateImage(ctx, req.(*GenerateImageRequest))
+	}
+	return interceptor(ctx, in, info, handler)
+}
+
+func _Backend_AudioTranscription_Handler(srv interface{}, ctx context.Context, dec func(interface{}) error, interceptor grpc.UnaryServerInterceptor) (interface{}, error) {
+	in := new(TranscriptRequest)
+	if err := dec(in); err != nil {
+		return nil, err
+	}
+	if interceptor == nil {
+		return srv.(BackendServer).AudioTranscription(ctx, in)
+	}
+	info := &grpc.UnaryServerInfo{
+		Server:     srv,
+		FullMethod: "/backend.Backend/AudioTranscription",
+	}
+	handler := func(ctx context.Context, req interface{}) (interface{}, error) {
+		return srv.(BackendServer).AudioTranscription(ctx, req.(*TranscriptRequest))
+	}
+	return interceptor(ctx, in, info, handler)
+}
+
+func _Backend_TTS_Handler(srv interface{}, ctx context.Context, dec func(interface{}) error, interceptor grpc.UnaryServerInterceptor) (interface{}, error) {
+	in := new(TTSRequest)
+	if err := dec(in); err != nil {
+		return nil, err
+	}
+	if interceptor == nil {
+		return srv.(BackendServer).TTS(ctx, in)
+	}
+	info := &grpc.UnaryServerInfo{
+		Server:     srv,
+		FullMethod: "/backend.Backend/TTS",
+	}
+	handler := func(ctx context.Context, req interface{}) (interface{}, error) {
+		return srv.(BackendServer).TTS(ctx, req.(*TTSRequest))
+	}
+	return interceptor(ctx, in, info, handler)
+}
+
+func _Backend_TokenizeString_Handler(srv interface{}, ctx context.Context, dec func(interface{}) error, interceptor grpc.UnaryServerInterceptor) (interface{}, error) {
+	in := new(PredictOptions)
+	if err := dec(in); err != nil {
+		return nil, err
+	}
+	if interceptor == nil {
+		return srv.(BackendServer).TokenizeString(ctx, in)
+	}
+	info := &grpc.UnaryServerInfo{
+		Server:     srv,
+		FullMethod: "/backend.Backend/TokenizeString",
+	}
+	handler := func(ctx context.Context, req interface{}) (interface{}, error) {
+		return srv.(BackendServer).TokenizeString(ctx, req.(*PredictOptions))
+	}
+	return interceptor(ctx, in, info, handler)
+}
+
+func _Backend_Status_Handler(srv interface{}, ctx context.Context, dec func(interface{}) error, interceptor grpc.UnaryServerInterceptor) (interface{}, error) {
+	in := new(HealthMessage)
+	if err := dec(in); err != nil {
+		return nil, err
+	}
+	if interceptor == nil {
+		return srv.(BackendServer).Status(ctx, in)
+	}
+	info := &grpc.UnaryServerInfo{
+		Server:     srv,
+		FullMethod: "/backend.Backend/Status",
+	}
+	handler := func(ctx context.Context, req interface{}) (interface{}, error) {
+		return srv.(BackendServer).Status(ctx, req.(*HealthMessage))
+	}
+	return interceptor(ctx, in, info, handler)
+}
+
+// Backend_ServiceDesc is the grpc.ServiceDesc for Backend service.
+// It's only intended for direct use with grpc.RegisterService,
+// and not to be introspected or modified (even as a copy)
+var Backend_ServiceDesc = grpc.ServiceDesc{
+	ServiceName: "backend.Backend",
+	HandlerType: (*BackendServer)(nil),
+	Methods: []grpc.MethodDesc{
+		{
+			MethodName: "Health",
+			Handler:    _Backend_Health_Handler,
+		},
+		{
+			MethodName: "Predict",
+			Handler:    _Backend_Predict_Handler,
+		},
+		{
+			MethodName: "LoadModel",
+			Handler:    _Backend_LoadModel_Handler,
+		},
+		{
+			MethodName: "Embedding",
+			Handler:    _Backend_Embedding_Handler,
+		},
+		{
+			MethodName: "GenerateImage",
+			Handler:    _Backend_GenerateImage_Handler,
+		},
+		{
+			MethodName: "AudioTranscription",
+			Handler:    _Backend_AudioTranscription_Handler,
+		},
+		{
+			MethodName: "TTS",
+			Handler:    _Backend_TTS_Handler,
+		},
+		{
+			MethodName: "TokenizeString",
+			Handler:    _Backend_TokenizeString_Handler,
+		},
+		{
+			MethodName: "Status",
+			Handler:    _Backend_Status_Handler,
+		},
+	},
+	Streams: []grpc.StreamDesc{
+		{
+			StreamName:    "PredictStream",
+			Handler:       _Backend_PredictStream_Handler,
+			ServerStreams: true,
+		},
+	},
+	Metadata: "backend/backend.proto",
+}
--- a/backend/cpp/grpc/Makefile
+++ b/backend/cpp/grpc/Makefile
@@ -0,0 +1,66 @@
+# Basic platform detection
+HOST_SYSTEM = $(shell uname | cut -f 1 -d_)
+SYSTEM ?= $(HOST_SYSTEM)
+
+TAG_LIB_GRPC?=v1.59.0
+GIT_REPO_LIB_GRPC?=https://github.com/grpc/grpc.git
+GIT_CLONE_DEPTH?=1
+NUM_BUILD_THREADS?=$(shell nproc --ignore=1)
+
+INSTALLED_PACKAGES=installed_packages
+GRPC_REPO=grpc_repo
+GRPC_BUILD=grpc_build
+
+export CMAKE_ARGS?=
+CMAKE_ARGS+=-DCMAKE_BUILD_TYPE=Release
+CMAKE_ARGS+=-DgRPC_INSTALL=ON
+CMAKE_ARGS+=-DEXECUTABLE_OUTPUT_PATH=../$(INSTALLED_PACKAGES)/grpc/bin
+CMAKE_ARGS+=-DLIBRARY_OUTPUT_PATH=../$(INSTALLED_PACKAGES)/grpc/lib
+CMAKE_ARGS+=-DgRPC_BUILD_TESTS=OFF
+CMAKE_ARGS+=-DgRPC_BUILD_CSHARP_EXT=OFF
+CMAKE_ARGS+=-DgRPC_BUILD_GRPC_CPP_PLUGIN=ON
+CMAKE_ARGS+=-DgRPC_BUILD_GRPC_CSHARP_PLUGIN=OFF
+CMAKE_ARGS+=-DgRPC_BUILD_GRPC_NODE_PLUGIN=OFF
+CMAKE_ARGS+=-DgRPC_BUILD_GRPC_OBJECTIVE_C_PLUGIN=OFF
+CMAKE_ARGS+=-DgRPC_BUILD_GRPC_PHP_PLUGIN=OFF
+CMAKE_ARGS+=-DgRPC_BUILD_GRPC_PYTHON_PLUGIN=ON
+CMAKE_ARGS+=-DgRPC_BUILD_GRPC_RUBY_PLUGIN=OFF
+CMAKE_ARGS+=-Dprotobuf_WITH_ZLIB=ON
+CMAKE_ARGS+=-DRE2_BUILD_TESTING=OFF
+CMAKE_ARGS+=-DCMAKE_INSTALL_PREFIX=../$(INSTALLED_PACKAGES)
+
+# windows need to set OPENSSL_NO_ASM. Results in slower crypto performance but doesn't build otherwise.
+# May be resolvable, but for now its set. More info: https://stackoverflow.com/a/75240504/480673
+ifeq ($(SYSTEM),MSYS)
+CMAKE_ARGS+=-DOPENSSL_NO_ASM=ON
+endif
+ifeq ($(SYSTEM),MINGW64)
+CMAKE_ARGS+=-DOPENSSL_NO_ASM=ON
+endif
+ifeq ($(SYSTEM),MINGW32)
+CMAKE_ARGS+=-DOPENSSL_NO_ASM=ON
+endif
+ifeq ($(SYSTEM),CYGWIN)
+CMAKE_ARGS+=-DOPENSSL_NO_ASM=ON
+endif
+
+$(INSTALLED_PACKAGES): grpc_build
+
+$(GRPC_REPO):
+	git clone --depth $(GIT_CLONE_DEPTH) -b $(TAG_LIB_GRPC) $(GIT_REPO_LIB_GRPC) $(GRPC_REPO)/grpc
+	cd $(GRPC_REPO)/grpc && git submodule update --init --recursive --depth $(GIT_CLONE_DEPTH)
+
+$(GRPC_BUILD): $(GRPC_REPO)
+	mkdir -p $(GRPC_BUILD)
+	cd $(GRPC_BUILD) && cmake $(CMAKE_ARGS) ../$(GRPC_REPO)/grpc && cmake --build . -- -j ${NUM_BUILD_THREADS} && cmake --build . --target install -- -j ${NUM_BUILD_THREADS}
+
+build: $(INSTALLED_PACKAGES)
+
+rebuild:
+	rm -rf grpc_build
+	$(MAKE) grpc_build
+
+clean:
+	rm -rf grpc_build
+	rm -rf grpc_repo
+	rm -rf installed_packages
--- a/backend/cpp/grpc/script/build_grpc.sh
+++ b/backend/cpp/grpc/script/build_grpc.sh
@@ -1,81 +0,0 @@
-#!/bin/bash
-
-# Builds locally from sources the packages needed by the llama cpp backend.
-
-# Makes sure a few base packages exist.
-# sudo apt-get --no-upgrade -y install g++ gcc binutils cmake git build-essential autoconf libtool pkg-config 
-
-SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
-echo "Script directory: $SCRIPT_DIR"
-
-CPP_INSTALLED_PACKAGES_DIR=$1
-if [ -z ${CPP_INSTALLED_PACKAGES_DIR} ]; then 
-    echo "CPP_INSTALLED_PACKAGES_DIR env variable not set. Don't know where to install: failed."; 
-    echo
-    exit -1
-fi
-
-if [ -d "${CPP_INSTALLED_PACKAGES_DIR}" ]; then
-  echo "gRPC installation directory already exists. Nothing to do."
-  exit 0
-fi
-
-# The depth when cloning a git repo. 1 speeds up the clone when the repo history is not needed.
-GIT_CLONE_DEPTH=1
-
-NUM_BUILD_THREADS=$(nproc --ignore=1)
-
-# Google gRPC --------------------------------------------------------------------------------------
-TAG_LIB_GRPC="v1.59.0"
-GIT_REPO_LIB_GRPC="https://github.com/grpc/grpc.git"
-GRPC_REPO_DIR="${SCRIPT_DIR}/../grpc_repo"
-GRPC_BUILD_DIR="${SCRIPT_DIR}/../grpc_build"
-SRC_DIR_LIB_GRPC="${GRPC_REPO_DIR}/grpc"
-
-echo "SRC_DIR_LIB_GRPC: ${SRC_DIR_LIB_GRPC}"
-echo "GRPC_REPO_DIR: ${GRPC_REPO_DIR}"
-echo "GRPC_BUILD_DIR: ${GRPC_BUILD_DIR}"
-
-mkdir -pv ${GRPC_REPO_DIR}
-
-rm   -rf ${GRPC_BUILD_DIR}
-mkdir -pv ${GRPC_BUILD_DIR}
-
-mkdir -pv ${CPP_INSTALLED_PACKAGES_DIR}
-	
-if [ -d "${SRC_DIR_LIB_GRPC}" ]; then
-  echo "gRPC source already exists locally. Not cloned again."
-else  
-  ( cd ${GRPC_REPO_DIR} && \
-    git clone --depth ${GIT_CLONE_DEPTH} -b ${TAG_LIB_GRPC} ${GIT_REPO_LIB_GRPC} && \
-    cd ${SRC_DIR_LIB_GRPC} && \
-    git submodule update --init --recursive --depth ${GIT_CLONE_DEPTH} 
-  )    
-fi
-
-( cd ${GRPC_BUILD_DIR} && \
-  cmake -G "Unix Makefiles" \
-     -DCMAKE_BUILD_TYPE=Release \
-     -DgRPC_INSTALL=ON \
-     -DEXECUTABLE_OUTPUT_PATH=${CPP_INSTALLED_PACKAGES_DIR}/grpc/bin \
-     -DLIBRARY_OUTPUT_PATH=${CPP_INSTALLED_PACKAGES_DIR}/grpc/lib \
-     -DgRPC_BUILD_TESTS=OFF \
-     -DgRPC_BUILD_CSHARP_EXT=OFF \
-     -DgRPC_BUILD_GRPC_CPP_PLUGIN=ON \
-     -DgRPC_BUILD_GRPC_CSHARP_PLUGIN=OFF \
-     -DgRPC_BUILD_GRPC_NODE_PLUGIN=OFF \
-     -DgRPC_BUILD_GRPC_OBJECTIVE_C_PLUGIN=OFF \
-     -DgRPC_BUILD_GRPC_PHP_PLUGIN=OFF \
-     -DgRPC_BUILD_GRPC_PYTHON_PLUGIN=ON \
-     -DgRPC_BUILD_GRPC_RUBY_PLUGIN=OFF \
-     -Dprotobuf_WITH_ZLIB=ON \
-     -DRE2_BUILD_TESTING=OFF \
-     -DCMAKE_INSTALL_PREFIX=${CPP_INSTALLED_PACKAGES_DIR}/ \
-     ${SRC_DIR_LIB_GRPC}  && \
-  cmake --build .  -- -j ${NUM_BUILD_THREADS} && \
-  cmake --build .  --target install -- -j ${NUM_BUILD_THREADS} 
-)
-
-rm -rf ${GRPC_BUILD_DIR}
-rm -rf ${GRPC_REPO_DIR}
-
--- a/backend/cpp/llama/CMakeLists.txt
+++ b/backend/cpp/llama/CMakeLists.txt
@@ -17,9 +17,17 @@ cmake_minimum_required(VERSION 3.15)
 set(TARGET grpc-server)
 set(_PROTOBUF_LIBPROTOBUF libprotobuf)
 set(_REFLECTION grpc++_reflection)
+
 if (${CMAKE_SYSTEM_NAME} MATCHES "Darwin")
-    link_directories("/opt/homebrew/lib")
-    include_directories("/opt/homebrew/include")
+    # Set correct Homebrew install folder for Apple Silicon and Intel Macs
+    if (CMAKE_HOST_SYSTEM_PROCESSOR MATCHES "arm64")
+        set(HOMEBREW_DEFAULT_PREFIX "/opt/homebrew")
+    else()
+        set(HOMEBREW_DEFAULT_PREFIX "/usr/local")
+    endif()
+
+    link_directories("${HOMEBREW_DEFAULT_PREFIX}/lib")
+    include_directories("${HOMEBREW_DEFAULT_PREFIX}/include")
 endif()

 find_package(absl CONFIG REQUIRED)
--- a/backend/cpp/llama/grpc-server.cpp
+++ b/backend/cpp/llama/grpc-server.cpp
@@ -26,6 +26,7 @@
 #include <mutex>
 #include <chrono>
 #include <regex>
+#include <condition_variable>
 #include <grpcpp/ext/proto_server_reflection_plugin.h>
 #include <grpcpp/grpcpp.h>
 #include <grpcpp/health_check_service_interface.h>
@@ -40,12 +41,15 @@ using backend::HealthMessage;


 ///// LLAMA.CPP server code below
+
 #define DEFAULT_OAICOMPAT_MODEL "gpt-3.5-turbo-0613"
+
 using json = nlohmann::json;

 struct server_params
 {
    std::string hostname = "127.0.0.1";
+    std::string api_key;
    std::string public_path = "examples/server/public";
    int32_t port = 8080;
    int32_t read_timeout = 600;
@@ -89,7 +93,7 @@ static inline bool is_base64(uint8_t c)
    return (isalnum(c) || (c == '+') || (c == '/'));
 }

-static std::vector<uint8_t> base64_decode(std::string const &encoded_string)
+static std::vector<uint8_t> base64_decode(const std::string & encoded_string)
 {
    int i = 0;
    int j = 0;
@@ -154,8 +158,8 @@ static std::vector<uint8_t> base64_decode(std::string const &encoded_string)
 //

 enum task_type {
-    COMPLETION_TASK,
-    CANCEL_TASK
+    TASK_TYPE_COMPLETION,
+    TASK_TYPE_CANCEL,
 };

 struct task_server {
@@ -216,10 +220,10 @@ struct slot_image
    int32_t id;

    bool request_encode_image = false;
-    float* image_embedding = nullptr;
+    float * image_embedding = nullptr;
    int32_t image_tokens = 0;

-    clip_image_u8 img_data;
+    clip_image_u8 * img_data;

    std::string prefix_prompt; // before of this image
 };
@@ -441,20 +445,25 @@ struct llama_client_slot

        generated_token_probs.clear();

-        for (slot_image &img : images)
+        for (slot_image & img : images)
        {
            free(img.image_embedding);
-            delete[] img.img_data.data;
+            if (img.img_data) {
+                clip_image_u8_free(img.img_data);
+            }
            img.prefix_prompt = "";
        }

        images.clear();
-        // llama_set_rng_seed(ctx, params.seed); in batched the seed matter???????
    }

    bool has_budget(gpt_params &global_params) {
+        if (params.n_predict == -1 && global_params.n_predict == -1)
+        {
+            return true; // limitless
+        }
        n_remaining = -1;
-        if(params.n_predict != -1)
+        if (params.n_predict != -1)
        {
            n_remaining = params.n_predict - n_decoded;
        }
@@ -462,7 +471,7 @@ struct llama_client_slot
        {
            n_remaining = global_params.n_predict - n_decoded;
        }
-        return n_remaining > 0 || n_remaining == -1; // no budget || limitless
+        return n_remaining > 0; // no budget
    }

    bool available() const {
@@ -550,7 +559,9 @@ struct llama_server_context
    std::vector<task_result> queue_results;
    std::vector<task_multi>  queue_multitasks;
    std::mutex mutex_tasks; // also guards id_gen, and queue_multitasks
+    std::condition_variable condition_tasks;
    std::mutex mutex_results;
+    std::condition_variable condition_results;

    ~llama_server_context()
    {
@@ -769,6 +780,42 @@ struct llama_server_context
            slot->prompt = "";
        }

+        slot->sparams.penalty_prompt_tokens.clear();
+        slot->sparams.use_penalty_prompt_tokens = false;
+        const auto &penalty_prompt = data.find("penalty_prompt");
+        if (penalty_prompt != data.end())
+        {
+            if (penalty_prompt->is_string())
+            {
+                const auto penalty_prompt_string = penalty_prompt->get<std::string>();
+                auto penalty_tokens = llama_tokenize(model, penalty_prompt_string, false);
+                slot->sparams.penalty_prompt_tokens.swap(penalty_tokens);
+                if (slot->params.n_predict > 0)
+                {
+                    slot->sparams.penalty_prompt_tokens.reserve(slot->sparams.penalty_prompt_tokens.size() + slot->params.n_predict);
+                }
+                slot->sparams.use_penalty_prompt_tokens = true;
+            }
+            else if (penalty_prompt->is_array())
+            {
+                const auto n_tokens = penalty_prompt->size();
+                slot->sparams.penalty_prompt_tokens.reserve(n_tokens + std::max(0, slot->params.n_predict));
+                const int n_vocab = llama_n_vocab(model);
+                for (const auto &penalty_token : *penalty_prompt)
+                {
+                    if (penalty_token.is_number_integer())
+                    {
+                        const auto tok = penalty_token.get<llama_token>();
+                        if (tok >= 0 && tok < n_vocab)
+                        {
+                            slot->sparams.penalty_prompt_tokens.push_back(tok);
+                        }
+                    }
+                }
+                slot->sparams.use_penalty_prompt_tokens = true;
+            }
+        }
+
        slot->sparams.logit_bias.clear();

        if (json_value(data, "ignore_eos", false))
@@ -821,24 +868,17 @@ struct llama_server_context
            {
                for (const auto &img : *images_data)
                {
-                    std::string data_b64 = img["data"].get<std::string>();
+                    const std::vector<uint8_t> image_buffer = base64_decode(img["data"].get<std::string>());
+
                    slot_image img_sl;
                    img_sl.id = img.count("id") != 0 ? img["id"].get<int>() : slot->images.size();
-                    int width, height, channels;
-                    std::vector<uint8_t> image_buffer = base64_decode(data_b64);
-                    data_b64.clear();
-                    auto data = stbi_load_from_memory(image_buffer.data(), image_buffer.size(), &width, &height, &channels, 3);
-                    if (!data) {
+                    img_sl.img_data = clip_image_u8_init();
+                    if (!clip_image_load_from_bytes(image_buffer.data(), image_buffer.size(), img_sl.img_data))
+                    {
                        LOG_TEE("slot %i - failed to load image [id: %i]\n", slot->id, img_sl.id);
                        return false;
                    }
-                    LOG_TEE("slot %i - image loaded [id: %i] resolution (%i x %i)\n", slot->id, img_sl.id, width, height);
-                    img_sl.img_data.nx = width;
-                    img_sl.img_data.ny = height;
-                    img_sl.img_data.size = width * height * 3;
-                    img_sl.img_data.data = new uint8_t[width * height * 3]();
-                    memcpy(img_sl.img_data.data, data, width * height * 3);
-                    stbi_image_free(data);
+                    LOG_TEE("slot %i - loaded image\n", slot->id);
                    img_sl.request_encode_image = true;
                    slot->images.push_back(img_sl);
                }
@@ -893,6 +933,7 @@ struct llama_server_context
            llama_sampling_free(slot->ctx_sampling);
        }
        slot->ctx_sampling = llama_sampling_init(slot->sparams);
+        llama_set_rng_seed(ctx, slot->params.seed);
        slot->command = LOAD_PROMPT;

        all_slots_are_idle = false;
@@ -1000,6 +1041,12 @@ struct llama_server_context
        slot.generated_text += token_str;
        slot.has_next_token = true;

+        if (slot.ctx_sampling->params.use_penalty_prompt_tokens && result.tok != -1)
+        {
+            // we can change penalty_prompt_tokens because it is always created from scratch each request
+            slot.ctx_sampling->params.penalty_prompt_tokens.push_back(result.tok);
+        }
+
        // check if there is incomplete UTF-8 character at the end
        bool incomplete = false;
        for (unsigned i = 1; i < 5 && i <= slot.generated_text.size(); ++i)
@@ -1070,7 +1117,7 @@ struct llama_server_context
        }

        // check the limits
-        if (slot.n_decoded > 2 && slot.has_next_token && !slot.has_budget(params))
+        if (slot.n_decoded > 0 && slot.has_next_token && !slot.has_budget(params))
        {
            slot.stopped_limit = true;
            slot.has_next_token = false;
@@ -1106,8 +1153,8 @@ struct llama_server_context
            {
                continue;
            }
-            clip_image_f32 img_res;
-            if (!clip_image_preprocess(clp_ctx, &img.img_data, &img_res, /*pad2square =*/ true))
+            clip_image_f32 * img_res = clip_image_f32_init();
+            if (!clip_image_preprocess(clp_ctx, img.img_data, img_res, /*pad2square =*/ true))
            {
                LOG_TEE("Error processing the given image");
                clip_free(clp_ctx);
@@ -1122,20 +1169,22 @@ struct llama_server_context
                return false;
            }
            LOG_TEE("slot %i - encoding image [id: %i]\n", slot.id, img.id);
-            if (!clip_image_encode(clp_ctx, params.n_threads, &img_res, img.image_embedding))
+            if (!clip_image_encode(clp_ctx, params.n_threads, img_res, img.image_embedding))
            {
                LOG_TEE("Unable to encode image\n");
                return false;
            }
+            clip_image_f32_free(img_res);
            img.request_encode_image = false;
        }

        return slot.images.size() > 0;
    }

-    void send_error(task_server& task, std::string error)
+    void send_error(task_server& task, const std::string &error)
    {
-        std::lock_guard<std::mutex> lock(mutex_results);
+        LOG_TEE("task %i - error: %s\n", task.id, error.c_str());
+        std::unique_lock<std::mutex> lock(mutex_results);
        task_result res;
        res.id = task.id;
        res.multitask_id = task.multitask_id;
@@ -1143,6 +1192,7 @@ struct llama_server_context
        res.error = true;
        res.result_json = { { "content", error } };
        queue_results.push_back(res);
+        condition_results.notify_all();
    }

    void add_multi_task(int id, std::vector<int>& sub_ids)
@@ -1152,6 +1202,7 @@ struct llama_server_context
        multi.id = id;
        std::copy(sub_ids.begin(), sub_ids.end(), std::inserter(multi.subtasks_remaining, multi.subtasks_remaining.end()));
        queue_multitasks.push_back(multi);
+        condition_tasks.notify_one();
    }

    void update_multi_task(int multitask_id, int subtask_id, task_result& result)
@@ -1163,6 +1214,7 @@ struct llama_server_context
            {
                multitask.subtasks_remaining.erase(subtask_id);
                multitask.results.push_back(result);
+                condition_tasks.notify_one();
            }
        }
    }
@@ -1181,7 +1233,7 @@ struct llama_server_context
            {"n_ctx",             slot.n_ctx},
            {"model",             params.model_alias},
            {"seed",              slot.params.seed},
-            {"temp",              slot.sparams.temp},
+            {"temperature",       slot.sparams.temp},
            {"top_k",             slot.sparams.top_k},
            {"top_p",             slot.sparams.top_p},
            {"min_p",             slot.sparams.min_p},
@@ -1191,6 +1243,8 @@ struct llama_server_context
            {"repeat_penalty",    slot.sparams.penalty_repeat},
            {"presence_penalty",  slot.sparams.penalty_present},
            {"frequency_penalty", slot.sparams.penalty_freq},
+            {"penalty_prompt_tokens", slot.sparams.penalty_prompt_tokens},
+            {"use_penalty_prompt_tokens", slot.sparams.use_penalty_prompt_tokens},
            {"mirostat",          slot.sparams.mirostat},
            {"mirostat_tau",      slot.sparams.mirostat_tau},
            {"mirostat_eta",      slot.sparams.mirostat_eta},
@@ -1208,7 +1262,7 @@ struct llama_server_context

    void send_partial_response(llama_client_slot &slot, completion_token_output tkn)
    {
-        std::lock_guard<std::mutex> lock(mutex_results);
+        std::unique_lock<std::mutex> lock(mutex_results);
        task_result res;
        res.id = slot.task_id;
        res.multitask_id = slot.multitask_id;
@@ -1227,7 +1281,7 @@ struct llama_server_context
        {
            std::vector<completion_token_output> probs_output = {};
            const std::vector<llama_token> to_send_toks = llama_tokenize(ctx, tkn.text_to_send, false);
-            size_t probs_pos = std::min(slot.sent_token_probs_index, slot.generated_token_probs.size());
+            size_t probs_pos      = std::min(slot.sent_token_probs_index,                       slot.generated_token_probs.size());
            size_t probs_stop_pos = std::min(slot.sent_token_probs_index + to_send_toks.size(), slot.generated_token_probs.size());
            if (probs_pos < probs_stop_pos)
            {
@@ -1244,11 +1298,12 @@ struct llama_server_context
        }

        queue_results.push_back(res);
+        condition_results.notify_all();
    }

    void send_final_response(llama_client_slot &slot)
    {
-        std::lock_guard<std::mutex> lock(mutex_results);
+        std::unique_lock<std::mutex> lock(mutex_results);
        task_result res;
        res.id = slot.task_id;
        res.multitask_id = slot.multitask_id;
@@ -1286,7 +1341,7 @@ struct llama_server_context
            {
                probs = std::vector<completion_token_output>(
                                    slot.generated_token_probs.begin(),
-                                    slot.generated_token_probs.begin() + slot.sent_token_probs_index);
+                                    slot.generated_token_probs.end());
            }
            res.result_json["completion_probabilities"] = probs_vector_to_json(ctx, probs);
        }
@@ -1296,6 +1351,11 @@ struct llama_server_context
            res.result_json["oaicompat_token_ctr"] = slot.n_decoded;
            res.result_json["model"] = slot.oaicompat_model;
        }
+        queue_results.push_back(res);
+        condition_results.notify_all();
+
+        // done with results, unlock
+        lock.unlock();

        // parent multitask, if any, needs to be updated
        if (slot.multitask_id != -1)
@@ -1303,12 +1363,11 @@ struct llama_server_context
            update_multi_task(slot.multitask_id, slot.task_id, res);
        }

-        queue_results.push_back(res);
    }

    void send_embedding(llama_client_slot &slot)
    {
-        std::lock_guard<std::mutex> lock(mutex_results);
+        std::unique_lock<std::mutex> lock(mutex_results);
        task_result res;
        res.id = slot.task_id;
        res.multitask_id = slot.multitask_id;
@@ -1336,6 +1395,7 @@ struct llama_server_context
            };
        }
        queue_results.push_back(res);
+        condition_results.notify_all();
    }

    int request_completion(json data, bool infill, bool embedding, int multitask_id)
@@ -1347,11 +1407,11 @@ struct llama_server_context
        task.data = std::move(data);
        task.infill_mode = infill;
        task.embedding_mode = embedding;
-        task.type = COMPLETION_TASK;
+        task.type = TASK_TYPE_COMPLETION;
        task.multitask_id = multitask_id;

        // when a completion task's prompt array is not a singleton, we split it into multiple requests
-        if (task.data.at("prompt").size() > 1)
+        if (task.data.count("prompt") && task.data.at("prompt").size() > 1)
        {
            lock.unlock(); // entering new func scope
            return split_multiprompt_task(task);
@@ -1359,6 +1419,7 @@ struct llama_server_context

        // otherwise, it's a single-prompt task, we actually queue it
        queue_tasks.push_back(task);
+        condition_tasks.notify_one();
        return task.id;
    }

@@ -1366,13 +1427,10 @@ struct llama_server_context
    {
        while (true)
        {
-            std::this_thread::sleep_for(std::chrono::microseconds(5));
-            std::lock_guard<std::mutex> lock(mutex_results);
-
-            if (queue_results.empty())
-            {
-                continue;
-            }
+            std::unique_lock<std::mutex> lock(mutex_results);
+            condition_results.wait(lock, [&]{
+                return !queue_results.empty();
+            });

            for (int i = 0; i < (int) queue_results.size(); i++)
            {
@@ -1468,12 +1526,13 @@ struct llama_server_context

    void request_cancel(int task_id)
    {
-        std::lock_guard<std::mutex> lock(mutex_tasks);
+        std::unique_lock<std::mutex> lock(mutex_tasks);
        task_server task;
        task.id = id_gen++;
-        task.type = CANCEL_TASK;
+        task.type = TASK_TYPE_CANCEL;
        task.target_id = task_id;
        queue_tasks.push_back(task);
+        condition_tasks.notify_one();
    }

    int split_multiprompt_task(task_server& multiprompt_task)
@@ -1499,33 +1558,42 @@ struct llama_server_context

    void process_tasks()
    {
-        std::lock_guard<std::mutex> lock(mutex_tasks);
+        std::unique_lock<std::mutex> lock(mutex_tasks);
+        std::vector<task_server> deferred_tasks;
        while (!queue_tasks.empty())
        {
            task_server task = queue_tasks.front();
            queue_tasks.erase(queue_tasks.begin());
            switch (task.type)
            {
-                case COMPLETION_TASK: {
+                case TASK_TYPE_COMPLETION: {
                    llama_client_slot *slot = get_slot(json_value(task.data, "slot_id", -1));
                    if (slot == nullptr)
                    {
-                        LOG_TEE("slot unavailable\n");
-                        // send error result
-                        send_error(task, "slot unavailable");
-                        return;
+                        // if no slot is available, we defer this task for processing later
+                        deferred_tasks.push_back(task);
+                        break;
                    }

                    if (task.data.contains("system_prompt"))
                    {
+                        if (!all_slots_are_idle) {
+                            send_error(task, "system prompt can only be updated when all slots are idle");
+                            break;
+                        }
                        process_system_prompt_data(task.data["system_prompt"]);
+                        // reset cache_tokens for all slots
+                        for (llama_client_slot &slot : slots)
+                        {
+                            slot.cache_tokens.clear();
+                        }
                    }

                    slot->reset();

-                    slot->infill = task.infill_mode;
-                    slot->embedding = task.embedding_mode;
-                    slot->task_id = task.id;
+                    slot->infill       = task.infill_mode;
+                    slot->embedding    = task.embedding_mode;
+                    slot->task_id      = task.id;
                    slot->multitask_id = task.multitask_id;

                    if (!launch_slot_with_data(slot, task.data))
@@ -1535,7 +1603,7 @@ struct llama_server_context
                        break;
                    }
                } break;
-                case CANCEL_TASK: { // release slot linked with the task id
+                case TASK_TYPE_CANCEL: { // release slot linked with the task id
                    for (auto & slot : slots)
                    {
                        if (slot.task_id == task.target_id)
@@ -1548,7 +1616,14 @@ struct llama_server_context
            }
        }

+        // add all the deferred tasks back the the queue
+        for (task_server &task : deferred_tasks)
+        {
+            queue_tasks.push_back(task);
+        }
+
        // remove finished multitasks from the queue of multitasks, and add the corresponding result to the result queue
+        std::vector<task_result> agg_results;
        auto queue_iterator = queue_multitasks.begin();
        while (queue_iterator != queue_multitasks.end())
        {
@@ -1569,8 +1644,8 @@ struct llama_server_context
                }
                aggregate_result.result_json = json{ "results", result_jsons };

-                std::lock_guard<std::mutex> lock(mutex_results);
-                queue_results.push_back(aggregate_result);
+                agg_results.push_back(aggregate_result);
+                condition_results.notify_all();

                queue_iterator = queue_multitasks.erase(queue_iterator);
            }
@@ -1579,14 +1654,19 @@ struct llama_server_context
                ++queue_iterator;
            }
        }
+        // done with tasks, unlock
+        lock.unlock();
+
+        // copy aggregate results of complete multi-tasks to the results queue
+        std::lock_guard<std::mutex> lock_results(mutex_results);
+        queue_results.insert(queue_results.end(), agg_results.begin(), agg_results.end());
    }

    bool update_slots() {
        // attend tasks
        process_tasks();

-        // update the system prompt wait until all slots are idle state
-        if (system_need_update && all_slots_are_idle)
+        if (system_need_update)
        {
            LOG_TEE("updating system prompt\n");
            update_system_prompt();
@@ -1601,8 +1681,10 @@ struct llama_server_context
                LOG_TEE("all slots are idle and system prompt is empty, clear the KV cache\n");
                kv_cache_clear();
            }
-            // avoid 100% usage of cpu all time
-            std::this_thread::sleep_for(std::chrono::milliseconds(5));
+            std::unique_lock<std::mutex> lock(mutex_tasks);
+            condition_tasks.wait(lock, [&]{
+                return !queue_tasks.empty();
+            });
        }

        for (llama_client_slot &slot : slots)
@@ -1660,7 +1742,6 @@ struct llama_server_context

            llama_batch_add(batch, slot.sampled, system_tokens.size() + slot.n_past, { slot.id }, true);

-            slot.n_decoded += 1;
            slot.n_past += 1;
        }

@@ -1675,7 +1756,8 @@ struct llama_server_context
                const bool has_prompt = slot.prompt.is_array() || (slot.prompt.is_string() && !slot.prompt.get<std::string>().empty()) || !slot.images.empty();

                // empty prompt passed -> release the slot and send empty response
-                if (slot.state == IDLE && slot.command == LOAD_PROMPT && !has_prompt)
+                // note: infill mode allows empty prompt
+                if (slot.state == IDLE && slot.command == LOAD_PROMPT && !has_prompt && !slot.infill)
                {
                    slot.release();
                    slot.print_timings();
@@ -1778,7 +1860,7 @@ struct llama_server_context

                    slot.cache_tokens = prompt_tokens;

-                    if (slot.n_past == slot.num_prompt_tokens)
+                    if (slot.n_past == slot.num_prompt_tokens && slot.n_past > 0)
                    {
                        // we have to evaluate at least 1 token to generate logits.
                        LOG_TEE("slot %d : we have to evaluate at least 1 token to generate logits\n", slot.id);
@@ -1878,6 +1960,7 @@ struct llama_server_context

                llama_sampling_accept(slot.ctx_sampling, ctx, id, true);

+                slot.n_decoded += 1;
                if (slot.n_decoded == 1)
                {
                    slot.t_start_genereration = ggml_time_us();
@@ -1962,28 +2045,35 @@ json oaicompat_completion_params_parse(
    llama_params["__oaicompat"] = true;

    // Map OpenAI parameters to llama.cpp parameters
-    llama_params["model"]             = json_value(body, "model", std::string("uknown"));
+    //
+    // For parameters that are defined by the OpenAI documentation (e.g.
+    // temperature), we explicitly specify OpenAI's intended default; we
+    // need to do that because sometimes OpenAI disagrees with llama.cpp
+    //
+    // https://platform.openai.com/docs/api-reference/chat/create
+    llama_sampling_params default_sparams;
+    llama_params["model"]             = json_value(body, "model", std::string("unknown"));
    llama_params["prompt"]            = format_chatml(body["messages"]); // OpenAI 'messages' to llama.cpp 'prompt'
    llama_params["cache_prompt"]      = json_value(body, "cache_prompt", false);
-    llama_params["temperature"]       = json_value(body, "temperature", 0.8);
-    llama_params["top_k"]             = json_value(body, "top_k", 40);
-    llama_params["top_p"]             = json_value(body, "top_p", 0.95);
+    llama_params["temperature"]       = json_value(body, "temperature", 0.0);
+    llama_params["top_k"]             = json_value(body, "top_k", default_sparams.top_k);
+    llama_params["top_p"]             = json_value(body, "top_p", 1.0);
    llama_params["n_predict"]         = json_value(body, "max_tokens", -1);
    llama_params["logit_bias"]        = json_value(body, "logit_bias",json::object());
    llama_params["frequency_penalty"] = json_value(body, "frequency_penalty", 0.0);
    llama_params["presence_penalty"]  = json_value(body, "presence_penalty", 0.0);
-    llama_params["seed"]              = json_value(body, "seed", 0);
+    llama_params["seed"]              = json_value(body, "seed", LLAMA_DEFAULT_SEED);
    llama_params["stream"]            = json_value(body, "stream", false);
-    llama_params["mirostat"]          = json_value(body, "mirostat", false);
-    llama_params["mirostat_tau"]      = json_value(body, "mirostat_tau", 0.0);
-    llama_params["mirostat_eta"]      = json_value(body, "mirostat_eta", 0.0);
-    llama_params["penalize_nl"]       = json_value(body, "penalize_nl", false);
-    llama_params["typical_p"]         = json_value(body, "typical_p", 0.0);
-    llama_params["repeat_last_n"]     = json_value(body, "repeat_last_n", 0);
+    llama_params["mirostat"]          = json_value(body, "mirostat", default_sparams.mirostat);
+    llama_params["mirostat_tau"]      = json_value(body, "mirostat_tau", default_sparams.mirostat_tau);
+    llama_params["mirostat_eta"]      = json_value(body, "mirostat_eta", default_sparams.mirostat_eta);
+    llama_params["penalize_nl"]       = json_value(body, "penalize_nl", default_sparams.penalize_nl);
+    llama_params["typical_p"]         = json_value(body, "typical_p", default_sparams.typical_p);
+    llama_params["repeat_last_n"]     = json_value(body, "repeat_last_n", default_sparams.penalty_last_n);
    llama_params["ignore_eos"]        = json_value(body, "ignore_eos", false);
-    llama_params["tfs_z"]             = json_value(body, "tfs_z", 0.0);
+    llama_params["tfs_z"]             = json_value(body, "tfs_z", default_sparams.tfs_z);

-    if (llama_params.count("grammar") != 0) {
+    if (body.count("grammar") != 0) {
        llama_params["grammar"] = json_value(body, "grammar", json::object());
    }

@@ -2034,8 +2124,8 @@ static json format_final_response_oaicompat(const json &request, const task_resu
            {"object", streaming ? "chat.completion.chunk" : "chat.completion"},
            {"usage",
                json{{"completion_tokens", num_tokens_predicted},
-                    {"prompt_tokens", num_prompt_tokens},
-                    {"total_tokens", num_tokens_predicted + num_prompt_tokens}}},
+                     {"prompt_tokens",     num_prompt_tokens},
+                     {"total_tokens",      num_tokens_predicted + num_prompt_tokens}}},
            {"id", gen_chatcmplid()}};

    if (server_verbose) {
@@ -2375,10 +2465,10 @@ static void params_parse(const backend::ModelOptions* request,
    const char *env_parallel = std::getenv("LLAMACPP_PARALLEL");
    if (env_parallel != NULL) {
        params.n_parallel = std::stoi(env_parallel);
+        params.cont_batching = true;
    } else {
        params.n_parallel = 1;
    }
-
    // TODO: Add yarn

    if (!request->tensorsplit().empty()) {
--- a/backend/python/autogptq/Makefile
+++ b/backend/python/autogptq/Makefile
@@ -1,5 +1,4 @@
 .PHONY: autogptq
 autogptq:
-	@echo "Creating virtual environment..."
-	@conda env create --name autogptq --file autogptq.yml
-	@echo "Virtual environment created."
+	$(MAKE) -C ../common-env/transformers
+
--- a/backend/python/autogptq/backend_pb2.py
+++ b/backend/python/autogptq/backend_pb2.py
--- a/backend/python/autogptq/run.sh
+++ b/backend/python/autogptq/run.sh
@@ -6,7 +6,7 @@
 export PATH=$PATH:/opt/conda/bin

 # Activate conda environment
-source activate autogptq
+source activate transformers

 # get the directory where the bash script is located
 DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
--- a/backend/python/bark/backend_pb2.py
+++ b/backend/python/bark/backend_pb2.py
--- a/backend/python/common-env/transformers/install.sh
+++ b/backend/python/common-env/transformers/install.sh
@@ -13,3 +13,12 @@ if conda_env_exists "transformers" ; then
 else 
    echo "Virtual environment already exists."
 fi
+
+if [ "$PIP_CACHE_PURGE" = true ] ; then
+    export PATH=$PATH:/opt/conda/bin
+
+    # Activate conda environment
+    source activate transformers
+
+    pip cache purge
+fi
--- a/backend/python/common-env/transformers/transformers-nvidia.yml
+++ b/backend/python/common-env/transformers/transformers-nvidia.yml
@@ -45,7 +45,7 @@ dependencies:
      - fsspec==2023.6.0
      - funcy==2.0
      - grpcio==1.59.0
-      - huggingface-hub==0.16.4
+      - huggingface-hub
      - idna==3.4
      - jinja2==3.1.2
      - jmespath==1.0.1
@@ -70,7 +70,6 @@ dependencies:
      - packaging==23.2
      - pandas
      - peft==0.5.0
-      - git+https://github.com/bigscience-workshop/petals
      - protobuf==4.24.4
      - psutil==5.9.5
      - pyarrow==13.0.0
@@ -85,17 +84,16 @@ dependencies:
      - scipy==1.11.3
      - six==1.16.0
      - sympy==1.12
-      - tokenizers==0.14.0
-      - torch==2.1.0
-      - torchaudio==2.1.0
+      - tokenizers
+      - torch==2.1.2
+      - torchaudio==2.1.2
      - tqdm==4.66.1
-      - transformers==4.34.0
-      - TTS==0.22.0
      - triton==2.1.0
      - typing-extensions==4.8.0
      - tzdata==2023.3
      - urllib3==1.26.17
      - xxhash==3.4.1
+      - auto-gptq==0.6.0
      - yarl==1.9.2
      - soundfile
      - langid
@@ -114,4 +112,7 @@ dependencies:
      - sudachipy
      - sudachidict_core
      - vocos
+      - vllm==0.2.7
+      - transformers>=4.36.0  # Required for Mixtral.
+      - xformers==0.0.23.post1  
 prefix: /opt/conda/envs/transformers
--- a/backend/python/common-env/transformers/transformers.yml
+++ b/backend/python/common-env/transformers/transformers.yml
@@ -46,7 +46,7 @@ dependencies:
      - fsspec==2023.6.0
      - funcy==2.0
      - grpcio==1.59.0
-      - huggingface-hub==0.16.4
+      - huggingface-hub
      - idna==3.4
      - jinja2==3.1.2
      - jmespath==1.0.1
@@ -59,7 +59,6 @@ dependencies:
      - packaging==23.2
      - pandas
      - peft==0.5.0
-      - git+https://github.com/bigscience-workshop/petals
      - protobuf==4.24.4
      - psutil==5.9.5
      - pyarrow==13.0.0
@@ -74,14 +73,14 @@ dependencies:
      - scipy==1.11.3
      - six==1.16.0
      - sympy==1.12
-      - tokenizers==0.14.0
-      - torch==2.1.0
-      - torchaudio==2.1.0
+      - tokenizers
+      - torch==2.1.2
+      - torchaudio==2.1.2
      - tqdm==4.66.1
-      - transformers==4.34.0
      - triton==2.1.0
      - typing-extensions==4.8.0
      - tzdata==2023.3
+      - auto-gptq==0.6.0
      - urllib3==1.26.17
      - xxhash==3.4.1
      - yarl==1.9.2
@@ -102,4 +101,7 @@ dependencies:
      - sudachipy
      - sudachidict_core
      - vocos
+      - vllm==0.2.7
+      - transformers>=4.36.0  # Required for Mixtral.
+      - xformers==0.0.23.post1  
 prefix: /opt/conda/envs/transformers
--- a/backend/python/coqui/backend_pb2.py
+++ b/backend/python/coqui/backend_pb2.py
--- a/backend/python/coqui/coqui_server.py
+++ b/backend/python/coqui/coqui_server.py
@@ -21,7 +21,7 @@ _ONE_DAY_IN_SECONDS = 60 * 60 * 24

 # If MAX_WORKERS are specified in the environment use it, otherwise default to 1
 MAX_WORKERS = int(os.environ.get('PYTHON_GRPC_MAX_WORKERS', '1'))
-COQUI_LANGUAGE = os.environ.get('COQUI_LANGUAGE', 'en')
+COQUI_LANGUAGE = os.environ.get('COQUI_LANGUAGE', None)

 # Implement the BackendServicer class with the service methods
 class BackendServicer(backend_pb2_grpc.BackendServicer):
@@ -33,11 +33,18 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
    def LoadModel(self, request, context):

        # Get device
-        device = "cuda" if request.CUDA else "cpu"
+        # device = "cuda" if request.CUDA else "cpu"
+        if torch.cuda.is_available():
+            print("CUDA is available", file=sys.stderr)
+            device = "cuda"
+        else:
+            print("CUDA is not available", file=sys.stderr)
+            device = "cpu"

        if not torch.cuda.is_available() and request.CUDA:
            return backend_pb2.Result(success=False, message="CUDA is not available")

+        self.AudioPath = None
        # List available 🐸TTS models
        print(TTS().list_models())
        if os.path.isabs(request.AudioPath):
--- a/backend/python/diffusers/backend_pb2.py
+++ b/backend/python/diffusers/backend_pb2.py
--- a/backend/python/diffusers/diffusers.yml
+++ b/backend/python/diffusers/diffusers.yml
@@ -53,6 +53,7 @@ dependencies:
      - nvidia-nccl-cu12==2.18.1
      - nvidia-nvjitlink-cu12==12.2.140
      - nvidia-nvtx-cu12==12.1.105
+      - omegaconf
      - packaging==23.2
      - pillow==10.0.1
      - protobuf==4.24.4
--- a/backend/python/exllama/Makefile
+++ b/backend/python/exllama/Makefile
@@ -1,8 +1,6 @@
 .PHONY: exllama
 exllama:
-	@echo "Creating virtual environment..."
-	@conda env create --name exllama --file exllama.yml
-	@echo "Virtual environment created."
+	$(MAKE) -C ../common-env/transformers
 	bash install.sh

 .PHONY: run
--- a/backend/python/exllama/backend_pb2.py
+++ b/backend/python/exllama/backend_pb2.py
--- a/backend/python/exllama/install.sh
+++ b/backend/python/exllama/install.sh
@@ -5,11 +5,15 @@
 export PATH=$PATH:/opt/conda/bin

 # Activate conda environment
-source activate exllama
+source activate transformers

 echo $CONDA_PREFIX


 git clone https://github.com/turboderp/exllama $CONDA_PREFIX/exllama && pushd $CONDA_PREFIX/exllama && pip install -r requirements.txt && popd

-cp -rfv $CONDA_PREFIX/exllama/* ./
+cp -rfv $CONDA_PREFIX/exllama/* ./
+
+if [ "$PIP_CACHE_PURGE" = true ] ; then
+    pip cache purge
+fi
--- a/backend/python/exllama/run.sh
+++ b/backend/python/exllama/run.sh
@@ -6,9 +6,11 @@
 export PATH=$PATH:/opt/conda/bin

 # Activate conda environment
-source activate exllama
+source activate transformers

 # get the directory where the bash script is located
 DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"

+cd $DIR
+
 python $DIR/exllama.py $@
--- a/backend/python/exllama2/Makefile
+++ b/backend/python/exllama2/Makefile
@@ -1,8 +1,6 @@
 .PHONY: exllama2
 exllama2:
-	@echo "Creating virtual environment..."
-	@conda env create --name exllama2 --file exllama2.yml
-	@echo "Virtual environment created."
+	$(MAKE) -C ../common-env/transformers
 	bash install.sh

 .PHONY: run
--- a/backend/python/exllama2/backend_pb2.py
+++ b/backend/python/exllama2/backend_pb2.py
--- a/backend/python/exllama2/install.sh
+++ b/backend/python/exllama2/install.sh
@@ -5,10 +5,14 @@
 export PATH=$PATH:/opt/conda/bin

 # Activate conda environment
-source activate exllama2
+source activate transformers

 echo $CONDA_PREFIX

 git clone https://github.com/turboderp/exllamav2 $CONDA_PREFIX/exllamav2 && pushd $CONDA_PREFIX/exllamav2 && pip install -r requirements.txt && popd

-cp -rfv $CONDA_PREFIX/exllamav2/* ./  
+cp -rfv $CONDA_PREFIX/exllamav2/* ./  
+
+if [ "$PIP_CACHE_PURGE" = true ] ; then
+    pip cache purge
+fi
--- a/backend/python/exllama2/run.sh
+++ b/backend/python/exllama2/run.sh
@@ -6,9 +6,11 @@
 export PATH=$PATH:/opt/conda/bin

 # Activate conda environment
-source activate exllama2
+source activate transformers

 # get the directory where the bash script is located
 DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"

+cd $DIR
+
 python $DIR/exllama2_backend.py $@
--- a/backend/python/mamba/Makefile
+++ b/backend/python/mamba/Makefile
@@ -0,0 +1,16 @@
+.PHONY: mamba
+mamba:
+	$(MAKE) -C ../common-env/transformers
+	bash install.sh
+
+.PHONY: run
+run:
+	@echo "Running mamba..."
+	bash run.sh
+	@echo "mamba run."
+
+.PHONY: test
+test:
+	@echo "Testing mamba..."
+	bash test.sh
+	@echo "mamba tested."
--- a/backend/python/mamba/README.md
+++ b/backend/python/mamba/README.md
@@ -0,0 +1,5 @@
+# Creating a separate environment for the mamba project
+
+```
+make mamba
+```
--- a/backend/python/mamba/backend_mamba.py
+++ b/backend/python/mamba/backend_mamba.py
@@ -0,0 +1,179 @@
+#!/usr/bin/env python3
+from concurrent import futures
+import time
+import argparse
+import signal
+import sys
+import os
+
+import backend_pb2
+import backend_pb2_grpc
+
+import grpc
+
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel
+
+_ONE_DAY_IN_SECONDS = 60 * 60 * 24
+
+# If MAX_WORKERS are specified in the environment use it, otherwise default to 1
+MAX_WORKERS = int(os.environ.get('PYTHON_GRPC_MAX_WORKERS', '1'))
+MAMBA_CHAT= os.environ.get('MAMBA_CHAT', '1') == '1'
+
+# Implement the BackendServicer class with the service methods
+class BackendServicer(backend_pb2_grpc.BackendServicer):
+    """
+    A gRPC servicer that implements the Backend service defined in backend.proto.
+    """
+    def generate(self,prompt, max_new_tokens):
+        """
+        Generates text based on the given prompt and maximum number of new tokens.
+
+        Args:
+            prompt (str): The prompt to generate text from.
+            max_new_tokens (int): The maximum number of new tokens to generate.
+
+        Returns:
+            str: The generated text.
+        """
+        self.generator.end_beam_search()
+
+        # Tokenizing the input
+        ids = self.generator.tokenizer.encode(prompt)
+
+        self.generator.gen_begin_reuse(ids)
+        initial_len = self.generator.sequence[0].shape[0]
+        has_leading_space = False
+        decoded_text = ''
+        for i in range(max_new_tokens):
+            token = self.generator.gen_single_token()
+            if i == 0 and self.generator.tokenizer.tokenizer.IdToPiece(int(token)).startswith('▁'):
+                has_leading_space = True
+
+            decoded_text = self.generator.tokenizer.decode(self.generator.sequence[0][initial_len:])
+            if has_leading_space:
+                decoded_text = ' ' + decoded_text
+
+            if token.item() == self.generator.tokenizer.eos_token_id:
+                break
+        return decoded_text
+
+    def Health(self, request, context):
+        """
+        Returns a health check message.
+
+        Args:
+            request: The health check request.
+            context: The gRPC context.
+
+        Returns:
+            backend_pb2.Reply: The health check reply.
+        """
+        return backend_pb2.Reply(message=bytes("OK", 'utf-8'))
+
+    def LoadModel(self, request, context):
+        """
+        Loads a language model.
+
+        Args:
+            request: The load model request.
+            context: The gRPC context.
+
+        Returns:
+            backend_pb2.Result: The load model result.
+        """
+        try:
+            tokenizerModel = request.Tokenizer
+            if tokenizerModel == "":
+                tokenizerModel = request.Model
+
+            tokenizer = AutoTokenizer.from_pretrained(tokenizerModel)
+            if MAMBA_CHAT:
+                tokenizer.eos_token = "<|endoftext|>"
+                tokenizer.pad_token = tokenizer.eos_token
+            self.tokenizer = tokenizer
+            self.model = MambaLMHeadModel.from_pretrained(request.Model, device="cuda", dtype=torch.float16)
+        except Exception as err:
+            return backend_pb2.Result(success=False, message=f"Unexpected {err=}, {type(err)=}")
+        return backend_pb2.Result(message="Model loaded successfully", success=True)
+
+    def Predict(self, request, context):
+        """
+        Generates text based on the given prompt and sampling parameters.
+
+        Args:
+            request: The predict request.
+            context: The gRPC context.
+
+        Returns:
+            backend_pb2.Result: The predict result.
+        """
+        if request.TopP == 0:
+            request.TopP = 0.9
+
+        max_tokens = request.Tokens
+
+        if request.Tokens == 0:
+            max_tokens = 2000
+
+        # encoded_input = self.tokenizer(request.Prompt)
+        tokens = self.tokenizer(request.Prompt, return_tensors="pt")
+        input_ids = tokens.input_ids.to(device="cuda")
+        out = self.model.generate(input_ids=input_ids, max_length=max_tokens, temperature=request.Temperature,
+                                     top_p=request.TopP, eos_token_id=self.tokenizer.eos_token_id)
+
+        decoded = self.tokenizer.batch_decode(out)
+       
+        generated_text = decoded[0]
+
+        # Remove prompt from response if present
+        if request.Prompt in generated_text:
+            generated_text = generated_text.replace(request.Prompt, "")
+
+        return backend_pb2.Reply(message=bytes(generated_text, encoding='utf-8'))
+
+    def PredictStream(self, request, context):
+        """
+        Generates text based on the given prompt and sampling parameters, and streams the results.
+
+        Args:
+            request: The predict stream request.
+            context: The gRPC context.
+
+        Returns:
+            backend_pb2.Result: The predict stream result.
+        """
+        yield self.Predict(request, context)
+
+def serve(address):
+    server = grpc.server(futures.ThreadPoolExecutor(max_workers=MAX_WORKERS))
+    backend_pb2_grpc.add_BackendServicer_to_server(BackendServicer(), server)
+    server.add_insecure_port(address)
+    server.start()
+    print("Server started. Listening on: " + address, file=sys.stderr)
+
+    # Define the signal handler function
+    def signal_handler(sig, frame):
+        print("Received termination signal. Shutting down...")
+        server.stop(0)
+        sys.exit(0)
+
+    # Set the signal handlers for SIGINT and SIGTERM
+    signal.signal(signal.SIGINT, signal_handler)
+    signal.signal(signal.SIGTERM, signal_handler)
+
+    try:
+        while True:
+            time.sleep(_ONE_DAY_IN_SECONDS)
+    except KeyboardInterrupt:
+        server.stop(0)
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Run the gRPC server.")
+    parser.add_argument(
+        "--addr", default="localhost:50051", help="The address to bind the server to."
+    )
+    args = parser.parse_args()
+
+    serve(args.addr)
--- a/backend/python/mamba/backend_pb2.py
+++ b/backend/python/mamba/backend_pb2.py
--- a/backend/python/mamba/backend_pb2_grpc.py
+++ b/backend/python/mamba/backend_pb2_grpc.py
@@ -0,0 +1,363 @@
+# Generated by the gRPC Python protocol compiler plugin. DO NOT EDIT!
+"""Client and server classes corresponding to protobuf-defined services."""
+import grpc
+
+import backend_pb2 as backend__pb2
+
+
+class BackendStub(object):
+    """Missing associated documentation comment in .proto file."""
+
+    def __init__(self, channel):
+        """Constructor.
+
+        Args:
+            channel: A grpc.Channel.
+        """
+        self.Health = channel.unary_unary(
+                '/backend.Backend/Health',
+                request_serializer=backend__pb2.HealthMessage.SerializeToString,
+                response_deserializer=backend__pb2.Reply.FromString,
+                )
+        self.Predict = channel.unary_unary(
+                '/backend.Backend/Predict',
+                request_serializer=backend__pb2.PredictOptions.SerializeToString,
+                response_deserializer=backend__pb2.Reply.FromString,
+                )
+        self.LoadModel = channel.unary_unary(
+                '/backend.Backend/LoadModel',
+                request_serializer=backend__pb2.ModelOptions.SerializeToString,
+                response_deserializer=backend__pb2.Result.FromString,
+                )
+        self.PredictStream = channel.unary_stream(
+                '/backend.Backend/PredictStream',
+                request_serializer=backend__pb2.PredictOptions.SerializeToString,
+                response_deserializer=backend__pb2.Reply.FromString,
+                )
+        self.Embedding = channel.unary_unary(
+                '/backend.Backend/Embedding',
+                request_serializer=backend__pb2.PredictOptions.SerializeToString,
+                response_deserializer=backend__pb2.EmbeddingResult.FromString,
+                )
+        self.GenerateImage = channel.unary_unary(
+                '/backend.Backend/GenerateImage',
+                request_serializer=backend__pb2.GenerateImageRequest.SerializeToString,
+                response_deserializer=backend__pb2.Result.FromString,
+                )
+        self.AudioTranscription = channel.unary_unary(
+                '/backend.Backend/AudioTranscription',
+                request_serializer=backend__pb2.TranscriptRequest.SerializeToString,
+                response_deserializer=backend__pb2.TranscriptResult.FromString,
+                )
+        self.TTS = channel.unary_unary(
+                '/backend.Backend/TTS',
+                request_serializer=backend__pb2.TTSRequest.SerializeToString,
+                response_deserializer=backend__pb2.Result.FromString,
+                )
+        self.TokenizeString = channel.unary_unary(
+                '/backend.Backend/TokenizeString',
+                request_serializer=backend__pb2.PredictOptions.SerializeToString,
+                response_deserializer=backend__pb2.TokenizationResponse.FromString,
+                )
+        self.Status = channel.unary_unary(
+                '/backend.Backend/Status',
+                request_serializer=backend__pb2.HealthMessage.SerializeToString,
+                response_deserializer=backend__pb2.StatusResponse.FromString,
+                )
+
+
+class BackendServicer(object):
+    """Missing associated documentation comment in .proto file."""
+
+    def Health(self, request, context):
+        """Missing associated documentation comment in .proto file."""
+        context.set_code(grpc.StatusCode.UNIMPLEMENTED)
+        context.set_details('Method not implemented!')
+        raise NotImplementedError('Method not implemented!')
+
+    def Predict(self, request, context):
+        """Missing associated documentation comment in .proto file."""
+        context.set_code(grpc.StatusCode.UNIMPLEMENTED)
+        context.set_details('Method not implemented!')
+        raise NotImplementedError('Method not implemented!')
+
+    def LoadModel(self, request, context):
+        """Missing associated documentation comment in .proto file."""
+        context.set_code(grpc.StatusCode.UNIMPLEMENTED)
+        context.set_details('Method not implemented!')
+        raise NotImplementedError('Method not implemented!')
+
+    def PredictStream(self, request, context):
+        """Missing associated documentation comment in .proto file."""
+        context.set_code(grpc.StatusCode.UNIMPLEMENTED)
+        context.set_details('Method not implemented!')
+        raise NotImplementedError('Method not implemented!')
+
+    def Embedding(self, request, context):
+        """Missing associated documentation comment in .proto file."""
+        context.set_code(grpc.StatusCode.UNIMPLEMENTED)
+        context.set_details('Method not implemented!')
+        raise NotImplementedError('Method not implemented!')
+
+    def GenerateImage(self, request, context):
+        """Missing associated documentation comment in .proto file."""
+        context.set_code(grpc.StatusCode.UNIMPLEMENTED)
+        context.set_details('Method not implemented!')
+        raise NotImplementedError('Method not implemented!')
+
+    def AudioTranscription(self, request, context):
+        """Missing associated documentation comment in .proto file."""
+        context.set_code(grpc.StatusCode.UNIMPLEMENTED)
+        context.set_details('Method not implemented!')
+        raise NotImplementedError('Method not implemented!')
+
+    def TTS(self, request, context):
+        """Missing associated documentation comment in .proto file."""
+        context.set_code(grpc.StatusCode.UNIMPLEMENTED)
+        context.set_details('Method not implemented!')
+        raise NotImplementedError('Method not implemented!')
+
+    def TokenizeString(self, request, context):
+        """Missing associated documentation comment in .proto file."""
+        context.set_code(grpc.StatusCode.UNIMPLEMENTED)
+        context.set_details('Method not implemented!')
+        raise NotImplementedError('Method not implemented!')
+
+    def Status(self, request, context):
+        """Missing associated documentation comment in .proto file."""
+        context.set_code(grpc.StatusCode.UNIMPLEMENTED)
+        context.set_details('Method not implemented!')
+        raise NotImplementedError('Method not implemented!')
+
+
+def add_BackendServicer_to_server(servicer, server):
+    rpc_method_handlers = {
+            'Health': grpc.unary_unary_rpc_method_handler(
+                    servicer.Health,
+                    request_deserializer=backend__pb2.HealthMessage.FromString,
+                    response_serializer=backend__pb2.Reply.SerializeToString,
+            ),
+            'Predict': grpc.unary_unary_rpc_method_handler(
+                    servicer.Predict,
+                    request_deserializer=backend__pb2.PredictOptions.FromString,
+                    response_serializer=backend__pb2.Reply.SerializeToString,
+            ),
+            'LoadModel': grpc.unary_unary_rpc_method_handler(
+                    servicer.LoadModel,
+                    request_deserializer=backend__pb2.ModelOptions.FromString,
+                    response_serializer=backend__pb2.Result.SerializeToString,
+            ),
+            'PredictStream': grpc.unary_stream_rpc_method_handler(
+                    servicer.PredictStream,
+                    request_deserializer=backend__pb2.PredictOptions.FromString,
+                    response_serializer=backend__pb2.Reply.SerializeToString,
+            ),
+            'Embedding': grpc.unary_unary_rpc_method_handler(
+                    servicer.Embedding,
+                    request_deserializer=backend__pb2.PredictOptions.FromString,
+                    response_serializer=backend__pb2.EmbeddingResult.SerializeToString,
+            ),
+            'GenerateImage': grpc.unary_unary_rpc_method_handler(
+                    servicer.GenerateImage,
+                    request_deserializer=backend__pb2.GenerateImageRequest.FromString,
+                    response_serializer=backend__pb2.Result.SerializeToString,
+            ),
+            'AudioTranscription': grpc.unary_unary_rpc_method_handler(
+                    servicer.AudioTranscription,
+                    request_deserializer=backend__pb2.TranscriptRequest.FromString,
+                    response_serializer=backend__pb2.TranscriptResult.SerializeToString,
+            ),
+            'TTS': grpc.unary_unary_rpc_method_handler(
+                    servicer.TTS,
+                    request_deserializer=backend__pb2.TTSRequest.FromString,
+                    response_serializer=backend__pb2.Result.SerializeToString,
+            ),
+            'TokenizeString': grpc.unary_unary_rpc_method_handler(
+                    servicer.TokenizeString,
+                    request_deserializer=backend__pb2.PredictOptions.FromString,
+                    response_serializer=backend__pb2.TokenizationResponse.SerializeToString,
+            ),
+            'Status': grpc.unary_unary_rpc_method_handler(
+                    servicer.Status,
+                    request_deserializer=backend__pb2.HealthMessage.FromString,
+                    response_serializer=backend__pb2.StatusResponse.SerializeToString,
+            ),
+    }
+    generic_handler = grpc.method_handlers_generic_handler(
+            'backend.Backend', rpc_method_handlers)
+    server.add_generic_rpc_handlers((generic_handler,))
+
+
+ # This class is part of an EXPERIMENTAL API.
+class Backend(object):
+    """Missing associated documentation comment in .proto file."""
+
+    @staticmethod
+    def Health(request,
+            target,
+            options=(),
+            channel_credentials=None,
+            call_credentials=None,
+            insecure=False,
+            compression=None,
+            wait_for_ready=None,
+            timeout=None,
+            metadata=None):
+        return grpc.experimental.unary_unary(request, target, '/backend.Backend/Health',
+            backend__pb2.HealthMessage.SerializeToString,
+            backend__pb2.Reply.FromString,
+            options, channel_credentials,
+            insecure, call_credentials, compression, wait_for_ready, timeout, metadata)
+
+    @staticmethod
+    def Predict(request,
+            target,
+            options=(),
+            channel_credentials=None,
+            call_credentials=None,
+            insecure=False,
+            compression=None,
+            wait_for_ready=None,
+            timeout=None,
+            metadata=None):
+        return grpc.experimental.unary_unary(request, target, '/backend.Backend/Predict',
+            backend__pb2.PredictOptions.SerializeToString,
+            backend__pb2.Reply.FromString,
+            options, channel_credentials,
+            insecure, call_credentials, compression, wait_for_ready, timeout, metadata)
+
+    @staticmethod
+    def LoadModel(request,
+            target,
+            options=(),
+            channel_credentials=None,
+            call_credentials=None,
+            insecure=False,
+            compression=None,
+            wait_for_ready=None,
+            timeout=None,
+            metadata=None):
+        return grpc.experimental.unary_unary(request, target, '/backend.Backend/LoadModel',
+            backend__pb2.ModelOptions.SerializeToString,
+            backend__pb2.Result.FromString,
+            options, channel_credentials,
+            insecure, call_credentials, compression, wait_for_ready, timeout, metadata)
+
+    @staticmethod
+    def PredictStream(request,
+            target,
+            options=(),
+            channel_credentials=None,
+            call_credentials=None,
+            insecure=False,
+            compression=None,
+            wait_for_ready=None,
+            timeout=None,
+            metadata=None):
+        return grpc.experimental.unary_stream(request, target, '/backend.Backend/PredictStream',
+            backend__pb2.PredictOptions.SerializeToString,
+            backend__pb2.Reply.FromString,
+            options, channel_credentials,
+            insecure, call_credentials, compression, wait_for_ready, timeout, metadata)
+
+    @staticmethod
+    def Embedding(request,
+            target,
+            options=(),
+            channel_credentials=None,
+            call_credentials=None,
+            insecure=False,
+            compression=None,
+            wait_for_ready=None,
+            timeout=None,
+            metadata=None):
+        return grpc.experimental.unary_unary(request, target, '/backend.Backend/Embedding',
+            backend__pb2.PredictOptions.SerializeToString,
+            backend__pb2.EmbeddingResult.FromString,
+            options, channel_credentials,
+            insecure, call_credentials, compression, wait_for_ready, timeout, metadata)
+
+    @staticmethod
+    def GenerateImage(request,
+            target,
+            options=(),
+            channel_credentials=None,
+            call_credentials=None,
+            insecure=False,
+            compression=None,
+            wait_for_ready=None,
+            timeout=None,
+            metadata=None):
+        return grpc.experimental.unary_unary(request, target, '/backend.Backend/GenerateImage',
+            backend__pb2.GenerateImageRequest.SerializeToString,
+            backend__pb2.Result.FromString,
+            options, channel_credentials,
+            insecure, call_credentials, compression, wait_for_ready, timeout, metadata)
+
+    @staticmethod
+    def AudioTranscription(request,
+            target,
+            options=(),
+            channel_credentials=None,
+            call_credentials=None,
+            insecure=False,
+            compression=None,
+            wait_for_ready=None,
+            timeout=None,
+            metadata=None):
+        return grpc.experimental.unary_unary(request, target, '/backend.Backend/AudioTranscription',
+            backend__pb2.TranscriptRequest.SerializeToString,
+            backend__pb2.TranscriptResult.FromString,
+            options, channel_credentials,
+            insecure, call_credentials, compression, wait_for_ready, timeout, metadata)
+
+    @staticmethod
+    def TTS(request,
+            target,
+            options=(),
+            channel_credentials=None,
+            call_credentials=None,
+            insecure=False,
+            compression=None,
+            wait_for_ready=None,
+            timeout=None,
+            metadata=None):
+        return grpc.experimental.unary_unary(request, target, '/backend.Backend/TTS',
+            backend__pb2.TTSRequest.SerializeToString,
+            backend__pb2.Result.FromString,
+            options, channel_credentials,
+            insecure, call_credentials, compression, wait_for_ready, timeout, metadata)
+
+    @staticmethod
+    def TokenizeString(request,
+            target,
+            options=(),
+            channel_credentials=None,
+            call_credentials=None,
+            insecure=False,
+            compression=None,
+            wait_for_ready=None,
+            timeout=None,
+            metadata=None):
+        return grpc.experimental.unary_unary(request, target, '/backend.Backend/TokenizeString',
+            backend__pb2.PredictOptions.SerializeToString,
+            backend__pb2.TokenizationResponse.FromString,
+            options, channel_credentials,
+            insecure, call_credentials, compression, wait_for_ready, timeout, metadata)
+
+    @staticmethod
+    def Status(request,
+            target,
+            options=(),
+            channel_credentials=None,
+            call_credentials=None,
+            insecure=False,
+            compression=None,
+            wait_for_ready=None,
+            timeout=None,
+            metadata=None):
+        return grpc.experimental.unary_unary(request, target, '/backend.Backend/Status',
+            backend__pb2.HealthMessage.SerializeToString,
+            backend__pb2.StatusResponse.FromString,
+            options, channel_credentials,
+            insecure, call_credentials, compression, wait_for_ready, timeout, metadata)
--- a/backend/python/mamba/install.sh
+++ b/backend/python/mamba/install.sh
@@ -0,0 +1,21 @@
+#!/bin/bash
+
+##
+## A bash script installs the required dependencies of VALL-E-X and prepares the environment
+export PATH=$PATH:/opt/conda/bin
+
+if [ "$BUILD_TYPE" != "cublas" ]; then
+    echo "[mamba] Attention!!! nvcc is required - skipping installation"
+    exit 0
+fi
+
+# Activate conda environment
+source activate transformers
+
+echo $CONDA_PREFIX
+
+pip install causal-conv1d==1.0.0 mamba-ssm==1.0.1
+
+if [ "$PIP_CACHE_PURGE" = true ] ; then
+    pip cache purge
+fi
--- a/backend/python/mamba/run.sh
+++ b/backend/python/mamba/run.sh
@@ -0,0 +1,14 @@
+#!/bin/bash
+
+##
+## A bash script wrapper that runs the diffusers server with conda
+
+export PATH=$PATH:/opt/conda/bin
+
+# Activate conda environment
+source activate transformers
+
+# get the directory where the bash script is located
+DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
+
+python $DIR/backend_mamba.py $@
--- a/backend/python/mamba/test.sh
+++ b/backend/python/mamba/test.sh
@@ -0,0 +1,11 @@
+#!/bin/bash
+##
+## A bash script wrapper that runs the transformers server with conda
+
+# Activate conda environment
+source activate transformers
+
+# get the directory where the bash script is located
+DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
+
+python -m unittest $DIR/test_backend_mamba.py
--- a/backend/python/mamba/test_backend_mamba.py
+++ b/backend/python/mamba/test_backend_mamba.py
@@ -0,0 +1,76 @@
+import unittest
+import subprocess
+import time
+import backend_pb2
+import backend_pb2_grpc
+
+import grpc
+
+import unittest
+import subprocess
+import time
+import grpc
+import backend_pb2_grpc
+import backend_pb2
+
+class TestBackendServicer(unittest.TestCase):
+    """
+    TestBackendServicer is the class that tests the gRPC service.
+
+    This class contains methods to test the startup and shutdown of the gRPC service.
+    """
+    def setUp(self):
+        self.service = subprocess.Popen(["python", "backend_vllm.py", "--addr", "localhost:50051"])
+        time.sleep(10)
+
+    def tearDown(self) -> None:
+        self.service.terminate()
+        self.service.wait()
+
+    def test_server_startup(self):
+        try:
+            self.setUp()
+            with grpc.insecure_channel("localhost:50051") as channel:
+                stub = backend_pb2_grpc.BackendStub(channel)
+                response = stub.Health(backend_pb2.HealthMessage())
+                self.assertEqual(response.message, b'OK')
+        except Exception as err:
+            print(err)
+            self.fail("Server failed to start")
+        finally:
+            self.tearDown()
+    def test_load_model(self):
+        """
+        This method tests if the model is loaded successfully
+        """
+        try:
+            self.setUp()
+            with grpc.insecure_channel("localhost:50051") as channel:
+                stub = backend_pb2_grpc.BackendStub(channel)
+                response = stub.LoadModel(backend_pb2.ModelOptions(Model="facebook/opt-125m"))
+                self.assertTrue(response.success)
+                self.assertEqual(response.message, "Model loaded successfully")
+        except Exception as err:
+            print(err)
+            self.fail("LoadModel service failed")
+        finally:
+            self.tearDown()
+
+    def test_text(self):
+        """
+        This method tests if the embeddings are generated successfully
+        """
+        try:
+            self.setUp()
+            with grpc.insecure_channel("localhost:50051") as channel:
+                stub = backend_pb2_grpc.BackendStub(channel)
+                response = stub.LoadModel(backend_pb2.ModelOptions(Model="facebook/opt-125m"))
+                self.assertTrue(response.success)
+                req = backend_pb2.PredictOptions(Prompt="The capital of France is")
+                resp = stub.Predict(req)
+                self.assertIsNotNone(resp.message)
+        except Exception as err:
+            print(err)
+            self.fail("text service failed")
+        finally:
+            self.tearDown()
--- a/backend/python/petals/Makefile
+++ b/backend/python/petals/Makefile
@@ -1,6 +1,8 @@
 .PHONY: petals
 petals:
-	$(MAKE) -C ../common-env/transformers
+	@echo "Creating virtual environment..."
+	@conda env create --name petals --file petals.yml
+	@echo "Virtual environment created."

 .PHONY: run
 run:
--- a/backend/python/petals/backend_pb2.py
+++ b/backend/python/petals/backend_pb2.py
--- a/backend/python/petals/run.sh
+++ b/backend/python/petals/run.sh
@@ -5,14 +5,16 @@

 export PATH=$PATH:/opt/conda/bin

+CONDA_ENV=petals
+
 # Activate conda environment
 # if source is available use it, or use conda
 #
 if [ -f /opt/conda/bin/activate ]; then
-    source activate transformers
+    source activate $CONDA_ENV
 else
    eval "$(conda shell.bash hook)"
-    conda activate transformers
+    conda activate $CONDA_ENV
 fi

 # get the directory where the bash script is located
--- a/backend/python/petals/test.sh
+++ b/backend/python/petals/test.sh
@@ -3,7 +3,16 @@
 ## A bash script wrapper that runs the transformers server with conda

 # Activate conda environment
-source activate transformers
+CONDA_ENV=petals
+# Activate conda environment
+# if source is available use it, or use conda
+#
+if [ -f /opt/conda/bin/activate ]; then
+    source activate $CONDA_ENV
+else
+    eval "$(conda shell.bash hook)"
+    conda activate $CONDA_ENV
+fi

 # get the directory where the bash script is located
 DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
--- a/backend/python/sentencetransformers/backend_pb2.py
+++ b/backend/python/sentencetransformers/backend_pb2.py
--- a/backend/python/transformers-musicgen/backend_pb2.py
+++ b/backend/python/transformers-musicgen/backend_pb2.py
--- a/backend/python/transformers/backend_pb2.py
+++ b/backend/python/transformers/backend_pb2.py
--- a/backend/python/transformers/transformers_server.py
+++ b/backend/python/transformers/transformers_server.py
@@ -15,8 +15,8 @@ import backend_pb2_grpc

 import grpc
 import torch
-
-from transformers import AutoTokenizer, AutoModel
+import torch.cuda
+from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM, set_seed

 _ONE_DAY_IN_SECONDS = 60 * 60 * 24

@@ -68,16 +68,19 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
        """
        model_name = request.Model
        try:
-            self.model = AutoModel.from_pretrained(model_name, trust_remote_code=True) # trust_remote_code is needed to use the encode method with embeddings models like jinai-v2
-            self.tokenizer = AutoTokenizer.from_pretrained(model_name)
+            if request.Type == "AutoModelForCausalLM":
+                self.model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
+            else:
+                self.model = AutoModel.from_pretrained(model_name, trust_remote_code=True)

-            if request.CUDA:
+            self.tokenizer = AutoTokenizer.from_pretrained(model_name)
+            self.CUDA = False
+
+            if request.CUDA or torch.cuda.is_available():
                try:
-                    # TODO: also tensorflow, make configurable
-                    import torch.cuda
-                    if torch.cuda.is_available():
-                        print("Loading model", model_name, "to CUDA.", file=sys.stderr)
-                        self.model = self.model.to("cuda")
+                    print("Loading model", model_name, "to CUDA.", file=sys.stderr)
+                    self.model = self.model.to("cuda")
+                    self.CUDA = True
                except Exception as err:
                    print("Not using CUDA:", err, file=sys.stderr)
        except Exception as err:
@@ -98,6 +101,7 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
            An EmbeddingResult object that contains the calculated embeddings.
        """

+        set_seed(request.Seed)
        # Tokenize input
        max_length = 512
        if request.Tokens != 0:
@@ -113,6 +117,51 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
        print("Embeddings:", sentence_embeddings, file=sys.stderr)
        return backend_pb2.EmbeddingResult(embeddings=sentence_embeddings)

+    def Predict(self, request, context):
+        """
+        Generates text based on the given prompt and sampling parameters.
+
+        Args:
+            request: The predict request.
+            context: The gRPC context.
+
+        Returns:
+            backend_pb2.Reply: The predict result.
+        """
+        set_seed(request.Seed)
+        if request.TopP == 0:
+            request.TopP = 0.9
+
+        max_tokens = 200
+        if request.Tokens > 0:
+            max_tokens = request.Tokens
+
+        inputs = self.tokenizer(request.Prompt, return_tensors="pt").input_ids
+        if self.CUDA:
+            inputs = inputs.to("cuda")
+
+        outputs = self.model.generate(inputs,max_new_tokens=max_tokens, temperature=request.Temperature, top_p=request.TopP)
+
+        generated_text = self.tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
+        # Remove prompt from response if present
+        if request.Prompt in generated_text:
+            generated_text = generated_text.replace(request.Prompt, "")
+
+        return backend_pb2.Reply(message=bytes(generated_text, encoding='utf-8'))
+
+    def PredictStream(self, request, context):
+        """
+        Generates text based on the given prompt and sampling parameters, and streams the results.
+
+        Args:
+            request: The predict stream request.
+            context: The gRPC context.
+
+        Returns:
+            backend_pb2.Result: The predict stream result.
+        """
+        yield self.Predict(request, context)
+

 def serve(address):
    server = grpc.server(futures.ThreadPoolExecutor(max_workers=MAX_WORKERS))
--- a/backend/python/vall-e-x/backend_pb2.py
+++ b/backend/python/vall-e-x/backend_pb2.py
--- a/backend/python/vall-e-x/install.sh
+++ b/backend/python/vall-e-x/install.sh
@@ -12,4 +12,8 @@ echo $CONDA_PREFIX

 git clone https://github.com/Plachtaa/VALL-E-X.git $CONDA_PREFIX/vall-e-x && pushd $CONDA_PREFIX/vall-e-x && git checkout -b build $SHA && pip install -r requirements.txt && popd

-cp -rfv $CONDA_PREFIX/vall-e-x/* ./
+cp -rfv $CONDA_PREFIX/vall-e-x/* ./
+
+if [ "$PIP_CACHE_PURGE" = true ] ; then
+    pip cache purge
+fi
--- a/backend/python/vall-e-x/run.sh
+++ b/backend/python/vall-e-x/run.sh
@@ -10,4 +10,6 @@ source activate transformers
 # get the directory where the bash script is located
 DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"

+cd $DIR
+
 python $DIR/ttsvalle.py $@
--- a/backend/python/vllm/Makefile
+++ b/backend/python/vllm/Makefile
@@ -1,8 +1,6 @@
 .PHONY: vllm
 vllm:
-	@echo "Creating virtual environment..."
-	@conda env create --name vllm --file vllm.yml
-	@echo "Virtual environment created."
+	$(MAKE) -C ../common-env/transformers

 .PHONY: run
 run:
--- a/backend/python/vllm/backend_pb2.py
+++ b/backend/python/vllm/backend_pb2.py
--- a/backend/python/vllm/backend_vllm.py
+++ b/backend/python/vllm/backend_vllm.py
@@ -97,12 +97,16 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
            context: The gRPC context.

        Returns:
-            backend_pb2.Result: The predict result.
+            backend_pb2.Reply: The predict result.
        """
        if request.TopP == 0:
            request.TopP = 0.9

-        sampling_params = SamplingParams(temperature=request.Temperature, top_p=request.TopP)
+        max_tokens = 200
+        if request.Tokens > 0:
+            max_tokens = request.Tokens
+
+        sampling_params = SamplingParams(max_tokens=max_tokens, temperature=request.Temperature, top_p=request.TopP)
        outputs = self.llm.generate([request.Prompt], sampling_params)

        generated_text = outputs[0].outputs[0].text
@@ -110,7 +114,7 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
        if request.Prompt in generated_text:
            generated_text = generated_text.replace(request.Prompt, "")

-        return backend_pb2.Result(message=bytes(generated_text, encoding='utf-8'))
+        return backend_pb2.Reply(message=bytes(generated_text, encoding='utf-8'))

    def PredictStream(self, request, context):
        """
@@ -123,11 +127,7 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
        Returns:
            backend_pb2.Result: The predict stream result.
        """
-        # Implement PredictStream RPC
-        #for reply in some_data_generator():
-        #    yield reply
-        # Not implemented yet
-        return self.Predict(request, context)
+        yield self.Predict(request, context)

 def serve(address):
    server = grpc.server(futures.ThreadPoolExecutor(max_workers=MAX_WORKERS))
--- a/backend/python/vllm/run.sh
+++ b/backend/python/vllm/run.sh
@@ -6,7 +6,7 @@
 export PATH=$PATH:/opt/conda/bin

 # Activate conda environment
-source activate vllm
+source activate transformers

 # get the directory where the bash script is located
 DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
--- a/backend/python/vllm/test.sh
+++ b/backend/python/vllm/test.sh
@@ -3,7 +3,7 @@
 ## A bash script wrapper that runs the transformers server with conda

 # Activate conda environment
-source activate vllm
+source activate transformers

 # get the directory where the bash script is located
 DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
--- a/backend/python/vllm/vllm.yml
+++ b/backend/python/vllm/vllm.yml
@@ -1,99 +0,0 @@
-name: vllm
-channels:
-  - defaults
-dependencies:
-  - _libgcc_mutex=0.1=main
-  - _openmp_mutex=5.1=1_gnu
-  - bzip2=1.0.8=h7b6447c_0
-  - ca-certificates=2023.08.22=h06a4308_0
-  - ld_impl_linux-64=2.38=h1181459_1
-  - libffi=3.4.4=h6a678d5_0
-  - libgcc-ng=11.2.0=h1234567_1
-  - libgomp=11.2.0=h1234567_1
-  - libstdcxx-ng=11.2.0=h1234567_1
-  - libuuid=1.41.5=h5eee18b_0
-  - ncurses=6.4=h6a678d5_0
-  - openssl=3.0.11=h7f8727e_2
-  - pip=23.2.1=py311h06a4308_0
-  - python=3.11.5=h955ad1f_0
-  - readline=8.2=h5eee18b_0
-  - setuptools=68.0.0=py311h06a4308_0
-  - sqlite=3.41.2=h5eee18b_0
-  - tk=8.6.12=h1ccaba5_0
-  - wheel=0.41.2=py311h06a4308_0
-  - xz=5.4.2=h5eee18b_0
-  - zlib=1.2.13=h5eee18b_0
-  - pip:
-      - aiosignal==1.3.1
-      - anyio==3.7.1
-      - attrs==23.1.0
-      - certifi==2023.7.22
-      - charset-normalizer==3.3.0
-      - click==8.1.7
-      - cmake==3.27.6
-      - fastapi==0.103.2
-      - filelock==3.12.4
-      - frozenlist==1.4.0
-      - fsspec==2023.9.2
-      - grpcio==1.59.0
-      - h11==0.14.0
-      - httptools==0.6.0
-      - huggingface-hub==0.17.3
-      - idna==3.4
-      - jinja2==3.1.2
-      - jsonschema==4.19.1
-      - jsonschema-specifications==2023.7.1
-      - lit==17.0.2
-      - markupsafe==2.1.3
-      - mpmath==1.3.0
-      - msgpack==1.0.7
-      - networkx==3.1
-      - ninja==1.11.1
-      - numpy==1.26.0
-      - nvidia-cublas-cu11==11.10.3.66
-      - nvidia-cuda-cupti-cu11==11.7.101
-      - nvidia-cuda-nvrtc-cu11==11.7.99
-      - nvidia-cuda-runtime-cu11==11.7.99
-      - nvidia-cudnn-cu11==8.5.0.96
-      - nvidia-cufft-cu11==10.9.0.58
-      - nvidia-curand-cu11==10.2.10.91
-      - nvidia-cusolver-cu11==11.4.0.1
-      - nvidia-cusparse-cu11==11.7.4.91
-      - nvidia-nccl-cu11==2.14.3
-      - nvidia-nvtx-cu11==11.7.91
-      - packaging==23.2
-      - pandas==2.1.1
-      - protobuf==4.24.4
-      - psutil==5.9.5
-      - pyarrow==13.0.0
-      - pydantic==1.10.13
-      - python-dateutil==2.8.2
-      - python-dotenv==1.0.0
-      - pytz==2023.3.post1
-      - pyyaml==6.0.1
-      - ray==2.7.0
-      - referencing==0.30.2
-      - regex==2023.10.3
-      - requests==2.31.0
-      - rpds-py==0.10.4
-      - safetensors==0.4.0
-      - sentencepiece==0.1.99
-      - six==1.16.0
-      - sniffio==1.3.0
-      - starlette==0.27.0
-      - sympy==1.12
-      - tokenizers==0.14.1
-      - torch==2.0.1
-      - tqdm==4.66.1
-      - transformers==4.34.0
-      - triton==2.0.0
-      - typing-extensions==4.8.0
-      - tzdata==2023.3
-      - urllib3==2.0.6
-      - uvicorn==0.23.2
-      - uvloop==0.17.0
-      - vllm==0.2.0
-      - watchfiles==0.20.0
-      - websockets==11.0.3
-      - xformers==0.0.22
-prefix: /opt/conda/envs/vllm
--- a/docs/assets/jsconfig.json
+++ b/docs/assets/jsconfig.json
@@ -0,0 +1,11 @@
+{
+ "compilerOptions": {
+  "baseUrl": ".",
+  "paths": {
+   "*": [
+    "../../../../.cache/hugo_cache/modules/filecache/modules/pkg/mod/github.com/gohugoio/hugo-mod-jslibs-dist/popperjs/v2@v2.21100.20000/package/dist/cjs/popper.js/*",
+    "../../../../.cache/hugo_cache/modules/filecache/modules/pkg/mod/github.com/twbs/bootstrap@v5.3.2+incompatible/js/*"
+   ]
+  }
+ }
+}
--- a/docs/config.toml
+++ b/docs/config.toml
@@ -1,133 +1,178 @@
-# this is a required setting for this theme to appear on https://themes.gohugo.io/
-# change this to a value appropriate for you; if your site is served from a subdirectory
-# set it like "https://example.com/mysite/"
 baseURL = "https://localai.io/"
+languageCode = "en-GB"
+contentDir = "content"
+enableEmoji = true
+enableGitInfo = true # N.B. .GitInfo does not currently function with git submodule content directories

-# canonicalization will only be used for the sitemap.xml and index.xml files;
-# if set to false, a site served from a subdirectory will generate wrong links
-# inside of the above mentioned files; if you serve the page from the servers root
-# you are free to set the value to false as recommended by the official Hugo documentation
-canonifyURLs = true # true -> all relative URLs would instead be canonicalized using baseURL
-# required value to serve this page from a webserver AND the file system;
-# if you don't want to serve your page from the file system, you can also set this value
-# to false
-relativeURLs = true # true -> rewrite all relative URLs to be relative to the current content
-# if you set uglyURLs to false, this theme will append 'index.html' to any branch bundle link
-# so your page can be also served from the file system; if you don't want that,
-# set disableExplicitIndexURLs=true in the [params] section
-uglyURLs = false     # true -> basic/index.html -> basic.html
+defaultContentLanguage = 'en'

-# the directory where Hugo reads the themes from; this is specific to your
-# installation and most certainly needs be deleted or changed
-#themesdir = "../.."
-# yeah, well, obviously a mandatory setting for your site, if you want to
-# use this theme ;-)
-theme = "hugo-theme-relearn"
-
-# the main language of this site; also an automatic pirrrate translation is
-# available in this showcase
-languageCode = "en"
-
-# make sure your defaultContentLanguage is the first one in the [languages]
-# array below, as the theme needs to make assumptions on it
-defaultContentLanguage = "en"
-
-# the site's title of this showcase; you should change this ;-)
-title = "LocalAI Documentation"
-
-# We disable this for testing the exampleSite; you must do so too
-# if you want to use the themes parameter disableGeneratorVersion=true;
-# otherwise Hugo will create a generator tag on your home page
-disableHugoGeneratorInject = true
-
-[outputs]
-  # add JSON to the home to support Lunr search; This is a mandatory setting
-  # for the search functionality
-  # add PRINT to home, section and page to activate the feature to print whole
-  # chapters
-  home = ["HTML", "RSS", "PRINT", "SEARCH", "SEARCHPAGE"]
-  section = ["HTML", "RSS", "PRINT"]
-  page = ["HTML", "RSS", "PRINT"]

 [markup]
-  [markup.highlight]
-    # if `guessSyntax = true`, there will be no unstyled code even if no language
-    # was given BUT Mermaid and Math codefences will not work anymore! So this is a
-    # mandatory setting for your site if you want to use Mermaid or Math codefences
-    guessSyntax = true
+  defaultMarkdownHandler = "goldmark"
+  [markup.tableOfContents]
+      endLevel = 3
+      startLevel = 1
+  [markup.goldmark]
+    [markup.goldmark.renderer]
+      unsafe = true # https://jdhao.github.io/2019/12/29/hugo_html_not_shown/
+  # [markup.highlight]
+  #   codeFences = false # disables Hugo's default syntax highlighting
+  # [markup.goldmark.parser]
+  #   [markup.goldmark.parser.attribute]
+  #     block = true
+  #     title = true

-    # here in this showcase we use our own modified chroma syntax highlightning style
-    # which is imported in theme-relearn-light.css / theme-relearn-dark.css;
-    # if you want to use a predefined style instead:
-    # - remove the following `noClasses`
-    # - set the following `style` to a predefined style name
-    # - remove the `@import` of the self-defined chroma stylesheet from your CSS files
-    #   (here eg.: theme-relearn-light.css / theme-relearn-dark.css)
-    noClasses = false
-    style = "tango"

-  [markup.goldmark.renderer]
-    # activated for this showcase to use HTML and JavaScript; decide on your own needs;
-    # if in doubt, remove this line
-    unsafe = true

-# allows `hugo server` to display this showcase in IE11; this is used for testing, as we
-# are still supporting IE11 - although with degraded experience; if you don't care about
-# `hugo server` or browsers of ancient times, fell free to remove this whole block
-[server]
-  [[server.headers]]
-    for = "**.html"
-    [server.headers.values]
-       X-UA-Compatible = "IE=edge"
+[params]
+
+  google_fonts = [
+    ["Inter", "300, 400, 600, 700"],
+    ["Fira Code", "500, 700"]
+  ]
+
+  sans_serif_font = "Inter"     # Default is System font
+  secondary_font  = "Inter"     # Default is System font
+  mono_font       = "Fira Code" # Default is System font
+
+    [params.footer]
+        copyright = "© 2023-2024 <a href='https://mudler.pm' target=_blank>Ettore Di Giacinto</a>"
+        version = true # includes git commit info
+
+    [params.social]
+        github = "mudler/LocalAI"        # YOUR_GITHUB_ID or YOUR_GITHUB_URL
+        twitter = "LocalAI_API"       # YOUR_TWITTER_ID
+        dicord = "uJAeKSAGDy"
+        # instagram = "colinwilson"     # YOUR_INSTAGRAM_ID
+        rss = true                    # show rss icon with link
+
+    [params.docs] # Parameters for the /docs 'template'
+
+        logo = "https://github.com/go-skynet/LocalAI/assets/2420543/0966aa2a-166e-4f99-a3e5-6c915fc997dd"
+        logo_text = "LocalAI"
+        title           = "LocalAI documentation"           # default html title for documentation pages/sections
+
+        pathName        = "docs"                            # path name for documentation site | default "docs"
+
+        # themeColor      = "cyan"                            # (optional) - Set theme accent colour. Options include: blue (default), green, red, yellow, emerald, cardinal, magenta, cyan
+
+        darkMode        = true                                # enable dark mode option? default false
+
+        prism           = true                                # enable syntax highlighting via Prism
+
+        prismTheme      = "solarized-light"                           # (optional) - Set theme for PrismJS. Options include: lotusdocs (default), solarized-light, twilight, lucario
+
+        # gitinfo
+        repoURL         = "https://github.com/mudler/LocalAI"  # Git repository URL for your site [support for GitHub, GitLab, and BitBucket]
+        repoBranch      = "master"
+        editPage        = true                                # enable 'Edit this page' feature - default false
+        lastMod         = true                                # enable 'Last modified' date on pages - default false
+        lastModRelative = true                                # format 'Last modified' time as relative - default true
+
+        sidebarIcons    = true                                # enable sidebar icons? default false
+        breadcrumbs     = true                                # default is true
+        backToTop       = true                                # enable back-to-top button? default true
+
+        # ToC
+        toc             = true                                # enable table of contents? default is true
+        tocMobile       = true                                # enable table of contents in mobile view? default is true
+        scrollSpy       = true                                # enable scrollspy on ToC? default is true
+
+        # front matter
+        descriptions    = true                                # enable front matter descriptions under content title?
+        titleIcon       = true                                # enable front matter icon title prefix? default is false
+
+        # content navigation
+        navDesc         = true                                # include front matter descriptions in Prev/Next navigation cards
+        navDescTrunc    = 30                                  # Number of characters by which to truncate the Prev/Next descriptions
+
+        listDescTrunc   = 100                                 # Number of characters by which to truncate the list card description
+
+        # Link behaviour
+        intLinkTooltip  = true                                # Enable a tooltip for internal links that displays info about the destination? default false
+        # extLinkNewTab   = false                             # Open external links in a new Tab? default true
+        # logoLinkURL = ""                                    # Set a custom URL destination for the top header logo link.
+
+    [params.flexsearch] # Parameters for FlexSearch
+        enabled             = true
+        # tokenize            = "full"
+        # optimize            = true
+        # cache               = 100
+        # minQueryChar        = 3 # default is 0 (disabled)
+        # maxResult           = 5 # default is 5
+        # searchSectionsIndex = []
+
+    [params.docsearch] # Parameters for DocSearch
+        # appID     = "" # Algolia Application ID
+        # apiKey    = "" # Algolia Search-Only API (Public) Key
+        # indexName = "" # Index Name to perform search on (or set env variable HUGO_PARAM_DOCSEARCH_indexName)
+
+    [params.analytics] # Parameters for Analytics (Google, Plausible)
+        # plausibleURL    = "/docs/s" # (or set via env variable HUGO_PARAM_ANALYTICS_plausibleURL)
+        # plausibleAPI    = "/docs/s" # optional - (or set via env variable HUGO_PARAM_ANALYTICS_plausibleAPI)
+        # plausibleDomain = ""      # (or set via env variable HUGO_PARAM_ANALYTICS_plausibleDomain)
+
+    # [params.feedback]
+    #     enabled = true
+    #     emoticonTpl = true
+    #     eventDest = ["plausible","google"]
+    #     emoticonEventName = "Feedback"
+    #     positiveEventName = "Positive Feedback"
+    #     negativeEventName = "Negative Feedback"
+    #     positiveFormTitle = "What did you like?"
+    #     negativeFormTitle = "What went wrong?"
+    #     successMsg = "Thank you for helping to improve Lotus Docs' documentation!"
+    #     errorMsg = "Sorry! There was an error while attempting to submit your feedback!"
+    #     positiveForm = [
+    #       ["Accurate", "Accurately describes the feature or option."],
+    #       ["Solved my problem", "Helped me resolve an issue."],
+    #       ["Easy to understand", "Easy to follow and comprehend."],
+    #       ["Something else"]
+    #     ]
+    #     negativeForm = [
+    #       ["Inaccurate", "Doesn't accurately describe the feature or option."],
+    #       ["Couldn't find what I was looking for", "Missing important information."],
+    #       ["Hard to understand", "Too complicated or unclear."],
+    #       ["Code sample errors", "One or more code samples are incorrect."],
+    #       ["Something else"]
+    #     ]
+
+[menu]
+ [[menu.primary]]
+    name  = "Docs"
+    url = "docs/"
+    identifier = "docs"
+    weight = 10
+[[menu.primary]]
+    name = "Discord"
+    url = "https://discord.gg/uJAeKSAGDy"
+    identifier = "discord"
+    weight = 20

-# showcase of the menu shortcuts; you can use relative URLs linking
-# to your content or use fully-quallified URLs to link outside of
-# your project
 [languages]
  [languages.en]
    title = "LocalAI documentation"
-    weight = 1
    languageName = "English"
-    [languages.en.params]
-      landingPageName = "<i class='fas fa-home'></i> Home"
-  [[languages.en.menu.shortcuts]]
-    name = "<i class='fas fa-home'></i> Home"
-    url = "/"
-    weight = 1
-  [[languages.en.menu.shortcuts]]
-    name = "<i class='fab fa-fw fa-github'></i> GitHub repo"
-    identifier = "ds"
-    url = "https://github.com/go-skynet/LocalAI"
    weight = 10
+#  [languages.fr]
+#    title = "LocalAI documentation"
+#    languageName = "Français"
+#    contentDir = "content/fr"
+#    weight = 20
+#  [languages.de]
+#    title = "LocalAI documentation"
+#    languageName = "Deutsch"
+#    contentDir = "content/de"
+#    weight = 30

-  [[languages.en.menu.shortcuts]]
-    name = "<i class='fas fa-fw fa-camera'></i> Examples"
-    url = "https://github.com/go-skynet/LocalAI/tree/master/examples/"
-    weight = 11

-  [[languages.en.menu.shortcuts]]
-    name = "<i class='fas fa-fw fa-images'></i> Model Gallery"
-    url = "https://github.com/go-skynet/model-gallery"
-    weight = 12

-  [[languages.en.menu.shortcuts]]
-    name = "<i class='fas fa-fw fa-download'></i> Container images"
-    url = "https://quay.io/repository/go-skynet/local-ai"
-    weight = 20
-  #[[languages.en.menu.shortcuts]]
-  #  name = "<i class='fas fa-fw fa-bullhorn'></i> Credits"
-  #  url = "more/credits/"
-  #  weight = 30
-
-  [[languages.en.menu.shortcuts]]
-    name = "<i class='fas fa-fw fa-tags'></i> Releases"
-    url = "https://github.com/go-skynet/LocalAI/releases"
-    weight = 40


 # mounts are only needed in this showcase to access the publicly available screenshots;
 # remove this section if you don't need further mounts
 [module]
+  replacements = "github.com/colinwilson/lotusdocs -> lotusdocs"
  [[module.mounts]]
    source = 'archetypes'
    target = 'archetypes'
@@ -152,30 +197,11 @@ disableHugoGeneratorInject = true
  [[module.mounts]]
    source = 'static'
    target = 'static'
-
-
-# settings specific to this theme's features; choose to your likings and
-# consult this documentation for explaination
-[params]
-  editURL = "https://github.com/mudler/LocalAI/edit/master/docs/content/"
-  description = "Documentation for LocalAI"
-  author = "Ettore Di Giacinto"
-  showVisitedLinks = true
-  collapsibleMenu = true
-  disableBreadcrumb = false
-  disableInlineCopyToClipBoard = true
-  disableNextPrev = false
-  disableLandingPageButton = true
-  breadcrumbSeparator = ">"
-  titleSeparator = "::"
-  themeVariant = [ "auto", "relearn-bright", "relearn-light", "relearn-dark", "learn", "neon", "blue", "green", "red" ]
-  themeVariantAuto = [ "relearn-light", "relearn-dark" ]
-  disableSeoHiddenPages = true
-  # this is to index search for your native language in other languages, too (eg.
-  # pir in this showcase)
-  additionalContentLanguage = [ "en" ]
-  # this is for the stylesheet generator to allow for interactivity in Mermaid
-  # graphs; you usually will not need it and you should remove this for
-  # security reasons
-  mermaidInitialize = "{ \"securityLevel\": \"loose\" }"
-  mermaidZoom = true
+    # uncomment line below for temporary local development of module
+    # or when using a 'theme' as a git submodule
+  [[module.imports]]
+    path = "github.com/colinwilson/lotusdocs"
+    disable = false
+  [[module.imports]]
+    path = "github.com/gohugoio/hugo-mod-bootstrap-scss/v5"
+    disable = false
--- a/docs/content/advanced/development.md
+++ b/docs/content/advanced/development.md
@@ -1,37 +0,0 @@
-
-+++
-disableToc = false
-title = "Development documentation"
-weight = 7
-+++
-
-{{% notice note %}}
-
-This section is for developers and contributors. If you are looking for the user documentation, this is not the right place!
-
-{{% /notice %}}
-
-This section will collect how-to, notes and development documentation
-
-## Contributing
-
-We use conventional commits and semantic versioning. Please follow the [conventional commits](https://www.conventionalcommits.org/en/v1.0.0/) specification when writing commit messages.
-
-## Creating a gRPC backend
-
-LocalAI backends are `gRPC` servers.
-
-In order to create a new backend you need:
-
- If there are changes required to the protobuf code, modify the [proto](https://github.com/go-skynet/LocalAI/blob/master/pkg/grpc/proto/backend.proto) file and re-generate the code with `make protogen`.
- Modify the `Makefile` to add your new backend and re-generate the client code with `make protogen` if necessary.
- Create a new `gRPC` server in `extra/grpc` if it's not written in go: [link](https://github.com/go-skynet/LocalAI/tree/master/extra/grpc), and create the specific implementation.
-    - Golang `gRPC` servers should be added in the [pkg/backend](https://github.com/go-skynet/LocalAI/tree/master/pkg/backend) directory given their type. See [piper](https://github.com/go-skynet/LocalAI/blob/master/pkg/backend/tts/piper.go) as an example.
-    - Golang servers needs a respective `cmd/grpc` binary that must be created too, see also [cmd/grpc/piper](https://github.com/go-skynet/LocalAI/tree/master/cmd/grpc/piper) as an example, update also the Makefile accordingly to build the binary during build time.
- Update the Dockerfile: if the backend is written in another language, update the `Dockerfile` default *EXTERNAL_GRPC_BACKENDS* variable by listing the new binary [link](https://github.com/go-skynet/LocalAI/blob/c2233648164f67cdb74dd33b8d46244e14436ab3/Dockerfile#L14).
-
-Once you are done, you can either re-build `LocalAI` with your backend or you can try it out by running the `gRPC` server manually and specifying the host and IP to LocalAI with `--external-grpc-backends` or using (`EXTERNAL_GRPC_BACKENDS` environment variable, comma separated list of `name:host:port` tuples, e.g. `my-awesome-backend:host:port`):
-
-```bash
-./local-ai --debug --external-grpc-backends "my-awesome-backend:host:port" ...
-```
--- a/docs/content/docs/advanced/_index.en.md
+++ b/docs/content/docs/advanced/_index.en.md
@@ -0,0 +1,11 @@
+---
+weight: 20
+title: "Advanced"
+description: "Advanced usage"
+icon: science
+lead: ""
+date: 2020-10-06T08:49:15+00:00
+lastmod: 2020-10-06T08:49:15+00:00
+draft: false
+images: []
+---
--- a/docs/content/docs/advanced/advanced-usage.md
+++ b/docs/content/docs/advanced/advanced-usage.md
@@ -1,15 +1,16 @@

 +++
 disableToc = false
-title = "Advanced"
-weight = 6
+title = "Advanced usage"
+weight = 21
+url = '/advanced'
 +++

 ### Advanced configuration with YAML files

 In order to define default prompts, model parameters (such as custom default `top_p` or `top_k`), LocalAI can be configured to serve user-defined models with a set of default parameters and templates.

-You can create multiple `yaml` files in the models path or either specify a single YAML configuration file. 
+In order to configure a model, you can create multiple `yaml` files in the models path or either specify a single YAML configuration file. 
 Consider the following `models` folder in the `example/chatbot-ui`:

 ```
@@ -96,6 +97,12 @@ Specifying a `config-file` via CLI allows to declare models in a single file as

 See also [chatbot-ui](https://github.com/go-skynet/LocalAI/tree/master/examples/chatbot-ui) as an example on how to use config files.

+It is possible to specify a full URL or a short-hand URL to a YAML model configuration file and use it on start with local-ai, for example to use phi-2:
+
+```
+local-ai github://mudler/LocalAI/examples/configurations/phi-2.yaml@master
+```
+
 ### Full config model file reference

 ```yaml
@@ -303,7 +310,7 @@ prompt_cache_all: true

 By default LocalAI will try to autoload the model by trying all the backends. This might work for most of models, but some of the backends are NOT configured to autoload.

-The available backends are listed in the [model compatibility table]({{%relref "model-compatibility" %}}).
+The available backends are listed in the [model compatibility table]({{%relref "docs/reference/compatibility-table" %}}).

 In order to specify a backend for your models, create a model config file in your `models` directory specifying the backend:

@@ -337,6 +344,19 @@ Or a remote URI:
 ./local-ai --debug --external-grpc-backends "my-awesome-backend:host:port"
 ```

+For example, to start vllm manually after compiling LocalAI (also assuming running the command from the root of the repository):
+
+```bash
+./local-ai --external-grpc-backends "vllm:$PWD/backend/python/vllm/run.sh"
+```
+
+Note that first is is necessary to create the conda environment with:
+
+```bash
+make -C backend/python/vllm
+```
+
+
 ### Environment variables

 When LocalAI runs in a container,
@@ -359,15 +379,37 @@ docker run --env REBUILD=true localai
 docker run --env-file .env localai
 ```

-### Build only a single backend
+### CLI parameters

-You can control the backends that are built by setting the `GRPC_BACKENDS` environment variable. For instance, to build only the `llama-cpp` backend only:
+You can control LocalAI with command line arguments, to specify a binding address, or the number of threads.

-```bash
-make GRPC_BACKENDS=backend-assets/grpc/llama-cpp build
-```

-By default, all the backends are built.
+| Parameter                      | Environmental Variable          | Default Variable                                   | Description                                                         |
+| ------------------------------ | ------------------------------- | -------------------------------------------------- | ------------------------------------------------------------------- |
+| --f16                          | $F16                            | false                                              | Enable f16 mode                                                     |
+| --debug                        | $DEBUG                          | false                                              | Enable debug mode                                                   |
+| --cors                         | $CORS                           | false                                              | Enable CORS support                                                 |
+| --cors-allow-origins value     | $CORS_ALLOW_ORIGINS             |                                                    | Specify origins allowed for CORS                                     |
+| --threads value                | $THREADS                        | 4    | Number of threads to use for parallel computation                    |
+| --models-path value            | $MODELS_PATH                    | ./models       | Path to the directory containing models used for inferencing        |
+| --preload-models value         | $PRELOAD_MODELS                 |           | List of models to preload in JSON format at startup                  |
+| --preload-models-config value  | $PRELOAD_MODELS_CONFIG          |  | A config with a list of models to apply at startup. Specify the path to a YAML config file |
+| --config-file value            | $CONFIG_FILE                    |                                         | Path to the config file                                             |
+| --address value                | $ADDRESS                        | :8080                    | Specify the bind address for the API server                         |
+| --image-path value             | $IMAGE_PATH                     |                                     | Path to the directory used to store generated images                             |
+| --context-size value           | $CONTEXT_SIZE                   | 512                 | Default context size of the model                                   |
+| --upload-limit value           | $UPLOAD_LIMIT                   | 15                         | Default upload limit in megabytes (audio file upload)                                  |
+| --galleries                    | $GALLERIES                      |                                                    | Allows to set galleries from command line                           |
+|--parallel-requests              | $PARALLEL_REQUESTS     |   false |            Enable backends to handle multiple requests in parallel. This is for backends that supports multiple requests in parallel, like llama.cpp or vllm |
+| --single-active-backend   | $SINGLE_ACTIVE_BACKEND |  false |    Allow only one backend to be running |
+| --api-keys value |   $API_KEY | empty |  List of API Keys to enable API authentication. When this is set, all the requests must be authenticated with one of these API keys.
+| --enable-watchdog-idle | $WATCHDOG_IDLE | false | Enable watchdog for stopping idle backends. This will stop the backends if are in idle state for too long. (default: false) [$WATCHDOG_IDLE]
+| --enable-watchdog-busy   |     $WATCHDOG_BUSY | false |         Enable watchdog for stopping busy backends that exceed a defined threshold.|
+| --watchdog-busy-timeout value | $WATCHDOG_BUSY_TIMEOUT | 5m | Watchdog timeout. This will restart the backend if it crashes.  |
+| --watchdog-idle-timeout value | $WATCHDOG_IDLE_TIMEOUT | 15m | Watchdog idle timeout. This will restart the backend if it crashes. |
+| --preload-backend-only | $PRELOAD_BACKEND_ONLY | false | If set, the api is NOT launched, and only the preloaded models / backends are started. This is intended for multi-node setups. |
+| --external-grpc-backends | EXTERNAL_GRPC_BACKENDS | none | Comma separated list of external gRPC backends to use. Format: `name:host:port` or `name:/path/to/file` |
+

 ### Extra backends

@@ -391,11 +433,11 @@ RUN PATH=$PATH:/opt/conda/bin make -C backend/python/diffusers
 ENV EXTERNAL_GRPC_BACKENDS="diffusers:/build/backend/python/diffusers/run.sh"
 ```

-{{% notice note %}}
+{{% alert note %}}

 You can specify remote external backends or path to local files. The syntax is `backend-name:/path/to/backend` or `backend-name:host:port`.

-{{% /notice %}}
+{{% /alert %}}

 #### In runtime

--- a/docs/content/docs/advanced/fine-tuning.md
+++ b/docs/content/docs/advanced/fine-tuning.md
@@ -2,12 +2,12 @@
 +++
 disableToc = false
 title = "Fine-tuning LLMs for text generation"
-weight = 3
+weight = 22
 +++

-{{% notice note %}}
+{{% alert note %}}
 Section under construction
-{{% /notice %}}
+{{% /alert %}}

 This section covers how to fine-tune a language model for text generation and consume it in LocalAI.

--- a/docs/content/faq/_index.en.md
+++ b/docs/content/faq/_index.en.md
@@ -2,7 +2,9 @@
 +++
 disableToc = false
 title = "FAQ"
-weight = 9
+weight = 24
+icon = "quiz"
+url = "/faq/"
 +++

 ## Frequently asked questions
@@ -12,25 +14,13 @@ Here are answers to some of the most common questions.

 ### How do I get models? 

-<details>
-
 Most gguf-based models should work, but newer models may require additions to the API. If a model doesn't work, please feel free to open up issues. However, be cautious about downloading models from the internet and directly onto your machine, as there may be security vulnerabilities in lama.cpp or ggml that could be maliciously exploited. Some models can be found on Hugging Face: https://huggingface.co/models?search=gguf, or models from gpt4all are compatible too: https://github.com/nomic-ai/gpt4all.

-</details>
-
 ### What's the difference with Serge, or XXX?

-
-<details>
-
 LocalAI is a multi-model solution that doesn't focus on a specific model type (e.g., llama.cpp or alpaca.cpp), and it handles all of these internally for faster inference,  easy to set up locally and deploy to Kubernetes.

-</details>
-
-
-### Everything is slow, how come?
-
-<details>
+### Everything is slow, how is it possible?

 There are few situation why this could occur. Some tips are:
 - Don't use HDD to store your models. Prefer SSD over HDD. In case you are stuck with HDD, disable `mmap` in the model config file so it loads everything in memory.
@@ -38,61 +28,31 @@ There are few situation why this could occur. Some tips are:
 - Run LocalAI with `DEBUG=true`. This gives more information, including stats on the token inference speed.
 - Check that you are actually getting an output: run a simple curl request with `"stream": true` to see how fast the model is responding. 

-</details>
-
 ### Can I use it with a Discord bot, or XXX?

-<details>
-
 Yes! If the client uses OpenAI and supports setting a different base URL to send requests to, you can use the LocalAI endpoint. This allows to use this with every application that was supposed to work with OpenAI, but without changing the application!

-</details>
-
-
 ### Can this leverage GPUs? 

-<details>
-
-There is partial GPU support, see build instructions above.
-
-</details>
+There is GPU support, see {{%relref "docs/features/GPU-acceleration" %}}.

 ### Where is the webUI? 

-<details> 
-There is the availability of localai-webui and chatbot-ui in the examples section and can be setup as per the instructions. However as LocalAI is an API you can already plug it into existing projects that provides are UI interfaces to OpenAI's APIs. There are several already on github, and should be compatible with LocalAI already (as it mimics the OpenAI API)
-
-</details>
+There is the availability of localai-webui and chatbot-ui in the examples section and can be setup as per the instructions. However as LocalAI is an API you can already plug it into existing projects that provides are UI interfaces to OpenAI's APIs. There are several already on Github, and should be compatible with LocalAI already (as it mimics the OpenAI API)

 ### Does it work with AutoGPT? 

-<details>
-
 Yes, see the [examples](https://github.com/go-skynet/LocalAI/tree/master/examples/)!

-</details>
-
 ### How can I troubleshoot when something is wrong?

-<details>
-
 Enable the debug mode by setting `DEBUG=true` in the environment variables. This will give you more information on what's going on.
 You can also specify `--debug` in the command line.

-</details>
-
 ### I'm getting 'invalid pitch' error when running with CUDA, what's wrong?

-<details>
-
 This typically happens when your prompt exceeds the context size. Try to reduce the prompt size, or increase the context size.

-</details>
-
 ### I'm getting a 'SIGILL' error, what's wrong?

-<details>
-
-Your CPU probably does not have support for certain instructions that are compiled by default in the pre-built binaries. If you are running in a container, try setting `REBUILD=true` and disable the CPU instructions that are not compatible with your CPU. For instance: `CMAKE_ARGS="-DLLAMA_F16C=OFF -DLLAMA_AVX512=OFF -DLLAMA_AVX2=OFF -DLLAMA_FMA=OFF" make build`
-  
-</details>
+Your CPU probably does not have support for certain instructions that are compiled by default in the pre-built binaries. If you are running in a container, try setting `REBUILD=true` and disable the CPU instructions that are not compatible with your CPU. For instance: `CMAKE_ARGS="-DLLAMA_F16C=OFF -DLLAMA_AVX512=OFF -DLLAMA_AVX2=OFF -DLLAMA_FMA=OFF" make build`
--- a/docs/content/docs/features/GPU-acceleration.md
+++ b/docs/content/docs/features/GPU-acceleration.md
@@ -1,25 +1,35 @@
-
 +++
 disableToc = false
 title = "⚡ GPU acceleration"
-weight = 2
+weight = 9
+url = "/features/gpu-acceleration/"
 +++

-{{% notice note %}}
+{{% alert context="warning" %}}
 Section under construction
-{{% /notice %}}
+{{% /alert %}}

 This section contains instruction on how to use LocalAI with GPU acceleration.

-{{% notice note %}}
-For accelleration for AMD or Metal HW there are no specific container images, see the [build]({{%relref "build/#acceleration" %}})
-{{% /notice %}}
+{{% alert icon="⚡" context="warning" %}}
+For accelleration for AMD or Metal HW there are no specific container images, see the [build]({{%relref "docs/getting-started/build#Acceleration" %}})
+{{% /alert %}}

-### CUDA
+### CUDA(NVIDIA) acceleration
+
+#### Requirements

 Requirement: nvidia-container-toolkit (installation instructions [1](https://www.server-world.info/en/note?os=Ubuntu_22.04&p=nvidia&f=2) [2](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html))

-To use CUDA, use the images with the `cublas` tag.
+To check what CUDA version do you need, you can either run `nvidia-smi` or `nvcc --version`. 
+
+Alternatively, you can also check nvidia-smi with docker:
+
+```
+docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
+```
+
+To use CUDA, use the images with the `cublas` tag, for example.

 The image list is on [quay](https://quay.io/repository/go-skynet/local-ai?tab=tags):

--- a/docs/content/docs/features/_index.en.md
+++ b/docs/content/docs/features/_index.en.md
@@ -0,0 +1,8 @@
+
+++
+disableToc = false
+title = "Features"
+weight = 8
+icon = "feature_search"
+url = "/features/"
+++
--- a/docs/content/docs/features/audio-to-text.md
+++ b/docs/content/docs/features/audio-to-text.md
@@ -1,10 +1,13 @@
 +++
 disableToc = false
 title = "🔈 Audio to text"
-weight = 2
+weight = 16
+url = "/features/audio-to-text/"
 +++

-The transcription endpoint allows to convert audio files to text. The endpoint is based on [whisper.cpp](https://github.com/ggerganov/whisper.cpp), a C++ library for audio transcription. The endpoint supports the audio formats supported by `ffmpeg`.
+Audio to text models are models that can generate text from an audio file.
+
+The transcription endpoint allows to convert audio files to text. The endpoint is based on [whisper.cpp](https://github.com/ggerganov/whisper.cpp), a C++ library for audio transcription. The endpoint input supports all the audio formats supported by `ffmpeg`.

 ## Usage

--- a/docs/content/docs/features/constrained_grammars.md
+++ b/docs/content/docs/features/constrained_grammars.md
@@ -2,20 +2,21 @@
 +++
 disableToc = false
 title = "✍️ Constrained grammars"
-weight = 6
+weight = 15
+url = "/features/constrained_grammars/"
 +++

 The chat endpoint accepts an additional `grammar` parameter which takes a [BNF defined grammar](https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form).

 This allows the LLM to constrain the output to a user-defined schema, allowing to generate `JSON`, `YAML`, and everything that can be defined with a BNF grammar.

-{{% notice note %}}
-This feature works only with models compatible with the [llama.cpp](https://github.com/ggerganov/llama.cpp) backend (see also [Model compatibility]({{%relref "model-compatibility" %}})). For details on how it works, see the upstream PRs: https://github.com/ggerganov/llama.cpp/pull/1773, https://github.com/ggerganov/llama.cpp/pull/1887
-{{% /notice %}}
+{{% alert note %}}
+This feature works only with models compatible with the [llama.cpp](https://github.com/ggerganov/llama.cpp) backend (see also [Model compatibility]({{%relref "docs/reference/compatibility-table" %}})). For details on how it works, see the upstream PRs: https://github.com/ggerganov/llama.cpp/pull/1773, https://github.com/ggerganov/llama.cpp/pull/1887
+{{% /alert %}}

 ## Setup

-Follow the setup instructions from the [LocalAI functions]({{%relref "features/openai-functions" %}}) page.
+Follow the setup instructions from the [LocalAI functions]({{%relref "docs/features/openai-functions" %}}) page.

 ## 💡 Usage example

--- a/docs/content/docs/features/embeddings.md
+++ b/docs/content/docs/features/embeddings.md
@@ -2,7 +2,8 @@
 +++
 disableToc = false
 title = "🧠 Embeddings"
-weight = 2
+weight = 13
+url = "/features/embeddings/"
 +++

 LocalAI supports generating embeddings for text or list of tokens.
@@ -73,7 +74,7 @@ parameters:

 The `sentencetransformers` backend uses Python [sentence-transformers](https://github.com/UKPLab/sentence-transformers). For a list of all pre-trained models available see here: https://github.com/UKPLab/sentence-transformers#pre-trained-models

-{{% notice note %}}
+{{% alert note %}}

 - The `sentencetransformers` backend is an optional backend of LocalAI and uses Python. If you are running `LocalAI` from the containers you are good to go and should be already configured for use.
 - If you are running `LocalAI` manually you must install the python dependencies (`make prepare-extra-conda-environments`). This requires `conda` to be installed.
@@ -82,7 +83,7 @@ The `sentencetransformers` backend uses Python [sentence-transformers](https://g
 - The `sentencetransformers` backend does support only embeddings of text, and not of tokens. If you need to embed tokens you can use the `bert` backend or `llama.cpp`.
 - No models are required to be downloaded before using the `sentencetransformers` backend. The models will be downloaded automatically the first time the API is used.

-{{% /notice %}}
+{{% /alert %}}

 ## Llama.cpp embeddings

--- a/docs/content/docs/features/gpt-vision.md
+++ b/docs/content/docs/features/gpt-vision.md
@@ -2,13 +2,10 @@
 +++
 disableToc = false
 title = "🆕 GPT Vision"
-weight = 2
+weight = 14
+url = "/features/gpt-vision/"
 +++

-{{% notice note %}}
-Available only on `master` builds
-{{% /notice %}}
-
 LocalAI supports understanding images by using [LLaVA](https://llava.hliu.cc/), and implements the [GPT Vision API](https://platform.openai.com/docs/guides/vision) from OpenAI.

 ![llava](https://github.com/mudler/LocalAI/assets/2420543/cb0a0897-3b58-4350-af66-e6f4387b58d3)
@@ -27,4 +24,4 @@ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/jso

 ### Setup

-To setup the LLaVa models, follow the full example in the [configuration examples](https://github.com/mudler/LocalAI/blob/master/examples/configurations/README.md#llava).
+To setup the LLaVa models, follow the full example in the [configuration examples](https://github.com/mudler/LocalAI/blob/master/examples/configurations/README.md#llava).
--- a/docs/content/docs/features/image-generation.md
+++ b/docs/content/docs/features/image-generation.md
@@ -2,13 +2,14 @@
 +++
 disableToc = false
 title = "🎨 Image generation"
-weight = 2
+weight = 12
+url = "/features/image-generation/"
 +++

 ![anime_girl](https://github.com/go-skynet/LocalAI/assets/2420543/8aaca62a-e864-4011-98ae-dcc708103928)
 (Generated with [AnimagineXL](https://huggingface.co/Linaqruf/animagine-xl))

-LocalAI supports generating images with Stable diffusion, running on CPU using a C++ implementation, [Stable-Diffusion-NCNN](https://github.com/EdVince/Stable-Diffusion-NCNN) ([binding](https://github.com/mudler/go-stable-diffusion)) and [🧨 Diffusers]({{%relref "model-compatibility/diffusers" %}}).
+LocalAI supports generating images with Stable diffusion, running on CPU using C++ and Python implementations.

 ## Usage

@@ -35,7 +36,9 @@ curl http://localhost:8080/v1/images/generations -H "Content-Type: application/j
 }'
 ```

-## stablediffusion-cpp
+## Backends
+
+### stablediffusion-cpp

 | mode=0                                                                                                                | mode=1 (winograd/sgemm)                                                                                                                |
 |------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------|
@@ -45,7 +48,7 @@ curl http://localhost:8080/v1/images/generations -H "Content-Type: application/j

 Note: image generator supports images up to 512x512. You can use other tools however to upscale the image, for instance: https://github.com/upscayl/upscayl.

-### Setup
+#### Setup

 Note: In order to use the `images/generation` endpoint with the `stablediffusion` C++ backend, you need to build LocalAI with `GO_TAGS=stablediffusion`. If you are using the container images, it is already enabled.

@@ -128,11 +131,14 @@ models

 {{< /tabs >}}

-## Diffusers
+### Diffusers

-This is an extra backend - in the container is already available and there is nothing to do for the setup.
+[Diffusers](https://huggingface.co/docs/diffusers/index) is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. LocalAI has a diffusers backend which allows image generation using the `diffusers` library.

-### Model setup
+![anime_girl](https://github.com/go-skynet/LocalAI/assets/2420543/8aaca62a-e864-4011-98ae-dcc708103928)
+(Generated with [AnimagineXL](https://huggingface.co/Linaqruf/animagine-xl))
+
+#### Model setup

 The models will be downloaded the first time you use the backend from `huggingface` automatically.

@@ -150,3 +156,198 @@ diffusers:
  cuda: false # Enable for GPU usage (CUDA)
  scheduler_type: euler_a
 ```
+
+#### Dependencies
+
+This is an extra backend - in the container is already available and there is nothing to do for the setup. Do not use *core* images (ending with `-core`). If you are building manually, see the [build instructions]({{%relref "docs/getting-started/build" %}}).
+
+#### Model setup
+
+The models will be downloaded the first time you use the backend from `huggingface` automatically.
+
+Create a model configuration file in the `models` directory, for instance to use `Linaqruf/animagine-xl` with CPU:
+
+```yaml
+name: animagine-xl
+parameters:
+  model: Linaqruf/animagine-xl
+backend: diffusers
+cuda: true
+f16: true
+diffusers:
+  scheduler_type: euler_a
+```
+
+#### Local models
+
+You can also use local models, or modify some parameters like `clip_skip`, `scheduler_type`, for instance:
+
+```yaml
+name: stablediffusion
+parameters:
+  model: toonyou_beta6.safetensors
+backend: diffusers
+step: 30
+f16: true
+cuda: true
+diffusers:
+  pipeline_type: StableDiffusionPipeline
+  enable_parameters: "negative_prompt,num_inference_steps,clip_skip"
+  scheduler_type: "k_dpmpp_sde"
+  cfg_scale: 8
+  clip_skip: 11
+```
+
+#### Configuration parameters
+
+The following parameters are available in the configuration file:
+
+| Parameter | Description | Default |
+| --- | --- | --- |
+| `f16` | Force the usage of `float16` instead of `float32` | `false` |
+| `step` | Number of steps to run the model for | `30` |
+| `cuda` | Enable CUDA acceleration | `false` |
+| `enable_parameters` | Parameters to enable for the model | `negative_prompt,num_inference_steps,clip_skip` |
+| `scheduler_type` | Scheduler type | `k_dpp_sde` |
+| `cfg_scale` | Configuration scale | `8` |
+| `clip_skip` | Clip skip | None |
+| `pipeline_type` | Pipeline type | `AutoPipelineForText2Image` |
+
+There are available several types of schedulers:
+
+| Scheduler | Description |
+| --- | --- |
+| `ddim` | DDIM |
+| `pndm` | PNDM |
+| `heun` | Heun |
+| `unipc` | UniPC |
+| `euler` | Euler |
+| `euler_a` | Euler a |
+| `lms` | LMS |
+| `k_lms` | LMS Karras |
+| `dpm_2` | DPM2 |
+| `k_dpm_2` | DPM2 Karras |
+| `dpm_2_a` | DPM2 a |
+| `k_dpm_2_a` | DPM2 a Karras |
+| `dpmpp_2m` | DPM++ 2M |
+| `k_dpmpp_2m` | DPM++ 2M Karras |
+| `dpmpp_sde` | DPM++ SDE |
+| `k_dpmpp_sde` | DPM++ SDE Karras |
+| `dpmpp_2m_sde` | DPM++ 2M SDE |
+| `k_dpmpp_2m_sde` | DPM++ 2M SDE Karras |
+
+Pipelines types available:
+
+| Pipeline type | Description |
+| --- | --- |
+| `StableDiffusionPipeline` | Stable diffusion pipeline |
+| `StableDiffusionImg2ImgPipeline` | Stable diffusion image to image pipeline |
+| `StableDiffusionDepth2ImgPipeline` | Stable diffusion depth to image pipeline |
+| `DiffusionPipeline` | Diffusion pipeline |
+| `StableDiffusionXLPipeline` | Stable diffusion XL pipeline |
+
+#### Usage
+
+#### Text to Image
+Use the `image` generation endpoint with the `model` name from the configuration file:
+
+```bash
+curl http://localhost:8080/v1/images/generations \
+    -H "Content-Type: application/json" \
+    -d '{
+      "prompt": "<positive prompt>|<negative prompt>", 
+      "model": "animagine-xl", 
+      "step": 51,
+      "size": "1024x1024" 
+    }'
+```
+
+#### Image to Image
+
+https://huggingface.co/docs/diffusers/using-diffusers/img2img
+
+An example model (GPU):
+```yaml
+name: stablediffusion-edit
+parameters:
+  model: nitrosocke/Ghibli-Diffusion
+backend: diffusers
+step: 25
+cuda: true
+f16: true
+diffusers:
+  pipeline_type: StableDiffusionImg2ImgPipeline
+  enable_parameters: "negative_prompt,num_inference_steps,image"
+```
+
+```bash
+IMAGE_PATH=/path/to/your/image
+(echo -n '{"file": "'; base64 $IMAGE_PATH; echo '", "prompt": "a sky background","size": "512x512","model":"stablediffusion-edit"}') |
+curl -H "Content-Type: application/json" -d @-  http://localhost:8080/v1/images/generations
+```
+
+#### Depth to Image
+
+https://huggingface.co/docs/diffusers/using-diffusers/depth2img
+
+```yaml
+name: stablediffusion-depth
+parameters:
+  model: stabilityai/stable-diffusion-2-depth
+backend: diffusers
+step: 50
+# Force CPU usage
+f16: true
+cuda: true
+diffusers:
+  pipeline_type: StableDiffusionDepth2ImgPipeline
+  enable_parameters: "negative_prompt,num_inference_steps,image"
+  cfg_scale: 6
+```
+
+```bash
+(echo -n '{"file": "'; base64 ~/path/to/image.jpeg; echo '", "prompt": "a sky background","size": "512x512","model":"stablediffusion-depth"}') |
+curl -H "Content-Type: application/json" -d @-  http://localhost:8080/v1/images/generations
+```
+
+#### img2vid
+
+
+```yaml
+name: img2vid
+parameters:
+  model: stabilityai/stable-video-diffusion-img2vid
+backend: diffusers
+step: 25
+# Force CPU usage
+f16: true
+cuda: true
+diffusers:
+  pipeline_type: StableVideoDiffusionPipeline
+```
+
+```bash
+(echo -n '{"file": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png?download=true","size": "512x512","model":"img2vid"}') |
+curl -H "Content-Type: application/json" -X POST -d @- http://localhost:8080/v1/images/generations
+```
+
+#### txt2vid
+
+```yaml
+name: txt2vid
+parameters:
+  model: damo-vilab/text-to-video-ms-1.7b
+backend: diffusers
+step: 25
+# Force CPU usage
+f16: true
+cuda: true
+diffusers:
+  pipeline_type: VideoDiffusionPipeline
+  cuda: true
+```
+
+```bash
+(echo -n '{"prompt": "spiderman surfing","size": "512x512","model":"txt2vid"}') |
+curl -H "Content-Type: application/json" -X POST -d @- http://localhost:8080/v1/images/generations
+```
--- a/docs/content/docs/features/model-gallery.md
+++ b/docs/content/docs/features/model-gallery.md
@@ -2,7 +2,9 @@
 +++
 disableToc = false
 title = "🖼️ Model gallery"
-weight = 7
+
+weight = 18
+url = '/models'
 +++

 <h1 align="center">
@@ -15,13 +17,13 @@ The model gallery is a (experimental!) collection of models configurations for [

 LocalAI to ease out installations of models provide a way to preload models on start and downloading and installing them in runtime. You can install models manually by copying them over the `models` directory, or use the API to configure, download and verify the model assets for you. As the UI is still a work in progress, you will find here the documentation about the API Endpoints.

-{{% notice note %}}
+{{% alert note %}}
 The models in this gallery are not directly maintained by LocalAI. If you find a model that is not working, please open an issue on the model gallery repository.
-{{% /notice %}}
+{{% /alert %}}

-{{% notice note %}}
+{{% alert note %}}
 GPT and text generation models might have a license which is not permissive for commercial use or might be questionable or without any license at all. Please check the model license before using it. The official gallery contains only open licensed models.
-{{% /notice %}}
+{{% /alert %}}

 ## Useful Links and resources

@@ -48,7 +50,7 @@ GALLERIES=[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.

 where `github:go-skynet/model-gallery/index.yaml` will be expanded automatically to `https://raw.githubusercontent.com/go-skynet/model-gallery/main/index.yaml`.

-{{% notice note %}}
+{{% alert note %}}

 As this feature is experimental, you need to run `local-ai` with a list of `GALLERIES`. Currently there are two galleries:

@@ -63,19 +65,19 @@ GALLERIES=[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.

 If running with `docker-compose`, simply edit the `.env` file and uncomment the `GALLERIES` variable, and add the one you want to use.

-{{% /notice %}}
+{{% /alert %}}

-{{% notice note %}}
+{{% alert note %}}
 You might not find all the models in this gallery. Automated CI updates the gallery automatically. You can find however most of the models on huggingface (https://huggingface.co/), generally it should be available `~24h` after upload.

 By under any circumstances LocalAI and any developer is not responsible for the models in this gallery, as CI is just indexing them and providing a convenient way to install with an automatic configuration with a consistent API. Don't install models from authors you don't trust, and, check the appropriate license for your use case. Models are automatically indexed and hosted on huggingface (https://huggingface.co/). For any issue with the models, please open an issue on the model gallery repository if it's a LocalAI misconfiguration, otherwise refer to the huggingface repository. If you think a model should not be listed, please reach to us and we will remove it from the gallery.
-{{% /notice %}}
+{{% /alert %}}

-{{% notice note %}}
+{{% alert note %}}

 There is no documentation yet on how to build a gallery or a repository - but you can find an example in the [model-gallery](https://github.com/go-skynet/model-gallery) repository.

-{{% /notice %}}
+{{% /alert %}}


 ### List Models
@@ -117,7 +119,7 @@ where:
 - `bert-embeddings` is the model name in the gallery
  (read its [config here](https://github.com/go-skynet/model-gallery/blob/main/bert-embeddings.yaml)).

-{{% notice note %}}
+{{% alert note %}}
 If the `huggingface` model gallery is enabled (it's enabled by default),
 and the model has an entry in the model gallery's associated YAML config
 (for `huggingface`, see [`model-gallery/huggingface.yaml`](https://github.com/go-skynet/model-gallery/blob/main/huggingface.yaml)),
@@ -132,7 +134,7 @@ curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{
 ```

 Note that the `id` can be used similarly when pre-loading models at start.
-{{% /notice %}}
+{{% /alert %}}


 ## How to install a model (without a gallery)
@@ -217,7 +219,7 @@ YAML:

 </details>

-{{% notice note %}}
+{{% alert note %}}

 You can find already some open licensed models in the [model gallery](https://github.com/go-skynet/model-gallery).

@@ -241,7 +243,7 @@ curl $LOCALAI/models/apply -H "Content-Type: application/json" -d '{

 </details>

-{{% /notice %}}
+{{% /alert %}}

 ## Installing a model with a different name

--- a/docs/content/docs/features/openai-functions.md
+++ b/docs/content/docs/features/openai-functions.md
@@ -2,7 +2,8 @@
 +++
 disableToc = false
 title = "🔥 OpenAI functions"
-weight = 2
+weight = 17
+url = "/features/openai-functions/"
 +++

 LocalAI supports running OpenAI functions with `llama.cpp` compatible models.
@@ -67,13 +68,13 @@ response = openai.ChatCompletion.create(
 # ...
 ```

-{{% notice note %}}
+{{% alert note %}}
 When running the python script, be sure to:

 - Set `OPENAI_API_KEY` environment variable to a random string (the OpenAI api key is NOT required!)
 - Set `OPENAI_API_BASE` to point to your LocalAI service, for example `OPENAI_API_BASE=http://localhost:8080`

-{{% /notice %}}
+{{% /alert %}}

 ## Advanced

--- a/docs/content/docs/features/text-generation.md
+++ b/docs/content/docs/features/text-generation.md
@@ -0,0 +1,264 @@
+
+++
+disableToc = false
+title = "📖 Text generation (GPT)"
+weight = 10
+url = "/features/text-generation/"
+++
+
+LocalAI supports generating text with GPT with `llama.cpp` and other backends (such as `rwkv.cpp` as ) see also the [Model compatibility]({{%relref "docs/reference/compatibility-table" %}}) for an up-to-date list of the supported model families.
+
+Note:
+
+- You can also specify the model name as part of the OpenAI token.
+- If only one model is available, the API will use it for all the requests.
+
+## API Reference
+
+### Chat completions
+
+https://platform.openai.com/docs/api-reference/chat
+
+For example, to generate a chat completion, you can send a POST request to the `/v1/chat/completions` endpoint with the instruction as the request body:
+
+```bash
+curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
+  "model": "ggml-koala-7b-model-q4_0-r2.bin",
+  "messages": [{"role": "user", "content": "Say this is a test!"}],
+  "temperature": 0.7
+}'
+```
+
+Available additional parameters: `top_p`, `top_k`, `max_tokens`
+
+### Edit completions
+
+https://platform.openai.com/docs/api-reference/edits
+
+To generate an edit completion you can send a POST request to the `/v1/edits` endpoint with the instruction as the request body:
+
+```bash
+curl http://localhost:8080/v1/edits -H "Content-Type: application/json" -d '{
+  "model": "ggml-koala-7b-model-q4_0-r2.bin",
+  "instruction": "rephrase",
+  "input": "Black cat jumped out of the window",
+  "temperature": 0.7
+}'
+```
+
+Available additional parameters: `top_p`, `top_k`, `max_tokens`.
+
+### Completions
+
+https://platform.openai.com/docs/api-reference/completions
+
+To generate a completion, you can send a POST request to the `/v1/completions` endpoint with the instruction as per the request body:
+
+```bash
+curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
+  "model": "ggml-koala-7b-model-q4_0-r2.bin",
+  "prompt": "A long time ago in a galaxy far, far away",
+  "temperature": 0.7
+}'
+```
+
+Available additional parameters: `top_p`, `top_k`, `max_tokens`
+
+### List models
+
+You can list all the models available with:
+
+```bash
+curl http://localhost:8080/v1/models
+```
+
+## Backends
+
+### AutoGPTQ
+
+[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) is an easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
+
+#### Prerequisites
+
+This is an extra backend - in the container images is already available and there is nothing to do for the setup.
+
+If you are building LocalAI locally, you need to install [AutoGPTQ manually](https://github.com/PanQiWei/AutoGPTQ#quick-installation).
+
+
+#### Model setup
+
+The models are automatically downloaded from `huggingface` if not present the first time. It is possible to define models via `YAML` config file, or just by querying the endpoint with the `huggingface` repository model name. For example, create a `YAML` config file in `models/`:
+
+```
+name: orca
+backend: autogptq
+model_base_name: "orca_mini_v2_13b-GPTQ-4bit-128g.no-act.order"
+parameters:
+  model: "TheBloke/orca_mini_v2_13b-GPTQ"
+# ...
+```
+
+Test with:
+
+```bash
+curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{                                                                                                         
+   "model": "orca",
+   "messages": [{"role": "user", "content": "How are you?"}],
+   "temperature": 0.1
+ }'
+```
+### RWKV
+
+A full example on how to run a rwkv model is in the [examples](https://github.com/go-skynet/LocalAI/tree/master/examples/rwkv).
+
+Note: rwkv models needs to specify the backend `rwkv` in the YAML config files and have an associated tokenizer along that needs to be provided with it:
+
+```
+36464540 -rw-r--r--  1 mudler mudler 1.2G May  3 10:51 rwkv_small
+36464543 -rw-r--r--  1 mudler mudler 2.4M May  3 10:51 rwkv_small.tokenizer.json
+```
+
+### llama.cpp
+
+[llama.cpp](https://github.com/ggerganov/llama.cpp) is a popular port of Facebook's LLaMA model in C/C++.
+
+{{% alert note %}}
+
+The `ggml` file format has been deprecated. If you are using `ggml` models and you are configuring your model with a YAML file, specify, use the `llama-ggml` backend instead. If you are relying in automatic detection of the model, you should be fine. For `gguf` models, use the `llama` backend. The go backend is deprecated as well but still available as `go-llama`. The go backend supports still features not available in the mainline: speculative sampling and embeddings.
+
+{{% /alert %}}
+
+#### Features
+
+The `llama.cpp` model supports the following features:
+- [📖 Text generation (GPT)]({{%relref "docs/features/text-generation" %}})
+- [🧠 Embeddings]({{%relref "docs/features/embeddings" %}})
+- [🔥 OpenAI functions]({{%relref "docs/features/openai-functions" %}})
+- [✍️ Constrained grammars]({{%relref "docs/features/constrained_grammars" %}})
+
+#### Setup
+
+LocalAI supports `llama.cpp` models out of the box. You can use the `llama.cpp` model in the same way as any other model. 
+
+##### Manual setup
+
+It is sufficient to copy the `ggml` or `gguf` model files in the `models` folder. You can refer to the model in the `model` parameter in the API calls.
+
+[You can optionally create an associated YAML]({{%relref "docs/advanced" %}}) model config file to tune the model's parameters or apply a template to the prompt.
+
+Prompt templates are useful for models that are fine-tuned towards a specific prompt. 
+
+##### Automatic setup
+
+LocalAI supports model galleries which are indexes of models. For instance, the huggingface gallery contains a large curated index of models from the huggingface model hub for `ggml` or `gguf` models.
+
+For instance, if you have the galleries enabled and LocalAI already running, you can just start chatting with models in huggingface by running:
+
+```bash
+curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
+     "model": "TheBloke/WizardLM-13B-V1.2-GGML/wizardlm-13b-v1.2.ggmlv3.q2_K.bin",
+     "messages": [{"role": "user", "content": "Say this is a test!"}],
+     "temperature": 0.1
+   }'
+```
+
+LocalAI will automatically download and configure the model in the `model` directory.
+
+Models can be also preloaded or downloaded on demand. To learn about model galleries, check out the [model gallery documentation]({{%relref "docs/features/model-gallery" %}}).
+
+#### YAML configuration
+
+To use the `llama.cpp` backend, specify `llama` as the backend in the YAML file:
+
+```yaml
+name: llama
+backend: llama
+parameters:
+  # Relative to the models path
+  model: file.gguf.bin
+```
+
+In the example above we specify `llama` as the backend to restrict loading `gguf` models only. 
+
+For instance, to use the `llama-ggml` backend for `ggml` models:
+
+```yaml
+name: llama
+backend: llama-ggml
+parameters:
+  # Relative to the models path
+  model: file.ggml.bin
+```
+
+#### Reference
+
+- [llama](https://github.com/ggerganov/llama.cpp)
+- [binding](https://github.com/go-skynet/go-llama.cpp)
+
+
+### exllama/2
+
+[Exllama](https://github.com/turboderp/exllama) is a "A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights". Both `exllama` and `exllama2` are supported.
+
+#### Model setup
+
+Download the model as a folder inside the `model ` directory and create a YAML file specifying the `exllama` backend. For instance with the `TheBloke/WizardLM-7B-uncensored-GPTQ` model:
+
+```
+$ git lfs install
+$ cd models && git clone https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GPTQ
+$ ls models/                                                                 
+.keep                        WizardLM-7B-uncensored-GPTQ/ exllama.yaml
+$ cat models/exllama.yaml                                                     
+name: exllama
+parameters:
+  model: WizardLM-7B-uncensored-GPTQ
+backend: exllama
+# Note: you can also specify "exllama2" if it's an exllama2 model here
+# ...
+```
+
+Test with:
+
+```bash
+curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{                                                                                                         
+   "model": "exllama",
+   "messages": [{"role": "user", "content": "How are you?"}],
+   "temperature": 0.1
+ }'
+```
+
+### vLLM
+
+[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference.
+
+LocalAI has a built-in integration with vLLM, and it can be used to run models. You can check out `vllm` performance [here](https://github.com/vllm-project/vllm#performance).
+
+#### Setup
+
+Create a YAML file for the model you want to use with `vllm`.
+
+To setup a model, you need to just specify the model name in the YAML config file:
+```yaml
+name: vllm
+backend: vllm
+parameters:
+    model: "facebook/opt-125m"
+
+# Decomment to specify a quantization method (optional)
+# quantization: "awq"
+```
+
+The backend will automatically download the required files in order to run the model.
+
+
+#### Usage
+
+Use the `completions` endpoint by specifying the `vllm` backend:
+```
+curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{   
+   "model": "vllm",
+   "prompt": "Hello, my name is",
+   "temperature": 0.1, "top_p": 0.1
+ }'
+```
--- a/docs/content/docs/features/text-to-audio.md
+++ b/docs/content/docs/features/text-to-audio.md
@@ -0,0 +1,159 @@
+
+++
+disableToc = false
+title = "🗣 Text to audio (TTS)"
+weight = 11
+url = "/features/text-to-audio/"
+++
+
+The `/tts` endpoint can be used to generate speech from text.
+
+## Usage
+
+Input: `input`, `model`
+
+For example, to generate an audio file, you can send a POST request to the `/tts` endpoint with the instruction as the request body:
+
+```bash
+curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
+  "input": "Hello world",
+  "model": "tts"
+}'
+```
+
+Returns an `audio/wav` file.
+
+
+## Backends
+
+### 🐸 Coqui
+
+Required: Don't use `LocalAI` images ending with the `-core` tag,. Python dependencies are required in order to use this backend.
+
+Coqui works without any configuration, to test it, you can run the following curl command:
+
+```
+    curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
+        "backend": "coqui",
+        "model": "tts_models/en/ljspeech/glow-tts",
+        "input":"Hello, this is a test!"
+        }'
+```
+
+### Bark
+
+[Bark](https://github.com/suno-ai/bark) allows to generate audio from text prompts.
+
+This is an extra backend - in the container is already available and there is nothing to do for the setup.
+
+#### Model setup
+
+There is nothing to be done for the model setup. You can already start to use bark. The models will be downloaded the first time you use the backend.
+
+#### Usage
+
+Use the `tts` endpoint by specifying the `bark` backend:
+
+```
+curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
+     "backend": "bark",
+     "input":"Hello!"
+   }' | aplay
+```
+
+To specify a voice from https://github.com/suno-ai/bark#-voice-presets ( https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c ), use the `model` parameter:
+
+```
+curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
+     "backend": "bark",
+     "input":"Hello!",
+     "model": "v2/en_speaker_4"
+   }' | aplay
+```
+
+### Piper
+
+To install the `piper` audio models manually:
+
+- Download Voices from https://github.com/rhasspy/piper/releases/tag/v0.0.2
+- Extract the `.tar.tgz` files (.onnx,.json) inside `models`
+- Run the following command to test the model is working
+
+To use the tts endpoint, run the following command. You can specify a backend with the `backend` parameter. For example, to use the `piper` backend:
+```bash
+curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{
+  "model":"it-riccardo_fasol-x-low.onnx",
+  "backend": "piper",
+  "input": "Ciao, sono Ettore"
+}' | aplay
+```
+
+Note:
+
+- `aplay` is a Linux command. You can use other tools to play the audio file.
+- The model name is the filename with the extension.
+- The model name is case sensitive.
+- LocalAI must be compiled with the `GO_TAGS=tts` flag.
+
+### Transformers-musicgen
+
+LocalAI also has experimental support for `transformers-musicgen` for the generation of short musical compositions. Currently, this is implemented via the same requests used for text to speech:
+
+```
+curl --request POST \
+  --url http://localhost:8080/tts \
+  --header 'Content-Type: application/json' \
+  --data '{
+    "backend": "transformers-musicgen",
+    "model": "facebook/musicgen-medium",
+    "input": "Cello Rave"
+}' | aplay
+```
+
+Future versions of LocalAI will expose additional control over audio generation beyond the text prompt.
+
+### Vall-E-X
+
+[VALL-E-X](https://github.com/Plachtaa/VALL-E-X) is an open source implementation of Microsoft's VALL-E X zero-shot TTS model.
+
+#### Setup
+
+The backend will automatically download the required files in order to run the model.
+
+This is an extra backend - in the container is already available and there is nothing to do for the setup. If you are building manually, you need to install Vall-E-X manually first.
+
+#### Usage
+
+Use the tts endpoint by specifying the vall-e-x backend:
+
+```
+curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
+     "backend": "vall-e-x",
+     "input":"Hello!"
+   }' | aplay
+```
+
+#### Voice cloning
+
+In order to use voice cloning capabilities you must create a `YAML` configuration file to setup a model:
+
+```yaml
+name: cloned-voice
+backend: vall-e-x
+parameters:
+  model: "cloned-voice"
+vall-e:
+  # The path to the audio file to be cloned
+  # relative to the models directory 
+  audio_path: "path-to-wav-source.wav"
+```
+
+Then you can specify the model name in the requests:
+
+```
+curl http://localhost:8080/tts -H "Content-Type: application/json" -d '{         
+     "backend": "vall-e-x",
+     "model": "cloned-voice",
+     "input":"Hello!"
+   }' | aplay
+```
--- a/docs/content/docs/getting-started/_index.en.md
+++ b/docs/content/docs/getting-started/_index.en.md
@@ -0,0 +1,7 @@
+
+++
+disableToc = false
+title = "Getting started"
+weight = 2
+icon = "rocket_launch"
+++
--- a/docs/content/docs/getting-started/build.md
+++ b/docs/content/docs/getting-started/build.md
@@ -1,22 +1,27 @@

 +++
 disableToc = false
-title = "Build"
-weight = 5
+title = "Build LocalAI from source"
+weight = 6
 url = '/basics/build/'
-
+ico = "rocket_launch"
 +++

-### Build locally
+### Build
+
+LocalAI can be built as a container image or as a single, portable binary. Note that the some model architectures might require Python libraries, which are not included in the binary. The binary contains only the core backends written in Go and C++. 
+
+LocalAI's extensible architecture allows you to add your own backends, which can be written in any language, and as such the container images contains also the Python dependencies to run all the available backends (for example, in order to run backends like __Diffusers__ that allows to generate images and videos from text).
+
+In some cases you might want to re-build LocalAI from source (for instance to leverage Apple Silicon acceleration), or to build a custom container image with your own backends. This section contains instructions on how to build LocalAI from source.
+
+#### Container image

 Requirements:

-Either Docker/podman, or
- Golang >= 1.21
- Cmake/make
- GCC
+- Docker or podman, or a container engine

-In order to build the `LocalAI` container image locally you can use `docker`:
+In order to build the `LocalAI` container image locally you can use `docker`, for example:

 ```
 # build the image
@@ -24,7 +29,47 @@ docker build -t localai .
 docker run localai
 ```

-Or you can build the manually binary with `make`:
+#### Build LocalAI locally
+
+##### Requirements
+
+In order to build LocalAI locally, you need the following requirements:
+
+- Golang >= 1.21
+- Cmake/make
+- GCC
+- GRPC
+
+To install the dependencies follow the instructions below:
+
+{{< tabs tabTotal="3"  >}}
+{{% tab tabName="Apple" %}}
+
+```bash
+brew install abseil cmake go grpc protobuf wget
+```
+
+{{% /tab %}}
+{{% tab tabName="Debian" %}}
+
+```bash
+apt install golang protobuf-compiler-grpc libgrpc-dev make cmake
+```
+
+{{% /tab %}}
+{{% tab tabName="From source" %}}
+
+Specify `BUILD_GRPC_FOR_BACKEND_LLAMA=true` to build automatically the gRPC dependencies
+
+```bash
+make ... BUILD_GRPC_FOR_BACKEND_LLAMA=true build
+```
+
+{{% /tab %}}
+{{< /tabs >}}
+
+##### Build
+To build LocalAI with `make`:

 ```
 git clone https://github.com/go-skynet/LocalAI
@@ -32,9 +77,19 @@ cd LocalAI
 make build
 ```

-To run: `./local-ai`
+This should produce the binary `local-ai`

-{{% notice note %}}
+Here is the list of the variables available that can be used to customize the build:
+
+| Variable | Default | Description |
+| ---------------------| ------- | ----------- |
+| `BUILD_TYPE`         |   None      | Build type. Available: `cublas`, `openblas`, `clblas`, `metal`,`hipblas` |
+| `GO_TAGS`            |   `tts stablediffusion`      | Go tags. Available: `stablediffusion`, `tts`, `tinydream` |
+| `CLBLAST_DIR`        |         | Specify a CLBlast directory |
+| `CUDA_LIBPATH`       |         | Specify a CUDA library path |
+| `BUILD_API_ONLY` | false | Set to true to build only the API (no backends will be built) |
+
+{{% alert note %}}

 #### CPU flagset compatibility

@@ -52,9 +107,9 @@ docker run  quay.io/go-skynet/localai
 docker run --rm -ti -p 8080:8080 -e DEBUG=true -e MODELS_PATH=/models -e THREADS=1 -e REBUILD=true -e CMAKE_ARGS="-DLLAMA_F16C=OFF -DLLAMA_AVX512=OFF -DLLAMA_AVX2=OFF -DLLAMA_AVX=OFF -DLLAMA_FMA=OFF" -v $PWD/models:/models quay.io/go-skynet/local-ai:latest
 ```

-{{% /notice %}}
+{{% /alert %}}

-### Build on mac
+### Example: Build on mac

 Building on Mac (M1 or M2) works, but you may need to install some prerequisites using `brew`. 

@@ -96,7 +151,7 @@ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/jso

 **Requirements**: OpenCV, Gomp

-Image generation is experimental and requires `GO_TAGS=stablediffusion` to be set during build:
+Image generation requires `GO_TAGS=stablediffusion` or `GO_TAGS=tinydream` to be set during build:

 ```
 make GO_TAGS=stablediffusion build
@@ -114,15 +169,6 @@ make GO_TAGS=tts build

 ### Acceleration

-List of the variables available to customize the build:
-
-| Variable | Default | Description |
-| ---------------------| ------- | ----------- |
-| `BUILD_TYPE`         |   None      | Build type. Available: `cublas`, `openblas`, `clblas`, `metal`,`hipblas` |
-| `GO_TAGS`            |   `tts stablediffusion`      | Go tags. Available: `stablediffusion`, `tts` |
-| `CLBLAST_DIR`        |         | Specify a CLBlast directory |
-| `CUDA_LIBPATH`       |         | Specify a CUDA library path |
-
 #### OpenBLAS

 Software acceleration.
@@ -179,7 +225,7 @@ make BUILD_TYPE=clblas build

 To specify a clblast dir set: `CLBLAST_DIR`

-### Metal (Apple Silicon)
+#### Metal (Apple Silicon)

 ```
 make BUILD_TYPE=metal build
@@ -188,6 +234,29 @@ make BUILD_TYPE=metal build
 # Note: only models quantized with q4_0 are supported!
 ```

+
 ### Windows compatibility

 Make sure to give enough resources to the running container. See https://github.com/go-skynet/LocalAI/issues/2
+
+### Examples
+
+More advanced build options are available, for instance to build only a single backend.
+
+#### Build only a single backend
+
+You can control the backends that are built by setting the `GRPC_BACKENDS` environment variable. For instance, to build only the `llama-cpp` backend only:
+
+```bash
+make GRPC_BACKENDS=backend-assets/grpc/llama-cpp build
+```
+
+By default, all the backends are built.
+
+#### Specific llama.cpp version
+
+To build with a specific version of llama.cpp, set `CPPLLAMA_VERSION` to the tag or wanted sha:
+
+```
+CPPLLAMA_VERSION=<sha> make build
+```
--- a/docs/content/docs/getting-started/customize-model.md
+++ b/docs/content/docs/getting-started/customize-model.md
@@ -0,0 +1,73 @@
+++
+disableToc = false
+title = "Customizing the Model"
+weight = 4
+icon = "rocket_launch"
+
+++
+
+To customize the prompt template or the default settings of the model, a configuration file is utilized. This file must adhere to the LocalAI YAML configuration standards. For comprehensive syntax details, refer to the [advanced documentation]({{%relref "docs/advanced" %}}). The configuration file can be located either remotely (such as in a Github Gist) or within the local filesystem or a remote URL.
+
+LocalAI can be initiated using either its container image or binary, with a command that includes URLs of model config files or utilizes a shorthand format (like `huggingface://` or `github://`), which is then expanded into complete URLs.
+
+The configuration can also be set via an environment variable. For instance:
+
+```
+# Command-Line Arguments
+local-ai github://owner/repo/file.yaml@branch
+
+# Environment Variable
+MODELS="github://owner/repo/file.yaml@branch,github://owner/repo/file.yaml@branch" local-ai
+```
+
+Here's an example to initiate the **phi-2** model:
+
+```bash
+docker run -p 8080:8080 localai/localai:{{< version >}}-ffmpeg-core https://gist.githubusercontent.com/mudler/ad601a0488b497b69ec549150d9edd18/raw/a8a8869ef1bb7e3830bf5c0bae29a0cce991ff8d/phi-2.yaml
+```
+
+You can also check all the embedded models configurations [here](https://github.com/mudler/LocalAI/tree/master/embedded/models).
+
+{{% alert icon="" %}}
+The model configurations used in the quickstart are accessible here: [https://github.com/mudler/LocalAI/tree/master/embedded/models](https://github.com/mudler/LocalAI/tree/master/embedded/models). Contributions are welcome; please feel free to submit a Pull Request.
+
+The `phi-2` model configuration from the quickstart is expanded from [https://github.com/mudler/LocalAI/blob/master/examples/configurations/phi-2.yaml](https://github.com/mudler/LocalAI/blob/master/examples/configurations/phi-2.yaml).
+{{% /alert %}}
+
+## Example: Customizing the Prompt Template
+
+To modify the prompt template, create a Github gist or a Pastebin file, and copy the content from [https://github.com/mudler/LocalAI/blob/master/examples/configurations/phi-2.yaml](https://github.com/mudler/LocalAI/blob/master/examples/configurations/phi-2.yaml). Alter the fields as needed:
+
+```yaml
+name: phi-2
+context_size: 2048
+f16: true
+threads: 11
+gpu_layers: 90
+mmap: true
+parameters:
+  # Reference any HF model or a local file here
+  model: huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf
+  temperature: 0.2
+  top_k: 40
+  top_p: 0.95
+template:
+  
+  chat: &template |
+    Instruct: {{.Input}}
+    Output:
+  # Modify the prompt template here ^^^ as per your requirements
+  completion: *template
+```
+
+Then, launch LocalAI using your gist's URL:
+
+```bash
+## Important! Substitute with your gist's URL!
+docker run -p 8080:8080 localai/localai:{{< version >}}-ffmpeg-core https://gist.githubusercontent.com/xxxx/phi-2.yaml
+```
+
+## Next Steps
+
+- Visit the [advanced section]({{%relref "docs/advanced" %}}) for more insights on prompt templates and configuration files.
+- To learn about fine-tuning an LLM model, check out the [fine-tuning section]({{%relref "docs/advanced/fine-tuning" %}}).
--- a/docs/content/docs/getting-started/manual.md
+++ b/docs/content/docs/getting-started/manual.md
@@ -0,0 +1,166 @@
+ 
+++
+disableToc = false
+title = "Run models manually"
+weight = 5
+icon = "rocket_launch"
+
+++
+
+
+1. Ensure you have a model file, a configuration YAML file, or both. Customize model defaults and specific settings with a configuration file. For advanced configurations, refer to the [Advanced Documentation](docs/advanced).
+
+2. For GPU Acceleration instructions, visit [GPU acceleration](docs/features/gpu-acceleration).
+
+{{< tabs tabTotal="5" >}}
+{{% tab tabName="Docker" %}}
+
+```bash
+# Prepare the models into the `model` directory
+mkdir models
+
+# copy your models to it
+cp your-model.gguf models/
+
+# run the LocalAI container
+docker run -p 8080:8080 -v $PWD/models:/models -ti --rm quay.io/go-skynet/local-ai:latest --models-path /models --context-size 700 --threads 4
+# You should see:
+# 
+# ┌───────────────────────────────────────────────────┐
+# │                   Fiber v2.42.0                   │
+# │               http://127.0.0.1:8080               │
+# │       (bound on host 0.0.0.0 and port 8080)       │
+# │                                                   │
+# │ Handlers ............. 1  Processes ........... 1 │
+# │ Prefork ....... Disabled  PID ................. 1 │
+# └───────────────────────────────────────────────────┘
+
+# Try the endpoint with curl
+curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
+     "model": "your-model.gguf",
+     "prompt": "A long time ago in a galaxy far, far away",
+     "temperature": 0.7
+   }'
+```
+
+{{% alert icon="💡" %}}
+
+**Other Docker Images**:
+
+For other Docker images, please see the table in
+https://localai.io/basics/getting_started/#container-images.
+
+{{% /alert %}}
+
+Here is a more specific example:
+
+```bash
+mkdir models
+
+# Download luna-ai-llama2 to models/
+wget https://huggingface.co/TheBloke/Luna-AI-Llama2-Uncensored-GGUF/resolve/main/luna-ai-llama2-uncensored.Q4_0.gguf -O models/luna-ai-llama2
+
+# Use a template from the examples
+cp -rf prompt-templates/getting_started.tmpl models/luna-ai-llama2.tmpl
+
+docker run -p 8080:8080 -v $PWD/models:/models -ti --rm quay.io/go-skynet/local-ai:latest --models-path /models --context-size 700 --threads 4
+
+# Now API is accessible at localhost:8080
+curl http://localhost:8080/v1/models
+# {"object":"list","data":[{"id":"luna-ai-llama2","object":"model"}]}
+
+curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
+     "model": "luna-ai-llama2",
+     "messages": [{"role": "user", "content": "How are you?"}],
+     "temperature": 0.9
+   }'
+# {"model":"luna-ai-llama2","choices":[{"message":{"role":"assistant","content":"I'm doing well, thanks. How about you?"}}]}
+```
+
+{{% alert note %}}
+- If running on Apple Silicon (ARM) it is **not** suggested to run on Docker due to emulation. Follow the [build instructions]({{%relref "docs/getting-started/build" %}}) to use Metal acceleration for full GPU support.
+- If you are running Apple x86_64 you can use `docker`, there is no additional gain into building it from source.
+{{% /alert %}}
+
+{{% /tab %}}
+{{% tab tabName="Docker compose" %}}
+
+```bash
+# Clone LocalAI
+git clone https://github.com/go-skynet/LocalAI
+
+cd LocalAI
+
+# (optional) Checkout a specific LocalAI tag
+# git checkout -b build <TAG>
+
+# copy your models to models/
+cp your-model.gguf models/
+
+# (optional) Edit the .env file to set things like context size and threads
+# vim .env
+
+# start with docker compose
+docker compose up -d --pull always
+# or you can build the images with:
+# docker compose up -d --build
+
+# Now API is accessible at localhost:8080
+curl http://localhost:8080/v1/models
+# {"object":"list","data":[{"id":"your-model.gguf","object":"model"}]}
+
+curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
+     "model": "your-model.gguf",
+     "prompt": "A long time ago in a galaxy far, far away",
+     "temperature": 0.7
+   }'
+```
+
+{{% alert icon="💡" %}}
+
+**Other Docker Images**:
+
+For other Docker images, please see the table in
+https://localai.io/basics/getting_started/#container-images.
+
+{{% /alert %}}
+
+Note: If you are on Windows, please make sure the project is on the Linux Filesystem, otherwise loading models might be slow. For more Info: [Microsoft Docs](https://learn.microsoft.com/en-us/windows/wsl/filesystems)
+
+{{% /tab %}}
+
+{{% tab tabName="Kubernetes" %}}
+
+For installing LocalAI in Kubernetes, you can use the following helm chart:
+
+```bash
+# Install the helm repository
+helm repo add go-skynet https://go-skynet.github.io/helm-charts/
+# Update the repositories
+helm repo update
+# Get the values
+helm show values go-skynet/local-ai > values.yaml
+
+# Edit the values value if needed
+# vim values.yaml ...
+
+# Install the helm chart
+helm install local-ai go-skynet/local-ai -f values.yaml
+```
+
+{{% /tab %}}
+{{% tab tabName="From binary" %}}
+
+LocalAI binary releases are available in [Github](https://github.com/go-skynet/LocalAI/releases).
+
+{{% /tab %}}
+
+{{% tab tabName="From source" %}}
+
+See the [build section]({{%relref "docs/getting-started/build" %}}).
+  
+{{% /tab %}}
+
+{{< /tabs >}}
+
+For more model configurations, visit the [Examples Section](https://github.com/mudler/LocalAI/tree/master/examples/configurations).
--- a/docs/content/docs/getting-started/quickstart.md
+++ b/docs/content/docs/getting-started/quickstart.md
@@ -0,0 +1,199 @@
+ 
+++
+disableToc = false
+title = "Quickstart"
+weight = 3
+url = '/basics/getting_started/'
+icon = "rocket_launch"
+
+++
+
+**LocalAI** is the free, Open Source OpenAI alternative. LocalAI act as a drop-in replacement REST API that's compatible with OpenAI API specifications for local inferencing. It allows you to run [LLMs]({{%relref "docs/features/text-generation" %}}), generate images, audio (and not only) locally or on-prem with consumer grade hardware, supporting multiple model families and architectures.
+
+## Installation Methods
+
+LocalAI is available as a container image and binary, compatible with various container engines like Docker, Podman, and Kubernetes. Container images are published on [quay.io](https://quay.io/repository/go-skynet/local-ai?tab=tags&tag=latest) and [Docker Hub](https://hub.docker.com/r/localai/localai). Binaries can be downloaded from [GitHub](https://github.com/mudler/LocalAI/releases).
+
+
+{{% alert icon="💡" %}}
+
+**Hardware Requirements:** The hardware requirements for LocalAI vary based on the model size and quantization method used. For performance benchmarks with different backends, such as `llama.cpp`, visit [this link](https://github.com/ggerganov/llama.cpp#memorydisk-requirements). The `rwkv` backend is noted for its lower resource consumption.
+
+{{% /alert %}}
+
+## Prerequisites
+
+Before you begin, ensure you have a container engine installed if you are not using the binaries. Suitable options include Docker or Podman. For installation instructions, refer to the following guides:
+
+- [Install Docker Desktop (Mac, Windows, Linux)](https://docs.docker.com/get-docker/)
+- [Install Podman (Linux)](https://podman.io/getting-started/installation)
+- [Install Docker engine (Servers)](https://docs.docker.com/engine/install/#get-started)
+
+
+## Running Models
+
+> _Do you have already a model file? Skip to [Run models manually]({{%relref "docs/getting-started/manual" %}})_.
+
+LocalAI allows one-click runs with popular models. It downloads the model and starts the API with the model loaded. 
+
+There are different categories of models: [LLMs]({{%relref "docs/features/text-generation" %}}), [Multimodal LLM]({{%relref "docs/features/gpt-vision" %}}) , [Embeddings]({{%relref "docs/features/embeddings" %}}), [Audio to Text]({{%relref "docs/features/audio-to-text" %}}), and [Text to Audio]({{%relref "docs/features/text-to-audio" %}}) depending on the backend being used and the model architecture.
+
+{{% alert icon="💡" %}}
+
+To customize the models, see [Model customization]({{%relref "docs/getting-started/customize-model" %}}). For more model configurations, visit the [Examples Section](https://github.com/mudler/LocalAI/tree/master/examples/configurations) and the configurations for the models below is available [here](https://github.com/mudler/LocalAI/tree/master/embedded/models).
+{{% /alert %}}
+
+{{< tabs tabTotal="3" >}}
+{{% tab tabName="CPU-only" %}}
+
+> 💡Don't need GPU acceleration? use the CPU images which are lighter and do not have Nvidia dependencies
+
+| Model | Category | Docker command |
+| --- | --- | --- |
+| [phi-2](https://huggingface.co/microsoft/phi-2) | [LLM]({{%relref "docs/features/text-generation" %}}) | ```docker run -ti -p 8080:8080 localai/localai:{{< version >}}-ffmpeg-core phi-2``` |
+| 🌋 [llava](https://github.com/SkunkworksAI/BakLLaVA) | [Multimodal LLM]({{%relref "docs/features/gpt-vision" %}}) | ```docker run -ti -p 8080:8080 localai/localai:{{< version >}}-ffmpeg-core llava``` |
+| [mistral-openorca](https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca) | [LLM]({{%relref "docs/features/text-generation" %}}) | ```docker run -ti -p 8080:8080 localai/localai:{{< version >}}-ffmpeg-core mistral-openorca``` |
+| [bert-cpp](https://github.com/skeskinen/bert.cpp) | [Embeddings]({{%relref "docs/features/embeddings" %}}) | ```docker run -ti -p 8080:8080 localai/localai:{{< version >}}-ffmpeg-core bert-cpp``` |
+| [all-minilm-l6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) | [Embeddings]({{%relref "docs/features/embeddings" %}}) | ```docker run -ti -p 8080:8080 localai/localai:{{< version >}}-ffmpeg all-minilm-l6-v2``` |
+| whisper-base | [Audio to Text]({{%relref "docs/features/audio-to-text" %}}) | ```docker run -ti -p 8080:8080 localai/localai:{{< version >}}-ffmpeg-core whisper-base``` |
+| rhasspy-voice-en-us-amy | [Text to Audio]({{%relref "docs/features/text-to-audio" %}}) | ```docker run -ti -p 8080:8080 localai/localai:{{< version >}}-ffmpeg-core rhasspy-voice-en-us-amy``` |
+| 🐸 [coqui](https://github.com/coqui-ai/TTS) | [Text to Audio]({{%relref "docs/features/text-to-audio" %}}) | ```docker run -ti -p 8080:8080 localai/localai:{{< version >}}-ffmpeg coqui``` |
+| 🐶 [bark](https://github.com/suno-ai/bark) | [Text to Audio]({{%relref "docs/features/text-to-audio" %}}) | ```docker run -ti -p 8080:8080 localai/localai:{{< version >}}-ffmpeg bark``` |
+| 🔊 [vall-e-x](https://github.com/Plachtaa/VALL-E-X)  | [Text to Audio]({{%relref "docs/features/text-to-audio" %}}) | ```docker run -ti -p 8080:8080 localai/localai:{{< version >}}-ffmpeg vall-e-x``` |
+| mixtral-instruct Mixtral-8x7B-Instruct-v0.1 | [LLM]({{%relref "docs/features/text-generation" %}}) | ```docker run -ti -p 8080:8080 localai/localai:{{< version >}}-ffmpeg-core mixtral-instruct``` |
+| [tinyllama-chat](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v0.3-GGUF) [original model](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.3) | [LLM]({{%relref "docs/features/text-generation" %}}) | ```docker run -ti -p 8080:8080 localai/localai:{{< version >}}-ffmpeg-core tinyllama-chat``` |
+| [dolphin-2.5-mixtral-8x7b](https://huggingface.co/TheBloke/dolphin-2.5-mixtral-8x7b-GGUF) | [LLM]({{%relref "docs/features/text-generation" %}}) | ```docker run -ti -p 8080:8080 localai/localai:{{< version >}}-ffmpeg-core dolphin-2.5-mixtral-8x7b``` |
+| 🐍 [mamba](https://github.com/state-spaces/mamba) | [LLM]({{%relref "docs/features/text-generation" %}}) | GPU-only |
+| animagine-xl | [Text to Image]({{%relref "docs/features/image-generation" %}}) | GPU-only |
+| transformers-tinyllama | [LLM]({{%relref "docs/features/text-generation" %}}) | GPU-only |
+| [codellama-7b](https://huggingface.co/codellama/CodeLlama-7b-hf) (with transformers) | [LLM]({{%relref "docs/features/text-generation" %}}) | GPU-only |
+| [codellama-7b-gguf](https://huggingface.co/TheBloke/CodeLlama-7B-GGUF) (with llama.cpp) | [LLM]({{%relref "docs/features/text-generation" %}}) | ```docker run -ti -p 8080:8080 localai/localai:{{< version >}}-ffmpeg-core codellama-7b-gguf``` |
+{{% /tab %}}
+{{% tab tabName="GPU (CUDA 11)" %}}
+
+
+> To know which version of CUDA do you have available, you can check with `nvidia-smi` or `nvcc --version` see also [GPU acceleration]({{%relref "docs/features/gpu-acceleration" %}}).
+
+| Model | Category | Docker command |
+| --- | --- | --- |
+| [phi-2](https://huggingface.co/microsoft/phi-2) | [LLM]({{%relref "docs/features/text-generation" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda11-core phi-2``` |
+| 🌋 [llava](https://github.com/SkunkworksAI/BakLLaVA) | [Multimodal LLM]({{%relref "docs/features/gpt-vision" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda11-core llava``` |
+| [mistral-openorca](https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca) | [LLM]({{%relref "docs/features/text-generation" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda11-core mistral-openorca``` |
+| [bert-cpp](https://github.com/skeskinen/bert.cpp) | [Embeddings]({{%relref "docs/features/embeddings" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda11-core bert-cpp``` |
+| [all-minilm-l6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) | [Embeddings]({{%relref "docs/features/embeddings" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda11 all-minilm-l6-v2``` |
+| whisper-base | [Audio to Text]({{%relref "docs/features/audio-to-text" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda11-core whisper-base``` |
+| rhasspy-voice-en-us-amy | [Text to Audio]({{%relref "docs/features/text-to-audio" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda11-core rhasspy-voice-en-us-amy``` |
+| 🐸 [coqui](https://github.com/coqui-ai/TTS) | [Text to Audio]({{%relref "docs/features/text-to-audio" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda11 coqui``` |
+| 🐶 [bark](https://github.com/suno-ai/bark) | [Text to Audio]({{%relref "docs/features/text-to-audio" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda11 bark``` |
+| 🔊 [vall-e-x](https://github.com/Plachtaa/VALL-E-X) | [Text to Audio]({{%relref "docs/features/text-to-audio" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda11 vall-e-x``` |
+| mixtral-instruct Mixtral-8x7B-Instruct-v0.1 | [LLM]({{%relref "docs/features/text-generation" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda11-core mixtral-instruct``` |
+| [tinyllama-chat](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v0.3-GGUF) [original model](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.3) | [LLM]({{%relref "docs/features/text-generation" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda11-core tinyllama-chat``` |
+| [dolphin-2.5-mixtral-8x7b](https://huggingface.co/TheBloke/dolphin-2.5-mixtral-8x7b-GGUF) | [LLM]({{%relref "docs/features/text-generation" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda11-core dolphin-2.5-mixtral-8x7b``` |
+| 🐍 [mamba](https://github.com/state-spaces/mamba) | [LLM]({{%relref "docs/features/text-generation" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda11 mamba-chat``` |
+| animagine-xl | [Text to Image]({{%relref "docs/features/image-generation" %}}) |  ```docker run -ti -p 8080:8080 -e COMPEL=0 --gpus all localai/localai:{{< version >}}-cublas-cuda11 animagine-xl``` |
+| transformers-tinyllama | [LLM]({{%relref "docs/features/text-generation" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda11 transformers-tinyllama``` |
+| [codellama-7b](https://huggingface.co/codellama/CodeLlama-7b-hf) | [LLM]({{%relref "docs/features/text-generation" %}})  | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda11 codellama-7b``` |
+| [codellama-7b-gguf](https://huggingface.co/TheBloke/CodeLlama-7B-GGUF) | [LLM]({{%relref "docs/features/text-generation" %}})  | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda11-core codellama-7b-gguf``` |
+{{% /tab %}}
+
+
+{{% tab tabName="GPU (CUDA 12)" %}}
+
+> To know which version of CUDA do you have available, you can check with `nvidia-smi` or `nvcc --version` see also [GPU acceleration]({{%relref "docs/features/gpu-acceleration" %}}).
+
+| Model | Category | Docker command |
+| --- | --- | --- |
+| [phi-2](https://huggingface.co/microsoft/phi-2) | [LLM]({{%relref "docs/features/text-generation" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda12-core phi-2``` |
+| 🌋 [llava](https://github.com/SkunkworksAI/BakLLaVA) | [Multimodal LLM]({{%relref "docs/features/gpt-vision" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda12-core llava``` |
+| [mistral-openorca](https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca) | [LLM]({{%relref "docs/features/text-generation" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda12-core mistral-openorca``` |
+| [bert-cpp](https://github.com/skeskinen/bert.cpp) | [Embeddings]({{%relref "docs/features/embeddings" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda12-core bert-cpp``` |
+| [all-minilm-l6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) | [Embeddings]({{%relref "docs/features/embeddings" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda12 all-minilm-l6-v2``` |
+| whisper-base | [Audio to Text]({{%relref "docs/features/audio-to-text" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda12-core whisper-base``` |
+| rhasspy-voice-en-us-amy | [Text to Audio]({{%relref "docs/features/text-to-audio" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda12-core rhasspy-voice-en-us-amy``` |
+| 🐸 [coqui](https://github.com/coqui-ai/TTS) | [Text to Audio]({{%relref "docs/features/text-to-audio" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda12 coqui``` |
+| 🐶 [bark](https://github.com/suno-ai/bark) | [Text to Audio]({{%relref "docs/features/text-to-audio" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda12 bark``` |
+| 🔊 [vall-e-x](https://github.com/Plachtaa/VALL-E-X) | [Text to Audio]({{%relref "docs/features/text-to-audio" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda12 vall-e-x``` |
+| mixtral-instruct Mixtral-8x7B-Instruct-v0.1 | [LLM]({{%relref "docs/features/text-generation" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda12-core mixtral-instruct``` |
+| [tinyllama-chat](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v0.3-GGUF) [original model](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.3) | [LLM]({{%relref "docs/features/text-generation" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda12-core tinyllama-chat``` |
+| [dolphin-2.5-mixtral-8x7b](https://huggingface.co/TheBloke/dolphin-2.5-mixtral-8x7b-GGUF) | [LLM]({{%relref "docs/features/text-generation" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda12-core dolphin-2.5-mixtral-8x7b``` |
+| 🐍 [mamba](https://github.com/state-spaces/mamba) | [LLM]({{%relref "docs/features/text-generation" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda12 mamba-chat``` |
+| animagine-xl | [Text to Image]({{%relref "docs/features/image-generation" %}}) | ```docker run -ti -p 8080:8080 -e COMPEL=0 --gpus all localai/localai:{{< version >}}-cublas-cuda12 animagine-xl``` |
+| transformers-tinyllama | [LLM]({{%relref "docs/features/text-generation" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda12 transformers-tinyllama``` |
+| [codellama-7b](https://huggingface.co/codellama/CodeLlama-7b-hf) | [LLM]({{%relref "docs/features/text-generation" %}}) | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda12 codellama-7b``` |
+| [codellama-7b-gguf](https://huggingface.co/TheBloke/CodeLlama-7B-GGUF) | [LLM]({{%relref "docs/features/text-generation" %}})  | ```docker run -ti -p 8080:8080 --gpus all localai/localai:{{< version >}}-cublas-cuda12-core codellama-7b-gguf``` |
+{{% /tab %}}
+
+{{< /tabs >}}
+
+{{% alert icon="💡" %}}
+**Tip** You can actually specify multiple models to start an instance with the models loaded, for example to have both llava and phi-2 configured:
+
+```bash
+docker run -ti -p 8080:8080 localai/localai:{{< version >}}-ffmpeg-core llava phi-2
+```
+
+{{% /alert %}}
+
+## Container images
+
+LocalAI provides a variety of images to support different environments. These images are available on [quay.io](https://quay.io/repository/go-skynet/local-ai?tab=tags) and [Docker Hub](https://hub.docker.com/r/localai/localai).
+
+For GPU Acceleration support for Nvidia video graphic cards, use the Nvidia/CUDA images, if you don't have a GPU, use the CPU images. If you have AMD or Mac Silicon, see the [build section]({{%relref "docs/getting-started/build" %}}).
+
+{{% alert icon="💡" %}}
+
+**Available Images Types**:
+
+- Images ending with `-core` are smaller images without predownload python dependencies. Use these images if you plan to use `llama.cpp`, `stablediffusion-ncn`, `tinydream` or `rwkv` backends - if you are not sure which one to use, do **not** use these images.
+- FFMpeg is **not** included in the default images due to [its licensing](https://www.ffmpeg.org/legal.html). If you need FFMpeg, use the images ending with `-ffmpeg`. Note that `ffmpeg` is needed in case of using `audio-to-text` LocalAI's features.
+- If using old and outdated CPUs and no GPUs you might need to set `REBUILD` to `true` as environment variable along with options to disable the flags which your CPU does not support, however note that inference will perform poorly and slow. See also [flagset compatibility]({{%relref "docs/getting-started/build#cpu-flagset-compatibility" %}}).
+
+{{% /alert %}}
+
+{{< tabs tabTotal="3" >}}
+{{% tab tabName="Vanilla / CPU Images" %}}
+
+| Description | Quay | Docker Hub                                   |
+| --- | --- |-----------------------------------------------|
+| Latest images from the branch (development) | `quay.io/go-skynet/local-ai:master` | `localai/localai:master`                      |
+| Latest tag | `quay.io/go-skynet/local-ai:latest` | `localai/localai:latest`                      |
+| Versioned image | `quay.io/go-skynet/local-ai:{{< version >}}` | `localai/localai:{{< version >}}`             |
+| Versioned image including FFMpeg| `quay.io/go-skynet/local-ai:{{< version >}}-ffmpeg` | `localai/localai:{{< version >}}-ffmpeg`      |
+| Versioned image including FFMpeg, no python | `quay.io/go-skynet/local-ai:{{< version >}}-ffmpeg-core` | `localai/localai:{{< version >}}-ffmpeg-core` |
+
+{{% /tab %}}
+
+{{% tab tabName="GPU Images CUDA 11" %}}
+
+| Description | Quay | Docker Hub                                                  |
+| --- | --- |-------------------------------------------------------------|
+| Latest images from the branch (development) | `quay.io/go-skynet/local-ai:master-cublas-cuda11` | `localai/localai:master-cublas-cuda11`                      |
+| Latest tag | `quay.io/go-skynet/local-ai:latest-cublas-cuda11` | `localai/localai:latest-cublas-cuda11`                      |
+| Versioned image | `quay.io/go-skynet/local-ai:{{< version >}}-cublas-cuda11` | `localai/localai:{{< version >}}-cublas-cuda11`             |
+| Versioned image including FFMpeg| `quay.io/go-skynet/local-ai:{{< version >}}-cublas-cuda11-ffmpeg` | `localai/localai:{{< version >}}-cublas-cuda11-ffmpeg`      |
+| Versioned image including FFMpeg, no python | `quay.io/go-skynet/local-ai:{{< version >}}-cublas-cuda11-ffmpeg-core` | `localai/localai:{{< version >}}-cublas-cuda11-ffmpeg-core` |
+
+{{% /tab %}}
+
+{{% tab tabName="GPU Images CUDA 12" %}}
+
+| Description | Quay | Docker Hub                                                  |
+| --- | --- |-------------------------------------------------------------|
+| Latest images from the branch (development) | `quay.io/go-skynet/local-ai:master-cublas-cuda12` | `localai/localai:master-cublas-cuda12`                      |
+| Latest tag | `quay.io/go-skynet/local-ai:latest-cublas-cuda12` | `localai/localai:latest-cublas-cuda12`                      |
+| Versioned image | `quay.io/go-skynet/local-ai:{{< version >}}-cublas-cuda12` | `localai/localai:{{< version >}}-cublas-cuda12`             |
+| Versioned image including FFMpeg| `quay.io/go-skynet/local-ai:{{< version >}}-cublas-cuda12-ffmpeg` | `localai/localai:{{< version >}}-cublas-cuda12-ffmpeg`      |
+| Versioned image including FFMpeg, no python | `quay.io/go-skynet/local-ai:{{< version >}}-cublas-cuda12-ffmpeg-core` | `localai/localai:{{< version >}}-cublas-cuda12-ffmpeg-core` |
+
+{{% /tab %}}
+
+{{< /tabs >}}
+
+## What's next?
+
+Explore further resources and community contributions:
+
+- [Community How to's](https://io.midori-ai.xyz/howtos/)
+- [Examples](https://github.com/mudler/LocalAI/tree/master/examples#examples)
+
+[![Screenshot from 2023-04-26 23-59-55](https://user-images.githubusercontent.com/2420543/234715439-98d12e03-d3ce-4f94-ab54-2b256808e05e.png)](https://github.com/mudler/LocalAI/tree/master/examples#examples)
--- a/docs/content/docs/integrations.md
+++ b/docs/content/docs/integrations.md
@@ -0,0 +1,28 @@
+++
+disableToc = false
+title = "Integrations"
+weight = 19
+icon = "rocket_launch"
+
+++
+
+## Community integrations
+
+List of projects that are using directly LocalAI behind the scenes can be found [here](https://github.com/mudler/LocalAI#-community-and-integrations).
+
+The list below is a list of software that integrates with LocalAI.
+
+- [AnythingLLM](https://github.com/Mintplex-Labs/anything-llm)
+- [Logseq GPT3 OpenAI plugin](https://github.com/briansunter/logseq-plugin-gpt3-openai) allows to set a base URL, and works with LocalAI.
+- https://github.com/longy2k/obsidian-bmo-chatbot
+- https://github.com/FlowiseAI/Flowise
+- https://github.com/k8sgpt-ai/k8sgpt
+- https://github.com/kairos-io/kairos
+- https://github.com/langchain4j/langchain4j
+- https://github.com/henomis/lingoose
+- https://github.com/trypromptly/LLMStack
+- https://github.com/mattermost/openops
+- https://github.com/charmbracelet/mods
+- https://github.com/cedriking/spark
+  
+Feel free to open up a Pull request (by clicking at the "Edit page" below) to get a page for your project made or if you see a error on one of the pages!
--- a/docs/content/docs/overview.md
+++ b/docs/content/docs/overview.md
@@ -1,8 +1,21 @@
+
 +++
-archetype = "home"
-title = "LocalAI"
+title = "Overview"
+weight = 1
+toc = true
+description = "What is LocalAI?"
+tags = ["Beginners"]
+categories = [""]
+author = "Ettore Di Giacinto"
+# This allows to overwrite the landing page
+url = '/'
+icon = "info"
 +++

+<p align="center">
+<a href="https://localai.io"><img width=512 src="https://github.com/go-skynet/LocalAI/assets/2420543/0966aa2a-166e-4f99-a3e5-6c915fc997dd"></a>
+</p               >
+
 <p align="center">
 <a href="https://github.com/go-skynet/LocalAI/fork" target="blank">
 <img src="https://img.shields.io/github/forks/go-skynet/LocalAI?style=for-the-badge" alt="LocalAI forks"/>
@@ -18,11 +31,14 @@ title = "LocalAI"
 </a>
 </p>

-> 💡 Get help - [❓FAQ](https://localai.io/faq/) [❓How tos](https://localai.io/howtos/) [💭Discussions](https://github.com/go-skynet/LocalAI/discussions) [💭Discord](https://discord.gg/uJAeKSAGDy)
+[<img src="https://img.shields.io/badge/dockerhub-images-important.svg?logo=Docker">](https://hub.docker.com/r/localai/localai)
+[<img src="https://img.shields.io/badge/quay.io-images-important.svg?">](https://quay.io/repository/go-skynet/local-ai?tab=tags&tag=latest)
+
+> 💡 Get help - [❓FAQ](https://localai.io/faq/) [❓How tos](https://io.midori-ai.xyz/howtos/) [💭Discussions](https://github.com/go-skynet/LocalAI/discussions) [💭Discord](https://discord.gg/uJAeKSAGDy)
 >
 > [💻 Quickstart](https://localai.io/basics/getting_started/) [📣 News](https://localai.io/basics/news/) [ 🛫 Examples ](https://github.com/go-skynet/LocalAI/tree/master/examples/) [ 🖼️ Models ](https://localai.io/models/) [ 🚀 Roadmap ](https://github.com/mudler/LocalAI/issues?q=is%3Aissue+is%3Aopen+label%3Aroadmap)

-**LocalAI** is the free, Open Source OpenAI alternative. LocalAI act as a drop-in replacement REST API that's compatible with OpenAI API specifications for local inferencing. It allows you to run LLMs, generate images, audio (and not only) locally or on-prem with consumer grade hardware, supporting multiple model families that are compatible with the ggml format. Does not require GPU. It is maintained by [mudler](https://github.com/mudler).
+**LocalAI** is the free, Open Source OpenAI alternative. LocalAI act as a drop-in replacement REST API that's compatible with OpenAI API specifications for local inferencing. It allows you to run LLMs, generate images, audio (and not only) locally or on-prem with consumer grade hardware, supporting multiple model families and architectures. Does not require GPU. It is maintained by [mudler](https://github.com/mudler).

 <p align="center">
 <a href="https://twitter.com/LocalAI_API" target="blank">
@@ -36,10 +52,10 @@ In a nutshell:

 - Local, OpenAI drop-in alternative REST API. You own your data.
 - NO GPU required. NO Internet access is required either
-  - Optional, GPU Acceleration is available in `llama.cpp`-compatible LLMs. See also the [build section](https://localai.io/basics/build/index.html).
+  - Optional, GPU Acceleration is available. See also the [build section](https://localai.io/basics/build/index.html).
 - Supports multiple models
 - 🏃 Once loaded the first time, it keep models loaded in memory for faster inference
- ⚡ Doesn't shell-out, but uses C++ bindings for a faster inference and better performance.
+- ⚡ Doesn't shell-out, but uses bindings for a faster inference and better performance.

 LocalAI is focused on making the AI accessible to anyone. Any contribution, feedback and PR is welcome!

@@ -62,7 +78,7 @@ Note that this started just as a fun weekend project by [mudler](https://github.

 LocalAI is an API written in Go that serves as an OpenAI shim, enabling software already developed with OpenAI SDKs to seamlessly integrate with LocalAI. It can be effortlessly implemented as a substitute, even on consumer-grade hardware. This capability is achieved by employing various C++ backends, including [ggml](https://github.com/ggerganov/ggml), to perform inference on LLMs using both CPU and, if desired, GPU. Internally LocalAI backends are just gRPC server, indeed you can specify and build your own gRPC server and extend LocalAI in runtime as well. It is possible to specify external gRPC server and/or binaries that LocalAI will manage internally.

-LocalAI uses a mixture of backends written in various languages (C++, Golang, Python, ...). You can check [the model compatibility table]({{%relref "model-compatibility" %}}) to learn about all the components of LocalAI.
+LocalAI uses a mixture of backends written in various languages (C++, Golang, Python, ...). You can check [the model compatibility table]({{%relref "docs/reference/compatibility-table" %}}) to learn about all the components of LocalAI.

 ![localai](https://github.com/go-skynet/localai-website/assets/2420543/6492e685-8282-4217-9daa-e229a31548bc)

--- a/docs/content/docs/reference/_index.en.md
+++ b/docs/content/docs/reference/_index.en.md
@@ -0,0 +1,11 @@
+---
+weight: 23
+title: "References"
+description: "Reference"
+icon: science
+lead: ""
+date: 2020-10-06T08:49:15+00:00
+lastmod: 2020-10-06T08:49:15+00:00
+draft: false
+images: []
+---
--- a/docs/content/docs/reference/compatibility-table.md
+++ b/docs/content/docs/reference/compatibility-table.md
@@ -1,29 +1,22 @@

 +++
 disableToc = false
-title = "Model compatibility"
-weight = 4
+title = "Model compatibility table"
+weight = 24
+url = "/model-compatibility/"
 +++

-LocalAI is compatible with the models supported by [llama.cpp](https://github.com/ggerganov/llama.cpp) supports also [GPT4ALL-J](https://github.com/nomic-ai/gpt4all) and [cerebras-GPT with ggml](https://huggingface.co/lxe/Cerebras-GPT-2.7B-Alpaca-SP-ggml).
-
-{{% notice note %}}
-
-LocalAI will attempt to automatically load models which are not explicitly configured for a specific backend. You can specify the backend to use by configuring a model with a YAML file. See [the advanced section]({{%relref "advanced" %}}) for more details.
-
-{{% /notice %}}
-
-### Hardware requirements
-
-Depending on the model you are attempting to run might need more RAM or CPU resources. Check out also [here](https://github.com/ggerganov/llama.cpp#memorydisk-requirements) for `gguf` based backends. `rwkv` is less expensive on resources.
-
-### Model compatibility table
-
 Besides llama based models, LocalAI is compatible also with other architectures. The table below lists all the compatible models families and the associated binding repository.

+{{% alert note %}}
+
+LocalAI will attempt to automatically load models which are not explicitly configured for a specific backend. You can specify the backend to use by configuring a model with a YAML file. See [the advanced section]({{%relref "docs/advanced" %}}) for more details.
+
+{{% /alert %}}
+
 | Backend and Bindings                                                             | Compatible models     | Completion/Chat endpoint | Capability | Embeddings support                | Token stream support | Acceleration |
 |----------------------------------------------------------------------------------|-----------------------|--------------------------|---------------------------|-----------------------------------|----------------------|--------------|
-| [llama.cpp]({{%relref "model-compatibility/llama-cpp" %}})        | Vicuna, Alpaca, LLaMa | yes                      | GPT and Functions                        | yes** | yes                  | CUDA, openCL, cuBLAS, Metal |
+| [llama.cpp]({{%relref "docs/features/text-generation#llama.cpp" %}})        | Vicuna, Alpaca, LLaMa | yes                      | GPT and Functions                        | yes** | yes                  | CUDA, openCL, cuBLAS, Metal |
 | [gpt4all-llama](https://github.com/nomic-ai/gpt4all)      | Vicuna, Alpaca, LLaMa | yes                      | GPT                        | no                                | yes                  | N/A  |
 | [gpt4all-mpt](https://github.com/nomic-ai/gpt4all)          | MPT                   | yes                      | GPT                        | no                                | yes                  | N/A  |
 | [gpt4all-j](https://github.com/nomic-ai/gpt4all)           | GPT4ALL-J             | yes                      | GPT                        | no                                | yes                  | N/A  |
@@ -43,40 +36,21 @@ Besides llama based models, LocalAI is compatible also with other architectures.
 | [langchain-huggingface](https://github.com/tmc/langchaingo)                                                                    | Any text generators available on HuggingFace through API | yes                      | GPT                        | no                                | no                   | N/A |
 | [piper](https://github.com/rhasspy/piper) ([binding](https://github.com/mudler/go-piper))                                                                     | Any piper onnx model | no                      | Text to voice                        | no                                | no                   | N/A |
 | [falcon](https://github.com/cmp-nct/ggllm.cpp/tree/c12b2d65f732a0d8846db2244e070f0f3e73505c) ([binding](https://github.com/mudler/go-ggllm.cpp))                                                                      | Falcon *** | yes                      | GPT                        | no                                | yes                   | CUDA |
-| `huggingface-embeddings` [sentence-transformers](https://github.com/UKPLab/sentence-transformers) | BERT                   | no                       | Embeddings only                  | yes                               | no                   | N/A |
+| [sentencetransformers](https://github.com/UKPLab/sentence-transformers) | BERT                   | no                       | Embeddings only                  | yes                               | no                   | N/A |
 | `bark`  | bark                   | no                       | Audio generation                  | no                               | no                   | yes |
-| `AutoGPTQ` | GPTQ                   | yes                       | GPT                  | yes                               | no                   | N/A |
+| `autogptq` | GPTQ                   | yes                       | GPT                  | yes                               | no                   | N/A |
 | `exllama`  | GPTQ                   | yes                       | GPT only                  | no                               | no                   | N/A |
 | `diffusers`  | SD,...                   | no                       | Image generation    | no                               | no                   | N/A |
 | `vall-e-x` | Vall-E    | no                       | Audio generation and Voice cloning    | no                               | no                   | CPU/CUDA |
 | `vllm` | Various GPTs and quantization formats | yes                      | GPT             | no | no                  | CPU/CUDA |
 | `exllama2`  | GPTQ                   | yes                       | GPT only                  | no                               | no                   | N/A |
 | `transformers-musicgen`  |                    | no                       | Audio generation                | no                               | no                   | N/A |
+| [tinydream](https://github.com/symisc/tiny-dream#tiny-dreaman-embedded-header-only-stable-diffusion-inference-c-librarypixlabiotiny-dream)         | stablediffusion               | no                       | Image                 | no                                | no                   | N/A |
+| `coqui` | Coqui    | no                       | Audio generation and Voice cloning    | no                               | no                   | CPU/CUDA |
+| `petals` | Various GPTs and quantization formats | yes                      | GPT             | no | no                  | CPU/CUDA |

-Note: any backend name listed above can be used in the `backend` field of the model configuration file (See [the advanced section]({{%relref "advanced" %}})).
+Note: any backend name listed above can be used in the `backend` field of the model configuration file (See [the advanced section]({{%relref "docs/advanced" %}})).

 - \* 7b ONLY
 - ** doesn't seem to be accurate
- *** 7b and 40b with the `ggccv` format, for instance: https://huggingface.co/TheBloke/WizardLM-Uncensored-Falcon-40B-GGML
-
-Tested with:
-
- [X] Automatically by CI with OpenLLAMA and GPT4ALL.
- [X] LLaMA 🦙
- [X] [Vicuna](https://github.com/ggerganov/llama.cpp/discussions/643#discussioncomment-5533894)
- [Alpaca](https://github.com/ggerganov/llama.cpp#instruction-mode-with-alpaca)
- [X] [GPT4ALL](https://gpt4all.io) (see also [using GPT4All](https://github.com/ggerganov/llama.cpp#using-gpt4all))
- [X] [GPT4ALL-J](https://gpt4all.io/models/ggml-gpt4all-j.bin) (no changes required)
- [X] [Koala](https://bair.berkeley.edu/blog/2023/04/03/koala/) 🐨
- [X] Cerebras-GPT
- [X] [WizardLM](https://github.com/nlpxucan/WizardLM)
- [X] [RWKV](https://github.com/BlinkDL/RWKV-LM) models with [rwkv.cpp](https://github.com/saharNooby/rwkv.cpp)
- [X] [bloom.cpp](https://github.com/NouamaneTazi/bloomz.cpp)
- [X] [Chinese LLaMA / Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca)
- [X] [Vigogne (French)](https://github.com/bofenghuang/vigogne)
- [X] [OpenBuddy 🐶 (Multilingual)](https://github.com/OpenBuddy/OpenBuddy)
- [X] [Pygmalion 7B / Metharme 7B](https://github.com/ggerganov/llama.cpp#using-pygmalion-7b--metharme-7b)
- [X] [HuggingFace Inference](https://huggingface.co/inference-api) models available through API
- [X] Falcon
-
-Note: You might need to convert some models from older models to the new format, for indications, see [the README in llama.cpp](https://github.com/ggerganov/llama.cpp#using-gpt4all) for instance to run `gpt4all`.
+- *** 7b and 40b with the `ggccv` format, for instance: https://huggingface.co/TheBloke/WizardLM-Uncensored-Falcon-40B-GGML
--- a/docs/content/docs/whats-new.md
+++ b/docs/content/docs/whats-new.md
@@ -1,11 +1,17 @@
 +++
 disableToc = false
-title = "🆕 What's New"
-weight = 2
+title = "News"
+weight = 7
 url = '/basics/news/'
-
+icon = "newspaper"
 +++

+Release notes have been now moved completely over Github releases. 
+
+You can see the release notes [here](https://github.com/mudler/LocalAI/releases).
+
+# Older release notes
+
 ## 04-12-2023: __v2.0.0__

 This release brings a major overhaul in some backends. 
@@ -68,7 +74,7 @@ From this release the `llama` backend supports only `gguf` files (see {{< pr "94

 ### Image generation enhancements

-The [Diffusers]({{%relref "model-compatibility/diffusers" %}}) backend got now various enhancements, including support to generate images from images, longer prompts, and support for more kernels schedulers. See the [Diffusers]({{%relref "model-compatibility/diffusers" %}}) documentation for more information.
+The [Diffusers]({{%relref "docs/features/image-generation" %}}) backend got now various enhancements, including support to generate images from images, longer prompts, and support for more kernels schedulers. See the [Diffusers]({{%relref "docs/features/image-generation" %}}) documentation for more information.

 ### Lora adapters

@@ -80,7 +86,7 @@ It is now possible for single-devices with one GPU to specify `--single-active-b

 ### Community spotlight

-![2023_08_26_15_09_27](https://github.com/go-skynet/localai-website/assets/2420543/f0204f8f-7462-4cdd-9154-4538683c1eef)
+

 #### Resources management

@@ -89,7 +95,7 @@ There is an ongoing effort in the community to better handling of resources. See

 #### New how-to section

-Thanks to the community efforts now we have a new [how-to section]({{%relref "howtos" %}}) with various examples on how to use LocalAI. This is a great starting point for new users! We are currently working on improving it, a huge shout out to {{< github "lunamidori5" >}} from the community for the impressive efforts on this!
+Thanks to the community efforts now we have a new [how-to website](https://io.midori-ai.xyz/howtos/) with various examples on how to use LocalAI. This is a great starting point for new users! We are currently working on improving it, a huge shout out to {{< github "lunamidori5" >}} from the community for the impressive efforts on this!

 #### 💡 More examples!

@@ -131,7 +137,7 @@ The full changelog is available [here](https://github.com/go-skynet/LocalAI/rele

 ## 🔥🔥🔥🔥 12-08-2023: __v1.24.0__ 🔥🔥🔥🔥

-This is release brings four(!) new additional backends to LocalAI: [🐶 Bark]({{%relref "model-compatibility/bark" %}}), 🦙 [AutoGPTQ]({{%relref "model-compatibility/autogptq" %}}), [🧨 Diffusers]({{%relref "model-compatibility/diffusers" %}}), 🦙 [exllama]({{%relref "model-compatibility/exllama" %}}) and a lot of improvements!
+This is release brings four(!) new additional backends to LocalAI: [🐶 Bark]({{%relref "docs/features/text-to-audio#bark" %}}), 🦙 [AutoGPTQ]({{%relref "docs/features/text-generation#autogptq" %}}), [🧨 Diffusers]({{%relref "docs/features/image-generation" %}}), 🦙 [exllama]({{%relref "docs/features/text-generation#exllama" %}}) and a lot of improvements!

 ### Major improvements:

@@ -143,23 +149,23 @@ This is release brings four(!) new additional backends to LocalAI: [🐶 Bark]({

 ### 🐶 Bark

-[Bark]({{%relref "model-compatibility/bark" %}}) is a text-prompted generative audio model - it combines GPT techniques to generate Audio from text. It is a great addition to LocalAI, and it's available in the container images by default.
+[Bark]({{%relref "docs/features/text-to-audio#bark" %}}) is a text-prompted generative audio model - it combines GPT techniques to generate Audio from text. It is a great addition to LocalAI, and it's available in the container images by default.

 It can also generate music, see the example: [lion.webm](https://user-images.githubusercontent.com/5068315/230684766-97f5ea23-ad99-473c-924b-66b6fab24289.webm)

 ### 🦙 AutoGPTQ

-[AutoGPTQ]({{%relref "model-compatibility/autogptq" %}}) is an easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
+[AutoGPTQ]({{%relref "docs/features/text-generation#autogptq" %}}) is an easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.

-It is targeted mainly for GPU usage only. Check out the [AutoGPTQ documentation]({{%relref "model-compatibility/autogptq" %}}) for usage.
+It is targeted mainly for GPU usage only. Check out the [ documentation]({{%relref "docs/features/text-generation" %}}) for usage.

 ### 🦙 Exllama

-[Exllama]({{%relref "model-compatibility/exllama" %}}) is a "A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights". It is a faster alternative to run LLaMA models on GPU.Check out the [Exllama documentation]({{%relref "model-compatibility/exllama" %}}) for usage.
+[Exllama]({{%relref "docs/features/text-generation#exllama" %}}) is a "A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights". It is a faster alternative to run LLaMA models on GPU.Check out the [Exllama documentation]({{%relref "docs/features/text-generation#exllama" %}}) for usage.

 ### 🧨 Diffusers

-[Diffusers]({{%relref "model-compatibility/diffusers" %}}) is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. Currently it is experimental, and supports generation only of images so you might encounter some issues on models which weren't tested yet. Check out the [Diffusers documentation]({{%relref "model-compatibility/diffusers" %}}) for usage.
+[Diffusers]({{%relref "docs/features/image-generation#diffusers" %}}) is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. Currently it is experimental, and supports generation only of images so you might encounter some issues on models which weren't tested yet. Check out the [Diffusers documentation]({{%relref "docs/features/image-generation" %}}) for usage.

 ### 🔑 API Keys

@@ -195,11 +201,11 @@ Most notably, this release brings important fixes for CUDA (and not only):
 * fix: select function calls if 'name' is set in the request by {{< github "mudler" >}} in {{< pr "827" >}}
 * fix: symlink libphonemize in the container by {{< github "mudler" >}} in {{< pr "831" >}}
  
-{{% notice note %}}
+{{% alert note %}}

-From this release [OpenAI functions]({{%relref "features/openai-functions" %}}) are available in the `llama` backend. The `llama-grammar` has been deprecated. See also [OpenAI functions]({{%relref "features/openai-functions" %}}).
+From this release [OpenAI functions]({{%relref "docs/features/openai-functions" %}}) are available in the `llama` backend. The `llama-grammar` has been deprecated. See also [OpenAI functions]({{%relref "docs/features/openai-functions" %}}).

-{{% /notice %}}
+{{% /alert %}}

 The full [changelog is available here](https://github.com/go-skynet/LocalAI/releases/tag/v1.23.0)

@@ -213,15 +219,15 @@ The full [changelog is available here](https://github.com/go-skynet/LocalAI/rele
 * feat: backends improvements by {{< github "mudler" >}} in {{< pr "778" >}}
 * feat(llama2): add template for chat messages by {{< github "dave-gray101" >}}  in {{< pr "782" >}}

-{{% notice note %}}
+{{% alert note %}}

-From this release to use the OpenAI functions you need to use the `llama-grammar` backend. It has been added a `llama` backend for tracking `llama.cpp` master and `llama-grammar` for the grammar functionalities that have not been merged yet upstream. See also [OpenAI functions]({{%relref "features/openai-functions" %}}). Until the feature is merged we will have two llama backends.
+From this release to use the OpenAI functions you need to use the `llama-grammar` backend. It has been added a `llama` backend for tracking `llama.cpp` master and `llama-grammar` for the grammar functionalities that have not been merged yet upstream. See also [OpenAI functions]({{%relref "docs/features/openai-functions" %}}). Until the feature is merged we will have two llama backends.

-{{% /notice %}}
+{{% /alert %}}

 ## Huggingface embeddings

-In this release is now possible to specify to LocalAI external `gRPC` backends that can be used for inferencing {{< pr "778" >}}. It is now possible to write internal backends in any language, and a `huggingface-embeddings` backend is now available in the container image to be used with https://github.com/UKPLab/sentence-transformers. See also [Embeddings]({{%relref "features/embeddings" %}}).
+In this release is now possible to specify to LocalAI external `gRPC` backends that can be used for inferencing {{< pr "778" >}}. It is now possible to write internal backends in any language, and a `huggingface-embeddings` backend is now available in the container image to be used with https://github.com/UKPLab/sentence-transformers. See also [Embeddings]({{%relref "docs/features/embeddings" %}}).

 ## LLaMa 2 has been released!

@@ -266,7 +272,7 @@ The former, ggml-based backend has been renamed to `falcon-ggml`.

 ### Default pre-compiled binaries

-From this release the default behavior of images has changed. Compilation is not triggered on start automatically, to recompile `local-ai` from scratch on start and switch back to the old behavior, you can set `REBUILD=true` in the environment variables. Rebuilding can be necessary if your CPU and/or architecture is old and the pre-compiled binaries are not compatible with your platform. See the [build section]({{%relref "build" %}}) for more information.
+From this release the default behavior of images has changed. Compilation is not triggered on start automatically, to recompile `local-ai` from scratch on start and switch back to the old behavior, you can set `REBUILD=true` in the environment variables. Rebuilding can be necessary if your CPU and/or architecture is old and the pre-compiled binaries are not compatible with your platform. See the [build section]({{%relref "docs/getting-started/build" %}}) for more information.

 [Full release changelog](https://github.com/go-skynet/LocalAI/releases/tag/v1.21.0)

@@ -276,8 +282,8 @@ From this release the default behavior of images has changed. Compilation is not

 ### Exciting New Features 🎉

-* Add Text-to-Audio generation with `go-piper` by {{< github "mudler" >}} in {{< pr "649" >}} See [API endpoints]({{%relref "features/text-to-audio" %}}) in our documentation.
-* Add gallery repository by {{< github "mudler" >}} in {{< pr "663" >}}. See [models]({{%relref "models" %}}) for documentation.
+* Add Text-to-Audio generation with `go-piper` by {{< github "mudler" >}} in {{< pr "649" >}} See [API endpoints]({{%relref "docs/features/text-to-audio" %}}) in our documentation.
+* Add gallery repository by {{< github "mudler" >}} in {{< pr "663" >}}. See [models]({{%relref "docs/features/model-gallery" %}}) for documentation.

 ### Container images
 - Standard (GPT + `stablediffusion`): `quay.io/go-skynet/local-ai:v1.20.0`
@@ -289,7 +295,7 @@ From this release the default behavior of images has changed. Compilation is not

 Updates to `llama.cpp`, `go-transformers`, `gpt4all.cpp` and `rwkv.cpp`.

-The NUMA option was enabled by {{< github "mudler" >}} in {{< pr "684" >}}, along with many new parameters (`mmap`,`mmlock`, ..). See [advanced]({{%relref "advanced" %}}) for the full list of parameters.
+The NUMA option was enabled by {{< github "mudler" >}} in {{< pr "684" >}}, along with many new parameters (`mmap`,`mmlock`, ..). See [advanced]({{%relref "docs/advanced" %}}) for the full list of parameters.

 ### Gallery repositories

@@ -313,13 +319,13 @@ or a `tts` voice with:
 curl http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{ "id": "model-gallery@voice-en-us-kathleen-low" }'
 ```

-See also [models]({{%relref "models" %}}) for a complete documentation.
+See also [models]({{%relref "docs/features/model-gallery" %}}) for a complete documentation.

 ### Text to Audio

 Now `LocalAI` uses [piper](https://github.com/rhasspy/piper) and [go-piper](https://github.com/mudler/go-piper) to generate audio from text. This is an experimental feature, and it requires `GO_TAGS=tts` to be set during build. It is enabled by default in the pre-built container images.

-To setup audio models, you can use the new galleries, or setup the models manually as described in [the API section of the documentation]({{%relref "features/text-to-audio" %}}).
+To setup audio models, you can use the new galleries, or setup the models manually as described in [the API section of the documentation]({{%relref "docs/features/text-to-audio" %}}).

 You can check the full changelog in [Github](https://github.com/go-skynet/LocalAI/releases/tag/v1.20.0)

@@ -347,7 +353,7 @@ We now support a vast variety of models, while being backward compatible with pr
 ### New features

 - ✨ Added support for `falcon`-based model families (7b)  ( [mudler](https://github.com/mudler) )
- ✨ Experimental support for Metal Apple Silicon GPU - ( [mudler](https://github.com/mudler) and thanks to [Soleblaze](https://github.com/Soleblaze) for testing! ). See the [build section]({{%relref "build#Acceleration" %}}).
+- ✨ Experimental support for Metal Apple Silicon GPU - ( [mudler](https://github.com/mudler) and thanks to [Soleblaze](https://github.com/Soleblaze) for testing! ). See the [build section]({{%relref "docs/getting-started/build#Acceleration" %}}).
 - ✨ Support for token stream in the `/v1/completions` endpoint ( [samm81](https://github.com/samm81) )
 - ✨ Added huggingface backend ( [Evilfreelancer](https://github.com/EvilFreelancer) )
 - 📷 Stablediffusion now can output `2048x2048` images size with `esrgan`! ( [mudler](https://github.com/mudler) )
@@ -388,7 +394,7 @@ Two new projects offer now direct integration with LocalAI!

 Support for OpenCL has been added while building from sources.

-You can now build LocalAI from source with `BUILD_TYPE=clblas` to have an OpenCL build. See also the [build section]({{%relref "build#Acceleration" %}}).
+You can now build LocalAI from source with `BUILD_TYPE=clblas` to have an OpenCL build. See also the [build section]({{%relref "docs/getting-started/build#Acceleration" %}}).

 For instructions on how to install OpenCL/CLBlast see [here](https://github.com/ggerganov/llama.cpp#blas-build).

@@ -418,7 +424,7 @@ prompt_cache_path: "alpaca-cache"
 prompt_cache_all: true
 ```

-See also the [advanced section]({{%relref "advanced" %}}).
+See also the [advanced section]({{%relref "docs/advanced" %}}).

 ## Media, Blogs, Social

@@ -431,7 +437,7 @@ See also the [advanced section]({{%relref "advanced" %}}).

 - 23-05-2023: __v1.15.0__ released. `go-gpt2.cpp` backend got renamed to `go-ggml-transformers.cpp` updated including https://github.com/ggerganov/llama.cpp/pull/1508 which breaks compatibility with older models. This impacts RedPajama, GptNeoX, MPT(not `gpt4all-mpt`), Dolly, GPT2 and Starcoder based models. [Binary releases available](https://github.com/go-skynet/LocalAI/releases), various fixes, including {{< pr "341" >}} .
 - 21-05-2023: __v1.14.0__ released. Minor updates to the `/models/apply` endpoint, `llama.cpp` backend updated including https://github.com/ggerganov/llama.cpp/pull/1508 which breaks compatibility with older models. `gpt4all` is still compatible with the old format. 
- 19-05-2023: __v1.13.0__ released! 🔥🔥 updates to the `gpt4all` and `llama` backend, consolidated CUDA support ( {{< pr "310" >}} thanks to @bubthegreat and @Thireus ), preliminar support for [installing models via API]({{%relref "advanced#" %}}).
+- 19-05-2023: __v1.13.0__ released! 🔥🔥 updates to the `gpt4all` and `llama` backend, consolidated CUDA support ( {{< pr "310" >}} thanks to @bubthegreat and @Thireus ), preliminar support for [installing models via API]({{%relref "docs/advanced#" %}}).
 - 17-05-2023:  __v1.12.0__ released! 🔥🔥 Minor fixes, plus CUDA ({{< pr "258" >}}) support for `llama.cpp`-compatible models and image generation ({{< pr "272" >}}).
 - 16-05-2023: 🔥🔥🔥 Experimental support for CUDA ({{< pr "258" >}}) in the `llama.cpp` backend and Stable diffusion CPU image generation ({{< pr "272" >}}) in `master`.

--- a/docs/content/features/_index.en.md
+++ b/docs/content/features/_index.en.md
@@ -1,17 +0,0 @@
-
-+++
-disableToc = false
-title = "Features"
-weight = 3
-+++
-
-This section contains the documentation for the features supported by LocalAI.
-
- [📖 Text generation (GPT)]({{%relref "features/text-generation" %}})
- [🗣 Text to Audio]({{%relref "features/text-to-audio" %}})
- [🔈 Audio to text]({{%relref "features/audio-to-text" %}})
- [🎨 Image generation]({{%relref "features/image-generation" %}})
- [🧠 Embeddings]({{%relref "features/embeddings" %}})
- [🔥 OpenAI functions]({{%relref "features/openai-functions" %}})
- [🆕 GPT Vision API]({{%relref "features/gpt-vision" %}})
- [✍️ Constrained grammars]({{%relref "features/constrained_grammars" %}})
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
LocalAI [bot]	abd678e147	⬆️ Update ggerganov/llama.cpp (#1655 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2024-01-28 09:24:44 +01:00
Ettore Di Giacinto	6ac5d814fb	feat(startup): fetch model definition remotely (#1654 )	2024-01-28 00:14:16 +01:00
LocalAI [bot]	f928899338	⬆️ Update ggerganov/llama.cpp (#1652 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2024-01-27 00:13:38 +01:00
Ettore Di Giacinto	5a6fd98839	fix(paths): automatically create paths (#1650 ) Especially useful when running inside a container. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2024-01-27 00:13:19 +01:00
Ettore Di Giacinto	072f71dfb7	Update codellama-7b.yaml Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2024-01-26 18:35:33 +01:00
Ettore Di Giacinto	670cee8274	Update transformers-tinyllama.yaml Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2024-01-26 18:29:38 +01:00
Ettore Di Giacinto	9f1be45552	Update quickstart.md Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2024-01-26 17:55:20 +01:00
Ettore Di Giacinto	f1846ae5ac	Update phi-2.yaml Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2024-01-26 16:22:54 +01:00
LocalAI [bot]	ac19998e5e	⬆️ Update ggerganov/llama.cpp (#1644 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2024-01-26 00:13:39 +01:00
Ettore Di Giacinto	cb7512734d	transformers: correctly load automodels (#1643 ) * backends(transformers): use AutoModel with LLM types * examples: animagine-xl * Add codellama examples	2024-01-26 00:13:21 +01:00
LocalAI [bot]	3733250b3c	⬆️ Update ggerganov/llama.cpp (#1642 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2024-01-24 22:51:59 +01:00
LocalAI [bot]	da3cd8993d	⬆️ Update docs version mudler/LocalAI (#1631 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2024-01-24 19:50:33 +01:00
LocalAI [bot]	7690caf020	⬆️ Update ggerganov/llama.cpp (#1632 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2024-01-23 23:07:51 +01:00
Ettore Di Giacinto	5e335eaead	feat(transformers): support also text generation (#1630 ) * feat(transformers): support also text generation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * embedded: set seed -1 --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2024-01-23 23:07:31 +01:00
coyzeng	d5d82ba344	feat(grpc): backend SPI pluggable in embedding mode (#1621 ) * run server * grpc backend embedded support * backend providable	2024-01-23 08:56:36 +01:00
LocalAI [bot]	efe2883c5d	⬆️ Update ggerganov/llama.cpp (#1626 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2024-01-22 23:22:01 +01:00
LocalAI [bot]	47237c7c3c	⬆️ Update ggerganov/llama.cpp (#1623 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2024-01-22 08:54:06 +01:00
Ettore Di Giacinto	697c769b64	fix(llama.cpp): enable cont batching when parallel is set (#1622 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2024-01-21 14:59:48 +01:00
Ettore Di Giacinto	94261b1717	Update gpt-vision.md Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2024-01-21 10:07:30 +01:00
Sebastian	eaf85a30f9	fix(llama.cpp): Enable parallel requests (#1616 ) integrate changes from llama.cpp Signed-off-by: Sebastian <tauven@gmail.com>	2024-01-21 09:56:14 +01:00
LocalAI [bot]	6a88b030ea	⬆️ Update ggerganov/llama.cpp (#1620 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2024-01-20 23:34:46 +01:00
LocalAI [bot]	f538416fb3	⬆️ Update docs version mudler/LocalAI (#1619 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2024-01-20 21:37:02 +00:00
Ettore Di Giacinto	06cd9ef98d	feat(extra-backends): Improvements, adding mamba example (#1618 ) * feat(extra-backends): Improvements vllm: add max_tokens, wire up stream event mamba: fixups, adding examples for mamba-chat * examples(mamba-chat): add * docs: update	2024-01-20 17:56:08 +01:00
James Braza	f3d71f8819	Modernized LlamaIndex integration (#1613 ) Updated LlamaIndex example	2024-01-20 10:06:32 +01:00
James Braza	b7127c2dc9	Expanded and interlinked Docker documentation (#1614 ) * Corrected dockerhub to Docker Hub * Consolidated two Docker examples * Linked Container Images in Manual Images	2024-01-20 10:05:14 +01:00
LocalAI [bot]	b2dc5fbd7e	⬆️ Update ggerganov/llama.cpp (#1612 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2024-01-20 00:38:14 +01:00
Ettore Di Giacinto	9e653d6abe	feat: 🐍 add mamba support (#1589 ) feat(mamba): Initial import This is a first iteration of the mamba backend, loosely based on mamba-chat(https://github.com/havenhq/mamba-chat).	2024-01-19 23:42:50 +01:00
Ettore Di Giacinto	52c9a7f45d	Update README.md Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2024-01-19 19:30:29 +01:00
Ettore Di Giacinto	ee42c9bfe6	docs: re-use original permalinks (#1610 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2024-01-19 19:23:58 +01:00
Ettore Di Giacinto	e6c3e483a1	Update build.md Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2024-01-19 19:09:35 +01:00
Ettore Di Giacinto	3a253c6cd7	Makefile: allow to build without GRPC_BACKENDS (#1607 )	2024-01-19 15:38:43 +01:00
Luna Midori	e9c3bbc6d7	Update README.md (#1601 ) Signed-off-by: Luna Midori <118759930+lunamidori5@users.noreply.github.com>	2024-01-19 08:55:37 +01:00
LocalAI [bot]	23d64ac53a	⬆️ Update ggerganov/llama.cpp (#1604 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2024-01-18 21:20:50 +00:00
Ettore Di Giacinto	34f9f20ff4	Update quickstart.md Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2024-01-18 20:49:04 +01:00
Ettore Di Giacinto	a4a72a79ae	Update integrations.md Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2024-01-18 19:53:41 +01:00
Ettore Di Giacinto	6ca4d38a01	docs/examples: enhancements (#1572 ) * docs: re-order sections * fix references * Add mixtral-instruct, tinyllama-chat, dolphin-2.5-mixtral-8x7b * Fix link * Minor corrections * fix: models is a StringSlice, not a String Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * WIP: switch docs theme * content * Fix GH link * enhancements * enhancements * Fixed how to link Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * fixups * logo fix * more fixups * final touches --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> Co-authored-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>	2024-01-18 19:41:08 +01:00
LocalAI [bot]	b5c93f176a	⬆️ Update ggerganov/llama.cpp (#1599 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2024-01-18 14:39:30 +01:00
LocalAI [bot]	1aaf88098d	⬆️ Update ggerganov/llama.cpp (#1597 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2024-01-17 09:27:02 +01:00
Dionysius	6f447e613d	docs: missing golang requirement for local build for debian (#1596 ) docs: fix missing golang requirement for local build for debian	2024-01-17 09:26:43 +01:00
LocalAI [bot]	dfb7c3b1aa	⬆️ Update ggerganov/llama.cpp (#1594 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2024-01-16 14:47:57 +01:00
Dionysius	b41eb5e1f3	prepend built binaries in PATH for BUILD_GRPC_FOR_BACKEND_LLAMA (#1593 ) prepend built binaries in PATH	2024-01-16 14:47:47 +01:00
LocalAI [bot]	9c2d264979	⬆️ Update ggerganov/llama.cpp (#1590 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2024-01-15 09:01:07 +01:00
LocalAI [bot]	b996c3198c	⬆️ Update ggerganov/llama.cpp (#1587 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2024-01-14 09:46:47 +00:00
Ettore Di Giacinto	f879c07c86	Update README.md Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2024-01-14 10:00:46 +01:00
Dionysius	441e2965ff	move BUILD_GRPC_FOR_BACKEND_LLAMA logic to makefile: errors in this section now immediately fail the build (#1576 ) * move BUILD_GRPC_FOR_BACKEND_LLAMA option to makefile * review: oversight, fixup cmake_args Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com> Signed-off-by: Dionysius <1341084+dionysius@users.noreply.github.com> --------- Signed-off-by: Dionysius <1341084+dionysius@users.noreply.github.com> Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2024-01-13 10:08:26 +01:00
LocalAI [bot]	cbe9a03e3c	⬆️ Update ggerganov/llama.cpp (#1583 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2024-01-12 23:04:04 +01:00
LocalAI [bot]	4ee7e73d00	⬆️ Update ggerganov/llama.cpp (#1578 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2024-01-12 16:04:33 +01:00
lunamidori5	1cca449726	Moving the how tos to self hosted (#1574 ) * Update _index.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Delete docs/content/howtos/easy-setup-sd.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Delete docs/content/howtos/easy-setup-full.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Delete docs/content/howtos/easy-setup-embeddings.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Delete docs/content/howtos/easy-setup-docker.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Delete docs/content/howtos/easy-request.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Delete docs/content/howtos/easy-model.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Update _index.en.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Update README.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Delete docs/content/howtos directory Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> --------- Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>	2024-01-11 09:25:18 +01:00
LocalAI [bot]	faf7c1c325	⬆️ Update ggerganov/llama.cpp (#1573 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2024-01-11 08:41:32 +01:00
LocalAI [bot]	58288494d6	⬆️ Update ggerganov/llama.cpp (#1568 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2024-01-10 10:18:57 +01:00
Dionysius	72283dc744	minor: replace shell pwd in Makefile with CURDIR for better windows compatibility (#1571 ) replace shell pwd in Makefile with CURDIR	2024-01-10 08:39:50 +00:00
LocalAI [bot]	b8240b4c18	⬆️ Update docs version mudler/LocalAI (#1567 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2024-01-09 21:56:12 +01:00
Ettore Di Giacinto	5309da40b7	Update Dockerfile Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2024-01-09 08:55:43 +01:00
Ettore Di Giacinto	08b90b4720	Update _index.en.md Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2024-01-09 08:50:19 +01:00
LocalAI [bot]	2e890b3838	⬆️ Update ggerganov/llama.cpp (#1563 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2024-01-09 08:48:40 +01:00
LocalAI [bot]	06656fc057	⬆️ Update docs version mudler/LocalAI (#1562 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2024-01-09 08:48:24 +01:00
LocalAI [bot]	574fa67bdc	⬆️ Update ggerganov/llama.cpp (#1558 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2024-01-08 00:38:03 +01:00
Ettore Di Giacinto	e19d7226f8	feat: more embedded models, coqui fixes, add model usage and description (#1556 ) * feat: add model descriptions and usage * remove default model gallery * models: add embeddings and tts * docs: update table * docs: updates * images: cleanup pip cache after install * images: always run apt-get clean * ux: improve gRPC connection errors * ux: improve some messages * fix: fix coqui when no AudioPath is passed by * embedded: add more models * Add usage * Reorder table	2024-01-08 00:37:02 +01:00
LocalAI [bot]	0843fe6c65	⬆️ Update docs version mudler/LocalAI (#1557 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2024-01-07 09:36:21 +01:00
Ettore Di Giacinto	62a02cd1fe	deps(conda): use transformers environment with autogptq (#1555 )	2024-01-06 15:30:53 +01:00
Ettore Di Giacinto	949da7792d	deps(conda): use transformers-env with vllm,exllama(2) (#1554 ) * deps(conda): use transformers with vllm * join vllm, exllama, exllama2, split petals	2024-01-06 13:32:28 +01:00
Ettore Di Giacinto	ce724a7e55	docs: improve getting started (#1553 ) * docs: improve getting started Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> * cleanups * Use dockerhub links * Shrink command to minimum --------- Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2024-01-06 01:04:14 +01:00
LocalAI [bot]	0a06c80801	⬆️ Update ggerganov/llama.cpp (#1547 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2024-01-05 23:27:51 +01:00
LocalAI [bot]	edc55ade61	⬆️ Update docs version mudler/LocalAI (#1546 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com> Co-authored-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>	2024-01-05 23:27:30 +01:00
Ettore Di Giacinto	09e5d9007b	feat: embedded model configurations, add popular model examples, refactoring (#1532 ) * move downloader out * separate startup functions for preloading configuration files * docs: add popular model examples Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * shorteners * Add llava * Add mistral-openorca * Better link to build section * docs: update * fixup * Drop code dups * Minor fixups * Apply suggestions from code review Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> * ci: try to cache gRPC build during tests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: do not build all images for tests, just necessary * ci: cache gRPC also in release pipeline * fixes * Update model_preload_test.go Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2024-01-05 23:16:33 +01:00
Ettore Di Giacinto	db926896bd	Revert "[Refactor]: Core/API Split" (#1550 ) Revert "[Refactor]: Core/API Split (#1506)" This reverts commit `ab7b4d5ee9`.	2024-01-05 18:04:46 +01:00
Dave	ab7b4d5ee9	[Refactor]: Core/API Split (#1506 ) Refactors api folder to core, creates firm split between backend code and api frontend.	2024-01-05 15:34:56 +01:00
Ettore Di Giacinto	bcf02449b3	ci(dockerhub): push images also to dockerhub (#1542 ) Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2024-01-04 08:32:29 +01:00
LocalAI [bot]	d48faf35ab	⬆️ Update ggerganov/llama.cpp (#1544 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2024-01-04 00:08:03 +01:00
Ettore Di Giacinto	583bd28a5c	fix(diffusers): add omegaconf dependency (#1540 ) Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2024-01-04 00:06:41 +01:00
LocalAI [bot]	7e1d8c489b	⬆️ Update ggerganov/llama.cpp (#1533 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2024-01-03 08:43:35 +01:00
LocalAI [bot]	de28867374	⬆️ Update ggerganov/llama.cpp (#1531 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2024-01-02 00:28:22 +00:00
Ettore Di Giacinto	a1aa6cb7c2	fix(entrypoint): cd to backend dir before start (#1530 ) Certain backends as vall-e-x are not meant to be used as a library, so we want to start the process in the same folder where the backend and all the assets are fixes #1394	2024-01-01 22:02:48 +01:00
Ettore Di Giacinto	85e2767dca	feat: add trimsuffix (#1528 )	2024-01-01 14:39:42 +01:00
Ettore Di Giacinto	fd48cb6506	deps(llama.cpp): update and sync grpc server (#1527 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2024-01-01 14:39:31 +01:00
Ettore Di Giacinto	522659eb59	feat(prepare): allow to specify additional files to download (#1526 )	2024-01-01 14:39:13 +01:00
Ettore Di Giacinto	f068efe509	docs(phi-2): add example (#1525 )	2024-01-01 10:51:47 +01:00
Ettore Di Giacinto	726fe416bb	docs: update hot topics Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2024-01-01 10:41:39 +01:00
Ettore Di Giacinto	66fa4f1767	feat: share models by url (#1522 ) * feat: allow to pass by models via args * expose it also as an env/arg * docs: enhancements to build/requirements * do not display status always * print download status * not all mesages are debug	2024-01-01 10:31:03 +01:00
Ettore Di Giacinto	d6565f3b99	Update _index.en.md Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2023-12-31 10:58:22 +01:00
LocalAI [bot]	27686ff20b	⬆️ Update ggerganov/llama.cpp (#1518 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2023-12-31 00:19:08 +00:00
LocalAI [bot]	a8b865022f	⬆️ Update docs version mudler/LocalAI (#1517 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2023-12-30 23:50:24 +00:00
Ettore Di Giacinto	c1888a8062	feat(preload): prepare models in galleries (#1515 ) Previously if applying models from the gallery API, we didn't actually allowed remote URLs as models as nothing was actually downloading the models referenced in the configuration file. Now we call Preload after we have all the models loaded in memory.	2023-12-30 18:55:18 +01:00
Ettore Di Giacinto	a95bb0521d	fix(download): correctly check for not found error (#1514 )	2023-12-30 15:36:46 +01:00
Chris Natale	e2311a145c	Fix: Set proper Homebrew install location for x86 Macs (#1510 ) * set proper Homebrew install location for x86 Macs * fix: remove prior conditional that my logic replaces	2023-12-30 12:37:26 +01:00
lunamidori5	d4e0bab6be	Update version.json (2.3.0) (#1511 ) Update version.json Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>	2023-12-30 10:19:46 +01:00
LocalAI [bot]	5b0dc20e4c	⬆️ Update ggerganov/llama.cpp (#1509 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2023-12-30 09:19:07 +00:00