pin go-rwkv

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
Revert "⬆️ Update donomii/go-rwkv.cpp" (#1474 )
2026-02-03 03:02:38 -05:00 · 2023-12-21 08:42:40 +01:00 · 2023-12-21 08:38:50 +01:00 · 2023-12-21 08:35:31 +01:00 · 2023-12-21 08:35:15 +01:00 · 2023-12-20 18:02:52 +01:00
111 changed files with 4423 additions and 1385 deletions
--- a/.github/bump_docs.sh
+++ b/.github/bump_docs.sh
@@ -0,0 +1,7 @@
+#!/bin/bash
+set -xe
+REPO=$1
+
+LATEST_TAG=$(curl -s "https://api.github.com/repos/$REPO/releases/latest" | jq -r '.name')
+
+cat <<< $(jq ".version = \"$LATEST_TAG\"" docs/data/version.json) > docs/data/version.json
--- a/.github/workflows/bump_docs.yaml
+++ b/.github/workflows/bump_docs.yaml
@@ -0,0 +1,31 @@
+name: Bump dependencies
+on:
+  schedule:
+    - cron: 0 20 * * *
+  workflow_dispatch:
+jobs:
+  bump:
+    strategy:
+      fail-fast: false
+      matrix:
+        include:
+          - repository: "mudler/LocalAI"
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Bump dependencies 🔧
+        run: |
+          bash .github/bump_docs.sh ${{ matrix.repository }}
+      - name: Create Pull Request
+        uses: peter-evans/create-pull-request@v5
+        with:
+          token: ${{ secrets.UPDATE_BOT_TOKEN }}
+          push-to-fork: ci-forks/LocalAI
+          commit-message: ':arrow_up: Update docs version ${{ matrix.repository }}'
+          title: ':arrow_up: Update docs version ${{ matrix.repository }}'
+          branch: "update/docs"
+          body: Bump of ${{ matrix.repository }} version inside docs
+          signoff: true
+
+
+
--- a/.github/workflows/test-extra.yml
+++ b/.github/workflows/test-extra.yml
@@ -0,0 +1,250 @@
+---
+name: 'Tests extras backends'
+
+on:
+  pull_request:
+  push:
+    branches:
+      - master
+    tags:
+      - '*'
+
+concurrency:
+  group: ci-tests-extra-${{ github.head_ref || github.ref }}-${{ github.repository }}
+  cancel-in-progress: true
+
+jobs:
+  tests-transformers:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Clone
+        uses: actions/checkout@v4
+        with: 
+          submodules: true
+      - name: Dependencies
+        run: |
+          sudo apt-get update
+          sudo apt-get install build-essential ffmpeg
+          curl https://repo.anaconda.com/pkgs/misc/gpgkeys/anaconda.asc | gpg --dearmor > conda.gpg && \
+             sudo install -o root -g root -m 644 conda.gpg /usr/share/keyrings/conda-archive-keyring.gpg && \
+              gpg --keyring /usr/share/keyrings/conda-archive-keyring.gpg --no-default-keyring --fingerprint 34161F5BF5EB1D4BFBBB8F0A8AEB4F8B29D82806 && \
+             sudo /bin/bash -c 'echo "deb [arch=amd64 signed-by=/usr/share/keyrings/conda-archive-keyring.gpg] https://repo.anaconda.com/pkgs/misc/debrepo/conda stable main" > /etc/apt/sources.list.d/conda.list' && \
+             sudo /bin/bash -c 'echo "deb [arch=amd64 signed-by=/usr/share/keyrings/conda-archive-keyring.gpg] https://repo.anaconda.com/pkgs/misc/debrepo/conda stable main" | tee -a /etc/apt/sources.list.d/conda.list' && \
+             sudo apt-get update && \
+             sudo apt-get install -y conda
+          sudo apt-get install -y ca-certificates cmake curl patch
+          sudo apt-get install -y libopencv-dev && sudo ln -s /usr/include/opencv4/opencv2 /usr/include/opencv2
+          
+          sudo rm -rfv /usr/bin/conda || true
+
+      - name: Test transformers
+        run: |
+           export PATH=$PATH:/opt/conda/bin
+           make -C backend/python/transformers
+           make -C backend/python/transformers test
+
+  tests-sentencetransformers:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Clone
+        uses: actions/checkout@v4
+        with: 
+          submodules: true
+      - name: Dependencies
+        run: |
+          sudo apt-get update
+          sudo apt-get install build-essential ffmpeg
+          curl https://repo.anaconda.com/pkgs/misc/gpgkeys/anaconda.asc | gpg --dearmor > conda.gpg && \
+             sudo install -o root -g root -m 644 conda.gpg /usr/share/keyrings/conda-archive-keyring.gpg && \
+              gpg --keyring /usr/share/keyrings/conda-archive-keyring.gpg --no-default-keyring --fingerprint 34161F5BF5EB1D4BFBBB8F0A8AEB4F8B29D82806 && \
+             sudo /bin/bash -c 'echo "deb [arch=amd64 signed-by=/usr/share/keyrings/conda-archive-keyring.gpg] https://repo.anaconda.com/pkgs/misc/debrepo/conda stable main" > /etc/apt/sources.list.d/conda.list' && \
+             sudo /bin/bash -c 'echo "deb [arch=amd64 signed-by=/usr/share/keyrings/conda-archive-keyring.gpg] https://repo.anaconda.com/pkgs/misc/debrepo/conda stable main" | tee -a /etc/apt/sources.list.d/conda.list' && \
+             sudo apt-get update && \
+             sudo apt-get install -y conda
+          sudo apt-get install -y ca-certificates cmake curl patch
+          sudo apt-get install -y libopencv-dev && sudo ln -s /usr/include/opencv4/opencv2 /usr/include/opencv2
+          
+          sudo rm -rfv /usr/bin/conda || true
+
+      - name: Test sentencetransformers
+        run: |
+           export PATH=$PATH:/opt/conda/bin
+           make -C backend/python/sentencetransformers
+           make -C backend/python/sentencetransformers test
+
+  tests-diffusers:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Clone
+        uses: actions/checkout@v4
+        with: 
+          submodules: true
+      - name: Dependencies
+        run: |
+          sudo apt-get update
+          sudo apt-get install build-essential ffmpeg
+          curl https://repo.anaconda.com/pkgs/misc/gpgkeys/anaconda.asc | gpg --dearmor > conda.gpg && \
+             sudo install -o root -g root -m 644 conda.gpg /usr/share/keyrings/conda-archive-keyring.gpg && \
+              gpg --keyring /usr/share/keyrings/conda-archive-keyring.gpg --no-default-keyring --fingerprint 34161F5BF5EB1D4BFBBB8F0A8AEB4F8B29D82806 && \
+             sudo /bin/bash -c 'echo "deb [arch=amd64 signed-by=/usr/share/keyrings/conda-archive-keyring.gpg] https://repo.anaconda.com/pkgs/misc/debrepo/conda stable main" > /etc/apt/sources.list.d/conda.list' && \
+             sudo /bin/bash -c 'echo "deb [arch=amd64 signed-by=/usr/share/keyrings/conda-archive-keyring.gpg] https://repo.anaconda.com/pkgs/misc/debrepo/conda stable main" | tee -a /etc/apt/sources.list.d/conda.list' && \
+             sudo apt-get update && \
+             sudo apt-get install -y conda
+          sudo apt-get install -y ca-certificates cmake curl patch
+          sudo apt-get install -y libopencv-dev && sudo ln -s /usr/include/opencv4/opencv2 /usr/include/opencv2
+          
+          sudo rm -rfv /usr/bin/conda || true
+
+      - name: Test diffusers
+        run: |
+           export PATH=$PATH:/opt/conda/bin
+           make -C backend/python/diffusers
+           make -C backend/python/diffusers test
+
+
+  tests-transformers-musicgen:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Clone
+        uses: actions/checkout@v4
+        with: 
+          submodules: true
+      - name: Dependencies
+        run: |
+          sudo apt-get update
+          sudo apt-get install build-essential ffmpeg
+          curl https://repo.anaconda.com/pkgs/misc/gpgkeys/anaconda.asc | gpg --dearmor > conda.gpg && \
+             sudo install -o root -g root -m 644 conda.gpg /usr/share/keyrings/conda-archive-keyring.gpg && \
+              gpg --keyring /usr/share/keyrings/conda-archive-keyring.gpg --no-default-keyring --fingerprint 34161F5BF5EB1D4BFBBB8F0A8AEB4F8B29D82806 && \
+             sudo /bin/bash -c 'echo "deb [arch=amd64 signed-by=/usr/share/keyrings/conda-archive-keyring.gpg] https://repo.anaconda.com/pkgs/misc/debrepo/conda stable main" > /etc/apt/sources.list.d/conda.list' && \
+             sudo /bin/bash -c 'echo "deb [arch=amd64 signed-by=/usr/share/keyrings/conda-archive-keyring.gpg] https://repo.anaconda.com/pkgs/misc/debrepo/conda stable main" | tee -a /etc/apt/sources.list.d/conda.list' && \
+             sudo apt-get update && \
+             sudo apt-get install -y conda
+          sudo apt-get install -y ca-certificates cmake curl patch
+          sudo apt-get install -y libopencv-dev && sudo ln -s /usr/include/opencv4/opencv2 /usr/include/opencv2
+          
+          sudo rm -rfv /usr/bin/conda || true
+
+      - name: Test transformers-musicgen
+        run: |
+           export PATH=$PATH:/opt/conda/bin
+           make -C backend/python/transformers-musicgen
+           make -C backend/python/transformers-musicgen test
+
+
+
+  tests-petals:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Clone
+        uses: actions/checkout@v4
+        with: 
+          submodules: true
+      - name: Dependencies
+        run: |
+          sudo apt-get update
+          sudo apt-get install build-essential ffmpeg
+          curl https://repo.anaconda.com/pkgs/misc/gpgkeys/anaconda.asc | gpg --dearmor > conda.gpg && \
+             sudo install -o root -g root -m 644 conda.gpg /usr/share/keyrings/conda-archive-keyring.gpg && \
+              gpg --keyring /usr/share/keyrings/conda-archive-keyring.gpg --no-default-keyring --fingerprint 34161F5BF5EB1D4BFBBB8F0A8AEB4F8B29D82806 && \
+             sudo /bin/bash -c 'echo "deb [arch=amd64 signed-by=/usr/share/keyrings/conda-archive-keyring.gpg] https://repo.anaconda.com/pkgs/misc/debrepo/conda stable main" > /etc/apt/sources.list.d/conda.list' && \
+             sudo /bin/bash -c 'echo "deb [arch=amd64 signed-by=/usr/share/keyrings/conda-archive-keyring.gpg] https://repo.anaconda.com/pkgs/misc/debrepo/conda stable main" | tee -a /etc/apt/sources.list.d/conda.list' && \
+             sudo apt-get update && \
+             sudo apt-get install -y conda
+          sudo apt-get install -y ca-certificates cmake curl patch
+          sudo apt-get install -y libopencv-dev && sudo ln -s /usr/include/opencv4/opencv2 /usr/include/opencv2
+          
+          sudo rm -rfv /usr/bin/conda || true
+
+      - name: Test petals
+        run: |
+           export PATH=$PATH:/opt/conda/bin
+           make -C backend/python/petals
+           make -C backend/python/petals test
+
+           
+
+  tests-bark:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Clone
+        uses: actions/checkout@v4
+        with: 
+          submodules: true
+      - name: Dependencies
+        run: |
+          sudo apt-get update
+          sudo apt-get install build-essential ffmpeg
+          curl https://repo.anaconda.com/pkgs/misc/gpgkeys/anaconda.asc | gpg --dearmor > conda.gpg && \
+             sudo install -o root -g root -m 644 conda.gpg /usr/share/keyrings/conda-archive-keyring.gpg && \
+              gpg --keyring /usr/share/keyrings/conda-archive-keyring.gpg --no-default-keyring --fingerprint 34161F5BF5EB1D4BFBBB8F0A8AEB4F8B29D82806 && \
+             sudo /bin/bash -c 'echo "deb [arch=amd64 signed-by=/usr/share/keyrings/conda-archive-keyring.gpg] https://repo.anaconda.com/pkgs/misc/debrepo/conda stable main" > /etc/apt/sources.list.d/conda.list' && \
+             sudo /bin/bash -c 'echo "deb [arch=amd64 signed-by=/usr/share/keyrings/conda-archive-keyring.gpg] https://repo.anaconda.com/pkgs/misc/debrepo/conda stable main" | tee -a /etc/apt/sources.list.d/conda.list' && \
+             sudo apt-get update && \
+             sudo apt-get install -y conda
+          sudo apt-get install -y ca-certificates cmake curl patch
+          sudo apt-get install -y libopencv-dev && sudo ln -s /usr/include/opencv4/opencv2 /usr/include/opencv2
+          
+          sudo rm -rfv /usr/bin/conda || true
+
+      - name: Test bark
+        run: |
+           export PATH=$PATH:/opt/conda/bin
+           make -C backend/python/bark
+           make -C backend/python/bark test
+
+           
+  # Below tests needs GPU. Commented out for now
+  # TODO: Re-enable as soon as we have GPU nodes
+  # tests-vllm:
+  #   runs-on: ubuntu-latest
+  #   steps:
+  #     - name: Clone
+  #       uses: actions/checkout@v4
+  #       with: 
+  #         submodules: true
+  #     - name: Dependencies
+  #       run: |
+  #         sudo apt-get update
+  #         sudo apt-get install build-essential ffmpeg
+  #         curl https://repo.anaconda.com/pkgs/misc/gpgkeys/anaconda.asc | gpg --dearmor > conda.gpg && \
+  #            sudo install -o root -g root -m 644 conda.gpg /usr/share/keyrings/conda-archive-keyring.gpg && \
+  #             gpg --keyring /usr/share/keyrings/conda-archive-keyring.gpg --no-default-keyring --fingerprint 34161F5BF5EB1D4BFBBB8F0A8AEB4F8B29D82806 && \
+  #            sudo /bin/bash -c 'echo "deb [arch=amd64 signed-by=/usr/share/keyrings/conda-archive-keyring.gpg] https://repo.anaconda.com/pkgs/misc/debrepo/conda stable main" > /etc/apt/sources.list.d/conda.list' && \
+  #            sudo /bin/bash -c 'echo "deb [arch=amd64 signed-by=/usr/share/keyrings/conda-archive-keyring.gpg] https://repo.anaconda.com/pkgs/misc/debrepo/conda stable main" | tee -a /etc/apt/sources.list.d/conda.list' && \
+  #            sudo apt-get update && \
+  #            sudo apt-get install -y conda
+  #         sudo apt-get install -y ca-certificates cmake curl patch
+  #         sudo apt-get install -y libopencv-dev && sudo ln -s /usr/include/opencv4/opencv2 /usr/include/opencv2
+  #         sudo rm -rfv /usr/bin/conda || true
+  #     - name: Test vllm
+  #       run: |
+  #          export PATH=$PATH:/opt/conda/bin
+  #          make -C backend/python/vllm
+  #          make -C backend/python/vllm test
+  # tests-vallex:
+  #   runs-on: ubuntu-latest
+  #   steps:
+  #     - name: Clone
+  #       uses: actions/checkout@v4
+  #       with: 
+  #         submodules: true
+  #     - name: Dependencies
+  #       run: |
+  #         sudo apt-get update
+  #         sudo apt-get install build-essential ffmpeg
+  #         curl https://repo.anaconda.com/pkgs/misc/gpgkeys/anaconda.asc | gpg --dearmor > conda.gpg && \
+  #            sudo install -o root -g root -m 644 conda.gpg /usr/share/keyrings/conda-archive-keyring.gpg && \
+  #             gpg --keyring /usr/share/keyrings/conda-archive-keyring.gpg --no-default-keyring --fingerprint 34161F5BF5EB1D4BFBBB8F0A8AEB4F8B29D82806 && \
+  #            sudo /bin/bash -c 'echo "deb [arch=amd64 signed-by=/usr/share/keyrings/conda-archive-keyring.gpg] https://repo.anaconda.com/pkgs/misc/debrepo/conda stable main" > /etc/apt/sources.list.d/conda.list' && \
+  #            sudo /bin/bash -c 'echo "deb [arch=amd64 signed-by=/usr/share/keyrings/conda-archive-keyring.gpg] https://repo.anaconda.com/pkgs/misc/debrepo/conda stable main" | tee -a /etc/apt/sources.list.d/conda.list' && \
+  #            sudo apt-get update && \
+  #            sudo apt-get install -y conda
+  #         sudo apt-get install -y ca-certificates cmake curl patch
+  #         sudo apt-get install -y libopencv-dev && sudo ln -s /usr/include/opencv4/opencv2 /usr/include/opencv2    
+  #         sudo rm -rfv /usr/bin/conda || true
+  #     - name: Test vall-e-x
+  #       run: |
+  #          export PATH=$PATH:/opt/conda/bin
+  #          make -C backend/python/vall-e-x
+  #          make -C backend/python/vall-e-x test
--- a/.github/workflows/test.yml
+++ b/.github/workflows/test.yml
@@ -83,7 +83,7 @@ jobs:
          # Pre-build piper before we start tests in order to have shared libraries in place
          make sources/go-piper && \
          GO_TAGS="tts" make -C sources/go-piper piper.o && \
-          sudo cp -rfv sources/go-piper/piper/build/pi/lib/. /usr/lib/ && \
+          sudo cp -rfv sources/go-piper/piper-phonemize/pi/lib/. /usr/lib/ && \
          # Pre-build stable diffusion before we install a newer version of abseil (not compatible with stablediffusion-ncn)
          GO_TAGS="stablediffusion tts" GRPC_BACKENDS=backend-assets/grpc/stablediffusion make build

--- a/.gitignore
+++ b/.gitignore
@@ -3,6 +3,7 @@
 __pycache__/
 *.a
 get-sources
+prepare-sources
 /backend/cpp/llama/grpc-server
 /backend/cpp/llama/llama.cpp

--- a/24
+++ b/24
@@ -12,7 +12,9 @@ ARG TARGETARCH
 ARG TARGETVARIANT

 ENV BUILD_TYPE=${BUILD_TYPE}
-ENV EXTERNAL_GRPC_BACKENDS="huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh,petals:/build/backend/python/petals/run.sh,transformers:/build/backend/python/transformers/run.sh,sentencetransformers:/build/backend/python/sentencetransformers/run.sh,autogptq:/build/backend/python/autogptq/run.sh,bark:/build/backend/python/bark/run.sh,diffusers:/build/backend/python/diffusers/run.sh,exllama:/build/backend/python/exllama/run.sh,vall-e-x:/build/backend/python/vall-e-x/run.sh,vllm:/build/backend/python/vllm/run.sh"
+
+ENV EXTERNAL_GRPC_BACKENDS="huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh,petals:/build/backend/python/petals/run.sh,transformers:/build/backend/python/transformers/run.sh,sentencetransformers:/build/backend/python/sentencetransformers/run.sh,autogptq:/build/backend/python/autogptq/run.sh,bark:/build/backend/python/bark/run.sh,diffusers:/build/backend/python/diffusers/run.sh,exllama:/build/backend/python/exllama/run.sh,vall-e-x:/build/backend/python/vall-e-x/run.sh,vllm:/build/backend/python/vllm/run.sh,exllama2:/build/backend/python/exllama2/run.sh,transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh"
+
 ENV GALLERIES='[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.yaml"}, {"url": "github:go-skynet/model-gallery/huggingface.yaml","name":"huggingface"}]'
 ARG GO_TAGS="stablediffusion tts"

@@ -105,9 +107,9 @@ RUN if [ "${BUILD_GRPC}" = "true" ]; then \
 # Rebuild with defaults backends
 RUN make build

-RUN if [ ! -d "/build/sources/go-piper/piper/build/pi/lib/" ]; then \
-    mkdir -p /build/sources/go-piper/piper/build/pi/lib/ \
-    touch /build/sources/go-piper/piper/build/pi/lib/keep \
+RUN if [ ! -d "/build/sources/go-piper/piper-phonemize/pi/lib/" ]; then \
+    mkdir -p /build/sources/go-piper/piper-phonemize/pi/lib/ \
+    touch /build/sources/go-piper/piper-phonemize/pi/lib/keep \
    ; fi

 ###################################
@@ -151,7 +153,7 @@ RUN make prepare-sources && cd /build/grpc/cmake/build && make install && rm -rf
 COPY --from=builder /build/local-ai ./

 # Copy shared libraries for piper
-COPY --from=builder /build/sources/go-piper/piper/build/pi/lib/* /usr/lib/
+COPY --from=builder /build/sources/go-piper/piper-phonemize/pi/lib/* /usr/lib/

 # do not let stablediffusion rebuild (requires an older version of absl)
 COPY --from=builder /build/backend-assets/grpc/stablediffusion ./backend-assets/grpc/stablediffusion
@@ -181,15 +183,15 @@ RUN if [ "${IMAGE_TYPE}" = "extras" ]; then \
 RUN if [ "${IMAGE_TYPE}" = "extras" ]; then \
 	PATH=$PATH:/opt/conda/bin make -C backend/python/exllama \
    ; fi
+RUN if [ "${IMAGE_TYPE}" = "extras" ]; then \
+    PATH=$PATH:/opt/conda/bin make -C backend/python/exllama2 \
+    ; fi
 RUN if [ "${IMAGE_TYPE}" = "extras" ]; then \
 	PATH=$PATH:/opt/conda/bin make -C backend/python/petals \
    ; fi
-
-# we also copy exllama libs over to resolve exllama import error
-# TODO: check if this is still needed
-RUN if [ -d /usr/local/lib/python3.9/dist-packages/exllama ]; then \
-        cp -rfv /usr/local/lib/python3.9/dist-packages/exllama backend/python/exllama/;\
-    fi
+RUN if [ "${IMAGE_TYPE}" = "extras" ]; then \
+	PATH=$PATH:/opt/conda/bin make -C backend/python/transformers-musicgen \
+    ; fi

 # Define the health check command
 HEALTHCHECK --interval=1m --timeout=10m --retries=10 \
--- a/26
+++ b/26
@@ -8,7 +8,7 @@ GOLLAMA_VERSION?=aeba71ee842819da681ea537e78846dc75949ac0

 GOLLAMA_STABLE_VERSION?=50cee7712066d9e38306eccadcfbb44ea87df4b7

-CPPLLAMA_VERSION?=5a7d3125e7c24f223659b7f0b7aa7736986e92c0
+CPPLLAMA_VERSION?=328b83de23b33240e28f4e74900d1d06726f5eb1

 # gpt4all version
 GPT4ALL_REPO?=https://github.com/nomic-ai/gpt4all
@@ -19,16 +19,16 @@ GOGGMLTRANSFORMERS_VERSION?=ffb09d7dd71e2cbc6c5d7d05357d230eea6f369a

 # go-rwkv version
 RWKV_REPO?=https://github.com/donomii/go-rwkv.cpp
-RWKV_VERSION?=c898cd0f62df8f2a7830e53d1d513bef4f6f792b
+RWKV_VERSION?=8f6d062fa80ed4ac4a00d1ac53aa4de54183fffe

 # whisper.cpp version
-WHISPER_CPP_VERSION?=e369243ebd24c8a14201f6b4280bccbb7b6a7df3
+WHISPER_CPP_VERSION?=9286d3f584240ba58bd44a1bd1e85141579c78d4

 # bert.cpp version
 BERT_VERSION?=6abe312cded14042f6b7c3cd8edf082713334a4d

 # go-piper version
-PIPER_VERSION?=5a4c9e28c84bac09ab6baa9f88457d852cb46bb2
+PIPER_VERSION?=d6b6275ba037dabdba4a8b65dfdf6b2a73a67f07

 # stablediffusion version
 STABLEDIFFUSION_VERSION?=902db5f066fd137697e3b69d0fa10d4782bd2c2f
@@ -36,6 +36,7 @@ STABLEDIFFUSION_VERSION?=902db5f066fd137697e3b69d0fa10d4782bd2c2f
 export BUILD_TYPE?=
 export STABLE_BUILD_TYPE?=$(BUILD_TYPE)
 export CMAKE_ARGS?=
+
 CGO_LDFLAGS?=
 CUDA_LIBPATH?=/usr/local/cuda/lib64/
 GO_TAGS?=
@@ -191,7 +192,7 @@ backend-assets/gpt4all: sources/gpt4all/gpt4all-bindings/golang/libgpt4all.a
 backend-assets/espeak-ng-data: sources/go-piper
 	mkdir -p backend-assets/espeak-ng-data
 	$(MAKE) -C sources/go-piper piper.o
-	@cp -rf sources/go-piper/piper/build/pi/share/espeak-ng-data/. backend-assets/espeak-ng-data
+	@cp -rf sources/go-piper/piper-phonemize/pi/share/espeak-ng-data/. backend-assets/espeak-ng-data

 sources/gpt4all/gpt4all-bindings/golang/libgpt4all.a: sources/gpt4all
 	$(MAKE) -C sources/gpt4all/gpt4all-bindings/golang/ libgpt4all.a
@@ -229,7 +230,7 @@ sources/go-piper/libpiper_binding.a: sources/go-piper
 	$(MAKE) -C sources/go-piper libpiper_binding.a example/main

 backend/cpp/llama/llama.cpp:
-	$(MAKE) -C backend/cpp/llama llama.cpp	
+	LLAMA_VERSION=$(CPPLLAMA_VERSION) $(MAKE) -C backend/cpp/llama llama.cpp	

 get-sources: backend/cpp/llama/llama.cpp sources/go-llama sources/go-llama-ggml sources/go-ggml-transformers sources/gpt4all sources/go-piper sources/go-rwkv sources/whisper.cpp sources/go-bert sources/go-stable-diffusion
 	touch $@
@@ -383,12 +384,13 @@ help: ## Show this help.
 protogen: protogen-go protogen-python

 protogen-go:
-	protoc --go_out=. --go_opt=paths=source_relative --go-grpc_out=. --go-grpc_opt=paths=source_relative \
+	protoc -Ibackend/ --go_out=pkg/grpc/proto/ --go_opt=paths=source_relative --go-grpc_out=pkg/grpc/proto/ --go-grpc_opt=paths=source_relative \
    backend/backend.proto

 protogen-python:
 	python3 -m grpc_tools.protoc -Ibackend/ --python_out=backend/python/sentencetransformers/ --grpc_python_out=backend/python/sentencetransformers/ backend/backend.proto
 	python3 -m grpc_tools.protoc -Ibackend/ --python_out=backend/python/transformers/ --grpc_python_out=backend/python/transformers/ backend/backend.proto
+	python3 -m grpc_tools.protoc -Ibackend/ --python_out=backend/python/transformers-musicgen/ --grpc_python_out=backend/python/transformers-musicgen/ backend/backend.proto
 	python3 -m grpc_tools.protoc -Ibackend/ --python_out=backend/python/autogptq/ --grpc_python_out=backend/python/autogptq/ backend/backend.proto
 	python3 -m grpc_tools.protoc -Ibackend/ --python_out=backend/python/exllama/ --grpc_python_out=backend/python/exllama/ backend/backend.proto
 	python3 -m grpc_tools.protoc -Ibackend/ --python_out=backend/python/bark/ --grpc_python_out=backend/python/bark/ backend/backend.proto
@@ -396,6 +398,7 @@ protogen-python:
 	python3 -m grpc_tools.protoc -Ibackend/ --python_out=backend/python/vall-e-x/ --grpc_python_out=backend/python/vall-e-x/ backend/backend.proto
 	python3 -m grpc_tools.protoc -Ibackend/ --python_out=backend/python/vllm/ --grpc_python_out=backend/python/vllm/ backend/backend.proto
 	python3 -m grpc_tools.protoc -Ibackend/ --python_out=backend/python/petals/ --grpc_python_out=backend/python/petals/ backend/backend.proto
+	python3 -m grpc_tools.protoc -Ibackend/ --python_out=backend/python/exllama2/ --grpc_python_out=backend/python/exllama2/ backend/backend.proto

 ## GRPC
 # Note: it is duplicated in the Dockerfile
@@ -406,10 +409,19 @@ prepare-extra-conda-environments:
 	$(MAKE) -C backend/python/vllm
 	$(MAKE) -C backend/python/sentencetransformers
 	$(MAKE) -C backend/python/transformers
+	$(MAKE) -C backend/python/transformers-musicgen
 	$(MAKE) -C backend/python/vall-e-x
 	$(MAKE) -C backend/python/exllama
 	$(MAKE) -C backend/python/petals
+	$(MAKE) -C backend/python/exllama2

+prepare-test-extra:
+	$(MAKE) -C backend/python/transformers
+	$(MAKE) -C backend/python/diffusers
+
+test-extra: prepare-test-extra
+	$(MAKE) -C backend/python/transformers test
+	$(MAKE) -C backend/python/diffusers test

 backend-assets/grpc:
 	mkdir -p backend-assets/grpc
--- a/README.md
+++ b/README.md
@@ -21,7 +21,7 @@
 </p>

 > :bulb: Get help - [❓FAQ](https://localai.io/faq/) [💭Discussions](https://github.com/go-skynet/LocalAI/discussions) [:speech_balloon: Discord](https://discord.gg/uJAeKSAGDy) [:book: Documentation website](https://localai.io/)
-> 
+>
 > [💻 Quickstart](https://localai.io/basics/getting_started/) [📣 News](https://localai.io/basics/news/) [ 🛫 Examples ](https://github.com/go-skynet/LocalAI/tree/master/examples/) [ 🖼️ Models ](https://localai.io/models/) [ 🚀 Roadmap ](https://github.com/mudler/LocalAI/issues?q=is%3Aissue+is%3Aopen+label%3Aroadmap)

 [![tests](https://github.com/go-skynet/LocalAI/actions/workflows/test.yml/badge.svg)](https://github.com/go-skynet/LocalAI/actions/workflows/test.yml)[![Build and Release](https://github.com/go-skynet/LocalAI/actions/workflows/release.yaml/badge.svg)](https://github.com/go-skynet/LocalAI/actions/workflows/release.yaml)[![build container images](https://github.com/go-skynet/LocalAI/actions/workflows/image.yml/badge.svg)](https://github.com/go-skynet/LocalAI/actions/workflows/image.yml)[![Bump dependencies](https://github.com/go-skynet/LocalAI/actions/workflows/bump_deps.yaml/badge.svg)](https://github.com/go-skynet/LocalAI/actions/workflows/bump_deps.yaml)[![Artifact Hub](https://img.shields.io/endpoint?url=https://artifacthub.io/badge/repository/localai)](https://artifacthub.io/packages/search?repo=localai)
@@ -54,7 +54,7 @@
 <p align="center">

 <a href="https://twitter.com/intent/tweet?text=Check%20this%20GitHub%20repository%20out.%20LocalAI%20-%20Let%27s%20you%20easily%20run%20LLM%20locally.&url=https://github.com/go-skynet/LocalAI&hashtags=LocalAI,AI" target="blank">
-<img src="https://img.shields.io/twitter/follow/_LocalAI?label=Share Repo on Twitter&style=social" alt="Follow _LocalAI"/></a> 
+<img src="https://img.shields.io/twitter/follow/_LocalAI?label=Share Repo on Twitter&style=social" alt="Follow _LocalAI"/></a>
 <a href="https://t.me/share/url?text=Check%20this%20GitHub%20repository%20out.%20LocalAI%20-%20Let%27s%20you%20easily%20run%20LLM%20locally.&url=https://github.com/go-skynet/LocalAI" target="_blank"><img src="https://img.shields.io/twitter/url?label=Telegram&logo=Telegram&style=social&url=https://github.com/go-skynet/LocalAI" alt="Share on Telegram"/></a>
 <a href="https://api.whatsapp.com/send?text=Check%20this%20GitHub%20repository%20out.%20LocalAI%20-%20Let%27s%20you%20easily%20run%20LLM%20locally.%20https://github.com/go-skynet/LocalAI"><img src="https://img.shields.io/twitter/url?label=whatsapp&logo=whatsapp&style=social&url=https://github.com/go-skynet/LocalAI" /></a> <a href="https://www.reddit.com/submit?url=https://github.com/go-skynet/LocalAI&title=Check%20this%20GitHub%20repository%20out.%20LocalAI%20-%20Let%27s%20you%20easily%20run%20LLM%20locally.
 " target="blank">
@@ -85,12 +85,12 @@ In a nutshell:

 - Local, OpenAI drop-in alternative REST API. You own your data.
 - NO GPU required. NO Internet access is required either
-  - Optional, GPU Acceleration is available in `llama.cpp`-compatible LLMs. See also the [build section](https://localai.io/basics/build/index.html). 
+  - Optional, GPU Acceleration is available in `llama.cpp`-compatible LLMs. See also the [build section](https://localai.io/basics/build/index.html).
 - Supports multiple models
 - 🏃 Once loaded the first time, it keep models loaded in memory for faster inference
 - ⚡ Doesn't shell-out, but uses C++ bindings for a faster inference and better performance.

-LocalAI was created by [Ettore Di Giacinto](https://github.com/mudler/) and is a community-driven project, focused on making the AI accessible to anyone. Any contribution, feedback and PR is welcome! 
+LocalAI was created by [Ettore Di Giacinto](https://github.com/mudler/) and is a community-driven project, focused on making the AI accessible to anyone. Any contribution, feedback and PR is welcome!

 Note that this started just as a [fun weekend project](https://localai.io/#backstory) in order to try to create the necessary pieces for a full AI assistant like `ChatGPT`: the community is growing fast and we are working hard to make it better and more stable. If you want to help, please consider contributing (see below)!

@@ -112,6 +112,9 @@ Check out the [Getting started](https://localai.io/basics/getting_started/index.

 ### 🔗 Community and integrations

+Build and deploy custom containers:
+- https://github.com/sozercan/aikit
+
 WebUIs:
 - https://github.com/Jirubizu/localai-admin
 - https://github.com/go-skynet/LocalAI-frontend
@@ -129,7 +132,7 @@ Other:
 - [How to install in Kubernetes](https://localai.io/basics/getting_started/index.html#run-localai-in-kubernetes)
 - [Projects integrating LocalAI](https://localai.io/integrations/)
 - [How tos section](https://localai.io/howtos/) (curated by our community)
-  
+
 ## :book: 🎥 [Media, Blogs, Social](https://localai.io/basics/news/#media-blogs-social)

 - [Create a slackbot for teams and OSS projects that answer to documentation](https://mudler.pm/posts/smart-slackbot-for-teams/)
@@ -159,12 +162,12 @@ Support the project by becoming [a backer or sponsor](https://github.com/sponsor

 A huge thank you to our generous sponsors who support this project:

-| ![Spectro Cloud logo_600x600px_transparent bg](https://github.com/go-skynet/LocalAI/assets/2420543/68a6f3cb-8a65-4a4d-99b5-6417a8905512) | 
+| ![Spectro Cloud logo_600x600px_transparent bg](https://github.com/go-skynet/LocalAI/assets/2420543/68a6f3cb-8a65-4a4d-99b5-6417a8905512) |
 |:-----------------------------------------------:|
-|  [Spectro Cloud](https://www.spectrocloud.com/)  |  
+|  [Spectro Cloud](https://www.spectrocloud.com/)  |
 |  Spectro Cloud kindly supports LocalAI by providing GPU and computing resources to run tests on lamdalabs!  |

-And a huge shout-out to individuals sponsoring the project by donating hardware or backing the project. 
+And a huge shout-out to individuals sponsoring the project by donating hardware or backing the project.

 - [Sponsor list](https://github.com/sponsors/mudler)
 - JDAM00 (donating HW for the CI)
--- a/api/api.go
+++ b/api/api.go
@@ -1,8 +1,10 @@
 package api

 import (
+	"encoding/json"
 	"errors"
 	"fmt"
+	"os"
 	"strings"

 	config "github.com/go-skynet/LocalAI/api/config"
@@ -45,6 +47,10 @@ func Startup(opts ...options.AppOption) (*options.Option, *config.ConfigLoader,
 		}
 	}

+	if err := cl.Preload(options.Loader.ModelPath); err != nil {
+		log.Error().Msgf("error downloading models: %s", err.Error())
+	}
+
 	if options.Debug {
 		for _, v := range cl.ListConfigs() {
 			cfg, _ := cl.GetConfig(v)
@@ -144,28 +150,46 @@ func App(opts ...options.AppOption) (*fiber.App, error) {

 	// Auth middleware checking if API key is valid. If no API key is set, no auth is required.
 	auth := func(c *fiber.Ctx) error {
-		if len(options.ApiKeys) > 0 {
-			authHeader := c.Get("Authorization")
-			if authHeader == "" {
-				return c.Status(fiber.StatusUnauthorized).JSON(fiber.Map{"message": "Authorization header missing"})
-			}
-			authHeaderParts := strings.Split(authHeader, " ")
-			if len(authHeaderParts) != 2 || authHeaderParts[0] != "Bearer" {
-				return c.Status(fiber.StatusUnauthorized).JSON(fiber.Map{"message": "Invalid Authorization header format"})
+		if len(options.ApiKeys) == 0 {
+			return c.Next()
+		}
+
+		// Check for api_keys.json file
+		fileContent, err := os.ReadFile("api_keys.json")
+		if err == nil {
+			// Parse JSON content from the file
+			var fileKeys []string
+			err := json.Unmarshal(fileContent, &fileKeys)
+			if err != nil {
+				return c.Status(fiber.StatusInternalServerError).JSON(fiber.Map{"message": "Error parsing api_keys.json"})
 			}

-			apiKey := authHeaderParts[1]
-			validApiKey := false
-			for _, key := range options.ApiKeys {
-				if apiKey == key {
-					validApiKey = true
-				}
-			}
-			if !validApiKey {
-				return c.Status(fiber.StatusUnauthorized).JSON(fiber.Map{"message": "Invalid API key"})
+			// Add file keys to options.ApiKeys
+			options.ApiKeys = append(options.ApiKeys, fileKeys...)
+		}
+
+		if len(options.ApiKeys) == 0 {
+			return c.Next()
+		}
+
+		authHeader := c.Get("Authorization")
+		if authHeader == "" {
+			return c.Status(fiber.StatusUnauthorized).JSON(fiber.Map{"message": "Authorization header missing"})
+		}
+		authHeaderParts := strings.Split(authHeader, " ")
+		if len(authHeaderParts) != 2 || authHeaderParts[0] != "Bearer" {
+			return c.Status(fiber.StatusUnauthorized).JSON(fiber.Map{"message": "Invalid Authorization header format"})
+		}
+
+		apiKey := authHeaderParts[1]
+		for _, key := range options.ApiKeys {
+			if apiKey == key {
+				return c.Next()
 			}
 		}
-		return c.Next()
+
+		return c.Status(fiber.StatusUnauthorized).JSON(fiber.Map{"message": "Invalid API key"})
+
 	}

 	if options.CORS {
--- a/api/api_test.go
+++ b/api/api_test.go
@@ -294,7 +294,7 @@ var _ = Describe("API test", func() {
 				Expect(content["backend"]).To(Equal("bert-embeddings"))
 			})

-			It("runs openllama", Label("llama"), func() {
+			It("runs openllama(llama-ggml backend)", Label("llama"), func() {
 				if runtime.GOOS != "linux" {
 					Skip("test supported only on linux")
 				}
@@ -362,9 +362,10 @@ var _ = Describe("API test", func() {
 				Expect(res["location"]).To(Equal("San Francisco, California, United States"), fmt.Sprint(res))
 				Expect(res["unit"]).To(Equal("celcius"), fmt.Sprint(res))
 				Expect(string(resp2.Choices[0].FinishReason)).To(Equal("function_call"), fmt.Sprint(resp2.Choices[0].FinishReason))
+
 			})

-			It("runs openllama gguf", Label("llama-gguf"), func() {
+			It("runs openllama gguf(llama-cpp)", Label("llama-gguf"), func() {
 				if runtime.GOOS != "linux" {
 					Skip("test supported only on linux")
 				}
--- a/api/backend/image.go
+++ b/api/backend/image.go
@@ -16,7 +16,7 @@ func ImageGeneration(height, width, mode, step, seed int, positive_prompt, negat
 		model.WithContext(o.Context),
 		model.WithModel(c.Model),
 		model.WithLoadGRPCLoadModelOpts(&proto.ModelOptions{
-			CUDA:          c.Diffusers.CUDA,
+			CUDA:          c.CUDA || c.Diffusers.CUDA,
 			SchedulerType: c.Diffusers.SchedulerType,
 			PipelineType:  c.Diffusers.PipelineType,
 			CFGScale:      c.Diffusers.CFGScale,
@@ -27,6 +27,7 @@ func ImageGeneration(height, width, mode, step, seed int, positive_prompt, negat
 			CLIPModel:     c.Diffusers.ClipModel,
 			CLIPSubfolder: c.Diffusers.ClipSubFolder,
 			CLIPSkip:      int32(c.Diffusers.ClipSkip),
+			ControlNet:    c.Diffusers.ControlNet,
 		}),
 	})

--- a/api/backend/options.go
+++ b/api/backend/options.go
@@ -46,6 +46,7 @@ func gRPCModelOpts(c config.Config) *pb.ModelOptions {
 		Seed:           int32(c.Seed),
 		NBatch:         int32(b),
 		NoMulMatQ:      c.NoMulMatQ,
+		CUDA:           c.CUDA, // diffusers, transformers
 		DraftModel:     c.DraftModel,
 		AudioPath:      c.VallE.AudioPath,
 		Quantization:   c.Quantization,
--- a/api/backend/tts.go
+++ b/api/backend/tts.go
@@ -59,9 +59,13 @@ func ModelTTS(backend, text, modelFile string, loader *model.ModelLoader, o *opt
 	// If the model file is not empty, we pass it joined with the model path
 	modelPath := ""
 	if modelFile != "" {
-		modelPath = filepath.Join(o.Loader.ModelPath, modelFile)
-		if err := utils.VerifyPath(modelPath, o.Loader.ModelPath); err != nil {
-			return "", nil, err
+		if bb != model.TransformersMusicGen {
+			modelPath = filepath.Join(o.Loader.ModelPath, modelFile)
+			if err := utils.VerifyPath(modelPath, o.Loader.ModelPath); err != nil {
+				return "", nil, err
+			}
+		} else {
+			modelPath = modelFile
 		}
 	}

--- a/api/config/config.go
+++ b/api/config/config.go
@@ -8,6 +8,8 @@ import (
 	"strings"
 	"sync"

+	"github.com/go-skynet/LocalAI/pkg/utils"
+	"github.com/rs/zerolog/log"
 	"gopkg.in/yaml.v3"
 )

@@ -38,14 +40,17 @@ type Config struct {

 	// Diffusers
 	Diffusers Diffusers `yaml:"diffusers"`
-
-	Step int `yaml:"step"`
+	Step      int       `yaml:"step"`

 	// GRPC Options
 	GRPC GRPC `yaml:"grpc"`

 	// Vall-e-x
 	VallE VallE `yaml:"vall-e"`
+
+	// CUDA
+	// Explicitly enable CUDA or not (some backends might need it)
+	CUDA bool `yaml:"cuda"`
 }

 type VallE struct {
@@ -65,15 +70,16 @@ type GRPC struct {
 }

 type Diffusers struct {
+	CUDA             bool    `yaml:"cuda"`
 	PipelineType     string  `yaml:"pipeline_type"`
 	SchedulerType    string  `yaml:"scheduler_type"`
-	CUDA             bool    `yaml:"cuda"`
 	EnableParameters string  `yaml:"enable_parameters"` // A list of comma separated parameters to specify
 	CFGScale         float32 `yaml:"cfg_scale"`         // Classifier-Free Guidance Scale
 	IMG2IMG          bool    `yaml:"img2img"`           // Image to Image Diffuser
 	ClipSkip         int     `yaml:"clip_skip"`         // Skip every N frames
 	ClipModel        string  `yaml:"clip_model"`        // Clip model to use
 	ClipSubFolder    string  `yaml:"clip_subfolder"`    // Subfolder to use for clip model
+	ControlNet       string  `yaml:"control_net"`
 }

 type LLMConfig struct {
@@ -260,6 +266,36 @@ func (cm *ConfigLoader) ListConfigs() []string {
 	return res
 }

+func (cm *ConfigLoader) Preload(modelPath string) error {
+	cm.Lock()
+	defer cm.Unlock()
+
+	for i, config := range cm.configs {
+		modelURL := config.PredictionOptions.Model
+		modelURL = utils.ConvertURL(modelURL)
+		if strings.HasPrefix(modelURL, "http://") || strings.HasPrefix(modelURL, "https://") {
+			// md5 of model name
+			md5Name := utils.MD5(modelURL)
+
+			// check if file exists
+			if _, err := os.Stat(filepath.Join(modelPath, md5Name)); err == os.ErrNotExist {
+				err := utils.DownloadFile(modelURL, filepath.Join(modelPath, md5Name), "", func(fileName, current, total string, percent float64) {
+					log.Info().Msgf("Downloading %s: %s/%s (%.2f%%)", fileName, current, total, percent)
+				})
+				if err != nil {
+					return err
+				}
+			}
+
+			cc := cm.configs[i]
+			c := &cc
+			c.PredictionOptions.Model = md5Name
+			cm.configs[i] = *c
+		}
+	}
+	return nil
+}
+
 func (cm *ConfigLoader) LoadConfigs(path string) error {
 	cm.Lock()
 	defer cm.Unlock()
--- a/api/openai/chat.go
+++ b/api/openai/chat.go
@@ -219,7 +219,12 @@ func ChatEndpoint(cm *config.ConfigLoader, o *options.Option) func(c *fiber.Ctx)
 			c.Set("Transfer-Encoding", "chunked")
 		}

-		templateFile := config.Model
+		templateFile := ""
+
+		// A model can have a "file.bin.tmpl" file associated with a prompt template prefix
+		if o.Loader.ExistsInModelPath(fmt.Sprintf("%s.tmpl", config.Model)) {
+			templateFile = config.Model
+		}

 		if config.TemplateConfig.Chat != "" && !processFunctions {
 			templateFile = config.TemplateConfig.Chat
@@ -229,18 +234,19 @@ func ChatEndpoint(cm *config.ConfigLoader, o *options.Option) func(c *fiber.Ctx)
 			templateFile = config.TemplateConfig.Functions
 		}

-		// A model can have a "file.bin.tmpl" file associated with a prompt template prefix
-		templatedInput, err := o.Loader.EvaluateTemplateForPrompt(model.ChatPromptTemplate, templateFile, model.PromptTemplateData{
-			SystemPrompt:         config.SystemPrompt,
-			SuppressSystemPrompt: suppressConfigSystemPrompt,
-			Input:                predInput,
-			Functions:            funcs,
-		})
-		if err == nil {
-			predInput = templatedInput
-			log.Debug().Msgf("Template found, input modified to: %s", predInput)
-		} else {
-			log.Debug().Msgf("Template failed loading: %s", err.Error())
+		if templateFile != "" {
+			templatedInput, err := o.Loader.EvaluateTemplateForPrompt(model.ChatPromptTemplate, templateFile, model.PromptTemplateData{
+				SystemPrompt:         config.SystemPrompt,
+				SuppressSystemPrompt: suppressConfigSystemPrompt,
+				Input:                predInput,
+				Functions:            funcs,
+			})
+			if err == nil {
+				predInput = templatedInput
+				log.Debug().Msgf("Template found, input modified to: %s", predInput)
+			} else {
+				log.Debug().Msgf("Template failed loading: %s", err.Error())
+			}
 		}

 		log.Debug().Msgf("Prompt (after templating): %s", predInput)
--- a/api/openai/completion.go
+++ b/api/openai/completion.go
@@ -81,7 +81,12 @@ func CompletionEndpoint(cm *config.ConfigLoader, o *options.Option) func(c *fibe
 			c.Set("Transfer-Encoding", "chunked")
 		}

-		templateFile := config.Model
+		templateFile := ""
+
+		// A model can have a "file.bin.tmpl" file associated with a prompt template prefix
+		if o.Loader.ExistsInModelPath(fmt.Sprintf("%s.tmpl", config.Model)) {
+			templateFile = config.Model
+		}

 		if config.TemplateConfig.Completion != "" {
 			templateFile = config.TemplateConfig.Completion
@@ -94,13 +99,14 @@ func CompletionEndpoint(cm *config.ConfigLoader, o *options.Option) func(c *fibe

 			predInput := config.PromptStrings[0]

-			// A model can have a "file.bin.tmpl" file associated with a prompt template prefix
-			templatedInput, err := o.Loader.EvaluateTemplateForPrompt(model.CompletionPromptTemplate, templateFile, model.PromptTemplateData{
-				Input: predInput,
-			})
-			if err == nil {
-				predInput = templatedInput
-				log.Debug().Msgf("Template found, input modified to: %s", predInput)
+			if templateFile != "" {
+				templatedInput, err := o.Loader.EvaluateTemplateForPrompt(model.CompletionPromptTemplate, templateFile, model.PromptTemplateData{
+					Input: predInput,
+				})
+				if err == nil {
+					predInput = templatedInput
+					log.Debug().Msgf("Template found, input modified to: %s", predInput)
+				}
 			}

 			responses := make(chan schema.OpenAIResponse)
@@ -145,14 +151,16 @@ func CompletionEndpoint(cm *config.ConfigLoader, o *options.Option) func(c *fibe
 		totalTokenUsage := backend.TokenUsage{}

 		for k, i := range config.PromptStrings {
-			// A model can have a "file.bin.tmpl" file associated with a prompt template prefix
-			templatedInput, err := o.Loader.EvaluateTemplateForPrompt(model.CompletionPromptTemplate, templateFile, model.PromptTemplateData{
-				SystemPrompt: config.SystemPrompt,
-				Input:        i,
-			})
-			if err == nil {
-				i = templatedInput
-				log.Debug().Msgf("Template found, input modified to: %s", i)
+			if templateFile != "" {
+				// A model can have a "file.bin.tmpl" file associated with a prompt template prefix
+				templatedInput, err := o.Loader.EvaluateTemplateForPrompt(model.CompletionPromptTemplate, templateFile, model.PromptTemplateData{
+					SystemPrompt: config.SystemPrompt,
+					Input:        i,
+				})
+				if err == nil {
+					i = templatedInput
+					log.Debug().Msgf("Template found, input modified to: %s", i)
+				}
 			}

 			r, tokenUsage, err := ComputeChoices(
--- a/api/openai/edit.go
+++ b/api/openai/edit.go
@@ -30,7 +30,12 @@ func EditEndpoint(cm *config.ConfigLoader, o *options.Option) func(c *fiber.Ctx)

 		log.Debug().Msgf("Parameter Config: %+v", config)

-		templateFile := config.Model
+		templateFile := ""
+
+		// A model can have a "file.bin.tmpl" file associated with a prompt template prefix
+		if o.Loader.ExistsInModelPath(fmt.Sprintf("%s.tmpl", config.Model)) {
+			templateFile = config.Model
+		}

 		if config.TemplateConfig.Edit != "" {
 			templateFile = config.TemplateConfig.Edit
@@ -40,15 +45,16 @@ func EditEndpoint(cm *config.ConfigLoader, o *options.Option) func(c *fiber.Ctx)
 		totalTokenUsage := backend.TokenUsage{}

 		for _, i := range config.InputStrings {
-			// A model can have a "file.bin.tmpl" file associated with a prompt template prefix
-			templatedInput, err := o.Loader.EvaluateTemplateForPrompt(model.EditPromptTemplate, templateFile, model.PromptTemplateData{
-				Input:        i,
-				Instruction:  input.Instruction,
-				SystemPrompt: config.SystemPrompt,
-			})
-			if err == nil {
-				i = templatedInput
-				log.Debug().Msgf("Template found, input modified to: %s", i)
+			if templateFile != "" {
+				templatedInput, err := o.Loader.EvaluateTemplateForPrompt(model.EditPromptTemplate, templateFile, model.PromptTemplateData{
+					Input:        i,
+					Instruction:  input.Instruction,
+					SystemPrompt: config.SystemPrompt,
+				})
+				if err == nil {
+					i = templatedInput
+					log.Debug().Msgf("Template found, input modified to: %s", i)
+				}
 			}

 			r, tokenUsage, err := ComputeChoices(input, i, config, o, o.Loader, func(s string, c *[]schema.Choice) {
--- a/api/openai/image.go
+++ b/api/openai/image.go
@@ -5,6 +5,8 @@ import (
 	"encoding/base64"
 	"encoding/json"
 	"fmt"
+	"io"
+	"net/http"
 	"os"
 	"path/filepath"
 	"strconv"
@@ -22,6 +24,26 @@ import (
 	"github.com/rs/zerolog/log"
 )

+func downloadFile(url string) (string, error) {
+	// Get the data
+	resp, err := http.Get(url)
+	if err != nil {
+		return "", err
+	}
+	defer resp.Body.Close()
+
+	// Create the file
+	out, err := os.CreateTemp("", "image")
+	if err != nil {
+		return "", err
+	}
+	defer out.Close()
+
+	// Write the body to file
+	_, err = io.Copy(out, resp.Body)
+	return out.Name(), err
+}
+
 // https://platform.openai.com/docs/api-reference/images/create

 /*
@@ -56,12 +78,31 @@ func ImageEndpoint(cm *config.ConfigLoader, o *options.Option) func(c *fiber.Ctx

 		src := ""
 		if input.File != "" {
-			//base 64 decode the file and write it somewhere
-			// that we will cleanup
-			decoded, err := base64.StdEncoding.DecodeString(input.File)
-			if err != nil {
-				return err
+
+			fileData := []byte{}
+			// check if input.File is an URL, if so download it and save it
+			// to a temporary file
+			if strings.HasPrefix(input.File, "http://") || strings.HasPrefix(input.File, "https://") {
+				out, err := downloadFile(input.File)
+				if err != nil {
+					return fmt.Errorf("failed downloading file:%w", err)
+				}
+				defer os.RemoveAll(out)
+
+				fileData, err = os.ReadFile(out)
+				if err != nil {
+					return fmt.Errorf("failed reading file:%w", err)
+				}
+
+			} else {
+				// base 64 decode the file and write it somewhere
+				// that we will cleanup
+				fileData, err = base64.StdEncoding.DecodeString(input.File)
+				if err != nil {
+					return err
+				}
 			}
+
 			// Create a temporary file
 			outputFile, err := os.CreateTemp(o.ImageDir, "b64")
 			if err != nil {
@@ -69,7 +110,7 @@ func ImageEndpoint(cm *config.ConfigLoader, o *options.Option) func(c *fiber.Ctx
 			}
 			// write the base64 result
 			writer := bufio.NewWriter(outputFile)
-			_, err = writer.Write(decoded)
+			_, err = writer.Write(fileData)
 			if err != nil {
 				outputFile.Close()
 				return err
--- a/backend/backend.proto
+++ b/backend/backend.proto
@@ -110,6 +110,7 @@ message ModelOptions {
  string CLIPModel = 31;
  string CLIPSubfolder = 32;
  int32 CLIPSkip = 33;
+  string ControlNet = 48;

  // RWKV
  string Tokenizer = 34;
--- a/backend/cpp/llama/Makefile
+++ b/backend/cpp/llama/Makefile
@@ -1,5 +1,5 @@

-LLAMA_VERSION?=d9b33fe95bd257b36c84ee5769cc048230067d6f
+LLAMA_VERSION?=

 CMAKE_ARGS?=
 BUILD_TYPE?=
@@ -21,6 +21,9 @@ endif

 llama.cpp:
 	git clone --recurse-submodules https://github.com/ggerganov/llama.cpp llama.cpp
+	if [ -z "$(LLAMA_VERSION)" ]; then \
+		exit 1; \
+	fi
 	cd llama.cpp && git checkout -b build $(LLAMA_VERSION) && git submodule update --init --recursive --depth 1

 llama.cpp/examples/grpc-server:
--- a/backend/cpp/llama/grpc-server.cpp
+++ b/backend/cpp/llama/grpc-server.cpp
@@ -40,9 +40,18 @@ using backend::HealthMessage;


 ///// LLAMA.CPP server code below
-
+#define DEFAULT_OAICOMPAT_MODEL "gpt-3.5-turbo-0613"
 using json = nlohmann::json;

+struct server_params
+{
+    std::string hostname = "127.0.0.1";
+    std::string public_path = "examples/server/public";
+    int32_t port = 8080;
+    int32_t read_timeout = 600;
+    int32_t write_timeout = 600;
+};
+
 static bool server_verbose = false;

 #if SERVER_VERBOSE != 1
@@ -62,6 +71,10 @@ static bool server_verbose = false;
 #define LOG_WARNING(MSG, ...) server_log("WARNING", __func__, __LINE__, MSG, __VA_ARGS__)
 #define LOG_INFO(   MSG, ...) server_log("INFO",    __func__, __LINE__, MSG, __VA_ARGS__)

+json oaicompat_completion_params_parse(const json &body);
+std::string format_chatml(std::vector<json> messages);
+
+
 //
 // base64 utils (TODO: move to common in the future)
 //
@@ -152,15 +165,23 @@ struct task_server {
    json data;
    bool infill_mode = false;
    bool embedding_mode = false;
+    int multitask_id = -1;
 };

 struct task_result {
    int id;
+    int multitask_id = -1;
    bool stop;
    bool error;
    json result_json;
 };

+struct task_multi {
+    int id;
+    std::set<int> subtasks_remaining{};
+    std::vector<task_result> results{};
+};
+
 // TODO: can become bool if we can't find use of more states
 enum slot_state
 {
@@ -365,7 +386,6 @@ struct llama_client_slot

    int32_t num_prompt_tokens           = 0;
    int32_t num_prompt_tokens_processed = 0;
-    int32_t multibyte_pending           = 0;

    json prompt;
    std::string generated_text;
@@ -381,6 +401,9 @@ struct llama_client_slot
    bool stopped_word = false;
    bool stopped_limit = false;

+    bool oaicompat = false;
+    std::string oaicompat_model;
+
    std::string stopping_word;

    // sampling
@@ -400,6 +423,9 @@ struct llama_client_slot
    double t_prompt_processing; // ms
    double t_token_generation; // ms

+    // multitasks
+    int multitask_id = -1;
+
    void reset() {
        num_prompt_tokens      = 0;
        generated_text         = "";
@@ -408,7 +434,6 @@ struct llama_client_slot
        stopped_word           = false;
        stopped_limit          = false;
        stopping_word          = "";
-        multibyte_pending      = 0;
        n_past                 = 0;
        sent_count             = 0;
        sent_token_probs_index = 0;
@@ -480,7 +505,7 @@ struct llama_client_slot
        };
    }

-    void print_timings() {
+    void print_timings() const {
        LOG_TEE("\n");
        LOG_TEE("%s: prompt eval time = %10.2f ms / %5d tokens (%8.2f ms per token, %8.2f tokens per second)\n",
            __func__, t_prompt_processing, num_prompt_tokens_processed, t_prompt_processing / num_prompt_tokens_processed, 1e3 / t_prompt_processing * num_prompt_tokens_processed);
@@ -504,6 +529,7 @@ struct llama_server_context
    bool multimodal         = false;
    bool clean_kv_cache     = true;
    bool all_slots_are_idle = false;
+    bool add_bos_token      = true;

    int32_t id_gen;
    int32_t n_ctx;  // total context for all clients / slots
@@ -522,7 +548,8 @@ struct llama_server_context

    std::vector<task_server> queue_tasks;
    std::vector<task_result> queue_results;
-    std::mutex mutex_tasks;
+    std::vector<task_multi>  queue_multitasks;
+    std::mutex mutex_tasks; // also guards id_gen, and queue_multitasks
    std::mutex mutex_results;

    ~llama_server_context()
@@ -576,6 +603,8 @@ struct llama_server_context

        n_ctx = llama_n_ctx(ctx);

+        add_bos_token = llama_should_add_bos_token(model);
+
        return true;
    }

@@ -609,6 +638,11 @@ struct llama_server_context

    std::vector<llama_token> tokenize(const json & json_prompt, bool add_bos) const
    {
+        // TODO: currently, we tokenize using special tokens by default
+        //       this is not always correct (see https://github.com/ggerganov/llama.cpp/pull/4160#issuecomment-1824826216)
+        //       but it's better compared to completely ignoring ChatML and other chat templates
+        const bool TMP_FORCE_SPECIAL = true;
+
        // If `add_bos` is true, we only add BOS, when json_prompt is a string,
        // or the first element of the json_prompt array is a string.
        std::vector<llama_token> prompt_tokens;
@@ -624,12 +658,12 @@ struct llama_server_context
                    std::vector<llama_token> p;
                    if (first)
                    {
-                        p = ::llama_tokenize(ctx, s, add_bos);
+                        p = ::llama_tokenize(ctx, s, add_bos, TMP_FORCE_SPECIAL);
                        first = false;
                    }
                    else
                    {
-                        p = ::llama_tokenize(ctx, s, false);
+                        p = ::llama_tokenize(ctx, s, false, TMP_FORCE_SPECIAL);
                    }
                    prompt_tokens.insert(prompt_tokens.end(), p.begin(), p.end());
                }
@@ -646,7 +680,7 @@ struct llama_server_context
        else
        {
            auto s = json_prompt.template get<std::string>();
-            prompt_tokens = ::llama_tokenize(ctx, s, add_bos);
+            prompt_tokens = ::llama_tokenize(ctx, s, add_bos, TMP_FORCE_SPECIAL);
        }

        return prompt_tokens;
@@ -677,11 +711,20 @@ struct llama_server_context
        slot_params default_params;
        llama_sampling_params default_sparams;

+        if (data.count("__oaicompat") != 0) {
+            slot->oaicompat = true;
+            slot->oaicompat_model = json_value(data, "model", std::string(DEFAULT_OAICOMPAT_MODEL));
+        } else {
+            slot->oaicompat = false;
+            slot->oaicompat_model = "";
+        }
+
        slot->params.stream           = json_value(data, "stream",            false);
        slot->params.cache_prompt     = json_value(data, "cache_prompt",      false);
        slot->params.n_predict        = json_value(data, "n_predict",         default_params.n_predict);
        slot->sparams.top_k           = json_value(data, "top_k",             default_sparams.top_k);
        slot->sparams.top_p           = json_value(data, "top_p",             default_sparams.top_p);
+        slot->sparams.min_p           = json_value(data, "min_p",             default_sparams.min_p);
        slot->sparams.tfs_z           = json_value(data, "tfs_z",             default_sparams.tfs_z);
        slot->sparams.typical_p       = json_value(data, "typical_p",         default_sparams.typical_p);
        slot->sparams.temp            = json_value(data, "temperature",       default_sparams.temp);
@@ -866,7 +909,7 @@ struct llama_server_context
    }

    void update_system_prompt() {
-        system_tokens = ::llama_tokenize(ctx, system_prompt, true);
+        system_tokens = ::llama_tokenize(ctx, system_prompt, add_bos_token);

        llama_batch_clear(batch);

@@ -957,35 +1000,36 @@ struct llama_server_context
        slot.generated_text += token_str;
        slot.has_next_token = true;

-        if (slot.multibyte_pending > 0)
+        // check if there is incomplete UTF-8 character at the end
+        bool incomplete = false;
+        for (unsigned i = 1; i < 5 && i <= slot.generated_text.size(); ++i)
        {
-            slot.multibyte_pending -= token_str.size();
-        }
-        else if (token_str.size() == 1)
-        {
-            const char c = token_str[0];
-            // 2-byte characters: 110xxxxx 10xxxxxx
+            unsigned char c = slot.generated_text[slot.generated_text.size() - i];
+            if ((c & 0xC0) == 0x80)
+            {
+                // continuation byte: 10xxxxxx
+                continue;
+            }
            if ((c & 0xE0) == 0xC0)
            {
-                slot.multibyte_pending = 1;
-                // 3-byte characters: 1110xxxx 10xxxxxx 10xxxxxx
+                // 2-byte character: 110xxxxx ...
+                incomplete = i < 2;
            }
            else if ((c & 0xF0) == 0xE0)
            {
-                slot.multibyte_pending = 2;
-                // 4-byte characters: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+                // 3-byte character: 1110xxxx ...
+                incomplete = i < 3;
            }
            else if ((c & 0xF8) == 0xF0)
            {
-                slot.multibyte_pending = 3;
-            }
-            else
-            {
-                slot.multibyte_pending = 0;
+                // 4-byte character: 11110xxx ...
+                incomplete = i < 4;
            }
+            // else 1-byte character or invalid byte
+            break;
        }

-        if (slot.multibyte_pending == 0)
+        if (!incomplete)
        {
            size_t pos = std::min(slot.sent_count, slot.generated_text.size());
            const std::string str_test = slot.generated_text.substr(pos);
@@ -1020,7 +1064,7 @@ struct llama_server_context
            }
        }

-        if (slot.multibyte_pending > 0 && !slot.has_next_token)
+        if (incomplete)
        {
            slot.has_next_token = true;
        }
@@ -1089,16 +1133,40 @@ struct llama_server_context
        return slot.images.size() > 0;
    }

-    void send_error(int id, std::string error)
+    void send_error(task_server& task, std::string error)
    {
        std::lock_guard<std::mutex> lock(mutex_results);
        task_result res;
-        res.id = id;
+        res.id = task.id;
+        res.multitask_id = task.multitask_id;
+        res.stop = false;
        res.error = true;
        res.result_json = { { "content", error } };
        queue_results.push_back(res);
    }

+    void add_multi_task(int id, std::vector<int>& sub_ids)
+    {
+        std::lock_guard<std::mutex> lock(mutex_tasks);
+        task_multi multi;
+        multi.id = id;
+        std::copy(sub_ids.begin(), sub_ids.end(), std::inserter(multi.subtasks_remaining, multi.subtasks_remaining.end()));
+        queue_multitasks.push_back(multi);
+    }
+
+    void update_multi_task(int multitask_id, int subtask_id, task_result& result)
+    {
+        std::lock_guard<std::mutex> lock(mutex_tasks);
+        for (auto& multitask : queue_multitasks)
+        {
+            if (multitask.id == multitask_id)
+            {
+                multitask.subtasks_remaining.erase(subtask_id);
+                multitask.results.push_back(result);
+            }
+        }
+    }
+
    json get_model_props()
    {
        return get_formated_generation(slots[0]);
@@ -1116,6 +1184,7 @@ struct llama_server_context
            {"temp",              slot.sparams.temp},
            {"top_k",             slot.sparams.top_k},
            {"top_p",             slot.sparams.top_p},
+            {"min_p",             slot.sparams.min_p},
            {"tfs_z",             slot.sparams.tfs_z},
            {"typical_p",         slot.sparams.typical_p},
            {"repeat_last_n",     slot.sparams.penalty_last_n},
@@ -1142,6 +1211,7 @@ struct llama_server_context
        std::lock_guard<std::mutex> lock(mutex_results);
        task_result res;
        res.id = slot.task_id;
+        res.multitask_id = slot.multitask_id;
        res.error = false;
        res.stop = false;

@@ -1167,6 +1237,12 @@ struct llama_server_context
            res.result_json["completion_probabilities"] = probs_vector_to_json(ctx, probs_output);
        }

+        if (slot.oaicompat)
+        {
+            res.result_json["oaicompat_token_ctr"] = slot.n_decoded;
+            res.result_json["model"] = slot.oaicompat_model;
+        }
+
        queue_results.push_back(res);
    }

@@ -1175,6 +1251,7 @@ struct llama_server_context
        std::lock_guard<std::mutex> lock(mutex_results);
        task_result res;
        res.id = slot.task_id;
+        res.multitask_id = slot.multitask_id;
        res.error = false;
        res.stop = true;

@@ -1214,6 +1291,18 @@ struct llama_server_context
            res.result_json["completion_probabilities"] = probs_vector_to_json(ctx, probs);
        }

+        if (slot.oaicompat)
+        {
+            res.result_json["oaicompat_token_ctr"] = slot.n_decoded;
+            res.result_json["model"] = slot.oaicompat_model;
+        }
+
+        // parent multitask, if any, needs to be updated
+        if (slot.multitask_id != -1)
+        {
+            update_multi_task(slot.multitask_id, slot.task_id, res);
+        }
+
        queue_results.push_back(res);
    }

@@ -1222,6 +1311,7 @@ struct llama_server_context
        std::lock_guard<std::mutex> lock(mutex_results);
        task_result res;
        res.id = slot.task_id;
+        res.multitask_id = slot.multitask_id;
        res.error = false;
        res.stop = true;

@@ -1248,15 +1338,26 @@ struct llama_server_context
        queue_results.push_back(res);
    }

-    int request_completion(json data, bool infill, bool embedding)
+    int request_completion(json data, bool infill, bool embedding, int multitask_id)
    {
-        std::lock_guard<std::mutex> lock(mutex_tasks);
+        std::unique_lock<std::mutex> lock(mutex_tasks);
        task_server task;
        task.id = id_gen++;
-        task.data = data;
+        task.target_id = 0;
+        task.data = std::move(data);
        task.infill_mode = infill;
        task.embedding_mode = embedding;
        task.type = COMPLETION_TASK;
+        task.multitask_id = multitask_id;
+
+        // when a completion task's prompt array is not a singleton, we split it into multiple requests
+        if (task.data.at("prompt").size() > 1)
+        {
+            lock.unlock(); // entering new func scope
+            return split_multiprompt_task(task);
+        }
+
+        // otherwise, it's a single-prompt task, we actually queue it
        queue_tasks.push_back(task);
        return task.id;
    }
@@ -1275,8 +1376,17 @@ struct llama_server_context

            for (int i = 0; i < (int) queue_results.size(); i++)
            {
+                // for now, tasks that have associated parent multitasks just get erased once multitask picks up the result
+                if (queue_results[i].multitask_id == task_id)
+                {
+                    update_multi_task(task_id, queue_results[i].id, queue_results[i]);
+                    queue_results.erase(queue_results.begin() + i);
+                    continue;
+                }
+
                if (queue_results[i].id == task_id)
                {
+                    assert(queue_results[i].multitask_id == -1);
                    task_result res = queue_results[i];
                    queue_results.erase(queue_results.begin() + i);
                    return res;
@@ -1366,6 +1476,27 @@ struct llama_server_context
        queue_tasks.push_back(task);
    }

+    int split_multiprompt_task(task_server& multiprompt_task)
+    {
+        int prompt_count = multiprompt_task.data.at("prompt").size();
+        assert(prompt_count > 1);
+
+        int multitask_id = id_gen++;
+        std::vector<int> subtask_ids(prompt_count);
+        for (int i = 0; i < prompt_count; i++)
+        {
+            json subtask_data = multiprompt_task.data;
+            subtask_data["prompt"] = subtask_data["prompt"][i];
+
+            // subtasks inherit everything else (infill mode, embedding mode, etc.)
+            subtask_ids[i] = request_completion(subtask_data, multiprompt_task.infill_mode, multiprompt_task.embedding_mode, multitask_id);
+        }
+
+        // queue up the multitask so we can track its subtask progression
+        add_multi_task(multitask_id, subtask_ids);
+        return multitask_id;
+    }
+
    void process_tasks()
    {
        std::lock_guard<std::mutex> lock(mutex_tasks);
@@ -1381,7 +1512,7 @@ struct llama_server_context
                    {
                        LOG_TEE("slot unavailable\n");
                        // send error result
-                        send_error(task.id, "slot unavailable");
+                        send_error(task, "slot unavailable");
                        return;
                    }

@@ -1395,11 +1526,12 @@ struct llama_server_context
                    slot->infill = task.infill_mode;
                    slot->embedding = task.embedding_mode;
                    slot->task_id = task.id;
+                    slot->multitask_id = task.multitask_id;

                    if (!launch_slot_with_data(slot, task.data))
                    {
                        // send error result
-                        send_error(task.id, "internal_error");
+                        send_error(task, "internal_error");
                        break;
                    }
                } break;
@@ -1415,6 +1547,38 @@ struct llama_server_context
                } break;
            }
        }
+
+        // remove finished multitasks from the queue of multitasks, and add the corresponding result to the result queue
+        auto queue_iterator = queue_multitasks.begin();
+        while (queue_iterator != queue_multitasks.end())
+        {
+            if (queue_iterator->subtasks_remaining.empty())
+            {
+                // all subtasks done == multitask is done
+                task_result aggregate_result;
+                aggregate_result.id = queue_iterator->id;
+                aggregate_result.stop = true;
+                aggregate_result.error = false;
+
+                // collect json results into one json result
+                std::vector<json> result_jsons;
+                for (auto& subres : queue_iterator->results)
+                {
+                    result_jsons.push_back(subres.result_json);
+                    aggregate_result.error = aggregate_result.error && subres.error;
+                }
+                aggregate_result.result_json = json{ "results", result_jsons };
+
+                std::lock_guard<std::mutex> lock(mutex_results);
+                queue_results.push_back(aggregate_result);
+
+                queue_iterator = queue_multitasks.erase(queue_iterator);
+            }
+            else
+            {
+                ++queue_iterator;
+            }
+        }
    }

    bool update_slots() {
@@ -1553,11 +1717,40 @@ struct llama_server_context
                    }
                    else
                    {
-                        prompt_tokens = tokenize(slot.prompt, system_prompt.empty());  // add BOS if there isn't system prompt
+                        prompt_tokens = tokenize(slot.prompt, system_prompt.empty() && add_bos_token);  // add BOS if there isn't system prompt
                    }

                    slot.num_prompt_tokens = prompt_tokens.size();

+                    if (slot.params.n_keep < 0)
+                    {
+                        slot.params.n_keep = slot.num_prompt_tokens;
+                    }
+                    slot.params.n_keep = std::min(slot.n_ctx - 4, slot.params.n_keep);
+
+                    // if input prompt is too big, truncate it
+                    if (slot.num_prompt_tokens >= slot.n_ctx)
+                    {
+                        const int n_left = slot.n_ctx - slot.params.n_keep;
+                        const int n_block_size = n_left / 2;
+                        const int erased_blocks = (slot.num_prompt_tokens - slot.params.n_keep - n_block_size) / n_block_size;
+
+                        std::vector<llama_token> new_tokens(prompt_tokens.begin(), prompt_tokens.begin() + slot.params.n_keep);
+                        new_tokens.insert(new_tokens.end(), prompt_tokens.begin() + slot.params.n_keep + erased_blocks * n_block_size, prompt_tokens.end());
+
+                        LOG_VERBOSE("input truncated", {
+                            {"n_ctx",  slot.n_ctx},
+                            {"n_keep", slot.params.n_keep},
+                            {"n_left", n_left},
+                            {"new_tokens", tokens_to_str(ctx, new_tokens.cbegin(), new_tokens.cend())},
+                        });
+                        slot.truncated = true;
+                        prompt_tokens = new_tokens;
+
+                        slot.num_prompt_tokens = prompt_tokens.size();
+                        GGML_ASSERT(slot.num_prompt_tokens < slot.n_ctx);
+                    }
+
                    if (!slot.params.cache_prompt)
                    {
                        llama_sampling_reset(slot.ctx_sampling);
@@ -1567,35 +1760,6 @@ struct llama_server_context
                    }
                    else
                    {
-                        if (slot.params.n_keep < 0)
-                        {
-                            slot.params.n_keep = slot.num_prompt_tokens;
-                        }
-                        slot.params.n_keep = std::min(slot.n_ctx - 4, slot.params.n_keep);
-
-                        // if input prompt is too big, truncate it
-                        if (slot.num_prompt_tokens >= slot.n_ctx)
-                        {
-                            const int n_left = slot.n_ctx - slot.params.n_keep;
-                            const int n_block_size = n_left / 2;
-                            const int erased_blocks = (slot.num_prompt_tokens - slot.params.n_keep - n_block_size) / n_block_size;
-
-                            std::vector<llama_token> new_tokens(prompt_tokens.begin(), prompt_tokens.begin() + slot.params.n_keep);
-                            new_tokens.insert(new_tokens.end(), prompt_tokens.begin() + slot.params.n_keep + erased_blocks * n_block_size, prompt_tokens.end());
-
-                            LOG_VERBOSE("input truncated", {
-                                                            {"n_ctx",  slot.n_ctx},
-                                                            {"n_keep", slot.params.n_keep},
-                                                            {"n_left", n_left},
-                                                            {"new_tokens", tokens_to_str(ctx, new_tokens.cbegin(), new_tokens.cend())},
-                                                        });
-                            slot.truncated = true;
-                            prompt_tokens = new_tokens;
-
-                            slot.num_prompt_tokens = prompt_tokens.size();
-                            GGML_ASSERT(slot.num_prompt_tokens < slot.n_ctx);
-                        }
-
                        // push the prompt into the sampling context (do not apply grammar)
                        for (auto &token : prompt_tokens)
                        {
@@ -1630,7 +1794,7 @@ struct llama_server_context
                    const bool has_images = process_images(slot);

                    // process the prefix of first image
-                    std::vector<llama_token> prefix_tokens = has_images ? tokenize(slot.images[0].prefix_prompt, true) : prompt_tokens;
+                    std::vector<llama_token> prefix_tokens = has_images ? tokenize(slot.images[0].prefix_prompt, add_bos_token) : prompt_tokens;
                    for (; slot.n_past < (int) prefix_tokens.size(); ++slot.n_past)
                    {
                       llama_batch_add(batch, prefix_tokens[slot.n_past], system_tokens.size() + slot.n_past, { slot.id }, false);
@@ -1750,6 +1914,231 @@ struct llama_server_context
 };


+static std::string random_string()
+{
+    static const std::string str("0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz");
+
+    std::random_device rd;
+    std::mt19937 generator(rd());
+
+    std::string result(32, ' ');
+
+    for (int i = 0; i < 32; ++i) {
+        result[i] = str[generator() % str.size()];
+    }
+
+    return result;
+}
+
+static std::string gen_chatcmplid()
+{
+    std::stringstream chatcmplid;
+    chatcmplid << "chatcmpl-" << random_string();
+    return chatcmplid.str();
+}
+
+std::string format_chatml(std::vector<json> messages)
+{
+    std::ostringstream chatml_msgs;
+
+    for (auto it = messages.begin(); it != messages.end(); ++it) {
+        chatml_msgs << "<|im_start|>"
+                    << json_value(*it, "role",    std::string("user")) << '\n';
+        chatml_msgs << json_value(*it, "content", std::string(""))
+                    << "<|im_end|>\n";
+    }
+
+    chatml_msgs << "<|im_start|>assistant" << '\n';
+
+    return chatml_msgs.str();
+}
+
+/* llama.cpp completion api semantics */
+json oaicompat_completion_params_parse(
+    const json &body /* openai api json semantics */)
+{
+    json llama_params;
+
+    llama_params["__oaicompat"] = true;
+
+    // Map OpenAI parameters to llama.cpp parameters
+    llama_params["model"]             = json_value(body, "model", std::string("uknown"));
+    llama_params["prompt"]            = format_chatml(body["messages"]); // OpenAI 'messages' to llama.cpp 'prompt'
+    llama_params["cache_prompt"]      = json_value(body, "cache_prompt", false);
+    llama_params["temperature"]       = json_value(body, "temperature", 0.8);
+    llama_params["top_k"]             = json_value(body, "top_k", 40);
+    llama_params["top_p"]             = json_value(body, "top_p", 0.95);
+    llama_params["n_predict"]         = json_value(body, "max_tokens", -1);
+    llama_params["logit_bias"]        = json_value(body, "logit_bias",json::object());
+    llama_params["frequency_penalty"] = json_value(body, "frequency_penalty", 0.0);
+    llama_params["presence_penalty"]  = json_value(body, "presence_penalty", 0.0);
+    llama_params["seed"]              = json_value(body, "seed", 0);
+    llama_params["stream"]            = json_value(body, "stream", false);
+    llama_params["mirostat"]          = json_value(body, "mirostat", false);
+    llama_params["mirostat_tau"]      = json_value(body, "mirostat_tau", 0.0);
+    llama_params["mirostat_eta"]      = json_value(body, "mirostat_eta", 0.0);
+    llama_params["penalize_nl"]       = json_value(body, "penalize_nl", false);
+    llama_params["typical_p"]         = json_value(body, "typical_p", 0.0);
+    llama_params["repeat_last_n"]     = json_value(body, "repeat_last_n", 0);
+    llama_params["ignore_eos"]        = json_value(body, "ignore_eos", false);
+    llama_params["tfs_z"]             = json_value(body, "tfs_z", 0.0);
+
+    if (llama_params.count("grammar") != 0) {
+        llama_params["grammar"] = json_value(body, "grammar", json::object());
+    }
+
+    // Handle 'stop' field
+    if (body.contains("stop") && body["stop"].is_string()) {
+        llama_params["stop"] = json::array({body["stop"].get<std::string>()});
+    } else {
+        llama_params["stop"] = json_value(body, "stop", json::array());
+    }
+
+    // Ensure there is ChatML-specific end sequence among stop words
+    llama_params["stop"].push_back("<|im_end|>");
+
+    return llama_params;
+}
+
+static json format_final_response_oaicompat(const json &request, const task_result &response, bool streaming = false)
+{
+    json result = response.result_json;
+
+    bool stopped_word        = result.count("stopped_word") != 0;
+    bool stopped_eos         = json_value(result, "stopped_eos", false);
+    int num_tokens_predicted = json_value(result, "tokens_predicted", 0);
+    int num_prompt_tokens    = json_value(result, "tokens_evaluated", 0);
+    std::string content      = json_value(result, "content", std::string(""));
+
+    std::string finish_reason = "length";
+    if (stopped_word || stopped_eos) {
+        finish_reason = "stop";
+    }
+
+    json choices =
+        streaming ? json::array({json{{"finish_reason", finish_reason},
+                                        {"index", 0},
+                                        {"delta", json::object()}}})
+                  : json::array({json{{"finish_reason", finish_reason},
+                                        {"index", 0},
+                                        {"message", json{{"content", content},
+                                                         {"role", "assistant"}}}}});
+
+    std::time_t t = std::time(0);
+
+    json res =
+        json{{"choices", choices},
+            {"created", t},
+            {"model",
+                json_value(request, "model", std::string(DEFAULT_OAICOMPAT_MODEL))},
+            {"object", streaming ? "chat.completion.chunk" : "chat.completion"},
+            {"usage",
+                json{{"completion_tokens", num_tokens_predicted},
+                    {"prompt_tokens", num_prompt_tokens},
+                    {"total_tokens", num_tokens_predicted + num_prompt_tokens}}},
+            {"id", gen_chatcmplid()}};
+
+    if (server_verbose) {
+        res["__verbose"] = result;
+    }
+
+    if (result.contains("completion_probabilities")) {
+        res["completion_probabilities"] = json_value(result, "completion_probabilities", json::array());
+    }
+
+    return res;
+}
+
+// return value is vector as there is one case where we might need to generate two responses
+static std::vector<json> format_partial_response_oaicompat(const task_result &response) {
+    json result = response.result_json;
+
+    if (!result.contains("model") || !result.contains("oaicompat_token_ctr")) {
+        return std::vector<json>({response.result_json});
+    }
+
+    bool first = json_value(result, "oaicompat_token_ctr", 0) == 0;
+    std::string modelname = json_value(result, "model", std::string(DEFAULT_OAICOMPAT_MODEL));
+
+    bool stopped_word   = json_value(result, "stopped_word", false);
+    bool stopped_eos    = json_value(result, "stopped_eos", false);
+    bool stopped_limit  = json_value(result, "stopped_limit", false);
+    std::string content = json_value(result, "content", std::string(""));
+
+    std::string finish_reason;
+    if (stopped_word || stopped_eos) {
+        finish_reason = "stop";
+    }
+    if (stopped_limit) {
+        finish_reason = "length";
+    }
+
+    std::time_t t = std::time(0);
+
+    json choices;
+
+    if (!finish_reason.empty()) {
+        choices = json::array({json{{"finish_reason", finish_reason},
+                                    {"index", 0},
+                                    {"delta", json::object()}}});
+    } else {
+        if (first) {
+            if (content.empty()) {
+                choices = json::array({json{{"finish_reason", nullptr},
+                                            {"index", 0},
+                                            {"delta", json{{"role", "assistant"}}}}});
+            } else {
+                // We have to send this as two updates to conform to openai behavior
+                json initial_ret = json{{"choices", json::array({json{
+                                        {"finish_reason", nullptr},
+                                        {"index", 0},
+                                        {"delta", json{
+                                            {"role", "assistant"}
+                                        }}}})},
+                            {"created", t},
+                            {"id", gen_chatcmplid()},
+                            {"model", modelname},
+                            {"object", "chat.completion.chunk"}};
+
+                json second_ret = json{
+                            {"choices", json::array({json{{"finish_reason", nullptr},
+                                                            {"index", 0},
+                                                            {"delta", json{
+                                                            {"content", content}}}
+                                                            }})},
+                            {"created", t},
+                            {"id", gen_chatcmplid()},
+                            {"model", modelname},
+                            {"object", "chat.completion.chunk"}};
+
+                return std::vector<json>({initial_ret, second_ret});
+            }
+        } else {
+            // Some idiosyncrasy in task processing logic makes several trailing calls
+            // with empty content, we ignore these at the calee site.
+            if (content.empty()) {
+                return std::vector<json>({json::object()});
+            }
+
+            choices = json::array({json{
+                {"finish_reason", nullptr},
+                {"index", 0},
+                {"delta",
+                json{
+                    {"content", content},
+                }},
+            }});
+        }
+    }
+
+    json ret = json{{"choices", choices},
+                    {"created", t},
+                    {"id", gen_chatcmplid()},
+                    {"model", modelname},
+                    {"object", "chat.completion.chunk"}};
+
+    return std::vector<json>({ret});
+}

 static json format_partial_response(
    llama_server_context &llama, llama_client_slot *slot, const std::string &content, const std::vector<completion_token_output> &probs
@@ -1782,8 +2171,6 @@ static json format_detokenized_response(std::string content)
        {"content", content}};
 }

-
-
 struct token_translator
 {
    llama_context * ctx;
@@ -1979,7 +2366,7 @@ static void params_parse(const backend::ModelOptions* request,
    //  params.model_alias ??
    params.model_alias =  request->modelfile();
    params.n_ctx = request->contextsize();
-    params.memory_f16 = request->f16memory();
+    //params.memory_f16 = request->f16memory();
    params.n_threads = request->threads();
    params.n_gpu_layers = request->ngpulayers();
    params.n_batch = request->nbatch();
@@ -2086,7 +2473,7 @@ public:
  }
  grpc::Status PredictStream(grpc::ServerContext* context, const backend::PredictOptions* request, grpc::ServerWriter<backend::Reply>* writer) override {
        json data = parse_options(true, request, llama);
-        const int task_id = llama.request_completion(data, false, false);
+        const int task_id = llama.request_completion(data, false, false, -1);
        while (true)
        {
            task_result result = llama.next_result(task_id);
@@ -2122,7 +2509,7 @@ public:

    grpc::Status Predict(ServerContext* context, const backend::PredictOptions* request, backend::Reply* reply) {
        json data = parse_options(false, request, llama);
-        const int task_id = llama.request_completion(data, false, false);
+        const int task_id = llama.request_completion(data, false, false, -1);
        std::string completion_text;
        task_result result = llama.next_result(task_id);
        if (!result.error && result.stop) {
--- a/backend/python/autogptq/backend_pb2.py
+++ b/backend/python/autogptq/backend_pb2.py
--- a/backend/python/bark/Makefile
+++ b/backend/python/bark/Makefile
@@ -1,5 +1,15 @@
 .PHONY: ttsbark
 ttsbark:
-	@echo "Creating virtual environment..."
-	@conda env create --name ttsbark --file ttsbark.yml
-	@echo "Virtual environment created."
+	$(MAKE) -C ../common-env/transformers
+
+.PHONY: run
+run:
+	@echo "Running bark..."
+	bash run.sh
+	@echo "bark run."
+
+.PHONY: test
+test:
+	@echo "Testing bark..."
+	bash test.sh
+	@echo "bark tested."
--- a/backend/python/bark/backend_pb2.py
+++ b/backend/python/bark/backend_pb2.py
--- a/backend/python/bark/run.sh
+++ b/backend/python/bark/run.sh
@@ -6,7 +6,7 @@
 export PATH=$PATH:/opt/conda/bin

 # Activate conda environment
-source activate ttsbark
+source activate transformers

 # get the directory where the bash script is located
 DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
--- a/backend/python/bark/test.py
+++ b/backend/python/bark/test.py
@@ -0,0 +1,81 @@
+"""
+A test script to test the gRPC service
+"""
+import unittest
+import subprocess
+import time
+import backend_pb2
+import backend_pb2_grpc
+
+import grpc
+
+
+class TestBackendServicer(unittest.TestCase):
+    """
+    TestBackendServicer is the class that tests the gRPC service
+    """
+    def setUp(self):
+        """
+        This method sets up the gRPC service by starting the server
+        """
+        self.service = subprocess.Popen(["python3", "ttsbark.py", "--addr", "localhost:50051"])
+        time.sleep(10)
+
+    def tearDown(self) -> None:
+        """
+        This method tears down the gRPC service by terminating the server
+        """
+        self.service.terminate()
+        self.service.wait()
+
+    def test_server_startup(self):
+        """
+        This method tests if the server starts up successfully
+        """
+        try:
+            self.setUp()
+            with grpc.insecure_channel("localhost:50051") as channel:
+                stub = backend_pb2_grpc.BackendStub(channel)
+                response = stub.Health(backend_pb2.HealthMessage())
+                self.assertEqual(response.message, b'OK')
+        except Exception as err:
+            print(err)
+            self.fail("Server failed to start")
+        finally:
+            self.tearDown()
+
+    def test_load_model(self):
+        """
+        This method tests if the model is loaded successfully
+        """
+        try:
+            self.setUp()
+            with grpc.insecure_channel("localhost:50051") as channel:
+                stub = backend_pb2_grpc.BackendStub(channel)
+                response = stub.LoadModel(backend_pb2.ModelOptions(Model="v2/en_speaker_4"))
+                self.assertTrue(response.success)
+                self.assertEqual(response.message, "Model loaded successfully")
+        except Exception as err:
+            print(err)
+            self.fail("LoadModel service failed")
+        finally:
+            self.tearDown()
+
+    def test_tts(self):
+        """
+        This method tests if the embeddings are generated successfully
+        """
+        try:
+            self.setUp()
+            with grpc.insecure_channel("localhost:50051") as channel:
+                stub = backend_pb2_grpc.BackendStub(channel)
+                response = stub.LoadModel(backend_pb2.ModelOptions(Model="v2/en_speaker_4"))
+                self.assertTrue(response.success)
+                tts_request = backend_pb2.TTSRequest(text="80s TV news production music hit for tonight's biggest story")
+                tts_response = stub.TTS(tts_request)
+                self.assertIsNotNone(tts_response)
+        except Exception as err:
+            print(err)
+            self.fail("TTS service failed")
+        finally:
+            self.tearDown()
--- a/backend/python/bark/test.sh
+++ b/backend/python/bark/test.sh
@@ -0,0 +1,11 @@
+#!/bin/bash
+##
+## A bash script wrapper that runs the bark server with conda
+
+# Activate conda environment
+source activate transformers
+
+# get the directory where the bash script is located
+DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
+
+python -m unittest $DIR/test.py
--- a/backend/python/bark/test_ttsbark.py
+++ b/backend/python/bark/test_ttsbark.py
@@ -1,32 +0,0 @@
-import unittest
-import subprocess
-import time
-import backend_pb2
-import backend_pb2_grpc
-
-import grpc
-
-class TestBackendServicer(unittest.TestCase):
-    """
-    TestBackendServicer is the class that tests the gRPC service
-    """
-    def setUp(self):
-        self.service = subprocess.Popen(["python3", "ttsbark.py", "--addr", "localhost:50051"])
-
-    def tearDown(self) -> None:
-        self.service.terminate()
-        self.service.wait()
-
-    def test_server_startup(self):
-        time.sleep(2)
-        try:
-            self.setUp()
-            with grpc.insecure_channel("localhost:50051") as channel:
-                stub = backend_pb2_grpc.BackendStub(channel)
-                response = stub.Health(backend_pb2.HealthMessage())
-                self.assertEqual(response.message, b'OK')
-        except Exception as err:
-            print(err)
-            self.fail("Server failed to start")
-        finally:
-            self.tearDown()
--- a/backend/python/common-env/transformers/Makefile
+++ b/backend/python/common-env/transformers/Makefile
@@ -0,0 +1,10 @@
+CONDA_ENV_PATH = "transformers.yml"
+
+ifeq ($(BUILD_TYPE), cublas)
+	CONDA_ENV_PATH = "transformers-nvidia.yml"
+endif
+
+.PHONY: transformers
+transformers:
+	@echo "Installing $(CONDA_ENV_PATH)..."
+	bash install.sh $(CONDA_ENV_PATH)
--- a/backend/python/common-env/transformers/install.sh
+++ b/backend/python/common-env/transformers/install.sh
@@ -0,0 +1,15 @@
+#!/bin/bash
+set -ex
+
+# Check if environment exist
+conda_env_exists(){
+    ! conda list --name "${@}" >/dev/null 2>/dev/null
+}
+
+if conda_env_exists "transformers" ; then
+    echo "Creating virtual environment..."
+    conda env create --name transformers --file $1
+    echo "Virtual environment created."
+else 
+    echo "Virtual environment already exists."
+fi
--- a/backend/python/common-env/transformers/transformers-nvidia.yml
+++ b/backend/python/common-env/transformers/transformers-nvidia.yml
@@ -1,4 +1,4 @@
-name: bark
+name: transformers
 channels:
  - defaults
 dependencies:
@@ -35,6 +35,8 @@ dependencies:
      - certifi==2023.7.22
      - charset-normalizer==3.3.0
      - datasets==2.14.5
+      - sentence-transformers==2.2.2
+      - sentencepiece==0.1.99
      - dill==0.3.7
      - einops==0.7.0
      - encodec==0.1.1
@@ -68,6 +70,7 @@ dependencies:
      - packaging==23.2
      - pandas==2.1.1
      - peft==0.5.0
+      - git+https://github.com/bigscience-workshop/petals
      - protobuf==4.24.4
      - psutil==5.9.5
      - pyarrow==13.0.0
@@ -93,4 +96,4 @@ dependencies:
      - urllib3==1.26.17
      - xxhash==3.4.1
      - yarl==1.9.2
-prefix: /opt/conda/envs/bark
+prefix: /opt/conda/envs/transformers
--- a/backend/python/sentencetransformers/sentencetransformers.yml
+++ b/backend/python/sentencetransformers/sentencetransformers.yml
@@ -1,4 +1,4 @@
-name: sentencetransformers
+name: transformers
 channels:
  - defaults
 dependencies:
@@ -20,58 +20,68 @@ dependencies:
  - setuptools=68.0.0=py311h06a4308_0
  - sqlite=3.41.2=h5eee18b_0
  - tk=8.6.12=h1ccaba5_0
-  - tzdata=2023c=h04d1e81_0
  - wheel=0.41.2=py311h06a4308_0
  - xz=5.4.2=h5eee18b_0
  - zlib=1.2.13=h5eee18b_0
  - pip:
+      - accelerate==0.23.0
+      - aiohttp==3.8.5
+      - aiosignal==1.3.1
+      - async-timeout==4.0.3
+      - attrs==23.1.0
+      - bark==0.1.5
+      - boto3==1.28.61
+      - botocore==1.31.61
      - certifi==2023.7.22
      - charset-normalizer==3.3.0
-      - click==8.1.7
+      - datasets==2.14.5
+      - sentence-transformers==2.2.2
+      - sentencepiece==0.1.99
+      - dill==0.3.7
+      - einops==0.7.0
+      - encodec==0.1.1
      - filelock==3.12.4
-      - fsspec==2023.9.2
+      - frozenlist==1.4.0
+      - fsspec==2023.6.0
+      - funcy==2.0
      - grpcio==1.59.0
-      - huggingface-hub==0.17.3
+      - huggingface-hub==0.16.4
      - idna==3.4
-      - install==1.3.5
      - jinja2==3.1.2
-      - joblib==1.3.2
+      - jmespath==1.0.1
      - markupsafe==2.1.3
      - mpmath==1.3.0
+      - multidict==6.0.4
+      - multiprocess==0.70.15
      - networkx==3.1
-      - nltk==3.8.1
      - numpy==1.26.0
-      - nvidia-cublas-cu12==12.1.3.1
-      - nvidia-cuda-cupti-cu12==12.1.105
-      - nvidia-cuda-nvrtc-cu12==12.1.105
-      - nvidia-cuda-runtime-cu12==12.1.105
-      - nvidia-cudnn-cu12==8.9.2.26
-      - nvidia-cufft-cu12==11.0.2.54
-      - nvidia-curand-cu12==10.3.2.106
-      - nvidia-cusolver-cu12==11.4.5.107
-      - nvidia-cusparse-cu12==12.1.0.106
-      - nvidia-nccl-cu12==2.18.1
-      - nvidia-nvjitlink-cu12==12.2.140
-      - nvidia-nvtx-cu12==12.1.105
      - packaging==23.2
-      - pillow==10.0.1
+      - pandas==2.1.1
+      - peft==0.5.0
+      - git+https://github.com/bigscience-workshop/petals
      - protobuf==4.24.4
+      - psutil==5.9.5
+      - pyarrow==13.0.0
+      - python-dateutil==2.8.2
+      - pytz==2023.3.post1
      - pyyaml==6.0.1
      - regex==2023.10.3
      - requests==2.31.0
-      - safetensors==0.4.0
-      - scikit-learn==1.3.1
+      - rouge==1.0.1
+      - s3transfer==0.7.0
+      - safetensors==0.3.3
      - scipy==1.11.3
-      - sentence-transformers==2.2.2
-      - sentencepiece==0.1.99
+      - six==1.16.0
      - sympy==1.12
-      - threadpoolctl==3.2.0
-      - tokenizers==0.14.1
+      - tokenizers==0.14.0
      - torch==2.1.0
-      - torchvision==0.16.0
+      - torchaudio==2.1.0
      - tqdm==4.66.1
      - transformers==4.34.0
      - triton==2.1.0
      - typing-extensions==4.8.0
-      - urllib3==2.0.6
-prefix: /opt/conda/envs/sentencetransformers
+      - tzdata==2023.3
+      - urllib3==1.26.17
+      - xxhash==3.4.1
+      - yarl==1.9.2
+prefix: /opt/conda/envs/transformers
--- a/backend/python/diffusers/Makefile
+++ b/backend/python/diffusers/Makefile
@@ -9,3 +9,6 @@ run:
 	@echo "Running diffusers..."
 	bash run.sh
 	@echo "Diffusers run."
+
+test:
+	bash test.sh
--- a/backend/python/diffusers/backend_diffusers.py
+++ b/backend/python/diffusers/backend_diffusers.py
@@ -18,9 +18,9 @@ import backend_pb2_grpc
 import grpc

 from diffusers import StableDiffusionXLPipeline, StableDiffusionDepth2ImgPipeline, DPMSolverMultistepScheduler, StableDiffusionPipeline, DiffusionPipeline, EulerAncestralDiscreteScheduler
-from diffusers import StableDiffusionImg2ImgPipeline
+from diffusers import StableDiffusionImg2ImgPipeline, AutoPipelineForText2Image, ControlNetModel, StableVideoDiffusionPipeline
 from diffusers.pipelines.stable_diffusion import safety_checker
-
+from diffusers.utils import load_image,export_to_video
 from compel import Compel

 from transformers import CLIPTextModel
@@ -30,6 +30,11 @@ from safetensors.torch import load_file
 _ONE_DAY_IN_SECONDS = 60 * 60 * 24
 COMPEL=os.environ.get("COMPEL", "1") == "1"
 CLIPSKIP=os.environ.get("CLIPSKIP", "1") == "1"
+SAFETENSORS=os.environ.get("SAFETENSORS", "1") == "1"
+CHUNK_SIZE=os.environ.get("CHUNK_SIZE", "8")
+FPS=os.environ.get("FPS", "7")
+DISABLE_CPU_OFFLOAD=os.environ.get("DISABLE_CPU_OFFLOAD", "0") == "1"
+FRAMES=os.environ.get("FRAMES", "64")

 # If MAX_WORKERS are specified in the environment use it, otherwise default to 1
 MAX_WORKERS = int(os.environ.get('PYTHON_GRPC_MAX_WORKERS', '1'))
@@ -135,8 +140,11 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
            print(f"Loading model {request.Model}...", file=sys.stderr)
            print(f"Request {request}", file=sys.stderr)
            torchType = torch.float32
+            variant = None
+
            if request.F16Memory:
                torchType = torch.float16
+                variant="fp16"

            local = False
            modelFile = request.Model
@@ -159,15 +167,10 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
                    modelFile = request.ModelFile
            
            fromSingleFile = request.Model.startswith("http") or request.Model.startswith("/") or local
-            
-            if request.IMG2IMG and request.PipelineType == "":
-                request.PipelineType == "StableDiffusionImg2ImgPipeline"
-                
-            if request.PipelineType == "":
-                request.PipelineType == "StableDiffusionPipeline"
-
+            self.img2vid=False
+            self.txt2vid=False
            ## img2img
-            if request.PipelineType == "StableDiffusionImg2ImgPipeline":
+            if (request.PipelineType == "StableDiffusionImg2ImgPipeline") or (request.IMG2IMG and request.PipelineType == ""):
                if fromSingleFile:
                    self.pipe = StableDiffusionImg2ImgPipeline.from_single_file(modelFile,
                                torch_dtype=torchType,
@@ -177,12 +180,26 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
                                torch_dtype=torchType,
                                guidance_scale=cfg_scale)

-            if request.PipelineType == "StableDiffusionDepth2ImgPipeline":
+            elif request.PipelineType == "StableDiffusionDepth2ImgPipeline":
                self.pipe = StableDiffusionDepth2ImgPipeline.from_pretrained(request.Model,
                            torch_dtype=torchType,
                            guidance_scale=cfg_scale)
+            ## img2vid
+            elif request.PipelineType == "StableVideoDiffusionPipeline":
+                self.img2vid=True
+                self.pipe = StableVideoDiffusionPipeline.from_pretrained(
+                    request.Model, torch_dtype=torchType, variant=variant
+                )
+                if not DISABLE_CPU_OFFLOAD:
+                    self.pipe.enable_model_cpu_offload()
            ## text2img
-            if request.PipelineType == "StableDiffusionPipeline":
+            elif request.PipelineType == "AutoPipelineForText2Image" or request.PipelineType == "":
+                self.pipe = AutoPipelineForText2Image.from_pretrained(request.Model,
+                                                    torch_dtype=torchType,
+                                                    use_safetensors=SAFETENSORS, 
+                                                    variant=variant,
+                                                    guidance_scale=cfg_scale)
+            elif request.PipelineType == "StableDiffusionPipeline":
                if fromSingleFile:
                    self.pipe = StableDiffusionPipeline.from_single_file(modelFile,
                                                        torch_dtype=torchType,
@@ -191,13 +208,16 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
                    self.pipe = StableDiffusionPipeline.from_pretrained(request.Model,
                                                        torch_dtype=torchType,
                                                        guidance_scale=cfg_scale)
-
-            if request.PipelineType == "DiffusionPipeline":
+            elif request.PipelineType == "DiffusionPipeline":
                self.pipe = DiffusionPipeline.from_pretrained(request.Model,
                                                        torch_dtype=torchType,
                                                        guidance_scale=cfg_scale)
-
-            if request.PipelineType == "StableDiffusionXLPipeline":
+            elif request.PipelineType == "VideoDiffusionPipeline":
+                self.txt2vid=True
+                self.pipe = DiffusionPipeline.from_pretrained(request.Model,
+                                                        torch_dtype=torchType,
+                                                        guidance_scale=cfg_scale)
+            elif request.PipelineType == "StableDiffusionXLPipeline":
                if fromSingleFile:
                    self.pipe = StableDiffusionXLPipeline.from_single_file(modelFile,
                                                               torch_dtype=torchType, use_safetensors=True,
@@ -207,21 +227,35 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
                        request.Model, 
                        torch_dtype=torchType, 
                        use_safetensors=True, 
-                #       variant="fp16"
+                        variant=variant,
                        guidance_scale=cfg_scale)
-            # https://github.com/huggingface/diffusers/issues/4446
-            # do not use text_encoder in the constructor since then
-            # https://github.com/huggingface/diffusers/issues/3212#issuecomment-1521841481
+
            if CLIPSKIP and request.CLIPSkip != 0:
-                text_encoder = CLIPTextModel.from_pretrained(clipmodel, num_hidden_layers=request.CLIPSkip,  subfolder=clipsubfolder, torch_dtype=torchType)
-                self.pipe.text_encoder=text_encoder
+                self.clip_skip = request.CLIPSkip
+            else:
+                self.clip_skip = 0
+            
            # torch_dtype needs to be customized. float16 for GPU, float32 for CPU
            # TODO: this needs to be customized
            if request.SchedulerType != "":
                self.pipe.scheduler = get_scheduler(request.SchedulerType, self.pipe.scheduler.config)
-            self.compel = Compel(tokenizer=self.pipe.tokenizer, text_encoder=self.pipe.text_encoder)
+                
+            if not self.img2vid:
+                self.compel = Compel(tokenizer=self.pipe.tokenizer, text_encoder=self.pipe.text_encoder)
+
+
+            if request.ControlNet:
+                self.controlnet = ControlNetModel.from_pretrained(
+                    request.ControlNet, torch_dtype=torchType, variant=variant
+                )
+                self.pipe.controlnet = self.controlnet
+            else:
+                self.controlnet = None
+
            if request.CUDA:
                self.pipe.to('cuda')
+                if self.controlnet:
+                    self.controlnet.to('cuda')
            # Assume directory from request.ModelFile.
            # Only if request.LoraAdapter it's not an absolute path
            if request.LoraAdapter and request.ModelFile != "" and not os.path.isabs(request.LoraAdapter) and request.LoraAdapter:
@@ -303,17 +337,28 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):

        prompt = request.positive_prompt

+        steps = 1
+
+        if request.step != 0:
+            steps = request.step
+
        # create a dictionary of values for the parameters
        options = {
            "negative_prompt":     request.negative_prompt, 
            "width":               request.width, 
            "height":              request.height,
-            "num_inference_steps": request.step,
+            "num_inference_steps": steps,
        }

-        if request.src != "":
+        if request.src != "" and not self.controlnet and not self.img2vid:
            image = Image.open(request.src)
            options["image"] = image
+        elif self.controlnet and request.src:
+            pose_image = load_image(request.src)
+            options["image"] = pose_image
+
+        if CLIPSKIP and self.clip_skip != 0:
+            options["clip_skip"]=self.clip_skip

        # Get the keys that we will build the args for our pipe for
        keys = options.keys()
@@ -333,6 +378,21 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
                request.seed
            )

+        if self.img2vid:
+            # Load the conditioning image
+            image = load_image(request.src)
+            image = image.resize((1024, 576))
+
+            generator = torch.manual_seed(request.seed)
+            frames = self.pipe(image, decode_chunk_size=CHUNK_SIZE, generator=generator).frames[0]
+            export_to_video(frames, request.dst, fps=FPS)
+            return backend_pb2.Result(message="Media generated successfully", success=True)
+
+        if self.txt2vid:
+            video_frames = self.pipe(prompt, num_inference_steps=steps, num_frames=int(FRAMES)).frames
+            export_to_video(video_frames, request.dst)
+            return backend_pb2.Result(message="Media generated successfully", success=True)
+
        image = {}
        if COMPEL:
            conditioning = self.compel.build_conditioning_tensor(prompt)
@@ -351,7 +411,7 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
        # save the result
        image.save(request.dst)

-        return backend_pb2.Result(message="Model loaded successfully", success=True)
+        return backend_pb2.Result(message="Media generated", success=True)

 def serve(address):
    server = grpc.server(futures.ThreadPoolExecutor(max_workers=MAX_WORKERS))
--- a/backend/python/diffusers/backend_pb2.py
+++ b/backend/python/diffusers/backend_pb2.py
--- a/backend/python/diffusers/diffusers.yml
+++ b/backend/python/diffusers/diffusers.yml
@@ -25,15 +25,15 @@ dependencies:
  - xz=5.4.2=h5eee18b_0
  - zlib=1.2.13=h5eee18b_0
  - pip:
-      - accelerate==0.23.0
+      - accelerate>=0.11.0
      - certifi==2023.7.22
      - charset-normalizer==3.3.0
      - compel==2.0.2
-      - diffusers==0.21.4
+      - diffusers==0.24.0
      - filelock==3.12.4
      - fsspec==2023.9.2
      - grpcio==1.59.0
-      - huggingface-hub==0.17.3
+      - huggingface-hub>=0.19.4
      - idna==3.4
      - importlib-metadata==6.8.0
      - jinja2==3.1.2
@@ -63,12 +63,11 @@ dependencies:
      - requests==2.31.0
      - safetensors==0.4.0
      - sympy==1.12
-      - tokenizers==0.14.1
      - torch==2.1.0
      - tqdm==4.66.1
-      - transformers==4.34.0
+      - transformers>=4.25.1
      - triton==2.1.0
      - typing-extensions==4.8.0
      - urllib3==2.0.6
      - zipp==3.17.0
-prefix: /opt/conda/envs/diffusers
+prefix: /opt/conda/envs/diffusers
--- a/backend/python/diffusers/test.py
+++ b/backend/python/diffusers/test.py
@@ -0,0 +1,84 @@
+"""
+A test script to test the gRPC service
+"""
+import unittest
+import subprocess
+import time
+import backend_pb2
+import backend_pb2_grpc
+
+import grpc
+
+
+class TestBackendServicer(unittest.TestCase):
+    """
+    TestBackendServicer is the class that tests the gRPC service
+    """
+    def setUp(self):
+        """
+        This method sets up the gRPC service by starting the server
+        """
+        self.service = subprocess.Popen(["python3", "backend_diffusers.py", "--addr", "localhost:50051"])
+
+    def tearDown(self) -> None:
+        """
+        This method tears down the gRPC service by terminating the server
+        """
+        self.service.kill()
+        self.service.wait()
+
+    def test_server_startup(self):
+        """
+        This method tests if the server starts up successfully
+        """
+        time.sleep(10)
+        try:
+            self.setUp()
+            with grpc.insecure_channel("localhost:50051") as channel:
+                stub = backend_pb2_grpc.BackendStub(channel)
+                response = stub.Health(backend_pb2.HealthMessage())
+                self.assertEqual(response.message, b'OK')
+        except Exception as err:
+            print(err)
+            self.fail("Server failed to start")
+        finally:
+            self.tearDown()
+
+    def test_load_model(self):
+        """
+        This method tests if the model is loaded successfully
+        """
+        time.sleep(10)
+        try:
+            self.setUp()
+            with grpc.insecure_channel("localhost:50051") as channel:
+                stub = backend_pb2_grpc.BackendStub(channel)
+                response = stub.LoadModel(backend_pb2.ModelOptions(Model="runwayml/stable-diffusion-v1-5"))
+                self.assertTrue(response.success)
+                self.assertEqual(response.message, "Model loaded successfully")
+        except Exception as err:
+            print(err)
+            self.fail("LoadModel service failed")
+        finally:
+            self.tearDown()
+
+    def test(self):
+        """
+        This method tests if the backend can generate images
+        """
+        time.sleep(10)
+        try:
+            self.setUp()
+            with grpc.insecure_channel("localhost:50051") as channel:
+                stub = backend_pb2_grpc.BackendStub(channel)
+                response = stub.LoadModel(backend_pb2.ModelOptions(Model="runwayml/stable-diffusion-v1-5"))
+                print(response.message)
+                self.assertTrue(response.success)
+                image_req = backend_pb2.GenerateImageRequest(positive_prompt="cat", width=16,height=16, dst="test.jpg")
+                re = stub.GenerateImage(image_req)
+                self.assertTrue(re.success)
+        except Exception as err:
+            print(err)
+            self.fail("Image gen service failed")
+        finally:
+            self.tearDown()
--- a/backend/python/diffusers/test.sh
+++ b/backend/python/diffusers/test.sh
@@ -0,0 +1,14 @@
+#!/bin/bash
+
+##
+## A bash script wrapper that runs the diffusers server with conda
+
+export PATH=$PATH:/opt/conda/bin
+
+# Activate conda environment
+source activate diffusers
+
+# get the directory where the bash script is located
+DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
+
+python -m unittest $DIR/test.py
--- a/backend/python/exllama/Makefile
+++ b/backend/python/exllama/Makefile
@@ -3,6 +3,7 @@ exllama:
 	@echo "Creating virtual environment..."
 	@conda env create --name exllama --file exllama.yml
 	@echo "Virtual environment created."
+	bash install.sh

 .PHONY: run
 run:
--- a/backend/python/exllama/backend_pb2.py
+++ b/backend/python/exllama/backend_pb2.py
--- a/backend/python/exllama/exllama.py
+++ b/backend/python/exllama/exllama.py
@@ -13,9 +13,10 @@ from pathlib import Path
 import torch
 import torch.nn.functional as F
 from torch import version as torch_version
-from exllama.generator import ExLlamaGenerator
-from exllama.model import ExLlama, ExLlamaCache, ExLlamaConfig
-from exllama.tokenizer import ExLlamaTokenizer
+
+from tokenizer import ExLlamaTokenizer
+from generator import ExLlamaGenerator
+from model import ExLlama, ExLlamaCache, ExLlamaConfig

 _ONE_DAY_IN_SECONDS = 60 * 60 * 24

--- a/backend/python/exllama/exllama.yml
+++ b/backend/python/exllama/exllama.yml
@@ -33,6 +33,7 @@ dependencies:
      - mpmath==1.3.0
      - networkx==3.1
      - ninja==1.11.1
+      - protobuf==4.24.4
      - nvidia-cublas-cu12==12.1.3.1
      - nvidia-cuda-cupti-cu12==12.1.105
      - nvidia-cuda-nvrtc-cu12==12.1.105
@@ -45,11 +46,11 @@ dependencies:
      - nvidia-nccl-cu12==2.18.1
      - nvidia-nvjitlink-cu12==12.2.140
      - nvidia-nvtx-cu12==12.1.105
-      - protobuf==4.24.4
      - safetensors==0.3.2
      - sentencepiece==0.1.99
      - sympy==1.12
      - torch==2.1.0
      - triton==2.1.0
      - typing-extensions==4.8.0
+      - numpy
 prefix: /opt/conda/envs/exllama
--- a/backend/python/exllama/install.sh
+++ b/backend/python/exllama/install.sh
@@ -0,0 +1,15 @@
+#!/bin/bash
+
+##
+## A bash script installs the required dependencies of VALL-E-X and prepares the environment
+export PATH=$PATH:/opt/conda/bin
+
+# Activate conda environment
+source activate exllama
+
+echo $CONDA_PREFIX
+
+
+git clone https://github.com/turboderp/exllama $CONDA_PREFIX/exllama && pushd $CONDA_PREFIX/exllama && pip install -r requirements.txt && popd
+
+cp -rfv $CONDA_PREFIX/exllama/* ./
--- a/backend/python/exllama2/Makefile
+++ b/backend/python/exllama2/Makefile
@@ -0,0 +1,12 @@
+.PHONY: exllama2
+exllama2:
+	@echo "Creating virtual environment..."
+	@conda env create --name exllama2 --file exllama2.yml
+	@echo "Virtual environment created."
+	bash install.sh
+
+.PHONY: run
+run:
+	@echo "Running exllama2..."
+	bash run.sh
+	@echo "exllama2 run."
--- a/backend/python/exllama2/backend_pb2.py
+++ b/backend/python/exllama2/backend_pb2.py
--- a/backend/python/exllama2/backend_pb2_grpc.py
+++ b/backend/python/exllama2/backend_pb2_grpc.py
@@ -0,0 +1,363 @@
+# Generated by the gRPC Python protocol compiler plugin. DO NOT EDIT!
+"""Client and server classes corresponding to protobuf-defined services."""
+import grpc
+
+import backend_pb2 as backend__pb2
+
+
+class BackendStub(object):
+    """Missing associated documentation comment in .proto file."""
+
+    def __init__(self, channel):
+        """Constructor.
+
+        Args:
+            channel: A grpc.Channel.
+        """
+        self.Health = channel.unary_unary(
+                '/backend.Backend/Health',
+                request_serializer=backend__pb2.HealthMessage.SerializeToString,
+                response_deserializer=backend__pb2.Reply.FromString,
+                )
+        self.Predict = channel.unary_unary(
+                '/backend.Backend/Predict',
+                request_serializer=backend__pb2.PredictOptions.SerializeToString,
+                response_deserializer=backend__pb2.Reply.FromString,
+                )
+        self.LoadModel = channel.unary_unary(
+                '/backend.Backend/LoadModel',
+                request_serializer=backend__pb2.ModelOptions.SerializeToString,
+                response_deserializer=backend__pb2.Result.FromString,
+                )
+        self.PredictStream = channel.unary_stream(
+                '/backend.Backend/PredictStream',
+                request_serializer=backend__pb2.PredictOptions.SerializeToString,
+                response_deserializer=backend__pb2.Reply.FromString,
+                )
+        self.Embedding = channel.unary_unary(
+                '/backend.Backend/Embedding',
+                request_serializer=backend__pb2.PredictOptions.SerializeToString,
+                response_deserializer=backend__pb2.EmbeddingResult.FromString,
+                )
+        self.GenerateImage = channel.unary_unary(
+                '/backend.Backend/GenerateImage',
+                request_serializer=backend__pb2.GenerateImageRequest.SerializeToString,
+                response_deserializer=backend__pb2.Result.FromString,
+                )
+        self.AudioTranscription = channel.unary_unary(
+                '/backend.Backend/AudioTranscription',
+                request_serializer=backend__pb2.TranscriptRequest.SerializeToString,
+                response_deserializer=backend__pb2.TranscriptResult.FromString,
+                )
+        self.TTS = channel.unary_unary(
+                '/backend.Backend/TTS',
+                request_serializer=backend__pb2.TTSRequest.SerializeToString,
+                response_deserializer=backend__pb2.Result.FromString,
+                )
+        self.TokenizeString = channel.unary_unary(
+                '/backend.Backend/TokenizeString',
+                request_serializer=backend__pb2.PredictOptions.SerializeToString,
+                response_deserializer=backend__pb2.TokenizationResponse.FromString,
+                )
+        self.Status = channel.unary_unary(
+                '/backend.Backend/Status',
+                request_serializer=backend__pb2.HealthMessage.SerializeToString,
+                response_deserializer=backend__pb2.StatusResponse.FromString,
+                )
+
+
+class BackendServicer(object):
+    """Missing associated documentation comment in .proto file."""
+
+    def Health(self, request, context):
+        """Missing associated documentation comment in .proto file."""
+        context.set_code(grpc.StatusCode.UNIMPLEMENTED)
+        context.set_details('Method not implemented!')
+        raise NotImplementedError('Method not implemented!')
+
+    def Predict(self, request, context):
+        """Missing associated documentation comment in .proto file."""
+        context.set_code(grpc.StatusCode.UNIMPLEMENTED)
+        context.set_details('Method not implemented!')
+        raise NotImplementedError('Method not implemented!')
+
+    def LoadModel(self, request, context):
+        """Missing associated documentation comment in .proto file."""
+        context.set_code(grpc.StatusCode.UNIMPLEMENTED)
+        context.set_details('Method not implemented!')
+        raise NotImplementedError('Method not implemented!')
+
+    def PredictStream(self, request, context):
+        """Missing associated documentation comment in .proto file."""
+        context.set_code(grpc.StatusCode.UNIMPLEMENTED)
+        context.set_details('Method not implemented!')
+        raise NotImplementedError('Method not implemented!')
+
+    def Embedding(self, request, context):
+        """Missing associated documentation comment in .proto file."""
+        context.set_code(grpc.StatusCode.UNIMPLEMENTED)
+        context.set_details('Method not implemented!')
+        raise NotImplementedError('Method not implemented!')
+
+    def GenerateImage(self, request, context):
+        """Missing associated documentation comment in .proto file."""
+        context.set_code(grpc.StatusCode.UNIMPLEMENTED)
+        context.set_details('Method not implemented!')
+        raise NotImplementedError('Method not implemented!')
+
+    def AudioTranscription(self, request, context):
+        """Missing associated documentation comment in .proto file."""
+        context.set_code(grpc.StatusCode.UNIMPLEMENTED)
+        context.set_details('Method not implemented!')
+        raise NotImplementedError('Method not implemented!')
+
+    def TTS(self, request, context):
+        """Missing associated documentation comment in .proto file."""
+        context.set_code(grpc.StatusCode.UNIMPLEMENTED)
+        context.set_details('Method not implemented!')
+        raise NotImplementedError('Method not implemented!')
+
+    def TokenizeString(self, request, context):
+        """Missing associated documentation comment in .proto file."""
+        context.set_code(grpc.StatusCode.UNIMPLEMENTED)
+        context.set_details('Method not implemented!')
+        raise NotImplementedError('Method not implemented!')
+
+    def Status(self, request, context):
+        """Missing associated documentation comment in .proto file."""
+        context.set_code(grpc.StatusCode.UNIMPLEMENTED)
+        context.set_details('Method not implemented!')
+        raise NotImplementedError('Method not implemented!')
+
+
+def add_BackendServicer_to_server(servicer, server):
+    rpc_method_handlers = {
+            'Health': grpc.unary_unary_rpc_method_handler(
+                    servicer.Health,
+                    request_deserializer=backend__pb2.HealthMessage.FromString,
+                    response_serializer=backend__pb2.Reply.SerializeToString,
+            ),
+            'Predict': grpc.unary_unary_rpc_method_handler(
+                    servicer.Predict,
+                    request_deserializer=backend__pb2.PredictOptions.FromString,
+                    response_serializer=backend__pb2.Reply.SerializeToString,
+            ),
+            'LoadModel': grpc.unary_unary_rpc_method_handler(
+                    servicer.LoadModel,
+                    request_deserializer=backend__pb2.ModelOptions.FromString,
+                    response_serializer=backend__pb2.Result.SerializeToString,
+            ),
+            'PredictStream': grpc.unary_stream_rpc_method_handler(
+                    servicer.PredictStream,
+                    request_deserializer=backend__pb2.PredictOptions.FromString,
+                    response_serializer=backend__pb2.Reply.SerializeToString,
+            ),
+            'Embedding': grpc.unary_unary_rpc_method_handler(
+                    servicer.Embedding,
+                    request_deserializer=backend__pb2.PredictOptions.FromString,
+                    response_serializer=backend__pb2.EmbeddingResult.SerializeToString,
+            ),
+            'GenerateImage': grpc.unary_unary_rpc_method_handler(
+                    servicer.GenerateImage,
+                    request_deserializer=backend__pb2.GenerateImageRequest.FromString,
+                    response_serializer=backend__pb2.Result.SerializeToString,
+            ),
+            'AudioTranscription': grpc.unary_unary_rpc_method_handler(
+                    servicer.AudioTranscription,
+                    request_deserializer=backend__pb2.TranscriptRequest.FromString,
+                    response_serializer=backend__pb2.TranscriptResult.SerializeToString,
+            ),
+            'TTS': grpc.unary_unary_rpc_method_handler(
+                    servicer.TTS,
+                    request_deserializer=backend__pb2.TTSRequest.FromString,
+                    response_serializer=backend__pb2.Result.SerializeToString,
+            ),
+            'TokenizeString': grpc.unary_unary_rpc_method_handler(
+                    servicer.TokenizeString,
+                    request_deserializer=backend__pb2.PredictOptions.FromString,
+                    response_serializer=backend__pb2.TokenizationResponse.SerializeToString,
+            ),
+            'Status': grpc.unary_unary_rpc_method_handler(
+                    servicer.Status,
+                    request_deserializer=backend__pb2.HealthMessage.FromString,
+                    response_serializer=backend__pb2.StatusResponse.SerializeToString,
+            ),
+    }
+    generic_handler = grpc.method_handlers_generic_handler(
+            'backend.Backend', rpc_method_handlers)
+    server.add_generic_rpc_handlers((generic_handler,))
+
+
+ # This class is part of an EXPERIMENTAL API.
+class Backend(object):
+    """Missing associated documentation comment in .proto file."""
+
+    @staticmethod
+    def Health(request,
+            target,
+            options=(),
+            channel_credentials=None,
+            call_credentials=None,
+            insecure=False,
+            compression=None,
+            wait_for_ready=None,
+            timeout=None,
+            metadata=None):
+        return grpc.experimental.unary_unary(request, target, '/backend.Backend/Health',
+            backend__pb2.HealthMessage.SerializeToString,
+            backend__pb2.Reply.FromString,
+            options, channel_credentials,
+            insecure, call_credentials, compression, wait_for_ready, timeout, metadata)
+
+    @staticmethod
+    def Predict(request,
+            target,
+            options=(),
+            channel_credentials=None,
+            call_credentials=None,
+            insecure=False,
+            compression=None,
+            wait_for_ready=None,
+            timeout=None,
+            metadata=None):
+        return grpc.experimental.unary_unary(request, target, '/backend.Backend/Predict',
+            backend__pb2.PredictOptions.SerializeToString,
+            backend__pb2.Reply.FromString,
+            options, channel_credentials,
+            insecure, call_credentials, compression, wait_for_ready, timeout, metadata)
+
+    @staticmethod
+    def LoadModel(request,
+            target,
+            options=(),
+            channel_credentials=None,
+            call_credentials=None,
+            insecure=False,
+            compression=None,
+            wait_for_ready=None,
+            timeout=None,
+            metadata=None):
+        return grpc.experimental.unary_unary(request, target, '/backend.Backend/LoadModel',
+            backend__pb2.ModelOptions.SerializeToString,
+            backend__pb2.Result.FromString,
+            options, channel_credentials,
+            insecure, call_credentials, compression, wait_for_ready, timeout, metadata)
+
+    @staticmethod
+    def PredictStream(request,
+            target,
+            options=(),
+            channel_credentials=None,
+            call_credentials=None,
+            insecure=False,
+            compression=None,
+            wait_for_ready=None,
+            timeout=None,
+            metadata=None):
+        return grpc.experimental.unary_stream(request, target, '/backend.Backend/PredictStream',
+            backend__pb2.PredictOptions.SerializeToString,
+            backend__pb2.Reply.FromString,
+            options, channel_credentials,
+            insecure, call_credentials, compression, wait_for_ready, timeout, metadata)
+
+    @staticmethod
+    def Embedding(request,
+            target,
+            options=(),
+            channel_credentials=None,
+            call_credentials=None,
+            insecure=False,
+            compression=None,
+            wait_for_ready=None,
+            timeout=None,
+            metadata=None):
+        return grpc.experimental.unary_unary(request, target, '/backend.Backend/Embedding',
+            backend__pb2.PredictOptions.SerializeToString,
+            backend__pb2.EmbeddingResult.FromString,
+            options, channel_credentials,
+            insecure, call_credentials, compression, wait_for_ready, timeout, metadata)
+
+    @staticmethod
+    def GenerateImage(request,
+            target,
+            options=(),
+            channel_credentials=None,
+            call_credentials=None,
+            insecure=False,
+            compression=None,
+            wait_for_ready=None,
+            timeout=None,
+            metadata=None):
+        return grpc.experimental.unary_unary(request, target, '/backend.Backend/GenerateImage',
+            backend__pb2.GenerateImageRequest.SerializeToString,
+            backend__pb2.Result.FromString,
+            options, channel_credentials,
+            insecure, call_credentials, compression, wait_for_ready, timeout, metadata)
+
+    @staticmethod
+    def AudioTranscription(request,
+            target,
+            options=(),
+            channel_credentials=None,
+            call_credentials=None,
+            insecure=False,
+            compression=None,
+            wait_for_ready=None,
+            timeout=None,
+            metadata=None):
+        return grpc.experimental.unary_unary(request, target, '/backend.Backend/AudioTranscription',
+            backend__pb2.TranscriptRequest.SerializeToString,
+            backend__pb2.TranscriptResult.FromString,
+            options, channel_credentials,
+            insecure, call_credentials, compression, wait_for_ready, timeout, metadata)
+
+    @staticmethod
+    def TTS(request,
+            target,
+            options=(),
+            channel_credentials=None,
+            call_credentials=None,
+            insecure=False,
+            compression=None,
+            wait_for_ready=None,
+            timeout=None,
+            metadata=None):
+        return grpc.experimental.unary_unary(request, target, '/backend.Backend/TTS',
+            backend__pb2.TTSRequest.SerializeToString,
+            backend__pb2.Result.FromString,
+            options, channel_credentials,
+            insecure, call_credentials, compression, wait_for_ready, timeout, metadata)
+
+    @staticmethod
+    def TokenizeString(request,
+            target,
+            options=(),
+            channel_credentials=None,
+            call_credentials=None,
+            insecure=False,
+            compression=None,
+            wait_for_ready=None,
+            timeout=None,
+            metadata=None):
+        return grpc.experimental.unary_unary(request, target, '/backend.Backend/TokenizeString',
+            backend__pb2.PredictOptions.SerializeToString,
+            backend__pb2.TokenizationResponse.FromString,
+            options, channel_credentials,
+            insecure, call_credentials, compression, wait_for_ready, timeout, metadata)
+
+    @staticmethod
+    def Status(request,
+            target,
+            options=(),
+            channel_credentials=None,
+            call_credentials=None,
+            insecure=False,
+            compression=None,
+            wait_for_ready=None,
+            timeout=None,
+            metadata=None):
+        return grpc.experimental.unary_unary(request, target, '/backend.Backend/Status',
+            backend__pb2.HealthMessage.SerializeToString,
+            backend__pb2.StatusResponse.FromString,
+            options, channel_credentials,
+            insecure, call_credentials, compression, wait_for_ready, timeout, metadata)
--- a/backend/python/transformers/transformers.yml
+++ b/backend/python/transformers/transformers.yml
@@ -1,4 +1,4 @@
-name: transformers
+name: exllama2
 channels:
  - defaults
 dependencies:
@@ -25,22 +25,13 @@ dependencies:
  - xz=5.4.2=h5eee18b_0
  - zlib=1.2.13=h5eee18b_0
  - pip:
-      - certifi==2023.7.22
-      - charset-normalizer==3.3.0
-      - click==8.1.7
      - filelock==3.12.4
      - fsspec==2023.9.2
      - grpcio==1.59.0
-      - huggingface-hub==0.17.3
-      - idna==3.4
-      - install==1.3.5
-      - jinja2==3.1.2
-      - joblib==1.3.2
      - markupsafe==2.1.3
      - mpmath==1.3.0
      - networkx==3.1
-      - nltk==3.8.1
-      - numpy==1.26.0
+      - protobuf==4.24.4
      - nvidia-cublas-cu12==12.1.3.1
      - nvidia-cuda-cupti-cu12==12.1.105
      - nvidia-cuda-nvrtc-cu12==12.1.105
@@ -53,25 +44,14 @@ dependencies:
      - nvidia-nccl-cu12==2.18.1
      - nvidia-nvjitlink-cu12==12.2.140
      - nvidia-nvtx-cu12==12.1.105
-      - packaging==23.2
-      - pillow==10.0.1
-      - protobuf==4.24.4
-      - pyyaml==6.0.1
-      - regex==2023.10.3
-      - requests==2.31.0
-      - safetensors==0.4.0
-      - scikit-learn==1.3.1
-      - scipy==1.11.3
-      - sentence-transformers==2.2.2
-      - sentencepiece==0.1.99
-      - sympy==1.12
-      - threadpoolctl==3.2.0
-      - tokenizers==0.14.1
-      - torch==2.1.0
-      - torchvision==0.16.0
-      - tqdm==4.66.1
-      - transformers==4.34.0
-      - triton==2.1.0
-      - typing-extensions==4.8.0
-      - urllib3==2.0.6
-prefix: /opt/conda/envs/transformers
+      - pandas
+      - numpy
+      - ninja
+      - fastparquet
+      - torch>=2.1.0
+      - safetensors>=0.3.2
+      - sentencepiece>=0.1.97
+      - pygments
+      - websockets
+      - regex
+prefix: /opt/conda/envs/exllama2
--- a/backend/python/exllama2/exllama2_backend.py
+++ b/backend/python/exllama2/exllama2_backend.py
@@ -0,0 +1,134 @@
+#!/usr/bin/env python3
+import grpc
+from concurrent import futures
+import time
+import backend_pb2
+import backend_pb2_grpc
+import argparse
+import signal
+import sys
+import os, glob
+
+from pathlib import Path
+import torch
+import torch.nn.functional as F
+from torch import version as torch_version
+
+
+from exllamav2.generator import (
+    ExLlamaV2BaseGenerator,
+    ExLlamaV2Sampler
+)
+
+
+from exllamav2 import(
+    ExLlamaV2,
+    ExLlamaV2Config,
+    ExLlamaV2Cache,
+    ExLlamaV2Cache_8bit,
+    ExLlamaV2Tokenizer,
+    model_init,
+)
+
+
+_ONE_DAY_IN_SECONDS = 60 * 60 * 24
+
+# If MAX_WORKERS are specified in the environment use it, otherwise default to 1
+MAX_WORKERS = int(os.environ.get('PYTHON_GRPC_MAX_WORKERS', '1'))
+
+# Implement the BackendServicer class with the service methods
+class BackendServicer(backend_pb2_grpc.BackendServicer):
+    def Health(self, request, context):
+        return backend_pb2.Reply(message=bytes("OK", 'utf-8'))
+    def LoadModel(self, request, context):
+        try:
+            model_directory = request.ModelFile
+
+            config = ExLlamaV2Config()
+            config.model_dir = model_directory
+            config.prepare()
+
+            model = ExLlamaV2(config)
+
+            cache = ExLlamaV2Cache(model, lazy = True)
+            model.load_autosplit(cache)
+
+            tokenizer = ExLlamaV2Tokenizer(config)
+
+            # Initialize generator
+
+            generator = ExLlamaV2BaseGenerator(model, cache, tokenizer)
+
+            self.generator= generator
+
+            generator.warmup()
+            self.model = model
+            self.tokenizer = tokenizer
+            self.cache = cache
+        except Exception as err:
+            return backend_pb2.Result(success=False, message=f"Unexpected {err=}, {type(err)=}")
+        return backend_pb2.Result(message="Model loaded successfully", success=True)
+
+    def Predict(self, request, context):
+
+        penalty = 1.15
+        if request.Penalty != 0.0:
+            penalty = request.Penalty
+
+        settings = ExLlamaV2Sampler.Settings()
+        settings.temperature = request.Temperature
+        settings.top_k = request.TopK
+        settings.top_p = request.TopP
+        settings.token_repetition_penalty = penalty
+        settings.disallow_tokens(self.tokenizer, [self.tokenizer.eos_token_id])
+        tokens = 512
+
+        if request.Tokens != 0:
+            tokens = request.Tokens
+        output = self.generator.generate_simple(request.Prompt, settings, tokens, seed = self.seed)
+
+        # Remove prompt from response if present
+        if request.Prompt in output:
+            output = output.replace(request.Prompt, "")
+
+        return backend_pb2.Result(message=bytes(t, encoding='utf-8'))
+
+    def PredictStream(self, request, context):
+        # Implement PredictStream RPC
+        #for reply in some_data_generator():
+        #    yield reply
+        # Not implemented yet
+        return self.Predict(request, context)
+
+
+def serve(address):
+    server = grpc.server(futures.ThreadPoolExecutor(max_workers=MAX_WORKERS))
+    backend_pb2_grpc.add_BackendServicer_to_server(BackendServicer(), server)
+    server.add_insecure_port(address)
+    server.start()
+    print("Server started. Listening on: " + address, file=sys.stderr)
+
+    # Define the signal handler function
+    def signal_handler(sig, frame):
+        print("Received termination signal. Shutting down...")
+        server.stop(0)
+        sys.exit(0)
+
+    # Set the signal handlers for SIGINT and SIGTERM
+    signal.signal(signal.SIGINT, signal_handler)
+    signal.signal(signal.SIGTERM, signal_handler)
+
+    try:
+        while True:
+            time.sleep(_ONE_DAY_IN_SECONDS)
+    except KeyboardInterrupt:
+        server.stop(0)
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Run the gRPC server.")
+    parser.add_argument(
+        "--addr", default="localhost:50051", help="The address to bind the server to."
+    )
+    args = parser.parse_args()
+
+    serve(args.addr)
--- a/backend/python/exllama2/install.sh
+++ b/backend/python/exllama2/install.sh
@@ -0,0 +1,14 @@
+#!/bin/bash
+
+##
+## A bash script installs the required dependencies of VALL-E-X and prepares the environment
+export PATH=$PATH:/opt/conda/bin
+
+# Activate conda environment
+source activate exllama2
+
+echo $CONDA_PREFIX
+
+git clone https://github.com/turboderp/exllamav2 $CONDA_PREFIX/exllamav2 && pushd $CONDA_PREFIX/exllamav2 && pip install -r requirements.txt && popd
+
+cp -rfv $CONDA_PREFIX/exllamav2/* ./  
--- a/backend/python/exllama2/run.sh
+++ b/backend/python/exllama2/run.sh
@@ -0,0 +1,14 @@
+#!/bin/bash
+
+##
+## A bash script wrapper that runs the exllama server with conda
+
+export PATH=$PATH:/opt/conda/bin
+
+# Activate conda environment
+source activate exllama2
+
+# get the directory where the bash script is located
+DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
+
+python $DIR/exllama2_backend.py $@
--- a/backend/python/petals/Makefile
+++ b/backend/python/petals/Makefile
@@ -1,11 +1,15 @@
 .PHONY: petals
 petals:
-	@echo "Creating virtual environment..."
-	@conda env create --name petals --file petals.yml
-	@echo "Virtual environment created."
+	$(MAKE) -C ../common-env/transformers

 .PHONY: run
 run:
 	@echo "Running petals..."
 	bash run.sh
 	@echo "petals run."
+
+.PHONY: test
+test:
+	@echo "Testing petals..."
+	bash test.sh
+	@echo "petals tested."
--- a/backend/python/petals/backend_pb2.py
+++ b/backend/python/petals/backend_pb2.py
--- a/backend/python/petals/petals.yml
+++ b/backend/python/petals/petals.yml
@@ -15,7 +15,7 @@ dependencies:
  # - ncurses=6.4=h6a678d5_0
  # - openssl=3.0.11=h7f8727e_2
  # - pip=23.2.1=py311h06a4308_0
-  # - python=3.11.5=h955ad1f_0
+  - python=3.11.5=h955ad1f_0
  # - readline=8.2=h5eee18b_0
  # - setuptools=68.0.0=py311h06a4308_0
  # - sqlite=3.41.2=h5eee18b_0
@@ -25,5 +25,6 @@ dependencies:
  # - xz=5.4.2=h5eee18b_0
  # - zlib=1.2.13=h5eee18b_0
  - pip:
+      - torch==2.1.0
      - git+https://github.com/bigscience-workshop/petals
 prefix: /opt/conda/envs/petals
--- a/backend/python/petals/run.sh
+++ b/backend/python/petals/run.sh
@@ -9,10 +9,10 @@ export PATH=$PATH:/opt/conda/bin
 # if source is available use it, or use conda
 #
 if [ -f /opt/conda/bin/activate ]; then
-    source activate petals
+    source activate transformers
 else
    eval "$(conda shell.bash hook)"
-    conda activate petals
+    conda activate transformers
 fi

 # get the directory where the bash script is located
--- a/backend/python/petals/test.sh
+++ b/backend/python/petals/test.sh
@@ -0,0 +1,11 @@
+#!/bin/bash
+##
+## A bash script wrapper that runs the transformers server with conda
+
+# Activate conda environment
+source activate transformers
+
+# get the directory where the bash script is located
+DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
+
+python -m unittest $DIR/test_petals.py
--- a/backend/python/petals/test_petals.py
+++ b/backend/python/petals/test_petals.py
@@ -0,0 +1,58 @@
+import unittest
+import subprocess
+import time
+import backend_pb2
+import backend_pb2_grpc
+
+import grpc
+
+import unittest
+import subprocess
+import time
+import grpc
+import backend_pb2_grpc
+import backend_pb2
+
+class TestBackendServicer(unittest.TestCase):
+    """
+    TestBackendServicer is the class that tests the gRPC service.
+
+    This class contains methods to test the startup and shutdown of the gRPC service.
+    """
+    def setUp(self):
+        self.service = subprocess.Popen(["python", "backend_petals.py", "--addr", "localhost:50051"])
+        time.sleep(10)
+
+    def tearDown(self) -> None:
+        self.service.terminate()
+        self.service.wait()
+
+    def test_server_startup(self):
+        try:
+            self.setUp()
+            with grpc.insecure_channel("localhost:50051") as channel:
+                stub = backend_pb2_grpc.BackendStub(channel)
+                response = stub.Health(backend_pb2.HealthMessage())
+                self.assertEqual(response.message, b'OK')
+        except Exception as err:
+            print(err)
+            self.fail("Server failed to start")
+        finally:
+            self.tearDown()
+    def test_load_model(self):
+        """
+        This method tests if the model is loaded successfully
+        """
+        try:
+            self.setUp()
+            with grpc.insecure_channel("localhost:50051") as channel:
+                stub = backend_pb2_grpc.BackendStub(channel)
+                response = stub.LoadModel(backend_pb2.ModelOptions(Model="bigscience/bloom-560m"))
+                print(response)
+                self.assertTrue(response.success)
+                self.assertEqual(response.message, "Model loaded successfully")
+        except Exception as err:
+            print(err)
+            self.fail("LoadModel service failed")
+        finally:
+            self.tearDown()
--- a/backend/python/sentencetransformers/Makefile
+++ b/backend/python/sentencetransformers/Makefile
@@ -1,8 +1,7 @@
 .PHONY: sentencetransformers
 sentencetransformers:
-	@echo "Creating virtual environment..."
-	@conda env create --name sentencetransformers --file sentencetransformers.yml
-	@echo "Virtual environment created."
+	$(MAKE) -C ../common-env/transformers
+

 .PHONY: run
 run:
--- a/backend/python/sentencetransformers/backend_pb2.py
+++ b/backend/python/sentencetransformers/backend_pb2.py
--- a/backend/python/sentencetransformers/run.sh
+++ b/backend/python/sentencetransformers/run.sh
@@ -6,7 +6,7 @@
 export PATH=$PATH:/opt/conda/bin

 # Activate conda environment
-source activate sentencetransformers
+source activate transformers

 # get the directory where the bash script is located
 DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
--- a/backend/python/sentencetransformers/test.sh
+++ b/backend/python/sentencetransformers/test.sh
@@ -3,7 +3,7 @@
 ## A bash script wrapper that runs the sentencetransformers server with conda

 # Activate conda environment
-source activate sentencetransformers
+source activate transformers

 # get the directory where the bash script is located
 DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
--- a/backend/python/sentencetransformers/test_sentencetransformers.py
+++ b/backend/python/sentencetransformers/test_sentencetransformers.py
@@ -19,19 +19,19 @@ class TestBackendServicer(unittest.TestCase):
        This method sets up the gRPC service by starting the server
        """
        self.service = subprocess.Popen(["python3", "sentencetransformers.py", "--addr", "localhost:50051"])
+        time.sleep(10)

    def tearDown(self) -> None:
        """
        This method tears down the gRPC service by terminating the server
        """
-        self.service.terminate()
+        self.service.kill()
        self.service.wait()

    def test_server_startup(self):
        """
        This method tests if the server starts up successfully
        """
-        time.sleep(2)
        try:
            self.setUp()
            with grpc.insecure_channel("localhost:50051") as channel:
--- a/backend/python/transformers-musicgen/Makefile
+++ b/backend/python/transformers-musicgen/Makefile
@@ -0,0 +1,16 @@
+
+.PHONY: transformers-musicgen
+transformers-musicgen:
+	$(MAKE) -C ../common-env/transformers
+
+.PHONY: run
+run:
+	@echo "Running transformers..."
+	bash run.sh
+	@echo "transformers run."
+
+.PHONY: test
+test:
+	@echo "Testing transformers..."
+	bash test.sh
+	@echo "transformers tested."
--- a/backend/python/transformers-musicgen/README.md
+++ b/backend/python/transformers-musicgen/README.md
@@ -0,0 +1,5 @@
+# Creating a separate environment for the transformers project
+
+```
+make transformers-musicgen
+```
--- a/backend/python/transformers-musicgen/backend_pb2.py
+++ b/backend/python/transformers-musicgen/backend_pb2.py
--- a/backend/python/transformers-musicgen/backend_pb2_grpc.py
+++ b/backend/python/transformers-musicgen/backend_pb2_grpc.py
@@ -0,0 +1,363 @@
+# Generated by the gRPC Python protocol compiler plugin. DO NOT EDIT!
+"""Client and server classes corresponding to protobuf-defined services."""
+import grpc
+
+import backend_pb2 as backend__pb2
+
+
+class BackendStub(object):
+    """Missing associated documentation comment in .proto file."""
+
+    def __init__(self, channel):
+        """Constructor.
+
+        Args:
+            channel: A grpc.Channel.
+        """
+        self.Health = channel.unary_unary(
+                '/backend.Backend/Health',
+                request_serializer=backend__pb2.HealthMessage.SerializeToString,
+                response_deserializer=backend__pb2.Reply.FromString,
+                )
+        self.Predict = channel.unary_unary(
+                '/backend.Backend/Predict',
+                request_serializer=backend__pb2.PredictOptions.SerializeToString,
+                response_deserializer=backend__pb2.Reply.FromString,
+                )
+        self.LoadModel = channel.unary_unary(
+                '/backend.Backend/LoadModel',
+                request_serializer=backend__pb2.ModelOptions.SerializeToString,
+                response_deserializer=backend__pb2.Result.FromString,
+                )
+        self.PredictStream = channel.unary_stream(
+                '/backend.Backend/PredictStream',
+                request_serializer=backend__pb2.PredictOptions.SerializeToString,
+                response_deserializer=backend__pb2.Reply.FromString,
+                )
+        self.Embedding = channel.unary_unary(
+                '/backend.Backend/Embedding',
+                request_serializer=backend__pb2.PredictOptions.SerializeToString,
+                response_deserializer=backend__pb2.EmbeddingResult.FromString,
+                )
+        self.GenerateImage = channel.unary_unary(
+                '/backend.Backend/GenerateImage',
+                request_serializer=backend__pb2.GenerateImageRequest.SerializeToString,
+                response_deserializer=backend__pb2.Result.FromString,
+                )
+        self.AudioTranscription = channel.unary_unary(
+                '/backend.Backend/AudioTranscription',
+                request_serializer=backend__pb2.TranscriptRequest.SerializeToString,
+                response_deserializer=backend__pb2.TranscriptResult.FromString,
+                )
+        self.TTS = channel.unary_unary(
+                '/backend.Backend/TTS',
+                request_serializer=backend__pb2.TTSRequest.SerializeToString,
+                response_deserializer=backend__pb2.Result.FromString,
+                )
+        self.TokenizeString = channel.unary_unary(
+                '/backend.Backend/TokenizeString',
+                request_serializer=backend__pb2.PredictOptions.SerializeToString,
+                response_deserializer=backend__pb2.TokenizationResponse.FromString,
+                )
+        self.Status = channel.unary_unary(
+                '/backend.Backend/Status',
+                request_serializer=backend__pb2.HealthMessage.SerializeToString,
+                response_deserializer=backend__pb2.StatusResponse.FromString,
+                )
+
+
+class BackendServicer(object):
+    """Missing associated documentation comment in .proto file."""
+
+    def Health(self, request, context):
+        """Missing associated documentation comment in .proto file."""
+        context.set_code(grpc.StatusCode.UNIMPLEMENTED)
+        context.set_details('Method not implemented!')
+        raise NotImplementedError('Method not implemented!')
+
+    def Predict(self, request, context):
+        """Missing associated documentation comment in .proto file."""
+        context.set_code(grpc.StatusCode.UNIMPLEMENTED)
+        context.set_details('Method not implemented!')
+        raise NotImplementedError('Method not implemented!')
+
+    def LoadModel(self, request, context):
+        """Missing associated documentation comment in .proto file."""
+        context.set_code(grpc.StatusCode.UNIMPLEMENTED)
+        context.set_details('Method not implemented!')
+        raise NotImplementedError('Method not implemented!')
+
+    def PredictStream(self, request, context):
+        """Missing associated documentation comment in .proto file."""
+        context.set_code(grpc.StatusCode.UNIMPLEMENTED)
+        context.set_details('Method not implemented!')
+        raise NotImplementedError('Method not implemented!')
+
+    def Embedding(self, request, context):
+        """Missing associated documentation comment in .proto file."""
+        context.set_code(grpc.StatusCode.UNIMPLEMENTED)
+        context.set_details('Method not implemented!')
+        raise NotImplementedError('Method not implemented!')
+
+    def GenerateImage(self, request, context):
+        """Missing associated documentation comment in .proto file."""
+        context.set_code(grpc.StatusCode.UNIMPLEMENTED)
+        context.set_details('Method not implemented!')
+        raise NotImplementedError('Method not implemented!')
+
+    def AudioTranscription(self, request, context):
+        """Missing associated documentation comment in .proto file."""
+        context.set_code(grpc.StatusCode.UNIMPLEMENTED)
+        context.set_details('Method not implemented!')
+        raise NotImplementedError('Method not implemented!')
+
+    def TTS(self, request, context):
+        """Missing associated documentation comment in .proto file."""
+        context.set_code(grpc.StatusCode.UNIMPLEMENTED)
+        context.set_details('Method not implemented!')
+        raise NotImplementedError('Method not implemented!')
+
+    def TokenizeString(self, request, context):
+        """Missing associated documentation comment in .proto file."""
+        context.set_code(grpc.StatusCode.UNIMPLEMENTED)
+        context.set_details('Method not implemented!')
+        raise NotImplementedError('Method not implemented!')
+
+    def Status(self, request, context):
+        """Missing associated documentation comment in .proto file."""
+        context.set_code(grpc.StatusCode.UNIMPLEMENTED)
+        context.set_details('Method not implemented!')
+        raise NotImplementedError('Method not implemented!')
+
+
+def add_BackendServicer_to_server(servicer, server):
+    rpc_method_handlers = {
+            'Health': grpc.unary_unary_rpc_method_handler(
+                    servicer.Health,
+                    request_deserializer=backend__pb2.HealthMessage.FromString,
+                    response_serializer=backend__pb2.Reply.SerializeToString,
+            ),
+            'Predict': grpc.unary_unary_rpc_method_handler(
+                    servicer.Predict,
+                    request_deserializer=backend__pb2.PredictOptions.FromString,
+                    response_serializer=backend__pb2.Reply.SerializeToString,
+            ),
+            'LoadModel': grpc.unary_unary_rpc_method_handler(
+                    servicer.LoadModel,
+                    request_deserializer=backend__pb2.ModelOptions.FromString,
+                    response_serializer=backend__pb2.Result.SerializeToString,
+            ),
+            'PredictStream': grpc.unary_stream_rpc_method_handler(
+                    servicer.PredictStream,
+                    request_deserializer=backend__pb2.PredictOptions.FromString,
+                    response_serializer=backend__pb2.Reply.SerializeToString,
+            ),
+            'Embedding': grpc.unary_unary_rpc_method_handler(
+                    servicer.Embedding,
+                    request_deserializer=backend__pb2.PredictOptions.FromString,
+                    response_serializer=backend__pb2.EmbeddingResult.SerializeToString,
+            ),
+            'GenerateImage': grpc.unary_unary_rpc_method_handler(
+                    servicer.GenerateImage,
+                    request_deserializer=backend__pb2.GenerateImageRequest.FromString,
+                    response_serializer=backend__pb2.Result.SerializeToString,
+            ),
+            'AudioTranscription': grpc.unary_unary_rpc_method_handler(
+                    servicer.AudioTranscription,
+                    request_deserializer=backend__pb2.TranscriptRequest.FromString,
+                    response_serializer=backend__pb2.TranscriptResult.SerializeToString,
+            ),
+            'TTS': grpc.unary_unary_rpc_method_handler(
+                    servicer.TTS,
+                    request_deserializer=backend__pb2.TTSRequest.FromString,
+                    response_serializer=backend__pb2.Result.SerializeToString,
+            ),
+            'TokenizeString': grpc.unary_unary_rpc_method_handler(
+                    servicer.TokenizeString,
+                    request_deserializer=backend__pb2.PredictOptions.FromString,
+                    response_serializer=backend__pb2.TokenizationResponse.SerializeToString,
+            ),
+            'Status': grpc.unary_unary_rpc_method_handler(
+                    servicer.Status,
+                    request_deserializer=backend__pb2.HealthMessage.FromString,
+                    response_serializer=backend__pb2.StatusResponse.SerializeToString,
+            ),
+    }
+    generic_handler = grpc.method_handlers_generic_handler(
+            'backend.Backend', rpc_method_handlers)
+    server.add_generic_rpc_handlers((generic_handler,))
+
+
+ # This class is part of an EXPERIMENTAL API.
+class Backend(object):
+    """Missing associated documentation comment in .proto file."""
+
+    @staticmethod
+    def Health(request,
+            target,
+            options=(),
+            channel_credentials=None,
+            call_credentials=None,
+            insecure=False,
+            compression=None,
+            wait_for_ready=None,
+            timeout=None,
+            metadata=None):
+        return grpc.experimental.unary_unary(request, target, '/backend.Backend/Health',
+            backend__pb2.HealthMessage.SerializeToString,
+            backend__pb2.Reply.FromString,
+            options, channel_credentials,
+            insecure, call_credentials, compression, wait_for_ready, timeout, metadata)
+
+    @staticmethod
+    def Predict(request,
+            target,
+            options=(),
+            channel_credentials=None,
+            call_credentials=None,
+            insecure=False,
+            compression=None,
+            wait_for_ready=None,
+            timeout=None,
+            metadata=None):
+        return grpc.experimental.unary_unary(request, target, '/backend.Backend/Predict',
+            backend__pb2.PredictOptions.SerializeToString,
+            backend__pb2.Reply.FromString,
+            options, channel_credentials,
+            insecure, call_credentials, compression, wait_for_ready, timeout, metadata)
+
+    @staticmethod
+    def LoadModel(request,
+            target,
+            options=(),
+            channel_credentials=None,
+            call_credentials=None,
+            insecure=False,
+            compression=None,
+            wait_for_ready=None,
+            timeout=None,
+            metadata=None):
+        return grpc.experimental.unary_unary(request, target, '/backend.Backend/LoadModel',
+            backend__pb2.ModelOptions.SerializeToString,
+            backend__pb2.Result.FromString,
+            options, channel_credentials,
+            insecure, call_credentials, compression, wait_for_ready, timeout, metadata)
+
+    @staticmethod
+    def PredictStream(request,
+            target,
+            options=(),
+            channel_credentials=None,
+            call_credentials=None,
+            insecure=False,
+            compression=None,
+            wait_for_ready=None,
+            timeout=None,
+            metadata=None):
+        return grpc.experimental.unary_stream(request, target, '/backend.Backend/PredictStream',
+            backend__pb2.PredictOptions.SerializeToString,
+            backend__pb2.Reply.FromString,
+            options, channel_credentials,
+            insecure, call_credentials, compression, wait_for_ready, timeout, metadata)
+
+    @staticmethod
+    def Embedding(request,
+            target,
+            options=(),
+            channel_credentials=None,
+            call_credentials=None,
+            insecure=False,
+            compression=None,
+            wait_for_ready=None,
+            timeout=None,
+            metadata=None):
+        return grpc.experimental.unary_unary(request, target, '/backend.Backend/Embedding',
+            backend__pb2.PredictOptions.SerializeToString,
+            backend__pb2.EmbeddingResult.FromString,
+            options, channel_credentials,
+            insecure, call_credentials, compression, wait_for_ready, timeout, metadata)
+
+    @staticmethod
+    def GenerateImage(request,
+            target,
+            options=(),
+            channel_credentials=None,
+            call_credentials=None,
+            insecure=False,
+            compression=None,
+            wait_for_ready=None,
+            timeout=None,
+            metadata=None):
+        return grpc.experimental.unary_unary(request, target, '/backend.Backend/GenerateImage',
+            backend__pb2.GenerateImageRequest.SerializeToString,
+            backend__pb2.Result.FromString,
+            options, channel_credentials,
+            insecure, call_credentials, compression, wait_for_ready, timeout, metadata)
+
+    @staticmethod
+    def AudioTranscription(request,
+            target,
+            options=(),
+            channel_credentials=None,
+            call_credentials=None,
+            insecure=False,
+            compression=None,
+            wait_for_ready=None,
+            timeout=None,
+            metadata=None):
+        return grpc.experimental.unary_unary(request, target, '/backend.Backend/AudioTranscription',
+            backend__pb2.TranscriptRequest.SerializeToString,
+            backend__pb2.TranscriptResult.FromString,
+            options, channel_credentials,
+            insecure, call_credentials, compression, wait_for_ready, timeout, metadata)
+
+    @staticmethod
+    def TTS(request,
+            target,
+            options=(),
+            channel_credentials=None,
+            call_credentials=None,
+            insecure=False,
+            compression=None,
+            wait_for_ready=None,
+            timeout=None,
+            metadata=None):
+        return grpc.experimental.unary_unary(request, target, '/backend.Backend/TTS',
+            backend__pb2.TTSRequest.SerializeToString,
+            backend__pb2.Result.FromString,
+            options, channel_credentials,
+            insecure, call_credentials, compression, wait_for_ready, timeout, metadata)
+
+    @staticmethod
+    def TokenizeString(request,
+            target,
+            options=(),
+            channel_credentials=None,
+            call_credentials=None,
+            insecure=False,
+            compression=None,
+            wait_for_ready=None,
+            timeout=None,
+            metadata=None):
+        return grpc.experimental.unary_unary(request, target, '/backend.Backend/TokenizeString',
+            backend__pb2.PredictOptions.SerializeToString,
+            backend__pb2.TokenizationResponse.FromString,
+            options, channel_credentials,
+            insecure, call_credentials, compression, wait_for_ready, timeout, metadata)
+
+    @staticmethod
+    def Status(request,
+            target,
+            options=(),
+            channel_credentials=None,
+            call_credentials=None,
+            insecure=False,
+            compression=None,
+            wait_for_ready=None,
+            timeout=None,
+            metadata=None):
+        return grpc.experimental.unary_unary(request, target, '/backend.Backend/Status',
+            backend__pb2.HealthMessage.SerializeToString,
+            backend__pb2.StatusResponse.FromString,
+            options, channel_credentials,
+            insecure, call_credentials, compression, wait_for_ready, timeout, metadata)
--- a/backend/python/transformers-musicgen/run.sh
+++ b/backend/python/transformers-musicgen/run.sh
@@ -0,0 +1,16 @@
+#!/bin/bash
+
+##
+## A bash script wrapper that runs the transformers-musicgen server with conda
+
+echo "Launching gRPC server for transformers-musicgen"
+
+export PATH=$PATH:/opt/conda/bin
+
+# Activate conda environment
+source activate transformers-musicgen
+
+# get the directory where the bash script is located
+DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
+
+python $DIR/transformers_server.py $@
--- a/backend/python/transformers-musicgen/test.sh
+++ b/backend/python/transformers-musicgen/test.sh
@@ -0,0 +1,11 @@
+#!/bin/bash
+##
+## A bash script wrapper that runs the transformers server with conda
+
+# Activate conda environment
+source activate transformers
+
+# get the directory where the bash script is located
+DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
+
+python -m unittest $DIR/test_transformers.py
--- a/backend/python/transformers-musicgen/test_transformers.py
+++ b/backend/python/transformers-musicgen/test_transformers.py
@@ -19,6 +19,7 @@ class TestBackendServicer(unittest.TestCase):
        This method sets up the gRPC service by starting the server
        """
        self.service = subprocess.Popen(["python3", "transformers_server.py", "--addr", "localhost:50051"])
+        time.sleep(10)

    def tearDown(self) -> None:
        """
@@ -31,7 +32,6 @@ class TestBackendServicer(unittest.TestCase):
        """
        This method tests if the server starts up successfully
        """
-        time.sleep(2)
        try:
            self.setUp()
            with grpc.insecure_channel("localhost:50051") as channel:
@@ -52,7 +52,7 @@ class TestBackendServicer(unittest.TestCase):
            self.setUp()
            with grpc.insecure_channel("localhost:50051") as channel:
                stub = backend_pb2_grpc.BackendStub(channel)
-                response = stub.LoadModel(backend_pb2.ModelOptions(Model="bert-base-nli-mean-tokens"))
+                response = stub.LoadModel(backend_pb2.ModelOptions(Model="facebook/musicgen-small"))
                self.assertTrue(response.success)
                self.assertEqual(response.message, "Model loaded successfully")
        except Exception as err:
@@ -61,7 +61,7 @@ class TestBackendServicer(unittest.TestCase):
        finally:
            self.tearDown()

-    def test_embedding(self):
+    def test_tts(self):
        """
        This method tests if the embeddings are generated successfully
        """
@@ -69,13 +69,13 @@ class TestBackendServicer(unittest.TestCase):
            self.setUp()
            with grpc.insecure_channel("localhost:50051") as channel:
                stub = backend_pb2_grpc.BackendStub(channel)
-                response = stub.LoadModel(backend_pb2.ModelOptions(Model="bert-base-nli-mean-tokens"))
+                response = stub.LoadModel(backend_pb2.ModelOptions(Model="facebook/musicgen-small"))
                self.assertTrue(response.success)
-                embedding_request = backend_pb2.PredictOptions(Embeddings="This is a test sentence.")
-                embedding_response = stub.Embedding(embedding_request)
-                self.assertIsNotNone(embedding_response.embeddings)
+                tts_request = backend_pb2.TTSRequest(text="80s TV news production music hit for tonight's biggest story")
+                tts_response = stub.TTS(tts_request)
+                self.assertIsNotNone(tts_response)
        except Exception as err:
            print(err)
-            self.fail("Embedding service failed")
+            self.fail("TTS service failed")
        finally:
            self.tearDown()
--- a/backend/python/transformers-musicgen/transformers_server.py
+++ b/backend/python/transformers-musicgen/transformers_server.py
@@ -0,0 +1,122 @@
+#!/usr/bin/env python3
+"""
+Extra gRPC server for MusicgenForConditionalGeneration models.
+"""
+from concurrent import futures
+
+import argparse
+import signal
+import sys
+import os
+
+import time
+import backend_pb2
+import backend_pb2_grpc
+
+import grpc
+
+from scipy.io.wavfile import write as write_wav
+from transformers import AutoProcessor, MusicgenForConditionalGeneration
+
+_ONE_DAY_IN_SECONDS = 60 * 60 * 24
+
+# If MAX_WORKERS are specified in the environment use it, otherwise default to 1
+MAX_WORKERS = int(os.environ.get('PYTHON_GRPC_MAX_WORKERS', '1'))
+
+# Implement the BackendServicer class with the service methods
+class BackendServicer(backend_pb2_grpc.BackendServicer):
+    """
+    A gRPC servicer for the backend service.
+
+    This class implements the gRPC methods for the backend service, including Health, LoadModel, and Embedding.
+    """
+    def Health(self, request, context):
+        """
+        A gRPC method that returns the health status of the backend service.
+
+        Args:
+            request: A HealthRequest object that contains the request parameters.
+            context: A grpc.ServicerContext object that provides information about the RPC.
+
+        Returns:
+            A Reply object that contains the health status of the backend service.
+        """
+        return backend_pb2.Reply(message=bytes("OK", 'utf-8'))
+
+    def LoadModel(self, request, context):
+        """
+        A gRPC method that loads a model into memory.
+
+        Args:
+            request: A LoadModelRequest object that contains the request parameters.
+            context: A grpc.ServicerContext object that provides information about the RPC.
+
+        Returns:
+            A Result object that contains the result of the LoadModel operation.
+        """
+        model_name = request.Model
+        try:
+            self.processor = AutoProcessor.from_pretrained(model_name)
+            self.model = MusicgenForConditionalGeneration.from_pretrained(model_name)
+        except Exception as err:
+            return backend_pb2.Result(success=False, message=f"Unexpected {err=}, {type(err)=}")
+
+        return backend_pb2.Result(message="Model loaded successfully", success=True)
+
+    def TTS(self, request, context):
+        model_name = request.model
+        if model_name == "":
+            return backend_pb2.Result(success=False, message="request.model is required")
+        try:
+            self.processor = AutoProcessor.from_pretrained(model_name)
+            self.model = MusicgenForConditionalGeneration.from_pretrained(model_name)
+            inputs = self.processor(
+                text=[request.text],
+                padding=True,
+                return_tensors="pt",
+            )
+            tokens = 256
+            # TODO get tokens from request?
+            audio_values = self.model.generate(**inputs, max_new_tokens=tokens)
+            print("[transformers-musicgen] TTS generated!", file=sys.stderr)
+            sampling_rate = self.model.config.audio_encoder.sampling_rate
+            write_wav(request.dst, rate=sampling_rate, data=audio_values[0, 0].numpy())
+            print("[transformers-musicgen] TTS saved to", request.dst, file=sys.stderr)
+            print("[transformers-musicgen] TTS for", file=sys.stderr)
+            print(request, file=sys.stderr)
+        except Exception as err:
+            return backend_pb2.Result(success=False, message=f"Unexpected {err=}, {type(err)=}")
+        return backend_pb2.Result(success=True)
+
+
+def serve(address):
+    server = grpc.server(futures.ThreadPoolExecutor(max_workers=MAX_WORKERS))
+    backend_pb2_grpc.add_BackendServicer_to_server(BackendServicer(), server)
+    server.add_insecure_port(address)
+    server.start()
+    print("[transformers-musicgen] Server started. Listening on: " + address, file=sys.stderr)
+
+    # Define the signal handler function
+    def signal_handler(sig, frame):
+        print("[transformers-musicgen] Received termination signal. Shutting down...")
+        server.stop(0)
+        sys.exit(0)
+
+    # Set the signal handlers for SIGINT and SIGTERM
+    signal.signal(signal.SIGINT, signal_handler)
+    signal.signal(signal.SIGTERM, signal_handler)
+
+    try:
+        while True:
+            time.sleep(_ONE_DAY_IN_SECONDS)
+    except KeyboardInterrupt:
+        server.stop(0)
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Run the gRPC server.")
+    parser.add_argument(
+        "--addr", default="localhost:50051", help="The address to bind the server to."
+    )
+    args = parser.parse_args()
+    print(f"[transformers-musicgen] startup: {args}", file=sys.stderr)
+    serve(args.addr)
--- a/backend/python/transformers/Makefile
+++ b/backend/python/transformers/Makefile
@@ -1,8 +1,6 @@
 .PHONY: transformers
 transformers:
-	@echo "Creating virtual environment..."
-	@conda env create --name transformers --file transformers.yml
-	@echo "Virtual environment created."
+	$(MAKE) -C ../common-env/transformers

 .PHONY: run
 run:
--- a/backend/python/transformers/backend_pb2.py
+++ b/backend/python/transformers/backend_pb2.py
--- a/backend/python/transformers/test_transformers_server.py
+++ b/backend/python/transformers/test_transformers_server.py
@@ -0,0 +1,84 @@
+"""
+A test script to test the gRPC service
+"""
+import unittest
+import subprocess
+import time
+import backend_pb2
+import backend_pb2_grpc
+
+import grpc
+
+
+class TestBackendServicer(unittest.TestCase):
+    """
+    TestBackendServicer is the class that tests the gRPC service
+    """
+    def setUp(self):
+        """
+        This method sets up the gRPC service by starting the server
+        """
+        self.service = subprocess.Popen(["python3", "transformers_server.py", "--addr", "localhost:50051"])
+
+    def tearDown(self) -> None:
+        """
+        This method tears down the gRPC service by terminating the server
+        """
+        self.service.kill()
+        self.service.wait()
+
+    def test_server_startup(self):
+        """
+        This method tests if the server starts up successfully
+        """
+        time.sleep(10)
+        try:
+            self.setUp()
+            with grpc.insecure_channel("localhost:50051") as channel:
+                stub = backend_pb2_grpc.BackendStub(channel)
+                response = stub.Health(backend_pb2.HealthMessage())
+                self.assertEqual(response.message, b'OK')
+        except Exception as err:
+            print(err)
+            self.fail("Server failed to start")
+        finally:
+            self.tearDown()
+
+    def test_load_model(self):
+        """
+        This method tests if the model is loaded successfully
+        """
+        time.sleep(10)
+        try:
+            self.setUp()
+            with grpc.insecure_channel("localhost:50051") as channel:
+                stub = backend_pb2_grpc.BackendStub(channel)
+                response = stub.LoadModel(backend_pb2.ModelOptions(Model="bert-base-cased"))
+                self.assertTrue(response.success)
+                self.assertEqual(response.message, "Model loaded successfully")
+        except Exception as err:
+            print(err)
+            self.fail("LoadModel service failed")
+        finally:
+            self.tearDown()
+
+    def test_embedding(self):
+        """
+        This method tests if the embeddings are generated successfully
+        """
+        time.sleep(10)
+        try:
+            self.setUp()
+            with grpc.insecure_channel("localhost:50051") as channel:
+                stub = backend_pb2_grpc.BackendStub(channel)
+                response = stub.LoadModel(backend_pb2.ModelOptions(Model="bert-base-cased"))
+                print(response.message)
+                self.assertTrue(response.success)
+                embedding_request = backend_pb2.PredictOptions(Embeddings="This is a test sentence.")
+                embedding_response = stub.Embedding(embedding_request)
+                self.assertIsNotNone(embedding_response.embeddings)
+        except Exception as err:
+            print(err)
+            self.fail("Embedding service failed")
+        finally:
+            self.tearDown()
--- a/backend/python/transformers/transformers_server.py
+++ b/backend/python/transformers/transformers_server.py
@@ -14,14 +14,27 @@ import backend_pb2
 import backend_pb2_grpc

 import grpc
+import torch

-from transformers import AutoModel
+from transformers import AutoTokenizer, AutoModel

 _ONE_DAY_IN_SECONDS = 60 * 60 * 24

 # If MAX_WORKERS are specified in the environment use it, otherwise default to 1
 MAX_WORKERS = int(os.environ.get('PYTHON_GRPC_MAX_WORKERS', '1'))

+
+def mean_pooling(model_output, attention_mask):
+    """
+    Mean pooling to get sentence embeddings. See:
+    https://huggingface.co/sentence-transformers/paraphrase-distilroberta-base-v1
+    """
+    token_embeddings = model_output[0]
+    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
+    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1) # Sum columns
+    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
+    return sum_embeddings / sum_mask
+
 # Implement the BackendServicer class with the service methods
 class BackendServicer(backend_pb2_grpc.BackendServicer):
    """
@@ -56,9 +69,19 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
        model_name = request.Model
        try:
            self.model = AutoModel.from_pretrained(model_name, trust_remote_code=True) # trust_remote_code is needed to use the encode method with embeddings models like jinai-v2
+            self.tokenizer = AutoTokenizer.from_pretrained(model_name)
+
+            if request.CUDA:
+                try:
+                    # TODO: also tensorflow, make configurable
+                    import torch.cuda
+                    if torch.cuda.is_available():
+                        print("Loading model", model_name, "to CUDA.", file=sys.stderr)
+                        self.model = self.model.to("cuda")
+                except Exception as err:
+                    print("Not using CUDA:", err, file=sys.stderr)
        except Exception as err:
            return backend_pb2.Result(success=False, message=f"Unexpected {err=}, {type(err)=}")
-
        # Implement your logic here for the LoadModel service
        # Replace this with your desired response
        return backend_pb2.Result(message="Model loaded successfully", success=True)
@@ -74,10 +97,20 @@ class BackendServicer(backend_pb2_grpc.BackendServicer):
        Returns:
            An EmbeddingResult object that contains the calculated embeddings.
        """
-        # Implement your logic here for the Embedding service
-        # Replace this with your desired response
+
+        # Tokenize input
+        max_length = 512
+        if request.Tokens != 0:
+            max_length = request.Tokens
+        encoded_input = self.tokenizer(request.Embeddings, padding=True, truncation=True, max_length=max_length, return_tensors="pt")    
+
+        # Create word embeddings
+        model_output = self.model(**encoded_input)
+
+        # Pool to get sentence embeddings; i.e. generate one 1024 vector for the entire sentence
+        sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask']).detach().numpy()
        print("Calculated embeddings for: " + request.Embeddings, file=sys.stderr)
-        sentence_embeddings = self.model.encode(request.Embeddings)
+        print("Embeddings:", sentence_embeddings, file=sys.stderr)
        return backend_pb2.EmbeddingResult(embeddings=sentence_embeddings)


--- a/backend/python/vall-e-x/Makefile
+++ b/backend/python/vall-e-x/Makefile
@@ -10,3 +10,9 @@ run:
 	@echo "Running ttsvalle..."
 	bash run.sh
 	@echo "ttsvalle run."
+
+.PHONY: test
+test:
+	@echo "Testing valle..."
+	bash test.sh
+	@echo "valle tested."
--- a/backend/python/vall-e-x/backend_pb2.py
+++ b/backend/python/vall-e-x/backend_pb2.py
--- a/backend/python/vall-e-x/test.py
+++ b/backend/python/vall-e-x/test.py
@@ -0,0 +1,81 @@
+"""
+A test script to test the gRPC service
+"""
+import unittest
+import subprocess
+import time
+import backend_pb2
+import backend_pb2_grpc
+
+import grpc
+
+
+class TestBackendServicer(unittest.TestCase):
+    """
+    TestBackendServicer is the class that tests the gRPC service
+    """
+    def setUp(self):
+        """
+        This method sets up the gRPC service by starting the server
+        """
+        self.service = subprocess.Popen(["python3", "ttsvalle.py", "--addr", "localhost:50051"])
+        time.sleep(10)
+
+    def tearDown(self) -> None:
+        """
+        This method tears down the gRPC service by terminating the server
+        """
+        self.service.terminate()
+        self.service.wait()
+
+    def test_server_startup(self):
+        """
+        This method tests if the server starts up successfully
+        """
+        try:
+            self.setUp()
+            with grpc.insecure_channel("localhost:50051") as channel:
+                stub = backend_pb2_grpc.BackendStub(channel)
+                response = stub.Health(backend_pb2.HealthMessage())
+                self.assertEqual(response.message, b'OK')
+        except Exception as err:
+            print(err)
+            self.fail("Server failed to start")
+        finally:
+            self.tearDown()
+
+    def test_load_model(self):
+        """
+        This method tests if the model is loaded successfully
+        """
+        try:
+            self.setUp()
+            with grpc.insecure_channel("localhost:50051") as channel:
+                stub = backend_pb2_grpc.BackendStub(channel)
+                response = stub.LoadModel(backend_pb2.ModelOptions(Model="dingzhen"))
+                self.assertTrue(response.success)
+                self.assertEqual(response.message, "Model loaded successfully")
+        except Exception as err:
+            print(err)
+            self.fail("LoadModel service failed")
+        finally:
+            self.tearDown()
+
+    def test_tts(self):
+        """
+        This method tests if the embeddings are generated successfully
+        """
+        try:
+            self.setUp()
+            with grpc.insecure_channel("localhost:50051") as channel:
+                stub = backend_pb2_grpc.BackendStub(channel)
+                response = stub.LoadModel(backend_pb2.ModelOptions(Model="dingzhen"))
+                self.assertTrue(response.success)
+                tts_request = backend_pb2.TTSRequest(text="80s TV news production music hit for tonight's biggest story")
+                tts_response = stub.TTS(tts_request)
+                self.assertIsNotNone(tts_response)
+        except Exception as err:
+            print(err)
+            self.fail("TTS service failed")
+        finally:
+            self.tearDown()
--- a/backend/python/vall-e-x/test.sh
+++ b/backend/python/vall-e-x/test.sh
@@ -0,0 +1,11 @@
+#!/bin/bash
+##
+## A bash script wrapper that runs the ttsvalle server with conda
+
+# Activate conda environment
+source activate ttsvalle
+
+# get the directory where the bash script is located
+DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
+
+python -m unittest $DIR/test.py
--- a/backend/python/vllm/Makefile
+++ b/backend/python/vllm/Makefile
@@ -9,3 +9,9 @@ run:
 	@echo "Running vllm..."
 	bash run.sh
 	@echo "vllm run."
+
+.PHONY: test
+test:
+	@echo "Testing vllm..."
+	bash test.sh
+	@echo "vllm tested."
--- a/backend/python/vllm/backend_pb2.py
+++ b/backend/python/vllm/backend_pb2.py
--- a/backend/python/vllm/test.sh
+++ b/backend/python/vllm/test.sh
@@ -0,0 +1,11 @@
+#!/bin/bash
+##
+## A bash script wrapper that runs the transformers server with conda
+
+# Activate conda environment
+source activate vllm
+
+# get the directory where the bash script is located
+DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
+
+python -m unittest $DIR/test_backend_vllm.py
--- a/backend/python/vllm/test_backend_vllm.py
+++ b/backend/python/vllm/test_backend_vllm.py
@@ -21,13 +21,13 @@ class TestBackendServicer(unittest.TestCase):
    """
    def setUp(self):
        self.service = subprocess.Popen(["python", "backend_vllm.py", "--addr", "localhost:50051"])
+        time.sleep(10)

    def tearDown(self) -> None:
        self.service.terminate()
        self.service.wait()

    def test_server_startup(self):
-        time.sleep(2)
        try:
            self.setUp()
            with grpc.insecure_channel("localhost:50051") as channel:
@@ -39,3 +39,38 @@ class TestBackendServicer(unittest.TestCase):
            self.fail("Server failed to start")
        finally:
            self.tearDown()
+    def test_load_model(self):
+        """
+        This method tests if the model is loaded successfully
+        """
+        try:
+            self.setUp()
+            with grpc.insecure_channel("localhost:50051") as channel:
+                stub = backend_pb2_grpc.BackendStub(channel)
+                response = stub.LoadModel(backend_pb2.ModelOptions(Model="facebook/opt-125m"))
+                self.assertTrue(response.success)
+                self.assertEqual(response.message, "Model loaded successfully")
+        except Exception as err:
+            print(err)
+            self.fail("LoadModel service failed")
+        finally:
+            self.tearDown()
+
+    def test_text(self):
+        """
+        This method tests if the embeddings are generated successfully
+        """
+        try:
+            self.setUp()
+            with grpc.insecure_channel("localhost:50051") as channel:
+                stub = backend_pb2_grpc.BackendStub(channel)
+                response = stub.LoadModel(backend_pb2.ModelOptions(Model="facebook/opt-125m"))
+                self.assertTrue(response.success)
+                req = backend_pb2.PredictOptions(Prompt="The capital of France is")
+                resp = stub.Predict(req)
+                self.assertIsNotNone(resp.message)
+        except Exception as err:
+            print(err)
+            self.fail("text service failed")
+        finally:
+            self.tearDown()
--- a/docs/content/advanced/_index.en.md
+++ b/docs/content/advanced/_index.en.md
@@ -207,6 +207,9 @@ lora_adapter: "/path/to/lora/adapter"
 lora_base: "/path/to/lora/base"
 # Disable mulmatq (CUDA)
 no_mulmatq: true
+
+# Diffusers/transformers
+cuda: true
 ```

 ### Prompt templates 
@@ -345,6 +348,7 @@ there are additional environment variables available that modify the behavior of
 | `BUILD_TYPE`               |         | Build type. Available: `cublas`, `openblas`, `clblas`                                                      |
 | `GO_TAGS`                  |         | Go tags. Available: `stablediffusion`                                                                      |
 | `HUGGINGFACEHUB_API_TOKEN` |         | Special token for interacting with HuggingFace Inference API, required only when using the `langchain-huggingface` backend |
+| `EXTRA_BACKENDS`          |         | A space separated list of backends to prepare. For example `EXTRA_BACKENDS="backend/python/diffusers backend/python/transformers"` prepares the conda environment on start |

 Here is how to configure these variables:

@@ -363,4 +367,40 @@ You can control the backends that are built by setting the `GRPC_BACKENDS` envir
 make GRPC_BACKENDS=backend-assets/grpc/llama-cpp build
 ```

-By default, all the backends are built.
+By default, all the backends are built.
+
+### Extra backends
+
+LocalAI can be extended with extra backends. The backends are implemented as `gRPC` services and can be written in any language. The container images that are built and published on [quay.io](https://quay.io/repository/go-skynet/local-ai?tab=tags) contain a set of images split in core and extra. By default Images bring all the dependencies and backends supported by LocalAI (we call those `extra` images). The `-core` images instead bring only the strictly necessary dependencies to run LocalAI without only a core set of backends.
+
+If you wish to build a custom container image with extra backends, you can use the core images and build only the backends you are interested into or prepare the environment on startup by using the `EXTRA_BACKENDS` environment variable. For instance, to use the diffusers backend:
+
+```Dockerfile
+FROM quay.io/go-skynet/local-ai:master-ffmpeg-core
+
+RUN PATH=$PATH:/opt/conda/bin make -C backend/python/diffusers
+```
+
+Remember also to set the `EXTERNAL_GRPC_BACKENDS` environment variable (or `--external-grpc-backends` as CLI flag) to point to the backends you are using (`EXTERNAL_GRPC_BACKENDS="backend_name:/path/to/backend"`), for example with diffusers:
+
+```Dockerfile
+FROM quay.io/go-skynet/local-ai:master-ffmpeg-core
+
+RUN PATH=$PATH:/opt/conda/bin make -C backend/python/diffusers
+
+ENV EXTERNAL_GRPC_BACKENDS="diffusers:/build/backend/python/diffusers/run.sh"
+```
+
+{{% notice note %}}
+
+You can specify remote external backends or path to local files. The syntax is `backend-name:/path/to/backend` or `backend-name:host:port`.
+
+{{% /notice %}}
+
+#### In runtime
+
+When using the `-core` container image it is possible to prepare the python backends you are interested into by using the `EXTRA_BACKENDS` variable, for instance:
+
+```bash
+docker run --env EXTRA_BACKENDS="backend/python/diffusers" quay.io/go-skynet/local-ai:master-ffmpeg-core
+```
--- a/docs/content/build/_index.en.md
+++ b/docs/content/build/_index.en.md
@@ -147,20 +147,24 @@ make BUILD_TYPE=cublas build

 More informations available in the upstream PR: https://github.com/ggerganov/llama.cpp/pull/1412

-#### Hipblas (AMD GPU)

-AMD GPU Acceleration
-
-Requirement: ROCm
+#### Hipblas (AMD GPU with ROCm on Arch Linux)

+Packages:
 ```
-make BUILD_TYPE=hipblas build
+pacman -S base-devel git rocm-hip-sdk rocm-opencl-sdk opencv clblast grpc
 ```

-Specific GPU targets can be specified with `GPU_TARGETS`:
-  
+Library links:
 ```
-make BUILD_TYPE=hipblas GPU_TARGETS=gfx90a build
+export CGO_CFLAGS="-I/usr/include/opencv4"
+export CGO_CXXFLAGS="-I/usr/include/opencv4"
+export CGO_LDFLAGS="-L/opt/rocm/hip/lib -lamdhip64 -L/opt/rocm/lib -lOpenCL -L/usr/lib -lclblast -lrocblas -lhipblas -lrocrand -lomp -O3 --rtlib=compiler-rt -unwindlib=libgcc -lhipblas -lrocblas --hip-link"
+```
+
+Build:
+```
+make BUILD_TYPE=hipblas GPU_TARGETS=gfx1030
 ```

 #### ClBLAS
--- a/docs/content/features/image-generation.md
+++ b/docs/content/features/image-generation.md
@@ -147,7 +147,6 @@ backend: diffusers
 # Force CPU usage - set to true for GPU
 f16: false
 diffusers:
-  pipeline_type: StableDiffusionXLPipeline
  cuda: false # Enable for GPU usage (CUDA)
  scheduler_type: euler_a
 ```
--- a/docs/content/features/text-to-audio.md
+++ b/docs/content/features/text-to-audio.md
@@ -52,6 +52,20 @@ Note:
 - The model name is case sensitive.
 - LocalAI must be compiled with the `GO_TAGS=tts` flag.

+LocalAI also has experimental support for `transformers-musicgen` for the generation of short musical compositions. Currently, this is implemented via the same requests used for text to speech:
+
+```
+curl --request POST \
+  --url http://localhost:8080/tts \
+  --header 'Content-Type: application/json' \
+  --data '{
+    "backend": "transformers-musicgen",
+    "model": "facebook/musicgen-medium",
+    "input": "Cello Rave"
+}' | aplay```
+
+Future versions of LocalAI will expose additional control over audio generation beyond the text prompt.
+
 #### Configuration

 Audio models can be configured via `YAML` files. This allows to configure specific setting for each backend. For instance, backends might be specifying a voice or supports voice cloning which must be specified in the configuration file.
--- a/docs/content/getting_started/_index.en.md
+++ b/docs/content/getting_started/_index.en.md
@@ -88,7 +88,7 @@ curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d
   }'
 ```

-Note: If you are on Windows, please run ``docker-compose`` not ``docker compose`` and make sure the project is in the Linux Filesystem, otherwise loading models might be slow. For more Info: [Microsoft Docs](https://learn.microsoft.com/en-us/windows/wsl/filesystems)
+Note: If you are on Windows, please make sure the project is on the Linux Filesystem, otherwise loading models might be slow. For more Info: [Microsoft Docs](https://learn.microsoft.com/en-us/windows/wsl/filesystems)

 {{% /tab %}}

@@ -115,10 +115,66 @@ helm install local-ai go-skynet/local-ai -f values.yaml

 {{< /tabs >}}

+### Container images
+
+LocalAI has a set of images to support CUDA, ffmpeg and 'vanilla' (CPU-only). The image list is on [quay](https://quay.io/repository/go-skynet/local-ai?tab=tags):
+
+{{< tabs >}}
+{{% tab name="Vanilla / CPU Images" %}}
+- `master`
+- `latest`
+- `{{< version >}}`
+- `{{< version >}}-ffmpeg`
+- `{{< version >}}-ffmpeg-core`
+
+Core Images - Smaller images without predownload python dependencies
+{{% /tab %}}
+
+{{% tab name="GPU Images CUDA 11" %}}
+- `master-cublas-cuda11`
+- `master-cublas-cuda11-core`
+- `{{< version >}}-cublas-cuda11`
+- `{{< version >}}-cublas-cuda11-core`
+- `{{< version >}}-cublas-cuda11-ffmpeg`
+- `{{< version >}}-cublas-cuda11-ffmpeg-core`
+
+Core Images - Smaller images without predownload python dependencies
+{{% /tab %}}
+
+{{% tab name="GPU Images CUDA 12" %}}
+- `master-cublas-cuda12`
+- `master-cublas-cuda12-core`
+- `{{< version >}}-cublas-cuda12`
+- `{{< version >}}-cublas-cuda12-core`
+- `{{< version >}}-cublas-cuda12-ffmpeg`
+- `{{< version >}}-cublas-cuda12-ffmpeg-core`
+
+Core Images - Smaller images without predownload python dependencies
+
+{{% /tab %}}
+
+{{< /tabs >}}
+
+Example:
+
+- Standard (GPT + `stablediffusion`): `quay.io/go-skynet/local-ai:latest`
+- FFmpeg: `quay.io/go-skynet/local-ai:{{< version >}}-ffmpeg`
+- CUDA 11+FFmpeg: `quay.io/go-skynet/local-ai:{{< version >}}-cublas-cuda11-ffmpeg`
+- CUDA 12+FFmpeg: `quay.io/go-skynet/local-ai:{{< version >}}-cublas-cuda12-ffmpeg`
+
+{{% notice note %}}
+Note: the binary inside the image is pre-compiled, and might not suite all CPUs.
+To enable CPU optimizations for the execution environment,
+the default behavior is to rebuild when starting the container.
+To disable this auto-rebuild behavior,
+set the environment variable `REBUILD` to `false`.
+
+See [docs on all environment variables]({{%relref "advanced#environment-variables" %}})
+for more info.
+{{% /notice %}}

 ### Example: Use luna-ai-llama2 model with `docker`

-
 ```bash
 mkdir models

@@ -178,36 +234,7 @@ You can control LocalAI with command line arguments, to specify a binding addres
 | --watchdog-busy-timeout value | $WATCHDOG_BUSY_TIMEOUT | 5m | Watchdog timeout. This will restart the backend if it crashes.  |
 | --watchdog-idle-timeout value | $WATCHDOG_IDLE_TIMEOUT | 15m | Watchdog idle timeout. This will restart the backend if it crashes. |
 | --preload-backend-only | $PRELOAD_BACKEND_ONLY | false | If set, the api is NOT launched, and only the preloaded models / backends are started. This is intended for multi-node setups. |
-
-### Container images
-
-LocalAI has a set of images to support CUDA, ffmpeg and 'vanilla' (CPU-only). The image list is on [quay](https://quay.io/repository/go-skynet/local-ai?tab=tags):
-
- Vanilla images tags: `master`, `v1.40.0`, `latest`, ...
- FFmpeg images tags: `master-ffmpeg`, `v1.40.0-ffmpeg`, ...
- CUDA `11` tags: `master-cublas-cuda11`, `v1.40.0-cublas-cuda11`, ...
- CUDA `12` tags: `master-cublas-cuda12`, `v1.40.0-cublas-cuda12`, ...
- CUDA `11` + FFmpeg tags: `master-cublas-cuda11-ffmpeg`, `v1.40.0-cublas-cuda11-ffmpeg`, ...
- CUDA `12` + FFmpeg tags: `master-cublas-cuda12-ffmpeg`, `v1.40.0-cublas-cuda12-ffmpeg`, ...
- Core images (smaller images without python dependencies): `master-core`, `v1.40.0-core`, ...
-
-Example:
-
- Standard (GPT + `stablediffusion`): `quay.io/go-skynet/local-ai:latest`
- FFmpeg: `quay.io/go-skynet/local-ai:v1.40.0-ffmpeg`
- CUDA 11+FFmpeg: `quay.io/go-skynet/local-ai:v1.40.0-cublas-cuda11-ffmpeg`
- CUDA 12+FFmpeg: `quay.io/go-skynet/local-ai:v1.40.0-cublas-cuda12-ffmpeg`
-
-{{% notice note %}}
-Note: the binary inside the image is pre-compiled, and might not suite all CPUs.
-To enable CPU optimizations for the execution environment,
-the default behavior is to rebuild when starting the container.
-To disable this auto-rebuild behavior,
-set the environment variable `REBUILD` to `false`.
-
-See [docs on all environment variables]({{%relref "advanced#environment-variables" %}})
-for more info.
-{{% /notice %}}
+| --external-grpc-backends | EXTERNAL_GRPC_BACKENDS | none | Comma separated list of external gRPC backends to use. Format: `name:host:port` or `name:/path/to/file` |

 ### Run LocalAI in Kubernetes

@@ -237,7 +264,7 @@ Deploy a single LocalAI pod with 6GB of persistent storage serving up a `ggml-gp
 replicaCount: 1

 deployment:
-  image: quay.io/go-skynet/local-ai:latest ##(This is for CPU only, to use GPU change it to a image that supports GPU IE "v1.40.0-cublas-cuda12")
+  image: quay.io/go-skynet/local-ai:latest ##(This is for CPU only, to use GPU change it to a image that supports GPU IE "v2.0.0-cublas-cuda12-core")
  env:
    threads: 4
    context_size: 512
--- a/docs/content/howtos/_index.md
+++ b/docs/content/howtos/_index.md
@@ -8,10 +8,10 @@ weight = 9

 This section includes LocalAI end-to-end examples, tutorial and how-tos curated by the community and maintained by [lunamidori5](https://github.com/lunamidori5).

- [Setup LocalAI with Docker on CPU]({{%relref "howtos/easy-setup-docker-cpu" %}})
- [Setup LocalAI with Docker With CUDA]({{%relref "howtos/easy-setup-docker-gpu" %}})
+- [Setup LocalAI with Docker]({{%relref "howtos/easy-setup-docker" %}})
 - [Seting up a Model]({{%relref "howtos/easy-model" %}})
- [Making requests to LocalAI]({{%relref "howtos/easy-request" %}})
+- [Making Text / LLM requests to LocalAI]({{%relref "howtos/easy-request" %}})
+- [Making Photo / SD requests to LocalAI]({{%relref "howtos/easy-setup-sd" %}})
  
 ## Programs and Demos

--- a/docs/content/howtos/easy-model.md
+++ b/docs/content/howtos/easy-model.md
@@ -5,43 +5,52 @@ title = "Easy Model Setup"
 weight = 2
 +++

-Lets Learn how to setup a model, for this ``How To`` we are going to use the ``Luna-Ai`` model (Yes I know haha - ``Luna Midori`` making a how to using the ``luna-ai-llama2`` model - lol)
+Lets learn how to setup a model, for this ``How To`` we are going to use the ``Dolphin 2.2.1 Mistral 7B`` model.

 To download the model to your models folder, run this command in a commandline of your picking.
 ```bash
 curl --location 'http://localhost:8080/models/apply' \
 --header 'Content-Type: application/json' \
 --data-raw '{
-    "id": "TheBloke/Luna-AI-Llama2-Uncensored-GGUF/luna-ai-llama2-uncensored.Q4_K_M.gguf"
+    "id": "TheBloke/dolphin-2.2.1-mistral-7B-GGUF/dolphin-2.2.1-mistral-7b.Q4_0.gguf"
 }'
 ```

-Each model needs at least ``4`` files, with out these files, the model will run raw, what that means is you can not change settings of the model.
+Each model needs at least ``5`` files, with out these files, the model will run raw, what that means is you can not change settings of the model.
 ```
 File 1 - The model's GGUF file
 File 2 - The model's .yaml file
 File 3 - The Chat API .tmpl file
-File 4 - The Completion API .tmpl file
+File 4 - The Chat API helper .tmpl file
+File 5 - The Completion API .tmpl file
 ```
 So lets fix that! We are using ``lunademo`` name for this ``How To`` but you can name the files what ever you want! Lets make blank files to start with

 ```bash
 touch lunademo-chat.tmpl
+touch lunademo-chat-block.tmpl
 touch lunademo-completion.tmpl
 touch lunademo.yaml
 ```
-Now lets edit the `"lunademo-chat.tmpl"`, Looking at the huggingface repo, this model uses the ``ASSISTANT:`` tag for when the AI replys, so lets make sure to add that to this file. Do not add the user as we will be doing that in our yaml file!
+Now lets edit the `"lunademo-chat.tmpl"`, This is the template that model "Chat" trained models use, but changed for LocalAI
+
+```txt
+<|im_start|>{{if eq .RoleName "assistant"}}assistant{{else if eq .RoleName "system"}}system{{else if eq .RoleName "user"}}user{{end}}
+{{if .Content}}{{.Content}}{{end}}
+<|im_end|>
+```
+
+For the `"lunademo-chat-block.tmpl"`, Looking at the huggingface repo, this model uses the ``<|im_start|>assistant`` tag for when the AI replys, so lets make sure to add that to this file. Do not add the user as we will be doing that in our yaml file!

 ```txt
 {{.Input}}
-
-ASSISTANT:
+<|im_start|>assistant
 ```

-Now in the `"lunademo-completion.tmpl"` file lets add this.
+Now in the `"lunademo-completion.tmpl"` file lets add this. (This is a hold over from OpenAI V0)

 ```txt
-Complete the following sentence: {{.Input}}
+{{.Input}}
 ```


@@ -58,25 +67,18 @@ What this does is tell ``LocalAI`` how to load the model. Then we are going to *
 ```yaml
 name: lunademo
 parameters:
-  model: luna-ai-llama2-uncensored.Q4_K_M.gguf
+  model: dolphin-2.2.1-mistral-7b.Q4_0.gguf
 ```

-Now that we have the model set up, there a few things we should add to the yaml file to make it run better, for this model it uses the following roles.
-```yaml
-roles:
-  assistant: 'ASSISTANT:'
-  system: 'SYSTEM:'
-  user: 'USER:'
-```
-
-What that did is made sure that ``LocalAI`` added the test to the users in the request, so if a message is from ``system`` it shows up in the template as ``SYSTEM:``, speaking of template files, lets add those to our models yaml file now.
+Now that LocalAI knows what file to load with our request, lets add the template files to our models yaml file now.
 ```yaml
 template:
-  chat: lunademo-chat
+  chat: lunademo-chat-block
+  chat_message: lunademo-chat
  completion: lunademo-completion
 ```

-If you are running on ``GPU`` or want to tune the model, you can add settings like
+If you are running on ``GPU`` or want to tune the model, you can add settings like (higher the GPU Layers the more GPU used)
 ```yaml
 f16: true
 gpu_layers: 4
@@ -85,8 +87,7 @@ gpu_layers: 4
 To fully tune the model to your like. But be warned, you **must** restart ``LocalAI`` after changing a yaml file

 ```bash
-docker-compose restart ##windows
-docker compose restart ##linux / mac
+docker compose restart
 ```

 If you want to check your models yaml, here is a full copy!
@@ -96,19 +97,18 @@ context_size: 2000
 ##Put settings right here for tunning!! Before name but after Backend!
 name: lunademo
 parameters:
-  model: luna-ai-llama2-uncensored.Q4_K_M.gguf
-roles:
-  assistant: 'ASSISTANT:'
-  system: 'SYSTEM:'
-  user: 'USER:'
+  model: dolphin-2.2.1-mistral-7b.Q4_0.gguf
 template:
-  chat: lunademo-chat
+  chat: lunademo-chat-block
+  chat_message: lunademo-chat
  completion: lunademo-completion
 ```

 Now that we got that setup, lets test it out but sending a [request]({{%relref "easy-request" %}}) to Localai! 

-## Adv Stuff
+## ----- Adv Stuff -----
+
+**(Please do not run these steps if you have already done the setup)**
 Alright now that we have learned how to set up our own models, here is how to use the gallery to do alot of this for us. This command will download and set up (mostly, we will **always** need to edit our yaml file to fit our computer / hardware)
 ```bash
 curl http://localhost:8080/models/apply -H "Content-Type: application/json" -d '{
--- a/docs/content/howtos/easy-setup-docker-cpu.md
+++ b/docs/content/howtos/easy-setup-docker-cpu.md
@@ -1,132 +0,0 @@
-
-+++
-disableToc = false
-title = "Easy Setup - CPU Docker"
-weight = 2
-+++
-
-{{% notice Note %}}
- You will need about 10gb of RAM Free
- You will need about 15gb of space free on C drive for ``Docker-compose``
-{{% /notice %}}
-
-We are going to run `LocalAI` with `docker-compose` for this set up.
-
-
-Lets clone `LocalAI` with git.
-
-```bash
-git clone https://github.com/go-skynet/LocalAI
-```
-
-
-Then we will cd into the ``LocalAI`` folder.
-
-```bash
-cd LocalAI
-```
-
-
-At this point we want to set up our `.env` file, here is a copy for you to use if you wish, please make sure to set it to the same as in the `docker-compose` file for later.
-
-```bash
-## Set number of threads.
-## Note: prefer the number of physical cores. Overbooking the CPU degrades performance notably.
-THREADS=2
-
-## Specify a different bind address (defaults to ":8080")
-# ADDRESS=127.0.0.1:8080
-
-## Define galleries.
-## models will to install will be visible in `/models/available`
-GALLERIES=[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.yaml"}, {"url": "github:go-skynet/model-gallery/huggingface.yaml","name":"huggingface"}]
-
-## Default path for models
-MODELS_PATH=/models
-
-## Enable debug mode
-# DEBUG=true
-
-## Disables COMPEL (Lets Stable Diffuser work, uncomment if you plan on using it)
-# COMPEL=0
-
-## Enable/Disable single backend (useful if only one GPU is available)
-# SINGLE_ACTIVE_BACKEND=true
-
-## Specify a build type. Available: cublas, openblas, clblas.
-BUILD_TYPE=cublas
-
-## Uncomment and set to true to enable rebuilding from source
-# REBUILD=true
-
-## Enable go tags, available: stablediffusion, tts
-## stablediffusion: image generation with stablediffusion
-## tts: enables text-to-speech with go-piper 
-## (requires REBUILD=true)
-#
-#GO_TAGS=tts
-
-## Path where to store generated images
-# IMAGE_PATH=/tmp
-
-## Specify a default upload limit in MB (whisper)
-# UPLOAD_LIMIT
-
-# HUGGINGFACEHUB_API_TOKEN=Token here
-```
-
-
-Now that we have the `.env` set lets set up our `docker-compose` file.
-It will use a container from [quay.io](https://quay.io/repository/go-skynet/local-ai?tab=tags).
-Also note this `docker-compose` file is for `CPU` only.
-
-```docker
-version: '3.6'
-
-services:
-  api:
-    image: quay.io/go-skynet/local-ai:v1.40.0
-    tty: true # enable colorized logs
-    restart: always # should this be on-failure ?
-    ports:
-      - 8080:8080
-    env_file:
-      - .env
-    volumes:
-      - ./models:/models
-      - ./images/:/tmp/generated/images/
-    command: ["/usr/bin/local-ai" ]
-```
-
-
-Make sure to save that in the root of the `LocalAI` folder. Then lets spin up the Docker run this in a `CMD` or `BASH`
-
-```bash
-docker-compose up -d --pull always ##Windows
-docker compose up -d --pull always ##Linux
-```
-
-
-Now we are going to let that set up, once it is done, lets check to make sure our huggingface / localai galleries are working (wait until you see this screen to do this)
-
-You should see:
-```
-┌───────────────────────────────────────────────────┐
-│                   Fiber v2.42.0                   │
-│               http://127.0.0.1:8080               │
-│       (bound on host 0.0.0.0 and port 8080)       │
-│                                                   │
-│ Handlers ............. 1  Processes ........... 1 │
-│ Prefork ....... Disabled  PID ................. 1 │
-└───────────────────────────────────────────────────┘
-```
-
-```bash
-curl http://localhost:8080/models/available
-```
-
-Output will look like this:
-
-![](https://cdn.discordapp.com/attachments/1116933141895053322/1134037542845566976/image.png)
-
-Now that we got that setup, lets go setup a [model]({{%relref "easy-model" %}})
--- a/docs/content/howtos/easy-setup-docker-gpu.md
+++ b/docs/content/howtos/easy-setup-docker-gpu.md
@@ -1,33 +1,26 @@

 +++
 disableToc = false
-title = "Easy Setup - GPU Docker"
+title = "Easy Setup - Docker"
 weight = 2
 +++

 {{% notice Note %}}
 - You will need about 10gb of RAM Free
- You will need about 15gb of space free on C drive for ``Docker-compose``
+- You will need about 15gb of space free on C drive for ``Docker compose``
 {{% /notice %}}

-We are going to run `LocalAI` with `docker-compose` for this set up.
+We are going to run `LocalAI` with `docker compose` for this set up.

-
-Lets clone `LocalAI` with git.
-
-```bash
-git clone https://github.com/go-skynet/LocalAI
-```
-
-
-Then we will cd into the `LocalAI` folder.
-
-```bash
+Lets setup our folders for ``LocalAI`` (run these to make the folders for you if you wish)
+```batch
+mkdir "LocalAI"
 cd LocalAI
+mkdir "models"
+mkdir "images"
 ```

-
-At this point we want to set up our `.env` file, here is a copy for you to use if you wish, please make sure to set it to the same as in the `docker-compose` file for later.
+At this point we want to set up our `.env` file, here is a copy for you to use if you wish, Make sure this is in the ``LocalAI`` folder.

 ```bash
 ## Set number of threads.
@@ -45,7 +38,7 @@ GALLERIES=[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.
 MODELS_PATH=/models

 ## Enable debug mode
-# DEBUG=true
+DEBUG=true

 ## Disables COMPEL (Lets Stable Diffuser work, uncomment if you plan on using it)
 # COMPEL=0
@@ -78,15 +71,58 @@ BUILD_TYPE=cublas

 Now that we have the `.env` set lets set up our `docker-compose` file.
 It will use a container from [quay.io](https://quay.io/repository/go-skynet/local-ai?tab=tags).
+
+
+{{< tabs >}}
+{{% tab name="CPU Only" %}}
+Also note this `docker-compose` file is for `CPU` only.
+
+```docker
+version: '3.6'
+
+services:
+  api:
+    image: quay.io/go-skynet/local-ai:{{< version >}}
+    tty: true # enable colorized logs
+    restart: always # should this be on-failure ?
+    ports:
+      - 8080:8080
+    env_file:
+      - .env
+    volumes:
+      - ./models:/models
+      - ./images/:/tmp/generated/images/
+    command: ["/usr/bin/local-ai" ]
+```
+{{% /tab %}}
+
+{{% tab name="GPU and CPU" %}}
 Also note this `docker-compose` file is for `CUDA` only.

 Please change the image to what you need.
-```
-Cuda 11 - v1.40.0-cublas-cuda11
-Cuda 12 - v1.40.0-cublas-cuda12
-Cuda 11 with TTS - v1.40.0-cublas-cuda11-ffmpeg
-Cuda 12 with TTS - v1.40.0-cublas-cuda12-ffmpeg
-```
+{{< tabs >}}
+{{% tab name="GPU Images CUDA 11" %}}
+- `master-cublas-cuda11`
+- `master-cublas-cuda11-core`
+- `{{< version >}}-cublas-cuda11`
+- `{{< version >}}-cublas-cuda11-core`
+- `{{< version >}}-cublas-cuda11-ffmpeg`
+- `{{< version >}}-cublas-cuda11-ffmpeg-core`
+
+Core Images - Smaller images without predownload python dependencies
+{{% /tab %}}
+
+{{% tab name="GPU Images CUDA 12" %}}
+- `master-cublas-cuda12`
+- `master-cublas-cuda12-core`
+- `{{< version >}}-cublas-cuda12`
+- `{{< version >}}-cublas-cuda12-core`
+- `{{< version >}}-cublas-cuda12-ffmpeg`
+- `{{< version >}}-cublas-cuda12-ffmpeg-core`
+
+Core Images - Smaller images without predownload python dependencies
+{{% /tab %}}
+{{< /tabs >}}

 ```docker
 version: '3.6'
@@ -112,13 +148,14 @@ services:
      - ./images/:/tmp/generated/images/
    command: ["/usr/bin/local-ai" ]
 ```
+{{% /tab %}}
+{{< /tabs >}}


 Make sure to save that in the root of the `LocalAI` folder. Then lets spin up the Docker run this in a `CMD` or `BASH`

 ```bash
-docker-compose up -d --pull always ##Windows
-docker compose up -d --pull always ##Linux
+docker compose up -d --pull always
 ```


--- a/docs/content/howtos/easy-setup-embeddings.md
+++ b/docs/content/howtos/easy-setup-embeddings.md
@@ -12,17 +12,6 @@ curl http://localhost:8080/models/apply -H "Content-Type: application/json" -d '
   }'  
 ```

-Now we need to make a ``bert.yaml`` in the models folder
-```yaml
-backend: bert-embeddings
-embeddings: true
-name: text-embedding-ada-002
-parameters:
-  model: bert
-```
-
-**Restart LocalAI after you change a yaml file**
-
 When you would like to request the model from CLI you can do 

 ```bash
@@ -30,7 +19,7 @@ curl http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "input": "The food was delicious and the waiter...",
-    "model": "text-embedding-ada-002"
+    "model": "bert-embeddings"
  }'
 ```

--- a/docs/content/howtos/easy-setup-sd.md
+++ b/docs/content/howtos/easy-setup-sd.md
@@ -5,7 +5,7 @@ weight = 2
 +++

 To set up a Stable Diffusion model is super easy.
-In your models folder make a file called ``stablediffusion.yaml``, then edit that file with the following. (You can change ``Linaqruf/animagine-xl`` with what ever ``sd-lx`` model you would like.
+In your ``models`` folder make a file called ``stablediffusion.yaml``, then edit that file with the following. (You can change ``Linaqruf/animagine-xl`` with what ever ``sd-lx`` model you would like.
 ```yaml
 name: animagine-xl
 parameters:
@@ -15,15 +15,13 @@ backend: diffusers
 # Force CPU usage - set to true for GPU
 f16: false
 diffusers:
-  pipeline_type: StableDiffusionXLPipeline
  cuda: false # Enable for GPU usage (CUDA)
  scheduler_type: dpm_2_a
 ```

 If you are using docker, you will need to run in the localai folder with the ``docker-compose.yaml`` file in it
 ```bash
-docker-compose down #windows
-docker compose down #linux/mac
+docker compose down
 ```

 Then in your ``.env`` file uncomment this line.
@@ -33,14 +31,13 @@ COMPEL=0

 After that we can reinstall the LocalAI docker VM by running in the localai folder with the ``docker-compose.yaml`` file in it
 ```bash
-docker-compose up #windows
-docker compose up #linux/mac
+docker compose up -d
 ```

 Then to download and setup the model, Just send in a normal ``OpenAI`` request! LocalAI will do the rest!
 ```bash
 curl http://localhost:8080/v1/images/generations -H "Content-Type: application/json" -d '{
  "prompt": "Two Boxes, 1blue, 1red",
-  "size": "256x256"
+  "size": "1024x1024"
 }'
 ```
--- a/docs/content/integrations/AIKit.md
+++ b/docs/content/integrations/AIKit.md
@@ -0,0 +1,178 @@
+
+++
+disableToc = false
+title = "AIKit"
+description="AI + BuildKit = AIKit: Build and deploy large language models easily"
+weight = 2
+++
+
+GitHub Link - https://github.com/sozercan/aikit
+
+[AIKit](https://github.com/sozercan/aikit) is a quick, easy, and local or cloud-agnostic way to get started to host and deploy large language models (LLMs) for inference. No GPU, internet access or additional tools are needed to get started except for [Docker](https://docs.docker.com/desktop/install/linux-install/)!
+
+AIKit uses [LocalAI](https://localai.io/) under-the-hood to run inference. LocalAI provides a drop-in replacement REST API that is OpenAI API compatible, so you can use any OpenAI API compatible client, such as [Kubectl AI](https://github.com/sozercan/kubectl-ai), [Chatbot-UI](https://github.com/sozercan/chatbot-ui) and many more, to send requests to open-source LLMs powered by AIKit!
+
+> At this time, AIKit is tested with LocalAI `llama` backend. Other backends may work but are not tested. Please open an issue if you'd like to see support for other backends.
+
+## Features
+
+- 🐳 No GPU, Internet access or additional tools needed except for [Docker](https://docs.docker.com/desktop/install/linux-install/)!
+- 🤏 Minimal image size, resulting in less vulnerabilities and smaller attack surface with a custom [distroless](https://github.com/GoogleContainerTools/distroless)-based image
+- 🚀 Easy to use declarative configuration
+- ✨ OpenAI API compatible to use with any OpenAI API compatible client
+- 🚢 Kubernetes deployment ready
+- 📦 Supports multiple models with a single image
+- 🖥️ Supports GPU-accelerated inferencing with NVIDIA GPUs
+- 🔐 Signed images for `aikit` and pre-made models
+
+## Pre-made Models
+
+AIKit comes with pre-made models that you can use out-of-the-box!
+
+### CPU
+- 🦙 Llama 2 7B Chat: `ghcr.io/sozercan/llama2:7b`
+- 🦙 Llama 2 13B Chat: `ghcr.io/sozercan/llama2:13b`
+- 🐬 Orca 2 13B: `ghcr.io/sozercan/orca2:13b`
+
+### NVIDIA CUDA
+
+- 🦙 Llama 2 7B Chat (CUDA): `ghcr.io/sozercan/llama2:7b-cuda`
+- 🦙 Llama 2 13B Chat (CUDA): `ghcr.io/sozercan/llama2:13b-cuda`
+- 🐬 Orca 2 13B (CUDA): `ghcr.io/sozercan/orca2:13b-cuda`
+
+> CUDA models includes CUDA v12. They are used with [NVIDIA GPU acceleration](#gpu-acceleration-support).
+
+## Quick Start
+
+### Creating an image
+
+> This section shows how to create a custom image with models of your choosing. If you want to use one of the pre-made models, skip to [running models](#running-models).
+>
+> Please see [models folder](./models/) for pre-made model definitions. You can find more model examples at [go-skynet/model-gallery](https://github.com/go-skynet/model-gallery).
+
+Create an `aikitfile.yaml` with the following structure:
+
+```yaml
+#syntax=ghcr.io/sozercan/aikit:latest
+apiVersion: v1alpha1
+models:
+  - name: llama-2-7b-chat
+    source: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf
+```
+
+> This is the simplest way to get started to build an image. For full `aikitfile` specification, see [specs](docs/specs.md).
+
+First, create a buildx buildkit instance. Alternatively, if you are using Docker v24 with [containerd image store](https://docs.docker.com/storage/containerd/) enabled, you can skip this step.
+
+```bash
+docker buildx create --use --name aikit-builder
+```
+
+Then build your image with:
+
+```bash
+docker buildx build . -t my-model -f aikitfile.yaml --load
+```
+
+This will build a local container image with your model(s). You can see the image with:
+
+```bash
+docker images
+REPOSITORY    TAG       IMAGE ID       CREATED             SIZE
+my-model      latest    e7b7c5a4a2cb   About an hour ago   5.51GB
+```
+
+### Running models
+
+You can start the inferencing server for your models with:
+
+```bash
+# for pre-made models, replace "my-model" with the image name
+docker run -d --rm -p 8080:8080 my-model
+```
+
+You can then send requests to `localhost:8080` to run inference from your models. For example:
+
+```bash
+curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
+     "model": "llama-2-7b-chat",
+     "messages": [{"role": "user", "content": "explain kubernetes in a sentence"}]
+   }'
+{"created":1701236489,"object":"chat.completion","id":"dd1ff40b-31a7-4418-9e32-42151ab6875a","model":"llama-2-7b-chat","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"\nKubernetes is a container orchestration system that automates the deployment, scaling, and management of containerized applications in a microservices architecture."}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
+```
+
+## Kubernetes Deployment
+
+It is easy to get started to deploy your models to Kubernetes!
+
+Make sure you have a Kubernetes cluster running and `kubectl` is configured to talk to it, and your model images are accessible from the cluster.
+
+> You can use [kind](https://kind.sigs.k8s.io/) to create a local Kubernetes cluster for testing purposes.
+
+```bash
+# create a deployment
+# for pre-made models, replace "my-model" with the image name
+kubectl create deployment my-llm-deployment --image=my-model
+
+# expose it as a service
+kubectl expose deployment my-llm-deployment --port=8080 --target-port=8080 --name=my-llm-service
+
+# easy to scale up and down as needed
+kubectl scale deployment my-llm-deployment --replicas=3
+
+# port-forward for testing locally
+kubectl port-forward service/my-llm-service 8080:8080
+
+# send requests to your model
+curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
+     "model": "llama-2-7b-chat",
+     "messages": [{"role": "user", "content": "explain kubernetes in a sentence"}]
+   }'
+{"created":1701236489,"object":"chat.completion","id":"dd1ff40b-31a7-4418-9e32-42151ab6875a","model":"llama-2-7b-chat","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"\nKubernetes is a container orchestration system that automates the deployment, scaling, and management of containerized applications in a microservices architecture."}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
+```
+
+> For an example Kubernetes deployment and service YAML, see [kubernetes folder](./kubernetes/). Please note that these are examples, you may need to customize them (such as properly configured resource requests and limits) based on your needs.
+
+## GPU Acceleration Support
+
+> At this time, only NVIDIA GPU acceleration is supported. Please open an issue if you'd like to see support for other GPU vendors.
+
+### NVIDIA
+
+AIKit supports GPU accelerated inferencing with [NVIDIA Container Toolkit](https://github.com/NVIDIA/nvidia-container-toolkit). You must also have [NVIDIA Drivers](https://www.nvidia.com/en-us/drivers/unix/) installed on your host machine.
+
+For Kubernetes, [NVIDIA GPU Operator](https://github.com/NVIDIA/gpu-operator) provides a streamlined way to install the NVIDIA drivers and container toolkit to configure your cluster to use GPUs.
+
+To get started with GPU-accelerated inferencing, make sure to set the following in your `aikitfile` and build your model.
+
+```yaml
+runtime: cuda         # use NVIDIA CUDA runtime
+f16: true             # use float16 precision
+gpu_layers: 35        # number of layers to offload to GPU
+low_vram: true        # for devices with low VRAM
+```
+
+> Make sure to customize these values based on your model and GPU specs.
+
+After building the model, you can run it with [`--gpus all`](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/docker-specialized.html#gpu-enumeration) flag to enable GPU support:
+
+```bash
+# for pre-made models, replace "my-model" with the image name
+docker run --rm --gpus all -p 8080:8080 my-model
+```
+
+If GPU acceleration is working, you'll see output that is similar to following in the debug logs:
+
+```bash
+5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr ggml_init_cublas: found 1 CUDA devices:
+5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr   Device 0: Tesla T4, compute capability 7.5
+...
+5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: using CUDA for GPU acceleration
+5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: mem required  =   70.41 MB (+ 2048.00 MB per state)
+5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: offloading 32 repeating layers to GPU
+5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: offloading non-repeating layers to GPU
+5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: offloading v cache to GPU
+5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: offloading k cache to GPU
+5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: offloaded 35/35 layers to GPU
+5:32AM DBG GRPC(llama-2-7b-chat.Q4_K_M.gguf-127.0.0.1:43735): stderr llm_load_tensors: VRAM used: 5869 MB
+```
--- a/docs/content/model-compatibility/_index.en.md
+++ b/docs/content/model-compatibility/_index.en.md
@@ -50,6 +50,8 @@ Besides llama based models, LocalAI is compatible also with other architectures.
 | `diffusers`  | SD,...                   | no                       | Image generation    | no                               | no                   | N/A |
 | `vall-e-x` | Vall-E    | no                       | Audio generation and Voice cloning    | no                               | no                   | CPU/CUDA |
 | `vllm` | Various GPTs and quantization formats | yes                      | GPT             | no | no                  | CPU/CUDA |
+| `exllama2`  | GPTQ                   | yes                       | GPT only                  | no                               | no                   | N/A |
+| `transformers-musicgen`  |                    | no                       | Audio generation                | no                               | no                   | N/A |

 Note: any backend name listed above can be used in the `backend` field of the model configuration file (See [the advanced section]({{%relref "advanced" %}})).

--- a/docs/content/model-compatibility/diffusers.md
+++ b/docs/content/model-compatibility/diffusers.md
@@ -27,12 +27,9 @@ name: animagine-xl
 parameters:
  model: Linaqruf/animagine-xl
 backend: diffusers
-
-# Force CPU usage - set to true for GPU
-f16: false
+cuda: true
+f16: true
 diffusers:
-  pipeline_type: StableDiffusionXLPipeline
-  cuda: false # Enable for GPU usage (CUDA)
  scheduler_type: euler_a
 ```

@@ -47,9 +44,9 @@ parameters:
 backend: diffusers
 step: 30
 f16: true
+cuda: true
 diffusers:
  pipeline_type: StableDiffusionPipeline
-  cuda: true
  enable_parameters: "negative_prompt,num_inference_steps,clip_skip"
  scheduler_type: "k_dpmpp_sde"
  cfg_scale: 8
@@ -69,7 +66,7 @@ The following parameters are available in the configuration file:
 | `scheduler_type` | Scheduler type | `k_dpp_sde` |
 | `cfg_scale` | Configuration scale | `8` |
 | `clip_skip` | Clip skip | None |
-| `pipeline_type` | Pipeline type | `StableDiffusionPipeline` |
+| `pipeline_type` | Pipeline type | `AutoPipelineForText2Image` |

 There are available several types of schedulers:

@@ -131,17 +128,16 @@ parameters:
  model: nitrosocke/Ghibli-Diffusion
 backend: diffusers
 step: 25
-
+cuda: true
 f16: true
 diffusers:
  pipeline_type: StableDiffusionImg2ImgPipeline
-  cuda: true
  enable_parameters: "negative_prompt,num_inference_steps,image"
 ```

 ```bash
 IMAGE_PATH=/path/to/your/image
-(echo -n '{"image": "'; base64 $IMAGE_PATH; echo '", "prompt": "a sky background","size": "512x512","model":"stablediffusion-edit"}') |
+(echo -n '{"file": "'; base64 $IMAGE_PATH; echo '", "prompt": "a sky background","size": "512x512","model":"stablediffusion-edit"}') |
 curl -H "Content-Type: application/json" -d @-  http://localhost:8080/v1/images/generations
 ```

@@ -157,14 +153,67 @@ backend: diffusers
 step: 50
 # Force CPU usage
 f16: true
+cuda: true
 diffusers:
  pipeline_type: StableDiffusionDepth2ImgPipeline
-  cuda: true
  enable_parameters: "negative_prompt,num_inference_steps,image"
  cfg_scale: 6
 ```

 ```bash
-(echo -n '{"image": "'; base64 ~/path/to/image.jpeg; echo '", "prompt": "a sky background","size": "512x512","model":"stablediffusion-depth"}') |
+(echo -n '{"file": "'; base64 ~/path/to/image.jpeg; echo '", "prompt": "a sky background","size": "512x512","model":"stablediffusion-depth"}') |
 curl -H "Content-Type: application/json" -d @-  http://localhost:8080/v1/images/generations
 ```
+
+## img2vid
+
+{{% notice note %}}
+
+Experimental and available only on master builds. See: https://github.com/mudler/LocalAI/pull/1442
+
+{{% /notice %}}
+
+```yaml
+name: img2vid
+parameters:
+  model: stabilityai/stable-video-diffusion-img2vid
+backend: diffusers
+step: 25
+# Force CPU usage
+f16: true
+cuda: true
+diffusers:
+  pipeline_type: StableVideoDiffusionPipeline
+```
+
+```bash
+(echo -n '{"file": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png?download=true","size": "512x512","model":"img2vid"}') |
+curl -H "Content-Type: application/json" -X POST -d @- http://localhost:8080/v1/images/generations
+```
+
+## txt2vid
+
+{{% notice note %}}
+
+Experimental and available only on master builds. See: https://github.com/mudler/LocalAI/pull/1442
+
+{{% /notice %}}
+
+```yaml
+name: txt2vid
+parameters:
+  model: damo-vilab/text-to-video-ms-1.7b
+backend: diffusers
+step: 25
+# Force CPU usage
+f16: true
+cuda: true
+diffusers:
+  pipeline_type: VideoDiffusionPipeline
+  cuda: true
+```
+
+```bash
+(echo -n '{"prompt": "spiderman surfing","size": "512x512","model":"txt2vid"}') |
+curl -H "Content-Type: application/json" -X POST -d @- http://localhost:8080/v1/images/generations
+```
--- a/docs/content/model-compatibility/llama-cpp.md
+++ b/docs/content/model-compatibility/llama-cpp.md
@@ -9,7 +9,7 @@ weight = 1

 {{% notice note %}}

-The `ggml` file format has been deprecated. If you are using `ggml` models and you are configuring your model with a YAML file, specify, use the `llama-stable` backend instead. If you are relying in automatic detection of the model, you should be fine. For `gguf` models, use the `llama` backend.
+The `ggml` file format has been deprecated. If you are using `ggml` models and you are configuring your model with a YAML file, specify, use the `llama-ggml` backend instead. If you are relying in automatic detection of the model, you should be fine. For `gguf` models, use the `llama` backend. The go backend is deprecated as well but still available as `go-llama`. The go backend supports still features not available in the mainline: speculative sampling and embeddings.

 {{% /notice %}}

@@ -65,11 +65,11 @@ parameters:

 In the example above we specify `llama` as the backend to restrict loading `gguf` models only. 

-For instance, to use the `llama-stable` backend for `ggml` models:
+For instance, to use the `llama-ggml` backend for `ggml` models:

 ```yaml
 name: llama
-backend: llama-stable
+backend: llama-ggml
 parameters:
  # Relative to the models path
  model: file.ggml.bin
--- a/docs/content/news/_index.en.md
+++ b/docs/content/news/_index.en.md
@@ -6,6 +6,30 @@ url = '/basics/news/'

 +++

+## 04-12-2023: __v2.0.0__
+
+This release brings a major overhaul in some backends. 
+
+Breaking/important changes:
+- Backend rename: `llama-stable` renamed to `llama-ggml` {{< pr "1287" >}}
+- Prompt template changes: {{< pr "1254" >}} (extra space in roles)
+- Apple metal bugfixes: {{< pr "1365" >}}
+
+New:
+- Added support for LLaVa and OpenAI Vision API support ({{< pr "1254" >}})
+- Python based backends are now using conda to track env dependencies ( {{< pr "1144" >}} )
+- Support for parallel requests ( {{< pr "1290"  >}} )
+- Support for transformers-embeddings ( {{< pr "1308"  >}})
+- Watchdog for backends ( {{< pr "1341"  >}}). As https://github.com/ggerganov/llama.cpp/issues/3969 is hitting LocalAI's llama-cpp implementation, we have now a watchdog that can be used to make sure backends are not stalling. This is a generic mechanism that can be enabled for all the backends now.
+- Whisper.cpp updates ( {{< pr "1302" >}} )
+- Petals backend ( {{< pr "1350" >}} )
+- Full LLM fine-tuning example to use with LocalAI: https://localai.io/advanced/fine-tuning/
+
+Due to the python dependencies size of images grew in size. 
+If you still want to use smaller images without python dependencies, you can use the corresponding images tags ending with `-core`.
+
+Full changelog: https://github.com/mudler/LocalAI/releases/tag/v2.0.0
+
 ## 30-10-2023: __v1.40.0__

 This release is a preparation before v2 - the efforts now will be to refactor, polish and add new backends. Follow up on: https://github.com/mudler/LocalAI/issues/1126
@@ -16,7 +40,7 @@ This release now brings the `llama-cpp` backend which is a c++ backend tied to l

 ### Support for  ROCm/HIPBLAS 

-This release bring support for AMD thanks to @65a .  See more details in https://github.com/mudler/LocalAI/pull/1100
+This release bring support for AMD thanks to @65a .  See more details in {{< pr "1100" >}}

 ### More CLI commands

--- a/docs/data/version.json
+++ b/docs/data/version.json
@@ -0,0 +1,3 @@
+{
+  "version": "v2.1.0"
+}
--- a/docs/layouts/shortcodes/version.html
+++ b/docs/layouts/shortcodes/version.html
@@ -0,0 +1 @@
+{{ $.Site.Data.version.version }}
--- a/entrypoint.sh
+++ b/entrypoint.sh
@@ -3,6 +3,16 @@ set -e

 cd /build

+# If we have set EXTRA_BACKENDS, then we need to prepare the backends
+if [ -n "$EXTRA_BACKENDS" ]; then
+	echo "EXTRA_BACKENDS: $EXTRA_BACKENDS"
+	# Space separated list of backends
+	for backend in $EXTRA_BACKENDS; do
+		echo "Preparing backend: $backend"
+		make -C $backend
+	done
+fi
+
 if [ "$REBUILD" != "false" ]; then
 	rm -rf ./local-ai
 	make build -j${BUILD_PARALLELISM:-1}
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Ettore Di Giacinto	9ae47d37e9	pin go-rwkv Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2023-12-21 08:42:40 +01:00
Ettore Di Giacinto	2b3ad7f41c	Revert "⬆️ Update donomii/go-rwkv.cpp" (#1474 ) Revert "⬆️ Update donomii/go-rwkv.cpp (#1470)" This reverts commit `51db10b18f`.	2023-12-21 08:38:50 +01:00
LocalAI [bot]	51db10b18f	⬆️ Update donomii/go-rwkv.cpp (#1470 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2023-12-21 08:35:31 +01:00
Ettore Di Giacinto	b4b21a446b	feat(conda): share envs with transformer-based backends (#1465 ) * feat(conda): share env between diffusers and bark * Detect if env already exists * share diffusers and petals * tests: add petals * Use smaller model for tests with petals * test only model load on petals * tests(petals): run only load model tests * Revert "test only model load on petals" This reverts commit `111cfa97f1`. * move transformers and sentencetransformers to common env * Share also transformers-musicgen	2023-12-21 08:35:15 +01:00
LocalAI [bot]	23eced1644	⬆️ Update ggerganov/llama.cpp (#1461 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2023-12-20 18:02:52 +01:00
LocalAI [bot]	7741a6e75d	⬆️ Update ggerganov/whisper.cpp (#1462 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2023-12-20 00:21:49 +00:00
LocalAI [bot]	d4210db0c9	⬆️ Update ggerganov/llama.cpp (#1457 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2023-12-19 00:42:19 +01:00
lunamidori5	17dde75107	How To (Updates and Fixes) (#1456 ) * Update easy-setup-embeddings.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Update easy-setup-docker-cpu.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Update easy-setup-docker-gpu.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Update and rename easy-setup-docker-cpu.md to easy-setup-docker.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Update easy-setup-docker.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Update easy-setup-docker.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Update _index.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Update easy-setup-docker.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Update easy-setup-docker.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Delete docs/content/howtos/easy-setup-docker-gpu.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Update _index.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Update easy-setup-sd.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Update easy-setup-sd.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> --------- Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>	2023-12-18 18:59:08 +01:00
Ettore Di Giacinto	1fc3a375df	feat: inline templates and accept URLs in models (#1452 ) * feat: Allow inline templates * feat: Allow to specify url in model config files Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> * feat: support 'huggingface://' format * style: reuse-code from gallery --------- Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2023-12-18 18:58:44 +01:00
LocalAI [bot]	64a8471dd5	⬆️ Update ggerganov/llama.cpp (#1455 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2023-12-18 08:55:29 +01:00
LocalAI [bot]	86a8df1c8b	⬆️ Update ggerganov/llama.cpp (#1450 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2023-12-17 19:02:28 +01:00
Ettore Di Giacinto	2eeed2287b	docs: automatically track latest versions (#1451 )	2023-12-17 19:02:13 +01:00
Ettore Di Giacinto	3d83128f16	feat(alias): alias llama to llama-cpp, update docs (#1448 ) Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2023-12-16 18:22:45 +01:00
Ettore Di Giacinto	1c286c3c2f	docs(mixtral): add mixtral example (#1449 )	2023-12-16 17:44:43 +01:00
LocalAI [bot]	2f7beb6744	⬆️ Update ggerganov/whisper.cpp (#1434 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2023-12-16 09:22:28 +01:00
LocalAI [bot]	ab0370a0b9	⬆️ Update ggerganov/llama.cpp (#1429 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2023-12-16 09:22:13 +01:00
LocalAI [bot]	3f9a41684a	⬆️ Update mudler/go-piper (#1441 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2023-12-16 09:21:56 +01:00
Ettore Di Giacinto	dd982acf2c	feat(img2vid,txt2vid): Initial support for img2vid,txt2vid (#1442 ) * feat(img2vid): Initial support for img2vid * doc(SD): fix SDXL Example * Minor fixups for img2vid * docs(img2img): fix example curl call * feat(txt2vid): initial support Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> * diffusers: be retro-compatible with CUDA settings * docs(img2vid, txt2vid): examples * Add notice on docs --------- Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2023-12-15 18:06:20 -05:00
Ettore Di Giacinto	fb6a5bc620	update(llama.cpp): update server, correctly propagate LLAMA_VERSION (#1440 ) * fix(Makefile): correctly propagate LLAMA_VERSION Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> * update grpc-server.cpp --------- Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2023-12-15 08:26:48 +01:00
Ettore Di Giacinto	7641f92cde	feat(diffusers): update, add autopipeline, controlnet (#1432 ) * feat(diffusers): update, add autopipeline, controlenet * tests with AutoPipeline * simplify logic	2023-12-13 19:20:22 +01:00
LocalAI [bot]	72325fd0a3	⬆️ Update ggerganov/whisper.cpp (#1430 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2023-12-13 08:37:02 +01:00
Sertaç Özercan	1b7ed5e2e6	docs: add aikit to integrations (#1412 ) * docs: add aikit to integrations Signed-off-by: Sertac Ozercan <sozercan@gmail.com> * docs: add to readme Signed-off-by: Sertac Ozercan <sozercan@gmail.com> --------- Signed-off-by: Sertac Ozercan <sozercan@gmail.com> Co-authored-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>	2023-12-12 18:58:57 +01:00
LocalAI [bot]	86fac272d8	⬆️ Update ggerganov/llama.cpp (#1391 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2023-12-12 18:22:48 +01:00
Samuel Walker	865e523ff1	Documentation for Hipblas (#1425 ) hiplas arch	2023-12-12 15:05:01 +01:00
Ettore Di Giacinto	9aa2a7ca13	extras: add vllm,bark,vall-e-x tests, bump diffusers (#1422 ) * tests: add vllm Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> * tests: Add vall-e-x tests * Add bark tests * bump diffusers --------- Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2023-12-12 00:39:26 +01:00
Ettore Di Giacinto	e80cbca6b0	Update _index.en.md Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2023-12-12 00:37:01 +01:00
Ettore Di Giacinto	718a5d4a9e	fix(transformers): add sentence-transformers and transformers-musicgen tests, fix musicgen wrapper (#1420 ) tests: add sentence-transformers and transformers-musicgen Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> * fix: tranformers-musicgen conda env Initialize correctly the environment for the transformers-musicgen backend. * fix(tests): transformer-musicgen tests fixups --------- Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2023-12-11 19:26:02 +01:00
lunamidori5	9222bec8b1	How To Updates / Model Used Switched / Removed "docker-compose" (RIP) (#1417 ) * Update _index.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Update easy-model.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Update easy-setup-docker-cpu.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Update easy-setup-docker-gpu.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Update _index.en.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Update easy-setup-docker-cpu.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Update easy-setup-docker-gpu.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Update easy-setup-docker-cpu.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Update easy-setup-docker-cpu.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Update easy-setup-docker-gpu.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Update easy-model.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Update easy-setup-docker-cpu.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Update easy-setup-docker-gpu.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Update easy-setup-docker-cpu.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Update _index.en.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Update easy-setup-docker-gpu.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> --------- Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>	2023-12-11 14:27:29 +00:00
LocalAI [bot]	4a965e1b0e	⬆️ Update ggerganov/whisper.cpp (#1418 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2023-12-11 08:24:48 +01:00
Ettore Di Giacinto	48e5380e45	tests: add diffusers tests (#1419 )	2023-12-11 08:20:34 +01:00
LocalAI [bot]	831418612b	⬆️ Update mudler/go-piper (#1400 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2023-12-10 08:50:26 +01:00
LocalAI [bot]	89ff12309d	⬆️ Update ggerganov/whisper.cpp (#1390 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2023-12-09 09:23:40 +01:00
Ettore Di Giacinto	3a4fb6fa4b	feat(entrypoint): optionally prepare extra endpoints (#1405 ) entrypoint: optionally prepare extra endpoints Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2023-12-08 20:04:13 +01:00
Ettore Di Giacinto	b181503c30	docs: update v2.0.0 notes Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2023-12-08 16:06:24 +01:00
Ettore Di Giacinto	887b3dff04	feat: cuda transformers (#1401 ) * Use cuda in transformers if available tensorflow probably needs a different check. Signed-off-by: Erich Schubert <kno10@users.noreply.github.com> * feat: expose CUDA at top level Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * tests: add to tests and create workflow for py extra backends * doc: update note on how to use core images --------- Signed-off-by: Erich Schubert <kno10@users.noreply.github.com> Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Erich Schubert <kno10@users.noreply.github.com>	2023-12-08 15:45:04 +01:00
Ettore Di Giacinto	3822bd2369	docs: updates Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2023-12-08 15:11:44 +01:00
Ettore Di Giacinto	4de2c6a421	docs: update news Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2023-12-08 14:59:25 +01:00
Ettore Di Giacinto	6c4231fd35	docs: 2.0 updates Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2023-12-08 14:58:53 +01:00
lunamidori5	adfa7aa1fa	docs: site update fixing old image text / How To update updating GPU and CPU docker pages (#1399 ) * Update _index.en.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Update easy-setup-docker-cpu.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> * Update easy-setup-docker-gpu.md Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com> --------- Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>	2023-12-08 10:27:21 +01:00
Dave	8b6e601405	Feat: new backend: transformers-musicgen (#1387 ) Transformers-MusicGen --------- Signed-off-by: Dave <dave@gray101.com>	2023-12-08 10:01:02 +01:00
Ettore Di Giacinto	6011911746	fix(piper): pin petals, phonemize and espeak (#1393 ) * fix: pin phonemize and espeak Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: pin petals deps --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2023-12-07 22:58:41 +01:00
LocalAI [bot]	997119c27a	⬆️ Update ggerganov/llama.cpp (#1385 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2023-12-05 15:44:24 +01:00
Dave	2eb6865a27	Fix: API Key / JSON Fast Follow #1 (#1388 ) fast follow fix #1 - imports, final loop, one last chance to skip Co-authored-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>	2023-12-05 10:35:27 +00:00
Ettore Di Giacinto	2b2d6673ff	exllama(v2): fix exllamav1, add exllamav2 (#1384 ) * fix(exllama): fix exllama deps with anaconda Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(exllamav2): add exllamav2 backend Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2023-12-05 08:15:37 +01:00
lunamidori5	563c5b7ea0	Added Check API KEYs file to API.go (#1381 ) Added API KEYs file Signed-off-by: lunamidori5 <118759930+lunamidori5@users.noreply.github.com>	2023-12-04 22:06:45 -05:00
LocalAI [bot]	67966b623c	⬆️ Update ggerganov/llama.cpp (#1379 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2023-12-04 18:36:34 +01:00
LocalAI [bot]	9fc3fd04be	⬆️ Update ggerganov/whisper.cpp (#1378 ) Signed-off-by: GitHub <noreply@github.com> Co-authored-by: mudler <mudler@users.noreply.github.com>	2023-12-04 18:36:22 +01:00