From 5e97329bcb1ef1fbe254f01fbe3a070f3c04c0b6 Mon Sep 17 00:00:00 2001
From: Aaron Pham <29749331+aarnphm@users.noreply.github.com>
Date: Thu, 23 May 2024 12:50:01 -0400
Subject: [PATCH] infra: prepare 0.5 releases (#996)

* chore: prepare for 0.5

Signed-off-by: Aaron <29749331+aarnphm@users.noreply.github.com>

* chore: update changelogs

Signed-off-by: Aaron <29749331+aarnphm@users.noreply.github.com>

* chore: fix to lowest python version supported

Signed-off-by: Aaron <29749331+aarnphm@users.noreply.github.com>

* chore: update scripts

Signed-off-by: Aaron <29749331+aarnphm@users.noreply.github.com>

---------

Signed-off-by: Aaron <29749331+aarnphm@users.noreply.github.com>
---
 .python-version-default     |   2 +-
 README.md                   | 125 ++----------------------------------
 changelog.d/996.breaking.md |   7 ++
 openllm-python/README.md    | 125 ++----------------------------------
 4 files changed, 22 insertions(+), 237 deletions(-)
 create mode 100644 changelog.d/996.breaking.md

diff --git a/.python-version-default b/.python-version-default
index 2c073331..bd28b9c5 100644
--- a/.python-version-default
+++ b/.python-version-default
@@ -1 +1 @@
-3.11
+3.9
diff --git a/README.md b/README.md
index 412c99c1..040f9ea2 100644
--- a/README.md
+++ b/README.md
@@ -23,13 +23,11 @@
 
 OpenLLM helps developers **run any open-source LLMs**, such as Llama 2 and Mistral, as **OpenAI-compatible API endpoints**, locally and in the cloud, optimized for serving throughput and production deployment.
 
-
 - 🚂 Support a wide range of open-source LLMs including LLMs fine-tuned with your own data
 - ⛓️ OpenAI compatible API endpoints for seamless transition from your LLM app to open-source LLMs
 - 🔥 State-of-the-art serving and inference performance
 - 🎯 Simplified cloud deployment via [BentoML](https://www.bentoml.com)
 
-
 <!-- hatch-fancy-pypi-readme intro stop -->
 
 ![Gif showing OpenLLM Intro](/.github/assets/output.gif)
@@ -46,29 +44,13 @@ For starter, we provide two ways to quickly try out OpenLLM:
 
 Try this [OpenLLM tutorial in Google Colab: Serving Llama 2 with OpenLLM](https://colab.research.google.com/github/bentoml/OpenLLM/blob/main/examples/llama2.ipynb).
 
-### Docker
-
-We provide a docker container that helps you start running OpenLLM:
-
-```bash
-docker run --rm -it -p 3000:3000 ghcr.io/bentoml/openllm start facebook/opt-1.3b --backend pt
-```
-
-> [!NOTE]
-> Given you have access to GPUs and have setup [nvidia-docker](https://github.com/NVIDIA/nvidia-container-toolkit),  you can additionally pass in `--gpus`
-> to use GPU for faster inference and optimization
->```bash
-> docker run --rm --gpus all -p 3000:3000 -it ghcr.io/bentoml/openllm start HuggingFaceH4/zephyr-7b-beta --backend vllm
-> ```
-
-
 ## 🏃 Get started
 
 The following provides instructions for how to get started with OpenLLM locally.
 
 ### Prerequisites
 
-You have installed Python 3.8 (or later) and `pip`. We highly recommend using a [Virtual Environment](https://docs.python.org/3/library/venv.html) to prevent package conflicts.
+You have installed Python 3.9 (or later) and `pip`. We highly recommend using a [Virtual Environment](https://docs.python.org/3/library/venv.html) to prevent package conflicts.
 
 ### Install OpenLLM
 
@@ -82,65 +64,23 @@ To verify the installation, run:
 
 ```bash
 $ openllm -h
-
-Usage: openllm [OPTIONS] COMMAND [ARGS]...
-
-   ██████╗ ██████╗ ███████╗███╗   ██╗██╗     ██╗     ███╗   ███╗
-  ██╔═══██╗██╔══██╗██╔════╝████╗  ██║██║     ██║     ████╗ ████║
-  ██║   ██║██████╔╝█████╗  ██╔██╗ ██║██║     ██║     ██╔████╔██║
-  ██║   ██║██╔═══╝ ██╔══╝  ██║╚██╗██║██║     ██║     ██║╚██╔╝██║
-  ╚██████╔╝██║     ███████╗██║ ╚████║███████╗███████╗██║ ╚═╝ ██║
-   ╚═════╝ ╚═╝     ╚══════╝╚═╝  ╚═══╝╚══════╝╚══════╝╚═╝     ╚═╝.
-
-  An open platform for operating large language models in production.
-  Fine-tune, serve, deploy, and monitor any LLMs with ease.
-
-Options:
-  -v, --version  Show the version and exit.
-  -h, --help     Show this message and exit.
-
-Commands:
-  build       Package a given models into a BentoLLM.
-  import      Setup LLM interactively.
-  models      List all supported models.
-  prune       Remove all saved models, (and optionally bentos) built with OpenLLM locally.
-  query       Query a LLM interactively, from a terminal.
-  start       Start a LLMServer for any supported LLM.
-
-Extensions:
-  build-base-container  Base image builder for BentoLLM.
-  dive-bentos           Dive into a BentoLLM.
-  get-containerfile     Return Containerfile of any given Bento.
-  get-prompt            Get the default prompt used by OpenLLM.
-  list-bentos           List available bentos built by OpenLLM.
-  list-models           This is equivalent to openllm models...
-  playground            OpenLLM Playground.
 ```
 
 ### Start a LLM server
 
-OpenLLM allows you to quickly spin up an LLM server using `openllm start`. For example, to start a [phi-2](https://huggingface.co/microsoft/phi-2) server, run the following:
+OpenLLM allows you to quickly spin up an LLM server using `openllm start`. For example, to start a [Llama 3 8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) server, run the following:
 
 ```bash
-TRUST_REMOTE_CODE=True openllm start microsoft/phi-2
+openllm start meta-llama/Meta-Llama-3-8B
 ```
 
-This starts the server at [http://0.0.0.0:3000/](http://0.0.0.0:3000/). OpenLLM downloads the model to the BentoML local Model Store if it has not been registered before. To view your local models, run `bentoml models list`.
-
 To interact with the server, you can visit the web UI at [http://0.0.0.0:3000/](http://0.0.0.0:3000/) or send a request using `curl`. You can also use OpenLLM’s built-in Python client to interact with the server:
 
 ```python
 import openllm
 
-client = openllm.client.HTTPClient('http://localhost:3000')
-client.query('Explain to me the difference between "further" and "farther"')
-```
-
-Alternatively, use the `openllm query` command to query the model:
-
-```bash
-export OPENLLM_ENDPOINT=http://localhost:3000
-openllm query 'Explain to me the difference between "further" and "farther"'
+client = openllm.HTTPClient('http://localhost:3000')
+client.generate('Explain to me the difference between "further" and "farther"')
 ```
 
 OpenLLM seamlessly supports many models and their variants. You can specify different variants of the model to be served. For example:
@@ -155,15 +95,6 @@ openllm start <model_id> --<options>
 > architecture. Use the `openllm models` command to see the complete list of supported
 > models, their architectures, and their variants.
 
-> [!IMPORTANT]
-> If you are testing OpenLLM on CPU, you might want to pass in `DTYPE=float32`. By default,
-> OpenLLM will set model `dtype` to `bfloat16` for the best performance.
-> ```bash
-> DTYPE=float32 openllm start microsoft/phi-2
-> ```
-> This will also applies to older GPUs. If your GPUs doesn't support `bfloat16`, then you also
-> want to set `DTYPE=float16`.
-
 ## 🧩 Supported models
 
 OpenLLM currently supports the following models. By default, OpenLLM doesn't include dependencies to run all models. The extra model-specific dependencies can be installed with the instructions below.
@@ -1097,7 +1028,6 @@ openllm build facebook/opt-6.7b --adapter-id ./path/to/adapter_id --build-ctx .
 > [!IMPORTANT]
 > Fine-tuning support is still experimental and currently only works with PyTorch backend. vLLM support is coming soon.
 
-
 ## ⚙️ Integrations
 
 OpenLLM is not just a standalone product; it's a building block designed to
@@ -1115,11 +1045,9 @@ specify the base_url to `llm-endpoint/v1` and you are good to go:
 ```python
 import openai
 
-client = openai.OpenAI(
-  base_url='http://localhost:3000/v1', api_key='na'
-)  # Here the server is running on localhost:3000
+client = openai.OpenAI(base_url='http://localhost:3000/v1', api_key='na')  # Here the server is running on 0.0.0.0:3000
 
-completions = client.completions.create(
+completions = client.chat.completions.create(
   prompt='Write me a tag line for an ice cream shop.', model=model, max_tokens=64, stream=stream
 )
 ```
@@ -1130,7 +1058,6 @@ The compatible endpoints supports `/completions`, `/chat/completions`, and `/mod
 > You can find out OpenAI example clients under the
 > [examples](https://github.com/bentoml/OpenLLM/tree/main/examples) folder.
 
-
 ### [LlamaIndex](https://docs.llamaindex.ai/en/stable/examples/llm/openllm/)
 
 To start a local LLM with `llama_index`, simply use `llama_index.llms.openllm.OpenLLM`:
@@ -1172,24 +1099,6 @@ llm = OpenLLM(server_url='http://44.23.123.1:3000', server_type='http')
 llm('What is the difference between a duck and a goose? And why there are so many Goose in Canada?')
 ```
 
-### Transformers Agents
-
-OpenLLM seamlessly integrates with
-[Transformers Agents](https://huggingface.co/docs/transformers/transformers_agents).
-
-> [!WARNING]
-> The Transformers Agent is still at an experimental stage. It is
-> recommended to install OpenLLM with `pip install -r nightly-requirements.txt`
-> to get the latest API update for HuggingFace agent.
-
-```python
-import transformers
-
-agent = transformers.HfAgent('http://localhost:3000/hf/agent')  # URL that runs the OpenLLM server
-
-agent.run('Is the following `text` positive or negative?', text="I don't like how this models is generate inputs")
-```
-
 <!-- hatch-fancy-pypi-readme interim stop -->
 
 ![Gif showing Agent integration](/.github/assets/agent.gif)
@@ -1280,26 +1189,6 @@ Checkout our
 [Developer Guide](https://github.com/bentoml/OpenLLM/blob/main/DEVELOPMENT.md)
 if you wish to contribute to OpenLLM's codebase.
 
-## 🍇 Telemetry
-
-OpenLLM collects usage data to enhance user experience and improve the product.
-We only report OpenLLM's internal API calls and ensure maximum privacy by
-excluding sensitive information. We will never collect user code, model data, or
-stack traces. For usage tracking, check out the
-[code](https://github.com/bentoml/OpenLLM/blob/main/openllm-core/src/openllm_core/utils/analytics.py).
-
-You can opt out of usage tracking by using the `--do-not-track` CLI option:
-
-```bash
-openllm [command] --do-not-track
-```
-
-Or by setting the environment variable `OPENLLM_DO_NOT_TRACK=True`:
-
-```bash
-export OPENLLM_DO_NOT_TRACK=True
-```
-
 ## 📔 Citation
 
 If you use OpenLLM in your research, we provide a [citation](./CITATION.cff) to
diff --git a/changelog.d/996.breaking.md b/changelog.d/996.breaking.md
new file mode 100644
index 00000000..e091084f
--- /dev/null
+++ b/changelog.d/996.breaking.md
@@ -0,0 +1,7 @@
+Now, OpenLLM is compatible with BentoML 1.2 and above architecture.
+
+Additionally, `openllm` CLI will only offer `start` and `build` to simplify the workflow.
+
+OpenLLM will also now require vllm by default, and CPU support is currently turning off. We will look into supporting CPU in later version as our main focus is on accelerator.
+
+Python API is also considered deprecated and internal only. If you are using this in your old service, make sure to set `IMPLEMENTATION=deprecated` as environment variable to avoid breaking changes. We recommend users to upgrade to BentoML 1.2.
diff --git a/openllm-python/README.md b/openllm-python/README.md
index 412c99c1..040f9ea2 100644
--- a/openllm-python/README.md
+++ b/openllm-python/README.md
@@ -23,13 +23,11 @@
 
 OpenLLM helps developers **run any open-source LLMs**, such as Llama 2 and Mistral, as **OpenAI-compatible API endpoints**, locally and in the cloud, optimized for serving throughput and production deployment.
 
-
 - 🚂 Support a wide range of open-source LLMs including LLMs fine-tuned with your own data
 - ⛓️ OpenAI compatible API endpoints for seamless transition from your LLM app to open-source LLMs
 - 🔥 State-of-the-art serving and inference performance
 - 🎯 Simplified cloud deployment via [BentoML](https://www.bentoml.com)
 
-
 <!-- hatch-fancy-pypi-readme intro stop -->
 
 ![Gif showing OpenLLM Intro](/.github/assets/output.gif)
@@ -46,29 +44,13 @@ For starter, we provide two ways to quickly try out OpenLLM:
 
 Try this [OpenLLM tutorial in Google Colab: Serving Llama 2 with OpenLLM](https://colab.research.google.com/github/bentoml/OpenLLM/blob/main/examples/llama2.ipynb).
 
-### Docker
-
-We provide a docker container that helps you start running OpenLLM:
-
-```bash
-docker run --rm -it -p 3000:3000 ghcr.io/bentoml/openllm start facebook/opt-1.3b --backend pt
-```
-
-> [!NOTE]
-> Given you have access to GPUs and have setup [nvidia-docker](https://github.com/NVIDIA/nvidia-container-toolkit),  you can additionally pass in `--gpus`
-> to use GPU for faster inference and optimization
->```bash
-> docker run --rm --gpus all -p 3000:3000 -it ghcr.io/bentoml/openllm start HuggingFaceH4/zephyr-7b-beta --backend vllm
-> ```
-
-
 ## 🏃 Get started
 
 The following provides instructions for how to get started with OpenLLM locally.
 
 ### Prerequisites
 
-You have installed Python 3.8 (or later) and `pip`. We highly recommend using a [Virtual Environment](https://docs.python.org/3/library/venv.html) to prevent package conflicts.
+You have installed Python 3.9 (or later) and `pip`. We highly recommend using a [Virtual Environment](https://docs.python.org/3/library/venv.html) to prevent package conflicts.
 
 ### Install OpenLLM
 
@@ -82,65 +64,23 @@ To verify the installation, run:
 
 ```bash
 $ openllm -h
-
-Usage: openllm [OPTIONS] COMMAND [ARGS]...
-
-   ██████╗ ██████╗ ███████╗███╗   ██╗██╗     ██╗     ███╗   ███╗
-  ██╔═══██╗██╔══██╗██╔════╝████╗  ██║██║     ██║     ████╗ ████║
-  ██║   ██║██████╔╝█████╗  ██╔██╗ ██║██║     ██║     ██╔████╔██║
-  ██║   ██║██╔═══╝ ██╔══╝  ██║╚██╗██║██║     ██║     ██║╚██╔╝██║
-  ╚██████╔╝██║     ███████╗██║ ╚████║███████╗███████╗██║ ╚═╝ ██║
-   ╚═════╝ ╚═╝     ╚══════╝╚═╝  ╚═══╝╚══════╝╚══════╝╚═╝     ╚═╝.
-
-  An open platform for operating large language models in production.
-  Fine-tune, serve, deploy, and monitor any LLMs with ease.
-
-Options:
-  -v, --version  Show the version and exit.
-  -h, --help     Show this message and exit.
-
-Commands:
-  build       Package a given models into a BentoLLM.
-  import      Setup LLM interactively.
-  models      List all supported models.
-  prune       Remove all saved models, (and optionally bentos) built with OpenLLM locally.
-  query       Query a LLM interactively, from a terminal.
-  start       Start a LLMServer for any supported LLM.
-
-Extensions:
-  build-base-container  Base image builder for BentoLLM.
-  dive-bentos           Dive into a BentoLLM.
-  get-containerfile     Return Containerfile of any given Bento.
-  get-prompt            Get the default prompt used by OpenLLM.
-  list-bentos           List available bentos built by OpenLLM.
-  list-models           This is equivalent to openllm models...
-  playground            OpenLLM Playground.
 ```
 
 ### Start a LLM server
 
-OpenLLM allows you to quickly spin up an LLM server using `openllm start`. For example, to start a [phi-2](https://huggingface.co/microsoft/phi-2) server, run the following:
+OpenLLM allows you to quickly spin up an LLM server using `openllm start`. For example, to start a [Llama 3 8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) server, run the following:
 
 ```bash
-TRUST_REMOTE_CODE=True openllm start microsoft/phi-2
+openllm start meta-llama/Meta-Llama-3-8B
 ```
 
-This starts the server at [http://0.0.0.0:3000/](http://0.0.0.0:3000/). OpenLLM downloads the model to the BentoML local Model Store if it has not been registered before. To view your local models, run `bentoml models list`.
-
 To interact with the server, you can visit the web UI at [http://0.0.0.0:3000/](http://0.0.0.0:3000/) or send a request using `curl`. You can also use OpenLLM’s built-in Python client to interact with the server:
 
 ```python
 import openllm
 
-client = openllm.client.HTTPClient('http://localhost:3000')
-client.query('Explain to me the difference between "further" and "farther"')
-```
-
-Alternatively, use the `openllm query` command to query the model:
-
-```bash
-export OPENLLM_ENDPOINT=http://localhost:3000
-openllm query 'Explain to me the difference between "further" and "farther"'
+client = openllm.HTTPClient('http://localhost:3000')
+client.generate('Explain to me the difference between "further" and "farther"')
 ```
 
 OpenLLM seamlessly supports many models and their variants. You can specify different variants of the model to be served. For example:
@@ -155,15 +95,6 @@ openllm start <model_id> --<options>
 > architecture. Use the `openllm models` command to see the complete list of supported
 > models, their architectures, and their variants.
 
-> [!IMPORTANT]
-> If you are testing OpenLLM on CPU, you might want to pass in `DTYPE=float32`. By default,
-> OpenLLM will set model `dtype` to `bfloat16` for the best performance.
-> ```bash
-> DTYPE=float32 openllm start microsoft/phi-2
-> ```
-> This will also applies to older GPUs. If your GPUs doesn't support `bfloat16`, then you also
-> want to set `DTYPE=float16`.
-
 ## 🧩 Supported models
 
 OpenLLM currently supports the following models. By default, OpenLLM doesn't include dependencies to run all models. The extra model-specific dependencies can be installed with the instructions below.
@@ -1097,7 +1028,6 @@ openllm build facebook/opt-6.7b --adapter-id ./path/to/adapter_id --build-ctx .
 > [!IMPORTANT]
 > Fine-tuning support is still experimental and currently only works with PyTorch backend. vLLM support is coming soon.
 
-
 ## ⚙️ Integrations
 
 OpenLLM is not just a standalone product; it's a building block designed to
@@ -1115,11 +1045,9 @@ specify the base_url to `llm-endpoint/v1` and you are good to go:
 ```python
 import openai
 
-client = openai.OpenAI(
-  base_url='http://localhost:3000/v1', api_key='na'
-)  # Here the server is running on localhost:3000
+client = openai.OpenAI(base_url='http://localhost:3000/v1', api_key='na')  # Here the server is running on 0.0.0.0:3000
 
-completions = client.completions.create(
+completions = client.chat.completions.create(
   prompt='Write me a tag line for an ice cream shop.', model=model, max_tokens=64, stream=stream
 )
 ```
@@ -1130,7 +1058,6 @@ The compatible endpoints supports `/completions`, `/chat/completions`, and `/mod
 > You can find out OpenAI example clients under the
 > [examples](https://github.com/bentoml/OpenLLM/tree/main/examples) folder.
 
-
 ### [LlamaIndex](https://docs.llamaindex.ai/en/stable/examples/llm/openllm/)
 
 To start a local LLM with `llama_index`, simply use `llama_index.llms.openllm.OpenLLM`:
@@ -1172,24 +1099,6 @@ llm = OpenLLM(server_url='http://44.23.123.1:3000', server_type='http')
 llm('What is the difference between a duck and a goose? And why there are so many Goose in Canada?')
 ```
 
-### Transformers Agents
-
-OpenLLM seamlessly integrates with
-[Transformers Agents](https://huggingface.co/docs/transformers/transformers_agents).
-
-> [!WARNING]
-> The Transformers Agent is still at an experimental stage. It is
-> recommended to install OpenLLM with `pip install -r nightly-requirements.txt`
-> to get the latest API update for HuggingFace agent.
-
-```python
-import transformers
-
-agent = transformers.HfAgent('http://localhost:3000/hf/agent')  # URL that runs the OpenLLM server
-
-agent.run('Is the following `text` positive or negative?', text="I don't like how this models is generate inputs")
-```
-
 <!-- hatch-fancy-pypi-readme interim stop -->
 
 ![Gif showing Agent integration](/.github/assets/agent.gif)
@@ -1280,26 +1189,6 @@ Checkout our
 [Developer Guide](https://github.com/bentoml/OpenLLM/blob/main/DEVELOPMENT.md)
 if you wish to contribute to OpenLLM's codebase.
 
-## 🍇 Telemetry
-
-OpenLLM collects usage data to enhance user experience and improve the product.
-We only report OpenLLM's internal API calls and ensure maximum privacy by
-excluding sensitive information. We will never collect user code, model data, or
-stack traces. For usage tracking, check out the
-[code](https://github.com/bentoml/OpenLLM/blob/main/openllm-core/src/openllm_core/utils/analytics.py).
-
-You can opt out of usage tracking by using the `--do-not-track` CLI option:
-
-```bash
-openllm [command] --do-not-track
-```
-
-Or by setting the environment variable `OPENLLM_DO_NOT_TRACK=True`:
-
-```bash
-export OPENLLM_DO_NOT_TRACK=True
-```
-
 ## 📔 Citation
 
 If you use OpenLLM in your research, we provide a [citation](./CITATION.cff) to